Diagram
Diagram based on HICDEP 1.120
Structure of data
From flat files towards a normalized structure
The data collected in HIV collaborations is presented on the following pages in a set of data files/tables. Typically data would be put into one data file that would hold one line/record per patient where each field is represented as a separate column in that dataset. Often a dataset could contain more than 3000 columns of data.
The implication of going from thousands of fields to fewer fields means that data is in fact transposed from the flat format into the normalised format.
Example of a flat file structure:
PATIENT |
ALAT_D |
ALAT_V |
ALAT_U |
ASAT_D |
ASAT_V |
ASAT_U |
999999 |
01-01-2000 |
15 |
U/l |
01-01-2000 |
12 |
U/l |
The normalised structure would then be like this:
PATIENT |
TYPE_ID |
LAB_DATE |
LAB_VAL |
LAB_UNIT |
999999 |
1 |
01-01-2000 |
15 |
U/l |
999999 |
2 |
01-01-2000 |
12 |
U/l |
The type of measurement is identified through the TYPE_ID field. Here 1 codes for ALAT and 2 codes for ASAT:
Code |
Description |
1 |
ALAT - Alanin-Aminotransferase |
2 |
ASAT - Aspartat aminotransferase |
Technical considerations
To enable a normalised structure that minimises the number of columns dramatically, the one file solution must be broken into several minor tables. These breakdowns are driven by the different data characteristics.
Each table has a basic structure that includes the patient identifier, a code that represents e.g. drug, adverse event or laboratory test performed. Along with this combination values like date, result, unit etc are present for each record.
A record for a laboratory measurement would include:
- Patient identifier
- Measurement type identifier
- Measured value
- Unit of value
- Date of determination
A record for usage of an antiretroviral drug would include:
- Patient identifier
- Drug identifier
- Start date for usage
- End date for usage
- Reason for discontinuation
These issues imply that a set of distinct tables must be generated based on the “nature” of the data. Since laboratory, medication and event data both cannot and should not be mixed at least 3 tables must be designed. Additionally there are other types of information that need their own domains: background information on the patient (height, birth date etc.), visit related data (weight, blood pressure, wasting etc.), and resistance testing (the latter requires more consideration due to the diversity of data present).
In this protocol further separation of data into different tables are presented. These separations are not only based on the rules for the relational model and normalisation, but they are “culturally” related.
For example: antiretroviral treatment medication is kept in one table and other medication in another table; CD4 cell measurements and HIV-RNA measurements are put into separate tables, that are also different from the general laboratory table. These separations are done simply because data in these tables are of distinct importance in analysis and often are gathered more frequently and with more attention than other variables.
Coding Conventions Date codes
Although it is best to have precise dates in the format of YEAR-MONTH-DAY ISO standard, it might be that some cohorts are limited to representing date data at the level of the month only, or information kept on the patient in the charts only defines dates to the month and in some cases only to the year. To solve this a set of date codes are presented here.
Day unknown
In this case the date should be coded as the 15th of the month – so that 1999-12-?? becomes 1999-12-15. This enables the date to be no more than 15 days away from the actual date.
Month and day unknown
Best approach to this is to apply something similar, as with unknown dates, this would then mean that 1999-??-?? becomes 1999-07-01.
Year unknown
If the year is unknown but the presence of the date value is needed as in case of opportunistic infections or adverse events (see later in this document) a fictive date should be used that couldn’t be mistaken with an actual date. An unknown year should be coded as 1911-11-11.
Specification of precision
An alternative to the above is to apply an additional field to each date field for which it is known that there might be issues regarding the precision of the dates. The field is then used to specify at which degree of the day, month or year the date is precise:
Code |
Precision of date |
< |
Before this date |
D |
Exact to the date |
M |
Exact to the month |
Y |
Exact to the year |
> |
After this date |
U |
Unknown |
The Data Transfer Protocol for IeDEA Multi-regional Collaborations suggests that such a precision annotation variable should have the same name as the date variable with the additional Suffix _A. For example, the precision of BIRTH_D will be annotated using an optional variable with the name BIRTH_D_A.
ICD-10 codes
The coding system is the official standard for coding of diseases, however there is a wide set of “homebrew” codes used within the HIV field in data coding in general, often it’s a 3 or 4 letter code which is an abbreviation for the AIDS defining disease. ICD-10 doesn’t have single codes that represent all single CDC-C events and as a consequence of this a list of 3 to 4 letter codes is the recommended way of coding for all CDC stage C events
ICD-10 codes are however the recommended for codes AE’s since it would become impossible for this protocol to maintain a complete list of all possible AE’s. ICD-10 is also recommended for causes of death.
ATC codes
ATC is a hierarchical structure for coding medication. The structure and hierarchy are best explained with an example of how a drug code is defined. Here it is on Indinavir:
- J - ANTIINFECTIVES FOR SYSTEMIC USE (1st level, anatomical main group)
- J05 - ANTIVIRALS FOR SYSTEMIC USE (2nd level, therapeutic subgroup)
- J05A - DIRECT ACTING ANTIVIRALS (3rd level, pharmacological subgroup)
- J05AE - Protease inhibitors (4th level, chemical subgroup)
- J05AE02 - Indinavir (5th level, chemical substance)
This hierarchy has some benefits as will be explained later, but one of its limitations is that it’s impossible to “read” the code compared to the widely used 3 letter mnemonic codes for antiretroviral drugs.
Example:
Drug |
Code |
ATC code |
Indinavir |
IDV |
J05AE02 |
The difference is that the IDV code is easily readable, where the ATC code is not; going from a flat file structure to a normalised structure the human readable aspect becomes increasingly important. In the flat file format the column names and the possibility of labels makes data more or less readable; in the normalised format only the coding can help. Because of this the 3 letter codes are being presented in this document. However it must be stressed that usage of the ATC coding should be used to diminish the risk of several homebrew and non-compatible coding schemes.
Currently however, the ATC scheme does not provide sufficient detail on the specific drugs, there is e.g. no official way to code Saquinavir as hard or soft gel. Thus a slight alteration to the set of codes will be presented in the sections of the ART and MED tables. The alterations are designed to extend the existing structure of ATC.
One of the benefits is that the structure of ATC allows easier statistics on e.g. drug classes
- J05AE Protease inhibitors
- J05AE01 Saquinavir
- J05AE02 Indinavir
- J05AE03 Ritonavir
- J05AE04 Nelfinavir
- J05AE05 Amprenavir
- J05AE06 Lopinavir
- J05AF Nucleoside and nucleotide reverse transcriptase inhibitors
- J05AF01 Zidovudine
- J05AF02 Didanosine
- J05AF03 Zalcitabine
- J05AF04 Stavudine
- J05AF05 Lamivudine
- J05AF06 Abacavir
- J05AF07 Tenofovir disoproxil
- J05AF30 Combinations1
- J05AG Non-nucleoside reverse transcriptase inhibitors
- J05AG01 Nevirapine
- J05AG02 Delavirdine
- J05AG03 Efavirenz
Although the codes might be harder to read they provide grouping mechanisms in the way they are coded. Interested readers should go to the ATC Website to learn about the structure of ATC. A fully updated database of ATC codes and DDD (Defined Daily Dosage) is available for querying.
Other codes
It is often necessary to code for values like “Yes”, “No” and “Unknown”, this document suggests that the following codes should be used:
Code |
Description |
0 |
No |
1 |
Yes |
9 |
Unknown |
Unknown should be used to identify the difference between a value that has not yet been collected (Empty) and a value that cannot be collected (Unknown). Empty values should be required where Unknown values make little sense to keep querying for a value.
Example – weight:
Depending on the unit in which weight is measured, a different value for Unknown should be applied. In the case of kg the “Unknown” code should be 999 and not just 9 or 99, the last two could be actual values.
Blank values, for SAS users also known as "." and for database programmers known as NULL, should be used wherever specified in this protocol. However, sometimes it might be more correct just to omit the record if no value has been recorded, test has not been performed etc.
Quality Assurance
In order to verify the consistency and correctness of the data, QA checks are made before the data is used. The QA checks applying to a given table are listed at the bottom of its article. Additionally, a list of all QA checks, including checks which do not directly apply to the HICDEP tables themselves, is available here.