|Version 1 (modified by admin, 3 years ago) (diff)|
- General data format
- Structure of data
- Coding Conventions
- Quality Assurance
This article describes the HICDEP version 1.70 which was released on the 8th of November 2013. For a more detailed version history, please refer to the ChangeLog.
General data format
The table pages referenced in the overview describe the specific tables' structure in detail and present a list of suggested codes, both standard and human readable.
All codes apart from trivial no, yes or unknown codes are presented as lookup tables, the usage of these are described in the the article Considerations for using the format to create a database.
Along with the basic structure described in each “Core fields” section, additional fields containing additional or more specific data are described in the “Additional fields” sections. These fields were taken from several cohort collaborations but with the required changes that were needed for the specific data structures. This is presented to the reader to show that the core structure is not a fixed proposal but rather a basic structure, which can be altered by adding fields.
Issues regarding duplicates are discussed in Considerations For Data Management.
Overview of data tables
|tblCEP||holds type and date of clinical events and procedures including serious non-AIDS conditions. Former known as tblAE (adverse event).|
|tblART||holds type of antiretroviral drug, start and stop dates and reason for stopping|
|tblBAS||holds basic information such as demographics, basic clinical information and date of AIDS diagnosis|
|tblDELIVERY_CHILD||holds delivery information related to the child|
|tblDELIVERY_MUM||holds delivery information related to the mother|
|tblDIS||holds type and date of CDC-C diseases.|
|tblLAB||holds type, date, value and unit of laboratory tests.|
|tblLAB_BP||holds date, diastolic and systolic values and unit of blood pressure measurements.|
|tblLAB_CD4||holds date and value of CD4 measurements.|
|tblLAB_RNA||holds date, value, detection limit and type of viral assay.|
|tblLAB_RES||holds background information on the resistance test, laboratory, library, kit, software and type of test|
|tblLAB_RES_LVL_1||holds nucleoside sequence for the PRO and RT sequences|
|tblLAB_RES_LVL_2||holds mutations and positions of these.|
|tblLAB_RES_LVL_3||holds resistance result in relation to antiretroviral drug.|
|tblLAB_VIRO||holds test results for viro-/serological tests (hepatitis etc.)|
|tblLTFU||holds data on death and drop-out|
|tblMED||holds type, start and stop dates for other medication/treatments.|
|tblNEWBORN||holds information related to newborns|
|tblNEWBORN_ABNORM||holds information related to abnormalities of newborns|
|tblOVERLAP||holds information on the patient's participation in other cohorts|
|tblPREG||holds general pregnancy-related information|
|tblPREG_OBS||holds information on obstetrical problems during pregnancy|
|tblPREG_OUT||describes the pregnancy outcome|
|tblREFILL||holds information on prescription refills|
|tblSAMPLES||holds information on the storage of blood, urine and other biological samples|
|tblVIS||holds visit related information such as weight, wasting, smoking, occupational status etc.|
Structure of data
From flat files towards a normalized structure
The data collected in HIV collaborations is presented on the following pages in a set of data files/tables. Typically data would be put into one data file that would hold one line/record per patient where each field is represented as a separate column in that dataset. Often a dataset could contain more than 3000 columns of data.
The implication of going from thousands of fields to fewer fields means that data is in fact transposed from the flat format into the normalised format.
Example of a flat file structure:
The normalised structure would then be like this:
The type of measurement is identified through the TYPE_ID field. Here 1 codes for ALAT and 2 codes for ASAT:
|1||ALAT - Alanin-Aminotransferase|
|2||ASAT - Aspartat aminotransferase|
To enable a normalised structure that minimises the number of columns dramatically, the one file solution must be broken into several minor tables. These breakdowns are driven by the different data characteristics.
Each table has a basic structure that includes the patient identifier, a code that represents e.g. drug, adverse event or laboratory test performed. Along with this combination values like date, result, unit etc are present for each record.
A record for a laboratory measurement would include:
- Patient identifier
- Measurement type identifier
- Measured value
- Unit of value
- Date of determination
A record for usage of an antiretroviral drug would include:
- Patient identifier
- Drug identifier
- Start date for usage
- End date for usage
- Reason for discontinuation
These issues imply that a set of distinct tables must be generated based on the “nature” of the data. Since laboratory, medication and event data both cannot and should not be mixed at least 3 tables must be designed. Additionally there are other types of information that need their own domains: background information on the patient (height, birth date etc.), visit related data (weight, blood pressure, wasting etc.), and resistance testing (the latter requires more consideration due to the diversity of data present).
In this protocol further separation of data into different tables are presented. These separations are not only based on the rules for the relational model and normalisation, but they are “culturally” related.
For example: antiretroviral treatment medication is kept in one table and other medication in another table; CD4 cell measurements and HIV-RNA measurements are put into separate tables, that are also different from the general laboratory table. These separations are done simply because data in these tables are of distinct importance in analysis and often are gathered more frequently and with more attention than other variables.
Although it is best to have precise dates in the format of YEAR-MONTH-DAY ISO standard, it might be that some cohorts are limited to representing date data at the level of the month only, or information kept on the patient in the charts only defines dates to the month and in some cases only to the year. To solve this a set of date codes are presented here.
In this case the date should be coded as the 15th of the month – so that 1999-12-?? becomes 1999-12-15. This enables the date to be no more than 15 days away from the actual date.
Month and day unknown
Best approach to this is to apply something similar, as with unknown dates, this would then mean that 1999-??-?? becomes 1999-07-01.
If the year is unknown but the presence of the date value is needed as in case of opportunistic infections or adverse events (see later in this document) a fictive date should be used that couldn’t be mistaken with an actual date. An unknown year should be coded as 1911-11-11.
Specification of precision
An alternative to the above is to apply an additional field to each date field for which it is known that there might be issues regarding the precision of the dates. The field is then used to specify at which degree of the day, month or year the date is precise:
|Code||Precision of date|
|<||Before this date|
|D||Exact to the date|
|M||Exact to the month|
|Y||Exact to the year|
|>||After this date|
The coding system is the official standard for coding of diseases, however there is a wide set of “homebrew” codes used within the HIV field in data coding in general, often it’s a 3 or 4 letter codes that is an abbreviation for the AIDS defining disease. ICD-10 doesn’t have single codes that represent all single CDC-C events and as a consequence of this a list of 3 to 4 letter codes is the recommended way of coding for all CDC stage C events
ICD-10 codes are however the recommended for codes AE’s since it would become impossible for this protocol to maintain a complete list of all possible AE’s. ICD-10 is also recommended for causes of death.
ATC is a hierarchical structure for coding medication. The structure and hierarchy are best explained with an example of how a drug code is defined. Here it is on Indinavir:
- ANTIINFECTIVES FOR SYSTEMIC USE (1st level, anatomical main group)
- ANTIVIRALS FOR SYSTEMIC USE (2nd level, therapeutic subgroup)
- DIRECT ACTING ANTIVIRALS (3rd level, pharmacological subgroup)
- Protease inhibitors (4th level, chemical subgroup)
- Indinavir (5th level, chemical substance)
This hierarchy has some benefits as will be explained later, but one of its limitations is that it’s impossible to “read” the code compared to the widely used 3 letter mnemonic codes for antiretroviral drugs.
The difference is that the IDV code is easily readable, where the ATC code is not; going from a flat file structure to a normalised structure the human readable aspect becomes increasingly important. In the flat file format the column names and the possibility of labels makes data more or less readable; in the normalised format only the coding can help. Because of this the 3 letter codes are being presented in this document. However it must be stressed that usage of the ATC coding should be used to diminish the risk of several homebrew and non-compatible coding schemes.
Currently however, the ATC scheme does not provide sufficient detail on the specific drugs, there is e.g. no official way to code Saquinavir as hard or soft gel. Thus a slight alteration to the set of codes will be presented in the sections of the ART and MED tables. The alterations are designed to extend the existing structure of ATC.
One of the benefits is that the structure of ATC allows easier statistics on e.g. drug classes
- J05AE Protease inhibitors
- J05AE01 Saquinavir
- J05AE02 Indinavir
- J05AE03 Ritonavir
- J05AE04 Nelfinavir
- J05AE05 Amprenavir
- J05AE06 Lopinavir
- J05AF Nucleoside and nucleotide reverse transcriptase inhibitors
- J05AF01 Zidovudine
- J05AF02 Didanosine
- J05AF03 Zalcitabine
- J05AF04 Stavudine
- J05AF05 Lamivudine
- J05AF06 Abacavir
- J05AF07 Tenofovir disoproxil
- J05AF30 Combinations1
- J05AG Non-nucleoside reverse transcriptase inhibitors
- J05AG01 Nevirapine
- J05AG02 Delavirdine
- J05AG03 Efavirenz
Although the codes might be harder to read they provide grouping mechanisms in the way they are coded. Interested readers should go to the ATC Website to learn about the structure of ATC. A fully updated database of ATC codes and DDD (Defined Daily Dosage) is available for querying.
It is often necessary to code for values like “Yes”, “No” and “Unknown”, this document suggests that the following codes should be used:
Unknown should be used to identify the difference between a value that has not yet been collected (Empty) and a value that cannot be collected (Unknown). Empty values should be required where Unknown values make little sense to keep querying for a value.
Example – weight:
Depending on the unit in which weight is measured, a different value for Unknown should be applied. In the case of kg the “Unknown” code should be 999 and not just 9 or 99, the last two could be actual values.
Blank values, for SAS users also known as "." and for database programmers known as NULL, should be used wherever specified in this protocol. However, sometimes it might be more correct just to omit the record if no value has been recorded, test has not been performed etc.
In order to verify the consistency and correctness of the data, QA checks are made before the data is used. The QA checks applying to a given table are listed at the bottom of its article. Additionally, a list of all QA checks, including checks which do not directly apply to the HICDEP tables themselves, is available here.