Wiki

HICDEP 1.120

This article describes the current draft version HICDEP 1.120. For a more detailed version history, please refer to the ChangeLog.

General data format
The table pages referenced in the overview describe the specific tables' structure in detail and present a list of suggested codes, both standard and human readable.

All codes apart from trivial no, yes or unknown codes are presented as lookup tables, the usage of these are described in the the article Considerations for using the format to create a database.

Along with the basic structure described in each “Core fields” section, additional fields containing additional or more specific data are described in the “Additional fields” sections. These fields were taken from several cohort collaborations but with the required changes that were needed for the specific data structures. This is presented to the reader to show that the core structure is not a fixed proposal but rather a basic structure, which can be altered by adding fields.

Issues regarding duplicates are discussed in Considerations For Data Management.

Overview of data tables

Table	Content
tblART	holds type of antiretroviral drug, start and stop dates and reason for stopping
tblART_MUM	Antiretroviral Medication of mother in cases where mother is not enrolled in cohort
tblBAS	holds basic information such as demographics, basic clinical information and date of AIDS diagnosis
tblCANC	holds type and date of diagnosis of cancer
tblCENTER	holds information about the Center (e.g. geographical localisation, type of clinic) where the patient is receiving HIV care
tblCEP	holds type and date of clinical events and procedures including serious non-AIDS conditions. Former known as tblAE (adverse event).
tblDELIVERY_CHILD	holds delivery information related to the child
tblDELIVERY_MUM	holds delivery information related to the mother
tblDIS	holds type and date of CDC-C diseases and malignancies.
tblLAB	holds type, date, value and unit of laboratory tests.
tblLAB_BP	holds date, diastolic and systolic values and unit of blood pressure measurements.
tblLAB_CD4	holds date and value of CD4 measurements.
tblLAB_HCV	holds information on HCV genotype and subtype
tblLAB_RES	holds background information on the resistance test, laboratory, library, kit, software and type of test
tblLAB_RES_LVL_1	holds nucleoside sequence for the PRO and RT sequences
tblLAB_RES_LVL_2	holds mutations and positions of these.
tblLAB_RES_LVL_3	holds resistance result in relation to antiretroviral drug.
tblLAB_RNA	holds date, value, detection limit and type of viral assay.
tblLAB_VIRO	holds test results for viro-/serological tests (hepatitis etc.)
tblLTFU	All submitted participants should figure in the table. Participants who are NOT lost to follow and who have died, should be noted as DROP_Y=0 Holds data on death and drop-out
tblMED	holds type, start and stop dates for other medication/treatments.
tblNEWBORN	holds information related to newborns
tblNEWBORN_ABNORM	holds information related to abnormalities of newborns
tblOVERLAP	holds information on the patient's participation in other cohorts
tblPREG	holds general pregnancy-related information
tblPREG_OBS	holds information on obstetrical problems during pregnancy
tblPREG_OUT	describes the pregnancy outcome
tblPROGRAM	holds information on the program with which the center is associated
tblREFILL	holds information on prescription refills
tblSAMPLES	holds information on the storage of blood, urine and other biological samples
tblVIS	holds visit related information such as weight, wasting, smoking, occupational status etc.
tblVIS_SUBS	holds information on patients use of substances like alcohol, cigarettes and drugs

Diagram

Diagram based on HICDEP 1.120

Structure of data

From flat files towards a normalized structure

The data collected in HIV collaborations is presented on the following pages in a set of data files/tables. Typically data would be put into one data file that would hold one line/record per patient where each field is represented as a separate column in that dataset. Often a dataset could contain more than 3000 columns of data.

The implication of going from thousands of fields to fewer fields means that data is in fact transposed from the flat format into the normalised format.

Example of a flat file structure:

PATIENT	ALAT_D	ALAT_V	ALAT_U	ASAT_D	ASAT_V	ASAT_U
999999	01-01-2000	15	U/l	01-01-2000	12	U/l

The normalised structure would then be like this:

PATIENT	TYPE_ID	LAB_DATE	LAB_VAL	LAB_UNIT
999999	1	01-01-2000	15	U/l
999999	2	01-01-2000	12	U/l

The type of measurement is identified through the TYPE_ID field. Here 1 codes for ALAT and 2 codes for ASAT:

Code	Description
1	ALAT - Alanin-Aminotransferase
2	ASAT - Aspartat aminotransferase

Technical considerations
To enable a normalised structure that minimises the number of columns dramatically, the one file solution must be broken into several minor tables. These breakdowns are driven by the different data characteristics.

Each table has a basic structure that includes the patient identifier, a code that represents e.g. drug, adverse event or laboratory test performed. Along with this combination values like date, result, unit etc are present for each record.

A record for a laboratory measurement would include:

Patient identifier
Measurement type identifier
Measured value
Unit of value
Date of determination

A record for usage of an antiretroviral drug would include:

Patient identifier
Drug identifier
Start date for usage
End date for usage
Reason for discontinuation

These issues imply that a set of distinct tables must be generated based on the “nature” of the data. Since laboratory, medication and event data both cannot and should not be mixed at least 3 tables must be designed. Additionally there are other types of information that need their own domains: background information on the patient (height, birth date etc.), visit related data (weight, blood pressure, wasting etc.), and resistance testing (the latter requires more consideration due to the diversity of data present).

In this protocol further separation of data into different tables are presented. These separations are not only based on the rules for the relational model and normalisation, but they are “culturally” related.

For example: antiretroviral treatment medication is kept in one table and other medication in another table; CD4 cell measurements and HIV-RNA measurements are put into separate tables, that are also different from the general laboratory table. These separations are done simply because data in these tables are of distinct importance in analysis and often are gathered more frequently and with more attention than other variables.

Coding Conventions Date codes
Although it is best to have precise dates in the format of YEAR-MONTH-DAY ISO standard, it might be that some cohorts are limited to representing date data at the level of the month only, or information kept on the patient in the charts only defines dates to the month and in some cases only to the year. To solve this a set of date codes are presented here.

Day unknown
In this case the date should be coded as the 15th of the month – so that 1999-12-?? becomes 1999-12-15. This enables the date to be no more than 15 days away from the actual date.

Month and day unknown
Best approach to this is to apply something similar, as with unknown dates, this would then mean that 1999-??-?? becomes 1999-07-01.

Year unknown
If the year is unknown but the presence of the date value is needed as in case of opportunistic infections or adverse events (see later in this document) a fictive date should be used that couldn’t be mistaken with an actual date. An unknown year should be coded as 1911-11-11.

Specification of precision
An alternative to the above is to apply an additional field to each date field for which it is known that there might be issues regarding the precision of the dates. The field is then used to specify at which degree of the day, month or year the date is precise:

Code	Precision of date
<	Before this date
D	Exact to the date
M	Exact to the month
Y	Exact to the year
>	After this date
U	Unknown

The Data Transfer Protocol for IeDEA Multi-regional Collaborations suggests that such a precision annotation variable should have the same name as the date variable with the additional Suffix _A. For example, the precision of BIRTH_D will be annotated using an optional variable with the name BIRTH_D_A.

ICD-10 codes
The coding system is the official standard for coding of diseases, however there is a wide set of “homebrew” codes used within the HIV field in data coding in general, often it’s a 3 or 4 letter code which is an abbreviation for the AIDS defining disease. ICD-10 doesn’t have single codes that represent all single CDC-C events and as a consequence of this a list of 3 to 4 letter codes is the recommended way of coding for all CDC stage C events

ICD-10 codes are however the recommended for codes AE’s since it would become impossible for this protocol to maintain a complete list of all possible AE’s. ICD-10 is also recommended for causes of death.

ATC codes
ATC is a hierarchical structure for coding medication. The structure and hierarchy are best explained with an example of how a drug code is defined. Here it is on Indinavir:

J - ANTIINFECTIVES FOR SYSTEMIC USE (1st level, anatomical main group)
J05 - ANTIVIRALS FOR SYSTEMIC USE (2nd level, therapeutic subgroup)
J05A - DIRECT ACTING ANTIVIRALS (3rd level, pharmacological subgroup)
J05AE - Protease inhibitors (4th level, chemical subgroup)
J05AE02 - Indinavir (5th level, chemical substance)

This hierarchy has some benefits as will be explained later, but one of its limitations is that it’s impossible to “read” the code compared to the widely used 3 letter mnemonic codes for antiretroviral drugs.

Example:

Drug	Code	ATC code
Indinavir	IDV	J05AE02

The difference is that the IDV code is easily readable, where the ATC code is not; going from a flat file structure to a normalised structure the human readable aspect becomes increasingly important. In the flat file format the column names and the possibility of labels makes data more or less readable; in the normalised format only the coding can help. Because of this the 3 letter codes are being presented in this document. However it must be stressed that usage of the ATC coding should be used to diminish the risk of several homebrew and non-compatible coding schemes.

Currently however, the ATC scheme does not provide sufficient detail on the specific drugs, there is e.g. no official way to code Saquinavir as hard or soft gel. Thus a slight alteration to the set of codes will be presented in the sections of the ART and MED tables. The alterations are designed to extend the existing structure of ATC.

One of the benefits is that the structure of ATC allows easier statistics on e.g. drug classes

J05AE Protease inhibitors
- J05AE01 Saquinavir
- J05AE02 Indinavir
- J05AE03 Ritonavir
- J05AE04 Nelfinavir
- J05AE05 Amprenavir
- J05AE06 Lopinavir
J05AF Nucleoside and nucleotide reverse transcriptase inhibitors
- J05AF01 Zidovudine
- J05AF02 Didanosine
- J05AF03 Zalcitabine
- J05AF04 Stavudine
- J05AF05 Lamivudine
- J05AF06 Abacavir
- J05AF07 Tenofovir disoproxil
- J05AF30 Combinations1
J05AG Non-nucleoside reverse transcriptase inhibitors
- J05AG01 Nevirapine
- J05AG02 Delavirdine
- J05AG03 Efavirenz

Although the codes might be harder to read they provide grouping mechanisms in the way they are coded. Interested readers should go to the ATC Website to learn about the structure of ATC. A fully updated database of ATC codes and DDD (Defined Daily Dosage) is available for querying.

Other codes
It is often necessary to code for values like “Yes”, “No” and “Unknown”, this document suggests that the following codes should be used:

Code	Description
0	No
1	Yes
9	Unknown

Unknown should be used to identify the difference between a value that has not yet been collected (Empty) and a value that cannot be collected (Unknown). Empty values should be required where Unknown values make little sense to keep querying for a value.

Example – weight:
Depending on the unit in which weight is measured, a different value for Unknown should be applied. In the case of kg the “Unknown” code should be 999 and not just 9 or 99, the last two could be actual values.

Blank values, for SAS users also known as "." and for database programmers known as NULL, should be used wherever specified in this protocol. However, sometimes it might be more correct just to omit the record if no value has been recorded, test has not been performed etc.

Quality Assurance
In order to verify the consistency and correctness of the data, QA checks are made before the data is used. The QA checks applying to a given table are listed at the bottom of its article. Additionally, a list of all QA checks, including checks which do not directly apply to the HICDEP tables themselves, is available here.