Considerations for data management

Uniqueness of PATIENT ID

Each table in this document is shown with the PATIENT field as part of the unique identifier for each record, in many cases it might be necessary to specify both PATIENT and COHORT as identifier since the PATIENT id might not be unique across cohorts, this should however be implemented only if needed.

Another solution, which might be worth considering when building the final dataset for the analysis, is to concatenate the patient id and the cohort id into a single unique new patient id.

Duplicate records
Transposition of the flat format data, where there is one record per patient, into a normalised structure that has multiple rows per patient requires that the combination of PATIENT, TYPE_ID and LAB_DATE is unique for each row in the table.

The normalised structure and the relational model does not allow for duplicate records like in the following table to exist:

PATIENT TYPE_ID LAB_DATE LAB_VAL LAB_UNIT
999999 1 2000-01-01 15 U/l
999999 1 2000-01-01 15 U/l

The typical solution to this is to generate an auto-incremented value – RECORD_ID - for each record in the table:

RECORD_ID PATIENT TYPE_ID LAB_DATE LAB_VAL LAB_UNIT
1 999999 1 2000-01-01 15 U/l
2 999999 1 2000-01-01 15 U/l

But it can, and should be dealt with in the design of the study. The presence of duplicate records like in the example shown above might not make any sense; if they did occur it then would be on account of a recording error. If, however, it was important to keep records that show measurements taken the same day but at different times, the format of the LAB_DATE shouldn’t be date (YYYY-MM-DD) but instead should be date-time (YYYY-MM-DD hh:mm:ss):

PATIENT TYPE_ID LAB_DATE LAB_VAL LAB_UNIT
999999 1 2000-01-01 10:00:00 15 U/l
999999 1 2000-01-01 14:00:00 15 U/l

Different and more sophisticated methods to make records unique will be presented later in this document under the definition of the LAB table.