/ERN-common-data-elements

Creative Commons Zero v1.0 UniversalCC0-1.0

Semantic data model of the set of common data elements for rare disease registration

To make rare disease registry data Interoperable (the 'I' in FAIR). Version 2.0. License CC0.

Here, we present a semantic data model of the set of common data elements for rare diseases registration recommended by the European commission joint research centre. There are 16 data elements: ‘Pseudonym’, ‘Date of Birth’, ‘Sex’, ‘Patient’s status’, ‘Date of death’, ‘First contact with specialised centre’, ‘Age at onset’, Age at diagnosis’, ‘Diagnosis of the rare disease’, ‘Genetic diagnosis’, ‘Undiagnosed case’, ‘Agreement to be contacted for research purposes’, ‘Consent to the reuse of data’, ’Biological sample’, ‘Link to a biobank’, ‘Classification of functioning/disability’.

The semantic data model is presented below in 11 modules describing the different 16 data elements. Central to each module is the 'person'. Each module has in addition different characteristics assigned to the person.

Feedback

Your feedback is more than welcome it will help us improve our semantic data model. Please use github issues to provide your feedback. If you are new to github please see this video to know more about github issues.

Module: person

Link to ShEx (shape expression)

Module: Pseudonym

Notes: Pseudonym is modelled as a string.

Module: Personal information

Notes:

  • For 'Date of birth': We initially used the predicate foaf:birthday, however here the date of birth is described in mm-dd, but we want to describe it in mm-dd-yyyy. We now created a new instance describing ‘date of birth’ using the NCIT class ‘Birth Date’. To this instance we have then attached xsd:DateTime with the predicate sio:has_value.
  • For 'Sex': We consider the patient's sex at birth. These can be 'Female', 'Male', 'Undetermined', or 'Foetus' (unknown). Undetermined means that it is not possible to determine the sex, e.g. due to sexual disorders.

Module: Patient status

Notes:

  • 'Patient’s status' (Patient alive or dead). These can be 'Alive', 'Dead', 'Lost in follow-up', or 'Opted-out'. If patients opt-out, their data can no longer be collected. It is unclear whether the data already been collected (before patient opted-out) can be left in or needs to be removed. 'Opted-out' should also be somehow linked to 7.1. (Agreement to be contacted for research purposes) and 7.2. (Consent to the reuse of data) under 'Module: Consent', but it is unclear how this link should look like.

Module: Care pathway

Notes:

  • 'First contact with specialised centre' (Date of first contact with specialised centre): Date (dd/mm/yyyy). This refers to the specialised center in the ERN. Only specialized centers can fill in the database. Open questions: ehat if people move or go to different specialized centers?

Module: Disease history and diagnosis

Notes:

  • 'Age at onset' (Age at which symptoms/signs first appeared). These can be 'Antenatal', 'At birth', 'Date (dd/mm/yyyy)', or 'Undetermined'.
  • 'Diagnosis of the rare disease' (Diagnosis retained by the specialised centre): Orpha code (strongly recommended - see link) / Alpha code / ICD-9 code / ICD-9 CM code / ICD-10 code. In case of non-rare disease where no Orpha code is available the Disease ontology should be used.

Module: Genetic diagnosis

Module: Undiagnosed

Notes: We have added ‘no diagnosis’ as an option to the model for cases where no diagnosis is available. In addition phenotype and genotype can be collected for these patients.

Module: Consent

Module: Biobanks

Module: Disability

Notes:

  • For the 'Classification of functioning/disability' (Patient’s disability profile according to International Classification of Functioning and Disability (ICF): Disability profile / Score. Open question: who calculates this score (the doctor or the patient)?
  • Here, we have only modelled the final score and not all the subscores.
  • Here, We defined an instance describing the ‘Activity’ of collecting the score.
  • todo: take a look at the domain-range of score-wasGeneratedBY-questionnaire

General notes

  • The relationships defined in this semantic data model are all based on assigning characteristics to the patient/person. If multiple forms are filled in per patient/person we need to include clinical visits, including dates etc.
  • In this data model we have used snomed, which has a licence, which considerations do we need to take here when thinking about reusability within multiple countries?