The objective of these scripts is to provide diagnostic tools of the temporal quality of biological data in a clinical data warehouse. These scripts are based on: "WHAT CAN MILLIONS OF LABORATORY VALUES TELL US ABOUT THE TEMPORAL ASPECT OF DATA QUALITY? STUDY OF DATA SPANNING 17 YEARS IN A CLINICAL DATA WAREHOUSE". Comput Methods Programs Biomed. 2018 Dec 29. pii: S0169-2607(18)30708-9. doi: 10.1016/j.cmpb.2018.12.030. https://www.ncbi.nlm.nih.gov/pubmed/30612785 Abstract OBJECTIVE: To identify common temporal evolution profiles in biological data and propose a semi-automated method to these patterns in a clinical data warehouse (CDW). MATERIALS AND METHODS: We leveraged the CDW of the European Hospital Georges Pompidou and tracked the evolution of 192 biological parameters over a period of 17 years (for 445,000 + patients, and 131 million laboratory test results) RESULTS: We identified three common profiles of evolution: discretization, breakpoints, and trends. We developed computational and statistical methods to identify these profiles in the CDW. Overall, of the 192 observed biological parameters (87,814,136 values), 135 presented at least one evolution. We identified breakpoints in 30 distinct parameters, discretizations in 32, and trends in 79. DISCUSSION AND CONCLUSION: our method allowed the identification of several temporal events in the data. Considering the distribution over time of these events, we identified probable causes for the observed profiles: instruments or software upgrades and changes in computation formulas. We evaluated the potential impact for data reuse. Finally, we formulated recommendations to enable safe use and sharing of biological data collection to limit the impact of data evolution in retrospective and federated studies (e.g. the annotation of laboratory parameters presenting breakpoints or trends). # The data input format Two examples files have to be put in a directory called 'Example' in the working directory: + Count.csv + BIO.20170821101046.csv These files helps you to adapt your SQL query for the proposed data profiling algorithm. The file 'Count.csv' includes the example of the list of exams as an input. The file 'BIO.20170821101046.csv' includes the require input format for lab tests data. # How to use scripts without real data ? To run for the first time the data profiling algorithm, we recommend to try it on simulated data: + Put all the R files in a R project directory. + Launch the master script called 'main.R' Report and derived files will be create with the simulated data. The simulated data are generated with the script 'datasimu.R'. This script generates the differents patterns described in the article. # How to use scripts with real data ? After lauching programs with simulated data. You will show the data directory where to put your real data. You have also to generate Count.csv which contains the index of lab tests. If you have an I2B2 data warehouse you can also adapt the datareal.R file. # Landscape of R files There are 13 R files : + init.R: install and load packages, create directories, source fun_ R files + main.R: master script call all script + datasimu.R: simulation of data to test the programs. This script generates the differents patterns described in the article. + datareal.R: extract files from i2b2 clinical data warehouse in the right format + movingquantiles.R: compute the moving quantiles + fun_movingquantiles.R: auxiliary file associated to the movingquantiles.R file + missingdata.R: detect missing data (not described in the article, previous version) + fun_missingdata.R: auxiliary file associated to the missingdata.R file (not described in the article, previous version) + discretization.R: detect discritization + fun_discretization.R: auxiliary file associated to the discretization.R file + breakpoints.R: detect breakpoints + fun_breakpoints.R: auxiliary file associated to the breakpoints.R file + trends.R: performe regressions + fun_trends.R: auxiliary file associated to the trends.R file
vlooten/BioQuality
What can millions of laboratory values tell us about the temporal aspect of data quality? study of data spanning 17 years in a clinical data warehouse. Comput Methods Programs Biomed. 2018 Dec 29. 10.1016/j.cmpb.2018.12.030
RGPL-3.0