Perl code from a project for the UCSF HDFCCC
Archive: 125519-StudyData-04-10-2018.zip
OnCore Name | Cleaned, Filtered, Formatted for Input to Perl Script Name |
---|---|
Demographics.xlsx | demographics_filtered.tsv |
Blood Labs V2.xlsx | blood_labs_v2_new_filtered.tsv |
ECOG-Weight V3.xlsx | patient_ECOG_scale_performance_status_and_weight_new_filtered.tsv |
Followup.xlsx | Not used |
GU-Disease Assessment V3.xlsx | gu_disease_assessment_new_v3_filtered.tsv |
Past Tissue V1.xlsx | Not used |
Prostate Diagnosis V4.xlsx | prostate_diagnosis_v4_new_filtered.tsv |
SU2C Biopsy V3.xlsx | su2c_biopsy_v3_new_filtered.tsv |
SU2C Pr Ca Tx Sumry V2.xlsx | su2c_pr_ca_tx_sumry_v2_new_filtered.tsv |
SU2C Prior TX V3.xlsx | su2c_prior_tx_v3_new_filtered.tsv |
SU2C Specimen V1.xlsx | su2c_specimen_v1_extra_progression_records.tsv |
SU2C Subsequent Treatment V1.xlsx | su2c_subsequent_treatment_v1_new_filtered.tsv |
SU2C Tissue Report V1.xlsx | Not used |
This additional Input File was being manually curated and maintained separately from any Project Databases and was provided manually after the code for parsing and processing the files above was at the Beta Testing stage:
WCDT clinical database version 4-6-2018 - additional data points for Marc Perry.xlsx
which was transformed into: wcdt_clinical_additional_data.tsv
My basic workflow (based on my previous bioinformatic Data Wrangling experience) was to:
- Open each Excel file using the OpenOffice/LibreOffice Calc Spreadsheet program (this is personal preference, you could probably use any similar program)
- Export (or Save As . . . ) the main table as a TSV file
- Create a master list of patient_ids, (based on the Demographics table) in the OnCore tables this is always column 1, and is named 'SEQUENCE_NO_'
- For each table I would remove all records (or rows, or tuples) that lacked a valid identifier in the first column (because there is no way to link that data to a patient record)
- I Did a rigourous QA/QC check to make sure that
- There were no "extra" patient_ids in the other tables (i.e. patient_ids that were missing from the Demographics table), if I found any then I would query the data providers for clarification/correction
- I compiled a list of patient_ids that were LACKING records in the other tables (mainly for planning the logic of the parsing and processing code).
- In general, the first ~ 13 columns in each of the OnCore tables are named and ordered in the same pattern, and contain what might be called high-level metadata for the project. One of these columns is named 'Not Applicable or Missing'. In general, my Perl code logic skips processing any records with an entry in this field, but in the second iteration of coding, I typically flagged any such patient_ids at this pre-processing step and either reviewed that record manually, or queried the data providers to confirm my understanding
- After converting each table into a TSV I would use CLI tools to count the number of records and number of columns in that table, to confirm that all of the rows were the same length. This is important because some of the fields in some of the tables can contain free-text entries, and in some cases the CRAs have cut blocks of text from some other spreadsheet program and pasted them verbatim into OnCore. After being exported from OnCore these offending fields may have all manner of extra carriage returns, linefeeds, newlines, etc. etc. Most of the tables were "clean" and behaved as expected. A couple had a small number of lines with fields I cleaned up manually, but one table, named GU Disease Assessment, had dozens, if not scores, of records where one specific column had been mangled by saving as a TSV. Initially I attempted to manually clean up the offending fields in this table, but eventually grew so frustrated that I derived a search and replace step in Calc that cleaned all of these up. Typically, I would replace the unwanted characters with a single blank space, or in some cases a vertical pipe symbol like this
|
.
Perhaps this is not such a big surprise. Perhaps the most obvious problem/issue is that some of the tables, the Blood labs table in particular, contain what are essentially empty records, as if they were placeholders created but never filled in completely. There are also essentially duplicate records in some of the tables as if two different people entered the same information on different dates. There are also copy-paste errors and typos. You have been warned.
This table has one, and only one, record for each patient_id.
You might think that every patient absolutely required a Baseline biopsy specimen in order to enter the study, but this is not the case, so any processing you do always has to check for this, or at least not assume that there will definitely be a baseline biopsy.
Another way of stating this is that not every patient enrolled in the study goes on to progress. In some cases the patients may have died from some other cause. In other cases the patient may have withdrawn, or gone off-study; however MOST of the patients apparently do eventually have a progression event.
Any processing code has to take into account that some patients will have multiple progression events, but note, that this is not the same as having multiple progression biopsies. In short, your data model has to account for progression_1 (which usually seems to be described as plain old 'progression'), progression_2, and progression_3, otherwise some records will not be properly processed.
Select Patients had a second Biopsy Procedure from one (or more) metastatic sites; by definition, this is a "Progression Biopsy"
At first I assumed that patients without a Progression Biopsy had never progressed, but that is not true. While most patients do eventually progess (mentioned above), only a subset of these patients also have a progression biopsy. One patient had two separate progression biopsies from two separate anatomical sites, on the same date (or on two adjacent dates). Four patients had a second subsequent progression biopsy specimen collected from a second progression event. One of these patients had a third progression biopsy collected.
Virtually every patient with a Baseline Biopsy sample/specimen received subsequent treatment, post-bx-tx, and each treatment could have a ProgressionDate; these ALSO define Progression Events
During post-biopsy therapy (after the baseline biopsy), patients who progressed would typically acquire a 'Treatment Progression Date' and this is recorded in the Subsequent Treatment Table from OnCore. There are many, many, more of these progression events than there are progression biopsies. The code in these Perl modules takes these treatment progression dates into account, and constructs a timeline for each patient, allowing the proper annotation of e.g., the PSA tests, Blood Lab Tests, and Radiographic Disease Assays.
In other words, there are patients without a progression biopsy in the biopsy table, and without a post-biopsy therapy treatment progression date, who nonetheless are annotated in the Specimen table has having 'Progression' specimens. I missed this the first time around, and it was only after digging much deeper into the implied OnCore Data Model that I discovered this. The Perl script now incorporates that specimen progression data as well.
There are patients with records in various tables, tagged, or annotated, as Progressions, that were not discovered using any of the the three sources described above. So in total there are four different sources of information that can be used to assign a timepoint to an assay.
This is a Perl script and several associated modules that parses a specific collection of TSV files and then generates a monolithic TSV table that is almost ready to uploaded/imported into a REDCap project (if the REDCap project has the correct matching Data Dictionary).
Since REDCap ONLY accepts CSV files you will need to convert this TSV into a CSV prior to importing it into your REDCap project. However, some of the data fields that you migrated from OnCore may contain commas, therefore the first step that I execute after saving the STDOUT from the clean_and_parse_oncore.pl is this:
perl -pe 's/,/\|/g' monolithic_table.tsv > monolithic_table_WITH_NO_COMMAS.tsv
Followed by:
perl -pe 's/\t/,/g' monolithic_table_WITH_NO_COMMAS.tsv > monolithic_table_WITH_NO_COMMAS.csv
You will not be able to run this script without the matching OnCore TSV input files (which are not available in this GitHub repository). For reasons of convenience, the names of the input files are hard-coded in the body of the script itself, and the script expects to be in a directory one level beneath where the tables are stored. If you use a a different directory structure, or a different set of files, then you will need to edit this section of the script.
The OnCore tables were obtained as a data dump into MS-Excel