
Primary LanguagePython

Fake Data from Scratch

Andrew Zimolzak, MD, MMSc

Generate structured medical data from "first principles."


What should we do when an algorithm, app, data model, etc. needs some data to practice on? One approach is to take real private health records and deidentify them, but we don't like this approach because it's fraught with problems (see below).

Code explanation


./make_npa_city_state.sh    # needs to be run one time only

Outputs CSV files of demographics + histology + genes, and lab data (patients.csv, labs.csv).

Input files:

Current features

  • Given the reference range for a lab, code will generate values that are usually within this range (for healthy patients) or less likely to be within this range (for sick patients).

  • Code can handle equations that model relationships between labs. For example, hematocrit = hemoglobin * 3. It will also add some "fuzziness" to any specified equation.

  • Code generates repeated labs for the same patient, at random intervals, where "random" is defined in a realistic way. In detail, the distribution of intervals between labs follows the exponential distribution (somewhat like real life). This is another way of saying that the number of lab measurements per unit time follows the Poisson distribution.

  • Code understands that today's lab is affected by the previous lab and how much time has passed since then. In detail, the trend of labs over time is modeled as Brownian motion.

  • Patient name. First and last names reflect the true distribution of names in the US. (But not the joint distribution of first+last, so you can get some ethnically unlikely first+last name pairs like Ahmed Krzyzewski.)

  • Fake contact details such as 315 Pine St, Davidsonville MD 21035. 410-555-0978. ZIP code of residence simulates the real population distribution of the US (all citizens, not just veterans). ZIP is decoded to a real city name. Area code is often correct to the level of city; always correct by state.

  • Which labs: CBC (currently 8 numbers), BMP (about 8 numbers), Calcium, WBC differential.

  • Age, gender. These two variables simulate the joint probability distribution of age & gender in US veterans (all comers, not limited to lung cancer patients, so you get some 29 year olds).

  • limits on Brownian motion so it can't get absurd or negative numbers, K of 25, Hct of 109, etc.

  • Clearly denotes the demographics as fake.

  • Genes. Somewhat realistic distributions of lung cancer mutations based on published literature. Also plausible number of mutations per tumor.

  • Specific diagnosis (really only histology).

  • Stage of cancer including rough idea of T/N/M.

To do

  • Need to do some unit testing of quantile2text() function. Still fails with index out of range error intermittently.

  • ever received platinum containing chemo

  • failed first line treatment

  • date of diagnosis

  • Makefile or tup, especially for downloading.

  • Facility (hospital) name, pathologist name, PCP, oncologist, specimen number.

Lower priority to do

  • Rename the "distribution" csv files so they are recognizable as such.

  • which vendor ran your genotype

  • Names of meds you've received for cancer. What types of meds (oral, IV, targeted, traditional). For oral: fill dates & quantities. Other CA treatments (surg, rads)? [fancier version of "ever received platinum containing chemo"]

  • Social security number

  • Response of cancer to treatment (progressing | stable | remitting). fancier version of "failed first line treatment"]

  • More dates: recruitment, upcoming appointments with oncology, rad-onc, chemotherapy.

  • vital signs

  • What level of consent for Precision Oncology.

  • Era of military service.

  • consider splitting out one lab per line. (id=0, date=2014-04-04, lab=hgb, val=10.2)

  • make it messy in deeper ways (messy can mean more than just out of range results).

  • curl to automate download of source data files like from census.gov

  • refactor better names for my classes and modules

What does 'fraught' mean?

Here is my argument for why we may want fake data from scratch rather than deidentified real data.

First: 45 CFR 164.514 describes two ways in which covered entities may classify information as not individually identifiable. (A.) A person with knowledge of statistical means for "rendering information not individually identifiable" must determine that the reidentification risk "is very small." (B.) The identifiers that 45 CFR specifies must be removed.

Second: Erika Holmberg’s notes on data security say "MAVERIC has not previously certified datasets as de-identified."

Third: Because of the first two points, I am not sure that it is enough to do small tinkering with dates and/or lab values.

Fourth: Data from scratch is what Vick, Ned, and I thought we would try to start with, because of all these regulatory and statistical issues. Ultimately it depends what works for Cytolon.