CSV file for patients.csv has extra column (between

Question

CSV file for patients.csv has extra column (between

Closed this issue 9 months ago · 6 comments

What happened?

I ran run_synthea with default options to create test patient, and the patient CSV file contained this row wiht an extra value:

d12cae9-54b6-a5b1-4481-b49ae3b31e1a,1984-12-30,,999-56-7435,S99992971,X66900476X,Mr.,César846,Bernal586,,,M,white,hispanic,M,Ponce  Puerto Rico  PR,936 Cummings Burg Apt 64,Haverhill,Massachusetts,Essex County,25009,01835,42.769551061336124,-71.04918584824105,9201.74,732602.04,9388

There is an extra value "25009" between County and ZIP code. The code was not found in the FHIR nor CCDA data.

On a different run with different seed, the field was empty but still present, so it seems reproducible.

Environment

- OS: Linux 22.04
- Java: openjdk version "11.0.20" 2023-07-18

Relevant log output

I made some small changes to the config file and am using the dev version, build locally

# Starting with a properties file because it requires no additional dependencies

exporter.baseDirectory = ./output/
exporter.use_uuid_filenames = false
exporter.subfolders_by_id_substring = false
# exporters that use XML or JSON can enable or disable 'pretty printing'
exporter.pretty_print = true
# number of years of history to keep in exported records, anything older than this may be filtered out
# set years_of_history = 0 to skip filtering altogether and keep the entire history
exporter.years_of_history = 10
# split records allows patients to have one record per provider organization
exporter.split_records = false
exporter.split_records.duplicate_data = false
exporter.metadata.export = true
exporter.ccda.export = true
exporter.fhir.export = true
exporter.fhir_stu3.export = false
exporter.fhir_dstu2.export = false
exporter.fhir.use_shr_extensions = false
exporter.fhir.use_us_core_ig = true
exporter.fhir.us_core_version = 5.0.1
exporter.fhir.transaction_bundle = true
# using bulk_data=true will ignore exporter.pretty_print
exporter.fhir.bulk_data = false
# included_ and excluded_resources list out the resource types to include/exclude in the csv exporters.
# only one of these may be set at a time, if both are set then both will be ignored.
# if neither is set, then all resource types will be included.
# note the Patient and Encounter resources will always be included, even if specifically listed as excluded here
exporter.fhir.included_resources =
exporter.fhir.excluded_resources =
exporter.groups.fhir.export = false
exporter.hospital.fhir.export = true
exporter.hospital.fhir_stu3.export = false
exporter.hospital.fhir_dstu2.export = false
exporter.practitioner.fhir.export = true
exporter.practitioner.fhir_stu3.export = false
exporter.practitioner.fhir_dstu2.export = false
exporter.encoding = UTF-8
exporter.json.export = false
exporter.json.include_module_history = false
exporter.csv.export = true
# if exporter.csv.append_mode = true, then each run will add new data to any existing CSVs. if false, each run will clear out the files and start fresh
exporter.csv.append_mode = true
# if exporter.csv.folder_per_run = true, then each run will have CSVs placed into a unique subfolder. if false, each run will only use the top-level csv folder
exporter.csv.folder_per_run = false
# included_files and excluded_files list out the files to include/exclude in the csv exporter
# only one of these may be set at a time, if both are set then both will be ignored
# if neither is set, then all files will be included
# see list of files at: https://github.com/synthetichealth/synthea/wiki/CSV-File-Data-Dictionary
# include filenames separated with a comma, ex: patients.csv,procedures.csv,medications.csv
# NOTE: the csv exporter does not actively delete files, so if Run 1 you included a file, then Run 2 you exclude that file, the version from Run 1 will still be present
exporter.csv.included_files =
exporter.csv.excluded_files = patient_expenses.csv

exporter.cpcds.export = false
exporter.cpcds.append_mode = false
exporter.cpcds.folder_per_run = false
exporter.cpcds.single_payer = false

exporter.bfd.export = false
exporter.bfd.require_code_maps = true
exporter.bfd.export_missing_codes = true
exporter.bfd.bene_id_start = -1000000
exporter.bfd.clm_id_start = -100000000
exporter.bfd.clm_grp_id_start = -100000000
exporter.bfd.pde_id_start = -100000000
exporter.bfd.fi_doc_cntl_num_start = -100000000
exporter.bfd.carr_clm_cntl_num_start = -100000000
exporter.bfd.mbi_start = 1S00-E00-AA00
exporter.bfd.hicn_start = T01000000A
exporter.bfd.partc_contract_start = Y0001
exporter.bfd.partc_contract_count = 10
exporter.bfd.plan_benefit_package_start = 800
exporter.bfd.plan_benefit_package_count = 5
exporter.bfd.partd_contract_start = Z0001
exporter.bfd.partd_contract_count = 10
exporter.bfd.clia_labs_start = 00A0000000
exporter.bfd.clia_labs_count = 10
exporter.bfd.cutoff_date=20140529

exporter.cdw.export = false
exporter.text.export = false
exporter.text.per_encounter_export = false
exporter.clinical_note.export = false

# parameters for symptoms export
exporter.symptoms.csv.export = false
# selection mode of conditions or symptom export: 0 = conditions according to  exporter.years_of_history. other values = all conditions (entire history)
exporter.symptoms.mode = 0
# if exporter.symptoms.csv.append_mode = true, then each run will add new data to any existing CSVs. if false, each run will clear out the files and start fresh
exporter.symptoms.csv.append_mode = false
# if exporter.symptoms.csv.folder_per_run = true, then each run will have CSVs placed into a unique subfolder. if false, each run will only use the top-level csv folder
exporter.symptoms.csv.folder_per_run = false
exporter.symptoms.text.export = false

# enable searching for custom exporter implementations
exporter.enable_custom_exporters = true

# the number of patients to generate, by default
# this can be overridden by passing a different value to the Generator constructor
generate.default_population = 1

# the number of threads to use for the generator, set the value to -1 to match the number of
# available processors (as per Runtime.getRuntime().availableProcessors())
# defaults to -1 if not specified
generate.thread_pool_size = -1

generate.log_patients.detail = simple
# options are "none", "simple", or "detailed" (without quotes). defaults to simple if another value is used
# none = print nothing to the console during generation
# simple = print patient names once they are generated.
# detailed = print patient names, atributes, vital signs, etc..  May slow down processing

generate.timestep = 604800000
# time is in ms
# 1000 * 60 * 60 * 24 * 7 = 604800000

# default demographics is every city in the US
generate.demographics.default_file = geography/demographics.csv
generate.geography.zipcodes.default_file = geography/zipcodes.csv
generate.geography.country_code = US
generate.geography.timezones.default_file = geography/timezones.csv
generate.geography.foreign.birthplace.default_file = geography/foreign_birthplace.json
generate.geography.sdoh.default_file = geography/sdoh.csv

# Lookup Table Folder location
generate.lookup_tables = modules/lookup_tables/

# Set to true if you want every patient to be dead.
generate.only_dead_patients = false
# Set to true if you want every patient to be alive.
generate.only_alive_patients = false
# If both only_dead_patients and only_alive_patients are set to true,
# It they will both default back to false

# if criteria are provided, (for example, only_dead_patients, only_alive_patients, or a "patient keep module" with -k flag)
# this is the maximum number of times synthea will loop over a single slot attempting to produce a matching patient.
# after this many failed attempts, it will throw an exception.
# set this to 0 to allow for unlimited attempts (but be aware of the possibility that it will never complete!)
generate.max_attempts_to_keep_patient = 1000

# if true, tracks and prints out details of transition tables for each module upon completion
# note that this may significantly slow down processing, and is intended primarily for debugging
generate.track_detailed_transition_metrics = false

# If true, person names have numbers appended to them to make them more obviously fake
generate.append_numbers_to_person_names = true

# Probability of each person having a middle name. 0 is zero, 1.0 is 100% chance.
generate.middle_names = 0.80

# if true, the entire population will use veteran prevalence data
generate.veteran_population_override = false

# these should add up to 1.0
# weighting and categories are inspired by the following but there are no specific hard numbers to point to
# http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1694190/pdf/amjph00543-0042.pdf
# http://www.ncbi.nlm.nih.gov/pubmed/8122813
generate.demographics.socioeconomic.weights.income = 0.2
generate.demographics.socioeconomic.weights.education = 0.7
generate.demographics.socioeconomic.weights.occupation = 0.1

generate.demographics.socioeconomic.score.low = 0.0
generate.demographics.socioeconomic.score.middle = 0.25
generate.demographics.socioeconomic.score.high = 0.66

generate.demographics.socioeconomic.education.less_than_hs.min = 0.0
generate.demographics.socioeconomic.education.less_than_hs.max = 0.5
generate.demographics.socioeconomic.education.hs_degree.min = 0.1
generate.demographics.socioeconomic.education.hs_degree.max = 0.75
generate.demographics.socioeconomic.education.some_college.min = 0.3
generate.demographics.socioeconomic.education.some_college.max = 0.85
generate.demographics.socioeconomic.education.bs_degree.min = 0.5
generate.demographics.socioeconomic.education.bs_degree.max = 1.0

# The average family size in the US is 3.13. The 2010 FPL for a 3-person household is $18310. Tuned it to $17550 for realistic medicaid/ACA enrollments.
generate.demographics.socioeconomic.income.poverty = 17550
generate.demographics.socioeconomic.income.high = 75000

generate.birthweights.default_file = birthweights.csv
generate.birthweights.logging = false

# in Massachusetts, the individual insurance mandate became law in 2006
# in the US, the Affordable Care Act become law in 2010,
# and individual and employer mandates took effect in 2014.
# mandate.year will determine when individuals with an occupation score above mandate.occupation
# receive employer mandated insurance (aka "private" insurance).
# prior to mandate.year, anyone with income greater than the annual cost of an insurance plan
# will purchase the insurance.
generate.insurance.mandate.year = 2006
generate.insurance.mandate.occupation = 0.2

# Defines what percent of insurance premiums are covered by employers, when employer-covered.
# According to [https://www.kff.org/report-section/ehbs-2021-summary-of-findings/],
# the average employee premium contribution is 0.17 and employers pay 0.83.
generate.insurance.employer_coverage = 0.83

# Default Costs, to be used for pricing something that we don't have a specific price for
# -- $500 for procedures is completely invented
generate.costs.default_procedure_cost = 500.00
# -- $255 for medications - also invented
generate.costs.default_medication_cost = 255.00
# -- Encounters billed using avg prices from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3096340/
# -- Adjustments for initial or subsequent hospital visit and level/complexity/time of encounter
# -- not included. Assume initial, low complexity encounter (Tables 4 & 6)
generate.costs.default_encounter_cost = 125.00
# -- https://www.nytimes.com/2014/07/03/health/Vaccine-Costs-Soaring-Paying-Till-It-Hurts.html
# -- currently all vaccines cost $136.
generate.costs.default_immunization_cost = 136.00
generate.costs.default_lab_cost = 100.00
# -- assumes device costs are included in procedure cost, if not add to costs/devices.csv
generate.costs.default_device_cost = 0.00
# -- assumes supply costs are included in procedure cost, if not add to costs/supplies.csv
generate.costs.default_supply_cost = 0.00

# Providers
generate.providers.hospitals.default_file = providers/hospitals.csv
generate.providers.longterm.default_file = providers/longterm.csv
generate.providers.nursing.default_file = providers/nursing.csv
generate.providers.rehab.default_file = providers/rehab.csv
generate.providers.hospice.default_file = providers/hospice.csv
generate.providers.dialysis.default_file = providers/dialysis.csv
generate.providers.homehealth.default_file = providers/home_health_agencies.csv
generate.providers.veterans.default_file = providers/va_facilities.csv
generate.providers.urgentcare.default_file = providers/urgent_care_facilities.csv
generate.providers.primarycare.default_file = providers/primary_care_facilities.csv
generate.providers.ihs.hospitals.default_file = providers/ihs_facilities.csv
generate.providers.ihs.primarycare.default_file = providers/ihs_centers.csv

# Provider selection behavior
# How patients select a provider organization:
#  nearest - select the closest provider. See generate.providers.maximum_search_distance
#  random  - select randomly.
#  network - select a random provider in your insurance network. same as random except it changes every time the patient switches insurance provider.
#  medicare - select the nearest provider that can bill Medicare. If no Medicare provider is found, it defaults back to "nearest".
generate.providers.selection_behavior = nearest

# if a provider cannot be found for a certain type of service,
# this will default to the nearest hospital.
generate.providers.default_to_hospital_on_failure = true

# minimum number of providers linked per patient
# if this number is not met it re-runs the simulation
generate.providers.minimum = 1

# maximum distance to look for a provider for a given patient, in km
# set to 10 degrees lat/lon to support the model that veterans only seek care at VA facilities
generate.providers.maximum_search_distance = 1000

# Payers
generate.payers.insurance_companies.default_file = payers/insurance_companies.csv
generate.payers.insurance_plans.default_file = payers/insurance_plans.csv
generate.payers.insurance_plans.eligibilities_file = payers/insurance_eligibilities.csv
generate.payers.insurance_companies.medicare = Medicare
generate.payers.insurance_companies.medicaid = Medicaid
generate.payers.insurance_companies.dual_eligible = Dual Eligible
# The percentage of a person's income that they are willing to spend on health insurance premiums.
generate.payers.insurance_plans.income_premium_ratio = 0.034
# The chance of rejection
# Plan selection behavior
# How patients select a plan:
#  best_rates - select plans with best rates for person's existing conditions and medical needs
#  random  - select plans randomly.
#  priority  - select plans based on the priority level defined in the insurance plans file.
generate.payers.selection_behavior = priority

# Payer adjustment behavior
# How payers adjust claims:
#  none - the payer reimburses each claim by the full amount.
#  fixed - the payer adjusts each claim by a fixed rate (set by adjustment_rate)
#  random  - the payer adjusts each claim by a random rate (between zero and adjustment_rate).
generate.payers.adjustment_behavior = none
# Payer adjustment rate should be between zero and one (0.00 - 1.00), where 0.05 is 5%.
generate.payers.adjustment_rate = 0.10

# Experimental feature. Patients will miss care if true, but side-effects of missing that care
# are not handled. Additionally, the path the disease module might take may no longer make sense.
# It might assume things occurred that haven't actually happened it. Use with care.
generate.payers.loss_of_care = false

# Add a FHIR terminology service URL to enable the use of ValueSet URIs within code definitions.
# generate.terminology_service_url = https://r4.ontoserver.csiro.au/fhir

# Quit Smoking
lifecycle.quit_smoking.baseline = 0.01
lifecycle.quit_smoking.timestep_delta = -0.01
lifecycle.quit_smoking.smoking_duration_factor_per_year = 1.0

# Quit Alcoholism
lifecycle.quit_alcoholism.baseline = 0.001
lifecycle.quit_alcoholism.timestep_delta = -0.001
lifecycle.quit_alcoholism.alcoholism_duration_factor_per_year = 1.0

# Adherence
lifecycle.adherence.baseline = 0.05

# set this to true to enable randomized "death by natural causes"
# highly recommended if "only_dead_patients" is true
lifecycle.death_by_natural_causes = false

# set this to enable "death by loss of care" or missed care,
# e.g. not covered by insurance or otherwise unaffordable.
# only functional if "generate.payers.loss_of_care" is also true.
lifecycle.death_by_loss_of_care = false

# Use physiology simulations to generate some VitalSigns
physiology.generators.enabled = false

# Allow physiology module states to be executed
# If false, all Physiology state objects will immediately redirect to the state defined in
# the alt_direct_transition field
physiology.state.enabled = false

# set to true to introduce errors in height, weight and BMI observations for people
# under 20 years old
growtherrors = false

Answer 1 · 2023-08-29T18:26:14.000Z

I found a similar extra column for PAYERS (Column 5 seems extra) and PROVIDERS (The last column seems extra).

Answer 2 · 2023-08-29T18:30:15.000Z

Thank you for the report... we'll look into it.

Do you still have the console output?

I'm interested in the end of that output... For example,

Running with options:
Population: 5
Seed: 1693320322058
Provider Seed:1693320322058
Reference Time: 1693320322058
Location: Massachusetts
Min Age: 0
Max Age: 140
3 -- Ezra452 Ferry570 (18 y/o M) Concord, Massachusetts  (25625)
1 -- Johanne551 Gislason620 (24 y/o F) Randolph, Massachusetts  (34253)
2 -- Marta91 Villareal516 (25 y/o F) Springfield, Massachusetts  (36679)
4 -- Stefany238 Connelly992 (27 y/o F) Boston, Massachusetts  (38650)
5 -- Dolores502 Alejandra902 Alanis890 (62 y/o F) Westborough, Massachusetts DECEASED (120079)
5 -- Natalia964 Esperanza675 Padilla483 (68 y/o F) Westborough, Massachusetts  (96113)
Records: total=6, alive=5, dead=1
RNG=5
Clinician RNG=5643

Answer 3 · 2023-08-29T18:31:50.000Z

Or, the relevant ./output/metadata/*.json file.

Answer 4 · 2023-08-29T18:36:17.000Z

I have this one for the retry:

Running with options:
Population: 1
Seed: 101
Provider Seed:1693332487663
Reference Time: 1693332487663
Location: Massachusetts
Min Age: 0
Max Age: 140
1 -- Garrett899 VonRueden376 (57 y/o M) Norfolk, Massachusetts  (84735)
Records: total=1, alive=1, dead=0
RNG=1
Clinician RNG=5644

with the CSV file being this:

45746e91-d759-c73c-b033-56587d814cca,1966-05-04,,999-60-4143,S99962688,X73514809X,Mr.,Garrett899,VonRueden376,,,M,white,nonhispanic,M,Franklin  Massachusetts  US,419 Rogahn Alley,Norfolk,Massachusetts,Norfolk County,,00000,42.1328679377805,-71.29048599718199,55929.45,451474.93,749961

In that case, the zip code is missing, completely and not the one in the output (84735)

Answer 5 · 2023-08-29T18:37:23.000Z

And this is the metadata file

{
  "runID": "984705fd-dc7f-4ee8-99da-65a9a95da183",
  "seed": 101,
  "clinicianSeed": 1693332487663,
  "referenceTime": "20230829",
  "endTime": "20230829",
  "version": "master-branch-latest\n",
  "patientCount": 1,
  "providerCount": 1327,
  "payerCount": 9,
  "javaVersion": "11.0.20",
  "generate.thread_pool_size": -1,
  "generatorThreads": 20,
  "runStartTime": "2023-08-29T18:08:07Z",
  "runTimeInSeconds": 9,
  "exporter.years_of_history": "10",
  "state": "Massachusetts",
  "modules": "*"
}

Answer 6 · 2023-08-30T12:33:59.000Z

This is another case of the data dictionary being wrong.

The extra field is the FIPS county code, e.g., https://transition.fcc.gov/oet/info/maps/census/fips/fips.txt

I updated the data dictionary to reflect the field.