/simulated-data

Simulated Seattle Flu Study data for developing visualization, analysis, and modeling tools.

Primary LanguageR

seattleFluSimulatedData

This directory contains simulated Seattle Flu Study linelist data and auspice-ready simulated phylogenetic trees. These data hopefully have sufficient realism to aid in the design of visualization and modeling tools, but they are not designed to be fully realistic in the following ways:

  • only a subset of the metadata from our various questionnaires are represented, and tools prototyped on this data should retain flexibility to handle other fields.
  • we don't know what the epidemiology of ILI in the Seattle Metro area really looks like, and so the assumptions generated by this process may often be pretty far from good.
  • Some assumptions in the underlying models change between diseases not for epidemiological reason, but so we can later test our ability to infer the differences from the observed data with downstream modeling.

The data was generated with a set of transmission and genomic epidemiology simulation tools being developed at the Institute for Disease Modeling. The simulation engine produces complete transmission trees and downsampled phylogenetic trees for SIR diseases with complex timeseries dynamics, complex metadata, and realistic sampling frames. To request access to the complete simulated truth datasets behind these observed datasets, the model configuration files, and simulation engine (which is a prototype in private development for now), please contact @famulare.

Pathogens

Simulated observations are available for:

  • A_H1N1pdm (h1n1pdm)
  • A_H3N2 (h3n2)
  • B_Victoria (vic)
  • B_Yamagata (yam)
  • RSVa (rsva)
  • otherILI (This is a symptomatic category that includes people who tested negative for all screened pathogens or tested positive for one of the dozen+ non-notifiable pathogens in the multiplex assay.)

Observed participant metadata

simulatedParticipantDatabase.csv is a linelist of 11,550 sampled invididuals who were positive for one of the above pathogens. The following metadata is available for each simulated participant:

  • samplingLocation (hospital, clinic, kiosk, daycare, shelter, atHome)
  • timeInfected (decimal date in years)
  • sex (assigned at birth)
  • age (in years)
  • fluShot (received influenza vaccine in last year?)
  • hasFever (yes,no)
  • hasCough (yes,no)
  • hasMyalgia (yes,no)
  • census tract (GEOID)
  • CRA_NAME (colloquial neighborhoods in Seattle)
  • NEIGHBORHOOD_DISTRICT_NAME (larger neighborhood regions in Seattle)
  • PUMA5CE (Federal Public Use Microdata Areas)

Future iterations may also include:

  • additional samplingLocations (SEA-TAC)
  • recent travel
  • work census tract
  • what do you want to see?

Sampling frame

The domain of the simulations is all of King County, WA, but the observed participants constitute a non-random sample of the population, biased toward residency within the Seattle city boundary.

Participants are labeled by the category of their sampling location, one of:

  • hospital
  • clinic
  • kiosk
  • daycare
  • shelter
  • atHome

and obervations are drawn from each category within specified geographic catchments and time windows. Catchments are overlapping but differ in detail among categories. All preferentially sample from neighborhoods within the Seattle city boundary, but retain some probabability of capturing people who live in the rest of King County.

As with the real study, analysis must be mindful of the fact that the observed population is not representative of the total population that partipates transmission.

Auspice-ready derived data

For H1N1pdm, H3N2, B_Victoria, B_Yamagata, and RSVa, also provided are time-scaled and genetic divergence phylogenetic trees with associated metadata for both the observed tips and (most of) the ancestral nodes.

  • The time-scaled phylogentic tree is a direct downsampling of the true transmission tree, with some zero-distance nodes added to ensure a bifurcating tree structure, and some artifical deep root nodes to join disconnected components of the simulated transmission trees for each pathogen.
  • The nucleotide divergence phylogenetic tree is just based on a poisson mutation rate given the time tree. I don't plan to simulate any sequences in the near-future.
  • The tree metadata is inhereted from from the true transmission tree for both sampled tips and ancestral nodes. The only nodes without metadata are deep ancestral nodes that don't really exist in the simulations, but are the results of joining independent transmission lineages with a hack-a-coalescent model. In real data, these nodes would root outside the Seattle Metro area as the origins of independent importations to the region. (If auspice needs metadata for these nodes to work, I can cook some up.)

In addition to the "observed" variables decribed above, the auspice-ready metadata files also include some variables that are unobservable in the real world, but are useful for thinking about model outputs that can (or will someday) be inferred from real data:

  • importation event that each infection descends from
  • beta (indiviual-level infectiousness)
  • infectiousDuration (individual-level)
  • SIR-cluster membership. (This is an internal simulation detail. The transmission model has simple two-scale dynamics where clusters with SIR mass-action dynamics spawn new clusters stochastically. The result is a simulation with non-mass-action large scale dynamics, where transmission mixes mass-action dynamics and bottlenecks between mass-action pools.)