This is a tool to convert the fixed-width records of the National Hospital Discharge Survey into CSV files that can be more easily manipulated. This tool handles all of these steps:
- Download the raw data from the NHDS FTP service
- Convert the fixed with records to CSV
- Generate a per-day infection rate similar to the output of the IMS data tools, described in more detail below.
- Create an sqlite database that maps the ICD9 codes to their descriptions
The entire tool is executed using make
. To generate a CSV file for a
particular year, run:
$ make gen/as_csv/2009.csv
Similarly, to generate data that is similar to the IMS matrices, run:
$ make gen/match_ims/breast-cancer.txt
The list of diseases is located in the Makefile near the
top. This list is formatted as ICD9_PREFIX:DISEASE_NAME
(e.g. 174:breast-cancer
).
The format of the fixed-width records changes every year, so the
nhds/parse_nhds.py
script has to keep track of which year every
field is included in the dataset. For example, in 2009, there were
only 7 diagnosis codes recorded, but 2010 introduced 8 more.
The format strings are in the following format:
fmt = (
(("09", "10"),
2, ("Survey Year", "year_end")),
(("09", "10"),
1, _("Newborn status")),
(("09", "10"),
1, _("Units for age")),
# ...
)
Each tuple begins with a tuple of years that the field is valid for
(in this case, all of these fields are in the 2009 and 2010 datasets),
followed by the width of the field (So the second field is 1 character
long), and then a tuple of (Given Name, Clean Name)
, which relates
to the name in the NHDS documentation, as well as a name which is
easier to parse and work with (typically in all lowercase, with spaces
replaced with underscores.
Similarly, the script keeps track of the target format to output. This means that it can generate only fields present in the 2009 dataset, even if it's reading a 2010 set. When no target is specified, it will include all columns available in the output.
To change the target, the rule for gen/as_csv/%.csv
must be changed.