/ETL_5

Taking stock, aligning, and moving along in the ETL process

ETL_5

Taking Stock, Aligning, and Moving Along

Taking Stock

The primary purposes of this extended ETL example is to show competence at planning, documenting, and implementing a small ETL project.

In ETL_1 I did the following:

  1. Introduced the project
  2. Indicated the data to be ETLed would be NHIS 1994 household data.
  3. Shared five steps that would be affiliated with repositories ETL_1, ETL_2, ETL_3, ETL_4, and ETL_5.

In ETL_2 I did the following:

  1. Imported 06724-0001-Data.txt into a spreadsheet named 06724-0001-Data.xlsx.
  2. Standardized the field (i.e., column) names in 06724-0001-Data.xlsx.
  3. Documented the standardized field names in the data file in another spreadsheet named robbins_etl_06724-0001-Data-Crosswalk-1994.xlsx.
  4. Imported the spreadsheet file 06724-0001-Data.xlsx into a Access database file 06724-0001-Data.accdb.
  5. Wrote SQL queries to identify the unique values per field as well as to determine the number of each unique value in each field.
  6. Compared the same information as indicated in the codebook file named 06724-Codebook.pdf .
  7. Discoverd discrepancies between my results and the 06724-Codebook.pdf file (i.e., the codebook).
  8. The discrepancies were of two types: a. A lack of attention to detail when representing the "codes" in the pdf document. For example, a code of "10" would be represented as "10." b. The codebook represented the number of code "08" in "Processing_Week_Code" as 2,820 when the actual number of this code's occurance in this field in the data file was 3,820.

In ETL_3 I did the following:

  1. Designed database tables (and their relationships to each other).
  2. Built a twenty-four step plan for "manually" moving the data from teh data file (06724-0001-Data.txt) to an Oracle PDB (Pluggable Database).

In ETL_4 I did the following:

  1. Applied the plan from ETL_3 using the knowledge of the data from ETL_2.
  2. "Discovered" that one planned CODE table (and its affiliated) data field seemed incorrect.
  3. Built the data table and three code tables using SQL Plus as well as SQL Developer.
  4. Used SQL Loader to load these tables.
  5. Built a "template" file for each of the following file types: a. creating a table in SQL Plus b. a csv file which would contain data to be loaded c. a ctl file which would direct SQL Loader

Aligning

Because I was uncomfortable with:

  • the amount of effort it took to understand the "household" codebook.pdf file obtained at NHIS
  • my conclusions with regards to whether, for example, a particular field was a "CODE" field or an "ID" field
  • with the amount of effort that would be necessary for future "person", "condition", etc. files if their codebook.pdf file were used to parse the data

I went back to square one and tried to identify a better way to access the data, and which would be much more robust (i.e., reduce uncertainty) and be much more efficient.

I discovered this resource:

which integrates, organizes, and documents some of the information from this resource:

I was thrilled when I discovered that IHIS provides a method to see which variables are included in each year, for all NHIS survey data from 1963 to 2014.

In the context of ETL this supports starting the ETL process with integrated (across years) data instead of using a homegrown process for integrating the data (across years) which could introduce errors.

In the context of predicting y from x over a time series of data, it is also more comfortable using already integrated data. Further, since IHIS indicates whether a particular year has a particular variable, an analyst can select a subset of years where all of the variables are in all of the years. By doing so, s/he reduces issues with missing data.

I was also thrilled to discover that IHIS provides a codebook that is standardized in its development, and which is text-based, as opposed to PDF based.

In the context of ETL accuracy, a PDF file adds a large amount of markup which is meaningful to the PDF document presenter software but which adds a layer, which can cause errors, to the programmer that is only trying to extract, transform, and load data. However, a text file does not have the additional layer which can cause misinterpretation by a programmer.

In the context of ETL efficiency, the fact that the codebook generated by IHIS is built in a standardized from by the systems underneath the IHIS web site, is a wonderful discovery. Since the process that IHIS uses to build a codebook is standardized, its format when it is built is standardized. This in turn makes it so that I can build one program to parse any IHIS codebook, and reliably generate CSV files, Oracle tables, etc.

Moving Along

In ETL_6 I have taken advantage of the clear representation of codes (because the codebook is text) and of the standardized format to build a program that can take all information in the codebook and output files or tables that are inline with the design from ETL_4.

Please feel free to move to the ETL_6 repository at this point or at some future moment at your convenience.