kumc-bmi/naaccr-tumor-data

use DB to mediate access to raw records

Closed this issue · 2 comments

dckc commented
  • load raw records into a DB table (e.g. as CLOBs)
  • feed PatientFlatReader from a Reader that gets data from the DB table

The "data lake" approach here was somewhat half-hearted: code for reading the NAACCR file had to run on the host with the NAACCR file, since we didn't deploy a distributed filesystem such as HDFS. This would treat the DB somewhat like a distributed filesystem. Using the DB to mediate access to the DB is the norm in HERON development for a decade or so in any case.

p.s. this is the motivation for the loadRaw experiment.

dckc commented

This is the norm as of the electric release cc28f83

dckc commented

Performance of using PatientFlatReader from CLOBs to discrete columns was poor: >2.5hrs for ~100k tumor records.

Going straight from the flat file to JDBC gets it down to ~6min in 912ee3a. (a critical bug was fixed in d321e28)