timrdf/csv2rdf4lod-automation

Option for fatal error when links_via can't locate a match (for validation)

Opened this issue · 0 comments

As we collect and convert bloodwork data into RDF we are aggregating labels since each lab facility uses different labels. For example, one facility provides "neu#", another provides "Neu # (ANC)", and yet another provides "ne #r". To a physician the mappings are obvious but to a machine not so much. It would be a nice feature to have CSV2RDF4LOD stop conversion when it fails to find a match because we're looking to have as complete coverage as possible of the underlying data. Currently, we work around it by scanning for lines where the property for the column appears but not multiple values:

$ find * -name '*.e1.ttl' -exec grep -H ofCharacteristic {} \; | grep -v ,
2013-08-20/automatic/cbc_ruby.csv.e1.ttl:   health:ofCharacteristic value_of_characteristic:Neu_ANC ;
2013-08-27/automatic/250_comprehensive_panel.csv.e1.ttl:    health:ofCharacteristic value_of_characteristic:Bilirubin_Total ;

Once the failures have been identified we can then add the missing labels to the ontology, wipe the version, and reconvert. However, on large datasets this rinse-and-repeat procedure would be cumbersome as the conversion might take significant time and we'd like to know about failure early in the process.