jdurbin/wekaMine

wmTableValidator

Opened this issue · 1 comments

TableFileLoader, Table, DoubleTable, are all written to be fast and so don't do a lot of cell-by-cell checking. However, the files we get are so often garbled in some way, headers that don't match, or tables with rows not all the same size, that we need some kind of gross format checking on these files. Things to check for:

are there row or column headers?
are all the rows the same length?
does the table contain missing values or null values and how are those represented?

I hesitate to add this to the overhead of the Table classes... maybe a separate class to pre-scan and sanity check before handing it to a Table class?

Just burned agin by not noticing that some gene names were in HUGO and some others were HUGO|1324, so some models wouldn't work with other data. Had to retrain a bunch of models.

Need a script to validate tables that we can run on the data once in an overall pipeline. It would check that all row and column names are present, that each row has the right number of columns. It should optionally output a report on whether each row or each column is all numeric or, if not. If numeric it should report or verify the range of the numbers in each row/column. If nominal, it should report what set of nominal values each row/column contains. It should take a file that lists the valid possible names for a row or column and verify that the row/column names match that column.

So, to prevent the pain I experienced with the hugo names, I might run it like:

wmTableValidator -d BRCA_tumor -rowNameSet hugo_names.txt -colNameSet TCGAShortNames.txt -nullValue "null" -allNumeric TRUE

Or something. It would be harder, but maybe a companion script to massage a file into a cononical form... obviously replacing null values with a desired null value, but maybe other transformations also....