Improve the error messages in case of a genotype data parsing failure
nevrome opened this issue · 4 comments
At the moment an issue in the genotype data is always reduced to
Issues in genotype data parsing:
SeqFormatException "Error while parsing: not enough input. Error occurred when trying to parse this chunk: \"...\""
That often does not help to identify and solve the underlying issue, because it omits in which package + SNP (+ individual?) the problem occurred. If such an error comes up in a big forge, debugging becomes a search for the needle in the haystack. The short snipped of the relevant chunk in the error message above can be pointless, when the genotype data is in a binary format.
I wonder if there is a way to include additional, crucial information in this error message.
We can definitely include a size-check for the genotype data, which I'm happy to include into our validation pipeline.
It has become clear that a size-check is not possible. But independent of that I was hoping for more.
I hope we could get an error message that looks something like this:
Issues in genotype data parsing:
Can not parse SNP A in line B of file C of package D for individual E.
Error occurred when trying to parse this chunk: "this is not the SNP you're searching for"
Is this science fiction with our current implementation?
OK, so going back to size checks: Since we do have the snpSet (1240K, HumanOrigins, Other) in the YAML file, we should actually be able to give a size-check warning after all, at least in cases where it's either 1240K or HumanOrigins. We can hardcode the expected number of SNPs for these categories and then use the number of of individuals to compute an expected byte size of the *.bed
or the *.geno
files. Of course, I think a mismatch between expectation and should not yield a hard error, because technically one could imagine some packages simply dropping SNPs which are uncovered (the schema doesn't forbid this, and our forging technology can handle this explicitly). But at least we can spit out a warning.
I'll work on that.
It has become clear that a size-check is not possible. But independent of that I was hoping for more.
I hope we could get an error message that looks something like this:
Issues in genotype data parsing: Can not parse SNP A in line B of file C of package D for individual E. Error occurred when trying to parse this chunk: "this is not the SNP you're searching for"
Is this science fiction with our current implementation?
I think it's not science fiction. My sequence-formats parsers can provide all that information, it's just a matter of having all the data ready to create that error message, which might involve some refactoring here and there. I'll look into it.