validator
Opened this issue · 7 comments
Hi,
thanks for this.
I have a lot of problems with incorrect and weird samplesheets from the lab generated with "copy-paste" and strange barcode schemes, such as mixed Truseq and Nextera.
I was starting to write a very simple validator to pick up on the worst errors, but now see you have done much more.
Are you planning to write a standalone validator or is this already possible via your library?
Thanks, Colin
I certainly could! I want to make a non-short-circuiting validation refactor but have not found the time: #96
Right now, if anything looks "wrong" an exception is raised immediately. Instead of compiling all errors, and then emitting all of them at once at the end, you can actually fix your sample sheet once instead of iteratively.
What are some validations you would like to see included beyond those that are simply spec. non-conforming?
It would also be awesome if it supported plugin validations so you could easily extend base validations with custom lab validations.
Hmm, I could use this in our lab too. I will do my best to carve out some time.
Nice. I had just started this, but these are the most common issues I see:
- truseq and nextera confusion
- incorrect delimiters (they make this in excel and inexplicably mess delimiters up on a regular basis)
- incorrect headers
Like I said, I haven't go far at all so this is just very preliminary I'm afraid. The idea is the lab colleagues can run a simple .bat script, which runs a python script natively in Windows. Output comments and errors go to an output txt file. That way they get instant feedback without having to wait for a) data and b) a bioinformatician to run bcl2fastq.
def checkIndexAdaptersLine(inputLine):
if "Index Adapters" in inputLine:
indexAdaptersLineCount = indexAdaptersLineCount + 1
if '"Index Adapters,""TruSeq DNA CD Indexes (96 Indexes)"""' in inputLine:
outputComments.append("Index Adapters line looks good for TruSeq\n")
elif '"Index Adapters,""IDT-ILMN Nextera DNA UD Indexes Set A"""' in inputLine:
outputComments.append("Index Adapters line looks good for Nextera\n")
else:
outputComments.append("INFO: Could not read Index Adapters line properly\n")
outputComments.append('INFO: Typically should be Truseq: "Index Adapters,""TruSeq DNA CD Indexes (96 Indexes)"""\n')
outputComments.append('INFO: Typically should be Nextera "Index Adapters,""IDT-ILMN Nextera DNA UD Indexes Set A"""\n')
def checkSemicolons(inputLine):
if ";" in inputLine:
outputComments.append("\n\n####### ERROR !!! FOUND A SEMICOLON; SHOULD ONLY CONTAIN COMMA AS DELIMITERS !! ####### \n ")
def checkHeader(inputLine):
if "Sample_ID" in inputLine:
if "Sample_Name,Sample_Name" in inputLine:
outputComments.append("Error: Sample_Name,Sample_Name should be Sample_ID,Sample_Name\n")
if "Sample_ID,Sample_Name,Sample_Plate,Sample_Well,I7_Index_ID,index,I5_Index_ID,index2,Sample_Project,Description,,,," in inputLine:
outputComments.append("Header looks ok\n")
outputComments.append("INFO: Header: " + inputLine+"\n")
Do they use the Illumina experiment manager to create a sample sheet? Our lab folks do and it helps cut down errors.
Nope, since they say it doesn't allow custom primers, which we use a lot of. eg for amplicons, Nextera, NEB etc.
@colindaven I like where you are headed:
Output comments and errors go to an output txt file.
Right now validation is fail-fast which really hurts the turn-around for making a valid sample sheet since you have to iteratively edit and parse the sample sheet to wade through each validation exception one-by-one. I agree we should refactor validation in this toolkit so it is modular and as lazy as possible (collect all exceptions, and then emit in bulk at the end of a validation call).
We're only on 0.11.0 so this is something I'm inclined to bundle into a v1 refactor and final public API.