[Input samplesheet validation] Sample names with point can result in accidental sample merge
TimHHH opened this issue · 5 comments
Given two samples with the same name before a point can result in an unwanted sample merge, e.g.:
Votintseva2017.614406.m
Votintseva2017.614406.1
In this case both different samples are interpreted as one, namely Votintseva2017.614406
Hence we should make a note in the manual that points are not allowed in the sample sheet.
I'm thinking about addressing this in a general sense of creating a samplesheet validation logic?
A python script would be triggered to validate the samplesheet with all the nuances - what do you think?
Notes from meeting on 30-Aug-2022
- Sample names (i.e. the
sample
column) should not have dots (non-dash symbols). Add a list of symbols not allowed. - All fields have a value (no empty columns)
- Remove the assumptions from https://github.com/TORCH-Consortium/xbs-nf/blob/436c515a1aa1cd6773f449a25d18ff3f6a962aa8/main.nf#L22
- Exit if samplesheet validation fails
- Checks for dots in reference genome names (SNPEFF / default_configs)
- TODO: Evaluate quoted strings in the samplesheet
- Read-1 should be different from Read-2
- (OPTIONAL) Both of these files should exist
@TimHHH @LennertVerboven please feel free to add other validations.
This one Checks for dots in reference genome names (SNPEFF / default_configs)
can be dropped. Our pipeline is not designed for using other reference genomes because of downstream process that require H37Rv. However, modifying XBS-nf to run with a different reference genome is certainly doable for those with a programming background.
Another requirement: no two rows should exist with exactly the same Study Sample Library Attempt
. (at least the attempt number should differ)
The initial effort has been done by @LennertVerboven and added here https://github.com/TORCH-Consortium/xbs-nf/blob/master/bin/sample_sheet_validation.py