TORCH-Consortium/MAGMA

[Input samplesheet validation] Sample names with point can result in accidental sample merge

TimHHH opened this issue · 5 comments

Given two samples with the same name before a point can result in an unwanted sample merge, e.g.:
Votintseva2017.614406.m
Votintseva2017.614406.1
In this case both different samples are interpreted as one, namely Votintseva2017.614406
Hence we should make a note in the manual that points are not allowed in the sample sheet.

I'm thinking about addressing this in a general sense of creating a samplesheet validation logic?

A python script would be triggered to validate the samplesheet with all the nuances - what do you think?

CC @LennertVerboven

Notes from meeting on 30-Aug-2022

  • Sample names (i.e. the sample column) should not have dots (non-dash symbols). Add a list of symbols not allowed.
  • All fields have a value (no empty columns)
  • Checks for dots in reference genome names (SNPEFF / default_configs)
  • TODO: Evaluate quoted strings in the samplesheet
  • Read-1 should be different from Read-2
  • (OPTIONAL) Both of these files should exist

@TimHHH @LennertVerboven please feel free to add other validations.

This one Checks for dots in reference genome names (SNPEFF / default_configs) can be dropped. Our pipeline is not designed for using other reference genomes because of downstream process that require H37Rv. However, modifying XBS-nf to run with a different reference genome is certainly doable for those with a programming background.

Another requirement: no two rows should exist with exactly the same Study Sample Library Attempt. (at least the attempt number should differ)

TODO: @abhi18av Need to add another check for any duplicates in the samplesheet.