broadinstitute/Drop-seq

Better handling of duplicate identifies in SAMPLE_FILE

Closed this issue · 1 comments

Instructions

When SAMPLE_FILE contains duplicate entries, validation incorrectly fails.

Affected tool(s)

All tools using SAMPLE_FILE and VCF arguments - first detected using GatherDigitalAlleleCounts, but should be most tools that use the same underlying API.

Affected version(s)

  • [ x] Latest public release version [version?]
  • [ x] Latest development/master branch as of [date of test?]

Description

When the sample file contains duplicate identifiers, the VCF validation counts the number of entries in the file as the number of samples, and uses the set of sample names in the VCF to perform a set intersect, then check if the two are the same length. Because the sample list passed in is a list and not a set, it allows duplicates and has more elements when there are duplicates.

INFO 2022-04-08 22:36:43 GatherDigitalAlleleCounts Found 16 samples in VCF and requested sample list out of 17 requested
[Fri Apr 08 22:36:43 EDT 2022] org.broadinstitute.dropseqrna.barnyard.digitalallelecounts.GatherDigitalAlleleCounts done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=2058354688
Exception in thread "main" java.lang.IllegalArgumentException: Did not find all of the requested samples. Can not continue.
at org.broadinstitute.dropseqrna.vcftools.SampleAssignmentVCFUtils.validateSampleNamesInVCF(SampleAssignmentVCFUtils.java:241)
at org.broadinstitute.dropseqrna.barnyard.digitalallelecounts.GatherDigitalAlleleCounts.getSNPInfoCollection(GatherDigitalAlleleCounts.java:505)
at org.broadinstitute.dropseqrna.barnyard.digitalallelecounts.GatherDigitalAlleleCounts.doWork(GatherDigitalAlleleCounts.java:265)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:308)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103)
at org.broadinstitute.dropseqrna.cmdline.DropSeqMain.main(DropSeqMain.java:42)

Steps to reproduce

Run a program with a sample list that includes duplicates

Expected behavior

This validation should pass. The sample list should be converted to a set to remove duplicates before validation.

Actual behavior

Validation fails.

Fixed in 2.5.4