/gdemux

demultiplex into groups (subsets) rather than individual sample files, by barcode

Primary LanguagePythonThe UnlicenseUnlicense

Grouped Demultiplexing

We need to demultiplex, but want groups of barcodes to be joined into a single FASTQ file. And we want it to be easy.

This will demultiplex FASTQs using fastq-multx (conda install -c bioconda fastq-multx) then cat them into grouped FASTQs rather than individual samples. The grouped FASTQs are validated by total count against the individual sample sum. This effectively subsets 'Undetermined' into smaller groups of 'Undetermined' files.

Group Named Output Files

Setting --output-action to "groupid" and running:

$ gdemux -a groupid -o out test_R1.fastq test_barcodes.txt
[2016-08-03 17:00 INFO] Found 10 samples across 3 groups within test_barcodes.txt
[2016-08-03 17:00 INFO] Demultiplexing (mismatches=1, distance=2, quality=0)
[2016-08-03 17:00 INFO] Joining reads across groups
[2016-08-03 17:00 INFO] Validating group read counts with sample counts
[2016-08-03 17:00 INFO] Processing complete

$ tree out
out
├── group1_I1.fastq
├── group1_R1.fastq
├── group1_R2.fastq
├── group2_I1.fastq
├── group2_R1.fastq
├── group2_R2.fastq
├── group3_I1.fastq
├── group3_R1.fastq
└── group3_R2.fastq

Undetermined Subsets as Output Files

Setting --output-action to "undetermined" and running:

$ gdemux -a undetermined -o out test_R1.fastq test_barcodes.txt
[2016-08-03 17:01 INFO] Found 10 samples across 3 groups within test_barcodes.txt
[2016-08-03 17:01 INFO] Demultiplexing (mismatches=1, distance=2, quality=0)
[2016-08-03 17:01 INFO] Joining reads across groups
[2016-08-03 17:01 INFO] Validating group read counts with sample counts
[2016-08-03 17:01 INFO] Processing complete

$ tree out
out
├── group1
│   ├── Undetermined_I1.fastq
│   ├── Undetermined_R1.fastq
│   └── Undetermined_R2.fastq
├── group2
│   ├── Undetermined_I1.fastq
│   ├── Undetermined_R1.fastq
│   └── Undetermined_R2.fastq
└── group3
    ├── Undetermined_I1.fastq
    ├── Undetermined_R1.fastq
    └── Undetermined_R2.fastq

Example Metadata

groupid barcode
group1 AAGGCGCTCCTT
group1 GATCTAATCGAG
group1 CTGATGTACACG
group2 ACGTATTCGAAG
group2 GACGTTAAGAAT
group2 TGGTGGAGTTTC
group3 TTAACAAGGCAA
group3 AACCGCATAAGT
group3 CCACAACGATCA
group3 AGTTCTCATTAA

Extra columns can exist and differing column names can be used though they will need to be specified on the command line as --group-id and --barcode.

A header isn't necessary either, though you'll need to specify more options. --no-header will be necessary, along with 0-based integers for the 3 columns, e.g. --no-header --group-id 0 --barcode 1.