StevenWingett/FastQ-Screen

Pairing filtered FASTQ files

Closed this issue · 10 comments

FastQ Screen processes paired FASTQ files independently. This means that the order of the reads in the filtered results (derived from paired input files) will most likely not correspond to one another.

Come up with a solution that pairs reads in FASTQ files (a read may be present in one file, but not in its pair). Also, check whether the output is in the same order as the input for this processing.

We just ran into this problem and found https://github.com/linsalrob/fastq-pair

It doesn't handle gz files, so that's an inconvenience...

and this solution: https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/repair-guide/

Thanks you, that is worth knowing. Have you tried this software? Does it work with FASTQ files generated by FastQ Screen?

Best regards,
Steven

We will keep you appraised, we should have it worked out by wed, if not sooner. Based upon the documentation for both, it seems that both should work, as long as the files from fastq_screen are unzipped first.

repair.sh in the BBmap package works

bash script

INDIR=.
OUTDIR=.
EXTENSION=.tagged_filter.fastq

mkdir $OUTDIR

ls $INDIR/*$EXTENSION | \
sed -e "s/R[12]$EXTENSION//" -e 's/.*\///g' | \
uniq | \
parallel --no-notice -j32 \
        repair.sh \
        in1=$INDIR/{}R1$EXTENSION \
        in2=$INDIR/{}R2$EXTENSION \
        out1=$OUTDIR/{}repr_R1.fq \
        out2=$OUTDIR/{}repr_R2.fq \
        outs=$OUTDIR/{}repr_orph.fq \
        repair

output

java -ea -Xmx161330m -cp /opt/conda/opt/bbmap-38.90-1/current/ jgi.SplitPairsAndSingles rp in1=./20190905-PIRE-Ssp-A-P1P1-L4.R1.tagged_filter.fastq in2=./20190905-PIRE-Ssp-A-P1P1-L4.R2.tagged_filter.fastq out1=./20190905-PIRE-Ssp
Executing jgi.SplitPairsAndSingles [rp, in1=./20190905-PIRE-Ssp-A-P1P1-L4.R1.tagged_filter.fastq, in2=./20190905-PIRE-Ssp-A-P1P1-L4.R2.tagged_filter.fastq, out1=./20190905-PIRE-Ssp-A-P1P1-L4.repr_R1.fq, out2=./20190905-PIRE-Ssp-A

Set INTERLEAVED to false
Started output stream.

Input:                          451303 reads            67441076 bases.
Result:                         451303 reads (100.00%)  67441076 bases (100.00%)
Pairs:                          412164 reads (91.33%)   61610055 bases (91.35%)
Singletons:                     39139 reads (8.67%)     5831021 bases (8.65%)

Time:                           3.550 seconds.
Reads Processed:        451k    127.14k reads/sec
Bases Processed:      67441k    19.00m bases/sec

Proof of success

paste <(head -n40 20190905-PIRE-Ssp-A-P1P1-L4.repr_R1.fq | paste - - - - | cut -f1) <(head -n40 20190905-PIRE-Ssp-A-P1P1-L4.repr_R2.fq | paste - - - - | cut -f1)
@K00124:439:H3LKVBBXX:4:1101:13880:1801 1:N:0:ATTACTCG+TCAGAGCC#FQST:Human:Ecoli:PhiX:Contamnts_Adaptrs:Vectors:Bacteria_24k:Protists_1k:Viruses_bt2:Fungi_RefSeq_release:Coral:Bird:Plant:All_Fish:0000000000000                   @K00124:439:H3LKVBBXX:4:1101:13880:1801 2:N:0:ATTACTCG+TCAGAGCC#FQST:Human:Ecoli:PhiX:Contamnts_Adaptrs:Vectors:Bacteria_24k:Protists_1k:Viruses_bt2:Fungi_RefSeq_release:Coral:Bird:Plant:All_Fish:0000000000000
@K00124:439:H3LKVBBXX:4:1101:15250:1854 1:N:0:ATTACTCG+TCAGAGCC#FQST:0000000000000      @K00124:439:H3LKVBBXX:4:1101:15250:1854 2:N:0:ATTACTCG+TCAGAGCC#FQST:0000000000000
@K00124:439:H3LKVBBXX:4:1101:5822:1872 1:N:0:ATTACTCG+TCAGAGCC#FQST:0000000000000       @K00124:439:H3LKVBBXX:4:1101:5822:1872 2:N:0:ATTACTCG+TCAGAGCC#FQST:0000000000000
@K00124:439:H3LKVBBXX:4:1101:13758:1942 1:N:0:ATTACTCG+TCAGAGCC#FQST:0000000000000      @K00124:439:H3LKVBBXX:4:1101:13758:1942 2:N:0:ATTACTCG+TCAGAGCC#FQST:0000000000000
@K00124:439:H3LKVBBXX:4:1101:13941:1977 1:N:0:ATTACTCG+TCAGAGCC#FQST:0000000000002      @K00124:439:H3LKVBBXX:4:1101:13941:1977 2:N:0:ATTACTCG+TCAGAGCC#FQST:0000000000000
@K00124:439:H3LKVBBXX:4:1101:13738:2012 1:N:0:ATTACTCG+TCAGAGCC#FQST:0000000000002      @K00124:439:H3LKVBBXX:4:1101:13738:2012 2:N:0:ATTACTCG+TCAGAGCC#FQST:0000000000000
@K00124:439:H3LKVBBXX:4:1101:6573:2047 1:N:0:ATTACTCG+TCAGAGCC#FQST:0000000000000       @K00124:439:H3LKVBBXX:4:1101:6573:2047 2:N:0:ATTACTCG+TCAGAGCC#FQST:0000000000000
@K00124:439:H3LKVBBXX:4:1101:7243:2047 1:N:0:ATTACTCG+TCAGAGCC#FQST:0000000000000       @K00124:439:H3LKVBBXX:4:1101:7243:2047 2:N:0:ATTACTCG+TCAGAGCC#FQST:0000000000000
@K00124:439:H3LKVBBXX:4:1101:22952:2083 1:N:0:ATTACTCG+TCAGAGCC#FQST:0000000000000      @K00124:439:H3LKVBBXX:4:1101:22952:2083 2:N:0:ATTACTCG+TCAGAGCC#FQST:0000000000000
@K00124:439:H3LKVBBXX:4:1101:26829:2083 1:N:0:ATTACTCG+TCAGAGCC#FQST:0000000000000      @K00124:439:H3LKVBBXX:4:1101:26829:2083 2:N:0:ATTACTCG+TCAGAGCC#FQST:0000000000000

paste <(tail -n40 20190905-PIRE-Ssp-A-P1P1-L4.repr_R1.fq | paste - - - - | cut -f1) <(tail -n40 20190905-PIRE-Ssp-A-P1P1-L4.repr_R2.fq | paste - - - - | cut -f1)
@K00124:439:H3LKVBBXX:4:2224:28067:72367 1:N:0:ATTACTCG+TCAGAGCC#FQST:0000000000000     @K00124:439:H3LKVBBXX:4:2224:28067:72367 2:N:0:ATTACTCG+TCAGAGCC#FQST:0000000000001
@K00124:439:H3LKVBBXX:4:2224:29488:72367 1:N:0:ATTACTCG+TCAGAGCC#FQST:0000000000000     @K00124:439:H3LKVBBXX:4:2224:29488:72367 2:N:0:ATTACTCG+TCAGAGCC#FQST:0000000000000
@K00124:439:H3LKVBBXX:4:2224:16498:72473 1:N:0:ATTACTCG+TCAGAGCC#FQST:0000000000000     @K00124:439:H3LKVBBXX:4:2224:16498:72473 2:N:0:ATTACTCG+TCAGAGCC#FQST:0000000000000
@K00124:439:H3LKVBBXX:4:2224:9303:72491 1:N:0:ATTACTCG+TCAGAGCC#FQST:0000000000002      @K00124:439:H3LKVBBXX:4:2224:9303:72491 2:N:0:ATTACTCG+TCAGAGCC#FQST:0000000000000
@K00124:439:H3LKVBBXX:4:2224:29518:72491 1:N:0:ATTACTCG+TCAGAGCC#FQST:0000000000000     @K00124:439:H3LKVBBXX:4:2224:29518:72491 2:N:0:ATTACTCG+TCAGAGCC#FQST:0000000000000
@K00124:439:H3LKVBBXX:4:2224:23561:72543 1:N:0:ATTACTCG+TCAGAGCC#FQST:0000000000000     @K00124:439:H3LKVBBXX:4:2224:23561:72543 2:N:0:ATTACTCG+TCAGAGCC#FQST:0000000000000
@K00124:439:H3LKVBBXX:4:2224:9222:72561 1:N:0:ATTACTCG+TCAGAGCC#FQST:0000000000002      @K00124:439:H3LKVBBXX:4:2224:9222:72561 2:N:0:ATTACTCG+TCAGAGCC#FQST:0000000000000
@K00124:439:H3LKVBBXX:4:2224:7415:72596 1:N:0:ATTACTCG+TCAGAGCC#FQST:0000000000001      @K00124:439:H3LKVBBXX:4:2224:7415:72596 2:N:0:ATTACTCG+TCAGAGCC#FQST:0000000000000
@K00124:439:H3LKVBBXX:4:2224:16752:72631 1:N:0:ATTACTCG+TCAGAGCC#FQST:0000000000000     @K00124:439:H3LKVBBXX:4:2224:16752:72631 2:N:0:ATTACTCG+TCAGAGCC#FQST:0000000000002
@K00124:439:H3LKVBBXX:4:2224:30431:72631 1:N:0:ATTACTCG+TCAGAGCC#FQST:0000000000000     @K00124:439:H3LKVBBXX:4:2224:30431:72631 2:N:0:ATTACTCG+TCAGAGCC#FQST:0000000000000

Hi,

Thanks for your work. Just to check, this worked on files processed by FastQ Screen? I could make a note of this software in the FastQ Screen documentation to alert user to this feature if they should need it.

Thanks,
Steven

no problem, thank you for making fastq_screen!

yes, repair.sh was applied directly to fastq_screen output following filtering. You can see the #FQST: tags in the read name notes above.

In my experience, the disassociation of reads in the R1 and R2 files will derail the average molecular ecology grad student for a bit without prominently noting it in the documentation. I noticed the issue prior to encountering a downstream error when reviewing the multiqc report and switched a graph from proportions to counts.

Lastly, we didn't try https://github.com/linsalrob/fastq-pair because it wasn't already loaded on our hpc.

Cheers,

Chris

Hi again,

Thanks for that - that is useful to know. Just to double-check, this is the tool you used: https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/repair-guide/

Many thanks,
Steven

yes, repair.sh in the BBmap / BBtools package

Thanks for that. I've updated the documentation accordingly. This will make it in to the next release.

Thanks,
Steven