Summary function not working with default GORG database
Opened this issue · 1 comments
juliambrosman commented
text in the summary file states:
"custom databases without the expected columns are not supported in the summary function"
I ran:
nextflow run BigelowLab/gorg-classifier -profile docker --seqs 'data/metagenomes/SRR*1.fastq.gz' --cpus 8
the metagenome I used was downloaded like so:
fastq-dump --split-files --skip-technical -N 0 -X 300000 --gzip --readids SRR6507279
brwnj commented
I fixed the bug related to this issue in 23b3f70. I did have to update these read names though because they should be the same in R1 and R2.
Here's the name fix I did:
def readfq(fp): # this is a generator function
last = None # this is a buffer keeping the last unprocessed line
while True: # mimic closure; is it a bad idea?
if not last: # the first record or a record following a fastq
for l in fp: # search for the start of the next record
if l[0] in '>@': # fasta/q header line
last = l[:-1] # save this line
break
if not last: break
name, seqs, last = last[1:].partition(" ")[0], [], None
for l in fp: # read the sequence
if l[0] in '@+>':
last = l[:-1]
break
seqs.append(l[:-1])
if not last or last[0] != '+': # this is a fasta record
yield name, ''.join(seqs), None # yield a fasta record
if not last: break
else: # this is a fastq record
seq, leng, seqs = ''.join(seqs), 0, []
for l in fp: # read the quality
seqs.append(l[:-1])
leng += len(l) - 1
if leng >= len(seq): # have read enough quality
last = None
yield name, seq, ''.join(seqs); # yield a fastq record
break
if last: # reach EOF before reading enough quality
yield name, seq, None # yield a fasta record instead
break
if __name__ == "__main__":
import sys
for name, seq, qual in readfq(sys.stdin):
name = name.rpartition(".")[0]
print("@" + name, seq, "", qual, sep="\n")
Then
zcat SRR6507279_1.fastq.gz | python update_names.py | gzip > example/SRR6507279_1.fastq.gz
# etc.