Mangled new lines in coverage gtf file
OXB-DC opened this issue · 4 comments
Hi,
I'm noting that StringTie 2.1.4 seems to mangle the output of the coverage gtf files (obtained with the -C <cov_refs.gtf> option), in that some lines do not have a new line character added.
I am expecting 9 fields in the output files but sometime I get more. Easy to verify with:
awk -F '\t' '{print NF}' <cov_refs.gtf> | sort -nu | tail -n 1
I have written a script (with some judicious use of tr and sed) to correct this which looks up the names of each chromosome and scaffold from the fasta file and uses them as line delimiters to repair the coverage gtfs (so I have got a workaround). This only works with GENCODE or UCSC named chromosomes though (prefixed with chr).
Is this something anyone else has noticed? I'm only using the coverage gtf to be conservative about which transcripts I report (only those covered end-to-end in reads).
Does the latest 2.2.0 version exhibit the same issue on your data? I would rather not go back to track this bug in older versions if it has been fixed already.
I will download and install latest version and set up a run to see. Was wondering if anyone else could at least run that awk command to check on any output they already have.
I can confirm that I am observing this issue with StringTie v2.2.1.
It appears to stem from a multithreading race condition. Lines aren't just missing line breaks, but multiple threads seem to be writing to the same file concurrently. Here is a partial output where this is evident
NC_001134.8 RefSeq transcript 420201 423041 . - . ID=NM_001178434.1;CDS=420201:423041;CDSphase=0;geneID=YBR086CNC_001134.8 RefSeq transcript 423765 424829 . + . ID=NM_001178435.3;CDS=423765:424829;CDSphase=0;geneID=YBR087W;gene_name=RFC5;coverage=0.21
;gene_name=IST2;coverage=0.00
which should read
NC_001134.8 RefSeq transcript 420201 423041 . - .ID=NM_001178434.1;CDS=420201:423041;CDSphase=0;geneID=YBR086C;gene_name=IST2;coverage=0.00
NC_001134.8 RefSeq transcript 423765 424829 . + . ID=NM_001178435.3;CDS=423765:424829;CDSphase=0;geneID=YBR087W;gene_name=RFC5;coverage=0.21
When using a single thread, the error does not occur, which further supports a race condition
Thank you for confirming this, indeed it seems to be a thread synchronization/resource locking issue. -C
was meant as an internal/checking option that we only used with the default single-threading run mode, did not expect to get wider use etc.
It's likely an easy fix but I am very busy at the moment so I cannot dev+test a fix right now.
Pull requests for such fixes are always welcome.