Support for Windows-style linebreaks
jjcrawford opened this issue · 1 comments
Hi Eileen,
I think the FASTA standard is that both Unix and Windows-style line breaks should be valid for the format[1] , yet I believe RATTLE currently segfaults ungracefully when attempting to cluster_reads() on a FASTA or FASTQ file with Windows-style (CR + LF) line breaks.
Unfortunately by default, std::getline() will treat any file as though it were only LF-delimited with Unix-style line endings, so if you std::getline() through a Windows-format file, each string you get will still have a sneaky return carriage character hiding at the end of it.
In RATTLE, std::getline() is used in both read_fasta_file() and read_fastq_file() and no precaution is taken to sanitise away any possible trailing return carriage characters, so the instances of the read_t struct get built with each member having a return carriage still hiding at the end.
Then cluster_reads() passes this read_set_t to extract_reads_from_kmer() which in turn calls reverse_complement() from utils.cpp. The reverse_complement() function iterates through each character of the seq member of those read_t struct instances and .finds() the corresponding complement pair of nucleotide characters from the unordered map base_complements (utils.hpp) and then attempts to access the second member of this pair.
This will cause a segfault if attempting to .find() a character which doesn't appear in base_complements. This would usually be fine (albeit still not graceful) since it would mean that the sequence contained characters other than the nucleotides A, C, G, T, or U and hence would not be well-formatted, but in the case of a Windows-format file it will erroneously attempt to find the base complement of that pesky return carriage character hiding at the end of the seq and immediately cause a segfault.
It would seem the most obvious solution to this is probably just to sterilise any carriage return characters out of the fasta/fastq file when it is being read in fasta.cpp? I think it might perhaps also be worth checking for the presence of any other non-nucleotide characters present in the sequences as they're being read in so that RATTLE can die gracefully with an informative error message rather than just plainly segfault?
Thanks for finally giving the project some long-awaited maintenance!
Kind Regards,
Jack
Hi Jack,
A new update has been provided to solve this Windows/ Unix style linebreaks issue.
Please let me know if it doesn't work.
Thanks,
Eileen