V-Z/sondovac

Another failure during fastq to fasta conversion

johnchau opened this issue · 8 comments

Hello,

I am attempting to use Sondovac with my own data.

I ran into a problem during Step 5:

"Converting FASTQ to FASTA. This may take longer time.
fastq_to_fasta: Premature End-Of-File (filename ='output_combined_reads_no_cp_no_mt_reads.extendedFrags.fastq')

Error! Conversion of FASTQ to FASTA failed. Aborting. Check if file
output_combined_reads_no_cp_no_mt_reads.extendedFrags.fastq is correct."

I believe there is a problem with the fastq file. The fastq file, as well as the .hist and .histogram files, from step 4 are empty (zero bytes), but not the .notCombined_#.fastq files. I think it is because no combined pairs were actually made. My genomic data are from 100 bp reads for ~300 bp fragments, so I don't expect any pairs to overlap.

Is it still possible to use Sondovac?

Please first check which FASTX-Toolkit version you use. You should use version 0.0.14. In an older version of Sondovac this was stated wrongly (it said version 0.0.13.).

If this doesn't help it is quite likely that no reads were combined. Although, even if you selected for 300 bp fragments there is usually a proportion of shorter (and longer) fragments, because size selection is not so precise, and at least few reads could have been combined by FLASH. You could check that by opening the *_combined_reads_no_cp_no_mt_reads file, in case it is written.

Generally, it should be possible to design probes from non-overlapping paired-end reads, although the quality of the de novo assembly (step 7) might be lower compared to using combined reads as input for the assembly. You could outcode step 4, the FLASH step, copy the forward and reverse reads into one file and progress with step 5, if you would like to continue with Sondovac on your data.

I am using FASTX-Toolkit version 0.0.14, and the *_combined_reads_no_cp_no_mt_reads file is completely empty (0 combined pairs).

Is there a way to start with step 5 in Sondovac and specify what files the program uses?

V-Z commented

Hello,
I'm sorry for late answer, I was on holiday, and I'm sorry You have troubles running our script. I need full command You use to run Sondovač, details about Your operating system (at least name and version) and also I need content of the directory with output files (ls -la).

There is no possibility to start with step 5. I realized that for your dataset it is indeed impossible to combine the reads (I confused read length and fragment size in my recent comment), so the only possibility I see is to modify Sondovac part A in such a way that the FLASH step is omitted and instead the forward and reverse reads, after the removal of plastid (and mitochondrial) reads, are copied into one file, which should then be converted from .fastq to .fasta.

Thank you for the replies. I was able to perform steps 5 and 6 using the
uncombined reads from step 3 by using the code from Sondovac rather than
doing it within the program.

I now have a question about step 8. What exactly are considered exons vs.
loci in the program? In other words, the length of what sequences are used
to determine the length of exons and the length of loci?
When I run Sondovac part b, I am getting much higher numbers of genes of a
certain length in the table "Number of assembled sequences" (for example,
in one run, the table says there are 931 genes for genes of length >600 bp)
than in the output of that step (following the example above, the output of
step 8 has only 71 genes). Are the two numbers of genes supposed to be the
same? If not, what would cause such a large difference.

Thank you for your help!

Exons are synonymous to contigs that are obtained from the de novo assembly. A locus comprises all contigs that belong to the same transcript. As each transcript has a unique number, the names of the probes in the final probe file (and intermediate steps) should be interpreted as follows (example):
Assembly_10081_Contig_1_398
Assembly_10081_Contig_2_316
Assembly_10081_Contig_10_329
Assembly_23121_Contig_1_145
Assembly_23121_Contig_2_262
Locus / assembly (could also be called gene, although it it only part of a gene, as (1) the exons are randomly concatenated to a locus, (2) certain exons could be missing, and (3) intronic sequences are missing anyway) of transcript number 10081 comprises three exons / contigs (1, 2, 10). The exon lengths, which are given as the last number in each probe name, sum up to 1043 bp, and locus 10081 would pass the >600 bp minimum locus length filter. Locus 23121 would not pass this length filter, as the exons sum up to 407 bp.
The table and the outfile of step 8 should have the same gene number. You only loose exons and loci in steps 9 and 10, and you might need to remove few in step 11. Do you use Sondovac v.1.2? I found issues with the sequence summary stats in an older version but thought I had fixed that.

Thank you for the clarification! That makes total sense now.

I am using v.1.2. I think my issue with step 8 has to do with the fact that I am using Geneious version 9, and the contigs/assemblies are named differently. I tried running the code outside Sondovac, and when I got to the step for "# Modify labels of FASTA sequences to ensure them to work correctly", I got the error message "invalid command code S". I skipped that step and proceeded and ran into an issue with the step "# Extract and sort the exons making up genes of a certain minimum total length". The output has sequence names that are different from those in other files ("Assembly_###_Assembly" vs. "###_Assembly" in all other files). I got around it by editing the output file so the sequence names matched those in the other files. I'm not sure if the issue came up because I skipped the step "# Modify labels of FASTA sequences to ensure them to work correctly" or if it is an issue not addressed from using Geneious v.9.
Here is an example of an assembly name in the fasta file from Geneious v.9: >000000020181 Assembly
And the assembly names in the SEQUENCESTAB file looks like: 000000020181_Assembly

V-Z commented

I'm sorry, @johnchau, I'm bit confused now...
You tried to run part A with uncombined data and it failed. It's expectable, I think. Anyway, when You copy/paste the commands from the script file, where was the problem with part A? Which command did not work? To reproduce it, I need to know details about Your operating system, versions of software in use and so on. And yes, I wish to fix it, although we do not plan (for now) to support uncombined reads.
Regarding the step 8, do You copy/paste the commands instead of running the script? Which line in the script is failing? Again, what is Your operating system and versions?
I tested it with Geneious 9, but it is possible I didn't covered every possible nuance... :-/ My Geneious 9 FASTA looks like >25417_Assembly 245 reads from 25417.