'stitcher' required files are not documented
Closed this issue · 29 comments
The help menu for the 'stitcher' program describes a taxon file that must be used, but it's not clear what format this should take or how it is used. There is a tutorial file for this program but it is empty, and I don't see any test files indicating the format.
My question is, what is the format of the taxon file and how is it used? There is an error message from 'stitcher' that indicates there should be a taxon name and contig name included in the assembly, but this is ambiguous without knowing what this means in terms of the format (I made a few guesses without success but decided to stop and ask).
I'm guessing there is a formatting step after the 'assembly' phase, or maybe the sequences going into the assembly should be in a certain format? The previous steps completed without error, suggesting the formats are as expected, so I'm looking for a little advice on the formatting.
Okay, thanks for the response.
Just to be clear, should the taxon file include all species in the query and the reference species, or just the query?
And just for example, if I have one query (Lactuca sativa
) and one reference (Helianthus annuus
) species how would I format the taxon file and the assembly/reference specifically?
The taxon file does not include the reference sequences or taxa. It is
specifically referring to the taxa that you assembled loci for.
Okay, this is what I assumed would be the case. Using the example I provided, I would have:
Lactuca sativa
in the taxon file.
The reference file should have the locus names in the name line that
matches the locus name in the output file from atram - e.g. the locus names
in your original target file.
Following the README, I used the same reference/target file for all steps and the gene/locus name is in the header so I assumed this is what is intended.
The atram assembly outfile has the taxon name and the locus name in the
name of the file.
By locus are you referring to the reference species, or do you have to split the input assembly into individual FASTA files with the locus in the file name somehow? That step is easily achieved but getting there is unclear, specifically the "locus" name in the file part. What would that look like?
The stitcher will be looking at each taxon in your taxon
list and matching those with the names of the loci from your reference file
to build a single file for each locus that has all the stitched exons for
each taxon.
This part is unclear. It is not clear what matching a taxon to a locus means or how it works. A locus being a gene name would take the form HaChr01g000001
whereas the query taxon would Lactuca sativa
, using examples I provided above. There must be some formatting step I suppose or some other process (perhaps alternative term use) that has to happen.
Did you assemble a single locus? The reference file usually has many target
loci in them in a fasta format with each locus being named it the
nameline. Here are examples of those file.
The input reference file is a set of protein sequences from a single species.
Example taxon file
TaxonA
TaxonB
I think the format I'm using here is okay.
Example Reference file - the locus names are in the name line ( make sure
it is AA not DNA) ...
I think this is fine also.
so my atram output files usually looks like these:
taxonA.CytB.filtered_contigs.fasta
taxonA.Ef1a.filtered_contigs.fasta
taxonB.CytB.filtered_contigs.fasta
taxonB.Ef1a.filtered_conitigs.fasta
This is where my results deviated. I get a single output file from the assembly step called "*all_contigs.fasta" as opposed to files with the locus name.
It sounds like your reference file might be named by
the reference taxon rather than the locus? That should be fine too. If you
were trying to assemble many loci then those need to be separated in the
reference file so aTRAM can assemble them one after the other. Use the -Q
(not -q) in the atram assembly run.
My reference is named for the species (e.g.,Helianthus_annuus.faa
, using the examples I listed above). I am trying to assemble many loci and I think this should be formatted correct based on the comments above. Okay, I will look into the -Q
option. The assembly ran without error, but I think the way I ran it did no produce the output expected for the stitching step, so I will try that and update the issue here.
Thanks for the advice.
I have the assembly step running now with the -Q
option, but I estimate it will take more than two months to finish (using one query and one target species). I'm curious if there is any practical way to speed up this process. I have specified 96 processors to be used for this step but it appears that the assemblies are done in a serial fashion, so I'm not taking full advantage of the CPUs and RAM I have available.
Also, I'm using spades for the assembly with the --careful
option. It occurred to me this might slow things down, but based on how the data is being processed I don't see much time being spent on that step for each assembly.
Any advice would be helpful, thanks.
How many shards does your library have? That is will limit the number of
processors you want to use.
Based on the preprocessor log it says there are 27 BLAST DBs. Is this what you are referring to as shards?
Also you can set up more than one run.
Do you mean split the query into chunks and run them independently? That makes sense to me. Analyzing loci on a per-chromosome basis could make a huge difference.
yes ok so if you have 27 shards then you probably dont want to assign more
than 30 cores to any run. If you have 90 cores then you could set up three
runs and split the total loci to assemble into three files and run each
with 30 cores.
Okay, I think I will explore parallelizing this run. Thanks.
One more question: how does the --fraction
option influence the run time and how should this option be used ideally? I was not sure if this should be left at the default or changed to capture more loci.
Never mind the previous question. The fraction part seems pretty clear looking at the code.
I have the assembly step running in a parallel fashion now, so I should be able to get back to the original question about the 'stitching' phase once these processes are completed.
I have arrived back at the original question at least, but it is still not clear what is expected.
Here is the error message I'm getting:
2020-05-08 04:53:28 ERROR: There were no hits. Please make sure your contig file names contain both a reference gene name and taxon name in them.
The input taxon file is like so (one species):
taxonA
The input reference looks like:
>gene1
MSVNPSIGGEIPPPIPP
The input contigs are named like so:
taxonA.gene1.filtered_contigs.fasta
and these are in a directory called filtered_contigs
. My invocation of the command was:
time ./aTRAM/atram_stitcher.py \
-T taxon_file.txt \
-t /mnt/nvme2n1/tmp \
-o stitcher_out \
--assemblies-dir=filtered_contigs \
-f '*filtered_contigs.fasta' \
--reference-genes=ref.faa \
--log-level debug \
-l stitcher_debug_may7.log
The reference is the same that was used in previous steps. The error I'm getting, as noted above, says there must be a taxon and gene name in the file but this obviously is the case.
Any help on what the program is expecting would be helpful. Thanks.
Thanks for the response.
Give it a path to the contig files with the -a command. That should get you
there.
The usage suggests that the PATH to the assemblies can be given by -a, --assemblies-dir
.
I specified that in the command above and tried the short arg version for completeness, but I get the same result.
No the files are not empty. The example I posted was for simplicity by I actually pass an ENV var to the -a
arg that is the full system path.
Hey Evan, I'm sorry you're having so much trouble here. As you have found out, the stitcher still has some rough edges. Let's see if I can help... well I'm going to try to help you help yourself.
I hope you're OK with digging into databases.
I noticed that you are using the --temp-dir argument (-t), good for debugging. It you also use the --keep-temp-dir argument you'll save all of the temp files for further analysis. The one that I'm most interested in is the atram_stitcher.sqlite.db. That's where we save the temporary database that the stitcher uses.
To look at this table I use the "DB Browser for SQLite". It available for free. If you cannot get your hands on that then using the sqlite3 command line tool is a bit more cumbersome but it'll work.
Your error telling us that there are no records in the "exonerate" table. Let see if we can find out why. The first things to look at are the "contigs" table and the "reference_genes" table. Does every contig in the contigs table have a ref_name that matches a ref_name in the reference_genes table. So the steps are:
- Add the --keep-temp-dir option to the stitcher run. Then find the created temporary directory.
- Open the temp database (atram_stitcher.sqlite.db) with either the DB browser (or sqlite3 command line).
- Look at the "contigs" table and the "reference_genes" table.
- If you don't have the browser then command line
select * from contigs;
and
select * from reference_genes;
will show you the contents of the tables.
- If you don't have the browser then command line
- We're most interested in the count of contigs in the contig table. If you have the DB browser then you can just click on the "Browse Data" tab and select the contigs table. It'll show you the count of records below. If you're using the sqlite3 command line then
select count(*) from contigs;
- Make sure that every contig's ref_name has a corresponding entry in the reference_genes table.
- if you're OK with SQL then you can
select count(*) from contigs where ref_name in (select ref_name from reference_genes);
This should match the count above.
- if you're OK with SQL then you can
That's enough to get started.
Thanks for the responses. Though, I don't think it makes sense for me to pursue this idea at this time.
Looking into databases is something I am definitely comfortable with but it needs to have a clear purpose. Spending time inspecting tables to "get started" seems like I'd be going down a rabbit hole, and not I don't see a way forward with that approach.
The solution seems likely to involve diving into the code and inserting breakpoints to inspect the issue and I think that task would be better left to the authors.
I respect this decision, of course.
FYI: I was trying to bisect where the problem could be coming from. Is it a problem with the input data or with the exonerate assembly. But you are right in that depending on where the issue occurred there would be more debugging and without the offending data I can't do much here.
I'd be happy to provide the data and scripts I'm using if you think it would help. I am using data from one species obtained from the SRA and mapping against sunflower.
The ultimate goal was to add this species to a phylogenetic tree to gain insights into gene family evolution. The relationships are already known so the placement of the genes could be informative.
And to be clear, I do want to help. This is a cool project but I'm burdened with a lot of other tasks at the moment.
Your data would help for for sure. I cannot guarantee a fix but it would definitely make it much more likely. When you get the chance, please send your input files along (or links to them) as well as the related log file.
PS: I sorry if I came off as snarky and/or rude. Not my intent, even in the slightest. You've already been quite helpful in identifying the other bug and this is the kind of thing we want to foster here.
PPS: I understand being overburdened and I don't want to add to that. And just so you're aware, I'm on Dr. Allen's projects only on Fridays so if there is a lag it's because of that.
I can send the links for sure. No reason to worry about the responses, I never perceived any tone other than professional and helpful, and they were very timely. I hope my comments were received in the same manner.
Also, when if comes to code I would always err on the side of being direct and descriptive rather than trying to be cordial. People always want a quick solution and will never fault you for trying to help in that respect.
Thanks for the update. This could indeed be related, but when I have the chance to try this approach again I will share my findings related to the original issue.
I see never shared a link to the files, so I'm sorry about that. That makes debugging very hard, but I will try to provide some useful information when I get the chance.