Best practice for large datasets?
JohnsonStev opened this issue · 4 comments
Dear Isaac,
I would like to apply mccortex on a large scale resequencing project. (~400 individuals, 1GB genome size),
I read through the wiki, and here is what I think a possible workflow might look like
- Build graphs for each sample and reference with one chosen kmer size
- Clean each of the graphs
- Merge the clean graphs
- Read threading to produce link files
- Clean link files
- Merge the clean link files
- Call the variants
Do you have any suggestion about the workflow or is there any pitfall I need to be aware of?
Thank you so much.
That should work in principle. If you don't hear back from Isaac, you might try asking @kvg.
Thanks for the answer, I am trying to merge the clean graphs all together in one single command and it took a lot of time.
Is it more time saving to merge a few graphs in parallel first, then merge those merged graphs?
Thank you again
McCortex loads all the graphs into memory before joining them, and yes, this can be a bit slow. I think what you've outlined would be faster, but it's not clear that the improvement would be particularly significant (I'd imagine it depends on the contents of the graphs - particularly the number of shared k-mers between each sample).
An alternate strategy that might help you is the "Join" command we wrote in a companion tool, Corticall. This assumes your graphs are stored in sorted order (with the '-s' option in mccortex commands), and then the graphs are merged linearly. This tends to be much faster than the built-in McCortex join command; I've used this to merge a couple hundred microbial genomes. The resulting joined graphs will remain compatible with all of mccortex's subcommands.
After downloading and building Corticall, the command-line for this would be:
$ java -jar build/jars/corticall.jar Join -g <graph_1.sorted.ctx> -g <graph_2.sorted.ctx> ... -g <graph_N.sorted.ctx> -o joined.ctx
Please let me know if that does or doesn't work for you.
Dear KVG,
Thanks for your response, I will try it.
Meanwhile I am still working on running through the whole workflow using a subset of data.
I've done link threading plus link error cleaning of each sample. Now I am trying to merge the link files, when I found that I don't know how to generate a "ref.ctp.gz" or "refAndSamples.ctp.gz" file. All I got after running "thread" are "sample.ctp.gz".
Thanks you so much for the help!!