mcveanlab/mccortex

Best practice for large datasets?

JohnsonStev opened this issue · 4 comments

Dear Isaac,

I would like to apply mccortex on a large scale resequencing project. (~400 individuals, 1GB genome size),
I read through the wiki, and here is what I think a possible workflow might look like

  1. Build graphs for each sample and reference with one chosen kmer size
  2. Clean each of the graphs
  3. Merge the clean graphs
  4. Read threading to produce link files
  5. Clean link files
  6. Merge the clean link files
  7. Call the variants

Do you have any suggestion about the workflow or is there any pitfall I need to be aware of?
Thank you so much.

That should work in principle. If you don't hear back from Isaac, you might try asking @kvg.

Thanks for the answer, I am trying to merge the clean graphs all together in one single command and it took a lot of time.
Is it more time saving to merge a few graphs in parallel first, then merge those merged graphs?
Thank you again

kvg commented

McCortex loads all the graphs into memory before joining them, and yes, this can be a bit slow. I think what you've outlined would be faster, but it's not clear that the improvement would be particularly significant (I'd imagine it depends on the contents of the graphs - particularly the number of shared k-mers between each sample).

An alternate strategy that might help you is the "Join" command we wrote in a companion tool, Corticall. This assumes your graphs are stored in sorted order (with the '-s' option in mccortex commands), and then the graphs are merged linearly. This tends to be much faster than the built-in McCortex join command; I've used this to merge a couple hundred microbial genomes. The resulting joined graphs will remain compatible with all of mccortex's subcommands.

After downloading and building Corticall, the command-line for this would be:

$ java -jar build/jars/corticall.jar Join -g <graph_1.sorted.ctx> -g <graph_2.sorted.ctx> ... -g <graph_N.sorted.ctx> -o joined.ctx

Please let me know if that does or doesn't work for you.

Dear KVG,

Thanks for your response, I will try it.

Meanwhile I am still working on running through the whole workflow using a subset of data.
I've done link threading plus link error cleaning of each sample. Now I am trying to merge the link files, when I found that I don't know how to generate a "ref.ctp.gz" or "refAndSamples.ctp.gz" file. All I got after running "thread" are "sample.ctp.gz".

Thanks you so much for the help!!