Shamir-Lab/SCAPP

SCAPP running for a long time and then, segfault

eperezv opened this issue · 2 comments

Hello,

I am trying to run SCAPP on my dataset. For some samples, it runs fast (around 1 day), but there are others in which it is taking long. One is running for more than 15 days. The "scapp.log" file was being updated all the time, so I guessed it was properly running. However, today, it failed (segfault).

Do you know whether it is normal to take that much time? I put the program to run on 30 threads, but it uses 1 for almost all the time. The file I'm running SCAPP on is around 816 Mb, obtained from metaspades (up to k99), with 387.000 contigs (67.000 above 1000bp).
Is there any way of subsetting the fastg file from metaspades (to remove short contigs) as input for SCAPP.

Thank you,
Best regards

Hi, thanks for your interest in using SCAPP.

SCAPP will run for varying times depending on the sample and can take a very long time for large, complex samples as the core of the algorithm scales as O(n^3) for n nodes in a component of the assembly graph (in practice it is not that extreme).

It is hard to know what the complexity of your graph is (# of nodes in largest component, node degrees, lengths of potential paths etc), however the file size and # of contigs should provide a rough estimate. For reference, we have run SCAPP on files larger than what you reported here in less than a day (16 threads).

It is strange that it segfaulted, and that it is mostly using 1 thread and that it takes so long.
If you can send me the log file (here or to dpellow AT post DOT tau DOT ac DOT il) and more information:

  • system details (# cores, RAM, OS),
  • SCAPP version and how you installed it (bioconda etc),
  • other sample information you might have (e.g. # of reads, # of contigs in largest component, graph file if you are able),

I can see if there is something to debug the problem with this specific sample.

You could try to remove short contigs from the assembly graph before running SCAPP, but this will likely degrade the performance, and SCAPP should be able to run on a graph of this size.
I would first try to run it with fewer threads (the fact that it is mostly using 1 thread seems suspicious).
If that doesn't work then maybe re-running the metaSPAdes assembly with higher k (up to 127) if you have many reads, or maybe lower k (77) if you don't really have a lot of reads will result in a simpler graph that SCAPP will process quickly.

If you have a really huge assembly graph (it doesn't sound like it from your description) you could also try to divide the reads and create a few smaller assemblies and run SCAPP on each of them.

Hello,

Thanks a lot for your answer. I've already sent the requested data. If you don't see anything wrong, I will try the steps you suggest.