labgem/PPanGGOLiN

MAFFT error when running ppanggolin MSA

jvfe opened this issue · 9 comments

jvfe commented
  • PPanGGOLIN v2.0.5

Hi, after running and creating the pangenome file with this command, for about ~1600 GFFs:

ppanggolin \
    workflow \
    --cpu 24 \
    --anno ppanggolin_samplesheet.tsv \
    --output ppanggolin

I started running the MSA command:

ppanggolin \
    msa \
    --cpu 24 \
    --pangenome pangenome.h5 \
    --output ppanggolin_msa \
    --partition all

But I'm always getting this MAFFT error:


Command output:
  2024-04-04 12:56:08 utils.py:l168 INFO	Command: /usr/local/bin/ppanggolin msa --cpu 24 --pangenome copied_pangenome.h5 --output ppanggolin_msa --partition all
  2024-04-04 12:56:08 utils.py:l169 INFO	PPanGGOLiN version: 2.0.5
  2024-04-04 12:56:08 utils.py:l722 INFO	3 parameters have a non-default value.
  2024-04-04 12:56:09 readBinaries.py:l94 INFO	Getting the current pangenome status
  2024-04-04 12:56:09 readBinaries.py:l715 INFO	Reading pangenome annotations...
  2024-04-04 13:03:59 readBinaries.py:l722 INFO	Reading pangenome gene dna sequences...
  2024-04-04 13:14:16 readBinaries.py:l730 INFO	Reading pangenome gene families...
  2024-04-04 13:16:38 writeMSA.py:l310 INFO	Doing MSA for all families...
  2024-04-04 13:16:38 writeMSA.py:l203 INFO	Preparing input files for MSA...
  2024-04-04 13:42:51 writeMSA.py:l212 INFO	Computing the MSA ...

Command error:
   20%|██        | 91094/446396 [3:30:14<10:15:22,  9.62family/s]
   20%|██        | 91096/446396 [3:30:16<21:15:50,  4.64family/s]
   20%|██        | 91098/446396 [3:30:16<17:11:41,  5.74family/s]
   20%|██        | 91100/446396 [3:30:16<14:35:28,  6.76family/s]
   20%|██        | 91103/446396 [3:30:16<10:45:47,  9.17family/s]
   20%|██        | 91105/446396 [3:30:16<12:04:14,  8.18family/s]
   20%|██        | 91109/446396 [3:30:17<11:07:37,  8.87family/s]
   20%|██        | 91111/446396 [3:30:17<10:04:28,  9.80family/s]
   20%|██        | 91113/446396 [3:30:17<12:38:48,  7.80family/s]
   20%|██        | 91115/446396 [3:30:18<13:31:27,  7.30family/s]
   20%|██        | 91116/446396 [3:30:18<14:02:16,  7.03family/s]
   20%|██        | 91117/446396 [3:30:18<17:43:26,  5.57family/s]
   20%|██        | 91118/446396 [3:30:18<20:16:36,  4.87family/s]
   20%|██        | 91119/446396 [3:30:18<18:58:24,  5.20family/s]
   20%|██        | 91120/446396 [3:30:19<17:36:03,  5.61family/s]
   20%|██        | 91122/446396 [3:30:19<14:24:48,  6.85family/s]
   20%|██        | 91124/446396 [3:30:19<14:07:42,  6.98family/s]
   20%|██        | 91128/446396 [3:30:19<8:11:13, 12.05family/s] 
   20%|██        | 91130/446396 [3:30:20<10:02:37,  9.83family/s]
   20%|██        | 91131/446396 [3:30:20<13:39:58,  7.22family/s]
  multiprocessing.pool.RemoteTraceback: 
  """
  Traceback (most recent call last):
    File "/usr/local/lib/python3.9/multiprocessing/pool.py", line 125, in worker
      result = (True, func(*args, **kwds))
    File "/usr/local/lib/python3.9/site-packages/ppanggolin/formats/writeMSA.py", line 182, in launch_multi_mafft
      launch_mafft(*args)
    File "/usr/local/lib/python3.9/site-packages/ppanggolin/formats/writeMSA.py", line 172, in launch_mafft
      subprocess.run(cmd, stdout=open(outname, "w"), stderr=subprocess.DEVNULL, check=True)
    File "/usr/local/lib/python3.9/subprocess.py", line 528, in run
      raise CalledProcessError(retcode, process.args,
  subprocess.CalledProcessError: Command '['mafft', '--thread', '1', '/tmp/tmp8ye30tt_/BLONNJ_19495.fasta']' returned non-zero exit status 1.
  """

I couldn't figure out why as I can't access the mafft error directly so I'm kind of out of ideas at the moment. I had managed to run both of these commands previously on a smaller subset of this dataset (~200 gffs). I'm using the PPanGGOLIN docker image from biocontainers, in case that's relevant.

Thanks!

Hi,

Thank you for your issue, there is definitely an error reporting from ppanggolin on that specific case that we can improve.

To find out what is happening, ideally I'd run this command again and check the content of "/tmp/tmp8ye30tt_/BLONNJ_19495.fasta", and rerun mafft on it outside of PPanGGOLiN, but I don't remember if it's easy to keep the tmp files as I have not used "msa" in a while now.

I assume there is something odd with this family, BLONNJ_19495. Is there something different or strange about it ?
Alternatively, if you can't do what I suggested above, the things that you can check:

  • How many members does it have? (number of lines with this id in gene_families.tsv should give that answer)
  • Is it a multigenic family? (=> often has more than 1 gene among the genomes of your pangenome, I think mean_persistent_duplication.tsv can tell you if this is a persistent family)
  • Is it mostly made of fragments ? ( number of lines with this id and a F in the third column in gene_families.tsv should give that answer)
  • Is there unexpected characters in the fasta sequence of its genes ?

Adelme

jvfe commented

Hi, so I ended up removing the genome that had 'BLONNJ_19495' and tried running it again. Unfortunately I overwrote the old results, but got the exact same error, this time for a different ID, 'ELALCC_40030'. So I'll answer your questions based on this last run, since it should point to the same issue.

I can't access ELALCC_40030.fasta directly since it ran in a tmp directory under a docker container, but I can give you the gff3 file it came from (It's in contig 76, see below).
SAMEA2267045.gff3.txt

  • 2 lines with this ID in gene_families.tsv, one of them does contain F:
ELALCC_40030    NOFIGB_22955    F
ELALCC_40030    ELALCC_40030
  • It's not a multigenic family, as far as I could gather.
  • Couldn't see any unexpected characters either.

Alright I see, thank you. Maybe the fact that it's the only non-fragment member of the family is linked to the problem?
I will try to replicate the error and get back to you if there is something.

Hi,

I did not manage to reproduce this problem using our testing dataset, nor with a real dataset I was working on, nor using the genome you uploaded.

Would it be possible for you to share a (possibly small-ish) "pangenome.h5" file that resulted in a problem like this?

Adelme

Just in case, if you have no means of sharing your pangenome.h5 file, if you share with us your email address someone from the ppanggolin dev team can provide you with a link where you can upload the file.

jvfe commented

Just in case, if you have no means of sharing your pangenome.h5 file, if you share with us your email address someone from the ppanggolin dev team can provide you with a link where you can upload the file.

Sorry for such a late reply, but here is the pangenome.h5 file that is failing in the way I described above. Unfortunately it's not that small (~3GB). I'll see if I can create a smaller one that returns this same error.

Hi

After some testing I managed something that looks like your error... accidently.

For me, it was actually unrelated to ppanggolin directly but linked to a lack of permission to the TMPDIR of the system in which you are executing PPanGGOLIN. When mafft tries to access it, it fails and this makes it crash. The error given for me was the same as this one: https://forum.qiime2.org/t/plugin-error-from-phylogeny/19519

This was however impossible to guess with the way ppanggolin prints out the mafft stderr. The PR linked to this issue improves this.
I'll close this issue once this gets into a release.

Adelme

jvfe commented

I see! Thank you so much for your patience. I managed to fix the issue on my side as well after changing the TMPDIR singularity was using.

The PR #229 has been released in v2.1.0