AstrobioMike/GToTree

Alignment file empty

Closed this issue · 5 comments

Dear Mike,

Thanks for this wonderful tool! I have been using it for a while with no problem, but this time I am having hard time getting it to work. The pipeline seems to work with no problem until the alignment step, which produces an empty alignment file. That's why I guess FastTree gives this error:

Error reading header line

Please, see the attached log file.

I repeated the analysis more than once, but everytime it fails with the same error.

Could you please help?

Best regards,

Ali
gtotree-runlog.txt

Hi there, Ali!

Thanks for the kind words about GToTree :)

Sorry it's giving you trouble. I'm not immediately sure what's going on, unfortunately, but some things look a little funny in the log (like not giving output for each NCBI-downloaded genome).

To start, can you try making a new GToTree conda environment and see if it runs successfully with the latest version? It's up to 1.8.6. You can keep your original and just name this new one something else.

And could you email me your "GTDB_rep_ref_seqs_uniq.txt" and "fasta_files.txt" files (MikeLeebmsisorg)? If it's okay with you, sending me the actual fasta files listed in "fasta_files.txt" would let me try to reproduce the problem completely too - I'll delete them after :)

And thanks for writing in so we can catch this if there's a bug somewhere!

I figured out what the problem is. it's a temporary directory issue with GNU parallel

    Finished aligning and formatting gene-set IPPT.
    --------------------------------------------------------------------------
    parallel: Error: Output is incomplete.
    parallel: Error: Cannot append to buffer file in /tmp.
    parallel: Error: Is the disk full?
    parallel: Error: Change $TMPDIR with --tmpdir or use --compress.
    ________________________________________________________________________________
    sed: couldn't flush stdout: No space left on device
    tail: error writing 'standard output': No space left on device
    tail: write error: No space left on device
    sort: fflush failed: 'standard output': No space left on device
    sort: write error
    paste:
    tmp/gtotree.tmp.oNwPz/building_genomes_tab_for_removed_genomes.tm
    : No such file or directory
    /home/ali/mambaforge/envs/gtotree_new/bin/GToTree: line
    2523: /tmp/gtotree.tmp.oNwPz/genomes_tab_for_retained_genomes.tmp: No
    space left on device
    cat: /tmp/gtotree.tmp.oNwPz/genomes_tab_for_retained_genomes.tmp: No
    such file or directory
    cat: write error: No space left on device
    cat: write error: No space left on device
    ##############################################################################
    sed: couldn't flush stdout: No space left on device
    ####      Catting all alignments together      ####
    sed: couldn't flush stdout: No space left on device
    #############################################################################
    sed: couldn't flush stdout: No space left on device
    It is currently 08:27 PM; the process started at 04:42 PM.
    sed: couldn't flush stdout: No space left on device
    Current process runtime: 3 hours and 45 minutes.
    sed: couldn't flush stdout: No space left on device
    Traceback (most recent call last):
    File "/home/ali/mambaforge/envs/gtotree_new/bin/gtt-cat
    alignments", line 63, in <module>
    out.write("XXXXX".join(seqs) + "\n")
    OSError: [Errno 28] No space left on device
    During handling of the above exception, another exception occurred:
    OSError: [Errno 28] No space left on device
    During handling of the above exception, another exception occurred:
    Traceback (most recent call last):
    File "/home/ali/mambaforge/envs/gtotree_new/bin/gtt-cat
    alignments", line 60, in <module>
    with open(output_dir + "Aligned_SCGs.faa", "w") as out:
    OSError: [Errno 28] No space left on device
    sed: couldn't flush stdout: No space left`

The bad thing is there is no option in GToTree to modify the the tempdir for parallel. Also another thing that made me not able to identify the problem is that this error is not included in the log file save in the output directory. I had to redirect stdout and stderr to a file to see it. I previously flushed them out relying on the output log file.

So, in general, I would be really grateful if a new GToTree option is included to modify the tempdir for parallel. I also hope that the output log file would be identical to everything printed to stdout and stderr.

Your help is really appreciated!

Thank you for figuring it out and the suggestions to improve things! Sorry i've been slow at getting on this.

I haven't had time to check yet, but i think parallel (and all the out of space messages above) is/are using the general temp directory variable $TMPDIR, which is manageable on the system as a whole, separate from GToTree. Did you happen to try setting that before hand and seeing if that made a difference? (e.g., setting export TMPDIR="wanted/temp/location" either permanently in a startup file, or running that prior to executing GToTree). It also makes me think it's not necessarily parallel we should worry about, but just making sure whatever is set as $TMPDIR is being used. Does that make sense/sound right to you?

Oh and on the logs missing, i made some choices to minimize the reporting, but yea it's not great when there is a hidden problem like this. I think i will look into capturing separate, more detailed logs too 👍

Out of curiosity, and to help me make sure i capture what would have indicated this issue, when you say this:

I had to redirect stdout and stderr to a file to see it.

For which process/command/step are you talking about that you redirected the stdout and stderr to get this info?

I haven't had time to check yet, but i think parallel (and all the out of space messages above) is/are using the general temp directory variable $TMPDIR, which is manageable on the system as a whole, separate from GToTree. Did you happen to try setting that before hand and seeing if that made a difference? (e.g., setting export TMPDIR="wanted/temp/location" either permanently in a startup file, or running that prior to executing GToTree). It also makes me think it's not necessarily parallel we should worry about, but just making sure whatever is set as $TMPDIR is being used. Does that make sense/sound right to you?

Yes, this does make! I actually thought that the --tmpdir option is the only way I can pass the new tmpdir path to GNU parallel, but after your comment and consulting the GNU parallel tutorial, I realized I can set the path of the desired tmpdir just before the command. So, thanks for this catch!

So, I did chnage the tmpdir and before that I also cleaned up some storage, just in case and it worked!

Out of curiosity, and to help me make sure i capture what would have indicated this issue, when you say this:

I had to redirect stdout and stderr to a file to see it.

For which process/command/step are you talking about that you redirected the stdout and stderr to get this info?

Actually, I observed that the stdout and stderr generated by GToTree and normally printed out to screen is a bit different from the log saved to file. I initially relied on the log saved to file and disregarded the output printed to screen. However, to troubleshoot the problem in hand I saved the screen output to file

GToTree_command > file 2>&1

There, I could find the GNU parallel error related to tmpdir, but I couldn't find it in the other log file available in the GToTree output directory.

Anyway, thank you so much! I think the problem is now solved!

Excellent! Thanks, @aelbehery!