muellan/metacache

Merging results from querying partitioned database

jaimeortiz-david opened this issue · 16 comments

Dear Andre,

I am working with a partitioned database (44 total chunks, each 16GB). The querying of the database works well (I believe) and reasonably fast. However, once I have the 44 txt files, and I am merging these to get abundance-per-species, the computation is taking a lot of time (now running for three days). I am unsure if I am doing something wrong or if this is normal behavior (each txt file is around 100-300GB, and the fastq pair-end file contains 100 species with 10,000 reads each, a total of 1 million reads). Below is an example of the line I used to merge the text files:

metacache merge results_mix05 -taxonomy ncbi_taxonomy -lowest species -abundances 05_abund_extraction.txt -abundance-per species

Also, I am using the 64-bit version of metacache to generate k-mers of size 32.

Do you think if I partition the database to even smaller chunks will work faster for merging the final results?

I will appreciate any guidance you could provide.

Best wishes,

Jaime

Hi!
The merging should scale linearly with the number of files. I would recommend to use less database parts if possible.

I am a little surprised that the txt files are multiple hundreds of GB for only 1 million reads. Which options did you choose for querying?

Lastly, did you check the memory consumption while merging? If you run out of memory and the system starts swapping this can lead to extremely bad runtime.

I don't think attachments work with email replies. Could you upload the script in github or use something like pastebin?

Hi Robin,

I do apologize for that. I am attaching the script here
slurm_merge_10Mix_mod copy.txt
.

If I understand correctly you first copy multiple TB of result files to the scratch directory. Is this really necessary? Can't you access these files directly with Metacache? Have you timed how long this copy takes? I think you can put "srun" in front of your commands (e.g. srun $c1) to get separate timings in sacct.

I am still worried how your result files got so large. Can you maybe upload the first few lines (head) of a result file?

Hi,

Here are the first lines of one of the result files:

Reporting per-read mappings (non-mapping lines start with '# ').

Only the lowest matching rank will be reported.

Classification will be constrained to ranks from 'species' to 'domain'.

Classification hit threshold is 5 per query

At maximum 2 classification candidates will be considered per query.

File based paired-end mode:

Reads from two consecutive files will be interleaved.

Max insert size considered 0.

Using 128 threads

TABLE_LAYOUT: query_id | query_header | top_hits | rank:taxname

/workdir/jdo53/Mix_05sp/fish_db_R1.fastq + /workdir/jdo53/Mix_05sp/fish_db_R2.fastq

11265 | CM001427.1-28530/1 | 473354:6,280673:4 | subclass:Neopterygii
11266 | CM001430.1-32512/1 | 7897:2,371672:2 | --
11267 | CM001410.1-281764/1 | 181435:4,240162:4 | --
11268 | CM001409.1-376322/1 | 7897:4,371672:4 | --
11269 | CM001411.1-6520/1 | 161453:5,7897:4 | subphylum:Craniata
11270 | CM001408.1-243296/1 | 179366:4,56726:4 | --
11271 | CM001418.1-68805/1 | 56726:4,240162:4 | --
11272 | CM001416.1-115266/1 | 280673:4,72055:4 | --
11273 | CM001406.1-281222/1 | 47706:4,7897:2 | --

OK, this looks fine. But this should not accumulate to 100GB for 1 million reads. It should be in the order of 100MB per file.

Nevertheless, I have no idea what's going wrong with the merging. Does the merge call produce any output? It should show a progress bar to indicate how many files have been merged. Does it write anything to the output files?
Have you tried merging just 2 files at once?

Hi,

The merging process stopped since it took too long to finalize. I think there is an issue somewhere, and the program maintains an endless loop. The error file is a massive 450GB text file, here are the 50 first lines and the last 50 lines which are identical! It does not even go over 10% of the query.

Applied taxonomy to database.
Merging result files.
max canddidates: 2
number of files: 37
[=======> ] 10%Query 1: Could not read taxid.
Query 1: taxid 0 not found. Skipping hit.
Query 1: Could not read taxid.
Query 1: taxid 0 not found. Skipping hit.
Query 1: Could not read taxid.
Query 1: taxid 0 not found. Skipping hit.
Query 1: Could not read taxid.
Query 1: taxid 0 not found. Skipping hit.
Query 1: Could not read taxid.
Query 1: taxid 0 not found. Skipping hit.
Query 1: Could not read taxid.
Query 1: taxid 0 not found. Skipping hit.
Query 1: Could not read taxid.
Query 1: taxid 0 not found. Skipping hit.
Query 1: Could not read taxid.
Query 1: taxid 0 not found. Skipping hit.
Query 1: Could not read taxid.
Query 1: taxid 0 not found. Skipping hit.
Query 1: Could not read taxid.
Query 1: taxid 0 not found. Skipping hit.
Query 1: Could not read taxid.
Query 1: taxid 0 not found. Skipping hit.
Query 1: Could not read taxid.
Query 1: taxid 0 not found. Skipping hit.
Query 1: Could not read taxid.
Query 1: taxid 0 not found. Skipping hit.
Query 1: Could not read taxid.
Query 1: taxid 0 not found. Skipping hit.
Query 1: Could not read taxid.
Query 1: taxid 0 not found. Skipping hit.
Query 1: Could not read taxid.
Query 1: taxid 0 not found. Skipping hit.
Query 1: Could not read taxid.
Query 1: taxid 0 not found. Skipping hit.
Query 1: Could not read taxid.
Query 1: taxid 0 not found. Skipping hit.
Query 1: Could not read taxid.
Query 1: taxid 0 not found. Skipping hit.
Query 1: Could not read taxid.
Query 1: taxid 0 not found. Skipping hit.
Query 1: Could not read taxid.
Query 1: taxid 0 not found. Skipping hit.
Query 1: Could not read taxid.
Query 1: taxid 0 not found. Skipping hit.

-bash-4.2$ tail -n 50 slurm_10Mix_merge_4431989_4294967294.err
Query 1: taxid 0 not found. Skipping hit.
Query 1: Could not read taxid.
Query 1: taxid 0 not found. Skipping hit.
Query 1: Could not read taxid.
Query 1: taxid 0 not found. Skipping hit.
Query 1: Could not read taxid.
Query 1: taxid 0 not found. Skipping hit.
Query 1: Could not read taxid.
Query 1: taxid 0 not found. Skipping hit.
Query 1: Could not read taxid.
Query 1: taxid 0 not found. Skipping hit.
Query 1: Could not read taxid.
Query 1: taxid 0 not found. Skipping hit.
Query 1: Could not read taxid.
Query 1: taxid 0 not found. Skipping hit.
Query 1: Could not read taxid.
Query 1: taxid 0 not found. Skipping hit.
Query 1: Could not read taxid.
Query 1: taxid 0 not found. Skipping hit.
Query 1: Could not read taxid.
Query 1: taxid 0 not found. Skipping hit.
Query 1: Could not read taxid.
Query 1: taxid 0 not found. Skipping hit.
Query 1: Could not read taxid.
Query 1: taxid 0 not found. Skipping hit.
Query 1: Could not read taxid.
Query 1: taxid 0 not found. Skipping hit.
Query 1: Could not read taxid.
Query 1: taxid 0 not found. Skipping hit.
Query 1: Could not read taxid.
Query 1: taxid 0 not found. Skipping hit.
Query 1: Could not read taxid.
Query 1: taxid 0 not found. Skipping hit.
Query 1: Could not read taxid.
Query 1: taxid 0 not found. Skipping hit.
Query 1: Could not read taxid.
Query 1: taxid 0 not found. Skipping hit.
Query 1: Could not read taxid.
Query 1: taxid 0 not found. Skipping hit.
Query 1: Could not read taxid.
Query 1: taxid 0 not found. Skipping hit.
Query 1: Could not read taxid.
Query 1: taxid 0 not found. Skipping hit.
Query 1: Could not read taxid.
Query 1: taxid 0 not found. Skipping hit.
Query 1: Could not read taxid.
Query 1: taxid 0 not found. Skipping hit.
Query 1: Could not read taxid.
Query 1: taxid 0 not found. Skipping hit.

Well, it seems like one of your result files is corrupt. Can you please check if all queries finished properly?
I will write a fix for the infinite loop. It should be possible to skip corrupt results.

The infinite loop is fixed in v2.2.3.

Hi,

Thank you very much for fixing this.
Additionally, I have found out why the files are corrupt. Some file names make no sense. See attached lines from the query result text files. Maybe this can help to get an additional perspective and better understand why these strange names appear during the process (bug in my script?), which corrupts the files and sends the process into an infinite loop.

491519 | CAKAKQ010001760.1-23/1 | 166741:1 | --
491520 | CAKAKQ010025327.1-63/1 | 166741:3 | --

/workdir/jdo53/4423340/Mix_10sp/results_mix10/res_mix10_10.txt + /workdir/jdo53/4423340/Mix_10sp/results_mix10/res_mix10_11.txt

/workdir/jdo53/4423340/Mix_10sp/results_mix10/res_mix10_13.txt + /workdir/jdo53/4423340/Mix_10sp/results_mix10/res_mix10_5.txt

/workdir/jdo53/4423340/Mix_10sp/results_mix10/res_mix10_6.txt + /workdir/jdo53/4423340/Mix_10sp/results_mix10/res_mix10_7.txt

/workdir/jdo53/4423340/Mix_10sp/results_mix10/res_mix10_8.txt + ??t??t??t"1???H??M1??fDH?71??f?H?H??????t??t??t"1???H???M1??fDH?71??f??o?????t??t??t"1???H??M1??fDH?71??f?H?H?????

queries: 1000000

time: 64370 ms

speed: 932111 queries/min

unclassified: 94.5548% (472774)

/workdir/jdo53/4423315/Mix_10sp/results_mix10/res_mix10_10.txt + /workdir/jdo53/4423315/Mix_10sp/results_mix10/res_mix10_5.txt

/workdir/jdo53/4423315/Mix_10sp/results_mix10/res_mix10_6.txt + /workdir/jdo53/4423315/Mix_10sp/results_mix10/res_mix10_7.txt

/workdir/jdo53/4423315/Mix_10sp/results_mix10/res_mix10_8.txt + ????W&??

queries: 1000000

time: 29285 ms

speed: 2.04883e+06 queries/min

unclassified: 45.5042% (227521)

Sorry, I don't really understand where those lines with the filenames appear. What's the content of the results_mix10 directory? Are there any other files besides the 44 result files? Could you maybe upload your query script, too?

In your merge script you save the merged result in the result directory. This would be included as input file if you call the script another time.

Dear Robin,

Thank you for your response.

I thought I attached the query script. I am attaching it here.

The results_mix10 directory contains only the 44 results files. My initial thought was that there was something else there that might have created the weird lines.

slurm_mix10_query_mod copy.txt

Ok, so the problem is as I suspected, but in the parent directory Mix_10sp. If you give a directory to MetaCache, it will query all files in the directory and its sub-directories. If you query Mix_10sp, it will also query all files in Mix_10sp/results_mix10. This would include results from other jobs in the array. I would suggest that you store the read files and result files in separate directories.

Hi Robin,

Thank you so much for all your help and for releasing the new version that improves the merging process. It seems that now the merging process is working fine. I also noticed that the merging issue was also happening with result files that did not assign any taxa. In other words, this was not only happening with corrupted files but with files that had no results. I hope this makes sense.

Many thanks again!

Best wishes,

Jaime

You're welcome. I'm glad it works now.