Problem building DB with metacache-build-refseq
donovan-parks opened this issue · 6 comments
Hi,
I'm experiencing some issues building a RefSeq reference database for MetaCache v2.1.1.
The first issue is that the NCBI assembly_summary.txt
files now have a few entries where the ftp_path
field is set to na
. This breaks the metacache-build-refseq
script. This is easy enough to workaround once you know the problem, but it takes a bit of exploring to figure this out.
The more major issue is that metacache build
command has been stuck at 98% for several hours now and all the metacache processes are in a sleep state. I am running this on a 16 CPU machine with 126 GB of RAM. Only 82 GB of RAM is used at the moment so this doesn't appear to be a memory issue. I can also verify I have plenty of disk space.
Have you experience this problem? Can anyone verify that they have recently been able to build a MetaCache DB from complete RefSeq genomes?
Thanks,
Donovan
Hi,
thanks for pointing out the issue with the assembly_summery.txt
files. We'll update our scripts.
Regarding the other problem - I think we have to try to reproduce it in order to diagnose what's wrong. Could you maybe try to build again, but with option -verbose
. This will list the currently processed genomes. Maybe that could indicate at what point / processed sequence it goes wrong. Other than that we'll also (try to) build the latest RefSeq.
We also ran into the same problem when trying to build the latest RefSeq. I don't think it has anything to do with the input files. We'll investigate and let you know as soon as we find something.
Great - thank you for the quick response and for looking into the issue.
Turns out the problem is rather mundane: the latest RefSeq releases contain so many complete genomes that our default data type for storing reference sequence ids is not sufficient anymore. The current defaults only support up to 65536 reference sequences. This should have triggered an error message during the build, but the error handling for this case seems to be broken and the build just paused.
We'll change the default from 16bit to 32bit and fix the error handling.
In the meantime you can compile with:
make MACROS="-DMC_TARGET_ID_TYPE=uint32_t"
Note that this will increase the memory footprint of the databases slightly.
The latest release contains a fixed download script, the new data type defaults and an updated documentation.
https://github.com/muellan/metacache/releases/tag/v2.2.0
Excellent - that you for the quick response and fix.