Rinoahu/SwiftOrtho

SwiftOrtho for 150 animal genomes

000generic opened this issue · 3 comments

I'm interested in using SwiftOrtho to cluster 150 animal genomes as part of a pygmy squid genome project. I have two possible machines to work with but have questions regarding time, available disk space relative to your publication, where you had 28 Tb of space, and how realistic the idea of doing this actually is.

Machine 1: 48 CPUs, 1024 Gb RAM, and 15 Tb disk space (GenuineIntel Intel(R) Xeon(R) Gold 6146 CPU @ 3.20GHz)

Machine 2: 64 CPUs, 512 Gb RAM, and 10 Tb disk space (GenuineIntel Intel(R) Xeon(R) CPU E5-2697A v4 @ 2.60GHz)

Machine 2 I can take over indefinitely while Machine 1 is shared but not heavily used by others - so ideally I would use only 25-35 of the available 48 CPUs on Machine 1 - though for a period of around 2 weeks I could take over the machine.

For the 150 animal genomes I'm estimating a total of 4.5 million proteins (30,000 genes per species).

I was wondering what you would be estimate for 1) amount of time to run SwiftOrtho on the 150 species for each of the machines, 2) will the limited amount of disk space be an issue for (what I think is) such a large data set, and 3) are there optimizations you would recommend for SwiftOrtho in this case?

Thank you!

Hi, we tested swiftortho on >1700 bacteria with ~6 million proteins, 1) The total CPU hours is 1200 for protein alignment, and 2) the file size of the alignment is ~1.4 Tb. Because you are working on animal genomes, the time and space usage may be several times larger than that of the bacteria. 3).To speed up and reduce disk space, you can use a long k-mer. For example, you can set "-s" to "-s 1111111111" which means the kmer size is 10.

Those are helpful numbers! Maybe I will try initially with high-quality subset of ~50 genomes and see how it goes. Thank you :)

We recently ran a swiftortho job with 150 animal genomes and the "euk" seed pattern of -s 1011111,11111 but after a week and it had used up 4-5 Tb of space and was still on the first 60 of 250 jobs.

We could try the kmer size of 10 with -s 1111111111 that you recommended above but we are clustering across all animals and I wonder if 10 1s in a row is too strict for seeding at this evolutionary distance...? Or should be no problem and still cluster great?

We saw in the paper a default of 1110100010001011, 11010110111 with weight 8 for seeding. Would you recommend this over 10 1s, given our evolutionary distances?

Or would it make sense increase the weight of pattern to 10? If so, is there a particular logic or reasoning behind the 1110100010001011, 11010110111 pattern that we might follow. Arbitrarily I have at the moment:

11101010001000101011, 110101011010111