lh3/cgranges

speed

Closed this issue · 8 comments

hi, thanks for the nice library. I wrapped it for nim-lang thinking it would be faster than my naive library.
See timings here:
https://github.com/brentp/nim-cgranges

My library (here: https://github.com/brentp/nim-lapper) has horrible worst-case performance given some huge intervals, because it's just sorted by start, with knowledge of longest interval. But, it does seem to work well for most genomic data-types which often do not have huge intervals.

I wonder if you could see if a simple datastructure like the one in nim-lapper holds up in your timings. (Or if you see something wrong with my code that would result in slowing down cgranges-- I was surprised at the result).

current timings here: https://github.com/brentp/nim-cgranges#speed

lh3 commented

I am not surprised. In javascript, the naive version (the one used in paftools.js) is also faster than the interval tree version. That is probably because binary search has a much tighter inner loop in comparison to interval tree. There is a price to pay for improving the worst case. A solution to this is to inspect the data before indexing and choose the index data structure accordingly.

You should also note that cgranges can be up to twice as slow given an interval spanning the whole chromosome. NCList or centered interval tree may be more robust to such an interval.

At the end of day, though, I think the performance of overlap query is good enough in practice. Other steps can be slower.

lh3 commented

BTW, here is an interesting thread on the performance of overlap query: BioJulia/Bio.jl#340. On testing, I would be interested in the speed on the test BEDs in the release page. I guess the best performance may vary a lot with inputs.

thanks for the thread. also linking this: https://www.biorxiv.org/content/10.1101/593657v1
for completeness.
I have already found a use-case where nim-cgranges is faster than nim-lapper due to large regions (your LCRs).
I had thought to implement interval splitting for large intervals in lapper with a pointer back to the original interval, but never got around to it.

lh3 commented

When talking about AIList in README, I have cited the preprint. The test/3rd-party directory provides the link to the original AIList repo.

I have also done more comparisons to AIList using the test data from the preprint. On all of their datasets, AIList is 20-50% faster than cgranges and cgranges is faster than AITree and NCList. Nonetheless, AIList is not implemented as a library. It even hard codes chromosome names. AIList requires a lot of code refactoring or even reimplementation to be used as a library. The transition to a library will slow it down a little bit, too. All that being said, AIList is a very good algorithm, possibly the fastest overall.

I understand that all benchmarks will be incomplete - and I'm sure mine also are in a million ways I have not thought of - but your test does not use my (simple) Cython-optimizations and also mixes reading and queries, while my NCLS is meant to be an in-memory datastructure. I think the AIList benchmarks have the same downsides.

Note that I do not have any ego attached to NCLS, I did not invent it. I was just looking into different datastructures to optimize my pyranges library and I do not find the current benchmarks compelling enough to switch. But if someone in the future has the time to do tests on these as in-memory datastructures and found that one was significantly faster than the other, I would be happy to write a Cython-wrapper for it.

lh3 commented

What do you mean by “mixes reading and queries”? What cython-optimizations are you referring to? Both AIList paper and this repo compare C/C++ implementations, having nothing to do with Cython.

In general, I think AIList, NCList, CITree, and IAITree are all fast enough. User operations are probably of similar speed. You don’t need to change the underlying algorithm behind pyranges. If it were me, I would focus on reducing dependencies.

Yes, agreed. Sorry, I was being unclear and this post was not necessary anyways.

"mixing reading and queries": The benchmarks read the data and query at the same time. For genomicranges-like datastructures, this is not applicable, as all the data are in-memory.

The optimization I alluded to: when doing many queries at once with the NCList API you can malloc and free the IntervalIterator once, instead of once per query. This will presumably save some time (https://github.com/lh3/cgranges/blob/master/test/3rd-party/NCList/intervaldb.c#L630).

But I agree, all are so fast it is not really important. And thanks to your blog post I did start reducing dependencies (moved all code requiring FTP/mysql-libraries into their own package etc.). Looking forward to hearing you speak at Genome Informatics.

lh3 commented

Thanks for the clarification.

"mixing reading and queries": The benchmarks read the data and query at the same time. For genomicranges-like data structures, this is not applicable, as all the data are in-memory.

In a lot of cases, the list of queries is too huge to be loaded into memory –– think about a short-read BAM. Reading and query at the same time is also a common pattern. With good APIs, streaming should be as fast as preloading all intervals.

The optimization I alluded to: when doing many queries at once with the NCList API you can malloc and free the IntervalIterator once, instead of once per query. This will presumably save some time

Yes, it does save time, but the effect is probably minor. Repeatedly mallocing a small chunk in a single thread is fairly fast.