Speed up augur filter by replacing Pandas
victorlin opened this issue · 1 comments
victorlin commented
Context
See parent issue for context on how Pandas is used in augur filter and why it is slow.
The alternative way of working with large datasets is to load/keep it on disk. There is a spectrum of alternatives which can be divided into two categories:
- Pandas-like alternative such as Dask. Unsure how portable the existing Pandas logic is to Dask, but ideally this would be closer to a library swap with less code change than a full rewrite.
- Database file approach such as SQLite. This would require more of a rewrite and needs extensive testing. Note that at least some form of Pandas may still be necessary to continue supporting the
--query
option (which allows Pandas-based queries and is widely used).