nextstrain/augur

Speed up augur filter by replacing Pandas

victorlin opened this issue · 1 comments

Context

See parent issue for context on how Pandas is used in augur filter and why it is slow.

The alternative way of working with large datasets is to load/keep it on disk. There is a spectrum of alternatives which can be divided into two categories:

  1. Pandas-like alternative such as Dask. Unsure how portable the existing Pandas logic is to Dask, but ideally this would be closer to a library swap with less code change than a full rewrite.
  2. Database file approach such as SQLite. This would require more of a rewrite and needs extensive testing. Note that at least some form of Pandas may still be necessary to continue supporting the --query option (which allows Pandas-based queries and is widely used).

Progress

question
Are we exploring panda-alternatives like polars? I guess polars would be part of the first category.