Use faster/multithread read functions

Question

Use faster/multithread read functions

Closed this issue 3 years ago · 2 comments

Reading in large matrices (such as distance matrixes with many samples) could be done quite a bit faster if something like fread from data.table or vroom was used. It would add a dependency, but speed things up considerably.

Answer 1 · 2020-08-20T17:28:18.000Z

This is true; however, my intuition is that most users would not have such large datasets! In the case of feature tables, they are already being read/stored as sparse matrices; however, there could be room for improvement on distance matrices. I would suspect that at the scale where the performance of fread would become appreciable, QIIME2/artifacts would no longer be the preferred tool and you would be better off going right to the source tools. Can you give me an idea of the sample numbers where you have been finding poor performance?

Answer 2 · 2020-08-25T01:24:55.000Z

You're probably correct that most people don't work with datasets this large! I've been doing some comparison with Earth Microbiome Project data, using 5-6k samples. To be clear, qiime2R works for this data, and I've been pretty happy using QIIME2 to manage things. Just an idea to make the process even smoother.