DataTable file readers should utilize memory mapping

Question

DataTable file readers should utilize memory mapping

Closed this issue 10 months ago · 1 comments

Loading files in memory tends to speed up IO by quite a bit. A simple benchmark on the 1m.csv file with 1 million lines took ~2kms with the current DataTable.csv_read() opposed to Pandas ~200ms. After toying with memory mapping we sped up to just under 200ms. I will keep playing with this technique so we can achieve the same functionality as Pandas where we can infer the native datatypes by column.

Pandas seems to use some try except type logic starting with converting numbers to type int as the first try. As of now we are doing something similar as well as checks using regular expressions for determining types. This likely isn't adequate and should be overhauled with a more advanced algorithm for our case of "pattern matching"

Answer 1 · 2023-11-20T21:40:54.000Z

In additional to the innate DataTable class, I will look into support for the C++DataFrame project https://github.com/hosseinmoein/DataFrame/tree/master