To run the benchmarks install Pixi, clone this repository and from inside the repository directory run:
gzip -dk data.csv.gz
pixi install
pixi run bench
The results of processing the CSV file (not counting the time to initialize the Python interpreter or load libraries) are next:
Description | File / Function | Time (seconds) |
---|---|---|
Pure Python looping with csv module using int types | pure_python_int | 3.4547557830810547 |
Pure Python looping with csv module using float types | pure_python_float | 3.8738009929656982 |
pandas with C engine | pandas_c | 1.50089430809021 |
pandas with Python engine | pandas_python | 8.328583478927612 |
pandas with PyArrow engine and NumPy dtypes | pandas_pyarrow | 0.31276631355285645 |
pandas with PyArrow engine and PyArrow dtypes | pandas_pyarrow_arrow | 0.29172492027282715 |
Polars in lazy mode | polars_lazy | 0.10555672645568848 |
Polars in streaming mode | polars_streaming | 0.11504125595092773 |
Polars with SQL API | polars_sql | 0.09796714782714844 |
DuckDB with SQL API | duckdb_sql | 0.8167853355407715 |
DataFusion with SQL API | datafusion_sql | 0.20633697509765625 |
NumPy with loadtxt function | numpy_loadtxt | 1.8354885578155518 |
The exact version of each library can be seen in the pixi.toml
file. Note that DuckDB seems to package
for conda-forge later, so the benchmarks use DuckDB 0.9 while 0.10 seems to be available in other package
managers.