/bench_csv

CSV processing benchmarks for different open source technologies

Primary LanguagePython

Python benchmarks to process a csv file

To run the benchmarks install Pixi, clone this repository and from inside the repository directory run:

gzip -dk data.csv.gz
pixi install
pixi run bench

Results

The results of processing the CSV file (not counting the time to initialize the Python interpreter or load libraries) are next:

Description File / Function Time (seconds)
Pure Python looping with csv module using int types pure_python_int 3.4547557830810547
Pure Python looping with csv module using float types pure_python_float 3.8738009929656982
pandas with C engine pandas_c 1.50089430809021
pandas with Python engine pandas_python 8.328583478927612
pandas with PyArrow engine and NumPy dtypes pandas_pyarrow 0.31276631355285645
pandas with PyArrow engine and PyArrow dtypes pandas_pyarrow_arrow 0.29172492027282715
Polars in lazy mode polars_lazy 0.10555672645568848
Polars in streaming mode polars_streaming 0.11504125595092773
Polars with SQL API polars_sql 0.09796714782714844
DuckDB with SQL API duckdb_sql 0.8167853355407715
DataFusion with SQL API datafusion_sql 0.20633697509765625
NumPy with loadtxt function numpy_loadtxt 1.8354885578155518

The exact version of each library can be seen in the pixi.toml file. Note that DuckDB seems to package for conda-forge later, so the benchmarks use DuckDB 0.9 while 0.10 seems to be available in other package managers.