Experiments to (try) replicate/mirror Python's pandas
library using our Glorious Haskell language and conduit
's stream processing.
DISCLAIMER: THIS IS NOT A BINDING FOR PYTHON PANDAS!
Pandas is widely adopted and used in the data processing community due to its expansive library capabilities and low barrier to entry thanks to Python's popularity. However there are limitations:
- Absolutely dynamic without any type safety make scripts/applications a runtime exception minefield
- Several operations (like joins) mutate record labels and create new ones with a magic suffix.
- Several functionalities require doing things the "pandas way" to exploit high performance. (Ex: map, filter)
- Datasets are loaded into memory. This is an expensive endeavor for "big data".
No idea. This repo intends to try and port common use cases that I come across to haskell land. Improvements over Pandas:
- Static typing can ensure data transformations are safe and errors are explicitly handled (ex: Missing values, parsing failures etc).
- Join results can be expressed more richly with tuples or
These
to know exactly what the results of the operation were. - The core of the functionality and types can be vanilla Haskell and still play nicely with streaming/processing libraries.
- Streaming also means only a fraction of the dataset is ever stored in memory at a time.
-
Get stack
-
Compile
$ stack build --pedantic --ghc-options '-O2'
-
Run
$ time stack exec conduit-pandas-exe
- Explore more use-cases and examples
- Add benchmarks with pandas and sqlite
stat | value |
---|---|
CPU | 2.3 GHz Intel Core i5 |
Memory | 16 GB 2133 MHz LPDDR3 |
- Inner join for 1000x1000 records runs in 28 seconds, rougly 36K rows/second.