Description on Towards Data Science.
- Code for processing
pandas.DataFrame
that is compatible withsklearn.pipeline
- High code coverage by unit tests (via pytest)
- Continuous integration via GitHub Actions as I add more functionality
Make scikit-learn pipelines retain and remember metadata, e.g. column names, from pandas DataFrames. Facilitates model debugging and interpretation! For details, see my article on Towards Data Science.
To be fair, there is one way that scikit-learn utilizes metadata in DataFrames: ColumnTransformer can identify DataFrame columns by their string names, and directs your desired transformers to each column. Here is an example by Allison Honold on Towards Data Science.
Unfortunately, ColumnTransformer
produces numpy arrays or scipy sparse matrices. My code extends ColumnTransformer
such that it produces pandas.DataFrame as well.
All files are located within directory src
.
-
ImputeByGroup.py
ImputeNumericalByGroup
;groupby
, calculate per-group statistics (e.g. median) of a numerical column with missing values, and then impute each group using said statistics.ImputeCategoricalByGroup
; like the above, but for imputing discrete, categorical columns. Fills up missing values using the most frequent unique value.
-
PandasColumnTransformer.py
- Wrapper around
sklearn.compose.ColumnTransformer
for automatic bookkeeping of column names, even when the number of columns changed after a transformation (e.g. one-hot encoding)
- Wrapper around
Copy the source file(s) of interest from the src
directory into your own project, and then import as necessary. Please see playground.ipynb
for a usage demonstration.