Helper for ML feature preprocessing. Very tasty.
Machine learning is cool, but those models are picky beasts. Most statistical models don't handle anything else beyond numbers, thus the development of techniques such as one-hot encoding for categorical variables (e.g. a country variable).
To create these numerical representations, and create other interesting variables aka features, we must pass data through a preprocessing step aka feature engineering. If you work in an environment with several data science teams, it is likely several teams have reimplemented the same feature engineering code in different projects. This is especially true in polyglot teams (e.g. some people prefer R, others Python, yet a growing number Clojure).
bulgogi
is a system to simplify this process, via centralization of
concerns and coupling with a feature store [1]
[2]. The final goal is to increase sharing of code, and ensure point-in-time correctness of training data.
Presently, bulgogi
is an idea and this repository is a
playground for experimentation. Examples of applications and use-cases will be added as development evolves.
The original code started as a gist here.
Although this introduction and examples in /example
focus on using bulgogi
within the context of production ML models, this library is
generic enough for other use-cases where mapping arbitrary functions over data is useful.
I gave a talk at re:Clojure 2021 about it.
Ideas and discussion are welcome!
This is definitly not a replacement for ML pipelines (sci-kit learn pipelines, tidymodels workflows) in all situations. If the cost of higher latency (no benchmarks yet about how much) is higher than the cost of quicker collaboration, then by all means use a pipeline.
bulgogi
is not in Clojars yet, but you can try it with deps.edn
:
{:deps {io.github.jcpsantiago/bulgogi {:git/url "https://github.com/jcpsantiago/bulgogi/"
:git/sha "278ce2738f26d4100b3470f133f682ad450662c4"}}
You can see an example implementation in /example.
The main meat in bulgogi
is the preprocessed
function.
It takes in a request map with keys :input-data
(another map) and :features
(a vector of strings) e.g.
{:input-data {:current-amount 700
:email "squadron42@starfleet.ufp"
:features ["n-digits-in-email-name"
"contains-risky-item"]}
and a namespace e.g. 'example.main'
to look for functions with the same name as the vals in :features
,
then pmaps
those fns over the :input-data
.
Finally, it returns a map with the preprocessed data
{:n-digits-in-email-name 2
:contains-risky-item 1}