Lightweight Machine Learning model registry.
Designed to be a simple to use, portable ML model registry that embeds (and versions) training data, models and logs model execution. Currently implements sklearn
models, or more precicisely sklearn.pipeline.Pipeline
objects.
Portability is achieved through attaching to SQLite as a registry storage. So, just by sharing the mlops-lite.db file, it ensures completely reproducable results, as it contains both training data, along with all the models, their metadata.
Better persistence and ability to scale is achieved by attaching registry to Postgres storage (at the cost of portability).
In principle it can be used with any object that implements predict
, predict_proba
, feature_names_in_
, classes_
methods. Future development intends to generalize the interface, but at this stage it mimics sklearn.pipeline.Pipeline
.
Only one dataset and only one model (deployable) can be be active at a time. Binding either to the client will automatically persist it in the database. It will avoid any duplicated artifacts (through comparing md5 hash) and increment version if name is the same but md5 is different.
from mlopslite.client import MlopsLite
# Initializing the client connects to the registry
# By default it's SQLite
client = MlopsLite()
# Get an example Dataset
from mlopslite.utils import sklearn_dataset_to_dataframe
iris = sklearn_dataset_to_dataframe("iris")
# Bind Dataset to the registry
client.bind_dataset(data = iris, name = "iris")
# Metadata of the currently active dataset
client.dataset.info()
column_name | original_dtype | converted_dtype | null_count | unique_count | min_value_num | max_value_num | |
---|---|---|---|---|---|---|---|
0 | sepal length (cm) | float64 | float | 0 | 35 | 4.3 | 7.9 |
1 | sepal width (cm) | float64 | float | 0 | 23 | 2 | 4.4 |
2 | petal length (cm) | float64 | float | 0 | 43 | 1 | 6.9 |
3 | petal width (cm) | float64 | float | 0 | 22 | 0.1 | 2.5 |
4 | target | object | str | 0 | 3 | nan | nan |
# List all the datasets in the registry
client.list_datasets()
id | name | version | size_cols | size_rows | description | |
---|---|---|---|---|---|---|
0 | 1 | iris | 1 | 5 | 150 |
# the id of dataset from the output of list_datasets method
client.pull_dataset(1)
From a perspective of mlopslite
we're only interested in model
artifact, so in principle this could be any model that is trained with sklearn
and assembled as a Pipeline
.
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
client.dataset.data.drop('target', axis = 1),
client.dataset.data['target'],
random_state = 1
)
column_def = client.dataset.info()
num_cols = column_def.loc[column_def['converted_dtype'] == 'float', 'column_name'].to_list()
num_transform = make_pipeline(
SimpleImputer(strategy="median"),
StandardScaler()
)
preproc = ColumnTransformer([
('num', num_transform, num_cols),
])
model = make_pipeline(preproc, LogisticRegression())
model.fit(X = X_train, y = y_train)
This will persist the model in the database. Similarly like dataset, it will check md5 of model artifact and will not write duplicates. And will increment version if artifact is diferent but name is the same.
client.bind_deployable(deployable = model, name = 'testmodel', target = 'target')
Similar to datasets, list all deployables in the registry:
client.list_deployables()
id | name | version | dataset_registry_id | estimator_class | estimator_type | description | |
---|---|---|---|---|---|---|---|
0 | 1 | testmodel | 1 | 1 | LogisticRegression | classifier |
Assuming this is a new python session.
from mlopslite.client import MlopsLite
client = MlopsLite()
client.pull_deployable(1)
test = [
{
'sepal length (cm)': 2,
'petal length (cm)': 1,
'petal width (cm)': 1,
'sepal width (cm)': 1
},
{
'reference_id': 'some_external_id',
'sepal length (cm)': 1,
'petal length (cm)': 2,
'petal width (cm)': 1,
'sepal width (cm)': 1
}
]
client.predict(test, log = True)
"""
output:
[{'reference_id': None,
'results': {'setosa': 0.7817632336603132,
'versicolor': 0.21727812223107698,
'virginica': 0.0009586441086099546}},
{'reference_id': 'some_external_id',
'results': {'setosa': 0.9250271226821645,
'versicolor': 0.07308630128882285,
'virginica': 0.001886576029012563}}]
"""
All inputs and outputs are logged in the registry, with a reference to Dataset
and Deployable
artifacts.
- Documentation
- Minor refactor (consistently naming things)
- Testing (most used models & transformation patterns; SQLite+Postgres+MySQL)
- Build pipelines (Github actions)
- Package & Publish on PyPi, version 0.0.1
- Example deployment setup suggestion
- Monitoring interface
- State consistency (e.g. unloading model when other dataset is loaded; switching appropriate dataset when other model is loaded etc.)
- Extract library dependencies and write to deployable metadata + check upon pulling model.
- Alternative Dataset file types (JSON, csv, parquet)
- Alternative Deployable file types (pickle, joblib, PMML)
- Alternative artifact save targets (file system e.g. AWS, with reference in registry)