A small framework for easily deployable, reproducible machine learning research that largely builds on top the scikit-learn API. Its goal is to make common research procedures templated, optimized, and well recorded. To this end it features:
- Flexible wrappers to plug in your tools and features of choice.
- Sparse and multi-threaded pipeline through hashing and pooling - from data to feature space.
- Storage of all settings and fitted parts of the entire experiment, promoting reproducibility.
- Dump an easily deployable version of the final model for plug-and-play demos.
- Visual overview in an web-environment for comparison and intepretation of experiments.
Read the documentation at readthedocs.
This repository is currently in alpha development, and is therefore not stable.
Omesa can be used in various scenarios:
- As an end-to-end template-to-results framework.
- As several fast classes to aid your machine learning workflow.
- As a storage and overview for your experiments.
As such, you can vary between restricted options and little code, and many options (thus more code) that require some restricted storage format. Either way, Omesa aids in structuring and comparing experiments.
The front-end provides visualization and comparison of model performance. Currently only the 'Results' section works, preview below:
To test, do the following:
$ cd /dir/to/omesa/examples
$ python3 n_gram.py
$ cd ../front
$ python3 ./app.wsgi
And follow the localhost
link that is shown to access the web app. Please
note that this part can be quite unstable. Bug reports are welcome.
Omesa currently heavily relies on numpy
, scipy
and sklearn
. To use
Frog as a Dutch back-end, we strongly recommend using LaMachine. For
English, there is a spaCy wrapper available.
With the end-to-end Experiment
pipeline and a configuration dictionary,
several experiments or set-ups can be ran and evaluated with a very minimal
piece of code. One of the test examples provided is that of n-gram
classification of Wikipedia documents. In this experiment, we are provided with
a toy set n_gram.csv that features 10 articles about Machine Learning, and 10
random other articles. To run the experiment, the following configuration is used:
With the end-to-end Experiment
pipeline and a configuration dictionary,
several experiments or set-ups can be ran and evaluated with a very minimal
piece of code. One of the test examples provided is that of n-gram classification
of Wikipedia documents. In this experiment, we are provided with a toy set
n_gram.csv that features 10 articles about Machine Learning, and 10 random
other articles. To run the experiment, the following configuration is used:
from omesa.experiment import Experiment
from omesa.featurizer import Ngrams
from omesa.containers import Pipe
from omesa.components import Vectorizer, Evaluator
from sklearn.naive_bayes import MultinomialNB
Experiment(
project="unit_tests",
name="20_news_grams",
data=[CSV("n_gram.csv", data="intro", label="label")],
pipeline=[
Vectorizer(
features=[
Ngrams(level='char', n_list=[3])
]),
Pipe('clf', MultinomialNB()),
Evaluator(scoring='f1', average='micro',
lime_docs=CSV("n_gram.csv", data="intro", label="label")),
],
save=("model", "db")
)
This will cross validate performance on the .csv
, selecting text
and label columns and indicating a header is present in the .csv
document.
We provide the Ngrams
function and parameters to be used as features, and
store the log.
The log file will be printed during run time, as well as stored in the script's directory. A sample from the output of the current experiment is as follows:
---- Omesa ----
Config:
feature: char_ngram
n_list: [3]
name: gram_experiment
seed: 42
Sparse train shape: (20, 1301)
Performance on test set:
precision recall f1-score support
DF 0.83 0.50 0.62 10
ML 0.64 0.90 0.75 10
avg / total 0.74 0.70 0.69 20
Experiment took 0.2 seconds
----------
Here's an example of the most minimum word frequency feature class:
class SomeFeaturizer(object):
def __init__(self, some_params):
"""Set parameters for SomeFeaturizer."""
self.name = 'hookname'
self.some_params = some_params
def transform(self, raw, parse):
"""Return a dictionary of feature values."""
return Counter([x for x in raw])
This returns a {word: frequency}
dict per instance that can easily be
transformed into a sparse matrix.
Part of the work on Omesa was carried out in the context of the AMiCA (IWT SBO-project 120007) project, funded by the government agency for Innovation by Science and Technology (IWT).