multivac-ml

Pre-trained Apache Spark's ML Pipeline for NLP, Classification, etc.

Project Structure

models Offline ML Models (for downloads)
- models/word2vec (Word2Vec Model)
- models/nlp (Part of Speech Models)
demo Demo project

Facts and Figures

POS Tagger models

Enlgish POS tagger model (UD_English-EWT) Only en_ewt-ud-train.conllu file was used to train the model:

Precision, Recall and F1-Score against the test dataset en_ewt-ud-test.conllu

Tokens	Precision	Recall	F1-Score
25831	0.93	0.91	0.92

Precision, Recall and F1-Score against the training dataset en_ewt-ud-train.conllu

Tokens	Precision	Recall	F1-Score
63785	0.98	0.98	0.98

Precision is "how useful the POS results are", and Recall is "how complete the results are". Precision can be seen as a measure of exactness or quality, whereas recall is a measure of completeness or quantity. https://en.wikipedia.org/wiki/Precision_and_recall

The F1 score is the harmonic average of the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0. https://en.wikipedia.org/wiki/F1_score

Open Data

Multivac ML data: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/WSWU7K

Multivac Open Data: https://dataverse.harvard.edu/dataverse/multivac

Dataset Citation

Panahi, Maziyar;Chavalarias, David, 2018, "Multivac Machine Learning Models", https://doi.org/10.7910/DVN/WSWU7K, Harvard Dataverse, V2

Code of Conduct

This, and all github.com/multivacplatform projects, are under the Multivac Platform Open Source Code of Conduct. Additionally, see the Typelevel Code of Conduct for specific examples of harassing behavior that are not tolerated.

multivacplatform/multivac-ml