/multivac-ml

Pre-trained ML models for Apache Spark

Primary LanguageScalaMIT LicenseMIT

multivac-ml GitHub license Build Status multivac discuss multivac channel Codacy Badge

Pre-trained Apache Spark's ML Pipeline for NLP, Classification, etc.

Project Structure

Facts and Figures

POS Tagger models

Enlgish POS tagger model (UD_English-EWT) Only en_ewt-ud-train.conllu file was used to train the model:

Precision, Recall and F1-Score against the test dataset en_ewt-ud-test.conllu

Tokens Precision Recall F1-Score
25831 0.93 0.91 0.92

Precision, Recall and F1-Score against the training dataset en_ewt-ud-train.conllu

Tokens Precision Recall F1-Score
63785 0.98 0.98 0.98

Precision is "how useful the POS results are", and Recall is "how complete the results are". Precision can be seen as a measure of exactness or quality, whereas recall is a measure of completeness or quantity. https://en.wikipedia.org/wiki/Precision_and_recall

The F1 score is the harmonic average of the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0. https://en.wikipedia.org/wiki/F1_score

Precision

Recall

F1 Score

Read more on evaluation of the models

Open Data

Multivac ML data: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/WSWU7K

Multivac Open Data: https://dataverse.harvard.edu/dataverse/multivac

Dataset Citation

Panahi, Maziyar;Chavalarias, David, 2018, "Multivac Machine Learning Models", https://doi.org/10.7910/DVN/WSWU7K, Harvard Dataverse, V2

Code of Conduct

This, and all github.com/multivacplatform projects, are under the Multivac Platform Open Source Code of Conduct. Additionally, see the Typelevel Code of Conduct for specific examples of harassing behavior that are not tolerated.

Copyright and License

Code and documentation copyright (c) 2018-2019 ISCPIF - CNRS. Code released under the MIT license.