Pre-trained Apache Spark's ML Pipeline for NLP, Classification, etc.
- models Offline ML Models (for downloads)
- models/word2vec (Word2Vec Model)
- models/nlp (Part of Speech Models)
- demo Demo project
Enlgish POS tagger model (UD_English-EWT)
Only en_ewt-ud-train.conllu
file was used to train the model:
Precision, Recall and F1-Score against the test dataset en_ewt-ud-test.conllu
Tokens | Precision | Recall | F1-Score |
---|---|---|---|
25831 | 0.93 | 0.91 | 0.92 |
Precision, Recall and F1-Score against the training dataset en_ewt-ud-train.conllu
Tokens | Precision | Recall | F1-Score |
---|---|---|---|
63785 | 0.98 | 0.98 | 0.98 |
Precision is "how useful the POS results are", and Recall is "how complete the results are". Precision can be seen as a measure of exactness or quality, whereas recall is a measure of completeness or quantity. https://en.wikipedia.org/wiki/Precision_and_recall
The F1 score is the harmonic average of the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0. https://en.wikipedia.org/wiki/F1_score
Read more on evaluation of the models
Multivac ML data: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/WSWU7K
Multivac Open Data: https://dataverse.harvard.edu/dataverse/multivac
Panahi, Maziyar;Chavalarias, David, 2018, "Multivac Machine Learning Models", https://doi.org/10.7910/DVN/WSWU7K, Harvard Dataverse, V2
This, and all github.com/multivacplatform projects, are under the Multivac Platform Open Source Code of Conduct. Additionally, see the Typelevel Code of Conduct for specific examples of harassing behavior that are not tolerated.
Code and documentation copyright (c) 2018-2019 ISCPIF - CNRS. Code released under the MIT license.