language identification model trained with WiLI-2018
WiLI-2018 is a benchmark dataset for monolingual written natural language identification. WiLI-2018 is a publicly available, free of charge dataset of short text extracts from Wikipedia. It contains 1000 paragraphs of 235 languages, totaling in 23500 paragraphs. WiLI is a classification dataset: Given an unknown paragraph written in one dominant language, it has to be decided which language it is.
Paper : https://arxiv.org/pdf/1801.07779.pdf
Download : https://zenodo.org/record/841984
tf-idf vectorized text and Naive Bayes classifier for multinomial models used
from lid import Lid
model = Lid('models/NBModel_full.jbl')
text = "La rue double en longueur et s'urbanise durant le XIXe siècle ; plusieurs usines et ateliers s'y installent."
model.predict(text,topk=5)
[('French', 74.38),
('Occitan', 6.33),
('Picard', 2.59),
('Narom', 2.36),
('Arpitan', 2.35)]
Overall results of trained model for test set of data
Metric | Score |
---|---|
Accuracy | 0.95 |
Recall | 0.93 |
F1 | 0.93 |
for starting server
uvicorn main:app --reload
for documentation
after started server