Detectors
Tools used for this purpose:
*: Supports the Guarani language.
Installation
Pre-requisites:
Install polyglot dependencies.
Install requirements pip install -r requirements.txt
Download fastText lib.
Download the crubadan corpus.
# commented out due to low precision of textcat, use glcd3 instead.
"""
import nltk
nltk.download('crubadan')
nltk.download('punkt')
"""
Command Line Interface
All commands must be run from the src directory.
Detect language of tweets
python run.py [data_dir] [file_name_of_tweets] [language_lexicon] --detect_language --guarani
data_dir: path to data directory and must be relative to the src directory. Required.
file_name_of_tweets: Name of the file containing the tweets in CSV format. Required.
language_lexicon: Name of the file containing the language's (to-identify) words lexicon. Optional. In fact, language_lexicon can be any low-resource language.
guarani: The language (to-identify) is Guarani (or another low-resource language)? Optional. Needed for language_lexicon.
Note: Partially forked from https://github.com/social-link-analytics-group-bsc/tw_coronavirus in v1.0.