Lingualytics is a Python library for dealing with code mixed text.
Lingualytics is powered by powerful libraries like Pytorch, Transformers, Texthero, NLTK and Scikit-learn.
-
Preprocessing
- Remove stopwords
- Remove punctuations, with an option to add punctuations of your own language
- Remove words less than a character limit
-
Representation
- Find n-grams from given text
-
NLP
- Classification using PyTorch
- Train a classifier on your data to perform tasks like Sentiment Analysis
- Evaluate the classifier with metrics like accuracy, f1 score, precision and recall
- Use the trained tokenizer to tokenize text
- Classification using PyTorch
Checkout some codemix friendly models that we have trained using Lingualytics
- bert-base-multilingual-codemixed-cased-sentiment
- bert-base-en-es-codemix-cased
- bert-base-en-hi-codemix-cased
Use the package manager pip to install lingualytics.
pip install lingualytics
from lingualytics.preprocessing import remove_lessthan, remove_punctuation, remove_stopwords
from lingualytics.stopwords import hi_stopwords,en_stopwords
from texthero.preprocessing import remove_digits
import pandas as pd
df = pd.read_csv(
"https://github.com/lingualytics/py-lingualytics/raw/master/datasets/SAIL_2017/Processed_Data/Devanagari/validation.txt", header=None, sep='\t', names=['text','label']
)
# pd.set_option('display.max_colwidth', None)
df['clean_text'] = df['text'].pipe(remove_digits) \
.pipe(remove_punctuation) \
.pipe(remove_lessthan,length=3) \
.pipe(remove_stopwords,stopwords=en_stopwords.union(hi_stopwords))
print(df)
Currently available datasets are
- CS-EN-ES-CORPUS Vilares, D., et al.
- SAIL-2017 Dipankar Das., et al.
- Sub-Word-LSTM Joshi, Aditya, et al.
from lingualytics.learner import Learner
learner = Learner(model_type = 'bert',
model_name = 'bert-base-multilingual-cased',
dataset = 'SAIL_2017')
learner.fit()
The train data path should have 3 files
- train.txt
- validation.txt
- test.txt
Any file should have the text and label in a line, separated by a tab. Then change the data_dir
to the path of your custom dataset.
from lingualytics.representation import get_ngrams
import pandas as pd
df = pd.read_csv(
"https://github.com/jbesomi/texthero/raw/master/dataset/bbcsport.csv"
)
ngrams = get_ngrams(df['text'],n=2)
print(ngrams[:10])
Documentation is a work in progress! Have a look at it here.
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
- Khanuja, Simran, et al. "GLUECoS: An Evaluation Benchmark for Code-Switched NLP." arXiv preprint arXiv:2004.12376 (2020).