We build a pipeline to clean text noisy code-switched text online.
git clone --recursive https://github.com/sumeet-iitg/CS-TextNormalization.git
-- Don't miss the 'recursive' part for pulling required sub-modules
DataManagement
: This folder contains the various abstractions that make up the pipeline. When you add a new implementation of some tool for the pipeline, make sure that it is always along the lines of an abstraction contained in this folder. Feel free to add new abstractions into this folder. Some of the abstractions are as follows:
languageUtils.py
: Classes for Langauge Specific Identifiers, Lexicons and SpellCheckers.
dataloader.py
: Classes for loading a corpus - mono-lingual/multi-lingual.
You can use this pipeline end to end, or run the individual components within
python main.py "source_tanglish.txt" "english,telugu"