The objective is to develop systems for cross-lingual topic ID without relying on machine translation systems and bilingual word embeddings. In other words, given parallel text between a source (eg. English) and target language, and some set of topic labeled documents in a source language, the goal is to predict topic labels for test documents from the target language.
This project is still in the development phase. NOT all codes can be made public at this stage.
- Python >= 3.7
- scipy >= 1.3
- numpy >= 1.16.4
- scikit-learn >= 0.21.2
Kinyarwanda (IL9)
Sinhalese (IL10)
Zulu
Hindi
LDC
Leidos
- Pre-Processing
python src/XML_parser_Universal.py --help
- Baseline classification on labeled English data
- Feature representation
python src/extract_features_for_test_data.py --help
- Feature transformation from source language space to the target language (English) space
- Training English classifier
python src/train_classifier_on_eng_sf_features.py --help
Training English classifier batch-wise
python src/train_clf_batchwise.py --help
- Predict labels/topics
- Evaluate average precision score
bash wrapper_preProcess.sh
Task performed:
- Generate ground truth annotations
- Preparing ground truth of the annotations
- Convert ASR outputs to a text file and document id file
- Analyze English labeled text data
- Eliminate out-of-domain docs
bash wrapper_baseline_best_vocabs.sh
Task performed:
- Generate the best vocabulary from each class of labeled English text data
- Finds the best alpha for each topic/label
- Generates bash file to generate features for parallel text
- Generates bash file to compute final average precision scores for the given test data.
- Display the average precision score for all N (vocab size for each label)
bash wrapper_feats_gen.sh
Performs feature extraction on parallel text data and labeled English text data
bash wrapper_learn_transformation.sh
Generates bash files to learn transformation and submit jobs as well to GPUs
bash wrapper_final_scores_generator_new_doc_level.sh
Task performed:
- Training English Classifier
- Combine all sentences of one document into separate sentences.
- Feature extraction of text data
- Apply transformation from IL space to English space
- Predict topics for each document
- Compute average precision scores