Tanmai Khanna 20161212 Vaishnavi Pamulapati 20161114
This is code to process a TM and retrieve phrases from it based on its Edit Distance from the Candidate Sentence.
Create a Translation Memory from English to Hindi. When given an English sentence as input, return N best matches from the TM and their Hindi Translations.
The Matching has been done by two methods:
- Edit Distance
- Weighted N-Grams
A report with Experimental Results and Observations can be found inside the Experiments
folder: ExperimentReport.md
- Install Dependecies
pip3 install -r requirements.txt
- Download data from https://drive.google.com/file/d/1c58qrA0mrAIXuB_Ueg3J4q8xT--dU-lV/view?usp=sharing , create folder
Project
and store it inside in a folder calledtm_data
. - Extract this folder in the folder
Project
such that projectFolder
consists oftm_data
and this folder, i.e.TranslationMemory
. Then go into the folder.
TMRetrieval-FULL-Optimised-Ranking.ipynb
Run whole Notebook, Give Input Sentence when prompted.
TMRetrieval_weighted_Optimised.ipynb
Run whole Notebook, Give Input Sentence when prompted.
The original TM files are tm_src.txt
and tm_tgt.txt
which are parallel phrase files (aligned) and can be accessed at https://drive.google.com/file/d/1c58qrA0mrAIXuB_Ueg3J4q8xT--dU-lV/view?usp=sharing with several other versions of these files.
tm_src_pp.txt
is the preprocessed Source TM which contains only content words for each sentence in the TM.
Several other files exist of multiple sizes. The sizes are mentioned in the file name. This size represents the number of sentences in them.
For eg., tm_src_10000.txt
and tm_tgt_10000.txt
both contain the first 10000 sentences of the original TM.
The smaller data files are available in this folder. The larger ones are available in the link provided above.
Here is a Comprehensive Description of the files available in this Project.
-
Contains the Experiment Report, which contains Experimental Results and Observations.
-
Also contains Python files for all Python Notebooks
TMRetrieval_NoPreprocessing.ipynb
& TMRetrieval_weighted.ipynb
- Calculates Edit Distance / Weighted N-Grams on all sentences in TM
- Returns best N matches
TMRetrieval-FULL-Optimised.ipynb
& TMRetrieval_weighted_Optimised.ipynb
- Uses Preprocessed Data
- Prunes Search Using Content Words
- Runs Edit Distance / Weighted N-grams on This Subset of TM
- Returns best N matches
TMRetrieval-FULL-Optimised-MEMORY.ipynb
- Same as Optimised TM, but accesses TM from disk piece by piece.
- Scalable Solution
TMRetrieval-FULL-Optimised-Indexed.ipynb
- While pruning the search, instead of a Naive Search, an index is created from the preprocessed TM for faster search to prune Candidate Sentences and Edit Distance is run only on these Candidate Sentences.
TMRetrieval-FULL-Optimised-Ranking.ipynb
- After pruning the search, we rank the candidates based on how many content words match and we run Edit Distance only on the top 500 ranked candidate sentences. Fastest solution.
TMPreProcessing.ipynb
Converts to lowercase and creates a new TM Source with only Content Words.
IndexedSentences.ipynb
Creates an Index from the TM.
IDF_values.ipynb
Calculates IDF values of Document.