- Edited by Wei-Hung Weng (MIT CSAIL)
- Created: Jan 16, 2019
- Latest update: May 10, 2019
- Please contact the author with errors found.
ckbjimmy {AT} mit {DOT} edu
This repository contains the codes for the clinical language translation task using unsupervised SMT method presented in the paper Unsupervised Clinical Language Translation (KDD 2019).
- Python 3
- NumPy
- PyTorch
- Moses (for text tokenization, statistical machine translation and back-translation---we suggest to use binary version to prevent complicated compiling process, the version use in the current code is for Ubuntu 17.04+)
- fastText (for learning word embeddings)
- MUSE (for learning bilingual dictionary induction)
- Prepare your corpora in source (
pro
) and target (con
) languages - Edit the data path and parameters in the first section in
run.sh
- Replace the line 59-63 of
MUSE/src/evaluation/word_translation.py
withword1, word2 = line.rstrip().split()
to ensure the correct evaluation - Run
run.sh
The result will be saved in the folder res/RESULT_DIR
under the root directory. The translated sentence will be saved in the last forward translation folder and named as test.tgt.hyp.tok
.
DATA_PATH # corpus path (please name the source corpus named as `pro` and target corpus as `con`)
EMB_DIM # dimension of the word embeddings (default = 300)
SUBWORD # whether to use subword information when training monolingual word embeddings (default = false)
USE_NEWSCRAWL # whether to use third-party large-scale corpora for training consumer LM (default = false)
N_THREADS # number of threads in data preprocessing (default = 48)
IDENTICAL_CHAR # whether to use identical chars as anchors when training cross-lingual embeddings (default = true)
RESULT_DIR # where the experimental results will be stored
SENT_BT # monolingual data used in back-translation (default = 100000)
If you use this code, please kindly cite the paper for this GitHub project (see below for BibTex):
@inproceedings{weng2019unsupervised,
title={Unsupervised Clinical Language Translation},
author={Weng, Wei-Hung and Chung, Yu-An and Szolovits, Peter},
booktitle={KDD},
year={2019}
}