Source code repository for corpus linguistics taught at Portland State University.
The labs for this course are designed for non-programmers. One of my goals was to create programming versions of the assignments, so that students interested in NLP could do this option. I have Lab 1 completed and a start on Lab 2. Stay tuned!
"A Case Study in Improving Machine Translation"
Abstract:
One of the challenges in adopting machine translation has been how to implement post-editor
(translator) feedback. Machine translation engineers are given access to corpora to train
translation models, but have struggled to make improvements suggested by the translators that use
them. This study explores using NLP to identify types of grammatical features in corpora that are
poorly translated. We setup a simple scenario creating conditions where a specific grammatical
feature, the passive voice, is poorly translated, how we can identify this feature in corpora, and
how augmenting our training corpus with passive voice phrases improves machine translation
quality of this feature.
You can read the full paper here.
The source code for the project is located the corpus_project folder. I will add a Jupyter Notebook and some more information on requirements, etc., at a later date.