apertium-eu-en: An XSLT repository from jimregan

Basque and English

These are the linguistic data for the Apertium Basque--English machine translator.
It translates only in the direction Basque--English. 
You need apertium-3.0 and lttoolbox-3.0 to use this translator.


To compile the linguistical data simply do:

$ ./configure

to generate a Makefile file and then

$ make

inside of this directory.

CITING

Academic users of this package are requested to cite the following article:

@article{PLN4867,
	author = {O'Regan, Jim and Forcada, Mikel~L.},
	title = {Peeking through the language barrier: the development of a free/open-source gisting system for Basque to English based on apertium.org},
	journal = {Procesamiento del Lenguaje Natural},
	volume = {51},
	number = {0},
	year = {2013},
	issn = {1989-7553},	
	url = {http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/4867}
}

To cite Apertium, please use the following:

@article{apertium,year={2011},
issn={0922-6567},
journal={Machine Translation},
volume={25},
number={2},
doi={10.1007/s10590-011-9090-0},
title={Apertium: a free/open-source platform for rule-based machine translation},
url={http://dx.doi.org/10.1007/s10590-011-9090-0},publisher={Springer Netherlands},
keywords={Free/open-source machine translation; Rule-based machine translation; 
Apertium; Shallow transfer; Finite-state transducers},
author={Forcada, Mikel~L. and Ginestí-Rosell, Mireia and Nordfalk, Jacob and O’Regan, Jim and Ortiz-Rojas, Sergio and Pérez-Ortiz, Juan~Antonio and Sánchez-Martínez, Felipe and Ramírez-Sánchez, Gema and Tyers, Francis~M.},
pages={127-144},
language={English}
}

Many of the resources used were adapted from the apertium-eu-es package; to
cite that, please use:

@article{eues,
  title={Development of a free {B}asque to {S}panish machine translation system},
  author={Ginest{\i}-Rosell, Mireia and Ram{\i}rez-S{\'a}nchez, Gema and Ortiz-Rojas, Sergio and Tyers, Francis M and Forcada, Mikel L},
  journal={Procesamiento del Lenguaje Natural},
  volume={43},
  pages={187--195},
  url={http://transducens.dlsi.ua.es/repositori/transducens/pubs/248/art21.pdf},
  year={2009}
}

That, in turn, was based on the Matxin Spanish to Basque system; to cite that,
use either of the following:

@inproceedings{matxinen,
  title={Matxin: moving towards language independence},
  author={Mayor, Aingeru and Tyers, Francis~M},
  booktitle={Proceedings of the first international workshop on free/open-source rule-based machine translation, {A}lacant},
  pages={11--17},
  url={http://ixa.si.ehu.es/Ixa/Argitalpenak/Artikuluak/1257340034/publikoak/FreeRBMT09.pdf},
  year={2009}
}

@article{matxin,
  title={Matxin, an open-source rule-based machine translation system for Basque
},
  author={Mayor, Aingeru and Alegria, I{\~n}aki and De Ilarraza, Arantza D{\'\i}az and Labaka, Gorka and Lersundi, Mikel and Sarasola, Kepa},
  journal={{M}achine {T}ranslation},
  volume={25},
  number={1},
  pages={53--82},
  year={2011},
  publisher={Springer}
}

EXISTING RESOURCES

The initial lexicon for this package was created by triangulating the lexica
of apertium-eu-es and apertium-en-es; extra data was added from Matxin's
prototype English to Basque translator. For more information, see section 2.3
of the 2013 SEPLN article (``Peeking through the language barrier''). For information specific to the Matxin data, please see the 2009 FreeRBMT paper 
(``Matxin: moving towards language independence'').

The transfer rules were converted from the rules in apertium-eu-es, and 
augemented for English. For more information, see section 2.5 of the 2013
SEPLN article, or section 2.5 of the 2009 SEPLN article (``Development of a 
free Basque to Spanish machine translation system'') 

FUNDING

This package was produced with funding from the European Association for 
Machine Translation.

TAGGER 

To use this language-pair package with apertium YOU DO NOT NEED TO
RETRAIN THE TAGGER. Probabilities and auxiliary data are provided for
both the eu-en translation direction which should be
acceptable for most applications, and should work even if you change
the dictionaries in a reasonably way.

If for some reason you need to retrain the tagger (for example, you
have made really extensive changes to the dictionaries such as
creating new lexical categories), you have three alternatives:

* To perform a supervised training:

  To this end tagged corpora is provided, but tagged corpora
  (eu-tagger-data/eu.tagged ) could be
  obsolete for some words. If this is the case, the tagger training 
  program  will show you where the problems are and you will need 
  to solve them by hand. Be sure to solve the problems by modifying 
  ONLY the .tagged file, NEVER the .untagged file that is 
  automatically generated.

  The supervised training is done by typing: 

  make -f eu-es-supervised.make (for the Basque part-of-speech tagger)
* To perform an unsupervised training:

  For this purpose you will need to assemble a large (hundreds of
  thousand of words) plain-text corpus for each language (for example,
  using a robot to harvest text from online newspapers) and put them in
  the proper place, for instance eu-tagger-data/eu.crp.txt. This type
of training does not need human
  intervention but, as expected, results will be less adequate than
  those obtained with the supervised training.

  The unsupervised training is done through the iterative Baum-Welch
  algorithm. By default the number of iterations is set to 8, but you
  can change this value by editing the Makefile and changing the
  value of TAGGER_UNSUPERVISED_ITERATIONS.

  The unsupervised training is done by typing:

  make -f eu-es-unsupervised.make (for the Basque part-of-speech tagger)

  This is the training method followed to train the basque tagger.


* To perform an unsupervised training by using target-language
  information and the rest of the modules of the Apertium MT engine:

  To do so you need large plain-text corpora on both languages. Please
  download the apertium-tagger-training-tools package and follow the
  instructions provided there.
jimregan/apertium-eu-en