Basque and English These are the linguistic data for the Apertium Basque--English machine translator. It translates only in the direction Basque--English. You need apertium-3.0 and lttoolbox-3.0 to use this translator. To compile the linguistical data simply do: $ ./configure to generate a Makefile file and then $ make inside of this directory. CITING Academic users of this package are requested to cite the following article: @article{PLN4867, author = {O'Regan, Jim and Forcada, Mikel~L.}, title = {Peeking through the language barrier: the development of a free/open-source gisting system for Basque to English based on apertium.org}, journal = {Procesamiento del Lenguaje Natural}, volume = {51}, number = {0}, year = {2013}, issn = {1989-7553}, url = {http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/4867} } To cite Apertium, please use the following: @article{apertium,year={2011}, issn={0922-6567}, journal={Machine Translation}, volume={25}, number={2}, doi={10.1007/s10590-011-9090-0}, title={Apertium: a free/open-source platform for rule-based machine translation}, url={http://dx.doi.org/10.1007/s10590-011-9090-0},publisher={Springer Netherlands}, keywords={Free/open-source machine translation; Rule-based machine translation; Apertium; Shallow transfer; Finite-state transducers}, author={Forcada, Mikel~L. and Ginestí-Rosell, Mireia and Nordfalk, Jacob and O’Regan, Jim and Ortiz-Rojas, Sergio and Pérez-Ortiz, Juan~Antonio and Sánchez-Martínez, Felipe and Ramírez-Sánchez, Gema and Tyers, Francis~M.}, pages={127-144}, language={English} } Many of the resources used were adapted from the apertium-eu-es package; to cite that, please use: @article{eues, title={Development of a free {B}asque to {S}panish machine translation system}, author={Ginest{\i}-Rosell, Mireia and Ram{\i}rez-S{\'a}nchez, Gema and Ortiz-Rojas, Sergio and Tyers, Francis M and Forcada, Mikel L}, journal={Procesamiento del Lenguaje Natural}, volume={43}, pages={187--195}, url={http://transducens.dlsi.ua.es/repositori/transducens/pubs/248/art21.pdf}, year={2009} } That, in turn, was based on the Matxin Spanish to Basque system; to cite that, use either of the following: @inproceedings{matxinen, title={Matxin: moving towards language independence}, author={Mayor, Aingeru and Tyers, Francis~M}, booktitle={Proceedings of the first international workshop on free/open-source rule-based machine translation, {A}lacant}, pages={11--17}, url={http://ixa.si.ehu.es/Ixa/Argitalpenak/Artikuluak/1257340034/publikoak/FreeRBMT09.pdf}, year={2009} } @article{matxin, title={Matxin, an open-source rule-based machine translation system for Basque }, author={Mayor, Aingeru and Alegria, I{\~n}aki and De Ilarraza, Arantza D{\'\i}az and Labaka, Gorka and Lersundi, Mikel and Sarasola, Kepa}, journal={{M}achine {T}ranslation}, volume={25}, number={1}, pages={53--82}, year={2011}, publisher={Springer} } EXISTING RESOURCES The initial lexicon for this package was created by triangulating the lexica of apertium-eu-es and apertium-en-es; extra data was added from Matxin's prototype English to Basque translator. For more information, see section 2.3 of the 2013 SEPLN article (``Peeking through the language barrier''). For information specific to the Matxin data, please see the 2009 FreeRBMT paper (``Matxin: moving towards language independence''). The transfer rules were converted from the rules in apertium-eu-es, and augemented for English. For more information, see section 2.5 of the 2013 SEPLN article, or section 2.5 of the 2009 SEPLN article (``Development of a free Basque to Spanish machine translation system'') FUNDING This package was produced with funding from the European Association for Machine Translation. TAGGER To use this language-pair package with apertium YOU DO NOT NEED TO RETRAIN THE TAGGER. Probabilities and auxiliary data are provided for both the eu-en translation direction which should be acceptable for most applications, and should work even if you change the dictionaries in a reasonably way. If for some reason you need to retrain the tagger (for example, you have made really extensive changes to the dictionaries such as creating new lexical categories), you have three alternatives: * To perform a supervised training: To this end tagged corpora is provided, but tagged corpora (eu-tagger-data/eu.tagged ) could be obsolete for some words. If this is the case, the tagger training program will show you where the problems are and you will need to solve them by hand. Be sure to solve the problems by modifying ONLY the .tagged file, NEVER the .untagged file that is automatically generated. The supervised training is done by typing: make -f eu-es-supervised.make (for the Basque part-of-speech tagger) * To perform an unsupervised training: For this purpose you will need to assemble a large (hundreds of thousand of words) plain-text corpus for each language (for example, using a robot to harvest text from online newspapers) and put them in the proper place, for instance eu-tagger-data/eu.crp.txt. This type of training does not need human intervention but, as expected, results will be less adequate than those obtained with the supervised training. The unsupervised training is done through the iterative Baum-Welch algorithm. By default the number of iterations is set to 8, but you can change this value by editing the Makefile and changing the value of TAGGER_UNSUPERVISED_ITERATIONS. The unsupervised training is done by typing: make -f eu-es-unsupervised.make (for the Basque part-of-speech tagger) This is the training method followed to train the basque tagger. * To perform an unsupervised training by using target-language information and the rest of the modules of the Apertium MT engine: To do so you need large plain-text corpora on both languages. Please download the apertium-tagger-training-tools package and follow the instructions provided there.