# Traindata Orthography-to-phonology networks require training data that has some special care taken to consider aspects of phonological and orthographic processing when specifying to the training representations used during learning. This repo provides some functionality to that end. ## Installation This project now supports dependency management through Poetry. To get started, first [install Poetry](https://python-poetry.org/docs/#installation) if you haven't already. Clone the repository to your local machine: ```bash git clone https://github.com/MCooperBorkenhagen/Traindata.git ``` If you're developing in VS Code, consider creating the virtual environment in the project directory as VS Code natively detects it and uses its kernel ```bash poetry config virtualenvs.in-project true ``` Install dependencies and activate the virtual environment. ```bash poetry install poetry shell ``` If you're in VS code, and have installed the *.venv* folder locally, then you do not need to activate the poetry shell. ## Usage To include this repository as a dependency in another project, add the following to your project's pyproject.toml file: ```toml [tool.poetry.dependencies] python = ">=3.9,<3.13" traindata = { git = "https://github.com/MCooperBorkenhagen/Traindata.git", branch = "main" } ``` Then, run _*poetry install*_ to install the `traindata` package along with its dependencies. Alternatively, you can build and install this project using pip ```plaintext pip install git+https://github.com/MCooperBorkenhagen/Traindata.git@main#egg=traindata ``` by adding that URL (along with the git+ prefix) to a requirements.txt file ```bash pip install -r requirements.txt ``` or by cloning the repository and building it from the *setup.py* file ```bash python setup.py bdist_wheel pip install dist/traindata-0.1.0-py3-none-any.whl ``` Phonology --------- The phonological structure contained/ assumed in Traindata is opinionated and is based on the ARPAbet. For more information on this representational scheme see below. The specific version of this scheme is maintained locally in this repository, but is derivative of what is contained in the MCB repository linked below. ARPAbet Documentation: https://en.wikipedia.org/wiki/ARPABET http://www.speech.cs.cmu.edu/cgi-bin/cmudict https://github.com/MCooperBorkenhagen/ARPAbet The phonological representations are extensions of previous connectionist implementations, namely from Harm & Seidenberg (1999). In essence, the classical distinctive features framework is used, originally formulated in Chomsky and Halle (1968). Originally, the framework used in this repository codes for binary features, but other codings are possible. Orthography ----------- The orthographic representations used here assume a one-hot coding over letters, with each represented fundamentally as a 26-unit vector where the nth element in the vector corresponds to the position of that letter in the alphabet. Originally, the letters specified are all lowercase, but that could change in the future. Terminal Segments ----------------- An important detail about the way that orthography and phonology are represented in the method contained here concerns terminal segments. When processing time-varying representations of language, including words, the boundaries of any given language form (sentence, word, etc.) often needs to be explicitly identified. The featural representations contained here allow for this, with a feature representing the start of the word (labeled SOS) and a feature encoding the end of the word (labeled EOS). The labels "SOS" and "EOS" are adopted from the machine learning literature and implementations of sequential ANNs.