/Traindata

Generate training data for an orthography-to-phonology ANN

Primary LanguagePython

# Traindata

Orthography-to-phonology networks require training data that
has some special care taken to consider aspects of phonological
and orthographic processing when specifying to the training 
representations used during learning. This repo provides some
functionality to that end.

## Installation

This project now supports dependency management through Poetry. To get started, first [install Poetry](https://python-poetry.org/docs/#installation) if you haven't already.

Clone the repository to your local machine:
```bash
git clone https://github.com/MCooperBorkenhagen/Traindata.git
```

If you're developing in VS Code, consider creating the virtual environment in the project directory
as VS Code natively detects it and uses its kernel

```bash
poetry config virtualenvs.in-project true
```

Install dependencies and activate the virtual environment. 

```bash
poetry install
poetry shell
```

If you're in VS code, and have installed the *.venv* folder locally, then you do not need to activate the poetry shell.

## Usage

To include this repository as a dependency in another project, add the following to your project's pyproject.toml file:

```toml
[tool.poetry.dependencies]
python = ">=3.9,<3.13"
traindata = { git = "https://github.com/MCooperBorkenhagen/Traindata.git", branch = "main" }
```

Then, run _*poetry install*_ to install the `traindata` package along with its dependencies.

Alternatively, you can build and install this project using pip

```plaintext
pip install git+https://github.com/MCooperBorkenhagen/Traindata.git@main#egg=traindata
```

by adding that URL (along with the git+ prefix) to a requirements.txt file

```bash
pip install -r requirements.txt
```

or by cloning the repository and building it from the *setup.py* file


```bash
python setup.py bdist_wheel
pip install dist/traindata-0.1.0-py3-none-any.whl
```

Phonology
---------
The phonological structure contained/ assumed in Traindata is
opinionated and is based on the ARPAbet. For more information
on this representational scheme see below. The specific version
of this scheme is maintained locally in this repository, but is
derivative of what is contained in the MCB repository linked below.

ARPAbet Documentation:
https://en.wikipedia.org/wiki/ARPABET
http://www.speech.cs.cmu.edu/cgi-bin/cmudict
https://github.com/MCooperBorkenhagen/ARPAbet

The phonological representations are extensions of previous
connectionist implementations, namely from Harm & Seidenberg (1999).
In essence, the classical distinctive features framework is used,
originally formulated in Chomsky and Halle (1968). Originally,
the framework used in this repository codes for binary features,
but other codings are possible.


Orthography
-----------
The orthographic representations used here assume a one-hot
coding over letters, with each represented fundamentally as 
a 26-unit vector where the nth element in the vector corresponds
to the position of that letter in the alphabet. Originally, the
letters specified are all lowercase, but that could change in
the future.

Terminal Segments
-----------------
An important detail about the way that orthography and phonology
are represented in the method contained here concerns terminal
segments. When processing time-varying representations of language,
including words, the boundaries of any given language form (sentence,
word, etc.) often needs to be explicitly identified. The featural
representations contained here allow for this, with a feature
representing the start of the word (labeled SOS) and a feature
encoding the end of the word (labeled EOS). The labels "SOS" and
"EOS" are adopted from the machine learning literature and implementations
of sequential ANNs.