Code repository associated with 'The cross-linguistic performance of word segmentation models over time', by Andrew Caines, Emma Altmann-Richer & Paula Buttery, University of Cambridge, U.K.
Submitted to the Journal of Child Language.
To use this code, please ensure you have an up-to-date installation of Python 3, preferably running in a virtual environment.
You'll need to install the following Python packages: NLTK, SciPy. You might do something like pip3 install numpy scipy nltk
once you've upgraded pip. Be sure to also download the data associated with NLTK (python -m nltk.downloader all
).
Also ensure you have at least base R installed; you might also choose to install RStudio but that's up to you.
Finally, this experiment depends on the phonemizer and wordseg tools developed by Alex Cristia, Mathieu Bernard, and colleagues. Please see the installation instructions on their websites. Note that phonemizer requires festival
and/or eSpeak
as back-end text-to-speech systems, plus optionally segments grapheme-to-phoneme mapping. We use espeak-ng with extended dictionaries for Cantonese, Mandarin, Russian; plus segments for Japanese etc.
Wordseg has various dependencies too, detailed here. Since it needs Python 3, it's cleaner to set up a virtual environment and install wordseg there.
Note that these experiments were run on XML corpora downloaded from CHILDES. We unzip and store the files under the path ~/Corpora/CHILDES/xml/
on a Unix-like system (i.e. Mac, Linux). Our corpus selections depended on factors described in our paper. Our used dataset is available as a zip file upon request.
Here are the corpora included in our study:
Language | Corpus | Language | Corpus | Language | Corpus |
---|---|---|---|---|---|
Basque | Luque | German | Leo, Miller, Rigol, Szagun, Wagner | Mandarin | Tong, Zhou3 |
Cantonese | LeeWongLeung | Greek | Doukas | Norwegian | Ringstad |
Croatian | Kovacevic | Hungarian | Bodor, MacWhinney, Reger | PortugueseBR | Florianopolis |
Danish | Plunkett | Icelandic | Kari | PortuguesePT | Santos |
Dutch | Gillis, Groningen, VanKampen | Indonesian | Jakarta | Romanian | Avram |
EnglishNA | Bloom70, Braunwald, Brent, Brown, Cornell, Gelman, MacWhinney, NewmanRatner, Peters, Post, Rollins, Sachs, Soderstrom, Suppes, Tardif, Valian | Irish | Gaeltacht | Serbian | SCECL |
EnglishUK | Korman, Lara, MPI-EVA-Manchester, Manchester, Nuffield, Thomas | Italian | Tonelli | Spanish | Aguirre, JacksonThal, Nieva, OreaPine, Ornat, Vila |
Estonian | Vija, Zupping | Japanese | Ishii, MiiPro, Miyata | Swedish | Lund |
Farsi | Family | Korean | Jiwon, Ryu | Turkish | Aksu |
French | York |
Note that we removed the diary and 0notrans/0untranscribed/0extra directories in the Lara (Eng.UK), Braunwald, MacWhinney, Nelson (all Eng.NA) and LeeWongLeung (Cantonese) collections before further processing. We also removed specific child corpora from collections because of their starting age being over 2 years: Lea from French York; PIT, IDO, PRI, LAR from Indonesian Jakarta; Yun from Korean Ryu.
In your ~/Corpora/CHILDES/
directory, you need to have at least the following subdirectories (in order of use): xml
, non_child_utterances
, phonemized
, wordseg
, plus a ~/tmp/
directory for temporary files created during the wordseg experiments.
It's important that within ~/Corpora/CHILDES/xml/
you have subdirectories for each language: e.g. ~/Corpora/CHILDES/xml/Spanish/
, ~/Corpora/CHILDES/xml/French/
(title case is important too). The downloaded and unzipped CHILDES corpora go into the language appropriate directory, maintaining the structure they come in, i.e. collectionName
or collectionName/childName
(e.g. Lara/
or Brown/Adam/
).
Experiment output files will save to your working directory (i.e. where you download the repository and/or run the experiment from) unless you update the code.
- Corpus preparation: takes XML transcriptions for all corpora in the data directory, filters child utterances, and outputs plain text strings one line per utterance if there are the requisite number of non-child utterances in the corpus (default=10000; must be edited in file). Also counts corpora, non-child utterances and words, and outputs a statistics file in the directory above the XML. Run as --
python3 step1_prepare_childes_xml_for_phonemizer.py
- Phonemize the corpora: transforms plain text utterances and transforms them into phonemic form with the
phonemizer
toolkit. Deals with a known set of languages, listed at the top of the script (to add new languages: add to the dictionary in the script with the new language name and eSpeak code, available by queryingespeak --voices
from the command line). Outputs a limited number of utterances (default=10000; edit in file, or set to zero to indicate no limit). Note that we used espeak-ng (version 1.49.3) as the backend for all languages except for Japanese which requiressegments
to deal with romanized transcripts (note that we edited error-handling so that all invalid graphemes would be ignored rather than causing an exit; to do this replace the raise error line in the strict function insegments/src/segments/errors.py
withreturn ''
). Note that we remove punctuation and tone markers from the Chinese files, and code-switching markers from all files. Run as --
python3 step2_run_phonemizer.py
- Wordseg experiments: prepares each phonemized file for use by wordseg. Runs selected wordseg algorithms (currently: baselines, transitional probabilities, DiBS, PUDDLE) on every file and evaluates against the true word segmentations. Requires
lnre.R
(which installs thezipfR
library if you don't already have it) and a~/tmp/
directory. Outputs experiment files to~/Corpora/CHILDES/wordseg/
and a results file to~/Corpora/CHILDES/segmentation_experiment_stats.csv
.
Run within a virtual environment if that's where you've installed wordseg
, e.g.
source ~/venvs/Py3/wordseg/bin/activate
And then run as --
python3 step3_wordsegmentation_experiments.py
- Statistical analysis with
step4_stats_analysis.R
: prints descriptive stats for corpora, evaluation scores, pairwise t-tests, fits regression models and makes plots. Runs in a piece-by-piece fashion in interactive R.
If you use this code please cite our paper:
@article{caines-et-al-2019,
author = {Andrew Caines and Emma Altmann-Richer and Paula Buttery},
year = {2019},
title = {The cross-linguistic performance of word segmentation models over time},
journal = {Journal of Child Language},
}
Andrew Caines, andrew.caines@cl.cam.ac.uk, September 2019