Children's Reading Assessment is the task of evaluating the correctness of a child's read aloud speech of a passage. The main goal is to provide meaningful feedback of the read speech so that it can be used to improve their reading fluency.
Disfluency Detection and Classification thus becomes an important step to achieve this goal. This is a very challenging problem because of two reasons:
- Fewer number of datasets that have children's speech
- Lack of task-oriented annotations in the existing datasets
This repository provides a simple script annotate_cmuk.py
that uses a novel labelling scheme for the CMU Kids Corpus
as well as the final labels cmuk_labels.csv
generated based on the proposed scheme. The main motivation is to encourage future work towards collecting more children’s speech datasets for building robust reading assessment models.
Prerequisites: Python
The CMU Kids Corpus has 24 male speakers and 52 female speakers in the age range of six to eleven, with a total of 5,180 utterances. The dataset consists of the following files:
Files | Contents |
---|---|
Sentence | All the sentences used for the read aloud task |
Point | The remarks describing the speech by a human annotator for each of speech files |
Transcrip | The transcripts of the child's read aloud speech |
Kids | The recorded audio of the child reading a passage |
The speech is labeled at the word level. These labels fall under the following categories:
Labels | Definition |
---|---|
Stutter | Some syllables in a word are repeated |
Repeat | A word is uttered more than once |
Incorrect | The uttered word does not match the actual word in the passage |
Skip | A word is omitted while reading the passage |
Correct | The rest of the uttered words that do not fall under the previous categories |
Based on the remarks provided in the dataset by the human annotator, we used the following pattern matching scheme to generate the labels:
Remarks from the dataset | Labels |
---|---|
False starts & phone additions in a word | stutter |
Repeating and restarting words | repeat |
Word substitutions, word additions & word replacements | incorrect |
Word deletions | skip |
No remarks | correct |
Apache 2.0, see LICENSE.