If you've ever tried to work with the CHILDES corpus before, you know that dealing with it can be...difficult. It's organized into a tremendous number of flat text files, one per session; the file format used to encode the session transcript (not to mention the metadata!) is, to put it mildly, highly irregular.
In short, it's a mess.
This repository is a central place for various CHILDES-related tools, scripts, and hacks. The first of these are a set of scripts to wrangle the .cha
format (the 'highly irregular' thing above) into sanity-inducing XML, adding in POS tags and dependency parses along the way.
ruby tools/xmlize.rb Eng-USA/ && ruby tools/centralize_xml.rb Eng-USA/ xml/
ruby tools/extract_nouns.rb xml/ | ruby tools/filter_nouns.rb > childes.nouns
cat childes.nouns | ruby tools/cluster_nouns.rb > childes.wordnet_cluster
Takes an XML file, extracts all of the utterances, and prints them to standard output for later processing. Takes an optional key to ignore (e.g. CHI)
ruby tools/xml2document.rb path/to/childes.xml <ignore_key>
Reads a list of noun/count pairs from STDIN (like the output of extract_nouns or filter_nouns) and prints a yaml clustering based on their first-order WordNet synsets. Requires doches/rwordnet.
[cat noun.list] | ruby tools/cluster_wordnet_nouns.rb
Takes a path to a directory containing XML (as ouput by cha2xml), finds all of the utterances therein, and writes a target corpus of all of them to standard out. Basically a wrapper around xml2corpus, and probably the sort of thing you're only interested in if you're also using doches/corncob.
ruby tools/corpusize.rb path/to/xml/root
Replicates the directory structure of input in output, copying
only xml files over into the new structure. Use this after xmlize,
to build a version of CHILDES containing only XML. You don't have to
do this (other tools will silently ignore .cha
files, preferring XML),
but it satisfies my housekeeping urges.
ruby tools/centralize_xml.rb path/to/CHILDES/input path/to/xml/output
Takes a directory containing CHILDES XML and prints a mapping (tab-delimited) of child age to filename.
ruby tools/build_agemap.rb path/to/childes
Takes an XML file, extracts all of the utterances, and prints a TargetCorpus (see doches/corncob) using nouns from the POStag list as target words. Like cha2xml, you probably want to call this automatically from some other script. Takes an optional key to ignore (e.g. CHI)
ruby tools/xml2corpus.rb path/to/childes.xml <ignore_key>
Filters a target corpus (from standard input) to include only lines involving target words from a list, printing the result to standard out. Used to clean up the output of xml2corpus according to the output of filter_nouns.
[cat file.target_corpus] | ruby tools/filter_corpus.rb path/to/nouns.filtered
Reads a CHILDES .cha file from STDIN and outputs XML file containing cleaned dialog to STDOUT
[cat thing.cha] | ruby cha2xml.rb <options>
xmlize calls cha2xml with all of these options on by default
- --braces Strip out experimenter annotations ("foo [this is a note] bar") from utterances.
- --clean Remove words containing nonsensical (i.e. non-word) characters.
- --minipar Run utterances through MINIPAR, including the result in the
<parse>
tag. Looks for./vendor/pdemo/pdemo
, with data files in./vendor/data
. - --tag Run utterances through a pure Ruby implementation of the Brill tagger, including the result in the
<tags>
tag.
Reads an agemap from standard input and creates a set of nlda-friendly corpora in , binned into six-month periods
cat [agemap] | ruby tools/agemap2nldacorpora.rb path/to/output
Compute reading levels for each document in a directory, and produce a data file ready for plotting with GnuPlot.
ruby tools/compute_reading_levels.rb path/to/dir > file.dat
Scans a diretory for .cha files, converting any it finds into dialog XML
ruby xmlize.rb <path/to/CHILDES/root>
Reads a list of noun/counts (as output by extract_nouns) from STDIN, filtering the list to include only nouns appearing in WordNet (as nouns). Requires doches/rwordnet.
[cat noun.txt] | ruby tools/filter_nouns.rb
Reads an agemap from standard input and creates a set of corpora in , one per each six-month period
cat [agemap] | ruby tools/agemap2corpora.rb path/to/output
Compute reading levels (e.g. Coleman-Liau Index, Automated Readability Index, Flesh-Kincaid Readability Test) for a given target_corpus
One sentence per line.
ruby tools/reading_level.rb path/to/file.target_corpus <options>
Prints a tab-delimited list of metric names as a comment (e.g. "# coleman ari words"), followed by a tab-delimited list of computed metrics (e.g. "4.3 4.1 6.0 132.8")
- --ari Automated Readability Index
- --coleman Coleman-Liau Index
- --fkre Flesh-Kincaid Readability Test
- --words Average number of words per sentence
- --syllables Average number of syllables per sentence
- --characters Average number of characters per sentence
Looks recursively in a directory for xml, scanning each file found for nouns in the POS tag list and outputting a list of all nouns found (plus counts)
ruby tools/extract_nouns.rb path/to/xml/root