Summarize EAF Files

Overview

The script summarize-eaf.py reads ELAN files in EUDICO Annotation Format and calculates the total times of the sets of (possibly overlapping) annotated segments therein. It uses the pympi-ling library to parse the input files, and writes its output to table in CSV format in a specified output file.

In particular, it is intended for use in analyzing speech recordings for the study of the language environments of infants and children.

Computation Method

First, summarize-eaf.py reads all of the base tier names from the EAF file, filtering out tiers specified as "ignored" on the command line (using the --ignore-tiers option), along with any tiers that have no sub-tiers, including a built-in list (code_num, on_off, context, and code). Then it extracts all of the annotated segments for each of the non-ignored base tiers, and builds a list of start and end timestamps from that.

Once that list of "events" is sorted in chronological order, it scans through them, keeping track of which segments are active at each point in time, and producing a list of the sum of the active time (in milliseconds) that each unique combination of tiers was active, as well as the total time each tier was active (regardless of overlap with other tiers).

Then it goes through a similar process, computing the total times annotated and child-directed speech (CDS), adult-directed speech (ADS), and speech segments directed at both. This is done by selecting the tiers named xds@<BASE>, where <BASE> is the name of a base tier.

Output Table

The output tables look like this:

File Tier(s) Exclusive Total CDS ADS BOTH
1234 CHI 143099 195471
1234 FA1 206302 249630 148523 47897 32494
1234 MA1 202093 240012 162588 59462 2474
1234 CHI+FA1 28308
1234 CHI+FA1+MA1 1166
1234 CHI+MA1 22898
1234 FA1+MA1 13855 559 2319 2291
1234 Totals 617721 685113 311670 109678 37259
5678 CHI 242516 311834
5678 FA1 477640 565899 549321 3528
5678 FA2 23787 45555 337 6852
5678 MA1 3708 5808 4481
5678 CHI+FA1 66504
5678 CHI+FA1+FA2 700
5678 CHI+FA2 1340
5678 CHI+MA1 773
5678 FA1+FA2 19728 436
5678 FA1+MA1 1327 1327
5678 Totals 838023 929096 555902 10380
* Grand Totals 1455744 1614209 867572 120058 37259

Each row is identified by the filename (without the .eaf extension) and the unique combination of simultaneously active tiers. For rows representing overlapping speech, the base tier names are concatenated with + signs.

The Exclusive column contains the sum total time (in milliseconds) during which that particular set of tiers was the only one active. If we combine all of these Exclusive values for a given file, we should get the amount for that file in the Totals row for the file. This equals the total time marked as annotated speech for the non-ignored tiers in the file.

The Total column contains the total time that an individual base tier was active, whether or not it was overlapping with any other tiers for some of that time. Therefore, the Total amount for a tier should be greater than or equal to the Exclusive amount for that tier. This amount is only reported for base tiers on their own, not for overlapping combinations of base tiers.

The last three columns contain similar totals, but for child-directed, adult-directed, and both-directed speech. All of these are likewise summed in the Totals row for each file.

Finally, if multiple EAF files are processed in a batch, a Grand Totals row contains the sum of the Totals rows for each file.

Options

There are several command-line options for controlling the behaviour of the summarize-eaf.py script. Notably:

  • Setting the output file name with --output
  • Setting the output delimiter character with --delimiter
  • Suppressing the CDS/ADS/BOTH computation with --no-xds
  • Suppressing the Totals and Grand Totals rows with --no-totals
  • Suppressing the output of overlapping tier combinations with --no-overlap
  • Ignoring specified tiers with --ignore-tiers
  • Using specified tiers as an input mask with --masking-tiers

Some of these options are self-explanatory, but a few require a bit more explanation.

Suppressing Totals

Since the Totals (and Grand Totals) row(s) are simple sums of rows above them, it could be convenient for some analyses to omit them by using the option --no-totals.

Suppressing overlap details

If invoked with --no-overlap, summarize-eaf.py won't write rows for overlapping tier combinations to the output file, but those amounts will still be included in the Totals (and Grand Totals) row(s). If you don't care about the details of which tiers overlapped and for how long, this can produce a substantially smaller output table.

Note: If you use both --no-overlap and --no-totals, you will not have access to enough information to compute the omitted Totals row(s) correctly.

Suppressing CDS, et al

The option --no-xds will cause summarize-eaf.py to omit the data for the final three columns. The columns (and their headers) will still be included in the output, but the data won't be compiled and the cells will all be empty. The script only runs very slightly faster with this option, so it's really only useful for removing unwanted noise from the output table.

Ignoring tiers

Tiers can be added to the "ignored tiers" list by specifying them after the --ignore-tiers option. This is useful if you want to segregate one tier from the others, and report totals from the other tiers as if the ignored tiers were not present. Most likely, this would be used to filter out electronic devices (i.e. the EE1 tier).

Limiting the counted segments to the "code" tier

The segments (or partial segments) included in the output can be limited to ones that overlap with the code tier (or any other specified tier) using the --limiting-tier option. If this option is used, only the portion of the EAF file that is marked with the specified tier (usually code) will be included in the output summary.

With the --limiting-tier-pattern option, it's possible to further restrict the included sections to ones that are annotated with text that matches a specified regular expression. Only matching segments of the limiting tier will be summarized in the output, unless the option --negate-limiting-tier-pattern is also used, in which case the segments that match will be filtered out, and the non-matching segments of the limiting tier will be summarized instead.

Using a tier as a mask

The --masking-tiers option turns the specified tier into an "input mask" for the tiers being processed. The EAF files are processed as if there were no active tiers wherever a masking tier is active. Put another way, any overlap of other tiers with a masking tier is not counted in the totals.

This allows an analysis where subjects (i.e. CHI) are assumed to not be listening whenever they are speaking.

Setup

Dependencies

To run the script, you'll need Python installed (tested on versions 2.7.18 and 3.8.5), and the pympi-ling package, which can be installed using pip:

$ pip install pympi-ling

Version 1.69 of pympi-ling works, but if your EAF files are version 3.0 or higher, you'll see error messages on standard output:

Parsing unknown version of ELAN spec... This could result in errors...

The master branch of pympi-ling on GitHub has an improved warning system, which allows us to suppress this warning (since our EAF v3.0 files are compatible), but both versions will work.

Running the script

First, clone the repository:

$ git clone https://github.com/aclew/eaf_speech_counter

Supposing you have a folder named data containing a set of EAF files, run the script like this:

$ eaf_speech_counter/summarize-eaf.py -o output.csv data/*.eaf

The output table will be written to output.csv, including a totals row for each file, grand totals for the set of files, and CDS/ADS data.

Ignoring tiers

Tiers can be added to the ignore list by using the --ignore-tiers option (or simply -i for short):

$ summarize-eaf.py -o output.csv -i EE1 -- data/*.eaf

Note that the -- separating option parameters from the name(s) of the target EAF file(s) is required; otherwise the filenames will be treated as the names of tiers to be ignored.

Multiple tiers can be ignored, as well:

$ summarize-eaf.py -o output.csv -i EE1 UC1 MA3 -- data/*.eaf

Omitting Totals

The totals rows and grand totals row will be omitted with the --no-totals option:

$ summarize-eaf.py -o output.csv --no-totals data/*.eaf

Omitting overlap details

The rows showing overlap details can be omitted with --no-overlap:

$ summarize-eaf.py -o output.csv --no-overlap data/*.eaf

Masking tiers

To use a tier as a mask, use the option --masking-tiers (or -m):

$ summarize-eaf.py -o output.csv -m CHI -- data/*.eaf

Debugging

To turn on debug output, use --verbose or -v. There are three levels, depending on how many time you use the option:

# INFO level
$ summarize-eaf.py -o output.csv -i EE1 -v data/*.eaf
# DEBUG level
$ summarize-eaf.py -o output.csv -i EE1 -vv data/*.eaf
# VERBOSE level
$ summarize-eaf.py -o output.csv -i EE1 -vvv data/*.eaf

The highest level of output is extremely verbose; it writes a line for every event in sequence.