The Switchboard Dialog Act Corpus (SwDA) extends the Switchboard-1 Telephone Speech Corpus, Release 2 with turn/utterance-level dialog-act tags. The tags summarize syntactic, semantic, and pragmatic information about the associated turn. The SwDA project was undertaken at UC Boulder in the late 1990s.
The SwDA is not inherently linked to the Penn Treebank 3 parses of Switchboard, and it is far from straightforward to align the two resources. In addition, the SwDA is not distributed with the Switchboard's tables of metadata about the conversations and their participants.
This project includes a version of the corpus (swda.zip
) that
pools all of this information to the best of my ability. In addition,
it includes Python classes that should make it easy to work with
this merged resource.
This project was originally part of my LSA Linguistic Institute 2011 course Computational Pragmatics. Additional resources from that corpus:
- Corpus overview
- Experiment: Question acts and interrogative clauses in the SwDA
- Analysis: Clustering words by tags in the SwDA
The code in this repository is compatible with Python 2 and Python 3. Its only other external dependency is NLTK, with the data installed so that WordNet is available.
swda.py
: the module for processing this corpus distributionswda.zip
: the corpus; needs to be unzippedswda_functions.py
: some simple examples aggregating informaton withCorpusReader
smetadata_processor.py
: auxiliary processing file used to createswda/swda-metadata.csv
The code's Transcript
objects model the individual files in the corpus.
A Transcript
object is built from a transcript filename and the corpus
metadata file:
from swda import Transcript
trans = Transcript('swda/sw00utt/sw_0001_4325.utt.csv', 'swda/swda-metadata.csv')
trans.topic_description
'CHILD CARE'
trans.prompt
'FIND OUT WHAT CRITERIA THE OTHER CALLER WOULD USE IN SELECTING CHILD \
CARE SERVICES FOR A PRESCHOOLER. IS IT EASY OR DIFFICULT TO FIND SUCH CARE?'
trans.talk_day
datetime.datetime(1992, 3, 23, 0, 0)
trans.talk_day.year
1992
trans.talk_day.month
3
trans.from_caller
1632
trans.from_caller_sex
'FEMALE'
Transcript
instances have many attributes:
for a in sorted([a for a in dir(trans) if not a.startswith('_')]):
print(a)
conversation_no
conversation_no
from_caller
from_caller_birth_year
from_caller_dialect_area
from_caller_education
from_caller_sex
header
length
metadata
prompt
ptd_basename
swda_filename
talk_day
to_caller
to_caller_birth_year
to_caller_dialect_area
to_caller_education
to_caller_sex
topic_description
utterances
These have many attributes and methods. Some examples:
utt = trans.utterances[19]
utt.caller
'B'
utt.act_tag
'sv'
utt.text
'[ I guess + --'
utt.pos
'[ I/PRP ] guess/VBP --/:'
utt.pos_words()
['I', 'guess', '--']
utt.pos_lemmas(wn_lemmatize=True)
[('I', 'prp'), ('guess', 'v'), ('--', ':')]
len(utt.trees)
1
utt.trees[0].pprint()
'(S
(EDITED
(RM (-DFL- \\[))
(S (NP-SBJ (PRP I)) (VP-UNF (VBP guess)))
(IP (-DFL- \\+)))
(NP-SBJ (PRP I))
(VP
(VBP guess)
(RS (-DFL- \\]))
(SBAR
(-NONE- 0)
(S (NP-SBJ (PRP we)) (VP (MD can) (VP (VB start))))))
(. .))'
Because the trees often properly contain the utterance, they cannot be used to gather word- or phrase-level statistics unless care is taken to restrict attention to the subtrees, or fragments thereof, that represent the utterance itself.
Not all utterances have trees; only a subset of the Switchboard is fully parsed. Thus, of the 221,616 utterances in the SwDA, 118,218 (53%) have at least one tree.
The main interface provided by swda.py
is the CorpusReader
, which allows you to
iterate through the entire corpus, gathering information as you go. CorpusReader
objects are built from just the root of the directory containing your csv files.
(It assumes that swda-metadata.csv
is in the first directory below that root.)
from swda import CorpusReader
corpus = CorpusReader('swda')
The two central methods for CorpusReader
objects are iter_transcripts
and iter_utterances
. The method iter_utterances
is basically an abbreviation
of the following nested loop:
for trans in corpus.iter_transcripts():
for utt in trans.utterances:
yield utt
For some illustrations, see swda_functions.py
.
There's a much fuller overview here: http://compprag.christopherpotts.net/swda.html