delph-in/docs

ERG Treebanks: summary about data

Opened this issue ยท 18 comments

I suggest that we expand the section about the datasets that constitue the ERG treebanks: https://github.com/delph-in/docs/wiki/RedwoodsTop

Currently, the wiki page refers the reader to Flickinger 2011 but that work is not easily available online (I don't think?) Furthermore, even if one has it, it is still not fully obvious how to map the datasets described there to the files in the ERG release (for some, it is obvious, for others, it is not).

Here's the list of the files in the current release. I filled out the mapping where I could make them, but some things I cannot map to something described in Flickinger 2011 easily:

csli Constructed
ccs Constructed
control Constructed
esd Constructed
fracas Constructed
handp12 Handpicked? From where?
mrs Constructed
pest ???
sh-spec Sherlock Holmes
sh-spec-r Sherlock Holmes
trec Constructed
cb The Cathedral and the Bazaar
ec* E-commerce
hike LOGON
jh* LOGON
tg* LOGON
ps* LOGON
rondane LOGON
rtc* ???
bcs ???
scm SemCor?..
vm* Verbmobil
ws* Wikipedia
wlb03 ???
wnb03 ???
peted ???
petet ???
ntucle ???
omw ???
wsj* Wall Street Journal

Can anyone help complete this table?

The OMW means Open Multilingual Wordnet. This is a sample of 2000 sentences from synset English definitions from @fcbond 's OMW (http://compling.hss.ntu.edu.sg/omw/). As far as I know, this is mostly the Princeton Wordnet 3.0 with small fixes.

I agree that we do need this documentation in the wiki. BTW, not always clear that WSJ is also part of Ontonotes and Propbank dataset (see propbank/propbank-release#14).

Is ws* all the https://github.com/delph-in/docs/wiki/WikiWoods? What does the star mean?

oepen commented

Is ws* all the https://github.com/delph-in/docs/wiki/WikiWoods? What does the star mean?

This just means, all corpora that start with "ws".

Thank you, @oepen !

Here's the same table updated with the info from index.lisp. For some items, I am still missing an adequate description though...

csli "CSLI testsuite" Constructed examples is there a citation?.. and what exactly does it mean?..
ccs "Collab Compuational Semantics" Constructed examples citation? and I don't know what this means either...
control "Control examples from literature" Constructed examples clear enough I guess though some provenance would be nice
esd "ERG Semantic Documentation Test Suite" Constructed https://github.com/delph-in/docs/wiki/ErgSemantics
fracas "FraCaS Semantics Test Suite" textual inference problem set? Cooper et al. 1996 https://gu-clasp.github.io/multifracas/D16.pdf
handp12 The Cambridge grammar of the English language, Ch12 ??? Huddleston and Pullum 2005 Not available online; What is the relationship of the chapter and the test suite?
mrs MRS test suite Constructed examples https://github.com/delph-in/docs/wiki/MatrixMrsTestSuite
pest ??? ??? ??? ???
sh-spec Sherlock Holmes late 19th century fiction Conan Doyle, 1892 https://www.gutenberg.org/files/1661/1661-h/1661-h.htm#chap08
sh-spec-r what's this second one?
trec "TREC QA Questions (Ninth conference" Constructed examples? Can't find this specific event
cb The Cathedral and the Bazaar technical essay Raymond, 1999 http://www.catb.org/~esr/writings/cathedral-bazaar/
ec* E-commerce email (YY) email (customer service etc)
hike LOGON travel brochures
jh* LOGON travel brochures
tg* LOGON travel brochures
ps* LOGON travel brochures
rondane LOGON travel brochures
rtc* ??? ??? ??? ???
bcs "Brown Corpus Sampler (SDP 2015 Task)" Oepen et al. 2015 https://aclanthology.org/S15-2153.pdf
scm "SemCor Melbourne Sampler (Disjoint from BCS)" same as above?..
vm* Verbmobil scheduling dialogues Is Wahlster 1993 the citation?.. http://verbmobil.dfki.de/ww.html
ws* Wikipedia Encyclopaedic texts about computational linguistics?.. anything more we know about them?
wlb03 ??? ??? ??? ???
wnb03 ??? ??? ??? ???
peted "Evaluation By Textual Entailment (Development)" what does it mean?
petet "Evaluation By Textual Entailment (Test)" what does it mean?
ntucle Something to do with NTU but what?
omw Open Multilingual Wordnet ? http://compling.hss.ntu.edu.sg/omw/
wsj* Wall Street Journal News articles https://catalog.ldc.upenn.edu/LDC93S6A

Many thanks, @oepen . Let me know if you think that this table could go directly into the wiki e.g. into RedwoodsTop.

What about adding some extra information about the size of each treebank? I am particularly interested to know how many sentences we have with golden MRS. Does anyone have this number? Is there any other ERG golden analyzed treebank besides the data inside the ERG repository under tsdb/gold?

In the tsdb/gold we have 131,401 sentences:

% for f in find . -type f -name item.gz; do echo $f, gzcat $f | wc -l; done | awk 'BEGIN {s=0} {s = s + $2} END{ print s}'
131401

Two profiles are 'virtual'. The wescience and redwoods. But redwoods mention profiles that do not exist in the tsdb/gold folder:

  1. Instead of "jh0", "jh1", "jh2", "jh3", "jh4" and "jh5" we have only the profiles "jh", "jhk" and "jku"
  2. Instead of "tg1" and "tg2" we have "tg", "tgk" and "tgu"
  3. Instead of "sc01", "sc02" and "sc03" we have only "scm"

Questions:

  1. Should we update the profile redwoods, that is, the list in the virtual file?
  2. The last AMR dataset contains 59,255 sentences. As far as I understood from https://amr.isi.edu/download.html and https://catalog.ldc.upenn.edu/LDC2020T02, this AMR 3.0 contains AMR 2.0 and AMR 1.0, so AMR data is ~45% of the size of MRS data, am I right?

@oepen, the CCS event is the precursor of http://mrp.nlpl.eu/2020/index.php?page=14#companion? If so, what is the origin of the EDS data on MRP datasets?

Finally, there are sentences duplicated in the profiles:

% for f in */item.gz; do gzcat $f | awk -F "@" '{print $7}' >> sentences; done
% sort sentences| sort | uniq | wc -l
  105820

some examples:

% sort sentences| sort | uniq -c | sort -nr | head -20
3288 MIME-Version: 1.0
3288 Content-Type: text/plain; charset=iso-8859-1
3288 Content-Transfer-Encoding: 8bit
 303 Message-ID: <1043735849\smv.stanford.edu>
 301 Message-ID: <1043735850\smv.stanford.edu>
 300 Message-ID: <1043735851\smv.stanford.edu>
 295 Message-ID: <1043735854\smv.stanford.edu>
 295 Message-ID: <1043735852\smv.stanford.edu>
 294 Message-ID: <1043735855\smv.stanford.edu>
 292 Message-ID: <1043735853\smv.stanford.edu>
 290 Message-ID: <1043735857\smv.stanford.edu>
 289 Message-ID: <1043735858\smv.stanford.edu>
 289 Message-ID: <1043735856\smv.stanford.edu>
 275 Message-ID: <1043735848\smv.stanford.edu>
 268 okay.
 227 From: stefan\syy.com
 204 From: dan\syy.com
 202 From: monique\syy.com
 200 From: remy\syy.com
 200 From: brian\syy.com

What about adding some extra information about the size of each treebank? I am particularly interested to know how many sentences we have with golden MRS. Does anyone have this number?

Alex, the redwoods.xlsx file (which you can find in the release) has the sentence numbers!

I found a link to the redwoods.xls file https://github.com/delph-in/docs/wiki/RedwoodsTop. But the page is pointing to http://svn.delph-in.net/erg/tags/1214/etc/redwoods.xls. In the etc folder of ERG in the trunk branch of the repository, I found the new version of this file.

If I am reading it right, we have 97,286 sentences fully disambiguated in the redwoods collection, right? Still the more than the 59,255 AMR sentences but less impressive number. Is this number the actually number of sentences with golden MRS that we have available? What is the status of the sentences under the profiles not included in the redwoods?

I noticed that sh-spband-r profile is not listed in the spreadsheet redwoods.xls. What is this?

oepen commented

the CCS event is the precursor of http://mrp.nlpl.eu/2020/index.php?page=14#companion? If so, what is the origin of the EDS data on MRP datasets?

broadly speaking, i guess one could say that CCS (and a series of additional meetings in a similar spirit) was part of the build-up for the MRP shared tasks. but one could just as well say that the desire to compare different frameworks and specific analyses has been a motivating force for dan, emily, myself, and others for at least the past decade. sitting down to compare individual sentences in great depth (in the CCS spirit) is one technique we have used; the SDP and MRP shared tasks series was a different approach with some of the same underlying motivation.

regarding the EDS data in MRP 2019 and 2020, it comes from the 1214 ERG release, aka DeepBank 1.1.

oepen commented
  1. Instead of "jh0", "jh1", "jh2", "jh3", "jh4" and "jh5" we have only the profiles "jh", "jhk" and "jku"
  2. Instead of "tg1" and "tg2" we have "tg", "tgk" and "tgu"
  3. Instead of "sc01", "sc02" and "sc03" we have only "scm"

yes, with the transition from the original [incr tsdb()]-based treebanking environment to FFTB, profiles became a lot smaller, seeing as only the packed forest is recorded rather than a 500-best list of full derivations for each input. that meant that dan could undo some sub-divisions of collections that logically belonged together (JH, TG, and SC). post-1214, he concatenated these profiles back together.

So we also have DeepBank in addition to the wesearch and redwoods "virtual" profiles? According to https://github.com/delph-in/docs/wiki/DeepBank it is the wsj* profiles. These are released in http://metashare.dfki.de/repository/browse/deepbank/d550713c0bd211e38e2e003048d082a41c57b04b11e146f1887ceb7158e2038c/ summing up to 43,541 Sentences. But I suppose the WSJ* profiles under the Erg repository are updated with ERG 2020 release, so the META-SHARE website data is outdated.