/UFSAC

UFSAC is a resource containing all WordNet Sense Annotated Corpora, and a Java library for manipulating them

Primary LanguageJavaMIT LicenseMIT

UFSAC: Unification of Sense Annotated Corpora and Tools

This repository contains the dataset of the article named "UFSAC: Unification of Sense Annotated Corpora and Tools", written by Loïc Vial, Benjamin Lecouteux and Didier Schwab, for the 11th edition of the Language Resources and Evaluation Conference (LREC) that took place in May 2018 in Miyazaki, Japan.

The full article is available at the following URL: http://www.lrec-conf.org/proceedings/lrec2018/summaries/250.html.

Content of the repository

This repository contains:

  • The sense annotated corpora in UFSAC, the format described in the paper, available through direct links, see below. Note that the files have been compressed using the tool xz and therefore needs to be decompressed with unxz or similar.
    The last version (2.1) contains the following corpora annotated with WordNet 3.0 sense keys:

    • SemCor
      → file semcor.xml
    • DSO (code to convert the original data only)
    • WordNet Gloss Tagged (Princeton WordNet Gloss Corpus)
      → file wngt.xml
    • MASC
      → file masc.xml
    • OMSTI
      → file omsti.xml
    • Ontonotes (code to convert the original data only)
    • Train-O-Matic
      → file trainomatic.xml
    • SensEval 2 All-words task (both from original data and from Raganato et al. (2017) framework)
      → files senseval2.xml and raganato_senseval2.xml
    • SensEval 2 Lexical sample task (train and test)
      → files senseval2_lexical_sample_train.xml and senseval2_lexical_sample_test.xml
    • SensEval 3 task 1 (both from original data and from Raganato et al. (2017) framework)
      → files senseval3task1.xml and raganato_senseval3.xml
    • SensEval 3 task 6 Lexical sample task (train and test)
      → files senseval3task6_train.xml and senseval3task6_test.xml
    • SemEval 2007 task 7
      → file semeval2007task7.xml
    • SemEval 2007 task 17 (both from original data and from Raganato et al. (2017) framework)
      → files semeval2007task17.xml and raganato_semeval2007.xml
    • SemEval 2013 task 12 (both from original data and from Raganato et al. (2017) framework)
      → files semeval2013task12.xml and raganato_semeval2013.xml
    • SemEval 2015 task 13 (both from original data and from Raganato et al. (2017) framework)
      → files semeval2015task13.xml and raganato_semeval2015.xml
    • Concatenation of all SensEval and SemEval all-words tasks (from Raganato et al. (2017) framework)
      → file raganato_ALL.xml
  • The source code of the Java API and the scripts described in the paper, in the folder java.

  • Scripts for converting corpora from various formats (Semcor, DSO, OMSTI...) into UFSAC, converting UFSAC corpora into Raganato et al.'s format, computing MFS, etc., in the folder scripts

Get Started

If you want to use the Java API or the scripts, the prerequisites are:

Once they are installed, you must compile the code:

  • Go into the java folder
  • Run mvn compile or ./compile.sh

And if you want to use the library as a dependency in another Maven projects:

  • Go into the java folder
  • Run mvn install or ./install.sh

Version history

Version 2.1 (October 2018)

Direct link to the data: https://drive.google.com/file/d/1kwBMIDBTf6heRno9bdLvF-DahSLHIZyV

  • Small fix in Semeval2007Task7, Semeval2015Task13 and Raganato et al. corpora where words in a multi-word expression were collapsed. They are now separated by an underscore symbol.
  • Version number is shorter: <major version>.<minor version>

Version 2.0.0 (July 2018)

Version 1.1.0 (June 2018)

Direct link to the data: https://drive.google.com/file/d/1XKOnRPnm0TSia1PKwe2xsGE4IDqvAAbb

  • Fix a problem where some POS tags did not follow the PTB convention
  • Merge the "omsti_part{0,1,2,3,4}.xml" files in one single "omsti.xml" file

Version 1.0.0 (May 2018)

Direct link to the data: https://drive.google.com/file/d/1-II0demgruLdSdI8SC6dmnIqDNrZvdpW

Original version which contains the following corpora:

Plus the code to produce the UFSAC version from the original version of the following corpora: