/JET

Software and data links for Repl4NLP 2018 paper "Jointly embedding entities and text with distant supervision."

Primary LanguageC

JET (Jointly-embedded Entities and Text)

This is an open-source implementation of a method for jointly learning distributional embeddings of entities, words, and terms from unlabeled text with distant supervision, described in the following paper:

This work was also presented as a poster at the AMIA Informatics Summit 2018, titled "Jointly embedding biomedical entities and text with distant supervision."

Looking for WikiSRS data? You can find it at https://slate.cse.ohio-state.edu/WikiSRS/

Overview

This repository contains three main components:

  • src is the C implementation of the JET method, with all associated libraries.
  • preprocessing is Python-based code for noisy annotation with a terminology.
  • experiments is Python code for replicating the experiments found in the paper; for more information, please see experiments/README

The included demo.sh script will download a tiny test corpus and run the preprocessing and JET implementations on it.

Pre-trained embeddings, along with other associated data from the paper, can be downloaded at this link.

If you notice any issues with the code, please open up an issue in the tracker!

Dependencies

The C code is self-contained; however, random behavior is implemented with a copy of Matsumoto and Nishimura's excellent implementation of Mersenne Twister (for more, see their webpage), included in src.

The Python preprocessing code requires:

  • NLTK

The experimental implementations require:

Reference

If you use this software/method in your own work, please cite the paper above:

@inproceedings(Newman-Griffis2018Repl4NLP,
  author = {Newman-Griffis, Denis and Lai, Albert M. and Fosler-Lussier, Eric},
  title = {Jointly Embedding Entities and Text with Distant Supervision},
  booktitle = {Proceedings of the 3rd Workshop on Representation Learning for NLP},
  year = {2018}
}