
Preprocessing and analysis for training SNOMED-CT concept embeddings from CORD-19 corpus

Primary LanguagePythonBSD 3-Clause "New" or "Revised" LicenseBSD-3-Clause

Comparative corpus linguistics with embeddings

TextEssence is a tool for comparative corpus linguistics using embeddings. It is described in the NAACL-HLT 2021 paper:

See more at:

Setting up the web interface

(1) Set up the environment. Install Node.js, and set up a Python virtual environment with the necessary dependencies. The following (using Anaconda) should do the trick:

conda create -n textessence python=3
source activate textessence
pip install -r requirements.txt

(2) Download pre-generated data from Zenodo. Go to our Zenodo release and download the files to the machine you'll be running TextEssence on.

Unzip the CORD-19_monthly_embeddings.zip file; this will extract all the pretrained concept embeddings from our case study on CORD-19.

Now you'll need to modify config.ini to link to the files you've downloaded. You'll do this in two steps:

Point to the DB file: Change the DatabaseFile field of PairedNeighborhoodAnalysis to point to the downloaded SQLite DB file. For example, if you downloaded the CORD-19 data into /var/textessence, your config.ini would have:

DatabaseFile = /var/textessence/CORD-19_analysis__2020-03__2020-10.db

Point to the pretrained embeddings: Add a section to config.ini for each of the subcorpora from the CORD-19 analysis, like the following

ReplicateTemplate = /var/textessence/2020-03-27/2020-03-27_SNOMEDCT_concepts_replicate-{REPL}.txt
EmbeddingFormat = txt
ReplicateTemplate = /var/textessence/2020-03-27/2020-03-27_SNOMEDCT_concepts_replicate-{REPL}.txt
EmbeddingFormat = txt

This allows the Compare interface to calculate cosine similarities between embeddings on demand.

(3) Start the Flask backend The back end of the web interface is implemented in Flask. A make target has been created in makefile to simplify running the interface:

make run_dashboard

(4) Start the Svelte frontend The front end of the web interface is implemented in Svelte.

First-time setup: Load the visualization module and install its packages, using

git submodule init
git submodule update
cd nearest_neighbors/dashboard/diachronic-concept-viewer
npm install

Main run command: Start the Node server for handling the visualization content, using

npm run autobuild

(5) Start using the system! TextEssence will now be running at http://localhost:5000.

Contact and citation

If you have a question about TextEssence or an issue using the toolkit, please submit a GitHub issue.

If you use TextEssence in your work, please cite the following paper:

    title = "{T}ext{E}ssence: A Tool for Interactive Analysis of Semantic Shifts Between Corpora",
    author = "Newman-Griffis, Denis  and
      Sivaraman, Venkatesh  and
      Perer, Adam  and
      Fosler-Lussier, Eric  and
      Hochheiser, Harry",
    booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Demonstrations",
    month = jun,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2021.naacl-demos.13",
    pages = "106--115",

For any other questions, please contact Denis Newman-Griffis at dnewmangriffis@pitt.edu.