
Primary LanguagePythonGNU General Public License v3.0GPL-3.0

variant2literature (v2l)

Extract and normalize variants from academic papers in xml, pdf, doc, docx, xlsx, csv formats.

Required Data and Packages


download following files and put them in variant2literature/models/

FasterRCNN model:
UCSC tables (hg19):
NCBI gene_info
  • download from ucsc and convert it to fasta format or download from GATK bundle and decompress it.
    rename it to ucsc.hg19.fasta (if the filename is not ucsc.hg19.fasta).
  • download tmVar 2.0
    copy tmVarJava/CRF/MentionExtractionUB.Model to variant2literature/models/
  • download GNormPlus
    copy GNormPlusJava/Dictionary/GNR.Model to variant2literature/models/ and
    copy GNormPlusJava/Dictionary/PT_CTDGene.txt to variant2literature/models/


Run variant2literature in Docker


  • Linux OS
  • Docker 18.09.0 or higher
  • CUDA 8.0 or higher
  • nvidia-docker


  • build docker image by make build
  • compile fasterRCNN by make compile
  • start docker container by make run
  • start mysql docker container by make run-db
  • load data into database by make load-db (run only once unless MYSQL_VOLUME is changed)

Index Papers

  • put paper directories in input/
  • run make index
  • query by make query or make query OUTPUT_FILE=output.txt

Delete Indexes

  • run make truncate

Stop and Remove Docker Container

  • run make rm
  • run make rm-db

Directly run in Linux


apt-get install -y software-properties-common
add-apt-repository ppa:deadsnakes/ppa
apt-get update
apt-get install -y \
        build-essential cmake \
        python3.6-dev python3-pip python3-tk \
        libpoppler-cpp-dev libmagic-dev libxrender-dev \
        libsm6 libxext6 libglib2.0-0 \
        libreoffice poppler-utils
Install python and required package
ln -s /usr/bin/python3.6 /usr/local/bin/python
python -m pip install -U pip==18.1
pip install torch==0.4.1
# If you have CUDA 9.2, please use the following command to install pytorch instead
# pip install http://download.pytorch.org/whl/cu92/torch-0.4.1-cp36-cp36m-linux_x86_64.whl

pip install -r requirements.txt \
    && python -c "import nltk; nltk.download('punkt'); nltk.download('averaged_perceptron_tagger')" 
Initialize CRF++
cp CRF++-0.58.tar.gz /opt/
cd /opt && tar zxvf CRF++-0.58.tar.gz
cd /opt/CRF++-0.58 \
    && ./configure && make && make install && cd python \
    && cp /opt/CRF++-0.58/crfpp.h . \
    && python setup.py build && ldconfig \
    && python setup.py install
Initialize table detector
cd table_detector/lib && bash make.sh
Install mysql
apt-get install mariadb-server
service mysql start
Change mysql root password
# then enter default password `s8fjYJd92oP`

If you get en error like 1698, "Access denied for user 'root'@'localhost'", please set the root user to use the mysql_native_password plugin.

mysql> USE mysql;
mysql> UPDATE user SET plugin='mysql_native_password' WHERE User='root';
mysql> exit;

then restart mysql

service mysql restart
Load mysql data
ln -s ./ /app

export MYSQL_HOST=
export MYSQL_PORT=3306

cd mysqldb
python models.py
Run table detector
export LOAD_BALANCER_HOST='localhost'

cd table_detector && python table_detector.py
Index papers

Put paper directories in input/, then execute

python main.py --n-process 1 --input input/

If your input files are plain text, or you're running on a device without GPU, please add --no-table-detect to disable the table detector.
The results will be saved in mysql database, please use query.py to query or use SQL command directly. For example:

mysql> USE gene;
mysql> SELECT * FROM var_pmid WHERE _id='<paper_directory_name>';
python query.py


This project is licensed under the GPLv3 License - see the LICENSE file for details.


The fasterRCNN implementation used here is written by Jianwei Yang and Jiasen Lu.