yangmqglobe/variant2literature

PythonGPL-3.0

variant2literature

Extract and normalize variants from academic papers in xml, pdf, doc, docx, xlsx, csv formats.

Prerequisites

Linux OS
Docker 18.09.0 or higher
CUDA 8.0 or higher
nvidia-docker

Required Data and Packages

CRF++:

download CRF++.0.58.tar.gz
put CRF++.0.58.tar.gz in variant2literature/

download following files and put them in `variant2literature/models/`

FasterRCNN model:

https://www.dropbox.com/s/g980k8hpqj1q8cn/faster_rcnn.pth

UCSC tables (hg19):

NCBI gene_info

ftp://ftp.ncbi.nih.gov/gene/DATA/GENE_INFO/Mammalia/Homo_sapiens.gene_info.gz
ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2refseq.gz

ucsc.hg19.fasta

download from ucsc and convert it to fasta format or download from GATK bundle and decompress it.
rename it to ucsc.hg19.fasta (if the filename is not ucsc.hg19.fasta).

tmVar

download tmVar 2.0
copy tmVarJava/CRF/MentionExtractionUB.Model to variant2literature/models/

GNormPlus

download GNormPlus
copy GNormPlusJava/Dictionary/GNR.Model to variant2literature/models/ and
copy GNormPlusJava/Dictionary/PT_CTDGene.txt to variant2literature/models/

Usage

Setup

build docker image by make build
compile fasterRCNN by make compile
start docker container by make run
start mysql docker container by make run-db
load data into database by make load-db (run only once unless MYSQL_VOLUME is changed)

Index Papers

put paper directories in input/
run make index
query by make query or make query OUTPUT_FILE=output.txt

Delete Indexes

run make truncate

Stop and Remove Docker Container

run make rm
run make rm-db

License

This project is licensed under the GPLv3 License - see the LICENSE file for details.

Acknowledgments

The fasterRCNN implementation used here is written by Jianwei Yang and Jiasen Lu.