/variant2literature

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

variant2literature

Extract and normalize variants from academic papers in xml, pdf, doc, docx, xlsx, csv formats.

Prerequisites

  • Linux OS
  • Docker 18.09.0 or higher
  • CUDA 8.0 or higher
  • nvidia-docker

Required Data and Packages

CRF++:

download following files and put them in variant2literature/models/

FasterRCNN model:
UCSC tables (hg19):
NCBI gene_info
  • ftp://ftp.ncbi.nih.gov/gene/DATA/GENE_INFO/Mammalia/Homo_sapiens.gene_info.gz
  • ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2refseq.gz
ucsc.hg19.fasta
  • download from ucsc and convert it to fasta format or download from GATK bundle and decompress it.
    rename it to ucsc.hg19.fasta (if the filename is not ucsc.hg19.fasta).
tmVar
  • download tmVar 2.0
    copy tmVarJava/CRF/MentionExtractionUB.Model to variant2literature/models/
GNormPlus
  • download GNormPlus
    copy GNormPlusJava/Dictionary/GNR.Model to variant2literature/models/ and
    copy GNormPlusJava/Dictionary/PT_CTDGene.txt to variant2literature/models/

Usage

Setup

  • build docker image by make build
  • compile fasterRCNN by make compile
  • start docker container by make run
  • start mysql docker container by make run-db
  • load data into database by make load-db (run only once unless MYSQL_VOLUME is changed)

Index Papers

  • put paper directories in input/
  • run make index
  • query by make query or make query OUTPUT_FILE=output.txt

Delete Indexes

  • run make truncate

Stop and Remove Docker Container

  • run make rm
  • run make rm-db

License

This project is licensed under the GPLv3 License - see the LICENSE file for details.

Acknowledgments

The fasterRCNN implementation used here is written by Jianwei Yang and Jiasen Lu.