/PubMed-crawler

Python project for retrieval and processing of medical articles from the PubMed database.

Primary LanguagePython

PubMed Crawler

Python project for retrieval and processing of medical articles from the PubMed database.

Retrieving articles and storing locally

The script crawler.py makes use of the Entrez utilities to perform a query to the PubMed database, with query keyword defined inside. For each article found is allocated an Article object, defined in article.py, that collects: abstract, title, author, date, journal and keywords list.

The results then are stored both in xml format (DATA/xml/), and json format (elasticsearch/json/) for indexing with Elasticsearch. The task is parallelized across the available threads using python multiprocessing module.

Requirements

requests
Beautifulsoup
multiprocessing

Preprocessing

preprocessor.py implements a Preprocessor object that is used to load data from the xml generated at the previous step, or simply to preprocess raw text. It splits the text in paragraphs and sentences and then performs tokenization (with regex), stop words removal and lemmatization (with nltk). Again multiprocessing is used to parallelize the task across the available threads.

Requirements

re
Beautifulsoup
multiprocessing
nltk

Usage

Load an xml file generated by crawler.py and write the result to out_file

p = Preprocessor(load='path-to-file.xml', out_file='path-to-out-file.txt')

or simply feed some raw text

p = Preprocessor(text=some_text, out_file='path-to-out-file.txt')

then perform preprocessing

p.preprocess(tokenize=True, remove_stop=True, lower=True, lemma=True)

only sentence splitting is performed by default.

Causal graph

Here I use SemRep to extract predications from the abstracts collected with crawler.py. The bash script predicates.sh takes as argument a raw text file with one sentence per line (can be generated using Preprocessor for example)

sh predicates.sh example.txt

and generates the predicates.xml file containing all the predications found by SemRep. The task is splitted in batches of 100 sentences and runs parallel thanks to GNU Parallel.

causal_graph.py then parses the predications file generated and extracts all the subject-predicate-object tuples. From them selects only the causal ones and build the graph using the Graph object implemented in graph.py, based on the python module graph_tool.

Here an example about covid-19

Requirements

SemRep
GNU Parallel
graph_tool
Beautifulsoup