Note: You will need Python 2.7 as your default Python version.
Before installing the needed libraries, make sure your system is ready. The commands below are for Ubuntu.
-
Install compiler tools on your system:
sudo apt-get install build-essential gfortran
-
Make sure your system has
easy_install
. If it doesn't, you will need setuptools:sudo apt-get install python-setuptools
-
Install
pip
:sudo easy_install pip
-
Install the Python shared libraries and headers:
sudo apt-get install python-dev
-
Install necessary C libraries:
sudo apt-get install libz-dev libigraph0-dev libblas-dev liblapack-dev
The scripts depend on a number of Python libraries. All of them can be installed using pip
.
pip install ipython lxml numpy pandas python-dateutil python-igraph requests requests-cache scipy suds cssselect
ProTip: We recommend using the virtualenv tool so that these libraries are installed locally. Launch with:
virtualenv venv
source venv/bin/activate
If you're having trouble installing lxml using pip, you can install it with Ubuntu's package manager: sudo apt-get install python-lxml
.
If you're not running Python in virtualenv, you will need to tell Python where to find the metrics/src
directory. To do so:
export PYTHONPATH=/path/to/metrics/src
A top-down network has these layers:
- The drug itself.
- The FDA NDA (New Drug Application) and all clinical trials classified under the given drug.
- All articles referenced by the FDA NDA and clinical trials. Includes authors, institutions, and grant agencies connected to each article in this layer.
- All articles that are in the bibliographies of each article above. Includes authors, institutions, and grant agencies connected to each article in this layer.
The top-down script can automatically retrieve all articles referred by clinical trials for a given drug. However, it cannot automatically create a bibliography list of the FDA NDA. That must be done manually. Thus the input file for the top-down script contains: a) the drug name b) the articles referred by the FDA NDA. Input files that have already been generated are located in the input/
directory.
Here we'll use the Ivacaftor/Kalydeco drug as an example.
- Go to the FDA drug site and search for Kalydeco.
- Click on Approval History, Letters, Reviews, and Related Documents.
- Under the Approval item (at the end), click on Review. (http://www.accessdata.fda.gov/drugsatfda_docs/nda/2012/203188s000TOC.cfm)
- Click on Medical Review (PDF). (http://www.accessdata.fda.gov/drugsatfda_docs/nda/2012/203188Orig1s000MedR.pdf)
- Use the poppler-util package's pdftotext command to extract the text from the PDF.
- Copy the Literature Review/References section of the text file to its own file
input/Ivacaftor-FDA-NDA-Medical.txt
. - Put the name of the drug (
Ivacaftor
) as the first line in the text file above.
The references must follow the CSE citation format: 3. Riordan JR, Rommens JM, Kerem B. Identification of the cystic fibrosis gene: cloning and characterization of complementary DNA. Science 1989, Sep 8; 245(4922):1066-73.
To generate the network, run the top-down script:
python src/topdown.py --format cse --levels 2 input.txt output.pklz
The network file will be stored in output.pklz
. If you want to open this network in Cytoscape, convert it to XGMML:
python src/xgmml.py output.pklz output.xgmml
Instead of creating a top-down network from an FDA NDA, you can create one from a list of PMIDs. This is useful for getting a network of peripheral articles.
Create a text file listing each PMID on a separate line.
python src/topdown.py --format pmid --dont-search-trials --levels 2 input.txt output.pklz
The score.py
script takes a top-down network file and outputs the same network but with each node containing a score attribute.
Article nodes can be scored by:
- individual: the article's score is its citation count
- propagate: the article's score is its citation count plus the score of any lower-level article that connects to it
Author, institution, and grant agency ("neighbor") nodes can be scored by:
- sum: sum the score of all articles that connect to a neighbor node
- indegree: the neighbor's score is the number of articles that connect to it
Example of calling score.py
:
python src/score.py --article-scoring propagate --neighbor-scoring indegree input.pklz output.pklz
You can create a histogram plot of the publication dates from articles in a network file using the articlestats.py
script.
Create three separate CSV files: one containing all article publication dates, another only for articles marked as clinical trial, and another for articles not marked as clinical trial.
python src/articlestats.py output.pklz article_years.csv pmid pubdays
python src/articlestats.py --filter clinical-only output.pklz clinical_article_years.csv pmid pubdays
python src/articlestats.py --filter non-clinical-only output.pklz non_clinical_article_years.csv pmid pubdays
Use the article_years_plot.R
command:
R CMD src/article_years_plot.R
The R script expects the article_years.csv
, clinical_article_years.csv
, and non_clinical_article_years.csv
files to be in the current directory. It will create the file article-years.pdf
in the current directory when it finishes.
A bottom-up network has these layers:
- An author
- All articles written by the author above. Included are co-authors, institutions, and grant agencies connected to each article in this layer.
- (Only two-level.) Another layer of articles that cite the articles above. Included are authors, institutions, and grant agencies connected to each article.
The bottom-up script needs an author name and an institution. Here's an example of how to run the bottom-up script on a single author:
python src/bottomup.py --levels 2 "Pico AR" "gladstone" output.pklz
The network file will be stored in output.pklz
. If you want to open this network in Cytoscape, convert it to XGMML:
python src/xgmml.py output.pklz output.xgmml
Create an input file that follows this format:
Author-A
Institution-A
Output/Path/A.pklz
Author-B
Institution-B
Output/Path/B.pklz
...
Run the pipeline script:
sh src/bottomup-pipeline.sh input-scripted.txt
Important: The bottomup-pipeline.sh
takes into account Web of Science's throttling limitations. It sleeps for 60 seconds after each author to prevent bottomup.py
from signing into Web of Science too frequently.
Note: Sometimes Web of Science will stop working, causing bottomup.py
to die. To get around this problem, bottomup-pipeline.sh
was designed to be run repeatedly. If an output file already exists for an author, it will skip that author.
Note: The pipeline script creates one-level networks for each author. If you want to change this, open bottomup-pipeline.sh
and look at line 15. You can change the number of levels there.
The authorssample.py
script will:
- Collect all articles published under the given MeSH terms.
- Create a list of last authors from each article.
- Randomly sample from this list based on the given sample size.
- Output the author's name and the five most common institutions affiliated with the author's articles under the given MeSH term.
Here's an example of how to run it:
python src/authorssample.py --output random-authors-and-institutions.txt --num-samples 1 --sample-size 200 --mesh-terms anticoagulant thrombosis
Adding more than one MeSH term will do an and operation across all terms (e.g. --mesh-terms anticoagulant thrombosis
→ anticoagulant AND thrombosis
). If a single MeSH term consists of multiple words, enclose them in quotation marks (e.g. "cystic fibrosis"
).
The sampling script filters authors with low publication numbers. Thus the effective sample size will be approximately half. If you want the effective sample size to be around 100, set the sample size to be 200.
All samples will be put into a single output file. Each sample begins with the text # Sample i
.
The output file can then be converted into the bottom-up pipeline input file (described above) using the pickno1.py
script.
Before running pickno1.py
, keep in mind that all samples are put into a single file. You will need to split each sample into its own file before running this script. You will also need to remove the # Sample i
line from each file, even if you have only a single sample.
Here's how to run pickno1.py
:
cat random-authors-and-institutions.txt | python src/pickno1.py /output/prefix/path > random-authors-scripted.txt
The authormat.py
script takes a list of bottom-up network files and outputs a matrix, with each row containing summary statistics for each given file.
Here's how to run it:
python src/authormat.py output.csv network-1.pklz network-2.pklz ...
Note that each file path is contained in every third line of the bottom-up pipeline input file. You can get each third line using bash like so:
while read l; do read l; read l; if [ -f "$l" ]; then echo -ne "\"$l\" "; fi; done < input-scripted.txt
(The above snippet also makes sure that the network file exists.)
Copy the output of the above snippet and paste it after entering python src/authormat.py output.csv
.
Let's say you have two matrices, one containing core author networks and another of peripheral author networks. You want to remove all authors in the peripheral matrix who are core authors. You can do this with the dupauthors.py
script:
python dupauthors.py core-matrix.csv peripheral-matrix.csv
This will output all author names that appear in both matrices. You can then delete these duplicated authors from the peripheral matrix.
-
articlestats.py
Takes a top-down network file as input and creates a CSV file containing information about each article node.
-
authormat.py
Creates a summary matrix for each given bottom-up network file. Outputs a CSV file that can be imported into R.
-
authorssample.py
Creates a random sample of last authors who published under given MeSH terms. Outputs a file listing each sampled author and top 5 institutions affiliated with the articles published by the given author.
-
bottomup.py
Takes an author and his or her institution affiliation and creates a bottom-up network. The network is stored in the
pklz
format. -
dupauthors.py
Lists all duplicated authors across two author matrix files.
-
meshmat.py
Takes a network file as input and outputs a CSV file of MeSH term frequency across all article nodes.
-
pickno1.py
Processes the text output of
authorssample.py
and converts it into an input file suitable forbottomup-pipeline.sh
. -
score.py
Takes a top-down network file in the
pklz
format. adds a score attribute to all article, author, institution, and grant agency nodes. Outputs a network file in thepklz
format. -
testparse.py
Takes an input file of CSE styled references and tries to parse them. Useful for debugging an input file for the top-down CSE workflow.
-
topdown.py
Takes a list of CSE references or PMIDs and creates a top-down network. The network is stored in the
pklz
format. -
xgmml.py
Takes a
pklz
file and converts it into anxgmml
file.
The command scripts above rely on infrastructure code. Here's an explanation of each file:
-
clinicaltrials.py
Provides the
Client
class for the clinicaltrials.gov web service. -
litnet.py
Provides the
LitNet
class that makes it easy to generate networks that represent relationships between articles, authors, institutions, and grant agencies. -
pubmed.py
Provides the
Client
class for the PubMed web service. -
wos.py
Provides the
Client
class for the Thomson Reuters Web of Science web service. -
util.py
Utility functions for working with XML files.
All output data files can be found at: //gdsl.gladstone.internal/gdsl/GICD/Common Use/Samad Lotia/metrics
I've included a VirtualBox disk drive that contains Ubuntu along with all necessary software packages needed to run the metrics scripts. You can find it here:
//gdsl.gladstone.internal/gdsl/GICD/Common Use/Samad Lotia/Metrics-Ubuntu-VirtualBox-disk.vdi
When you set up your VM, you can set its hard disk drive to Metrics-Ubuntu-VirtualBox-disk.vdi
. The username is metrics
and password is metrics
.