An automatic text mining pipeline to identify sentence-level mentions of autism-associated genes and phenotypes in literature through natural language processing methods. We aim to understand gene–phenotype associations in the autism-related literature to unravel the disease mechanisms and advance its diagnosis and treatment. We have generated a comprehensive database of gene-phenotype associations with the autism-related literature. The database can be easily updated as new literature becomes available with Autism_genepheno. To run Autism_genepheno pipeline, please follow the instructions below:
python3 Autism_genepheno/bin/Autism_genepheno_PMC_scraper.py --pmc_id_list source/pmc_result.txt --out_dir XML_Autism_datasets_5years
--pmc_id_list: "source/pmc_result.txt" is txt file to store all papers' PMCID, one PMCID per line". To get your customized txt file, you can download it from PMC website here.
--out_dir, default = ./XML_datasets. You can define your own folder, for example "XML_Autism_datasets_5years".
STEP 1. Run 'Autism_genepheno_step1.ipynb' to extract sentence-level gene-phenotype pairs, their occurrences in each paper and the summary of results.
1. Input are the path to gene list, phenotype list, and target papers folder 'XML_Autism_datasets_5years'
from STEP 0.
#==========================================================================================================================
ASDPTO_dir = 'Autism_genepheno/source/ASDPTO.csv' # The ASDPTO part phenotype list
UMLS_dir = 'Autism_genepheno/source/UMLS.txt' # The UMLS part phenotype list
allGene_dir = 'Autism_genepheno/source/export_latest.tsv' # The autism-associated gene list from VariCarta database
papers_dir = './XML_datasets_5year/' # Target papers in the last five years
out_dir = './Autism_genepheno_results/' # default = './Autism_genepheno_results/'
#==========================================================================================================================
You can download the gene list 'export_latest.tsv' here.
To skip STEP 0, you can also download target papers in the last 5 years 'XML_Autism_datasets_5year' here.
Autism_genepheno_results
|-Extracted_results
| |-PMCxxxxxxx.json
| ...
|-Sum_for_each_paper
| |-PMCxxxxxx.txt
| ...
|-Sum_all
|-n_g.txt
|-n_p.txt
|In_Summary.txt
Output 1. Extracted sentence-level gene-phenotype pairs in the folder 'Extracted_results'
. Here is an example of the extracted result in JSON format for a PMC paper (PMCxxxxxxx.json
).
{
"PMCid": "PMC6571119",
"Title": "Impaired neurodevelopmental pathways in autism spectrum disorder: a review of signaling mechanisms and crosstalk (Published on 6/15/2019)",
"Sentences": {
"Sentence0": {
"Content": "For instance, Neuroligins (NLGN), fragile X mental retardation 1 (FMR1), ubiquitin-protein ligase E3A (UBE3A), and DLX, which modulate BMP signaling, have been found to be associated with ASD [10–13].",
"Gene": [
"FMR1",
"UBE3A"
],
"Original phenotype": [
"mental retardation"
],
"Normolized phenotype": [
[
"C0025362",
"Mental retardation",
"OMIM, HPO, SNOMEDCT_US",
"HP:0001249"
]
],
"Upper level concepts (HPO only)": [
"Abnormality of the nervous system"
]
}
...
}
}
Output 2. Occurrence of genes and phenotypes for each paper in the folder 'Sum_for_each_paper'
. Here is an example of the results in JSON format.
{
"PMCid": "PMC6741850",
"Only abstract?": "N",
"Number of Sentences": 40,
"n_g": {
"TRPM8": 1,
"FMR1": 2,
"MB": 1,
"PTEN": 1
},
"n_p": {
"['C0009443', '(Acute nasopharyngitis or rhinitis) or (common cold)', 'SNOMEDCT_US', 'NULL']": 1,
"['C0456909', 'Blindness', 'MSH, OMIM, SNOMEDCT_US, HPO', 'HP:0000618']": 1,
"['C0233577', 'Mimicry', 'SNOMEDCT_US', 'NULL']": 1
}
}
Output 3. Summary of results in the folder 'Sum_all'
. Here is an example of the summary:
Number of paper processed: 15095
Number of the articles have only abstract: 5008
Number of paper get at least one sentence: 8512
Sentences extracted: 62183
N_tot = 2754875
Unique gene list from all papers: ['PTPRE', 'TSPO', ...]
Unique normalized phenotype list from all papers: ["['C1510472', 'Dependence syndrome', 'SNOMEDCT_US', 'NULL']", "['C0008372', 'Intrahepatic cholestasis', 'OMIM, HPO, SNOMEDCT_US', 'HP:0001406']", ...]
STEP 2. Run 'Autism_genepheno_step2.ipynb' to analyze the "Autism_genepheno_results"
from STEP 2. It calculates the NPMI of each gene-phenotype pair and outputs the gene-phenotype matrix.
1. Inputs are the path to the results from STEP 2. They are path to 'Extracted_results', 'n_p.txt', 'n_g.txt' and 'In_Summary.txt'
#================================================================================================================================
# input file dir
json_path = './Autism_genepheno_results/Extraced_results' # the output file of step1
np_dir = './Autism_genepheno_results/Sum_all/n_p.txt' # the output file of step1
ng_dir = './Autism_genepheno_results/Sum_all/n_g.txt' # the output file of step1
In_Summary_dir='./Autism_genepheno_results/Sum_all/In_Summary.txt' # the output file of step1
sfari_gene_dir='Autism_genepheno/source/SFARI-Gene_genes_12-11-2020release_12-19-2020export.xlsx' # the SFARI genes file dir
# output file dir
NPMI_result_dir='./Autism_genepheno_results/NPMI_file/' # folder of NPMI file
#================================================================================================================================
Autism_genepheno_results
|-NPMI_file
|-NPMI.json
|-NPMI.csv
|-NPMI_above_zero.csv
|-graph_matrix_01_NPMIabove0.csv
Output 1. The file ‘NPMI.json'
includes all the NPMI information of each gene-phenotype pair. Here is an example of the NPMI information of a gene-phenotype pair.
[
{
"gene": "SHANK3",
"phenotype": "['C1853490', '22q13 Deletion Syndrome', 'MSH', 'NULL']",
"NPMI": 0.607088439701336,
"gene_sfari_class": 1.0,
"n_g": 1964, # the number of sentences mentioning the gene
"n_p": 256, # the number of sentences mentioning the phenotype
"n_gp": 94.0 # the number of sentences where the gene and phenotype co-occurs
},
...
]
Output 2. The NPMI results are grouped by gene in the file ‘NPMI.csv’
file and ‘NPMI_above_zero.csv’
.
| gene | phenotype | gene_sfari_class | NPMI | n_g | n_p | n_gp |
|--------|----------------------------------------------------------|------------------|----------|------|-----|------|
| SHANK3 | ['C1853490', '22q13 Deletion Syndrome', 'MSH', 'NULL'] | 1 | 0.607088 | 1964 | 256 | 94 |
...
Output 3. The gene-phenotype matrix is in the ‘graph_matrix_01_NPMIabove0.csv’ file. The matrix shows the quantitative relationship between gene and phenotype. Each row refers to a gene and each column refers to a phenotype. If the NPMI value of a gene-phenotype pair is positive, the value in the gene-phenotype matrix is 1, else 0.
| | ['C1535926', 'Child Mental Disorders', 'MSH', 'NULL'] | ['C0038271', 'Repetitive movements', 'HPO','HP:0000733'] | ['C0019247', 'Genetic Diseases', 'MSH','NULL'] | | | |
|---------|-------------------------------------------------------|----------------------------------------------------------|------------------------------------------------|---|---|---|
| SCN2A | 1 | 0 | 1 | | | |
| CACNA1C | 1 | 1 | 1 | | | |
| AFF2 | 1 | 0 | 1 | | | |
| | | |
Final output files (gene-phenotype associations) 'NPMI.csv' and 'NPMI_above_zero.csv' can be downloaded from here.
Read the file here.
Text mining of gene-phenotype associations reveals new phenotypic profiles of autism-associated genes. S. Li, Z. Guo, J. B. Ioffe, Y. Hu, Y. Zhen, X. Zhou. Scientific Reports (2021) 11(1):15269.