tair-pubmed-connector

There is no function to fetch automatically the information on the reported pubmed articles links in the tair to be used for the language models, so i coded this function which will take the tair information, a gene or locus tag and will fetch the corresponding pubmed and then from the pubmed the corresponding abstracts, which can be directly used for the text summarization into langchain or the llm models.

python
readTairNCBI("/Users/gauravsablok/Desktop/CodeCheck/release/ATH_GO_GOSLIM.txt", \
                  "/Users/gauravsablok/Desktop/CodeCheck/release/ATH_GO_GOSLIM.out", \
                          "2200935")
# this is just a reference id for checking the functionality
[('https://pubmed.ncbi.nlm.nih.gov/30356219/',
  'Nitrogen is an essential macronutrient for plant growth and basic metabolic processes. 
  The application of nitrogen-containing fertilizer increases yield, which has been a 
  substantial factor in the green revolution1. Ecologically, however, excessive application 
  of fertilizer has disastrous effects such as eutrophication2. A better understanding 
  of how plants regulate nitrogen metabolism is critical to increase plant yield and reduce 
  fertilizer overuse. Here we present a transcriptional regulatory network and twenty-one 
  transcription factors that regulate the architecture of root and shoot systems in response 
  to changes in nitrogen availability. Genetic perturbation of a subset of these transcription 
  factors revealed coordinate transcriptional regulation of enzymes involved in nitrogen metabolism. 
  Transcriptional regulators in the network are transcriptionally modified by feedback via 
  genetic perturbation of nitrogen metabolism. The network, genes and gene-regulatory modules 
  identified here will prove critical to increasing agricultural productivity.'),
 ('https://pubmed.ncbi.nlm.nih.gov/34562334/',
  'Unraveling gene function is pivotal to understanding the signaling cascades that control plant 
  development and stress responses. As experimental profiling is costly and labor intensive, 
  there is a clear need for high-confidence computational annotation. In contrast to detailed 
  gene-specific functional information, transcriptomics data are widely available for both model and 
  crop species. Here, we describe a novel automated function prediction method, which leverages 
  complementary information from multiple expression datasets by analyzing study-specific gene 
  co-expression networks. First, we benchmarked the prediction performance on recently characterized 
  Arabidopsis thaliana genes, and showed that our method outperforms state-of-the-art expression-based 
  approaches. Next, we predicted biological process annotations for known (n = 15 790) and unknown 
  (n = 11 865) genes in A. thaliana and validated our predictions using experimental protein-DNA and 
  protein-protein interaction data (covering >220 000 interactions in total), obtaining a set of 
  high-confidence functional annotations. Our method assigned at least one validated annotation to 
  5054 (42.6%) unknown genes, and at least one novel validated function to 3408 (53.0%) genes with 
  computational annotations only. These omics-supported functional annotations shed light on a 
  variety of developmental processes and molecular responses, such as flower and root development, 
  defense responses to fungi and bacteria, and phytohormone signaling, and help fill the information 
  gap on biological process annotations in Arabidopsis. An in-depth analysis of two context-specific 
  networks, modeling seed development and response to water deprivation, shows how previously 
  uncharacterized genes function within the respective networks. Moreover, our automated function 
  prediction approach can be applied in future studies to facilitate gene discovery for crop improvement.'),
 ('https://pubmed.ncbi.nlm.nih.gov/11118137/',
  'The completion of the Arabidopsis thaliana genome sequence allows a comparative analysis of 
  transcriptional regulators across the three eukaryotic kingdoms. Arabidopsis dedicates over 
  5% of its genome to code for more than 1500 transcription factors, about 45% of which are from 
  families specific to plants. Arabidopsis transcription factors that belong to families common 
  to all eukaryotes do not share significant similarity with those of the other kingdoms beyond the 
  conserved DNA binding domains, many of which have been arranged in combinations specific to each 
  lineage. The genome-wide comparison reveals the evolutionary generation of diversity in the 
  regulation of transcription.')]

Gaurav Sablok
University of Potsdam
Potsdam,Germany

codecreatede/tair-pubmed-connect

tair-pubmed-connector