Welcome to the official repository of the Genomics Data Automatic Exploration Benchmark (GenoTEX), described in our paper here. GenoTEX supports the evaluation and development of Large Language Model (LLM)-based methods for automating gene expression data analysis, including dataset selection, preprocessing, and statistical analysis.
GenoTEX offers annotated code and results for solving a variety of gene identification questions, organized in a comprehensive analysis pipeline that meets computational genomics standards. These annotations are curated by human bioinformaticians to ensure accuracy and reliability. You can access the dataset and other resources from this repository to support your research and development in automatic gene data analysis.
Our work belongs to the general topic of AI4Science, where we show the potential and limitations of LLM-based agents in scientific explorations.
-
code/: Contains Jupyter notebooks for the preprocessing of gene expression datasets. Each trait has its own subdirectory with notebooks for specific datasets, named after cohort IDs. The
statistics.py
file provides statistical analysis tools for the preprocessed data. -
preprocessed/: Includes preprocessed data organized by trait. Each trait subdirectory contains:
cohort_info.json
: Stores results of manual data filtering and metadata such as sample size.gene_data/
: Subdirectory for preprocessed gene data.trait_data/
: Subdirectory for preprocessed trait data.
-
output/: Contains regression results for each trait. Each subdirectory holds results for gene identification problems involving the respective trait, with filenames based on trait-condition pairs.
-
Clone the repository:
git clone https://github.com/Liu-Hy/GenoTex.git cd GenoTex
-
Install dependencies:
Ensure you have the necessary Python packages installed. You can create a virtual environment and install dependencies using:
python -m venv venv source venv/bin/activate pip install -r requirements.txt
-
Run code: Navigate to the code/ directory and execute the Jupyter notebooks corresponding to the trait and cohort of interest.
-
Evaluate performance: Use eval.py to compare the performance of your automated method with the gold standard results provided.
We welcome contributions to enhance GenoTEX. Please fork the repository, create a new branch for your feature or bug
fix, and submit a pull request. For major changes, please open an issue first to discuss what you would like to change.
If you use GenoTEX in your research, please cite our paper using the following BibTeX entry:
@article{liu2024genotex,
title={GenoTEX: A Benchmark for Evaluating LLM-Based Exploration of Gene Expression Data in Alignment with Bioinformaticians},
author={Liu, Haoyang and Wang, Haohan},
journal={arXiv preprint arXiv:2406.15341},
year={2024}
}
This project is licensed under the Creative Commons (CC) license.