VarAnnotator is a bioinformatics pipeline designed to efficiently annotate genetic variants from VCF (Variant Call Format) files. The system fetches gene annotations, population frequencies, and dbSNP IDs, serving results through a RESTful API and an interactive web interface.
- VCF Processing: Parses and processes VCF files, producing variant annotations. Currently only accepts GRCh37.
- Population Frequency Data: Retrieves population frequency from databases such as gnomAD and 1000 Genomes.
- Gene Annotations: Annotates variants with gene and transcript data using Ensembl’s API (grch37.rest).
- Error Handling & Retries: Gracefully handles errors (e.g., HTTP 404, 500) and retries failed API requests with exponential backoff.
- Web Interface: Offers a Flask-based API for querying annotated variants.
- Dockerized: The project is containerized for easier deployment and execution in isolated environments.
- Python 3.11
- Docker (for containerized execution)
- Git
-
Clone the Repository:
git clone https://github.com/tmbogus/VarAnnotator.git cd VarAnnotator
-
Install Dependencies:
You can install dependencies using
pip
:pip install -r requirements.txt
Alternatively, if you prefer Docker, you can build and run the container (see Docker instructions below).
To annotate variants from a VCF file, use the following command:
python scripts/annotate_variants_ensembl.py --input_vcf ./input/NIST.vcf --output_tsv ./output/NIST.annotated.tsv
This command will generate a TSV file containing the annotations.
The annotate_variants_ensembl.py
script supports several optional arguments to customize the annotation process:
-
--batch_size
Description: Number of variants per API request.
Default:25
Example:--batch_size 50
-
--max_workers
Description: Number of worker threads for parallel processing.
Default:15
Example:--max_workers 20
-
--reqs_per_sec
Description: API requests per second to adhere to rate limits.
Default:15
Example:--reqs_per_sec 10
-
--target_populations
Description: Specify target populations for frequency data.
Default:["gnomADe:NFE", "gnomADg:NFE", "1000GENOMES:phase_3:CEU"]
Example:--target_populations gnomADe:AFR 1000GENOMES:phase_3:CHB
python scripts/annotate_variants_ensembl.py \
--input_vcf ./input/NIST.vcf \
--output_tsv ./output/NIST.annotated.tsv \
--batch_size 50 \
--max_workers 20 \
--reqs_per_sec 10 \
--target_populations gnomADe:AFR 1000GENOMES:phase_3:CHB
This example customizes the annotation process by adjusting the batch size, number of worker threads, request rate, and specifying different target populations for frequency data.
You can use the Flask web app to interact with the annotated variants:
python app.py
Once the server is running, you can access the web interface at http://localhost:5000 to query and explore the annotated variants.
To run VarAnnotator using Docker:
-
Build the Docker Image:
docker build -t varannotator .
-
Run the Container:
Map the input and output directories and expose the API port:
docker run -v $(pwd)/input:/app/input -v $(pwd)/output:/app/output -p 5000:5000 varannotator
This command will:
- Expose the Flask API at port
5000
. - Map your
input/
andoutput/
directories into the container for processing.
VarAnnotator provides a RESTful API to interact with the annotated variants.
- Method:
GET
- Description: Retrieve a paginated list of annotated genetic variants.
Query Parameters:
frequency
: Filter variants by population frequency.depth
: Filter variants by read depth.sort_column
: Column to sort by (e.g.,CHROM
,POS
,Gene
).sort_order
: Sort order (asc or desc).page
: Page number for pagination (default: 1).per_page
: Number of records per page (default: 20).
Example Request:
GET /variants?frequency=0.1&depth=10&sort_column=Gene&sort_order=asc
- Method:
GET
- Description: Retrieve a specific variant by its ID.
For more details see API.md
VarAnnotator includes robust error handling for API calls:
- 404 Not Found: If the requested variant or endpoint does not exist, a 404 status with an appropriate error message is returned.
- 500 Internal Server Error: API calls that result in server errors (HTTP 500, 502, etc.) are automatically retried with exponential backoff.
- 429 Too Many Requests: The API respects rate limits, and retries requests after a delay based on the
Retry-After
header.
The internal EnsemblRestClient
is responsible for handling retries and rate limits:
- Retries: If an API call fails with a 500-level error, the system retries up to 5 times, with exponential backoff (2^n seconds).
- Rate Limiting: When encountering a 429 status, the
Retry-After
header is used to determine the delay before retrying.
VarAnnotator comes with a comprehensive test suite. Tests cover key components of the pipeline, including error handling, VCF processing, and API endpoints.
Before running the test suite, ensure that the PYTHONPATH
is set so the tests can locate the VarAnnotator modules. While inside the project directory, run:
# Set PYTHONPATH with absolute paths and run the test suite
PYTHONPATH=$(pwd):$(pwd)/scripts python3 tests/test_suite.py
- API Interaction Tests: Ensure correct handling of API calls, particularly around error handling and retries.
- Mocking: Mocks are included for
HTTPError
,time.sleep
, and other components to simulate error conditions and speed up testing. - Unit and Functional Tests: Comprehensive tests cover VCF processing, population frequency retrieval, and Flask API endpoints.
.
├── app.py # Flask app for the web interface
├── input # Directory for input VCF files
│ └── NIST.vcf # Example input VCF file
├── output # Directory for annotated output
│ ├── NIST.annotated.tsv # Annotated output file
│ ├── pipeline_status.json # Pipeline status file
│ ├── annotated_variants.tsv # Main output file
│ ├── snakemake.log # Snakemake log file
│ └── logs
│ └── annotate_variants_NIST.log # Log file for annotation
├── static # Static assets for web interface
│ └── index.html
├── tests # Test suite
│ ├── test_annotate_variants.py
│ ├── test_app.py
│ ├── test_ensembl_client.py
│ └── test_suite.py # Main entry point for running all tests
├── scripts
│ ├── annotate_variants_ensembl.py # Main script for annotating VCF files
│ ├── ensembl_client.py # Ensembl API client
│ ├── Snakefile # Snakemake workflow for pipeline orchestration
├── Dockerfile # Docker configuration file
├── entrypoint.sh # Entrypoint script for Docker container
├── requirements.txt # Python dependencies
├── LICENSE # License information
├── README.md # Project documentation (this file)
├── API.md # API documentation
└── logs # Additional logs directory
├── detailed_logs.log
└── api_times_log.tsv
We welcome contributions to VarAnnotator! If you’d like to contribute, follow these steps:
- Fork the repository on GitHub.
- Create a feature branch for your new feature or bug fix.
- Submit a pull request with a detailed explanation of your changes.
Please ensure that your contributions adhere to the existing coding style and that all tests pass before submitting your pull request.
This project is licensed under the Apache License Version 2.0 - see the LICENSE file for details.