BLAST_Ripper-Meta is an integrated pipeline designed to perform parallel BLAST searches on multiple FASTA files, process the results by adding taxonomic information, and generate visualizations of taxonomic distributions.
This script is optimized for efficient analysis of large sequencing datasets and offers the following key features:
- Parallel BLAST Processing: Utilizes multiprocessing to process multiple FASTA files in parallel, enhancing speed and efficiency.
- Comprehensive Taxonomic Integration: Incorporates detailed taxonomic information by parsing local taxonomy databases.
- Advanced Result Parsing and Visualization: Parses BLAST results to generate various taxonomic distribution visualizations.
- Performance Monitoring and Logging: Monitors memory and CPU usage throughout the process and records detailed logs.
- Memory Usage Optimization: Implements memory-efficient algorithms for handling large-scale data processing.
- Parallel BLAST Processing: Splits input FASTA files into multiple chunks and performs BLAST searches in parallel.
- Taxonomic Information Augmentation: Parses local taxonomy databases (
names.dmp
,nodes.dmp
) to enrich sequences with taxonomic lineage information. - Result Parsing and Filtering: Parses BLAST results and filters matches based on user-defined coverage and identity thresholds.
- Visualization Generation: Creates various visualizations of taxonomic distributions (sunburst plots, heatmaps, etc.).
- Report Generation: Produces detailed HTML and text reports summarizing the analysis results.
- Performance Monitoring: Monitors memory and CPU usage during processing and logs performance metrics.
The following Python packages are required:
- Python 3.6 or higher
- Biopython
- Matplotlib
- NumPy
- Pandas
- Seaborn
- tqdm
- NetworkX
- psutil
Using conda
:
conda create -n blast_ripper_env python=3.8
conda activate blast_ripper_env
conda install -c conda-forge biopython matplotlib numpy pandas seaborn tqdm networkx psutil
Using pip
:
pip install biopython matplotlib numpy pandas seaborn tqdm networkx psutil
python blast_ripper_meta.py -in <input_folder> -out <output_folder> -db <BLAST_database_path> -taxdb <taxonomy_database_path> [options]
-in
,--input_folder
: Path to the input folder containing FASTA files.-out
,--output_folder
: Path to the output folder where results will be stored.-db
,--db
: Path to the BLAST database (e.g., thent
database).-taxdb
,--taxdb_dir
: Path to the taxonomy database containingnames.dmp
andnodes.dmp
files.
-t
,--threads
: Number of threads per BLAST process (default: 4).-n
,--num_chunks
: Number of chunks to split each input file into (default: 8).-shd
,--use-ramdisk
: Use/dev/shm
as temporary storage to reduce disk I/O.-taxids
: List of taxids to filter results (optional).-qcov
,--qcov_threshold
: Query coverage threshold for considering a match (default: 80%).-identity
,--identity_threshold
: Identity threshold (default: 90%).-h
,--help
: Show help message and exit.
python blast_ripper_meta.py -in input_folder -out output_folder -db /path/to/nt -taxdb /path/to/taxonomy -t 8 -n 16 -qcov 85 -identity 95
The script scans the input folder for all files ending with .fasta
or .fa
and processes each one. Each file is split into the specified number of chunks and processed in parallel using multiprocessing.
A BLAST search is performed on each chunk, and the results are saved in TSV format. The -taxids
option allows you to filter results by specific taxids.
BLAST results are parsed, and matches meeting the specified query coverage and identity thresholds are selected. The best match for each sequence is retained.
The script parses the local taxonomy database (names.dmp
and nodes.dmp
) to build detailed taxonomic information for each taxid. This includes species, genus, family, order, and other lineage information, which are added to the sequence matches.
For each input file, the final results are saved in TSV format, including sequence ID, match status, and taxonomic information.
Based on the analyzed data, various visualizations are generated:
- Sunburst Plot: Visualizes the distribution according to taxonomic hierarchy.
- Heatmap: Compares diversity metrics across samples.
- Network Diagram: Illustrates taxonomic relationships.
- Diversity Metrics Bar Graph: Visualizes species richness, Shannon index, Simpson index, etc.
Detailed HTML and text reports summarizing the analysis and providing comprehensive statistics are generated.
<output_folder>/
: Output directory<input_file_name>/
: Subdirectory for each input file*_combined.tsv
: Combined BLAST results*_final_results.tsv
: Final results file
visualizations/
: Directory containing visualization imagessunburst_plot.png
diversity_heatmap.png
taxonomy_network.png
- Other visualization images
diversity_metrics.json
: File containing diversity metricssummary_report.txt
: Summary reportdetailed_report.tsv
: Detailed reportfinal_report.html
: Final HTML reportanalysis_report.html
: Report index page*.log
: Log files
The script monitors memory and CPU usage throughout the process and records detailed logs. Log files are saved in the output directory with a timestamp.
- BLAST Database: Ensure that the latest BLAST database is installed locally before running the script.
- Taxonomy Database: A taxonomy database containing
names.dmp
andnodes.dmp
is required. - System Resources: Due to the processing of large datasets, it's recommended to run the script on a system with sufficient memory and CPU cores.
- Using RAM Disk: The
-shd
option uses/dev/shm
as temporary storage to reduce disk I/O. Ensure that your system has enough free memory to accommodate this.
- Dependency Errors: Ensure all required Python packages are installed.
- Memory Issues: If you encounter memory errors during processing, consider reducing the number of chunks using the
-n
option or avoid using the RAM disk. - BLAST Errors: If errors occur during BLAST execution, verify the BLAST database path and check for appropriate permissions.
For questions or suggestions regarding the script, please contact the maintainer.
This script is distributed under the MIT License.
Thank you for using BLAST_Ripper-Meta!