A simple tool to extract reads from specific taxonomic groups BAM files
We recommend having conda installed to manage the virtual environments
First, we create a conda virtual environment with:
wget https://raw.githubusercontent.com/genomewalker/get-reads-taxonomy/master/environment.yml
conda env create -f environment.yml
Then we proceed to install using pip:
pip install get-reads-taxonomy
mamba install -c conda-forge -c bioconda -c genomewalker get-reads-taxonomy
Using pip
pip install git+ssh://git@github.com/genomewalker/get-reads-taxonomy.git
By cloning in a dedicated conda environment
git clone git@github.com:genomewalker/get-reads-taxonomy.git
cd get-reads-taxonomy
conda env create -f environment.yml
conda activate get-reads-taxonomy
pip install -e .
getRTax will take a BAM file and a taxonomy TSV file and extract the reads that map to each of the selected taxa. One can select a list of taxa and ranks to extract the reads from.
For a complete list of options:
$ getRTax --help
usage: getRTax [-h] -b BAM [-p PREFIX] [--only-read-ids] [--combine] [--unique] -T
TAXONOMY_FILE -r RANK [-M SORT_MEMORY] [-t THREADS] [--chunk-size CHUNK_SIZE]
[--debug] [--version]
A simple tool to extract damaged reads from BAM files
optional arguments:
-h, --help show this help message and exit
required arguments:
-b BAM, --bam BAM The BAM file used to generate the metaDMG results (default: None)
optional arguments:
-p PREFIX, --prefix PREFIX
Prefix used for the output files (default: None)
--only-read-ids Only output read IDs instead of the full read (default: False)
--combine Combine reads from different taxonomic groups into one file (default:
False)
--unique Only output unique mapping reads in the case of reads mapping to
multiple references (default: False)
-T TAXONOMY_FILE, --taxonomy-file TAXONOMY_FILE
A file containing the taxonomy of the BAM references in the format
d__;p__;c__;o__;f__;g__;s__. (default: None)
-r RANK, --rank RANK Which taxonomic group and rank we want to get the reads extracted.
(default: None)
-M SORT_MEMORY, --sort-memory SORT_MEMORY
Set maximum memory per thread for sorting; suffix K/M/G recognized
(default: 1G)
-t THREADS, --threads THREADS
Number of threads (default: 1)
--chunk-size CHUNK_SIZE
Chunk size for parallel processing (default: None)
--debug Print debug messages (default: False)
--version Print program version
One would run getRTax
as:
getRTax --bam MED-2021-28-ver15-2LFQY-210811_S25.dedup.bam -T hires-organelles-viruses-smags.tax.tsv -r '{"domain":["d__Bacteria", "d__Archaea", "d__Viruses", "d__Eukaryota"]}' --threads 8 --unique
Note: The final number number of reads might not correspond to the number of reads in the BAM file. The reason is that if you are allowing multiple alignments for each read, the reads might be mapped to multiple references. In this case, the reads will be counted multiple times, for example, a read might map to a certain references, but also map to a reference that might be discarded. In this case, the read will be counted twice, once for the reference that is not discarded and once for the reference that is discarded. If you want to avoid this, you can use the
--unique
option, which will only count the read once.
To be able to extract reads from specific taxa and/or ranks, one needs to provide a taxonomy file. This file should be a TSV file with the following format:
ACCESSION\td__Bacteria;l__Bacteria;k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Yersinia;s__Yersinia pestis
ACCESSION
is the reference accession in the BAM file. The taxonomy is separated by ;
and the taxonomic groups are separated by __
. The taxonomic groups recognized by getRTax
in --taxonomy-file
and --rank
are:
- domain:
d__
- lineage:
l__
- kingdom:
k__
- phylum:
p__
- class:
c__
- order:
o__
- family:
f__
- genus:
g__
- species:
s__
Note: The taxonomic groups are case sensitive and one can include as many as desired. For example, if one wants to extract the reads from the genus Yersinia and the class Bacilli, one would use
--rank '{"genus": "Yersinia", "class":"Bacilli"}
.