This tool identifies ancient reads, given a file of known ancient kmers. It does so in the following steps:
- Build an
ancient_kmers.bloom
filter from an ancient kmers text file (if such a Bloom filter does not yet exist). - For a set of input reads:
- Save those reads which have 2 consecutive kmer matches against
ancient_kmers.bloom
- Kmerize the saved reads to generate a new set of ancient kmers, called "anchor kmers"
- Save those reads which have 2 consecutive kmer matches against
- For the same set of input reads, identify matches against anchor kmers and classify each read with >50% matches as an ancient read.
# Use the ancient kmers bloom filter provided
python akmerbroom.py --ancient_bloom
or
# Use an ancient kmers text file
python akmerbroom.py --ancient_kmers_set
The data/
folder should contain the following input files:
ancient_kmers.bloom : a bloom filter with ancient kmers
unknown_reads.fastq : a file with reads which we want to classify as ancient or not
[optional] ancient_kmers : a text file where each row is a known ancient kmer
The output/
folder should contain the following output files:
annotated_reads.fastq # intermediate output
annotated_reads_with_anchor_kmers.fastq # final output
The final output file has the following 4 fields in each record header:
SeqId, ReadLen, isConsecutiveMatchFound, AnchorProportion
By default, reads with AnchorProportion
>= 0.5 (ie. 50%) are chosen as ancient reads.
pip install biopython
pip install cython
pip install pybloomfiltermmap3
The tests/
folder contains a test dataset consisting of aOral data @SRR13355797
mixed with non aOral data @ERR671934
.
To run a test, use the following steps:
First, link the test dataset in the input data/
folder:
cd data/
ln -sf ../tests/unknown_reads.fastq .
Next, download the Bloom Filter into the data/
folder from the following
Google Drive link.
Note that it could take a few minutes (file size = 3Gb).
This can be done from the command line using the gdown
utility.
cd data/ # if you are not already in the data/ directory
pip install gdown
gdown --id 16-7N6l_FwxCG5UDdR8cP7tvVhjG55mtf
Finally, run aKmerBroom
cd ../ # if you are not already in the main directory
python akmerbroom.py --ancient_bloom
The ancient reads file will be written to output/annotated_reads_with_anchor_kmers.fastq
.
The majority of output reads should be from the aOral sample @SRR13355797
, with a few false positives from non aOral @ERR671934
.