Samtools provides a function "faidx" (FAsta InDeX), which creates a small flat index file ".fai" allowing for fast random access to any subsequence in the indexed fasta, while loading a minimal amount of the file in to memory.
Pyfaidx provides an interface for creating and using this index for fast random access of DNA subsequences from huge fasta files in a "pythonic" manner. Indexing speed is comparable to samtools, and in some cases sequence retrieval is much faster (benchmark). For example:
>>> from pyfaidx import Fasta
>>> genes = Fasta('tests/data/genes.fasta')
>>> genes
Fasta("tests/data/genes.fasta") # set strict_bounds=True for bounds checking
Acts like a dictionary.
>>> genes.keys() ['NR_104215.1',
'KF435150.1', 'NM_001282548.1', 'NM_001282549.1', 'XM_005249644.1',
'NM_001282543.1', 'NR_104216.1', 'XM_005265508.1', 'XR_241079.1',
'AB821309.1', 'XM_005249645.1', 'XR_241081.1', 'XM_005249643.1',
'XM_005249642.1', 'NM_001282545.1', 'NR_104212.1', 'XR_241080.1',
'XM_005265507.1', 'KF435149.1', 'NM_000465.3']
>>> genes['NM_001282543.1'][200:230]
>NM_001282543.1:201-230
CTCGTTCCGCGCCCGCCATGGAACCGGATG
>>> genes['NM_001282543.1'][200:230].seq
'CTCGTTCCGCGCCCGCCATGGAACCGGATG'
>>> genes['NM_001282543.1'][200:230].name
'NM_001282543.1:201-230'
>>> genes['NM_001282543.1'][200:230].start
201
>>> genes['NM_001282543.1'][200:230].end
230
>>> len(genes['NM_001282543.1'])
5466
Slices just like a string:
>>> genes['NM_001282543.1'][200:230][:10]
>NM_001282543.1:201-210
CTCGTTCCGC
>>> genes['NM_001282543.1'][200:230][::-1]
>NM_001282543.1:230-201
GTAGGCCAAGGTACCGCCCGCGCCTTGCTC
>>> genes['NM_001282543.1'][200:230][::3]
>NM_001282543.1:201-230
CGCCCCTACA
>>> genes['NM_001282543.1'][:]
>NM_001282543.1:1-5466
CCCCGCCCCT........
- Start and end coordinates are 0-based, just like Python.
Complements and reverse complements just like DNA
>>> genes['NM_001282543.1'][200:230].complement
>NM_001282543.1 (complement):201-230
GAGCAAGGCGCGGGCGGTACCTTGGCCTAC
>>> genes['NM_001282543.1'][200:230].reverse
>NM_001282543.1:230-201
GTAGGCCAAGGTACCGCCCGCGCCTTGCTC
>>> -genes['NM_001282543.1'][200:230]
>NM_001282543.1 (complement):230-201
CATCCGGTTCCATGGCGGGCGCGGAACGAG
Custom key functions provide cleaner access:
>>> from pyfaidx import Fasta
>>> genes = Fasta('tests/data/genes.fasta', key_function = lambda x: x.split('.')[0])
>>> genes.keys()
dict_keys(['NR_104212', 'NM_001282543', 'XM_005249644', 'XM_005249645', 'NR_104216', 'XM_005249643', 'NR_104215', 'KF435150', 'AB821309', 'NM_001282549', 'XR_241081', 'KF435149', 'XR_241079', 'NM_000465', 'XM_005265508', 'XR_241080', 'XM_005249642', 'NM_001282545', 'XM_005265507', 'NM_001282548'])
>>> genes['NR_104212'][:10]
>NR_104212:1-10
CCCCGCCCCT
Or just get a Python string:
>>> from pyfaidx import Fasta
>>> genes = Fasta('tests/data/genes.fasta', as_raw=True)
>>> genes
Fasta("tests/data/genes.fasta", as_raw=True)
>>> genes['NM_001282543.1'][200:230]
CTCGTTCCGCGCCCGCCATGGAACCGGATG
It also provides a command-line script:
$ faidx tests/data/genes.fasta NM_001282543.1:201-210 NM_001282543.1:300-320
>NM_001282543.1
CTCGTTCCGC
>NM_001282543.1
GTAATTGTGTAAGTGACTGCA
$ faidx --complement tests/data/genes.fasta NM_001282543.1:201-210
>NM_001282543.1
GAGCAAGGCG
$ faidx --reverse tests/data/genes.fasta NM_001282543.1:201-210
>NM_001282543.1
CGCCTTGCTC
$ faidx tests/data/genes.fasta NM_001282543.1
>NM_001282543.1
CCCCGCCCCT........
$ faidx tests/data/genes.fasta --list regions.txt
...
Similar syntax as samtools faidx
A lower-level Faidx class is also available:
>>> from pyfaidx import Faidx
>>> fa = Faidx('T7.fa') # can return str with as_raw=True
>>> fa.build('T7.fa', 'T7.fa.fai')
>>> fa.index
{'EM_PHG:V01146': {'lenc': 60, 'lenb': 61, 'rlen': 39937, 'offset': 40571}, 'EM_PHG:GU071091': {'lenc': 60, 'lenb': 61, 'rlen': 39778, 'offset': 74}}
>>> fa.fetch('EM_PHG:V01146', 1, 10)
EM_PHG:V01146
TCTCACAGTG
>>> fa.fetch('EM_PHG:V01146', 100, 120)
>EM_PHG:V01146
GGTTGGGGATGACCCTTGGGT
- If the FASTA file is not indexed, when
Faidx
is initialized therebuild_index()
method will automatically run, and the index will be written to "filename.fa.fai" withwrite_fai()
. where "filename.fa" is the original FASTA file. - Start and end coordinates are 1-based.
This package is tested under Python 3.3, 3.2, 2.7, 2.6, and pypy.
pip install pyfaidx or python setup.py install
"samtools faidx" compatible FASTA indexing in pure python.
usage: faidx [-h] [-l LIST] [-n] [--complement] [--reverse] fasta [regions [regions ...]] Fetch sequence from faidx-indexed FASTA positional arguments: fasta FASTA file regions space separated regions of sequence to fetch e.g. chr1:1-1000 optional arguments: -h, --help show this help message and exit -l LIST, --list LIST list of regions, one per line -n, --name print sequence names. default: True --complement comlement the sequence. default: False --reverse reverse the sequence. default: False
New in version 0.2.5:
- Fasta and Faidx can take default_seq in addition to as_raw, key_function, and strict_bounds parameters.
- Fixed issue #20
- Faidx has attribute raw_index which is a list representing the fai file.
- Faidx has rebuild_index and write_fai functions for building and writing raw_index to file.
- Extra test cases, and test cases against Biopython SeqIO
New in version 0.2.4:
- Faidx index order is stable and non-random
New in version 0.2.3:
- Fixed a bug affecting Python 2.6
New in version 0.2.2:
- Fasta can receive the strict_bounds argument
New in version 0.2.1:
- FastaRecord str attribute returns a string
- Fasta is now an iterator
New in version 0.2.0:
- as_raw keyword arg for Faidx and Fasta allows a simple string return type
- __str__ method for FastaRecord returns entire contig sequence
New in version 0.1.9:
- line wrapping of
faidx
is set based on the wrapping of the indexed fasta file - added
--reverse
and--complement
arguments tofaidx
New in version 0.1.8:
key_function
keyword argument toFasta
allows lookup based on function output
This project is freely licensed by the author, Matthew Shirley, and was completed under the mentorship and financial support of Drs. Sarah Wheelan and Vasan Yegnasubramanian at the Sidney Kimmel Comprehensive Cancer Center in the Department of Oncology.