fasta_one_hot_encoder
Simple python to lazily one-hot encode fasta files using multiple processes, either single bases or considering arbitrary kmers.
How do I install this package?
As usual, just download it using pip:
pip install fasta_one_hot_encoder
Tests Coverage
Since some software handling coverages sometime get slightly different results, here's three of them:
Examples
Bases
One-hot encode to bases.
from fasta_one_hot_encoder import FastaOneHotEncoder
encoder = FastaOneHotEncoder(
nucleotides = "acgt",
lower = True,
sparse = False,
handle_unknown="ignore"
)
path = "test_data/my_test_fasta.fa"
encoder.transform_to_df(path, verbose=True).to_csv(
"my_result.csv"
)
Obtained results should look like:
a | c | g | t | |
---|---|---|---|---|
0 | 0 | 0 | 1 | 0 |
1 | 0 | 1 | 0 | 0 |
2 | 0 | 1 | 0 | 0 |
Handling anonymous nucleotides
In many datasets you will encounter either "n"
or "N"
, depending on the strand.
Just add an "n"
to the code
from fasta_one_hot_encoder import FastaOneHotEncoder
encoder = FastaOneHotEncoder(
nucleotides = "acgt",
lower = True,
sparse = False,
handle_unknown="ignore"
)
path = "test_data/my_test_fasta.fa"
encoder.transform_to_df(path, verbose=True).to_csv(
"my_result.csv"
)
Obtained results should look like:
a | c | g | t | n | |
---|---|---|---|---|---|
0 | 0 | 0 | 1 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 1 |
2 | 0 | 1 | 0 | 0 | 0 |
Kmers
One-hot encode to kmers of given length.
from fasta_one_hot_encoder import FastaOneHotEncoder
encoder = FastaOneHotEncoder(
nucleotides = "acgt",
kmers_length=2,
lower = True,
sparse = False,
handle_unknown="ignore"
)
path = "test_data/my_test_fasta.fa"
encoder.transform_to_df(path, verbose=True).to_csv(
"my_result.csv"
)
Obtained results should look like:
aa | ac | ag | at | ca | cc | cg | ct | ga | gc | gg | gt | ta | tc | tg | tt | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |