fasta_one_hot_encoder

Simple python to lazily one-hot encode fasta files using multiple processes, either single bases or considering arbitrary kmers.

How do I install this package?

As usual, just download it using pip:

pip install fasta_one_hot_encoder

Tests Coverage

Since some software handling coverages sometime get slightly different results, here's three of them:

Examples

Bases

One-hot encode to bases.

from fasta_one_hot_encoder import FastaOneHotEncoder

encoder = FastaOneHotEncoder(
    nucleotides = "acgt",
    lower = True,
    sparse = False,
    handle_unknown="ignore"
)
path = "test_data/my_test_fasta.fa"
encoder.transform_to_df(path, verbose=True).to_csv(
    "my_result.csv"
)

Obtained results should look like:

	c	g
0	0	1
1	1	0
2	1	0

Handling anonymous nucleotides

In many datasets you will encounter either "n" or "N", depending on the strand. Just add an "n" to the code

from fasta_one_hot_encoder import FastaOneHotEncoder

encoder = FastaOneHotEncoder(
    nucleotides = "acgt",
    lower = True,
    sparse = False,
    handle_unknown="ignore"
)
path = "test_data/my_test_fasta.fa"
encoder.transform_to_df(path, verbose=True).to_csv(
    "my_result.csv"
)

Obtained results should look like:

	c	g	n
0	0	1	0
1	0	0	1
2	1	0	0

Kmers

One-hot encode to kmers of given length.

from fasta_one_hot_encoder import FastaOneHotEncoder

encoder = FastaOneHotEncoder(
    nucleotides = "acgt",
    kmers_length=2,
    lower = True,
    sparse = False,
    handle_unknown="ignore"
)
path = "test_data/my_test_fasta.fa"
encoder.transform_to_df(path, verbose=True).to_csv(
    "my_result.csv"
)

Obtained results should look like:

	aa	ac	ag	at	ca	cc	cg	ct	ga	gc	gg	gt	ta	tc	tg	tt
0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0
1	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0

sroener/fasta_one_hot_encoder

fasta_one_hot_encoder

How do I install this package?

Tests Coverage

Examples

Bases

Handling anonymous nucleotides

Kmers