/fasta_one_hot_encoder

Simple python to lazily one-hot encode fasta files using multiple processes.

Primary LanguagePython

fasta_one_hot_encoder

Travis CI build SonarCloud Quality SonarCloud Maintainability Codacy Maintainability Maintainability Pypi project Pypi total project downloads

Simple python to lazily one-hot encode fasta files using multiple processes, either single bases or considering arbitrary kmers.

How do I install this package?

As usual, just download it using pip:

pip install fasta_one_hot_encoder

Tests Coverage

Since some software handling coverages sometime get slightly different results, here's three of them:

Coveralls Coverage SonarCloud Coverage Code Climate Coverate

Examples

Bases

Bases

One-hot encode to bases.

from fasta_one_hot_encoder import FastaOneHotEncoder

encoder = FastaOneHotEncoder(
    nucleotides = "acgt",
    lower = True,
    sparse = False,
    handle_unknown="ignore"
)
path = "test_data/my_test_fasta.fa"
encoder.transform_to_df(path, verbose=True).to_csv(
    "my_result.csv"
)

Obtained results should look like:

  a c g t
0 0 0 1 0
1 0 1 0 0
2 0 1 0 0

Handling anonymous nucleotides

Anonymous nucleotides

In many datasets you will encounter either "n" or "N", depending on the strand. Just add an "n" to the code

from fasta_one_hot_encoder import FastaOneHotEncoder

encoder = FastaOneHotEncoder(
    nucleotides = "acgt",
    lower = True,
    sparse = False,
    handle_unknown="ignore"
)
path = "test_data/my_test_fasta.fa"
encoder.transform_to_df(path, verbose=True).to_csv(
    "my_result.csv"
)

Obtained results should look like:

  a c g t n
0 0 0 1 0 0
1 0 0 0 0 1
2 0 1 0 0 0

Kmers

Kmers

One-hot encode to kmers of given length.

from fasta_one_hot_encoder import FastaOneHotEncoder

encoder = FastaOneHotEncoder(
    nucleotides = "acgt",
    kmers_length=2,
    lower = True,
    sparse = False,
    handle_unknown="ignore"
)
path = "test_data/my_test_fasta.fa"
encoder.transform_to_df(path, verbose=True).to_csv(
    "my_result.csv"
)

Obtained results should look like:

  aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt
0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0