Edinburgh-Genome-Foundry/codon-usage-tables

Package fails to install via pip

Closed this issue · 14 comments

When installing via pip pip install python_codon_tables the following error is produced

Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/private/var/folders/_f/5kqpr8kx5zl9qjtkzzkb31xw0000gn/T/pip-install-a0_5fa8a/python-codon-tables/setup.py", line 16, in <module>
        with open(os.path.join('data', 'version.txt'), 'r') as f:
    FileNotFoundError: [Errno 2] No such file or directory: 'data/version.txt'

Seems that the issue is that the data directory is one up from the packaged directory.

Zulko commented

This may have been a problem with the MANIFEST file. I have now added this file to the manifest and pushed a new version on pypi. Could you try again and let me know if it works?

Looks like we are hitting a different error now

Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/private/var/folders/_f/5kqpr8kx5zl9qjtkzzkb31xw0000gn/T/pip-install-m1m1kpun/python-codon-tables/setup.py", line 19, in <module>
        with open(os.path.join('python_codon_tables', 'README.rst'), 'r') as f:
    FileNotFoundError: [Errno 2] No such file or directory: 'python_codon_tables/README.rst'
Zulko commented

The MANIFEST, again... Sorry for these, they are trivial but they are not caught by the test suite because it doesnt use pip. I'll fix it

Zulko commented

I fixed it on Github but I'll only push the new version to PyPI tomorrow. In the meantime you can also install from directly from Github with pip:

pip install git+https://github.com/Edinburgh-Genome-Foundry/codon-usage-tables.git

Thanks!

Zulko commented

Done. Let me know if I can close this one.

Pip install works, but importing fails:

FileNotFoundError: [Errno 2] No such file or directory: '.../miniconda3/envs/codon_harmony/lib/python3.7/site-packages/python_codon_tables/../data/tables'
Zulko commented

Gee thats all more complicated than it should be. I believe there is still something wrong with the MANIFEST, maybe recursive-include instead of include or something like that. I will do some tests on monday. Thanks for the reports!

Zulko commented

Ok, I managed to reproduce your bug and to fix it, at least on my machine. Could you try a pip --upgrade and let me know if it now works for you (I am confident it will).

Beware that today I also changed the API in subtle ways so make sure you use the methods highlighted in the example. On the good side, there is a new feature table = download_codons_table(taxid=XXX) which allows you to get the table for any taxID.

Great! It seems to be working now. I am going to try to integrate this into my project https://github.com/weitzner/codon_harmony Thanks!

Zulko commented

Hey it seems that your project could use DnaChisel, a generic DNA optimization library for Python which I am very proud of (I am surely biased!)

Here is how (some of) your project's specifications would be formulated using DnaChisel. Some specifications might not be exactly as you want them (in particular there has been some discussion around the codon harmonization) but the library is written so as to be easily extended by the user, so maybe it could work for you:

import dnachisel as dc

# GENERATE A RANDOM PROTEIN SEQUENCE FOR THE EXAMPLE
aa_sequence = dc.random_protein_sequence(1000)
dna_sequence = dc.reverse_translate(aa_sequence)

# SPECIFY THE CONSTRAINTS AND OBJECTIVES
problem = dc.DnaOptimizationProblem(
    sequence=dna_sequence,
    constraints=[
        dc.EnforceTranslation(translation=aa_sequence), # keep the protein sequence
        dc.EnforceGCContent(mini=0.3, maxi=0.7, window=70),
        dc.AvoidHairpins(stem_size=10),
        dc.AvoidPattern(dc.repeated_kmers(3, 3)),
        dc.AvoidPattern(dc.repeated_kmers(9, 2)),
        dc.AvoidPattern(enzyme='BsmBI'),
        *(dc.AvoidPattern(dc.homopolymer_pattern(c, 6)) for c in "ATGC")
    ],
    objectives=[
        dc.CodonOptimize(species='e_coli')
    ]
)

# SOLVE THE CONSTRAINTS, THEN OPTIMIZE
print ("BEFORE:", problem.constraints_text_summary())
problem.resolve_constraints()
problem.optimize()
print ("AFTER:", problem.constraints_text_summary())

Output:

BEFORE: ===> FAILURE: 5 constraints evaluations failed
✔PASS ┍ EnforceTranslation[0-3000(+)]
      │ All OK.
✔PASS ┍ EnforceGCContent[0-3000(+)](mini:0.30, maxi:0.70, window:70)
      │ Passed !
✔PASS ┍ AvoidHairpins[0-3000(+)](stem_size:10, hairpin_window:200)
      │ Score:         0. Locations: []
 FAIL ┍ AvoidPattern[0-3000(+)](([ATGC]{3})\1{2} (3-repeats 3-mers))
      │ Failed. Pattern found at positions [97-106(+), 98-107(+), 99-108(+),
      │ 100-109(+), 172-181(+), 1453-1462(+), 1454-1463(+), 2967-2976(+)]
 FAIL ┍ AvoidPattern[0-3000(+)](([ATGC]{9})\1{1} (2-repeats 9-mers))
      │ Failed. Pattern found at positions [1420-1438(+)]
 FAIL ┍ AvoidPattern[0-3000(+)](enzyme:BsmBI)
      │ Failed. Pattern found at positions [1594-1600(-), 337-343(-)]
 FAIL ┍ AvoidPattern[0-3000(+)](AAAAAA)
      │ Failed. Pattern found at positions [790-796(+), 1226-1232(+),
      │ 1963-1969(+), 1964-1970(+), 1965-1971(+), 2206-2212(+), 2207-2213(+),
      │ 2810-2816(+)]
 FAIL ┍ AvoidPattern[0-3000(+)](TTTTTT)
      │ Failed. Pattern found at positions [2810-2816(-), 2207-2213(-),
      │ 2206-2212(-), 1965-1971(-), 1964-1970(-), 1963-1969(-), 1226-1232(-),
      │ 790-796(-)]
✔PASS ┍ AvoidPattern[0-3000(+)](GGGGGG)
      │ Passed. Pattern not found !
✔PASS ┍ AvoidPattern[0-3000(+)](CCCCCC)
      │ Passed. Pattern not found !


AFTER: ===> SUCCESS - all constraints evaluations pass
✔PASS ┍ EnforceTranslation[0-3000(+)]
      │ All OK.
✔PASS ┍ EnforceGCContent[0-3000(+)](mini:0.30, maxi:0.70, window:70)
      │ Passed !
✔PASS ┍ AvoidHairpins[0-3000(+)](stem_size:10, hairpin_window:200)
      │ Score:         0. Locations: []
✔PASS ┍ AvoidPattern[0-3000(+)](([ATGC]{3})\1{2} (3-repeats 3-mers))
      │ Passed. Pattern not found !
✔PASS ┍ AvoidPattern[0-3000(+)](([ATGC]{9})\1{1} (2-repeats 9-mers))
      │ Passed. Pattern not found !
✔PASS ┍ AvoidPattern[0-3000(+)](enzyme:BsmBI)
      │ Passed. Pattern not found !
✔PASS ┍ AvoidPattern[0-3000(+)](AAAAAA)
      │ Passed. Pattern not found !
✔PASS ┍ AvoidPattern[0-3000(+)](TTTTTT)
      │ Passed. Pattern not found !
✔PASS ┍ AvoidPattern[0-3000(+)](GGGGGG)
      │ Passed. Pattern not found !
✔PASS ┍ AvoidPattern[0-3000(+)](CCCCCC)
      │ Passed. Pattern not found !

Wow, that seems to be very close to what I could directly. Can a dc.DnaOptimizationProblem contain multiple dc.EnforceGCContents? I have a particular way of defining "harmony" as well as a strategy for determining which codons to enrich and. deplete. I'd be interested to serif we could merge the strategies somehow. Let me know if that would be of interest to you!

Zulko commented

Yes you can put different GC contents for instance:

constraints = [
    EnforceGCContent(mini=0.4, maxi=0.6), # global GC
    EnforceGCContent(mini=0.7, maxi=0.3, window=100), # windowed GC
    EnforceGCContent(mini=0.9, maxi=0.2, window=30), # smaller-windowed GC
]

Regarding the codon harmonization I would definitely be interested in whether/how your codon optimization can be ported into a Specification class. Most DnaChisel specs implement a strategy in which they "scan" the sequence, spot "underoptimal" regions, and optimize these locally, one after another, from left to right. But a new Specification class can also define its own resolution strategy and you are not obliged to follow this pattern. Could you describe briefly how your harmonization score is computed and how it is optimized?

First a few tolerances are set – usage frequency below which a codon will be excluded from the set (currently defaults to 0.10, so if a codon is used < 10%. of the time, it is not considered here), and a maximum allowed deviation from the host profile (1 + relax in the other package).

To compute the idea codon usage, the codon usage tables are updated (rare codons are dropped, frequencies are recomputed), and then, using the AA sequence, the desired use of each codon (as integers) is calculated. After this, the current DNA sequence is scanned with each codon's position(s) and count recorded, and the residual of the expected usage vs observed usage is computed. And then, basically, you just go through the list of codons that are over-represented and replace them with those that are under-represented.

After all that, the codon adaptation index is computed, and the sequence that matches the host profile and doesn't have the undesirable features with the highest CAI is outputted to disk.