The purpose of this exercise is to learn regular expressions in Python by working with sequences of nucleotide.
At this point, you should have (1) an account on Github and (2) been introduced to the very basics of Git.
-
Login to your Github account.
-
Fork this repository, by clicking the 'Fork' button on the upper right of the page.
After a few seconds, you should be looking at your copy of the repo in your own Github account.
-
Click the 'Clone or download' button, and copy the URL of the repo via the 'copy to clipboard' button.
-
In your terminal, navigate to where you want to keep this repo (you can always move it later, so just your home directory is fine). Then type:
$ git clone the-url-you-just-copied
and hit enter to clone the repository. Make sure you are cloning your fork of this repo.
-
Next,
cd
into the directory:$ cd the-name-of-directory-you-just-cloned
At this point, you should be in your own local copy of the repository.
-
Next, create a new 'homework' branch for working on this exercise:
$ git checkout -b homework origin/dev
This will get you to a point where we left off from the Python translation homework. Confirm that all the translation code is working:
$ python3 test_translate.py $ python3 translate.py
-
Create a remote 'homework' branch that you can 'push' to as you work on the exercise:
$ git push origin homework:homework
-
As you work on the exercise below, be sure to frequently
add
andcommit
your work andpush
changes to the remote 'homework' branch. Don't enter these commands now; this is just to jog your memory:$ # Do some work $ git add file-you-worked-on.py $ git commit $ git push origin homework
The goal of this exercise is for you to practice using regular expressions in Python to find open-reading frames (ORFs) in sequences of nucleotides.
Open the file find_orf.py
with your preferred text editor.
In this file you will find functions with detailed docstrings, and three
places in the code marked with comments like:
##########################################################################
############################ EDIT CODE BELOW #############################
# `codon_pattern_str` needs to be a regular expression that will match any
# codon (but only one codon).
# Currently, `codon_pattern_str` is only a string of literal characters.
# This is a valid regular expression, but it will only match 'AUG' exactly.
# Change `codon_pattern_str` so that it will match any valid codons, and
# only valid codons.
# Read the docstring above for additional clues.
codon_pattern_str = r'AUG'
##########################################################################
Here is where you will need to edit the code to get a regular expression that
accomplishes the task at hand. In the example above, codon_pattern_str
is the regular expression that you would need to edit.
Read the function's docstring and the comments carefully to help guide your
composing of the regular expression.
Once you have successfully composed the regular expressions in find_orf.py
,
all the tests in test_find_orf.py
will pass.
$ python3 test_find_orf.py
Read the error and failure messages reported by test_find_orf.py
carefully
to help you get the regular expressions in find_orf.py
working correctly.
You can run the tests for each function separately:
$ python3 test_find_orf.py TestVetNucleotideSequence
$ python3 test_find_orf.py TestVetCodon
$ python3 test_find_orf.py TestFindFirstOrf
These will run the tests for the functions vet_nucleotide_sequence
,
vet_codon
, and find_first_orf
, respectively.
It can be helpful to only run the tests for the function your working on.
Start with the vet_nucleotide_sequence
and vet_codon
functions
first.
These regular expressions are easier, and the find_first_orf
function depends
on them (i.e., it uses them), so working on find_first_orf
will be easier if
the tests for vet_nucleotide_sequence
and vet_codon
are passing.
NOTE: You do NOT need to know how the tests work! You only need to add code
to find_orf.py
.
The tests only care about what each function returns, not how the functions
work.
- Make sure you are using Git to commit and push your work often as you work on Part 1.
- Work on 'mini examples' in the python interpreter.
Once all the tests are passing, you can use find_orf.py
as a script
to find the first open reading frame in a nucleotide sequence.
Check out the help menu:
$ python3 find_orf.py -h
This help menu is created automatically by the argparse
package using the
information provided by the code in the main
function in find_orf.py
.
You can give the script a nucleotide sequence to search at the command line:
$ python3 find_orf.py AUGGUAUAA
The sequence can also be provided in a file if you specify the -p
flag:
$ python3 find_orf.py -p path-to-a-file-with-a-sequence.py
You can also specify start and stop codons with -s
and -x
arguments,
respectively:
$ python3 -s AUG -x UAA UAG find_orf.py AUGGUAUAA
All of this command-line interface is specified and handled in the main
function of find_orf.py
using the argparse
package.
This package can be very useful for making your scripts portable and user friendly. You will use it in Part 2!
Create a new script (file) called translate_orf.py
.
Now, add code to translate_orf.py
so that it can translate the first ORF
found in a nucleotide sequence.
This script will have a command-line interface very similarly to find_orf.py
.
The only difference is that it will translate the ORF that it finds and report
the amino acid sequence (to standard output).
Some tips that might be helpful:
- Write a
main
function very similar to the one infind_orf.py
to set up the command-line interface and process command-line arguments. import
and use thefind_first_orf
function infind_orf.py
to find the first ORF.import
and use thetranslate_sequence
function intranslate.py
to translate the ORF.import
and use theparse_sequence_from_path
function infind_orf.py
, just like it is currently used in themain
function offind_orf.py
.- Don't worry about accommodating genetic codes; just use (i.e., copy and
paste) the same genetic code that is in
translate.py
(just afterif __name__ == '__main__':
).
A main
function similar to the one in find_orf.py
should be all you need to
get this working, because you can do the ORF searching and translation within
the main
function using the functions you imported from find_orf.py
and
translate.py
and the genetic code you pasted in.
- Make sure you are using Git to commit and push your work often as you work on part 2.
After completing Part 2, you likely have code in the main
functions of
find_orf.py
and translate_orf.py
that is largely redundant.
Can you think of a way of reorganizing the code to avoid this redundancy?
Also, can you update translate_orf.py
so that it can accommodate any genetic
code?
This exercise was inspired by, and borrowed heavily from, the class notes and Python scripts written by Mark Holder, which can be found at https://github.com/mtholder/eebprogramming.
This work was made possible by funding provided to Jamie Oaks from the National Science Foundation (DEB 1656004).
This work is licensed under a Creative Commons Attribution 4.0 International License.