This Python script was produced as part of the course Introduction to Scientific Programming in Python of the UCL Graduate School.
More information on the course can be found on its home page.
The script puts together a collection of functions that essentially import a fastafile containing sequences of DNA and produce a fastafile with the most likely protein sequence for each DNA sequence.
The steps in the script are roughly the following:
- Reads in the fasta file
- Stores the sequences in a dictionary
- Generates the six possible frames for each sequence (+1, +2, +3 and -1, -2, -3)
- Swaps the DNA sequences for protein sequences
- Finds the longest protein sequence between an open and close marker
- Stores the longest protein sequence for each DNA sequence in a dictionary
- Can save the protein sequences on a fasta file or print the sequences on the terminal
The script is quite simple. It contains three options that can be passed from the command line:
- -h prints a very simple help
- -i (--ifile) must be followed by the fasta file
- -o (--ofile) must be followed by the name where the protein sequences will be stored
- -p is an option that allows printing the protein sequences on the terminal
To use the script enter the following in the terminal:
$ python dna2proteins.py -i sequences.fa -o proteins.fa -p
And substitute sequences.fa
and proteins.fa
for the appropriate filenames and paths.
The code for this script was developed jointly by:
- Erin Vehstedt
- Johanna Fischer
- Maragatham Kumar
- Andrés Calderón
- Marya Koleva
- Patricio R. Estévez Soto
- With the guidance and help of Fabian Zimmer
This project is not maintained. We make no assurances nor offer any guarantees regarding its performance. It was developed as an effort to learn python.