Most of programs below use sequence id format like this:
1.gene|2.kingdom|3.order|4.family|5.genus|6.species|7.accession_id|8.specimen_voucher
Covert genbank format to fasta format with reformated id.
Normalize gene name by these rules:
- rRNA gene should start with "rrn"
- tRNA gene should look like "trnXnnn", and "X" is the letter of amino acids, "nnn" is 3 letters of codon. Note that it is tran-reversed.
- for other gene names, characters not being alphabet will be removed, and suffix like number or letter of subunit will be capitalized.
- name which could not be recognized will do nothing
This is a python3 function. It need re and Biopython to run.
To use it, follow this example:
from gene_rename import normalize
new_name, name_type = normalize(old_name)
Mention that all input and output is string. For name_type, it will be:
- bad_name
- suspicious_name
- tRNA
- rRNA
- normal
"bad_name" means it cannot be recognized and will be same as input. "suspicious_name" means it its too long (longer than 15 characters).
python3 gb2fasta.py gb_file-name
Find output files in "file_name_out".
python3 gb2fasta.py sequence.gb
You can use these separators:
|/:;~!?@#$%^&*+= It's better to avoid use " " (space) and "_" underscore in id. "" is prohibited. Example: 1.gene|2.order|3.family|4.genus|5.species|6.accession_id|7.specimen_voucher
python3 fasta_rename.py fasta_file "id_format"
"id_format" is new id you want, if you omit it, program will ask you. If you want to add fixed number in id, you have to use "" to avoid conflict with index of field in sequence id. For instance, if you want to add "2018" at the beginning of the sequence id, you have to enter "\2\0\1\8".
This program can be used in any name format.
- Put this program in same folder with fasta files you want to change id.
- Double click to run.
- Input number of fields you want in the order you want and seperators you want to use. Notice that every field could only be used once.
- Find output files in "renamed" folder.
Divide sequences in input fasta file into different file according to seperator you choose. Make sure they have same format in sequence id.
If the field as seperator in sequence id does not exist, it will be put into FAILED.fasta.
- Run
python3 group_by.py input_file
- Choose the field you want to use to be seperator
- If you know which field you want the you can run like this:
python3 group_by.py input_file -c n
The n is the number of field. If you do not set "-c", it will hint you to choose which field of sequence id to be used as separator.
python3 group_by.py cgl.fasta -c 4
Input query string same as in NCBI Genbank and download large data.
python3 large_query.py
Then input query string.
According to accession id list you downloaded from NCBI, download gb records. Note that this program will not check output. So you have to verify the data by self.
Usage:
python3 large_query_with_id_list.py id_list -redo accession_number
The id_list is accession list file you downloaded before. If the download process failed, program will quit and you can use "-redo" to continue from given accession number.
Only left one record per species by remove shorter sequence (consider 'N').
python3 uniq-species.py input.fasta -c choice.
See log file for detail output.
If you do not set "-c", it will hint you to choose which field of sequence id was used to divide sequence.
python3 uniq.py whole.fasta -c "4 5"
Replace BOP with info given in "info.csv".
python3 replace.py
Make sure you have info.csv in the same folder with replace.py and fasta files end with ".fasta".
Only handle ">BOP000000".
Add filename into the head of sequence id.
python3 add_filename.py fastafile
Be sure to install python3 rather than python 2.7. Besides, to use subprocess.run, you would better install python 3.5 or above
Be sure to download same version of python3