pip install -r requirements.txt
Genbank file to search the gene positions and extended sequences.
gbk_filename = "/home/agustin/Tatroviride_IMI206040_0.gb" # <-Please change->
Fasta of query sequences to take de gene id or txt with gene ids in the first colum (columns separated by \t). The first line of the txt(header of file) is avoid. Include file name and "fasta" or "txt" file type:
input_file_name_type=["/home/agustin/IDcluster10.txt","txt"] # <-Please change->
The output is a fasta file format with the original sequences and the extended upstream and downstream sequences. Set the name of the output file:
fna_filename = "UpDownStream.fna" # <-Please change->
Select how much base pairs extend to upstream or downastream side from the original genes start and end:
upstream=1500 # <-Please change->
downstream=0 # <-Please change->
python3 upDownStreamSeqsFromgbk.py
bash expand_fasta.sh < UpDownStream.fna
bash original_fasta.sh < UpDownStream.fna
awk '{if (/^>/) print $0; else print(substr($1,1,1500)) }' extended.fasta
awk '{if (/^>/) print $0; else print(substr($1, length($1)-1500, length($1))) }' extended.fasta
bash parserApp.sh