We had 12 genbank files (for gram-positive bacteria) to analyze.
Codes were written to extract information on location (start position
, end position
) and desirable qualifiers (Gene ID
, Gene name
, Locus tag
, strand orientation
) of the feature keys from the feature table of Genbank files for all the genes present in them. All the genes where Gene ID
or Gene name
were absent were marked as unavailable in our data. Gene sequences were also extracted along with upstream- and downstream- flanking regions of length 200 bp
and 203 bp
(to account for 3 positions, i.e., boundary cases, while looking for motifs in the next step where otherwise it’d give zero count for those positions).
The previously found upstream- and downstream- flanking regions of genes were searched for GAAG
and GAAA
motifs and the starting positions for these motifs relative to the gene boundary (In 5'-3'
direction, gene’s first base will have +1
position, so the upstream- flanking region will be from -200
to -1
and gene’s last base will have, say n
is the length of the gene, +n
position, so the downstream- flanking region will be from n+1
to n+200
) were stored. The counts of these motifs starting at each position in the flanking regions of 200 bp
length were recorded.
These counts were then used to create line plots showing frequency of the motifs starting at a position relative to the start site, i.e., +1
, of the gene. For this part, the data analysis and visualizations were done using Python (v 3.8.12)
and some of its libraries- pandas (v 1.3.3)
, BioPython (v 1.79)
, re module (v 2.2.1)
, and matplotlib (v 3.4.2)
, in Jupyter notebooks (v 6.4.3)
.