There are 4496 genomes in total. These genomes are:
- All bacterial genomes
- There is only a single chromosome
- Genome dir name contains 3/4 digit org code
- Have fasta files in genome dir named as: org_code.fasta
- All genome dirs have a mapping file
- The mapping file contains gene names, start and end positions, strand info etc.
As of June 2, 2023: these genomes are here in the GPU machine: /scratch/mbr5797/genomes_extracted_from_kegg
Nore: Some gene info is not in the mapping files. These are genes that originate from two different locations in the genome. Out of 7964903 genes, count of missing spliced genes: only 5772
conda create -y --name extract_kegg
conda install -y --name extract_kegg -c conda-forge -c bioconda --file requirements.txt
conda activate extract_kegg
python src/main.py
Listed in list_of_genomes