/gbFileParser

Forked and modified script of GenParser. Proesses local Genbank files

Primary LanguagePython

Welcome to gbFileParser 👋

Version Documentation

A simple script to parse genbank files (.gb) to extract information asked for in config files, and save them in xlsx format. This script is a fork of GenParser.

The script can also generate fasta files and log files that contain Accession IDs that failed to produce valid data in xlsx file. However this has not been tested yet.

🚀 Usage

Make sure you have Python 3.x installed and pip version >= 9.0.1

First run the following command at the root of your project:

    pip3 install -r requirements.txt

This will ensure you have all the requirements installed on your system.

To run the script you can run the command:

    python3 main.py [gb FILE]

Doing so will produce a xlsx file called gbFileParsedData. Which will contain parsed data based in the header list you provide. In the case above, no header list was provided and so it used a Generic Header list found in the config file. You can modify that list directly if you plan to use the same headers over and over again.

The Available Arguments that can be passed into the script are:

    > Flags for Input Files:
    --header        [HEADER LIST FILE]
    --feature       [FEATURE LIST FILE]
    --recognition   [RECOGNITION LIST FILE]
    
    > Flags:
    -f              Produce a Fasta file for all sequences found
    -t              Convert and produce a TSV file of xlsx file produced
    -l              Produce a logs file containing all Accession IDs with no sequences

The main purpose of this script is to extract sequences related to Features in the Genbank files. To address that, currently users have to identify the sequence location in the Genbank file (Feature), and then adding a "Hook" or key to look for. An example is provided in the config file (features.txt, recognition.txt).

In the example, we see that the features list contains the name of the feature to look in (i.e CDS, Gene). The script requires the Feature to contain a "Hook" to latch onto and extract the sequence if present. The recognition hook can be either a key (Text before the '=') or the value (Text after the '='). If no custom Recognition list or Feature list is provided, a default one is utilized, that can be modified in the config file. However if you would like to use a custom Recognition and Feature list to identify specific sequences you can simply run the script using flags above:

    python3 main.py [gb FILE] --feature [FEATURE FILE] -- recognition [RECOGNITION FILE]

To produce Fasta file, log file or TSV file simply add the flags at the end.

Author

👤 @yunomer

Show your support

Give a ⭐ if this project helped you!