/gbmunge

Munge GenBank files into FASTA and tab-separated metadata

Primary LanguageCMIT LicenseMIT

gbmunge

Munge GenBank files into FASTA sequences and tab-separated metadata.

This little C program will extract the following information from a GenBank file:

  • 名字
  • name
  • 登录号

  • accession

  • length

  • submission date

  • host

  • country

  • collection date

In addition to extracting this information, dates are reformatted e.g. 31-DEC-2001 becomes 2001-12-31, which makes them more digestible to downstream software like BEAST, and country names are cleaned and matched to ISO3 codes.

Usage

gbmunge [-h] -i <Genbank_file> -f <sequence_output> -o <metadata_output> [-t] [-s]
  • Genbank_file: filename of GenBank-formatted sequence file (normally downloaded as sequence.gb)
  • sequence_output: filename of FASTA output
  • metadata_output: filename of tab-separated metadata
  • -t: flag to
    • only output sequences with collection dates (of any precision)
    • to name sequences as {accession}_{collection_date}
  • -s: flag to include sequences in tab-delimited file

Building

git clone https://github.com/sdwfrost/gbmunge
cd gbmunge
make

This will build gbmunge in the src/ directory. Add the directory to the path, or move the executable somewhere.

Testing

A Genbank file of MERS Coronavirus sequences is provided in the test/ directory.

cd test
../src/gbmunge -i sequence.gb -f sequence.fas -o sequence.txt -t

Here are the first few lines of output in sequence.txt:

name accession length submission_date host country_original country countrycode collection_date
JX869059_2012-06-13 JX869059 30119 2012-12-04 Homo sapiens NA NA NA 2012-06-13
KC164505_2012-09-11 KC164505 30111 2013-07-12 Homo sapiens United Kingdom United Kingdom GBR 2012-09-11
KC667074_2012-09-19 KC667074 30112 2013-04-30 Homo sapiens United Kingdom: England United Kingdom GBR 2012-09-19
KC776174_2012-04 KC776174 30030 2013-03-25 Homo sapiens Jordan Jordan JOR 2012-04

Credits

This code uses a slightly modified version of the GBParsy parser downloaded from the Google Code Archive. I found that the parsing of the LOCUS field wasn't working properly.