gbmunge

Munge GenBank files into FASTA sequences and tab-separated metadata.

This little C program will extract the following information from a GenBank file:

名字
name

登录号
accession
length
submission date
host
country
collection date

In addition to extracting this information, dates are reformatted e.g. 31-DEC-2001 becomes 2001-12-31, which makes them more digestible to downstream software like BEAST, and country names are cleaned and matched to ISO3 codes.

Usage

gbmunge [-h] -i <Genbank_file> -f <sequence_output> -o <metadata_output> [-t] [-s]

Genbank_file: filename of GenBank-formatted sequence file (normally downloaded as sequence.gb)
sequence_output: filename of FASTA output
metadata_output: filename of tab-separated metadata
-t: flag to
- only output sequences with collection dates (of any precision)
- to name sequences as {accession}_{collection_date}
-s: flag to include sequences in tab-delimited file

Building

git clone https://github.com/sdwfrost/gbmunge
cd gbmunge
make

This will build gbmunge in the src/ directory. Add the directory to the path, or move the executable somewhere.

Testing

A Genbank file of MERS Coronavirus sequences is provided in the test/ directory.

cd test
../src/gbmunge -i sequence.gb -f sequence.fas -o sequence.txt -t

Here are the first few lines of output in sequence.txt:

name	accession	length	submission_date	host	country_original	country	countrycode	collection_date
JX869059_2012-06-13	JX869059	30119	2012-12-04	Homo sapiens	NA	NA	NA	2012-06-13
KC164505_2012-09-11	KC164505	30111	2013-07-12	Homo sapiens	United Kingdom	United Kingdom	GBR	2012-09-11
KC667074_2012-09-19	KC667074	30112	2013-04-30	Homo sapiens	United Kingdom: England	United Kingdom	GBR	2012-09-19
KC776174_2012-04	KC776174	30030	2013-03-25	Homo sapiens	Jordan	Jordan	JOR	2012-04

Credits

This code uses a slightly modified version of the GBParsy parser downloaded from the Google Code Archive. I found that the parsing of the LOCUS field wasn't working properly.

BEAST-Community/gbmunge

gbmunge

Usage

Building

Testing

Credits