This project is a program that is able to read genbank flat file (.gbff) files and store the data in a list. The program allows users to view the genes associated with a particular person or publication and also makes it possible to search for an author in the list. It also allows navigation between an author and a publication and the ability to select an author or publication to write to a file with all associated data. The program has been tested to work on both Unix machines as well as Windows.
The program uses gradle and the build can be found in ./build/libs/eindopdracht-1.0-SNAPSHOT.jar
.
In case a build isn't present you can build the program yourself by cloning the repository and running ./gradlew build.
Repository can be cloned into IDE of choice or downloaded as a .zip and opening the project.
The tool will always need an input directory, it can handle both .gz files and .gbff files. All files in the directory must be of either of these types.
The tool has the following optional arguments:
a
or-authors
: Display all authors in listed files.p
or-publications
: Display all publications in listed files.ba
or-by-author
: Enter an author to display all publications by that author. Needs an exact match.bp
or-by-publication
: Enter a publication to display all authors of that publication. Can parse partial names.o
or-output
: The output file to write the results to. Will create a new file in directory in case file is not found.h
or-help
: Display the help menu.
The program is broken up into 4 classes:
GenbankExplorer
: Acts as the main class and handles the command line arguments.GenbankParser
: Takes in the file and stores the information in two objects:GenbankEntry
andGenbankReference
.GenbankEntry
: Stores information such as locus, etc.GenbankReference
: Stores information such as author and journal.
The GenbankEntry
objects are stored in a list for use with the commands.
Testing files for this project with which this program is known to work are stored under ./src/main/resources/genbank/*
These commands were run from within IntelliJ under 'gradle/run' and are known to work.
Listing authors
run --args="./src/main/resources/genbank/ -a"
Listing genomes related to author
run --args="./src/main/resources/genbank/ -ag Thayer,N."
etc.
The program is documented according to Javadoc standards and is self documented where needed by means of clear and concise identifiers/variable names. Logic is explained inside code where needed and if reoccuring stated as such.
In the future, the program can be improved to serialize and deserialize the parsed objects (store the results) in order to reduce processing power when dealing with large .gbff files.