Eindopdracht 2023 SamSokolov

This project is a program that is able to read genbank flat file (.gbff) files and store the data in a list. The program allows users to view the genes associated with a particular person or publication and also makes it possible to search for an author in the list. It also allows navigation between an author and a publication and the ability to select an author or publication to write to a file with all associated data. The program has been tested to work on both Unix machines as well as Windows.

Installation

The program uses gradle and the build can be found in ./build/libs/eindopdracht-1.0-SNAPSHOT.jar. In case a build isn't present you can build the program yourself by cloning the repository and running ./gradlew build.

Repository can be cloned into IDE of choice or downloaded as a .zip and opening the project.

Usage

The tool will always need an input directory, it can handle both .gz files and .gbff files. All files in the directory must be of either of these types.

The tool has the following optional arguments:

  • a or -authors: Display all authors in listed files.
  • p or -publications: Display all publications in listed files.
  • ba or -by-author: Enter an author to display all publications by that author. Needs an exact match.
  • bp or -by-publication: Enter a publication to display all authors of that publication. Can parse partial names.
  • o or -output: The output file to write the results to. Will create a new file in directory in case file is not found.
  • h or -help: Display the help menu.

image

The program is broken up into 4 classes:

  • GenbankExplorer: Acts as the main class and handles the command line arguments.
  • GenbankParser: Takes in the file and stores the information in two objects: GenbankEntry and GenbankReference.
  • GenbankEntry: Stores information such as locus, etc.
  • GenbankReference: Stores information such as author and journal.

The GenbankEntry objects are stored in a list for use with the commands.

image

Testing files for this project with which this program is known to work are stored under ./src/main/resources/genbank/*

image

These commands were run from within IntelliJ under 'gradle/run' and are known to work.

image

Listing authors run --args="./src/main/resources/genbank/ -a"

Listing genomes related to author run --args="./src/main/resources/genbank/ -ag Thayer,N."

etc.

image

Extra Information

The program is documented according to Javadoc standards and is self documented where needed by means of clear and concise identifiers/variable names. Logic is explained inside code where needed and if reoccuring stated as such.

Future Improvements

In the future, the program can be improved to serialize and deserialize the parsed objects (store the results) in order to reduce processing power when dealing with large .gbff files.