DNAnalyzer: A Java repository from ImpossibleReality

A highly efficient, powerful, and feature-rich algorithm for analyzing DNA sequences

DNAnalyzer identifies proteins, amino acids, start and stop codons, high coverage regions, regions susceptible to neurodevelopmental disorders, transcription factors, and regulatory elements. Researchers are working to extract valuable information from such software to better understand human health and disease. Currently, we have a Command-Line-Interface (CLI) and are working on a Graphical User Interface (GUI) that will enable physicians to quickly and more easily interact with the software, enabling them to identify genetic mutations that may cause disease.

Background

The human genome is composed of over 3 billion base pairs, making human analysis nearly impossible. Consequently, using powerful computational and statistical methods to decode the functional information hidden in DNA sequences are necessary. The genome is also extremely intricate and contains a plethora of data, which need to be organized and converted into analyzable data appropriately. Current analytical tools and software make it arduous for both geneticists and physicians to do so, thus restricting them from acquiring crucial information to better understand humans. [1]

Features

Start and stop codons
- Indicate the start and stop of an amino acid. There are 20 different amino acids. A protein consists of one or more chains of amino acids (called polypeptides) whose sequence is encoded in a gene. [2]
High coverage regions
- Regions of a DNA genome that code for a protein and have a relatively high proportion of guanine and cytosine nucleotides to the 4 nucleotide bases (45-60% GC-content). [3]
Longest genes
- Most susceptible to disease implications and are especially linked to neurodevelopmental disorders (e.g., autism). [4]
Regulatory elements
- Binding sites for transcription factors, which are involved in gene regulation. [6]
FASTA files (.fa)
- Supports multi-line and single-line FASTA database files. Files can either be uploaded or linked to from the web. [7]
Command-line interface (Met CLI)
- The Methionine command-line interface (abbreviated as Met CLI) is a unified tool for running DNAnalyzer services from the command-line. The CLI is a powerful tool for using DNAnalyzer services and scripting a sequence of commands to execute. You can currently access all the core features present in DNAnalyzer without having to log in, although account support will be implemented soon. To get more information on Met CLI installation and currently supported commands, refer to Met CLI GitHub repository.

Quick Introduction to DNA

DNA

In a nutshell, DNA is found in every cell of your body and contains the instructions for building over 200 different types of cells. DNA is similar to a programming language, but only for living organisms. We can crack the code to reading and interpreting it by using Artificial Intelligence and Machine Learning, which can have life-saving benefits as well as key insights.

Algorithm

The current algorithm, while tested thoroughly, is still a work in progress in terms of features, but it is getting better every day with your help.

Databases

Having a database of DNA is the best way to interpret the DNA, and when combined with machine learning, the ML model can make accurate predictions on DNA it has never seen before. This is how current DNA tests function.

Consider a videogame's anti-cheat; the anti-cheat detects all player movements and compares them to a list of confirmed cheats in that videogame. This database contains hundreds of known cheats that players can have, usually the most common ones. When a player cheats, you can assign a probability number to how likely this anti-cheat detection is correct; more common cheats are usually higher on this scale.

Getting Started

System Requirements

To build and run the DNAnalyzer, you need

JDK 17
A JAVA_HOME environment variable pointing to your JDK, or the Java executable in your PATH
Gradle

Build & Run

We use Gradle for building. The Gradle wrapper takes care of downloading dependencies, testing, compiling, linking, and packaging the code.

Windows:

.\gradlew build

Linux/Unix/macOS:

./gradlew build

Executable:

java -jar build/libs/DNAnalyzer.jar <arguments>

Arguments:

DNAnalyzer uses CLI arguments instead of stdin. For example, you can do:

<executable> assets/dna/random/dnalong.fa --amino=ser --min=0 --max=100 -r

Usage:

<executable> <arguments>

Example:

java -jar build/libs/DNAnalyzer.jar assets/dna/random/dnalong.fa --amino=ser --min=16450 --max=520218 -r

Help message:

Usage: DNAnalyzer [-hrV] --amino=<aminoAcid> [--find=<proteinFile>]
                  [--max=<maxCount>] [--min=<minCount>] DNA
A program to analyze DNA sequences.
      DNA                    The FASTA file to be analyzed.
      --amino=<aminoAcid>    The amino acid representing the start of a gene.
      --find=<proteinFile>   The DNA sequence to be found within the FASTA file.
  -h, --help                 Show this help message and exit.
      --max=<maxCount>       The maximum count of the reading frame.
      --min=<minCount>       The minimum count of the reading frame.
  -r, --reverse              Reverse the DNA sequence before processing.
  -V, --version              Print version information and exit.

Demo

demo.mov

Future Support and Improvements

GUI

A cross-platform GUI-based application that will perform the algorithms implemented in the software. Currently, the Met CLI is used as an expedient for this feature. Once implemented, the Met CLI would continue to be the main tool for power users.

Needleman-Wunsch Algorithm

This algorithm is used primarily for gene sequencing looking for the optimal match between multiple gene sequences. While the Boyer-Moore algorithm is undoubtedly more efficient, the Needleman-Wunsch algorithm continues to be one of the most accurate algorithms for genomic sequencing. [8]

Genotype Data for Analysis and Machine Learning Training

This will bring the ability to use genotype data from external DNA testing services with DNAnalyzer's Algorithm. In the future, to use this program, all you need is a simple 150$ DNA Test to be able to experience all the features of DNAnalyzer.

DIAMOND Implementation, a BLAST fork.

This will combine DIAMOND's performance advantage along with BLAST's algorithm.

Data sources:

Contributing

Contributors

DNAnalyzer was developed with the help of the following people:

Citations

Genomic Data Science Fact Sheet. (n.d.). Genome.gov. https://www.genome.gov/about-genomics/fact-sheets/Genomic-Data-Science

DNA and RNA codon tables. (2020, December 13). Wikipedia. https://en.wikipedia.org/wiki/DNA_and_RNA_codon_tables

GC-content - an overview | ScienceDirect Topics. (n.d.). Www.sciencedirect.com. https://www.sciencedirect.com/topics/biochemistry-genetics-and-molecular-biology/gc-content

Length matters: Disease implications for long genes. (2013, October 22). Spectrum | Autism Research News. https://www.spectrumnews.org/opinion/viewpoint/length-matters-disease-implications-for-long-genes/

Suter, D. M. (2020). Transcription Factors and DNA Play Hide and Seek. Trends in Cell Biology. https://pubmed.ncbi.nlm.nih.gov/32413318/

What is non-coding DNA?: MedlinePlus Genetics. (n.d.). Medlineplus.gov. https://medlineplus.gov/genetics/understanding/basics/noncodingdna/

BLAST TOPICS. (2019). Nih.gov. https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=BlastHelp

Wikipedia Contributors. (2021, March 24). Needleman–Wunsch algorithm. Wikipedia; Wikimedia Foundation. https://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm

Cytogenic Location. (2020, December 13). Wikipedia. https://en.wikipedia.org/wiki/Cytogenetics

Terms of Use

You are entirely responsible for the use of this application, including any and all activities that occur. While the DNAnalyzer Team strives to fix all major bugs that may be either reported by a user or discovered while debugging, they will not be held liable for any loss that the user may incur as a result of using this application, under any circumstances. For further inquiries, please contact the following email address: contact@dnanalyzer.live

ImpossibleReality/DNAnalyzer