The goal of this project is the implementation of a machine learning algorithm, for the taxonomic classification of DNA-sequences based on the codon usage frequency. Therefore the kNN and the Random Forest algorithms are compared to their accuracy and precision. These criteria should be used to determine the suitability of the two algorithms for a potential classification application of DNA-sequences on their codon usage frequency into phylogenetic domains. The biological background knowlage was used to solve the presented problem as good as possible.
Source: Codon Usage Bias Levels Predict Taxonomic Identity and Genetic (Composition Khomtchouk 2020)
Used data avaible at the original source