The aim of this project is to provide an efficient parallel implementation of a substring matching algorithm and apply it to the search of genes within the genome of an organism.
In this repository you can find the code for the serial and parallel (OpenMPI) implementation in C language.
Download from the official Open MPI website and follow the official guide to install it. This project has been developed using Open MPI 4.1.5, but you can consider working with your preferred compatible version.
The datasets can be taken from the NCBI database. After downloading the genome and genes files of your interest, you should copy and paste the gene (pattern) you want to search in a separate .txt file.
In the experiments performed by the author the following datasets have been used:
- Escherichia Coli: a bacteria commonly found in the lower intestine of warm-blooded organisms, 4.6MB
- Entamoeba Invadens: an amoebozoa parasite of reptiles, 40.9MB
- Formica Exsecta: a species of ant found from Western Europe to Asia, 277.6MB
To compile and execute the serial version of the program move inside the /src/serial directory and execute the following commands:
$ gcc main.cpp
$ ./a.out [genomePath] [patternPath]
To compile and execute the parallel version of the program move inside the /src/parallel directory and execute the following commands:
$ mpicc main.cpp
$ mpirun -n [nodes] a.out [genomePath] [patternPath]