Here is our implementation for distributed D2 distance algorithm for k-mers by using the Apache Hadoop framework.
Among several alignment-free methods to calculate similarity between two strings, we picked one (D2) based on word statistics, specifically their frequency in a sequence.
Once all possible k-mers into the two sequences have been determined, to calculate the distance among them we'll use the D2 function.
Both sequential and distributed implementation of D2 algorithm take as input the result of KMC tool output: KMC allows to count k-mers in one or more genomic sequences; in our case, it has been used to count from k = 3 up to k = 13. For each sequence, a k-mer occurrency file has been generated.
Distributed D2 implementation consist of a first MapReduce phase (to read k-mers occurrences from KMC output file and calculate partial D2 scores) and an eventual second one where if more than one task is created to sum partial scores.
Test cluster machines had the following configuration:
- CPU: Intel Xeon E3-12xx v2 (Ivy Bridge), 8 cores
- RAM: 32 GB
- OS: Ubuntu 16.04.4 LTS
Each Hadoop node had the following configuration:
yarn-site.xml
yarn.nodemanager.resource.memory-mb
: 30720yarn.nodemanager.resource.cpu-vcores
: 8
hdfs-site.xml
dfs.replication
: 1dfs.blocksize
: 64m
mapred-site.xml
mapreduce.map.memory.mb
: 4096mapreduce.reduce.memory.mb
: 7168mapreduce.map.java.opts
: -Xmx3276Mmapreduce.reduce.java.opts
: -Xmx5734Mmapreduce.[map|reduce].cpu.vcores
: 2
The bgp-d2 repository consist of three main folders:
Both sequential and distributed projects can be built by running the following command:
mvn clean compile javadoc:javadoc