/MapReduce

Market Basket Analysis with Hadoop

Primary LanguageJava

##Computing the frequency of microRNAs cooperation using the MapReduce framework

microRNAs are small non-coding molecules produced by the genome to regulate its activity. These small ribo-nucleic acids (RNA) molecules are 22-24 nucleotids long and can recognize, by base-pairing (A binds U, C binds G), complementary sequences on messengers RNAs (mRNA). The consequences of the binding of a microRNA (miR) on its complementary targets' sequences (on various mRNAs) is an inhibition of mRNAs translation into proteins, such as cellular enzymes.

Importantly, mRNAs can be targeted by many microRNAs, and the cooperation of these microRNAs can have a major impact on the regulation of their targets.

Can we define groups of microRNAs that cooperate frequently ?

The specific task of finding association between items can be solved by association rules learning. The technic known as “Market Basket Analysis” (MBA) aim at computing the frequency of co-occurrence of items in transactions. Here the items are the microRNAs and the transactions are the mRNAs. I will use the microRNA-mRNA prediction from TargetScan.

Here is a sample:

NM_000014	A2M	AAAGAAU	10090	0	0	0	0	1	0	1	0	mmu-miR-186	-0.078	NULL
NM_000014	A2M	AAUCUCU	10090	0	0	0	0	1	0	1	0	mmu-miR-216b	-0.188	0.073

The second column is the mRNA gene’s name and the 13th column is the name of the microRNA predicted to interact with the mRNA. From this database I selected a sample of 2000 transactions out of 2.10^6. (sample-mmu-miR.txt).

The first step is to generate the transactions : mRNA miR1 … miRn. The Map task is performed by generateTransactionMapper.java and the reducer is generateTransactionsReducer.java.

Here is the call to Hadoop:

hadoop jar generateTransactions.jar fr.cnrs.igmm.mg.generateTransactionsDriver input/sample-mmu-miR.txt list-miR

This produce the following result (list-miR.txt):

A2M	mmu-miR-186	mmu-miR-216b	mmu-miR-291a-5p	mmu-miR-128	mmu-miR-326	mmu-miR-327	mmu-miR-494	mmu-miR-760-3p	mmu-miR-673-5p	mmu-miR-27a	

Then I used this list of transactions to perform the Market Basket Analysis implemented by Mahmoud Parsian.

I used the following call to Hadoop:

hadoop jar MBA.jar org.dataalgorithms.chap07.mapreduce.MBADriver list-miR/part-r-00000 output 2

this will compute the number of transactions for each pair of items.

mgirardot@rfl-bioinfo:~/Bureau$ hadoop fs -cat output/part* | head
15/04/01 14:33:30 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[mmu-miR-101a, mmu-miR-101c]	1
[mmu-miR-101a, mmu-miR-1188]	1
[mmu-miR-101a, mmu-miR-1190]	1
[mmu-miR-101a, mmu-miR-1194]	1
[mmu-miR-101a, mmu-miR-1197]	1
[mmu-miR-101a, mmu-miR-1198-3p]	1
[mmu-miR-101a, mmu-miR-1224]	1
[mmu-miR-101a, mmu-miR-1249]	1
[mmu-miR-101a, mmu-miR-125a-3p]	1
[mmu-miR-101a, mmu-miR-125a-5p]	1

Then the frequency can be easily computed by:

mgirardot@rfl-bioinfo:~/Bureau$ hadoop fs -cat output/part* | cut -f 2 | grep "1" | wc -l
61194
mgirardot@rfl-bioinfo:~/Bureau$ hadoop fs -cat output/part* | cut -f 2 | grep "2" | wc -l
16142
mgirardot@rfl-bioinfo:~/Bureau$ hadoop fs -cat output/part* | cut -f 2 | grep "3" | wc -l
3376
mgirardot@rfl-bioinfo:~/Bureau$ hadoop fs -cat output/part* | cut -f 2 | grep "4" | wc -l
504
mgirardot@rfl-bioinfo:~/Bureau$ hadoop fs -cat output/part* | cut -f 2 | grep "5" | wc -l
69
mgirardot@rfl-bioinfo:~/Bureau$ hadoop fs -cat output/part* | cut -f 2 | grep "6" | wc -l
7
mgirardot@rfl-bioinfo:~/Bureau$ hadoop fs -cat output/part* | cut -f 2 | grep "7" | wc -l
0
mgirardot@rfl-bioinfo:~/Bureau$ hadoop fs -cat output/part* | wc -l
81294

Counts 1 2 3 4 5 6 7
81294 61194 16142 3376 504 69 7 0
freq 0.75 0.19 0.04 0.006 0.0008 8e-5 0

This show that the vast majority of the microRNA's pairs (75%) appears only once in this sample of 2000 predictions. However, we can see that a significant proportion of miR pairs (25%) appear on 2 or more targets.

Conclusion:

The Market Basket Analysis perfomed with the Hadoop framework allows to find groups of microRNAs that cooperate frequently. However, this analysis performed for pairs is very inefficient for tuples of 3 microRNAs or more due to the exponential increase of combinaisons to consider.