alzheimers-diagnosis-mapreduce

A computational system that analyzes Alzheimer's Disease diagnoses using the gene expression profiles of patients. It uses MapReduce for feature engineering and Spark for gene clustering and classification. By Ada Chen, William Wu, and myself for our Big Data class.

Requirements
Running with AWS
File List

Requirements

HDFS Setup

Make sure that you have Homebrew installed.

Install Hadoop with Homebrew by typing the following in a terminal window.

brew install hadoop

Configure your .bashrc file to include:

export HADOOP_HOME=/usr/local/Cellar/hadoop/2.8.0/libexec
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop

Source your .bashrc file by doing the following:

source .bashrc

Change directory by doing cd /usr/local/Cellar/hadoop/2.8.0/libexec/etc/hadoop and configure the following files:
hdfs-site.xml

<configuration>
 <property>
     <name>dfs.replication</name>
     <value>1</value>
 </property>
</configuration>

core-site.xml

<configuration>
 <property> 
      <name>hadoop.tmp.dir</name>
      <value>/usr/local/Cellar/hadoop/hdfs/tmp</value>
      <description> A base for other temporary directories. </description>
 </property>
  . . .
 <property>
     <name>fs.default.name</name>
     <value>hdfs://localhost:9000</value>
 </property>
</configuration>

Next, format the hdfs master node by typing the following in a terminal window.

hdfs namenode -format

Create a directory to store your files on HDFS:
```
hdfs dfs -mkdir -p ~/<directory name>
```
e.g. hdfs dfs -mkdir inputfiles

CHECK
```
hdfs dfs -ls hdfs://localhost:9000/user/…/<new directory name>
```
e.g. hdfs dfs -ls hdfs://localhost:9000/user/Elizabeth/inputfiles

Place the input files in the new directory:

hdfs dfs -put ~/<file location path on computer> <directory name>/<file name>

e.g.

hdfs dfs -put ~/Documents/BigData/Project2/ROSMAP_RNASeq_entrez.csv inputfiles/ROSMAP_RNASeq_entrez.csv
hdfs dfs -put ~/Documents/BigData/Project2/gene_cluster.csv inputfiles/gene_cluster.csv

Run the HDFS master node in a terminal window:
```
hdfs namenode
```
While the namenode is still running, run the HDFS data storage node in a new terminal tab/window:
```
hdfs datanode
```

Finally, in the map_clusters.py file, change the file paths to:

file_rosmap = “hdfs://localhost:9000/user/…/<directory name>/ROSMAP_RNASeq_entrez.csv”
file_gene_cluster = “hdfs://localhost:9000/user/…/<directory name>/gene_cluster.csv”

e.g.

file_rosmap = "hdfs://localhost:9000/user/Elizabeth/inputfiles/ROSMAP_RNASeq_entrez.csv"
file_gene_cluster = "hdfs://localhost:9000/user/Elizabeth/inputfiles/gene_cluster.csv"

willgocode/alzheimers-diagnosis-mapreduce

alzheimers-diagnosis-mapreduce

Table of Contents

Requirements

HDFS Setup

Spark Setup

Running the Program

Running with AWS

File List