/alzheimers-diagnosis-mapreduce

A computational system that analyzes Alzheimer's Disease diagnoses using the gene expression profiles of patients. It uses MapReduce for feature engineering and Spark for gene clustering and classification.

Primary LanguagePython

alzheimers-diagnosis-mapreduce

A computational system that analyzes Alzheimer's Disease diagnoses using the gene expression profiles of patients. It uses MapReduce for feature engineering and Spark for gene clustering and classification. By Ada Chen, William Wu, and myself for our Big Data class.

Table of Contents

Requirements

HDFS Setup

Make sure that you have Homebrew installed.

  1. Install Hadoop with Homebrew by typing the following in a terminal window.
brew install hadoop
  1. Configure your .bashrc file to include:
export HADOOP_HOME=/usr/local/Cellar/hadoop/2.8.0/libexec
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
  1. Source your .bashrc file by doing the following:
source .bashrc
  1. Change directory by doing cd /usr/local/Cellar/hadoop/2.8.0/libexec/etc/hadoop and configure the following files:

  2. hdfs-site.xml

<configuration>
 <property>
     <name>dfs.replication</name>
     <value>1</value>
 </property>
</configuration>
  1. core-site.xml
<configuration>
 <property> 
      <name>hadoop.tmp.dir</name>
      <value>/usr/local/Cellar/hadoop/hdfs/tmp</value>
      <description> A base for other temporary directories. </description>
 </property>
  . . .
 <property>
     <name>fs.default.name</name>
     <value>hdfs://localhost:9000</value>
 </property>
</configuration>
  1. Next, format the hdfs master node by typing the following in a terminal window.
hdfs namenode -format
  1. Create a directory to store your files on HDFS:

    hdfs dfs -mkdir -p ~/<directory name>
    

    e.g. hdfs dfs -mkdir inputfiles

    CHECK

    hdfs dfs -ls hdfs://localhost:9000/user/…/<new directory name>
    

    e.g. hdfs dfs -ls hdfs://localhost:9000/user/Elizabeth/inputfiles

  2. Place the input files in the new directory:

    hdfs dfs -put ~/<file location path on computer> <directory name>/<file name>
    

    e.g.

    hdfs dfs -put ~/Documents/BigData/Project2/ROSMAP_RNASeq_entrez.csv inputfiles/ROSMAP_RNASeq_entrez.csv
    hdfs dfs -put ~/Documents/BigData/Project2/gene_cluster.csv inputfiles/gene_cluster.csv
    
  3. Run the HDFS master node in a terminal window:

    hdfs namenode
    
  4. While the namenode is still running, run the HDFS data storage node in a new terminal tab/window:

    hdfs datanode
    
  5. Finally, in the map_clusters.py file, change the file paths to:

    file_rosmap = “hdfs://localhost:9000/user/…/<directory name>/ROSMAP_RNASeq_entrez.csv”
    file_gene_cluster = “hdfs://localhost:9000/user/…/<directory name>/gene_cluster.csv”
    

    e.g.

    file_rosmap = "hdfs://localhost:9000/user/Elizabeth/inputfiles/ROSMAP_RNASeq_entrez.csv"
    file_gene_cluster = "hdfs://localhost:9000/user/Elizabeth/inputfiles/gene_cluster.csv"
    

Spark Setup

Running the Program

Running with AWS

File List