A computational system that analyzes Alzheimer's Disease diagnoses using the gene expression profiles of patients. It uses MapReduce for feature engineering and Spark for gene clustering and classification. By Ada Chen, William Wu, and myself for our Big Data class.
Make sure that you have Homebrew installed.
- Install Hadoop with Homebrew by typing the following in a terminal window.
brew install hadoop
- Configure your
.bashrc
file to include:
export HADOOP_HOME=/usr/local/Cellar/hadoop/2.8.0/libexec
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
- Source your
.bashrc
file by doing the following:
source .bashrc
-
Change directory by doing
cd /usr/local/Cellar/hadoop/2.8.0/libexec/etc/hadoop
and configure the following files: -
hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
- core-site.xml
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/local/Cellar/hadoop/hdfs/tmp</value>
<description> A base for other temporary directories. </description>
</property>
. . .
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
- Next, format the hdfs master node by typing the following in a terminal window.
hdfs namenode -format
-
Create a directory to store your files on HDFS:
hdfs dfs -mkdir -p ~/<directory name>
e.g.
hdfs dfs -mkdir inputfiles
CHECK
hdfs dfs -ls hdfs://localhost:9000/user/…/<new directory name>
e.g.
hdfs dfs -ls hdfs://localhost:9000/user/Elizabeth/inputfiles
-
Place the input files in the new directory:
hdfs dfs -put ~/<file location path on computer> <directory name>/<file name>
e.g.
hdfs dfs -put ~/Documents/BigData/Project2/ROSMAP_RNASeq_entrez.csv inputfiles/ROSMAP_RNASeq_entrez.csv hdfs dfs -put ~/Documents/BigData/Project2/gene_cluster.csv inputfiles/gene_cluster.csv
-
Run the HDFS master node in a terminal window:
hdfs namenode
-
While the
namenode
is still running, run the HDFS data storage node in a new terminal tab/window:hdfs datanode
-
Finally, in the
map_clusters.py
file, change the file paths to:file_rosmap = “hdfs://localhost:9000/user/…/<directory name>/ROSMAP_RNASeq_entrez.csv” file_gene_cluster = “hdfs://localhost:9000/user/…/<directory name>/gene_cluster.csv”
e.g.
file_rosmap = "hdfs://localhost:9000/user/Elizabeth/inputfiles/ROSMAP_RNASeq_entrez.csv" file_gene_cluster = "hdfs://localhost:9000/user/Elizabeth/inputfiles/gene_cluster.csv"