dataplayground: A Jupyter Notebook repository from Chloejay

Data science playground- 20190126 Coderbunker workshop

General info

💁 This workshop will talks about the data science pipeline, play fun with the some open source tools from data fetching, ETL to data analytics (data processing), data visualization(data insight generation), in the end I will show how to use simple machine leanring library to run one modeling (predictive analysis).

"Spark with Zeppelin are great combination"

Getting started:

loopback:v5.0.0 generator-loopback:v6.0.2 loopback-workspace:v4.5.0
node: v11.3.0
Python: v3.6.5
MySQL (of course feel free to use PostgreSQL, we know we love it!) 😃
PySpark
Zeppelin
Keras
Hadoop 2.9.2
Maven
libprotoc 2.5.0
openssl/1.0.2o_2/
aws-java-sdk-1.7.4
hadoop-aws-2.7.1

Setup

To start this workshop, install it locally using npm and pip: (before plan to use docker but my mac low storage will strike in high frequency)
The article I wrote before on Medium about how to set up Loopback, please find here, might helpful if you just start loopback.

$ npm install
$ npm install -g loopback-cli (https://loopback.io/doc/en/lb3/) 
$ npm install loopback-connector-mysql --save  (https://www.npmjs.com/package/loopback-connector-mysql) 
$ install Anaconda (https://repo.continuum.io/archive)
$ install Apache Spark 
$ install Java 
$ pip install Keras  
#install hadoop 2.9.2  
$ brew install hadoop (Hadoop was installed under /usr/local/Cellar/hadoop) 
or $ wget http://www.eu.apache.org/dist/hadoop/common/hadoop-2.9.2/hadoop-2.9.2-src.tar.gz (I recommend this one, otherwise there are so much bugs after when we build the env with Hadoop) 

#config the hadoop 
$ cd /usr/local/opt/hadoop
$ hadoop-env.sh 
export HADOOP_OPTS="$HADOOP_OPTS -Djava.net.preferIPv4Stack=true" 

export HADOOP_OPTS="$HADOOP_OPTS -Djava.net.preferIPv4Stack=true -Djava.security.krb5.realm= -Djava.security.krb5.kdc="
export JAVA_HOME="/Library/Java/JavaVirtualMachines/<ADD_JDK_VERSION_HERE>/Contents/Home" 

$ core-site.xml 
# later when we connect with AWS S3 bucket, we need add IAM info on the property 
<configuration>
  <property>
    <name>hadoop.tmp.dir</name>
    <value>/usr/local/Cellar/hadoop/hdfs/tmp</value>
    <description>A base for other temporary directories</description>             
  </property>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://localhost:8020</value>
  </property>
</configuration>
	
# mapred-site.xml 
<configuration>
  <property>
    <name>mapred.job.tracker</name>
    <value>localhost:8021</value>
  </property>
</configuration>

# hdfs-site.xml 
<configuration>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
</configuration>

$ hdfs namenode -format 
$ /usr/local/opt/hadoop/sbin (HDFS service) 
$ ./start-dfs.sh 
	
#install Maven 
# build the Hadoop env 
$ mvn package -Pdist,native -DskipTests -Dtar 
#from some bugs and configuration 
![picture alt](https://github.com/Chloejay/dataplayground/blob/master/Screen%20Shot%202019-07-03%20at%2020.59.51.png?raw=true) 
#to  
![picture alt](https://github.com/Chloejay/dataplayground/blob/master/Screen%20Shot%202019-07-03%20at%2021.41.46.png?raw=true)

Install Spark on Mac

First download Java

Java SE Development Kit 8

Go to the Apache Spark website
Choose a Spark release and directly download
Go to your home directory (command in bold below)
$ cd ~
Unzip the folder in your home directory using the following command

 
$ tar -zxvf spark-2.4.0-bin-hadoop2.7.tgz

Use the following command to see that you have a .bash_profile


$ ls -a

Config Spark to edit .bash_profile

 
$ vim .bash_profile 
export SPARK_PATH=~/spark-2.4.0-bin-hadoop2.7
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
#For python 3, have to add the line below or will get an error export PYSPARK_PYTHON=python3
alias jupyter_notebook='$SPARK_PATH/bin/pyspark --master local[2]'
export JAVA_HOME=$(/usr/libexec/java_home)
export JAVA_LIBRARY_PATH=$HADOOP_HOME/lib/native:$JAVA_LIBRARY_PATH
export HADOOP_HOME=/usr/local/Cellar/hadoop/3.1.2/libexec
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"

$ source .bash_profile

then run the code to check if the pyspark installed



$ jupyter_notebook

open jupyter notebook from command line


$ cd spark-2.4.0-bin-hadoop2.7 
$ bin/pyspark

what is Spark data processing engine, focus on in-memory distributed computing use case basic operations: transformation and actions

Transformations are operations on RDDs that return a new RDD, map() and filter().
Actions are operations that return a result to the driver program or write it to storage, and kick off a computation, count() and first() for a new data types.

Zeppelin Basic operators: maps, joins, filters, etc. Spark as a tool for data exploration: notebooks and workflow

Chloejay/dataplayground

Data science playground- 20190126 Coderbunker workshop

General info

Getting started:

Setup

Install Spark on Mac

Further reading material 📗