/dataplayground

workshop for Coderbunker community

Primary LanguageJupyter Notebook

Data science playground- 20190126 Coderbunker workshop

General info
💁 This workshop will talks about the data science pipeline, play fun with the some open source tools from data fetching, ETL to data analytics (data processing), data visualization(data insight generation), in the end I will show how to use simple machine leanring library to run one modeling (predictive analysis).

picture alt
"Spark with Zeppelin are great combination"

Getting started:
  • loopback:v5.0.0 generator-loopback:v6.0.2 loopback-workspace:v4.5.0
  • node: v11.3.0
  • Python: v3.6.5
  • MySQL (of course feel free to use PostgreSQL, we know we love it!) 😃
  • PySpark
  • Zeppelin
  • Keras
  • Hadoop 2.9.2
  • Maven
  • libprotoc 2.5.0
  • openssl/1.0.2o_2/
  • aws-java-sdk-1.7.4
  • hadoop-aws-2.7.1
Setup
To start this workshop, install it locally using npm and pip: (before plan to use docker but my mac low storage will strike in high frequency)
The article I wrote before on Medium about how to set up Loopback, please find here, might helpful if you just start loopback.
$ npm install
$ npm install -g loopback-cli (https://loopback.io/doc/en/lb3/) 
$ npm install loopback-connector-mysql --save  (https://www.npmjs.com/package/loopback-connector-mysql) 
$ install Anaconda (https://repo.continuum.io/archive)
$ install Apache Spark 
$ install Java 
$ pip install Keras  
#install hadoop 2.9.2  
$ brew install hadoop (Hadoop was installed under /usr/local/Cellar/hadoop) 
or $ wget http://www.eu.apache.org/dist/hadoop/common/hadoop-2.9.2/hadoop-2.9.2-src.tar.gz (I recommend this one, otherwise there are so much bugs after when we build the env with Hadoop) 

#config the hadoop 
$ cd /usr/local/opt/hadoop
$ hadoop-env.sh 
export HADOOP_OPTS="$HADOOP_OPTS -Djava.net.preferIPv4Stack=true" 

export HADOOP_OPTS="$HADOOP_OPTS -Djava.net.preferIPv4Stack=true -Djava.security.krb5.realm= -Djava.security.krb5.kdc="
export JAVA_HOME="/Library/Java/JavaVirtualMachines/<ADD_JDK_VERSION_HERE>/Contents/Home" 

$ core-site.xml 
# later when we connect with AWS S3 bucket, we need add IAM info on the property 
<configuration>
  <property>
    <name>hadoop.tmp.dir</name>
    <value>/usr/local/Cellar/hadoop/hdfs/tmp</value>
    <description>A base for other temporary directories</description>             
  </property>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://localhost:8020</value>
  </property>
</configuration>
	
# mapred-site.xml 
<configuration>
  <property>
    <name>mapred.job.tracker</name>
    <value>localhost:8021</value>
  </property>
</configuration>

# hdfs-site.xml 
<configuration>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
</configuration>

$ hdfs namenode -format 
$ /usr/local/opt/hadoop/sbin (HDFS service) 
$ ./start-dfs.sh 
	
#install Maven 
# build the Hadoop env 
$ mvn package -Pdist,native -DskipTests -Dtar 
#from some bugs and configuration 
![picture alt](https://github.com/Chloejay/dataplayground/blob/master/Screen%20Shot%202019-07-03%20at%2020.59.51.png?raw=true) 
#to  
![picture alt](https://github.com/Chloejay/dataplayground/blob/master/Screen%20Shot%202019-07-03%20at%2021.41.46.png?raw=true) 
Install Spark on Mac
  • First download Java
  • note: please don't install Java version more than 8, which will created much bugs later such as java.lang.IllegalArgumentException, Unsupported class file major version 55, so better install from Java SE Development Kit 8 site and choose your os system and config.
  • Go to the Apache Spark website
  • Choose a Spark release and directly download
  • Go to your home directory (command in bold below)
    $ cd ~
  • Unzip the folder in your home directory using the following command

  • $ tar -zxvf spark-2.4.0-bin-hadoop2.7.tgz
  • Use the following command to see that you have a .bash_profile

  • $ ls -a
  • Config Spark to edit .bash_profile

  • $ vim .bash_profile

    export SPARK_PATH=~/spark-2.4.0-bin-hadoop2.7 export PYSPARK_DRIVER_PYTHON="jupyter" export PYSPARK_DRIVER_PYTHON_OPTS="notebook" #For python 3, have to add the line below or will get an error export PYSPARK_PYTHON=python3 alias jupyter_notebook='$SPARK_PATH/bin/pyspark --master local[2]' export JAVA_HOME=$(/usr/libexec/java_home) export JAVA_LIBRARY_PATH=$HADOOP_HOME/lib/native:$JAVA_LIBRARY_PATH

    export HADOOP_HOME=/usr/local/Cellar/hadoop/3.1.2/libexec export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"

    $ source .bash_profile

  • then run the code to check if the pyspark installed

  • $ jupyter_notebook

  • open jupyter notebook from command line
  • $ cd spark-2.4.0-bin-hadoop2.7 $ bin/pyspark

what is Spark data processing engine, focus on in-memory distributed computing use case basic operations: transformation and actions

  • Transformations are operations on RDDs that return a new RDD, map() and filter().
  • Actions are operations that return a result to the driver program or write it to storage, and kick off a computation, count() and first() for a new data types.

Zeppelin Basic operators: maps, joins, filters, etc. Spark as a tool for data exploration: notebooks and workflow

Further reading material 📗
SQL
Nodejs
Spark
Machine leanring