"Spark with Zeppelin are great combination"
- loopback:v5.0.0 generator-loopback:v6.0.2 loopback-workspace:v4.5.0
- node: v11.3.0
- Python: v3.6.5
- MySQL (of course feel free to use PostgreSQL, we know we love it!) 😃
- PySpark
- Zeppelin
- Keras
- Hadoop 2.9.2
- Maven
- libprotoc 2.5.0
- openssl/1.0.2o_2/
- aws-java-sdk-1.7.4
- hadoop-aws-2.7.1
The article I wrote before on Medium about how to set up Loopback, please find here, might helpful if you just start loopback.
$ npm install
$ npm install -g loopback-cli (https://loopback.io/doc/en/lb3/)
$ npm install loopback-connector-mysql --save (https://www.npmjs.com/package/loopback-connector-mysql)
$ install Anaconda (https://repo.continuum.io/archive)
$ install Apache Spark
$ install Java
$ pip install Keras
#install hadoop 2.9.2
$ brew install hadoop (Hadoop was installed under /usr/local/Cellar/hadoop)
or $ wget http://www.eu.apache.org/dist/hadoop/common/hadoop-2.9.2/hadoop-2.9.2-src.tar.gz (I recommend this one, otherwise there are so much bugs after when we build the env with Hadoop)
#config the hadoop
$ cd /usr/local/opt/hadoop
$ hadoop-env.sh
export HADOOP_OPTS="$HADOOP_OPTS -Djava.net.preferIPv4Stack=true"
export HADOOP_OPTS="$HADOOP_OPTS -Djava.net.preferIPv4Stack=true -Djava.security.krb5.realm= -Djava.security.krb5.kdc="
export JAVA_HOME="/Library/Java/JavaVirtualMachines/<ADD_JDK_VERSION_HERE>/Contents/Home"
$ core-site.xml
# later when we connect with AWS S3 bucket, we need add IAM info on the property
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/local/Cellar/hadoop/hdfs/tmp</value>
<description>A base for other temporary directories</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:8020</value>
</property>
</configuration>
# mapred-site.xml
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:8021</value>
</property>
</configuration>
# hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
$ hdfs namenode -format
$ /usr/local/opt/hadoop/sbin (HDFS service)
$ ./start-dfs.sh
#install Maven
# build the Hadoop env
$ mvn package -Pdist,native -DskipTests -Dtar
#from some bugs and configuration
![picture alt](https://github.com/Chloejay/dataplayground/blob/master/Screen%20Shot%202019-07-03%20at%2020.59.51.png?raw=true)
#to
![picture alt](https://github.com/Chloejay/dataplayground/blob/master/Screen%20Shot%202019-07-03%20at%2021.41.46.png?raw=true)
- First download Java note: please don't install Java version more than 8, which will created much bugs later such as java.lang.IllegalArgumentException, Unsupported class file major version 55, so better install from Java SE Development Kit 8 site and choose your os system and config.
- Go to the Apache Spark website
- Choose a Spark release and directly download
- Go to your home directory (command in bold below)
$ cd ~
- Unzip the folder in your home directory using the following command
- Use the following command to see that you have a .bash_profile
- Config Spark to edit .bash_profile
- then run the code to check if the pyspark installed
- open jupyter notebook from command line
$ tar -zxvf spark-2.4.0-bin-hadoop2.7.tgz
$ ls -a
$ vim .bash_profile
export SPARK_PATH=~/spark-2.4.0-bin-hadoop2.7
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
#For python 3, have to add the line below or will get an error export PYSPARK_PYTHON=python3
alias jupyter_notebook='$SPARK_PATH/bin/pyspark --master local[2]'
export JAVA_HOME=$(/usr/libexec/java_home)
export JAVA_LIBRARY_PATH=$HADOOP_HOME/lib/native:$JAVA_LIBRARY_PATH
export HADOOP_HOME=/usr/local/Cellar/hadoop/3.1.2/libexec
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
$ source .bash_profile
$ jupyter_notebook
$ cd spark-2.4.0-bin-hadoop2.7
$ bin/pyspark
what is Spark data processing engine, focus on in-memory distributed computing use case basic operations: transformation and actions
- Transformations are operations on RDDs that return a new RDD, map() and filter().
- Actions are operations that return a result to the driver program or write it to storage, and kick off a computation, count() and first() for a new data types.
Zeppelin Basic operators: maps, joins, filters, etc. Spark as a tool for data exploration: notebooks and workflow
SQL
Nodejs
Spark
Machine leanring