Spark Demo Course
Description
This README contains instuction on how to install PySpark on your local Windows machine and start on it jobs.
Pre-requirements
Python
You need have Python on your local machine. If you don't have it, go to https://www.python.org/downloads/windows/ and install latest version. After the installation is complete, close the Command Prompt if it was already open, reopen it and check if you can successfully run
python --version
command. After this install jupyter notebook, pandas, numpy, pyspark and findspark(you can easily find guides in the Web) and all other staff you need.
Msvcr100.dll
If you have Visual Studio c++ skip this step. If you don't have follow this guide https://www.computer-setup.ru/msvcr100-dll-chto-eto-za-oshibka-kak-ispravit?ysclid=l84uyxs1qq310355183 to install Msvcr100.dll.
Kaggle
You also need to register on kaggle(for example using your google account) to download datasets.
Microsoft SQL Server
You alse need to have microsoft sql server(you can find how download it in internet).
Installation guide
If you want to have set up same as me, you can download Spark_Demo_Course from this repository and go directly to step "Last step". For other people you need to go throw all installation guide.
Java
First of all you need to install Java JDK 7<=version<=11. I really recommended to install Java JDK version 8 to not have problems with versions. For the moment you can't download Java JDK from Oracle archive from Belarus. Try to use VPN or download it from external sources.
Spark
Go to the page https://spark.apache.org/downloads.html. Select the latest stable release of Spark. Choose a package type: select a version that is pre-built for the latest version of Hadoop such as Pre-built for Hadoop 3.3. If you want to have same versions as me, choose Spark 3.1.3 and Hadoop 2.7. After you choose package-type, you will see under it Download Spark, click on link, you will be redirecting to the next page where you need to click on the link like below.
Hadoop
Go to the page https://hadoop.apache.org/release/2.7.0.html. Select version of Hadoop that you choose while installing Spark(if you remember, you choose pre-built version of Hadoop). Choose download like below.
Winutils.exe
Windows users need also install winutils.exe to work with Spark. Go to this page https://github.com/steveloughran/winutils. You can find winutils.exe in hadoop-'version'\bin. If you don't find your version here, find it in another sources.
Unpacking
Now you need to unpack Hadoop, Java and Spark to make just folder. After you unpack packages you need to put winutils.exe into the hadoop-2.7.0\bin and spark-3.1.3-bin-hadoop2.7\bin.
Last step
This is the last step. Use win+r and write sysdm.cpl, click ok. Click on Additionally, Environment Variables. You need to create 3 variables in system variables(second window):
- JAVA_HOME=...\Java\jdk-8;
- HADOOP_HOME=...\hadoop-2.7.0;
- SPARK_HOME=...\spark-3.1.3-bin-hadoop2.7.
Then in the system variables you need to add 3 paths into path(second window):
- %HADOOP_HOME%\bin;
- %SPARK_HOME%\bin;
- %JAVA_HOME%\bin.
Environment ready
Now you can open new command prompt and type there
pyspark
You need to see something like below.