- Spark is a unified analytics engine for large-scale data processing.
- It provides high-level APIs in Scala, Java, Python, and R, and an optimized engine that supports general computation graphs for data analysis.
- It also supports a rich set of higher-level tools including Spark SQL for SQL and DataFrames, MLlib for machine learning, GraphX for graph processing, and Structured Streaming for stream processing.
-
- A system running Windows 10
- A user account with administrator privileges (required to install software, modify file permissions, and modify system PATH)
- Command Prompt or Powershell
- A tool to extract .tar files, such as 7-Zip
- Installing Apache Spark on Windows 10 may seem complicated to novice users, but this simple tutorial will have you up and running.
- If you already have Java 8 and Python 3 installed, you can skip the first two steps.
- Open Browser and navigate to https://www.java.com/download/ie_manual.jsp.
- Download the latest stable version.
- Add it to the environment variable.
- Open Browser and navigate to https://www.python.org/downloads/.
- Download the latest stable version.
- Add it to the environment variable.
- Open a browser and navigate to https://spark.apache.org/downloads.html.
- Under the Download Apache Spark heading, there are two drop-down menus. Use the current non-preview version.
-
Make sure Java, Python and Spark are installed correctly by running the following command.
python --version
java --version
pyspark --version
-
Open the terminal/command prompt.
-
Clone the repository.
https://github.com/jainam2385/Big-Data-Analytics-Using-Spark
-
Next open jupyter notebook and run each cell.
-
Apache Spark: An open-source, fast and general-purpose cluster computing system.
-
Jupyter Notebooks: An open-source web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text.
-
Data Science: Systems and scientific methods to analyze structured and unstructured data in order to extract knowledge and insights.
-
Python: Python is a programming language that lets you work more quickly and integrate your systems more effectively.