/pyspark_exercises

Practice your Pyspark skills!

Primary LanguageJupyter NotebookBSD 3-Clause "New" or "Revised" LicenseBSD-3-Clause

Pyspark Exercises

We created this repository as a way to help Data Scientists learning Pyspark become familiar with the tools and functionality available in the API. This repository contains 11 lessons covering core concepts in data manipulation. This repository was forked from Guipsamora's Pandas Exercises project and repurposed to solve the same exercises using the Pyspark API instead of Pandas.

Tutorials are great resources, but to learn is to do. So unless you practice you won't learn. Pyspark is no exception!

There will be three different types of files:
      1. Exercise instructions
      2. Solutions without code
      3. Solutions with code and comments

My suggestion is that you learn a topic in a tutorial, video or documentation and then do the first exercises. Learn one more topic and do more exercises. If you are stuck, don't go directly to the solution with code files. Check the solutions only and try to get the correct answer.

Suggestions and collaborations are more than welcome.🙂 Please open an issue or make a PR indicating the exercise and your problem/solution.

Contributing

As a community project, we're seeking help to converting this repo into a complete repository for mastering Pyspark.

We need assistance with the following:

Convert existing .ipynb files with Pandas solutions to Pyspark solutions.

Select an issue in the Issues tab corresponding to one of the tutorial directories. In your pull request, re-write the directory using Pyspark instead of pandas. So far, we've listed issues for every exercise in the repo.

Create new issues

We have a lot of refactoring to do outside of the lessons. If you see something that needs to be changed, please raise an issue. To contribute, please either raise an issue in the Issues tab, or raise a pull request for an existing issue.

Readme's

Our readme section could use some work. For instance, we should list ways to run Pyspark on local machines (Windows, MacOS, Linux).

Lessons

Getting and knowing Merge Time Series
Filtering and Sorting Stats Deleting
Grouping Visualization Indexing
Apply Creating Series and DataFrames Exporting

Chipotle
Occupation
World Food Facts

Chipotle
Euro12
Fictional Army

Alcohol Consumption
Occupation
Regiment

Students Alcohol Consumption
US_Crime_Rates

Auto_MPG
Fictitious Names
House Market

US_Baby_Names
Wind_Stats

Chipotle
Titanic Disaster
Scores
Online Retail
Tips

Pokemon

Apple_Stock
Getting_Financial_Data
Investor_Flow_of_Funds_US

Iris
Wine

Video Solutions

Video tutorials of data scientists working through the above exercises:

Data Talks - Pandas Learning By Doing