/Spark-Spotify-Data

Spark usage on a csv file

Primary LanguageJupyter NotebookMIT LicenseMIT

Spark Spotify Data

By Jarret Jeter

Here I use the Spark technology to do some data cleaning on a csv file of Spotify artist data

Technologies Used

  • Python
  • Spark

Description

This is pretty excessive to use Spark for, but it's just a small practice example on my local computer

Setup/Installation Requirements

  • Make sure you have a text editor such as Visual Studio Code installed.
  • Have a running version of Python3.7
  • Clone this repository (https://github.com/jarretjeter/Spark-Spotify-Data.git) onto your local computer from github
  • In your terminal create a virtual environment ('python3.7 -m venv venv'), activate it ('source venv/bin/activate') and then install the requirements ('pip install -r requirements.txt')
  • In the root project directory, create a folder named "data", go to the directory in your terminal and then run the command 'gsutil -m cp gs://data.datastack.academy/spotify/spotify_artists.csv ./data' to retrieve the csv data
  • Once you have the data downloaded, add a name (such as "row") to the csv header index column to avoid errors
  • With that all done you can run the notebook cells in the 'main.ipynb' file to see some simple examples of Spark

Known Bugs

  • No known bugs at this time

License

If you have any questions, please email me at jarretjeter@gmail.com

MIT

Copyright (c) 6/26/2022 Jarret Jeter