This project is part of the requirement for Data Scientist Nanodegree on Udacity to follow a Data Analysis end-to-end process including ETL, Machine Learning and Deployment.
- Sparkify.ipynb: notebook run on the medium dataset run on medium dataset of 247.6 MB ('medium-sparkify-event-data.json')
- Sparkify_smallDataset.ipynb: test notebook run on a small dataset of 128MB ('mini_sparkify_event_data.json')
The code is written in Python3. To run the Python code, you will need to install necessary packages using pip install or conda install. You also need to run the Spark Cluster on either Amazon Web Services (AWS) or IBM Cloud. If you choose to use AWS, you can use the full 12GB dataset hosted on our public S3 bucket, and expect to use about $30 dollars to run this cluster while you build your project for a week. If you choose to use IBM Cloud, you'll use a medium-sized 23 MB dataset we provide for you to download here. You will still deploy your application on a Spark cluster, but this will not cost you any money.
The deployment of this project used the medium_dataset 23MB and run on IBM Cloud
Create an IBM Cloud account to use the IBM Watson Studio service.
- Visit cloud.ibm.com and click on the "Create an IBM account" button.
- If you have an IBM Cloud account already, sign in.
- If you do not have an account, sign up.
- Once you finish signing up, wait a few minutes to receive your IBM Cloud account confirmation email. Then return to cloud.ibm.com and sign in.
- Register for IBM Watson Studio
- Click on the menu icon on the top left and select "Watson." Or just click here. Scroll down and select "Try Watson Studio."
- Next, select "Create Project."
- Hover over "Data Science" and click "Create Project" Enter a name for your project and click "Create" on the bottom right.
- Click "Add to project" on the top and select "Notebook"
- Enter a name for your notebook and select "Default Spark Python 3.6" under runtime and click "Create Notebook" on the bottom right.
- Credit to Udacity and the instructional staff for the project guidance and preparation
- Other online resources for this project completion:
- https://spark.apache.org/docs/latest/ml-classification-regression.html
- https://docs.databricks.com/applications/machine-learning/mllib/binary-classification-mllib-pipelines.html
- https://medium.com/@dhiraj.p.rai/logistic-regression-in-spark-ml-8a95b5f5434c
- https://databricks.com/session/apache-spark-mllib-2-0-preview-data-science-and-production