Jupyter Notebook for Data Science

Notes, example code and datasets for the online course Jupyter Notebook for Data Science.

Course Prerequisites – a few tutorials we recommend to get ready to take the course (in case you haven't worked with Python before)
Course Code Examples - the source code developed during the course. We recommend you set it up on your own computer in order to try it out and make changes.
Course Notes - useful links and additional resources that we recommend you check out after you finish each section.
Next Steps – some tips on how to continue learning after you've finished the course
Credits
License

Course Prerequisites

To fully benefit from the coverage included in this course, you will need:

a basic understanding of the Python programming language (tutorial), including some basics of the web (HTML/CSS) for the scraping section (tutorial)
know the basics of running commands on the command line (tutorial), including knowing git well enough to download this source code locally (tutorial)
a basic understanding of math and statistics will come in handy, but is not a strict requirement

We are combining a wide collection of skills in this course – programming, data collection and analysis, so some parts will likely be a bit unfamiliar to you, no matter your background. Don't be discouraged by this – a part of the data science profession is to learn new skills over time to stay up-to-date. If you get stuck in any particular area, take a small break, learn some more details on the subject and resume the course afterwards. There is a wealth of resources online – from the online documentation pages of the libraries we use in the course, to websites like https://stackoverflow.com/ where a lot of the beginner questions have already been answered.

Course Code Examples

Create a new directory.

mkdir jupyter-course
cd jupyter-course

Clone this repository. Note – you will first need to install the Git Large File Storage extension to clone all the large datasets in this repository.

git clone https://github.com/PacktPublishing/Jupyter-Notebook-for-Data-Science.git

Start Jupyter Notebook using the Docker stack. Adapt the path to your working directory (I'm assuming ~/code/jupyter-course).

docker run -it --rm -p 8888:8888 -v ~/code/jupyter-course:/home/jovyan/work jupyter/datascience-notebook:de0cd8011b9e

You can always leave out the exact image tag (:de0cd8011b9e) to get the latest version of all the packages, but this is the version that was used in the course.

After everything is downloaded and started, you should get a link in your console to open Jupyter Notebook in your browser. The notebook should be connected to your local files including this git repository. You should now be ready to go through the example code or create your own notebooks to analyse the example datasets.

If you want to try out the new JupyterLab interface (as we do in the course in Section 5), you need to modify the command a bit.

docker run -it --rm -p 8888:8888 -v ~/code/jupyter-course:/home/jovyan/work jupyter/datascience-notebook:de0cd8011b9e start.sh jupyter lab

For Section 5 where we install additional packages, like Matplotlib Basemap and Plotly, build and run the custom Docker image from the Dockerfile in this git repo.

docker build --rm -t jupyter/custom-notebook .
docker run -it --rm -p 8888:8888 -v ~/code/jupyter-course:/home/jovyan jupyter/custom-notebook start.sh jupyter lab

Note – some of the notebooks connect to REST APIs that require API keys (DarkSky, Plotly & Mapbox). If you want to follow along, you will need to create accounts on these services and substitute your own API keys in the code. This is all explained in the course videos.

Course Notes

In the course a number of useful online resources are mentioned – you can find the links to all of them here.

Section 1: Jupyter Notebook Introduction

1.1. Course Introduction

1.2. Setting up Jupyter Notebook

Jupyter Notebook data science stack

1.3. Using Jupyter Notebook

Life expectancy data from the World Bank

1.4: Publishing Notebooks

Section 2: Data Analysis Using Pandas

2.1: Parsing the Crime Dataset

2.2: Pandas Data Structures

2.3: Explore and Visualise the Data

Advanced Indexing – Pandas documentation section about hierarchical indexing

2.4: Create an Interactive Widget

Jupyter Widgets

Section 3: Scraping Data

3.1: Introduction to Data Scraping

Scrapy – a Python framework for scraping

3.2: Fetching Data from a REST API Using Requests

Update – Since creating this course, DarkSky has shut down its API to the public. There are alternative weather APIs available. It is a good exercise to try to fetch similar data from another source, as these are exactly the types of tasks one frequently runs into during day-to-day data science work.

3.3: Importing API data into Pandas

3.4: Scraping Websites using BeautifulSoup

The Weather Underground website has invariably changed since creating this course. One of the downsides of scraping websites is that the underlaying HTML markup often changes (usually even more often than API protocols). Using CSS selectors similar to the ones we used in the video, it should be possible to adapt the code to work with an updated version of the website.

Section 4: Advanced Visualisation

4.1: Introduction to Information-Dense Visualisations

4.2: Visualising Data Correlation

4.3: Linear Regression

4.4: Correlation Matrix

Correlation matrix using Seaborn, another plotting package

Section 5: Analysing Geographic Data

5.1: Maps in Data Science

5.2: Plotting Crime Locations

JupyterLab

5.3: Interactive Maps Using Plotly

Plotly

5.4: Final Remarks

Next Steps

After you're done with the course, consider finding a practical problem to work on – that's the best way to learn. Here are some ideas on what to work on:

Awesome Python for Social Good – a curated list of topics where you can use your data science & programming skills to help society.
Code for All – network of organizations advocating open data and helping developers and interested people get involved with analysing public data.
Kaggle – an online community for working on real data science projects posted by companies and NGOs with occasional competitions & prizes. People also help each other out by commenting on uploaded solutions and starting various discussions about data science methods. Jupyter Notebooks are used to perform the work.

Some inspirational data science examples:

Jake VanderPlas – the blog from an astronomer and data scientist very active in the Python community
FiveThirtyEight – a website that publishes data-driven articles on (mostly US) sports, politics & economics. It's a great source of inspiration of what sort of topics can be explored.
Mike Bostock – data science blog with code examples from the author of D3.js.

Credits

Course and materials author – Dražen Lučanin. Hear about more of Dražen's courses by subscribing here!

Published by Packt.

License

The code is published under the MIT license.

ipocan/Jupyter-Notebook-for-Data-Science