Reproducible Data Science with Python

This repository contains the open learning resource Reproducible Data Science with Python in the form of Python Jupyter notebooks.

Publication
Releases
License

Description

The open learning resource uses real-world social data sets related to the COVID-19 pandemic to provide an accessible introduction to open, reproducible, and ethical data analysis using hands-on Python coding, modern open-source computational tools, and data science techniques. Topics include reproducible workflows, data wrangling, exploratory data analysis, data visualisation, pattern discovery (e.g., clustering), prediction and machine learning, causal inference, and network analysis.

How to use the learning resource?

You can read the textbook on the dedicated website. In addition, you can view each individual notebook on GitHub by clicking on the respective button below.

To interactively work with the code, you can access the interactive versions of the Jupyter notebooks via the free cloud services MyBinder and Colab. Both services allow you to interactively modify and run the notebooks from your browser.

By clicking on a button below, you will launch an interactive version of the Jupyter notebook, with the following capabilities:

Reproducibility: Notebooks run in a reproducible computing environment containing the same Python packages and package versions used in the original notebooks.
Session time: Notebooks run for up to 6 hours and will be shut down automatically after more than 10 minutes of inactivity.
Notebook persistence: Non-persistent, changes will be lost after your MyBinder session times out unless you download the notebook.
Access: Free, public, and anonymous cloud service. No setup or a login is required to view and execute the notebooks. Notebooks that use safeguarded data should not be launched on MyBinder.

By clicking on a button below, you will open a Jupyter notebook in Colab, with the following capabilities:

Reproducibility: Colab environment comes with pre-installed Python packages and package versions, which may differ from the ones used in the original notebooks. To enable computational reproducibility, see section "Installing dependencies" below.
Session time: Notebooks run for up to 12 hours and will disconnect when left idle for too long (time may vary).
Notebook persistence: Persistent, changes are saved automatically when you are logged in.
Access: Free cloud service that requires no setup. You can view the notebooks without a login but to execute and modify a notebook, a Google account and a login are required.

Textbook chapter	View on GitHub	Launch on MyBinder.org	Open in Colab
About the textbook
End-to-End Data Science Project
Python Data Science on the Cloud
Open Reproducible Data Science Workflow
Data Design and Data Wrangling
Data Exploration and Data Visualisation
Pattern Discovery using Unsupervised Learning
Prediction using Supervised Learning
What Causes What? Introduction to Causal inference
Network Analysis
Data Ethics

NOTE

The notebooks Prediction using Supervised Learning and What Causes What? Introduction to Causal inference require access to safeguarded data which, once obtained, needs to be stored securely on your Google Drive and loaded in your private Colab notebooks.

Installing dependencies

To enable computational reproducibility and minimise errors due to updates of Python libraries, you may need to install the dependencies of the resource listed in the requirements.txt file in your Colab notebook (dependencies are automatically preinstalled in Binder). To install dependencies, you can execute the following code at the top code cell of your active notebook:

!pip install -r https://raw.githubusercontent.com/valdanchev/reproducible-data-science-python/master/requirements.txt

Contributing to the resource

Contributions to the learning resource are welcome. Contributions can be made through creating an issue or a pull request.

To create an issue, contributors are encouraged to follow the GitHub quickstart guide on creating an issue.
To create a pull request, contributors are encouraged to follow the GitHub quickstart guide on creating a fork and submitting a pull request.

License

Reproducible Data Science with Python by Valentin Danchev is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

domrxh/reproducible-data-science-python