The goals of this training is to:
- Get you excited about Data Science
- Give a quick introduction for some of the Python's libraries available: Pandas (data wrangling), Scikit-learn (ML), Matplotlib (visualisation)
- Give a quick overview of an approach to tackling Data Science problems
It will not:
- Make you an expert Data Scientist
- Go into details (or do the maths) for the techniques / algorithms we will use
- Properly cover any deep learning / neural networks
This course is delivered using Jupyter Notebooks so if you're not familiar with them some helpful documentation is What is the Jupyter Notebook? and Notebook Basics.
The notebooks contain Python code which you will run during the exercises; this is done by highlighting the cell then
clicking Run
in Jupyter. Bear in mind that this code should be executed in order and each cell should complete before running the next cell.
This training requires a number of libraries which are installed, for example, with pip3
. These libraries are:
- Jupyter - An interactive programming environment that runs in the browser.
- scikit-learn - Powerful and easy-to-use machine learning algorithms.
- pandas - A powerful way of handling dataframes which are two-dimensional tabular data structures with labeled axes.
- numpy - Scientific computing capability providing support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.
- scipy - Similar to Numpy, it gives you access to key mathematical modules such as optimization, linear algebra, integration, and interpolation, etc
- matplotlib - A plotting library for the Python programming language and NumPy.
This training also uses Python 3 and a number of Python libraries, so before starting the training you will need to:
- Install Python 3 following the Python Beginners Guide or you might find Installing Python 3 on Mac OS X useful if you use Mac OS X.
- Install Virtualenv using the Installation documentation.
- In your project directory create a new virtual environment by running
virtualenv -p python3.6 env
- Enable your virtual environment by running
source env/bin/activate
- Install the dependencies by running
pip3 install -r requirements.txt
- Finally, in order to start your development environment, type
jupyter notebook
in your terminal. This should automatically open a tab in your browser or you can visit localhost:8888. To shut down Jupyter typectrl + c
. - When you are finished with the training you can run
deactivate
to deactivate your virtualenv.
If you are familiar with Docker, you can use the Jupyter datascience-notebook
image to spin up everything you need for the course. As a starting point, the following command creates a passwordless instance of Jupyter at http://localhost:8888/, mapped to your current working directory:
docker run \
-d --rm -p 127.0.0.1:8888:8888 \
--name=datascience-notebook \
--mount type=bind,source="$(pwd)",target=/home/jovyan \
jupyter/datascience-notebook \
start-notebook.sh --NotebookApp.token=''
Alternatively, you can install Anaconda which aims to simplify package management.
The training is split into 4 courses:
This training is still work-in-progress. Please send us any feedback to datalab @ bbc.co.uk
to help us improve it!
And if you found this training easy and had fun doing it, why not join us? https://findouthow.datalab.rocks/