This repository serves as a base for my online course and in-person workshops. It contains Jupyter notebooks, Python scripts (automatically generated from notebooks), and environment configurations (conda and docker). Join for exclusive additional content (exercices and ressources) — reach out to me at louisdorard.com/contact for more information!
Below is a description of the contents of this repo, and a list of steps to prepare for the exercises and projects you'll do on your laptop ("set-up"):
During the course/workshop you'll be using the following Jupyter notebooks:
Experiments.ipynb
: Bash notebook to run notebook-based experiments, locally or on cloud infrastructure- More to come! Click on "Watch" to get notified.
The repo also contains:
setup/
: files used to prepare the ML development environmentscripts/
: Python and Bash scripts automatically generated from notebooks byjupytext
- Create accounts
- Install development environment
- Set environment variables
- Download data
- Test environment
- Cloud platform
- Install IDE (optional)
All commands given below should be executed from the root of this repo and are meant for the bash
shell (see Appendix: set-up shell if needed).
You'll need accounts on the following platforms:
- Kaggle — ML competitions platform. We'll use it to download datasets and send predictions for evaluation.
- Gradient — cloud ML development and deployment platform. We explain its benefits in the Cloud platform section below, and we provide information to get started.
Note that Gradient is a paid service which offers some free functionalities. No credit card is required for creating your account. If you get asked for one, feel free to ignore. We'll add you to our Gradient team so you'll be able to use all paid features during the workshop, without having to enter your credit card.
Our ML development environment is based on Python and Jupyter. We use conda
to install it. Conda is a Python distribution, an environment manager, and a package manager (it resolves dependencies).
- Clone this repo and
cd
into it:git clone https://github.com/louisdorard/full-stack-ml.git cd full-stack-ml/
- Create an
output/
directory, meant for storage of artifacts from notebook executions and experiments (note: it's included in.gitignore
).mkdir output/
- Install
conda
. The fastest way for this is to install Miniconda (a mini version of Anaconda): see official instructions for Windows (use the example command), macOS (use the example command under Installing in silent mode), or Linux. If you're asked whether to add (Ana)conda to you PATH, choose yes (or tick the appropriate box in the graphical installer). (Note: On macOS I used Homebrew:brew cask install miniconda
.) - Update conda:
conda update —all -y
- Create the
full-stack-ml
environment, which will contain the packages listed inenvironment.yml
:(Note that this downloads and uses theconda install anaconda-client conda env create louisdorard/full-stack-ml
environment.yml
file from Anaconda Cloud, which might be different from the local file, if you have made any changes.) - Initialize conda for your shell (replace
YOUR_SHELL_NAME
with the name of your shell, e.g.bash
):conda init YOUR_SHELL_NAME
- Activate the
full-stack-ml
environment:conda activate full-stack-ml
- Run script to finalize Jupyter installation and configuration:
bash setup/jupyter-install.bash
- Since Jupyter is a web-based environment, you might need to update your web browser to ensure proper functioning of the Jupyter (Lab) interface.
Remarks:
- If you run into any difficulties with installing this environment, you can try using a Docker-powered environment, based on the
louisdorard/full-stack-ml
image and on the docker-compose configuration insetup/docker/
). The installation steps above match the instructions in the Dockerfile used to create that image, so you can jump straight to the next section. - Alternatively, you can also try using a cloud platform.
Add environment files:
.env
(at the root of this repo), which will store theDATA_PATH
variable: this should be the path to the directory where you store raw data files. This variable will be used by our data loading utils (mlxtend.utils.data).~/auth.env
(in your home folder this time), which will contain Kaggle and Gradient authentication variables.
As a starting point, you can copy the sample files found in setup/
, which contain example key/value pairs. You'll need to change the values!
cp setup/sample.env .env
cp setup/sample-auth.env ~/auth.env
- For Kaggle:
- Your username can be found in the top right corner of the Kaggle web interface, once you're logged in. Let's call it
USERNAME
(please replace in the URL below) - Go to the API section on https://www.kaggle.com/`USERNAME`/account and click on Create New API Token
- Your username can be found in the top right corner of the Kaggle web interface, once you're logged in. Let's call it
- For Gradient:
- You can create an API key from https://www.paperspace.com/console/account/api: enter a Name (this can be whatever you want, e.g. "workshop"), a Description (optional), and click on "Create API token".
- Your project ID can be found at https://www.paperspace.com/console/projects.
Finally, add the following line at the end of your shell config file (e.g. ~/profile
or ~/.bash_profile
or ~/.bashrc
for bash):
source auth.env
This will allow you to use the kaggle
CLI, for downloading datasets or uploading submissions.
Note: the .env
file is kept specific to the current project; this allows to specify different data paths for different projects, in different repos. The same ~/auth.env
file can be useful for several different projects.
We'll use 4 datasets (the first 3 are from Kaggle competitions):
- Avazu (~1 GB compressed - 7 GB uncompressed) - classification - categorical features with many possible values
- House Prices (~200 KB) - regression - numerical and categorical features
- Give Me Some Credit (~7 MB) - classification - numerical features
- MNIST (~55 MB) - classification - numerical features, no missing values (pixels)
They can be downloaded to your raw data directory with the following command:
bash setup/scripts/Download-Data.sh
Assuming that you've already activated the full-stack-ml
environment in your current shell session...
-
Test that the environment is functional by running
00-Version-Information.ipynb
, which displays the versions of the core libraries used in this workshop:papermill 00-Version-Information.ipynb output/00-Version-Information.ipynb
The output notebook should contain:
sklearn 0.22.1
-
Test notebooks (this can take a couple of minutes):
jupytext --from ipynb --execute ??-*.ipynb
-
Fire up Jupyter Lab, to interact with the notebooks:
jupyter-lab
This should automatically open your browser at http://localhost:8888/
We'll be using the Gradient cloud ML platform during this workshop (see link in Accounts section).
There are 2 main ways to use Gradient: for Running Jobs (paid feature) and for Running Notebooks. Once we have added you to our Gradient team, you'll be able to run Jobs without having to pay. In the meantime, you can run Notebooks for free with a "Free-CPU" cloud instance.
- It’s common practice to use powerful machines in the cloud for Machine and Deep Learning experiments, equipped with GPUs or high-performing CPUs with many cores. They make it faster to run jobs, and they can continue running while your laptop is closed.
- Another advantage of the cloud is that you can have access to a development environment without having to install anything. You can have access to this workshop's development environment via Notebooks, which will run a docker container based on this repo's docker image.
- ML platforms (as opposed to regular cloud services) like Gradient make it faster to set up cloud machines and more convenient to persist work done on these machines.
- You won’t have to use your own wifi for downloading heavy datasets (some of which weigh several GBs): downloads will happen via the platform's internet connection (FYI Gradient's download speed can reach ~ 80 MB/s).
When we add you to our team project, you'll have access to paid features such as running Jobs. But if you don't want to wait until we add you to the team, you can add a credit card to your Gradient Private Workspace. A "workspace" on ML cloud platforms is a place where you can create projects, in which all your experiment files will be stored (code, assets, outputs, results).
- Install the Gradient CLI (Command Line Interface)
pip install -U gradient
- Add your API key (assuming that you've already added
GRADIENT_API_KEY
toauth.env
and sourced it):The key gets stored ingradient apiKey $GRADIENT_API_KEY
~/.paperspace/config.json
. - Create and run your Jobs via the
Experiments.ipynb
Bash notebook.
Notebooks on Gradient provide an interesting way to get started faster with ML development, or when you don't want to have to install anything on your machine.
- When we add you to our team workspace, you'll have access to our data storage (mapped to
/storage/data/
), where the necessary data files have already been copied. But if you want to start using some of the notebooks here in the meantime, you'll need to use your Private Workspace and to download that data. - You'll need to start by creating a Notebook and setting environment variables:
- Click on Create Notebook at https://www.paperspace.com/console/notebooks
- In "01. Choose Container":
- "Enter Container Name" -> louisdorard/full-stack-ml
- "Container user" -> root
- In "02. Choose Machine", pick Free-CPU
- Click on Create Notebook to confirm everything
- In "01. Choose Container":
- Once the notebook is running, get the Notebook ID from the list
- Go to https://NOTEBOOK_ID.gradient.paperspace.com/lab
- Click on "New terminal"
- Adapt the following commands by adding your Kaggle username and key, and execute:
git clone https://github.com/louisdorard/full-stack-ml.git sudo echo "export KAGGLE_USERNAME=" >> /root/.bashrc sudo echo "export KAGGLE_KEY=" >> /root/.bashrc sudo bash full-stack-ml/scripts/Download-Data.sh
- Click on Create Notebook at https://www.paperspace.com/console/notebooks
Using an IDE can be a useful complement to Jupyter Lab, e.g. for linting or refactoring or moving code around, or for creating Python modules. I recommend Visual Studio Code (VS Code):
- It's a free and popular IDE for many programming languages
- It has built-in support for Git
Go through Getting Started with Python in VS Code. Alternative ways to install it are to use Homebrew on macOS (brew cask install visual-studio-code
), or to download from the Snap Store on Linux.
Some recommended extensions: Python, Docker, Rainbow CSV, Excel Viewer, Markdown All in One.
Installation of the development environment is done from the shell. The most popular option is Bash; however, I prefer to use the fish shell. Regarding terminals:
- macOS: I recommand iTerm.
- Windows:
- I recommend Cmder (which uses Bash by default). Download the "Full" version of Cmder, which includes Git.
- Other popular options to use the command line for Windows users are Cygwin, or to use the Ubuntu linux distribution. There are 3 possible ways to do that:
- Ubuntu app for Windows
- Dual-boot installation alongside Windows
- Bootable USB stick.
Once your shell and your terminal are set up, you'll be ready to execute all the commands given here! Note for Cmder users on Windows: after typing "" in a command, press the Tab key, and it will replace "" with the path to your home directory.
Louis Dorard | Follow me on Twitter @louisdorard