/jupyter-git-tutorial

A tutorial for using Git with Jupyter notebooks

Primary LanguageJupyter Notebook

jupyter-git-tutorial

The goal of this tutorial is to show how to use Git with Jupyter notebooks. The primary audience for this tutorial are data scientists and data analysts who have some experience with Jupyter notebooks but little to no experience with Git or the command line. To that end, this tutorial uses JupyterLab and the JupyterLab Git extension but also provides the equivalent git commands, for the curious.

This tutorial is inspired by the Katacoda Git tutorial in that it follows the same basic flow of introducing Git commands but applied to Jupyter notebooks. The actual notebook used here is from this Google Colab notebook for teaching the pandas Python library.

Setting up JupyterLab

There are multiple run Jupyter notebooks. This tutorial has been designed to work with two options:

  • Google Cloud Platform AI Platform Notebooks, which provides cloud instances of JupyterLab
  • Local installation of JuptyerLab

Google Cloud AI Platform Notebooks

This section describes how to create a JupyterLab instance with GCP's AI Platform Notebooks, which automatically includes the Git extension. More detailed instructions can be found here.

  1. Open the AI Platform Notebooks console
  2. Click +NEW INSTANCE and select "Python 3"
    1. Instance name: git-tutorial-python-[YOUR NAME]
    2. Region: us-east1 (South Carolina)
    3. Zone: us-east1-b
    4. Instance properties: use all defaults
    5. Click CREATE
  3. Click OPEN JUPYTERLAB

Local Installation

This section describes how to install JupyterLab and the Git extension on your local machine. This section assumes you already have Python installed. Note that this section has only been tested on macOS, so far.

  1. (Optional, but recommended) Create a virtual environment. There are many options. If you don't have a preference, an arbitrary recommendation is virtualenvwrapper.
  2. Install Node.js
  3. Install JupyterLab and related packages: pip install jupyterlab~=2.2.9 jupyterlab-git==0.24.0
  4. Install the Git extension: jupyter labextension install @jupyterlab/git
  5. Install the nbdime extension: nbdime extensions --enable
  6. Run JupyterLab: jupyter-lab

Forking this repo

To allow you to push and pull commits to and from this repo, you must create your own copy of it, known as creating a fork. The following instructions describe how to fork this repo. More detailed instructions can be found here.

  1. Create a GitHub account, if you don't have one
  2. Open https://github.com/hahns0lo/jupyter-git-tutorial in a browser
  3. In the upper right-hand corner, click Fork
  4. Select your user

Setup SSH key

To allow you to push and pull commits without being prompted for a password, you must setup your GitHub account with an SSH key. The following instructions describe how to do this in your JupyterLab instance. More detailed instructions can be found

  1. Open JupyterLab
  2. Open a terminal from the JupyterLab launcher
  3. Create an SSH key by following the Linux instructions here
  4. Add the SSH key to your GitHub account the Linux instructions here.
    1. For step 1, use the following command instead: cat ~/.ssh/id_ed25519.pub

Setup ReviewNB

ReviewNB is a GitHub Marketplace app that provides visual diffs for Jupyter notebooks on GitHub. The following instructions describe how to setup ReviewNB with your fork.

  1. Open https://github.com/marketplace/review-notebook-app
  2. Under Pricing and setup, select Free and click Install it for free
  3. Click Complete order and begin installation
  4. Select Only select repositories, select [username]/jupyter-git-tutorial, and click Install
  5. Click Authorize Review Notebook App and you will be redirected to https://app.reviewnb.com/

Scenario 1

  1. Clone your fork of this repo
    1. Using the Git extension
      1. Get the URI to your fork. In your browser, click Code, select "HTTPS", and copy the URI.
      2. On the left-hand side of JupyterLab, click the Git icon to open the Git extension.
      3. Click Clone a Repository
      4. Paste the URI to your fork, e.g. https://github.com/[username]/jupyter-git-tutorial
    2. Using the command line
      1. Get the URI to your fork. In your browser, click Code, select "SSH", and copy the URI.
      2. Open a terminal from the JupyterLab launcher
      3. git clone git@github.com:[username]/jupyter-git-tutorial.git
  2. Make a copy of the tutorial notebook
    1. Using JupyterLab
      1. Open jupyter-git-tutorial
      2. Create a new folder called tutorial
      3. Copy and paste intro_to_pandas.ipynb into tutorial
    2. Using the command line
      1. cd jupyter-git-tutorial
      2. mkdir tutorial
      3. cp intro_to_pandas.ipynb tutorial
  3. Stage the notebook
    1. Using the Git extension
      1. Under Untracked, select intro_to_pandas.ipynb and click +
    2. Using the command line
      1. git status
      2. git add tutorial/intro_to_pandas.ipynb
      3. git status
  4. Commit the notebook
    1. Using the Git extension
      1. Summary: Adding copy of notebook
      2. Click Commit
      3. Enter your name and email
    2. Using the command line
      1. Set your email address: git config --global user.email "you@example.com"
      2. Set your name: git config --global user.name "Your Name"
      3. git commit -m "Adding copy of notebook"
      4. git status
  5. Ignore Jupyter checkpoints
    1. Open intro_to_pandas.ipynb

    2. Create a new text file in the jupyter-git-tutorial folder called .gitignore and add the following:

      .ipynb_checkpoints
      
    3. Stage and commit .gitignore

      1. Using the Git extension
        1. Under Untracked, select .gitignore and click +
        2. Summary: Ignoring checkpoints
        3. Click Commit
      2. Using the command line
        1. git status
        2. git add .gitignore
        3. git commit -m "Ignoring checkpoints"
        4. git status

Scenario 2

  1. Open intro_to_pandas.ipynb in the jupyter-git-tutorial/tutorial folder, run it, and save
  2. Check Git status
    1. Using the Git extension
      1. intro_to_pandas.ipynb should be listed under Changed
    2. Using the command line
      1. cd ~/jupyter-git-tutorial
      2. git status
      3. tutorial/intro_to_pandas.ipynb should be listed as modified under Changes not staged for commit
  3. Look at the changes
    1. Using the Git extension
      1. Under Changed, select intro_to_pandas.ipynb and click the icon with a + and -
      2. Only outputs should have changed
    2. Using the command line
      1. git diff
      2. Keep pressing space to scroll down or q to quit
      3. git difftool
      4. Use the up/down keys to scroll or the following sequence twice to quit
        1. :q
        2. Enter
  4. Stage the changes and view the changes again
    1. Using the Git extension
      1. Under Changed, select intro_to_pandas.ipynb and click +
      2. Under Staged, select intro_to_pandas.ipynb and click the icon with a + and -
    2. Using the command line
      1. git status
      2. git add tutorial/intro_to_pandas.ipynb
      3. git status
      4. git diff Nothing should happen!
      5. git diff --staged
      6. git difftool --staged
  5. Commit the changes
    1. Using the Git extension
      1. Summary: Ran notebook
      2. Click Commit
      3. Enter your name and email
    2. Using the command line
      1. git commit -m "Ran notebook"
      2. git status
  6. Look at the log
    1. Using the Git extension
      1. Click the History tab
    2. Using the command line
      1. git log
      2. git log --pretty=format:"%h %an %ar - %s"
  7. Look at the last commit
    1. Using the Git extension
      1. Click the History tab
      2. Click on the "Ran notebook" commit to expand
      3. Click on intro_to_pandas.ipynb
    2. Using the command line
      1. Copy the long string of numbers and text after commit. This is called the commit hash or commit SHA.
      2. git show [commit hash]

Scenario 3

  1. Open intro_to_pandas.ipynb in the jupyter-git-tutorial/tutorial folder
  2. Modify the notebook
    1. Find and replace the following
      1. Sacramento to Los Angeles
      2. 485199 to 3792621
      3. 97.92 to 468.97
    2. Run the notebook and save
  3. Look at the changes
    1. Using the Git extension
      1. Look at the output after the pd.Series(['San Francisco', 'San Jose', 'Los Angeles']) cell
      2. Hover over the red/green boxes under "Outputs changed" and click Show source
    2. Using the command line
      1. git diff
      2. git difftool
    3. Note that may appear that all outputs have changed, even if you don't see any differences. This is because if the cell numbers differ, that counts as a change.
    4. Try Run->Restart Kernel and Run All Cells... to reset cell numbering and look at the diff again
  4. Stage and commit the changes
    1. Summary: "Replaced Sacramento with Los Angeles"
  5. Look at information about the remote repository
    1. The Git extension does not have this feature
    2. Using the command line
      1. cd ~/jupyter-git-tutorial
      2. git remote
      3. git remote show origin
  6. Open https://github.com/[username]/jupyter-git-tutorial in a browser
    1. Click on the N commits link next to the icon of a watch. It should not contain any of your commits.
  7. Push your commits
    1. Using the Git extension
      1. Click the cloud icon with an up arrow
      2. Enter GitHub username and password
    2. Using the command line
      1. git push
  8. Look at the log
  9. Open https://github.com/[username]/jupyter-git-tutorial in a browser
    1. Click on the N commits link again. The value of N should be larger and it should match the log history.

Bonus: ReviewNB

  1. Open https://app.reviewnb.com/
  2. Select [username]/jupyter-git-tutorial
  3. Select the Commits tab
  4. Select the "Replaced Sacramento with Los Angeles" commit
  5. Click SEE ON GITHUB