/nbstudy

Tools for studying notebooks.

Primary LanguagePythonBSD 3-Clause "New" or "Revised" LicenseBSD-3-Clause

nbstudy - tools for studying notebooks

CI Testing PyPI - Version PyPI - Python Version


Table of Contents

Overview

nbstudy is a collection of tools for studying notebooks, especially those published on GitHub.

It generalizes the tooling used in Exploration and Explanation in Computational Notebooks by Adam Rule et al, with additions for Refactoring in Computational Notebooks by the author of this tool (Dylan Lukes, Eric Liu, et al).

The goal of this project is to codify the functionality used to support these studies and future studies in a way that they can be used reproducibly by anyone else who wants to study notebooks in the wild.

⚠️Warning: This tool is still in early development. The API is not stable, and the tool is not yet feature-complete. Many features are still missing (have not been cleaned up and copied over from existing code used priorly for publications), and the documentation is incomplete.

Functionality:

  • Workspace
  • GitHub

Requirements

nbstudy requires Python 3.12 or later, and a recent version of Git 2.43 with support for sparse-checkout and partial-clone.

Installation

To install globally from PyPI, from anywhere run:

pip install nbstudy
nbstudy -h

To develop, clone the repository and then from the root of the repository run:

hatch shell
nbstudy -h

Introduction

Workspaces

nbstudy works on the principle of a "workspace" in which notebooks are studied. A workspace is a Git repository which contains a local cache of notebooks (as sparsely checked-out submodules) as well a database of metadata about those notebooks used to maintain the cache, supported by settings in a configuration file and environment variables.

While some tools work in isolation without a workspace, by using one you get the benefits of being able to automate the process of studying collections of notebooks, collecting results interactively as you go. For example: if you wanted to do coding of every commit of each notebook.

A workspace looks like this:

my-nbstudy-workspace/
├── .gitignore
├── .gitmodules
├── nbstudy.config.json
├── nbstudy.config.env
├── nbstudy.db
└── nbcache/
    ├── localhost/
    │   └── my-repo/
    │
    ├── github.com/
    ┆   ├── user1/
        ┆   ├── repo1/
            ┆   ├── .../notebook1.ipynb
                └── .../notebook2.ipynb

Workspace Settings

The nbstudy.config.json file contains settings for the workspace. Settings may also be configured using environment variables prefixed with NBSTUDY_, or read from the nbstudy.config.env file.

⚠️ The nbstudy.config.json file is intended to be shared with others, and should not contain any sensitive information. The nbstudy.config.env file is intended to be private, and should contain sensitive information such as API keys. It is by default included in the .gitignore file.

Workpace Database

The nbstudy.db file is a SQLite database containing metadata about the notebooks in the workspace, and is managed by nbstudy. It is not intended to be edited manually, though it can be inspected.

Workspace Notebook Cache

The nbcache/ directory contains the notebooks themselves, organized by the hostname of the Git repository they came from, followed by the username and repository name of the repository.

There are two cases that are specially managed by nbstudy:

The localhost/ directory is used for notebooks with no provenance (e.g. notebooks that are created locally and not in a Git repository), and is managed by the nbstudy local subcommands.

The github.com/ directory is used for notebooks that are scraped from a Git repository hosted on GitHub and is managed by the nbstudy github subcommands.

⚠️ Notebook caches can grow very large, as repositories that are fully downloaded can include huge data files. nbstudy does make a best faith effort to minimize bloat by using the sparse-checkout and partial-clone features of Git to only download the minimum files that are needed (notebook files themselves) by default.

Usage

The nbstudy tool provides a number of subcommands for working with notebooks.

🛠️ TODO

Attribution

If you use nbstudy in your research, please cite it as follows:

@misc{nbstudy,
  author = {Lukes, Dylan},
  title = {nbstudy - tools for studying notebooks},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/DylanLukes/nbstudy}},
  commit = {<commit hash here>}
}

License

nbstudy is distributed under the terms of the BSD-3-Clause license.