Mushroom App: Data Labeling & Versioning Demo

In this tutorial we will build a data pipeline flow as shown:

Prerequisites

Have Docker installed
Cloned this repository to your local machine with a terminal up and running
Check that your Docker is running with the following command

docker run hello-world

Install Docker

Install Docker Desktop

Ensure Docker Memory

To make sure we can run multiple container go to Docker>Preferences>Resources and in "Memory" make sure you have selected > 4GB

Install VSCode

Follow the instructions for your operating system.
If you already have a preferred text editor, skip this step.

Mushroom App: Data Labeling

In this tutorial we will setup a data labeling web app to label data for the mushroom app. We will use Docker to run everything inside containers.

In order to complete this tutorial you will need your own GCP account setup and your github repo.

Fork the github repository

Fork or download from here

Setup GCP Credentials

Next step is to enable our container to have access to GCP Storage buckets.

Create a local secrets folder

It is important to note that we do not want any secure information in Git. So we will manage these files outside of the git folder. At the same level as the data-labeling folder create a folder called secrets

Your folder structure should look like this:

   |-data-labeling
   |-secrets

Setup GCP Service Account

Here are the step to create a service account:
To setup a service account you will need to go to GCP Console, search for "Service accounts" from the top search box. or go to: "IAM & Admins" > "Service accounts" from the top-left menu and create a new service account called "data-service-account". For "Service account permissions" select "Cloud Storage" > "Storage Admin" (Type "cloud storage" in filter and scroll down till you find). Then click continue and done.
This will create a service account
On the right "Actions" column click the vertical ... and select "Manage keys". A prompt for Create private key for "data-service-account" will appear select "JSON" and click create. This will download a Private key json file to your computer. Copy this json file into the secrets folder. Rename the json file to data-service-account.json

Attach GCP Credentials to Container

To setup GCP Credentials in a container we need to set the environment variable GOOGLE_APPLICATION_CREDENTIALS inside the container to the path of the secrets file from the previous step
We do this by setting the GOOGLE_APPLICATION_CREDENTIALS to /secrets/data-service-account.json in the docker compose file
Make sure the GCP_PROJECT matches your GCP Project

docker-compose.yml

version: "3.8"
networks:
    default:
        name: data-labeling-network
        external: true
services:
    data-label-cli:
        image: data-label-cli
        container_name: data-label-cli
        volumes:
            - ../secrets:/secrets
            - ../data-labeling:/app
        environment:
            GOOGLE_APPLICATION_CREDENTIALS: /secrets/data-service-account.json 
            GCP_PROJECT: "ac215-project" [REPLACE WITH YOUR GCP PROJECT]
            GCP_ZONE: "us-central1-a"
        depends_on:
            - data-label-studio
    data-label-studio:
        image: heartexlabs/label-studio:latest
        container_name: data-label-studio
        ports:
            - 8080:8080
        volumes:
            - ./docker-volumes/label-studio:/label-studio/data
            - ../secrets:/secrets
        environment:
            LABEL_STUDIO_DISABLE_SIGNUP_WITHOUT_LINK: "true"
            LABEL_STUDIO_USERNAME: "pavlos@seas.harvard.edu" [REPLACE WITH YOUR EMAIL]
            LABEL_STUDIO_PASSWORD: "awesome" [CHANGE IF NECESSARY]
            GOOGLE_APPLICATION_CREDENTIALS: /secrets/data-service-account.json
            GCP_PROJECT: "ac215-project" [REPLACE WITH YOUR GCP PROJECT]
            GCP_ZONE: "us-central1-a"

Prepare Dataset

In this step we will assume we have already collected some data for the mushroom app. The images are of various mushrooms belonging to either amanita, crimini, or oyster type. None of the images are labeled and our task here is to use label studio to manage labeling of images.

Download data

Download the unlabeled data from here
Extract the zip file

Create GCS Bucket

Go to https://console.cloud.google.com/storage/browser
Create a bucket mushroom-app-data-demo (REPLACE WITH YOUR BUCKET NAME)
Create a folder mushrooms_unlabeled inside the bucket
Create a folder mushrooms_labeled inside the bucket

Upload data to Bucket

Upload the images from your local folder into the folder mushrooms_unlabeled inside the bucket

Run Label Studio Container

Run `docker-shell.sh` or `docker-shell.bat`

Based on your OS, run the startup script to make building & running the container easy

Make sure you are inside the data-labeling folder and open a terminal at this location
Run sh docker-shell.sh or docker-shell.bat for windows

This will run two container. The label studio container and a CLI container that can call API's to label studio. You can verify them by running docker container ls on another terminal prompt. You should see something like this:

CONTAINER ID   IMAGE                             COMMAND                  CREATED              STATUS              PORTS                                                      NAMES
00d808ab0386   data-label-cli                    "pipenv shell"           About a minute ago   Up About a minute                                                              data-labeling-data-label-cli-run
4ab1ec940b4a   heartexlabs/label-studio:latest   "./deploy/docker-ent…"   2 days ago           Up 2 days           0.0.0.0:8080->8080/tcp                                     data-label-studio

Setup Label Studio

Create Annotation Project

Here we will setup the Label Studio App to user our mushroom images so we can annotate them.

Run the Label Studio App by going to http://localhost:8080/
Login with pavlos@seas.harvard.edu / awesome, use the credentials in the docker compose file that you used
Click Create Project to create a new project
Give it a project name
Skip Data Import tab and go to Labeling Setup
Select Template: Computer Vision > Image Classification
Remove the default label choices and add: amanita, crimini, oyster
Save

Configure Cloud Storage

Next we will configure Label Studio to read images from a GCS bucket and save annotations to a GCS bucket

Go the project created in the previous step
Click on Settings and select Cloud Storage on the left options
Click Add Source Storage
Then in the popup for storage details:
- Storage Type: Google Cloud Storage
- Storage Title: Mushroom Images
- Bucket Name: mushroom-app-data-demo (REPLACE WITH YOUR BUCKET NAME)
- Bucket Prefix: mushrooms_unlabeled
- File Filter Regex: .*
- Enable: Treat every bucket object as a source file
- Enable: Use pre-signed URLs
- Ignore: Google Application Credentials
- Ignore: Google Project ID
You can Check Connection to make sure your connection works
Save your changes
Click Sync Storage to start syncing from the bucket to label studio
Click Add Target Storage
Then in the popup for storage details:
- Storage Type: Google Cloud Storage
- Storage Title: Mushroom Images
- Bucket Name: mushroom-app-data-demo (REPLACE WITH YOUR BUCKET NAME)
- Bucket Prefix: mushrooms_labeled
- Ignore: Google Application Credentials
- Ignore: Google Project ID
You can Check Connection to make sure your connection works
Save your changes

Enable cross-origin resource sharing (CORS)

In order to view images in Label studio directly from GCS Bucket, we need to enable CORS

Go to the shell where we ran the docker containers
Run python cli.py -c
To view the CORs settings, run python cli.py -m
To view all the code open data-labeling folder in VSCode or any IDE of choice

Annotate Data

Go into the newly create project and you should see the images automatically pulled in from the GCS Cloud Storage Bucket

Click on an item in the grid to annotate using the UI
Repeat for a few of the images

Here are some examples of mushrooms and their labels:

View Annotations in GCS Bucket

Go to https://console.cloud.google.com/storage/browser
Go into the mushroom-app-data-demo (REPLACE WITH YOUR BUCKET NAME) and then into the folder mushrooms_labeled
You should see some json files corresponding to the images in the mushrooms_unlabeled that have been annotated
Open a json file to see what the annotations look like

View Annotations using CLI

Get the API key from Label studio for programatic access to data
Go to User Profile > Account & Settings
You can copy the Access Token from this screen
Use this token as the -k argument in the following command line calls
Go to the shell where ran the docker containers
Run python cli.py -p -k followed by your Access Token. This will list out your projects
Run python cli.py -t -k followed by your Access Token. This will list some tasks from the first project

You will see the some json output of the annotations for each image that is being stored in Label Studio

Annotations: [{'id': 5, 'created_username': ' pavlos@seas.harvard.edu, 1', 'created_ago': '1\xa0hour, 53\xa0minutes', 'completed_by': 1, 'result': [{'value': {'choices': ['amanita']}, 'id': 'qHjUzqXO6W', 'from_name': 'choice', 'to_name': 'image', 'type': 'choices', 'origin': 'manual'}], 'was_cancelled': False, 'ground_truth': False, 'created_at': '2023-09-06T17:33:08.558474Z', 'updated_at': '2023-09-06T17:33:08.558492Z', 'draft_created_at': None, 'lead_time': 5.981, 'import_id': None, 'last_action': None, 'task': 1, 'project': 1, 'updated_by': 1, 'parent_prediction': None, 'parent_annotation': None, 'last_created_by': None}]

Annotations: [{'id': 1, 'created_username': ' pavlos@seas.harvard.edu, 1', 'created_ago': '1\xa0hour, 55\xa0minutes', 'completed_by': 1, 'result': [{'value': {'choices': ['amanita']}, 'id': 'Hp3wZORhBI', 'from_name': 'choice', 'to_name': 'image', 'type': 'choices', 'origin': 'manual'}], 'was_cancelled': False, 'ground_truth': False, 'created_at': '2023-09-06T17:31:04.307102Z', 'updated_at': '2023-09-06T17:31:04.307117Z', 'draft_created_at': None, 'lead_time': 11.197, 'import_id': None, 'last_action': None, 'task': 2, 'project': 1, 'updated_by': 1, 'parent_prediction': None, 'parent_annotation': None, 'last_created_by': None}]

🎉 Congratulations we just setup Label Studio and was able to annotate some data with it

Mushroom App: Data Versioning

In this tutorial we will setup a data versioning step for the mushroom app pipeline. We will use Docker to run everything inside containers.

Fork the github repository

Fork or download from here

Your folder structure should look like this:

   |-data-labeling
   |---docker-volumes
   |-----label-studio
   |-data-versioning
   |-secrets

To view all the code open data-versioning folder in VSCode or any IDE of choice

Create a Data Store folder in GCS Bucket

Go to https://console.cloud.google.com/storage/browser
Go to the bucket mushroom-app-data-demo (REPLACE WITH YOUR BUCKET NAME)
Create a folder dvc_store inside the bucket

Run DVC Container

We will be using DVC as our data versioning tool. DVC (Data Version Control) is an Open-source, Git-based data science tool. It applies version control to machine learning development, make your repo the backbone of your project.

Setup DVC Container Parameters

In order for the DVC container to connect to our GCS Bucket open the file docker-shell.sh and edit some of the values to match your setup

export GCS_BUCKET_NAME="mushroom-app-data-demo" [REPLACE WITH YOUR BUCKET NAME]
export GCP_PROJECT="ac215-project" [REPLACE WITH YOUR GCP PROJECT]
export GCP_ZONE="us-central1-a"

For windows open the file docker-shell.bat

Run `docker-shell.sh` or `docker-shell.bat`

Based on your OS, run the startup script to make building & running the container easy

Make sure you are inside the data-versioning folder and open a terminal at this location
Run sh docker-shell.sh or docker-shell.bat for windows

This will run a container that has DVC already installed. You can verify the containers running by docker container ls on another terminal prompt. You should see something like this:

CONTAINER ID   IMAGE                             COMMAND                  CREATED              STATUS              PORTS                                                      NAMES
00d808ab0386   data-label-cli                    "pipenv shell"           About a minute ago   Up About a minute                                                              data-labeling-data-label-cli-run
4ab1ec940b4a   heartexlabs/label-studio:latest   "./deploy/docker-ent…"   2 days ago           Up 2 days           0.0.0.0:8080->8080/tcp                                     data-label-studio
e87e8c6f180f   data-version-cli                  "pipenv shell"           5 seconds ago        Up 5 seconds                                                                   data-version-cli

Download Labeled Data

In this step we will download all the labeled data from the GCS bucket and create dataset_v1 version of our dataset.

Go to the shell where ran the docker container for data-versioning
Run python cli.py -d

If you check inside the data-versioning folder you should see the a mushroom_dataset folder with labeled images in them.

   .
   |-mushroom_dataset
   |---amanita
   |---crimini
   |---oyster
   |-mushroom_dataset_prep

The dataset from the data labeling step will be downloaded to a local folder called mushroom_dataset

Ensure we do not push data files to git

Make sure to have your gitignore to ignore the dataset folders. We do not want the dataset files going into our git repo.

/mushroom_dataset_prep
/mushroom_dataset

Version Data using DVC

In this step we will start tracking the dataset using DVC

Initialize Data Registry

In this step we create a data registry using DVC dvc init

Add Remote Registry to GCS Bucket (For Data)

dvc remote add -d mushroom_dataset gs://mushroom-app-data-demo/dvc_store

Add the dataset to registry

dvc add mushroom_dataset

Push Data to Remote Registry

dvc push

You can go to your GCS Bucket folder dvs_store to view the tracking files

Update Git to track DVC

First run git status git status
Add changes git add .
Commit changes git commit -m 'dataset updates...'
Add a dataset tag git tag -a 'dataset_v1' -m 'tag dataset'
Push changes git push --atomic origin main dataset_v1

Download Data to view version

In this Step we will use Colab to view various version of the dataset

Open Colab Notebook
Follow instruction in the Colab Notebook

Make changes to data

Use Label Studio to annotate some more data

Go to Label Studio App at http://localhost:8080/
Click on an item in the grid to annotate using the UI
Repeat for a few of the images

Download newly Labeled Data

In this step we will download the labeled data from the GCS bucket and create dataset_v2 version of our dataset.

Go to the shell where ran the docker container for data-versioning
Run python cli.py -d

Add the dataset to registry

dvc add mushroom_dataset

Push Data to Remote Registry

dvc push

Update Git to track DVC changes

First run git status git status
Add changes git add .
Commit changes git commit -m 'dataset updates...'
Add a dataset tag git tag -a 'dataset_v2' -m 'tag dataset'
Push changes git push --atomic origin main dataset_v2

Download Data to view version

In this Step we will use Colab to view the new version of the dataset

Open Colab Notebook
Follow instruction in the Colab Notebook to view dataset_v2

By the end of this tutorial your folder structure should look like this:

   |-data-labeling
   |---docker-volumes
   |-----label-studio
   |-data-versioning
   |-mushroom_dataset
   |---amanita
   |---crimini
   |---oyster
   |-mushroom_dataset_prep
   |-secrets

🎉 Congratulations we just setup and tested data versioning using DVC

Docker Cleanup

To make sure we do not have any running containers and clear up an unused images

Run docker container ls
Stop any container that is running
Run docker system prune
Run docker image ls

tadhglooram93/data-labeling

Mushroom App: Data Labeling & Versioning Demo

Prerequisites

Install Docker

Ensure Docker Memory

Install VSCode

Mushroom App: Data Labeling

Fork the github repository

Setup GCP Credentials

Create a local secrets folder

Setup GCP Service Account

Attach GCP Credentials to Container

Prepare Dataset

Download data

Create GCS Bucket

Upload data to Bucket

Run Label Studio Container

Run docker-shell.sh or docker-shell.bat

Setup Label Studio

Create Annotation Project

Configure Cloud Storage

Enable cross-origin resource sharing (CORS)

Annotate Data

View Annotations in GCS Bucket

View Annotations using CLI

🎉 Congratulations we just setup Label Studio and was able to annotate some data with it

Mushroom App: Data Versioning

Fork the github repository

Create a Data Store folder in GCS Bucket

Run DVC Container

Setup DVC Container Parameters

Run docker-shell.sh or docker-shell.bat

Download Labeled Data

Ensure we do not push data files to git

Version Data using DVC

Initialize Data Registry

Add Remote Registry to GCS Bucket (For Data)

Add the dataset to registry

Push Data to Remote Registry

Update Git to track DVC

Download Data to view version

Make changes to data

Use Label Studio to annotate some more data

Download newly Labeled Data

Add the dataset to registry

Push Data to Remote Registry

Update Git to track DVC changes

Download Data to view version

🎉 Congratulations we just setup and tested data versioning using DVC

Docker Cleanup

Run `docker-shell.sh` or `docker-shell.bat`

Run `docker-shell.sh` or `docker-shell.bat`