In this tutorial we will build a data pipeline flow as shown:
- Have Docker installed
- Cloned this repository to your local machine with a terminal up and running
- Check that your Docker is running with the following command
docker run hello-world
Install Docker Desktop
- To make sure we can run multiple container go to Docker>Preferences>Resources and in "Memory" make sure you have selected > 4GB
Follow the instructions for your operating system.
If you already have a preferred text editor, skip this step.
In this tutorial we will setup a data labeling web app to label data for the mushroom app. We will use Docker to run everything inside containers.
In order to complete this tutorial you will need your own GCP account setup and your github repo.
- Fork or download from here
Next step is to enable our container to have access to GCP Storage buckets.
It is important to note that we do not want any secure information in Git. So we will manage these files outside of the git folder. At the same level as the data-labeling
folder create a folder called secrets
Your folder structure should look like this:
|-data-labeling
|-secrets
- Here are the step to create a service account:
- To setup a service account you will need to go to GCP Console, search for "Service accounts" from the top search box. or go to: "IAM & Admins" > "Service accounts" from the top-left menu and create a new service account called "data-service-account". For "Service account permissions" select "Cloud Storage" > "Storage Admin" (Type "cloud storage" in filter and scroll down till you find). Then click continue and done.
- This will create a service account
- On the right "Actions" column click the vertical ... and select "Manage keys". A prompt for Create private key for "data-service-account" will appear select "JSON" and click create. This will download a Private key json file to your computer. Copy this json file into the secrets folder. Rename the json file to
data-service-account.json
-
To setup GCP Credentials in a container we need to set the environment variable
GOOGLE_APPLICATION_CREDENTIALS
inside the container to the path of the secrets file from the previous step -
We do this by setting the
GOOGLE_APPLICATION_CREDENTIALS
to/secrets/data-service-account.json
in the docker compose file -
Make sure the
GCP_PROJECT
matches your GCP Project
docker-compose.yml
version: "3.8"
networks:
default:
name: data-labeling-network
external: true
services:
data-label-cli:
image: data-label-cli
container_name: data-label-cli
volumes:
- ../secrets:/secrets
- ../data-labeling:/app
environment:
GOOGLE_APPLICATION_CREDENTIALS: /secrets/data-service-account.json
GCP_PROJECT: "ac215-project" [REPLACE WITH YOUR GCP PROJECT]
GCP_ZONE: "us-central1-a"
depends_on:
- data-label-studio
data-label-studio:
image: heartexlabs/label-studio:latest
container_name: data-label-studio
ports:
- 8080:8080
volumes:
- ./docker-volumes/label-studio:/label-studio/data
- ../secrets:/secrets
environment:
LABEL_STUDIO_DISABLE_SIGNUP_WITHOUT_LINK: "true"
LABEL_STUDIO_USERNAME: "pavlos@seas.harvard.edu" [REPLACE WITH YOUR EMAIL]
LABEL_STUDIO_PASSWORD: "awesome" [CHANGE IF NECESSARY]
GOOGLE_APPLICATION_CREDENTIALS: /secrets/data-service-account.json
GCP_PROJECT: "ac215-project" [REPLACE WITH YOUR GCP PROJECT]
GCP_ZONE: "us-central1-a"
In this step we will assume we have already collected some data for the mushroom app. The images are of various mushrooms belonging to either amanita
, crimini
, or oyster
type. None of the images are labeled and our task here is to use label studio to manage labeling of images.
- Download the unlabeled data from here
- Extract the zip file
- Go to
https://console.cloud.google.com/storage/browser
- Create a bucket
mushroom-app-data-demo
(REPLACE WITH YOUR BUCKET NAME) - Create a folder
mushrooms_unlabeled
inside the bucket - Create a folder
mushrooms_labeled
inside the bucket
- Upload the images from your local folder into the folder
mushrooms_unlabeled
inside the bucket
Based on your OS, run the startup script to make building & running the container easy
- Make sure you are inside the
data-labeling
folder and open a terminal at this location - Run
sh docker-shell.sh
ordocker-shell.bat
for windows
This will run two container. The label studio container and a CLI container that can call API's to label studio. You can verify them by running docker container ls
on another terminal prompt. You should see something like this:
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
00d808ab0386 data-label-cli "pipenv shell" About a minute ago Up About a minute data-labeling-data-label-cli-run
4ab1ec940b4a heartexlabs/label-studio:latest "./deploy/docker-ent…" 2 days ago Up 2 days 0.0.0.0:8080->8080/tcp data-label-studio
Here we will setup the Label Studio App to user our mushroom images so we can annotate them.
- Run the Label Studio App by going to
http://localhost:8080/
- Login with
pavlos@seas.harvard.edu
/awesome
, use the credentials in the docker compose file that you used - Click
Create Project
to create a new project - Give it a project name
- Skip
Data Import
tab and go toLabeling Setup
- Select Template: Computer Vision > Image Classification
- Remove the default label choices and add:
amanita
,crimini
,oyster
- Save
Next we will configure Label Studio to read images from a GCS bucket and save annotations to a GCS bucket
- Go the project created in the previous step
- Click on
Settings
and selectCloud Storage
on the left options - Click
Add Source Storage
- Then in the popup for storage details:
- Storage Type:
Google Cloud Storage
- Storage Title:
Mushroom Images
- Bucket Name:
mushroom-app-data-demo
(REPLACE WITH YOUR BUCKET NAME) - Bucket Prefix:
mushrooms_unlabeled
- File Filter Regex:
.*
- Enable: Treat every bucket object as a source file
- Enable: Use pre-signed URLs
- Ignore: Google Application Credentials
- Ignore: Google Project ID
- Storage Type:
- You can
Check Connection
to make sure your connection works Save
your changes- Click
Sync Storage
to start syncing from the bucket to label studio - Click
Add Target Storage
- Then in the popup for storage details:
- Storage Type:
Google Cloud Storage
- Storage Title:
Mushroom Images
- Bucket Name:
mushroom-app-data-demo
(REPLACE WITH YOUR BUCKET NAME) - Bucket Prefix:
mushrooms_labeled
- Ignore: Google Application Credentials
- Ignore: Google Project ID
- Storage Type:
- You can
Check Connection
to make sure your connection works Save
your changes
In order to view images in Label studio directly from GCS Bucket, we need to enable CORS
- Go to the shell where we ran the docker containers
- Run
python cli.py -c
- To view the CORs settings, run
python cli.py -m
- To view all the code open
data-labeling
folder in VSCode or any IDE of choice
Go into the newly create project and you should see the images automatically pulled in from the GCS Cloud Storage Bucket
- Click on an item in the grid to annotate using the UI
- Repeat for a few of the images
Here are some examples of mushrooms and their labels:
- Go to
https://console.cloud.google.com/storage/browser
- Go into the
mushroom-app-data-demo
(REPLACE WITH YOUR BUCKET NAME) and then into the foldermushrooms_labeled
- You should see some json files corresponding to the images in the
mushrooms_unlabeled
that have been annotated - Open a json file to see what the annotations look like
- Get the API key from Label studio for programatic access to data
- Go to User Profile > Account & Settings
- You can copy the Access Token from this screen
- Use this token as the -k argument in the following command line calls
- Go to the shell where ran the docker containers
- Run
python cli.py -p -k
followed by your Access Token. This will list out your projects - Run
python cli.py -t -k
followed by your Access Token. This will list some tasks from the first project
You will see the some json output of the annotations for each image that is being stored in Label Studio
Annotations: [{'id': 5, 'created_username': ' pavlos@seas.harvard.edu, 1', 'created_ago': '1\xa0hour, 53\xa0minutes', 'completed_by': 1, 'result': [{'value': {'choices': ['amanita']}, 'id': 'qHjUzqXO6W', 'from_name': 'choice', 'to_name': 'image', 'type': 'choices', 'origin': 'manual'}], 'was_cancelled': False, 'ground_truth': False, 'created_at': '2023-09-06T17:33:08.558474Z', 'updated_at': '2023-09-06T17:33:08.558492Z', 'draft_created_at': None, 'lead_time': 5.981, 'import_id': None, 'last_action': None, 'task': 1, 'project': 1, 'updated_by': 1, 'parent_prediction': None, 'parent_annotation': None, 'last_created_by': None}]
Annotations: [{'id': 1, 'created_username': ' pavlos@seas.harvard.edu, 1', 'created_ago': '1\xa0hour, 55\xa0minutes', 'completed_by': 1, 'result': [{'value': {'choices': ['amanita']}, 'id': 'Hp3wZORhBI', 'from_name': 'choice', 'to_name': 'image', 'type': 'choices', 'origin': 'manual'}], 'was_cancelled': False, 'ground_truth': False, 'created_at': '2023-09-06T17:31:04.307102Z', 'updated_at': '2023-09-06T17:31:04.307117Z', 'draft_created_at': None, 'lead_time': 11.197, 'import_id': None, 'last_action': None, 'task': 2, 'project': 1, 'updated_by': 1, 'parent_prediction': None, 'parent_annotation': None, 'last_created_by': None}]
In this tutorial we will setup a data versioning step for the mushroom app pipeline. We will use Docker to run everything inside containers.
- Fork or download from here
Your folder structure should look like this:
|-data-labeling
|---docker-volumes
|-----label-studio
|-data-versioning
|-secrets
- To view all the code open
data-versioning
folder in VSCode or any IDE of choice
- Go to
https://console.cloud.google.com/storage/browser
- Go to the bucket
mushroom-app-data-demo
(REPLACE WITH YOUR BUCKET NAME) - Create a folder
dvc_store
inside the bucket
We will be using DVC as our data versioning tool. DVC (Data Version Control) is an Open-source, Git-based data science tool. It applies version control to machine learning development, make your repo the backbone of your project.
In order for the DVC container to connect to our GCS Bucket open the file docker-shell.sh
and edit some of the values to match your setup
export GCS_BUCKET_NAME="mushroom-app-data-demo" [REPLACE WITH YOUR BUCKET NAME]
export GCP_PROJECT="ac215-project" [REPLACE WITH YOUR GCP PROJECT]
export GCP_ZONE="us-central1-a"
For windows open the file docker-shell.bat
Based on your OS, run the startup script to make building & running the container easy
- Make sure you are inside the
data-versioning
folder and open a terminal at this location - Run
sh docker-shell.sh
ordocker-shell.bat
for windows
This will run a container that has DVC already installed. You can verify the containers running by docker container ls
on another terminal prompt. You should see something like this:
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
00d808ab0386 data-label-cli "pipenv shell" About a minute ago Up About a minute data-labeling-data-label-cli-run
4ab1ec940b4a heartexlabs/label-studio:latest "./deploy/docker-ent…" 2 days ago Up 2 days 0.0.0.0:8080->8080/tcp data-label-studio
e87e8c6f180f data-version-cli "pipenv shell" 5 seconds ago Up 5 seconds data-version-cli
In this step we will download all the labeled data from the GCS bucket and create dataset_v1
version of our dataset.
- Go to the shell where ran the docker container for
data-versioning
- Run
python cli.py -d
If you check inside the data-versioning
folder you should see the a mushroom_dataset
folder with labeled images in them.
.
|-mushroom_dataset
|---amanita
|---crimini
|---oyster
|-mushroom_dataset_prep
The dataset from the data labeling step will be downloaded to a local folder called mushroom_dataset
Make sure to have your gitignore to ignore the dataset folders. We do not want the dataset files going into our git repo.
/mushroom_dataset_prep
/mushroom_dataset
In this step we will start tracking the dataset using DVC
In this step we create a data registry using DVC
dvc init
dvc remote add -d mushroom_dataset gs://mushroom-app-data-demo/dvc_store
dvc add mushroom_dataset
dvc push
You can go to your GCS Bucket folder dvs_store
to view the tracking files
- First run git status
git status
- Add changes
git add .
- Commit changes
git commit -m 'dataset updates...'
- Add a dataset tag
git tag -a 'dataset_v1' -m 'tag dataset'
- Push changes
git push --atomic origin main dataset_v1
In this Step we will use Colab to view various version of the dataset
- Open Colab Notebook
- Follow instruction in the Colab Notebook
- Go to Label Studio App at
http://localhost:8080/
- Click on an item in the grid to annotate using the UI
- Repeat for a few of the images
In this step we will download the labeled data from the GCS bucket and create dataset_v2
version of our dataset.
- Go to the shell where ran the docker container for
data-versioning
- Run
python cli.py -d
dvc add mushroom_dataset
dvc push
- First run git status
git status
- Add changes
git add .
- Commit changes
git commit -m 'dataset updates...'
- Add a dataset tag
git tag -a 'dataset_v2' -m 'tag dataset'
- Push changes
git push --atomic origin main dataset_v2
In this Step we will use Colab to view the new version of the dataset
- Open Colab Notebook
- Follow instruction in the Colab Notebook to view
dataset_v2
By the end of this tutorial your folder structure should look like this:
|-data-labeling
|---docker-volumes
|-----label-studio
|-data-versioning
|-mushroom_dataset
|---amanita
|---crimini
|---oyster
|-mushroom_dataset_prep
|-secrets
To make sure we do not have any running containers and clear up an unused images
- Run
docker container ls
- Stop any container that is running
- Run
docker system prune
- Run
docker image ls