0. Repo setup (already run for you)
git clone git@github.com:daavoo/dataset-pyday-bcn-2021.git
cd dataset-pyday-bcn-2021
pip install -r requirements.txt
dvc init
You should be able to follow all the steps bellow without leaving the browser.
Navigate to your for fork and press .
or change the URL from "github.com" to "github.dev"
-
Create folder on Google Drive.
-
Copy the URL.
Add remote to `.dvc/config`
[core]
remote = myremote
['remote "myremote"']
url = gdrive://{YOUR_URL}
Create `params.yaml`
repo: daavoo/data-source-pyday-bcn-2021
labels:
- Comida
- Hobby
- Libro
state: open
since: 2021/1/1
until: 2021/11/23
data_folder: data
metrics_file: data.json
Create `dvc.yaml`
stages:
get-data:
cmd: python src/get_data.py
--output_folder ${data_folder}
deps:
- src/get_data.py
params:
- repo
- labels
- state
- since
- until
outs:
- ${data_folder}
data-metrics:
cmd: python src/compute_metrics.py
--input_folder ${data_folder}
--output_metrics_file ${metrics_file}
deps:
- src/compute_metrics.py
- ${data_folder}
metrics:
- ${metrics_file}:
cache: false
Create `src/get_data.py`
import os
from datetime import datetime
from pathlib import Path
import fire
import yaml
from github import Github
from loguru import logger
def get_data(output_folder):
with open("params.yaml") as f:
params = yaml.safe_load(f)
output_folder = Path(output_folder)
for label in params["labels"]:
(output_folder / label).mkdir(parents=True, exist_ok=True)
since = datetime(*map(int, params["since"].split("/")))
until = datetime(*map(int, params["until"].split("/")))
logger.info(f"Getting issue labels since {since} until {until}")
logger.info("Initializing Github")
if os.environ.get("GITHUB_TOKEN"):
g = Github(os.environ["GITHUB_TOKEN"])
else:
g = Github()
logger.info(f"Querying repo: {params['repo']}")
repo = g.get_repo(params["repo"])
for issue in repo.get_issues(state=params["state"], since=since):
issue_labels = [
x.name for x in issue.labels if x.name in params["labels"]
]
if (
issue.pull_request
or issue.created_at > until
or len(issue_labels) != 1
):
logger.debug(f"Skipping issue: {issue.title}")
logger.debug(f"Created at: {issue.created_at}")
logger.debug(f"Labels: {issue.labels}")
continue
label = str(issue_labels[0])
logger.info(f"TITLE:\n{issue.title}")
logger.info(f"BODY:\n{issue.body}")
logger.info(f"LABEL:\n{label}")
output_file = output_folder / label / f"{issue.number}.txt"
output_file.write_text(f"{issue.title}\n{issue.body}")
if __name__ == "__main__":
fire.Fire(get_data)
Create `src/compute_metrics.py`
import json
from pathlib import Path
import fire
from loguru import logger
def compute_metrics(input_folder, output_metrics_file):
data_path = Path(input_folder)
metrics = {}
for label_folder in data_path.iterdir():
metrics[label_folder.name] = len(list(label_folder.iterdir()))
for name, amount in metrics.items():
logger.info(f"LABEL: {name}: {amount}")
with open(output_metrics_file, "w") as f:
json.dump(metrics, f, indent=4)
if __name__ == "__main__":
fire.Fire(compute_metrics)
Create `secrets.GDRIVE_CREDENTIALS_DATA`
-
Get the credentials: https://colab.research.google.com/drive/1Xe96hFDCrzL-Vt4Zj-cVHOxUgu-fyuBW
-
Add new secret to GitHub repo.
Create `.github/workflows/on_pr.yml`
name: DVC & CML Workflow on Pull Request
on:
pull_request:
# Allows you to run this workflow manually from the Actions tab
workflow_dispatch:
jobs:
build:
runs-on: ubuntu-latest
container: docker://ghcr.io/iterative/cml:latest
steps:
- uses: actions/checkout@v2
with:
fetch-depth: 0
- name: Setup
run: |
pip install -r requirements.txt
- name: Run DVC pipeline
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
GDRIVE_CREDENTIALS_DATA: ${{ secrets.GDRIVE_CREDENTIALS_DATA }}
run: |
dvc repro --pull
- name: Push changes
env:
GDRIVE_CREDENTIALS_DATA: ${{ secrets.GDRIVE_CREDENTIALS_DATA }}
run: |
dvc push
- name: CML PR
env:
REPO_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
cml pr "data.*" "dvc.lock" "params.yaml"
- name: CML Report
env:
REPO_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
echo "## Metrics & Params" >> report.md
dvc exp diff main --old --show-md >> report.md
cml send-comment --pr --update report.md
-
Edit
params.yaml
from the GitHub Interface. -
Change
until
from2021/11/24
to2021/11/28
. -
Select
Create a new branch for this commit and start a pull request
Create `.github/workflows/weekly.yml`
name: DVC & CML Weekly Workflow
on:
schedule:
- cron: "0 0 * * 0"
# Allows you to run this workflow manually from the Actions tab
workflow_dispatch:
jobs:
build:
runs-on: ubuntu-latest
container: docker://ghcr.io/iterative/cml:latest
steps:
- uses: actions/checkout@v2
with:
fetch-depth: 0
- name: Setup
env:
GDRIVE_CREDENTIALS_DATA: ${{ secrets.GDRIVE_CREDENTIALS_DATA }}
run: |
pip install -r requirements.txt
dvc pull
- name: Run DVC pipeline
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
dvc exp run -S until=$(date +'%Y/%m/%d')
- name: Push changes
env:
GDRIVE_CREDENTIALS_DATA: ${{ secrets.GDRIVE_CREDENTIALS_DATA }}
run: |
dvc push
- name: CML PR
env:
REPO_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
cml pr "data.*" "dvc.lock" "params.yaml"
- name: CML Report
env:
REPO_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
echo "## Metrics & Params" >> report.md
dvc exp diff main --old --show-md >> report.md
cml send-comment --pr --update --commit-sha=HEAD report.md
-
Access studio view: https://dvc.org/doc/studio/get-started
-
Run the pipeline editing the params: https://dvc.org/doc/studio/user-guide/run-experiments