/dvc_tutorial

Primary LanguagePythonMIT LicenseMIT

Workflow of Data Version Control (DVC)

workflow

STEPS:

STEP 01: Create a empty remote repository

STEP 02: intialize a git local repository and connect to remote repository

  • Open and project folder in VS code then follow below command -
echo "# dvc_tutoral" >> README.md

git init

git add README.md

git commit -m "first commit"

git branch -M main

git remote add origin https://github.com/USER_NAME/REPO_NAME.git

git push -u origin main
touch .gitignore

Content of the gitignore can be found from reference repository

STEP 03: create and activate conda environment

conda create -n dvc-ml python=3.9 -y

conda activate dvc-ml

STEP 04: create a setup file

  • To use src folder as package, we have to create a setup.py as below:
touch setup.py
  • Paste the below content in the setup.py file and make the necessary changes as per your user ID-
from setuptools import setup

with open("README.md", "r", encoding="utf-8") as f:
    long_description = f.read()

setup(
    name="src",
    version="0.0.1",
    author="USER_NAME",
    description="A small package for dvc ml pipeline demo",
    long_description=long_description,
    long_description_content_type="text/markdown",
    url="https://github.com/rohit-chandra/dvc_tutorial",
    author_email="rohitv.chandra@gmail.com",
    packages=["src"],
    python_requires=">=3.9",
    install_requires=[
        'dvc',
        'pandas',
        'scikit-learn'
    ]
)
  • To verify whether src is working as package or not, run the below command and you should see the src package along with it's version in the list:
pip list

STEP 05: create requirement file and install dependencies

touch requirements.txt

pip install -r requirements.txt

content of requirements.txt - Refer the reference repository

STEP 06: initialize dvc

dvc init

STEP 07: create the basic directory structure

mkdir -p src/utils config

STEP 08: create the config file

touch config/config.yml

content of config.yml -

data_source: http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv

artifacts: 
  artifacts_dir: artifacts
  raw_local_dir: raw_local_dir
  raw_local_file: data.csv

STEP 09: create the stage 01 python file and all_utils file:

touch src/stage_01_load_save.py src/utils/all_utils.py

content of both these files can be refererd from the reference given

STEP 10: create the dvc.yaml file and add the stage 01:

touch dvc.yaml

content of dvc.yaml file -

stages:
  load_data:
    cmd: python src/stage_01_load_save.py --config=config/config.yaml
    deps:
      - src/stage_01_load_save.py
      - src/utils/all_utils.py
      - config/config.yaml
    outs:
      - artifacts/raw_local_dir/data.csv

STEP 11: run the dvc repro command

dvc repo

STEP 12: push the changes to remote repository

git add .
git commit -m "stage 01 added"
git push origin main