Here we will run benchmarks for different data-versioning tools.
TODO
Create a XetHub repository and three github repositories for DVC, git LFS and git-annex with readme's.
Here I named them xethub-py
, xethub-git
, versioning-dvc
, versioning-lfs
, versioning-lfs-github
,
We clone them locally and setup the remotes: Setup your git user name and python environment:
GITUSER=$(git config --global user.name) # or manually set your GitHub/XetHub user name
python -m venv .venv \
&& source .venv/bin/activate \
&& pip install -r requirements.txt
# Download data - takes time!
python src/download.py --dir=data --download=all --limit=2
# For quick testing
python src/generate.py --dir=mock --count=5 --rows=1000
Generate two repos:
- xethub-git # here we will use the git client
- xethub-py # here we will use the python client
-
git xet clone https://xethub.com/xdssio/xethub-git.git xethub-git
# use your own repository -
Get token and setup as environment variables:
export XET_USER_NAME=<user-name> export XET_USER_TOKEN=<xethub-token>
-
pip install pyxet
-
(Optional) Install CLI
git clone https://github.com/$GITUSER/versioning-dvc dvc
- Install CLI
pip install dvc dvc-s3
- setup remote:
cd dvc dvc init dvc remote add -d versioning-article s3://<your-bucket-name>/dvc dvc remote modify versioning-article region us-west-2
Warning: THIS WILL COST YOU MONEY! Limitations:
- GitHub Free and GitHub Pro have a maximum file size limit of 2 GB
- GitHub Team has a maximum file size limit of 4 GB
- GitHub Enterprise Cloud has a maximum file size limit of 5 GB
- Bitbucket Cloud has a maximum file upload limit of 10 GB
Setup:
git clone https://github.com/xdssio/versioning-lfs-github.git lfs-github
- Install CLI
cd lfs-github
git lfs install
git lfs track '*.parquet'
git lfs track '*.csv'
git lfs track '*.txt'
git add .gitattributes && git commit -m "Enable LFS" && git push
git clone https://github.com/$GITUSER/versioning-lfs lfs-s3
- Install CLI
cd lfs-s3
git lfs install
git lfs track '*.parquet'
git lfs track '*.csv'
git add .gitattributes && git commit -m "Enable LFS" && git push
cd ..
# so we can setup the server- LFS server setup - Reference
-
Generating a random key is easy:
openssl rand -hex 32
- Keep this secret and save it in a password manager so you don't lose it. We will pass this to the server below.
-
Create a lfs-server/.env file with the following contents:
AWS_ACCESS_KEY_ID=XXXXXXXXXXXXXXXXXXXX AWS_SECRET_ACCESS_KEY=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX AWS_DEFAULT_REGION=us-west-2 LFS_ENCRYPTION_KEY=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx # the result of the openssl command above LFS_S3_BUCKET=my-bucket LFS_MAX_CACHE_SIZE=10GB
-
Improve performance (optional)
# Increase the number of worker threads git config --global lfs.concurrenttransfers 64 # Use a global LFS cache to make re-cloning faster git config --global lfs.storage ~/.cache/lfs
-
- Update the lfs-s3/.lfsconfig file:
[lfs]
url = "http://http://0.0.0.0:8081/api/my-org/my-project"
─────────┬──────── ──┬─ ─┬─ ───┬── ─────┬────
│ │ │ │ └ Replace with your project's name
│ │ │ └ Replace with your organization name
│ │ └ Required to be "api"
│ └ The port your server started with
└ The host name of your server
- Run local :
docker-compose up
- On mac:
brew tap treeverse/lakefs && brew install lakefs
- Run using docker metadata and credentials
mkdir ~/lakefs/metadata # for persistency docker run --user=root --pull always -p 8000:8000 -e LAKEFS_BLOCKSTORE_TYPE='s3' -e AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID -e AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY -e LAKEFS_DATABASE_LOCAL_PATH=/etc/lakefs/metadata -v ~/lakefs/metadata:/etc/lakefs/metadata treeverse/lakefs:0.110.0 run --local-settings
- Copy credentials ands save to
~/.lakefs.yaml
. - Create a repository and connect to S3 in the UI
# Terminal 1
(cd lfs-server && docker-compose up)
# Terminal 2
docker run --pull always -p 8000:8000 -e LAKEFS_BLOCKSTORE_TYPE='s3' -e AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID -e AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY -e LAKEFS_DATABASE_LOCAL_PATH=/etc/lakefs/metadata -v ~/lakefs/metadata:/etc/lakefs/metadata treeverse/lakefs:0.110.0 run --local-settings
# for debug add:
XET_LOG_LEVEL=debug XET_LOG_PATH=`pwd`/xethub.log
Pull latest data
cd lfs-github && git pull && cd .. && \
cd lfs-s3 && git pull && cd .. && \
cd dvc && git pull && cd .. && \
cd xethub-git && git pull && cd ..
python main.py --help
This upload a small file to all techs
python main.py test
This is a quick view of latest results from the terminal.
python main.py latest
# This let you see the last 10 rows ignoring the experiment id ('track_id').
python main.py latest 10
We simulate a single step with a single technology.
We generate a new file with a given seed (such as the first rows are always the same for the same seed) - number of rows
is the start rows + add rows X step.
Params:
╭─ Options ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --tech [s3|pyxet|gitxet|lakefs|lfs-git|lfs-s3|dvc] The tech to use [default: Tech.pyxet] │
│ --step INTEGER RANGE [x>=0] The step to simulate [default: 0] │
│ --start-rows INTEGER How many rows to start with [default: 100000000] │
│ --add-rows INTEGER How many rows to add [default: 10000000] │
│ --suffix [csv|parquet|txt] What file type to save [default: Suffix.csv] │
│ --diverse --no-diverse Whether to generate numeric data [default: no-diverse] │
│ --label TEXT The experiment to run [default: default] │
│ --seed INTEGER The seed to use [default: 0] │
│ --help Show this message and exit. │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Exmaple:
# This create a csv file with 120 rows - it is equivalent to start with 100 rows, and append 10 rows at step 1 and 2.
python main.py split --tech=s3 --tech=pyxet --tech=gitxet --tech=dvc --step=2 --start-rows=100 --add-rows=10
We simulate a train,validation test split.
We generate a new file with a given seed (such as the first rows are always the same for the same seed) - number of rows
is the start rows + add rows X step X 2, then split the dataframe to train, validation and test such as the last two
hold the latest "add-rows".
Params:
╭─ Options ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --tech [s3|pyxet|gitxet|lakefs|lfs-git|lfs-s3|dvc] The tech to use [default: Tech.pyxet] │
│ --step INTEGER RANGE [x>=0] The step to simulate [default: 0] │
│ --start-rows INTEGER How many rows to start with [default: 100000000] │
│ --add-rows INTEGER How many rows to add [default: 10000000] │
│ --suffix [csv|parquet|txt] What file type to save [default: Suffix.csv] │
│ --diverse --no-diverse Whether to generate numeric data [default: no-diverse] │
│ --label TEXT The experiment to run [default: default] │
│ --seed INTEGER The seed to use [default: 0] │
│ --help Show this message and exit. │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Exmaple:
# This create a parquet file with 120 rows, a validation file with 10 rows, and a test file with 10 rows.
python main.py split --tech=s3 --step=2 --start-rows = 100 --add-rows = 10
This is a simulation of a feature engineering step. It creates a basic file, and for each step, another random column is added.
╭─ Options ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --tech [s3|pyxet|gitxet|lakefs|lfs-git|lfs-s3|dvc] The tech to use [default: Tech.pyxet] │
│ --step INTEGER RANGE [x>=0] The step to simulate [default: 0] │
│ --start-rows INTEGER How many rows to start with [default: 100000000] │
│ --suffix [csv|parquet|txt] What file type to save [default: Suffix.parquet] │
│ --label TEXT The experiment to run [default: default] │
│ --seed INTEGER The seed to use [default: 0] │
│ --help Show this message and exit. │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯```
### Benchmark
Execute a benchmark for a given workflow, list of technologies and multiple steps.
```bash
╭─ Arguments ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ * workflow WORKFLOW:{append|split|taxi} The workflow to execute [default: None] [required] │
│ tech [TECH]:[s3|pyxet|gitxet|lakefs|lfs-git|lfs-s3|dvc]... The tech to use [default: None (all)] │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --steps INTEGER RANGE [x>=1] number of steps to run [default: 1] │
│ --start-rows INTEGER How many rows to start with [default: 100000000] │
│ --add-rows INTEGER How many rows to add [default: 10000000] │
│ --suffix [csv|parquet|txt] What file type to save [default: Suffix.csv] │
│ --diverse --no-diverse If True generate diverse data, default is numeric only [default: no-diverse] │
│ --label TEXT The experiment to run [default: default] │
│ --seed INTEGER The seed to use [default: 0] │
│ --help Show this message and exit. │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
- If only a single step is done, it is equivalent to
random
workflow - as we only generate a single file and upload. - It is recommended to provide 'label' for each run, so it will be easier to compare results. If not provided and steps==1 -> label is random, otherwise it is 'append-{steps}.
from xetrack import Reader
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
OUTPUT_DB = 'output/stats.db'
reader = Reader(OUTPUT_DB)
result = reader.to_df()
latest_track = result.tail(1)['track_id'].iloc[0]
result = result[result['track_id'] == latest_track]
# result.to_csv('output/latest.csv', index=False)