
Adding large amounts of metadata does not work

As a stress test, I have a repo with 542,247 images in it and wanted to add metadata to a data source. I ran the following code from a Jupyter notebook:

# Set up DagsHub
import os
os.environ["DAGSHUB_CLIENT_HOST"] = "https://test.dagshub.com"

from dagshub.data_engine.model import datasources

repo = "yonomitt/LAION-Aesthetics-V2-6.5plus"
image_root = "data"
    ds = datasources.get_datasource(repo=repo, name="images")
    ds = datasources.create_from_repo(repo=repo, name="images", path=image_root)

# Imports
from tqdm import tqdm

# Add metadata
annotations_file = 'labels.tsv'

with ds.metadata_context() as ctx, open(annotations_file) as f:
    for row in tqdm(f.readlines()):
        image, caption, score = row.split('\t')[:3]
        ctx.update_metadata(image, {'caption': caption, 'score': score})

The first time I ran this, it never returned (I waited several hours). The second time, I got a 502:

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 542247/542247 [00:01<00:00, 342020.81it/s]
TransportServerError: 502 Server Error: Bad Gateway for url: https://test.dagshub.com/api/v1/repos/yonomitt/LAION-Aesthetics-V2-6.5plus/data-engine/graphql

The labels.tsv file can be found here: https://dagshub.com/DagsHub-Datasets/LAION-Aesthetics-V2-6.5plus/src/main/data/labels.tsv

And has 542,247 rows.

The workaround was to batch the metadata uploads:

annotations_file = 'labels.tsv'

all_metadata = []
with open(annotations_file) as f:
    for row in tqdm(f.readlines()):
        image, caption, score = row.split('\t')[:3]
    all_metadata.append((image, {'caption': caption[:255], 'score': score}))

total = len(all_metadata)

batch = 1000
for start in tqdm(range(0, total, batch)):
    data = all_metadata[start:start+batch]

    with ds.metadata_context() as ctx, open(annotations_file) as f:
        for image, metadata in data:
            ctx.update_metadata(image, metadata)

I've copied the batching into the metadata upload, uploading it in batches of 5k points at a time.
Hope that's good enough and we don't need any backend changes