The open-source data lake for gen ai, computer vision and nlp

Start KatiML

Prerequisites:
1. docker-compose: https://docs.docker.com/compose/install/
2. Python: https://www.python.org/downloads/
3. Pip: https://pip.pypa.io/en/stable/installation/

Clone this repo

git clone --recurse-submodules git@github.com:dioptra-ai/katiml.git

Start all services with the startup script

# Linux / MacOS
./startup.sh

# Windows
./startup.cmd

Visit http://localhost:4004/ with the following default credentials
- username: admin@dioptra.ai
- password: password
Click on the "Load Demo Data" button (this might take a minute or two).
When the data is loaded, run the embeddings analysis.

Querying the lake

Install the SDK
```
pip install dioptra --upgrade
```
Get your API key
- user profile (bottom left corner)
- Create Api Key

Query the data as a DataFrame

import os
os.environ['DIOPTRA_API_KEY'] = '__api_key_value__1686583552218__'
os.environ['DIOPTRA_APP_ENDPOINT'] = 'http://localhost:4004'
os.environ['DIOPTRA_API_ENDPOINT'] = 'http://localhost:4006/events'

from dioptra.lake.utils import select_datapoints

select_datapoints(
    filters=[{
        'left': 'tags.name',
        'op': '=',
        'right': 'data_source'},{
        'left': 'tags.value',
        'op': '=',
        'right': 'sample_coco'}])

Reponse

                                    id           organization_id                created_at request_id   type                                           metadata  text parent_datapoint
0    57c5f3ba-0a9b-405d-9a09-16ab41ea48ad  648738ff58e6931848b214ff  2023-06-12T15:27:32.790Z       None  IMAGE  {'uri': 'https://dioptra-demo.s3.us-east-2.ama...  None             None
1    e674dc9d-2bf6-4a5f-b7be-9888e68914c1  648738ff58e6931848b214ff  2023-06-12T15:28:19.115Z       None  IMAGE  {'uri': 'https://dioptra-demo.s3.us-east-2.ama...  None             None
..                                    ...                       ...                       ...        ...    ...                                                ...   ...              ...
998  565b7adc-f5d1-44b9-b2e7-685d16cdd2b6  648738ff58e6931848b214ff  2023-06-12T15:27:50.675Z       None  IMAGE  {'uri': 'https://dioptra-demo.s3.us-east-2.ama...  None             None
999  d37aac5c-f2e6-4a7b-b3b8-b556af1a96fd  648738ff58e6931848b214ff  2023-06-12T15:28:22.263Z       None  IMAGE  {'uri': 'https://dioptra-demo.s3.us-east-2.ama...  None             None

[1000 rows x 8 columns]

Dataset version control

Create a dataset

from dioptra.lake.datasets import Dataset as KTMLDataset
my_dataset = KTMLDataset()
my_dataset.get_or_create('my dataset')

Add datapoints and commit

my_first_datapoints_df = select_datapoints(....)
my_dataset.add_datapoints(list(my_first_datapoints_df['id'])
my_dataset.commit('my first commit')

my_second_datapoints_df = select_datapoints(....)
my_dataset.add_datapoints(list(my_second_datapoints_df['id'])
my_dataset.commit('my second commit')

Check out, roll back and get diffs