The open-source data lake for gen ai, computer vision and nlp
-
Prerequisites:
- docker-compose: https://docs.docker.com/compose/install/
- Python: https://www.python.org/downloads/
- Pip: https://pip.pypa.io/en/stable/installation/
-
Clone this repo
git clone --recurse-submodules git@github.com:dioptra-ai/katiml.git
-
Start all services with the startup script
# Linux / MacOS ./startup.sh
# Windows ./startup.cmd
-
Visit http://localhost:4004/ with the following default credentials
- username:
admin@dioptra.ai
- password:
password
- username:
-
Click on the "Load Demo Data" button (this might take a minute or two).
-
When the data is loaded, run the embeddings analysis.
-
Install the SDK
pip install dioptra --upgrade
-
Get your API key
- user profile (bottom left corner)
Create Api Key
-
Query the data as a DataFrame
import os os.environ['DIOPTRA_API_KEY'] = '__api_key_value__1686583552218__' os.environ['DIOPTRA_APP_ENDPOINT'] = 'http://localhost:4004' os.environ['DIOPTRA_API_ENDPOINT'] = 'http://localhost:4006/events' from dioptra.lake.utils import select_datapoints select_datapoints( filters=[{ 'left': 'tags.name', 'op': '=', 'right': 'data_source'},{ 'left': 'tags.value', 'op': '=', 'right': 'sample_coco'}])
Reponse
id organization_id created_at request_id type metadata text parent_datapoint 0 57c5f3ba-0a9b-405d-9a09-16ab41ea48ad 648738ff58e6931848b214ff 2023-06-12T15:27:32.790Z None IMAGE {'uri': 'https://dioptra-demo.s3.us-east-2.ama... None None 1 e674dc9d-2bf6-4a5f-b7be-9888e68914c1 648738ff58e6931848b214ff 2023-06-12T15:28:19.115Z None IMAGE {'uri': 'https://dioptra-demo.s3.us-east-2.ama... None None .. ... ... ... ... ... ... ... ... 998 565b7adc-f5d1-44b9-b2e7-685d16cdd2b6 648738ff58e6931848b214ff 2023-06-12T15:27:50.675Z None IMAGE {'uri': 'https://dioptra-demo.s3.us-east-2.ama... None None 999 d37aac5c-f2e6-4a7b-b3b8-b556af1a96fd 648738ff58e6931848b214ff 2023-06-12T15:28:22.263Z None IMAGE {'uri': 'https://dioptra-demo.s3.us-east-2.ama... None None [1000 rows x 8 columns]
-
Create a dataset
from dioptra.lake.datasets import Dataset as KTMLDataset my_dataset = KTMLDataset() my_dataset.get_or_create('my dataset')
-
Add datapoints and commit
my_first_datapoints_df = select_datapoints(....) my_dataset.add_datapoints(list(my_first_datapoints_df['id']) my_dataset.commit('my first commit') my_second_datapoints_df = select_datapoints(....) my_dataset.add_datapoints(list(my_second_datapoints_df['id']) my_dataset.commit('my second commit')
-
Check out, roll back and get diffs