/katiml

Primary LanguageShellGNU Affero General Public License v3.0AGPL-3.0

The open-source data lake for gen ai, computer vision and nlp

PyPI

Start KatiML

  1. Prerequisites:

    1. docker-compose: https://docs.docker.com/compose/install/
    2. Python: https://www.python.org/downloads/
    3. Pip: https://pip.pypa.io/en/stable/installation/
  2. Clone this repo

    git clone --recurse-submodules git@github.com:dioptra-ai/katiml.git
  3. Start all services with the startup script

    # Linux / MacOS
    ./startup.sh
    # Windows
    ./startup.cmd
  4. Visit http://localhost:4004/ with the following default credentials

    • username: admin@dioptra.ai
    • password: password
  5. Click on the "Load Demo Data" button (this might take a minute or two).

  6. When the data is loaded, run the embeddings analysis.

Querying the lake

  1. Install the SDK

    pip install dioptra --upgrade
  2. Get your API key

    • user profile (bottom left corner)
    • Create Api Key
  3. Query the data as a DataFrame

    import os
    os.environ['DIOPTRA_API_KEY'] = '__api_key_value__1686583552218__'
    os.environ['DIOPTRA_APP_ENDPOINT'] = 'http://localhost:4004'
    os.environ['DIOPTRA_API_ENDPOINT'] = 'http://localhost:4006/events'
    
    from dioptra.lake.utils import select_datapoints
    
    select_datapoints(
        filters=[{
            'left': 'tags.name',
            'op': '=',
            'right': 'data_source'},{
            'left': 'tags.value',
            'op': '=',
            'right': 'sample_coco'}])

    Reponse

                                        id           organization_id                created_at request_id   type                                           metadata  text parent_datapoint
    0    57c5f3ba-0a9b-405d-9a09-16ab41ea48ad  648738ff58e6931848b214ff  2023-06-12T15:27:32.790Z       None  IMAGE  {'uri': 'https://dioptra-demo.s3.us-east-2.ama...  None             None
    1    e674dc9d-2bf6-4a5f-b7be-9888e68914c1  648738ff58e6931848b214ff  2023-06-12T15:28:19.115Z       None  IMAGE  {'uri': 'https://dioptra-demo.s3.us-east-2.ama...  None             None
    ..                                    ...                       ...                       ...        ...    ...                                                ...   ...              ...
    998  565b7adc-f5d1-44b9-b2e7-685d16cdd2b6  648738ff58e6931848b214ff  2023-06-12T15:27:50.675Z       None  IMAGE  {'uri': 'https://dioptra-demo.s3.us-east-2.ama...  None             None
    999  d37aac5c-f2e6-4a7b-b3b8-b556af1a96fd  648738ff58e6931848b214ff  2023-06-12T15:28:22.263Z       None  IMAGE  {'uri': 'https://dioptra-demo.s3.us-east-2.ama...  None             None
    
    [1000 rows x 8 columns]
    

Dataset version control

  1. Create a dataset

    from dioptra.lake.datasets import Dataset as KTMLDataset
    my_dataset = KTMLDataset()
    my_dataset.get_or_create('my dataset')
  2. Add datapoints and commit

    my_first_datapoints_df = select_datapoints(....)
    my_dataset.add_datapoints(list(my_first_datapoints_df['id'])
    my_dataset.commit('my first commit')
    
    my_second_datapoints_df = select_datapoints(....)
    my_dataset.add_datapoints(list(my_second_datapoints_df['id'])
    my_dataset.commit('my second commit')
  3. Check out, roll back and get diffs