deltatorch

image image image image

Concept

deltatorch allows users to directly use DeltaLake tables as a data source for training using PyTorch. Using deltatorch, users can create a PyTorch DataLoader to load the training data. We support distributed training using PyTorch DDP as well.

Usage

Requirements

  • Python Version > 3.8
  • pip or conda

Installation

  • with pip:
pip install git+https://github.com/mshtelma/deltatorch

Create PyTorch DataLoader to read our DeltaLake table

To utilize deltatorch at first, we will need a DeltaLake table containing training data we would like to use for training your PyTorch deep learning model. There is a requirement: this table must have an autoincrement ID field. This field is used by deltatorch for sharding and parallelization of loading. After that, we can use the create_pytorch_dataloader function to create PyTorch DataLoader, which can be used directly during training. Below you can find an example of creating a DataLoader for the following table schema :

CREATE TABLE TRAINING_DATA 
(   
    image BINARY,   
    label BIGINT,   
    id INT
) 
USING delta LOCATION 'path' 

After the table is ready we can use the create_pytorch_dataloader function to create a PyTorch DataLoader :

from deltatorch import create_pytorch_dataloader
from deltatorch import FieldSpec

def create_data_loader(path:str, batch_size:int):

    return create_pytorch_dataloader(
        # Path to the DeltaLake table
        path,
        # Autoincrement ID field
        id_field="id",
        # Fields which will be used during training
        fields=[
            FieldSpec("image",
                      # Load image using Pillow
                      load_image_using_pil=True, 
                      # PyTorch Transform
                      transform=transform),
            FieldSpec("label"),
        ],
        # Number of readers 
        num_workers=2,
        # Shuffle data inside the record batches
        shuffle=True,
        # Batch size        
        batch_size=batch_size,
    )