replicate/replicate-python

Train with data that is not publicly available

edufschmidt opened this issue · 2 comments

Hi, I was wondering if it would be possible to run a training job with data that is not publicly available, i.e. that requires authentication. If so, how can I pass the credentials to the Python client?

mattt commented

Hi @edufschmidt. Services like AWS S3, Cloudflare R2, Azure Blob Storage, and Google Cloud Storage have a way to generate pre-signed URLs, which provide time-limited access to download private resources. My recommendation would be to generate and pass a pre-signed URL as an input to your training. You could do that on-demand in Python:

import replicate
import boto3
from botocore.exceptions import NoCredentialsError

def generate_pre_signed_url(bucket_name, object_name, expiration=3600):
    try:
        s3_client = boto3.client('s3')
        response = s3_client.generate_presigned_url('get_object',
                                                    Params={'Bucket': bucket_name,
                                                            'Key': object_name},
                                                    ExpiresIn=expiration)
        return response
    except NoCredentialsError:
        print("No AWS credentials found.")
        return None

# Example usage
bucket_name = 'my-bucket'
object_name = 'my-file.txt'
url = generate_pre_signed_url(bucket_name, object_name)

training = replicate.trainings.create("<version>", destination="<my/destination>", input={"data_url": url})

Thank you @mattt! I'm going to give that a try. Closing as I believe this solves my issue.