ploomber/projects

Example with ploomber.clients.S3Client

Closed this issue ยท 14 comments

Hi there,
I am struggling to understand how can I use the S3Client to download and upload product files into a pipeline.
I am assuming is similar to the SQL example but besides creating the client in clients.py I am not sure how to specify in the tasks to download files.
Any help is welcome!

Hi!

Great catch, we're missing one example here. I'll briefly explain how this works.

As you mentioned, the S3 client is similar to the SQL client. Once you've configured it, it'll automatically upload the task's output files upon successful task execution, so there is no need to add any extra code. Furthermore, if you delete the output once it has been uploaded and your source code hasn't changed. You can execute ploomber build, and Ploomber will download the outputs from S3 (instead of executing the tasks again)

Does this help? Feel free to add any other questions.

Okay let me try it and will feedback here.

Pictures better than words

image

image

Forgot to mention the output:

(base) jovyan@1a3714ecaaf2:~/work/ploomber/s3client$ ploomber build
Error: Failed to determine task class for source 'S3Client': Invalid dotted path 'S3Client'. Value must be a dot separated string, with at least two parts: [module_name].[function_name]. 

Alright, so your clients section just needs one change. However, the tasks don't need references to S3, they must be regular tasks, here's an example:

clients:
    # tell ploomber to use the client returned by clients.get to manage all output files
    File: clients.get

tasks:
  - source: tasks.some_function
    # ploomber will upload this upon execution 
    product: output/get.parquet

  - source: scripts/some-script.py
    # ploomber will upload this upon execution
    product:
        nb: output/nb.html
        model: output/model.pickle

If you only want to upload some of the files (and not all), you can pass the client to specific tasks:

tasks:
  # this task does not have a client, do not upload anything
  - source: tasks.some_function
    product: output/get.parquet

  - source: scripts/some-script.py
   # this task has a client, upload output files
    product:
        nb: output/nb.html
        model: output/model.pickle
    client: clients.get

What about some examples for downloading data, e.g. I want a step where I pull down objects from the bucket and process them.

Good point. We don't have examples like that yet since initially, the client's purpose was to backup pipeline results for later review. So far, users who want to download files for later processing use the APIs from their services directly. In your case, it'd be calling boto3.

However, we're working on a cloud service for Ploomber (that will allow you to run Ploomber pipelines in the cloud without code changes) so data ingestion is a feature we'll develop soon.

Hi there, yes that makes sense now.
Another question: does the SQL module is read only as well e.g. SELECT only?
With this syntax:

clients:
    # tell ploomber to use the client returned by clients.get to manage all output files
    File: clients.get

Could I have multiple identifiers for example:

clients:
    # tell ploomber to use the client returned by clients.get to manage all output files
    LocalFile: clients.get_file
    AwsFile: clients.get_s3
    GcpFile: clients.get_gcp

Do I still have to reference them via the function name syntax or can I use the identifier?

- source: scripts/some-script.py
  # this task has a client, upload output files
   product:
       nb: output/nb.html
       model: output/model.pickle
   client: clients.get_gcp
- source: scripts/some-script.py
  # this task has a client, upload output files
   product:
       nb: output/nb.html
       model: output/model.pickle
   client: LocalFile

Btw the example you provided works fine on my AWS setup.

@priamai can you list here what you're expecting the client to do besides downloads? Is it for reproducibility/experiments tracking?
This will help us better define the reqs for when we build it ๐Ÿ˜Š

The SQL clients can also create tables and views. See this example.

It's not possible to have multiple identifiers, because those identifiers correspond to products/task classes in Ploomber (more in the docs).

Example:

clients:
   # key must be a valid task class or product class

   # File is a product class - this means "upload all files"
   File: clients.s3_client

  # SQLScript is a task class - this means "execute all SQL scripts using this client"
   SQLScript: clients.db_client

   # keys cannot be arbitrary identifiers
   some-identifier: clients.some_client

More on product vs task clients here.

I'm glad to hear that the example worked with your AWS setup, anything else we can help with?

The SQL clients can also create tables and views. See this example.

It's not possible to have multiple identifiers, because those identifiers correspond to products/task classes in Ploomber (more in the docs).

Example:

clients:
   # key must be a valid task class or product class

   # File is a product class - this means "upload all files"
   File: clients.s3_client

  # SQLScript is a task class - this means "execute all SQL scripts using this client"
   SQLScript: clients.db_client

   # keys cannot be arbitrary identifiers
   some-identifier: clients.some_client

More on product vs task clients here.

I'm glad to hear that the example worked with your AWS setup, anything else we can help with?

Thanks for the pointer, this makes sense now. For my AWS experiment, I can then just create a task that pulls the files and save them into the local folder. Then the last task after computing the statistics can just delete them. I have another related idea responding to the other author.

@priamai can you list here what you're expecting the client to do besides downloads? Is it for reproducibility/experiments tracking? This will help us better define the reqs for when we build it ๐Ÿ˜Š

Yes so I am guessing coming from Kedro/Elyra I got confused about the concept of source and client in Ploomber.
What will be a very useful module is a source client that can monitor when the input has changed.
For argument's sake my plumber runs periodically every day and there might be new files or rows in my sources. If Ploomber can check via metadata whether there is new input available, then it can continue to execute the downstream tasks.
In addition to S3 would be nice also to have common source connectors to databases such as Cassadra and ElasticSearch.
I know it's a lot of work but a lot of people would appreciate such auto-diffing check sort to speak.

Awesome, I'm glad ploomber is working for you.

Keep an eye on Ploomber's changelog (and the Slack community), since we'll be working on data ingestion features pretty soon.

For checking the database's state: you can implement an initial step that checks whether there are new rows in the database and if so, run the ploomber pipeline. However, we think we can do better here, and adding this kind of features to an existing pipeline is in our roadmap :)

Ploomber works with any database that has a PEP 249 compliant driver. From a quick glance, looks like Cassandra is. So if you give it a try, please share your feedback!

I'm closing this now but feel free to open a new issue if you have more questions!