cartography-cncf/cartography

AWS: Make Fetches Concurrent

Closed this issue · 2 comments

Title: Make Fetches Concurrent

Description:
Cartography up to now is entirely serial. But there are many opportunities for concurrency, particularly during the get_ phases of sync modules.

As a basic example: To get all the data needed for all AWS ECR Images, we:

  1. call boto's method, to get a list of all repositories
  2. for each repository, we call another boto method to list all images within that repository

The loop in the second step above is what we can make concurrent, saving time of network overhead when there are perhaps many thousands of repositories.

The boto clients are also thread-safe, meaning it's okay to run them in multiple concurrent threads.

Ideally, we would use the async version of client libraries, but boto does not support that. There is a project aiboto3 that seems to wrap around boto to make its methods async, but personally I would rather not.

I have a pending PR as another workaround #1192. The idea there is to do minimal refactoring by

  1. converting our loops to async methods
  2. calling boto's get methods from separate threads

To be more concrete with the ECR Images example: you might think to just convert get_ecr_repository_images to an async def, but that still does not yield any performance gain because inside of that is a long-running function call with boto's list_images method. Because of the GIL, it is ultimately the list_images method that needs to run in a separate thread, but I chose to keep the refactor simple by putting it's only caller in a separate thread.

I've read a bit about asyncio and GIL from the realpython guides, but I'm still new and would like to hear thoughts from the community.

Relevant Links:

Some things to consider:

How will we apply/refactor this change to all AWS modules?

Will it be easy for new module authors to copy this pattern in their own modules?

Can you describe the performance gain that you observed in #1192 with a bit more detail?

We tried to incorporate multi-threading. One of the challenges we faced was around Rate Limits. boto3 has retry policies in place. But, it sometimes fails if there are a ton of concurrent requests going to AWS.