thumbor-community/aws

[RFC] Add some randomness in the key names

Bladrak opened this issue · 10 comments

Following https://www.youtube.com/watch?v=_FHRzq7eHQc we should add some randomness in our S3 Key names, in order to allow for better scalability and performance in high load environments.

Here's what I suggest: edit the https://github.com/thumbor-community/aws/blob/master/tc_aws/aws/bucket.py#L125 function by appending the first seven chars of the sha1 of the path and its extension in the end.

For instance, a key such as image/in/my/folder/foo.jpg would become image/in/my/folder/foo.jpg/baaa1c1.jpg.

What do you think?

I'm just concerned about performance. All time that you made a request for an image, the system will take a sha1 from URL and search for this equivalent image on the bucket.

It will indeed take a bit more time within thumbor, but should allow for faster response time from AWS. In the overall, I think we should gain some performance, given the most time is spent within AWS fetching.

good point

problem is that randomness must be set at the prefix of the key. on suffix is it useless.

This make finding groups of images (like image from a defined site using http loader or so with define Id like type/id for example) difficult to find in our bucket.

I even wonder if using a prefix with TC_AWS_STORAGE_ROOT_PATH is not counterproductive.

juris commented

There used to be randomness in the key names long time ago: af53c75#diff-73180fe182db7c4424a4ec422d370b5b

def normalize_path(self, path):
    root_path = self.context.config.get('RESULT_STORAGE_AWS_STORAGE_ROOT_PATH', default='thumbor/result_storage/')
    path_segments = [path]
    if self.is_auto_webp:
        path_segments.append("webp")
    digest = hashlib.sha1(".".join(path_segments).encode('utf-8')).hexdigest()
    return os.path.join(root_path, digest)		
 

Indeed, missed that at the time, good catch!

If it's fine with everyone, I'd like to proceed with this feature, but by making it configurable, to avoid an impact for the people upgrading the solution (so they won't have to re-build the cache).

juris commented

Would be great! We are still using Thumbor with the old aws plugin and not upgrading to the current version because of the lack of randomness.

I've submitted a pre-PR, feel free to check whether that's a behavior that would seem suitable.

This has been released under the 6.1.0 version on Pypi