S3 Targets e.g. Cloudflare R2 for Staging, S3 for AWS Deep Archive, Backblaze for lukewarm, Wasabi for lukewarm, and etc.

Question

S3 Targets e.g. Cloudflare R2 for Staging, S3 for AWS Deep Archive, Backblaze for lukewarm, Wasabi for lukewarm, and etc.

nelsonjchen opened this issue 2 years ago · 12 comments

nelsonjchen commented 2 years ago

General Issues to tackle:

Cloudflare Workers 100MB Maximum PUT limit bug
- CF Workers cannot PUT more than 100MB from them.
- This results in 10x more calls than the current implementation and thus 10x the more beatdown from anti-abuse or rate limiting measures and overall slowdown. With Azure's DL from Remote API, the PUT limitation isn't existent.
Possible blowing of CPU limits like from @mderazon
- As answered in gtr-proxy readme, the implementation in GTR currently returns a response object ASAP. Cloudflare will stop billing for CPU once that is done, and this easily results in staying underneath the free Workers CPU limit.. PUT-ing would require an await and some time before returning which may cost some CPU while the await is happening. During implementation of #5 , I tried implementing this by re-PUT to Azure and I noticed that workers may have been bursting beyond the 10ms limit. It worked, but this may have been due to my paid Cloudflare Workers plan. This was with a simple await on a fetch that was PUT-ing the response body object to Azure. It worked, but this was beyond 10ms. I suspect an S3 implementation would suffer the same too. Anyway, my memory is really fuzzy on this. This may not be true.

Targets:

Hot
- R2
  - For Staging for local backup. No Archive Tier available unfortunately. Lifecycle rules. Will be personally using for local backup staging if available. $15/mo/TB
  - Real and Authentic Free Download/Upload
Lukewarm
- Backblaze B2
  - $5/mo/TB
  - $1/mo/TB DL
- Wasabi
  - $6/mo/TB
  - "Free" (we'll ban your ass if you download more than you've stored) DL.
Cold
- S3
  - For Deep Archive Tier (Equivalent Pricing to Azure)
  - Probably more comfy for AWS-natives
  - $1/mo/TB

Answer 1 · 2023-03-05T20:02:00.000Z

What is the reason for waiting for Lifecycle rules ? These destinations are quite cheap as they are no ?
Also, uploading to R2 seems like a small step as CF proxy is used anyway no ?

Answer 2 · 2023-03-05T20:24:38.000Z

The biggest concern there for me is that it'll cost $15 to host 1TB of data on R2. That blows my budget by quite a lot. I want to make sure Cloudflare has some safeguards that a guide can guide to setup to prevent that in case someone forgets to delete their staging area.

Answer 3 · 2023-03-05T20:46:05.000Z

I fleshed out the issue description a lot @mderazon .

Answer 4 · 2023-03-05T22:41:15.000Z

fwiw, this is what I was trying to do with Workers:
https://community.cloudflare.com/t/backup-directly-from-google-drive-to-r2/440132/5

Answer 5 · 2023-03-05T22:50:59.000Z

fwiw, this is what I was trying to do with Workers: https://community.cloudflare.com/t/backup-directly-from-google-drive-to-r2/440132/5

Hmm, that's such a weird usage of some APIs. You pass in a body which is just a ReadableStream, but then there's also queue size and part size. Doesn't that require some sort of seekable buffer or something? Maybe it blew up because those aren't compatible things you can do with a simple byte stream or a representation of a byte stream.

Answer 6 · 2023-03-06T00:47:02.000Z

You're doing a lot more orchestration in the worker than what I did my approach as well. In the prototype GTR Azure Transload from Cloudware Workers where the worker itself does the transloading, a lot of the orchestration happens on the extension, where it isn't bound by the silly 10ms CPU limit. The worker or the many worker instances really is just pretty much given two fetches, a response body from one to stick into the other, and no fat libraries doing stuff like part size and queues are used; the worker stays very dumb.

Answer 7 · 2023-03-06T00:49:42.000Z

On that note about fat libraries, if I do try to tackle this, I'll probably be using https://github.com/mhart/aws4fetch and maybe just the raw stuff in there.

Answer 8 · 2023-03-06T08:30:39.000Z

I don't think the size of the library makes any difference, as it could be one line in the library that does some CPU and that would be it.
In the case of the library I used, the culprit might be somewhere around these lines of code
https://github.com/aws/aws-sdk-js-v3/blob/ce7cc58b15fd7ba0bd2b10c7a471b4c8ce95b7d9/lib/lib-storage/src/Upload.ts#L309-L355

There's also this:
https://community.cloudflare.com/t/streaming-large-remote-files/14501/3

I will try the lib you mentioned in my code to see if it makes a difference

Answer 9 · 2023-03-17T05:49:00.000Z

Just noting this down here: https://developers.cloudflare.com/workers/platform/limits/#simultaneous-open-connections

There is a limit of 6 simultaneous connections. Theoretically, I can do 3/10s the speed of the current Azure transloading from one worker call.

Answer 10 · 2023-05-05T15:54:31.000Z

https://developers.cloudflare.com/r2/buckets/object-lifecycles/

lifecycle rules have been added

Answer 11 · 2023-09-20T12:45:31.000Z

I'm keeping an eye on this project, wanted to ask, now that lifecycle rules have been added, the last missing piece to send it to any S3 compatible storage is remote fetch feature that Azure storage has ?

Answer 12 · 2023-09-20T14:52:12.000Z

The last missing piece is acceptable performance. The 100MB POST limit inside workers was extremely annoying. Is it still there? It cuts the speed to a top speed of 3/10s of Azure's and causes request count to spike to the point where it smashes into the free account limit's ceiling.

I haven't touched this issue in some time, I might resurrect it now that I got a new 8 TB drive to backup to.