nelsonjchen/gargantuan-takeout-rocket

S3 Targets e.g. Cloudflare R2 for Staging, S3 for AWS Deep Archive, Backblaze for lukewarm, Wasabi for lukewarm, and etc.

nelsonjchen opened this issue · 12 comments

General Issues to tackle:

Targets:

  • Hot
    • R2
      • For Staging for local backup. No Archive Tier available unfortunately. Lifecycle rules. Will be personally using for local backup staging if available. $15/mo/TB
      • Real and Authentic Free Download/Upload
  • Lukewarm
  • Cold
    • S3
      • For Deep Archive Tier (Equivalent Pricing to Azure)
      • Probably more comfy for AWS-natives
      • $1/mo/TB

What is the reason for waiting for Lifecycle rules ? These destinations are quite cheap as they are no ?
Also, uploading to R2 seems like a small step as CF proxy is used anyway no ?

The biggest concern there for me is that it'll cost $15 to host 1TB of data on R2. That blows my budget by quite a lot. I want to make sure Cloudflare has some safeguards that a guide can guide to setup to prevent that in case someone forgets to delete their staging area.

I fleshed out the issue description a lot @mderazon .

fwiw, this is what I was trying to do with Workers: https://community.cloudflare.com/t/backup-directly-from-google-drive-to-r2/440132/5

Hmm, that's such a weird usage of some APIs. You pass in a body which is just a ReadableStream, but then there's also queue size and part size. Doesn't that require some sort of seekable buffer or something? Maybe it blew up because those aren't compatible things you can do with a simple byte stream or a representation of a byte stream.

You're doing a lot more orchestration in the worker than what I did my approach as well. In the prototype GTR Azure Transload from Cloudware Workers where the worker itself does the transloading, a lot of the orchestration happens on the extension, where it isn't bound by the silly 10ms CPU limit. The worker or the many worker instances really is just pretty much given two fetches, a response body from one to stick into the other, and no fat libraries doing stuff like part size and queues are used; the worker stays very dumb.

On that note about fat libraries, if I do try to tackle this, I'll probably be using https://github.com/mhart/aws4fetch and maybe just the raw stuff in there.

I don't think the size of the library makes any difference, as it could be one line in the library that does some CPU and that would be it.
In the case of the library I used, the culprit might be somewhere around these lines of code
https://github.com/aws/aws-sdk-js-v3/blob/ce7cc58b15fd7ba0bd2b10c7a471b4c8ce95b7d9/lib/lib-storage/src/Upload.ts#L309-L355

There's also this:
https://community.cloudflare.com/t/streaming-large-remote-files/14501/3

I will try the lib you mentioned in my code to see if it makes a difference

Just noting this down here: https://developers.cloudflare.com/workers/platform/limits/#simultaneous-open-connections

There is a limit of 6 simultaneous connections. Theoretically, I can do 3/10s the speed of the current Azure transloading from one worker call.

I'm keeping an eye on this project, wanted to ask, now that lifecycle rules have been added, the last missing piece to send it to any S3 compatible storage is remote fetch feature that Azure storage has ?

The last missing piece is acceptable performance. The 100MB POST limit inside workers was extremely annoying. Is it still there? It cuts the speed to a top speed of 3/10s of Azure's and causes request count to spike to the point where it smashes into the free account limit's ceiling.

I haven't touched this issue in some time, I might resurrect it now that I got a new 8 TB drive to backup to.