nelsonjchen/gargantuan-takeout-rocket

Downloads failed as "No content-length header"

udmada opened this issue ยท 28 comments

Hi Nelson,

I wanted to thank you for your amazing work on the [project name] extension. I'm also interested in using R2 and/or B2 as potential destinations.

I followed your tutorial using a self-hosted Cloudflare Proxy and tried to download a 50G archive from Google Takeout. However, I received an error message from the extension that said, takeout.zip - failed - No content-length header. Based on my initial assessment, I believe this issue might be related to the Cloudflare worker.

Could you please provide me with some insights on what could be causing this issue and what steps I can take to resolve it?

Thank you again for all your hard work!

Cheers!
Adam

I'm not sure to be honest. The setup is a bit jank. I would check the extension's console and see if it can give more insight. Content-length header is usually done by the extension running HEAD against the final and direct signed takeout URL.

Also, what is the URL of the direct download link that the ZIP file resolves to? I was able to whitelist EU and US in the worker but I am a bit uncertain about AUS. This message could come from the proxy blocking it. You can neuter it yourself or just give it to me after 30 minutes either by email or just posting it here.

Right now it's whitelisting these.

https://github.com/nelsonjchen/gtr-proxy/blob/a4b9980304b5a6ad9dee664bf23a6b468da82b2a/src/handler.ts#L266-L272

Thanks for your response! I tailed Cf logs and I can see it is trying to HEAD a google endpoint that starts with storage.googleapis.com/dataliberation/. Let me try whitelisting it myself :)

The HEAD is supposed to be from the extension itself. It doesn't need to go through the proxy.

Hold up, are you trying to transload from drive?

I can see Cf logs having a 403 response to a HEAD req that is sent to [my own provy]/p/storage.googleapis.com/dataliberation/[blah blah blah]. Chrome service worker console shows the same network activity.

In terms of where I am transloading from - I did a Takepout archive (of Google Photos). I understand that ultimately photos in GPhotos are stored in GDrive, but I didn't do it directly via Drive.

image

This is the interface I download Takeouts from. Do you use the same? I press that blue download button.

I did and the extension successfully intercepts the download.

Hmm, I can reproduce the issue. I'm not sure what's going on at the moment, but I'll see what I can find out and fix it.

Thanks!

Just a bit more information I guess:

I tried downloading another Takeout archive from another account - same 403 resp and same error message. However, I can see that the HEAD req was sent to a different endpoint: [blah]-apidata.googleusercontent.com/download/storage/v1/b/dataliberation/[blah], which is whitelisted in the code.

image

The format seems to have changed slightly. There's now a %2B (+) here which wasn't in this demo URL before. My guess is that Azure is unhelpful screwing this up too.

note: these are both expired of course.

Past: https://00f74ba44b071b761059aef3fd79738daea1be7829-apidata.googleusercontent.com/download/storage/v1/b/dataliberation/o/20211113T212502Z%2F-4311693717716012545%2F498d83a5-1ab3-4a79-815f-e5cfda855e7a%2F1%2F869777c3-49ff-4d4e-a932-230a6b0b2a78?jk=AFshE3XT7l4gO3olRD23ASyAuaK-Lbi1Z4oc4eMBje8eLdA1mHPk-VeNNMCDno2sDlRKTKD2Nqau1HdkE9nX5f462yylgcSu5kmIknW0lU-1Xx3Mb8OnO5L-DMq3W8xslAI6vlKnqrKaTztfOKSQOfn-5XWf4OuiuDCTdstSSCcsNDMu8b4NX6cnuRhGRdVonqtH3lf9TV7fIBJMchxy3l-i3W_tiGHO7NP9B2Rnvo2uJP7-pgbfxH_ki0DLerQhKK4hRx6KeHWfXL2XT80lLVYwfS2dk5XVAplFIIV7Lp9H7x3HERQzR7_1JshhluQyoG6Vqv7gRYyav8S7PrwkKXStCho5fc85ErZ0dQqJXmvNqCtdWCB8-KzIA5-UgjlLcDzk_mVYMUfcr-_i-R-5tA_Rnb0MmavB94aIj9EfEh0g0B6yCRnAHAIuob6EYFTeCVTs7XXBlqlMKF-P0A5L2d47f0pSQrosQUNshoZKKieSl71vD3kiFDZ4OIg5K-yPlkniodFuyRr-hf5LeBIZhMFNozA2nfGOU3cW3i_sJZgNJNf68UK_l1beTDJ5ZKEZ5ot0jgaQ7w_KlLEonaGJM4Lw7oVby-GbqmlFYe2SI9wwxcXURdW88AW4zipqCMOz_N7cBYC0zm1t4TRSW2-_uvsQWLQRA_9g8avGn8RIKr8i-ISa7sfMaUQEkY4eOtsV7l3JHNeKjmJtxSOJPwg487Cv0htwGt_3Kd6IbyFOb1l0l9wKtkIxkQqliTvAK7VXZUGr1Cdsbbhq1qy3AF1aMVPA1vghV2TOOr5rOzVkRUmTLQzU5WfsYOoNcKjJ7mPvuOirFkKvSHzBQDvZ8_B2RgwT7zMZ7LsjAhG1zS3eDTijUMi9QEM_FYkugRpZ36eg9SZWrEbHCp36y0kL7QK8gZHVP6ePvOqujXG1BCryrxp5UQ9AhZS3szhe54MDf1877LTEmCH5_utBvQqF31dlinmEWiL4YTwiSEwwUToJ38H7gmI-CWErYJsJylmuOSfUoJFpELSRi4Qw4fF-figbaB3w_BNhXvEBdUsMeSNkBkU5u4nwAfG8IJ6TxkyZZKgK4uIhG1R7mr7QaRJ_bizIRVUl&isca=1+

New: https://ff9ee6df18ab8755fc1e3e5aca35f89663f668c795c7a437012252f-apidata.googleusercontent.com/download/storage/v1/b/dataliberation/o/20230306T025503.608%2B0000%2F-4311693717716012545%2F612988c2-608e-43c7-bf61-fec758c0d677%2F1%2Ff06a42e4-3547-4e8a-90d0-6cdd68a1f779?jk=AahUMlsFazPeGdNNVtJI2Z4Zcub-6Oi0SZavX3fqpD8k18AjnquLEbJuUAxAkK-dtM8hHi7V6J4nOZgKpbtelLIZvc64ol6Tw8rJWbr68OWahPb4V27XGkdjIcU8rCFdpvMToT0Qy2qIJ106DoRzdwWLCjE9JMORUNtsttyrbnKbJDJXQQb-K0nRk-dQNSlAma3nfbJrXbjBfz1ZFHWjgRoqinSWgu2hJ6QB2u8FKLRmbSCnFaT9lu2P4Py7V-5EfLE5PPxNpklgrQElu9454kQ8K5Qg1zWKFZlLHCAOewLtInJocpLcaI_HbnCiKREUsPIAAQ8DjDXfUl0RjKWiDAjrNVv-yMVrE2LCzZJGMJmIVGvvrdTB5NEWkCEMbkU1QY1RCDKYYBb6UEF2kwBlQMSeaT_5JD9PbximCgFL6-DzkutxKTDlXcFX5HmpFYfWmhoekDYBLeFYLkhoX8riHos859a93mu9c4C2wQc7YPuQKTHUMuiM4m94d9zLJUCQUtFCb1rzSKBpCfuHaMKxnL0KwMQ8yUu7JwXDZfxwCk2DP-Q6eKV6vFapc6r6XWwQpqP7Nr_7kJk0daalHqBYmiVxW5jP5ZCMUfcXBlpYnb9brctVogsX4chGW2tiLfQiYBGUF_SxxCvImdM59zWvy58O99sblVJyooKorqtGmqTXu3p4K_m1BIpp3O4tQmwlU6b-F2ni17AByIIIVI1aGzLz0_FciA8xwRvVsLZHFy7Z9p8O1wLFgHUt-UzgcIKwrB0E-9LH-1eU5ZHEyJYFTJqINTsWlrrMHSlxdUJo6qOBQ5T7x8arZaXViqNLlTwyuxf_qKwCHZaiO-OeI4xGO6UBMD1AfCsOvIYKOBvKLRwd6UiU03tJw_cLKfUJT6tvg5efVI61UovUJkIEtTeCbxm31CBgn4L7J7cvPrknRpIjFrQiE8JUtzl9l-ts185CqwJvhj8BsSMcUjNE9ns-UBCbE1QUIf4GaUJHq73gpiCjU9IpjJZtWIKp0r_Ohbbb2bXXhnrZV4cUW3VlO9rpqUYF1YxDvPyYRy8v4ZEJDPbc7F3VRejDLufzl09gzcZ8A6zBNAWLZUiwDUVNOC_VeQKsraltuMe0uQDv-y0XcrfqFbpOplwG-Ha0zgwHFGSGJ8RrR2Z-UpcEbJwEqnS_U8yKW3sqI8Ed9UhdNKDoK_BinqNco1kdekSUuveHOE_ORLG6UZyCo4dwMpiG63Xffum0hTiyjMBUPr8PU5NkSQiuYv7OdWwH2vQXNgRNl_PEAZDKi-rE1Oj49ZqrR_UJj0S-RHUwou_5e1XegYfsYwhjOnxXvDRHmV6UGA&isca=1

Hmm, an initial hack with that assumption doesn't work. I'll have to test this out more over next weekend with https://github.com/nelsonjchen/put-block-from-url-esc-issue-demo-server and get down to how exactly azure is mangling Google's URL.

It's definitely something to do with %2B or the + that Google just added. Armoring probably did survive Azure, but now it might be something in Javascript, Cloudflare, etc. causing this unintentional replacement.

Observe the RawPath values here.

Just a +: https://put-block-from-url-esc-issue-demo-server-3vngqvvpoq-uc.a.run.app/Me+You.txt

Debug: Escaped Path Recieved
&url.URL{Scheme:"", Opaque:"", User:(*url.Userinfo)(nil), Host:"", Path:"/Me+You.txt", RawPath:"", ForceQuery:false, RawQuery:"", Fragment:"", RawFragment:""}
"/Me+You.txt"

Percent encoded, note the server sees %2B correctly: https://put-block-from-url-esc-issue-demo-server-3vngqvvpoq-uc.a.run.app/Me%2BYou.txt

Debug: Escaped Path Recieved
&url.URL{Scheme:"", Opaque:"", User:(*url.Userinfo)(nil), Host:"", Path:"/Me+You.txt", RawPath:"/Me%2BYou.txt", ForceQuery:false, RawQuery:"", Fragment:"", RawFragment:""}
"/Me%2BYou.txt"

Proxy with no encoding: https://gtr-proxy.677472.xyz/p/put-block-from-url-esc-issue-demo-server-3vngqvvpoq-uc.a.run.app/Me+You.txt

Debug: Escaped Path Recieved
&url.URL{Scheme:"", Opaque:"", User:(*url.Userinfo)(nil), Host:"", Path:"/Me+You.txt", RawPath:"", ForceQuery:false, RawQuery:"", Fragment:"", RawFragment:""}
"/Me+You.txt"

Proxy with encoding: https://gtr-proxy.677472.xyz/p/put-block-from-url-esc-issue-demo-server-3vngqvvpoq-uc.a.run.app/Me%2BYou.txt

Debug: Escaped Path Recieved
&url.URL{Scheme:"", Opaque:"", User:(*url.Userinfo)(nil), Host:"", Path:"/Me+You.txt", RawPath:"", ForceQuery:false, RawQuery:"", Fragment:"", RawFragment:""}
"/Me+You.txt"

Proxy with "azure armoring": https://gtr-proxy.677472.xyz/p/put-block-from-url-esc-issue-demo-server-3vngqvvpoq-uc.a.run.app/Me%252BYou.txt

Debug: Escaped Path Recieved
&url.URL{Scheme:"", Opaque:"", User:(*url.Userinfo)(nil), Host:"", Path:"/Me+You.txt", RawPath:"", ForceQuery:false, RawQuery:"", Fragment:"", RawFragment:""}
"/Me+You.txt"

Hmm, this is a bit frustrating. I'll have to experiment a bit with CF and see what can be done to work around this CF bug.

image

addEventListener("fetch", event => {
  event.respondWith(handleRequest(event.request))
})

async function handleRequest(request) {
  const url = "https://put-block-from-url-esc-issue-demo-server-3vngqvvpoq-uc.a.run.app/Me%2BYou.txt"  
  const originalResponse = await fetch((new URL(url)).toString())


  return originalResponse
}

https://cloudflareworkers.com/#a4c100303434d48b75d700b7adf8e794:http://about:blank/

apparently the URL object will cause this. It preserves %2F but will clobber %2B.

Going to reopen this issue. I queued up a Takeout for myself but it's like "Start in 2 days". OK, whatever, I'll test it then and if it works, this will close for sure.

https://gtr-proxy.677472.xyz/p/put-block-from-url-esc-issue-demo-server-3vngqvvpoq-uc.a.run.app/Me%2BYou.txt

Debug: Escaped Path Recieved
&url.URL{Scheme:"", Opaque:"", User:(*url.Userinfo)(nil), Host:"", Path:"/Me+You.txt", RawPath:"/Me%2BYou.txt", ForceQuery:false, RawQuery:"", Fragment:"", RawFragment:""}
"/Me%2BYou.txt"

I think the issue is now resolved ๐ŸŽ‰

image

You'll need to pull or update your GTR proxy.

I experienced this today
image

could it be that %3D at the end of the url?

image

I have my own Google Takeout in progress right now on schedule. I'm traveling but I'll check it out when I can.

gtr-proxy is returning a 403 in background.js

Alt-Svc: h3=":443"; ma=86400
Cf-Ray: 7d5b119c0acd4798-DFW
Content-Encoding: br
Content-Type: text/plain;charset=UTF-8
Date: Sun, 11 Jun 2023 16:00:45 GMT
Nel: {"success_fraction":0,"report_to":"cf-nel","max_age":604800}
Report-To: {"endpoints":[{"url":"https:\/\/a.nel.cloudflare.com\/report\/v3?s=bGlo90JDsy9IBJV3SD3fdtxdObN9Y4HLIMqcb2N%2BoUdciY01GxDf3KaQDveBC9djBBKHogg9pwU8cqHOQvS81Cj9YTm6BBjCwmY5btjwIQxAqvcwGVvHwsmYEGt%2B2s8bEeMmVh1j1A%3D%3D"}],"group":"cf-nel","max_age":604800}
Server: cloudflare
Vary: Accept-Encoding

I have my own Google Takeout in progress right now on schedule. I'm traveling but I'll check it out when I can.

appreciate it, let me know if you can reproduce. If it helps this is a google workspace account, in this case for colorado.edu

I think it has to do with this function in gtr-proxy/src/handler.ts

export function validGoogleTakeoutUrl(url: URL): boolean {
  return (
    url.hostname.endsWith('apidata.googleusercontent.com') &&
    (url.pathname.startsWith('/download/storage/v1/b/dataliberation/o/') ||
      url.pathname.startsWith('/download/storage/v1/b/takeout'))
  )
}

perhaps you could instead use regex something like

export function validGoogleTakeoutUrl(url: URL): boolean {
	return /^.*\.(googleusercontent\.com|googleapis\.com|google\.com)$/.test(url.hostname)
}

or

export function validGoogleTakeoutUrl(url: URL): boolean {
	return /^.*\.(googleusercontent\.com|googleapis\.com|google\.com)$/.test(url.hostname) && /^((\/download\/storage\/v1\/b\/.*)|\/takeout-eu\/.*)/.test(url.pathname)
}

I did not run these expressions through regexr to validate htem, but I hope you can understand the gist

A more broad approach could simply check that the cert authority organization is Google Trust Services LLC, though i could think of a few reasons why you wouldn't want to proxy any google url

I'm handy with regex but I also am very aware of how their terseness can sometimes have holes so I reach for a larger, more obvious self-documentation version.

Yeah, the few reasons why you wouldn't want to proxy any Google URL is right and the realization that's just what one could think of.

Was this a takeout that worked in the past or is this the first time with a Workspace account?

Also, I thought and your profile says you're in colorado. "takeout-eu"?

I just did a demo takeout. The final URL was (expired for demo):

https://ff0ca247828400af09f22471a5d3bc6000d2e3bc3a05f516a56a653-apidata.googleusercontent.com/download/storage/v1/b/dataliberation/o/20230611T154640.666Z%2F-4311693717716012545%2F2a18aa67-2d0b-4108-9e34-97e879e54714%2F1%2F9745587c-f94b-459c-aa67-3fd3b76bd168?jk=ARTXCFFfAVItYcq_kEDloAaxFt7LxluL3XTB7B0har6VijVOH2UqxNY_jd92xn3yjPAiRLm8eORTn2roHkV3l7KKEeGQ3s5otN7Qc-glzrJ77l1s_xjIxFZNctt7b6euqApR_pHkiS04lfGVv_ZpOWUvBHiHQxjlz6ikZ3s9dtItGMrBIzmYqq6sHZj-8HT-V17AeNdKlpvVI86nvjwjOuKqHiwnuJlYvqiG7SFb0CvsKJcs5ZowwEfQ_7_uWTo6edv-8SWaUWHKvevfhmZHMQvVbmKibalZhKYxOa_jUsfdr3CCAuz0MxPfwca33aim1x__CBpSYGarWQl0GtQqKqoaAo7KVL-wtRdLXvbNV_P2IajDcxQUrQIPnvXOxciWKI0eYaINZTeX4uQoovE_Rw2uH-YkqU5vR4x2ZJaOBrLO8N8MVJtFwAguE8q3Y7ywGFYItkZMvgvT7V0OF0jmQMKD9mHgtcrkrsGOQeKojuxdZEm8hLU4q-hr9kXYba6yat8ARSkmUoeGac70TU_znkQZ6rv_5O9f6whtIeNeFcUL3ro2-p0BZ3xSWRNZMG7AvWoOrD4GNwHTpcHsROZS2Y9DucPhf0r683ulkhD2fyFAMhIVAAQ4vHO3AcFzFx4d3QH0mbsEtX8Dewc2A591s74eRno6wnyxDJoi6_s9uN916fxGtIN77hS6pJQVb6fD0HncqqCQTLDfGyKSQHaGkEPsq-1UHnUX1rUZUcV5ddKhnaQaN_W7GdBLEeZRoX51F16CtZ_8LKJr8kdp4AKXqXs1uEoWgsBytrycFXA3htHLGC7OhNeekwXe0Z_m7q181z4hGq6e6n10Ckc9JC3A7NeB5scY8ElferFFsc07LfQci-QyYkASdbffj6m7Gih6dxc-BVyymF_q5qyAnc9g_XXXevC5CDBp7hQULZJRrYnU_JelEN2n_0WkNKIvSEEwfT-NIvjts_Yrn3U0JGS2ba-p3qvcpJUH5kKC4slPiSbTFcAWbnvRxuHQRax_i_QS8ZTQnL7LGHuLY33N8RzRyU-DB0BBi_MCH4sKmZ3DxLDK8Ua9jD3iaIyX7g2z1CD_cKTw7K7GkkVhYWfslR4UvlYsHpUEVstUVtW9x_3-0Jp9tTIR-SYqW0x59qndeiURTjz-S0-v8TvwpJbSWJvK0rCphKlASm4-EigY6Q7w351chNDM4X0vfNEW44bah3L8tRFyH740NpqfuA6N9Cefnq7JLouXNhW5-hdQDp2UHamvgUOd5aUXj0Bz30yc7XaB4YtyMq0XZi0orZ4hHX-pKlzUu-DHvZHJwldpmB1UOwJp7a286Pm7od9ETg1r1d4qYCKCaTLOIT1BNZQK-ppMAF1V&isca=1


What's the URL of a takeout archive in "Downloads"? Like the final destination?

Also, I thought and your profile says you're in colorado. "takeout-eu"?

yeah no clue why that is the case. Basically CU Boulder finally cracked down on the once 'unlimited' google drive storage, so here I am trying to download 2tb over the past 3 weeks lol

yes it worked in the past. I exceeded the retry attempts when using the extension, If you used something like preact to inject into the dom on any page when a download is intercepted (and/or fails) it would be fairly informative. I forgot I installed this extension earlier and was very confused when all of my unrelated downloads weren't working.

Here you can see takeout downloads working, except when chrome uses 40% of my ryzen 5 5600x to write files and crashes when i drag a tab out of a window.

image

sorry for the trouble I would have just patched the cloudflare worker myself, but I am kind of averse to using cloudflare's dev tools as I used to work for a competitor ๐Ÿ˜…

Yeah the archive downloads have a lifespan of 5 attempts. This ephemeral-ness is one of the reasons why I made GTR too. Out of danger ASAP please thanks.

OK, I'll see about adding those URLs as allowable to the proxy. It must be unique to workspaces. I'll probably make it allow anything starting with takeout on that domain.

Can you check if those are URLs have all that is needed to download the archives with a small takeout? Like, could you copy a URL and download an archive with it shortly after it is generated with an incognito window?

yes it worked in the past. I exceeded the retry attempts when using the extension, If you used something like preact to inject into the dom on any page when a download is intercepted (and/or fails) it would be fairly informative. I forgot I installed this extension earlier and was very confused when all of my unrelated downloads weren't working.

Yeah I could do more, but I explicitly elected to take out that requirement since I didn't feel like fighting a heavily post-processed page and RE-ing it. Intercepting the downloads with the downloads API in Chromium felt much more stable.

@AskAlice I've updated the gtr-proxy I host to include those domains and path prefixes. Please give it a try and let me know if that works. I can't see the full URLs from your screenshot, but if they're signed and don't need cookies, then we should be good.

Fingers crossed! ๐Ÿคž

After a few tries, and changing nothing, I got a download to start! but other files are giving me 502s

takeout-20230609T011058Z-026.tgz - failed - Failed to stage block: 502 
<html> <head><title>502 Bad Gateway</title></head>
<body> <center><h1>502 Bad Gateway</h1></center>
<hr><center>cloudflare</center>
</body> </html>
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->

and others I see

takeout-20230609T011058Z-016.tgz - failed - Failed to stage block: 400 
<?xml version="1.0" encoding="utf-8"?><Error>
<Code>CannotVerifyCopySource</Code>
<Message>Bad Request RequestId:81365698-301e-0032-4752-9d31c6000000 Time:2023-06-12T17:22:31.2050574Z</Message></Error>

also (might be from retrying, not sure)

Failed to stage block: 500 invalid URL

and also

Failed to commit block list: 400

I hope I can just retry those and see if it works but I've got a good portion of these downloaded though which should supplement the ones that I downloaded before chrome died under the sheer load of writing files to my hard drive. So thank you for this awesome tool!

Ok, probably as good as I can get it