429 Client Error: Too Many Requests for url: https://archive.md/
Opened this issue · 4 comments
This has never worked for me, I always get 429 error
Running from within Python:
>>> import archiveis
>>> archive_url = archiveis.capture("http://www.example.com/")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/username/.local/share/virtualenvs/username-nbgasrwQ/lib/python3.8/site-packages/archiveis/api.py", line 39, in capture
response.raise_for_status()
File "/home/username/.local/share/virtualenvs/username-nbgasrwQ/lib/python3.8/site-packages/requests/models.py", line 953, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://archive.md/
Running command-line:
$ archiveis https://google.com/
Traceback (most recent call last):
File "./.local/share/virtualenvs/google-JUflU5ax/bin/archiveis", line 8, in <module>
sys.exit(cli())
File "/home/username/.local/share/virtualenvs/google-JUflU5ax/lib/python3.8/site-packages/click/core.py", line 1137, in __call__
return self.main(*args, **kwargs)
File "/home/username/.local/share/virtualenvs/google-JUflU5ax/lib/python3.8/site-packages/click/core.py", line 1062, in main
rv = self.invoke(ctx)
File "/home/username/.local/share/virtualenvs/google-JUflU5ax/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/username/.local/share/virtualenvs/google-JUflU5ax/lib/python3.8/site-packages/click/core.py", line 763, in invoke
return __callback(*args, **kwargs)
File "/home/username/.local/share/virtualenvs/google-JUflU5ax/lib/python3.8/site-packages/archiveis/api.py", line 106, in cli
archive_url = capture(url, **kwargs)
File "/home/username/.local/share/virtualenvs/google-JUflU5ax/lib/python3.8/site-packages/archiveis/api.py", line 39, in capture
response.raise_for_status()
File "/home/username/.local/share/virtualenvs/google-JUflU5ax/lib/python3.8/site-packages/requests/models.py", line 953, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://archive.md/
Confirmed. Same here.
Except the domain atm is archive.is
.
Looking at the response that comes back it appears that it is getting caught by a CAPTCHA. Unsure what the proper way to deal with this is.
I was getting 429's also.
What you can do to fix this is to try some different user agent strings. I switched mine to be Firefox on Linux and it worked:
The command-line program comes bundled with a -ua
flag or --user-agent
to change it.
archiveis -ua "Mozilla/5.0 (X11; Linux x86_64; rv:10.0) Gecko/20100101 Firefox/10.0" https://google.com
https://archive.md/wip/VBqdJ
Looking at the response that comes back it appears that it is getting caught by a CAPTCHA. Unsure what the proper way to deal with this is.
I think once you hit that captcha you're pretty much just stuck. I've tried in browser automation libraries like Playwright to do obvious things like click the captcha checkbox to no avail.
Like the comment above said, choosing a real user-agent probably helps.
Besides that, I think that their rate limiting is fairly IP based, so if you are able to distribute your requests across IPs that may help.
I have also noticed that they throttle me way more aggressively when using Cloudflare 1.1.1.1 or Warp. This may likely apply to other VPNs as well, but I haven't tested personally.
It would be nice to wrap up this project with a queuing system using the submissions as jobs and having automatic retries to make it more robust / act like a "service". I haven't seen anyone doing exactly that yet.
I have also noticed that, while infrequent, there are definitely some pages which seem to crash the archive.is archiver and never succeed in being archived. Not sure if there's a way to report that to them.