cocoflan/wayback-machine-downloader

Downloading an entire website results is missing pages and content.

Opened this issue · 3 comments

I am trying to recover a website that was created for a non-profit organization with WordPress. It was hosted on a third-party site but the organization has lost its admin access and somehow broke the site. I'm trying to recover the site as it was in January of 2024 when the site was working. When trying to recover the website from archive.org, I ran the CLI utility, but it didn't download all the pages I was expecting.

I ran: wayback_machine_downloader http://sorensonlegacyfoundation.org --to 20240101. It downloads 250 files, but there are still lots of HTML pages that are missing. Like the entry file index.html is there, but /what-we-fund, how-to-apply, and other about 10 other pages are not there.

Looking over the raw text files in vs code, I confirmed that these pages are missing and not just nested away somewhere by searching for specific text unique to each page.

Is there something I'm missing or should I just download each page individually from archive.org?

@Bryson14 Hi, I created a post under "Rate limiting?" that might help with the temporally blocking from IA that your having so you can finish your download.
#1

That worked well to add the network limiter! But I think that just because I'm on linux, this worked, but it wouldn't be a good solution for Mac or PC.

I have implemented some rate limiting (see Issue #1 & PR #5) and retry (see PR #6) functionality, which should help with this.

That said, it looks like this utility still can try to download multiple different timestamped versions of an individual page (especially when using the --to/--from options, which may cause issues related to your use-case. I'm not sure if the Wayback Machine API returns the files in a chronologically sorted order, especially in a descending (newest to oldest) order. I'll look into this.