jjjake/internetarchive

Collection 'pub_consumer-reports' error - no matching files found

kennethb22 opened this issue · 3 comments

The collection pub_consumer-reports contains 239 files which show PDFs being available for download. I tried manually download pdfs for a few random files in the collection and was able to. When I run

ia download --search='put_consumer-reports' --no-directories --glob=\*pdf

ia gives the following output

pub_consumer-reports (1/632):
 skipping pub_consumer-reports, no matching files found.
pub_consumer-reports-news-digest (2/632):
 skipping pub_consumer-reports-news-digest, no matching files found.
sim_consumer-reports-news-digest_1976-10-01_1_1 (3/632):
 error downloading file sim_consumer-reports-news-digest_1976-10-01_1_1.pdf, exception raised: 403 Client Error: Forbidden for url: https://ia803401.us.archive.org/34/items/sim_consumer-reports-news-digest_1976-10-01_1_1/sim_consumer-reports-news-digest_1976-10-01_1_1.pdf
sim_consumer-reports-news-digest_1976-10-15_1_2 (4/632):
 error downloading file sim_consumer-reports-news-digest_1976-10-15_1_2.pdf, exception raised: 403 Client Error: Forbidden for url: https://ia801801.us.archive.org/13/items/sim_consumer-reports-news-digest_1976-10-15_1_2/sim_consumer-reports-news-digest_1976-10-15_1_2.pdf
sim_consumer-reports-news-digest_1976-11-01_1_3 (5/632):
 error downloading file sim_consumer-reports-news-digest_1976-11-01_1_3.pdf, exception raised: 403 Client Error: Forbidden for url: https://ia601801.us.archive.org/15/items/sim_consumer-reports-news-digest_1976-11-01_1_3/sim_consumer-reports-news-digest_1976-11-01_1_3.pdf
sim_consumer-reports-news-digest_1976-11-15_1_4 (6/632):
 error downloading file sim_consumer-reports-news-digest_1976-11-15_1_4.pdf, exception raised: 403 Client Error: Forbidden for url: https://ia803402.us.archive.org/20/items/sim_consumer-reports-news-digest_1976-11-15_1_4/sim_consumer-reports-news-digest_1976-11-15_1_4.pdf
sim_consumer-reports-news-digest_1976-12-01_1_5 (7/632):

...

I was able to use ia to successfully download PDFs from the collection boardwatchmagazine:

ia download --search='boardwatchmagazine' --no-directories --glob=\*pdf

macOS 11.6.5, Python 3.7.4

maxz commented

These files are not available for public download. The items need to be borrowed and can't be downloaded as PDF. Hence the 403 Forbidden.

As I noted in the original post, there are pdfs that can be downloaded in the collection

https://archive.org/details/pub_consumer-reports

and I was able to download some of them manually. ia appears to be skipping over pub_consumer-reports

pub_consumer-reports (1/632):
 skipping pub_consumer-reports, no matching files found.

and then jumping to pub_consumer-reports-news-digest, which is not the collection I specified when running ia.

Again, I did verify that pub_consumer-reports does contain downloadable PDFs, and was abled to download some manually.

It's curious that pub_consumer-reports contains 239 items and pub_consumer_reports-news-digest contains 391 items, but ia appears to be reporting both collections contain 632 items.

maxz commented

As I noted in the original post, there are pdfs that can be downloaded in the collection

Yes, because not all of them underlie the same restrictions.

ia appears to be skipping over pub_consumer-reports

pub_consumer-reports is a collection. Collections do not contains files. They contain items which contain files.
You try to download PDF files from the collection.

and then jumping to pub_consumer-reports-news-digest, which is not the collection I specified when running ia.

I'm not sure how you came to the conclusion that --search='put_consumer-reports' would be the correct way to specify a collection. It is not. The correct syntax is even stated in the documentation: --search='collection:put_consumer-reports'

It's curious that pub_consumer-reports contains 239 items and pub_consumer_reports-news-digest contains 391 items, but ia appears to be reporting both collections contain 632 items.

You are running a search for that term and everything related to it is returned. That number is not directly related to any collections.

To remain close to your initial command, you could use ia download --search='collection:pub_consumer-reports' --no-directories --glob=\*pdf.

This should get you your desired results if I understood correctly what you are trying to do.