miyagawa/web-scraper

Issues with HTTPS

Closed this issue · 7 comments

Here's my code:

use URI;
use Web::Scraper;
use Encode;

my $torrents = scraper {
    process 'a.tab', "torrents[]" => scraper {
      process "a", title => 'TEXT';
    };
};

my $res = $torrents->scrape( URI->new("https://ilcorsaronero.info/categoria.php?active=0&category=1&order=data&by=DESC&page=0") );

for my $torrent (@{$res->{torrents}}) {
    print Encode::encode("utf8", "$author->{title}\n");
}

It is a really simple scraper, but when I run it I get this error:

$ perl scraper.pl
$ GET https://ilcorsaronero.info/categoria.php?active=0&category=1&order=data&by=DESC&page=0 failed: 501 Protocol scheme 'https' is not supported (LWP::Protocol::https not installed) at scraper.pl line 11.

Do you have any ideas about how to solve this?
I already tried installing LWP::Protocol::https

P

@phonicmouse Are you sure LWP::Protocol::https really installed? What do you get when you run:

perl -MLWP::Protocol::https -e1

I'm on Fedora and got an error doing cpanm LWP::Protocol::https and had to fix it this hacky way:

ln -s /usr/include/locale.h /usr/include/xlocale.h

I don't know what provided xlocale.h in the past but I don't see it around anymore.

Then I still got an error running your scraper.pl:

GET https://ilcorsaronero.info/categoria.php?active=0&category=1&order=data&by=DESC&page=0 failed: 500 Can't connect to ilcorsaronero.info:443 (certificate verify failed) at S line 11.     

But that's because that ilcorsaronero.info website doesn't have its SSL certificate chain set up right, as noted here:

https://www.ssllabs.com/ssltest/analyze.html?d=ilcorsaronero.info

Well, actually I get this:

Can't locate LWP/Protocol/https.pm in @INC (you may need to install the LWP::Prot
ocol::https module) (@INC contains: /etc/perl /usr/local/lib/x86_64-linux-gnu/per
l/5.22.1 /usr/local/share/perl/5.22.1 /usr/lib/x86_64-linux-gnu/perl5/5.22 /usr/s
hare/perl5 /usr/lib/x86_64-linux-gnu/perl/5.22 /usr/share/perl/5.22 /usr/local/li
b/site_perl /usr/lib/x86_64-linux-gnu/perl-base .).
BEGIN failed--compilation aborted.

But even after executing cpanm LWP::Protocol::https and then also sudo cpanm LWP::Protocol::https I still get the same error.

I know the ssl error, but the website is not mine, so i can't do anything....anyway there should be a way to bypass the ssl verification. In the meanwhile I'm trying scrapy on python and I got what I wanted using it, but since the project I'm working on is totally written In perl and Mojo, I would really prefer using this package.
Anyway, thanks for your help and time @jonjensen . I'll try on a different distro in a VM because I think my Ubuntu installation got really dirt installing weird things such as Haskell😄 and all the other development tools I installed over time.

P.S: I run a mirror of that site on one of my servers, so this will solve the second problem, even if trying to put http user and pass in the url it doesn't work, so there should be some kind of extra setting for this.
Here is the link to the mirror test.

Anyway, trying on another clean Ubuntu 16.04 installation still fails. And there is no way as far as i know to skip ssl verification.
Using another domain with valid ssl works perfectly. 🛩

PERL_LWP_SSL_VERIFY_HOSTNAME=0

Can it solve problem?

@thunderpick Good one, but sadly it doesn't work for me....

For Ubuntu 16.04 you can run something like apt install liblwp-protocol-https-perl if package still does not exist.

The package exists, but i can't disable ssl verify....