
Unable to scrape nytimes pages

Closed this issue · 1 comments

Reported by @nchapman via IRC.

http urls:='[""]' -j -v

POST /v1/metadata HTTP/1.1
Accept: application/json
Accept-Encoding: gzip, deflate
Connection: keep-alive
Content-Length: 191
Content-Type: application/json; charset=utf-8
User-Agent: HTTPie/0.9.1

    "urls": [

HTTP/1.1 200 OK
Connection: keep-alive
Content-Length: 709
Content-Type: application/json; charset=utf-8
Date: Tue, 30 Aug 2016 19:58:20 GMT
ETag: W/"2c5-sQuf60kT/Uv0NneQ7eyZNA"

    "request_error": "",
    "url_errors": {},
    "urls": {
        "": {
            "favicon_url": "",
            "images": [],
            "original_url": "",
            "title": "Log In - The New York Times",
            "url": ""

Note how we don't get any images[] and the title seems to be a paywall-esque "Log In - The New York Times".

Trying to curl that page directly (outside of the metadata proxy) gives us an HTTP/303 redirect (and a Location header to a login page):


HTTP/1.1 303 See Other
Accept-Ranges: bytes
Age: 0
Connection: close
Date: Tue, 30 Aug 2016 20:00:03 GMT
Server: Varnish
Set-Cookie: RMID=007f0101169257c5e5c3000d;Path=/;;Expires=Wed, 30 Aug 2017 20:00:03 UTC
X-API-Version: 5-0
X-Frame-Options: DENY
X-PageType: article

Oddly, scraping the mobile version of the site seemingly works as expected:


HTTP/1.1 200 OK
Age: 94
Cache-Control: private,max-age=300,s-maxage=300
Connection: keep-alive
Content-Encoding: gzip
Content-Type: text/html; charset=utf-8
Date: Tue, 30 Aug 2016 20:01:51 GMT
ETag: W/"1e7a4-lV5WY4MH9VNOuMh1nqTohQ"
Server: nginx/1.0.15
Set-Cookie: nyt-a=1d22bb9f4cdeb5ac2793e447ffb7b046;path=/;;expires=Wed, 30 Aug 2017 20:01:51 UTC
Set-Cookie: RMID=007f01017b1657c5e62f000c;path=/;;expires=Wed, 30 Aug 2017 20:01:51 UTC
Set-Cookie: NYT-Loc=d;path=/;;expires=Tue, 06 Sep 2016 20:01:51 UTC
Set-Cookie: NYT-S=0MC9KINZeOiaXDXrmvxADeHFpGjxeqeeDKdeFz9JchiAIUFL2BEX5FWcV.Ynx4rkFI; expires=Thu, 29-Sep-2016 20:01:51 GMT; path=/;
Transfer-Encoding: chunked
Via: 1.1 varnish
X-Frame-Options: DENY
X-Powered-By: Express
X-Varnish: 1879619983 1879617502
nnCoection: close

<!DOCTYPE html><html lang="en" class=""><head><meta

@chapman added the following comments:

nchapman> jkerim pdehaan: all the gray boxes are nytimes
nchapman> which is to say... sites that are not nytimes are looking great!
nchapman> hey computer world snuck in there at the bottom
nchapman> but that page doesn't have a pic so it's all good

jkerim> pdehaan: i haven’t looked at it but my suspicion is not that it DOENS’T follow redirects but rather that it DOES and nytimes is redirecting us to a paywell which the scraper is parsing so i’ll ahve to finagle with fetch a little

More updates from The @nchapman!

nchapman> jkerim pdehaan: this works curl --verbose --location --cookie-jar testcookiejar
nchapman> it's likely a cookies issue
nchapman> could be something else as well but that seems to fix it -- i tried a few different variations of the curl request to narrow down the params that made a difference

Nick's story checks out, I was able to get this working w/ HTTPie as well:

$ http --session=user2 --follow -v

Not sure how to do it w/ node-fetch API yet (docs are a bit sparse), but this may be promising: