mozilla/page-metadata-service

Unable to scrape nytimes pages

Closed this issue · 1 comments

Reported by @nchapman via IRC.

http https://page-metadata-service.stage.mozaws.net/v1/metadata urls:='["http://www.nytimes.com/2016/08/30/movies/gene-wilder-dead.html?hp&action=click&pgtype=Homepage&clickSource=story-heading&module=photo-spot-region&region=top-news&WT.nav=top-news"]' -j -v

POST /v1/metadata HTTP/1.1
Accept: application/json
Accept-Encoding: gzip, deflate
Connection: keep-alive
Content-Length: 191
Content-Type: application/json; charset=utf-8
Host: page-metadata-service.stage.mozaws.net
User-Agent: HTTPie/0.9.1

{
    "urls": [
        "http://www.nytimes.com/2016/08/30/movies/gene-wilder-dead.html?hp&action=click&pgtype=Homepage&clickSource=story-heading&module=photo-spot-region&region=top-news&WT.nav=top-news"
    ]
}

HTTP/1.1 200 OK
Connection: keep-alive
Content-Length: 709
Content-Type: application/json; charset=utf-8
Date: Tue, 30 Aug 2016 19:58:20 GMT
ETag: W/"2c5-sQuf60kT/Uv0NneQ7eyZNA"

{
    "request_error": "",
    "url_errors": {},
    "urls": {
        "http://www.nytimes.com/2016/08/30/movies/gene-wilder-dead.html?hp&action=click&pgtype=Homepage&clickSource=story-heading&module=photo-spot-region&region=top-news&WT.nav=top-news": {
            "favicon_url": "http://www.nytimes.com/favicon.ico",
            "images": [],
            "original_url": "http://www.nytimes.com/2016/08/30/movies/gene-wilder-dead.html?hp&action=click&pgtype=Homepage&clickSource=story-heading&module=photo-spot-region&region=top-news&WT.nav=top-news",
            "title": "Log In - The New York Times",
            "url": "http://www.nytimes.com/2016/08/30/movies/gene-wilder-dead.html?hp&action=click&pgtype=Homepage&clickSource=story-heading&module=photo-spot-region&region=top-news&WT.nav=top-news"
        }
    }
}

Note how we don't get any images[] and the title seems to be a paywall-esque "Log In - The New York Times".

Trying to curl that page directly (outside of the metadata proxy) gives us an HTTP/303 redirect (and a Location header to a login page):

http http://www.nytimes.com/2016/08/30/movies/gene-wilder-dead.html\?hp\&action\=click\&pgtype\=Homepage\&clickSource\=story-heading\&module\=photo-spot-region\&region\=top-news\&WT.nav\=top-news

HTTP/1.1 303 See Other
Accept-Ranges: bytes
Age: 0
Connection: close
Date: Tue, 30 Aug 2016 20:00:03 GMT
Location: http://www.nytimes.com/glogin?URI=http%3A%2F%2Fwww.nytimes.com%2F2016%2F08%2F30%2Fmovies%2Fgene-wilder-dead.html%3Fhp%26action%3Dclick%26pgtype%3DHomepage%26clickSource%3Dstory-heading%26module%3Dphoto-spot-region%26region%3Dtop-news%26WT.nav%3Dtop-news%26_r%3D0
Server: Varnish
Set-Cookie: RMID=007f0101169257c5e5c3000d;Path=/; Domain=.nytimes.com;Expires=Wed, 30 Aug 2017 20:00:03 UTC
X-API-Version: 5-0
X-Frame-Options: DENY
X-PageType: article

Oddly, scraping the mobile version of the site seemingly works as expected:

http http://mobile.nytimes.com/2016/08/30/movies/gene-wilder-dead.html\?hp\&action\=click\&pgtype\=Homepage\&clickSource\=story-heading\&module\=photo-spot-region\&region\=top-news\&WT.nav\=top-news

HTTP/1.1 200 OK
Age: 94
Cache-Control: private,max-age=300,s-maxage=300
Connection: keep-alive
Content-Encoding: gzip
Content-Type: text/html; charset=utf-8
Date: Tue, 30 Aug 2016 20:01:51 GMT
ETag: W/"1e7a4-lV5WY4MH9VNOuMh1nqTohQ"
NYT-disable-for-perf-key:
Server: nginx/1.0.15
Set-Cookie: nyt-a=1d22bb9f4cdeb5ac2793e447ffb7b046;path=/;domain=.nytimes.com;expires=Wed, 30 Aug 2017 20:01:51 UTC
Set-Cookie: RMID=007f01017b1657c5e62f000c;path=/;domain=.nytimes.com;expires=Wed, 30 Aug 2017 20:01:51 UTC
Set-Cookie: NYT-Loc=d;path=/;domain=.nytimes.com;expires=Tue, 06 Sep 2016 20:01:51 UTC
Set-Cookie: NYT-S=0MC9KINZeOiaXDXrmvxADeHFpGjxeqeeDKdeFz9JchiAIUFL2BEX5FWcV.Ynx4rkFI; expires=Thu, 29-Sep-2016 20:01:51 GMT; path=/; domain=.nytimes.com
Transfer-Encoding: chunked
Via: 1.1 varnish
X-Frame-Options: DENY
X-Powered-By: Express
X-Varnish: 1879619983 1879617502
nnCoection: close

<!DOCTYPE html><html lang="en" class=""><head><meta
...

@chapman added the following comments:

nchapman> jkerim pdehaan: all the gray boxes are nytimes https://cl.ly/1Q1i2V3q0d42
nchapman> which is to say... sites that are not nytimes are looking great!
nchapman> hey computer world snuck in there at the bottom
nchapman> but that page doesn't have a pic so it's all good

jkerim> pdehaan: i haven’t looked at it but my suspicion is not that it DOENS’T follow redirects but rather that it DOES and nytimes is redirecting us to a paywell which the scraper is parsing so i’ll ahve to finagle with fetch a little

More updates from The @nchapman!

nchapman> jkerim pdehaan: this works curl --verbose --location --cookie-jar testcookiejar http://www.nytimes.com/2016/08/30/movies/gene-wilder-dead.html
nchapman> it's likely a cookies issue
nchapman> could be something else as well but that seems to fix it -- i tried a few different variations of the curl request to narrow down the params that made a difference


Nick's story checks out, I was able to get this working w/ HTTPie as well:

$ http --session=user2 http://www.nytimes.com/2016/08/30/movies/gene-wilder-dead.html --follow -v

Not sure how to do it w/ node-fetch API yet (docs are a bit sparse), but this may be promising: https://www.npmjs.com/package/fetch-cookie