Goose fails in extracting articles from The New York Times

Question

Goose fails in extracting articles from The New York Times

Closed this issue 9 years ago · 5 comments

following code:

import urllib2
import goose
url = "http://www.nytimes.com/reuters/2015/12/21/world/africa/21reuters-kenya-attacks-somalia.html?_r=0"
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor())
response = opener.open(url)
raw_html = response.read()
g = goose.Goose()
a = g.extract(raw_html=raw_html)
a.cleaned_text
u''
Empty string is returned.

Answer 1 · 2015-12-22T17:03:37.000Z

Your url string looks malformed, but anyway I fixed a similar issue in #225 and pull-requested but I don't think it was ever approved. My fork has this issue fixed and I've been parsing NYtimes happily ever since.

Good luck!

Rob

Answer 2 · 2015-12-22T17:05:00.000Z

@robmcdan the PR has not been accepted because it breaks the test case suites

Answer 3 · 2015-12-22T17:56:40.000Z

@robmcdan @grangier I just bypassed using goose and wrote a little snippet using BeautifulSoup and urllib2 that works.

Answer 4 · 2015-12-23T01:08:46.000Z

I'm curious; this issue was close because there is a work-around involving some external manipulation?

Answer 5 · 2015-12-23T10:40:38.000Z

@richardpetithory No, I closed this issue because @robmcdan has an issue open with this exact same problem #225 so I don't think there is a need for two open requests.