Goose fails in extracting articles from The New York Times
Closed this issue · 5 comments
following code:
import urllib2
import goose
url = "http://www.nytimes.com/reuters/2015/12/21/world/africa/21reuters-kenya-attacks-somalia.html?_r=0"
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor())
response = opener.open(url)
raw_html = response.read()
g = goose.Goose()
a = g.extract(raw_html=raw_html)
a.cleaned_text
u''
Empty string is returned.
Your url string looks malformed, but anyway I fixed a similar issue in #225 and pull-requested but I don't think it was ever approved. My fork has this issue fixed and I've been parsing NYtimes happily ever since.
Good luck!
Rob
I'm curious; this issue was close because there is a work-around involving some external manipulation?
@richardpetithory No, I closed this issue because @robmcdan has an issue open with this exact same problem #225 so I don't think there is a need for two open requests.