Readability of MSN articles
rpdelaney opened this issue · 0 comments
rpdelaney commented
I'm struggling to get this working with MSN news articles. Here's the approach I'm using:
def fetch_url(url: str, timeout: int = 10) -> str:
"""Get the content from a page at URL, if it is a URL."""
if not is_url(url):
return url
response = requests.get(url, timeout=timeout)
response.raise_for_status()
soup = bs(response.content, "html.parser")
return soup.get_text()
def summarize(content: str) -> str:
"""Take content and use readability to return a document summary."""
doc = Document(content)
title: str = doc.short_title()
summary: str = bs(doc.summary(), "lxml").text
return f"{title}\n{summary}"
This works well on all the other news sites I've tried, but with MSN it's different.
Example. With this URL, I only get MSN
for a title and the summary is empty.
Any suggestions?