buriy/python-readability

Readability of MSN articles

rpdelaney opened this issue · 0 comments

I'm struggling to get this working with MSN news articles. Here's the approach I'm using:

def fetch_url(url: str, timeout: int = 10) -> str:
    """Get the content from a page at URL, if it is a URL."""
    if not is_url(url):
        return url

    response = requests.get(url, timeout=timeout)
    response.raise_for_status()
    soup = bs(response.content, "html.parser")

    return soup.get_text()


def summarize(content: str) -> str:
    """Take content and use readability to return a document summary."""
    doc = Document(content)

    title: str = doc.short_title()
    summary: str = bs(doc.summary(), "lxml").text

    return f"{title}\n{summary}"

This works well on all the other news sites I've tried, but with MSN it's different.

Example. With this URL, I only get MSN for a title and the summary is empty.

Any suggestions?