/swan

An implementation of the Goose HTML Content / Article Extractor algorithm in golang

Primary LanguageHTMLOtherNOASSERTION

Swan Build Status GoDoc

swan

An implementation of the Goose HTML Content / Article Extractor algorithm in golang.

Swan allows you to extract cleaned up text and HTML content from any webpage by removing all the extra junk that so many pages have these days.

Check out the go documentation page for full usage and examples.


Features

  • Main content extraction from almost any source
  • Extract HTML content with images
  • Get article metadata, publish dates, and a lot more
  • Recognize different content types and apply special extractions (currently only recognizes comic sites and normal sites)

Planned

  • Inline videos into HTML content when found in an article
  • Recognize news sources and extract corresponding video / audio content
  • Recognize and extract more types of content
  • An interesting idea: buriy/python-readability#57 (comment)