/sitefeed

Extracts RSS feed from websites that don't support it.

Primary LanguagePython

SiteFeed

Extracts RSS1 feed from websites that don't support it.

Installation

pip install sitefeed@git+https://github.com/ivofrolov/sitefeed.git

Usage

Create a TOML configuration file of fhe following structure.

[feed.example]                         # <example> is a feed name
start_url = "https://www.example.com/" # web page containing links to articles
link_extractor = {
    allow = "/articles/",              # regexp that article links must match (not required)
    restrict_css = ".links",           # css selector of a region links should be extracted from (not required)
}
article_extractor = {
    min_text_length = 25,              # higher value leads to more precise detection of longer texts (not required)
    negative_keywords = ["footer"],    # patterns in classes and ids that decrease content candidates score (not required)
}
output = "example.xml"                 # path to store feed

[feed.another]
# ...

And start a crawling process.

sitefeed -c config.toml -o feeds

GitHub Action

You can also fork this repository and get your own feeds hosted by GitHub pages. The repo contains scheduled action that updates feeds every day.

Footnotes

  1. Atom actually