Extracts RSS1 feed from websites that don't support it.
pip install sitefeed@git+https://github.com/ivofrolov/sitefeed.git
Create a TOML configuration file of fhe following structure.
[feed.example] # <example> is a feed name
start_url = "https://www.example.com/" # web page containing links to articles
link_extractor = {
allow = "/articles/", # regexp that article links must match (not required)
restrict_css = ".links", # css selector of a region links should be extracted from (not required)
}
article_extractor = {
min_text_length = 25, # higher value leads to more precise detection of longer texts (not required)
negative_keywords = ["footer"], # patterns in classes and ids that decrease content candidates score (not required)
}
output = "example.xml" # path to store feed
[feed.another]
# ...
And start a crawling process.
sitefeed -c config.toml -o feeds
You can also fork this repository and get your own feeds hosted by GitHub pages. The repo contains scheduled action that updates feeds every day.
Footnotes
-
Atom actually ↩