/gpipe43

A full text RSS generator which can hosted on google app engine

Primary LanguagePythonMIT LicenseMIT

gpipe43 is a full text RSS generator which can hosted on Google App Engine. Use Regex to search and format full text from a article, or any other content that you want.
Inspired by Yahoo Pipes and Feed43.
Yahoo Pipe RIP.

Feature

  • Support multi page.
  • Display all images of article's gallery.
  • Appending article's comment is possible.

Prepare

Simple quickstart

Edit /main/user_agents.py

  • add UA

Edit config.py

  • prjname: Name of your project on app engine
  • bucket_name: Name of bucket
  • subdir4bg: The crawler working under: http://[prjname].appspot.com/[subdir4bg]/[rssname]
  • subdir4rss: This is your RSS site: http://[prjname].appspot.com/[subdir4rss]/[rssname]

Edit example.py,replace 'example' to your own RSS's name

  • rssname: RSS's name.
  • siteurl: The website or a RSS feed that you want to generat fulltext RSS.
  • reg4site: Regex that can find articles' URL. Leave a blank if siteurl is a feed.
  • reg4title: Regex for title of a article. Leave a blank if siteurl is a feed.
  • reg4pubdate: Regex for publish date of a article. Leave a blank if siteurl is a feed. The format of pubdate must contain '%Y-%m-%d', otherwise leave a blank.
  • reg4text: Regex for main body of a article.
  • reg4comment: Regex for comment. Not necessary, can leave it blank. You can also use this Regex to find all the image of a gallery in the article.
  • reg4nextpage: Regex for article's next page if there's more than one page.
  • Anzahl: How much article will be generated. If there's not only one siteurl, this limit for EVERY SINGLE siteurl instead of for all articleurl from all siteurl. 0 = no limit.

  • *encoding: Optional. Generally chardet can detect the right encoding, but sometimes it cannot(for example, recognize gb18030 as gb3212), so I use 'replace' option of decode method to avoid illegal character, then there's replacement character in generated feed. So you can specify the encoding of the website. It only influence main text.
  • rssgen.ausfuehren('use_urllib/use_urlfetch', 'st/mt', siteurl, reg4site, reg4title, reg4pubdate, reg4text, reg4comment, reg4nextpage, Anzahl): Generat a RSS from a website.
  • feed_fulltext.ausfuehren('use_urllib/use_urlfetch', siteurl, reg4nextpage, reg4text, reg4comment, Anzahl, rssname): Use this to generat fulltext from a RSS feed.
    • use_urllib: Use urllib2,with UA
    • use_urlfetch: Use urlfetch,no UA
    • mt: Multi threading
    • st: Single threading

Edit feed_list.py

  • Replace 'example' to your own RSS's name

app.yaml, cron.yaml

Optional

  • Edit ./main/Vorlage.xml and Vorlage_Error.xml, you can fill the properties of elements 'generator', 'webMaster' and 'copyright'.
  • If you just would like to format an existing feed, see example_02.py, then add url and script to app.yaml. It's not necessary to add it in feed_list.py and cron.yaml, because the feed will not save in cloud storage.

Test

dev_appserver.py [PATH_TO_YOUR_APP]/app.yaml

Start the crawler: http://localhost:8080/[subdir4bg]/[rssname]
When done, here to check your RSS: http://localhost:8080/[subdir4rssg]/[rssname]

See official guide: Using the Local Development Server

Upload to app engine

  • cd to the directory of your project

gcloud config set project PROJECT_NAME
gcloud app deploy app.yaml cron.yaml --version=VERSION_NUMBER

See official guide: Deploying a Python App

Examples