pythonhacker/harvestman-crawler

RSS Integration

GoogleCodeExporter opened this issue · 4 comments

Provide RSS integration feature to the crawler.

RSS Integration will allow for,

 1. As a trigger to start/restart website crawling/indexing based on RSS
    feed updates.
 2. To implement an RSS crawler/indexer, that can fetch information from
    RSS feeds in a very customized fashion.
 3. As a first step to Web 2.0 integration.

Original issue reported on code.google.com by abpil...@gmail.com on 25 Jun 2008 at 12:31

So what advantage will rss feed have over regular website crawl?
Will all feeds be saved as one file, and the updated consecutively as new feeds 
arrive?

2. What kind of customization would it be?


Original comment by szybal...@gmail.com on 3 Jul 2008 at 2:43

Original comment by abpil...@gmail.com on 7 Jul 2008 at 8:16

  • Changed state: Started
I'm not the expert on rss but I think:
1. rss crawling should be able to crawl links in a feed, and be able treat them 
the
same way we treat href=".." in html files for further processing. 
2. Be able to read rss 1.0, 2.0, atom
3. Do we save the rss feeeds as .rss files or as .xml?

Original comment by szybal...@gmail.com on 9 Jul 2008 at 2:50

Some thoughts:
1. rss can enable incremental crawls. I think that it requires you to keep 
state,
though? If you're crawling a blog, for instance, you can find all new sites 
since the
last crawl - however, rss typically won't tell you what has changed, only what 
the
last "n" updates on the blog were.

2. Maybe (1) could imply auto generation of link following rules, to allow for
incremental crawls only of those links which are "related" to the changed 
links? For
instance, only crawl new blog messages, and replies to those messages?

Also, what is meant by "first step to Web 2.0 integration"? Is there some grand 
plan?

thanks!

Original comment by vijay...@gmail.com on 9 Jul 2008 at 10:51