RSS Integration
GoogleCodeExporter opened this issue · 4 comments
GoogleCodeExporter commented
Provide RSS integration feature to the crawler.
RSS Integration will allow for,
1. As a trigger to start/restart website crawling/indexing based on RSS
feed updates.
2. To implement an RSS crawler/indexer, that can fetch information from
RSS feeds in a very customized fashion.
3. As a first step to Web 2.0 integration.
Original issue reported on code.google.com by abpil...@gmail.com
on 25 Jun 2008 at 12:31
GoogleCodeExporter commented
So what advantage will rss feed have over regular website crawl?
Will all feeds be saved as one file, and the updated consecutively as new feeds
arrive?
2. What kind of customization would it be?
Original comment by szybal...@gmail.com
on 3 Jul 2008 at 2:43
GoogleCodeExporter commented
Original comment by abpil...@gmail.com
on 7 Jul 2008 at 8:16
- Changed state: Started
GoogleCodeExporter commented
I'm not the expert on rss but I think:
1. rss crawling should be able to crawl links in a feed, and be able treat them
the
same way we treat href=".." in html files for further processing.
2. Be able to read rss 1.0, 2.0, atom
3. Do we save the rss feeeds as .rss files or as .xml?
Original comment by szybal...@gmail.com
on 9 Jul 2008 at 2:50
GoogleCodeExporter commented
Some thoughts:
1. rss can enable incremental crawls. I think that it requires you to keep
state,
though? If you're crawling a blog, for instance, you can find all new sites
since the
last crawl - however, rss typically won't tell you what has changed, only what
the
last "n" updates on the blog were.
2. Maybe (1) could imply auto generation of link following rules, to allow for
incremental crawls only of those links which are "related" to the changed
links? For
instance, only crawl new blog messages, and replies to those messages?
Also, what is meant by "first step to Web 2.0 integration"? Is there some grand
plan?
thanks!
Original comment by vijay...@gmail.com
on 9 Jul 2008 at 10:51