/hackernews-scraper

Scrape hacker news comments and posts

Primary LanguagePythonBSD 2-Clause "Simplified" LicenseBSD-2-Clause

hackernews-scraper

Scrape hacker news comments and posts using the Algolia API.

Usage

from hackernews_scraper import CommentScraper

CommentScraper.getComments(since=1394039447)

The above will return a generator that will yield one comment at a time. It will keep on going until there are no more comments to fetch, or until it reaches the 50 pages limit set by hacker news. In the latter case, a TooManyItemsException will be raised.

If the hacker news API response is missing any required fields, the scraper will raise KeyError.

Response format

Comments:

{
 'author': u'dhmholley',
 'comment_id': u'7531026',
 'comment_text': u'Are people still blowing this whistle?...',
 'created_at': u'2014-04-04T12:57:38.000Z',
 'parent_id': 7530853,
 'points': 1,
 'story_id': None,
 'story_title': None,
 'story_url': None,
 'timestamp': 1396616258,
 'title': None,
 'url': None
}

Stories:

{
 'author': u'sethco',
 'created_at': u'2014-04-04T12:56:23.000Z',
 'objectID': None,
 'points': 1,
 'story_text': 1,
 'timestamp': 1396616183,
 'title': u'Opower IPO today',
 'url': u'http://www.businesswire.com/news/home/20140403006541/en#.Uz4cbq1dVih'
}

Testing

You need to have httpretty and factory-boy installed.

Run nosetests in the root folder or the tests folder.