Dataset collected from popular Russian collective blogs Habrahabr, Geektimes and Megamozg owned by TM.
{
"_id": 115710,
"_last_update": "2015-04-08T00:00:00",
"title": "Собираем данные с помощью Scrapy",
"published": "2011-03-18T23:13:00",
"author": "bekbulatov",
"author_url": "http://habrahabr.ru/users/bekbulatov/",
"author_rating": 3.8,
"hubs": [["Python", "http://habrahabr.ru/hub/python/"]],
"favs_count": 315,
"pageviews": 21281,
"tags": ["scrapy", "парсинг", "python", "crawler"],
"comments_count": 49,
"content_html": "..."
}
Summary table about all posts from the dataset in CSV format. Encoding is UTF-8.
Columns:
- post_id
- last_update
- published
- title
- author
- favs_count
- pageviews
- comments_count
- comments_parsed
- comments_banned
- first_comment_time
- last_comment_time
- author_comments
- tags
- content_length
- hubs_count
- hubs
Use script download_all_habr.py
to fetch and parse all the pages available now. Habrahabr posts are
indexed with continuous integer numbers from 1 to about 300000. You should specify the range of indices to download.
If you want to distribute your download across several machines, just specify them different pieces of the whole range.
$ python download_all_habr.py --start-index 1 --finish-index 300000
Script will create the directory habr_pages
and download post contents there.