/get-all-hacker-news-submissions-comments

Simple Python scripts to download all Hacker News submissions and comments and store them in a PostgreSQL database.

Primary LanguagePython

UPDATE August 7th, 2017: All Hacker News submissions are now available on BigQuery, and the dataset is updated daily. If you are scraping Hacker News data at scale, it may be more efficient to use BigQuery instead.

An example query to get the top 2,000 Hacker News submissions:

#standardSQL
SELECT title, score
FROM `bigquery-public-data.hacker_news.full`
WHERE type = 'story'
ORDER BY score DESC
LIMIT 2000

The web interface can only download up to 10,000 titles; you'll need to use an API to get more.


This repository contains simple Python scripts to download all Hacker News submissions and comments and store them in a PostgreSQL database, for use in ad-hoc data analysis. These scripts are optimized from the scripts used to gather data for my October 2014 blog post The Quality, Popularity, and Negativity of 5.6 Million Hacker News Comments. Parameters for connecting to the appropriate PostgreSQL database are set at the beginning of each file.

This script uses the older Algolia API for Hacker News (instead of the official HN API) due to its support for bulk requests and comment scores for most comments. Run-time of downloading and processing all Hacker News submissions is about 2 hours; run-time of downloading and processing all Hacker News comments is about 11 hours.

Example Queries

Average point score for HN submissions, by hour (EST) of submission:

SELECT EXTRACT(hour from created_at) AS hour, AVG(num_points) AS avg_points
FROM hn_submissions
WHERE num_points IS NOT NULL
GROUP BY hour
hour avg_points
0 9.718
1 9.063
2 8.521
3 8.929
4 9.113
5 9.492
6 10.099
7 10.965
8 11.513
9 11.692
10 11.141
11 10.832
12 11.187
13 11.716
14 11.237
15 11.178
16 10.735
17 10.731
18 10.709
19 10.935
20 10.942
21 10.836
22 10.386
23 10.090

Number of users who have made atleast n comments, and the average point score for the nth comment a user makes:

SELECT nth_comment, COUNT(num_points) AS users_who_made_num_comments, AVG(num_points) AS avg_points
FROM (
	SELECT num_points,
	ROW_NUMBER() OVER (PARTITION BY author ORDER BY created_at ASC) AS nth_comment
	FROM hn_comments
	WHERE num_points IS NOT NULL
) AS foo
WHERE nth_comment <= 25
GROUP BY nth_comment
ORDER BY nth_comment
nth_comment users_who_made_num_comments avg_points
1 159410 2.432
2 99599 2.474
3 79467 2.550
4 68525 2.620
5 60921 2.648
6 55477 2.681
7 51091 2.685
8 47522 2.764
9 44498 2.795
10 41998 2.827
11 39931 2.869
12 37992 2.862
13 36282 2.820
14 34770 2.886
15 33403 2.937
16 32195 2.916
17 31073 2.903
18 30070 2.978
19 29126 2.950
20 28217 2.968
21 27372 2.950
22 26619 2.975
23 25949 3.044
24 25295 3.017
25 24651 3.040

Create the Hacker News leaderboard of users with the most karma, the hard way. (note that aggregated karma values will differ from true values due to vote obfuscation, among other things):

SELECT author, SUM(num_points) - COUNT(num_points) AS karma
FROM (
	SELECT author, num_points
	FROM hn_submissions
	UNION ALL
	SELECT author, num_points
	FROM hn_comments
) AS foo
WHERE num_points IS NOT NULL
GROUP BY author
ORDER BY total_points DESC
LIMIT 25
author karma
tptacek 136777
pg 87380
ColinWright 76866
danso 57238
llambda 57105
fogus 55146
shawndumas 53092
patio11 51715
tokenadult 47853
ssclafani 46492
jgrahamc 45194
jacquesm 44717
cwan 44665
rayiner 41712
edw519 39716
DanielRibeiro 38530
luu 38035
ChuckMcM 37545
Libertatea 35177
evo_9 34585
lelf 34116
wglb 30763
aaronbrethorst 30220
raganwald 29993
anigbrowl 29875

Known Data Fidelity Caveats

Unfortunately, there are a few issues with the source data, which the scripts attempt to mitigate:

  • Hacker News automatically converts certain punctuation in Submissions/Comments contain into stylistic unicode (e.g. "smart quotes") which cannot be stored in the database; the scripts will convert the punctuation back to UTF-8.
  • Comments contain style and link HTML; the scripts attempt to strip it.
  • On the server-side, there are gaps of missing submission and comment data before 2010.
  • Comment scores are hidden server-size for comments after October 2014; this is coincidentally the month my blog post was published / the official API was published)