A collection of SQL queries to social media datasets. The queries return answers like "Most mentioned books on Hacker News", "Top apps on Reddit", and others. See the list of queries and how to use them below
Queries are written for Google BigQuery free public datasets (requires a Google account) and stored in .sql
files, organized by social media outlet (folder hackernews
and so on). These datasets are snapshots taken on particular dates, so results do not include post-2015 content.
Each of the queries processes 0.5-10GB of data. Processing up to 1TB per month is free, and you have up to 2,000 queries to experiment with.
Some examples (see hackernews
directory for the full list):
- Most cited books (comments) - Using links to Amazon.com as citations. Don't include text references. Also see this thread. This and the three next queries can be extended to other items. Examples:
- Movies on Internet Movie Database:
imdb.com/title/tt[0-9]+/
- Books on iTunes:
itunes.apple.com/book/id[0-9]+
- Apps on Google Play:
play.google.com/store/apps/details?id=.+
- Movies on Internet Movie Database:
- Most cited books (submissions) - The same, but this counts submitted URLs.
- Popular iTunes Apps (comments) - Like "Most cited books", but this tracks links to Apple Store.
- Popular iTunes Apps (submissions) - Similarly.
- Social network (graph) - A weighted directional graph based on users commenting each other. Weights correspond to the number of comments one user left to another. See Social network analysis for more information.
- Top authors by median - List of authors based on the median score. A quick way to find founders and VCs submitting to HN.
- Top authors by mean - Based on the mean score. Usually implies many low-scored posts with major hits due to the skewed distribution.
- Top news sources - Where most popular news come from? Separated by day of week and hour.
- Popular Wikipedia articles - Counting links to Wikipedia articles.
For simple queries, use Hacker News' Algolia search:
All Hacker News queries can be applied to Reddit after minor edits. Examples:
- Top authors by median - Authors ranked by the median score with minor adjustments. Expect no poor content from them.
- Top sources of political news - Ranking sources submitted to r/politics.
Reddit comments on BigQuery are split into multiple tables. If you want to select from comments, use TABLE_QUERY
:
FROM (TABLE_QUERY([fh-bigquery:reddit_comments], "table_id BETWEEN '2007' AND '2014' OR table_id CONTAINS '2015_' OR table_id CONTAINS '2016_'"))
.
Beware, this can quickly exhaust the free 1TB limit.
- Edits made from an IP address - Wikipedia records IP addresses of anonymous editors. With respect to privacy, some uses of this data:
- Edits by organization. Many organizations reserve static IPs. One famous example is US Congress' edits. This query is unlikely to return many edits done by a particular organization because the sample table contains only 300M edits. Too diluted to have a representative subset.
- Edits by region. The sample is sufficient for statistics by region and other broad characteristics.
You can export up to 16,000 rows or 10MB as a csv
file via web interface. Larger output can be exported through console. Add LIMIT <n>
to queries to control the number of rows in output.
- Locate a query in the repo's folder
- Login at https://bigquery.cloud.google.com/welcome
- Press "Compose query" in the top left corner
- Copy-paste the query and run it
See web UI quickstart by Google.
- Install Google Cloud SDK
- Initialize your account for command line tools
- Run
bq query `cat <path>`
, where<path>
leads to the.sql
file
- Get a Google Cloud account (free trial)
- Create a Jupyter notebook in Datalab
- Do
import gcp.bigquery as bq
- Run queries with
bq.Query()
function
See Felipe Hoffa's Hacker News notebook for example.
See BigQuery API Quickstart for examples in Java, Python, C#, PHP, Ruby. You'll need a credentials file to run it locally.
Pull requests are welcomed. Suggestions:
- Adding new data mining queries
- Rewriting
.sql
files related to Hacker News for Reddit and Wikipedia databases
The reference for BigQuery's SQL dialect.
- Discussions on Hacker News and Reddit
- Felipe Hoffa for publishing the datasets