Social Media Queries

A collection of SQL queries to social media datasets. The queries return answers like "Most mentioned books on Hacker News", "Top apps on Reddit", and others. See the list of queries and how to use them below

Queries

Queries are written for Google BigQuery free public datasets (requires a Google account) and stored in .sql files, organized by social media outlet (folder hackernews and so on). These datasets are snapshots taken on particular dates, so results do not include post-2015 content.

Each of the queries processes 0.5-10GB of data. Processing up to 1TB per month is free, and you have up to 2,000 queries to experiment with.

Hacker News

Some examples (see hackernews directory for the full list):

Most cited books (comments) - Using links to Amazon.com as citations. Don't include text references. Also see this thread. This and the three next queries can be extended to other items. Examples:
- Movies on Internet Movie Database: imdb.com/title/tt[0-9]+/
- Books on iTunes: itunes.apple.com/book/id[0-9]+
- Apps on Google Play: play.google.com/store/apps/details?id=.+
Most cited books (submissions) - The same, but this counts submitted URLs.
Popular iTunes Apps (comments) - Like "Most cited books", but this tracks links to Apple Store.
Popular iTunes Apps (submissions) - Similarly.
Social network (graph) - A weighted directional graph based on users commenting each other. Weights correspond to the number of comments one user left to another. See Social network analysis for more information.
Top authors by median - List of authors based on the median score. A quick way to find founders and VCs submitting to HN.
Top authors by mean - Based on the mean score. Usually implies many low-scored posts with major hits due to the skewed distribution.
Top news sources - Where most popular news come from? Separated by day of week and hour.
Popular Wikipedia articles - Counting links to Wikipedia articles.

For simple queries, use Hacker News' Algolia search:

All Hacker News queries can be applied to Reddit after minor edits. Examples:

Top authors by median - Authors ranked by the median score with minor adjustments. Expect no poor content from them.
Top sources of political news - Ranking sources submitted to r/politics.

Reddit comments on BigQuery are split into multiple tables. If you want to select from comments, use TABLE_QUERY:

FROM (TABLE_QUERY([fh-bigquery:reddit_comments], "table_id BETWEEN '2007' AND '2014' OR table_id CONTAINS '2015_' OR table_id CONTAINS '2016_'")).

Beware, this can quickly exhaust the free 1TB limit.

Wikipedia

Edits made from an IP address - Wikipedia records IP addresses of anonymous editors. With respect to privacy, some uses of this data:
- Edits by organization. Many organizations reserve static IPs. One famous example is US Congress' edits. This query is unlikely to return many edits done by a particular organization because the sample table contains only 300M edits. Too diluted to have a representative subset.
- Edits by region. The sample is sufficient for statistics by region and other broad characteristics.

Usage

Web Interface

You can export up to 16,000 rows or 10MB as a csv file via web interface. Larger output can be exported through console. Add LIMIT <n> to queries to control the number of rows in output.

Locate a query in the repo's folder
Login at https://bigquery.cloud.google.com/welcome
Press "Compose query" in the top left corner
Copy-paste the query and run it

See web UI quickstart by Google.

Command line: `bq`

Install Google Cloud SDK
Initialize your account for command line tools
Run bq query `cat <path>` , where <path> leads to the .sql file

Python in clouds: Jupyter, IPython notebooks

Get a Google Cloud account (free trial)
Create a Jupyter notebook in Datalab
Do import gcp.bigquery as bq
Run queries with bq.Query() function

See Felipe Hoffa's Hacker News notebook for example.

BigQuery API

See BigQuery API Quickstart for examples in Java, Python, C#, PHP, Ruby. You'll need a credentials file to run it locally.

Contributing

Pull requests are welcomed. Suggestions:

Adding new data mining queries
Rewriting .sql files related to Hacker News for Reddit and Wikipedia databases

The reference for BigQuery's SQL dialect.

Acknowledgements

Discussions on Hacker News and Reddit
Felipe Hoffa for publishing the datasets

Delejnr/smq