A place to store my SQL notes and practice problems.
- SELECT
- FROM
- WHERE
- GROUP BY
- HAVING
- ORDER BY
SELECT year, month, west
FROM tutorial.us_housing_units
SELECT year AS "year_built", month, west
FROM tutorial.us_housing_units
LIMIT 20
SELECT *
FROM tutorial.us_housing_units
WHERE month_name = 'February'
SELECT *
FROM tutorial.us_housing_units
WHERE month_name < 'o'
This finds all month names that start with N or an earlier letter in alphabet.
SELECT year, month, west, south,
(west + south) / 2 AS south_west_avg
FROM tutorial.us_housing_units
This find the average across the south and west columns, per entry.
- LIKE allows you to match similar values, instead of exact values.
- ILIKE is the same as LIKE but ignores case
- IN allows you to specify a list of values you'd like to include.
- BETWEEN allows you to select only rows within a certain range.
- IS NULL allows you to select rows that contain no data in a given column.
- AND allows you to select only rows that satisfy two conditions.
- OR allows you to select rows that satisfy either of two conditions.
- NOT allows you to select rows that do not match a certain condition.
- % = Any set of characters
- _ = A single character
IN checks if the entry's column value is included in a given set of values.
SELECT *
FROM tutorial.billboard_top_100_year_end
WHERE artist IN ('Taylor Swift', 'Usher', 'Ludacris')
BETWEEN matches values that fall within a specified range.
SELECT *
FROM tutorial.billboard_top_100_year_end
WHERE year_rank BETWEEN 5 AND 10
Sorts rows by a specific column(s). Sorts from smallest to largest by default (Ascending).
Use the DESC keyword after the column name to sort from largest to smallest instead.
SELECT *
FROM tutorial.billboard_top_100_year_end
WHERE year_rank <= 3
ORDER BY year DESC, year_rank
--This is a single line comment in SQL
/* Here's
a multi-line
comment in SQL */
Aggregation functions return a single row with one or many aggregated columns.
Place on the SELECT line. Can be used on non-numerical columns.
- SELECT COUNT(*) = count all results
- SELECT COUNT(price_high) = only count non-null values in that column
SELECT COUNT(date)
FROM tutorial.aapl_historical_stock_price
Only works on numerical (summable) columns. Nulls will be evaluated as 0 in the SUM.
SELECT SUM(volume)
FROM tutorial.aapl_historical_stock_price
Pulls the smallest/biggest value from the column.
SELECT MIN(volume) AS min_volume,
MAX(volume) AS max_volume
FROM tutorial.aapl_historical_stock_price
AVG finds an arithmetic mean average of a column.
Only accepts numerical columns and completely ignores null values. To include nulls as 0, pre-convert them to 0 before aggregating with AVG.
SELECT AVG(high)
FROM tutorial.aapl_historical_stock_price
WHERE high IS NOT NULL
Aggregation functions like COUNT, AVG, and SUM aggregate across entire table (result in 1 row)
GROUP BY can aggregate by parts of the table resulting in multiple rows (aggregated on sub-categories e.g. month of the year)
GROUP BY comes after FROM in the query.
SELECT year, COUNT(*) AS count
FROM tutorial.aapl_historical_stock_price
GROUP BY year
or GROUP BY multiple columns, which usually also needs an ORDER BY to order the results in a sensible way:
SELECT year,
month,
COUNT(*) AS count
FROM tutorial.aapl_historical_stock_price
GROUP BY year, month
ORDER BY month, year
HAVING is similar to the WHERE clause, but useful when you can't use the WHERE clause (when aggregating with GROUP BY).
SELECT year,
month,
MAX(high) AS month_high
FROM tutorial.aapl_historical_stock_price
GROUP BY year, month
HAVING MAX(high) > 400
ORDER BY year, month
CASE allows handling of if/then logic.
Must be followed by at least one WHEN/THEN statement.
- WHEN specifies criteria
- THEN specifies the value result of that criteria
- Finish with an ELSE case and then END AS [new_col_name]
This query creates the 'is_a_senior' column, and fills the data in dependent on if that entry has 'SR' as the value in the 'year' column.
SELECT player_name,
year,
CASE WHEN year = 'SR' THEN 'yes'
ELSE 'no' END AS is_a_senior
FROM benn.college_football_players
A more complex example:
SELECT player_name,
weight,
CASE WHEN weight > 250 THEN 'over 250'
WHEN weight > 200 AND weight <= 250 THEN '201-250'
WHEN weight > 175 AND weight <= 200 THEN '176-200'
ELSE '175 or under' END AS weight_group
FROM benn.college_football_players
CASE with aggregate functions:
SELECT CASE WHEN year = 'FR' THEN 'FR'
WHEN year = 'SO' THEN 'SO'
WHEN year = 'JR' THEN 'JR'
WHEN year = 'SR' THEN 'SR'
ELSE 'No Year Data' END AS year_group,
COUNT(1) AS count
FROM benn.college_football_players
GROUP BY CASE WHEN year = 'FR' THEN 'FR'
WHEN year = 'SO' THEN 'SO'
WHEN year = 'JR' THEN 'JR'
WHEN year = 'SR' THEN 'SR'
ELSE 'No Year Data' END
CASE inside aggregate functions:
SELECT COUNT(CASE WHEN year = 'FR' THEN 1 ELSE NULL END) AS fr_count,
COUNT(CASE WHEN year = 'SO' THEN 1 ELSE NULL END) AS so_count,
COUNT(CASE WHEN year = 'JR' THEN 1 ELSE NULL END) AS jr_count,
COUNT(CASE WHEN year = 'SR' THEN 1 ELSE NULL END) AS sr_count
FROM benn.college_football_players
Returns only unique values.
In aggregations, DISTINCT goes inside the aggregation function.
DISTINCT can cause very slow queries in aggregations.
SELECT COUNT(DISTINCT month) AS unique_months
FROM tutorial.aapl_historical_stock_price
Combines data from multiple tables, by connecting them via some relationship.
Follows this pattern:
- FROM [table1]
- JOIN [table2]
- ON [column in table1] = [column in table2]
SELECT *
FROM benn.college_football_players players
JOIN benn.college_football_teams teams
ON teams.school_name = players.school_name
Equivalent of intersection between 2 sets (tables).
SELECT players.*, teams.*
FROM benn.college_football_players players
JOIN benn.college_football_teams teams
ON teams.school_name = players.school_name
Similar to INNER JOIN but OUTER JOIN can return unmatched rows in one or both tables.
Types of outer join:
- LEFT JOIN = Returns only unmatched rows from left table.
- RIGHT JOIN = Returns only unmatched rows from right table.
- FULL OUTER JOIN = Returns unmatched rows from both tables.
Returns unmatched and matched rows from the left table, and only matched rows from the right table.
SELECT companies.permalink AS companies_permalink,
companies.name AS companies_name,
acquisitions.company_permalink AS acquisitions_permalink,
acquisitions.acquired_at AS acquired_date
FROM tutorial.crunchbase_companies companies
LEFT JOIN tutorial.crunchbase_acquisitions acquisitions
ON companies.permalink = acquisitions.company_permalink
Returns unmatched and matched rows from the right table, and only matched rows from the left table.
SELECT companies.permalink AS companies_permalink,
companies.name AS companies_name,
acquisitions.company_permalink AS acquisitions_permalink,
acquisitions.acquired_at AS acquired_date
FROM tutorial.crunchbase_acquisitions acquisitions
RIGHT JOIN tutorial.crunchbase_companies companies
ON companies.permalink = acquisitions.company_permalink
Filtering with ON will apply only in one table (before joining):
SELECT companies.permalink AS companies_permalink,
companies.name AS companies_name,
acquisitions.company_permalink AS acquisitions_permalink,
acquisitions.acquired_at AS acquired_date
FROM tutorial.crunchbase_companies companies
LEFT JOIN tutorial.crunchbase_acquisitions acquisitions
ON companies.permalink = acquisitions.company_permalink
AND acquisitions.company_permalink != '/company/1000memories'
ORDER BY 1
Filtering with WHERE is applied after the data join, and so applies to data from both tables:
SELECT companies.permalink AS companies_permalink,
companies.name AS companies_name,
acquisitions.company_permalink AS acquisitions_permalink,
acquisitions.acquired_at AS acquired_date
FROM tutorial.crunchbase_companies companies
LEFT JOIN tutorial.crunchbase_acquisitions acquisitions
ON companies.permalink = acquisitions.company_permalink
WHERE acquisitions.company_permalink != '/company/1000memories'
OR acquisitions.company_permalink IS NULL
ORDER BY 1
UNION lets you write multiple different SELECT queries, then combine them into a single table of results.
UNION only selects distinct values by default, but UNION ALL will avoid this and include duplicates.
SELECT *
FROM tutorial.crunchbase_investments_part1
UNION
SELECT *
FROM tutorial.crunchbase_investments_part2
Applies a fitlering step to a table in the JOIN process (so before data from other tables is involved).
SELECT companies.permalink,
companies.name,
companies.status,
COUNT(investments.investor_permalink) AS investors
FROM tutorial.crunchbase_companies companies
LEFT JOIN tutorial.crunchbase_investments_part1 investments
ON companies.permalink = investments.company_permalink
WHERE investments.funded_year > companies.founded_year + 5
GROUP BY 1,2, 3
JOINing on multiple fields is useful for:
- Accuracy (refines the results further)
- Performance (SQL Indexes can be used to create faster queries even if no further accuracy is achieved)
SELECT companies.permalink,
companies.name,
investments.company_name,
investments.company_permalink
FROM tutorial.crunchbase_companies companies
LEFT JOIN tutorial.crunchbase_investments_part1 investments
ON companies.permalink = investments.company_permalink
AND companies.name = investments.company_name
Sometimes useful to join a table with itself.
E.g. Finding companies that received investment from England after an investment from Japan:
SELECT DISTINCT japan_investments.company_name,
japan_investments.company_permalink
FROM tutorial.crunchbase_investments_part1 japan_investments
JOIN tutorial.crunchbase_investments_part1 gb_investments
ON japan_investments.company_name = gb_investments.company_name
AND gb_investments.investor_country_code = 'GBR'
AND gb_investments.funded_at > japan_investments.funded_at
WHERE japan_investments.investor_country_code = 'JPN'
ORDER BY 1
- String = VARACHAR(1024) = Any characters, max length 1024 chars
- Date/Time = TIMESTAMP = Date and time value
- Number = DOUBLE PRECISION = Up to 17 digits, with decimals
- Boolean = BOOLEAN = True/False
Can use CAST or CONVERT to change data types in queries.
SELECT CAST(funding_total_usd AS varchar) AS funding_total_usd_string,
founded_at_clean::varchar AS founded_at_string
FROM tutorial.crunchbase_companies_clean_date
- LEFT(s,n), RIGHT(s,n) = selects substring of n chars starting from left/right of string s.
- LENGTH(s) = Returns the length of string s.
- TRIM(both '()' FROM col) = Removes all brackets from start and end of the string in col.
- POSITION('c' in name) = Returns the numerical position of character 'c' in the name value.
- STRPOS(name, 'c') = Same thing as POSITION
- SUBSTR(name, 4, 2) = Returns substring in name starting 4 chars in, with a length of 2 chars.
- CONCAT(name, ',' surname) = Returns a joined string, separated by a comma in this case.
- || = Same as CONCAT, works like '+' between strings in Python.
- UPPER/LOWER = Converts string to upper/lower case.
- EXTRACT('year' FROM date) = Returns the year part of the date.
- COALESCE(name, 'no name') = Replaces null values in a column with a specified value.
Also known as inner queries or nested queries.
SELECT sub.*
FROM (
SELECT *
FROM tutorial.sf_crime_incidents_2014_01
WHERE day_of_week = 'Friday'
) sub
WHERE sub.resolution = 'NONE'
Using subqueries to aggregate at multiple stages:
SELECT LEFT(sub.date, 2) AS cleaned_month,
sub.day_of_week,
AVG(sub.incidents) AS average_incidents
FROM (
SELECT day_of_week,
date,
COUNT(incidnt_num) AS incidents
FROM tutorial.sf_crime_incidents_2014_01
GROUP BY 1,2
) sub
GROUP BY 1,2
ORDER BY 1,2
Joining subqueries:
SELECT incidents.*,
sub.incidents AS incidents_that_day
FROM tutorial.sf_crime_incidents_2014_01 incidents
JOIN ( SELECT date,
COUNT(incidnt_num) AS incidents
FROM tutorial.sf_crime_incidents_2014_01
GROUP BY 1
) sub
ON incidents.date = sub.date
ORDER BY sub.incidents DESC, time
From PostgreSQL docs:
A window function performs a calculation across a set of table rows that are somehow related to the current row. This is comparable to the type of calculation that can be done with an aggregate function. But unlike regular aggregate functions, use of a window function does not cause rows to become grouped into a single output row — the rows retain their separate identities. Behind the scenes, the window function is able to access more than just the current row of the query result.
Basic Example:
SELECT duration_seconds,
SUM (duration_seconds)
OVER (ORDER BY start_time)
AS running_total
FROM tutorial.dc_bikeshare_q1_2012
- OVER turns the normal query into a window function
- It SUMs the duration_seconds over the entire table output, and keeps a cumulative total
Split the rows over which the window function runs, by using PARTITION BY in the OVER arg:
SELECT start_terminal, duration_seconds,
SUM (duration_seconds)
OVER (PARTITION BY start_terminal ORDER BY start_time)
AS start_terminal_total
FROM tutorial.dc_bikeshare_q1_2012
WHERE start_time < '2012-01-08'