SQL Notes

Select top 3 without TOP/LIMIT/ROWNUM

Top 3 Quantity From OrdersDetails: https://www.w3schools.com/sql/trysql.asp?filename=trysql_select_all

SELECT Quantity 
FROM OrderDetails a
where (select count(*) from OrderDetails b where b.Quantity >= a.Quantity) <= 3
order by Quantity desc

or

SELECT Quantity, (select count(*) from OrderDetails b where b.Quantity >= a.Quantity) as rnk 
FROM OrderDetails a
where (select count(*) from OrderDetails b where b.Quantity >= a.Quantity) <= 3
order by Quantity desc

Note:

  • COUNT acts on WHERE b.Quantity >= a.Quantity which is also aggregation condition
  • Table self-join to be a condition
  • Use select to create a new column
  • The order of b.Quantity > a.Quantity
  • Pay attention to >= or >
  • No need to use ORDER BY when using b.Quantity >= a.Quantity
  • See below Second Largest for more explanation

Self Join

Rank Scores

Id Score
1 3.50
2 3.65
3 4.00
4 3.85
5 4.00
6 3.65

as

Score Rank
4.00 1
4.00 1
3.85 2
3.65 3
3.65 3
3.50 4
select a.Score, (select count(*) from (select distinct(Score) from Scores
order by Score desc) c where c.Score >= a.Score) as Rank
from (
select Score from Scores
order by Score desc) a

Rank Quantity

Select a.Quantity, (select count(*) from 
(Select distinct b.Quantity from OrderDetails b) sub 
where sub.Quantity >= a.Quantity) as rnk
from OrderDetails a
Order by rnk

or,

select c.Quantity, sub3.rnk as rank from OrderDetails c
join
(select sub1.Quantity, (select count(*) from 
(SELECT distinct(Quantity) FROM OrderDetails) sub2 
where sub2.Quantity >= sub1.Quantity) as rnk 
from 
(SELECT distinct(Quantity) FROM OrderDetails) sub1) sub3
on c.Quantity = sub3.Quantity
order by Rank asc

Note:

  • self join is very powerfull.
  • Use select count(*) from table a where c.Score >= a.Score to create a new column

Second Largest

Using LIMIT

SELECT number
FROM (
SELECT number FROM Table
ORDER BY number desc limit 2
) ORDER BY number limit 1

or Using TOP

SELECT TOP 1 salary   
FROM (  
SELECT TOP 2 salary  
FROM employee_table  
ORDER BY salary DESC) AS emp  
ORDER BY salary ASC;

or Using LIMIT and OFFSET

SELECT OrderID, Quantity FROM OrderDetails
Order By Quantity desc
Limit 1 Offset 1

Note:

  • LIMIT/TOP/OFFSET are not always available; they are not basis SQL commands.

Using max() and not in ()

select max(Quantity) from OrderDetails
where Quantity not in (select max(Quantity) from OrderDetails)

Notes:

  • column IN (1,2,3) / column NOT IN (select id from table)

Using self join

select Quantity from OrderDetails a
where (select count(*) from OrderDetails b where b.Quantity > a.Quantity) = 1

or

SELECT OrderID, Quantity FROM OrderDetails a
where (select count(*) from OrderDetails b where b.Quantity >= a.Quantity) = 2

Interpretation of this SQL Query:

  • Select a.Quantity in OrderDetails a and for each b.Quantity compare to this a.Quantity

  • Count the number where b.Quantity >= a.Quantity for each record aggregately

  • The first three largest a.Quantity: 120, 100, 90

  • select count(*) from OrderDetails b where b.Quantity >= 120

    return 1

  • select count(*) from OrderDetails b where b.Quantity >= 100

    return 2

  • select count(*) from OrderDetails b where b.Quantity >= 90

    return 3

Maximum in each group

Using rank() over()

RANK() OVER (PARTITION BY column_ID ORDER BY column_ID) AS rank_id

Other similar queries:

row_number() over(order by customer_atg_id) as record_id

Maximun Quantity in each of the OrderID:

SELECT OrderDetailID, OrderID, max(Quantity) FROM OrderDetails
group by OrderID

Note:

  • MAX acts on each of the group, similarly as MIN, SUM, etc.
  • In some database, this query will not work as OrderDetailID is not in group by.

Top 5 in each of the group

Select the top 5 CustomerID in each of the ShipperID

Select ShipperID, CustomerID from 
(SELECT ShipperID, CustomerID FROM [Orders]
order by ShipperID, CustomerID) a
where (select count(*) from (SELECT ShipperID, CustomerID FROM [Orders]
order by ShipperID, CustomerID) b where b.ShipperID == a.ShipperID and b.CustomerID <= a.CustomerID) <= 5

Note:

  • self join and two where condition count
SELECT rs.Field1,rs.Field2 
    FROM (
        SELECT Field1, Field2, 
               Rank() over (Partition BY Section
                ORDER BY RankCriteria DESC ) AS Rank
        FROM table
        ) rs WHERE Rank <= 10

Note:

  • Rank() over(Partition by ... Order by ...) as Rnk

Unique Combination Of Columns

给一个flight的table,有depature city和 arrival city,求unique的不论顺序的组合

depature arrival
A B
B A

结果只出现 A, B。

Unique combination of EmployeeID and ShipperID.

select distinct one, two from 
(SELECT (case when EmployeeID > ShipperID then ShipperID else EmployeeID end) as one,  (case when EmployeeID > ShipperID then EmployeeID else ShipperID end) as two FROM [Orders]) as sub

or,

select one, two from 
(SELECT (case when EmployeeID > ShipperID then ShipperID else EmployeeID end) as one,  (case when EmployeeID > ShipperID then EmployeeID else ShipperID end) as two FROM [Orders]) as sub
group by one, two

Note:

  • distinct(colA) or distinct colA, colB
  • Use group by to get unique combination
  • Use case when ... then ... end to order and do selection

Between Date

OrderID between '1996-09-02' and '1996-09-06': https://www.w3schools.com/sql/trysql.asp?filename=trysql_select_all

SELECT OrderID, OrderDate FROM Orders
WHERE OrderDate BETWEEN '1996-09-02' AND '1996-09-06'

Note:

  • BETWEEN ... AND ... is inclusive.

or

SELECT OrderID, OrderDate FROM Orders
WHERE OrderDate >= '1996-09-02' AND OrderDate <= '1996-09-06'

Left Join

Left Join Orders on Shippers https://www.w3schools.com/sql/trysql.asp?filename=trysql_select_all

SELECT a.OrderID, a.ShipperID, b.ShipperName FROM Orders a
Left join Shippers b
on a.ShipperID = b.ShipperID

Note:

  • left join fills all the records in “left table” with matches on “right table”

Left Join Shippers on Orders https://www.w3schools.com/sql/trysql.asp?filename=trysql_select_all

SELECT a.ShipperID, ShipperName, OrderID, OrderDate
FROM Shippers a
LEFT JOIN Orders b
ON a.ShipperID = b.ShipperID

Note:

  • If there are multiply copies in the ‘right table’ matching the ‘left table’. The ‘left table’ matched item will duplicate to have multiply copies.
  • There is no difference between ‘left join’ and ‘join’ if there is no ‘missing’ on the join condition element(s) in the two tables. The following three queries return the same results (order may be different).
  • SELECT a.ShipperName, CustomerID, EmployeeID, OrderID FROM Shippers a
    join Orders b
    on a.ShipperID = b.ShipperID
  • SELECT a.ShipperName, CustomerID, EmployeeID, OrderID FROM Shippers a
    Left join Orders b
    on a.ShipperID = b.ShipperID
  • SELECT b.ShipperName, CustomerID, EmployeeID, OrderID FROM Orders a
    Left join Shippers b
    on a.ShipperID = b.ShipperID

The trick of Left Join and IS NULL

找出A table里的user_id且该user_id不在table B里

select A.id
from A 
left join B on A.id = B.id
where B.id IS NULL;

Note:

  • Use IS NULL instead of =NULL

Count NULL and NOT NULL

SELECT COUNT(ID) as NotNull, SUM(CASE WHEN ID IS NULL then 1 else 0 end) as NullCount  

Note:

  • count(columnname) does not count NULL

Union

you have a table where you have date, user_id, song_id and count. It shows at the end of each day how many times in the history a user has listened to a given song. So count is cumulative sum. You have to update this on a daily basis based on a second table that records in real time when a user listens to a given song. Basically, at the end of each day, you go to this second table and pull a count of each user/song combination and then add this count to the first table that has the lifetime count. If it is the first time a user has listened to a given song, you won't have this pair in the lifetime table, so you have to create the pair there and then add the count of the last day.

  • Table1: date, user_id, song_id, count. This is cumulative.
  • Table2: date, user_id, song_id, count. This is for each day
  • Using Table 2 to update Table1.
SELECT current_date, t.user_id, t.song_id, SUM(t.count)
FROM (
    SELECT *
    FROM cumu
    UNION ALL
    SELECT *
    FROM curr) t
GROUP BY t.user_id, t.song_id;

Note:

  • use UNION ALL if the two tables have the same columns.
  • With UNION, only distinct values are selected.
  • The UNION ALL command is equal to the UNION command, except that UNION ALL selects all values.

String Match

Write an SQL query to find names of employee start with 'A'?

SELECT * FROM Employees WHERE EmpName like 'A%'

or end with 'A'

SELECT * FROM Employees WHERE EmpName like '%A'

or 'A' in the second position

SELECT * FROM Employees WHERE EmpName like '_A%'

Note:

  • '%' is used as a wild card with 'like'
  • '_' is used as one letter with 'like'

Regular Expression

SQL regular expression very similar to python regular expression.

One example, find LastName begins with capital D:

SELECT * FROM Employees
where LastName regexp '^D'

This one does not work in w3schools

Where vs Having

Select the EmployeeID with the number of orders larger than 10 from Orders

SELECT EmployeeID, count(OrderID) FROM Orders
group by EmployeeID
having count(OrderID) > 10

Note:

  • “WHERE” vs “HAVING”: The WHERE clause cannot be used to restrict groups. The HAVING clause should be used.
  • “HAVING” need to be used after ‘GROUP BY’

找出EmployeeID和所有对应的OrderID,EmployeeID有多于25的OrderID

SELECT a.EmployeeID, a.OrderID, b.cnt 
FROM Orders a
left join (
select EmployeeID, count(OrderID) as cnt
from Orders 
group by EmployeeID
) b
on a.EmployeeID = b.EmployeeID
where b.cnt > 25
Order by a.EmployeeID

or,

Select a.EmployeeID, a.OrderID From Orders a
Where a.EmployeeID in (
  Select b.EmployeeID From Orders b
  Group by b.EmployeeID
  Having count(b.OrderID) > 25
)
Order by a.EmployeeID, a.OrderID

Most common elements in a column

The most frequent Quantity in OrderDetails

select Quantity, max(cnt) from (
SELECT Quantity, count(Quantity) as cnt FROM [OrderDetails]
group by Quantity) a

Second most frequent Quantity in OrderDetails

select Quantity, max(cnt) from (
SELECT Quantity, count(Quantity) as cnt FROM [OrderDetails]
group by Quantity) a
where cnt not in (
select max(cnt) from (
SELECT Quantity, count(Quantity) as cnt FROM [OrderDetails]
group by Quantity)
)

or,

select Quantity, cnt from (
SELECT Quantity, count(Quantity) as cnt FROM [OrderDetails]
group by Quantity) a
order by cnt desc
limit 1 offset 1

or,

select Quantity, cnt from (
SELECT Quantity, count(Quantity) as cnt FROM [OrderDetails]
group by Quantity) a
where (select count(cnt) from (
SELECT Quantity, count(Quantity) as cnt FROM [OrderDetails]
group by Quantity) b where b.cnt >= a.cnt) = 2

Note:

  • one column table to two columns table:
    • one column, itself
    • second column, count(itself) with group by itself

Accumulative Sum

Get the first OrderDetailID from which the accumulative number of orders larger than 80 from OrderDetails

select min(OrderDetailID), accsum from (
SELECT OrderDetailID, (select sum(Quantity) from OrderDetails b where b.OrderDetailID <= a.OrderDetailID) as accsum FROM OrderDetails a
)
where accsum > 80

or,

Select min(a.OrderDetailID) from OrderDetails a
where (select sum(Quantity) from OrderDetails b where b.OrderDetailID <= a.OrderDetailID) >= 80

Note:

  • Using self join
  • Using sum()
  • Pay attention to < or <= depending on question

Count in range

Get the number of OrderDetailID with Quantity in 1-5, 6-10, 11-15, ... from OrderDetails https://www.w3schools.com/sql/trysql.asp?filename=trysql_select_all

Select (grp+1)*5 as Range, count(OrderDetailID), sum(Quantity) from
(SELECT OrderDetailID, Quantity, ((Quantity - 1) / 5) as grp FROM [OrderDetails])
group by grp

Note:

  • GROUP BY does not work on newly formed names, instead use the whole formula
  • Floor(Amount/50) or Ceil(Amount/50)

Join After Group By

Compute average order / clicks for each visitor?

click_data

visitor dt impression clicks test_variant
a 6/1 100 1 control
b 6/1 20 5 control
... ... ... ... ...
aaa 6/1 200 1 treatment
bbb 6/1 40 5 treatment

purchase_data

visitor dt orders $ test_variant
aa 6/2 1 100 control
bb 6/3 2 500 control
... ... ... ... ...
aaaa 6/5 3 1000 treatment
bbbb 6/10 1 200 treatment
select a.visitor, s_orders/sum(a.clicks) as each_avg
from click_data a
left join (
select visitor, sum(orders) as s_orders
from purchase_data b
group by visitor
) b
on a.visitor = b.visitor
group by a.visitor, s_orders

Note:

  • sum() / sum(), get one sum() in aggregation join sub query and then do another sum() in the main select

Case ... When ... Then ... Else ... End

select top 100 date, 
count(*) as count_all,
sum (
  case 
    when content not like '%some condition%' then 1
	when content like '%some condition%' then -1
    else 0
  end
) as count_condition
from sections
group by date
order by date

Note:

  • Use ‘Case When … Then … Else … End’ when need to count two different conditions from one table

cast( VAR as float)

给一个表有四个column:time, user_id, app_id, event ('imp' or 'click') 。大致意思:当一个user在玩一个app时,有一定几率会弹出一个窗口让user 添加信用卡(event "imp"),如果这个user按了 yes,那就是一个 click的event如何来判定这个弹窗口的效果?(click through rate),如何看哪个app click through rate最高,如何知道这个table的信息是否正确 (比如user 按了一次 yes,结果自动生成多次 click event)

select app_id, avg(rate) from (
select user_id, app_id, cast(count(case when event = 'click' then 1 else 0 end) as float) / cast(count(case when event = 'imp' then 1 else 0 end) as float) as rate
from table
group by user_id, app_id) sub
group by app_id

Write an SQL query that makes recommendations using the pages that your friends liked. Assume you have two tables: a two-column table of users and their friends, and a two-column table of users and the pages they liked. It should not recommend pages you already like.

select f.userid, l.pageid
from friends f
JOIN likes l ON l.userid = f.friendid
LEFT JOIN likes r ON (r.userid = f.userid AND r.pageid = l.pageid)
where r.pageid IS NULL;

Note:

  • The trick of using IS NULL
  • Whenever the question is something in table_a and not in table_b, use '=' and 'IS NULL'

Exists

Find employee last name and first name who has more than 35 Orders (count(OrderID) > 35). https://www.w3schools.com/sql/trysql.asp?filename=trysql_select_all

Select EmployeeID, LastName, FirstName from Employees a
where exists (
SELECT EmployeeID, count(OrderID) FROM [Orders] b
where b.EmployeeID = a.EmployeeID
group by EmployeeID
having count(OrderID) > 35)
  • Result: 4, Peacock, Margaret

Interpretation of this SQL Query:

  • SELECT EmployeeID, count(OrderID) FROM [Orders]
    where EmployeeID = 4
    group by EmployeeID
    having count(OrderID) > 35

    Result: 4, 40. where exists (...) returns True.

  • SELECT EmployeeID, count(OrderID) FROM [Orders]
    where EmployeeID in (1,2,3,5,6,7,8,9)
    group by EmployeeID
    having count(OrderID) > 35

    No result. where exists (...) returns False.

Note:

  • There is a Where statement in the EXISTS subquery which connects the subquery with the main query.
  • The EXISTS operator is used to test for the existence of any record in a subquery.
  • The EXISTS operator returns True if the subquery returns one or more records.
  • EXISTS is used to return a boolean value, JOIN returns a whole other table
  • EXISTS is only used to test if a subquery returns results, and short circuits as soon as it does.
  • EXISTS is a semi-join; JOIN is a join
  • The columns in the Exists Subyquery can not be called in the main query.

Tricky Self Join and Use Not Exists

Table: company

Member_id Company Year_Start
1 Microsoft 2000
1 Google 2006
1 Facebook 2012
2 Microsoft 2001
2 Oracle 2004
2 Google 2007
... ... ...

How many members ever moved from Microsoft to Google? (both member #1 and member #2 count)

SELECT COUNT(DISTINCT c1.Member_id)
FROM company c1 
JOIN company c2
ON c1.Member_id = c2.Member_id 
WHERE c1.Company = 'Microsoft' 
AND c2.Company = 'Google'
AND c1.Year_Start < c2.Year_Start

How many members moved directly from Microsoft to Google? (Member 2 does not count since Microsoft -> Oracle -> Google)

SELECT COUNT (DISTINCT c1.Member_id)
FROM company c1 
JOIN company c2
ON c1.Member_id = c2.Member_id 
WHERE c1.Company = 'Microsoft' 
AND c2.Company = 'Google'
AND c1.Year_Start < c2.Year_Start
LEFT JOIN company c3
ON c3.Member_id = c1.Member_id
WHERE c3.Year_Start BETWEEN c1.Year_Start AND c2.Year_Start
AND c3.Member_id IS NULL

or,

SELECT COUNT (DISTINCT c1.Member_id)
FROM company c1 
JOIN company c2
ON c1.Member_id = c2.Member_id 
WHERE c1.Company = 'Microsoft' 
AND c2.Company = 'Google'
AND c1.Year_Start < c2.Year_Start
AND NOT EXISTS ( 
SELECT c3.Member_id
FROM company c3
WHERE c3.Member_id = c1.Member_id
AND c3.Year_Start BETWEEN c1.Year_Start AND c2.Year_Start
)

Trick about Join and Count

一个log 数据表格 五列, Date, User_Id, Content_ID, Content_Type, Target_Id. Content_Type有5个,photo, post, share , like, comment. 其中post, photo, share 是story, like和comment是Action.

题目是问comment distribution by post.

Select Target_ID, count(Content_ID) as cnt
from table
where Content_Type = 'comment'
   and Target_ID in (select distinct Content_ID from table where Content_Type = 'post')
group by Target_ID

or,

Select post_id, count(comment_id)
from
(Select a.content_id as post_id, b.content_id as comment_id
From table a
Left join table b
On a.content_id = b.target_id
Where a.content_type =‘Post’
And b.content_type=‘Comment’
) sub
group by post_id

More directly:

Select a.content_id, count(b.user_id)
From content_actions a
Join content_actions b
On b.target_id = a.content_id
Where b.content_type = 'Comment'
And a.content_type = 'Post'
Group By a.content_id

Note:

SELECT a.OrderID, a.ShipperID, a.EmployeeID, b.Quantity
from Orders a
left join OrderDetails b
on a.OrderID = b.OrderID
where a.ShipperID = 1
and b.Quantity < 50
SELECT ShipperName, Count(CustomerID) 
FROM Orders a
Join Shippers b
On a.ShipperID = b.ShipperID
Group By ShipperName

Same as,

Select ShipperName, Count(CustomerID)
From (
SELECT *, b.ShipperName 
FROM Orders a
Join Shippers b
On a.ShipperID = b.ShipperID
) sub
Group By ShipperName

Order By Two Columns Differently and Aggregate

给一个table,studentid, courseid, grade, sql语句输出student最高成绩的一门课, 如果分数相同,输出courseid小的

select b.studentid, min(b.courseid), max(b.grade) from table b
join (
select studentid, max(grade) as grade from table c
group by studentid
) a
on a.studentid = b.studentid 
and a.grade = b.grade
group by b.studentid

Note:

  • select studentid, min(courseid), max(grade) from table
    group by studentid
    This does not work. min(courseid) will return the min of all the courseids in the studentid group and max(grade) will return the max of all the grades in studentid. This min and max are independent of each other.
  • select studentid, courseid, max(grade) from table
    group by studentid
    This does not work. As courseid is not in group by.
  • group by ShipperID, then max() select the max CustomerID row ordered according to EmployeeID in group ShipperID
  • EmployeeID is not in group by or aggregation count() or sum()

In MySQL,

SELECT studentid, courseid, grade
   FROM
     (SELECT studentid, courseid, grade, 
                  @student_rank := IF(@current_student = studentid, @student_rank + 1, 1) AS student_rank,
                  @current_student := studentid
       FROM table
       ORDER BY studentid, grade DESC
     ) ranked
   WHERE student_rank <= 1

In Orders, StudentID -> ShipperID, courseID -> EmployeeID, grade -> CustomerID https://www.w3schools.com/sql/trysql.asp?filename=trysql_select_all

select b.ShipperID, min(b.EmployeeID), max(b.CustomerID) from Orders b
join (
select ShipperID, max(CustomerID) as CustomerID from Orders c
group by ShipperID
) a
on a.ShipperID = b.ShipperID 
and a.CustomerID = b.CustomerID
group by b.ShipperID
SELECT ShipperID, EmployeeID, CustomerID
   FROM
     (SELECT ShipperID, EmployeeID, CustomerID, 
                  @Shipper_rank := IF(@current_Shipper = ShipperID, @Shipper_rank + 1, 1) AS Shipper_rank,
                  @current_Shipper := ShipperID
       FROM Orders
       ORDER BY ShipperID, CustomerID DESC
     ) ranked
   WHERE Shipper_rank <= 1

Sampling With rand()

Randomly select 100 records

select atg_id, map_id
from propensity
order by rand() 
LIMIT 10

or

select atg_id, map_id
from propensity
distribute by rand()
sort by rand() 
LIMIT 10

With .. AS Clause

  • The SQL WITH clause allows you to give a sub-query block a name (a process also called sub-query refactoring), which can be referenced in several places within the main SQL query.
  • WITH ... AS ... creates a temporary Table that can be used in the complex query.
WITH query_name1 AS (
     SELECT ...
     )

   , query_name2 AS (
     SELECT ...
       FROM query_name1
        ...
     )

SELECT ... 
  FROM query_name2
    ...

Select Column Names

select *
from INFORMATION_SCHEMA.COLUMNS
where TABLE_NAME = 'FACT_EVENT_AVAILABILITY'
and TABLE_SCHEMA = 'DWH'

Select Quote String

SELECT 'warehouse' as source_type, 1 as priority
UNION
SELECT 'cart' as source_type, 2 as priority

This will create a table as follows.

source_type priority
warehouse 1
cart 2