In this section, you'll learn how to use aggregate functions in SQL.
You will be able to:
- Write queries with aggregate functions like
COUNT
,MAX
,MIN
, andSUM
- Create an alias for the return value of an aggregate function
- Use
GROUP BY
to sort the data sets returned by aggregate functions - Compare aggregates using the
HAVING
clause
import sqlite3
import pandas as pd
conn = sqlite3.Connection('data.sqlite')
cur = conn.cursor()
Lets start by looking at some groupby statements to aggregate our data.
#Here we join the offices and employees tables in order to count the number of employees per city.
cur.execute("""select city,
count(employeeNumber)
from offices
join employees
using(officeCode)
group by city;""")
pd.DataFrame(cur.fetchall())
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
0 | 1 | |
---|---|---|
0 | Boston | 2 |
1 | London | 2 |
2 | NYC | 2 |
3 | Paris | 5 |
4 | San Francisco | 6 |
5 | Sydney | 4 |
6 | Tokyo | 2 |
We can also alias our groupby by specifying the number of our selection order that we want to group by. Additionally, we can also order or limit our selection with the order by and limit clauses.
cur.execute("""select city,
count(employeeNumber)
from offices
join employees
using(officeCode)
group by 1
order by count(employeeNumber) desc
limit 5;""")
pd.DataFrame(cur.fetchall())
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
0 | 1 | |
---|---|---|
0 | San Francisco | 6 |
1 | Paris | 5 |
2 | Sydney | 4 |
3 | Boston | 2 |
4 | London | 2 |
Recall that we can also retrieve our column names when using sqlite3 (note that this will be the default behavior in other environments such as sql workbench)
cur.execute("""select city,
count(employeeNumber)
from offices
join employees
using(officeCode)
group by 1
order by count(employeeNumber) desc
limit 5;""")
df = pd.DataFrame(cur.fetchall())
df. columns = [i[0] for i in cur.description]
df.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
city | count(employeeNumber) | |
---|---|---|
0 | San Francisco | 6 |
1 | Paris | 5 |
2 | Sydney | 4 |
3 | Boston | 2 |
4 | London | 2 |
Now that we can view our column names, we can also practice using alias's to name our aggregations.
cur.execute("""select city,
count(employeeNumber) as employeeCount
from offices
join employees
using(officeCode)
group by 1
order by count(employeeNumber) desc
limit 5;""")
df = pd.DataFrame(cur.fetchall())
df. columns = [i[0] for i in cur.description]
df.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
city | employeeCount | |
---|---|---|
0 | San Francisco | 6 |
1 | Paris | 5 |
2 | Sydney | 4 |
3 | Boston | 2 |
4 | London | 2 |
Aside from count() some other useful aggregations include: * min() * max() * sum() * avg()
cur.execute("""select customerName,
count(*) as number_purchases,
min(amount) as min_purchase,
max(amount) as max_purchase,
avg(amount) as avg_purchase,
sum(amount) as total_spent
from customers
join payments
using(customerNumber)
group by 1
order by sum(amount) desc;""")
df = pd.DataFrame(cur.fetchall())
df. columns = [i[0] for i in cur.description]
print(len(df))
df.head()
98
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
customerName | number_purchases | min_purchase | max_purchase | avg_purchase | total_spent | |
---|---|---|---|---|---|---|
0 | Euro+ Shopping Channel | 13 | 116208.40 | 65071.26 | 55056.844615 | 715738.98 |
1 | Mini Gifts Distributors Ltd. | 9 | 101244.59 | 85410.87 | 64909.804444 | 584188.24 |
2 | Australian Collectors, Co. | 4 | 44894.74 | 82261.22 | 45146.267500 | 180585.07 |
3 | Muscle Machine Inc | 4 | 20314.44 | 58841.35 | 44478.487500 | 177913.95 |
4 | Dragon Souveniers, Ltd. | 4 | 105743.00 | 44380.15 | 39062.757500 | 156251.03 |
df.tail()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
customerName | number_purchases | min_purchase | max_purchase | avg_purchase | total_spent | |
---|---|---|---|---|---|---|
93 | Royale Belge | 4 | 1128.20 | 1627.56 | 7304.295000 | 29217.18 |
94 | Frau da Collezione | 2 | 17746.26 | 7612.06 | 12679.160000 | 25358.32 |
95 | Atelier graphique | 3 | 14571.44 | 6066.78 | 7438.120000 | 22314.36 |
96 | Auto-Moto Classics Inc. | 3 | 5858.56 | 9658.74 | 7184.753333 | 21554.26 |
97 | Boards & Toys Co. | 2 | 3452.75 | 4465.85 | 3959.300000 | 7918.60 |
Finally, we can also filter our aggregated views with the having clause. The having clause works like the where clause but is used to filter data selections on conditions post the group by. For example, if we wanted to filter based on a customer's last name, we would use the where clause. However, if we wanted to filter a list of city's with at least 5 customers, we would using the having clause; we would first groupby city and count the number of customers, and the having clause allows us to pass conditions on the result of this aggregation.
cur.execute("""select city,
count(customerNumber) as number_customers
from customers
group by 1
having count(customerNumber)>=5;""")
df = pd.DataFrame(cur.fetchall())
df. columns = [i[0] for i in cur.description]
print(len(df))
df.head()
2
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
city | number_customers | |
---|---|---|
0 | Madrid | 5 |
1 | NYC | 5 |
We can also use the where and having clause in conjunction with each other for more complex rules. For example, let's say we want a list of customers who have made at least 3 purchases of over 50K each.
cur.execute("""select customerName,
count(amount) as number_purchases_over_50K
from customers
join payments
using(customerNumber)
where amount >= 50000
group by 1
having count(amount) >= 3
order by count(amount) desc;""")
df = pd.DataFrame(cur.fetchall())
df. columns = [i[0] for i in cur.description]
print(len(df))
df.head()
53
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
customerName | number_purchases_over_50K | |
---|---|---|
0 | Euro+ Shopping Channel | 13 |
1 | Mini Gifts Distributors Ltd. | 9 |
2 | Anna's Decorations, Ltd | 4 |
3 | Australian Collectors, Co. | 4 |
4 | Baane Mini Imports | 4 |
df.tail()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
customerName | number_purchases_over_50K | |
---|---|---|
48 | Stylish Desk Decors, Co. | 3 |
49 | Suominen Souveniers | 3 |
50 | Toys of Finland, Co. | 3 |
51 | Toys4GrownUps.com | 3 |
52 | Vitachrome Inc. | 3 |
After this section, you should have a good idea of how to use aggregate functions, aliases and the having clause to filter selections.