Mini Case Study is from Danny Ma's 8 Week SQL Challenge
The Objective of this case study is to debug SQL code and assist the General Manager of Health Analytics in answering questions for a board meeting.
Prior to answering the questions, SQL best practices encourage understanding the dataset and its fields.
This dataset includes 6 columns which are named as id, log_date, measure, measure_value, systolic, and diastolic. How this was determined by running this query (See Below).
SELECT * FROM health.user_logs;
Results:
id | log_date | measure | measure_value | systolic | diastolic |
---|
Due to the large amount of data, I was initally unsure of how many measures were present, running the query below displayed 3 measures blood_glucose, blood_pressure, weight.
SELECT
(DISTINCT meausure)
FROM
health.user_logs;
Results:
measure |
---|
blood_glucose |
blood_pressure |
weight |
Questions 1-9 will be displayed as screenshots of the the incorrect code followed by screenshots of the correct code.
Q.1
This will not run properly due to user_id not being a field to COUNT. The following code will produce the correct answer
SELECT
COUNT (DISTINCT id)
FROM
health.user_logs;
Results:
count |
---|
554 |
Q.2-8 I created a temporary table*
DROP TABLE IF EXISTS user_measure_count;
CREATE TEMP TABLE user_measure_count AS
SELECT
id,
COUNT(*) AS measure_count,
COUNT (DISTINCT measure) AS unique_measures
FROM health.user_logs
GROUP BY 1;
To ensure duplicate tables are not made DROP TABLE IF EXISTS is used followed by a CREATE TEMP TABLE
Q.2
The above query is incorrect because the mean is calculated used the AVG function. It must be aliased as the mean_value.
SELECT
ROUND (AVG(measure_count), 2) AS mean_value
FROM user_measure_count;
Results:
mean_value |
---|
79.23 |
Q.3
The question is asking median number of measurements Per User must order by measure_count
Median isnt a funciton in PostgresSQL Ordered Set Aggregate Functions is required to get the median value.
SELECT
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY measure_count) AS median_value
FROM user_measure_count;
Results:
median_value |
---|
2 |
Q.4
This query will not work due to the HAVING clause which filters records from groups based on a specified condition. The correct clause to use is the WHERE clause because it filters data from a specific table.
SELECT COUNT(*)
FROM user_measure_count
WHERE measure_count >= 3;
Results:
count |
---|
209 |
Q.5
Summing id values doesn't make sense because id values are individual per user and will not produce the intended results. The question is asking how many users which needs the COUNT clause.
SELECT COUNT(*)
FROM user_measure_count
WHERE measure_count >= 1000;
Results:
count |
---|
5 |
Q.6
COUNT DISTINCT needs to be inside () 'blood_sugar' needs to be replaced with 'blood_glucose'
SELECT
COUNT(DISTINCT id)
FROM health.user_logs
WHERE measure = 'blood_glucose';
Results:
count |
---|
325 |
Q.7
COUNT(DISTINCT measures) does not exist because it was aliased using the AS clause to unique_measures.
SELECT
COUNT(*)
FROM user_measure_count
WHERE unique_measures >= 2;
Results:
count |
---|
204 |
Q.8
This question is asking how many users have all 3 measurements. User was not fully spelled in the FROM clause.
SELECT
COUNT(*)
FROM user_measure_count
WHERE unique_measures = 3;
Results:
count |
---|
50 |
Q.9
This code will not run properly because GROUP clause is missing after each WITHIN clause. blood_pressure is a string and needs to be in single quotes 'blood_pressure'.
SELECT
PERCENTILE_CONT(0.5) WITHIN GROUP(ORDER BY systolic) AS median_systolic,
PERCENTILE_CONT(0.5) WITHIN GROUP(ORDER BY diastolic) AS median_diastolic
FROM health.user_logs
WHERE measure = 'blood_pressure';
Results:
median_systolic | median_diastolic |
---|---|
126 | 79 |
Debugging Code
Summary statisitcs (mean & median)
Sorting & filtering Data
Dropping & Creating CTE's