Analytics Test - Online Retail

Overview

Technologies used to deploy and share the analysis

Understanding the Data

What are the dimensions of the dataset?
Are there any missing values?
What attributes and data types it contains?

Understanding the Context

How many unique values?
What time window are we looking at?
How many different customers have bought a product from the company?
Which day had the most transactions happening?
What is the average value of a transaction?
Which products are most popular?
Additional notes and observations
Closing Remarks

Dashboard Queries

Take me to the Dashboard!
Banner
Unique Customers
Transactions
Revenue
Daily Transactions and Cumulative Revenue
Most Popular Products
Top 20 Customers
Top 10 Products - Transactions
Top 10 Products - Quantity

Overview

This document describes my process for the Exploratory Data Analysis (EDA) of the online_retail_II dataset.

TLDR: The present document contains a more throughout description of the process, queries and reasoning i use as starting point for most analysis.

If you are a visual learner or not so much into the "tech-side" of data, there is a dashboard with filters for you to play around and come up with some of your own insights!

You can access the dashboard here: Online Retail - Dashboard

Technologies used to deploy and share the analysis

Database

AWS RDS Postgres instance

(for more information: https://aws.amazon.com/rds/postgresql/)

Dashboard:

Metabase on Heroku

(for more information: https://www.metabase.com/docs/latest/operations-guide/running-metabase-on-heroku.html)

Understanding the Data

What are the dimensions of the dataset?

SELECT
    'online_retail' AS table,
    (
        SELECT COUNT(*)
        FROM online_retail

    ) as rows,
    (
        SELECT COUNT(*)
        FROM information_schema.columns
        WHERE table_name = 'online_retail'

    ) as columns

table	rows	columns
online_retail	1,067,371	8

The dataset has 8 columns and around 1 million rows. Let's see what else we can find...

Are there any null/missing values?

SELECT 'invoice'     AS column_name, SUM(case when invoice     is NULL then 1 else 0 end) AS null_values FROM online_retail
UNION
SELECT 'stockcode'   AS column_name, SUM(case when stockcode   is NULL then 1 else 0 end) AS null_values FROM online_retail
UNION
select 'description' AS column_name, SUM(case when description is NULL then 1 else 0 end) AS null_values FROM online_retail
UNION
SELECT 'quantity'    AS column_name, SUM(case when quantity    is NULL then 1 else 0 end) AS null_values FROM online_retail
UNION
SELECT 'invoicedate' AS column_name, SUM(case when invoicedate is NULL then 1 else 0 end) AS null_values FROM online_retail
UNION
SELECT 'price'       AS column_name, SUM(case when price       is NULL then 1 else 0 end) AS null_values FROM online_retail
UNION
SELECT 'customer_id' AS column_name, SUM(case when customer_id is NULL then 1 else 0 end) AS null_values FROM online_retail
UNION
SELECT 'country'     AS column_name, SUM(case when country     is NULL then 1 else 0 end) AS null_values FROM online_retail
ORDER BY null_values DESC

column_name	null_values
customer_id	243007
description	4382
invoice	0
stockcode	0
invoicedate	0
price	0
quantity	0
country	0

About a 1/4 of the values are missing from our customer_id column.

Although that may be very relevant because it represents a large fraction of the data and it looks like the main way to identify an unique customer, we still have plenty to work with and to maybe make some assumptions about why is this data missing or what other ways we can identify a customer.

What are the attributes (columns) and which data types they hold?

SELECT
    column_name,
    data_type
FROM information_schema.columns
WHERE table_name = 'online_retail'

column_name	data_type
invoice	character varying
stockcode	character varying
description	character varying
quantity	bigint
invoicedate	timestamp without time zone
price	double precision
customer_id	double precision
country	character varying

From a quick look at this we can say that this data is about orders made by customers.

Products:
- stockcode: Categorical / Serial. Possibly the unique ID of a product.
- price: Quantitative. Assuming its the unitary price of a product.
- description: Categorical. Description of the product.
Customers:
- customer_id: Serial. Unique ID of a customer.
- country: Categorical. Assuming it's the country of origin for an invoice/customer.
Orders:
- invoice: Serial. Unique ID of each order. A single order can have multiple products.
- invoicedate: Timestamp. Assuming it's in UTC-0 because most of the customers are from the UK.
- quantity: Quantitative. Number of unities ordered for a product.

Understanding the Context

Now that we know more about what data and how much of it we have, it's time to ask some more questions to get a sense of context. We want to use our data to answer some basic who, when, where, what and how many/much questions.

How many unique values?

SELECT 'invoice'     AS column_name, COUNT(DISTINCT invoice)     AS unique_values FROM online_retail
UNION
SELECT 'stockcode'   AS column_name, COUNT(DISTINCT stockcode)   AS unique_values FROM online_retail
UNION
SELECT 'description' AS column_name, COUNT(DISTINCT description) AS unique_values FROM online_retail
UNION
SELECT 'quantity'    AS column_name, COUNT(DISTINCT quantity)    AS unique_values FROM online_retail
UNION
SELECT 'invoicedate' AS column_name, COUNT(DISTINCT invoicedate) AS unique_values FROM online_retail
UNION
SELECT 'price'       AS column_name, COUNT(DISTINCT price)       AS unique_values FROM online_retail
UNION
SELECT 'customer_id' AS column_name, COUNT(DISTINCT customer_id) AS unique_values FROM online_retail
UNION
SELECT 'country'     AS column_name, COUNT(DISTINCT country)     AS unique_values FROM online_retail
ORDER BY unique_values DESC

column_name	unique_values
invoice	53628
invoicedate	47635
customer_id	5942
description	5698
stockcode	5305
price	2807
quantity	1057
country	43

What time window are we looking at?

    SELECT
        TO_CHAR(DATE_TRUNC('mon', invoicedate),'Mon/YYYY') AS month,
        COUNT(distinct invoice)                            AS transactions,
        COUNT(distinct customer_id)                        AS customers,
        SUM(price * quantity)                              AS revenue,
        SUM(quantity)                                      AS quantity

    FROM online_retail
    WHERE price > 0 AND quantity > 0
    GROUP BY month
    ORDER BY month DESC

Now we know that this dataset contains data from 53,628 invoices, made by approximately 5942 different customers from 43 countries who bought more than 5 thousand unique products in a two year window from december 2009 to december 2011.

How many different customers have bought a product from the company?

SELECT
    'customer_id' AS column_name,
    COUNT(DISTINCT customer_id) AS n_unique
FROM online_retail

column_name	unique_values
customer_id	5942

(note that the real number of unique customer may actually be higher, since nearly 1/4 of all customer_id contain null values)

We can take this one step further and get the top 5 countries with most customers.

SELECT
    country,
    COUNT(DISTINCT customer_id) AS n_unique
FROM online_retail
GROUP BY 1
ORDER BY 2 DESC
LIMIT 5

country	unique_customers
United Kingdom	5410
Germany	107
France	95
Spain	41
Belgium	29

Most customers are from the United Kingdom, followed by a few neighbouring countries, this might suggest that the data is from a british online retailer that sells mostly within Europe.

Which day had the most transactions happening?

SELECT
    CAST(date_trunc('d',invoicedate) AS DATE) AS day,
    COUNT(distinct invoice)                   AS daily_transactions
FROM online_retail
GROUP BY day
ORDER BY daily_transactions DESC
LIMIT 1

day	transaction_count
2010-11-04	219

The day with most transaction was november the 4th, 2010 with 219 unique transactions.
Let's se the top 10, it's the same query so we just have to increase the limit of returning rows.

...
LIMIT 10

day	daily_transactions
2010-11-04	219
2011-10-06	218
2010-10-05	206
2009-12-22	203
2010-11-11	192
2010-11-24	188
2011-11-10	184
2010-12-09	183
2010-11-25	181
2010-05-11	180

Looks like the peak in transaction volume happens by the end of the year, likely due to holliday season. What about on a weekly basis?

SELECT
    CAST(date_trunc('week',invoicedate) AS DATE) AS week,
    COUNT(distinct invoice) AS weekly_transactions
FROM online_retail
GROUP BY week
ORDER BY weekly_transactions DESC
LIMIT 10

week	weekly_transactions
2011-11-14	894
2010-11-01	891
2010-11-22	854
2010-11-08	834
2011-11-28	824
2010-11-29	808
2010-11-15	795
2011-11-07	781
2011-11-21	754
2010-10-18	750

Seems like november has been the best month for sales on both years.

What is the average value of a transaction?

We will consider the transaction value as the sum of prices times the quantity ordered for any single invoice.
(obs: there are a few prices with negative value and a description of "adjusted bad debit", so we are going to disconsider all negative prices)

The average reduces a series of numbers into a single number, while useful, alone it can lead to misinterpretations. So as to avoid this common pitfall, let's see how are the transaction values distributed along the it's range.

with a as (

    SELECT
        invoice,
        sum(price * quantity)                          AS transaction_value
    FROM online_retail
    WHERE price > 0 and quantity > 0
    GROUP BY 1
    ORDER BY 2 asc

), b as (

    SELECT
        *,
        ntile(4) OVER (ORDER BY transaction_value ASC) AS quartile
    FROM a
)

SELECT
    'First Quartile'                                   AS measure,
    max(transaction_value)                             AS value
    from b
    where quartile = 1
UNION
SELECT
    'Mode'                                             AS measure,
    mode() WITHIN GROUP (ORDER BY transaction_value)   AS value
   FROM b
UNION
SELECT
    'Median'                                           AS measure,
    max(transaction_value)                             AS value
    FROM b
    WHERE quartile = 2
UNION
SELECT
    'Third Quartile'                                   AS measure,
    max(transaction_value)                             AS value
    FROM b
    WHERE quartile = 3
UNION
SELECT
    'Min'                                              AS measure,
    round(min(transaction_value)::numeric,2)           AS value
    FROM b
UNION
SELECT
    'Avg'                                              AS measure,
    round(avg(transaction_value)::numeric,2)           AS value
    FROM b
UNION
SELECT
    'StdDev'                                           AS measure,
    round(stddev_samp(transaction_value)::numeric,2)   AS value
    FROM b
UNION
SELECT
    'Max'                                              AS measure,
    round(max(transaction_value)::numeric,2)           AS value
    FROM b
ORDER BY value ASC

measure	value
Min	0.19
Mode	15.00
First Quartile	151.97
Median	304.32
Third Quartile	504.90
Avg	523.30
StdDev	1517.35
Max	168469.60

SQL is an amazing language and can do most of the work, but it can also get quite verbose for simple things. Using the python library pandas we can obtain the same results with much less typing!
import pandas as pd
sql = """
  SELECT
       distinct invoice,
       sum(price * quantity) as transaction_amount
   FROM online_retail
   where price > 0 AND quantity > 0
   GROUP BY 1
   ORDER BY 2 asc
"""

df = pd.read_sql(sql, engine)
r = df.describe()
r.sort_values('transaction_amount').round(2)

This tell us that while the average transaction is just over $500.00, half of all transactions are under $300.00 and the most common value for a transaction is as low as $15.00.

The standard deviation is roughly 3 times the average, and the maximum price more than $160k, which indicates a large deviation probably due to some large outliers.

In other words, when we ask about averages, we are often looking for a "center of balance" in those numbers. One way to get a feel for this center, is by looking at the skewness of the data, or how unbalanced it is.

With this in mind we know there is a higher concentration of transactions with lower values that decreases fast towards higher values.

let's make this more visual, and plot with the help of some python libraries:

import pandas as pd
from matplotlib import pyplot as plt

sql = """

    SELECT
        invoice,
        sum(price * quantity) AS transaction
    FROM online_retail
    WHERE price > 0 and quantity > 0
    GROUP BY 1
    ORDER BY 2 asc

"""
# the engine param is the actual database api connection.
# I've used the sqlalchemy.create_engine() method for the engine object
# read more at: https://docs.sqlalchemy.org/en/14/core/engines.html
df = pd.read_sql(sql, engine)

# getting the values for our three mains measures of central tendency
mean = df.mean()['transaction']
median = df.median()['transaction']
mode = df.mode()['transaction'].iloc[0]

# plotting the frequency histogram
plt.figure(figsize=(10,5))

# limit x and y axis to improve readability, high-end prices were cut off from the view.
plt.ylim(0,4000)
plt.xlim(-50,2000)

# add labels to each axis
plt.ylabel('Frequency')
plt.xlabel('Transaction Amount')

# the actual histogram method
plt.hist(x=df['transaction'], bins=5000)

# plot the lines with central tendencies
plt.axvline(mean, color='r', linestyle='dashed', linewidth=1)
plt.text(mean + 10, 3600, f'Mean\n{mean:.2f}')

plt.axvline(median, color='r', linestyle='dashed', linewidth=1)
plt.text(median + 10, 3600, f'Median\n{median:.2f}')

plt.axvline(mode, color='r', linestyle='dashed', linewidth=1)
plt.text(mode + 10, 3600, f'Mode\n{mode:.2f}')

With a quick glance at this histogram we can say that the average is 523, but most transactions fall between ~50 and ~300.

Which products are most popular?

What are we going to consider as popular?

By number of transactions:

SELECT
    stockcode,
    description,
    COUNT(distinct invoice) AS transactions
FROM online_retail
GROUP BY 1, 2
ORDER BY 3 DESC
LIMIT 10

stockcode	description	transactions
85123A	WHITE HANGING HEART T-LIGHT HOLDER	5495
22423	REGENCY CAKESTAND 3 TIER	4261
85099B	JUMBO BAG RED RETROSPOT	3320
84879	ASSORTED COLOUR BIRD ORNAMENT	2827
47566	PARTY BUNTING	2699
21232	STRAWBERRY CERAMIC TRINKET BOX	2488
20727	LUNCH BAG BLACK SKULL.	2396
21931	JUMBO STORAGE BAG SUKI	2364
22411	JUMBO SHOPPER VINTAGE RED PAISLEY	2215
22469	HEART OF WICKER SMALL	2174

By total quantity sold

SELECT
    stockcode,
    description,
    sum(quantity) AS total_quantity
FROM online_retail
GROUP BY 1, 2
ORDER BY 3 DESC
LIMIT 10

stockcode	description	total_quantity
84077	WORLD WAR 2 GLIDERS ASSTD DESIGNS	108545
85123A	WHITE HANGING HEART T-LIGHT HOLDER	92453
84879	ASSORTED COLOUR BIRD ORNAMENT	81306
85099B	JUMBO BAG RED RETROSPOT	77671
17003	BROCADE RING PURSE	70700
21977	PACK OF 60 PINK PAISLEY CAKE CASES	56575
84991	60 TEATIME FAIRY CAKE CASES	54366
22197	SMALL POPCORN HOLDER	49616
21212	PACK OF 72 RETROSPOT CAKE CASES	49344
21212	PACK OF 72 RETRO SPOT CAKE CASES	46106

By revenue generated

SELECT
    stockcode,
    description,
    round(sum(price * quantity)::numeric,2) AS revenue
FROM online_retail
GROUP BY 1, 2
ORDER BY 3 DESC
LIMIT 10

stockcode	description	revenue
22423	REGENCY CAKESTAND 3 TIER	327813.65
DOT	DOTCOM POSTAGE	322647.47
85123A	WHITE HANGING HEART T-LIGHT HOLDER	253541.51
47566	PARTY BUNTING	147948.50
85099B	JUMBO BAG RED RETROSPOT	46689.00
84879	ASSORTED COLOUR BIRD ORNAMENT	131413.85
22086	PAPER CHAIN KIT 50'S CHRISTMAS	121662.14
POST	POSTAGE	112341.00
79321	CHILLI LIGHTS	84854.16
84347	ROTATING SILVER ANGELS T-LIGHT HLDR	73814.72

Now, lets try and plot all this information combined to see the performances for the top 10 products, this time with Metabase, our BI tool of choice

Click Here to see this chart on Metabase.

Which products show in all three results?

WITH a as (

    SELECT
        stockcode,
        description,
        COUNT(distinct invoice)                 AS transactions
    FROM online_retail
    GROUP BY 1, 2
    ORDER BY 3 DESC
    LIMIT 10

), b as (

    SELECT
        stockcode,
        description,
        sum(quantity)                           AS quantity
    FROM online_retail
    GROUP BY 1, 2
    ORDER BY 3 DESC
    LIMIT 10

), c as (

    SELECT
        stockcode,
        description,
        round(sum(price * quantity)::numeric,2) AS revenue
    FROM online_retail
    GROUP BY 1, 2
    ORDER BY 3 DESC
    LIMIT 10

)

SELECT
    a.stockcode,
    a.description,
    a.transactions,
    b.quantity,
    c.revenue
FROM a
    INNER JOIN b ON a.stockcode = b.stockcode
    INNER JOIN c ON a.stockcode = c.stockcode
ORDER BY quantity DESC

stockcode	description	transactions	quantity	revenue
85123A	WHITE HANGING HEART T-LIGHT HOLDER	5,495	92,453	253,541.51
84879	ASSORTED COLOUR BIRD ORNAMENT	2,827	81,306	131,413.85
85099B	JUMBO BAG RED RETROSPOT	3,320	77,671	146,689.00

Who would guess heart hanging light holders would be so popular?!

Source

And even if we filter by unique customers these hanging lights still outperform the other products, being ordered by 1.4k unique customers in more than 20 countries!

SELECT
	stockcode,
	description,
	count(DISTINCT customer_id) AS unique_customers,
	count(DISTINCT country)     AS unique_countries
FROM online_retail
GROUP BY 1, 2
ORDER BY 3 desc
LIMIT 5

stockcode	description	unique_customers	unique_countries
85123A	WHITE HANGING HEART T-LIGHT HOLDER	1494	23
22423	REGENCY CAKESTAND 3 TIER	1316	31
22138	BAKING SET 9 PIECE RETROSPOT	1152	31
84879	ASSORTED COLOUR BIRD ORNAMENT	1012	19
22086	PAPER CHAIN KIT 50'S CHRISTMAS	896	12

Additional notes and observations

Negative quantities and prices

While there is just one distinct stockcode and 5 rows where the price is negative, there are more than 4,000 distinct stockcode values where the quantity for that row is negative.

SELECT
    DISTINCT stockcode,
    invoice,
    quantity,
    price,
    description
FROM online_retail
WHERE price < 0
ORDER BY 1

stockcode	invoice	quantity	price	description
B	A506401	1	-53594.36	Adjust bad debt
B	A516228	1	-44031.79	Adjust bad debt
B	A528059	1	-38925.87	Adjust bad debt
B	A563186	1	-11062.06	Adjust bad debt
B	A563187	1	-11062.06	Adjust bad debt

I've noticed most of the invoices starting with C hold negative quantities except for 1 row, but there are plenty other invoices with negative quantities, at first glance i couldn't see any specific identifier for these negative quantities. Perhaps it has to do with cancelled orders, or returning products.

The best way would be to understand where these stockcodes come from and how they are inputed to the system, but while that may not always be possible, a deeper exploration crossing prices, quantities and descriptions could expose some underlying relationships in the data.

Closing remarks

From here on we could continue our exploration, trying to understand all the meaningful relations between products, orders, customers and countries.

There is enough data here alone to learn many things and ask many questions, but like any creative endeavour, one must know when to stop and present what was found so far, get feedback from peers and keep check with relevant stakeholders that rely on your work before going further with the analysis.

Data can tell us a lot, but data alone is not enough, Data Science/Analytics/Engineering is about learning, sharing, questioning and communicating. It's about translation, meaning and memory, about being curious and open-minded.

Good data analytics should feel intuitive and simple, despite how technically challenging it can be to achieve such results.

Dashboard Queries

⇒ Take me to the Dashboard!

The syntax {% raw %}[[ {{ Filter Name }} ]]{% endraw %} is a template syntax used by metabase to allow for dynamic charts using SQL and custom filters.

All the queries in the dashboard contain dynamic filters for month/year and country. The images under each query here show the results for November 2010, but you can input any month from dec/2009 to dec/2011 to see the compiled data for that month.

Banner

SELECT concat('You are viewing data from : ', to_char(mon, 'Month YYYY'))
FROM (

    SELECT date_trunc('mon', invoicedate) AS mon
    FROM online_retail
    {% raw %}
    [[WHERE {{Month}}]]
    {% endraw %}
) AS a
LIMIT 1

Unique Customers

    SELECT COUNT(DISTINCT customer_id)
    FROM online_retail
    WHERE customer_id IS NOT NULL
    {% raw %}
    [[AND {{Month}}]]
    [[AND {{Country}}]]
    {% endraw %}

Transactions

SELECT
    COUNT(DISTINCT invoice) AS Transactions
FROM online_retail
{% raw %}
[[WHERE {{Month}}]]
[[AND {{Country}}]]
{% endraw %}

Revenue

SELECT
    SUM(price * quantity)
FROM online_retail
WHERE price > 0 AND quantity > 0
{% raw %}
[[WHERE {{Month}}]]
[[AND {{Country}}]]
{% endraw %}

Daily Transactions and Cumulative Revenue

SELECT
    day,
    daily_transactions,
    SUM(revenue) OVER (ORDER BY day ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cumulative_revenue
FROM (

    SELECT
        DATE_TRUNC('d', CAST(invoicedate AS timestamp)) AS day,
        COUNT(distinct invoice)                         AS daily_transactions,
        SUM(price * quantity)                           AS revenue
    FROM online_retail
    WHERE price > 0 AND quantity > 0
    {% raw %}
    [[AND {{Month}}]]
    [[AND {{Country}}]]
    {% endraw %}
    GROUP BY 1
    ORDER BY 1 DESC

) as a

Most Popular Products

SELECT
    stockcode,
    description,
    COUNT(distinct invoice) As transactions,
    SUM(quantity)           AS quantity,
    SUM(price * quantity)   AS revenue
FROM online_retail
{% raw %}
[[WHERE {{Month}}]]
[[AND {{Country}}]]
{% endraw %}
GROUP BY 1, 2
ORDER BY 3 DESC
LIMIT 10

Top 20 Customers

SELECT
    customer_id,
    SUM(price * quantity)   AS revenue,
    SUM(quantity)           AS quantity,
    COUNT(DISTINCT invoice) AS transactions
FROM online_retail
WHERE customer_id IS NOT NULL
  AND price > 0 AND quantity > 0
  {% raw %}
  [[AND {{Month}}]]
  [[AND {{Country}}]]
  {% endraw %}
GROUP BY 1
ORDER BY 2 DESC
LIMIT 20

Top 10 Products - Transactions

SELECT
    stockcode,
    description,
    COUNT(distinct invoice) AS transactions,
    SUM(quantity)           AS quantity
FROM online_retail
{% raw %}
WHERE {{Month}}
[[AND {{Product}}]]
[[AND {{Country}}]]
{% endraw %}
GROUP BY 1, 2
ORDER BY 3 desc
LIMIT 10

Top 10 Products - Quantity

SELECT
    stockcode,
    description,
    COUNT(distinct invoice) as transactions,
    SUM(quantity) as quantity

FROM online_retail
{% raw %}
[[WHERE {{Month}}]]
[[AND {{Country}}]]
{% endraw %}
GROUP BY 1, 2
ORDER BY 4 desc
LIMIT 10

LeonAndrade/LeonAndrade.github.io