/dataengineering-assignment

Prescreening Tasks for Data Engineer

Primary LanguageJupyter NotebookApache License 2.0Apache-2.0

Prescreening Tasks for Data Engineer

This document provides the exercises that are part of the prescreening process for Data Engineering positions. As such, their goal is to provide insights into solutioning process and skills application for solving example tasks related to this position. Thus, please make sure to explain the important details and specify any assumption taken.

Exercise 1

Assume having a Sales table with the following fields:

  • Sales Order ID
  • Sales Order Item
  • Customer ID
  • Date
  • Transaction Value

And another table named Discounts with the following structure:

  • Sales Order ID
  • Customer ID
  • Discount Value

For a customer and a sales order there can exist a discount value in the Discounts table, however there is no information about the precise discount value per each item in the Sales table. Thus, for this exercise it is necessary to:

  • Define Create Table statements for both tables.
  • Define a strategy to allocate the discount values to the Sales table (e.g., in a new column) and define the necessary query/scripts to execute the logic.
  • Consider different scenarios/edge cases (e.g., depending on the data quality) and, optionally, propose possible solutions and/or scripts to handle such cases.

Exercise 2

Implement the solution above, preferably using Python (with Pandas), Java, C#, or another programming language. You can also use a pseudo-code instead. Please make sure to provide comments where needed to the solution provided.

Assumptions

I will assume that this is an e-commerce app, and due to the description of the problem this is a design problem of an OLTP system, the exercises are not asking about agregations, data ingestion, transformations and less how to move data between systems, in order to get insights from the preexisted data, there are no references neither of the velocity of the data nor about the tasks scheduled, all the situations mentioned before are features of the OLAP solutions, Hence, I will avoid based my decisions thinking in OLAP systems.

A sales order is a record that a customer can use to initiate or request a sale. In general, the sales order is filled out by the customer to order certain products from a business. The customer pays for these products when they submit the sales order. This document represents the promise that the requested goods will be ordered by the business and delivered to the customer later. When the goods are fulfilled and picked up by the customer, the sales order is used as a receipt to make sure the correct amount and types of products are given to the customer. In simple terms, a sales order is a request from a customer for specific items.

There are two relevant situations in the description:

  • For a customer and a sales order there can exist a discount value in the Discounts table
    • This is let me understand that only a general discount value could be applied to the whole value of the order
    • Are the discounts being generated for a specific users?
    • There are situations where the order can be applied coupons or promotional codes which assign a general discount to the entire purchase.
  • However there is no information about the precise discount value per each item in the Sales table.
    • Here , the description is taking into account assign different discount values per item, which could implied store that information in another table.
    • Is there a catalog with a different discount for each item or is the discount allocation established in a general way to the entire user's order?

Base case

The initial data model allows assigning a discount per user to each purchased item that appears in the sales table, but there are two problems with this approach:

  1. You cannot add different discounts to each item
  2. You cannot add more than one item of the same type to the order, this would create a problem with the composite primary key.

initial

let's focus in the relevant tables using PostgreSQL

CREATE TABLE IF NOT EXISTS Sales(
   sales_order_id int NOT NULL,
   sales_order_item int NOT NULL, 
   customer_id int null,
   date timestamp not null, 
   transaction_value float not null, 
   discounted_value float not null,
   PRIMARY KEY(sales_order_id, sales_order_item, customer_id)
);


CREATE TABLE IF NOT EXISTS Discounts(
   sales_order_id int not null,
   customer_id int not null,
   discount_value float not null
   PRIMARY KEY(sales_order_id, customer_id)
);

insert into sales values 
(1, 2, 150, NOW(), 200, 200),
(1, 3, 150, NOW(), 310, 310),
(1, 4, 150, NOW(), 80, 80)


insert into discounts values (1, 150, 0.3)
select * from sales

image

select * from discounts

image

WITH sales_updated_CTE(sales_order_id,sales_order_item, customer_id, date,transaction_value, discount)
AS
(
   SELECT 
       s.sales_order_id,
       s.sales_order_item,
       s.customer_id,
       s.date,
       s.transaction_value,
       (s.transaction_value - (s.transaction_value * d.discount_value)) AS discount
   FROM sales s
   JOIN discounts d
   ON s.sales_order_id = d.sales_order_id
   WHERE s.sales_order_id = %s AND s.customer_id = %s
)
UPDATE sales as s
SET discounted_value = c.discount
FROM sales_updated_CTE as c 
WHERE c.sales_order_id = s.sales_order_id 
AND c.customer_id = s.customer_id
AND c.sales_order_item = s.sales_order_item
select * from sales

image

Edge case

The second approach (edge case), considered add more than one unit of the same product to the order (quantity) and create different discounts to each item, and update the discount is less complex

second

CREATE TABLE IF NOT EXISTS Sales2(
   sales_order_id int NOT NULL,
   customer_id int null,
   date timestamp not null, 
   PRIMARY KEY(sales_order_id, customer_id)
);

CREATE TABLE IF NOT EXISTS Order_detail(
   sales_order_id int not null,
   sales_order_item int NOT NULL, 
   transaction_value float not null,
   quantity int not null, 
   discount_value float not null,
   discounted_value float not null,
   PRIMARY KEY(sales_order_id, sales_order_item)
);


insert into Sales2 values (1, 150, NOW())

insert into Order_detail values 
(1, 2, 150, 2, 0.3, 150),
(1, 3, 210, 1, 0.4, 210), 
(1, 4, 80, 3, 0.2, 80)

select * from sales2

image

select * from Order_detail

image

UPDATE Order_detail
SET discounted_value = (transaction_value - (transaction_value * discount_value)) * quantity
WHERE sales_order_id = %s
select * from Order_detail

image

Steps to execute the project

  • Install Docker Desktop on Windows, it will install docker compose as well, docker compose will alow you to run multiple containers applications, this project has two containers with Jupyter Notebook and PostgreSQL

  • Install git-bash for windows, once installed , open git bash and download this repository, this will download all the folders and the docker-compose.yml file, and other files needed.

ramse@DESKTOP-K6K6E5A MINGW64 /c
$ git clone https://github.com/Wittline/dataengineering-assignment.git
  • Once all the files needed were downloaded from the repository , Let's run everything we will use the git bash tool again, go to the folder ~/documents/github/dataengineering-assignment/docker and run the docker compose command
ramse@DESKTOP-K6K6E5A MINGW64 ~/documents/github/dataengineering-assignment/docker (main)
$ docker-compose up

let's wait until all the images and containers are created

image

  • Open a new git bash window again, and use the following command:
ramse@DESKTOP-K6K6E5A MINGW64 ~/documents/github/dataengineering-assignment/docker (main)
$ docker ps

It will show you all the containers and images contained in the docker-compose-yml file, this mean that all the images were created correctly

image

let's check the docker-compose.yml file

version: '3'

services:
    jupyter-notebook:
        image: jupyter/minimal-notebook
        volumes:
            - ./folder:/home/jovyan/work
        ports:
            - 8888:8888
        container_name: jupyter-notebook-container
        environment:
            - JUPYTER_TOKEN=jupyter
        command: jupyter notebook --NotebookApp.iopub_data_rate_limit=3e10
        depends_on:
            - "db"
    db:
        container_name: pg_container
        image: postgres
        restart: always
        environment:
            POSTGRES_USER: "db"
            POSTGRES_PASSWORD: "db"
            POSTGRES_DB: "db_db"
        ports:
            - "5432:5432"
        volumes:
            - pg_data:/var/lib/postgresql/data/

volumes:
    pg_data:

image

The jupyter notebook of this project is called dataengineering-assignment.ipynb over the folder work, there is a copy of this file in the folder called folder in this repository, please keep this file on its folder.

image

Now you can run the project step by step using this notebook, the files sql_queries.py and db_engine.py contains all the logic needed

image

Summary

  • The project was builded with Python version 3 with the help of Jupyter notebook, and the database selected was PostgreSQL, It is a Containerized Development with Docker, so, all the dependencies and the example is must be deplyed using docker, check the file docker-compose-yml in this project.
  • All the answers were given to all items of exercise 1, using two approaches, the base approach which the problem is asking for, and an extra approach called edge case was shown as well.
  • The answer to exercise two is the implementation of the entire application demonstrating the two approaches programmatically, and its execution depends on the correct execution of the containers.

Authors