This repository contains code examples, please clone this repository in order to use them.
In a first approach to the problem I have came up with a simple design that answer these questions:
- Which category does this Merchant belong to?
- Which are the shipping policies for this Merchant?
- Which other Merchants belong to this Category?
- What is the current cash back rate for this Merchant?
- Is there any ongoing campaign for this Merchant?
- Is there a special cash back rate for this Campaign?
- Which other Merchants have ongoing campaigns at this moment?
As this design shows, Merchants cash back rates have been normalized into a separate table, this way all cash back rates changes will be registered, and also rates that apply for Campaigns. Also, it will be possible to schedule cash back rates changes.
Further Merchant data disaggregation is possible if it is needed to keep track of other changes over time. For instance: Merchant logos, Shipping details, Cash back rates breaktown for specific Merchant’s categories (external categories).
Extra tables could be added, that help answering extra questions when running reports. For example, number of users that visited and shopped at stores.
- I have used UUIDs as primary keys as this would allow easier key generation mechanisms in a database sharding scenario.
- For a real application I would add further analysis for the indexes created on searchable columns. E.g: BTREE vs Hash.
- Usage of
enum
in MerchantLogos table is an example, this should be changed if possible values for thetype
column will change over time.
Please note ER Diagram relation notation in MerchantShippingDetails
, MerchantReturnPolicy
, MerchantAdditionalInfo
and MerchantVisits
is wrong, those should be many-to-one relationships.
A db.sql file is provided with the proposed database schema and some example data, which I'll use to run some queries.
-
Clone this repository
-
Go to the recently created repository root directory
-
Initialize a MySQL container
docker run --name ga_mysql_test -e MYSQL_ROOT_PASSWORD=password -d mysql:5.7
- Load DB Schema
docker exec -i ga_mysql_test sh -c 'exec mysql -u root -p"password"' < db/db.sql
- Load example data
docker exec -i ga_mysql_test sh -c 'exec mysql -u root -p"password"' < db/example_data.sql
- Connect to MySQL DB and run example queries:
docker exec -it ga_mysql_test mysql -u root -p"password" giving_assistant
- Which category does this Merchant belong to?
SELECT c.name FROM Merchants m INNER JOIN Categories c ON c.id = m.category_id
WHERE m.id = '1805de8a-d9ba-11ea-80b6-0242ac110002';
- Which are the shipping policies for this Merchant?
SELECT ms.content FROM MerchantShippingDetails ms
WHERE ms.merchant_id = '1805de8a-d9ba-11ea-80b6-0242ac110002'
ORDER BY created_at DESC
LIMIT 1;
- Which other Merchants belong to this Category?
SELECT m.id, m.name FROM Merchants m
WHERE m.category_id = 'fd3e5c44-d9a2-11ea-80b6-0242ac110002'
AND m.id != '1805de8a-d9ba-11ea-80b6-0242ac110002'
ORDER BY m.name
LIMIT 5;
- What is the current cash back rate for this Merchant?
SELECT r.rate FROM CashBackRates r
WHERE r.merchant_id = '6649d2bf-d9b9-11ea-80b6-0242ac110002'
ORDER BY created_at DESC
LIMIT 1;
Also the same question could be answered by looking at the date_start
adn date_end
columns
SELECT r.rate FROM CashBackRates r
WHERE r.merchant_id = '6649d2bf-d9b9-11ea-80b6-0242ac110002'
AND r.date_start <= '2020-08-07 10:35:07'
AND (r.date_end >= '2020-08-07 10:35:07' OR r.date_end IS NULL)
ORDER BY created_at DESC
LIMIT 1;
This could be used to ask for a rate in a certain date either in the future or in the past.
- Is there any on-going campaign for this Merchant?
SELECT c.id, c.name FROM Campaigns c
WHERE c.merchant_id = '6649d2bf-d9b9-11ea-80b6-0242ac110002'
AND c.date_start <= NOW()
AND (c.date_end >= NOW() OR c.date_end IS NULL);
- Is there a special cash back rate for this Campaign?
SELECT r.rate FROM CashBackRates r
WHERE r.campaign_id = 'd08a4db3-d9a4-11ea-80b6-0242ac110002'
ORDER BY r.created_at DESC
LIMIT 1;
- Which other Merchants have on-going campaigns at this moment?
SELECT m.id, m.name FROM Merchants m
INNER JOIN Campaigns c ON m.id = c.merchant_id
WHERE c.date_start <= NOW()
AND (c.date_end >= NOW() OR c.date_end IS NULL);
There are different approaches for this issue, I can combine a couple and create some sort of Data Intake Engine
that process different data sources.
A first step could be to identify the criticality of the data provided and assign a priority to each data source[1], this could be understood as a trustworthiness index
, so that given a collection of data sources:
DS = {ds1, ds2, ..., dsn}
And a defined priority set corresponding to each data source:
P = {p1, p2, ..., pn}
in regards of certain attribute in the datasets, for instance: cash_back_rate
.
Then retrieving the cash back rate from DS
will be done as follow:
cash_back_rate = loadData('cash_back_rate', DS, P, maxPriority)
This will retrieve data from dataset whose has a higher priority value.
The previous approach could be an overkill in certain situations and for those cases I still could use duplicate
and similar text
calculation algorithms.
Having duplicate data increase redundancy which might be a sign that we are facing with correct data. So by creating an index on duplicate data coming from several data sources we could arrive the data that has a higher probability to be correct.
Also using a similar text
approach by calculating the Levenshtein distance between two strings we could arrive to a similar solution.
In the case of receiving 5 different store names for the Merchant I created a simple service that will try to sort out this issue.
Considering the 5 store names as the input dataset I created the SimpleDataResolver
service that analyze duplicate data and resolves the right value as the one that most repeats in the input dataset.
For a scenario where there is no duplicate data then I use a similar text
search approach to guess the correct value.
This algorithm uses the Levenshtein distance, then by comparing the dataset values to each other and measuring the distance
between them I create a similarity map. This maps each value in the input dataset to the most similar value found.
At the end, I take the similarity map as the new input dataset and repeat the duplicate data processing as described above.
This service will not work when input data set contains totally unrelated data.
A code example and its corresponding test were added in src/DataIntake/SimpleDataResolver and tests/SimpleDataResolverTest. Here is how to use it:
-
Clone this repository
-
Go to the recently created repository root directory
-
Run composer install
docker run --rm -it --volume $PWD:/app composer install
- Run phpunit
docker run --rm -it --volume $PWD:/app composer ./vendor/bin/phpunit
- The service could be improved by calculating the similarity map in several passes.
- Further analysis is required to use this for critical data.
According to the data modeling shown above, if a full Merchant record is required a SQL query like the following will be needed:
SELECT
m.id,
m.category_id,
m.name,
m.about,
(SELECT ml.url FROM MerchantLogos ml WHERE ml.type = '0' AND ml.merchant_id = {MERCHANT_ID}
ORDER BY ml.created_at DESC LIMIT 1) as header_logo,
(SELECT ml.url FROM MerchantLogos ml WHERE ml.type = '1' AND ml.merchant_id = {MERCHANT_ID}
ORDER BY ml.created_at DESC LIMIT 1) as mobile_logo,
(SELECT ms.content FROM MerchantShippingDetails ms WHERE ms.merchant_id = {MERCHANT_ID}
ORDER BY ms.created_at DESC LIMIT 1) as shipping_details,
(SELECT cb.rate FROM CashBackRates cb WHERE cb.merchant_id = {MERCHANT_ID}
AND cb.date_start <= NOW() AND (cb.date_end >= NOW() OR cb.date_end IS NULL)
ORDER BY cb.created_at DESC LIMIT 1) as cash_back_rate
FROM Merchants m
WHERE m.id = {MERCHANT_ID}
LIMIT 1;
This query could be expensive and if we are looking support high concurrency traffic it would really help having a cache layer.
For this purpose I would use Redis so that whenever a Merchant record is required it will lookup the Redis cache first and try to hydrate a Merchant object from cached data.
If no data is associated to the merchant key then the above query will be run against the database server. The resulting data then will be used to hydrate a Merchant object. Also this data will be pushed to the Redis cache for future use.
With this design will also be possible to write data to cache at the same time we write data to the datastore so we will keep consistency.
For data that change over time it is possible to set expiration time in redis, so for example if a CashBackRate record has a date_end
value set, a timeout value could be calculated and added to Redis. This way Redis will automatically invalidate the cache value when it gets old.
If we take a look at the proposed design we could see that Merchant's cash back rates is handled in a separate table.
Every time the cash back rate for a given Merchant changes it will not overwrite existing data, a new record containing the up-to-date rate data will be added instead, all of this with a timestamp of the moment in which the change ocurred. This way we can ask the latest cash back rate at all times, and at the same time we keep historical changes.
In addition, by having the date_start
and date_end
columns, cash back rates changes could be scheduled so that they will apply automatically in the future.
The obvious design solution for this project was to have a normalized schema in a relational database. It has many advantages in terms of data integrity and consistency. But as schema evolves the queries become more and more complex and need to join many tables together in order to satisfy them.
So the key is to decouple data persistence from the domain model to avoid object-relational impedance mismatch[2].
Rather than saving mutable state in database and update it in place, we save the succession of events that brought us to the current state.
By doing this we can always derive current state by replaying all the events from the beginning of time.
When using this approach we are not overwriting the data, but merely accumulating facts about what happended in the domain.
This design could be implemented with Apache Kafka in combination with Kafka Streams.
Kafka is a distributed append-only log and can also be used as a pub-sub or queue. By using this tool we can turn our application's use cases into commands that needs to be processed. These commands will generate events and publish them into Kafka topics
.
The events then are consumed by different event handlers. One of those handlers could write the events back to a normalized relational database that we can query anytime.
Since multiple handlers can listen to the same event, other handlers could send a response to a pub-sub topic
to which the user browser could be connected through websockets.
Kafka Streams is a library that allows to do stream processing on top of Kafka. Streams are "data in flight", a continously updating dataset which is ordered and replayable. Kafka Streams also allows the possibilty to see streams as tables. A table in Kafka Streams is a collection of evolving facts and can be seen as a point-in-time view of aggregated data.
[1]. Ali, R.; Siddiqi, M. H.; Ahmed, M. I.; Ali, T.; Hussain, S.; Huh, E.; Kang, B.H.; Lee S. GUDM: Automatic Generation of Unified Datasets for Learning and Reasoning in Healthcare. Sensors 2015, 15, 15772-15798.
[2]. Ireland, J.C. Object-relational impedance mismatch : a framework based approach. Open University 2011.