This project was created to show how to use NoSQL databases such as Redis, MongoDB, Neo4j and Cassandra to implement a real life data model of a B2C application and try to check how is the query expressiveness for each database. The data model used for this experiment is the TPCH benchmark data model shown below:
To check the query expressiveness for each database, three SQL queries have been chosen from the same benchmark (TPCH) which are shown below:
This query is used to report the amount of billed, shipped and returned items. The SQL query is shown below:
select
l_returnflag,
l_linestatus,
sum(l_quantity) as sum_qty,
sum(l_extendedprice) as sum_base_price,
sum(l_extendedprice*(1-l_discount)) as sum_disc_price,
sum(l_extendedprice*(1-l_discount)*(1+l_tax)) as sum_charge,
avg(l_quantity) as avg_qty,
avg(l_extendedprice) as avg_price,
avg(l_discount) as avg_disc,
count(*) as count_order
from
lineitem
where
l_shipdate <= date '1998-12-01' - interval '[DELTA]' day (3)
group by
l_returnflag,
l_linestatus
order by
l_returnflag,
l_linestatus;
This query is used to get the 10 unshipped orders with the highest value. In order to do that, the query joins three tables (customer, order and lineitem) as shown below:
select
l_orderkey,
sum(l_extendedprice*(1-l_discount)) as revenue,
o_orderdate,
o_shippriority
from
customer,
orders,
lineitem
where
c_mktsegment = '[SEGMENT]'
and c_custkey = o_custkey
and l_orderkey = o_orderkey
and o_orderdate < date '[DATE]'
and l_shipdate > date '[DATE]'
group by
l_orderkey,
o_orderdate,
o_shippriority
order by
revenue desc,
o_orderdate;
This query is used to see how well the order priority system is working and gives indication of the customer satisfaction. The SQL query is shown below:
select
o_orderpriority,
count(*) as order_count
from
orders
where
o_orderdate >= date '[DATE]'
and o_orderdate < date '[DATE]' + interval '3' month
and exists (
select
*
from
lineitem
where
l_orderkey = o_orderkey
and l_commitdate < l_receiptdate
)
group by
o_orderpriority
order by
o_orderpriority;
This project was created as separate projects for each database where each project can be run alone as a service and each query is built like an API that can be called to test the code. For example for the Cassandra database, you can find the project under "Cassandra/CassandraTPCHQueries/" and for MongoDB the project is under the path "MongoDB/TPCHQueries/" and so forth.
The data used as input for the projects of each database is a small set from the data generated using TPCH DBGEN tool. The data is stored in CSV files that correspond to each object in the TPCH benchmark schema shown above (customer, supplier, part, lineitem, order , etc ...). For more details about how to generate the data, please have a look at this blog post. You can change the used input data by changing the content of the input files that are located under the "data" folder. For example, the input files for Cassandra are located in this path.
To test the projects, you can use run the individual projects for each database and use swagger UI to call the APIs for each query. For example, you can run the (Cassandra/CassandraTPCHQueries/) project by running the following:
gradle run
Then you can find the swagger ui in the following path:
http://localhost:{port}/api/swagger
For each database example project, I have created a detailed explanation that can be found in the below link:
https://github.com/alronz/b2c-database-selection/
Beside the explanation for the projects above, you can find brief documentation for most of the features in each database.
For example, Cassandra project "Cassandra/CassandraTPCHQueries/" documentation is can be found follow :
https://github.com/alronz/b2c-database-selection/blob/master/Cassandra/Examples/tpch.md