In my recent project, I was tasked with optimizing a highly parallelizable computing system that could be distributed across multiple machines. To manage the coordination, I designated one machine as the coordinator. This coordinator's role was to dispatch computing jobs to other machines, referred to as workers, aggregate the results, and provide an efficient querying interface.
Specifically, the task involved computing pairwise distances for a sample set of size
Given the computationally intensive nature of the task, preserving the computed results was crucial. While tolerating minor data corruption is acceptable, the integrity of the majority of results must be ensured. Thus, using a robust database management system (DBMS) became necessary. After evaluating factors such as development speed, deployment cost, performance, data reliability and the existing code base, I selected SQLite as the storage backend for its balance of these attributes, and Peewee as the ORM framework for fast development.
During the project development, I noticed that Peewee and SQLite out of box yielded unsatisifying performance, but with some simple tricks, the performances boosted and the following bench mark was designed to quantify the performance gain.
I categorized operations with SQLite into 2 types: insertion and querying. More specific:
- Insertion: occured at task preparation, which inserted plenty of rows into the database to initialize the computing task. It could be further divide into 2 phases
- sample insertion: register samples in the
sample
table - result allocation: pair samples one-by-one and insert the pair with
val = -1
into theresult
table
- sample insertion: register samples in the
- Query: orrured at result analyzing, which query distance for a given list of sample pairs
The optimization tricks were:
- WAL: use journal mode of
WAL
instead of the default mode,DELETE
- TSC: wrap operations in a transaction
- BLK: insert/query rows in bulk with single SQL execution with a lot of placeholders
- JSON: insert/query rows in bulk and transfer data in JSON instead of using lots of parameter placeholder
?
- SPC: problem specific tricks that leaverage problem structure and write raw SQL scripts to be executed natively in the SQLite. Usually, SPC tricks are hard to adapt to other problems, but could be extremely efficient
One trick may be adopted in combination with other tricks to offer even higher performance. A specific combination is denoted as a method. The special method that includes no tricks is dentoed as 1b1
, which means that it inserts / queries multiple rows one-by-one. Though the benchmark could not cover all possible methods, it made its best effort to be representative for common usage.
To quantify the performance, row per second (RPS) was selected as the metric. It is defined as the maximum number of rows involved (read/write) in 1 second. The higher RPS a method achieves, the better the performance is. RPS is calculate by timing the insertion/query duration at a series of operation scale. To prevent slow methods from congesting my computer, I set timeout for each test.
The following tables list the best performance acheived by different methods.
RPS for insertions:
Tricks | -- | WAL | TSC | TSC + WAL |
---|---|---|---|---|
1b1 | 153.9 (1x) | 497.4 (3x) | 13875.0 (90x) | 13960.7 (90x) |
BLK | 54101.3 (352x) | -- | -- | -- |
JSON | 289494.5 (1881x) | -- | -- | -- |
SPC | 612193.6 (3978x) | 601393.5 (3908x) | -- | -- |
--
in the column header indicates that no special tricks were used--
in cells means that the method was not tested- The results are formatted as
absolute rps (speed relative to 1b1)
. For example,497.4 (3x)
indicates that the method achieves 497.4 rows per seconds, and is 3 times faster than method1b1
Based on the results, I found the following primary causes to the poor performance:
-
Transaction overhead. Transaction overhead is the time consumed for creating, commiting or rollbacking transactions. It was the primary causes of the poor performance. Simply switching the journal mechanism from
DELETE
, which is the default mode of SQLite, toWAL
speeded RPS 3 times. The trick has no side effect and is recommended for everyone. It could be done as simple as:from peewee import SqliteDatabase conn = SqliteDatabase(path, pragmas={'journal_mode': 'wal'}) # Enable WAL
To acquire a better performace, wrapping operations in a single transaction when possible gives a huge acceleration, 90 times faster than the
1b1
method. Here is the code for the method1b1 + TSC
:# File: db.py class Sample(peewee.Model): id = peewee.AutoField() path = peewee.TextField(unique=True) # ... # File: insert_bench.py with db.conn.atomic() as tsc: # Wrap writes into a transaction for i in range(sample_count): path = sample_path_fmt.format(i) db.Sample.create(path=path)
-
Binding overhead. Binding overhead is typical introduced by the higher-level librariy, i.e. the Peewee in this benchmark. The bindings adds a lot of overhead to each statement, and slows down the overall performance. For example, in the code of
1b1 + TSC
, each call ofdb.Sample.create(path=path)
invokes a SQL execution and adds the unwanted binding overhead.An intuitive optimization is to batch a lot of operations into a single SQL execution, such as
BLK
:samples = [] for i in range(sample_count): path = sample_path_fmt.format(i) samples.append({"path": path}) # instead of executing SQL in the loop db.Sample.insert_many(samples).execute() # batch them into one SQL and execute here
It generates SQL like:
INSERT INTO sample (path) VALUES (?), (?), (?), ..., (?);
Compared with
1b1 + TSC
,BLK
method was about 4 times faster.But for extremely large amount of insertion, the number of parameters, the number of
?
in the SQL, may exceed the SQLite limitation, so transfering data in JSON format is a better option. Here is the cods forJSON
method:SQL = """ INSERT INTO sample (path) SELECT j.value ->> '$.path' FROM json_each(?) AS j; """ samples = [] for i in range(sample_count): samples.append({"path": sample_path_fmt.format(i)}) db.conn.execute_sql(SQL, (json.dumps(samples),))
From the benchmark result,
JSON
method was about 5 times faster thanBLK
. It was the fastest method without leveraging the problem specific structure.
Above methods are universal that could be used in other scenarios. But for this benchmark, the fastest method SPC
should make use of the problem structure:
SQL = """
INSERT INTO result
SELECT s1.id, s2.id, -1
FROM sample AS s1 JOIN sample AS s2
ON s1.id < s2.id;
"""
db.conn.execute_sql(SQL)
It avoids the third type of overhead, data transfering. It is the overhead for transferring data between Python and SQLite. SPC
is 2 times faster than JSON
. Though performant, the method is not as universal as previous methods, as it requires that the information could be directly computed from an existing table, which is not a common situation.
RPS for query:
Tricks | -- | WAL |
---|---|---|
1b1 | 4217.2 (1x) | 5065.9 (1.2x) |
JSON | 99220.8 (24x) | 98848.6 (23x) |
SPC | 221999.2 (52x) | 217852.7 (52x) |
Query optimization is similar to insertion optimization, but different in terms of transaction overhead, which has far less impact on performance. Querying rows one-by-one gave the RPS of 4217.2, which is quite faster than the insertion.
To optimize the query performance, we should minimize the binding overhead, but BLK
trick is not usable. Here is the code of BLK
method for query:
qry_pairs = __gen_qry_pair(opt)
conditions = [
(db.Result.a == a) & (db.Result.b == b)
for a, b in qry_pairs
]
conditions = reduce(lambda a, b: a | b, conditions)
qry = db.Result.select().where(conditions)
In the benchmark, it quickly ran into error at querying about 60 results. I was not sure if the restriction is posed by Peewee or SQLite, and quicky switched to JSON
method:
SQL = """
SELECT r.a_id, r.b_id, r.val
FROM result AS r JOIN json_each(?) AS j
WHERE r.a_id = (j.value ->> '$.a') AND r.b_id = (j.value ->> '$.b');
"""
ary_list = []
for a, b in qry_pairs:
ary_list.append({"a": a, "b": b})
rst = db.conn.execute_sql(SQL, (json.dumps(ary_list),))
JSON
method is 24 times faster than 1b1
and universal for most applications.
In the project mentioned in the introduction, my need was to query the results of any pairs between two sample sets, and it inspired the method denoted as SPC
in the table. The code is listed bellow:
SQL = """
WITH pair AS (
SELECT DISTINCT
CASE
WHEN s1.value < s2.value THEN s1.value
ELSE s2.value
END as aid,
CASE
WHEN s1.value < s2.value THEN s2.value
ELSE s1.value
END as bid
FROM json_each(?) as s1, json_each(?) as s2
)
SELECT aid, bid, r.val
FROM pair
JOIN result AS r
ON r.a_id = pair.aid AND r.b_id = pair.bid;
"""
rst = db.conn.execute_sql(SQL, (json.dumps(idset1), json.dumps(idset2)))
It is both convenient and performant, ideal for my project, but may not be used in other applicatons.
-
Enable
WAL
mode in any situation -
Wrap writes into a single transaction
-
Combine mutiple SQL execution into one execution. The relavent data could be done by either
- SQL parameters, more compatible with ORM frameworks
- JSON function, more performant but may need to write SQL manually
-
Inspect the problem, and find optimization methods based on the problem structure
The database compromised 2 tables, sample
and result
.
Example rows in sample
id | path |
---|---|
1 | /dataset/xxxx.mat |
2 | /dataset/yyyy.mat |
... | ... |
id
was the self increasing primary key assigned by SQLitepath
was the path pointing to the sample file
Example rows in result
a_id | b_id | val |
---|---|---|
1 | 2 | -1 |
1 | 3 | 3 |
... | ... | ... |
a_id
,b_id
were foreign keys referring tablesample
and they were also the primary key ofresult
. To simplify coding, we stipulated thata_id
should be smaller thanb_id
and the constraint,a_id < b_id
, was defined at table creation.val
was the calculation result, with the negitave value as special flags that -1 meant pending for distribution, and -2 meant beening distributed and waiting for a result.
Sample Insertion
Sample Count | 1b1 | 1b1 + TSC | 1b1 + WAL | 1b1 + WAL + TSC | BLK | JSON |
---|---|---|---|---|---|---|
100 | 153.9 (0.6 s) | 6328.8 (0.0 s) | 498.8 (0.2 s) | 9793.2 (0.0 s) | 12189.2 (0.0 s) | 40469.4 (0.0 s) |
1000 | TIMEOUT | 12589.3 (0.1 s) | 497.4 (2.0 s) | 13482.8 (0.1 s) | 54946.3 (0.0 s) | 227241.7 (0.0 s) |
10000 | TIMEOUT | 13875.0 (0.7 s) | TIMEOUT | 13960.7 (0.7 s) | 91620.1 (0.1 s) | 365874.2 (0.0 s) |
100000 | TIMEOUT | TIMEOUT | TIMEOUT | TIMEOUT | 86441.1 (1.2 s) | 332334.6 (0.3 s) |
Result Allocation (Slow Methods)
Result Count | 1b1 | 1b1 + WAL | 1b1 + TSC |
---|---|---|---|
45 | 138.7 (0.3 s) | 479.7 (0.1 s) | 3845.6 (0.0 s) |
435 | TIMEOUT | 494.7 (0.9 s) | 8805.7 (0.0 s) |
1225 | TIMEOUT | 494.8 (2.5 s) | 9919.7 (0.1 s) |
2415 | TIMEOUT | TIMEOUT | 9890.0 (0.2 s) |
4005 | TIMEOUT | TIMEOUT | 10282.4 (0.4 s) |
Result Allocation (Fast Methods)
Result Count | BLK | 1b1 + WAL + TSC | JSON |
---|---|---|---|
4950 | 53335.5 (0.1 s) | 10288.6 (0.5 s) | 224493.9 (0.0 s) |
11175 | 52545.4 (0.2 s) | 10347.7 (1.1 s) | 265938.5 (0.0 s) |
19900 | 54101.3 (0.4 s) | 10334.3 (1.9 s) | 280637.4 (0.1 s) |
31125 | 52797.5 (0.6 s) | TIMEOUT | 289494.5 (0.1 s) |
44850 | 51066.5 (0.9 s) | TIMEOUT | 283312.1 (0.2 s) |
61075 | 48488.1 (1.3 s) | TIMEOUT | 274972.9 (0.2 s) |
79800 | 50185.3 (1.6 s) | TIMEOUT | 266527.4 (0.3 s) |
101025 | TIMEOUT | TIMEOUT | 281844.1 (0.4 s) |
124750 | TIMEOUT | TIMEOUT | 278266.0 (0.4 s) |
150975 | TIMEOUT | TIMEOUT | 273094.7 (0.6 s) |
179700 | TIMEOUT | TIMEOUT | 274961.6 (0.7 s) |
210925 | TIMEOUT | TIMEOUT | 278077.0 (0.8 s) |
244650 | TIMEOUT | TIMEOUT | 270197.4 (0.9 s) |
280875 | TIMEOUT | TIMEOUT | 271831.2 (1.0 s) |
319600 | TIMEOUT | TIMEOUT | 272458.6 (1.2 s) |
360825 | TIMEOUT | TIMEOUT | 197347.6 (1.8 s) |
404550 | TIMEOUT | TIMEOUT | 165734.0 (2.4 s) |
450775 | TIMEOUT | TIMEOUT | TIMEOUT |
499500 | TIMEOUT | TIMEOUT | TIMEOUT |
Result Allocation (Problem Specific Methods)
Result Count | SPC | SPC + WAL |
---|---|---|
4950 | 354747.2 (0.0 s) | 408807.1 (0.0 s) |
19900 | 556475.2 (0.0 s) | 571218.6 (0.0 s) |
44850 | 555178.7 (0.1 s) | 601393.5 (0.1 s) |
79800 | 600041.5 (0.1 s) | 540840.9 (0.1 s) |
124750 | 612193.6 (0.2 s) | 439438.8 (0.3 s) |
179700 | 597780.5 (0.3 s) | 456027.0 (0.4 s) |
244650 | 610217.8 (0.4 s) | 451484.7 (0.5 s) |
319600 | 598742.0 (0.5 s) | 432308.7 (0.7 s) |
404550 | 261878.8 (1.5 s) | 248389.2 (1.6 s) |
499500 | 199371.1 (2.5 s) | 186666.5 (2.7 s) |
604450 | 169896.9 (3.6 s) | 159021.0 (3.8 s) |
719400 | 154525.9 (4.7 s) | 141470.0 (5.1 s) |
844350 | 141343.6 (6.0 s) | 126209.7 (6.7 s) |
Query Count | 1b1 | JSON | SPC | 1b1 + WAL | JSON + WAL | SPC + WAL |
---|---|---|---|---|---|---|
100 | 3936.8 (0.0 s) | 49980.0 (0.0 s) | 99147.3 (0.0 s) | 4746.2 (0.0 s) | 55549.4 (0.0 s) | 71428.6 (0.0 s) |
1000 | 4077.2 (0.2 s) | 87725.5 (0.0 s) | 166666.7 (0.0 s) | 5065.9 (0.2 s) | 82861.0 (0.0 s) | 166800.1 (0.0 s) |
2000 | 4116.4 (0.5 s) | 92591.7 (0.0 s) | 188686.4 (0.0 s) | 5009.2 (0.4 s) | 94344.1 (0.0 s) | 201373.4 (0.0 s) |
4000 | 3973.5 (1.0 s) | 94552.8 (0.0 s) | 215070.0 (0.0 s) | 4946.8 (0.8 s) | 95553.0 (0.0 s) | 223314.0 (0.0 s) |
8000 | 4029.9 (2.0 s) | 96619.1 (0.1 s) | 208464.7 (0.0 s) | 4996.4 (1.6 s) | 97730.0 (0.1 s) | 216231.4 (0.0 s) |
10000 | 4084.7 (2.4 s) | 98878.3 (0.1 s) | 221999.2 (0.0 s) | 5017.7 (2.0 s) | 96509.4 (0.1 s) | 208531.4 (0.0 s) |
11000 | 4078.2 (2.7 s) | 99220.8 (0.1 s) | 220002.6 (0.0 s) | 5020.6 (2.2 s) | 97943.9 (0.1 s) | 210575.5 (0.1 s) |
12000 | 4141.0 (2.9 s) | 98586.4 (0.1 s) | 205541.4 (0.1 s) | 5039.6 (2.4 s) | 97616.7 (0.1 s) | 207652.7 (0.1 s) |
13000 | 4157.3 (3.1 s) | 98206.0 (0.1 s) | 210088.8 (0.1 s) | 5050.9 (2.6 s) | 98848.6 (0.1 s) | 208302.6 (0.1 s) |
14000 | 4217.2 (3.3 s) | 98754.8 (0.1 s) | 211585.2 (0.1 s) | 4890.7 (2.9 s) | 97996.7 (0.1 s) | 217852.7 (0.1 s) |