Batch insertion ideal implementation - Block or Execute method?
pkoosha opened this issue · 1 comments
Hi,
I am trying to have large batches of inserts to support bulk insertions. Reading the example and searching through other issues I am confused what is the fastest and best impelemtation of inserting large batches of data in a table. Could you help me on how to implement this efficiently?
My application is multi-threaded and each thread will insert around 10k rows every 30 seconds.
Here is the execute implementation
clickhouse::Client client(clickhouse::ClientOptions().SetHost("localhost"));
std::string query = "INSERT INTO port_data_counters (guid, port, xmit_data, rcv_data, xmit_pkts, rcv_pkts, "
"unicast_xmit_pkts, unicast_rcv_pkts, multicast_xmit_pkts, multicast_rcv_pkts) "
"VALUES ";
for (int i = 0; i < 100000; i++) {
char value_set[256];
snprintf(value_set, sizeof(value_set), "(%lu, %d, %lu, %lu, %lu, %lu, %lu, %lu, %lu, %lu),",
(unsigned long)rand(), rand() % 1000, (unsigned long)rand(), (unsigned long)rand(),
(unsigned long)rand(), (unsigned long)rand(), (unsigned long)rand(), (unsigned long)rand(),
(unsigned long)rand(), (unsigned long)rand());
query += value_set;
}
query.erase(query.length() - 1); // Remove the last comma
try {
client.Execute(query);
} catch (const std::exception &e) {
fprintf(stderr, "Exception: %s\n", e.what());
}
Here is block sample implementation
const int num_rows = 1000000;
const int batch_size = 1000;
const int num_batches = num_rows / batch_size;
for (int i = 0; i < num_batches; ++i) {
clickhouse::Block block;
auto guid = std::make_shared<clickhouse::ColumnUInt64>();
auto port = std::make_shared<clickhouse::ColumnInt32>();
auto xmit_data = std::make_shared<clickhouse::ColumnNullable<clickhouse::ColumnUInt64>>();
auto rcv_data = std::make_shared<clickhouse::ColumnNullable<clickhouse::ColumnUInt64>>();
auto xmit_pkts = std::make_shared<clickhouse::ColumnNullable<clickhouse::ColumnUInt64>>();
auto rcv_pkts = std::make_shared<clickhouse::ColumnNullable<clickhouse::ColumnUInt64>>();
auto unicast_xmit_pkts = std::make_shared<clickhouse::ColumnNullable<clickhouse::ColumnUInt64>>();
auto unicast_rcv_pkts = std::make_shared<clickhouse::ColumnNullable<clickhouse::ColumnUInt64>>();
auto multicast_xmit_pkts = std::make_shared<clickhouse::ColumnNullable<clickhouse::ColumnUInt64>>();
auto multicast_rcv_pkts = std::make_shared<clickhouse::ColumnNullable<clickhouse::ColumnUInt64>>();
for (int j = 0; j < batch_size; ++j) {
const uint64_t row_id = i * batch_size + j;
guid->Append(row_id);
port->Append(123);
xmit_data->AppendNullable(100);
rcv_data->AppendNullable(200);
xmit_pkts->AppendNullable(300);
rcv_pkts->AppendNullable(400);
unicast_xmit_pkts->AppendNullable(500);
unicast_rcv_pkts->AppendNullable(600);
multicast_xmit_pkts->AppendNullable(700);
multicast_rcv_pkts->AppendNullable(800);
}
block.AppendColumn("guid", guid);
block.AppendColumn("port", port);
block.AppendColumn("xmit_data", xmit_data);
block.AppendColumn("rcv_data", rcv_data);
block.AppendColumn("xmit_pkts", xmit_pkts);
block.AppendColumn("rcv_pkts", rcv_pkts);
block.AppendColumn("unicast_xmit_pkts", unicast_xmit_pkts);
block.AppendColumn("unicast_rcv_pkts", unicast_rcv_pkts);
block.AppendColumn("multicast_xmit_pkts", multicast_xmit_pkts);
block.AppendColumn("multicast_rcv_pkts", multicast_rcv_pkts);
client.Insert("port_data_counters", block);
}```
The only way to answer that is to make some measurements... in your synthetic example, composing a string may be a better one, especially if you pre-allocate query the buffer.
In some edge cases, using Block
+Client::Insert
would be MUCH-MUCH faster.
- Like if you have
ColumnString
and put elements into it asstd::string_view
(so string lifetime is managed externally and there are less allocations and copying). - Or
LowCardinality(String)
with many duplicates
So you have to profile your case and figure it out on your own.