batch insert - too slow
mikeTWC1984 opened this issue · 10 comments
I was testing insert query like below (for jdbc engine table).
insert into jdbc_test(a, b)
select number*3, number*5 from system.numbers n limit 5000
I was using mysql and mssql drivers. In both cases results were very slow - 200-300 rows per second (using tiny table with 2 int columns). This is about the same speed as if you'd insert rows one by one with auto commit. rewriteBatchedStatements=true was set. I did few more tests using plain java for mysql, if I turn off auto commit I can get 7-8K rows per second, even if inserting one by one. Using batch api this goes up to 80K per second.
So sounds like when bridge is writing data it's using very small batches (or not using batches at all?)
Regarding auto commit - it is true by default , however during batch writing I think it's more logical to start transaction tough to avoid partial inserts (on crash). It would also speed up inserts with smaller batches.
Here is the quick and dirty snippet for testing. It assumes there is data base "test" and table "test" with (a int, b int) columns.
import java.sql.DriverManager;
import java.sql.SQLException;
/* usage
// copy 2000 rows 1 by 1, with auto commit
java --source 11 -cp "./mysql-connector-java-8.0.26.jar" jt.java 2000
// copy 50K rows in 1000 row batches with auto commit (after each batch insertion)
java --source 11 -cp "./mysql-connector-java-8.0.26.jar" jt.java 50000 1000
// copy 50K rows in 1000 row batches, single transaction
java --source 11 -cp "./mysql-connector-java-8.0.26.jar" jt.java 50000 1000 false
*/
class Jt {
public static void main(String[] args) {
var rowsToCopy = args.length > 0 ? Integer.parseInt(args[0]) : 1000;
var batchSize = args.length > 1 ? Integer.parseInt(args[1]) : 1;
boolean autoCommit = args.length > 2 ? Boolean.parseBoolean(args[2]) : true;
try {
long start = System.currentTimeMillis();
var conn = DriverManager.getConnection(
"jdbc:mysql://localhost/test?user=root&password=root&rewriteBatchedStatements=true");
conn.setAutoCommit(autoCommit);
var stmt = conn.prepareStatement("insert into test.test values(?,?)");
if (batchSize == 1) {
System.out.println("No batching");
for (int i = 1; i <= rowsToCopy; i++) {
stmt.setInt(1, i * 3);
stmt.setInt(2, i * 5);
stmt.executeUpdate();
}
} else {
System.out.println("batch size: " + batchSize);
for (int i = 1; i <= rowsToCopy; i++) {
stmt.setInt(1, i * 3);
stmt.setInt(2, i * 5);
stmt.addBatch();
if (i % batchSize == 0)
stmt.executeBatch();
}
stmt.executeBatch();
}
if (!autoCommit)
conn.commit();
long finish = System.currentTimeMillis();
long dur = (finish - start);
System.out.println(String.format("copied %d rows in %,d ms (%d rows per sec). Autocommit: %b", rowsToCopy,
dur, rowsToCopy * 1000 / dur, autoCommit));
var s = conn.createStatement();
var rs = s.executeQuery("select sum(1) from test.test");
rs.next();
System.out.println(String.format("Current row count: %d", rs.getInt(1)));
conn.close();
} catch (SQLException ex) {
// handle any errors
System.out.println("SQLException: " + ex.getMessage());
System.out.println("SQLState: " + ex.getSQLState());
System.out.println("VendorError: " + ex.getErrorCode());
}
}
}
Thanks for spending time on this. There's overhead but it should not be that slow.
Anyway, below is what I got (using quick start to spin up the test environment in local):
-
DataSource -
rewriteBatchedStatements
is enabled or batch inserting is going to be extremely slow{ "mysql5": { "driverUrls": [ "https://repo1.maven.org/maven2/mysql/mysql-connector-java/5.1.49/mysql-connector-java-5.1.49.jar" ], "driverClassName": "com.mysql.jdbc.Driver", "jdbcUrl": "jdbc:mysql://db-mysql5/test?useSSL=false&rewriteBatchedStatements=true&useCompression=false&useOldAliasMetadataBehavior=true&allowMultiQueries=true", "username": "root", "password": "root", "initializationFailTimeout": 0, "minimumIdle": 0, "maximumPoolSize": 10 } }
-
Queries on ClickHouse -
batch_size
is set to 2000 as I limited all containers memory to 200MBselect * from jdbc('mysql5?mutation', 'create table test.aaa(a Int, b Int)') select * from jdbc('mysql5?mutation', 'truncate table test.aaa') select * from jdbc('mysql5', 'select count(1) from test.aaa') create table j(a UInt32, b UInt32) engine=JDBC('mysql5?batch_size=2000', 'test', 'aaa') insert into j select number, number + 1 from numbers(1000000) 0 rows in set. Elapsed: 5.779 sec. Processed 1.00 million rows, 8.00 MB (173.03 thousand rows/s., 1.38 MB/s.)
I think I can optimize batch insertion a bit by disabling auto commit and then commit batch by batch. However, for your case, perhaps it could relate to network between clickhouse and jdbc bridge, or jdbc bridge and mysql?
On a side note, in case you need to transfer massive data across DCs/regions:
- jdbc bridge is stateless so you can setup multiple instances behind a DNS name or proxy and setup load balancing accordingly
- compression should be enabled when network bandwidth could be an issue - this really depends on the database and its JDBC driver
- manual partition is probably needed to split one query into multiple(e.g. insert by specific date range)
- avoid large timeout(15 minutes+?)
OK, thanks for the update and tips! I was doing all this in docker on my local machine too, so those numbers looked very suspicious. I tried your docker compose setup and it works as expected now. I'll try to figure out what was wrong before. I'm planning to use it with mssql/oracle, will continue experimenting.
Regarding auto commit - I mainly wold be concern about partial inserts (rather than performance). Say if you run long insert and kill the process in the middle it will require manual clean up afterwards.
I'm having some issue with oracle
- default test query is "select 1", but oracle needs "select 1 from dual". I see it can be configured, but probably it would make more sense to use Connection.IsValid method to test connection by default.
- jdbc bridge enquotes table and schema names when you create jdbc table. Turns out oracle treats tableName and "tableName" as different tables. Is there an option to avoid this quotation? I believe DB2/Informix also treats quotes same way. I think quotation should be disabled by default, since user can add it as needed while creating table.
I also tested mssql, I had no issue. For batch insert it also needs extra property (useBulkCopyForBatchInsert=true) to get reasonable perfomance.
Thanks for sharing your findings. Is there a docker image for Oracle that I can use to reproduce the issue? I'll need to debug the code to understand what could be cause.
driver:
https://repo1.maven.org/maven2/com/oracle/database/jdbc/ojdbc8/21.3.0.0/ojdbc8-21.3.0.0.jar
docker:
docker run -d -p 1521:1521 -e ORACLE_PASSWORD=123456 gvenzl/oracle-xe
jdbc url:
jdbc:oracle:thin:@//localhost:1521/XEPDB1
user: SYSTEM
pass: 123456
test ddl
CREATE TABLE system.test (a int, b int)
INSERT INTO system.test SELECT 1, 2 FROM dual
CREATE TABLE system."test" (a int, b int)
INSERT INTO system."test" SELECT 3, 4 FROM dual
SELECT * FROM system.test
SELECT * FROM system."test"
Speaking of docker - I see jdbc bridge image is using java 8. I see oracle/mssql release drivers for specific jdk versions, so newer ones are not compatible with jdk8. Can you make separate image version for java 11, or it's not compatible?
OK, I realized quotation might not be an issue. If you do not enqoute table name, oracle just convert it to all uppercase. But if you use quotes it become case sensitive. Basically TEST and test are different tables (not test and "test")
OK, I can confirm that as long you keep column/table names in uppercase (for oracle) there is no any issue. Batch insert also works good without tuning any extra jdbc parameter, although I'd probably set bigger default batch_size.
I'm going to close that issue then. There are 2 other items I mention here - about autocommit and connection testing (replacing "select 1" with Conneciton.IsValid method), if you think those are somewhat reasonable I will open another issue
Please be aware that increasing batch_size
comes at a cost. The larger batch_size
the higher chance JDBC bridge will run of out memory, because it has to hold the whole batch in memory before sending over to target database. I'd suggest to set a reasonable number by considering row size(column count and size of each column etc.), concurrency, SLA, and JDBC bridge memory configuration etc. together. On a side note, fetch_size
has similar issue but it's just for query.
As to either use validation query like "select 1" or standard JDBC API Connection.isValid()
, it has nothing to do with JDBC bridge but HikariCP, the connection pool implementation. However, I do see the headache of tuning configuration for different databases - we should have templates defined in advance so that datasource configuration is just about host, port, database, and credentials. I didn't mention timeout here but I hope we can have better to configure that as well.
Lastly, to recap issues we discussed in this thread:
- transaction support is missing - begin -> insert batch by batch -> commit
- Java 9+ support - I'll create multi-release jar file in the near future, and then build native image accordingly
- quotation in oracle - I'll need to investigate the issue in weekend