Any way to use query_type: text on Repo.insert_all?

Question

Any way to use query_type: text on Repo.insert_all?

byronpc opened this issue 5 years ago · 15 comments

Reason for this is if we have dynamic amount of element in the batch insert. It will prepare one query per each amount. Let's suppose we have a feature called contact book syncing, where we don't know how many entries a user has in its contact book. It can be 1- 1000 or even more. So that would mean each connection would have to prepare 1000 prepared statements for the batch inserts. If you have a connection pool of let's say 10, then that would mean over time, it would've prepared 10,000 statements on the server.

Answer 1 · 2020-03-27T09:29:35.000Z

For insert_all and friends, MyXQL prepares the statement and then closes it immediately after use. So you would have only 10 statements on the server. If insert_all does any caching, then it would be per schema, so it means you would have at most number of schemas * number of connections. Given you are using a single schema here, then it would still be 10. :)

Answer 2 · 2020-03-28T23:40:54.000Z

@josevalim hmm...if it closes it immediately after used as you suggested, then there might be another bug again.

MySQL [performance_schema]> SELECT SQL_TEXT, COUNT(1) FROM prepared_statements_instances GROUP BY SQL_TEXT HAVING count(1) > 20\G
*************************** 1. row ***************************
SQL_TEXT: INSERT INTO `contact_book` (`contact_id`,`contact_name`,`contact_org`,`contact_org_id`,`contact_org_slug`,`contact_picture`,`contact_slug`,`contact_title`,`email`,`phone`,`reversed_search_phone`,`user_id`) VALUES (?,?,?,?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?,?,?,?),(?,?,
COUNT(1): 157

But assuming it is prepared, as I mention, then you will have a prepared statement where there is only 1 value, one statment where there are 2 values, and so on and so on, and that's on EVERY connection established to the server @josevalim

Answer 3 · 2020-03-28T23:46:58.000Z

i'm pretty sure the prepared statement doesn't get dropped after the call. Take this simple sample

Messenger.RepoWrite.insert_all(Messenger.Thread, [[creator_id: 2, created: now]])
Messenger.RepoWrite.insert_all(Messenger.Thread, [[creator_id: 2, created: now], [creator_id: 3, created: now]])
Messenger.RepoWrite.insert_all(Messenger.Thread, [[creator_id: 2, created: now], [creator_id: 3, created: now], [creator_id: 4, created: now]])

mysql> select SQL_TEXT from prepared_statements_instances\G
*************************** 1. row ***************************
SQL_TEXT: SELECT CAST(s0.`version` AS unsigned) FROM `schema_migrations` AS s0 FOR UPDATE
*************************** 2. row ***************************
SQL_TEXT: INSERT INTO `thread` (`created`,`creator_id`) VALUES (?,?)
*************************** 3. row ***************************
SQL_TEXT: INSERT INTO `thread` (`created`,`creator_id`) VALUES (?,?),(?,?)
*************************** 4. row ***************************
SQL_TEXT: INSERT INTO `thread` (`created`,`creator_id`) VALUES (?,?),(?,?),(?,?)
4 rows in set (0.00 sec)

if you have an insert all with 2 values, 3 values, 4 values. All 3 prepared statements get prepared and it doesn't go away. And that is per connection. Then put it in a scenario wherein like user contactbook syncing wherein the number of entries are variable to the user (it could be 1 - 1000 or even more items) and spread it to let's say a pool of 10 connections. Then that would mean 1000 x 10 = 10,000 prepared statements in total.

Answer 4 · 2020-03-28T23:55:29.000Z

This also applies to queries using IN.

SELECT * FROM table WHERE id IN [variable length]

The reason why I'm investigating all these prepared statement issues is because within a week of usage, we're usually reaching the 16k prepared statement limit. Ofcourse lifting the limit is one way to address the issue. But I'm not sure if that will be better rather than just cherry picking batch queries (batch insert and batch select) to force them to be of query_type: text

Answer 5 · 2020-03-29T00:41:09.000Z

Just to be clear, you are using MyXQL master? But I actually think I know what is the bug. Once the statement changes, we should close the previous query, but we do not. IN statements should never leak though, as they are never cached.

Answer 6 · 2020-03-29T09:16:02.000Z

yes 0.3.4

Answer 7 · 2020-03-29T09:35:39.000Z

Master or v0.3.4? There are bug fixes in master that did not make it to v0.3.4 yet. :)

Answer 8 · 2020-03-29T13:35:02.000Z

oh i will test tomorrow! will update you

Answer 9 · 2020-03-30T00:19:44.000Z

@josevalim just an update.

Regarding the race condition issue with ecto cache i posted last week, it seems to have already been addressed with this build
BATCH INSERT With regards to this issue of variable length for insert_all, while it doesn't store one copy for each variable length, i notice that it stores that LAST insert_all statement you sent. I guess at least we're limited to just one batch insert statement stored on the cache. Although is this your intended purpose? (when i insert 2 entries, that statement gets prepared, when i send a batch of 3 entries afterwards, the 2 entry statement gets unprepared, while the 3 entry statement gets prepared and cached).
BATCH SELECT I notice on for select where in ?, the prepared statement doesn't stay in the cache. So there's just abit of inconsistency between batch insert and batch select

Answer 10 · 2020-03-30T00:50:47.000Z

Btw, you were right, isnert_all was leaking statements but that’s now fixed on master.

Answer 11 · 2020-03-30T01:14:48.000Z

@josevalim the test i did an hour ago was using your latest master which you commited also an hour ago.

The difference i see here is BATCH SELECT doesn't maintain a prepared statement on cache (probably expected behavior?) while BATCH INSERT does maintain the last prepared statement on cache

Answer 12 · 2020-03-30T08:24:39.000Z

Yes, you are correct. We could maintain the last batch select in the cache too but I am not sure if it is necessary.

Answer 13 · 2021-04-11T09:38:00.000Z

Hello @josevalim. It looks like the 7a5363a change may easily lead to performance degradation.

For example It could be inefficient to have two (or more but not infinite of course) queries like:

insert into table (field1, field2) values (value1, value2)
insert into table (field1, field2, field3) values (value1, value2, value3)

as if such queries will be called each one by one - previous prepared statement would be closed and new one created every time. I have looked at postgrex and it looks like it does not do it the such way (maybe I'm wrong didn't look really deep).

Is there any way (or could be considered as a feature request?) to set how many different queries could be cached per one cache statement name or we can change it only for insert_all and other more dynamic things than usual insert queries?

Answer 14 · 2021-04-11T09:43:03.000Z

It is the responsibility of whoever is using the cache_statement to control how the cache keys are reused or if not at all.

currently there is no way on the client to get how many are shared but I believe there are queries you can run on the server.

Answer 15 · 2021-04-11T09:52:54.000Z

It is the responsibility of whoever is using the cache_statement to control how the cache keys are reused or if not at all.

Yes. But If I understand correctly we just can't control it at all on the myxql side. At least for insert statements - https://github.com/elixir-ecto/ecto_sql/blob/master/lib/ecto/adapters/myxql.ex#L259.

currently there is no way on the client to get how many are shared but I believe there are queries you can run on the server.

I'm not really sure that I've understood about shared and queries on the server. Could you please clarify what did you mean.

I don't use myxql directly but via ecto/ecto_sql and I'd like to have abilities to execute queries like I have posted in the previous message in a way that they will be prepared only once. I agree that the change provides a safeness for batch insert and cases like that, but maybe we can do it in more effective way.