Provide Optimized Reindexing
jvanderen1 opened this issue · 0 comments
I was wondering if it would be possible to optimize how PGSync updates Elasticsearch records. I put 2 different thoughts I had into these dropdown menu's. Please let me know what you think.
Only Update When Specified Columns Changed
I notice that when an unspecified column updates, it fires a re-index.
Example
Given the following schema for table my_table
# Table: my_table
# ---------------------------------------------------------------------------------------------------------
# Columns:
# id | integer | PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY
# foo | text |
# bar | text |
# ---------------------------------------------------------------------------------------------------------
And a schema.json
defined as:
[
{
"database": "default",
"index": "my_index",
"nodes": {
"table": "my_table",
"columns": [
"foo"
]
}
}
]
What is happening?
my_index
is getting updated when column bar
updates.
What do I expect to happen instead?
my_index
would not be updated when bar
updates because it is not specified in schema.json
.
Batch Updates Together
I notice that when a record updates multiple times, it gets re-indexed the same number of times it was updated.
Example
Given the following schema for table my_table
# Table: my_table
# ---------------------------------------------------------------------------------------------------------
# Columns:
# id | integer | PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY
# foo | text |
# ---------------------------------------------------------------------------------------------------------
And a schema.json
defined as:
[
{
"database": "default",
"index": "my_index",
"nodes": {
"table": "my_table",
"columns": [
"foo"
]
}
}
]
And I perform the following query:
UPDATE my_table
SET foo = "hello"
WHERE id = 1;
Followed by another query (before pgsync
runs again):
UPDATE my_table
SET foo = "world"
WHERE id = 1;
What is happening?
2 insert requests are sent to my_index
the next time pgsync
runs.
What do I expect to happen instead?
1 insert request is sent to my_index
the next time pgsync
runs. Because we are performing full document updates in the index, there is no reason to send the same document to Elasticsearch twice.