toluaina/pgsync

Provide Optimized Reindexing

jvanderen1 opened this issue · 0 comments

I was wondering if it would be possible to optimize how PGSync updates Elasticsearch records. I put 2 different thoughts I had into these dropdown menu's. Please let me know what you think.

Only Update When Specified Columns Changed
I notice that when an unspecified column updates, it fires a re-index.

Example

Given the following schema for table my_table

# Table: my_table
# ---------------------------------------------------------------------------------------------------------
# Columns:
#  id                          | integer                     | PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY
#  foo                         | text                        |
#  bar                         | text                        |
# ---------------------------------------------------------------------------------------------------------

And a schema.json defined as:

[
  {
    "database": "default",
    "index": "my_index",
    "nodes": {
      "table": "my_table",
      "columns": [
        "foo"
      ]
    }
  }
]

What is happening?

my_index is getting updated when column bar updates.

What do I expect to happen instead?

my_index would not be updated when bar updates because it is not specified in schema.json.

Batch Updates Together

I notice that when a record updates multiple times, it gets re-indexed the same number of times it was updated.

Example

Given the following schema for table my_table

# Table: my_table
# ---------------------------------------------------------------------------------------------------------
# Columns:
#  id                          | integer                     | PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY
#  foo                         | text                        |
# ---------------------------------------------------------------------------------------------------------

And a schema.json defined as:

[
  {
    "database": "default",
    "index": "my_index",
    "nodes": {
      "table": "my_table",
      "columns": [
        "foo"
      ]
    }
  }
]

And I perform the following query:

UPDATE my_table
SET foo = "hello"
WHERE id = 1;

Followed by another query (before pgsync runs again):

UPDATE my_table
SET foo = "world"
WHERE id = 1;

What is happening?

2 insert requests are sent to my_index the next time pgsync runs.

What do I expect to happen instead?

1 insert request is sent to my_index the next time pgsync runs. Because we are performing full document updates in the index, there is no reason to send the same document to Elasticsearch twice.