yunojuno/elasticsearch-django

Add support for partial updates

Closed this issue · 8 comments

We currently update a complete document on sync. It would be nice to be able to automatically do a partial document update (supported by ES) on the save signal handler if update_fields is passed, e.g.

# save the object, and push a complete document to ES
obj.save()

# save the object, and push a partial update to ES containing just `first_name`
obj.save(update_fields=['first_name'])
djm commented

For a bit of background, this is the high level of how elasticsearch-django works:

  • when the app is loaded, it checks your django settings to see which models you have registered (see SEARCH_SETTINGS)

  • in that setting, we define which indexes (stores) we want to create in elasticsearch, and we tell elasticsearch which models we have that should update those indexes.

  • elasticsearch-django then attaches itself via a post_save Django signal to each model.

  • when each instance of a model is saved, the elasticsearch-django signal handler is fired and it will update the correct ES index.

  • it does this by looking on the model in question for the .as_search_document method, which returns a giant dictionary/blob of data to store in the ES index.

  • that dictionary is serialized to JSON, and then sent to elastic search
    The problem is that for each save, the entire dict is regenerated and send to ES. Even if perhaps only one field was updated.

So the task is to make the signal aware of what fields are being updated, so that it in turn can ask the as_search_document methods for the minimal amount of data required.

As an example as to how this helps us: imagine a model that has an as_search_document method which returns 30 fields to send to ES. Often, to generate the data for those 30 fields, the Django ORM will end up doing many SQL queries behind the scenes. If we only wanted to update one of those 30 fields, this is a total waste and adds a performance penalty to any update.

Thanks @djm :-)

djm commented

Thinking about it more now, the biggest problem with this will be as_search_document methods, as it's the work that goes in to making that dict that we're trying to avoid.

There are two ways I can think of handling this:

  1. We pass update_fields through from the signal to the methods, and leave it up to the as_search_document methods itself to build a dictionary based on the requested fields. I think this would involve a little too much of if field_name in update_fields: add field to dict perhaps? Every field would need it.

  2. We ensure all the values that get put in the dict are lazy-eval'd, that way the entire dictionary could be passed back and the signal itself can choose which fields/keys it wants to eval out of the dictionary.

Open to thoughts!

Indeed. Personally I'd pass the update_fields into the as_search_document, and let the method work it out from there, as that's consistent - it's just a straight pass-through from the signal. It's then responsible for creating the partial doc. I get your point about the if..else, but I think I'd approach it from the other way - i.e. iterate though update_fields and add each?

Also - this is something of a companion issue to #26 and #24 - having all three should give total control. #26 means disabling auto_sync for certain models, rather than the entire index. #24 would allow us to not sync given certain conditions - i.e. in the case of touch (which is what generates the volume of updates), we could say only push an update if the last_updated_field has moved forward at least a day.

djm commented

i.e. iterate though update_fields and add each?

I don't see how that would work but I'm probably missing something. Each field is uniquely generated sadly so it's not as simple as looping through and using getattr on self or something like that.

Yup - thinking it through, what I said makes no (or at least not much) sense.

djm commented

I do think leaving it up to the model to choose what to do is probably better; as that's user-space code that a 3rd party would have control over changing as opposed to library code.

It's just a case of how we handle it internally after that, we can definitely come up with a lazy solution from that.