IBMStreams/streamsx.elasticsearch

Does this toolkit support upserts?

snowch opened this issue · 3 comments

It isn't clear if this toolkit only supports inserts or upserts. Can you please confirm?

I guess it doesn't support upserts:

// Add jsonDocuments to bulkRequest.
if (idToInsert != null) {
    bulkRequest.add(client.prepareIndex(indexToInsert, typeToInsert, idToInsert)
	.setSource(jsonDocuments));
} else {
    bulkRequest.add(client.prepareIndex(indexToInsert, typeToInsert)
    	.setSource(jsonDocuments));
}

Reference: com.ibm.streamsx.elasticsearch.ElasticsearchIndex

An upsert would require client.prepareUpdate() not client.prepareIndex() if I have understood correctly. See: https://www.elastic.co/guide/en/elasticsearch/client/java-api/5.4/java-docs-update.html

As far as I see it, the implemented use case is a dump of data into elasticsearch to serve a series of data points. Like in the sample with the ECG data, there is no need to go back to a previous data point and revise its value.
With the API link you posted I can see the possibility to introduce a new parameter that gets a list of 'default' key value pairs to be used when an insert is necessary during an upsert. Does this sound useful to you?

With the next version of the toolkit the Java client (transport client) is dropped in favor of the JEST client (using the REST API). When using the idNameAttribute parameter, the application can provide the _id field directly instead of leting eslaticsearch create an _id. With this setup, a document with the same _id will be updated whenever it is indexed again. So the upsert is working.

It may lead to problems, when multiple applications update the same document concurrently. In that case the latest update will win. Elasticsearch offers 'Optimistic Concurrency Control' to detect and handle that situation. That feature is not supported by the toolkit so far. It would require to make the _version field accessible from the application.