searchisko/elasticsearch-river-remote

Creation Exception - related to issue #44 - how to configure river to create additional/incremental _id

Closed this issue · 13 comments

CreationException[Guice creation errors:

  1. Error injecting constructor, java.lang.ClassCastException: java.util.ArrayList cannot be cast to java.util.Map
    at org.jboss.elasticsearch.river.remote.RemoteRiver.(Unknown Source)
    while locating org.jboss.elasticsearch.river.remote.RemoteRiver
    while locating org.elasticsearch.river.River

1 error]; nested: ClassCastException[java.util.ArrayList cannot be cast to java.util.Map];

I was unable to re-open the previous issue.

It means river configuration file has bad structure. Check it against doc/examples.

{
"type": "remote",
"remote": {
"urlGetDocuments": "https://company.domain.com/controller/rest/applications/SEAL%20-%20CTE%20-%20Production%20-%201/metric-data?metric-path=Backends%7CDefault%20Web%20Site/ClaqServices%7CAverage%20Response%20Time%20%28ms%29&time-range-type=BEFORE_NOW&duration-in-mins=15&output=json",
"timeout": "5s",
"spacesIndexed": "MAIN",
"username": "user@compay",
"pwd": "pssw0rd",
"spaceKeysExcluded": "",
"indexUpdatePeriod": "1m",
"indexFullUpdatePeriod": "0",
"simpleGetDocuments": "true",
"maxIndexingThreads": 2
},
"index" : {
"index" : "analytics",
"type" : "claq_test",
"remote_field_document_id" : "generated_id",
"fields" : {
"frequency" : {"remote_field" : "frequency"},
"metricPath" : {"remote_field" : "metricPath"},
"count" : {"remote_field" : "metricValues.count"},
"current" : {"remote_field" : "metricValues.current"},
"max" : {"remote_field" : "metricValues.max"},
"min" : {"remote_field" : "metricValues.min"},
"occurences" : {"remote_field" : "metricValues.occurences"},
"standardDeviation" : {"remote_field" : "metricValues.standardDeviation"},
"sum" : {"remote_field" : "metricValues.sum"},
"startTimeInMillis" : {"remote_field" : "metricValues.startTimeInMillis"},
"value" : {"remote_field" : "metricValues.value"}
},
"preprocessors": [
{
"name": "Unique id generator",
"class": "org.jboss.elasticsearch.tools.content.AddCurrentTimestampPreprocessor",
"settings": {"field": "generated_id"}
}
]
},
"activity_log": {
"index" : "remote_river_activity",
"type" : "remote_river_indexupdate"
}
}

Ok, I removed _river Index and triple checked config and reapplied:

Now I am seeing this:
"error_message": "Document ID not found in remote system response for Space MAIN within data: {metricPath=Business Transaction Performance|Business Transactions|ECommerce Server|Fetch catalog|Calls per Minute, frequency=ONE_MIN, metricValues=[{current=793, min=0, max=0, startTimeInMillis=1358880420000, value=786}]}"

It seems like preproc is not working correctly..

I checked river code and problem is that river checks id existence even before preprocessors are called and throws exception if id is empty. So solution is not to store generated id into new field, but use some field which is in original data (so is filled even before preprocessor runs) and preprocessor only replaces its value. If you need the original value also, you can copy it into another field by another preprocessor. For example see https://github.com/searchisko/searchisko/blob/master/configuration/rivers/jbossorg_sbs_article.json

Maybe some assistance, This is the first time using the preprocessors:
This currently isn't working and I need more coffee ;-)

The below config is using a working API, that will be easier to test.

{
"type": "remote",
"remote": {
"urlGetDocuments": "http://docs.appdynamics.com/download/attachments/20187207/REST_WildCardBT_metric-dataJSON.txt?version=1&modificationDate=1394226069000&api=v2",
"timeout": "5s",
"spacesIndexed": "MAIN",
"spaceKeysExcluded": "",
"indexUpdatePeriod": "1m",
"indexFullUpdatePeriod": "1h",
"simpleGetDocuments": "true",
"maxIndexingThreads": 2
},
"index": {
"index": "appdynamics",
"type": "metrics_test",
"remote_field_document_id": "generated_id",
"fields": {
"frequency": {
"remote_field": "frequency"
},
"metricPath": {
"remote_field": "metricPath"
},
"current": {
"remote_field": "metricValues.current"
},
"max": {
"remote_field": "metricValues.max"
},
"min": {
"remote_field": "metricValues.min"
},
"startTimeInMillis": {
"remote_field": "metricValues.startTimeInMillis"
},
"value": {
"remote_field": "metricValues.value"
}
},
"preprocessors": [
{
"name": "Unique id generator",
"class": "org.jboss.elasticsearch.tools.content.AddCurrentTimestampPreprocessor",
"settings": {
"field": "generated_id",
"source_field": "{generated_id}",
"target_field": "_id"
}
},
{
"name": "Remote id copy",
"class": "org.jboss.elasticsearch.tools.content.AddMultipleValuesPreprocessor",
"settings": {
"prep_id_remote": "{metricPath}"
}
}
]
},
"activity_log": {
"index": "remote_river_activity",
"type": "remote_river_indexupdate"
}
}

I'm still having issues.
Are there any other configs or readme's you can point me to?

I suppose metricPath field exists in data from remote system. So in this case you have to use:
"remote_field_document_id": "metricPath", then let current time preprocessor to fill this field

{
"name": "Unique id generator",
"class": "org.jboss.elasticsearch.tools.content.AddCurrentTimestampPreprocessor",
"settings": {"field": "metricPath"}
}

When you want to store original value from "metricPath" field into index also then you have to copy it to another field in copy preprocessor placed BEFORE AddCurrentTimestampPreprocessor.

This worked but with one issue, it also overwrites the metricPath field with the timestamp.
There are now 3 fields with the timestamp, ( _id, document_id, metricPath ). I need a way to preserve the data from metricPath or create another field with this data.

As I told previously, if you need original metricPath value obtained from remote system in index, then you have to copy it with another preprocessor (placed before generator) to some temporary field in data, and then take it from this filed in data when storing into search index in fields. So in river config you have to do something like:

"remote_field_document_id": "metricPath",
"fields" : {
  ...
  "metricPath": {
    "remote_field": "metricPathTemp"
  },
  ...
}
"preprocessors": [
  {
    "name" : "Remote id copy",
    "class" : "org.jboss.elasticsearch.tools.content.AddMultipleValuesPreprocessor",
    "settings" : {
      "metricPathTemp" : "{metricPath}"
    }
  },
  {
    "name": "Unique id generator",
    "class": "org.jboss.elasticsearch.tools.content.AddCurrentTimestampPreprocessor",
    "settings": {"field": "metricPath"}
  }
]
...

To be clear how river works. River takes data from remote system, parses them into memory data structure, then runs all preprocessors against these data, so preprocessors can manipulate and change this in memory data structure (add or remove fields, change values of existing fields etc). Then river maps values from this modified in memory structure into search index based on information from fields configuration. Final name of field in search index can be different than one used in original in memory data structure.

I'm still working on a new config to change multiple fields. I will be revisiting this soon.

I need to revisit this.