Philippus/elastic4s

SearchHit.sourceAsString adds superfluous type metadata

tastyminerals opened this issue · 2 comments

I am stuggling to figure out how can one deserialize a valid JSON hit string into a SearchHit? We are using JSON string -> SearchHit pattern a lot in our tests and classic elasticsearch library SearchHit allowed to do the following:

    def sourceFixtureAsSearchHit(fileName: String, docId: Int = 1): SearchHit = {
        val fixture = loadUTF8FixtureAsString(fileName)
        val source = new BytesArray(fixture)
        val hit = new SearchHit(docId)
        hit.sourceRef(source)
        hit
    }

The elastic4s SearchHit doesn't provide sourceRef. Hence, we parse (via circe) a JSON string into a Map[String, Any]

val sourceMap = parser.parse(fixture).getOrElse(Json.Null)
  .asObject.map(_.toMap)
  .getOrElse(Map.empty[String, Json])

and then store it into elastic4s SearchHit(_source = sourceMap). This however produces a different JSON representation during serialization back via hit.sourceAsString. For example, the original JSON

   {"document_id" : "0b85846f-2c7b-4cc8-b265-6c3fdf1da815"}

becomes

{
  "document_id": {
   "value": "0b85846f-2c7b-4cc8-b265-6c3fdf1da815",
   "array": false,
   "null": false,
   "boolean": false,
   "number": false,
   "string": true,
   "object": false
 }
}

This drastically increases the resulting string size: 214 lines -> ~3k. So, this doesn't look like the correct way to create SearchHit from strings. So, how does one deserialize a string into SearchHit?

Apparently, a workaround is possible if you use .sourceAsMap instead of .sourceAsString and then convert it to a new json string on your end. So the question is why .sourceAsString adds all that additional type metadata? It will index significantly more data if not checked :(

This has something to do with the Jackson that elastic4s uses. So, whenever SearchHit is instantiated manually and _source is set. The downstream .sourceAsString will generate a json with type elements. We avoid it now only by calling .sourceAsMap on the manually instantiated SearchHit, converting the result map into Json object and then back to String using circe.