cloudant-labs/clouseau

fields stored with {store: true} are not returned in order

Closed this issue · 9 comments

Hello. I am indexing my documents using { store: true }, and the fields returned by lucene are not in order. The behaviour is similar to this bug:
https://issues.apache.org/jira/browse/LUCENE-1727

Is this expected behaviour? I need the fields to come back in the same order as per the FAQ:
http://wiki.apache.org/lucene-java/LuceneFAQ#What_is_the_order_of_fields_returned_by_Document.fields.28.29.3F

I have logged my design document and I am pushing the fields in the right order.

Any help would be most appreciated! Thanks!

It's not intentional behaviour, we'll see if we can improve on this.

@arturog I was not able to reproduce your issue. Can you please provide more details (example of documents you are indexing, your sample design doc, a query you are running ).

For my case the fields came back in exactly same order as they were added to the document.
My design document:
curl -X PUT http://adm:pass@127.0.0.1:5984/movies/_design/search1 -H 'Content-Type: application/json' --data-binary '{"_id":"_design/search1", "indexes": {"store1": { "index": "function(doc){\n if(doc.Movie_name){\n index(\"movie_name\", doc.Movie_name, {\"store\": \"yes\"});} \n if(doc.Person_name){\n index(\"person_name\", doc.Person_name, {\"store\": \"yes\"});} \n if (doc.Movie_year) {\n index(\"movie_year\", doc.Movie_year, {\"store\": \"yes\"});} \n \n }"}}}'

Running the query: curl -X POST "http://adm:pass@127.0.0.1:5984/movies/_design/search1/_search/store1" -d '{"q": "movie_name:New", "limit":2}' | jq ., I can see in the results fields returned in the same order as they were indexed ["movie_name", "person_name", "movie_year"]

{
  "total_rows": 10,
  "bookmark": "g1AAAABoeJzLYWBgYMpgTmHgz8tPSTV0MDQy1zMAQsMckEQiQ5L8____szKY3Bz4jbgbgGJJDAyM_HANRmgakhJASurheqT4HIACidZZAN0lGI4",
  "rows": [
    {
      "id": "df8cecd9809662d08eb853989a4bfe7f",
      "order": [
        3.8994359970092773,
        271
      ],
      "fields": {
        "movie_name": "New Moon",
        "person_name": "Robert Pattinson",
        "movie_year": 2009
      }
    },
    {
      "id": "df8cecd9809662d08eb853989a0705d3",
      "order": [
        3.8877224922180176,
        59
      ],
      "fields": {
        "movie_name": "New Moon",
        "person_name": "Billy Burke",
        "movie_year": 2009
      }
    }
  ]
}

Event trying to use "include_fields":["movie_year", "movie_name", "person_name"] option providing with a different order of fields, I am still getting the expected order of the fields as they were indexed.

I have also tried different variations of indexing documents. For example, I was indexing two values under the same field name movie_name: curl -X PUT http://adm:pass@127.0.0.1:5984/movies/_design/search22 -H 'Content-Type: application/json' --data-binary '{"_id":"_design/search1", "indexes": {"store1": { "index": "function(doc){\n if(doc.Movie_name){\n index(\"movie_name\", doc.Movie_name, {\"store\": \"yes\"});} \n if(doc.Movie_year){\n index(\"movie_year\", doc.Movie_year, {\"store\": \"yes\"});} \n if (doc.Person_name) {\n index(\"movie_name\", doc.Person_name, {\"store\": \"yes\"});} \n \n }"}}}'
And here I also get results in the expected order:

{
  "total_rows": 2995,
  "bookmark": "g1AAAABleJzLYWBgYMpgTmHgz8tPSTVyMDQy1zMAQsMckEQiQ5L8____szKY3Ow_MIBBIgNctSGa6iQFIJlkj6EhCwC6axhd",
  "rows": [
    {
      "id": "df8cecd9809662d08eb853989a20797a",
      "order": [
        1,
        0
      ],
      "fields": {
        "movie_name": [
          "Helen Mirren",
          "National Treasure: Book of Secrets"
        ],
        "movie_year": 2007
      }
    },
    {
      "id": "df8cecd9809662d08eb853989a18a4d6",
      "order": [
        1,
        0
      ],
      "fields": {
        "movie_name": [
          "Everett Sloane",
          "Lust for Life"
        ],
        "movie_year": 1956
      }
    }
  ]
}

If you add two dummy hard-coded fields at the end, it all breaks on my installation. Perhaps the fields starting with a "$"?

Try using this indexing function:

function(doc){
  if(doc.Movie_name){
    index("movie_name", doc.Movie_name, {"store": "yes"});
  } 
  if(doc.Person_name){
    index("person_name", doc.Person_name, {"store": "yes"});
  }
  if (doc.Movie_year) {
    index("movie_year", doc.Movie_year, {"store": "yes"});
  }
  index("$klass", "movie", { "store": "yes" });
  index("$type", "horror", { "store": "yes" });
 }

I get the following when querying http://localhost:5984/movies/_design/search1/_search/store1/?q=person_name:Tom*

{  
  "total_rows":2,
  "bookmark":"g1AAAABteJzLYWBgYMpgTmEQTM4vTc5ISXLIyU9OzMnILy7JAUklMiTJ____PyuDyc3-AwMYJDLgUZ_HAlLyAEj9x9CWBQDZ3CAq",
  "rows":[  
    {  
      "id":"73061a00ff8f7f0b6441a0b5ba009fb8",
      "order":[  
        1.0,
        0
      ],
      "fields":{  
        "$type":"horror",
        "movie_name":"Forest Grump",
        "movie_year":2000.0,
        "$klass":"movie",
        "person_name":"Tom Chunks"
      }
    },
    {  
      "id":"73061a00ff8f7f0b6441a0b5ba00b00a",
      "order":[  
        1.0,
        0
      ],
      "fields":{  
        "$type":"horror",
        "movie_name":"A Few Bad Men",
        "movie_year":1980.0,
        "$klass":"movie",
        "person_name":"Tom Cruz"
      }
    }
  ]
}

I could believe it's a factor of how many fields you stored. Specifically, when we build the map on the scala side, perhaps @mayya-sharipova was just lucky with her three items and the map happened to return them in insert order. As it grows to five items, the order changes?

Given we return this as a JSON key/value object, with no implied order, I'm not sure what issue this is really causing you. Shouldn't you be looking items up in the "fields" item by key name anyway?

@rnewson: It seems you're right -I've done a quick test and adding more fields changes the order of the keys in the returned JSON.

I know order of keys in an object or JSON it's non-standard, but I was hoping to get the lucene saved fields in the same order as it makes it easier for the UI to just show fields (specially in Autocomplete). I index a huge variety of documents and I don't know beforehand the names of the keys that will be returned, so in a way I rely on order.

I see getFields.foldLeft(Map[String, Any]()) and then a conversion of the results to further push them into the Hit(order, fields.toList). Is it possible to not use the Map and handle tuples from the very beginning? Not proficient in Scala (yet)... Does toList convert the map into tuples?

Now that we understand it, we could add a query patan to return fields as an array.

Sent from my iPhone

On 17 Nov 2016, at 22:23, arturog notifications@github.com wrote:

@rnewson: It seems you're right -I've done a quick test and adding more fields changes the order of the keys in the returned JSON.

I know order of keys in an object or JSON it's non-standard, but I was hoping to get the lucene saved fields in the same order as it makes it easier for the UI to just show fields (specially in Autocomplete). I index a huge variety of documents and I don't know beforehand the names of the keys that will be returned, so in a way I rely on order.

I see getFields.foldLeft(MapString, Any) and then a conversion of the results to further push them into the Hit(order, fields.toList). Is it possible to not use the Map and handle tuples from the very beginning? Not proficient in Scala (yet)... Does toList convert the map into tuples?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

Using ListMap instead of Map keeps them in order and fixes the issue for me:

--- IndexService.scala  2016-11-18 08:37:15.089423869 +0000
+++ IndexService.scala.patched  2016-11-18 08:37:10.492423732 +0000
@@ -35,6 +35,7 @@
 import org.apache.lucene.analysis.Analyzer
 import scalang._
 import collection.JavaConversions._
+import scala.collection.immutable.ListMap
 import com.yammer.metrics.scala._
 import com.cloudant.clouseau.Utils._
 import org.apache.commons.configuration.Configuration
@@ -607,7 +608,7 @@
         searcher.doc(scoreDoc.doc, includeFields)
     }

-    var fields = doc.getFields.foldLeft(Map[String, Any]())((acc, field) => {
+    var fields = doc.getFields.foldLeft(ListMap[String, Any]())((acc, field) => {
       val value = field.numericValue match {
         case null =>
           field.stringValue

Cool, say it in the form of a pull request (with a test) and it's in. While we don't have to promise an object ordering it seems harmless to define it as emit order.

On 18 Nov 2016, at 08:44, arturog notifications@github.com wrote:

Using ListMap instead of Map keeps them in order and fixes the issue for me:

--- IndexService.scala 2016-11-18 08:37:15.089423869 +0000
+++ IndexService.scala.patched 2016-11-18 08:37:10.492423732 +0000
@@ -35,6 +35,7 @@
import org.apache.lucene.analysis.Analyzer
import scalang._
import collection.JavaConversions._
+import scala.collection.immutable.ListMap
import com.yammer.metrics.scala._
import com.cloudant.clouseau.Utils._
import org.apache.commons.configuration.Configuration
@@ -607,7 +608,7 @@
searcher.doc(scoreDoc.doc, includeFields)
}

  • var fields = doc.getFields.foldLeft(MapString, Any)((acc, field) => {
  • var fields = doc.getFields.foldLeft(ListMapString, Any)((acc, field) => {
    val value = field.numericValue match {
    case null =>
    field.stringValue

    You are receiving this because you were mentioned.
    Reply to this email directly, view it on GitHub, or mute the thread.

@arturog thanks for the change. Indeed, your change in the function docToHit will ensure that fields returned in an expected order from clouseau to dreyfus, but as @rnewson said we still return a dictionary as a result.

About Lucene saving docs in a expected order, it was saving them in the expected order even before. Indexing is happening here:

https://github.com/arturog/clouseau/blob/516290cc21c10a7bd45bd03ea7445821d9c2cd87/src/main/scala/com/cloudant/clouseau/ClouseauTypeFactory.scala#L99-L100

And we iterating over the list here, making sure that fields are indexed in the provided order.