lucidworks/spark-solr

Spark-Solr can't load non-stored multivalued fields with docValues=true and useDocValuesAsStored=true

uyilmaz opened this issue · 1 comments

Using Solr 8.4.0, Spark-Solr 3.6.1 Spark: 2.11

When a field is configured with:

stored="false" docValues="true" useDocValuesAsStored="true"

in Solr, you are able to retrieve it in query results even if it's not stored, docValues is used instead. This works in spark-solr, only not with multiValued=true fields.

SolrJ and regular solr api can provide such fields, but when we use them with spark-solr:

val s1 = Map(
      "zkHost" -> "myZK",
      "collection" -> "myCollection",
      "query" -> "multivaluedField:[* TO *]",
      "fields" -> "multivaluedField",
      "max_rows" -> "100000",
      "flatten_multivalued"-> "false"
    )
    
val data = spark.read.format("solr").options(s1).load

data.createOrReplaceTempView("myTable")

Results with:
data: org.apache.spark.sql.DataFrame = [id: string]
Notice that multiValuedField is not resolved.

This is a serious issue in my opinion, because it prohibits you from using streaming method when you need multiValued fields in an RDD.

In addition to above, when you specify a streaming expression instead of a query like:

val s1 = Map(
      "zkHost" -> "myZK",
      "collection" -> "myCollection",
      "expr" -> "search(myCollection,q="multivaluedField:[* TO *]",qt="/export",fl="multivaluedField,,id",sort="id asc")",
      "max_rows" -> "100000",
      "flatten_multivalued"-> "false"
    )

the "flatten_multivalued" parameter loses its effect, multivalued fields always get flattened.