Stratio/Spark-MongoDB

Schema discovery created unpersistable schema for empty arrays

ssimeonov opened this issue · 4 comments

If all observed values of a document field are [] the generated schema is for ArrayType[NullType] which cannot be persisted or used in any meaningful way.

In the absence of evidence of the type of array elements a more logical behavior would be to allow for overrides of the schema of a subset of fields, e.g., as a JSON string (schema.json in Spark) or, if a default behavior is needed, map to ArrayType[StringType] as opposed to ArrayType[NullType]. The benefits are that this mapping can be persisted and it can represent any Mongo arrays, including heterogeneous ones.

This issues is somehow like #113, you should not assume a type from the lack of evidence just because it is convenient.

With the current connector API you have no problem if you want to provide your own schema:

/**
 * A MongoDB baseRelation that can eliminate unneeded columns
 * and filter using selected predicates before producing
 * an RDD containing all matching tuples as Row objects.
 * @param config A Mongo configuration with needed properties for MongoDB
 * @param schemaProvided The optionally provided schema. If not provided,
 *                       it will be inferred from the whole field projection
 *                       of the specified table in Spark SQL statement using
 *                       a sample ratio (as JSON Data Source does).
 * @param sqlContext An existing Spark SQL context.
 */
class MongodbRelation(private val config: Config,
                       schemaProvided: Option[StructType] = None)(
                       @transient val sqlContext: SQLContext) extends BaseRelation

That class will only try to infer the schema if schemaProvided is None :

  /**
   * Default schema to be used in case no schema was provided before.
   * It scans the RDD generated by Spark SQL statement,
   * using specified sample ratio.
   */
  @transient private lazy val lazySchema =
    MongodbSchema(
      new MongodbRDD(sqlContext, config, rddPartitioner),
      config.get[Any](MongodbConfig.SamplingRatio).fold(MongodbConfig.DefaultSamplingRatio)(_.toString.toDouble)).schema()

  override val schema: StructType = schemaProvided.getOrElse(lazySchema)

On the other hand, there are ways of building a modified schema from the inferred which you can use to provide MongodbRelation with.

That's good to know but are you suggesting that the desired behavior of the library is to produce unwritable schema in this case as opposed to a potentially incorrect but writeable schema? In that case, especially since the exception is rather cryptic, would it not be better to create some best practices and utilities around checking the schema after discovery and, perhaps, offering schema rewriting strategies, e.g., remove the field from the schema (because adding the field later on works well with Spark's schema merging but changing the type of a field does not)?

@ssimeonov The development team behind Spark-MongoDB have been talking about this issue. In spite of being not formally right as I mentioned above we've agreed to change it to StringType. A sweet spot between correctness and convenience.

@ssimeonov A PR has been issued #125 with the described change, it will soon merged.

Btw, quite a strange collection, or too low sample ratio, where not a single object has an array with at least one element.