saswata-dutta/spark-ingestion

Allow specification of exact schema in conf

saswata-dutta opened this issue · 2 comments

Maybe read DDL or schema json for formats like json and csv, and avoid wrong inference.

DataType.fromJson(schema_json_str)
or
DataType.fromDDL(schema_DDL_str)

Then, spark.read.schema(schema)...

NB. what happens to rows which doesnt conform to schema, for csv,json consider columnNameOfCorruptRecord.

https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L650

For Mongo-spark no such option exists, so use explicit case class and maybe clean-frames,
but how to specify class name of schema.

https://stackoverflow.com/questions/23785439/getting-typetag-from-a-classname-string

https://github.com/funkyminds/cleanframe

For Mongo-spark no such option exists, so use explicit case class and maybe clean-frames,
but how to specify class name of schema.

https://stackoverflow.com/questions/23785439/getting-typetag-from-a-classname-string

https://github.com/funkyminds/cleanframe

Use the MongoSpark builder to specify sparksession, readconfig, and the schema as arg to the toDF terminator.