GoogleCloudDataproc/spark-bigquery-connector

DATETIME parsing in PySpark

satybald opened this issue · 0 comments

Currently, the documentation states that spark parses DATETIME as TimestampType[1]. However, in practice this type get parsed as a plain string for 0.32 version of connector[2].

Would it be possible to clarify what's the intended behaviour here? Would it be posible to cast datetime as timestamps type?

Reproducable example:

spark = SparkSession.builder \
...   .master('local[*]') \
...   .appName('Top Shakepeare words') \
...   .config('spark.jars.packages', 'com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.32.2') \
...   .getOrCreate()

>>> spark.conf.set("materializationDataset","spark_temp_dataset")
>>> word_count = spark.read \
...     .format('bigquery') \
...     .load('SELECT DATETIME("2023-08-12")')
>>> word_count.printSchema()
root
 |-- f0_: string (nullable = true)

[1] https://github.com/GoogleCloudDataproc/spark-bigquery-connector/#data-types
[2] https://github.com/GoogleCloudDataproc/spark-bigquery-connector/blob/master/spark-bigquery-connector-common/src/main/java/com/google/cloud/spark/bigquery/SchemaConverters.java#L387-L392