samelamin/spark-bigquery

class cast exception has occurs (Double cannot be cast to Float)

Closed this issue · 3 comments

Hi, I'm trying to analyze firebase data using Spark with this spark-Bigquery. But class cast exception has occurred like Double cannot be cast to Float.
Additionally, Double type exists in the Avro specs, but it seems only Float type casting in the module. (https://avro.apache.org/docs/1.8.1/spec.html)

Would you mind tell me is this a bug?

https://support.google.com/firebase/answer/7029846

Error Detail

  • command
val df = spark.sqlContext.read.format("com.samelamin.spark.bigquery")
  .option("tableReferenceSource","xxxx:yyy.app_events_intraday_20180417")
  .load()
df.printSchema
  • output
root
 |-- user_dim: struct (nullable = true)
 |    |-- user_id: string (nullable = true)
 |    |-- first_open_timestamp_micros: long (nullable = true)
 |    |-- user_properties: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- key: string (nullable = true)
 |    |    |    |-- value: struct (nullable = true)
 |    |    |    |    |-- value: struct (nullable = true)
 |    |    |    |    |    |-- string_value: string (nullable = true)
 |    |    |    |    |    |-- int_value: long (nullable = true)
 |    |    |    |    |    |-- float_value: float (nullable = true)
 |    |    |    |    |    |-- double_value: float (nullable = true)
 |    |    |    |    |-- set_timestamp_usec: long (nullable = true)
 |    |    |    |    |-- index: long (nullable = true)
 |    |-- device_info: struct (nullable = true)
 |    |    |-- device_category: string (nullable = true)
 |    |    |-- mobile_brand_name: string (nullable = true)
 |    |    |-- mobile_model_name: string (nullable = true)
 |    |    |-- mobile_marketing_name: string (nullable = true)
 |    |    |-- device_model: string (nullable = true)
 |    |    |-- platform_version: string (nullable = true)
 |    |    |-- device_id: string (nullable = true)
 |    |    |-- resettable_device_id: string (nullable = true)
 |    |    |-- user_default_language: string (nullable = true)
 |    |    |-- device_time_zone_offset_seconds: long (nullable = true)
 |    |    |-- limited_ad_tracking: boolean (nullable = true)
 |    |-- geo_info: struct (nullable = true)
 |    |    |-- continent: string (nullable = true)
 |    |    |-- country: string (nullable = true)
 |    |    |-- region: string (nullable = true)
 |    |    |-- city: string (nullable = true)
 |    |-- app_info: struct (nullable = true)
 |    |    |-- app_version: string (nullable = true)
 |    |    |-- app_instance_id: string (nullable = true)
 |    |    |-- app_store: string (nullable = true)
 |    |    |-- app_platform: string (nullable = true)
 |    |    |-- app_id: string (nullable = true)
 |    |-- traffic_source: struct (nullable = true)
 |    |    |-- user_acquired_campaign: string (nullable = true)
 |    |    |-- user_acquired_source: string (nullable = true)
 |    |    |-- user_acquired_medium: string (nullable = true)
 |    |-- bundle_info: struct (nullable = true)
 |    |    |-- bundle_sequence_id: long (nullable = true)
 |    |    |-- server_timestamp_offset_micros: long (nullable = true)
 |    |-- ltv_info: struct (nullable = true)
 |    |    |-- revenue: float (nullable = true)
 |    |    |-- currency: string (nullable = true)
 |-- event_dim: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- date: string (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- params: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- key: string (nullable = true)
 |    |    |    |    |-- value: struct (nullable = true)
 |    |    |    |    |    |-- string_value: string (nullable = true)
 |    |    |    |    |    |-- int_value: long (nullable = true)
 |    |    |    |    |    |-- float_value: float (nullable = true)
 |    |    |    |    |    |-- double_value: float (nullable = true)
 |    |    |-- timestamp_micros: long (nullable = true)
 |    |    |-- previous_timestamp_micros: long (nullable = true)
 |    |    |-- value_in_usd: float (nullable = true)
  • command
import org.apache.spark.sql.functions._
df.show
  • output
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 (TID 9, 10.228.249.82, executor 0): java.lang.ClassCastException: java.lang.Double cannot be cast to java.lang.Float

Hi @smdmts correct, we should be casting float to float not double to float

Good find!

Feel free to send a pr in

Hi,
Using the below in Scala code
import com.samelamin.spark.bigquery._

I have a Hive table imported to BigQuery through avro file and table is created in BQ as follows

image

It is pretty simple. The code tries to load this table first

`//read data from BigQuery Table
println("\nreading data from " + fullyQualifiedInputTableId)

val df = spark.sqlContext
.read
.format("com.samelamin.spark.bigquery")
.option("tableReferenceSource",fullyQualifiedInputTableId)
.load()

df.printSchema

// create a temporary view on DF
df.createOrReplaceTempView ("tmp")
`
OK this is the output

reading data from axial-glow-224522:accounts.ll_18201960 root |-- transactiondate: string (nullable = true) |-- transactiontype: string (nullable = true) |-- sortcode: string (nullable = true) |-- accountnumber: string (nullable = true) |-- transactiondescription: string (nullable = true) |-- debitamount: float (nullable = true) |-- creditamount: float (nullable = true) |-- balance: float (nullable = true)

The tmp view is created. However, when trying to read debitamount defined as float, I am getting the following error

spark.sql("select transactiondate,transactiontype, sortcode, accountnumber, transactiondescription, debitamount from tmp").collect.foreach(println)

18/12/27 19:41:59 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2, rhes77-cluster-w-1.europe-west2-a.c.axial-glow-224522.internal, executor 1): java.lang.ClassCastException: java.lang.Double cannot be cast to java.lang.Float at scala.runtime.BoxesRunTime.unboxToFloat(BoxesRunTime.java:109) at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getFloat(rows.scala:43) at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getFloat(rows.scala:195) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:836) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:836) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

Any workaround on this if exists please!

Thanks,

Mich

Hi,

I now have a work-around for this issue using Spark DF transformation to cast date from String to Date and String to Double where appropriate and then save the data in BigQuery table.

Let me know your thoughts.

Thanks