Failed to query fields of type Decimal and data is 0
eye-gu opened this issue · 16 comments
spark sql: select number_col1 from clickhouse
.zhouwq
.user_test
;
log
spark-sql> select number_col1 from clickhouse
.zhouwq
.user_test
;
23/08/10 18:11:38 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)
java.lang.NullPointerException
at java.math.BigDecimal.(BigDecimal.java:831)
at scala.math.BigDecimal$.apply(BigDecimal.scala:290)
at xenon.clickhouse.read.format.ClickHouseJsonReader.decodeValue(ClickHouseJsonReader.scala:74)
at xenon.clickhouse.read.format.ClickHouseJsonReader.decode(ClickHouseJsonReader.scala:48)
at xenon.clickhouse.read.format.ClickHouseJsonReader.decode(ClickHouseJsonReader.scala:33)
at xenon.clickhouse.read.ClickHouseReader.get(ClickHouseReader.scala:89)
at xenon.clickhouse.read.ClickHouseReader.get(ClickHouseReader.scala:29)
I think asText
does not help, if ClickHouse eliminates data of 0 in result, we should set it to default value explicitly.
scala> BigDecimal("")
java.lang.NumberFormatException
at java.math.BigDecimal.<init>(BigDecimal.java:599)
at java.math.BigDecimal.<init>(BigDecimal.java:383)
at java.math.BigDecimal.<init>(BigDecimal.java:809)
at scala.math.BigDecimal$.exact(BigDecimal.scala:126)
at scala.math.BigDecimal$.apply(BigDecimal.scala:284)
... 47 elided
BTW, thanks for reporting this issue, PR is welcome~
I'm not sure what caused it. In my local environment, it works after modifying to 'asText'
Could you try this way?
case d: DecimalType if jsonNode.isBigDecimal =>
Decimal(jsonNode.decimalValue, d.precision, d.scale)
case d: DecimalType if jsonNode.isFloat | jsonNode.isDouble =>
Decimal(BigDecimal(jsonNode.doubleValue, new MathContext(d.precision)), d.precision, d.scale)
+ case d: DecimalType if jsonNode.inInt =>
+ Decimal(BigDecimal(jsonNode.intValue, new MathContext(d.precision)), d.precision, d.scale)
case d: DecimalType =>
Decimal(BigDecimal(jsonNode.textValue, new MathContext(d.precision)), d.precision, d.scale)
Seems there are other unhandled cases here, like isLong, isBigInteger ...
isInt
works
So that's the root cause, would u like to send PR to fix it, and other potential cases?
PTAL
I found that the binary
format does not have this issue, also #264 issue. Is it more recommended to use binary
?
issue is still present. any updates? @eye-gu why issue is closed? it's still present in 0.7.2 with 0.4.5 jdbc driver with spark 3.4.2
@paf91 I need to find some time to publish a new version containing this patch, maybe in a few days, you can try building the master branch before the publish is done.
@pan3793 could you please tell if you know what's the best practice to save resulting dataframe into clickhouse? I couldn't find it in docs https://housepower.github.io/spark-clickhouse-connector/quick_start/02_play_with_spark_sql/ and the best I could do it use clickhouse-jdbc like df.write.format("jdbc").options(options_map).mode("Overwrite").save
@paf91 Well, the "best practices" depend on a lot of stuff. The built-in JDBC data source is maintained by the Apache Spark community as a generic solution for interacting with RDBMS, just keep using it if it works well for your cases.
Instead of providing best practices, I'd like to list some points that I think are worth careful consideration.
Performance in the distributed system is a big topic, I wrote an article to explain how this connector improves the query performance.
Convenience. For example, the data engineer may want to use CREATE/REPLACE TABLE ... AS SELECT ...
or equivalent df.writeTo(table).create()
and df.writeTo(table).createOrReplace()
, to create a ClickHouse table automatically and save data from Spark to ClickHouse, it's impossible without SPARK-43390
Transactions. For a long time, transactions were not strict or even missing in the big data world. In a distributed system, the failure of a single node is normal. Without the guarantee of writing transactions, the resulting retries may lead to eventual data duplication. Have some thoughts about this topic previously #145