Support for Spark 3.4.0
psolomin opened this issue ยท 9 comments
I've tried this connector with Spark 3.4.0 on Java 17 and it seems working for basic write and read operations. Is there a full test set anywhere which must run before Spark 3.4.0 support is added in the README?
P.S.: I needed to exclude Scala dependency:
implementation("com.microsoft.azure:spark-mssql-connector_2.12:1.3.0-BETA") {
exclude group: 'org.scala-lang'
}
Otherwise it was conflicting with Spark's scala 2.12.17
@psolomin Can you elaborate on how to exclude Scala dependency? What file should I add your snippet to? I am new & trying to set up pyspark 3.4.0 on my local machine.
Hello @hlngo
What file should I add your snippet to?
It's in my build.gradle
- I use Gradle to create a set of jars for Spark to use. Here's a part of my build.gradle
file:
configurations {
userLibs.extendsFrom implementation
}
dependencies {
// this are used for spark.driver.extraClassPath and spark.executor.extraClassPath
userLibs("com.microsoft.azure:spark-mssql-connector_2.12:1.3.0-BETA") {
exclude group: 'org.scala-lang'
}
}
task copyUserDeps(type: Copy) {
from configurations.userLibs
into "${userDepsLocation}"
}
When I run ./gradlew copyUserDeps
it fetches stuff under dependencies
+ all transitive dependencies (except Scala) and puts them into userDepsLocation
which I configure.
Then, when I run my apps, I use:
spark-submit ... --conf spark.driver.extraClassPath="$userDepsLocation/*" --conf spark.executor.extraClassPath="$userDepsLocation/*"
Hope it helps.
Trying the same thing, but problems on writing:
java.lang.NoSuchMethodError: 'org.apache.spark.sql.types.StructType org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getSchema(java.sql.ResultSet, org.apache.spark.sql.jdbc.JdbcDialect, boolean)'
Tried on jdk8 and 11
Hi @zhukovgreen
I did not see this when tried inserting into a table of primitives. What was your Spark DF schema you attempted to insert? Did that insertion into MS SQL work with previous versions of Spark?
Hi @psolomin.
With the previous version of spark it was working. I had the following write config:
options = {
'truncate': 'true',
'batchsize': 1000000,
'tableLock': 'true',
'numPartitions': 25,
'use_bulk_copy': True,
'schemaCheckEnabled': 'true',
'driver': 'com.microsoft.sqlserver.jdbc.SQLServerDriver',
'username': 'uname',
'password': '****',
'url': 'jdbc:sqlserver://mymssqlserver:1433;database=DB;encrypt=true;trustServerCertificate=true;',
'dbtable': 'dbo.QueueItemTrainingRecommend',
'batchSize': 100000,
'checkConstraints': 'true',
'reliabilityLevel': 'BEST_EFFORT',
}
I was writing using:
to_write.write.format("com.microsoft.sqlserver.jdbc.spark").options(**options).mode("overwrite").save()
The schema of the to_write df is:
root
|-- Comp: string (nullable = true)
|-- Empl: string (nullable = true)
|-- UniqueId string (nullable = true)
|-- WorkData: string (nullable = true)
|-- CreatedAt: timestamp (nullable = true)
.
Reading the exception I see that it has one boolean as last argument, but in spark 3.4 the signature has changed and it has two boleans.
See :
Thanks for sharing the details @zhukovgreen !
I think, in all my use cases timestamps were passed to MS SQL as strings. If I bump into the same issue I will think of forking and patching this.
Any timeline for release of this connector for Spark 3.4 with Scala 2.13 and Java 17 support?
@zhukovgreen just a follow-up:
I've tried inserting timestamp
Spark types into smalldatetime
column, and it worked fine. I will run further checks there anyway though.
Note: In my use-cases, I don't let Spark creating tables, they are created via plain SQL in a different tool, with indexes, etc. Spark only writes there.
thanks @psolomin, then consider my comment as resolved. I will double-check my case once the spark3.4 is ready to use. Make sense?