Unable to use sedona.global.charset in ShapefileReader
adamaps opened this issue · 4 comments
Expected behavior
ShapefileReader.readToGeometryRDD(sedona_context, shp_file)
should use the sedona.global.charset
configuration property set in the spark session when reading shapefiles containing non-ASCII characters.
E.g. A shapefile containing an attribute value "Ariñiz/Aríñez"
should appear in a dataframe as "Ariñiz/Aríñez"
.
Actual behavior
ShapefileReader.readToGeometryRDD(sedona_context, shp_file)
is not using the charset configuration property set in the spark context.
E.g. A shapefile containing an attribute value "Ariñiz/Aríñez"
appears in a dataframe as "Ariñiz/ArÃñez"
instead.
Steps to reproduce the problem
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession
from sedona.core.formatMapper.shapefileParser import ShapefileReader
from sedona.spark import SedonaContext
from sedona.utils.adapter import Adapter
conf = SparkConf()
conf.set("sedona.global.charset", "utf8")
spark = SparkSession.builder.config(conf=conf).getOrCreate()
sedona = SedonaContext.create(spark)
sedona_context = sedona.sparkContext
shp_file = '[aws s3 path to shapefile]'
shp_rdd = ShapefileReader.readToGeometryRDD(sedona_context, shp_file)
shp_df = Adapter.toDf(shp_rdd, sedona)
I can confirm that ("sedona.global.charset", "utf8")
appears in the configuration settings by using:
print(sedona_context.getConf().getAll())
I also tried setting the charset property after creating the sedona context as follows (although this appears to be an older solution):
sedona_context.setSystemProperty("sedona.global.charset", "utf8")
Please confirm how to set this configuration property correctly.
Settings
Sedona version = 1.5.1
Apache Spark version = 3.3.0
API type = Python
Python version = 3.10
Environment = AWS Glue 4.0 using sedona-spark-shaded-3.0_2.12-1.5.1.jar
and geotools-wrapper-1.5.1-28.2.jar
@adamaps If you are running Sedona in a cluster mode, this needs to be set via spark.executorEnv.[EnvironmentVariableName]
: https://spark.apache.org/docs/latest/configuration.html
In your case, you might want to try this:
spark.executorEnv.sedona.global.charset utf8
spark.executorEnv is a runtime config that can be set after your SparkSession or SedonaContext has been initiated.
spark.conf.set("spark.executorEnv.sedona.global.charset","utf8")
sedona.conf.set("spark.executorEnv.sedona.global.charset","utf8")
Thank you for the quick response, @jiayuasu !
I tested the following in client mode
before creating the Sedona SparkSession
/SparkContext
(via a local Docker container):
conf = SparkConf()
conf.set("sedona.global.charset", "utf8") # I have other conf settings not shown here
spark = SparkSession.builder.config(conf=conf).getOrCreate()
sedona = SedonaContext.create(spark)
And I tested both of the following in cluster mode
after creating the Sedona SparkSession
/SparkContext
(via AWS Glue):
spark.conf.set("spark.executorEnv.sedona.global.charset","utf8")
sedona.conf.set("spark.executorEnv.sedona.global.charset","utf8")
Unfortunately I still see the same issue in both cases.
Are you able to replicate (or reject) the issue using the attached shapefile sample?
sedona.global.charset
has to be set as a Java system property. You can try setting the following spark configurations:
spark.driver.extraJavaOptions -Dsedona.global.charset=utf8
spark.executor.extraJavaOptions -Dsedona.global.charset=utf8
The dataframe loaded from the sample shapefile:
+--------------------+--------------------+--------------------+--------------------+
| geometry| ID| Name| Name_ASCII|
+--------------------+--------------------+--------------------+--------------------+
|MULTIPOLYGON (((-...|01015 |Ariñiz/Aríñez ...|Ariniz/Arinez ...|
+--------------------+--------------------+--------------------+--------------------+
Thank you, @Kontinuation! 🎉
I can confirm that setting the following configuration parameter in PySpark worked for my local setup. And thanks @jiayuasu for updating the docs.
conf.set("spark.driver.extraJavaOptions", "-Dsedona.global.charset=utf8")
Running on AWS/Glue is still causing issues, but this seems specific to our setup.