Spark isn't able to read files when they don't have the proper extension. For example, it's able to read CSV
compressed with GZIP if they end with .gz
but it fails if they end with .GZ
.
With this small library, you can fix it.
Start the Databricks cluster with the following property:
spark.hadoop.io.compression.codecs com.galiglobal.databricks.GZCodec
See Spark configuration for more info.
Then, import the library to your cluster. You can download it from Releases
or you can clone the project and compile it with sbt package
.
See Uploading libraries to know more about how to upload libraries to Databricks.
Now you should able to read GZ files:
val df = spark.read.format("csv")
.option("header", "false")
.option("inferSchema", "true")
.option("sep", "\t")
.load("/FileStore/tables/antonmry/example.GZ")
If you want to use a different extension, just modify src/main/scala/com/galiglobal/databricks/GZCodec.scala
replacing
GZ
with your extension.
To do it with a Spark cluster, you can also use the following approach.