AbsaOSS/cobrix

Add support for S3 storage

awsazuser opened this issue · 4 comments

Does cobrix support S3 file systems ?
I am getting "java.lang.IllegalArgumentException: Wrong FS" error when loading the copybook and datafile from a AWS S3 bucket.

Code:

val spark = SparkSession.builder().appName("Spark-Cobol").getOrCreate()
import spark.implicits._
import za.co.absa.cobrix.spark.cobol.source

val df = spark.read.format(
"za.co.absa.cobrix.spark.cobol.source").option(
"copybooks", "s3://xxxx/tesfile.cbl").load("s3://xxxx/sourcedata/DATAFILE0100")

df.printSchema
df.show()

Error:

java.lang.IllegalArgumentException: Wrong FS: s3://xxxx/tesfile.cbl, expected: hdfs://ip-xxx-xx-xx-85.ec2.internal:8020
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:653)
at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:194)
at org.apache.hadoop.hdfs.DistributedFileSystem.access$000(DistributedFileSystem.java:106)
at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1305)
at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1317)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1430)
at za.co.absa.cobrix.spark.cobol.source.parameters.CobolParametersValidator$.za$co$absa$cobrix$spark$cobol$source$parameters$CobolParametersValidator$$validatePath$1(CobolParametersValidator.scala:71)
at za.co.absa.cobrix.spark.cobol.source.parameters.CobolParametersValidator$$anonfun$validateOrThrow$2.apply(CobolParametersValidator.scala:94)
at za.co.absa.cobrix.spark.cobol.source.parameters.CobolParametersValidator$$anonfun$validateOrThrow$2.apply(CobolParametersValidator.scala:93)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at za.co.absa.cobrix.spark.cobol.source.parameters.CobolParametersValidator$.validateOrThrow(CobolParametersValidator.scala:93)
at za.co.absa.cobrix.spark.cobol.source.DefaultSource.createRelation(DefaultSource.scala:52)
at za.co.absa.cobrix.spark.cobol.source.DefaultSource.createRelation(DefaultSource.scala:48)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:307)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:156)
... 160 elided

Unfortunately, S3 is not supported right now. But we might add S3 support in the future.

S3 storage should be supported in spark-cobol version 2.2.0.

Please, let me know if it works for you.

Does cobrix supports gs:// file system ?
i'm getting the same error as
Caused by: java.lang.IllegalArgumentException: Wrong FS: gs://

From the filesystem support perspective, spark-cobol is the same as any other Spark data source. If you can use gs:// to read CSV or Parquet, then it should be possible to read mainframe files as well.