This simple program auto-generates XML strings resembling the taxAccountReport and writes them to MinIO.
Its intended as a simple PoC to show how using the MinIO Java API we can write 8with the same code) to both MinIO, Google Cloud Storage, AWS S3 and any S3-compatible object storage.
Although this PoC is written in Scala, the code is simple enough to be easily translatable to Java or Kotlin and Zeppelin/Spark supports Kotlin, Java, Scala, Python and R.
mkdir -p ~/minio/data
docker run
-p 9000:9000
-p 9001:9001
--name minio1
-v ~/minio/data:/data minio/minio
server /data
--console-address ":9001"
Note that the local directory ~/minio/data
is passed as a volume to the container. See MinIO Docker Quickstart Guide for details.
Also note that features such as versioning, object locking, and bucket replication require distributed deploying MinIO with Erasure Coding. See MinIO Quickstart Guide for details.
docker run -u $(id -u) -p 8080:8080 -d
-v $PWD/logs:/logs
-v $PWD/notebook:/notebook
-v /usr/local/spark:/opt/spark
-e SPARK_HOME=/opt/spark
-e ZEPPELIN_LOG_DIR='/logs'
-e ZEPPELIN_NOTEBOOK_DIR='/notebook'
--name zeppelin apache/zeppelin:0.10.1
Note that logs
and notebooks
are persisted in a local volume. See using the official docker image for details.
Go to Interpreter
and search for spark
:
and Add
the following properties:
spark.hadoop.fs.s3a.endpoint 172.17.0.3:9000
spark.hadoop.fs.s3a.access.key roguedev1
spark.hadoop.fs.s3a.secret.key shellaccess
spark.hadoop.fs.s3a.path.style.access true
spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.aws.credentials.provider org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider
spark.hadoop.fs.s3a.connection.ssl.enabled false
Add to spark.jars.packages
the following coordinates:
org.apache.hadoop:hadoop-aws:3.2.2,com.databricks:spark-xml_2.12:0.14.0
To run and create 10 XML files (the default):
mvn scala:run -DmainClass=com.lv.MinIOWriter
You can pass the number of files to generate as an argument:
mvn scala:run -DmainClass=com.lv.MinIOWriter -DaddArgs=5
Endpoint, bucketname and auth properties can be configured on src/main/resources/application.conf
The underlying concept for this PoC is that it is possible, relatively easy even, to build an Analytical Engine
on top of Object/Cloud Storage. While Object Storage solutions, such as MinIO, gives us on-premise capabilities where needed,
Cloud Storage give us close to infinite capacity. In short, the approach demonstrated here give analytic capabilities
over practically anything we can put on object/cloud storage.