/spark-data-generator

Generates dummy parquet, csv, json files for testing and validating MinIO compatibility

Primary LanguageScalaCreative Commons Attribution 4.0 InternationalCC-BY-4.0

spark-data-generator

Generates dummy parquet, csv, json files for testing and validating MinIO S3 API compatibility.

TODO

  • Support configurable columns
  • Support CSV, JSON

Spark

Example shows how to generate parquet files for 1 billion rows, this example assumes that you have configured spark-defaults.conf to talk to MinIO deployment.

~ sbt package
~ spark-submit --class "ParquetGenerator" --master spark://masternode:7077 \
    --packages org.apache.hadoop:hadoop-aws:3.1.2 --driver-memory 100G \
    --executor-memory 200G --total-executor-cores 256 \
    target/scala-2.12/parquet-data-generator_2.12-1.0.jar 1000000000 1000 500 s3a://benchmarks/1b-500/