Parquet file generator for humans
mvn -DskipTests -T 1C install
This should give you parquet-generator-1.0.jar
in your target
folder.
To build for with-dependencies, you can use:
mvn -DskipTests -T 1C clean compile assembly:single
./bin/spark-submit --master yarn \
--class com.ibm.crail.spark.tools ParquetGenerator \
parquet-generator-1.0.jar [OPTIONS]
Current options are:
usage: parquet-generator
-a,--affix affix random payload. Means that in each instance of worker,
the variable payload data will be generated once, and used
multiple times (default false)
-c,--case <arg> case class schema currently supported are:
ParquetExample (default),
IntWithPayload, and tpcds.
These classes are in ./schema/ in src.
-C,--compress <arg> <String> compression type, valid values are:
uncompressed, snappy, gzip,
lzo (default: uncompressed)
-f,--format <arg> <String> output format type (e.g., parquet (default), csv, etc.)
-h,--help show help
-o,--output <arg> <String> the output file name (default: /ParqGenOutput.parquet)
-O,--options <arg> <str,str> key,value strings that will be passed to the data source of spark in
writing. E.g., for parquet you may want to re-consider parquet.block.size. The
default is 128MB (the HDFS block size).
-p,--partitions <arg> <int> number of output partitions (default: 1)
-r,--rows <arg> <long> total number of rows (default: 10)
-R,--rangeInt <arg> <int> maximum int value, value for any Int column will be generated between
[0,rangeInt), (default: 2147483647)
-s,--size <arg> <int> any variable payload size, string or payload in IntPayload (default: 100)
-S,--show <arg> <int> show <int> number of rows (default: 0, zero means do not show)
-t,--tasks <arg> <int> number of tasks to generate this data (default: 1)
-tcbp,--clusterByPartition <arg> <int> true(1) or false(0, default), pass the int
-tdd,--doubleForDecimal <arg> <int> true(1) or false(0, default), pass the int
-tdsd,--dsdgenDir <arg> <String> location of the dsdgen tool
-tfon,--filterOutNullPartitionValues <arg> <int> true(1) or false (0, default), pass the int
-tow,--overWrite <arg> <int> true(1, default) or false(0), pass the int
-tpt,--partitionTable <arg> <int> true(1) or false(0, default), pass the int
-tsd,--stringForDate <arg> <int> true(1) or false(0, default), pass the int
-tsf,--scaleFactor <arg> <Int> scaling factor (default: 1)
-ttf,--tableFiler <arg> <String> ?
An example run would be :
./bin/spark-submit --master yarn \
--class com.ibm.crail.spark.tools.ParquetGenerator parquet-generator-1.0.jar \
-c IntWithPayload -C snappy -o /myfile.parquet -r 84 -s 42 -p 12
This will create 984 ( = 12 * 84)
rows for case class IntWithPayload
as
[Int, Array[Byte]]
with 42 bytes byte array, and save this as a parquet file
format in /myfile.parquet
in 12 different partitions.
This is an example command to generate the dataset with the scaling factor of 2, with 8 tasks but in 2 files (or partitions) when running spark locally. The output goes to crail.
./bin/spark-submit -v --num-executors 2 --executor-cores 1 --executor-memory 1G --driver-memory 1G --master local
--class com.ibm.crail.spark.tools.ParquetGenerator
~/parquet-generator/target/parquet-generator-1.0.jar
-c tpcds
-o crail://localhost:9060/tpcds
-t 8
-p 2
-tsf 2
-tdsd ~/tpcds-kit/tools/
Note: on a cluster the location of dsdgen directory should be accessible on each machine.
Acknowledgement: The data generation logic is derived from https://github.com/databricks/spark-sql-perf
As described here, the logic uses a slightly modified version of the original TPC-DS kit. It can be downloaded and build from https://github.com/databricks/tpcds-kit as
$ git clone https://github.com/databricks/tpcds-kit.git
$ cd ./tpcds-kit/tools/
$ make OS=LINUX
17/11/13 14:10:01 12276 main INFO HiveUtils: Initializing HiveMetastoreConnection version 1.2.1 using Spark classes.
Exception in thread "main" java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder':
at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1053)
at org.apache.spark.sql.SparkSession$$anonfun$sessionState$2.apply(SparkSession.scala:130)
at org.apache.spark.sql.SparkSession$$anonfun$sessionState$2.apply(SparkSession.scala:130)
at scala.Option.getOrElse(Option.scala:121)
In a recent commit, I have introduced hive and ORC dependencies. In case you run into above problem then probably you have spark.driver.userClassPathFirst=true
set. See https://issues.apache.org/jira/browse/SPARK-16680
PRs are always welcome. Please fork, and make necessary modifications you propose, and let us know.
If you have questions or suggestions, feel free to post at:
https://groups.google.com/forum/#!forum/zrlio-users
or email: zrlio-users@googlegroups.com