The ReportReportStatsApp object is useful to get stats of specific tables or all tables belong to a database. The scripts/tables_stats.sh
is a script to use that functionality:
sh table_stats.sh bigdatadatalake_functional_dev
The output will be a file with stats of each column of each table of a database.
Simple Spark example of generating table stats for use of data quality checks
##SimpleDataGeneratorMain This will generate small test data set
SimpleDataGeneratorMain {outputPath}
##ConfigurableDataGeneratorMain This will generate large test data set
ConfigurableDataGeneratorMain {outputPath} {numberOfColumns} {numberOfRecords} {numberOfPartitions}
##TableStatsSinglePathMain This will output the following information on a given column in the table
- null count
- empty count
- total count
- unique value count
- max
- min
- sum
- top N values with there cardinality
TableStatsSinglePathMain {inputPath}
##Examples of Execution ###Small data set spark-submit --class com.cloudera.sa.examples.tablestats.ConfigurableDataGeneratorMain --master yarn --deploy-mode client --executor-memory 512M --num-executors 4 examples.tablestats-1.0-SNAPSHOT.jar ./gen/output 10 10000 4
spark-submit --class com.cloudera.sa.examples.tablestats.TableStatsSinglePathMain --master yarn --deploy-mode client --executor-memory 512M --num-executors 4 examples.tablestats-1.0-SNAPSHOT.jar ./gen/output
###Larger data set spark-submit --class com.cloudera.sa.examples.tablestats.ConfigurableDataGeneratorMain --master yarn --deploy-mode client --executor-memory 512M --num-executors 8 examples.tablestats-1.0-SNAPSHOT.jar ./gen1/output 10 10000000 8
spark-submit --class com.cloudera.sa.examples.tablestats.TableStatsSinglePathMain --master yarn --deploy-mode client --executor-memory 1024M --num-executors 8 examples.tablestats-1.0-SNAPSHOT.jar ./gen1/output