IBM/spark-tpc-ds-performance-test

Cannot complete the test run

HarryLiUS opened this issue · 11 comments

Hello,

I following the instruction to do a local test run. First 3 steps completed successfully. At step 4, the table creation completed in about 10+ minutes. It is longer than I expected, but it is completed. Here is the output:

==============================================
TPC-DS On Spark Menu

SETUP
(1) Create spark tables
RUN
(2) Run a subset of TPC-DS queries
(3) Run All (99) TPC-DS Queries
CLEANUP
(4) Cleanup
(Q) Quit

Please enter your choice followed by [ENTER]: 1

INFO: Creating tables. Will take a few minutes ...
INFO: Progress : [########################################] 100%
INFO: Spark tables created successfully..
Press any key to continue

After succeeded with table creation, I tried to run query 1 and here is what I got:

==============================================
TPC-DS On Spark Menu

SETUP
(1) Create spark tables
RUN
(2) Run a subset of TPC-DS queries
(3) Run All (99) TPC-DS Queries
CLEANUP
(4) Cleanup
(Q) Quit

Please enter your choice followed by [ENTER]: 2

Enter a comma separated list of queries to run (ex: 1, 2), followed by [ENTER]:
1
INFO: Checking pre-reqs for running TPC-DS queries. May take a few seconds..
ERROR: The rowcounts for TPC-DS tables are not correct. Please make sure option 1
is run before continuing with currently selected option
Press any key to continue

I repeated this again and no help.

Checking rowcounts.rrn, it is all 0.

And, here is the output from spark-shell from step 3.

scala> spark.conf
res0: org.apache.spark.sql.RuntimeConfig = org.apache.spark.sql.RuntimeConfig@505bc480
scala> spark.conf.get("spark.sql.catalogImplementation")
res1: String = hive

Thank you for the help,
Harry

@dilipbiswal can you give this a look when you have a minute ^

Hello @stevemart @dilipbiswal,
Do you have any update?

Also, I have questions about tpcdsenv.sh variables. For the error above, I used default except point the root directory to my TPC-DS installation directory. Here is the tpcdsenv.sh:

harry.li@perf84:/usr/local/harry/tpcds/spark-tpc-ds-performance-test$ cat bin/tpcdsenv.sh
#!/bin/bash
#
# tpcdsenv.sh - UNIX Environment Setup
#

#######################################################################
# This is a mandatory parameter. Please provide the location of
# spark installation.
#######################################################################
export SPARK_HOME=/usr/local/harry/spark

#######################################################################
# Script environment parameters. When they are not set the script
# defaults to paths relative from the script directory.
#######################################################################

export TPCDS_ROOT_DIR=/usr/local/harry/tpcds/spark-tpc-ds-performance-test
export TPCDS_LOG_DIR=
export TPCDS_DBNAME=
export TPCDS_WORK_DIR=
harry.li@perf84:/usr/local/harry/tpcds/spark-tpc-ds-performance-test$

My questions are:

  1. Is this setting good to use test with ./bin/tpcdsspark.sh?
  2. If I need to move my database from local disk to HDFS, what kind of changes will be? I tried to change the setting as following and it does not work.

export TPCDS_ROOT_DIR=/usr/local/harry/tpcds/spark-tpc-ds-performance-test
export TPCDS_LOG_DIR=hdfs:///TPC-DS/logDir
export TPCDS_DBNAME=hdfs:///TPC-DS/dbDir
export TPCDS_WORK_DIR=hdfs:///TPC-DS/workDir

Please advise and thanks in advance.
Harry

@HarryLiUS Can you run step 4 (cleanup) to clean all data and start from scratch? I think you may have run dsdgen to generate data at a different scale factor.

@HarryLiUS Could you solve the problem? I am facing the same problem.

has anyone resolved this?

I could not make it work with Spark 3.0.0. But after switching to Spark 2.4.5 the problem went away.

fbaig commented

Works without any modifications with Spark 2.4.5 and Spark 2.4.7. However, requires some modifications to run with Spark version 3.0.1. Actually, the solution does not even relate to Spark. There is a check which compares row counts from generated data and expected results. The check fails because it compares the contents of files. The newer version of Spark has some new warnings that get added to the beginning of the generated file and thus fails the comparison with the expected result.
Following are the steps to make it work with Spark version 3.0.1

  • In bin/tpcdsspark.sh in function check_createtables()
  • Before the file comparison check i.e. if cmp -s "$file1" "$file2"
  • If you are on Mac sed -i '' '/^W/d' $file1
  • If you are on Linux sed -i '/^W/d' $file1

This error occurs when the file rowcounts.rrn and the file rowcounts.expected are not exactly the same.
For me, it turns out that the rowcounts.rrn is derived by the log rowcounts.out , thus contains some unexpected warning logs.
The rowcounts.rnn then turn out to be like:

WARNING: An illegal reflective access operation has occurred...
Setting default log level to "WARN".
6
11718
144067

And the rowcounts.expected is like:

6
11718
144067

This cause the error in check_createtable.

So here's my solution:
Open up the log rowcounts.rrn under worker directory, write down the words that occurs in the file that should not contained in rowcounts.expected. In my case, the words including 'WARNING', 'Setting'
Then edit the file tpcdsspark.sh in line 99, add | grep -v "WARNING" and | grep -v "Setting", this will filter the unexpected log in 'rowcounts.out' and derive a good 'rowcounts.rnn'.
I then use "Cleanup" and then "create spark tables" in 'tpcdsspark.sh' command, then everything works fine for me.

This error occurs when the file rowcounts.rrn and the file rowcounts.expected are not exactly the same. For me, it turns out that the rowcounts.rrn is derived by the log rowcounts.out , thus contains some unexpected warning logs. The rowcounts.rnn then turn out to be like:

WARNING: An illegal reflective access operation has occurred...
Setting default log level to "WARN".
6
11718
144067

And the rowcounts.expected is like:

6
11718
144067

This cause the error in check_createtable.

So here's my solution: Open up the log rowcounts.rrn under worker directory, write down the words that occurs in the file that should not contained in rowcounts.expected. In my case, the words including 'WARNING', 'Setting' Then edit the file tpcdsspark.sh in line 99, add | grep -v "WARNING" and | grep -v "Setting", this will filter the unexpected log in 'rowcounts.out' and derive a good 'rowcounts.rnn'. I then use "Cleanup" and then "create spark tables" in 'tpcdsspark.sh' command, then everything works fine for me.

For spark 3.3.0, it need more filter. i make it work with adding the follow filter:

| grep -v "WARNING" | grep -v "Setting" | grep -v "Spark"

hi, @HarryLiUS Have you solved this problem? I checked my rowcounts.rrn and it is also all 0.

For spark 3.3.0, it need more filter. i make it work with adding the follow filter:

| grep -v "WARNING" | grep -v "Setting" | grep -v "Spark"

This is what fixed it for me.