Cannot complete the test run
HarryLiUS opened this issue · 11 comments
Hello,
I following the instruction to do a local test run. First 3 steps completed successfully. At step 4, the table creation completed in about 10+ minutes. It is longer than I expected, but it is completed. Here is the output:
==============================================
TPC-DS On Spark Menu
SETUP
(1) Create spark tables
RUN
(2) Run a subset of TPC-DS queries
(3) Run All (99) TPC-DS Queries
CLEANUP
(4) Cleanup
(Q) Quit
Please enter your choice followed by [ENTER]: 1
INFO: Creating tables. Will take a few minutes ...
INFO: Progress : [########################################] 100%
INFO: Spark tables created successfully..
Press any key to continue
After succeeded with table creation, I tried to run query 1 and here is what I got:
==============================================
TPC-DS On Spark Menu
SETUP
(1) Create spark tables
RUN
(2) Run a subset of TPC-DS queries
(3) Run All (99) TPC-DS Queries
CLEANUP
(4) Cleanup
(Q) Quit
Please enter your choice followed by [ENTER]: 2
Enter a comma separated list of queries to run (ex: 1, 2), followed by [ENTER]:
1
INFO: Checking pre-reqs for running TPC-DS queries. May take a few seconds..
ERROR: The rowcounts for TPC-DS tables are not correct. Please make sure option 1
is run before continuing with currently selected option
Press any key to continue
I repeated this again and no help.
Checking rowcounts.rrn, it is all 0.
And, here is the output from spark-shell from step 3.
scala> spark.conf
res0: org.apache.spark.sql.RuntimeConfig = org.apache.spark.sql.RuntimeConfig@505bc480
scala> spark.conf.get("spark.sql.catalogImplementation")
res1: String = hive
Thank you for the help,
Harry
@dilipbiswal can you give this a look when you have a minute ^
Hello @stevemart @dilipbiswal,
Do you have any update?
Also, I have questions about tpcdsenv.sh variables. For the error above, I used default except point the root directory to my TPC-DS installation directory. Here is the tpcdsenv.sh:
harry.li@perf84:/usr/local/harry/tpcds/spark-tpc-ds-performance-test$ cat bin/tpcdsenv.sh
#!/bin/bash
#
# tpcdsenv.sh - UNIX Environment Setup
#
#######################################################################
# This is a mandatory parameter. Please provide the location of
# spark installation.
#######################################################################
export SPARK_HOME=/usr/local/harry/spark
#######################################################################
# Script environment parameters. When they are not set the script
# defaults to paths relative from the script directory.
#######################################################################
export TPCDS_ROOT_DIR=/usr/local/harry/tpcds/spark-tpc-ds-performance-test
export TPCDS_LOG_DIR=
export TPCDS_DBNAME=
export TPCDS_WORK_DIR=
harry.li@perf84:/usr/local/harry/tpcds/spark-tpc-ds-performance-test$
My questions are:
- Is this setting good to use test with ./bin/tpcdsspark.sh?
- If I need to move my database from local disk to HDFS, what kind of changes will be? I tried to change the setting as following and it does not work.
export TPCDS_ROOT_DIR=/usr/local/harry/tpcds/spark-tpc-ds-performance-test
export TPCDS_LOG_DIR=hdfs:///TPC-DS/logDir
export TPCDS_DBNAME=hdfs:///TPC-DS/dbDir
export TPCDS_WORK_DIR=hdfs:///TPC-DS/workDir
Please advise and thanks in advance.
Harry
@HarryLiUS Can you run step 4 (cleanup) to clean all data and start from scratch? I think you may have run dsdgen to generate data at a different scale factor.
@HarryLiUS Could you solve the problem? I am facing the same problem.
has anyone resolved this?
I could not make it work with Spark 3.0.0. But after switching to Spark 2.4.5 the problem went away.
Works without any modifications with Spark 2.4.5 and Spark 2.4.7. However, requires some modifications to run with Spark version 3.0.1. Actually, the solution does not even relate to Spark. There is a check which compares row counts from generated data and expected results. The check fails because it compares the contents of files. The newer version of Spark has some new warnings that get added to the beginning of the generated file and thus fails the comparison with the expected result.
Following are the steps to make it work with Spark version 3.0.1
- In
bin/tpcdsspark.sh
in functioncheck_createtables()
- Before the file comparison check i.e.
if cmp -s "$file1" "$file2"
- If you are on Mac
sed -i '' '/^W/d' $file1
- If you are on Linux
sed -i '/^W/d' $file1
This error occurs when the file rowcounts.rrn
and the file rowcounts.expected
are not exactly the same.
For me, it turns out that the rowcounts.rrn
is derived by the log rowcounts.out
, thus contains some unexpected warning logs.
The rowcounts.rnn
then turn out to be like:
WARNING: An illegal reflective access operation has occurred...
Setting default log level to "WARN".
6
11718
144067
And the rowcounts.expected
is like:
6
11718
144067
This cause the error in check_createtable.
So here's my solution:
Open up the log rowcounts.rrn
under worker directory, write down the words that occurs in the file that should not contained in rowcounts.expected
. In my case, the words including 'WARNING', 'Setting'
Then edit the file tpcdsspark.sh
in line 99, add | grep -v "WARNING"
and | grep -v "Setting"
, this will filter the unexpected log in 'rowcounts.out' and derive a good 'rowcounts.rnn'.
I then use "Cleanup" and then "create spark tables" in 'tpcdsspark.sh' command, then everything works fine for me.
This error occurs when the file
rowcounts.rrn
and the filerowcounts.expected
are not exactly the same. For me, it turns out that therowcounts.rrn
is derived by the logrowcounts.out
, thus contains some unexpected warning logs. Therowcounts.rnn
then turn out to be like:WARNING: An illegal reflective access operation has occurred...
Setting default log level to "WARN".
6
11718
144067And the
rowcounts.expected
is like:6
11718
144067This cause the error in check_createtable.
So here's my solution: Open up the log
rowcounts.rrn
under worker directory, write down the words that occurs in the file that should not contained inrowcounts.expected
. In my case, the words including 'WARNING', 'Setting' Then edit the filetpcdsspark.sh
in line 99, add| grep -v "WARNING"
and| grep -v "Setting"
, this will filter the unexpected log in 'rowcounts.out' and derive a good 'rowcounts.rnn'. I then use "Cleanup" and then "create spark tables" in 'tpcdsspark.sh' command, then everything works fine for me.
For spark 3.3.0, it need more filter. i make it work with adding the follow filter:
| grep -v "WARNING" | grep -v "Setting" | grep -v "Spark"
hi, @HarryLiUS Have you solved this problem? I checked my rowcounts.rrn and it is also all 0.
For spark 3.3.0, it need more filter. i make it work with adding the follow filter:
| grep -v "WARNING" | grep -v "Setting" | grep -v "Spark"
This is what fixed it for me.