Looks like ccspark tried to access everything from local file. What's wrong with the settings?

Question

Looks like ccspark tried to access everything from local file. What's wrong with the settings?

GenuineReader opened this issue 2 years ago · 1 comments

spark-3.3.2-bin-hadoop3/bin/spark-submit ./server_count. --num_output_partitions 1 --log_level WARN ./input/wat.gz servernames

23/02/18 09:20:39 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 192.168.0.41, 56238, None)
2023-02-18 09:20:52,155 INFO CountServers: Reading local file WARC/1.0
2023-02-18 09:20:52,156 ERROR CountServers: Failed to open /Users/joe/cc-pyspark/WARC/1.0: [Errno 2] No such file or directory: '/Users/joe/cc-pyspark/WARC/1.0'
2023-02-18 09:20:52,157 INFO CountServers: Reading local file WARC-Type: warcinfo
2023-02-18 09:20:52,158 ERROR CountServers: Failed to open /Users/joe/cc-pyspark/WARC-Type: warcinfo: [Errno 2] No such file or directory: '/Users/joe/cc-pyspark/WARC-Type: warcinfo'
2023-02-18 09:20:52,158 INFO CountServers: Reading local file WARC-Date: 2017-04-01T22:37:17Z
2023-02-18 09:20:52,159 ERROR CountServers: Failed to open /Users/joe/cc-pyspark/WARC-Date: 2017-04-01T22:37:17Z: [Errno 2] No such file or directory: '/Users/joe/cc-pyspark/WARC-Date: 2017-04-01T22:37:17Z'
2023-02-18 09:20:52,160 INFO CountServers: Reading local file WARC-Filename: CC-MAIN-20170322212946-00000-ip-10-233-31-227.ec2.internal.warc.gz
2023-02-18 09:20:52,161 ERROR CountServers: Failed to open /Users/joe/cc-pyspark/WARC-Filename: CC-MAIN-20170322212946-00000-ip-10-233-31-227.ec2.internal.warc.gz: [Errno 2] No such file or directory: '/Users/joe/cc-pyspark/WARC-Filename: CC-MAIN-20170322212946-00000-ip-10-233-31-227.ec2.internal.warc.gz'
2023-02-18 09:20:52,163 INFO CountServers: Reading local file WARC-Record-ID: urn:uuid:55d1a532-f91b-4461-b803-9bfc77efa410

Answer 1 · 2023-02-19T21:47:04.000Z

The job expects as input a text file listing WARC/WAT/WET files (as path to a local file or S3 URL). According to the error message, looks like the job is reading a WAT file and without success tries to interpret every line as file name.