deanwampler/spark-scala-tutorial

Unable to run WordCount3 on my hadoop cluster

Opened this issue · 5 comments

Hi Owner!!

My WordCount3 is running successfully locally. I can the output folder with the output files in it. However when I run them on hadoop cluster using hadoop.HWordCount3 it displays an error.

[info] running: spark-submit --class WordCount3 ./target/scala-2.11/spark-scala-tutorial_2.11-5.0.0.jar --out /user/root/output/kjv-wc3
[info]
[error] Exception in thread "main" java.lang.NoSuchMethodError: scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;
[error]         at util.CommandLineOptions$$anonfun$3.apply(CommandLineOptions.scala:34)
[error]         at util.CommandLineOptions$$anonfun$3.apply(CommandLineOptions.scala:34)
[error]         at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
[error]         at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
[error]         at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
[error]         at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
[error]         at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
[error]         at scala.collection.AbstractTraversable.map(Traversable.scala:105)
[error]         at util.CommandLineOptions.apply(CommandLineOptions.scala:34)
[error]         at WordCount3$.main(WordCount3.scala:26)
[error]         at WordCount3.main(WordCount3.scala)
[error]         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[error]         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
[error]         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
[error]         at java.lang.reflect.Method.invoke(Method.java:498)
[error]         at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
[error]         at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
[error]         at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
[error]         at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
[error]         at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
[info]
[info] Contents of the output directories:
[error] ls: `/user/root/output': No such file or directory
[info]
[info]  **** To see the contents, open the following URL(s):
[info]
[info]
[success] Total time: 17 s, completed Dec 30, 2016 7:16:53 PM

Do I have to create a directory in my hadoop cluster? as /user/root/output? When I look into the scala code of hadoop for WordCount3, I feel like it's incomplete, but I am not sure. Please suggest!!

The tutorial uses Scala 2.11. Your hadoop version of Spark is probably using Scala 2.10. You'll have to recompile for 2.10.

This is easy to do one of two ways:

  • At the sbt prompt, change to 2.10.6 temporarily, then build the code:
> ++ 2.10.6
> package  
> ...
  • Change permanently by editing project/Build.scala and edit line 9:
  val ScalaVersion = "2.10.6"

I followed what you have explained, it still shows an error [error] ls: /user/root/output': No such file or directory` at the bottom. Thanks :)

[info] Loading project definition from /root/Shiva/spark-scala-tutorial/project
[info] Set current project to spark-scala-tutorial (in build file:/root/Shiva/spark-scala-tutorial/)
> ++ 2.10.6
[info] Setting version to 2.10.6
[info] Reapplying settings...
[info] Set current project to spark-scala-tutorial (in build file:/root/Shiva/spark-scala-tutorial/)
> package
[info] Updating {file:/root/Shiva/spark-scala-tutorial/}SparkWorkshop...
[info] Resolving com.sun.jersey.jersey-test-framework#jersey-test-framework-griz[info] Resolving com.fasterxml.jackson.module#jackson-module-scala_2.10;2.4.4 ..[info] Resolving com.fasterxml.jackson.module#jackson-module-scala_2.10;2.4.4 ..[info] Resolving org.fusesource.jansi#jansi;1.4 ...
[info] downloading https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.10/1.6.2/spark-core_2.10-1.6.2.jar ...
[info]  [SUCCESSFUL ] org.apache.spark#spark-core_2.10;1.6.2!spark-core_2.10.jar (615ms)
[info] downloading https://repo1.maven.org/maven2/org/apache/spark/spark-streaming_2.10/1.6.2/spark-streaming_2.10-1.6.2.jar ...
[info]  [SUCCESSFUL ] org.apache.spark#spark-streaming_2.10;1.6.2!spark-streaming_2.10.jar (111ms)
[info] downloading https://repo1.maven.org/maven2/org/apache/spark/spark-sql_2.10/1.6.2/spark-sql_2.10-1.6.2.jar ...
[info]  [SUCCESSFUL ] org.apache.spark#spark-sql_2.10;1.6.2!spark-sql_2.10.jar (173ms)
[info] downloading https://repo1.maven.org/maven2/org/apache/spark/spark-hive_2.10/1.6.2/spark-hive_2.10-1.6.2.jar ...
[info]  [SUCCESSFUL ] org.apache.spark#spark-hive_2.10;1.6.2!spark-hive_2.10.jar (64ms)
[info] downloading https://repo1.maven.org/maven2/com/twitter/chill_2.10/0.5.0/chill_2.10-0.5.0.jar ...
[info]  [SUCCESSFUL ] com.twitter#chill_2.10;0.5.0!chill_2.10.jar (35ms)
[info] downloading https://repo1.maven.org/maven2/org/apache/spark/spark-launcher_2.10/1.6.2/spark-launcher_2.10-1.6.2.jar ...
[info]  [SUCCESSFUL ] org.apache.spark#spark-launcher_2.10;1.6.2!spark-launcher_2.10.jar (28ms)
[info] downloading https://repo1.maven.org/maven2/org/apache/spark/spark-network-common_2.10/1.6.2/spark-network-common_2.10-1.6.2.jar ...
[info]  [SUCCESSFUL ] org.apache.spark#spark-network-common_2.10;1.6.2!spark-network-common_2.10.jar (169ms)
[info] downloading https://repo1.maven.org/maven2/org/apache/spark/spark-network-shuffle_2.10/1.6.2/spark-network-shuffle_2.10-1.6.2.jar ...
[info]  [SUCCESSFUL ] org.apache.spark#spark-network-shuffle_2.10;1.6.2!spark-network-shuffle_2.10.jar (25ms)
[info] downloading https://repo1.maven.org/maven2/org/apache/spark/spark-unsafe_2.10/1.6.2/spark-unsafe_2.10-1.6.2.jar ...
[info]  [SUCCESSFUL ] org.apache.spark#spark-unsafe_2.10;1.6.2!spark-unsafe_2.10.jar (26ms)
[info] downloading https://repo1.maven.org/maven2/com/typesafe/akka/akka-remote_2.10/2.3.11/akka-remote_2.10-2.3.11.jar ...
[info]  [SUCCESSFUL ] com.typesafe.akka#akka-remote_2.10;2.3.11!akka-remote_2.10.jar (93ms)
[info] downloading https://repo1.maven.org/maven2/com/typesafe/akka/akka-slf4j_2.10/2.3.11/akka-slf4j_2.10-2.3.11.jar ...
[info]  [SUCCESSFUL ] com.typesafe.akka#akka-slf4j_2.10;2.3.11!akka-slf4j_2.10.jar (27ms)
[info] downloading https://repo1.maven.org/maven2/org/json4s/json4s-jackson_2.10/3.2.10/json4s-jackson_2.10-3.2.10.jar ...
[info]  [SUCCESSFUL ] org.json4s#json4s-jackson_2.10;3.2.10!json4s-jackson_2.10.jar (25ms)
[info] downloading https://repo1.maven.org/maven2/com/fasterxml/jackson/module/jackson-module-scala_2.10/2.4.4/jackson-module-scala_2.10-2.4.4.jar ...
[info]  [SUCCESSFUL ] com.fasterxml.jackson.module#jackson-module-scala_2.10;2.4.4!jackson-module-scala_2.10.jar(bundle) (53ms)
[info] downloading https://repo1.maven.org/maven2/org/apache/avro/avro/1.7.7/avro-1.7.7.jar ...
[info]  [SUCCESSFUL ] org.apache.avro#avro;1.7.7!avro.jar (70ms)
[info] downloading https://repo1.maven.org/maven2/com/typesafe/akka/akka-actor_2.10/2.3.11/akka-actor_2.10-2.3.11.jar ...
[info]  [SUCCESSFUL ] com.typesafe.akka#akka-actor_2.10;2.3.11!akka-actor_2.10.jar (112ms)
[info] downloading https://repo1.maven.org/maven2/org/scala-lang/scalap/2.10.0/scalap-2.10.0.jar ...
[info]  [SUCCESSFUL ] org.scala-lang#scalap;2.10.0!scalap.jar (57ms)
[info] downloading https://repo1.maven.org/maven2/org/scala-lang/scala-compiler/2.10.0/scala-compiler-2.10.0.jar ...
[info]  [SUCCESSFUL ] org.scala-lang#scala-compiler;2.10.0!scala-compiler.jar (735ms)
[info] downloading https://repo1.maven.org/maven2/org/apache/spark/spark-catalyst_2.10/1.6.2/spark-catalyst_2.10-1.6.2.jar ...
[info]  [SUCCESSFUL ] org.apache.spark#spark-catalyst_2.10;1.6.2!spark-catalyst_2.10.jar (284ms)
[info] downloading https://repo1.maven.org/maven2/org/scalatest/scalatest_2.10/2.2.4/scalatest_2.10-2.2.4.jar ...
[info]  [SUCCESSFUL ] org.scalatest#scalatest_2.10;2.2.4!scalatest_2.10.jar(bundle) (365ms)
[info] downloading https://repo1.maven.org/maven2/org/scalacheck/scalacheck_2.10/1.12.2/scalacheck_2.10-1.12.2.jar ...
[info]  [SUCCESSFUL ] org.scalacheck#scalacheck_2.10;1.12.2!scalacheck_2.10.jar (55ms)
[info] Done updating.
[info] Compiling 38 Scala sources to /root/Shiva/spark-scala-tutorial/target/scala-2.10/classes...
[info] 'compiler-interface' not yet compiled for Scala 2.10.6. Compiling...
[info]   Compilation completed in 8.816 s
[warn] Multiple main classes detected.  Run 'show discoveredMainClasses' to see the list
[info] Packaging /root/Shiva/spark-scala-tutorial/target/scala-2.10/spark-scala-tutorial_2.10-5.0.0.jar ...
[info] Done packaging.
[success] Total time: 27 s, completed Dec 31, 2016 12:33:35 AM
> run
[warn] Multiple main classes detected.  Run 'show discoveredMainClasses' to see the list

Multiple main classes detected, select one to run:

 [1] Crawl5a
 [2] Crawl5aLocal
 [3] InvertedIndex5b
 [4] InvertedIndex5bSortByWordAndCounts
 [5] Joins7
 [6] Joins7Ordered
 [7] Matrix4
 [8] Matrix4StdDev
 [9] NGrams6
 [10] SparkSQL8
 [11] SparkStreaming11
 [12] SparkStreaming11Main
 [13] SparkStreaming11MainSocket
 [14] SparkStreaming11SQL
 [15] WordCount2
 [16] WordCount2GroupBy
 [17] WordCount2SortByCount
 [18] WordCount2SortByWord
 [19] WordCount3
 [20] WordCount3SortByWordLength
 [21] hadoop.HCrawl5a
 [22] hadoop.HInvertedIndex5b
 [23] hadoop.HJoins7
 [24] hadoop.HMatrix4
 [25] hadoop.HNGrams6
 [26] hadoop.HSparkSQL8
 [27] hadoop.HSparkStreaming11
 [28] hadoop.HWordCount3
 [29] sparktutorial.solns.InvertedIndex5bTfIdf
 [30] util.streaming.DataDirectoryServer
 [31] util.streaming.DataSocketServer

Enter number: 28

[info] Running hadoop.HWordCount3
[info] running: spark-submit --class WordCount3 ./target/scala-2.10/spark-scala-tutorial_2.10-5.0.0.jar ./target/scala-2.11/spark-scala-tutorial_2.11-5.0.0.jar --out /user/root/output/kjv-wc3
[info]
[info] Unrecognized argument (or missing second argument): ./target/scala-2.11/spark-scala-tutorial_2.11-5.0.0.jar
[info]
[info] usage: java ... WordCount3$ [options]
[info] where the options are the following:
[info] -h | --help  Show this message and quit.
[info] -i | --in  | --inpath  path   The input root directory of files to crawl (default: data/kjvdat.txt)
[info] -o | --out | --outpath path   The output location (default: output/kjv-wc3)
[info]
[info] -m | --master M      The "master" argument passed to SparkContext, "M" is one of:
[info]                      "local", local[N]", "mesos://host:port", or "spark://host:port"
[info]                      (default: local).
[info] -q | --quiet         Suppress some informational output.
[info]
[info] Contents of the output directories:
[error] ls: `/user/root/output': No such file or directory
[info]
[info]  **** To see the contents, open the following URL(s):
[info]
[info]
[success] Total time: 11 s, completed Dec 31, 2016 12:33:48 AM

Two comments. First, when you run with Hadoop, you'll have to define the correct directories in HDFS, which is the default file system assumed in that context, not a local file system (which doesn't mean much in a cluster - what's local?) You'll also want to use your correct user name for the cluster. /user/root is the home of the HDFS root user and probably not what you want. If you actually have permissions for that directory, you could create the output subdirectory and it might work. Normally, you would use /user/myname/output.

Second, those hadoop.H* hooks I created aren't well tested. I should remove them as I'm no longer interested in maintaining them. I would try running the spark-submit shell script itself, instead. However, I don't think it is a problem in this case.

Hi Dean :)

I can see a success message with no errors but I cannot see contents in the output directory. As you have mentioned above, I have create an output folder using hdfs dfs -mkdir /user/root/output root is the user on my cluster.

It ran with no errors but It didn't generate any files?


[info] Running hadoop.HWordCount3
[info] running: spark-submit --class WordCount3 ./target/scala-2.10/spark-scala-tutorial_2.10-5.0.0.jar ./target/scala-2.11/spark-scala-tutorial_2.11-5.0.0.jar --out /user/root/output/kjv-wc3
[info]
[info] Unrecognized argument (or missing second argument): ./target/scala-2.11/spark-scala-tutorial_2.11-5.0.0.jar
[info]
[info] usage: java ... WordCount3$ [options]
[info] where the options are the following:
[info] -h | --help  Show this message and quit.
[info] -i | --in  | --inpath  path   The input root directory of files to crawl (default: data/kjvdat.txt)
[info] -o | --out | --outpath path   The output location (default: output/kjv-wc3)
[info]
[info] -m | --master M      The "master" argument passed to SparkContext, "M" is one of:
[info]                      "local", local[N]", "mesos://host:port", or "spark://host:port"
[info]                      (default: local).
[info] -q | --quiet         Suppress some informational output.
[info]
[info] Contents of the output directories:
[info]
[info]  **** To see the contents, open the following URL(s):
[info]
[info]
[success] Total time: 13 s, completed Jan 2, 2017 9:02:40 PM

I'm going to delete the Hadoop support. Sorry.