whym/wikihadoop

compatibility with elastic mapreduce?

GabrielF00 opened this issue · 1 comments

Hi,

I've gotten wikihadoop to work in a VM with the CDH 4 distribution of Hadoop but I'm having some trouble getting wikihadoop to work with Amazon Elastic MapReduce. The error that I'm getting is that SplittableCompressionCodec is not found. I'm guessing that the issue here is that Amazon Elastic MapReduce uses the 1.0.3 version of Hadoop, which is not compatible with wikihadoop. Has anyone gotten this working with elastic mapreduce?

Thanks for your help,
Gabriel

Here is the full error:
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/io/compress/SplittableCompressionCodec
at org.wikimedia.wikihadoop.StreamWikiDumpInputFormat.isSplitable(StreamWikiDumpInputFormat.java:138)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:232)
at org.wikimedia.wikihadoop.StreamWikiDumpInputFormat.getSplits(StreamWikiDumpInputFormat.java:148)
at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1044)
at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1036)
at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:174)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:952)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:905)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:905)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:879)
at org.apache.hadoop.streaming.StreamJob.submitAndMonitorJob(StreamJob.java:1013)
at org.apache.hadoop.streaming.StreamJob.run(StreamJob.java:123)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:50)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:187)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.io.compress.SplittableCompressionCodec
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
... 23 more

whym commented

I cannot really recommend WikHadoop on Elastic MapReduce. There might be an ad-hoc workaround to work with it, but if your purpose of using Elastic MapReduce is the ease of operation, I am pretty sure it will be outweighted by the tedious workaround. In principle, you will need to remove all codes depending on SplittableCompressionCodecs and the like from WikiHadoop, in order to run it on Hadoop 0.20 or 1.0.x. When doing so, you would need to uncompress input files before running it.

Did you try (manually installing and) using a newer Hadoop distribution which contains SplittableCompressionCodecs? I haven't really tried, but procedures outlined in http://rodrigodsousa.blogspot.jp/2012/03/hadoop-amazon-ec2-updated-tutorial.html should work.