Yelp/mrjob

import pandas will raise exception: mrjob returned non-zero exit status 256

Alxe1 opened this issue · 1 comments

Alxe1 commented

I have a mrjob:

from mrjob.job import MRJob
from mrjob.step import MRStep
from mrjob.protocol import JSONProtocol, RawValueProtocol, PickleProtocol, JSONValueProtocol
import pandas

class WordCount(MRJob):

    INPUT_PROTOCOL = RawValueProtocol
    INTERNAL_PROTOCOL = PickleProtocol
    OUTPUT_PROTOCOL = JSONValueProtocol

    def mapper(self, _, line):
        words = line.split(" ")
        for word in words:
            yield word, 1
        # for i, word in enumerate(words):
        #     d = {'word': word, "cnt": 1}
        #     yield i, d

    def reducer(self, key, values):
        count = 0
        for v in values:
            count += 1
        yield None, {"word": key, "counts": count}
        # for d in values:
        #     yield key, d

    # def steps(self):
    #     return [MRStep(mapper=self.mapper, reducer=self.reducer)]


if __name__ == "__main__":
    # start = time.time()
    WordCount.run()
    # print("用时:{}".format(time.time() - start))

when I run it in hadoop environment, it raise:

Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
	at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:322)
	at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:535)
	at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
	at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)


Step 1 of 1 failed: Command '['/software/hadoop-2.7.3/bin/hadoop', 'jar', '/software/hadoop-2.7.3/share/hadoop/tools/lib/hadoop-streaming-2.7.3.jar', '-files', 'hdfs:///user/root/tmp/mrjob/mapreduce.root.20191222.084544.242130/files/wd/mapreduce.py#mapreduce.py,hdfs:///user/root/tmp/mrjob/mapreduce.root.20191222.084544.242130/files/wd/mrjob.zip#mrjob.zip,hdfs:///user/root/tmp/mrjob/mapreduce.root.20191222.084544.242130/files/wd/setup-wrapper.sh#setup-wrapper.sh', '-input', 'hdfs:///data/test.txt', '-output', 'hdfs:///user/root/tmp/mrjob/mapreduce.root.20191222.084544.242130/output', '-mapper', '/bin/sh -ex setup-wrapper.sh python3 mapreduce.py --step-num=0 --mapper', '-reducer', '/bin/sh -ex setup-wrapper.sh python3 mapreduce.py --step-num=0 --reducer']' returned non-zero exit status 256.

when I remove import pandas , it runs successfully. I think it's a bug.

Alxe1 commented

And when I add time in the main, it went wrong either.