import pandas will raise exception: mrjob returned non-zero exit status 256
Alxe1 opened this issue · 1 comments
Alxe1 commented
I have a mrjob:
from mrjob.job import MRJob
from mrjob.step import MRStep
from mrjob.protocol import JSONProtocol, RawValueProtocol, PickleProtocol, JSONValueProtocol
import pandas
class WordCount(MRJob):
INPUT_PROTOCOL = RawValueProtocol
INTERNAL_PROTOCOL = PickleProtocol
OUTPUT_PROTOCOL = JSONValueProtocol
def mapper(self, _, line):
words = line.split(" ")
for word in words:
yield word, 1
# for i, word in enumerate(words):
# d = {'word': word, "cnt": 1}
# yield i, d
def reducer(self, key, values):
count = 0
for v in values:
count += 1
yield None, {"word": key, "counts": count}
# for d in values:
# yield key, d
# def steps(self):
# return [MRStep(mapper=self.mapper, reducer=self.reducer)]
if __name__ == "__main__":
# start = time.time()
WordCount.run()
# print("用时:{}".format(time.time() - start))
when I run it in hadoop environment, it raise:
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:322)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:535)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Step 1 of 1 failed: Command '['/software/hadoop-2.7.3/bin/hadoop', 'jar', '/software/hadoop-2.7.3/share/hadoop/tools/lib/hadoop-streaming-2.7.3.jar', '-files', 'hdfs:///user/root/tmp/mrjob/mapreduce.root.20191222.084544.242130/files/wd/mapreduce.py#mapreduce.py,hdfs:///user/root/tmp/mrjob/mapreduce.root.20191222.084544.242130/files/wd/mrjob.zip#mrjob.zip,hdfs:///user/root/tmp/mrjob/mapreduce.root.20191222.084544.242130/files/wd/setup-wrapper.sh#setup-wrapper.sh', '-input', 'hdfs:///data/test.txt', '-output', 'hdfs:///user/root/tmp/mrjob/mapreduce.root.20191222.084544.242130/output', '-mapper', '/bin/sh -ex setup-wrapper.sh python3 mapreduce.py --step-num=0 --mapper', '-reducer', '/bin/sh -ex setup-wrapper.sh python3 mapreduce.py --step-num=0 --reducer']' returned non-zero exit status 256.
when I remove import pandas
, it runs successfully. I think it's a bug.
Alxe1 commented
And when I add time in the main, it went wrong either.