mattshma/bigdata

Caused by: java.net.UnknownHostException: host-xxx

mattshma opened this issue · 0 comments

使用 Kubernetes 创建的 pyspark (2.4.0)实例(YARN集群是实体机集群,非 k8s 创建,pyspark client 由 k8s 创建),在执行过程中报错如下:

Py4JJavaError: An error occurred while calling o67.collectToPython.
: org.apache.spark.SparkException: Job 0 cancelled because SparkContext was shut down
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:932)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:930)
	at scala.collection.mutable.HashSet.foreach(HashSet.scala:78)
	at org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:930)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:2126)
	at org.apache.spark.util.EventLoop.stop(EventLoop.scala:84)
	at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:2039)
	at org.apache.spark.SparkContext$$anonfun$stop$6.apply$mcV$sp(SparkContext.scala:1949)
	at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1340)
	at org.apache.spark.SparkContext.stop(SparkContext.scala:1948)
	at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$MonitorThread.run(YarnClientSchedulerBackend.scala:121)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
	at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:365)
	at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
	at org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:3258)
	at org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:3255)
	at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365)
	at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3364)
	at org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:3255)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:745)

查看 resource manager,原因是这个因为 UnknownHostException,即 kubernetes 的这个 spark client 的 hostname,对于 yarn 节点上的其他机器,是无法识别的。

Caused by: java.net.UnknownHostException: spark-client-xxxx
	at java.net.InetAddress.getAllByName0(InetAddress.java:1280)
	at java.net.InetAddress.getAllByName(InetAddress.java:1192)
	at java.net.InetAddress.getAllByName(InetAddress.java:1126)
	at java.net.InetAddress.getByName(InetAddress.java:1076)
	at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:146)
	at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:143)
	at java.security.AccessController.doPrivileged(Native Method)
	at io.netty.util.internal.SocketUtils.addressByName(SocketUtils.java:143)
	at io.netty.resolver.DefaultNameResolver.doResolve(DefaultNameResolver.java:43)
	at io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:63)
	at io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:55)
	at io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:57)
	at io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:32)
	at io.netty.resolver.AbstractAddressResolver.resolve(AbstractAddressResolver.java:108)
	at io.netty.bootstrap.Bootstrap.doResolveAndConnect0(Bootstrap.java:208)
	at io.netty.bootstrap.Bootstrap.access$000(Bootstrap.java:49)
	at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:188)
	at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:174)
	at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
	at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
	at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
	at io.netty.util.concurrent.DefaultPromise.trySuccess(DefaultPromise.java:104)
	at io.netty.channel.DefaultChannelPromise.trySuccess(DefaultChannelPromise.java:82)
	at io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetSuccess(AbstractChannel.java:978)
	at io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:512)
	at io.netty.channel.AbstractChannel$AbstractUnsafe.access$200(AbstractChannel.java:423)
	at io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:482)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
	at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
	... 1 more

目前该 pyspark client 的 IP 是通用的,但 host 未在DNS中配置,所以其他机器不可能通过 hostname 找到 ip。

对于 LOCAL 模式的 pyspark,可以在启动时指定--conf spark.driver.host=<IP>export SPARK_LOCAL_IP=<IP>,但对于 Yarn 模式,还是得以 ip 的形式运行。

查看了下,SPARK-21642 中提到,在 2.3.0 以后的版本中使用 FQDN 来代替 ip 设置 DRIVER_HOST_ADDRESS 值。在启动 pyspark 时,临时使用如下方法修复:

ip=`ifconfig|grep "inet addr" |grep -v "127.0.0.1" |awk '{print $2}'|cut -d: -f2`
/bin/echo "export SPARK_LOCAL_HOSTNAME=$ip" >> /etc/bash.bashrc