Spark cluster full of heartbeat timeouts, executors exiting on their own
我的Apache Spark集群正在运行一个应用程序,该应用程序使我的执行者超时很多:
1 2 3 4 5 6 | 10:23:30,761 ERROR ~ Lost executor 5 on slave2.cluster: Executor heartbeat timed out after 177005 ms 10:23:30,806 ERROR ~ Lost executor 1 on slave4.cluster: Executor heartbeat timed out after 176991 ms 10:23:30,812 ERROR ~ Lost executor 4 on slave6.cluster: Executor heartbeat timed out after 176981 ms 10:23:30,816 ERROR ~ Lost executor 6 on slave3.cluster: Executor heartbeat timed out after 176984 ms 10:23:30,820 ERROR ~ Lost executor 0 on slave5.cluster: Executor heartbeat timed out after 177004 ms 10:23:30,835 ERROR ~ Lost executor 3 on slave7.cluster: Executor heartbeat timed out after 176982 ms |
但是,在我的配置中,我可以确认我成功增加了执行程序心跳间隔:
当我访问标记为
1 2 | 16/05/16 10:11:26 ERROR TransportChannelHandler: Connection to /10.0.0.4:35328 has been quiet for 120000 ms while there are outstanding requests. Assuming connection is dead; please adjust spark.network.timeout if this is wrong. 16/05/16 10:11:26 ERROR CoarseGrainedExecutorBackend: Cannot register with driver: spark://[email protected]:35328 |
如何关闭心跳和/或防止执行程序超时?
缺少的心跳和执行器被YARN杀死几乎都是由于OOM。您应该检查各个执行器上的日志(查找文本"超出物理内存范围")。如果您有很多执行程序,并且发现手动检查所有日志很麻烦,建议您在Spark UI运行时对其进行监视。一旦任务失败,它将在UI中报告原因,因此很容易看到。请注意,由于缺少已被杀死的执行程序,某些任务将报告失败,因此请确保您查看每个失败任务的原因。
还要注意,只需将代码重新分区到代码中的适当位置即可快速解决大多数OOM问题(再次查看Spark UI,以获取有关可能需要调用
答案很简单。在我的
使用
1 | $SPARK_HOME/bin/spark-submit --conf spark.network.timeout 10000000 --class myclass.neuralnet.TrainNetSpark --master spark://master.cluster:7077 --driver-memory 30G --executor-memory 14G --num-executors 7 --executor-cores 8 --conf spark.driver.maxResultSize=4g --conf spark.executor.heartbeatInterval=10000000 path/to/my.jar |
如果您使用的是pyspark,则更改spark上下文的配置将解决此问题。您可以将其设置如下(注意,所有提到的时间均以毫秒为单位),heartbeatInterval(默认值为10000)应小于超时值(默认为120000)
1 2 3 4 5 | conf = SparkConf().setAppName("applicaiton") \ .set("spark.executor.heartbeatInterval","200000") \ .set("spark.network.timeout","300000") sc = SparkContext.getOrCreate(conf) sqlcontext = SQLContext(sc) |
希望这能解决您的问题。如果您遇到其他错误,请访问此处的文档