Process stuck within SQS calls
我有一个python脚本,它在一个循环中检查sqs上的消息,然后停止。该脚本每隔几分钟由cron作业重新启动一次,以防找不到它正在运行。
1 2 3 4 5 6 | #start def main(): ------For i from 1 to 100: -------------Check SQS for new message[establish connections to SQS] # long polling not used, Receive message wait time set to 0. -------------If new job found: --------------------ProcessIt() # end |
我发现,在EC2实例上运行脚本几天后,脚本就变得过时了,它不会检查来自SQS的任何新消息。
当我为进程的PID运行lsof时,只对sqs连接进行grepping,我发现所有到sqs的连接都处于关闭等待状态。解决我问题的方法是手动终止并重新启动脚本进程。所以,似乎cron甚至不能重新启动脚本,因为它已经一直在运行,并一直在调用sqs:
1
2
3
4
5
6
7
8
9
10
11
12
13 ip-10-x-y-z:~ # lsof -p 9018 | grep"72.21"
ld-linux. 9018 root 7u IPv4 474699439 0t0 TCP ip-10-x-y-z.ec2.internal:58211->72.21.202.145:https (CLOSE_WAIT)
ld-linux. 9018 root 10u IPv4 474699560 0t0 TCP ip-10-x-y-z.ec2.internal:53428->72.21.194.47:https (CLOSE_WAIT)
ld-linux. 9018 root 12u IPv4 474701017 0t0 TCP ip-10-x-y-z.ec2.internal:52166->72.21.214.70:https (CLOSE_WAIT)
ld-linux. 9018 root 18u IPv4 474694555 0t0 TCP ip-10-x-y-z.ec2.internal:57267->72.21.202.145:https (CLOSE_WAIT)
ld-linux. 9018 root 22u IPv4 474694573 0t0 TCP ip-10-x-y-z.ec2.internal:57271->72.21.202.145:https (CLOSE_WAIT)
ld-linux. 9018 root 39u IPv4 474701031 0t0 TCP ip-10-x-y-z.ec2.internal:52170->72.21.214.70:https (CLOSE_WAIT)
我知道我应该使用长时间的投票,但是我仍然想知道为什么这个过程会被卡住,而且永远不会自己恢复。我用的是2.23。
任何输入都会有帮助。
gdb调试导致了以下跟踪我的卡住进程:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | (gdb) pystack ~/mypackage/lib/python2.6/ssl.py (293): do_handshake ~/mypackage/lib/python2.6/ssl.py (120): __init__ ~/mypackage/lib/python2.6/ssl.py (350): wrap_socket ~/mypackage/lib/python2.6/site-packages/boto/https_connection.py (118): connect ~/mypackage/lib/python2.6/httplib.py (725): send ~/mypackage/lib/python2.6/httplib.py (764): _send_output ~/mypackage/lib/python2.6/httplib.py (892): endheaders ~/mypackage/lib/python2.6/httplib.py (937): _send_request ~/mypackage/lib/python2.6/httplib.py (899): request ~/mypackage/lib/python2.6/site-packages/boto/connection.py (902): _mexe ~/mypackage/lib/python2.6/site-packages/boto/connection.py (1063): make_request ~/mypackage/lib/python2.6/site-packages/boto/connection.py (1138): get_object ~/mypackage/lib/python2.6/site-packages/boto/sqs/connection.py (355): get_queue ~/mypackage/lib/python2.6/site-packages/sqs/SQSHelper.py (96): __init__ ~/mypackage/sqs/SQSWrapper.py (1229): main ~/mypackage/sqs/SQSWrapper.py (1367): <module> |
如我们所见,我的脚本被卡在了sqs的get_queue()API上。
似乎问题出在Python2.6的ssl握手函数中,该函数在Python2.7中得到了修复,但也有人在Python2.7中报告了同样的问题[见下面的链接]。我将使用python 2.7,并在我的sqs包装代码中对sqs api设置几分钟的超时以解决整个问题:以下链接帮助我归结为根本原因和解决方法:
http://bugs.python.org/issue5103
http://hg.python.org/cpython/rev/ce4916ca06dd/
Web应用程序在ssl.py中自行挂起几个小时。
超时函数(如果完成时间过长)