关于python:AWS lambda,scrapy和捕获异常

AWS lambda, scrapy and catching exceptions

我把Scrapy作为一个AWS lambda函数运行。在我的函数内部,我需要一个计时器来查看它是否运行超过1分钟,如果是,我需要运行一些逻辑。这是我的代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
def handler():
    x = 60
    watchdog = Watchdog(x)
    try:
        runner = CrawlerRunner()
        runner.crawl(MySpider1)
        runner.crawl(MySpider2)
        d = runner.join()
        d.addBoth(lambda _: reactor.stop())
        reactor.run()
    except Watchdog:
        print('Timeout error: process takes longer than %s seconds.' % x)
        # some other logic here
    watchdog.stop()

看门狗定时器课我从这个答案。问题是代码从未命中except Watchdog块,而是在外部引发异常:

1
2
3
4
5
6
7
8
9
Exception in thread Thread-1:
 Traceback (most recent call last):
   File"/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
     self.run()
   File"/usr/lib/python3.6/threading.py", line 1182, in run
     self.function(*self.args, **self.kwargs)
   File"./functions/python/my_scrapy/index.py", line 174, in defaultHandler
     raise self
 functions.python.my_scrapy.index.Watchdog: 1

我需要在函数中捕获异常。我该怎么办呢?附言:我对Python很陌生。


好吧,这个问题让我有点发疯了,这就是为什么不起作用的原因:

Watchdog对象所做的是创建另一个线程,在该线程中引发但不处理异常(异常只在主进程中处理)。幸运的是,Twisted有一些整洁的特性。

你可以在另一个线程中运行反应堆:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import time
from threading import Thread
from twisted.internet import reactor

runner = CrawlerRunner()
runner.crawl(MySpider1)
runner.crawl(MySpider2)
d = runner.join()
d.addBoth(lambda _: reactor.stop())
Thread(target=reactor.run, args=(False,)).start()  # reactor will run in a different thread so it doesn't lock the script here

time.sleep(60)  # Lock script here

# Now check if it's still scraping
if reactor.running:
    # do something
else:
    # do something else

我使用的是python 3.7.0


Twisted具有调度原语。例如,此程序运行大约60秒:

1
2
3
from twisted.internet import reactor
reactor.callLater(60, reactor.stop)
reactor.run()