AWS lambda, scrapy and catching exceptions
我把Scrapy作为一个AWS lambda函数运行。在我的函数内部,我需要一个计时器来查看它是否运行超过1分钟,如果是,我需要运行一些逻辑。这是我的代码:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | def handler(): x = 60 watchdog = Watchdog(x) try: runner = CrawlerRunner() runner.crawl(MySpider1) runner.crawl(MySpider2) d = runner.join() d.addBoth(lambda _: reactor.stop()) reactor.run() except Watchdog: print('Timeout error: process takes longer than %s seconds.' % x) # some other logic here watchdog.stop() |
看门狗定时器课我从这个答案。问题是代码从未命中
1 2 3 4 5 6 7 8 9 | Exception in thread Thread-1: Traceback (most recent call last): File"/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner self.run() File"/usr/lib/python3.6/threading.py", line 1182, in run self.function(*self.args, **self.kwargs) File"./functions/python/my_scrapy/index.py", line 174, in defaultHandler raise self functions.python.my_scrapy.index.Watchdog: 1 |
我需要在函数中捕获异常。我该怎么办呢?附言:我对Python很陌生。
好吧,这个问题让我有点发疯了,这就是为什么不起作用的原因:
你可以在另一个线程中运行反应堆:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | import time from threading import Thread from twisted.internet import reactor runner = CrawlerRunner() runner.crawl(MySpider1) runner.crawl(MySpider2) d = runner.join() d.addBoth(lambda _: reactor.stop()) Thread(target=reactor.run, args=(False,)).start() # reactor will run in a different thread so it doesn't lock the script here time.sleep(60) # Lock script here # Now check if it's still scraping if reactor.running: # do something else: # do something else |
我使用的是python 3.7.0
Twisted具有调度原语。例如,此程序运行大约60秒:
1 2 3 | from twisted.internet import reactor reactor.callLater(60, reactor.stop) reactor.run() |