Fetching the contents of a list of URL
我有几个URL列表,我想获取它们的HTML内容。网址是从Twitter上获取的,我不知道链接的内容。它们可能是指向网页以及音乐或视频的链接。这是如何读取URL列表链接的HTML内容:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 | from multiprocessing.dummy import Pool as ThreadPool def fetch_url(argv): url = argv[0] output = None print"processing url {}".format(url) try: # sending the request req = requests.get(url, stream=True) # checking if it is an html page content_type = req.headers.get('content-type') if 'text/html' in content_type or 'application/xhtml+xml' in content_type: # reading the contents html = req.content req.close() output = html else: print"\t{} is not an HTML file".format(url) req.close() except Exception, e: print"\t HTTP request was not accepted for {}; {}".format(url, e) return output with open('url_list_1.pkl', 'rb') as fp: url_list = pickle.load(fp) """ The url_list has such structure: url_list = [u'http://t.co/qmIPqQVBmW', u'http://t.co/mE8krkEejV', ...] """ pool = Thread pool = ThreadPool(N_THREADS) # open the results in their own threads and return the results func = fetch_url results = pool.map(func,url_list) # close the pool and wait for the work to finish pool.close() pool.join() |
对于大多数列表来说,代码没有任何问题,但对于其中一些列表,代码会卡住,无法完成任务。我认为一些URL没有返回响应。我该怎么补救?例如,等待一个请求X秒,如果它没有响应,就忘记它并移动到下一个URL?为什么会这样?
当然,您可以为您的请求设置超时(以秒为单位),这非常简单!
1 | req = requests.get(url, stream=True, timeout=1) |
引用自python请求:
timeout is not a time limit on the entire response download; rather, an exception is raised if the server has not issued a response for timeout seconds (more precisely, if no bytes have been received on the underlying socket for timeout seconds).
更多信息:http://docs.python requests.org/en/latest/user/quickstart/超时