关于python：获取URL列表的内容

Fetching the contents of a list of URL

我有几个URL列表，我想获取它们的HTML内容。网址是从Twitter上获取的，我不知道链接的内容。它们可能是指向网页以及音乐或视频的链接。这是如何读取URL列表链接的HTML内容：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53

from multiprocessing.dummy import Pool as ThreadPool

def fetch_url(argv):

url = argv[0]
output = None

print"processing url {}".format(url)

try:
# sending the request
req = requests.get(url, stream=True)

# checking if it is an html page
content_type = req.headers.get('content-type')
if 'text/html' in content_type or 'application/xhtml+xml' in content_type:

# reading the contents
html = req.content
req.close()

output = html

else:
print"\t{} is not an HTML file".format(url)
req.close()

except Exception, e:
print"\t HTTP request was not accepted for {}; {}".format(url, e)

return output

with open('url_list_1.pkl', 'rb') as fp:
url_list = pickle.load(fp)

"""
The url_list has such structure:
url_list = [u'http://t.co/qmIPqQVBmW',
u'http://t.co/mE8krkEejV',
...]
"""

pool = Thread
pool = ThreadPool(N_THREADS)

# open the results in their own threads and return the results
func = fetch_url
results = pool.map(func,url_list)

# close the pool and wait for the work to finish
pool.close()
pool.join()

对于大多数列表来说，代码没有任何问题，但对于其中一些列表，代码会卡住，无法完成任务。我认为一些URL没有返回响应。我该怎么补救？例如，等待一个请求X秒，如果它没有响应，就忘记它并移动到下一个URL？为什么会这样？

相关讨论

当然，您可以为您的请求设置超时(以秒为单位)，这非常简单！

1	req = requests.get(url, stream=True, timeout=1)

引用自python请求：

timeout is not a time limit on the entire response download; rather, an exception is raised if the server has not issued a response for timeout seconds (more precisely, if no bytes have been received on the underlying socket for timeout seconds).

更多信息：http://docs.python requests.org/en/latest/user/quickstart/超时