关于python:获取URL列表的内容

Fetching the contents of a list of URL

我有几个URL列表,我想获取它们的HTML内容。网址是从Twitter上获取的,我不知道链接的内容。它们可能是指向网页以及音乐或视频的链接。这是如何读取URL列表链接的HTML内容:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
from multiprocessing.dummy import Pool as ThreadPool

def fetch_url(argv):

    url = argv[0]
    output = None

    print"processing url {}".format(url)

    try:
        # sending the request
        req = requests.get(url, stream=True)

        # checking if it is an html page
        content_type = req.headers.get('content-type')
        if 'text/html' in content_type or 'application/xhtml+xml' in content_type:

            # reading the contents
            html = req.content
            req.close()

            output = html

        else:
            print"\t{} is not an HTML file".format(url)
            req.close()

    except Exception, e:
        print"\t HTTP request was not accepted for {}; {}".format(url, e)

    return output


with open('url_list_1.pkl', 'rb') as fp:
   url_list = pickle.load(fp)

"""
The url_list has such structure:
url_list = [u'http://t.co/qmIPqQVBmW',
            u'http://t.co/mE8krkEejV',
            ...]
"""


pool = Thread
pool = ThreadPool(N_THREADS)

# open the results in their own threads and return the results
func = fetch_url
results = pool.map(func,url_list)

# close the pool and wait for the work to finish
pool.close()
pool.join()

对于大多数列表来说,代码没有任何问题,但对于其中一些列表,代码会卡住,无法完成任务。我认为一些URL没有返回响应。我该怎么补救?例如,等待一个请求X秒,如果它没有响应,就忘记它并移动到下一个URL?为什么会这样?


当然,您可以为您的请求设置超时(以秒为单位),这非常简单!

1
req = requests.get(url, stream=True, timeout=1)

引用自python请求:

timeout is not a time limit on the entire response download; rather, an exception is raised if the server has not issued a response for timeout seconds (more precisely, if no bytes have been received on the underlying socket for timeout seconds).

更多信息:http://docs.python requests.org/en/latest/user/quickstart/超时