关于web抓取:Python的请求库超时但是从浏览器获得响应

Python's requests library timing out but getting the response from the browser

我正在尝试为NBA数据创建一个网络刮刀。当我运行以下代码时:

1
2
3
import requests

response = requests.get('https://stats.nba.com/stats/leaguedashplayerstats?College=&Conference=&Country=&DateFrom=10%2F20%2F2017&DateTo=10%2F20%2F2017&Division=&DraftPick=&DraftYear=&GameScope=&GameSegment=&Height=&LastNGames=0&LeagueID=00&Location=&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=Totals&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2017-18&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision=&Weight=')

请求正因错误而超时:

File"C:\ProgramData\Anaconda3\lib\site-packages
equests\api.py",
line 70, in get
return request('get', url, params=params, **kwargs)

File"C:\ProgramData\Anaconda3\lib\site-packages
equests\api.py",
line 56, in request
return session.request(method=method, url=url, **kwargs)

File
"C:\ProgramData\Anaconda3\lib\site-packages
equests\sessions.py",
line 488, in request
resp = self.send(prep, **send_kwargs)

File
"C:\ProgramData\Anaconda3\lib\site-packages
equests\sessions.py",
line 609, in send
r = adapter.send(request, **kwargs)

File
"C:\ProgramData\Anaconda3\lib\site-packages
equests\adapters.py",
line 473, in send
raise ConnectionError(err, request=request)

ConnectionError: ('Connection aborted.', OSError("(10060,
'WSAETIMEDOUT')",))

但是,当我在浏览器中点击同一个URL时,我得到了响应。


看起来您提到的网站正在检查请求头中的"User-Agent"。您可以在请求中伪造"User-Agent",使其看起来像来自实际浏览器,然后您将收到响应。

例如:

1
2
3
4
5
6
7
8
9
>>> import requests
>>> url ="https://stats.nba.com/stats/leaguedashplayerstats?College=&Conference=&Country=&DateFrom=10%2F20%2F2017&DateTo=10%2F20%2F2017&Division=&DraftPick=&DraftYear=&GameScope=&GameSegment=&Height=&LastNGames=0&LeagueID=00&Location=&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=Totals&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2017-18&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision=&Weight="
>>> headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}

>>> response = requests.get(url, headers=headers)
>>> response.status_code
200

>>> response.text  # will return the website content