关于网页抓取:如何使用Python下载多个PDF文件?

How can I download multiple PDF files with Python?

我正在尝试在https://occ.ca/our-publications的每个页面上下载出版物

我的最终目标是解析PDF文件中的文本并找到某些关键字。

到目前为止,我已经能够在所有页面上抓取PDF文件的链接。我已将这些链接保存到列表中。现在,我想浏览一下列表并使用Python下载所有pdf文件。下载完文件后,我想通过它们进行解析。

这是我到目前为止使用的代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import requests
from bs4 import BeautifulSoup
import lxml
import csv

# This code adds all PDF links into a list called
#"publications".

publications=[]
for i in range(19):
    response=requests.get('https://occ.ca/our-
   publications/page/{}/'
.format(i), headers={'User-
    Agent'
: 'Mozilla'})

    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'lxml')
        pdfs = soup.findAll('div', {"class":
      "publicationoverlay"})
        links = [pdf.find('a').attrs['href'] for pdf in pdfs]
    publications.append(links)

接下来,我要浏览该列表并下载PDF文件。

1
2
3
import urllib.request
for x in publications:
urllib.request.urlretrieve(x,'Publication_{}'.format(range(213)))

这是我运行代码时遇到的错误。

这是我得到的错误

追溯(最近一次通话):
文件" C:\ Users \ plumm \ AppData \ Local \ Programs \ Python \ Python37 \ m.py",第23行,在
urllib.request.urlretrieve(x,'Publication_ {} .pdf'.format(range(213)))
文件" C:\ Users \ plumm \ AppData \ Local \ Programs \ Python \ Python37 \ lib \ urllib \ request.py",第247行,位于urlretrieve中
使用contextlib.closing(urlopen(url,data))作为fp:
urlopen中的文件" C:\ Users \ plumm \ AppData \ Local \ Programs \ Python \ Python37 \ lib \ urllib \ request.py",第222行
返回opener.open(URL,数据,超时)
打开的文件" C:\ Users \ plumm \ AppData \ Local \ Programs \ Python \ Python37 \ lib \ urllib \ request.py",第531行
响应= meth(req,响应)
文件" C:\ Users \ plumm \ AppData \ Local \ Programs \ Python \ Python37 \ lib \ urllib \ request.py",第641行,位于http_response中
'http',请求,响应,代码,msg,hdr)
文件" C:\ Users \ plumm \ AppData \ Local \ Programs \ Python \ Python37 \ lib \ urllib \ request.py",第569行,错误
返回self._call_chain(* args)
_call_chain中的文件" C:\ Users \ plumm \ AppData \ Local \ Programs \ Python \ Python37 \ lib \ urllib \ request.py",第503行
结果= func(* args)
文件" C:\ Users \ plumm \ AppData \ Local \ Programs \ Python \ Python37 \ lib \ urllib \ request.py",第649行,位于http_error_default中
引发HTTPError(req.full_url,code,msg,hdrs,fp)
urllib.error.HTTPError:HTTP错误403:禁止


请尝试:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import requests
from bs4 import BeautifulSoup
import lxml
import csv

# This code adds all PDF links into a list called
#"publications".

publications=[]
for i in range(19):
    response=requests.get('https://occ.ca/our-
   publications/page/{}/'
.format(i), headers={'User-
    Agent'
: 'Mozilla'})

    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'lxml')
        pdfs = soup.findAll('div', {"class":
      "publicationoverlay"})
        links = [pdf.find('a').attrs['href'] for pdf in pdfs]
    publications.extend(links)

for cntr, link in enumerate(publications):
    print("try to get link", link)
    rslt = requests.get(link)
    print("Got", rslt)
    fname ="temporarypdf_%d.pdf" % cntr
    with open("temporarypdf_%d.pdf" % cntr,"wb") as fout:
        fout.write(rslt.raw.read())
    print("saved pdf data into", fname)
    # Call here the code that reads and parses the pdf.


您能告知发生错误的行号吗?