Python下载所有XKCD漫画

1、程序要做的事情：

加载主页
保持该页的漫画图片
转入前一张漫画的链接
重复直到第一张漫画

意味着代码要做的事情

利用requests模块下载页面
利用Beautiful Soup找到页面中漫画图像的URL
利用iter_content（）下载漫画图像，并保存到硬盘
找到前一张漫画的链接URL，然后重复

第一步：设计程序

打开一个浏览器的开发者工具，检查该页面上的元素，会发现下面的内容：
漫画图像文件的URL，由一个元素的href属性给出
元素在
元素之内
Prev按钮有一个rel HTML属性，值是prev
第一张漫画的Prev按钮链接到http://xkcd.com/#URL，表明没有前一个页面了

1 2	url='https://xkcd.com/' #starting url os.makedirs('xkcd',exist_ok=True) #store comics in ./xkcd

第二步：下载页面

1
2
3

print('Downloading page %s...' % url)
res=requests.get(url) #下载
res.raise_for_status() #如果下载发生问题，就抛出异常，并终止程序

第三步：寻找和下载漫画图像

1
2
3
4
5
6
7
8
9

#Find the URL of the comic image.
comicElem=soup.select('#comic img') #如果没有找到任何元素，那么将返回一个空列表，否则将返回一个列表，包含一个<img>元素。可以从这个<img>元素中取得src属性，将它传递给requests.get()，下载这个漫画图像文件
if comicElem==[]:
print('Could not find comic image.')
else:
comicUrl=comicElem[0].get('src')
print('Downloading image %s...' % (comicUrl))
res=requests.get('http:'+comicUrl)
res.raise_for_status()

漫画图像的元素识在一个

元素中，它带有的id属性设置为comic。所以选择器‘#comic img’将从BeatifulSoup对象中选出正确的元素

第四步：保存图像，找到前一张漫画

1
2
3
4
5
6
7
8
9

#Save the image to ./xkcd.
imageFile=open(os.path.join('xkcd',os.path.basename(comicUrl)),'wb')
for chunk in res.iter_content(100000):
imageFile.write(chunk)
imageFile.close()

#Get the Prev Button's url
prevLink=soup.select('a[rel="prev"]')[0]
url='https://xkcd.com/'+prevLink.get('href')

这时，漫画的图像文件保存在变量res中。你需要将图像数据写入硬盘的文件。

整个项目的代码如下：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

import requests,os,bs4
url='https://xkcd.com/' #starting url
os.makedirs('xkcd',exist_ok=True)#store comics in ./xkcd
while not url.endswith('#'):
print('Downloading page %s...' % url)
res=requests.get(url)
res.raise_for_status()
soup=bs4.BeautifulSoup(res.text,features='html.parser')

comicElem=soup.select('#comic img')
if comicElem==[]:
print('Could not find comic image.')
else:
comicUrl=comicElem[0].get('src')
print('Downloading image %s...' % (comicUrl))
res=requests.get('http:'+comicUrl)
res.raise_for_status()
imageFile=open(os.path.join('xkcd',os.path.basename(comicUrl)),'wb')
for chunk in res.iter_content(100000):
imageFile.write(chunk)
imageFile.close()

prevLink=soup.select('a[rel="prev"]')[0]
url='https://xkcd.com/'+prevLink.get('href')

print('Done')