关于python：BeautifulSoup选择具有特定类的某个元素中的所有href

BeautifulSoup select all href in some element with specific class

我正试图从这个网站上截取图片。我试过用废纸屑和废纸屑。Scrapy似乎在Windows10的家里不起作用，所以我现在正在尝试硒/漂亮的汤。我用的是python 3.6和spider。

我需要的href元素如下所示：

我要解决的主要问题是：-我应该如何选择有美汤的href？在下面的代码中，您可以看到我尝试了什么(但没有成功)-由于可以观察到，href只是URL的一个部分路径…我应该如何处理这个问题？

到目前为止，我的代码是：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39

from bs4 import BeautifulSoup
from time import sleep
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
import urllib
import requests
from os.path import basename

def start_requests(self):
self.driver = webdriver.Firefox("C:/Anaconda3/envs/scrapy/selenium/webdriver")
#programPause = input("Press the <ENTER> key to continue...")
self.driver.get("http://emblematica.grainger.illinois.edu/browse/emblems?Filter.Collection=Utrecht&Skip=0&Take=18")
html = self.driver.page_source

#html = requests.get("http://emblematica.grainger.illinois.edu/browse/emblems?Filter.Collection=Utrecht&Skip=0&Take=18")
soup = BeautifulSoup(html,"html.parser")
emblemshref = soup.select("a", {"class" :"emblem","href" : True})

for href in emblemshref:
link = href["href"]
with open(basename(link)," wb") as f:
f.write(requests.get(link).content)

#click on"next>>"
while True:
try:
next_page = self.driver.find_element_by_xpath("//a[@id='next']")
sleep(3)
self.logger.info('Sleeping for 3 seconds')
next_page.click()

#here again the same emblemshref loop

except NoSuchElementException:
#execute next on the last page
self.logger.info('No more pages to load')
self.driver.quit()
break

号

相关讨论

不确定以上答案是否能解决问题。这是一个为我工作的。

1
2
3

url ="SOME-URL-YOU-WANT-TO-SCRAPE"
response = requests.get(url=url)
urls = BeautifulSoup(response.content, 'lxml').find_all('a', attrs={"class": ["YOUR-CLASS-NAME"]}, href=True)

。

试试这个。它将提供遍历该站点中所有页面的所有URL。我已经使用了Explicit Wait，使其更快、更动态。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup

driver = webdriver.Chrome()
url ="http://emblematica.grainger.illinois.edu/"
wait = WebDriverWait(driver, 10)
driver.get("http://emblematica.grainger.illinois.edu/browse/emblems?Filter.Collection=Utrecht&Skip=0&Take=18")
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,".emblem")))

while True:
soup = BeautifulSoup(driver.page_source,"lxml")
for item in soup.select('.emblem'):
links = url + item['href']
print(links)

try:
link = driver.find_element_by_id("next")
link.click()
wait.until(EC.staleness_of(link))
except Exception:
break
driver.quit()

号

部分输出：

1
2
3

http://emblematica.grainger.illinois.edu/detail/emblem/av1615001
http://emblematica.grainger.illinois.edu/detail/emblem/av1615002
http://emblematica.grainger.illinois.edu/detail/emblem/av1615003