BeautifulSoup select all href in some element with specific class
我正试图从这个网站上截取图片。我试过用废纸屑和废纸屑。Scrapy似乎在Windows10的家里不起作用,所以我现在正在尝试硒/漂亮的汤。我用的是python 3.6和spider。
我需要的href元素如下所示:
1 |
我要解决的主要问题是:-我应该如何选择有美汤的href?在下面的代码中,您可以看到我尝试了什么(但没有成功)-由于可以观察到,href只是URL的一个部分路径…我应该如何处理这个问题?
到目前为止,我的代码是:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 | from bs4 import BeautifulSoup from time import sleep from selenium import webdriver from selenium.common.exceptions import NoSuchElementException import urllib import requests from os.path import basename def start_requests(self): self.driver = webdriver.Firefox("C:/Anaconda3/envs/scrapy/selenium/webdriver") #programPause = input("Press the <ENTER> key to continue...") self.driver.get("http://emblematica.grainger.illinois.edu/browse/emblems?Filter.Collection=Utrecht&Skip=0&Take=18") html = self.driver.page_source #html = requests.get("http://emblematica.grainger.illinois.edu/browse/emblems?Filter.Collection=Utrecht&Skip=0&Take=18") soup = BeautifulSoup(html,"html.parser") emblemshref = soup.select("a", {"class" :"emblem","href" : True}) for href in emblemshref: link = href["href"] with open(basename(link)," wb") as f: f.write(requests.get(link).content) #click on"next>>" while True: try: next_page = self.driver.find_element_by_xpath("//a[@id='next']") sleep(3) self.logger.info('Sleeping for 3 seconds') next_page.click() #here again the same emblemshref loop except NoSuchElementException: #execute next on the last page self.logger.info('No more pages to load') self.driver.quit() break |
号
您可以按类名获取href,如下所示:
小精灵:
1 2 3 4 5 | for link in soup.findAll('a', {'class': 'emblem'}): try: print link['href'] except KeyError: pass` |
不确定以上答案是否能解决问题。这是一个为我工作的。
1 2 3 | url ="SOME-URL-YOU-WANT-TO-SCRAPE" response = requests.get(url=url) urls = BeautifulSoup(response.content, 'lxml').find_all('a', attrs={"class": ["YOUR-CLASS-NAME"]}, href=True) |
。
试试这个。它将提供遍历该站点中所有页面的所有URL。我已经使用了
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from bs4 import BeautifulSoup driver = webdriver.Chrome() url ="http://emblematica.grainger.illinois.edu/" wait = WebDriverWait(driver, 10) driver.get("http://emblematica.grainger.illinois.edu/browse/emblems?Filter.Collection=Utrecht&Skip=0&Take=18") wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,".emblem"))) while True: soup = BeautifulSoup(driver.page_source,"lxml") for item in soup.select('.emblem'): links = url + item['href'] print(links) try: link = driver.find_element_by_id("next") link.click() wait.until(EC.staleness_of(link)) except Exception: break driver.quit() |
号
部分输出:
1 2 3 | http://emblematica.grainger.illinois.edu/detail/emblem/av1615001 http://emblematica.grainger.illinois.edu/detail/emblem/av1615002 http://emblematica.grainger.illinois.edu/detail/emblem/av1615003 |