关于python:BeautifulSoup选择具有特定类的某个元素中的所有href

BeautifulSoup select all href in some element with specific class

我正试图从这个网站上截取图片。我试过用废纸屑和废纸屑。Scrapy似乎在Windows10的家里不起作用,所以我现在正在尝试硒/漂亮的汤。我用的是python 3.6和spider。

我需要的href元素如下所示:

1
 

我要解决的主要问题是:-我应该如何选择有美汤的href?在下面的代码中,您可以看到我尝试了什么(但没有成功)-由于可以观察到,href只是URL的一个部分路径…我应该如何处理这个问题?

到目前为止,我的代码是:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
from bs4 import BeautifulSoup
from time import sleep
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
import urllib
import requests
from os.path  import basename


def start_requests(self):
        self.driver = webdriver.Firefox("C:/Anaconda3/envs/scrapy/selenium/webdriver")
        #programPause = input("Press the <ENTER> key to continue...")
        self.driver.get("http://emblematica.grainger.illinois.edu/browse/emblems?Filter.Collection=Utrecht&Skip=0&Take=18")
        html = self.driver.page_source

        #html = requests.get("http://emblematica.grainger.illinois.edu/browse/emblems?Filter.Collection=Utrecht&Skip=0&Take=18")
        soup = BeautifulSoup(html,"html.parser")        
        emblemshref = soup.select("a", {"class" :"emblem","href" : True})

        for href in emblemshref:
            link = href["href"]
            with open(basename(link)," wb") as f:
                f.write(requests.get(link).content)

        #click on"next>>"        
        while True:
            try:
                next_page = self.driver.find_element_by_xpath("//a[@id='next']")
                sleep(3)
                self.logger.info('Sleeping for 3 seconds')
                next_page.click()

                #here again the same emblemshref loop

            except NoSuchElementException:
                #execute next on the last page
                self.logger.info('No more pages to load')
                self.driver.quit()
                break


您可以按类名获取href,如下所示:

小精灵:

1
2
3
4
5
for link in soup.findAll('a', {'class': 'emblem'}):
   try:
      print link['href']
   except KeyError:
      pass`


不确定以上答案是否能解决问题。这是一个为我工作的。

1
2
3
url ="SOME-URL-YOU-WANT-TO-SCRAPE"
response = requests.get(url=url)
urls = BeautifulSoup(response.content, 'lxml').find_all('a', attrs={"class": ["YOUR-CLASS-NAME"]}, href=True)


试试这个。它将提供遍历该站点中所有页面的所有URL。我已经使用了Explicit Wait,使其更快、更动态。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup

driver = webdriver.Chrome()
url ="http://emblematica.grainger.illinois.edu/"
wait = WebDriverWait(driver, 10)
driver.get("http://emblematica.grainger.illinois.edu/browse/emblems?Filter.Collection=Utrecht&Skip=0&Take=18")
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,".emblem")))

while True:
    soup = BeautifulSoup(driver.page_source,"lxml")
    for item in soup.select('.emblem'):
        links = url + item['href']
        print(links)

    try:
        link = driver.find_element_by_id("next")
        link.click()
        wait.until(EC.staleness_of(link))
    except Exception:
        break
driver.quit()

部分输出:

1
2
3
http://emblematica.grainger.illinois.edu/detail/emblem/av1615001
http://emblematica.grainger.illinois.edu/detail/emblem/av1615002
http://emblematica.grainger.illinois.edu/detail/emblem/av1615003