关于python：使用已打开的网页（含硒）到beautifulsoup？

Use an already open webpage(with selenium) to beautifulsoup?

我打开了一个网页，并使用WebDriver代码登录。为此使用WebDriver，因为在设置为scrape之前，页面需要登录和各种其他操作。

目的是从这个打开的页面中获取数据。需要找到链接并打开它们，所以SeleniumWebDriver和BeautifulSoup之间会有很多组合。

我查看了BS4的文档，BeautifulSoup(open("ccc.html"))抛出了一个错误。

soup = bs4.BeautifulSoup(open("https://m/search.mp?ss=Pr+Dn+Ts"))

OSError: [Errno 22] Invalid argument: 'https://m/search.mp?ss=Pr+Dn+Ts'

我想这是因为它不是一个.html？

相关讨论

您正试图按网址打开网页。open()不会这样做，使用urlopen()来：

1
2
3
4
5

from urllib.request import urlopen # Python 3
# from urllib2 import urlopen # Python 2

url ="your target url here"
soup = bs4.BeautifulSoup(urlopen(url),"html.parser")

或者，对人类使用http-requests库：

1
2
3
4

import requests

response = requests.get(url)
soup = bs4.BeautifulSoup(response.content,"html.parser")

还要注意，强烈建议显式地指定一个解析器——我已经使用了html.parser，在这种情况下，还有其他的解析器可用。

I want to use the exact same page(same instance)

一种常见的方法是获取driver.page_source并将其传递给BeautifulSoup，以便进一步分析：

1
2
3
4
5
6
7
8
9
10
11
12

from bs4 import BeautifulSoup
from selenium import webdriver

driver = webdriver.Firefox()
driver.get(url)

# wait for page to load..

source = driver.page_source
driver.quit() # remove this line to leave the browser open

soup = BeautifulSoup(source,"html.parser")