Scraping -- Text element missing for <dt> tag from JS generated page using PyQt4
我正在尝试使用 PyQt4 抓取此页面,但由于某种原因,当我使用 BeautifulSoup 进行搜索时,
我对使用 PyQt4 还很陌生,所以我不确定这里出了什么问题。我得到了文本标签的所有文本元素,但没有 .页面没有完全加载还是出了什么问题?任何帮助表示赞赏。
这是我迄今为止一直在使用的代码:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | class Client(QWebPage): def __init__(self, url): print('\ \ Loading: \ ', url) self.app = QApplication(sys.argv) QWebPage.__init__(self) self.loadFinished.connect(self.on_page_load) self.mainFrame().load((QUrl(url))) self.app.exec() self.app.quit() def on_page_load(self): self.app.quit() url = 'http://www.hkex.com.hk/Market-Data/Securities-Prices/Equities/Equities-Quote?sym=700&sc_lang=en' client_response = Client(url) source = client_response.mainFrame().toHtml() soup = bs.BeautifulSoup(source, 'lxml') table = soup.find('div', {'class' : 'left_list_leve quote'}) price = soup.find('span' , {'class' : 'col_last'}) name = soup.find('p' , {'class' : 'col_name'}) all_dls = table.findAll('dl') |
这是我运行脚本后得到的结果。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 | Loading: http://www.hkex.com.hk/Market-Data/Securities-Prices/Equities/Equities-Quote?sym=700&sc_lang=en [<dl> <dd class="ico_name label_prevcls">PREV. CLOSE*</dd> <dt class="ico_data col_prevcls"></dt> </dl>, <dl> <dd class="ico_name label_open">OPEN**</dd> <dt class="ico_data col_open"></dt> </dl>, <dl> <dd class="ico_name label_turnover">TURNOVER</dd> <dt class="ico_data col_turnover"></dt> </dl>, <dl> <dd class="ico_name label_volume">VOLUME</dd> <dt class="ico_data col_volume"></dt> </dl>, <dl> <dd class="ico_name label_mktcap">MKT CAP</dd> <dt class="ico_data col_mktcap"></dt> </dl>, <dl> <dd class="ico_name label_lotsize">LOT SIZE</dd> <dt class="ico_data col_lotsize"></dt> </dl>, <dl> <dd class="ico_name label_bid">BID</dd> <dt class="ico_data col_bid"></dt> </dl>, <dl> <dd class="ico_name label_ask">ASK</dd> <dt class="ico_data col_ask"></dt> </dl>, <dl> <dd class="ico_name label_eps">EPS</dd> <dt class="ico_data col_eps"></dt> </dl>, <dl> <dd class="ico_name label_pe">P/E</dd> <dt class="ico_data col_pe"></dt> </dl>, <dl> <dd class="ico_name label_divyield">DIV YIELD</dd> <dt class="ico_data col_divyield"></dt> </dl>] <span class="col_last"></span> <p class="col_name"></p> |
您缺少 _loadFinished() 方法。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 | # -*- coding: utf-8 -*- from PyQt4 import QtCore, QtGui, QtWebKit from PyQt4.QtGui import * import bs4 as bs import sys class Client(QtWebKit.QWebPage): def __init__(self, url): self.app = QApplication(sys.argv) QtWebKit.QWebPage.__init__(self) self.loadFinished.connect(self._loadFinished) self.mainFrame().load(QtCore.QUrl(url)) self.app.exec_() def _loadFinished(self, result): self.frame = self.mainFrame() self.app.quit() url = 'http://www.hkex.com.hk/Market-Data/Securities-Prices/Equities/Equities-Quote?sym=700&sc_lang=en' client_response = Client(url) source = client_response.frame.toHtml() u = (unicode(source).encode("utf-8", errors="replace")) soup = bs.BeautifulSoup(u, 'lxml') table = soup.find('div', {'class': 'left_list_leve quote'}) price = soup.find('span', {'class': 'col_last'}) name = soup.find('p', {'class': 'col_name'}) all_dls = table.findAll('dl') for dl in all_dls: print (dl) |
输出:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 | <dl> <dd class="ico_name label_prevcls">PREV. CLOSE*</dd> <dt class="ico_data col_prevcls">HK$460.000</dt> </dl> <dl> <dd class="ico_name label_open">OPEN**</dd> <dt class="ico_data col_open">HK$459.000</dt> </dl> <dl> <dd class="ico_name label_turnover">TURNOVER</dd> <dt class="ico_data col_turnover">HK$11.08B</dt> </dl> <dl> <dd class="ico_name label_volume">VOLUME</dd> <dt class="ico_data col_volume">24.33M</dt> </dl> <dl> <dd class="ico_name label_mktcap">MKT CAP</dd> <dt class="ico_data col_mktcap">HK$4,297.37B</dt> </dl> <dl> <dd class="ico_name label_lotsize">LOT SIZE</dd> <dt class="ico_data col_lotsize">100</dt> </dl> <dl> <dd class="ico_name label_bid">BID</dd> <dt class="ico_data col_bid">HK$452.400</dt> </dl> <dl> <dd class="ico_name label_ask">ASK</dd> <dt class="ico_data col_ask">HK$452.600</dt> </dl> <dl> <dd class="ico_name label_eps">EPS</dd> <dt class="ico_data col_eps">RMB4.383</dt> </dl> <dl> <dd class="ico_name label_pe">P/E</dd> <dt class="ico_data col_pe">91.55x</dt> </dl> <dl> <dd class="ico_name label_divyield">DIV YIELD</dd> <dt class="ico_data col_divyield">0.13%</dt> </dl> |
尝试使用selenium:
1 2 3 4 5 6 7 8 | from selenium import webdriver import time driver = webdriver.PhantomJS() driver.get('http://www.hkex.com.hk/Market-Data/Securities- Prices/Equities/Equities-Quote?sym=700&sc_lang=en') content = driver.find_element_by_xpath('//*[@class="left_list_item list_item_op"]') print(content.text) |
样本输出:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | SIVABALANs-MBP:Desktop siva$ python test_phantomjs.py /Users/siva/anaconda3/lib/python3.6/site-packages/selenium/webdriver/phantomjs/webdriver.py:49: UserWarning: Selenium support for PhantomJS has been deprecated, please use headless versions of Chrome or Firefox instead warnings.warn('Selenium support for PhantomJS has been deprecated, please use headless ' PREV. CLOSE* HK$460.000 OPEN** HK$459.000 TURNOVER HK$11.08B VOLUME 24.33M MKT CAP HK$4,297.37B LOT SIZE 100 SIVABALANs-MBP:Desktop siva$ |