关于python：BeautifulSoup findall with class attribute- unicode encode error

BeautifulSoup findall with class attribute- unicode encode error

我正在用漂亮的汤从黑客新闻中提取新闻故事(只是标题)，直到现在我都有这么多。-

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

import urllib2
from BeautifulSoup import BeautifulSoup

HN_url ="http://news.ycombinator.com"

def get_page():
page_html = urllib2.urlopen(HN_url)
return page_html

def get_stories(content):
soup = BeautifulSoup(content)
titles_html =[]

for td in soup.findAll("td", {"class":"title" }):
titles_html += td.findAll("a")

return titles_html

print get_stories(get_page()

)

但是，当我运行代码时，它会给出一个错误-

1
2
3
4

Traceback (most recent call last):
File"terminalHN.py", line 19, in <module>
print get_stories(get_page())
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe2' in position 131: ordinal not in range(128)

我怎样才能让这个工作？

因为BeautifulSoup内部使用Unicode字符串。将unicode字符串打印到控制台将导致python尝试将unicode转换为python的默认编码(通常是ascii)。对于非ASCII网站，这通常会失败。你可以通过谷歌搜索"python+unicode"来学习关于python和unicode的基本知识。同时转换将Unicode字符串转换为UTF-8时使用

1	print some_unicode_string.decode('utf-8')