read line with .encode with utf8
我从一个文件中读取行,比如:
The Little Big Things: 163 Wege zur Spitzenleistung (Dein Leben) (German Edition) (Peters, Tom)
Die virtuelle Katastrophe: So führen Sie Teams über Distanz zur
Spitzenleistung (German Edition) (Thomas, Gary)
我用以下代码读取/编码它们:
1 | title = line.encode('utf8') |
但产出是:
b'Die virtuelle Katastrophe: So f\xc3\xbchren Sie Teams \xc3\xbcber
Distanz zur Spitzenleistung (German Edition) (Thomas, Gary)'b'The Little Big Things: 163 Wege zur Spitzenleistung (Dein Leben)
(German Edition) (Peters, Tom)'
为什么总是添加"b"?如何正确读取文件以保存"umlauts"?
以下是完整的相关代码段:
1 2 3 4 5 6 7 8 9 10 11 12 | # Parse the clippings.txt file lines = [line.strip() for line in codecs.open(config['CLIPPINGS_FILE'], 'r', 'utf-8-sig')] for line in lines: line_count = line_count + 1 if (line_count == 1 or is_title == 1): # ASSERT: this is a title line #title = line.encode('ascii', 'ignore') title = line.encode('utf8') prev_title = 1 is_title = 0 note_type_result = note_type = l = l_result = location ="" continue |
谢谢
方法
str.encode(encoding="utf-8", errors="strict")
Return an encoded version of the string as a bytes object. Default encoding is 'utf-8'. errors may be given to set a different error handling scheme. The default for errors is 'strict', meaning that encoding errors raise a UnicodeError. Other possible values are 'ignore', 'replace', 'xmlcharrefreplace', 'backslashreplace' and any other name registered via codecs.register_error(), see section Error Handlers. For a list of possible encodings, see section Standard Encodings.
所以你得到的正是你所期望的。
在大多数机器上,您只需
1 2 | with open(filename, encoding='utf8') as f: line = f.readline() |