关于python:使用utf8读取带有.encode的行

read line with .encode with utf8

本问题已经有最佳答案,请猛点这里访问。

我从一个文件中读取行,比如:

The Little Big Things: 163 Wege zur Spitzenleistung (Dein Leben) (German Edition) (Peters, Tom)

Die virtuelle Katastrophe: So führen Sie Teams über Distanz zur
Spitzenleistung (German Edition) (Thomas, Gary)

我用以下代码读取/编码它们:

1
title = line.encode('utf8')

但产出是:

b'Die virtuelle Katastrophe: So f\xc3\xbchren Sie Teams \xc3\xbcber
Distanz zur Spitzenleistung (German Edition) (Thomas, Gary)'

b'The Little Big Things: 163 Wege zur Spitzenleistung (Dein Leben)
(German Edition) (Peters, Tom)'

为什么总是添加"b"?如何正确读取文件以保存"umlauts"?

以下是完整的相关代码段:

1
2
3
4
5
6
7
8
9
10
11
12
# Parse the clippings.txt file
lines = [line.strip() for line in codecs.open(config['CLIPPINGS_FILE'], 'r', 'utf-8-sig')]
for line in lines:
    line_count = line_count + 1
    if (line_count == 1 or is_title == 1):
        # ASSERT: this is a title line
        #title = line.encode('ascii', 'ignore')
        title = line.encode('utf8')
        prev_title = 1
        is_title = 0
        note_type_result = note_type = l = l_result = location =""
        continue

谢谢


方法str.encode将unicode字符串转换为bytes对象:

str.encode(encoding="utf-8", errors="strict")
Return an encoded version of the string as a bytes object. Default encoding is 'utf-8'. errors may be given to set a different error handling scheme. The default for errors is 'strict', meaning that encoding errors raise a UnicodeError. Other possible values are 'ignore', 'replace', 'xmlcharrefreplace', 'backslashreplace' and any other name registered via codecs.register_error(), see section Error Handlers. For a list of possible encodings, see section Standard Encodings.

所以你得到的正是你所期望的。

在大多数机器上,您只需open文件和读取即可。如果文件编码不是系统默认值,则可以将其作为关键字参数传递:

1
2
with open(filename, encoding='utf8') as f:
    line = f.readline()