python:bytes转string | 码农家园

我用这段代码从外部程序得到标准输出:

1 2	>>> from subprocess import * >>> command_stdout = Popen(['ls', '-l'], stdout=PIPE).communicate()[0]

方法返回一个字节数组:

1
2
3
4
5

>>> command_stdout
b'total 0
-rw-rw-r-- 1 thomas thomas 0 Mar 3 07:03 file1
-rw-rw-r-- 1 thomas thomas 0 Mar 3 07:03 file2
'

但是，我希望将输出作为普通的Python字符串处理。所以我可以这样打印出来:

1
2
3

>>> print(command_stdout)
-rw-rw-r-- 1 thomas thomas 0 Mar 3 07:03 file1
-rw-rw-r-- 1 thomas thomas 0 Mar 3 07:03 file2

我认为这就是binascii.b2a_qp()方法的作用，但是当我尝试它时，我又得到了相同的字节数组:

1
2
3
4
5

>>> binascii.b2a_qp(command_stdout)
b'total 0
-rw-rw-r-- 1 thomas thomas 0 Mar 3 07:03 file1
-rw-rw-r-- 1 thomas thomas 0 Mar 3 07:03 file2
'

有人知道如何将字节值转换回字符串吗?我的意思是，使用"电池"而不是手动操作。我希望Python 3没问题。

相关讨论

你需要解码字节对象产生一个字符串:

1
2
3
4
5
6
7

>>> b"abcde"
b'abcde'

# utf-8 is used here because it is a very common encoding, but you
# need to use the encoding your data is actually in.
>>> b"abcde".decode("utf-8")
'abcde'

相关讨论

是的，但是考虑到这是一个windows命令的输出，它不应该使用".decode('windows-1252')"吗?吗?
使用"windows-1252"也不可靠(例如，对于Windows的其他语言版本)，使用sys.stdout.encoding不是最好的吗?
也许这将进一步帮助某些人:有时您使用字节数组进行e.x. TCP通信。如果您想将字节数组转换为字符串，并切断尾随的'x00'字符，下面的答案是不够的。使用b 'example x00 x00 .decode (utf - 8) .strip (" x00")。
我已经在bugs.python.org/issue17860上修复了一个关于记录它的bug——请随意提出一个补丁。如果这是难以贡献-评论如何改善这是受欢迎的。
二进制对象还具有哪些其他解码选项?
在Python 2.7.6中不处理b"\x80\x02\x03".decode("utf-8") -> UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte。
如果内容是随机二进制值，则utf-8转换可能会失败。相反，请参见@techtonik answer(以下)stackoverflow.com/a/27527728/198536
@mathtick: docs.python.org/devguide/documenting.html
@AaronMaenpaa:这不会像python2那样在数组上工作。
有点隐蔽。有关文档的参考，请参阅下面的答案。它也在bytes-docstring (help(command_stdout))中。
@Profpatsch: docs.python.org/3.5/library/stdtypes.html # bytes.decode
使用sys.stdout的小更新。编码——允许为None，这会导致encode()失败。
我有一些网络程序的代码。它的输出"已收到的报价:b' x00&c: Users .pycharm2016.3\config x00&c: Users pych??arm\systemx00x03-??-'我将如何更改我的代码来修复这个问题。当我写print(f"receivedquote: {data}".decode('utf-8')时，这并没有达到目的。
参见下面@borislav-sabev的回答。更好的解决方案。
虽然这通常是一种方法，但是您需要确保您的编码正确，否则您的代码可能会呕吐。更糟的是，来自外部世界的数据可能包含意外的编码。pypi.org/project/chardet上的chardet库可以帮助您解决这个问题，但是要始终保持防御性编程，有时甚至chardet也可能出错，所以使用一些适当的异常处理来包装您的垃圾。
UnicodeDecodeError: 'utf-8'编解码器无法解码位置168中的字节0x8b:起始字节无效
有人知道如何在张量流图中做同样的操作吗?
为什么str(text_bytes)不起作用?这在我看来很奇怪。
这是预期的吗?我得到AttributeError: 'str' object has no attribute 'decode'，但是字符串开头有一个b: b'(Answer 1 Ack)
'胡?!

我认为这个方法很简单:

1
2
3

bytes = [112, 52, 52]
"".join(map(chr, bytes))
>> p44

相关讨论

谢谢你，你的方法在别人都没用的时候对我起作用了。我有一个非编码字节数组，我需要转换成字符串。试图找到一种重新编码的方法，这样我就可以把它解码成一个字符串。这个方法非常有效!
@leetNightshade:但是效率非常低。如果你有一个字节数组，你只需要解码。
@Martijn Pieters我只是用这些答案做了一个简单的基准测试，运行多个10,000次stackoverflow.com/a/3646405/353094，上面的解决方案实际上每次都快得多。在Python 2.7.7中运行10,000次需要8ms，而其他运行需要12ms和18ms。当然，根据输入、Python版本等可能会有一些变化。对我来说似乎不太慢。
这里的OP使用的是python3。
很好。在Python 3.4.1 x86中，这个方法对字节数组进行字符串转换需要17.01ms、24.02ms和11.51ms。所以在这种情况下它不是最快的。
@leetNightshade:您似乎还在讨论整数和字节数组，而不是bytes值(由Popen.communicate()返回)。
彼得@Martijn是的。所以这并不是对问题主体的最佳答案。标题很误导人，不是吗?他/她希望将字节字符串转换为常规字符串，而不是将字节数组转换为字符串。这个答案对于题目是正确的。
@leetNightshade:标题确实会误导人，我来编辑一下。
它可以将从带有"rb"的文件中读取的字节转换为字符串，当您不知道编码时，它非常方便
@Sasszem:这个方法是一种反常的表达方式:a.decode('latin-1')，其中a = bytearray([112, 52, 52])("不存在纯文本这种东西")。如果您成功地将字节转换为文本字符串，那么您将使用一些编码—在本例中为latin-1)
对于python3，这应该等价于bytes([112, 52, 52]) - btw字节对于局部变量来说是一个不好的名字，因为它是一个p3内建
@leetNightshade:为了完整起见:在Python 3.6上，bytes(list_of_integers).decode('ascii')大约比''.join(map(chr, list_of_integers))快三分之一。

您需要解码字节字符串并将其转换为字符(unicode)字符串。

1	b'hello'.decode(encoding)

或者在python3上

1	str(b'hello', encoding)

相关讨论

如果不知道编码，那么要将二进制输入读入字符串中，在python3和python2兼容的方式下，使用古老的MS-DOS cp437编码:

1
2
3
4
5
6
7
8

PY3K = sys.version_info >= (3, 0)

lines = []
for line in stream:
if not PY3K:
lines.append(line)
else:
lines.append(line.decode('cp437'))

因为编码是未知的，所以期望非英语符号转换为cp437的字符(英语字符不被转换，因为它们在大多数单字节编码和UTF-8中匹配)。

解码任意二进制输入到UTF-8是不安全的，因为你可能会得到:

1
2
3
4
5

>>> b'\x00\x01\xffsd'.decode('utf-8')
Traceback (most recent call last):
File"<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 2: invalid
start byte

同样的情况也适用于latin-1，这在Python 2中是流行的(默认?)。查看代码页布局中的缺失点——这是Python使用臭名昭著的ordinal not in range阻塞的地方。

更新20150604:有传言说Python 3有surrogateescape错误策略，可以在不丢失数据和崩溃的情况下将数据编码成二进制数据，但是它需要进行[binary] -> [str] -> [binary]转换测试来验证性能和可靠性。

更新20170116:感谢评论近oo -也有可能斜杠转义所有未知字节与backslashreplace错误处理程序。这只适用于python3，所以即使有了这个解决方案，你仍然会从不同的Python版本得到不一致的输出:

1
2
3
4
5
6
7
8

PY3K = sys.version_info >= (3, 0)

lines = []
for line in stream:
if not PY3K:
lines.append(line)
else:
lines.append(line.decode('utf-8', 'backslashreplace'))

有关详细信息，请参见https://docs.python.org/3/howto/unicode.html# pythons-unicode -support。

更新20170119:我决定实现斜杠转义解码，适用于Python 2和Python 3。它应该比cp437解决方案慢一些，但是它应该在每个Python版本上产生相同的结果。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

# --- preparation

import codecs

def slashescape(err):
""" codecs error handler. err is UnicodeDecode instance. return
a tuple with a replacement for the unencodable part of the input
and a position where encoding should continue"""
#print err, dir(err), err.start, err.end, err.object[:err.start]
thebyte = err.object[err.start:err.end]
repl = u'\\x'+hex(ord(thebyte))[2:]
return (repl, err.end)

codecs.register_error('slashescape', slashescape)

# --- processing

stream = [b'\x80abc']

lines = []
for line in stream:
lines.append(line.decode('utf-8', 'slashescape'))

相关讨论

在python3中，默认编码是"utf-8"，所以你可以直接使用:

1	b'hello'.decode()

相当于

1	b'hello'.decode(encoding="utf-8")

另一方面，在python2中，编码默认为默认字符串编码。因此，你应该使用:

1	b'hello'.decode(encoding)

其中encoding是您想要的编码。

注意:Python 2.7中添加了对关键字参数的支持。

我认为你真正想要的是:

1
2
3

>>> from subprocess import *
>>> command_stdout = Popen(['ls', '-l'], stdout=PIPE).communicate()[0]
>>> command_text = command_stdout.decode(encoding='windows-1252')

Aaron的答案是正确的，只是您需要知道使用哪种编码。我相信Windows使用的是Windows -1252。只有当您的内容中有一些不寻常的(非ascii)字符时才会有影响，但这将会产生不同。

顺便说一下，之所以Python对二进制和文本数据使用两种不同的类型，是因为它不能在这两种类型之间进行神奇的转换，因为除非您告诉它，否则它不知道编码!您知道的惟一方法是阅读Windows文档(或在这里阅读)。

相关讨论

将universal_newlines设置为True，即

1	command_stdout = Popen(['ls', '-l'], stdout=PIPE, universal_newlines=True).communicate()[0]

相关讨论

虽然@Aaron Maenpaa的回答是正确的，但最近一位用户问道

Is there any more simply way? 'fhand.read().decode("ASCII")' [...] It's so long!

您可以使用

1	command_stdout.decode()

decode()有一个标准参数

codecs.decode(obj, encoding='utf-8', errors='strict')

要将字节序列解释为文本，必须知道相应的字符编码:

1	unicode_text = bytestring.decode(character_encoding)

例子:

1 2	>>> b'\xc2\xb5'.decode('utf-8') 'μ'

ls命令可能生成不能解释为文本的输出。文件名在Unix上可以是除斜杠b'/'和0之外的任何字节序列b'\0':

1	>>> open(bytes(range(0x100)).translate(None, b'\0/'), 'w').close()

尝试使用utf-8编码解码这样的字节汤会引发UnicodeDecodeError。

情况可能更糟。解码可能会无声地失败，并产生mojibake如果使用错误的不兼容编码:

1 2	>>> '—'.encode('utf-8').decode('cp1252') 'a€"'

数据已损坏，但您的程序仍然不知道发生了故障发生。

通常，要使用的字符编码并不嵌入到字节序列本身。你必须在乐队外传达这个信息。有些结果比其他结果更有可能，因此存在chardet模块可以猜测字符编码。一个Python脚本可以在不同的地方使用多个字符编码。

ls输出可以使用os.fsdecode()转换为Python字符串函数，即使在不可解码的情况下也能成功文件名(它使用打开sys.getfilesystemencoding()和surrogateescape错误处理程序Unix):

1
2
3
4

import os
import subprocess

output = os.fsdecode(subprocess.check_output('ls'))

要获得原始字节，可以使用os.fsencode()。

如果传递universal_newlines=True参数，则使用subprocesslocale.getpreferredencoding(False)来解码字节，例如，它可以是在Windows上cp1252。

要实时解码字节流，io.TextIOWrapper()可以用:example。

不同的命令可能使用不同的字符编码输出例如，dir内部命令(cmd)可以使用cp437。解码的输出，可以显式传递编码(Python 3.6+):

1	output = subprocess.check_output('dir', shell=True, encoding='cp437')

文件名可能与os.listdir()不同(后者使用Windows)例如，'\xb6'可以被'\x14' -Python所代替cp437编解码器映射b'\x14'来控制字符U+0014而不是U + 00 b6 (?)。要支持任意Unicode字符的文件名，请参阅Decode poweshell输出，其中可能包含一个python字符串中的非ascii Unicode字符

因为这个问题实际上是关于subprocess输出的，你有一个更直接的方法，因为Popen接受一个编码关键字(在Python 3.6+中):

1
2
3
4
5
6
7

>>> from subprocess import Popen, PIPE
>>> text = Popen(['ls', '-l'], stdout=PIPE, encoding='utf-8').communicate()[0]
>>> type(text)
str
>>> print(text)
total 0
-rw-r--r-- 1 wim badger 0 May 31 12:45 some_file.txt

对于其他用户，通常的解决方案是将字节解码为文本:

1 2	>>> b'abcde'.decode() 'abcde'

在没有参数的情况下，将使用sys.getdefaultencoding()。如果数据不是sys.getdefaultencoding()，则必须在decode调用中显式指定编码:

1 2	>>> b'caf\xe9'.decode('cp1250') 'café'

相关讨论

如果您应该得到以下通过尝试decode():

AttributeError: 'str' object has no attribute 'decode'

您还可以在一个cast中直接指定编码类型:

1
2
3
4
5

>>> my_byte_str
b'Hello World'

>>> str(my_byte_str, 'utf-8')
'Hello World'

当处理来自Windows系统的数据(以

行结尾)时，我的答案是

1
2
3
4

String = Bytes.decode("utf-8").replace("

","
")

为什么?用多行输入试试。txt:

1
2
3

Bytes = open("Input.txt","rb").read()
String = Bytes.decode("utf-8")
open("Output.txt","w").write(String)

所有的行结束符都将加倍(到

)，导致额外的空行。Python的文本读取函数通常对行尾进行规范化，以便字符串只使用
。如果您从Windows系统接收二进制数据，Python没有机会这样做。因此,

1
2
3
4
5
6

Bytes = open("Input.txt","rb").read()
String = Bytes.decode("utf-8").replace("

","
")
open("Output.txt","w").write(String)

将复制原始文件。

相关讨论

我创建了一个函数来清理列表

1
2
3
4
5
6
7
8
9

def cleanLists(self, lista):
lista = [x.strip() for x in lista]
lista = [x.replace('
', '') for x in lista]
lista = [x.replace('\b', '') for x in lista]
lista = [x.encode('utf8') for x in lista]
lista = [x.decode('utf8') for x in lista]

return lista

相关讨论

对于python3来说，将byte转换为string是一种更加安全的Python方法:

1
2
3
4
5
6
7
8
9
10

def byte_to_str(bytes_or_str):
if isinstance(bytes_or_str, bytes): #check if its in bytes
print(bytes_or_str.decode('utf-8'))
else:
print("Object not of byte type")

byte_to_str(b'total 0
-rw-rw-r-- 1 thomas thomas 0 Mar 3 07:03 file1
-rw-rw-r-- 1 thomas thomas 0 Mar 3 07:03 file2
')

输出:

1
2
3

total 0
-rw-rw-r-- 1 thomas thomas 0 Mar 3 07:03 file1
-rw-rw-r-- 1 thomas thomas 0 Mar 3 07:03 file2

相关讨论

1
2
3
4
5
6
7
8
9
10

def toString(string):
try:
return v.decode("utf-8")
except ValueError:
return string

b = b'97.080.500'
s = '97.080.500'
print(toString(b))
print(toString(s))

相关讨论

从http://docs.python.org/3/library/sys.html,

要从标准流中写入或读取二进制数据，请使用底层二进制缓冲区。例如，要将字节写入stdout，可以使用sys.stdout.buffer.write(b'abc')。

相关讨论