关于python：如何检查字符串是unicode还是ascii？

How do I check if a string is unicode or ascii?

在python中，我需要做什么来确定一个字符串有哪些编码？

在Python3中，所有字符串都是Unicode字符序列。有一个包含原始字节的bytes类型。

在python 2中，字符串的类型可以是str或unicode类型。您可以使用类似这样的代码来区分：

1
2
3
4
5
6
7

def whatisthis(s):
if isinstance(s, str):
print"ordinary string"
elif isinstance(s, unicode):
print"unicode string"
else:
print"not a string"

这不区分"unicode"或"ascii"；它只区分python类型。Unicode字符串可以由ASCII范围内的纯字符组成，字节串可以包含ASCII、编码的Unicode甚至非文本数据。

相关讨论

如何判断对象是Unicode字符串还是字节字符串

您可以使用type或isinstance。

在Python 2中：

1
2
3
4

>>> type(u'abc') # Python 2 unicode string literal
<type 'unicode'>
>>> type('abc') # Python 2 byte string literal
<type 'str'>

在python 2中，str只是一个字节序列。Python不知道什么它的编码是。unicode类型是存储文本的更安全的方式。如果您想了解更多信息，我建议您访问http://farmdev.com/talks/unicode/。

在Python 3中：

1
2
3
4

>>> type('abc') # Python 3 unicode string literal
<class 'str'>
>>> type(b'abc') # Python 3 byte string literal
<class 'bytes'>

在python 3中，str与python 2的unicode类似，用于存储文本。在python 2中称为str的东西在python 3中称为bytes。

如何判断字节字符串是有效的UTF-8还是ASCII

你可以打电话给decode。如果它引发了unicodedecodeerror异常，则它无效。

1
2
3
4
5
6
7

>>> u_umlaut = b'\xc3\x9c' # UTF-8 representation of the letter 'ü'
>>> u_umlaut.decode('utf-8')
u'\xdc'
>>> u_umlaut.decode('ascii')
Traceback (most recent call last):
File"<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

相关讨论

在python 3.x中，所有字符串都是Unicode字符序列。对str执行isInstance检查(默认情况下这意味着Unicode字符串)就足够了。

1	isinstance(x, str)

关于python 2.x，大多数人似乎在使用一个有两个检查的if语句。一个用于str，一个用于unicode。

如果您想用一条语句检查是否有一个"类似字符串"的对象，您可以执行以下操作：

1	isinstance(x, basestring)

相关讨论

Unicode不是编码-引用Kumar McMillan:

If ASCII, UTF-8, and other byte strings are"text" ...

...then Unicode is"text-ness";

it is the abstract form of text

读一读麦克米兰在python中的unicode，从pycon 2008中完全解开了谜团，它比堆栈溢出的大多数相关答案解释得更好。

相关讨论

如果您的代码需要与python 2和python 3兼容，那么如果不将它们包装在try/except或python版本测试中，就不能直接使用isinstance(s,bytes)或isinstance(s,unicode)，因为在python 2中bytes是未定义的，而在python 3中unicode是未定义的。

有一些丑陋的解决办法。一个非常难看的方法是比较类型的名称，而不是比较类型本身。下面是一个例子：

1
2
3
4
5
6
7

# convert bytes (python 3) or unicode (python 2) to str
if str(type(s)) =="<class 'bytes'>":
# only possible in Python 3
s = s.decode('ascii') # or s = str(s)[2:-1]
elif str(type(s)) =="<type 'unicode'>":
# only possible in Python 2
s = str(s)

可以说，稍微不那么难看的解决方法是检查python版本号，例如：

1
2
3
4
5
6
7
8

if sys.version_info >= (3,0,0):
# for Python 3
if isinstance(s, bytes):
s = s.decode('ascii') # or s = str(s)[2:-1]
else:
# for Python 2
if isinstance(s, unicode):
s = str(s)

这两种都是不合拍的，而且大多数时候可能有更好的方法。

相关讨论

使用：

1 2	import six if isinstance(obj, six.text_type)

这是一个图书馆的内幕的陈述：

1
2
3
4

if PY3:
string_types = str,
else:
string_types = basestring,

相关讨论

请注意，在python 3中，不太公平地说：

strs是任何x的utfx(如utf8)
strs为Unicode
strs是Unicode字符的有序集合。

python的str类型(通常)是一个Unicode代码点序列，其中一些代码点映射到字符。

即使在Python3上，回答这个问题也不像您想象的那么简单。

测试ASCII兼容字符串的一个明显方法是尝试编码：

1
2
3
4
5
6
7

"Hello there!".encode("ascii")
#>>> b'Hello there!'

"Hello there... ?!".encode("ascii")
#>>> Traceback (most recent call last):
#>>> File"", line 4, in <module>
#>>> UnicodeEncodeError: 'ascii' codec can't encode character '\u2603' in position 15: ordinal not in range(128)

这个错误区分了这些情况。

在python 3中，甚至有一些字符串包含无效的unicode代码点：

1
2
3
4
5
6
7

"Hello there!".encode("utf8")
#>>> b'Hello there!'

"\udcc3".encode("utf8")
#>>> Traceback (most recent call last):
#>>> File"", line 19, in <module>
#>>> UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc3' in position 0: surrogates not allowed

使用相同的方法来区分它们。

相关讨论

这可能对其他人有所帮助，我开始测试变量s的字符串类型，但对于我的应用程序来说，简单地将s返回为utf-8更有意义。调用的进程返回UTF，然后知道它在处理什么，并且可以适当地处理字符串。代码并不是原始的，但我打算在没有版本测试或导入六个版本的情况下将其作为Python版本不可知论者。请对下面的示例代码进行改进以帮助其他人。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

def return_utf(s):
if isinstance(s, str):
return s.encode('utf-8')
if isinstance(s, (int, float, complex)):
return str(s).encode('utf-8')
try:
return s.encode('utf-8')
except TypeError:
try:
return str(s).encode('utf-8')
except AttributeError:
return s
except AttributeError:
return s
return s # assume it was already utf-8

您可以使用通用编码检测器，但请注意，它只会给您最好的猜测，而不是实际的编码，因为例如，不可能知道字符串"abc"的编码。您将需要在其他地方获取编码信息，例如HTTP协议使用Content-Type头。

如果我们用PY3细/兼容性

import six
if isinstance(obj, six.text_type)