python使用的字符串比较技术

String comparison technique used by Python

我想知道python是如何进行字符串比较的，更具体地说，它是如何确定使用小于(<或大于(>运算符时的结果的。

例如，如果我把print('abc' < 'bac')放进去，我就得到True。我理解，它比较了字符串中相应的字符，但不清楚为什么会有更多的字符，因为缺少更好的术语，"权重"放在第一个字符串中a小于b(第一位置)的事实上，而不是放在第二个字符串(第二位置)中a小于b的事实上。

相关讨论

来自文档：

The comparison uses lexicographical
ordering: first the first two items
are compared, and if they differ this
determines the outcome of the
comparison; if they are equal, the
next two items are compared, and so
on, until either sequence is
exhausted.

也：

Lexicographical ordering for strings uses the Unicode code point number to order individual characters.

或在Python 2上：

Lexicographical ordering for strings uses the ASCII ordering for individual characters.

举个例子：

1
2
3
4

>>> 'abc' > 'bac'
False
>>> ord('a'), ord('b')
(97, 98)

当发现a小于b时，返回结果False。其他项目不进行比较(如第二个项目：b>a=True)。

注意大小写：

1
2
3
4

>>> [(x, ord(x)) for x in abc]
[('a', 97), ('b', 98), ('c', 99), ('d', 100), ('e', 101), ('f', 102), ('g', 103), ('h', 104), ('i', 105), ('j', 106), ('k', 107), ('l', 108), ('m', 109), ('n', 110), ('o', 111), ('p', 112), ('q', 113), ('r', 114), ('s', 115), ('t', 116), ('u', 117), ('v', 118), ('w', 119), ('x', 120), ('y', 121), ('z', 122)]
>>> [(x, ord(x)) for x in abc.upper()]
[('A', 65), ('B', 66), ('C', 67), ('D', 68), ('E', 69), ('F', 70), ('G', 71), ('H', 72), ('I', 73), ('J', 74), ('K', 75), ('L', 76), ('M', 77), ('N', 78), ('O', 79), ('P', 80), ('Q', 81), ('R', 82), ('S', 83), ('T', 84), ('U', 85), ('V', 86), ('W', 87), ('X', 88), ('Y', 89), ('Z', 90)]

相关讨论

python字符串比较是词典：

来自python文档：http://docs.python.org/reference/expressions.html

Strings are compared lexicographically using the numeric equivalents (the result of the built-in function ord()) of their characters. Unicode and 8-bit strings are fully interoperable in this behavior.

因此，在您的示例中，'abc' < 'bac'，‘a’在(小于)‘b’之前以数字形式出现(以ASCII和Unicode表示)，因此比较就到此结束。

相关讨论

python和几乎所有其他计算机语言都使用与(我希望)在印刷字典中查找单词时所使用的相同的原则：

(1)根据所涉及的人类语言，您有一个字符排序的概念："A"<"B"<"C"等

(2)第一个字符比第二个字符更重："az"<"za"(无论语言是从左到右还是从右到左还是从右到左，或者boustrophedon都是不相关的)

(3)如果要测试的字符用完，则较短的字符串小于较长的字符串："foo"<"food"

通常，在计算机语言中，"字符排序的概念"是相当原始的：每个字符都有一个与人类语言无关的数字ord(character)，并且使用该数字对字符进行比较和排序。通常，排序不适合于用户的人类语言，然后你需要进入"整理"，一个有趣的话题。

还可以看看如何在python中按字母顺序对unicode字符串进行排序？其中讨论的是Unicode排序规则算法(http://www.unicode.org/reports/tr10/)给出的排序规则。

回复评论

What? How else can ordering be defined other than left-to-right?

在S.Lott看来，在对法语进行排序时有一个著名的反例。它涉及重音：事实上，可以说，在法语中，字母是从左到右排序的，重音是从右到左排序的。下面是反例：我们有E和O？你会想到cote，cot_，c这个词吗？TE，C？t_排序为cote

最后一句话：你不应该谈论从左到右和从右到左的排序，而应该谈论向前和向后的排序。

的确，有些语言是从右到左写的，如果你认为阿拉伯语和希伯来语是从右到左排序的，那么从图形的角度看，你可能是对的，但在逻辑层面上你是错的！

事实上，Unicode考虑按逻辑顺序编码的字符串，而写入方向是glyph级别上发生的一种现象。换言之，即使在这个词里？？？？字母shin出现在跛子的右边，逻辑上它出现在它之前。要对这个词进行排序，首先考虑shin，然后考虑lamed，然后考虑vav，然后考虑mem，这是向前排序(尽管希伯来语是从右向左写的)，而法语重音是向后排序(尽管法语是从左向右写的)。

这是词典编纂顺序。它只是按字典顺序排列。

相关讨论

字符串比较的纯python等价物是：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

def less(string1, string2):
# Compare character by character
for idx in range(min(len(string1), len(string2))):
# Get the"value" of the character
ordinal1, ordinal2 = ord(string1[idx]), ord(string2[idx])
# If the"value" is identical check the next characters
if ordinal1 == ordinal2:
continue
# If it's smaller we're finished and can return True
elif ordinal1 < ordinal2:
return True
# If it's bigger we're finished and return False
else:
return False
# We're out of characters and all were equal, so the result depends on the length
# of the strings.
return len(string1) < len(string2)

这个函数相当于实际方法(python 3.6和python 2.7)，速度慢得多。同样要注意的是，这个实现并不完全是"pythonic"，只适用于<比较。只是为了说明它是如何工作的。我还没有检查它是否像用于组合Unicode字符的pythons比较那样工作。

更普遍的变种是：

1
2
3
4
5
6
7
8
9
10
11
12
13

from operator import lt, gt

def compare(string1, string2, less=True):
op = lt if less else gt
for char1, char2 in zip(string1, string2):
ordinal1, ordinal2 = ord(char1), ord(char1)
if ordinal1 == ordinal2:
continue
elif op(ordinal1, ordinal2):
return True
else:
return False
return op(len(string1), len(string2))

相关讨论

使用字符的数字等价物(内置函数ord()的结果)在词典中比较字符串。Unicode和8位字符串在此行为中完全可互操作。

下面是一个示例代码，它从词典的角度比较两个字符串。

1
2
3
4
5
6
7
8
9
10
11

a = str(input())
b = str(input())
if 1<=len(a)<=100 and 1<=len(b)<=100:
a = a.lower()
b = b.lower()
if a > b:
print('1')
elif a < b:
print( '-1')
elif a == b:
print('0')

对于不同的输入，输出是-

1
2
3
4
5
6
7
8
9
10
11

1- abcdefg
abcdeff
1

2- abc
Abc
0

3- abs
AbZ
-1