Valid characters in a python class name
我正在动态地创建Python类,我知道并不是所有字符都在这个上下文中有效。
类库中是否有一个方法可用于清理随机文本字符串,以便将其用作类名?无论是这个还是允许的字符列表都是一个很好的帮助。
关于与标识符名称冲突的补充:如下面的答案中指出的@ignacio,任何有效的标识符字符都是类名中的有效字符。你甚至可以使用一个保留字作为类名,而不会有任何麻烦。但有一个陷阱。如果确实使用了保留字,则无法像其他(非动态创建的)类那样访问类(例如,通过执行
python语言参考,?2.3,"标识符和关键字"
Identifiers (also referred to as names) are described by the following lexical definitions:
1
2
3
4
5 identifier ::= (letter|"_") (letter | digit |"_")*
letter ::= lowercase | uppercase
lowercase ::= "a"..."z"
uppercase ::= "A"..."Z"
digit ::= "0"..."9"Identifiers are unlimited in length. Case is significant.
号
根据python语言参考第2.3节"标识符和关键字",有效的python标识符定义为:
1 | (letter|"_") (letter | digit |"_")* |
或者,在regex中:
1 | [a-zA-Z_][a-zA-Z0-9_]* |
。
有趣的是,标识符的第一个字符是特殊的。在第一个字符之后,数字"0"到"9"对标识符有效,但不能是第一个字符。
这是一个函数,它将返回一个给定任意字符串的有效标识符。工作原理如下:
首先,我们使用
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | def gen_valid_identifier(seq): # get an iterator itr = iter(seq) # pull characters until we get a legal one for first in identifer for ch in itr: if ch == '_' or ch.isalpha(): yield ch break # pull remaining characters and yield legal ones for identifier for ch in itr: if ch == '_' or ch.isalpha() or ch.isdigit(): yield ch def sanitize_identifier(name): return ''.join(gen_valid_identifier(name)) |
这是一个干净的和Python式的方法来处理一个序列两种不同的方式。对于这个简单的问题,我们可以使用一个布尔变量来指示是否已经看到第一个字符:
1 2 3 4 5 6 7 8 | def gen_valid_identifier(seq): saw_first_char = False for ch in seq: if not saw_first_char and (ch == '_' or ch.isalpha()): saw_first_char = True yield ch elif saw_first_char and (ch == '_' or ch.isalpha() or ch.isdigit()): yield ch |
号
我不喜欢这个版本几乎和第一个版本一样多。一个字符的特殊处理现在在整个控制流中纠结在一起,这将比第一个版本慢,因为它必须不断检查
在显式迭代器上循环的速度与让Python隐式地为您获取迭代器的速度一样快,显式迭代器允许我们拆分为标识符的不同部分处理不同规则的循环。所以显式迭代器为我们提供了运行速度更快的更干净的代码。赢/赢。
到目前为止,这是一个老问题,但是我想在Python3中添加一个关于如何实现的答案。
此处记录了允许的字符:https://docs.python.org/3/reference/lexical_analysis.html标识符。它们包含很多特殊字符,包括标点符号、下划线和一系列外国字符。幸运的是,
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | import unicodedata def is_valid_name(name): if not _is_id_start(name[0]): return False for character in name[1:]: if not _is_id_continue(character): return False return True #All characters are allowed. _allowed_id_continue_categories = {"Ll","Lm","Lo","Lt","Lu","Mc","Mn","Nd","Nl","Pc"} _allowed_id_continue_characters = {"_","\u00B7","\u0387","\u1369","\u136A","\u136B","\u136C","\u136D","\u136E","\u136F","\u1370","\u1371","\u19DA","\u2118","\u212E","\u309B","\u309C"} _allowed_id_start_categories = {"Ll","Lm","Lo","Lt","Lu","Nl"} _allowed_id_start_characters = {"_","\u2118","\u212E","\u309B","\u309C"} def _is_id_start(character): return unicodedata.category(character) in _allowed_id_start_categories or character in _allowed_id_start_categories or unicodedata.category(unicodedata.normalize("NFKC", character)) in _allowed_id_start_categories or unicodedata.normalize("NFKC", character) in _allowed_id_start_characters def _is_id_continue(character): return unicodedata.category(character) in _allowed_id_continue_categories or character in _allowed_id_continue_characters or unicodedata.category(unicodedata.normalize("NFKC", character)) in _allowed_id_continue_categories or unicodedata.normalize("NFKC", character) in _allowed_id_continue_characters |
此代码改编自以下CC0:https://github.com/ghostkeeper/luna/blob/d69624cd0dd568aec2139054fae4d45b634da7e/plugins/data/enumerated/enumerated_type.py_l91。它经过了很好的测试。