Python - Basic validation of international names?
给定一个名称字符串,我想验证几个基本条件:-这些字符属于公认的文字/字母(拉丁语、汉语、阿拉伯语等),而不是埃莫语。-字符串不包含数字,长度小于40
我知道后者可以通过regex实现,但是有没有unicode方法来实现第一个呢?有没有可以利用的文本处理库?
您应该能够使用regex中的unicode字符类来检查这一点。
1 | [\p{P}\s\w]{40,} |
这里最重要的部分是使用Unicode模式的w字符类:
\p{P} matches any kind of punctuation character
\s matches any kind of invisible character (equal to[\p{Z}\h\v] )
\w match any word character in any script (equal to[\p{L}\p{N}_] )
号
现场演示
您可能需要添加更多类似于
但是为了能够利用这一点,您需要使用支持带
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | # coding=utf8 # the above tag defines encoding for this document and is for Python 2.x compatibility import regex as re regex = r"[\p{P}\s\w]{40,}" test_str = ("Wow cool song!Wow cool song!Wow cool song!Wow cool song! ???? Wow cool song! ????Wow cool song! ????Wow cool song! ???? ") matches = re.finditer(regex, test_str, re.UNICODE | re.MULTILINE) for matchNum, match in enumerate(matches): matchNum = matchNum + 1 print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group())) for groupNum in range(0, len(match.groups())): groupNum = groupNum + 1 print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum))) |
号
ps:.net regex为您提供了更多选项,如p is希腊语。