关于python：用一个空格替换非ASCII字符

Replace non-ASCII characters with a single space

我需要用空格替换所有非ASCII(x00-x7f)字符。我很惊讶，这在Python中并不容易实现，除非我遗漏了一些东西。以下函数只删除所有非ASCII字符：

1
2
3

def remove_non_ascii_1(text):

return ''.join(i for i in text if ord(i)<128)

该字符根据字符码位中的字节数(即用3个空格替换–字符)用空格数替换非ASCII字符：

1
2
3

def remove_non_ascii_2(text):

return re.sub(r'[^\x00-\x7F]',' ', text)

如何用一个空格替换所有非ASCII字符？

在无数类似的这样的问题中，无地址字符替换(而不是剥离)，并且另外地址所有非ASCII字符而不是特定字符。

相关讨论

您的''.join()表达式正在过滤，删除任何非ASCII的内容；您可以使用条件表达式来代替：

1	return ''.join([i if ord(i) < 128 else ' ' for i in text])

它一个接一个地处理字符，并且每替换一个字符仍然使用一个空格。

正则表达式只应将连续的非ASCII字符替换为空格：

1	re.sub(r'[^\x00-\x7F]+',' ', text)

注意那里的+。

相关讨论

对于最相似的原始字符串表示，我建议使用unidecode模块：

1
2
3

from unidecode import unidecode
def remove_non_ascii(text):
return unidecode(unicode(text, encoding ="utf-8"))

然后您可以在字符串中使用它：

1 2	remove_non_ascii("Ce?ía") Cenia

相关讨论

对于字符处理，请使用Unicode字符串：

1
2
3
4
5
6
7
8

PythonWin 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:57:17) [MSC v.1600 64 bit (AMD64)] on win32.
>>> s='ABC马克def'
>>> import re
>>> re.sub(r'[^\x00-\x7f]',r' ',s) # Each char is a Unicode codepoint.
'ABC def'
>>> b = s.encode('utf8')
>>> re.sub(rb'[^\x00-\x7f]',rb' ',b) # Each char is a 3-byte UTF-8 sequence.
b'ABC def'

但请注意，如果字符串包含已分解的Unicode字符(例如，分隔字符和组合重音标记)，则仍然存在问题：

1
2
3
4
5
6
7
8
9
10
11
12
13

>>> s = 'ma?ana'
>>> len(s)
6
>>> import unicodedata as ud
>>> n=ud.normalize('NFD',s)
>>> n
'man?ana'
>>> len(n)
7
>>> re.sub(r'[^\x00-\x7f]',r' ',s) # single codepoint
'ma ana'
>>> re.sub(r'[^\x00-\x7f]',r' ',n) # only combining mark replaced
'man ana'

相关讨论

如果替换字符可以是'？'我建议用result = text.encode('ascii', 'replace').decode()代替空格：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

"""Test the performance of different non-ASCII replacement methods."""

import re
from timeit import timeit

# 10_000 is typical in the project that I'm working on and most of the text
# is going to be non-ASCII.
text = '?' * 10_000

print(timeit(
"""
result = ''.join([c if ord(c) < 128 else '?' for c in text])
""",
number=1000,
globals=globals(),
))

print(timeit(
"""
result = text.encode('ascii', 'replace').decode()
""",
number=1000,
globals=globals(),
))

结果：

1 2	0.7208260721400134 0.009975979187503592

相关讨论

这个怎么样？

1
2
3
4
5
6
7
8

def replace_trash(unicode_string):
for i in range(0, len(unicode_string)):
try:
unicode_string[i].encode("ascii")
except:
#means it's non-ASCII
unicode_string=unicode_string[i].replace("") #replacing it with a single space
return unicode_string

相关讨论

作为一种本机且高效的方法，您不需要使用ord或任何字符循环。只需使用ascii进行编码，忽略错误。

以下仅删除非ASCII字符：

1	new_string = old_string.encode('ascii',errors='ignore')

现在，如果要替换已删除的字符，请执行以下操作：

1	final_string = new_string + b' ' * (len(old_string) - len(new_string))

相关讨论

有可能是另一个问题，但我提供了@alvero的答案(使用unidecode)。我想在我的字符串上做一个"常规"条带，也就是说，我的字符串的开头和结尾都是空白字符，然后用一个"常规"空格替换其他的空白字符，也就是说。

1	"Ce?ía?ma?ana????"

到

1	"Ce?ía ma?ana"

，

1
2
3
4
5
6

def safely_stripped(s: str):
return ' '.join(
stripped for stripped in
(bit.strip() for bit in
''.join((c if unidecode(c) else ' ') for c in s).strip().split())
if stripped)

我们首先用一个规则空间替换所有非Unicode空间(然后重新连接它)。

1	''.join((c if unidecode(c) else ' ') for c in s)

然后我们用python的正常拆分再次拆分，并去掉每个"位"，

1	(bit.strip() for bit in s.split())

最后再把它们连接起来，但前提是字符串通过了if测试，

1	' '.join(stripped for stripped in s if stripped)

据此，safely_stripped('????Ce?ía?ma?ana????')正确返回'Ce?ía ma?ana'。