在python 3中加速数百万个regex替换

Speed up millions of regex replacements in Python 3

我使用的是python 3.5.2

我有两张单子

大约75万个"句子"(长字符串)的列表
我想从我的750000句话中删除大约20000个"单词"的列表

因此，我必须循环浏览750000个句子，并执行大约20000个替换，但前提是我的单词实际上是"单词"，并且不是更大字符串的一部分。

我这样做是通过预先编译我的单词，使它们在\b元字符的两侧。

1	compiled_words = [re.compile(r'\b' + word + r'\b') for word in my20000words]

然后我循环我的"句子"

1
2
3
4
5
6

import re

for sentence in sentences:
for word in compiled_words:
sentence = re.sub(word,"", sentence)
# put sentence into a growing list

这个嵌套循环每秒处理大约50个句子，这很好，但是处理我所有的句子仍然需要几个小时。

有没有一种方法可以使用str.replace方法(我认为这种方法更快)，但仍然要求只在单词边界处进行替换？
或者，是否有办法加快re.sub方法？如果我的单词长度大于我的句子长度，我已经略过re.sub，稍微提高了速度，但这并不是很大的改进。

谢谢你的建议。

相关讨论

@穆罕默德不，我没有。我会调查的。如果您有任何建议的教程，请建议。谢谢你
这里的第一个答案有一些很好的示例代码：stackoverflow.com/questions/2846653/&hellip；只需将句子数组除以运行了那么多线程的CPU核心数。
您还可以尝试非regex实现——逐字遍历输入，并将每个输入与一个集合匹配。这是单通和散列查找非常快。
顺便问一下，这些句子有多长？750k行听起来不像一个数据集，需要花费数小时来处理。
甚至可以对pvg建议的非regex解决方案使用regex:-)只需regex用函数替换所有的"s+"，当它不在集合中时返回相同的字符串，当它在集合中时返回空字符串。我的测试(但不是在Python中)每秒处理大约10个字符。
@Mohammadali：不要为CPU绑定工作的例子操心。在执行字节码(全局解释器锁)时，python有一个很大的锁，因此您不能从CPU工作的线程中获益。您需要使用multiprocessing(即多个python进程)。
这些句子是什么样子的？他们有标点符号吗？空格是单词之间唯一的分隔符吗？用空字符串替换单词将留下两个空格。
@我只是在我发表了一个答案后才看到你的评论。您的建议很好，相应的代码似乎比公认的解决方案快得多。
@埃里克杜米尼尔啊，比赛回拨是一个很好的接触。这绝对是更明确和清晰(我的眼睛)，至于更快，我没有尝试。在您的例子中，使用concat'ed regexen得到了什么样的时间？
@我不想直说，因为我可能错过了什么。通过使用10万个英语单词的re.compile(r"\b(%s)\b" % '|'.join(words), re.IGNORECASE)，concreted regexen用了160个句子，共有10000个句子。相比之下，我的代码多用了100倍的句子。
@埃里克杜明，我认为这实际上是值得投入的答案，随着时间的推移。我认为有一件事可能会让人有点困惑，那就是显然建议使用regex变量来检查单词列表。
@如果结果是正确的时间，我肯定会把它放在我的答案里。我要求OP和Liteye确认，以确保我不会不尊重一个有效的答案，因为它实际上很慢。我会更新答案，但"\b(word1|word2|word3)\b"只不过是对imho单词列表的检查。在C语言中，使用regexp时，复杂性仍然与regex中的单词数成线性关系。
@EricDuminil编译的regex会短路，所以它与线性列表成员比较不太一样。但那是个小玩意。至于时间的有效性，您不必声称您的建议解决方案对于OP的特定数据集来说更快，仅对于您尝试过的数据集而言。这个问题以各种形式出现，可以使用更多可测试的答案。毕竟，目标是"对他人有用"，而不是简单地"对海报有用"。
@pvg：对于(word1|word2|word3)中没有包含的单词，regex仍然需要在说没有匹配之前检查每个单词，对吗？
你正在删除停止字，是吗？sklearn有矢量器，可以用更高的性能实现这一点。
这看起来对codereview.se来说是一个很好的问题。
@用户36476：你尝试过我的建议吗？我很高兴知道处理您的数据需要多长时间。

您可以尝试编译一个单一的模式，如"\b(word1|word2|word3)\b"。

由于re依靠C代码进行实际匹配，因此节省的成本可能非常可观。

正如@pvg在评论中指出的那样，它还得益于单次匹配。

如果你的话不是regex，埃里克的回答会更快。

相关讨论

我认为这可能是我建议的一个更优雅的实现。我对regex的实现不是百分之百的确定，但我认为它将逐步遍历字符串，直到找到一个边界，然后在该点检查每个单词。所以它只需要寻找一次边界。此外，我猜C代码可能会将单词编码为具有快速查找(如trie)的数据结构？
我不认为它使用trie，因为可能会有更复杂的模式。但是，这肯定使用C中的循环，而不是Python中的循环，这有利于提高性能。
很好，谢谢。@用户36476请让我们知道这段代码是如何执行的，我很感兴趣。
它不仅是C IMPL(这有很大的区别)，而且您还可以通过一个单通道进行匹配。这个问题的变种经常出现，有点奇怪没有(或者可能有，藏在某个地方？)一个典型的答案，用这个相当合理的想法。
@我来试试这个。谢谢您。
@你的建议把一份4小时的工作变成了一份4分钟的工作！我能将所有20000多个regex加入到一个巨大的regex中，我的笔记本电脑一眼也没眨一下。再次感谢。
@用户36476 regex不仅仅是写循环的快捷方式。实际上，他们使用一些聪明的算法，比使用一次只使用一个模式的显式循环可能更快地进行搜索。
@巴库鲁：江户十一〔0〕。您有什么理由相信Python的实现在这里除了循环之外还做其他事情吗？
Mehrdad是的。当然，它仍然是一个回溯引擎，因此在某些情况下需要花费大量的时间，但我很确定，像字面regex的分离这样的简单情况需要O(n+m)时间，其中n是要匹配的文本的长度，m是模式的长度。一个简单的简单循环是o(n&183；m)。OP使用的regex不需要任何回溯，因此我希望regex需要线性时间。
@巴库鲁：我真的很想知道情况是否如此，但我不认为正则表达式的解需要线性时间。如果它不在联盟中建立一个三重组织，我不知道它会怎么发生。
@Liteye：谢谢你链接到我的答案，你真是太公平了！
@巴库鲁：那不是原因。我在问您是否有理由相信实现实际上是以这种方式运行的，而不是您是否有理由相信它可以以这种方式运行。就我个人而言，我还没有遇到一个主流编程语言的regex实现，它在线性时间内的工作方式与您期望的经典regex的工作方式相同，因此如果您知道python这样做，那么您应该展示一些证据。
只是为了好玩，我写了一个三里格实验的答案。它比regexp联合快大约1000倍。
@你说的"单程票"是什么意思？

TLDR

如果需要最快的解决方案，请使用此方法(带集合查找)。对于类似于OP的数据集，其速度大约是公认答案的2000倍。

如果您坚持使用regex进行查找，请使用基于trie的版本，该版本比regex联合快1000倍。

理论

如果你的句子不是冗长的字符串，那么每秒处理50个以上可能是可行的。

如果将所有禁止使用的单词保存到一个集合中，将很快检查该集合中是否包含另一个单词。

把逻辑打包成一个函数，把这个函数作为参数交给re.sub，这样就完成了！

代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

import re
with open('/usr/share/dict/american-english') as wordbook:
banned_words = set(word.strip().lower() for word in wordbook)

def delete_banned_words(matchobj):
word = matchobj.group(0)
if word.lower() in banned_words:
return""
else:
return word

sentences = ["I'm eric. Welcome here!","Another boring sentence.",
"GiraffeElephantBoat","sfgsdg sdwerha aswertwe"] * 250000

word_pattern = re.compile('\w+')

for sentence in sentences:
sentence = word_pattern.sub(delete_banned_words, sentence)

转换后的句子有：

1
2
3
4

' . !
.
GiraffeElephantBoat
sfgsdg sdwerha aswertwe

注意：

搜索不区分大小写(多亏了lower())
用""替换一个单词可能会留下两个空格(如代码中所示)。
对于python3，\w+也匹配重音字符(例如"?ngstr?m")。
任何非文字字符(制表符、空格、换行符、标记等)将保持原样。

性能

有一百万句话，banned_words有将近10万个单词，脚本运行时间不到7秒。

相比之下，Liteye的答案需要1万句话的160秒。

其中n为总字数，m为禁字数，op和liteye代码为O(n*m)。

相比之下，我的代码应该在O(n+m)中运行。考虑到句子比禁止词多，算法变成了O(n)。

Regex联合试验

使用'\b(word1|word2|...|wordN)\b'模式的regex搜索的复杂性是什么？是O(n)还是O(1)呢？

很难理解regex引擎的工作方式，所以让我们编写一个简单的测试。

此代码将10**i个随机英语单词提取到一个列表中。它创建相应的regex union，并用不同的词测试它：

一个词显然不是一个词(以#开头)
一个是列表中的第一个词
一个是列表中的最后一个词
一个看起来像一个词，但不是

< BR>

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31

import re
import timeit
import random

with open('/usr/share/dict/american-english') as wordbook:
english_words = [word.strip().lower() for word in wordbook]
random.shuffle(english_words)

print("First 10 words :")
print(english_words[:10])

test_words = [
("Surely not a word","#surely_N?T?WORD_so_regex_engine_can_return_fast"),
("First word", english_words[0]),
("Last word", english_words[-1]),
("Almost a word","couldbeaword")
]

def find(word):
def fun():
return union.match(word)
return fun

for exp in range(1, 6):
print("
Union of %d words" % 10**exp)
union = re.compile(r"\b(%s)\b" % '|'.join(english_words[:10**exp]))
for description, test_word in test_words:
time = timeit.timeit(find(test_word), number=1000) * 1000
print(" %-17s : %.1fms" % (description, time))

它输出：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32

First 10 words :
["geritol's","sunstroke's", 'fib', 'fergus', 'charms', 'canning', 'supervisor', 'fallaciously',"heritage's", 'pastime']

Union of 10 words
Surely not a word : 0.7ms
First word : 0.8ms
Last word : 0.7ms
Almost a word : 0.7ms

Union of 100 words
Surely not a word : 0.7ms
First word : 1.1ms
Last word : 1.2ms
Almost a word : 1.2ms

Union of 1000 words
Surely not a word : 0.7ms
First word : 0.8ms
Last word : 9.6ms
Almost a word : 10.1ms

Union of 10000 words
Surely not a word : 1.4ms
First word : 1.8ms
Last word : 96.3ms
Almost a word : 116.6ms

Union of 100000 words
Surely not a word : 0.7ms
First word : 0.8ms
Last word : 1227.1ms
Almost a word : 1404.1ms

因此，搜索一个带有'\b(word1|word2|...|wordN)\b'模式的单词似乎有：

O(1)最佳案例
O(n/2)平均案件，仍为O(n)。
O(n)最坏情况

这些结果与简单的循环搜索是一致的。

与regex联合相比，另一种更快的方法是从trie创建regex模式。

相关讨论

TLDR

如果需要最快的基于regex的解决方案，请使用此方法。对于类似于OP的数据集，其速度大约是公认答案的1000倍。

如果您不关心regex，请使用这个基于集合的版本，它比regex联合快2000倍。

用trie优化regex

一个简单的regex联合方法随着许多被禁止的词而变得缓慢，因为regex引擎在优化模式方面做的不是很好。

有可能创建一个包含所有禁止词的trie，并编写相应的regex。生成的trie或regex不是真正的人类可读的，但它们允许非常快速的查找和匹配。

例子

1	['foobar', 'foobah', 'fooxar', 'foozap', 'fooza']

Regex union

列表转换为trie:

1	{'f': {'o': {'o': {'x': {'a': {'r': {'': 1}}}, 'b': {'a': {'r': {'': 1}, 'h': {'': 1}}}, 'z': {'a': {'': 1, 'p': {'': 1}}}}}}}

然后到这个regex模式：

1	r"\bfoo(?:ba[hr]\|xar\|zap?)\b"

Regex trie

最大的优点是，为了测试zoo是否匹配，regex引擎只需要比较第一个字符(它不匹配)，而不需要尝试5个单词。这是一个5个字的预处理过度杀戮，但它显示出数千个字的良好效果。

注意，使用(?:)非捕获组是因为：

foobar|baz与foobar或baz匹配，但不与foobaz匹配。
foo(bar|baz)将不需要的信息保存到捕获组。

代码

这里有一个稍微修改过的gist，我们可以将其用作trie.py库：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62

import re

class Trie():
"""Regex::Trie in Python. Creates a Trie out of a list of words. The trie can be exported to a Regex pattern.
The corresponding Regex should match much faster than a simple Regex union."""

def __init__(self):
self.data = {}

def add(self, word):
ref = self.data
for char in word:
ref[char] = char in ref and ref[char] or {}
ref = ref[char]
ref[''] = 1

def dump(self):
return self.data

def quote(self, char):
return re.escape(char)

def _pattern(self, pData):
data = pData
if"" in data and len(data.keys()) == 1:
return None

alt=[]
cc = []
q = 0
for char in sorted(data.keys()):
if isinstance(data[char], dict):
try:
recurse = self._pattern(data[char])
alt.append(self.quote(char) + recurse)
except:
cc.append(self.quote(char))
else:
q = 1
cconly = not len(alt) > 0

if len(cc) > 0:
if len(cc) == 1:
alt.append(cc[0])
else:
alt.append('[' + ''.join(cc) + ']')

if len(alt) == 1:
result = alt[0]
else:
result ="(?:" +"|".join(alt) +")"

if q:
if cconly:
result +="?"
else:
result ="(?:%s)?" % result
return result

def pattern(self):
return self._pattern(self.dump())

试验

下面是一个小测试(与本测试相同)：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35

# Encoding: utf-8
import re
import timeit
import random
from trie import Trie

with open('/usr/share/dict/american-english') as wordbook:
banned_words = [word.strip().lower() for word in wordbook]
random.shuffle(banned_words)

test_words = [
("Surely not a word","#surely_N?T?WORD_so_regex_engine_can_return_fast"),
("First word", banned_words[0]),
("Last word", banned_words[-1]),
("Almost a word","couldbeaword")
]

def trie_regex_from_words(words):
trie = Trie()
for word in words:
trie.add(word)
return re.compile(r"\b" + trie.pattern() + r"\b", re.IGNORECASE)

def find(word):
def fun():
return union.match(word)
return fun

for exp in range(1, 6):
print("
TrieRegex of %d words" % 10**exp)
union = trie_regex_from_words(banned_words[:10**exp])
for description, test_word in test_words:
time = timeit.timeit(find(test_word), number=1000) * 1000
print(" %s : %.1fms" % (description, time))

它输出：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29

TrieRegex of 10 words
Surely not a word : 0.3ms
First word : 0.4ms
Last word : 0.5ms
Almost a word : 0.5ms

TrieRegex of 100 words
Surely not a word : 0.3ms
First word : 0.5ms
Last word : 0.9ms
Almost a word : 0.6ms

TrieRegex of 1000 words
Surely not a word : 0.3ms
First word : 0.7ms
Last word : 0.9ms
Almost a word : 1.1ms

TrieRegex of 10000 words
Surely not a word : 0.1ms
First word : 1.0ms
Last word : 1.2ms
Almost a word : 1.2ms

TrieRegex of 100000 words
Surely not a word : 0.3ms
First word : 1.2ms
Last word : 0.9ms
Almost a word : 1.6ms

有关信息，regex的开头如下：

(?:a(?:(?:\'s|a(?:\'s|chen|liyah(?:\'s)?|r(?:dvark(?:(?:\'s|s))?|on))|b(?:\'s|a(?:c(?:us(?:(?:\'s|es))?|[ik])|ft|lone(?:(?:\'s|s))?|ndon(?:(?:ed|ing|ment(?:\'s)?|s))?|s(?:e(?:(?:ment(?:\'s)?|[ds]))?|h(?:(?:e[ds]|ing))?|ing)|t(?:e(?:(?:ment(?:\'s)?|[ds]))?|ing|toir(?:(?:\'s|s))?))|b(?:as(?:id)?|e(?:ss(?:(?:\'s|es))?|y(?:(?:\'s|s))?)|ot(?:(?:\'s|t(?:\'s)?|s))?|reviat(?:e[ds]?|i(?:ng|on(?:(?:\'s|s))?))|y(?:\'s)?|\é(?:(?:\'s|s))?)|d(?:icat(?:e[ds]?|i(?:ng|on(?:(?:\'s|s))?))|om(?:en(?:(?:\'s|s))?|inal)|u(?:ct(?:(?:ed|i(?:ng|on(?:(?:\'s|s))?)|or(?:(?:\'s|s))?|s))?|l(?:\'s)?))|e(?:(?:\'s|am|l(?:(?:\'s|ard|son(?:\'s)?))?|r(?:deen(?:\'s)?|nathy(?:\'s)?|ra(?:nt|tion(?:(?:\'s|s))?))|t(?:(?:t(?:e(?:r(?:(?:\'s|s))?|d)|ing|or(?:(?:\'s|s))?)|s))?|yance(?:\'s)?|d))?|hor(?:(?:r(?:e(?:n(?:ce(?:\'s)?|t)|d)|ing)|s))?|i(?:d(?:e[ds]?|ing|jan(?:\'s)?)|gail|l(?:ene|it(?:ies|y(?:\'s)?)))|j(?:ect(?:ly)?|ur(?:ation(?:(?:\'s|s))?|e[ds]?|ing))|l(?:a(?:tive(?:(?:\'s|s))?|ze)|e(?:(?:st|r))?|oom|ution(?:(?:\'s|s))?|y)|m\'s|n(?:e(?:gat(?:e[ds]?|i(?:ng|on(?:\'s)?))|r(?:\'s)?)|ormal(?:(?:it(?:ies|y(?:\'s)?)|ly))?)|o(?:ard|de(?:(?:\'s|s))?|li(?:sh(?:(?:e[ds]|ing))?|tion(?:(?:\'s|ist(?:(?:\'s|s))?))?)|mina(?:bl[ey]|t(?:e[ds]?|i(?:ng|on(?:(?:\'s|s))?)))|r(?:igin(?:al(?:(?:\'s|s))?|e(?:(?:\'s|s))?)|t(?:(?:ed|i(?:ng|on(?:(?:\'s|ist(?:(?:\'s|s))?|s))?|ve)|s))?)|u(?:nd(?:(?:ed|ing|s))?|t)|ve(?:(?:\'s|board))?)|r(?:a(?:cadabra(?:\'s)?|d(?:e[ds]?|ing)|ham(?:\'s)?|m(?:(?:\'s|s))?|si(?:on(?:(?:\'s|s))?|ve(?:(?:\'s|ly|ness(?:\'s)?|s))?))|east|idg(?:e(?:(?:ment(?:(?:\'s|s))?|[ds]))?|ing|ment(?:(?:\'s|s))?)|o(?:ad|gat(?:e[ds]?|i(?:ng|on(?:(?:\'s|s))?)))|upt(?:(?:e(?:st|r)|ly|ness(?:\'s)?))?)|s(?:alom|c(?:ess(?:(?:\'s|e[ds]|ing))?|issa(?:(?:\'s|[es]))?|ond(?:(?:ed|ing|s))?)|en(?:ce(?:(?:\'s|s))?|t(?:(?:e(?:e(?:(?:\'s|ism(?:\'s)?|s))?|d)|ing|ly|s))?)|inth(?:(?:\'s|e(?:\'s)?))?|o(?:l(?:ut(?:e(?:(?:\'s|ly|st?))?|i(?:on(?:\'s)?|sm(?:\'s)?))|v(?:e[ds]?|ing))|r(?:b(?:(?:e(?:n(?:cy(?:\'s)?|t(?:(?:\'s|s))?)|d)|ing|s))?|pti...

它真的是不可读的，但是对于一个100000个被禁词的列表，这个trie regex比一个简单的regex联合快1000倍！

以下是用trie python graphviz和graphviz twopi导出的完整trie的图表：

Enter image description here

相关讨论

你可能想尝试的一件事是对句子进行预处理以对单词边界进行编码。基本上，通过在单词边界上拆分，将每个句子变成单词列表。

这个过程应该更快，因为要处理一个句子，你只需要逐句检查每个单词是否匹配。

目前，regex搜索每次都必须再次遍历整个字符串，查找单词边界，然后在下一次传递之前"丢弃"此工作的结果。

好吧，这里有一个快速简单的解决方案，带有测试集。

获胜策略：

re.sub("w+"，repl，sentence)搜索单词。

"repl"可以是可调用的。我使用了一个执行dict查找的函数，该dict包含要搜索和替换的单词。

这是最简单和最快的解决方案(请参见下面示例代码中的函数replace4)。

次优

这个想法是使用re.split将句子拆分成单词，同时保存分隔符以便稍后重建句子。然后，通过简单的dict查找完成替换。

(请参见下面示例代码中的函数replace3)。

例如函数的计时：

1
2
3
4

replace1: 0.62 sentences/s
replace2: 7.43 sentences/s
replace3: 48498.03 sentences/s
replace4: 61374.97 sentences/s (...and 240.000/s with PyPy)

…和代码：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66

#! /bin/env python3
# -*- coding: utf-8

import time, random, re

def replace1( sentences ):
for n, sentence in enumerate( sentences ):
for search, repl in patterns:
sentence = re.sub("\\b"+search+"\\b", repl, sentence )

def replace2( sentences ):
for n, sentence in enumerate( sentences ):
for search, repl in patterns_comp:
sentence = re.sub( search, repl, sentence )

def replace3( sentences ):
pd = patterns_dict.get
for n, sentence in enumerate( sentences ):
#~ print( n, sentence )
# Split the sentence on non-word characters.
# Note: () in split patterns ensure the non-word characters ARE kept
# and returned in the result list, so we don't mangle the sentence.
# If ALL separators are spaces, use string.split instead or something.
# Example:
#~ >>> re.split(r"([^\w]+)","ab céé? . d2eéf")
#~ ['ab', ' ', 'céé', '? . ', 'd2eéf']
words = re.split(r"([^\w]+)", sentence)

# and... done.
sentence ="".join( pd(w,w) for w in words )

#~ print( n, sentence )

def replace4( sentences ):
pd = patterns_dict.get
def repl(m):
w = m.group()
return pd(w,w)

for n, sentence in enumerate( sentences ):
sentence = re.sub(r"\w+", repl, sentence)

# Build test set
test_words = [ ("word%d" % _) for _ in range(50000) ]
test_sentences = ["".join( random.sample( test_words, 10 )) for _ in range(1000) ]

# Create search and replace patterns
patterns = [ (("word%d" % _), ("repl%d" % _)) for _ in range(20000) ]
patterns_dict = dict( patterns )
patterns_comp = [ (re.compile("\\b"+search+"\\b"), repl) for search, repl in patterns ]

def test( func, num ):
t = time.time()
func( test_sentences[:num] )
print("%30s: %.02f sentences/s" % (func.__name__, num/(time.time()-t)))

print("Sentences", len(test_sentences) )
print("Words ", len(test_words) )

test( replace1, 1 )
test( replace2, 10 )
test( replace3, 1000 )
test( replace4, 1000 )

相关讨论

也许这里的python不是正确的工具。这是一个带有Unix工具链的

1
2
3
4
5

sed G file |
tr ' ' '
' |
grep -vf blacklist |
awk -v RS= -v OFS=' ' '{$1=$1}1'

假设您的黑名单文件是预处理和添加单词边界。这些步骤包括：将文件转换为两倍行距，每行将每个句子拆分为一个单词，大量删除文件中的黑名单单词，并将这些行合并回去。

这至少要快一个数量级。

用于从单词预处理黑名单文件(每行一个单词)

1	sed 's/.*/\\b&\\b/' words > blacklist

这个怎么样？

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55

#!/usr/bin/env python3

from __future__ import unicode_literals, print_function
import re
import time
import io

def replace_sentences_1(sentences, banned_words):
# faster on CPython, but does not use \b as the word separator
# so result is slightly different than replace_sentences_2()
def filter_sentence(sentence):
words = WORD_SPLITTER.split(sentence)
words_iter = iter(words)
for word in words_iter:
norm_word = word.lower()
if norm_word not in banned_words:
yield word
yield next(words_iter) # yield the word separator

WORD_SPLITTER = re.compile(r'(\W+)')
banned_words = set(banned_words)
for sentence in sentences:
yield ''.join(filter_sentence(sentence))

def replace_sentences_2(sentences, banned_words):
# slower on CPython, uses \b as separator
def filter_sentence(sentence):
boundaries = WORD_BOUNDARY.finditer(sentence)
current_boundary = 0
while True:
last_word_boundary, current_boundary = current_boundary, next(boundaries).start()
yield sentence[last_word_boundary:current_boundary] # yield the separators
last_word_boundary, current_boundary = current_boundary, next(boundaries).start()
word = sentence[last_word_boundary:current_boundary]
norm_word = word.lower()
if norm_word not in banned_words:
yield word

WORD_BOUNDARY = re.compile(r'\b')
banned_words = set(banned_words)
for sentence in sentences:
yield ''.join(filter_sentence(sentence))

corpus = io.open('corpus2.txt').read()
banned_words = [l.lower() for l in open('banned_words.txt').read().splitlines()]
sentences = corpus.split('. ')
output = io.open('output.txt', 'wb')
print('number of sentences:', len(sentences))
start = time.time()
for sentence in replace_sentences_1(sentences, banned_words):
output.write(sentence.encode('utf-8'))
output.write(b' .')
print('time:', time.time() - start)

这些解决方案在单词边界上进行拆分，并在一组中查找每个单词。它们应该比字替换(liteyes’solution)的re.sub更快，因为这些解决方案是O(n)，其中n是由于amortized O(1)集查找而产生的输入大小，而使用regex替换将导致regex引擎必须检查每个字符上的字匹配，而不仅仅是单词边界。我的解决方案特别注意保留原始文本中使用的空白(即，它不压缩空白并保留制表符、换行符和其他空白字符)，但是如果您决定不关心它，那么从输出中删除它们应该相当简单。

我在corpus.txt上进行了测试，它是从Gutenberg项目下载的多个电子书的串联，被禁的"words.txt"是从Ubuntu的单词表中随机抽取的20000个单词(/usr/share/dict/american english)。处理862462个句子大约需要30秒(其中一半是Pypy的句子)。我已经将句子定义为用""分隔的任何内容。.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

$ # replace_sentences_1()
$ python3 filter_words.py
number of sentences: 862462
time: 24.46173644065857
$ pypy filter_words.py
number of sentences: 862462
time: 15.9370770454

$ # replace_sentences_2()
$ python3 filter_words.py
number of sentences: 862462
time: 40.2742919921875
$ pypy filter_words.py
number of sentences: 862462
time: 13.1190629005

Pypy特别受益于第二种方法，而CPython在第一种方法上表现更好。上面的代码应该同时适用于python2和3。

相关讨论

实用方法

下面描述的解决方案使用大量内存将所有文本存储在同一字符串中，并降低复杂性级别。如果RAM是一个问题，在使用它之前要三思。

使用join/split技巧，您可以完全避免循环，这样可以加快算法的速度。

将句子与不包含在句子中的特殊分隔符连接起来：

1	merged_sentences = ' * '.join(sentences)

为使用|或"regex语句：

1	regex = re.compile(r'\b({})\b'.format('\|'.join(words)), re.I) # re.I is a case insensitive flag

用已编译的regex对单词下标，然后用特殊的分隔符将其拆分为单独的句子：

1	clean_sentences = re.sub(regex,"", merged_sentences).split(' * ')

性能

"".join的复杂性是O(n)。这是非常直观的，但无论如何，有一个来源的简短报价：

1
2
3

for (i = 0; i < seqlen; i++) {
[...]
sz += PyUnicode_GET_LENGTH(item);

因此，对于join/split，您有o(单词)+2*o(句子)，它仍然是线性复杂度，而对于初始方法，它是2*o(n2)。

不要使用多线程。gil将阻塞每个操作，因为您的任务是严格受CPU限制的，所以gil没有机会被释放，但每个线程将同时发送节拍，这将导致额外的工作，甚至导致操作无限。

相关讨论

将所有句子连接成一个文档。使用aho corasick算法的任何实现(这里是一个)来定位所有"坏"字。遍历文件，替换每个坏单词，更新后面找到的单词的偏移量等。