Speed up millions of regex replacements in Python 3
我使用的是python 3.5.2
我有两张单子
- 大约75万个"句子"(长字符串)的列表
- 我想从我的750000句话中删除大约20000个"单词"的列表
因此,我必须循环浏览750000个句子,并执行大约20000个替换,但前提是我的单词实际上是"单词",并且不是更大字符串的一部分。
我这样做是通过预先编译我的单词,使它们在
1 | compiled_words = [re.compile(r'\b' + word + r'\b') for word in my20000words] |
然后我循环我的"句子"
1 2 3 4 5 6 | import re for sentence in sentences: for word in compiled_words: sentence = re.sub(word,"", sentence) # put sentence into a growing list |
这个嵌套循环每秒处理大约50个句子,这很好,但是处理我所有的句子仍然需要几个小时。
有没有一种方法可以使用
str.replace 方法(我认为这种方法更快),但仍然要求只在单词边界处进行替换?或者,是否有办法加快
re.sub 方法?如果我的单词长度大于我的句子长度,我已经略过re.sub ,稍微提高了速度,但这并不是很大的改进。
谢谢你的建议。
您可以尝试编译一个单一的模式,如
由于
正如@pvg在评论中指出的那样,它还得益于单次匹配。
如果你的话不是regex,埃里克的回答会更快。
TLDR
如果需要最快的解决方案,请使用此方法(带集合查找)。对于类似于OP的数据集,其速度大约是公认答案的2000倍。
如果您坚持使用regex进行查找,请使用基于trie的版本,该版本比regex联合快1000倍。
理论如果你的句子不是冗长的字符串,那么每秒处理50个以上可能是可行的。
如果将所有禁止使用的单词保存到一个集合中,将很快检查该集合中是否包含另一个单词。
把逻辑打包成一个函数,把这个函数作为参数交给
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | import re with open('/usr/share/dict/american-english') as wordbook: banned_words = set(word.strip().lower() for word in wordbook) def delete_banned_words(matchobj): word = matchobj.group(0) if word.lower() in banned_words: return"" else: return word sentences = ["I'm eric. Welcome here!","Another boring sentence.", "GiraffeElephantBoat","sfgsdg sdwerha aswertwe"] * 250000 word_pattern = re.compile('\w+') for sentence in sentences: sentence = word_pattern.sub(delete_banned_words, sentence) |
转换后的句子有:
1 2 3 4 | ' . ! . GiraffeElephantBoat sfgsdg sdwerha aswertwe |
注意:
- 搜索不区分大小写(多亏了
lower() ) - 用
"" 替换一个单词可能会留下两个空格(如代码中所示)。 - 对于python3,
\w+ 也匹配重音字符(例如"?ngstr?m" )。 - 任何非文字字符(制表符、空格、换行符、标记等)将保持原样。
性能
有一百万句话,
相比之下,Liteye的答案需要1万句话的160秒。
其中
相比之下,我的代码应该在
使用
很难理解regex引擎的工作方式,所以让我们编写一个简单的测试。
此代码将
- 一个词显然不是一个词(以
# 开头) - 一个是列表中的第一个词
- 一个是列表中的最后一个词
- 一个看起来像一个词,但不是
< BR>
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | import re import timeit import random with open('/usr/share/dict/american-english') as wordbook: english_words = [word.strip().lower() for word in wordbook] random.shuffle(english_words) print("First 10 words :") print(english_words[:10]) test_words = [ ("Surely not a word","#surely_N?T?WORD_so_regex_engine_can_return_fast"), ("First word", english_words[0]), ("Last word", english_words[-1]), ("Almost a word","couldbeaword") ] def find(word): def fun(): return union.match(word) return fun for exp in range(1, 6): print(" Union of %d words" % 10**exp) union = re.compile(r"\b(%s)\b" % '|'.join(english_words[:10**exp])) for description, test_word in test_words: time = timeit.timeit(find(test_word), number=1000) * 1000 print(" %-17s : %.1fms" % (description, time)) |
它输出:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | First 10 words : ["geritol's","sunstroke's", 'fib', 'fergus', 'charms', 'canning', 'supervisor', 'fallaciously',"heritage's", 'pastime'] Union of 10 words Surely not a word : 0.7ms First word : 0.8ms Last word : 0.7ms Almost a word : 0.7ms Union of 100 words Surely not a word : 0.7ms First word : 1.1ms Last word : 1.2ms Almost a word : 1.2ms Union of 1000 words Surely not a word : 0.7ms First word : 0.8ms Last word : 9.6ms Almost a word : 10.1ms Union of 10000 words Surely not a word : 1.4ms First word : 1.8ms Last word : 96.3ms Almost a word : 116.6ms Union of 100000 words Surely not a word : 0.7ms First word : 0.8ms Last word : 1227.1ms Almost a word : 1404.1ms |
因此,搜索一个带有
O(1) 最佳案例O(n/2) 平均案件,仍为O(n) 。O(n) 最坏情况
这些结果与简单的循环搜索是一致的。
与regex联合相比,另一种更快的方法是从trie创建regex模式。
TLDR
如果需要最快的基于regex的解决方案,请使用此方法。对于类似于OP的数据集,其速度大约是公认答案的1000倍。
如果您不关心regex,请使用这个基于集合的版本,它比regex联合快2000倍。
用trie优化regex一个简单的regex联合方法随着许多被禁止的词而变得缓慢,因为regex引擎在优化模式方面做的不是很好。
有可能创建一个包含所有禁止词的trie,并编写相应的regex。生成的trie或regex不是真正的人类可读的,但它们允许非常快速的查找和匹配。
例子1 | ['foobar', 'foobah', 'fooxar', 'foozap', 'fooza'] |
列表转换为trie:
1 | {'f': {'o': {'o': {'x': {'a': {'r': {'': 1}}}, 'b': {'a': {'r': {'': 1}, 'h': {'': 1}}}, 'z': {'a': {'': 1, 'p': {'': 1}}}}}}} |
然后到这个regex模式:
1 | r"\bfoo(?:ba[hr]|xar|zap?)\b" |
最大的优点是,为了测试
注意,使用
foobar|baz 与foobar 或baz 匹配,但不与foobaz 匹配。foo(bar|baz) 将不需要的信息保存到捕获组。
代码
这里有一个稍微修改过的gist,我们可以将其用作
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 | import re class Trie(): """Regex::Trie in Python. Creates a Trie out of a list of words. The trie can be exported to a Regex pattern. The corresponding Regex should match much faster than a simple Regex union.""" def __init__(self): self.data = {} def add(self, word): ref = self.data for char in word: ref[char] = char in ref and ref[char] or {} ref = ref[char] ref[''] = 1 def dump(self): return self.data def quote(self, char): return re.escape(char) def _pattern(self, pData): data = pData if"" in data and len(data.keys()) == 1: return None alt=[] cc = [] q = 0 for char in sorted(data.keys()): if isinstance(data[char], dict): try: recurse = self._pattern(data[char]) alt.append(self.quote(char) + recurse) except: cc.append(self.quote(char)) else: q = 1 cconly = not len(alt) > 0 if len(cc) > 0: if len(cc) == 1: alt.append(cc[0]) else: alt.append('[' + ''.join(cc) + ']') if len(alt) == 1: result = alt[0] else: result ="(?:" +"|".join(alt) +")" if q: if cconly: result +="?" else: result ="(?:%s)?" % result return result def pattern(self): return self._pattern(self.dump()) |
试验
下面是一个小测试(与本测试相同):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | # Encoding: utf-8 import re import timeit import random from trie import Trie with open('/usr/share/dict/american-english') as wordbook: banned_words = [word.strip().lower() for word in wordbook] random.shuffle(banned_words) test_words = [ ("Surely not a word","#surely_N?T?WORD_so_regex_engine_can_return_fast"), ("First word", banned_words[0]), ("Last word", banned_words[-1]), ("Almost a word","couldbeaword") ] def trie_regex_from_words(words): trie = Trie() for word in words: trie.add(word) return re.compile(r"\b" + trie.pattern() + r"\b", re.IGNORECASE) def find(word): def fun(): return union.match(word) return fun for exp in range(1, 6): print(" TrieRegex of %d words" % 10**exp) union = trie_regex_from_words(banned_words[:10**exp]) for description, test_word in test_words: time = timeit.timeit(find(test_word), number=1000) * 1000 print(" %s : %.1fms" % (description, time)) |
它输出:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | TrieRegex of 10 words Surely not a word : 0.3ms First word : 0.4ms Last word : 0.5ms Almost a word : 0.5ms TrieRegex of 100 words Surely not a word : 0.3ms First word : 0.5ms Last word : 0.9ms Almost a word : 0.6ms TrieRegex of 1000 words Surely not a word : 0.3ms First word : 0.7ms Last word : 0.9ms Almost a word : 1.1ms TrieRegex of 10000 words Surely not a word : 0.1ms First word : 1.0ms Last word : 1.2ms Almost a word : 1.2ms TrieRegex of 100000 words Surely not a word : 0.3ms First word : 1.2ms Last word : 0.9ms Almost a word : 1.6ms |
有关信息,regex的开头如下:
(?:a(?:(?:\'s|a(?:\'s|chen|liyah(?:\'s)?|r(?:dvark(?:(?:\'s|s))?|on))|b(?:\'s|a(?:c(?:us(?:(?:\'s|es))?|[ik])|ft|lone(?:(?:\'s|s))?|ndon(?:(?:ed|ing|ment(?:\'s)?|s))?|s(?:e(?:(?:ment(?:\'s)?|[ds]))?|h(?:(?:e[ds]|ing))?|ing)|t(?:e(?:(?:ment(?:\'s)?|[ds]))?|ing|toir(?:(?:\'s|s))?))|b(?:as(?:id)?|e(?:ss(?:(?:\'s|es))?|y(?:(?:\'s|s))?)|ot(?:(?:\'s|t(?:\'s)?|s))?|reviat(?:e[ds]?|i(?:ng|on(?:(?:\'s|s))?))|y(?:\'s)?|\é(?:(?:\'s|s))?)|d(?:icat(?:e[ds]?|i(?:ng|on(?:(?:\'s|s))?))|om(?:en(?:(?:\'s|s))?|inal)|u(?:ct(?:(?:ed|i(?:ng|on(?:(?:\'s|s))?)|or(?:(?:\'s|s))?|s))?|l(?:\'s)?))|e(?:(?:\'s|am|l(?:(?:\'s|ard|son(?:\'s)?))?|r(?:deen(?:\'s)?|nathy(?:\'s)?|ra(?:nt|tion(?:(?:\'s|s))?))|t(?:(?:t(?:e(?:r(?:(?:\'s|s))?|d)|ing|or(?:(?:\'s|s))?)|s))?|yance(?:\'s)?|d))?|hor(?:(?:r(?:e(?:n(?:ce(?:\'s)?|t)|d)|ing)|s))?|i(?:d(?:e[ds]?|ing|jan(?:\'s)?)|gail|l(?:ene|it(?:ies|y(?:\'s)?)))|j(?:ect(?:ly)?|ur(?:ation(?:(?:\'s|s))?|e[ds]?|ing))|l(?:a(?:tive(?:(?:\'s|s))?|ze)|e(?:(?:st|r))?|oom|ution(?:(?:\'s|s))?|y)|m\'s|n(?:e(?:gat(?:e[ds]?|i(?:ng|on(?:\'s)?))|r(?:\'s)?)|ormal(?:(?:it(?:ies|y(?:\'s)?)|ly))?)|o(?:ard|de(?:(?:\'s|s))?|li(?:sh(?:(?:e[ds]|ing))?|tion(?:(?:\'s|ist(?:(?:\'s|s))?))?)|mina(?:bl[ey]|t(?:e[ds]?|i(?:ng|on(?:(?:\'s|s))?)))|r(?:igin(?:al(?:(?:\'s|s))?|e(?:(?:\'s|s))?)|t(?:(?:ed|i(?:ng|on(?:(?:\'s|ist(?:(?:\'s|s))?|s))?|ve)|s))?)|u(?:nd(?:(?:ed|ing|s))?|t)|ve(?:(?:\'s|board))?)|r(?:a(?:cadabra(?:\'s)?|d(?:e[ds]?|ing)|ham(?:\'s)?|m(?:(?:\'s|s))?|si(?:on(?:(?:\'s|s))?|ve(?:(?:\'s|ly|ness(?:\'s)?|s))?))|east|idg(?:e(?:(?:ment(?:(?:\'s|s))?|[ds]))?|ing|ment(?:(?:\'s|s))?)|o(?:ad|gat(?:e[ds]?|i(?:ng|on(?:(?:\'s|s))?)))|upt(?:(?:e(?:st|r)|ly|ness(?:\'s)?))?)|s(?:alom|c(?:ess(?:(?:\'s|e[ds]|ing))?|issa(?:(?:\'s|[es]))?|ond(?:(?:ed|ing|s))?)|en(?:ce(?:(?:\'s|s))?|t(?:(?:e(?:e(?:(?:\'s|ism(?:\'s)?|s))?|d)|ing|ly|s))?)|inth(?:(?:\'s|e(?:\'s)?))?|o(?:l(?:ut(?:e(?:(?:\'s|ly|st?))?|i(?:on(?:\'s)?|sm(?:\'s)?))|v(?:e[ds]?|ing))|r(?:b(?:(?:e(?:n(?:cy(?:\'s)?|t(?:(?:\'s|s))?)|d)|ing|s))?|pti...
它真的是不可读的,但是对于一个100000个被禁词的列表,这个trie regex比一个简单的regex联合快1000倍!
以下是用trie python graphviz和graphviz
你可能想尝试的一件事是对句子进行预处理以对单词边界进行编码。基本上,通过在单词边界上拆分,将每个句子变成单词列表。
这个过程应该更快,因为要处理一个句子,你只需要逐句检查每个单词是否匹配。
目前,regex搜索每次都必须再次遍历整个字符串,查找单词边界,然后在下一次传递之前"丢弃"此工作的结果。
好吧,这里有一个快速简单的解决方案,带有测试集。
获胜策略:
re.sub("w+",repl,sentence)搜索单词。
"repl"可以是可调用的。我使用了一个执行dict查找的函数,该dict包含要搜索和替换的单词。
这是最简单和最快的解决方案(请参见下面示例代码中的函数replace4)。
次优
这个想法是使用re.split将句子拆分成单词,同时保存分隔符以便稍后重建句子。然后,通过简单的dict查找完成替换。
(请参见下面示例代码中的函数replace3)。
例如函数的计时:
1 2 3 4 | replace1: 0.62 sentences/s replace2: 7.43 sentences/s replace3: 48498.03 sentences/s replace4: 61374.97 sentences/s (...and 240.000/s with PyPy) |
…和代码:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 | #! /bin/env python3 # -*- coding: utf-8 import time, random, re def replace1( sentences ): for n, sentence in enumerate( sentences ): for search, repl in patterns: sentence = re.sub("\\b"+search+"\\b", repl, sentence ) def replace2( sentences ): for n, sentence in enumerate( sentences ): for search, repl in patterns_comp: sentence = re.sub( search, repl, sentence ) def replace3( sentences ): pd = patterns_dict.get for n, sentence in enumerate( sentences ): #~ print( n, sentence ) # Split the sentence on non-word characters. # Note: () in split patterns ensure the non-word characters ARE kept # and returned in the result list, so we don't mangle the sentence. # If ALL separators are spaces, use string.split instead or something. # Example: #~ >>> re.split(r"([^\w]+)","ab céé? . d2eéf") #~ ['ab', ' ', 'céé', '? . ', 'd2eéf'] words = re.split(r"([^\w]+)", sentence) # and... done. sentence ="".join( pd(w,w) for w in words ) #~ print( n, sentence ) def replace4( sentences ): pd = patterns_dict.get def repl(m): w = m.group() return pd(w,w) for n, sentence in enumerate( sentences ): sentence = re.sub(r"\w+", repl, sentence) # Build test set test_words = [ ("word%d" % _) for _ in range(50000) ] test_sentences = ["".join( random.sample( test_words, 10 )) for _ in range(1000) ] # Create search and replace patterns patterns = [ (("word%d" % _), ("repl%d" % _)) for _ in range(20000) ] patterns_dict = dict( patterns ) patterns_comp = [ (re.compile("\\b"+search+"\\b"), repl) for search, repl in patterns ] def test( func, num ): t = time.time() func( test_sentences[:num] ) print("%30s: %.02f sentences/s" % (func.__name__, num/(time.time()-t))) print("Sentences", len(test_sentences) ) print("Words ", len(test_words) ) test( replace1, 1 ) test( replace2, 10 ) test( replace3, 1000 ) test( replace4, 1000 ) |
也许这里的python不是正确的工具。这是一个带有Unix工具链的
1 2 3 4 5 | sed G file | tr ' ' ' ' | grep -vf blacklist | awk -v RS= -v OFS=' ' '{$1=$1}1' |
假设您的黑名单文件是预处理和添加单词边界。这些步骤包括:将文件转换为两倍行距,每行将每个句子拆分为一个单词,大量删除文件中的黑名单单词,并将这些行合并回去。
这至少要快一个数量级。
用于从单词预处理黑名单文件(每行一个单词)
1 | sed 's/.*/\\b&\\b/' words > blacklist |
这个怎么样?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 | #!/usr/bin/env python3 from __future__ import unicode_literals, print_function import re import time import io def replace_sentences_1(sentences, banned_words): # faster on CPython, but does not use \b as the word separator # so result is slightly different than replace_sentences_2() def filter_sentence(sentence): words = WORD_SPLITTER.split(sentence) words_iter = iter(words) for word in words_iter: norm_word = word.lower() if norm_word not in banned_words: yield word yield next(words_iter) # yield the word separator WORD_SPLITTER = re.compile(r'(\W+)') banned_words = set(banned_words) for sentence in sentences: yield ''.join(filter_sentence(sentence)) def replace_sentences_2(sentences, banned_words): # slower on CPython, uses \b as separator def filter_sentence(sentence): boundaries = WORD_BOUNDARY.finditer(sentence) current_boundary = 0 while True: last_word_boundary, current_boundary = current_boundary, next(boundaries).start() yield sentence[last_word_boundary:current_boundary] # yield the separators last_word_boundary, current_boundary = current_boundary, next(boundaries).start() word = sentence[last_word_boundary:current_boundary] norm_word = word.lower() if norm_word not in banned_words: yield word WORD_BOUNDARY = re.compile(r'\b') banned_words = set(banned_words) for sentence in sentences: yield ''.join(filter_sentence(sentence)) corpus = io.open('corpus2.txt').read() banned_words = [l.lower() for l in open('banned_words.txt').read().splitlines()] sentences = corpus.split('. ') output = io.open('output.txt', 'wb') print('number of sentences:', len(sentences)) start = time.time() for sentence in replace_sentences_1(sentences, banned_words): output.write(sentence.encode('utf-8')) output.write(b' .') print('time:', time.time() - start) |
这些解决方案在单词边界上进行拆分,并在一组中查找每个单词。它们应该比字替换(liteyes’solution)的re.sub更快,因为这些解决方案是
我在corpus.txt上进行了测试,它是从Gutenberg项目下载的多个电子书的串联,被禁的"words.txt"是从Ubuntu的单词表中随机抽取的20000个单词(/usr/share/dict/american english)。处理862462个句子大约需要30秒(其中一半是Pypy的句子)。我已经将句子定义为用""分隔的任何内容。.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | $ # replace_sentences_1() $ python3 filter_words.py number of sentences: 862462 time: 24.46173644065857 $ pypy filter_words.py number of sentences: 862462 time: 15.9370770454 $ # replace_sentences_2() $ python3 filter_words.py number of sentences: 862462 time: 40.2742919921875 $ pypy filter_words.py number of sentences: 862462 time: 13.1190629005 |
Pypy特别受益于第二种方法,而CPython在第一种方法上表现更好。上面的代码应该同时适用于python2和3。
实用方法
下面描述的解决方案使用大量内存将所有文本存储在同一字符串中,并降低复杂性级别。如果RAM是一个问题,在使用它之前要三思。
使用
1 | merged_sentences = ' * '.join(sentences) |
1 | regex = re.compile(r'\b({})\b'.format('|'.join(words)), re.I) # re.I is a case insensitive flag |
1 | clean_sentences = re.sub(regex,"", merged_sentences).split(' * ') |
性能
1 2 3 | for (i = 0; i < seqlen; i++) { [...] sz += PyUnicode_GET_LENGTH(item); |
因此,对于
不要使用多线程。gil将阻塞每个操作,因为您的任务是严格受CPU限制的,所以gil没有机会被释放,但每个线程将同时发送节拍,这将导致额外的工作,甚至导致操作无限。
将所有句子连接成一个文档。使用aho corasick算法的任何实现(这里是一个)来定位所有"坏"字。遍历文件,替换每个坏单词,更新后面找到的单词的偏移量等。