Checking fuzzy/approximate substring existing in a longer string, in Python?
使用像leveinstein(leveinstein或difflib)这样的算法,很容易找到近似匹配。
1 2 3 | >>> import difflib >>> difflib.SequenceMatcher(None,"amazing","amaging").ratio() 0.8571428571428571 |
模糊匹配可以通过根据需要确定阈值来检测。
当前要求:根据较大字符串中的阈值查找模糊子字符串。
如。
1 2 3 | large_string ="thelargemanhatanproject is a great project in themanhattincity" query_string ="manhattan" #result ="manhatan","manhattin" and their indexes in large_string |
一种强力解决方案是生成长度为n-1到n+1(或其他匹配长度)的所有子字符串,其中n是查询字符串的长度,并逐个使用levenstein并查看阈值。
在python中是否有更好的解决方案可用,最好是python 2.7中包含的模块,或者外部可用的模块。
更新:python regex模块工作得很好,尽管对于模糊子串的情况,它比内置的
1 2 3 4 | >>> import regex >>> input ="Monalisa was painted by Leonrdo da Vinchi" >>> regex.search(r'\b(leonardo){e<3}\s+(da)\s+(vinci){e<2}\b',input,flags=regex.IGNORECASE) <regex.Match object; span=(23, 41), match=' Leonrdo da Vinchi', fuzzy_counts=(0, 2, 1)> |
使用
1 2 3 4 5 6 7 8 9 10 11 | >>> import difflib >>> large_string ="thelargemanhatanproject" >>> query_string ="manhattan" >>> s = difflib.SequenceMatcher(None, large_string, query_string) >>> sum(n for i,j,n in s.get_matching_blocks()) / float(len(query_string)) 0.8888888888888888 >>> query_string ="banana" >>> s = difflib.SequenceMatcher(None, large_string, query_string) >>> sum(n for i,j,n in s.get_matching_blocks()) / float(len(query_string)) 0.6666666666666666 |
更新
1 2 3 4 5 6 7 8 9 10 11 12 13 | import difflib def matches(large_string, query_string, threshold): words = large_string.split() for word in words: s = difflib.SequenceMatcher(None, word, query_string) match = ''.join(word[i:i+n] for i, j, n in s.get_matching_blocks() if n) if len(match) / float(len(query_string)) >= threshold: yield match large_string ="thelargemanhatanproject is a great project in themanhattincity" query_string ="manhattan" print list(matches(large_string, query_string, 0.8)) |
以上代码打印:
我使用模糊匹配的阈值模糊匹配和模糊搜索模糊提取词的匹配。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | from fuzzysearch import find_near_matches from fuzzywuzzy import process large_string ="thelargemanhatanproject is a great project in themanhattincity" query_string ="manhattan" def fuzzy_extract(qs, ls, threshold): '''fuzzy matches 'qs' in 'ls' and returns list of tuples of (word,index) ''' for word, _ in process.extractBests(qs, (ls,), score_cutoff=threshold): print('word {}'.format(word)) for match in find_near_matches(qs, word, max_l_dist=1): match = word[match.start:match.end] print('match {}'.format(match)) index = ls.find(match) yield (match, index) |
试验;
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | print('query: {} string: {}'.format(query_string, large_string)) for match,index in fuzzy_extract(query_string, large_string, 70): print('match: {} index: {}'.format(match, index)) query_string ="citi" print('query: {} string: {}'.format(query_string, large_string)) for match,index in fuzzy_extract(query_string, large_string, 30): print('match: {} index: {}'.format(match, index)) query_string ="greet" print('query: {} string: {}'.format(query_string, large_string)) for match,index in fuzzy_extract(query_string, large_string, 30): print('match: {} index: {}'.format(match, index)) |
输出;查询:曼哈顿字符串:大曼哈坦项目是曼哈坦市的一个伟大项目。比赛:曼哈坦索引:8匹配:manhattin索引:49
查询:花旗字符串:大曼哈坦项目是曼哈坦市的一个伟大项目。比赛:城市索引:58
查询:问候语字符串:大曼哈坦项目是曼哈坦市的一个伟大项目。比赛:伟大索引:29
新的regex库很快就会被替换,它包含了模糊匹配。
https://pypi.python.org/pypi/regex/
模糊匹配语法看起来相当有表现力,但这将使您能够匹配一个或更少的插入/添加/删除。
1 2 | import regex regex.match('(amazing){e<=1}', 'amaging') |
最近我为python编写了一个对齐库:https://github.com/eseraygun/python-alliance
使用它,您可以在任意一对序列上使用任意评分策略执行全局和局部对齐。实际上,在您的情况下,您需要半局部对齐,因为您不关心
下面是为您的案例修改的自述文件中的示例代码。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 | from alignment.sequence import Sequence, GAP_ELEMENT from alignment.vocabulary import Vocabulary from alignment.sequencealigner import SimpleScoring, LocalSequenceAligner large_string ="thelargemanhatanproject is a great project in themanhattincity" query_string ="manhattan" # Create sequences to be aligned. a = Sequence(large_string) b = Sequence(query_string) # Create a vocabulary and encode the sequences. v = Vocabulary() aEncoded = v.encodeSequence(a) bEncoded = v.encodeSequence(b) # Create a scoring and align the sequences using local aligner. scoring = SimpleScoring(1, -1) aligner = LocalSequenceAligner(scoring, -1, minScore=5) score, encodeds = aligner.align(aEncoded, bEncoded, backtrace=True) # Iterate over optimal alignments and print them. for encoded in encodeds: alignment = v.decodeSequenceAlignment(encoded) # Simulate a semi-local alignment. if len(filter(lambda e: e != GAP_ELEMENT, alignment.second)) != len(b): continue if alignment.first[0] == GAP_ELEMENT or alignment.first[-1] == GAP_ELEMENT: continue if alignment.second[0] == GAP_ELEMENT or alignment.second[-1] == GAP_ELEMENT: continue print alignment print 'Alignment score:', alignment.score print 'Percent identity:', alignment.percentIdentity() |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | m a n h a - t a n m a n h a t t a n Alignment score: 7 Percent identity: 88.8888888889 m a n h a t t - i m a n h a t t a n Alignment score: 5 Percent identity: 77.7777777778 m a n h a t t i n m a n h a t t a n Alignment score: 7 Percent identity: 88.8888888889 |
如果你去掉
1 2 3 4 5 6 7 8 9 | m a n h a - t a n m a n h a t t a n Alignment score: 7 Percent identity: 88.8888888889 m a n h a t t i n m a n h a t t a n Alignment score: 7 Percent identity: 88.8888888889 |
请注意,库中的所有算法都具有
上面的方法很好,但是我需要在很多干草中找到一个小针,最后像这样接近它:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | from difflib import SequenceMatcher as SM from nltk.util import ngrams import codecs needle ="this is the string we want to find" hay ="text text lots of text and more and more this string is the one we wanted to find and here is some more and even more still" needle_length = len(needle.split()) max_sim_val = 0 max_sim_string = u"" for ngram in ngrams(hay.split(), needle_length + int(.2*needle_length)): hay_ngram = u"".join(ngram) similarity = SM(None, hay_ngram, needle).ratio() if similarity > max_sim_val: max_sim_val = similarity max_sim_string = hay_ngram print max_sim_val, max_sim_string |
产量:
1 | 0.72972972973 this string is the one we wanted to find |