Find the similarity metric between two strings
如何获得字符串与Python中的另一个字符串相似的概率?
我想得到一个十进制值,比如0.9(意思是90%)等,最好是使用标准的python和library。
例如
1 2 3 | similar("Apple","Appel") #would have a high prob. similar("Apple","Mango") #would have a lower prob. |
有一个内置的。
1 2 3 4 | from difflib import SequenceMatcher def similar(a, b): return SequenceMatcher(None, a, b).ratio() |
号
使用它:
1 2 3 4 | >>> similar("Apple","Appel") 0.8 >>> similar("Apple","Mango") 0.0 |
我想你可能在找一个描述字符串之间距离的算法。您可以参考以下内容:
解决方案1:python内置
使用difflib中的SequenceMatcher
赞成的意见:本机python库,不需要额外的包。缺点:太有限了,有很多其他很好的字符串相似性算法。
例子:1 2 3 4 | >>> from difflib import SequenceMatcher >>> s = SequenceMatcher(None,"abcd","bcde") >>> s.ratio() 0.75 |
。解决方案2:水母库
它是一个非常好的图书馆,覆盖范围很广,问题很少。它支持:-列文斯坦距离-达默劳-列文斯坦距离-Jaro距离-Jaro Winkler距离-匹配评级方法比较-汉明距离
赞成的意见:易于使用,支持的算法范围,测试。缺点:不是本机库。
例子:
1 2 3 4 5 6 7 | >>> import jellyfish >>> jellyfish.levenshtein_distance(u'jellyfish', u'smellyfish') 2 >>> jellyfish.jaro_distance(u'jellyfish', u'smellyfish') 0.89629629629629637 >>> jellyfish.damerau_levenshtein_distance(u'jellyfish', u'jellyfihs') 1 |
。
1 2 3 4 | >>> fuzz.ratio("fuzzy wuzzy was a bear","wuzzy fuzzy was a bear") 91 >>> fuzz.token_sort_ratio("fuzzy wuzzy was a bear","wuzzy fuzzy was a bear") 100 |
。
您可以创建如下函数:
1 2 3 4 | def similar(w1, w2): w1 = w1 + ' ' * (len(w2) - len(w1)) w2 = w2 + ' ' * (len(w1) - len(w2)) return sum(1 if i == j else 0 for i, j in zip(w1, w2)) / float(len(w1)) |
包装距离包括Levenshtein距离:
1 2 3 | import distance distance.levenshtein("lenvestein","levenshtein") # 3 |
内置的
1 2 3 4 5 6 7 8 9 10 11 12 13 | from diff_match_patch import diff_match_patch def compute_similarity_and_diff(text1, text2): dmp = diff_match_patch() dmp.Diff_Timeout = 0.0 diff = dmp.diff_main(text1, text2, False) # similarity common_text = sum([len(txt) for op, txt in diff if op == 0]) text_length = max(len(text1), len(text2)) sim = common_text / text_length return sim, diff |