Get difference between two lists
我在python中有两个列表,如下所示:
1 2 | temp1 = ['One', 'Two', 'Three', 'Four'] temp2 = ['One', 'Two'] |
我需要用第一个列表中的项目创建第三个列表,而第二个列表中没有这些项目。从我必须得到的例子中:
1 | temp3 = ['Three', 'Four'] |
有没有没有没有快速的方法没有周期和检查?
1 2 | In [5]: list(set(temp1) - set(temp2)) Out[5]: ['Four', 'Three'] |
当心
1 2 | In [5]: set([1, 2]) - set([2, 3]) Out[5]: set([1]) |
你可能期望/希望它等于
现有的解决方案都提供以下其中一种:
- 比O(N*M)性能更快。
- 保留输入列表的顺序。
但到目前为止,没有一个解决方案两者都有。如果两者都需要,请尝试以下操作:
1 2 | s = set(temp2) temp3 = [x for x in temp1 if x not in s] |
性能试验
1 2 3 4 5 | import timeit init = 'temp1 = list(range(100)); temp2 = [i * 2 for i in range(50)]' print timeit.timeit('list(set(temp1) - set(temp2))', init, number = 100000) print timeit.timeit('s = set(temp2);[x for x in temp1 if x not in s]', init, number = 100000) print timeit.timeit('[item for item in temp1 if item not in temp2]', init, number = 100000) |
结果:
1 2 3 | 4.34620224079 # ars' answer 4.2770634955 # This answer 30.7715615392 # matt b's answer |
我提出的方法以及保留顺序也比集减法快(略),因为它不需要构造不必要的集。如果第一个列表比第二个列表长得多,并且散列很昂贵,那么性能差异将更加明显。下面是第二个测试,演示了这一点:
1 2 3 4 | init = ''' temp1 = [str(i) for i in range(100000)] temp2 = [str(i * 2) for i in range(50)] ''' |
结果:
1 2 3 | 11.3836875916 # ars' answer 3.63890368748 # this answer (3 times faster!) 37.7445402279 # matt b's answer |
1 | temp3 = [item for item in temp1 if item not in temp2] |
两个列表(如list1和list2)之间的区别可以通过以下简单函数找到。
1 2 3 4 | def diff(list1, list2): c = set(list1).union(set(list2)) # or c = set(list1) | set(list2) d = set(list1).intersection(set(list2)) # or d = set(list1) & set(list2) return list(c - d) |
或
1 2 | def diff(list1, list2): return list(set(list1).symmetric_difference(set(list2))) # or return list(set(list1) ^ set(list2)) |
使用上述功能,可以使用
python文档引用
如果您想要递归地使用差异,我已经为python编写了一个包:网址:https://github.com/seperman/deepdiff
安装从PYPI安装:
1 | pip install deepdiff |
示例用法
进口
1 2 3 | >>> from deepdiff import DeepDiff >>> from pprint import pprint >>> from __future__ import print_function # In case running on Python 2 |
同一对象返回空值
1 2 3 4 | >>> t1 = {1:1, 2:2, 3:3} >>> t2 = t1 >>> print(DeepDiff(t1, t2)) {} |
项目类型已更改
1 2 3 4 5 6 7 | >>> t1 = {1:1, 2:2, 3:3} >>> t2 = {1:1, 2:"2", 3:3} >>> pprint(DeepDiff(t1, t2), indent=2) { 'type_changes': { 'root[2]': { 'newtype': <class 'str'>, 'newvalue': '2', 'oldtype': <class 'int'>, 'oldvalue': 2}}} |
项的值已更改
1 2 3 4 | >>> t1 = {1:1, 2:2, 3:3} >>> t2 = {1:1, 2:4, 3:3} >>> pprint(DeepDiff(t1, t2), indent=2) {'values_changed': {'root[2]': {'newvalue': 4, 'oldvalue': 2}}} |
添加和/或删除的项目
1 2 3 4 5 6 7 | >>> t1 = {1:1, 2:2, 3:3, 4:4} >>> t2 = {1:1, 2:4, 3:3, 5:5, 6:6} >>> ddiff = DeepDiff(t1, t2) >>> pprint (ddiff) {'dic_item_added': ['root[5]', 'root[6]'], 'dic_item_removed': ['root[4]'], 'values_changed': {'root[2]': {'newvalue': 4, 'oldvalue': 2}}} |
字符串差异
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | >>> t1 = {1:1, 2:2, 3:3, 4:{"a":"hello","b":"world <hr> <p> If you are really looking into performance, then use numpy! </p> <p> Here is the full notebook as a gist on github with comparison between list, numpy, and pandas. </p> <p> https://gist.github.com/denfromufa/2821ff59b02e9482be15d27f2bbd4451 </p> <p> <img src="https://i.stack.imgur.com/lhT55.png" alt="enter image description here"> </p> <div class="suo-content">[collapse title=""]<ul><li>这是最好的答案!好极了!救了我几百个小时!</li><li>超过1英里。元素熊猫可以更快!</li><li>两种方法的输出都有差异。理想情况下,它们应该返回相同的输出集。NP方法的长度为28571,列表理解方法的长度为9524。</li><li>@抓到的鬼魂真是太棒了!我修正了我的答案</li><li>我更新了链接中的笔记本和屏幕截图。令人惊讶的是,即使在内部切换到hashtable,熊猫也比numpy慢。部分原因可能是由于向上转换到Int64。</li></ul>[/collapse]</div><hr><P>因为目前所有的解决方案都不产生元组,所以我会考虑:</P>[cc lang="python"]temp3 = tuple(set(temp1) - set(temp2)) |
可选地:
1 2 | #edited using @Mark Byers idea. If you accept this one as answer, just accept his instead. temp3 = tuple(x for x in temp1 if x not in set(temp2)) |
和其他非元组一样,它在这个方向上给出答案,它保持了顺序。
可以使用python xor运算符来完成。
- 这将删除每个列表中的重复项
- 这将显示temp1与temp2的差异,temp2与temp1的差异。
1 | set(temp1) ^ set(temp2) |
最简单的方法,
使用set().difference(set())
1 2 3 | list_a = [1,2,3] list_b = [2,3] print set(list_a).difference(set(list_b)) |
答案是
可以打印为列表,
1 | print list(set(list_a).difference(set(list_b))) |
这可能比Mark的列表理解还要快:
1 | list(itertools.filterfalse(set(temp2).__contains__, temp1)) |
试试这个:
1 | temp3 = set(temp1) - set(temp2) |
我想要两张单子,能像《江户记》中的江户记1(7)所做的那样。因为在搜索"python diff two list"时,这个问题首先出现,而且不是很具体,所以我将发布我的想法。
使用
1 2 3 4 5 6 7 8 9 | a = 'A quick fox jumps the lazy dog'.split() b = 'A quick brown mouse jumps over the dog'.split() from difflib import SequenceMatcher for tag, i, j, k, l in SequenceMatcher(None, a, b).get_opcodes(): if tag == 'equal': print('both have', a[i:j]) if tag in ('delete', 'replace'): print(' 1st has', a[i:j]) if tag in ('insert', 'replace'): print(' 2nd has', b[k:l]) |
此输出:
1 2 3 4 5 6 7 8 | both have ['A', 'quick'] 1st has ['fox'] 2nd has ['brown', 'mouse'] both have ['jumps'] 2nd has ['over'] both have ['the'] 1st has ['lazy'] both have ['dog'] |
当然,如果您的应用程序做出与其他答案相同的假设,您将从中受益最多。但是,如果您正在寻找真正的
例如,其他答案都无法处理:
1 2 | a = [1,2,3,4,5] b = [5,4,3,2,1] |
但这一个是:
1 2 3 | 2nd has [5, 4, 3, 2] both have [1] 1st has [2, 3, 4, 5] |
如果对difflist的元素进行排序和设置,则可以使用幼稚的方法。
1 2 3 4 | list1=[1,2,3,4,5] list2=[1,2,3] print list1[len(list2):] |
或使用本机设置方法:
1 2 3 4 5 6 7 8 | subset=set(list1).difference(list2) print subset import timeit init = 'temp1 = list(range(100)); temp2 = [i * 2 for i in range(50)]' print"Naive solution:", timeit.timeit('temp1[len(temp2):]', init, number = 100000) print"Native set solution:", timeit.timeit('set(temp1).difference(temp2)', init, number = 100000) |
原始解决方案:0.0787101593292
本机设置解决方案:0.998837615564
对于最简单的情况,这里有一个
这比上面的双向差异要短,因为它只做问题所要求的:生成第一个列表中的列表,而不是第二个列表中的列表。
1 2 3 4 5 6 7 8 | from collections import Counter lst1 = ['One', 'Two', 'Three', 'Four'] lst2 = ['One', 'Two'] c1 = Counter(lst1) c2 = Counter(lst2) diff = list((c1 - c2).elements()) |
或者,根据您的可读性偏好,它可以提供一个不错的一行程序:
1 | diff = list((Counter(lst1) - Counter(lst2)).elements()) |
输出:
1 | ['Three', 'Four'] |
请注意,如果您只是在遍历
由于此解决方案使用计数器,因此与基于多个集合的答案相比,它可以正确处理数量。例如,在此输入上:
1 2 | lst1 = ['One', 'Two', 'Two', 'Two', 'Three', 'Three', 'Four'] lst2 = ['One', 'Two'] |
输出是:
1 | ['Two', 'Two', 'Three', 'Three', 'Four'] |
我在游戏中有点太晚了,但是你可以把上面提到的一些代码的性能与这个进行比较,其中两个最快的竞争者是,
1 2 | list(set(x).symmetric_difference(set(y))) list(set(x) ^ set(y)) |
我为基本的编码水平道歉。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 | import time import random from itertools import filterfalse # 1 - performance (time taken) # 2 - correctness (answer - 1,4,5,6) # set performance performance = 1 numberoftests = 7 def answer(x,y,z): if z == 0: start = time.clock() lists = (str(list(set(x)-set(y))+list(set(y)-set(y)))) times = ("1 =" + str(time.clock() - start)) return (lists,times) elif z == 1: start = time.clock() lists = (str(list(set(x).symmetric_difference(set(y))))) times = ("2 =" + str(time.clock() - start)) return (lists,times) elif z == 2: start = time.clock() lists = (str(list(set(x) ^ set(y)))) times = ("3 =" + str(time.clock() - start)) return (lists,times) elif z == 3: start = time.clock() lists = (filterfalse(set(y).__contains__, x)) times = ("4 =" + str(time.clock() - start)) return (lists,times) elif z == 4: start = time.clock() lists = (tuple(set(x) - set(y))) times = ("5 =" + str(time.clock() - start)) return (lists,times) elif z == 5: start = time.clock() lists = ([tt for tt in x if tt not in y]) times = ("6 =" + str(time.clock() - start)) return (lists,times) else: start = time.clock() Xarray = [iDa for iDa in x if iDa not in y] Yarray = [iDb for iDb in y if iDb not in x] lists = (str(Xarray + Yarray)) times = ("7 =" + str(time.clock() - start)) return (lists,times) n = numberoftests if performance == 2: a = [1,2,3,4,5] b = [3,2,6] for c in range(0,n): d = answer(a,b,c) print(d[0]) elif performance == 1: for tests in range(0,10): print("Test Number" + str(tests + 1)) a = random.sample(range(1, 900000), 9999) b = random.sample(range(1, 900000), 9999) for c in range(0,n): #if c not in (1,4,5,6): d = answer(a,b,c) print(d[1]) |
如果遇到
1 | set(map(tuple, list_of_lists1)).symmetric_difference(set(map(tuple, list_of_lists2))) |
另请参见如何在python中比较列表/集合列表?
if you want something more like a changeset... could use Counter
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | from collections import Counter def diff(a, b): """ more verbose than needs to be, for clarity""" ca, cb = Counter(a), Counter(b) to_add = cb - ca to_remove = ca - cb changes = Counter(to_add) changes.subtract(to_remove) return changes lista = ['one', 'three', 'four', 'four', 'one'] listb = ['one', 'two', 'three'] In [127]: diff(lista, listb) Out[127]: Counter({'two': 1, 'one': -1, 'four': -2}) # in order to go from lista to list b, you need to add a"two", remove a"one", and remove two"four"s In [128]: diff(listb, lista) Out[128]: Counter({'four': 2, 'one': 1, 'two': -1}) # in order to go from listb to lista, you must add two"four"s, add a"one", and remove a"two" |
这是另一个解决方案:
1 2 3 4 | def diff(a, b): xa = [i for i in set(a) if i not in b] xb = [i for i in set(b) if i not in a] return xa + xb |
Arulmr解决方案的单线版本
1 2 | def diff(listA, listB): return set(listA) - set(listB) | set(listA) -set(listB) |
我们可以计算交叉减去列表的并集:
1 2 3 4 5 6 | temp1 = ['One', 'Two', 'Three', 'Four'] temp2 = ['One', 'Two', 'Five'] set(temp1+temp2)-(set(temp1)&set(temp2)) Out: set(['Four', 'Five', 'Three']) |
下面是几个简单的、保留顺序的方法来区分两个字符串列表。
代码
使用
1 2 3 4 5 6 7 8 9 10 | import pathlib temp1 = ["One","Two","Three","Four"] temp2 = ["One","Two"] p = pathlib.Path(*temp1) r = p.relative_to(*temp2) list(r.parts) # ['Three', 'Four'] |
这假定两个列表都包含具有等效开头的字符串。有关详细信息,请参阅文档。注意,与设置操作相比,它不是特别快。
使用
1 2 3 4 5 | import itertools as it [x for x, y in it.zip_longest(temp1, temp2) if x != y] # ['Three', 'Four'] |
这可以用一条线来解决。这个问题有两个列表(temp1和temp2)返回第三个列表(temp3)中的差异。
1 | temp3 = list(set(temp1).difference(set(temp2))) |
假设我们有两个列表
1 2 | list1 = [1, 3, 5, 7, 9] list2 = [1, 2, 3, 4, 5] |
从上面的两个列表中可以看出,列表2中存在项1、3、5,而列表2中不存在项7、9。另一方面,项目1、3、5存在于列表1中,而项目2、4不存在。
返回包含项目7、9和2、4的新列表的最佳解决方案是什么?
上面所有的答案都找到了解决方案,现在最理想的是什么?
1 2 3 4 5 6 7 8 9 10 | def difference(list1, list2): new_list = [] for i in list1: if i not in list2: new_list.append(i) for j in list2: if j not in list1: new_list.append(j) return new_list |
对战
1 2 | def sym_diff(list1, list2): return list(set(list1).symmetric_difference(set(list2))) |
使用timeit我们可以看到结果
1 2 3 4 5 6 7 | t1 = timeit.Timer("difference(list1, list2)","from __main__ import difference, list1, list2") t2 = timeit.Timer("sym_diff(list1, list2)","from __main__ import sym_diff, list1, list2") print('Using two for loops', t1.timeit(number=100000), 'Milliseconds') print('Using two for loops', t2.timeit(number=100000), 'Milliseconds') |
收益率
1 2 3 4 5 | [7, 9, 2, 4] Using two for loops 0.11572412995155901 Milliseconds Using symmetric_difference 0.11285737506113946 Milliseconds Process finished with exit code 0 |
1 2 3 4 5 6 7 | tweets=['manoj', 'shekhar', 'manoj', 'rahul', 'mohit','jyohit','sankar','pappu'] netweets=['manoj','pappu', 'shekhar','mohit','gourav'] netweet = [] for i in tweets: if i not in netweets: netweet.append(i) print(netweet) |
下面是区分两个列表(无论内容是什么)的简单方法,您可以得到如下所示的结果:
1 2 3 4 5 6 7 | >>> from sets import Set >>> >>> l1 = ['xvda', False, 'xvdbb', 12, 'xvdbc'] >>> l2 = ['xvda', 'xvdbb', 'xvdbc', 'xvdbd', None] >>> >>> Set(l1).symmetric_difference(Set(l2)) Set([False, 'xvdbd', None, 12]) |
希望这会有所帮助。
1 | (list(set(a)-set(b))+list(set(b)-set(a))) |