How do I find the duplicates in a list and create another list with them?
如何在python列表中查找重复项并创建另一个重复项列表?列表只包含整数。
要删除重复项,请使用
1 2 3 4 5 6 | a = [1,2,3,2,1,5,6,5,5,5] import collections print [item for item, count in collections.Counter(a).items() if count > 1] ## [1, 2, 5] |
请注意,
1 2 3 4 5 6 | seen = set() uniq = [] for x in a: if x not in seen: uniq.append(x) seen.add(x) |
或者更简洁地说:
1 2 | seen = set() uniq = [x for x in a if x not in seen and not seen.add(x)] |
我不推荐后一种样式,因为
要计算不带库的重复元素列表,请执行以下操作:
1 2 3 4 5 6 7 8 9 10 | seen = {} dupes = [] for x in a: if x not in seen: seen[x] = 1 else: if seen[x] == 1: dupes.append(x) seen[x] += 1 |
如果列表元素不可散列,则不能使用集合/dict,并且必须使用二次时间解决方案(将每种方法与每种方法进行比较)。例如:
1 2 3 4 5 6 7 | a = [[1], [2], [3], [1], [5], [3]] no_dupes = [x for n, x in enumerate(a) if x not in a[:n]] print no_dupes # [[1], [2], [3], [5]] dupes = [x for n, x in enumerate(a) if x in a[:n]] print dupes # [[1], [3]] |
1 2 3 | >>> l = [1,2,3,4,4,5,5,6,1] >>> set([x for x in l if l.count(x) > 1]) set([1, 4, 5]) |
你不需要计数,只需要看看以前有没有见过这个项目。对这个问题的答案进行了修改:
1 2 3 4 5 6 7 8 9 10 | def list_duplicates(seq): seen = set() seen_add = seen.add # adds all elements it doesn't know yet to seen and all other to seen_twice seen_twice = set( x for x in seq if x in seen or seen_add(x) ) # turn the set into a list (as requested) return list( seen_twice ) a = [1,2,3,2,1,5,6,5,5,5] list_duplicates(a) # yields [1, 2, 5] |
为了防止速度问题,这里有一些时间安排:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | # file: test.py import collections def thg435(l): return [x for x, y in collections.Counter(l).items() if y > 1] def moooeeeep(l): seen = set() seen_add = seen.add # adds all elements it doesn't know yet to seen and all other to seen_twice seen_twice = set( x for x in l if x in seen or seen_add(x) ) # turn the set into a list (as requested) return list( seen_twice ) def RiteshKumar(l): return list(set([x for x in l if l.count(x) > 1])) def JohnLaRooy(L): seen = set() seen2 = set() seen_add = seen.add seen2_add = seen2.add for item in L: if item in seen: seen2_add(item) else: seen_add(item) return list(seen2) l = [1,2,3,2,1,5,6,5,5,5]*100 |
结果如下:(干得好@johnlarooy!)
1 2 3 4 5 6 7 8 | $ python -mtimeit -s 'import test' 'test.JohnLaRooy(test.l)' 10000 loops, best of 3: 74.6 usec per loop $ python -mtimeit -s 'import test' 'test.moooeeeep(test.l)' 10000 loops, best of 3: 91.3 usec per loop $ python -mtimeit -s 'import test' 'test.thg435(test.l)' 1000 loops, best of 3: 266 usec per loop $ python -mtimeit -s 'import test' 'test.RiteshKumar(test.l)' 100 loops, best of 3: 8.35 msec per loop |
有趣的是,除了计时本身,当使用pypy时,排名也会略有变化。最有趣的是,基于计数器的方法从Pypy的优化中获益匪浅,而我建议的方法缓存方法似乎几乎没有任何效果。
1 2 3 4 5 6 | $ pypy -mtimeit -s 'import test' 'test.JohnLaRooy(test.l)' 100000 loops, best of 3: 17.8 usec per loop $ pypy -mtimeit -s 'import test' 'test.thg435(test.l)' 10000 loops, best of 3: 23 usec per loop $ pypy -mtimeit -s 'import test' 'test.moooeeeep(test.l)' 10000 loops, best of 3: 39.3 usec per loop |
显然,这种效果与输入数据的"重复性"有关。我设置了
1 2 3 4 5 6 | $ pypy -mtimeit -s 'import test' 'test.moooeeeep(test.l)' 1000 loops, best of 3: 495 usec per loop $ pypy -mtimeit -s 'import test' 'test.JohnLaRooy(test.l)' 1000 loops, best of 3: 499 usec per loop $ pypy -mtimeit -s 'import test' 'test.thg435(test.l)' 1000 loops, best of 3: 1.68 msec per loop |
我在寻找相关事物的时候遇到了这个问题——我想知道为什么没有人提供基于发电机的解决方案?解决这个问题将是:
1 2 | >>> print list(getDupes_9([1,2,3,2,1,5,6,5,5,5])) [1, 2, 5] |
我关注的是可伸缩性,因此测试了几种方法,包括在小列表中工作良好的简单项目,但随着列表变大,可伸缩性非常差(注意——最好使用timeit,但这是说明性的)。
我将@moooeeep作为比较工具(它的速度非常快:如果输入列表完全是随机的,则速度最快)和一种itertools方法(对于大多数排序的列表来说,这一方法甚至更快)。现在包括来自@firelynx的熊猫方法——缓慢,但并不可怕,而且简单。注意:对于大部分订购的大型列表,sort/tee/zip方法在我的机器上始终是最快的,而moooeeep对于无序列表是最快的,但您的里程数可能会有所不同。
优势
- 使用相同的代码快速简单地测试"any"重复项
假设
- 副本只应报告一次
- 重复订单不需要保留
- 副本可能在列表中的任何位置
最快的解决方案,1百万条:
1 2 3 4 5 6 7 8 9 10 | def getDupes(c): '''sort/tee/izip''' a, b = itertools.tee(sorted(c)) next(b, None) r = None for k, g in itertools.izip(a, b): if k != g: continue if k != r: yield k r = k |
测试的方法
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 | import itertools import time import random def getDupes_1(c): '''naive''' for i in xrange(0, len(c)): if c[i] in c[:i]: yield c[i] def getDupes_2(c): '''set len change''' s = set() for i in c: l = len(s) s.add(i) if len(s) == l: yield i def getDupes_3(c): '''in dict''' d = {} for i in c: if i in d: if d[i]: yield i d[i] = False else: d[i] = True def getDupes_4(c): '''in set''' s,r = set(),set() for i in c: if i not in s: s.add(i) elif i not in r: r.add(i) yield i def getDupes_5(c): '''sort/adjacent''' c = sorted(c) r = None for i in xrange(1, len(c)): if c[i] == c[i - 1]: if c[i] != r: yield c[i] r = c[i] def getDupes_6(c): '''sort/groupby''' def multiple(x): try: x.next() x.next() return True except: return False for k, g in itertools.ifilter(lambda x: multiple(x[1]), itertools.groupby(sorted(c))): yield k def getDupes_7(c): '''sort/zip''' c = sorted(c) r = None for k, g in zip(c[:-1],c[1:]): if k == g: if k != r: yield k r = k def getDupes_8(c): '''sort/izip''' c = sorted(c) r = None for k, g in itertools.izip(c[:-1],c[1:]): if k == g: if k != r: yield k r = k def getDupes_9(c): '''sort/tee/izip''' a, b = itertools.tee(sorted(c)) next(b, None) r = None for k, g in itertools.izip(a, b): if k != g: continue if k != r: yield k r = k def getDupes_a(l): '''moooeeeep''' seen = set() seen_add = seen.add # adds all elements it doesn't know yet to seen and all other to seen_twice for x in l: if x in seen or seen_add(x): yield x def getDupes_b(x): '''iter*/sorted''' x = sorted(x) def _matches(): for k,g in itertools.izip(x[:-1],x[1:]): if k == g: yield k for k, n in itertools.groupby(_matches()): yield k def getDupes_c(a): '''pandas''' import pandas as pd vc = pd.Series(a).value_counts() i = vc[vc > 1].index for _ in i: yield _ def hasDupes(fn,c): try: if fn(c).next(): return True # Found a dupe except StopIteration: pass return False def getDupes(fn,c): return list(fn(c)) STABLE = True if STABLE: print 'Finding FIRST then ALL duplicates, single dupe of"nth" placed element in 1m element array' else: print 'Finding FIRST then ALL duplicates, single dupe of"n" included in randomised 1m element array' for location in (50,250000,500000,750000,999999): for test in (getDupes_2, getDupes_3, getDupes_4, getDupes_5, getDupes_6, getDupes_8, getDupes_9, getDupes_a, getDupes_b, getDupes_c): print 'Test %-15s:%10d - '%(test.__doc__ or test.__name__,location), deltas = [] for FIRST in (True,False): for i in xrange(0, 5): c = range(0,1000000) if STABLE: c[0] = location else: c.append(location) random.shuffle(c) start = time.time() if FIRST: print '.' if location == test(c).next() else '!', else: print '.' if [location] == list(test(c)) else '!', deltas.append(time.time()-start) print ' -- %0.3f '%(sum(deltas)/len(deltas)), |
"所有重复"测试的结果是一致的,在此数组中查找"第一个"重复项,然后查找"所有"重复项:
1 2 3 4 5 6 7 8 9 10 11 | Finding FIRST then ALL duplicates, single dupe of"nth" placed element in 1m element array Test set len change : 500000 - . . . . . -- 0.264 . . . . . -- 0.402 Test in dict : 500000 - . . . . . -- 0.163 . . . . . -- 0.250 Test in set : 500000 - . . . . . -- 0.163 . . . . . -- 0.249 Test sort/adjacent : 500000 - . . . . . -- 0.159 . . . . . -- 0.229 Test sort/groupby : 500000 - . . . . . -- 0.860 . . . . . -- 1.286 Test sort/izip : 500000 - . . . . . -- 0.165 . . . . . -- 0.229 Test sort/tee/izip : 500000 - . . . . . -- 0.145 . . . . . -- 0.206 * Test moooeeeep : 500000 - . . . . . -- 0.149 . . . . . -- 0.232 Test iter*/sorted : 500000 - . . . . . -- 0.160 . . . . . -- 0.221 Test pandas : 500000 - . . . . . -- 0.493 . . . . . -- 0.499 |
当首先对列表进行无序排列时,排序的价格变得明显-效率明显下降,@moooeeep方法占主导地位,set&dict方法类似,但出租方执行者:
1 2 3 4 5 6 7 8 9 10 11 | Finding FIRST then ALL duplicates, single dupe of"n" included in randomised 1m element array Test set len change : 500000 - . . . . . -- 0.321 . . . . . -- 0.473 Test in dict : 500000 - . . . . . -- 0.285 . . . . . -- 0.360 Test in set : 500000 - . . . . . -- 0.309 . . . . . -- 0.365 Test sort/adjacent : 500000 - . . . . . -- 0.756 . . . . . -- 0.823 Test sort/groupby : 500000 - . . . . . -- 1.459 . . . . . -- 1.896 Test sort/izip : 500000 - . . . . . -- 0.786 . . . . . -- 0.845 Test sort/tee/izip : 500000 - . . . . . -- 0.743 . . . . . -- 0.804 Test moooeeeep : 500000 - . . . . . -- 0.234 . . . . . -- 0.311 * Test iter*/sorted : 500000 - . . . . . -- 0.776 . . . . . -- 0.840 Test pandas : 500000 - . . . . . -- 0.539 . . . . . -- 0.540 |
您可以使用
1 2 3 4 | >>> from iteration_utilities import duplicates >>> list(duplicates([1,1,2,1,2,3,4,2])) [1, 1, 2, 2] |
或者,如果您只需要每个副本中的一个,可以与
1 2 3 4 | >>> from iteration_utilities import unique_everseen >>> list(unique_everseen(duplicates([1,1,2,1,2,3,4,2]))) [1, 2] |
它还可以处理不可清洗的元素(但以性能为代价):
1 2 3 4 5 | >>> list(duplicates([[1], [2], [1], [3], [1]])) [[1], [1]] >>> list(unique_everseen(duplicates([[1], [2], [1], [3], [1]]))) [[1]] |
这是只有其他几种方法可以处理的。
基准点我做了一个快速基准测试,包含了这里提到的大多数(但不是全部)方法。
第一个基准只包括一小范围的列表长度,因为有些方法具有
在图中,Y轴代表时间,因此值越小意味着越好。它还绘制了日志,以便更好地显示各种值:
除去
如您所见,
这里还有一件有趣的事要注意,熊猫的方法对于小名单来说非常缓慢,但是很容易竞争更长的名单。
然而,由于这些基准测试表明,大多数方法的性能大致相同,因此使用哪种方法并不重要(除了具有
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 | from iteration_utilities import duplicates, unique_everseen from collections import Counter import pandas as pd import itertools def georg_counter(it): return [item for item, count in Counter(it).items() if count > 1] def georg_set(it): seen = set() uniq = [] for x in it: if x not in seen: uniq.append(x) seen.add(x) def georg_set2(it): seen = set() return [x for x in it if x not in seen and not seen.add(x)] def georg_set3(it): seen = {} dupes = [] for x in it: if x not in seen: seen[x] = 1 else: if seen[x] == 1: dupes.append(x) seen[x] += 1 def RiteshKumar_count(l): return set([x for x in l if l.count(x) > 1]) def moooeeeep(seq): seen = set() seen_add = seen.add # adds all elements it doesn't know yet to seen and all other to seen_twice seen_twice = set( x for x in seq if x in seen or seen_add(x) ) # turn the set into a list (as requested) return list( seen_twice ) def F1Rumors_implementation(c): a, b = itertools.tee(sorted(c)) next(b, None) r = None for k, g in zip(a, b): if k != g: continue if k != r: yield k r = k def F1Rumors(c): return list(F1Rumors_implementation(c)) def Edward(a): d = {} for elem in a: if elem in d: d[elem] += 1 else: d[elem] = 1 return [x for x, y in d.items() if y > 1] def wordsmith(a): return pd.Series(a)[pd.Series(a).duplicated()].values def NikhilPrabhu(li): li = li.copy() for x in set(li): li.remove(x) return list(set(li)) def firelynx(a): vc = pd.Series(a).value_counts() return vc[vc > 1].index.tolist() def HenryDev(myList): newList = set() for i in myList: if myList.count(i) >= 2: newList.add(i) return list(newList) def yota(number_lst): seen_set = set() duplicate_set = set(x for x in number_lst if x in seen_set or seen_set.add(x)) return seen_set - duplicate_set def IgorVishnevskiy(l): s=set(l) d=[] for x in l: if x in s: s.remove(x) else: d.append(x) return d def it_duplicates(l): return list(duplicates(l)) def it_unique_duplicates(l): return list(unique_everseen(duplicates(l))) |
基准1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | from simple_benchmark import benchmark import random funcs = [ georg_counter, georg_set, georg_set2, georg_set3, RiteshKumar_count, moooeeeep, F1Rumors, Edward, wordsmith, NikhilPrabhu, firelynx, HenryDev, yota, IgorVishnevskiy, it_duplicates, it_unique_duplicates ] args = {2**i: [random.randint(0, 2**(i-1)) for _ in range(2**i)] for i in range(2, 12)} b = benchmark(funcs, args, 'list size') b.plot() |
基准2
1 2 3 4 5 6 7 8 9 10 | funcs = [ georg_counter, georg_set, georg_set2, georg_set3, moooeeeep, F1Rumors, Edward, wordsmith, firelynx, yota, IgorVishnevskiy, it_duplicates, it_unique_duplicates ] args = {2**i: [random.randint(0, 2**(i-1)) for _ in range(2**i)] for i in range(2, 20)} b = benchmark(funcs, args, 'list size') b.plot() |
免责声明
这是我写的第三方图书馆的:
collections.counter在python 2.7中是新的:
1 2 3 4 5 6 7 8 9 | Python 2.5.4 (r254:67916, May 31 2010, 15:03:39) [GCC 4.1.2 20080704 (Red Hat 4.1.2-46)] on linux2 a = [1,2,3,2,1,5,6,5,5,5] import collections print [x for x, y in collections.Counter(a).items() if y > 1] Type"help","copyright","credits" or"license" for more information. File"", line 1, in AttributeError: 'module' object has no attribute 'Counter' >>> |
在早期版本中,您可以使用传统的dict代替:
1 2 3 4 5 6 7 8 9 | a = [1,2,3,2,1,5,6,5,5,5] d = {} for elem in a: if elem in d: d[elem] += 1 else: d[elem] = 1 print [x for x, y in d.items() if y > 1] |
使用熊猫:
1 2 3 4 | >>> import pandas as pd >>> a = [1, 2, 1, 3, 3, 3, 0] >>> pd.Series(a)[pd.Series(a).duplicated()].values array([1, 3, 3]) |
这里有一个简洁明了的解决方案-
1 2 3 4 | for x in set(li): li.remove(x) li = list(set(li)) |
简单地通过检查出现的次数来遍历列表中的每个元素,然后将它们添加到一个集合中,该集合将打印重复的元素。希望这能帮助别人。
1 2 3 4 5 6 7 8 9 | myList = [2 ,4 , 6, 8, 4, 6, 12]; newList = set() for i in myList: if myList.count(i) >= 2: newList.add(i) print(list(newList)) ## [4 , 6] |
如果不转换成列表,最简单的方法可能是如下所示。当他们要求不使用套装时,这在面试中可能很有用。
1 2 3 4 5 6 | a=[1,2,3,3,3] dup=[] for each in a: if each not in dup: dup.append(each) print(dup) |
=else获取2个单独的唯一值和重复值列表
1 2 3 4 5 6 7 8 9 10 11 12 13 | a=[1,2,3,3,3] uniques=[] dups=[] for each in a: if each not in uniques: uniques.append(each) else: dups.append(each) print("Unique values are below:") print(uniques) print("Duplicate values are below:") print(dups) |
我会和熊猫一起做,因为我经常用熊猫
1 2 3 4 | import pandas as pd a = [1,2,3,3,3,4,5,6,6,7] vc = pd.Series(a).value_counts() vc[vc > 1].index.tolist() |
给予
1 | [3,6] |
可能不是很有效,但它肯定比其他答案的代码要少,所以我想我会做出贡献。
接受答案的第三个例子给出了错误的答案,并不试图给出重复的答案。以下是正确的版本:
1 2 3 4 5 | number_lst = [1, 1, 2, 3, 5, ...] seen_set = set() duplicate_set = set(x for x in number_lst if x in seen_set or seen_set.add(x)) unique_set = seen_set - duplicate_set |
有点晚了,但可能对一些人有帮助。对于一个更大的清单,我发现这对我有用。
1 2 3 4 5 6 7 8 9 10 | l=[1,2,3,5,4,1,3,1] s=set(l) d=[] for x in l: if x in s: s.remove(x) else: d.append(x) d [1,3,1] |
只显示所有副本并保留顺序。
用Python中的一次迭代查找重复数据非常简单快捷的方法是:
1 2 3 4 5 6 7 8 9 10 11 | testList = ['red', 'blue', 'red', 'green', 'blue', 'blue'] testListDict = {} for item in testList: try: testListDict[item] += 1 except: testListDict[item] = 1 print testListDict |
输出如下:
1 2 | >>> print testListDict {'blue': 3, 'green': 1, 'red': 2} |
我的博客http://www.howtoprogramwithpython.com中包含了这些内容。
我们可以使用
1 2 3 4 5 6 7 8 | from itertools import groupby myList = [2, 4, 6, 8, 4, 6, 12] # when the list is sorted, groupby groups by consecutive elements which are similar for x, y in groupby(sorted(myList)): # list(y) returns all the occurences of item x if len(list(y)) > 1: print x |
输出将是:
1 2 | 4 6 |
一线解决方案:
1 | set([i for i in list if sum([1 for a in list if a == i]) > 1]) |
1 2 3 4 5 | list2 = [1, 2, 3, 4, 1, 2, 3] lset = set() [(lset.add(item), list2.append(item)) for item in list2 if item not in lset] print list(lset) |
这里有一个快速生成器,它使用dict将每个元素存储为一个带有布尔值的键,用于检查是否已生成重复项。
对于具有可哈希类型的所有元素的列表:
1 2 3 4 5 6 7 8 9 10 11 12 | def gen_dupes(array): unique = {} for value in array: if value in unique and unique[value]: unique[value] = False yield value else: unique[value] = True array = [1, 2, 2, 3, 4, 1, 5, 2, 6, 6] print(list(gen_dupes(array))) # => [2, 1, 6] |
对于可能包含列表的列表:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | def gen_dupes(array): unique = {} for value in array: is_list = False if type(value) is list: value = tuple(value) is_list = True if value in unique and unique[value]: unique[value] = False if is_list: value = list(value) yield value else: unique[value] = True array = [1, 2, 2, [1, 2], 3, 4, [1, 2], 5, 2, 6, 6] print(list(gen_dupes(array))) # => [2, [1, 2], 6] |
1 2 3 4 5 6 7 8 9 | def removeduplicates(a): seen = set() for i in a: if i not in seen: seen.add(i) return seen print(removeduplicates([1,1,2,2])) |
其他一些测试。当然可以……
1 | set([x for x in l if l.count(x) > 1]) |
…太贵了。使用下一个最终方法大约要快500倍(数组越长,效果越好):
1 2 3 4 5 6 7 8 9 10 11 12 | def dups_count_dict(l): d = {} for item in l: if item not in d: d[item] = 0 d[item] += 1 result_d = {key: val for key, val in d.iteritems() if val > 1} return result_d.keys() |
只有2个回路,没有非常昂贵的
例如,这里有一个代码来比较这些方法。代码如下,输出如下:
1 2 3 | dups_count: 13.368s # this is a function which uses l.count() dups_count_dict: 0.014s # this is a final best function (of the 3 functions) dups_count_counter: 0.024s # collections.Counter |
测试代码:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 | import numpy as np from time import time from collections import Counter class TimerCounter(object): def __init__(self): self._time_sum = 0 def start(self): self.time = time() def stop(self): self._time_sum += time() - self.time def get_time_sum(self): return self._time_sum def dups_count(l): return set([x for x in l if l.count(x) > 1]) def dups_count_dict(l): d = {} for item in l: if item not in d: d[item] = 0 d[item] += 1 result_d = {key: val for key, val in d.iteritems() if val > 1} return result_d.keys() def dups_counter(l): counter = Counter(l) result_d = {key: val for key, val in counter.iteritems() if val > 1} return result_d.keys() def gen_array(): np.random.seed(17) return list(np.random.randint(0, 5000, 10000)) def assert_equal_results(*results): primary_result = results[0] other_results = results[1:] for other_result in other_results: assert set(primary_result) == set(other_result) and len(primary_result) == len(other_result) if __name__ == '__main__': dups_count_time = TimerCounter() dups_count_dict_time = TimerCounter() dups_count_counter = TimerCounter() l = gen_array() for i in range(3): dups_count_time.start() result1 = dups_count(l) dups_count_time.stop() dups_count_dict_time.start() result2 = dups_count_dict(l) dups_count_dict_time.stop() dups_count_counter.start() result3 = dups_counter(l) dups_count_counter.stop() assert_equal_results(result1, result2, result3) print 'dups_count: %.3f' % dups_count_time.get_time_sum() print 'dups_count_dict: %.3f' % dups_count_dict_time.get_time_sum() print 'dups_count_counter: %.3f' % dups_count_counter.get_time_sum() |
方法1:
1 | list(set([val for idx, val in enumerate(input_list) if val in input_list[idx+1:]])) |
说明:[val对于idx,val对于enumerate(input_list),如果input_list[idx+1:]中的val是列表理解,则返回一个元素,如果该元素在当前位置、列表中、索引中存在相同的元素。
例子:输入列表=[42,31,42,31,3,31,31,5,6,6,6,6,7,42]
从列表42中的第一个元素开始,使用索引0检查元素42是否存在于输入列表[1]中(即从索引1到列表末尾)。因为42存在于输入清单[1:]中,它将返回42。
然后转到下一个元素31,索引为1,并检查元素31是否存在于输入列表[2:]中(即从索引2到列表末尾)。因为31存在于输入列表[2]中,所以它将返回31。
同样,它会遍历列表中的所有元素,并且只将重复/重复的元素返回到列表中。
然后,因为我们有重复项,在一个列表中,我们需要选择每个重复项中的一个,即删除重复项中的重复项,为此,我们确实调用了python内置的name set(),它删除了重复项,
然后留给我们一个集合,而不是一个列表,因此要从一个集合转换为一个列表,我们使用,typecast,list(),它将元素集转换为一个列表。
方法2:
1 2 3 4 5 6 7 8 9 10 11 | def dupes(ilist): temp_list = [] # initially, empty temporary list dupe_list = [] # initially, empty duplicate list for each in ilist: if each in temp_list: # Found a Duplicate element if not each in dupe_list: # Avoid duplicate elements in dupe_list dupe_list.append(each) # Add duplicate element to dupe_list else: temp_list.append(each) # Add a new (non-duplicate) to temp_list return dupe_list |
说明:这里我们创建两个空列表,从开始。然后继续遍历列表中的所有元素,查看它是否存在于临时列表中(最初为空)。如果它不在临时列表中,那么我们使用append方法将它添加到临时列表中。
如果它已经存在于temp_列表中,这意味着列表的当前元素是重复的,因此我们需要使用append方法将它添加到dupe_列表中。
这里有很多答案,但我认为这是一种相对可读且易于理解的方法:
1 2 3 4 5 6 7 8 | def get_duplicates(sorted_list): duplicates = [] last = sorted_list[0] for x in sorted_list[1:]: if x == last: duplicates.append(x) last = x return set(duplicates) |
笔记:
- 如果你想保留复制计数,就去掉演员表。在底部"设置"以获取完整列表
- 如果您更喜欢使用生成器,请将duplicates.append(x)替换为yield x,并在底部替换返回语句(您可以稍后强制转换为set)。
我必须这样做,因为我挑战自己不要使用其他方法:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | def dupList(oldlist): if type(oldlist)==type((2,2)): oldlist=[x for x in oldlist] newList=[] newList=newList+oldlist oldlist=oldlist forbidden=[] checkPoint=0 for i in range(len(oldlist)): #print 'start i', i if i in forbidden: continue else: for j in range(len(oldlist)): #print 'start j', j if j in forbidden: continue else: #print 'after Else' if i!=j: #print 'i,j', i,j #print oldlist #print newList if oldlist[j]==oldlist[i]: #print 'oldlist[i],oldlist[j]', oldlist[i],oldlist[j] forbidden.append(j) #print 'forbidden', forbidden del newList[j-checkPoint] #print newList checkPoint=checkPoint+1 return newList |
因此,您的示例工作如下:
1 2 3 | >>>a = [1,2,3,3,3,4,5,6,6,7] >>>dupList(a) [1, 2, 3, 4, 5, 6, 7] |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | raw_list = [1,2,3,3,4,5,6,6,7,2,3,4,2,3,4,1,3,4,] clean_list = list(set(raw_list)) duplicated_items = [] for item in raw_list: try: clean_list.remove(item) except ValueError: duplicated_items.append(item) print(duplicated_items) # [3, 6, 2, 3, 4, 2, 3, 4, 1, 3, 4] |
您基本上通过转换为set(
如果需要重复项目的索引,只需
使用toolz时:
1 2 3 4 5 | from toolz import frequencies, valfilter a = [1,2,2,3,4,5,4] >>> list(valfilter(lambda count: count > 1, frequencies(a)).keys()) [2,4] |
我参加这次讨论要晚得多。尽管如此,我还是想用一句话来处理这个问题。因为这就是Python的魅力所在。如果我们只想把重复的数据放到一个单独的列表(或任何集合)中,我建议这样做。假设我们有一个重复的列表,我们可以称之为"目标"。
1 | target=[1,2,3,4,4,4,3,5,6,8,4,3] |
现在,如果我们想得到副本,我们可以使用下面的一行程序:
1 | duplicates=dict(set((x,target.count(x)) for x in filter(lambda rec : target.count(rec)>1,target))) |
此代码将把重复的记录作为键,并作为值计数到字典"duplicates"中。"duplicate"字典如下所示:
1 | {3: 3, 4: 4} #it saying 3 is repeated 3 times and 4 is 4 times |
如果您只想将所有重复的记录单独放在一个列表中,那么它的代码也要短得多:
1 | duplicates=filter(lambda rec : target.count(rec)>1,target) |
输出将是:
1 | [3, 4, 4, 4, 3, 4, 3] |
这在Python2.7.x+版本中非常有效
使用