关于python:删除列表中的重复项

Removing duplicates in lists

我需要编写一个程序来检查一个列表是否有任何重复项，如果有，它将删除它们并返回一个新的列表，其中包含被复制/删除的项。这是我有的，但老实说，我不知道该怎么做。

1
2
3
4
5
6

def remove_duplicates():
t = ['a', 'b', 'c', 'd']
t2 = ['a', 'c', 'd']
for t in t2:
t.append(t.remove())
return t

获得唯一项目集合的常见方法是使用set。集合是不同对象的无序集合。要从任何一个iterable创建集合，只需将其传递给内置的set()函数即可。如果您以后再次需要一个实际的列表，您可以类似地将集合传递给list()函数。

以下示例应涵盖您尝试执行的操作：

1
2
3
4
5
6
7
8

>>> t = [1, 2, 3, 1, 2, 5, 6, 7, 8]
>>> t
[1, 2, 3, 1, 2, 5, 6, 7, 8]
>>> list(set(t))
[1, 2, 3, 5, 6, 7, 8]
>>> s = [1, 2, 3]
>>> list(set(t) - set(s))
[8, 5, 6, 7]

正如您从示例结果中看到的，原始订单不被维护。如上所述，集合本身是无序集合，因此顺序丢失。将集合转换回列表时，将创建任意顺序。

如果订单对你很重要，那么你必须使用不同的机制。一个非常常见的解决方案是依靠OrderedDict在插入期间保持键的顺序：

1
2
3

>>> from collections import OrderedDict
>>> list(OrderedDict.fromkeys(t))
[1, 2, 3, 5, 6, 7, 8]

请注意，这有一个开销，首先创建一个字典，然后从中创建一个列表。所以如果你实际上不需要保持秩序，你最好使用一套。请查看此问题，了解删除重复项时保留顺序的更多详细信息和其他方法。

最后要注意，set和OrderedDict解决方案都要求您的项目是可哈希的。这通常意味着它们必须是不可变的。如果必须处理不可散列的项(例如列表对象)，则必须使用一种缓慢的方法，在这种方法中，基本上必须将每个项与嵌套循环中的其他项进行比较。

相关讨论

在python 2.7中，从iterable中删除重复项的新方法是：

1
2
3

>>> from collections import OrderedDict
>>> list(OrderedDict.fromkeys('abracadabra'))
['a', 'b', 'r', 'c', 'd']

在Python3.5中，ordereddict有一个C实现。我的计时显示，现在这是Python3.5各种方法中最快和最短的。

在python 3.6中，常规dict既成了有序的又紧凑的。(此功能适用于cpython和pypy，但可能不存在于其他实现中)。这为我们提供了一种新的快速除尘方法，同时保持订单：

1 2	>>> list(dict.fromkeys('abracadabra')) ['a', 'b', 'r', 'c', 'd']

在Python3.7中，常规dict保证在所有实现中都按顺序排列。因此，最短和最快的解决方案是：

1 2	>>> list(dict.fromkeys('abracadabra')) ['a', 'b', 'r', 'c', 'd']

相关讨论

这是一条单行线：list(set(source_list))将发挥作用。

一个set是不可能有重复的东西。

更新：订单保留方法有两行：

1 2	from collections import OrderedDict OrderedDict((x, True) for x in source_list).keys()

这里我们使用这样一个事实：OrderedDict记住键的插入顺序，并且在更新特定键的值时不会更改它。我们插入True作为值，但是我们可以插入任何东西，只是不使用值。(set的工作原理与dict的工作原理非常相似，也忽略了值。)

1
2
3
4
5
6
7
8
9

>>> t = [1, 2, 3, 1, 2, 5, 6, 7, 8]
>>> t
[1, 2, 3, 1, 2, 5, 6, 7, 8]
>>> s = []
>>> for i in t:
if i not in s:
s.append(i)
>>> s
[1, 2, 3, 5, 6, 7, 8]

相关讨论

如果您不关心订单，只需执行以下操作：

1 2	def remove_duplicates(l): return list(set(l))

保证set无重复。

制定新的清单，保留L中第一个重复元素的顺序。

newlist=[ii for n,ii in enumerate(L) if ii not in L[:n]]

例如，if L=[1, 2, 2, 3, 4, 2, 4, 3, 5]，那么newlist将是[1,2,3,4,5]。

这将检查每个新元素在添加之前是否在列表中出现过。而且它不需要进口。

相关讨论

今天，一位同事将接受的答案作为代码的一部分发送给我进行代码审查。虽然我当然很钦佩这个问题的答案的优雅，但我对这个表现并不满意。我尝试过这个解决方案(我使用set来减少查找时间)

1
2
3
4
5
6
7
8

def ordered_set(in_list):
out_list = []
added = set()
for val in in_list:
if not val in added:
out_list.append(val)
added.add(val)
return out_list

为了比较效率，我使用了100个整数的随机样本-62个是唯一的

1
2
3
4
5

from random import randint
x = [randint(0,100) for _ in xrange(100)]

In [131]: len(set(x))
Out[131]: 62

这是测量结果

1
2
3
4
5

In [129]: %timeit list(OrderedDict.fromkeys(x))
10000 loops, best of 3: 86.4 us per loop

In [130]: %timeit ordered_set(x)
100000 loops, best of 3: 15.1 us per loop

那么，如果从解决方案中移除集合，会发生什么？

1
2
3
4
5
6

def ordered_set(inlist):
out_list = []
for val in inlist:
if not val in out_list:
out_list.append(val)
return out_list

结果并没有订购的ICT差，但仍然是原来解决方案的3倍以上。

1 2	In [136]: %timeit ordered_set(x) 10000 loops, best of 3: 52.6 us per loop

相关讨论

另一种方法：

1
2
3

>>> seq = [1,2,3,'a', 'a', 1,2]
>> dict.fromkeys(seq).keys()
['a', 1, 2, 3]

相关讨论

还有一些解决方案使用熊猫和麻木。它们都返回numpy数组，因此如果需要列表，必须使用函数.tolist()。

1 2	t=['a','a','b','b','b','c','c','c'] t2= ['c','c','b','b','b','a','a','a']

熊猫解决方案

使用熊猫功能unique()：

1
2
3
4
5

import pandas as pd
pd.unique(t).tolist()
>>>['a','b','c']
pd.unique(t2).tolist()
>>>['c','b','a']

努米溶液

使用numpy函数unique()。

1
2
3
4
5

import numpy as np
np.unique(t).tolist()
>>>['a','b','c']
np.unique(t2).tolist()
>>>['a','b','c']

注意numpy.unique()也会对值进行排序。所以列表t2被排序返回。如果要保留订单，请按此回答使用：

1
2
3

_, idx = np.unique(t2, return_index=True)
t2[np.sort(idx)].tolist()
>>>['c','b','a']

与其他方法相比，该解决方案并不那么优雅，但是与pandas.unique()相比，numpy.unique()还允许您检查嵌套数组在一个选定轴上是否唯一。

相关讨论

简单易行：

1
2
3

myList = [1, 2, 3, 1, 2, 5, 6, 7, 8]
cleanlist = []
[cleanlist.append(x) for x in myList if x not in cleanlist]

输出：

1 2	>>> cleanlist [1, 2, 3, 5, 6, 7, 8]

相关讨论

我的单子里有个口述，所以我不能用上面的方法。我得到了错误：

1	TypeError: unhashable type:

所以如果你关心订单和/或一些物品是不可清洗的。然后你会发现这很有用：

1
2
3
4

def make_unique(original_list):
unique_list = []
[unique_list.append(obj) for obj in original_list if obj not in unique_list]
return unique_list

有些人可能认为清单理解的副作用不是一个好的解决方案。还有一种选择：

1
2
3
4

def make_unique(original_list):
unique_list = []
map(lambda x: unique_list.append(x) if (x not in unique_list) else False, original_list)
return unique_list

相关讨论

尝试使用集合：

1
2
3
4
5
6

import sets
t = sets.Set(['a', 'b', 'c', 'd'])
t1 = sets.Set(['a', 'b', 'c'])

print t | t1
print t - t1

到目前为止，我在这里看到的所有顺序保持方法要么使用简单的比较(最好是O(n^2)时间复杂度)要么使用重的OrderedDicts／set＋list组合，这些组合仅限于哈希输入。下面是一个与哈希无关的O(nlogn)解决方案：

更新添加了key参数、文档和python 3兼容性。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

# from functools import reduce <-- add this import on Python 3

def uniq(iterable, key=lambda x: x):
"""
Remove duplicates from an iterable. Preserves order.
:type iterable: Iterable[Ord => A]
:param iterable: an iterable of objects of any orderable type
:type key: Callable[A] -> (Ord => B)
:param key: optional argument; by default an item (A) is discarded
if another item (B), such that A == B, has already been encountered and taken.
If you provide a key, this condition changes to key(A) == key(B); the callable
must return orderable objects.
"""
# Enumerate the list to restore order lately; reduce the sorted list; restore order
def append_unique(acc, item):
return acc if key(acc[-1][1]) == key(item[1]) else acc.append(item) or acc
srt_enum = sorted(enumerate(iterable), key=lambda item: key(item[1]))
return [item[1] for item in sorted(reduce(append_unique, srt_enum, [srt_enum[0]]))]

相关讨论

你也可以这样做：

1
2
3
4

>>> t = [1, 2, 3, 3, 2, 4, 5, 6]
>>> s = [x for i, x in enumerate(t) if i == t.index(x)]
>>> s
[1, 2, 3, 4, 5, 6]

上述工作的原因是index方法只返回元素的第一个索引。重复元素具有更高的索引。参考这里：

list.index(x[, start[, end]])
Return zero-based index in the list of
the first item whose value is x. Raises a ValueError if there is no
such item.

相关讨论

从列表中删除重复项的最佳方法是使用在Python中可用的set()函数，再次将该集合转换为列表。

1
2
3

In [2]: some_list = ['a','a','v','v','v','c','c','d']
In [3]: list(set(some_list))
Out[3]: ['a', 'c', 'd', 'v']

相关讨论

不使用集合

1
2
3
4
5
6
7

data=[1, 2, 3, 1, 2, 5, 6, 7, 8]
uni_data=[]
for dat in data:
if dat not in uni_data:
uni_data.append(dat)

print(uni_data)

通过订购保留减少变量：

假设我们有以下列表：

1	l = [5, 6, 6, 1, 1, 2, 2, 3, 4]

减少变量(非官方)：

1 2	>>> reduce(lambda r, v: v in r and r or r + [v], l, []) [5, 6, 1, 2, 3, 4]

速度快5倍，但更复杂

1 2	>>> reduce(lambda r, v: v in r[1] and r or (r[0].append(v) or r[1].add(v)) or r, l, ([], set()))[0] [5, 6, 1, 2, 3, 4]

说明：

1
2
3
4
5
6
7
8
9
10
11

default = (list(), set())
# user list to keep order
# use set to make lookup faster

def reducer(result, item):
if item not in result[1]:
result[0].append(item)
result[1].add(item)
return result

reduce(reducer, l, default)[0]

您可以使用以下功能：

1
2
3
4
5
6

def rem_dupes(dup_list):
yooneeks = []
for elem in dup_list:
if elem not in yooneeks:
yooneeks.append(elem)
return yooneeks

例子：

1	my_list = ['this','is','a','list','with','dupicates','in', 'the', 'list']

用途：

1	rem_dupes(my_list)

["this"、"is"、"a"、"list"、"with"、"dupites"、"in"、"the"]

这个人关心订单，没有太多麻烦(orderddict&others)。可能不是最简单的方式，也不是最短的方式，但技巧是：

1
2
3
4
5
6
7

def remove_duplicates(list):
''' Removes duplicate items from a list '''
singles_list = []
for element in list:
if element not in singles_list:
singles_list.append(element)
return singles_list

相关讨论

另一个更好的方法是，

1
2
3
4
5
6
7

import pandas as pd

myList = [1, 2, 3, 1, 2, 5, 6, 7, 8]
cleanList = pd.Series(myList).drop_duplicates().tolist()
print(cleanList)

#> [1, 2, 3, 5, 6, 7, 8]

秩序仍然保持着。

相关讨论

下面的代码很容易删除列表中的重复项

1
2
3
4
5
6
7
8

def remove_duplicates(x):
a = []
for i in x:
if i not in a:
a.append(i)
return a

print remove_duplicates([1,2,2,3,3,4])

返回[1,2,3,4]

相关讨论

还有许多其他的答案建议了不同的方法来实现这一点，但它们都是批处理操作，其中一些会丢弃原始订单。这也许可以，具体取决于您需要什么，但是如果您希望按每个值的第一个实例的顺序迭代这些值，并且希望一次删除所有重复的值，则可以使用此生成器：

1
2
3
4
5
6

def uniqify(iterable):
seen = set()
for item in iterable:
if item not in seen:
seen.add(item)
yield item

这将返回一个生成器/迭代器，因此您可以在任何可以使用迭代器的地方使用它。

1
2
3
4

for unique_item in uniqify([1, 2, 3, 4, 3, 2, 4, 5, 6, 7, 6, 8, 8]):
print(unique_item, end=' ')

print()

输出：

1	1 2 3 4 5 6 7 8

如果你想要一个list，你可以这样做：

1
2
3

unique_list = list(uniqify([1, 2, 3, 4, 3, 2, 4, 5, 6, 7, 6, 8, 8]))

print(unique_list)

输出：

1	[1, 2, 3, 4, 5, 6, 7, 8]

相关讨论

如果您想保留订单，而不使用任何外部模块，这里是一个简单的方法：

1
2
3

>>> t = [1, 9, 2, 3, 4, 5, 3, 6, 7, 5, 8, 9]
>>> list(dict.fromkeys(t))
[1, 9, 2, 3, 4, 5, 6, 7, 8]

注：此方法保留外观顺序，因此，如上所示，9将在1之后出现，因为它是第一次出现。然而，这和你做的结果是一样的。

1 2	from collections import OrderedDict ulist=list(OrderedDict.fromkeys(l))

但它要短得多，而且跑得更快。

这是因为每次fromkeys函数试图创建一个新的键时，如果该值已经存在，它将简单地覆盖它。然而，这根本不会影响字典，因为fromkeys创建了一个字典，其中所有键都具有None的值，因此有效地消除了所有重复项。

使用集合：

1
2
3

a = [0,1,2,3,4,3,3,4]
a = list(set(a))
print a

使用独特：

1
2
3
4

import numpy as np
a = [0,1,2,3,4,3,3,4]
a = np.unique(a).tolist()
print a

以下是与回复中列出的其他人交流的最快的Python疗法。

使用短路评估的实现细节可以使用列表理解，这是足够快的。visited.add(item)总是返回None作为结果，它的计算结果是False，因此or的右侧永远是这种表达式的结果。

自己计时

1
2
3
4
5

def deduplicate(sequence):
visited = set()
adder = visited.add # get rid of qualification overhead
out = [adder(item) or item for item in sequence if item not in visited]
return out

在python 3中非常简单的方法：

1
2
3
4
5
6

>>> n = [1, 2, 3, 4, 1, 1]
>>> n
[1, 2, 3, 4, 1, 1]
>>> m = sorted(list(set(n)))
>>> m
[1, 2, 3, 4]

相关讨论

如果不使用inbuilt set、dict.keys、uniqify、counter删除重复项(就地编辑而不是返回新列表)，请选中此项。

1
2
3
4
5
6
7

>>> t = [1, 2, 3, 1, 2, 5, 6, 7, 8]
>>> for i in t:
... if i in t[t.index(i)+1:]:
... t.remove(i)
...
>>> t
[3, 1, 2, 5, 6, 7, 8]

相关讨论

您可以使用set删除重复项：

1	mylist = list(set(mylist))

但请注意，结果将是无序的。如果这是个问题：

1	mylist.sort()

相关讨论

下面是一个例子，返回列表时不重复保存顺序。不需要任何外部导入。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

def GetListWithoutRepetitions(loInput):
# return list, consisting of elements of list/tuple loInput, without repetitions.
# Example: GetListWithoutRepetitions([None,None,1,1,2,2,3,3,3])
# Returns: [None, 1, 2, 3]

if loInput==[]:
return []

loOutput = []

if loInput[0] is None:
oGroupElement=1
else: # loInput[0]<>None
oGroupElement=None

for oElement in loInput:
if oElement<>oGroupElement:
loOutput.append(oElement)
oGroupElement = oElement
return loOutput

我认为转换为set是删除重复项的最简单方法：

1
2
3

list1 = [1,2,1]
list1 = list(set(list1))
print list1

只需使用集合就可以做到这一点。

步骤1：获取列表的不同元素步骤2获取列表的公共元素第3步：组合它们

1
2
3
4
5
6

In [1]: a = ["apples","bananas","cucumbers"]

In [2]: b = ["pears","apples","watermelons"]

In [3]: set(a).symmetric_difference(b).union(set(a).intersection(b))
Out[3]: {'apples', 'bananas', 'cucumbers', 'pears', 'watermelons'}

要删除重复项，请将其设置为集合，然后再次将其设置为列表并打印/使用。一套保证有独特的元素。例如：

1
2
3
4
5

a = [1,2,3,4,5,9,11,15]
b = [4,5,6,7,8]
c=a+b
print c
print list(set(c)) #one line for getting unique elements of c

输出如下(在python 2.7中检查)

1 2	[1, 2, 3, 4, 5, 9, 11, 15, 4, 5, 6, 7, 8] #simple list addition with duplicates [1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 15] #duplicates removed!!

1
2
3

def remove_duplicates(A):
[A.pop(count) for count,elem in enumerate(A) if A.count(elem)!=1]
return A

用于删除重复项的列表压缩

如果你不在乎秩序，想要一些不同于上面提到的Python式的方法(也就是说，它可以用于面试)，那么：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

def remove_dup(arr):
size = len(arr)
j = 0 # To store index of next unique element
for i in range(0, size-1):
# If current element is not equal
# to next element then store that
# current element
if(arr[i] != arr[i+1]):
arr[j] = arr[i]
j+=1

arr[j] = arr[size-1] # Store the last element as whether it is unique or repeated, it hasn't stored previously

return arr[0:j+1]

if __name__ == '__main__':
arr = [10, 10, 1, 1, 1, 3, 3, 4, 5, 6, 7, 8, 8, 9]
print(remove_dup(sorted(arr)))

时间复杂度：O(N)

辅助空间：O(N)

参考：http://www.geeksforgeks.org/remove-duplicates-sorted-array/

这里有很多答案使用set(..)(考虑到元素是可散列的，所以速度很快)或列表(其缺点是它会导致O(n2)算法)。

我建议的函数是混合函数：我们对可散列的项使用set(..)，对不可散列的项使用list(..)。此外，它被实现为一个生成器，例如，我们可以限制项目的数量，或者做一些额外的过滤。

最后，我们还可以使用key参数来指定元素的唯一性。例如，如果我们想过滤字符串列表，以便输出中的每个字符串都有不同的长度，那么可以使用这个方法。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

def uniq(iterable, key=lambda x: x):
seens = set()
seenl = []
for item in iterable:
k = key(item)
try:
seen = k in seens
except TypeError:
seen = k in seenl
if not seen:
yield item
try:
seens.add(k)
except TypeError:
seenl.append(k)

例如，我们现在可以这样使用：

1
2
3
4
5
6
7
8
9
10

>>> list(uniq(["apple","pear","banana","lemon"], len))
['apple', 'pear', 'banana']
>>> list(uniq(["apple","pear","lemon","banana"], len))
['apple', 'pear', 'banana']
>>> list(uniq(["apple","pear", {},"lemon", [],"banana"], len))
['apple', 'pear', {}, 'banana']
>>> list(uniq(["apple","pear", {},"lemon", [],"banana"]))
['apple', 'pear', {}, 'lemon', [], 'banana']
>>> list(uniq(["apple","pear", {},"lemon", {},"banana"]))
['apple', 'pear', {}, 'lemon', 'banana']

因此，它是一个uniqeness过滤器，可以处理任何iterable并过滤掉uniques，不管这些uniques是否可以散列。

它做了一个假设：如果一个对象是可散列的，而另一个对象是不可散列的，那么这两个对象就永远不会相等。严格来说，这是可能发生的，尽管这是非常罕见的。

相关讨论

另一个解决方案可能是：从列表中创建一个以项为键、索引为值的字典，然后打印字典键。

1
2
3
4
5

>>> lst = [1, 3, 4, 2, 1, 21, 1, 32, 21, 1, 6, 5, 7, 8, 2]
>>>
>>> dict_enum = {item:index for index, item in enumerate(lst)}
>>> print dict_enum.keys()
[32, 1, 2, 3, 4, 5, 6, 7, 8, 21]

相关讨论

不幸的是。这里的大多数答案要么不维持秩序要么太长。这是一个简单的、有序的答案。

1
2
3
4
5

s = [1,2,3,4,5,2,5,6,7,1,3,9,3,5]
x=[]

[x.append(i) for i in s if i not in x]
print(x)

这将为您提供删除重复项但保留顺序的X。

为了完整性，由于这是一个非常流行的问题，toolz库提供了一个unique函数：

1
2
3
4

>>> tuple(unique((1, 2, 3)))
(1, 2, 3)
>>> tuple(unique((1, 2, 1, 3)))
(1, 2, 3)

1
2
3
4
5
6
7
8
9
10
11

def remove_duplicates(input_list):
if input_list == []:
return []
#sort list from smallest to largest
input_list=sorted(input_list)
#initialize ouput list with first element of the sorted input list
output_list = [input_list[0]]
for item in input_list:
if item >output_list[-1]:
output_list.append(item)
return output_list

相关讨论

这只是一个可读的函数，很容易理解，我使用了dict数据结构，使用了一些内置函数和更复杂的o(n)

1
2
3
4
5
6
7

def undup(dup_list):
b={}
for i in dup_list:
b.update({i:1})
return b.keys()
a=["a",'b','a']
print undup(a)

免责声明：您可能会得到一个缩进错误(如果复制和粘贴)，在粘贴之前使用上面的代码并适当缩进

python内置了许多函数，可以使用set()删除列表中的重复项。根据您的示例，下面有两个列表t和t2

1
2
3
4

t = ['a', 'b', 'c', 'd']
t2 = ['a', 'c', 'd']
result = list(set(t) - set(t2))
result

答：[ B′]

相关讨论

有时需要在不创建新列表的情况下，就地删除重复项。例如，列表很大，或者将其保留为卷影副本

1
2
3
4
5

from collections import Counter
cntDict = Counter(t)
for item,cnt in cntDict.items():
for _ in range(cnt-1):
t.remove(item)

它需要安装第三方模块，但包iteration_utilities包含一个unique_everseen1功能，可以在保留订单的同时删除所有重复项：

1
2
3
4

>>> from iteration_utilities import unique_everseen

>>> list(unique_everseen(['a', 'b', 'c', 'd'] + ['a', 'c', 'd']))
['a', 'b', 'c', 'd']

如果要避免列表添加操作的开销，可以使用itertools.chain：

1
2
3

>>> from itertools import chain
>>> list(unique_everseen(chain(['a', 'b', 'c', 'd'], ['a', 'c', 'd'])))
['a', 'b', 'c', 'd']

如果列表中有不可显示的项目(例如列表)，那么unique_everseen也可以工作：

1
2
3

>>> from iteration_utilities import unique_everseen
>>> list(unique_everseen([['a'], ['b'], 'c', 'd'] + ['a', 'c', 'd']))
[['a'], ['b'], 'c', 'd', 'a']

但是，如果这些项是可散列的，则速度会慢得多。

披露：我是iteration_utilities图书馆的作者。

1	list_with_unique_items = list(set(list_with_duplicates))

相关讨论