python求列表众数的几种方式和耗时对比

列表里出现次数最多的元素叫众数，使用python求众数目前没有直接的api，可以通过间接的方式求得众数，目前主要有以下几种方式。

暴力求解

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

# 暴力求解
import time
login_list = [5, 8, 8, 5, 10, 9, 14, 16, 17, 7, 9, 8, 9, 12, 16, 20, 9, 10, 6, 9, 18, 17, 8, 6, 9, 16, 18, 18]
new_label = []
time_start = time.time()
n = 50000
for i in range(n):
count_dict = {}
for i in login_list:
if i in count_dict:
count_dict[i] += 1
else:
count_dict[i] = 1
dictSortList = sorted(count_dict.items(),key = lambda x:x[1], reverse = True)
new_label.append(dictSortList[0][0])
time_end = time.time()
print("耗时：", time_end-time_start)
print("平均单条耗时：", (time_end-time_start)/n)

输出如下：

1
2
耗时： 0.43876028060913086
平均单条耗时： 8.775205612182618e-06

为了对比时间更加公平，采用循环50000次的方式

暴力求解方式原理简单，遍历列表里所有元素，统计每个元素出现的次数，再按照出现的次数排序，取出次数最多的元素即为众数，暴力求解的方式耗时0.43s，看到这里大家可能鄙夷这种方式，但是别嚣张，暴力求解并不代表效果最不好，有时候越简单越有效 ^_^

pandas求解

1
2
3
4
5
6
7
8
9
10
11
12
13

# pandas求解
import pandas as pd
new_label = []
import time
login_list = [5, 8, 8, 5, 10, 9, 14, 16, 17, 7, 9, 8, 9, 12, 16, 20, 9, 10, 6, 9, 18, 17, 8, 6, 9, 16, 18, 18]
time_start = time.time()
n = 50000
for i in range(n):
tmp = pd.DataFrame({"A": login_list})
new_label.append(tmp["A"].mode()[0])
time_end = time.time()
print("耗时：", time_end-time_start)
print("平均单条耗时：", (time_end-time_start)/n)

输出如下：

1
2
耗时： 32.47795557975769
平均单条耗时： 0.0006495591115951538

看到这耗时是不是感觉到amazing？Unbelievable？pandas求解原理主要是生成一个series或者dataframe，然后使用mode函数。对比下暴力求解时间，pandas是真tm慢呀，我最开始就是用的这种方式，结果60w的数据硬生生跑了好久没跑出来。

scipy求解

1
2
3
4
5
6
7
8
9
10
11
12

# scipy求解
from scipy import stats
import time
new_label = []
login_list = [5, 8, 8, 5, 10, 9, 14, 16, 17, 7, 9, 8, 9, 12, 16, 20, 9, 10, 6, 9, 18, 17, 8, 6, 9, 16, 18, 18]
n = 50000
time_start = time.time()
for i in range(n):
new_label.append(stats.mode(login_list)[0][0])
time_end = time.time()
print("耗时：", time_end-time_start)
print("平均单条耗时：", (time_end-time_start)/n)

输出如下：

1
2
耗时： 7.2447569370269775
平均单条耗时： 0.00014489513874053954

看到这耗时是不是又一次amazing？Unbelievable？scipy求解原理和pandas类似，也是使用mode函数。对比下暴力求解和pandas求解，scipy也是慢。

numpy求解

1
2
3
4
5
6
7
8
9
10
11
12
13

# numpy求解
import numpy as np
import time
new_label = []
login_list = [5, 8, 8, 5, 10, 9, 14, 16, 17, 7, 9, 8, 9, 12, 16, 20, 9, 10, 6, 9, 18, 17, 8, 6, 9, 16, 18, 18]
n = 50000
time_start = time.time()
for i in range(n):
counts = np.bincount(login_list)
new_label.append(np.argmax(counts))
time_end = time.time()
print("耗时：", time_end-time_start)
print("平均单条耗时：", (time_end-time_start)/n)

输出如下：

1
2
耗时： 0.345947265625
平均单条耗时： 6.9189453125e-06

原来还是numpy靠谱，numpy求解的原理是使用np.bincount()函数把列表平铺开来，并且每个位置计数

举个例子 a = [1,2,3,1]

np.bincount(a) 的输出是 b = [0,2,1,1]，b[0]代表a中0出现几次，由于a中没有0，所以b[0]=0；b[1]代表a中1出现2次，b[2]代表a中2出现1次，其余同理。看到这里聪明的同学会想到，假设a中有个元素很大，那这个展开的列表就会很长，耗时就会变大吧，没错，你想对了。

假设 login_list = [5, 8, 8, 5, 10000]，用numpy求解方式，耗时成倍增长，差于暴力求解：

1
2
耗时： 0.9656193256378174
平均单条耗时： 1.9312386512756348e-05

总结

求解方式	暴力求解	pandas求解	scipy求解	numpy求解
耗时	0.44s	32s	7.24s	0.34s

其中列表中元素值很大的话，建议暴力求解，否则建议numpy求解

参考链接

https://zhuanlan.zhihu.com/p/46661241