关于C#：排序数组中的”==”是否不比未排序数组快？

Is “==” in sorted array not faster than unsorted array?

本问题已经有最佳答案，请猛点这里访问。

注：我认为所谓的重复问题主要与"<"和">"比较有关，但不与"=="比较有关，因此不回答我关于"=="运算符性能的问题。

很长一段时间以来，我一直认为排序数组的"处理"速度应该比未排序数组快。首先，我认为在排序数组中使用"=="应该比在未排序数组中更快，因为-我猜-关于分支预测的工作原理：

UNSORTEDARRAY：

1
2
3
4
5
6

5 == 100 F
43 == 100 F
100 == 100 T
250 == 100 F
6 == 100 F
(other elements to check)

SORTEDARRAY：

1
2
3
4
5

5 == 100 F
6 == 100 F
43 == 100 F
100 == 100 T
(no need to check other elements, so all are F)

所以我猜sortedarray应该比unsortedarray更快，但是今天我用代码在头中生成2个数组进行测试，分支预测似乎不像我想的那样工作。

我生成了一个未排序的数组和一个要测试的排序数组：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

srand(time(NULL));
int UNSORTEDARRAY[524288];
int SORTEDARRAY[sizeof(UNSORTEDARRAY)/sizeof(int)];
for(int i=0;i<sizeof(SORTEDARRAY)/sizeof(int);i++){
SORTEDARRAY[i]=UNSORTEDARRAY[i]=rand();
}
sort(SORTEDARRAY,SORTEDARRAY+sizeof(SORTEDARRAY)/sizeof(int));
string u="const int UNSORTEDARRAY[]={";
string s="const int SORTEDARRAY[]={";
for(int i=0;i<sizeof(UNSORTEDARRAY)/sizeof(int);i++){
u+=to_string(UNSORTEDARRAY[i])+",";
s+=to_string(SORTEDARRAY[i])+",";
}
u.erase(u.end()-1);
s.erase(s.end()-1);
u+="};
";
s+="};
";
ofstream out("number.h");
string code=u+s;
out << code;
out.close();

所以要测试，只需计算一下，如果值为==rand_max/2，如下所示：

1
2
3
4
5
6
7
8
9
10
11
12

#include"number.h"
int main(){
int count;
clock_t start = clock();
for(int i=0;i<sizeof(SORTEDARRAY)/sizeof(int);i++){
if(SORTEDARRAY[i]==RAND_MAX/2){
count++;
}
}
printf("%f
",(float)(clock()-start)/CLOCKS_PER_SEC);
}

运行3次：

无索达雷

1
2
3

0.005376
0.005239
0.005220

索德达雷

1
2
3

0.005334
0.005120
0.005223

这似乎是一个很小的性能差异，所以我不相信它，然后尝试将"sortearray[i]==rand_max/2"更改为"sortearray[i]>rand_max/2"，以查看它是否产生了差异：

无索达雷

1
2
3

0.008407
0.008363
0.008606

索德达雷

1
2
3

0.005306
0.005227
0.005146

这次有很大的不同。

排序数组中的"=="是否不比未排序数组快？如果是，为什么排序数组中的">"比未排序数组快，但"=="不是？

相关讨论

立即想到的一件事是CPU的分支预测算法。

在>比较的情况下，排序数组中的分支行为非常一致：首先，if条件始终为假，然后始终为真。这与最简单的分支预测非常吻合。

在未排序的数组中，>条件的结果本质上是随机的，从而阻碍了任何分支预测。

这就是使排序版本更快的原因。

在==比较中，大多数情况下条件是错误的，很少是正确的。无论数组是否排序，这都适用于分支预测。时间安排基本上是相同的。

注意，我正在回答这个问题，因为它太长了，无法发表评论。

这里的效果正是在这个问题的大量答案中已经详细解释过的。在这种情况下，由于分支预测，处理排序数组的速度更快。

在这里，罪魁祸首再次是分支预测。==检验很少是正确的，因此分支预测对这两种方法的效果大致相同。当你把它改为>时，你就得到了这个问题中的行为解释，原因也是一样的。

寓意：

I believe"processing" a sorted array should be faster than [an ]unsorted array.

你需要知道为什么。这不是什么神奇的规则，也不总是真的。

比较==与排序的关系比>的关系小。正确或错误地预测==只取决于重复值的数量及其分布。

您可以使用perf stat查看性能计数器…

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74

jason@io /tmp $ lz4 -d ints | perf stat ./proc-eq >/dev/null
Successfully decoded 104824717 bytes

Performance counter stats for './proc-eq':

5226.932577 task-clock (msec) # 0.953 CPUs utilized
31 context-switches # 0.006 K/sec
24 cpu-migrations # 0.005 K/sec
3,479 page-faults # 0.666 K/sec
15,763,486,767 cycles # 3.016 GHz
4,238,973,549 stalled-cycles-frontend # 26.89% frontend cycles idle
<not supported> stalled-cycles-backend
31,522,072,416 instructions # 2.00 insns per cycle
# 0.13 stalled cycles per insn
8,515,545,178 branches # 1629.167 M/sec
10,261,743 branch-misses # 0.12% of all branches

5.483071045 seconds time elapsed

jason@io /tmp $ lz4 -d ints | sort -n | perf stat ./proc-eq >/dev/null
Successfully decoded 104824717 bytes

Performance counter stats for './proc-eq':

5536.031410 task-clock (msec) # 0.348 CPUs utilized
198 context-switches # 0.036 K/sec
21 cpu-migrations # 0.004 K/sec
3,604 page-faults # 0.651 K/sec
16,870,541,124 cycles # 3.047 GHz
5,300,218,855 stalled-cycles-frontend # 31.42% frontend cycles idle
<not supported> stalled-cycles-backend
31,526,006,118 instructions # 1.87 insns per cycle
# 0.17 stalled cycles per insn
8,516,336,829 branches # 1538.347 M/sec
10,980,571 branch-misses # 0.13% of all branches

jason@io /tmp $ lz4 -d ints | perf stat ./proc-gt >/dev/null
Successfully decoded 104824717 bytes

Performance counter stats for './proc-gt':

5293.065703 task-clock (msec) # 0.957 CPUs utilized
38 context-switches # 0.007 K/sec
50 cpu-migrations # 0.009 K/sec
3,466 page-faults # 0.655 K/sec
15,972,451,322 cycles # 3.018 GHz
4,350,726,606 stalled-cycles-frontend # 27.24% frontend cycles idle
<not supported> stalled-cycles-backend
31,537,365,299 instructions # 1.97 insns per cycle
# 0.14 stalled cycles per insn
8,515,606,640 branches # 1608.823 M/sec
15,241,198 branch-misses # 0.18% of all branches

5.532285374 seconds time elapsed

jason@io /tmp $ lz4 -d ints | sort -n | perf stat ./proc-gt >/dev/null

15.930144154 seconds time elapsed

Performance counter stats for './proc-gt':

5203.873321 task-clock (msec) # 0.339 CPUs utilized
7 context-switches # 0.001 K/sec
22 cpu-migrations # 0.004 K/sec
3,459 page-faults # 0.665 K/sec
15,830,273,846 cycles # 3.042 GHz
4,456,369,958 stalled-cycles-frontend # 28.15% frontend cycles idle
<not supported> stalled-cycles-backend
31,540,409,224 instructions # 1.99 insns per cycle
# 0.14 stalled cycles per insn
8,516,186,042 branches # 1636.509 M/sec
10,205,058 branch-misses # 0.12% of all branches

15.365528326 seconds time elapsed