关于java：>和>=在冒泡排序会导致显着的性能差异

> vs. >= in bubble sort causes significant performance difference

我只是偶然发现了什么。起初我认为这可能是一个分支预测失误的案例，就像在这个案例中一样，但我不能解释为什么分支预测失误会导致这种现象。

我在Java中实现了两个版本的冒泡排序，并进行了一些性能测试：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66

import java.util.Random;

public class BubbleSortAnnomaly {

public static void main(String... args) {
final int ARRAY_SIZE = Integer.parseInt(args[0]);
final int LIMIT = Integer.parseInt(args[1]);
final int RUNS = Integer.parseInt(args[2]);

int[] a = new int[ARRAY_SIZE];
int[] b = new int[ARRAY_SIZE];
Random r = new Random();
for (int run = 0; RUNS > run; ++run) {
for (int i = 0; i < ARRAY_SIZE; i++) {
a[i] = r.nextInt(LIMIT);
b[i] = a[i];
}

System.out.print("Sorting with sortA:");
long start = System.nanoTime();
int swaps = bubbleSortA(a);

System.out.println( (System.nanoTime() - start) +" ns."
+"It used" + swaps +" swaps.");

System.out.print("Sorting with sortB:");
start = System.nanoTime();
swaps = bubbleSortB(b);

System.out.println( (System.nanoTime() - start) +" ns."
+"It used" + swaps +" swaps.");
}
}

public static int bubbleSortA(int[] a) {
int counter = 0;
for (int i = a.length - 1; i >= 0; --i) {
for (int j = 0; j < i; ++j) {
if (a[j] > a[j + 1]) {
swap(a, j, j + 1);
++counter;
}
}
}
return (counter);
}

public static int bubbleSortB(int[] a) {
int counter = 0;
for (int i = a.length - 1; i >= 0; --i) {
for (int j = 0; j < i; ++j) {
if (a[j] >= a[j + 1]) {
swap(a, j, j + 1);
++counter;
}
}
}
return (counter);
}

private static void swap(int[] a, int j, int i) {
int h = a[i];
a[i] = a[j];
a[j] = h;
}
}

如您所见，这两种排序方法之间的唯一区别是>与>=。在使用java BubbleSortAnnomaly 50000 10 10运行程序时，您显然希望sortB比sortA慢，因为它必须执行更多的swap(...)s。但我在三台不同的机器上得到了以下(或类似的)输出：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

Sorting with sortA: 4.214 seconds. It used 564960211 swaps.
Sorting with sortB: 2.278 seconds. It used 1249750569 swaps.
Sorting with sortA: 4.199 seconds. It used 563355818 swaps.
Sorting with sortB: 2.254 seconds. It used 1249750348 swaps.
Sorting with sortA: 4.189 seconds. It used 560825110 swaps.
Sorting with sortB: 2.264 seconds. It used 1249749572 swaps.
Sorting with sortA: 4.17 seconds. It used 561924561 swaps.
Sorting with sortB: 2.256 seconds. It used 1249749766 swaps.
Sorting with sortA: 4.198 seconds. It used 562613693 swaps.
Sorting with sortB: 2.266 seconds. It used 1249749880 swaps.
Sorting with sortA: 4.19 seconds. It used 561658723 swaps.
Sorting with sortB: 2.281 seconds. It used 1249751070 swaps.
Sorting with sortA: 4.193 seconds. It used 564986461 swaps.
Sorting with sortB: 2.266 seconds. It used 1249749681 swaps.
Sorting with sortA: 4.203 seconds. It used 562526980 swaps.
Sorting with sortB: 2.27 seconds. It used 1249749609 swaps.
Sorting with sortA: 4.176 seconds. It used 561070571 swaps.
Sorting with sortB: 2.241 seconds. It used 1249749831 swaps.
Sorting with sortA: 4.191 seconds. It used 559883210 swaps.
Sorting with sortB: 2.257 seconds. It used 1249749371 swaps.

将LIMIT的参数设置为，如50000(java BubbleSortAnnomaly 50000 50000 10时，得到预期结果：

1 2	Sorting with sortA: 3.983 seconds. It used 625941897 swaps. Sorting with sortB: 4.658 seconds. It used 789391382 swaps.

我将程序移植到C++中，以确定这个问题是否是Java特有的。这里是C++代码。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95

#include <cstdlib>
#include <iostream>

#include <omp.h>

#ifndef ARRAY_SIZE
#define ARRAY_SIZE 50000
#endif

#ifndef LIMIT
#define LIMIT 10
#endif

#ifndef RUNS
#define RUNS 10
#endif

void swap(int * a, int i, int j)
{
int h = a[i];
a[i] = a[j];
a[j] = h;
}

int bubbleSortA(int * a)
{
const int LAST = ARRAY_SIZE - 1;
int counter = 0;
for (int i = LAST; 0 < i; --i)
{
for (int j = 0; j < i; ++j)
{
int next = j + 1;
if (a[j] > a[next])
{
swap(a, j, next);
++counter;
}
}
}
return (counter);
}

int bubbleSortB(int * a)
{
const int LAST = ARRAY_SIZE - 1;
int counter = 0;
for (int i = LAST; 0 < i; --i)
{
for (int j = 0; j < i; ++j)
{
int next = j + 1;
if (a[j] >= a[next])
{
swap(a, j, next);
++counter;
}
}
}
return (counter);
}

int main()
{
int * a = (int *) malloc(ARRAY_SIZE * sizeof(int));
int * b = (int *) malloc(ARRAY_SIZE * sizeof(int));

for (int run = 0; RUNS > run; ++run)
{
for (int idx = 0; ARRAY_SIZE > idx; ++idx)
{
a[idx] = std::rand() % LIMIT;
b[idx] = a[idx];
}

std::cout <<"Sorting with sortA:";
double start = omp_get_wtime();
int swaps = bubbleSortA(a);

std::cout << (omp_get_wtime() - start) <<" seconds. It used" << swaps
<<" swaps." << std::endl;

std::cout <<"Sorting with sortB:";
start = omp_get_wtime();
swaps = bubbleSortB(b);

std::cout << (omp_get_wtime() - start) <<" seconds. It used" << swaps
<<" swaps." << std::endl;
}

free(a);
free(b);

return (0);
}

这个程序显示了相同的行为。有人能解释一下这里到底发生了什么吗？

先执行sortB，然后执行sortA，不会改变结果。

相关讨论

我认为这确实可以用分支预测失误来解释。

例如，考虑limit=11和sortB。在外循环的第一次迭代中，它将很快遇到一个等于10的元素。因此，它将有a[j]=10，因此a[j]肯定是>=a[next]，因为没有大于10的元素。因此，它将执行一个交换，然后在j中执行一个步骤，只会再次发现a[j]=10(相同的交换值)。所以再一次，它将是a[j]>=a[next]，所以是一个。除了最初的几个比较之外，每一个比较都是正确的。类似地，它将在外部循环的下一个迭代中运行。

与sortA不同。它将以大致相同的方式开始，偶然发现a[j]=10，以类似的方式进行一些交换，但只有在它发现a[next]=10的时候。那么条件将是错误的，不会进行交换。以此类推：每次它碰到a[next]=10时，条件都是错误的，没有进行任何交换。因此，该条件在11个条件中为真的10倍(从0到9的a[next]的值)，在11个条件中为假的1个条件。分支预测失败并不奇怪。

使用所提供的C++代码(删除时间计数)与EDOCX1×25指令，得到了确认Brac错过理论的结果。

对于Limit = 10，bubblesortB从分支预测中获得了很高的收益(0.01%的未命中率)，但是对于Limit = 50000分支预测的失败率(15.65%的未命中率)甚至比bubblesort(12.69%和12.76%的未命中率)更高。

BubbleSorta限制=10：

1
2
3
4
5
6
7
8
9
10
11
12

Performance counter stats for './bubbleA.out':

46670.947364 task-clock # 0.998 CPUs utilized
73 context-switches # 0.000 M/sec
28 CPU-migrations # 0.000 M/sec
379 page-faults # 0.000 M/sec
117,298,787,242 cycles # 2.513 GHz
117,471,719,598 instructions # 1.00 insns per cycle
25,104,504,912 branches # 537.904 M/sec
3,185,376,029 branch-misses # 12.69% of all branches

46.779031563 seconds time elapsed

BubbleSorta限制=50000：

1
2
3
4
5
6
7
8
9
10
11
12

Performance counter stats for './bubbleA.out':

46023.785539 task-clock # 0.998 CPUs utilized
59 context-switches # 0.000 M/sec
8 CPU-migrations # 0.000 M/sec
379 page-faults # 0.000 M/sec
118,261,821,200 cycles # 2.570 GHz
119,230,362,230 instructions # 1.01 insns per cycle
25,089,204,844 branches # 545.136 M/sec
3,200,514,556 branch-misses # 12.76% of all branches

46.126274884 seconds time elapsed

BubbleSortB限制=10：

1
2
3
4
5
6
7
8
9
10
11
12

Performance counter stats for './bubbleB.out':

26091.323705 task-clock # 0.998 CPUs utilized
28 context-switches # 0.000 M/sec
2 CPU-migrations # 0.000 M/sec
379 page-faults # 0.000 M/sec
64,822,368,062 cycles # 2.484 GHz
137,780,774,165 instructions # 2.13 insns per cycle
25,052,329,633 branches # 960.179 M/sec
3,019,138 branch-misses # 0.01% of all branches

26.149447493 seconds time elapsed

BubbleSortB限制=50000：

1
2
3
4
5
6
7
8
9
10
11
12

Performance counter stats for './bubbleB.out':

51644.210268 task-clock # 0.983 CPUs utilized
2,138 context-switches # 0.000 M/sec
69 CPU-migrations # 0.000 M/sec
378 page-faults # 0.000 M/sec
144,600,738,759 cycles # 2.800 GHz
124,273,104,207 instructions # 0.86 insns per cycle
25,104,320,436 branches # 486.101 M/sec
3,929,572,460 branch-misses # 15.65% of all branches

52.511233236 seconds time elapsed

Edit 2: This answer is probably wrong in most cases, lower when I say everything above is correct is still true, but the lower portion is not true for most processor architectures, see the comments. However, I will say that it's still theoretically possible there is some JVM on some OS/Architecture that does this, but that JVM is probably poorly implemented or it's a weird architecture. Also, this is theoretically possible in the sense that most conceivable things are theoretically possible, so I'd take the last portion with a grain of salt.

首先，我不确定C++，但我可以谈谈Java。

这是一些代码，

1
2
3
4
5
6
7
8
9
10

public class Example {

public static boolean less(final int a, final int b) {
return a < b;
}

public static boolean lessOrEqual(final int a, final int b) {
return a <= b;
}
}

在上面运行javap -c，我得到了字节码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

public class Example {
public Example();
Code:
0: aload_0
1: invokespecial #8 // Method java/lang/Object."<init>":()V
4: return

public static boolean less(int, int);
Code:
0: iload_0
1: iload_1
2: if_icmpge 7
5: iconst_1
6: ireturn
7: iconst_0
8: ireturn

public static boolean lessOrEqual(int, int);
Code:
0: iload_0
1: iload_1
2: if_icmpgt 7
5: iconst_1
6: ireturn
7: iconst_0
8: ireturn
}

你会注意到唯一的区别是EDOCX1(如果比较大/相等)与EDOCX1(如果比较大)。

上面的一切都是事实，剩下的是我对如何处理if_icmpge和if_icmpgt的最佳猜测，基于我学习汇编语言的大学课程。为了得到更好的答案，您应该查看JVM是如何处理这些问题的。我的猜测是C++也编译成类似的操作。

Edit: Documentation on if_i is here

计算机比较数字的方法是从一个数字中减去另一个数字，并检查该数字是否为0。因此，在执行a < b时，如果从a中减去b，并通过检查值的符号(b - a < 0)来查看结果是否小于0。做a <= b，尽管它必须做一个额外的步骤并减去1(b - a - 1 < 0)。

通常情况下，这是一个非常微小的区别，但这不是任何代码，这是一个令人震惊的泡沫排序！o(n^2)是我们进行这种特殊比较的平均次数，因为它在最内部的循环中。

是的，这可能与分支预测有关。我不确定，我不是这方面的专家，但我认为这也可能起着不重要的作用。