> vs. >= in bubble sort causes significant performance difference
我只是偶然发现了什么。起初我认为这可能是一个分支预测失误的案例,就像在这个案例中一样,但我不能解释为什么分支预测失误会导致这种现象。
我在Java中实现了两个版本的冒泡排序,并进行了一些性能测试:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 | import java.util.Random; public class BubbleSortAnnomaly { public static void main(String... args) { final int ARRAY_SIZE = Integer.parseInt(args[0]); final int LIMIT = Integer.parseInt(args[1]); final int RUNS = Integer.parseInt(args[2]); int[] a = new int[ARRAY_SIZE]; int[] b = new int[ARRAY_SIZE]; Random r = new Random(); for (int run = 0; RUNS > run; ++run) { for (int i = 0; i < ARRAY_SIZE; i++) { a[i] = r.nextInt(LIMIT); b[i] = a[i]; } System.out.print("Sorting with sortA:"); long start = System.nanoTime(); int swaps = bubbleSortA(a); System.out.println( (System.nanoTime() - start) +" ns." +"It used" + swaps +" swaps."); System.out.print("Sorting with sortB:"); start = System.nanoTime(); swaps = bubbleSortB(b); System.out.println( (System.nanoTime() - start) +" ns." +"It used" + swaps +" swaps."); } } public static int bubbleSortA(int[] a) { int counter = 0; for (int i = a.length - 1; i >= 0; --i) { for (int j = 0; j < i; ++j) { if (a[j] > a[j + 1]) { swap(a, j, j + 1); ++counter; } } } return (counter); } public static int bubbleSortB(int[] a) { int counter = 0; for (int i = a.length - 1; i >= 0; --i) { for (int j = 0; j < i; ++j) { if (a[j] >= a[j + 1]) { swap(a, j, j + 1); ++counter; } } } return (counter); } private static void swap(int[] a, int j, int i) { int h = a[i]; a[i] = a[j]; a[j] = h; } } |
如您所见,这两种排序方法之间的唯一区别是
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | Sorting with sortA: 4.214 seconds. It used 564960211 swaps. Sorting with sortB: 2.278 seconds. It used 1249750569 swaps. Sorting with sortA: 4.199 seconds. It used 563355818 swaps. Sorting with sortB: 2.254 seconds. It used 1249750348 swaps. Sorting with sortA: 4.189 seconds. It used 560825110 swaps. Sorting with sortB: 2.264 seconds. It used 1249749572 swaps. Sorting with sortA: 4.17 seconds. It used 561924561 swaps. Sorting with sortB: 2.256 seconds. It used 1249749766 swaps. Sorting with sortA: 4.198 seconds. It used 562613693 swaps. Sorting with sortB: 2.266 seconds. It used 1249749880 swaps. Sorting with sortA: 4.19 seconds. It used 561658723 swaps. Sorting with sortB: 2.281 seconds. It used 1249751070 swaps. Sorting with sortA: 4.193 seconds. It used 564986461 swaps. Sorting with sortB: 2.266 seconds. It used 1249749681 swaps. Sorting with sortA: 4.203 seconds. It used 562526980 swaps. Sorting with sortB: 2.27 seconds. It used 1249749609 swaps. Sorting with sortA: 4.176 seconds. It used 561070571 swaps. Sorting with sortB: 2.241 seconds. It used 1249749831 swaps. Sorting with sortA: 4.191 seconds. It used 559883210 swaps. Sorting with sortB: 2.257 seconds. It used 1249749371 swaps. |
将
1 2 | Sorting with sortA: 3.983 seconds. It used 625941897 swaps. Sorting with sortB: 4.658 seconds. It used 789391382 swaps. |
我将程序移植到C++中,以确定这个问题是否是Java特有的。这里是C++代码。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 | #include <cstdlib> #include <iostream> #include <omp.h> #ifndef ARRAY_SIZE #define ARRAY_SIZE 50000 #endif #ifndef LIMIT #define LIMIT 10 #endif #ifndef RUNS #define RUNS 10 #endif void swap(int * a, int i, int j) { int h = a[i]; a[i] = a[j]; a[j] = h; } int bubbleSortA(int * a) { const int LAST = ARRAY_SIZE - 1; int counter = 0; for (int i = LAST; 0 < i; --i) { for (int j = 0; j < i; ++j) { int next = j + 1; if (a[j] > a[next]) { swap(a, j, next); ++counter; } } } return (counter); } int bubbleSortB(int * a) { const int LAST = ARRAY_SIZE - 1; int counter = 0; for (int i = LAST; 0 < i; --i) { for (int j = 0; j < i; ++j) { int next = j + 1; if (a[j] >= a[next]) { swap(a, j, next); ++counter; } } } return (counter); } int main() { int * a = (int *) malloc(ARRAY_SIZE * sizeof(int)); int * b = (int *) malloc(ARRAY_SIZE * sizeof(int)); for (int run = 0; RUNS > run; ++run) { for (int idx = 0; ARRAY_SIZE > idx; ++idx) { a[idx] = std::rand() % LIMIT; b[idx] = a[idx]; } std::cout <<"Sorting with sortA:"; double start = omp_get_wtime(); int swaps = bubbleSortA(a); std::cout << (omp_get_wtime() - start) <<" seconds. It used" << swaps <<" swaps." << std::endl; std::cout <<"Sorting with sortB:"; start = omp_get_wtime(); swaps = bubbleSortB(b); std::cout << (omp_get_wtime() - start) <<" seconds. It used" << swaps <<" swaps." << std::endl; } free(a); free(b); return (0); } |
这个程序显示了相同的行为。有人能解释一下这里到底发生了什么吗?
先执行
我认为这可能确实是由于分支预测。如果将交换次数与内部排序迭代次数进行比较,则会发现:
极限=10
- A=560m交换/1250m回路
- B=1250m交换/1250m循环(比循环少0.02%)
极限=50000
- A=627M交换/1250M回路
- B=850m交换/1250m回路
因此,在
我认为这确实可以用分支预测失误来解释。
例如,考虑limit=11和
与
使用所提供的C++代码(删除时间计数)与EDOCX1×25指令,得到了确认Brac错过理论的结果。
对于
BubbleSorta限制=10:
1 2 3 4 5 6 7 8 9 10 11 12 | Performance counter stats for './bubbleA.out': 46670.947364 task-clock # 0.998 CPUs utilized 73 context-switches # 0.000 M/sec 28 CPU-migrations # 0.000 M/sec 379 page-faults # 0.000 M/sec 117,298,787,242 cycles # 2.513 GHz 117,471,719,598 instructions # 1.00 insns per cycle 25,104,504,912 branches # 537.904 M/sec 3,185,376,029 branch-misses # 12.69% of all branches 46.779031563 seconds time elapsed |
BubbleSorta限制=50000:
1 2 3 4 5 6 7 8 9 10 11 12 | Performance counter stats for './bubbleA.out': 46023.785539 task-clock # 0.998 CPUs utilized 59 context-switches # 0.000 M/sec 8 CPU-migrations # 0.000 M/sec 379 page-faults # 0.000 M/sec 118,261,821,200 cycles # 2.570 GHz 119,230,362,230 instructions # 1.01 insns per cycle 25,089,204,844 branches # 545.136 M/sec 3,200,514,556 branch-misses # 12.76% of all branches 46.126274884 seconds time elapsed |
BubbleSortB限制=10:
1 2 3 4 5 6 7 8 9 10 11 12 | Performance counter stats for './bubbleB.out': 26091.323705 task-clock # 0.998 CPUs utilized 28 context-switches # 0.000 M/sec 2 CPU-migrations # 0.000 M/sec 379 page-faults # 0.000 M/sec 64,822,368,062 cycles # 2.484 GHz 137,780,774,165 instructions # 2.13 insns per cycle 25,052,329,633 branches # 960.179 M/sec 3,019,138 branch-misses # 0.01% of all branches 26.149447493 seconds time elapsed |
BubbleSortB限制=50000:
1 2 3 4 5 6 7 8 9 10 11 12 | Performance counter stats for './bubbleB.out': 51644.210268 task-clock # 0.983 CPUs utilized 2,138 context-switches # 0.000 M/sec 69 CPU-migrations # 0.000 M/sec 378 page-faults # 0.000 M/sec 144,600,738,759 cycles # 2.800 GHz 124,273,104,207 instructions # 0.86 insns per cycle 25,104,320,436 branches # 486.101 M/sec 3,929,572,460 branch-misses # 15.65% of all branches 52.511233236 seconds time elapsed |
Edit 2: This answer is probably wrong in most cases, lower when I say everything above is correct is still true, but the lower portion is not true for most processor architectures, see the comments. However, I will say that it's still theoretically possible there is some JVM on some OS/Architecture that does this, but that JVM is probably poorly implemented or it's a weird architecture. Also, this is theoretically possible in the sense that most conceivable things are theoretically possible, so I'd take the last portion with a grain of salt.
首先,我不确定C++,但我可以谈谈Java。
这是一些代码,
1 2 3 4 5 6 7 8 9 10 | public class Example { public static boolean less(final int a, final int b) { return a < b; } public static boolean lessOrEqual(final int a, final int b) { return a <= b; } } |
在上面运行
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | public class Example { public Example(); Code: 0: aload_0 1: invokespecial #8 // Method java/lang/Object."<init>":()V 4: return public static boolean less(int, int); Code: 0: iload_0 1: iload_1 2: if_icmpge 7 5: iconst_1 6: ireturn 7: iconst_0 8: ireturn public static boolean lessOrEqual(int, int); Code: 0: iload_0 1: iload_1 2: if_icmpgt 7 5: iconst_1 6: ireturn 7: iconst_0 8: ireturn } |
你会注意到唯一的区别是EDOCX1(如果比较大/相等)与EDOCX1(如果比较大)。
上面的一切都是事实,剩下的是我对如何处理
Edit: Documentation on
if_i is here
计算机比较数字的方法是从一个数字中减去另一个数字,并检查该数字是否为0。因此,在执行
通常情况下,这是一个非常微小的区别,但这不是任何代码,这是一个令人震惊的泡沫排序!o(n^2)是我们进行这种特殊比较的平均次数,因为它在最内部的循环中。
是的,这可能与分支预测有关。我不确定,我不是这方面的专家,但我认为这也可能起着不重要的作用。