关于c ++：使用比较器网络对固定长度数组进行非常快速的排序

Very fast sorting of fixed length arrays using comparator networks

我有一些性能关键代码，涉及在C++中排序一个非常短的固定长度数组，其中大约3到10个元素(编译时参数发生变化)。

在我看来，一个专门针对每种可能的输入大小的静态排序网络可能是一种非常有效的方法：我们进行所有必要的比较，以确定我们在哪种情况下，然后执行最佳的交换数来对数组进行排序。

为了应用这个，我们使用一点模板魔术来推断数组长度并应用正确的网络：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28

#include <iostream>
using namespace std;

template< int K >
void static_sort(const double(&array)[K])
{
cout <<"General static sort
" << endl;
}

template<>
void static_sort<3>(const double(&array)[3])
{
cout <<"Static sort for K=3" << endl;
}

int main()
{

double array[3];

// performance critical code.
// ...
static_sort(array);
// ...

}

显然，编写所有这些代码非常麻烦，因此：

有人对这是否值得付出努力有什么看法吗？
有人知道这种优化是否存在于标准的实现中，例如，STD：：排序？
有没有一个容易的地方来掌握实现这种排序网络的代码？
也许可以使用模板magic静态地生成这样的排序网络。

现在，我只使用插入排序和静态模板参数(如上所述)，希望它会鼓励展开和其他编译时优化。

欢迎你的想法。

更新：我编写了一些测试代码，比较静态的插入短和STD：：排序。(当我说static时，我的意思是数组大小是固定的，并在编译时推导出来的(可能允许循环展开等)。我得到了至少20%的净改善(请注意，这一代包括在时间安排中)。平台：Clang，OS X 10.9。

如果要将代码与stdlib的实现进行比较，请在这里使用https://github.com/rosshemsley/static_排序。

我还没有找到一套很好的比较器网络分类器实现。

相关讨论

您要排序的值是什么？它们在固定范围内吗？
我的值恰好是[0,2pi]中的角度。但我想我的想法是把重点放在比较网络上，所以价值类型不应该太重要。
@Rosshemsley，你真的试过看看排序是否会花费大量的时间来执行你的程序吗？
是否尝试分析代码？如果排序算法效率低下，展开和模板对您没有任何好处。
数组大小在编译时已知吗？
@shahbaz nope：这一步骤的性能对代码的性能至关重要，但是代码的性能对我来说并不重要…(对于蒙特卡洛模拟，我只运行一次)。所以我没有做任何高级测试。一般来说，这似乎是一个有趣的想法，其他人可能会感兴趣。
我记得在某个网页上有一个C排序网络生成器(不幸的是，现在不能记住/找到它)。在两年左右的时间里，人们用它作为基准，发现超过5个元素，与std::sort的差异相当小或几乎不存在。无论如何，为了测试速度，使用代码生成可能是最简单的解决方案(如果您不想为学术目的做模板工作)。这是一个类似的网站，它只输出一组对来进行比较/交换，但是我认为你可以在5分钟内用C代码包装它。
@Andrey是的，请参见上面的代码示例。
有趣的想法是，我希望对于小数组，仅仅做一些简单的递归排序就可以被编译器折叠成交换东西所必需的东西，但是我认为没有人认真地做过，并且对它进行了基准测试和分析。我记得已经使用了典型的内向排序的非递归变量比我的STD：：排序更快，所以我会尝试一下，只要尽可能多地传递模板参数，帮助编译器优化这个。
@罗西姆斯利，那么你可能是在浪费时间。我的建议是只使用std::sort，然后用一个分析器运行程序，看看它在这个函数中花费了多少时间。你知道吗，也许std::sort也很聪明；)当然，这个问题仍然很有趣。
我对长度在3到5之间的数组最感兴趣。我将为我的模拟对每个数组中的万亿个进行排序，所以这确实是值得的。我很有兴趣掌握一些排序网络代码并进行比较。现在我认为比较静态插入排序到STD：：排序已经很有趣了：我会发布任何我提出的结果。
请参阅stackoverflow.com/questions/3903086/&hellip；

我最近写了一个小类，它使用Bose Nelson算法在编译时生成一个排序网络。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141

/**
* A Functor class to create a sort for fixed sized arrays/containers with a
* compile time generated Bose-Nelson sorting network.
* \tparam NumElements The number of elements in the array or container to sort.
* \tparam T The element type.
* \tparam Compare A comparator functor class that returns true if lhs < rhs.
*/
template <unsigned NumElements, class Compare = void> class StaticSort
{
template <class A, class C> struct Swap
{
template <class T> inline void s(T &v0, T &v1)
{
T t = Compare()(v0, v1) ? v0 : v1; // Min
v1 = Compare()(v0, v1) ? v1 : v0; // Max
v0 = t;
}

inline Swap(A &a, const int &i0, const int &i1) { s(a[i0], a[i1]); }
};

template <class A> struct Swap <A, void>
{
template <class T> inline void s(T &v0, T &v1)
{
// Explicitly code out the Min and Max to nudge the compiler
// to generate branchless code.
T t = v0 < v1 ? v0 : v1; // Min
v1 = v0 < v1 ? v1 : v0; // Max
v0 = t;
}

inline Swap(A &a, const int &i0, const int &i1) { s(a[i0], a[i1]); }
};

template <class A, class C, int I, int J, int X, int Y> struct PB
{
inline PB(A &a)
{
enum { L = X >> 1, M = (X & 1 ? Y : Y + 1) >> 1, IAddL = I + L, XSubL = X - L };
PB<A, C, I, J, L, M> p0(a);
PB<A, C, IAddL, J + M, XSubL, Y - M> p1(a);
PB<A, C, IAddL, J, XSubL, M> p2(a);
}
};

template <class A, class C, int I, int J> struct PB <A, C, I, J, 1, 1>
{
inline PB(A &a) { Swap<A, C> s(a, I - 1, J - 1); }
};

template <class A, class C, int I, int J> struct PB <A, C, I, J, 1, 2>
{
inline PB(A &a) { Swap<A, C> s0(a, I - 1, J); Swap<A, C> s1(a, I - 1, J - 1); }
};

template <class A, class C, int I, int J> struct PB <A, C, I, J, 2, 1>
{
inline PB(A &a) { Swap<A, C> s0(a, I - 1, J - 1); Swap<A, C> s1(a, I, J - 1); }
};

template <class A, class C, int I, int M, bool Stop = false> struct PS
{
inline PS(A &a)
{
enum { L = M >> 1, IAddL = I + L, MSubL = M - L};
PS<A, C, I, L, (L <= 1)> ps0(a);
PS<A, C, IAddL, MSubL, (MSubL <= 1)> ps1(a);
PB<A, C, I, IAddL, L, MSubL> pb(a);
}
};

template <class A, class C, int I, int M> struct PS <A, C, I, M, true>
{
inline PS(A &a) {}
};

public:
/**
* Sorts the array/container arr.
* \param arr The array/container to be sorted.
*/
template <class Container> inline void operator() (Container &arr) const
{
PS<Container, Compare, 1, NumElements, (NumElements <= 1)> ps(arr);
};

/**
* Sorts the array arr.
* \param arr The array to be sorted.
*/
template <class T> inline void operator() (T *arr) const
{
PS<T*, Compare, 1, NumElements, (NumElements <= 1)> ps(arr);
};
};

#include <iostream>
#include <vector>

int main(int argc, const char * argv[])
{
enum { NumValues = 32 };

// Arrays
{
int rands[NumValues];
for (int i = 0; i < NumValues; ++i) rands[i] = rand() % 100;
std::cout <<"Before Sort: \t";
for (int i = 0; i < NumValues; ++i) std::cout << rands[i] <<"";
std::cout <<"
";
StaticSort<NumValues> staticSort;
staticSort(rands);
std::cout <<"After Sort: \t";
for (int i = 0; i < NumValues; ++i) std::cout << rands[i] <<"";
std::cout <<"
";
}

std::cout <<"
";

// STL Vector
{
std::vector<int> rands(NumValues);
for (int i = 0; i < NumValues; ++i) rands[i] = rand() % 100;
std::cout <<"Before Sort: \t";
for (int i = 0; i < NumValues; ++i) std::cout << rands[i] <<"";
std::cout <<"
";
StaticSort<NumValues> staticSort;
staticSort(rands);
std::cout <<"After Sort: \t";
for (int i = 0; i < NumValues; ++i) std::cout << rands[i] <<"";
std::cout <<"
";
}

return 0;
}

号

基准

以下基准是用clang-o3编译的，并在我2012年年中的MacBookAir上运行。

对100万个数组进行排序的时间(毫秒)。大小为2、4、8的数组的毫秒数分别为1.943、8.655、20.246。氧化镁

下面是6个元素的小数组的每种排序的平均时钟。基准代码和示例可以在以下问题中找到：最快的固定长度6 int数组排序

1
2
3
4
5
6
7
8
9
10
11
12

Direct call to qsort library function : 342.26
Naive implementation (insertion sort) : 136.76
Insertion Sort (Daniel Stutzbach) : 101.37
Insertion Sort Unrolled : 110.27
Rank Order : 90.88
Rank Order with registers : 90.29
Sorting Networks (Daniel Stutzbach) : 93.66
Sorting Networks (Paul R) : 31.54
Sorting Networks 12 with Fast Swap : 32.06
Sorting Networks 12 reordered Swap : 29.74
Reordered Sorting Network w/ fast swap : 25.28
Templated Sorting Network (this class) : 25.01

对于6个元素，它的执行速度与问题中最快的示例一样快。

用于基准测试的代码可以在这里找到。

其他答案很有趣，也相当不错，但我相信我可以提供一些额外的答案元素，每点一分：

值得付出努力吗？好吧，如果您需要对整数的小集合进行排序，并且对排序网络进行了优化以尽可能多地利用某些指令，那么这可能是值得的。下图显示了使用不同排序算法对100万个大小为0-14的int数组进行排序的结果。如您所见，如果您真的需要，排序网络可以提供显著的加速。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

using namespace cppsort;

// Sorters are function objects that can be
// adapted with sorter adapters from the
// library
using sorter = small_array_adapter<
std_sorter,
sorting_network_sorter
>;

// Now you can use it as a function
sorter sort;

// Instead of a size-agnostic sorting algorithm,
// sort will use an optimal sorting network for
// 5 inputs since the bound of the array can be
// deduced at compile time
int arr[] = { 2, 4, 7, 9, 3 };
sort(arr);

如前所述，库为内置整数提供了高效的排序网络，但如果需要对其他内容的小数组进行排序(例如，我的最新基准测试表明，即使对于long long int)，它们也不比直接插入排序更好)。

您可能可以使用模板元编程来生成任意大小的排序网络，但是没有已知的算法可以生成最佳的排序网络，因此您也可以手工编写最佳的排序网络。我不认为简单算法生成的网络实际上能够提供可用和高效的网络(batcher的奇数-偶数排序和成对排序网络可能是唯一可用的网络)[另一个答案似乎表明生成的网络实际上可以工作]。

对于n<16有已知的最优或至少最佳长度的比较器网络，所以至少有一个相当好的起点。公平地说，因为优化网络的设计不一定能达到SSE或其他向量算法所能达到的最大并行度。

另一点是，已经有一些n的最优网络是n+1的稍大的最优网络的退化版本。

来自维基百科：

The optimal depths for up to 10 inputs are known and they are
respectively 0, 1, 3, 3, 5, 5, 6, 6, 7, 7.

号

这就是说，我将致力于实现N=4、6、8和10的网络，因为深度约束不能通过额外的并行性来模拟(我认为)。我还认为，在SSE的寄存器(也使用一些最小/最大指令)中工作的能力，甚至在RISC体系结构中使用一些相对较大的寄存器集，与"众所周知"的排序方法(如由于缺少指针算术和其他开销而导致的快速排序)相比，将提供显著的性能优势。

此外，我还希望使用臭名昭著的循环展开技巧达夫的设备来实现并行网络。

编辑当输入值已知为正时，IEEE-754浮点数或双精度数，也值得注意的是，比较也可以作为整数执行。(float和int必须具有相同的endianness)