关于c ++：std :: vector比普通数组慢得多吗？

Is std::vector so much slower than plain arrays?

我一直认为std::vector是"作为数组实现"的，这是一种普遍的智慧。今天我去测试了一下，结果似乎不是这样：

以下是一些测试结果：

1
2
3
4

UseArray completed in 2.619 seconds
UseVector completed in 9.284 seconds
UseVectorPushBack completed in 14.669 seconds
The whole thing completed in 26.591 seconds

比这慢3-4倍！对于"vector"的评论来说，它可能会慢一些。

我使用的代码是：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114

#include <cstdlib>
#include <vector>

#include <iostream>
#include <string>

#include <boost/date_time/posix_time/ptime.hpp>
#include <boost/date_time/microsec_time_clock.hpp>

class TestTimer
{
public:
TestTimer(const std::string & name) : name(name),
start(boost::date_time::microsec_clock<boost::posix_time::ptime>::local_time())
{
}

~TestTimer()
{
using namespace std;
using namespace boost;

posix_time::ptime now(date_time::microsec_clock<posix_time::ptime>::local_time());
posix_time::time_duration d = now - start;

cout << name <<" completed in" << d.total_milliseconds() / 1000.0 <<
" seconds" << endl;
}

private:
std::string name;
boost::posix_time::ptime start;
};

struct Pixel
{
Pixel()
{
}

Pixel(unsigned char r, unsigned char g, unsigned char b) : r(r), g(g), b(b)
{
}

unsigned char r, g, b;
};

void UseVector()
{
TestTimer t("UseVector");

for(int i = 0; i < 1000; ++i)
{
int dimension = 999;

std::vector<Pixel> pixels;
pixels.resize(dimension * dimension);

for(int i = 0; i < dimension * dimension; ++i)
{
pixels[i].r = 255;
pixels[i].g = 0;
pixels[i].b = 0;
}
}
}

void UseVectorPushBack()
{
TestTimer t("UseVectorPushBack");

for(int i = 0; i < 1000; ++i)
{
int dimension = 999;

std::vector<Pixel> pixels;
pixels.reserve(dimension * dimension);

for(int i = 0; i < dimension * dimension; ++i)
pixels.push_back(Pixel(255, 0, 0));
}
}

void UseArray()
{
TestTimer t("UseArray");

for(int i = 0; i < 1000; ++i)
{
int dimension = 999;

Pixel * pixels = (Pixel *)malloc(sizeof(Pixel) * dimension * dimension);

for(int i = 0 ; i < dimension * dimension; ++i)
{
pixels[i].r = 255;
pixels[i].g = 0;
pixels[i].b = 0;
}

free(pixels);
}
}

int main()
{
TestTimer t1("The whole thing");

UseArray();
UseVector();
UseVectorPushBack();

return 0;
}

我做错事了吗？或者我刚刚打破了这个表演神话？

我正在Visual Studio 2005中使用发布模式。

在VisualC++中，EDCOX1(2)减少了EDOCX1×3的一半(将其降低到4秒)。这真是太大了，我的意思是。

相关讨论

您使用什么编译器标志？STL通常依赖于编译器的性能优化
当您处于调试模式时，Vector的某些版本会添加额外的指令，以检查您是否在数组末尾或类似情况下无法访问。要获得真正的时间，您必须在发布模式中构建并打开优化。
您使用的是调试模式还是释放模式？我同意如果大量使用std::vector在调试模式下的性能可能会引起严重的关注。
@Rwong使用释放。
很好，你已经衡量而不是相信你在互联网上听到的说法。
使用vc++，尝试使用/O2从命令行编译。我不确定您需要在项目属性中设置什么来获得相同的效果，但是使用带有VC++2010的/O2而不是/Os会使我的usevectorbushback比usearray慢60%，而不是慢两倍。
矢量被实现为一个数组。这不是"传统智慧"，而是事实。您已经发现vector是一个通用的可调整大小的数组。祝贺你。与所有通用工具一样，可能会遇到一些非最优的特殊情况。这就是为什么传统的智慧是从vector开始，并在必要时考虑替代方案。
哈哈，"把脏盘子扔进水槽"和"把脏盘子扔进水槽并检查它们是否破裂"的速度有什么不同？
至少在VC2010上，主要的区别在于malloc()比resize()更快。从计时中删除内存分配，使用_迭代器_debug_level==0编译，结果相同。
您可以通过在反汇编窗口中单步执行来回答任何此类问题。我很惊讶有多少人宁愿胡乱猜测，而不是仅仅"打开盒子"，看看到底发生了什么。
@迈克：很明显，我们这里的年轻人并不像巫师那样善于组装。你愿意写一个关于如何用真正的人的方式来分析这个问题的答案吗？(装配)？
@迈克：请重新阅读我上面的评论，我想我会把它说得更好一点，这样我的意思就不会被误解了。"我真的很喜欢一个演示，从分解的角度解释如何处理这个问题。我知道如何运行反汇编程序，但不能从头到尾分析这个问题。尤其是，如果我启用优化，我会得到很多毛茸茸的程序集。如果我禁用了优化，人们可能会认为STL已针对启用的优化进行了优化。"
@Kizzx2:不需要成为遵循汇编语言的向导。首先，关闭优化，因为它只会扰乱代码。然后在指令级别上继续(在执行过程中显示源代码)。这很简单：把某个东西移到寄存器，乘以它，用它作为地址来获取某个东西，比较东西，有条件地跳转，调用一个函数等等。如果你在移动的过程中显示了源代码行，很快你就会明白编译器是怎么想的，你就会知道到底发生了什么。如果太长，跳过一些部分。
阅读程序集非常简单。从Intel或AMD下载手册，并查找您遇到的每个新指令。编写汇编是一个棘手的部分，但正如@mike所说，asm是一种非常简单的"语言"，而且它很容易阅读(它可能很冗长，需要时间，但并不困难)
@迈克：注意，关闭优化会搞砸时间。矢量的一些实现在未优化时使用范围检查版本，在优化时使用更快的版本。正如您所建议的，单步执行将表明它使用的是更慢但更安全的版本，而在高优化下进行的时间测试将显示相反的结果。
@大卫：你说得对，它在微观层面上影响代码，但你仍然可以看到这就是它所做的。然后，如果你需要打开优化，你也可以看到它在做什么。当我必须在汇编程序中逐步使用Fortran时，我发现如果它被优化了，就几乎不可能跟踪它。不过，我认为这是一项有用的技能。
几乎使用数组或STD:：C++中的向量，性能差距是什么？
@罗杰·帕特：这很不幸。我不知道是什么让你觉得这"几乎是一个复制品"，但是如果你读过代码和文章，你会发现它实际上是在谈论不同的事情。开车经过的目的是什么？(这篇文章公认的答案集中在索引和取消引用上。这个线程的最终答案在于构造函数)
开车过去？我只是指出了一个密切相关的问题。
如果你打电话给UseArray()的malloc()，我想知道结果如何。
在std:array<>中看到同样的情况是很有趣的！我还没读完所有的答案，所以可能已经测试过了；)
使用std:：vectorpixels(dimension*dimension)分配向量，这将消除与数组相比的额外调整大小开销。

使用以下方法：

g++ -O3 Time.cpp -I
./a.out
UseArray completed in 2.196 seconds
UseVector completed in 4.412 seconds
UseVectorPushBack completed in 8.017 seconds
The whole thing completed in 14.626 seconds

所以数组的速度是向量的两倍。

但在更详细地研究了代码之后，这是意料之中的；因为您在向量上运行了两次，而数组只运行了一次。注意：当您使用cx1〔0〕这个向量时，您不仅要分配内存，而且还要运行该向量并调用每个成员的构造函数。

重新排列代码，使向量只初始化每个对象一次：

1	std::vector<Pixel> pixels(dimensions * dimensions, Pixel(255,0,0));

现在再次进行相同的计时：

g++ -O3 Time.cpp -I
./a.out
UseVector completed in 2.216 seconds

矢量现在的性能只比数组稍差一点。在我看来，这种差异是微不足道的，可能是由一大堆与测试无关的事情造成的。

我还考虑到您没有正确地初始化/销毁UseArrray()方法中的pixel对象，因为没有调用构造函数/析构函数(这可能不是这个简单类的问题，但是任何稍微复杂的东西(例如指针或带指针的成员)都会导致问题。

相关讨论

我意识到我浪费了resize()的初始化——这就是为什么我添加了push_back版本(它不做"浪费的"初始化，但结果慢了)。我不知道如何简单地得到一组未初始化的内存向量，比如malloc。
@Kizzx2：你需要使用reserve()，而不是resize()。这会为对象分配空间(即，它会更改向量的容量)，但不会创建对象(即，向量的大小保持不变)。
将自己的分配器作为模板参数提供也可能有助于让向量只在未初始化的大堆块中获取其内存。
@詹姆斯：这就是我在以东城所做的。结果更慢了。
@Crashworks:我试着看了一下。对于一个简单的场景来说，似乎是一个彻底的过度杀戮。有没有内置的null_allocator或什么？
所以我猜，对于长寿的物体，vector的速度"几乎"与阵列一样快。对于短的活动对象，使用数组来消除初始化成本(除非您想编写自己的stl分配器)
说实话，kizzx2，在实时图形世界中，我们在性能敏感的代码中不太使用STL。我从未能让代码生成与一个简单的数组完全相同(这样，一个迭代器的前进和取消引用只需要一个加法运算和一个加载运算)。
您正在进行1000 000 000个阵列访问。时差为0.333秒。或者每个阵列访问的差异为0.0000000000333。假设一个2.33GHz的处理器，比如我的处理器，即每个阵列访问0.7个指令管道阶段。因此，向量看起来像是每次访问都使用一条额外的指令。
给定g++的-S命令行选项将使您能够准确地看到它发出的程序集。
在循环中调用EDOCX1[1]并不完全公平。将malloc+循环改为new Pixel[dimension * dimension]，现在阵列循环恢复到以前的快速速度。
最终的结果看起来更符合大众的意见。但我不会说"正确地完成矢量是更快的"。0.025秒：p，这样做的代价是为了容器而浪费所包含对象的默认构造函数，或者编写一个stl分配器。
@Martin"您正在进行1000 000个阵列访问。时差为0.333秒。或者每个阵列访问的差异为0.0000000000333。假设一个2.33GHz的处理器，比如我的处理器，即每个阵列访问0.7个指令管道阶段。因此，向量看起来像是每次访问都使用一条额外的指令。"听起来你想夸大数字，但我猜在任何图形程序中循环1000*1000像素映射都是非常常见的操作。
是的，但是我们从来没有用这样的线性组件方式——它几乎总是在simd中完成，要么作为一个着色程序，要么作为CPU上展开的内部循环，一次处理多个像素。在这个性能水平上，即使是循环分支引入的管道气泡也非常重要。(实际上，我正在努力修复渲染器中的某些分支预测失误。)
@James McNellis：你不能仅仅用reserve()替换resize()，因为这不会调整矢量自身大小的内部思想，因此随后对其元素的写入在技术上是"写入结束"，并将生成ub。尽管在实践中，每个STL实现在这方面都会"表现出自己的行为"，但是如何重新同步向量的大小呢？如果在填充向量后尝试调用resize()，它很可能会用默认的构造值覆盖所有这些元素！
@kizzx2：它的100000000个数组访问，当您执行一个1000*1000个元素的数组时，然后您有另一个循环(在函数内部)重复操作1000次。
@随机黑客：这不是我说的吗？我想我很清楚，reserve只改变向量的容量，而不是它的大小。
@詹姆斯：我的观点是，像你所说的那样使用reserve()而不是replace()，是没有用的，因为前者不会改变大小——这是关于向量的一个重要信息！选择走这条路线意味着项目不能相信之后pixels.size()的价值。
真正有趣的是，VC++在assign()上的速度比在reserve()和push_back()上的慢。现在，这只是WTF。
…对于使用类似赋值的构造函数，同上。另一个WTF。现在在拆卸中戳，但我认为这已经是bug归档的价值了。我明天把这个转发给斯蒂芬。
好吧，想想看。矢量方法中存在许多与异常相关的问题。将/EHsc添加到编译开关中可以清除这一点，而assign()现在实际上胜过了数组。是的。
@帕维尔：有趣的是，我没有考虑到例外处理的影响。我用/EHsc编译所有的vc++。
同样重要的是，当异常实际上不被抛出时，异常的成本实际上是不可检测的，当它们有效时，我们不再是您的主要关注点。
与在MSVC中删除异常进行编译相比，使用启用异常进行编译会产生可测量的性能开销，不管您是否抛出异常。试试看。
@CrashWorks：对于32位的Windows来说是非常正确的，但是对于64位则要少得多。
@Crashworks：不是我不相信你，但也许你应该问一个为什么会这样的问题(就像这个问题问的，为什么向量会变慢并且被证明是错误的)。我不知道你是否是赖特或者没有做过测试是错的，但是从我读过的所有文章中，他们指出使用异常的区别是不可测量的。
现在找不到它，但在某个地方，有一个关于异常处理的开销的问题，当没有异常被抛出时，我认为必须有非零的开销，因为反汇编显示msvc++添加了一些，除非它能证明没有异常可以发生。但事实证明，对于我尝试的例子来说，G++的开销确实是零——我认为这是一个相当大的工程壮举！(执行时间开销为零。)
我不认为for(...) { pixels[i] = Pixel(); }和pixels.resize(...)是完全一样的。前者进行n个默认构造函数调用和n个赋值运算符调用，后者进行1个默认构造函数调用和n个复制构造函数调用。考虑到两个版本都将像素初始化为未初始化的垃圾，从功能的角度来看，这并不重要。但它确实指出了与new Pixel[...]的区别，new Pixel[...]调用默认构造函数n次，从不使用复制构造函数或赋值运算符来复制未初始化的垃圾。
@BK1E：有趣的是，没有意识到resize()采用了(默认的)第二个参数！我希望，如果优化被打开，它们将编译成几乎相同的机器代码，因为"好像"规则，但是它们在技术上是不同的操作。
数组的情况更快，因为gcc对访问进行了矢量化，但在处理向量时却无法做到这一点。请看下面我的答案。
@jons34yp：我们得出结论，数组访问速度不快于向量。它们具有完全相同的特性。
我在Visual Studio 2013中运行了这个示例，在版本构建中得到了类似的结果。矢量速度减慢最多10%。然后我在Visual Studio外运行了exe，useArray()的速度提高了一倍！而矢量速度只有30%左右。所以最终，usearray的速度还是快了两倍多。
@罗齐娜：那么你是一个非常独特的人(或者有一些独特的设置)。当我多年来在许多系统上进行这种测试时。向量在直线数组上的运行时开销是零。
你也应该用-DNDEBUG来测试它。

好问题。我来到这里，希望能找到一些简单的修复方法来加速向量测试。那不是我想象的那样！

优化有帮助，但还不够。在优化方面，我仍然看到usearray和usevector之间的性能差异是2倍。有趣的是，在没有优化的情况下，usevector比usevectorbushback慢得多。

1
2
3
4
5
6
7
8
9
10
11
12

# g++ -Wall -Wextra -pedantic -o vector vector.cpp
# ./vector
UseArray completed in 20.68 seconds
UseVector completed in 120.509 seconds
UseVectorPushBack completed in 37.654 seconds
The whole thing completed in 178.845 seconds
# g++ -Wall -Wextra -pedantic -O3 -o vector vector.cpp
# ./vector
UseArray completed in 3.09 seconds
UseVector completed in 6.09 seconds
UseVectorPushBack completed in 9.847 seconds
The whole thing completed in 19.028 seconds

想法1-使用新的[]而不是malloc

我尝试在usearray中将malloc()改为new[]，这样就可以构建对象。以及从单个字段分配更改为分配像素实例。将内部循环变量重命名为j。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

void UseArray()
{
TestTimer t("UseArray");

for(int i = 0; i < 1000; ++i)
{
int dimension = 999;

// Same speed as malloc().
Pixel * pixels = new Pixel[dimension * dimension];

for(int j = 0 ; j < dimension * dimension; ++j)
pixels[j] = Pixel(255, 0, 0);

delete[] pixels;
}
}

令人惊讶的是(对我来说)，这些变化都没有什么不同。甚至没有对new[]的更改，它将默认构造所有像素。在使用new[]时，gcc似乎可以优化默认的构造函数调用，而在使用vector时则不能。

IDEA 2-删除重复的运算符[]调用

我还试图摆脱三重EDOCX1查询(6)，并缓存对EDOCX1的引用(7)。这实际上减慢了usevector的速度！哎呀。

1
2
3
4
5
6
7
8
9
10
11
12
13
14

for(int j = 0; j < dimension * dimension; ++j)
{
// Slower than accessing pixels[j] three times.
Pixel &pixel = pixels[j];
pixel.r = 255;
pixel.g = 0;
pixel.b = 0;
}

# ./vector
UseArray completed in 3.226 seconds
UseVector completed in 7.54 seconds
UseVectorPushBack completed in 9.859 seconds
The whole thing completed in 20.626 seconds

想法3-移除构造函数

完全删除构造函数怎么样？然后，也许GCC可以在创建向量时优化所有对象的构造。如果我们将像素更改为：

1
2
3
4

struct Pixel
{
unsigned char r, g, b;
};

结果：大约快10%。仍然比数组慢。嗯。

1
2
3

# ./vector
UseArray completed in 3.239 seconds
UseVector completed in 5.567 seconds

IDEA 4-使用迭代器而不是循环索引

用vector::iterator代替循环索引怎么样？

1
2
3
4
5
6

for (std::vector<Pixel>::iterator j = pixels.begin(); j != pixels.end(); ++j)
{
j->r = 255;
j->g = 0;
j->b = 0;
}

结果：

1
2
3

# ./vector
UseArray completed in 3.264 seconds
UseVector completed in 5.443 seconds

不，没什么不同。至少速度不慢。我认为这会有类似于我使用Pixel&引用的2的性能。

结论

即使一些智能cookie知道如何使向量循环与数组循环一样快，这也不能很好地说明std::vector的默认行为。编译器非常聪明，可以优化所有C++，并使STL容器与原始数组一样快。

底线是编译器在使用std::vector时无法优化掉no op默认的构造函数调用。如果您使用普通的new[]，它可以很好地优化它们。但不适用于std::vector。即使你可以重写你的代码来消除构造函数调用，这些调用都是面对着这里的咒语："编译器比你聪明。STL和普通C一样快，不用担心。"

相关讨论

再次感谢您实际运行代码。当有人试图挑战大众的意见时，有时很容易被无缘无故的抨击。
"编译器非常聪明，可以优化所有C++，并使STL容器与原始数组一样快。"我有一个理论，这个"编译器是智能的"只是一个神话——C++解析是非常困难的，编译器只是一台机器。
你可以看到马丁的回答。我认为我的测试是有点综合和强调初始化部分。对于长期使用的阵列，如果不是性能关键的手术，vector可能会为它带来的额外好处辩护。
我不知道。当然，他可以放慢阵列测试的速度，但他没有加快矢量1的速度。我在上面的编辑中删除了像素的构造器，使其成为一个简单的结构，但它仍然很慢。这对任何使用像vector这样简单类型的人来说都是坏消息。
我真希望我能对你的回答投两次赞成票。我想都想不出的聪明的想法来尝试(即使没有真正有效的想法)！
至于想法2，这并不奇怪。我以前做过一些硬核优化，对于像unsigned char这样的原始类型，CPU和优化器在寄存器中处理这些问题比在内存中处理更快。如果使用了引用，有时会强制执行某个内存地址。
我认为您在测试中并没有真正删除构造函数。编译器生成的默认构造函数是否对每个成员执行默认构造？您必须定义一个默认的构造函数，它显式地不做任何事情来进行有效的测试。
@MarkKizzx2的原始代码有一个不做任何事情的默认构造函数。当我移除它时，速度加快了10%。编译器生成的默认构造函数将默认构造每个成员，是的，但是默认构造unsigned char会使它们未初始化。
@约翰·库格曼：对于想法2，你有没有尝试过一个const参考？这可能会给编译器提供所需的提示，以防止不必要的内存提取。
只想指出，解析C++的复杂性(是疯狂复杂的，是的)与优化质量无关。后者通常发生在已经多次将解析结果转换为更低级表示的阶段。
@Evan:常量引用不起作用，因为我们需要分配给对象的成员。
@随机黑客：啊，你是对的，我没有仔细阅读示例代码。
用for (std::vector::iterator j = pixels.begin(), k = pixels.end(); j < k; ++j)代替for (std::vector::iterator j = pixels.begin(); j != pixels.end(); ++j)。我观察到一些架构有了很大的改进，用<而不是!=。

这是一个古老但流行的问题。

在这一点上，许多程序员将在C++ 11中工作。在C++ 11中，作为写入的OP代码，对于EDCOX1、0、EDCOX1、1，运行速度同样快。

1 2	UseVector completed in 3.74482 seconds UseArray completed in 3.70414 seconds

根本问题是，当您的Pixel结构未初始化时，std::vector::resize( size_t, T const&=T() )采用默认构造的Pixel并复制它。编译器没有注意到它被要求复制未初始化的数据，因此它实际上执行了复制。

在C++ 11中，EDCOX1(5)有两个重载。第一个是std::vector::resize(size_t)，另一个是std::vector::resize(size_t, T const&)。这意味着，当您在没有第二个参数的情况下调用resize时，它只是默认构造，编译器非常聪明，可以意识到默认构造什么也不做，因此它跳过了缓冲区的传递。

(添加两个重载来处理可移动、可构造和不可复制的类型——在处理未初始化的数据时提高性能是一个额外的好处)。

push_back解决方案也执行fencpost检查，这会减慢速度，因此它比malloc版本慢。

现场示例(我还用chrono::high_resolution_clock替换了计时器)。

请注意，如果您有一个通常需要初始化的结构，但您希望在增大缓冲区后处理它，那么可以使用自定义的std::vector分配器来完成此操作。如果你想把它移到一个更正常的std::vector，我相信小心使用allocator_traits和覆盖==可能会成功，但不确定。

相关讨论

公平地说，你不能将C++实现与C实现相比较，就像我将调用你的MALLC版本一样。malloc不创建对象-它只分配原始内存。然后你把内存当作对象，而不调用构造函数是很差的C++(可能是无效的——我会把它留给语言的律师)。

也就是说，简单地将malloc更改为new Pixel[dimensions*dimensions]，将free更改为delete [] pixels，与您所拥有的简单像素实现没有太大区别。这是我的盒子(E6600，64位)上的结果：

1
2
3
4

UseArray completed in 0.269 seconds
UseVector completed in 1.665 seconds
UseVectorPushBack completed in 7.309 seconds
The whole thing completed in 9.244 seconds

但只要稍微改变一下，桌子就会转动：

像素H

1
2
3
4
5
6
7

struct Pixel
{
Pixel();
Pixel(unsigned char r, unsigned char g, unsigned char b);

unsigned char r, g, b;
};

象素

1
2
3
4
5

#include"Pixel.h"

Pixel::Pixel() {}
Pixel::Pixel(unsigned char r, unsigned char g, unsigned char b)
: r(r), g(g), b(b) {}

MIN

1
2
3

#include"Pixel.h"
[rest of test harness without class Pixel]
[UseArray now uses new/delete not malloc/free]

以这种方式编译：

1
2
3

$ g++ -O3 -c -o Pixel.o Pixel.cc
$ g++ -O3 -c -o main.o main.cc
$ g++ -o main main.o Pixel.o

我们得到了非常不同的结果：

1
2
3
4

UseArray completed in 2.78 seconds
UseVector completed in 1.651 seconds
UseVectorPushBack completed in 7.826 seconds
The whole thing completed in 12.258 seconds

对于像素的非内联构造函数，std:：vector现在胜过原始数组。

通过std：：vector和std:allocator进行分配的复杂性似乎过于复杂，无法像简单的new Pixel[n]那样有效地进行优化。但是，我们可以看到问题仅仅在于分配，而不是向量访问，通过调整两个测试函数来创建一次向量/数组，将其移出循环：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

void UseVector()
{
TestTimer t("UseVector");

int dimension = 999;
std::vector<Pixel> pixels;
pixels.resize(dimension * dimension);

for(int i = 0; i < 1000; ++i)
{
for(int i = 0; i < dimension * dimension; ++i)
{
pixels[i].r = 255;
pixels[i].g = 0;
pixels[i].b = 0;
}
}
}

和

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

void UseArray()
{
TestTimer t("UseArray");

int dimension = 999;
Pixel * pixels = new Pixel[dimension * dimension];

for(int i = 0; i < 1000; ++i)
{
for(int i = 0 ; i < dimension * dimension; ++i)
{
pixels[i].r = 255;
pixels[i].g = 0;
pixels[i].b = 0;
}
}
delete [] pixels;
}

我们现在得到这些结果：

1
2
3
4

UseArray completed in 0.254 seconds
UseVector completed in 0.249 seconds
UseVectorPushBack completed in 7.298 seconds
The whole thing completed in 7.802 seconds

我们可以从中了解到，std:：vector与用于访问的原始数组类似，但是如果您需要多次创建和删除向量/数组，那么在元素的构造函数没有内联的情况下，创建复杂对象将比创建简单数组花费更多的时间。我认为这并不令人惊讶。

相关讨论

尝试一下：

1
2
3
4
5
6
7
8
9
10
11

void UseVectorCtor()
{
TestTimer t("UseConstructor");

for(int i = 0; i < 1000; ++i)
{
int dimension = 999;

std::vector<Pixel> pixels(dimension * dimension, Pixel(255, 0, 0));
}
}

我得到的性能与使用阵列几乎完全相同。

关于vector，它是一个比数组更通用的工具。这意味着你必须考虑如何使用它。它可以以许多不同的方式使用，提供阵列甚至没有的功能。如果您将它"错误"地用于您的目的，那么会产生大量开销，但是如果您正确地使用它，那么它通常基本上是一个零开销的数据结构。在这种情况下，问题是您分别初始化了向量(导致所有元素都调用了它们的默认ctor)，然后用正确的值分别覆盖每个元素。对于编译器来说，这比对数组执行相同操作要困难得多。这就是为什么向量提供了一个构造函数，让您可以这样做：用值X初始化N元素。

当你使用它的时候，向量和数组一样快。

所以不，你没有打破表演神话。但是你已经证明了只有在你优化使用向量的情况下才是正确的，这也是一个很好的观点。：)

从好的方面来说，这确实是最简单的用法，结果却是最快的。如果您将我的代码片段(一行)与约翰·库格曼的答案进行对比，其中包含大量的调整和优化，但仍然不能完全消除性能差异，那么很明显，vector的设计还是相当巧妙的。你不必为了获得和阵列一样的速度而跳过篮球圈。相反，您必须使用最简单的可能解决方案。

相关讨论

我仍然怀疑这是否是一个公平的比较。如果您要去掉内部循环，那么数组等价物就是构造一个单像素对象，然后在整个数组中进行blit。
使用new[]执行与vector.resize()相同的默认构造，但速度要快得多。new[]＋内环的速度应该和vector.resize()＋内环的速度一样，但速度不是，快了近两倍。
@约翰：这是一个公平的比较。在原始代码中，数组被分配了malloc，它不初始化或构造任何东西，因此它实际上是一个单通算法，就像我的vector示例一样。对于new[]来说，答案显然是两者都需要两次通过，但是在new[]的情况下，编译器能够优化额外的开销，而在vector的情况下则没有。但我不明白为什么在不理想的情况下会发生有趣的事情。如果你关心性能，你就不会写那样的代码。
@约翰：有趣的评论。如果我想在整个数组中快速切换，我想数组也是最佳解决方案——因为我不能告诉vector::resize()给我一块连续的内存，而不浪费时间调用无用的构造函数。
@ KIZZX2：是和不是。在C++中，数组通常也是初始化的。在C中，您将使用EDCOX1 11不执行初始化，但在C++中使用非POD类型将不起作用。因此，在一般情况下，C++数组将同样糟糕。也许问题是，如果您要经常执行这个Blitting，您是否会重用相同的数组/向量？如果你这样做了，那么一开始你只需要支付一次"无用的构造函数"的费用。事实上，闪电飞行也同样快。
但这是一个很好的例子。正如我在帖子中所说，标准库试图成为通用库。如果您只需要存储pod类型，并且需要执行诸如blitting之类的操作，并且出于某种原因，每次都要重新创建数组，并且数组的大小是固定的，那么使用malloc分配的原始数组可能是更好的解决方案。(但完全避免重新分配会更好，而且会大大否定vector的绩效惩罚)
Jalf：你说的"在C++中使用非POD类型不起作用"是什么意思？你的意思是有指针的结构吗？不过，对我来说，我不明白为什么vector不能给我一个选择，如果我不想要的话，就关闭构造函数。这项决定是为了存储具有传染性的东西，而operator[]的工作方式与array类似。想知道他们为什么遗漏了最后一点：/

当我第一次看你的代码时，这是一个不公平的比较；我肯定认为你不是在比较苹果和苹果。所以我想，让我们在所有测试中调用构造函数和析构函数，然后进行比较。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35

const size_t dimension = 1000;

void UseArray() {
TestTimer t("UseArray");
for(size_t j = 0; j < dimension; ++j) {
Pixel* pixels = new Pixel[dimension * dimension];
for(size_t i = 0 ; i < dimension * dimension; ++i) {
pixels[i].r = 255;
pixels[i].g = 0;
pixels[i].b = (unsigned char) (i % 255);
}
delete[] pixels;
}
}

void UseVector() {
TestTimer t("UseVector");
for(size_t j = 0; j < dimension; ++j) {
std::vector<Pixel> pixels(dimension * dimension);
for(size_t i = 0; i < dimension * dimension; ++i) {
pixels[i].r = 255;
pixels[i].g = 0;
pixels[i].b = (unsigned char) (i % 255);
}
}
}

int main() {
TestTimer t1("The whole thing");

UseArray();
UseVector();

return 0;
}

我的想法是，有了这个设置，它们应该是完全相同的。结果，我错了。

1
2
3

UseArray completed in 3.06 seconds
UseVector completed in 4.087 seconds
The whole thing completed in 10.14 seconds

那么，为什么会出现这种30%的性能损失呢？STL头中包含所有内容，因此编译器应该可以理解所需的所有内容。

我的想法是循环是如何将所有值初始化为默认构造函数的。所以我做了一个测试：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

class Tester {
public:
static int count;
static int count2;
Tester() { count++; }
Tester(const Tester&) { count2++; }
};
int Tester::count = 0;
int Tester::count2 = 0;

int main() {
std::vector<Tester> myvec(300);
printf("Default Constructed: %i
Copy Constructed: %i
", Tester::count, Tester::count2);

return 0;
}

我怀疑结果是：

1 2	Default Constructed: 1 Copy Constructed: 300

这显然是减速的原因，即向量使用复制构造函数从默认构造的对象初始化元素。

这意味着，在构造向量的过程中会发生以下伪操作顺序：

1 2	Pixel pixel; for (auto i = 0; i < N; ++i) vector[i] = pixel;

由于编译器所做的隐式复制构造函数，将其扩展为以下内容：

1
2
3
4
5
6

Pixel pixel;
for (auto i = 0; i < N; ++i) {
vector[i].r = pixel.r;
vector[i].g = pixel.g;
vector[i].b = pixel.b;
}

因此，默认的Pixel保持未初始化，而其余的则用默认的Pixel的未初始化值初始化。

与New[]和Delete[]的替代情况相比：

1
2
3
4
5
6
7
8
9
10
11
12
13
14

int main() {
Tester* myvec = new Tester[300];

printf("Default Constructed: %i
Copy Constructed:%i
", Tester::count, Tester::count2);

delete[] myvec;

return 0;
}

Default Constructed: 300
Copy Constructed: 0

它们都留给未初始化的值，并且没有对序列进行双重迭代。

有了这些信息，我们如何测试它？让我们尝试重写隐式复制构造函数。

1	Pixel(const Pixel&) {}

结果呢？

1
2
3

UseArray completed in 2.617 seconds
UseVector completed in 2.682 seconds
The whole thing completed in 5.301 seconds

总之，如果你经常做数百个向量：重新考虑你的算法。

在任何情况下，STL的实现都不会因为某些未知的原因而变慢，它只是按照您的要求执行；希望您能更好地了解。

相关讨论

不尝试disabling iterators和建筑在释放模式。你不应该看到多的性能差。

相关讨论

GNU的STL(和其他)，并vector(n)构建原型对象A，默认的编译器优化T()想走一空然后拷贝构造函数A发生任何垃圾。现在保留的地址记忆的对象是采取由STL的__uninitialized_fill_n_aux环，这是populating副本对象。D在efault值向量。那么，"我"是一个循环的STL的建构，但建构然后环/复制。这是反直觉的，但我要记得我该计算器在最近关于这个问题有点：构建更高效的CAN /复制引用计数的对象等。

是这样的：

1	vector<T> x(n);

或

1 2	vector<T> x; x.resize(n);

是很多东西：STL的实现类

1
2
3

T temp;
for (int i = 0; i < n; ++i)
x[i] = temp;

我的问题是，当前一代的编译器优化器不似乎是在工作温度uninitialised Insight是垃圾，和一个外环和优化的故障invocations默认复制构造函数。你可以认为是credibly编译器不能优化本了，作为一个程序员写上面有一合理的期望，所有的对象是相同的后环，即使垃圾(通常的警告对"相同的"memcmp /操作员/ VS = = =运算符等应用)。一个编译器可以预期有任何额外的洞察到他们的所有，性病：：向量>或以后使用的数据，我们建议本优化安全。

这可以与更多的理解上，直接实现：

1 2	for (int i = 0; i < n; ++i) x[i] = T();

但我们可以享受到了编译器的优化。

一个显示一位更多更好的这方面的考虑：向量的行为，

1	std::vector<big_reference_counted_object> x(10000);

我们是一个专业化的独立差分IF在10000和10000引用相同的数据对象。有一个合理的说法是，领先的休闲用户和保护您的C + +做的东西如此昂贵的outweights真实世界非常小的硬件成本的建设到优化复制。

原始的答案(参考/制作：意义上的评论)NO的机会。他几乎是作为一个矢量阵列，至少如果你sensibly储备空间。……

相关讨论

马丁·约克的回答让我很不安，因为它似乎是在地毯下刷初始化问题。但他认为冗余缺省构造是性能问题的根源是正确的。

[编辑：Martin的回答不再建议更改默认构造函数。]

对于眼前的问题，您当然可以调用vectorctor的2参数版本：

1	std::vector<Pixel> pixels(dimension * dimension, Pixel(255, 0, 0));

如果您想用一个常量值初始化，这是很常见的情况。但更普遍的问题是：如何有效地用比常量更复杂的值初始化？

为此，您可以使用一个back_insert_iterator，它是一个迭代器适配器。下面是一个矢量为ints的例子，尽管一般的想法对Pixels同样有效：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

#include <iterator>
// Simple functor return a list of squares: 1, 4, 9, 16...
struct squares {
squares() { i = 0; }
int operator()() const { ++i; return i * i; }

private:
int i;
};

...

std::vector<int> v;
v.reserve(someSize); // To make insertions efficient
std::generate_n(std::back_inserter(v), someSize, squares());

或者，您可以使用copy()或transform()，而不是generate_n()。

缺点是构造初始值的逻辑需要移动到一个单独的类中，这比将其放在适当的位置更不方便(尽管C++ 1x中的lambdas使这更好)。此外，我预计这仍然不会像基于malloc()的非STL版本那么快，但我预计它会很快完成，因为它只为每个元素做一个构造。

矢量的一个额外调用了像素构造器。

每一个都会导致你正在计时的将近一百万个ctor运行。

编辑：那么外面有1…1000个循环，所以让10亿个ctor调用！

编辑2：看到usearray案例的反汇编会很有趣。优化器可以对整个过程进行优化，因为它除了烧掉CPU之外没有任何效果。

相关讨论

我的笔记本电脑是Lenova G770(4 GB RAM)。

操作系统是Windows 7 64位(带笔记本电脑的操作系统)

编译器是mingw 4.6.1。

IDE是代码：：块。

我测试了第一个帖子的源代码。

结果

O2优化

usearray在2.841秒内完成

usevector在2.548秒内完成

usevectorbushback在11.95秒内完成

整个过程在17.342秒内完成

系统暂停

O3优化

usearray在1.452秒内完成

usevector在2.514秒内完成

使用VectorShushback在12.967秒内完成

整个过程在16.937秒内完成

在O3优化下，向量的性能似乎更差。

如果将循环更改为

1
2
3

pixels[i].r = i;
pixels[i].g = i;
pixels[i].b = i;

阵列和矢量在O2和O3下的速度几乎相同。

相关讨论

一些探查器数据(像素与32位对齐)：

1
2
3
4
5

g++ -msse3 -O3 -ftree-vectorize -g test.cpp -DNDEBUG && ./a.out
UseVector completed in 3.123 seconds
UseArray completed in 1.847 seconds
UseVectorPushBack completed in 9.186 seconds
The whole thing completed in 14.159 seconds

瞎说

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26

andrey@nv:~$ opannotate --source libcchem/src/a.out | grep"Total samples for file" -A3
Overflow stats not available
* Total samples for file :"/usr/include/c++/4.4/ext/new_allocator.h"
*
* 141008 52.5367
*/
--
* Total samples for file :"/home/andrey/libcchem/src/test.cpp"
*
* 61556 22.9345
*/
--
* Total samples for file :"/usr/include/c++/4.4/bits/stl_vector.h"
*
* 41956 15.6320
*/
--
* Total samples for file :"/usr/include/c++/4.4/bits/stl_uninitialized.h"
*
* 20956 7.8078
*/
--
* Total samples for file :"/usr/include/c++/4.4/bits/stl_construct.h"
*
* 2923 1.0891
*/

在allocator中：

1
2
3
4
5

: // _GLIBCXX_RESOLVE_LIB_DEFECTS
: // 402. wrong new expression in [some_] allocator::construct
: void
: construct(pointer __p, const _Tp& __val)
141008 52.5367 : { ::new((void *)__p) _Tp(__val); }

vector：

1
2
3
4
5
6
7
8
9
10
11
12
13
14

:void UseVector()
:{ /* UseVector() total: 60121 22.3999 */
...
:
:
10790 4.0201 : for (int i = 0; i < dimension * dimension; ++i) {
:
495 0.1844 : pixels[i].r = 255;
:
12618 4.7012 : pixels[i].g = 0;
:
2253 0.8394 : pixels[i].b = 0;
:
: }

数组

1
2
3
4
5
6
7
8
9
10
11
12

:void UseArray()
:{ /* UseArray() total: 35191 13.1114 */
:
...
:
136 0.0507 : for (int i = 0; i < dimension * dimension; ++i) {
:
9897 3.6874 : pixels[i].r = 255;
:
3511 1.3081 : pixels[i].g = 0;
:
21647 8.0652 : pixels[i].b = 0;

大部分开销都在复制构造函数中。例如，

1
2
3
4
5
6
7
8
9
10
11
12

std::vector < Pixel > pixels;//(dimension * dimension, Pixel());

pixels.reserve(dimension * dimension);

for (int i = 0; i < dimension * dimension; ++i) {

pixels[i].r = 255;

pixels[i].g = 0;

pixels[i].b = 0;
}

它与数组具有相同的性能。

相关讨论

一个更好的基准(我认为…)，编译器由于优化可以改变代码，因为分配向量/数组的结果不在任何地方使用。结果：

1
2
3
4
5
6
7
8
9

$ g++ test.cpp -o test -O3 -march=native
$ ./test
UseArray inner completed in 0.652 seconds
UseArray completed in 0.773 seconds
UseVector inner completed in 0.638 seconds
UseVector completed in 0.757 seconds
UseVectorPushBack inner completed in 6.732 seconds
UseVectorPush completed in 6.856 seconds
The whole thing completed in 8.387 seconds

编译程序：

1	gcc version 6.2.0 20161019 (Debian 6.2.0-9)

CPU：

1	model name : Intel(R) Core(TM) i7-3630QM CPU @ 2.40GHz

代码：

#include <cstdlib>
#include <vector>

#include <iostream>
#include <string>

#include <boost/date_time/posix_time/ptime.hpp>
#include <boost/date_time/microsec_time_clock.hpp>

class TestTimer
{
public:
TestTimer(const std::string & name) : name(name),
start(boost::date_time::microsec_clock<boost::posix_time::ptime>::local_time())
{
}

~TestTimer()
{
using namespace std;
using namespace boost;

posix_time::ptime now(date_time::microsec_clock<posix_time::ptime>::local_time());
posix_time::time_duration d = now - start;

cout << name <<" completed in" << d.total_milliseconds() / 1000.0 <<
" seconds" << endl;
}

private:
std::string name;
boost::posix_time::ptime start;
};

struct Pixel
{
Pixel()
{
}

Pixel(unsigned char r, unsigned char g, unsigned char b) : r(r), g(g), b(b)
{
}

unsigned char r, g, b;
};

void UseVector(std::vector<std::vector<Pixel> >& results)
{
TestTimer t("UseVector inner");

for(int i = 0; i < 1000; ++i)
{
int dimension = 999;

std::vector<Pixel>& pixels = results.at(i);
pixels.resize(dimension * dimension);

for(int i = 0; i < dimension * dimension; ++i)
{
pixels[i].r = 255;
pixels[i].g = 0;
pixels[i].b = 0;
}
}
}

void UseVectorPushBack(std::vector<std::vector<Pixel> >& results)
{
TestTimer t("UseVectorPushBack inner");

for(int i = 0; i < 1000; ++i)
{
int dimension = 999;

std::vector<Pixel>& pixels = results.at(i);
pixels.reserve(dimension * dimension);

for(int i = 0; i < dimension * dimension; ++i)
pixels.push_back(Pixel(255, 0, 0));
}
}

void UseArray(Pixel** results)
{
TestTimer t("UseArray inner");

for(int i = 0; i < 1000; ++i)
{
int dimension = 999;

Pixel * pixels = (Pixel *)malloc(sizeof(Pixel) * dimension * dimension);

results[i] = pixels;

for(int i = 0 ; i < dimension * dimension; ++i)
{
pixels[i].r = 255;
pixels[i].g = 0;
pixels[i].b = 0;
}

// free(pixels);
}
}

void UseArray()
{
TestTimer t("UseArray");
Pixel** array = (Pixel**)malloc(sizeof(Pixel*)* 1000);
UseArray(array);
for(int i=0;i<1000;++i)
free(array[i]);
free(array);
}

void UseVector()
{
TestTimer t("UseVector");
{
std::vector<std::vector<Pixel> > vector(1000, std::vector<Pixel>());
UseVector(vector);
}
}

void UseVectorPushBack()
{
TestTimer t("UseVectorPush");
{
std::vector<std::vector<Pixel> > vector(1000, std::vector<Pixel>());
UseVectorPushBack(vector);
}
}

int main()
{
TestTimer t1("The whole thing");

UseArray();
UseVector();
UseVectorPushBack();

return 0;
}

这里是如何工作的：在push_back矢量法

allocates x的向量空间，它是initialized当量。

下面是叙述性信息在检查如果有室流相关的项的数组。

它使复制项目在呼叫_回推。

售后服务电话：push_backX项目

向量空间的重新分配到一个第二KX量阵列。

它的第一个副本的第二个阵列上。

discards第一阵列。

现在使用的第二阵列作为存储直到它到达KX条目。

重复序列。如果你需要空间去reservingITS是将工作。超过那，它不是一项_复制然后推回去的样这是要吃你还活着。

作为一vector与阵列的事，我不得不同意与其他人。在基准上运行的释放，转向，和更多的放在一个标志是在微软的友好的人民不要# @ % $ ^它的雅。

一个更多的东西，如果你不需要调整，boost.array使用。

相关讨论

顺便说一下，在使用vector的类中，使用int等标准类型也会减慢速度。这里是一个多线程代码：

#include <iostream>
#include <cstdio>
#include <map>
#include <string>
#include <typeinfo>
#include <vector>
#include <pthread.h>
#include <sstream>
#include <fstream>
using namespace std;

//pthread_mutex_t map_mutex=PTHREAD_MUTEX_INITIALIZER;

long long num=500000000;
int procs=1;

struct iterate
{
int id;
int num;
void * member;
iterate(int a, int b, void *c) : id(a), num(b), member(c) {}
};

//fill out viterate and piterate
void * viterate(void * input)
{
printf("am in viterate
");
iterate * info=static_cast<iterate *> (input);
// reproduce member type
vector<int> test= *static_cast<vector<int>*> (info->member);
for (int i=info->id; i<test.size(); i+=info->num)
{
//printf("am in viterate loop
");
test[i];
}
pthread_exit(NULL);
}

void * piterate(void * input)
{
printf("am in piterate
");
iterate * info=static_cast<iterate *> (input);;
int * test=static_cast<int *> (info->member);
for (int i=info->id; i<num; i+=info->num) {
//printf("am in piterate loop
");
test[i];
}
pthread_exit(NULL);
}

int main()
{
cout<<"producing vector of size"<<num<<endl;
vector<int> vtest(num);
cout<<"produced a vector of size"<<vtest.size()<<endl;
pthread_t thread[procs];

iterate** it=new iterate*[procs];
int ans;
void *status;

cout<<"begining to thread through the vector
";
for (int i=0; i<procs; i++) {
it[i]=new iterate(i, procs, (void *) &vtest);
// ans=pthread_create(&thread[i],NULL,viterate, (void *) it[i]);
}
for (int i=0; i<procs; i++) {
pthread_join(thread[i], &status);
}
cout<<"end of threading through the vector";
//reuse the iterate structures

cout<<"producing a pointer with size"<<num<<endl;
int * pint=new int[num];
cout<<"produced a pointer with size"<<num<<endl;

cout<<"begining to thread through the pointer
";
for (int i=0; i<procs; i++) {
it[i]->member=&pint;
ans=pthread_create(&thread[i], NULL, piterate, (void*) it[i]);
}
for (int i=0; i<procs; i++) {
pthread_join(thread[i], &status);
}
cout<<"end of threading through the pointer
";

//delete structure array for iterate
for (int i=0; i<procs; i++) {
delete it[i];
}
delete [] it;

//delete pointer
delete [] pint;

cout<<"end of the program"<<endl;
return 0;
}

代码中的行为表明向量的实例化是代码中最长的部分。一旦你通过瓶颈。其余的代码运行得非常快。不管运行多少线程，这都是正确的。

顺便说一下，忽略包含的绝对疯狂的数量。我一直在使用这段代码来测试一个项目的内容，因此包含的内容不断增加。

使用正确的选项，向量和数组可以生成相同的ASM。在这些情况下，它们的速度当然是相同的，因为无论哪种方式，都可以获得相同的可执行文件。

相关讨论

我做了一些我想做一段时间的广泛测试。不妨分享一下。

这是我的双引导机器i7-3770，16GB RAM，x86_64，在Windows 8.1和Ubuntu 16.04上。更多信息和结论，备注如下。测试了MSVs2017和G++(在Windows和Linux上)。

测试程序

#include <iostream>
#include <chrono>
//#include
#include
#include <locale>
#include <vector>
#include <queue>
#include <deque>

// Note: total size of array must not exceed 0x7fffffff B = 2,147,483,647B
// which means that largest int array size is 536,870,911
// Also image size cannot be larger than 80,000,000B
constexpr int long g_size = 100000;
int g_A[g_size];

int main()
{
std::locale loc("");
std::cout.imbue(loc);
constexpr int long size = 100000; // largest array stack size

// stack allocated c array
std::chrono::steady_clock::time_point start = std::chrono::steady_clock::now();
int A[size];
for (int i = 0; i < size; i++)
A[i] = i;

auto duration = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::steady_clock::now() - start).count();
std::cout <<"c-style stack array duration=" << duration / 1000.0 <<"ms
";
std::cout <<"c-style stack array size=" << sizeof(A) <<"B

";

// global stack c array
start = std::chrono::steady_clock::now();
for (int i = 0; i < g_size; i++)
g_A[i] = i;

duration = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::steady_clock::now() - start).count();
std::cout <<"global c-style stack array duration=" << duration / 1000.0 <<"ms
";
std::cout <<"global c-style stack array size=" << sizeof(g_A) <<"B

";

// raw c array heap array
start = std::chrono::steady_clock::now();
int* AA = new int[size]; // bad_alloc() if it goes higher than 1,000,000,000
for (int i = 0; i < size; i++)
AA[i] = i;

duration = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::steady_clock::now() - start).count();
std::cout <<"c-style heap array duration=" << duration / 1000.0 <<"ms
";
std::cout <<"c-style heap array size=" << sizeof(AA) <<"B

";
delete[] AA;

// std::array<>
start = std::chrono::steady_clock::now();
std::array<int, size> AAA;
for (int i = 0; i < size; i++)
AAA[i] = i;
//std::sort(AAA.begin(), AAA.end());

duration = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::steady_clock::now() - start).count();
std::cout <<"std::array duration=" << duration / 1000.0 <<"ms
";
std::cout <<"std::array size=" << sizeof(AAA) <<"B

";

// std::vector<>
start = std::chrono::steady_clock::now();
std::vector<int> v;
for (int i = 0; i < size; i++)
v.push_back(i);
//std::sort(v.begin(), v.end());

duration = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::steady_clock::now() - start).count();
std::cout <<"std::vector duration=" << duration / 1000.0 <<"ms
";
std::cout <<"std::vector size=" << v.size() * sizeof(v.back()) <<"B

";

// std::deque<>
start = std::chrono::steady_clock::now();
std::deque<int> dq;
for (int i = 0; i < size; i++)
dq.push_back(i);
//std::sort(dq.begin(), dq.end());

duration = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::steady_clock::now() - start).count();
std::cout <<"std::deque duration=" << duration / 1000.0 <<"ms
";
std::cout <<"std::deque size=" << dq.size() * sizeof(dq.back()) <<"B

";

// std::queue<>
start = std::chrono::steady_clock::now();
std::queue<int> q;
for (int i = 0; i < size; i++)
q.push(i);

duration = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::steady_clock::now() - start).count();
std::cout <<"std::queue duration=" << duration / 1000.0 <<"ms
";
std::cout <<"std::queue size=" << q.size() * sizeof(q.front()) <<"B

";
}

结果

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54

//////////////////////////////////////////////////////////////////////////////////////////
// with MSVS 2017:
// >> cl /std:c++14 /Wall -O2 array_bench.cpp
//
// c-style stack array duration=0.15ms
// c-style stack array size=400,000B
//
// global c-style stack array duration=0.130ms
// global c-style stack array size=400,000B
//
// c-style heap array duration=0.90ms
// c-style heap array size=4B
//
// std::array duration=0.20ms
// std::array size=400,000B
//
// std::vector duration=0.544ms
// std::vector size=400,000B
//
// std::deque duration=1.375ms
// std::deque size=400,000B
//
// std::queue duration=1.491ms
// std::queue size=400,000B
//
//////////////////////////////////////////////////////////////////////////////////////////
//
// with g++ version:
// - (tdm64-1) 5.1.0 on Windows
// - (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609 on Ubuntu 16.04
// >> g++ -std=c++14 -Wall -march=native -O2 array_bench.cpp -o array_bench
//
// c-style stack array duration=0ms
// c-style stack array size=400,000B
//
// global c-style stack array duration=0.124ms
// global c-style stack array size=400,000B
//
// c-style heap array duration=0.648ms
// c-style heap array size=8B
//
// std::array duration=1ms
// std::array size=400,000B
//
// std::vector duration=0.402ms
// std::vector size=400,000B
//
// std::deque duration=0.234ms
// std::deque size=400,000B
//
// std::queue duration=0.304ms
// std::queue size=400,000
//
//////////////////////////////////////////////////////////////////////////////////////////

笔记

平均组装10次。
我最初也用std::sort()进行了测试(你可以看到它被注释掉了)，但后来又删除了它们，因为没有显著的相对差异。

我的结论和评论

请注意，全局C样式数组所花费的时间几乎与堆C样式数组所花费的时间相同
在所有的测试中，我注意到std::array在连续运行之间的时间变化具有显著的稳定性，而其他的测试尤其是std:：data结构在比较中变化很大。
O3优化没有显示任何值得注意的时间差异
删除Windows cl(no-o2)和g++上的优化(win/linux no-o2，no-march=native)会显著增加时间。特别是对于std:：data结构。MSV上的总时间比G++高，但在Windows上，std::array和C样式的阵列更快，而无需优化
G++生成的代码比微软的编译器更快(显然，它甚至在Windows上也运行得更快)。

判决

当然，这是用于优化构建的代码。既然问题是关于std::vector，那么是的！太多了！比普通数组慢(优化/未优化)。但是当您进行基准测试时，您自然希望生成优化的代码。

不过，对我来说，这场演出的明星是江户十一〔一〕号。

我只想说向量(和智能指针)只是原始数组(和原始指针)上的一个薄层附加。实际上，向量在连续内存中的访问时间比数组快。下面的代码显示了初始化和访问向量和数组的结果。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61

#include <boost/date_time/posix_time/posix_time.hpp>
#include <iostream>
#include <vector>
#define SIZE 20000
int main() {
srand (time(NULL));
vector<vector<int>> vector2d;
vector2d.reserve(SIZE);
int index(0);
boost::posix_time::ptime start_total = boost::posix_time::microsec_clock::local_time();
// timer start - build + access
for (int i = 0; i < SIZE; i++) {
vector2d.push_back(vector<int>(SIZE));
}
boost::posix_time::ptime start_access = boost::posix_time::microsec_clock::local_time();
// timer start - access
for (int i = 0; i < SIZE; i++) {
index = rand()%SIZE;
for (int j = 0; j < SIZE; j++) {

vector2d[index][index]++;
}
}
boost::posix_time::ptime end = boost::posix_time::microsec_clock::local_time();
boost::posix_time::time_duration msdiff = end - start_total;
cout <<"Vector total time:" << msdiff.total_milliseconds() <<"milliseconds.
";
msdiff = end - start_acess;
cout <<"Vector access time:" << msdiff.total_milliseconds() <<"milliseconds.
";

int index(0);
int** raw2d = nullptr;
raw2d = new int*[SIZE];
start_total = boost::posix_time::microsec_clock::local_time();
// timer start - build + access
for (int i = 0; i < SIZE; i++) {
raw2d[i] = new int[SIZE];
}
start_access = boost::posix_time::microsec_clock::local_time();
// timer start - access
for (int i = 0; i < SIZE; i++) {
index = rand()%SIZE;
for (int j = 0; j < SIZE; j++) {

raw2d[index][index]++;
}
}
end = boost::posix_time::microsec_clock::local_time();
msdiff = end - start_total;
cout <<"Array total time:" << msdiff.total_milliseconds() <<"milliseconds.
";
msdiff = end - start_acess;
cout <<"Array access time:" << msdiff.total_milliseconds() <<"milliseconds.
";
for (int i = 0; i < SIZE; i++) {
delete [] raw2d[i];
}
return 0;
}

输出是：

1
2
3
4

Vector total time: 925milliseconds.
Vector access time: 4milliseconds.
Array total time: 30milliseconds.
Array access time: 21milliseconds.

所以如果使用得当，速度几乎是一样的。(正如其他人提到的使用reserve()或resize())。

好吧，因为vector:：resize()比普通内存分配(malloc)做的更多。

尝试在复制构造函数中放置断点(定义它以便可以断点！)还有额外的处理时间。

我得说我不是C++专家。但要添加一些实验结果：

编译：GCC-62.0/BI/G++-O3-STD= C++ 14矢量CPP

机器：

1	Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz

操作系统：

1	2.6.32-642.13.1.el6.x86_64

输出：

1
2
3
4
5
6

UseArray completed in 0.167821 seconds
UseVector completed in 0.134402 seconds
UseConstructor completed in 0.134806 seconds
UseFillConstructor completed in 1.00279 seconds
UseVectorPushBack completed in 6.6887 seconds
The whole thing completed in 8.12888 seconds

这里我唯一感到奇怪的是"useFillConstructor"的性能与"useConstructor"相比。

代码：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

void UseConstructor()
{
TestTimer t("UseConstructor");

for(int i = 0; i < 1000; ++i)
{
int dimension = 999;

std::vector<Pixel> pixels(dimension*dimension);
for(int i = 0; i < dimension * dimension; ++i)
{
pixels[i].r = 255;
pixels[i].g = 0;
pixels[i].b = 0;
}
}
}

void UseFillConstructor()
{
TestTimer t("UseFillConstructor");

for(int i = 0; i < 1000; ++i)
{
int dimension = 999;

std::vector<Pixel> pixels(dimension*dimension, Pixel(255,0,0));
}
}

所以提供的额外"值"会大大降低性能，我认为这是由于多次调用复制构造函数造成的。但是…

编译：

1	gcc-6.2.0/bin/g++ -std=c++14 -O vector.cpp

输出：

1
2
3
4
5
6

UseArray completed in 1.02464 seconds
UseVector completed in 1.31056 seconds
UseConstructor completed in 1.47413 seconds
UseFillConstructor completed in 1.01555 seconds
UseVectorPushBack completed in 6.9597 seconds
The whole thing completed in 11.7851 seconds

所以在这种情况下，gcc优化是非常重要的，但是当一个值作为默认值提供时，它并不能帮助您很多。这实际上是我的学费。希望它能帮助新程序员选择哪种向量初始化格式。