关于性能：用c ++快速读取文本文件

Fast textfile reading in c++

我目前正在编写一个C++程序，其中包括阅读大量的大型文本文件。每行有约400.000行，在极端情况下每行有4000个或更多字符。为了测试，我使用ifstream和cplusplus.com提供的实现读取了其中一个文件。大约60秒，太长了。现在我想知道，有没有一个直接的方法来提高阅读速度？

编辑：我使用的代码或多或少是这样的：

1
2
3
4
5
6
7
8
9
10
11

string tmpString;
ifstream txtFile(path);
if(txtFile.is_open())
{
while(txtFile.good())
{
m_numLines++;
getline(txtFile, tmpString);
}
txtFile.close();
}

编辑2：我读到的文件只有82MB大。我主要是说它可以达到4000，因为我认为为了做缓冲可能需要知道。

编辑3：谢谢大家的回答，但考虑到我的问题，似乎没有太多的改进空间。我必须使用readline，因为我想计算行数。将ifstream实例化为二进制文件也不能使读取速度更快。我会尽可能地将其并行化，这至少是可行的。

编辑4：所以很明显我可以做一些事情。非常感谢你花了这么多时间，我非常感谢！=)

相关讨论

更新：确保检查初始答案下的(令人惊讶的)更新

内存映射文件很好地为我服务1：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

#include <boost/iostreams/device/mapped_file.hpp> // for mmap
#include // for std::find
#include <iostream> // for std::cout
#include <cstring>

int main()
{
boost::iostreams::mapped_file mmap("input.txt", boost::iostreams::mapped_file::readonly);
auto f = mmap.const_data();
auto l = f + mmap.size();

uintmax_t m_numLines = 0;
while (f && f!=l)
if ((f = static_cast<const char*>(memchr(f, '
', l-f))))
m_numLines++, f++;

std::cout <<"m_numLines =" << m_numLines <<"
";
}

这应该相当快。

更新

如果它可以帮助您测试这种方法，这里有一个版本使用mmap，而不是使用boost:see it live on coliru

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52

#include
#include <iostream>
#include <cstring>

// for mmap:
#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>

const char* map_file(const char* fname, size_t& length);

int main()
{
size_t length;
auto f = map_file("test.cpp", length);
auto l = f + length;

uintmax_t m_numLines = 0;
while (f && f!=l)
if ((f = static_cast<const char*>(memchr(f, '
', l-f))))
m_numLines++, f++;

std::cout <<"m_numLines =" << m_numLines <<"
";
}

void handle_error(const char* msg) {
perror(msg);
exit(255);
}

const char* map_file(const char* fname, size_t& length)
{
int fd = open(fname, O_RDONLY);
if (fd == -1)
handle_error("open");

// obtain file size
struct stat sb;
if (fstat(fd, &sb) == -1)
handle_error("fstat");

length = sb.st_size;

const char* addr = static_cast<const char*>(mmap(NULL, length, PROT_READ, MAP_PRIVATE, fd, 0u));
if (addr == MAP_FAILED)
handle_error("mmap");

// TODO close fd at some point in time, call munmap(...)
return addr;
}

更新

通过查看gnu coreutils wc的源代码，我可以从中挤出最后一点性能。令我惊讶的是，使用来自wc的以下(大大简化的)代码所花费的时间大约是上述内存映射文件所花费时间的84%：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

static uintmax_t wc(char const *fname)
{
static const auto BUFFER_SIZE = 16*1024;
int fd = open(fname, O_RDONLY);
if(fd == -1)
handle_error("open");

/* Advise the kernel of our access pattern. */
posix_fadvise(fd, 0, 0, 1); // FDADVICE_SEQUENTIAL

char buf[BUFFER_SIZE + 1];
uintmax_t lines = 0;

while(size_t bytes_read = read(fd, buf, BUFFER_SIZE))
{
if(bytes_read == (size_t)-1)
handle_error("read failed");
if (!bytes_read)
break;

for(char *p = buf; (p = (char*) memchr(p, '
', (buf + bytes_read) - p)); ++p)
++lines;
}

return lines;
}

1参见这里的基准：如何快速解析C++中的空间分离浮点？

相关讨论

4000*400000=1.6 GB如果您的硬盘不是一个SSD，您很可能会得到大约100 MB/s的顺序读取。输入输出只有16秒。

由于您没有详细说明您使用的特定代码，或者您需要如何解析这些文件(您需要一行一行地读取它，系统是否有大量的RAM？您可以将整个文件读取到一个大的RAM缓冲区，然后再解析它吗？)你几乎无能为力来加速这个过程。

在按顺序读取文件时，内存映射文件不会提供任何性能改进。也许手动解析新行的大块而不是使用"getline"可以提供改进。

完成一些学习后进行编辑(谢谢@sehe)。这是我可能使用的内存映射解决方案。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41

#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <errno.h>

int main() {
char* fName ="big.txt";
//
struct stat sb;
long cntr = 0;
int fd, lineLen;
char *data;
char *line;
// map the file
fd = open(fName, O_RDONLY);
fstat(fd, &sb);
//// int pageSize;
//// pageSize = getpagesize();
//// data = mmap((caddr_t)0, pageSize, PROT_READ, MAP_PRIVATE, fd, pageSize);
data = mmap((caddr_t)0, sb.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
line = data;
// get lines
while(cntr < sb.st_size) {
lineLen = 0;
line = data;
// find the next line
while(*data != '
' && cntr < sb.st_size) {
data++;
cntr++;
lineLen++;
}
/***** PROCESS LINE *****/
// ... processLine(line, lineLen);
}
return 0;
}

相关讨论

+1用于Beer Coaster计算。SSD could reach ~ 500GB/s though.Memory mapping could be more efficient depending on the usage scenarios
I need to read i t line by line，because they don't contain a header which tells me how long they are.I could put them into a ram buffer because I can disclard each one after reading it，but they again，I thought that was IFstream did.是否有一种方式可以告诉方案仅仅是把它变成拉姆？
@Sehe-I was always under the impression that memory mapping files was more of a conventice abstraction for overlapping I/o than a performance boost，especially for a sequential read task.My guess is the OP is using"getline"which is reading 1 byte at a time looking for and causing a lot of unnecessarily small file reads.利用一个大的读者Buffer in a sequential ifstream would offer the exact same performance a mapped file(but I am very open to be provent wrong).
@Arneecknagel-if you have enough ram to handle it you can get the file size and allocate a buffer large enough and do one read operation into the buffer.This will of course have the hefty delay I mentioned，I better way would be able to allocat a ~ 16mb sized buffer，read into it，parse the line you can and move the last(possibly unparsable at this time)line to the beginning of the buffer and continue your read loop into the rest of it.
@Arneecknagel-The underlying recording and abstraction of a mapped file would make the task I described in my last comment a bit easier，but probably not any faster.
@最后的适合本国的缓解行动是一种便利，但也有：预防支付所有的费用，你不可以进入，工作在二元模式的含义中，只是要求虚拟空间(作为一种可能的办法，它可以与当地的Buffer共享)。Some filesystem drivers may even have zero-copy paths，especially on ready maps
@Sehe-Thanks Sehe，Zero copy gave me some to look into.Seems for sequential read MMAP offers an order of magnitude improvement.我的前两个问题是从关于大文件的工作中产生的，过去，读者和作者之间的相似之处是一个问题。

尼尔·柯克，不幸的是，我不能回复你的评论(声誉不够)，但我在ifstream上做了一个性能测试，一行一行地读取文本文件，其性能完全相同。

1
2
3
4

std::stringstream stream;
std::string line;
while(std::getline(stream, line)) {
}

这在106MB文件上需要1426毫秒。

1
2
3
4
5

std::ifstream stream;
std::string line;
while(ifstream.good()) {
getline(stream, line);
}

这在同一个文件上需要1433毫秒。

以下代码更快：

1
2
3
4

const int MAX_LENGTH = 524288;
char* line = new char[MAX_LENGTH];
while (iStream.getline(line, MAX_LENGTH) && strlen(line) > 0) {
}

这在同一个文件上需要884ms。这有点棘手，因为您必须设置缓冲区的最大大小(即输入文件中每行的最大长度)。

你必须同时读取所有文件吗？(例如，在应用程序的开头)

如果这样做，请考虑将操作并行化。

不管是哪种方式，都可以考虑使用二进制流，或者对数据块进行非缓冲读取。

相关讨论

作为一个在竞争性编程方面有点背景的人，我可以告诉你：至少对于简单的事情，比如整数解析，C中的主要成本是锁定文件流(在默认情况下，多线程是这样做的)。使用unlocked_stdio版本代替(fgetc_unlocked()，fread_unlocked())。对于C++，一般的知识是使用EDCOX1，8，但是我不知道它是否和EDCOX1，5的一样快。

这里是我的标准整数解析代码，供参考。它比scanf快得多，正如我所说，主要是因为没有锁定流。对我来说，它和我以前使用的最好的手工编码的mmap或自定义缓冲版本一样快，没有疯狂的维护债务。

1
2
3
4
5
6
7
8

int readint(void)
{
int n, c;
n = getchar_unlocked() - '0';
while ((c = getchar_unlocked()) > ' ')
n = 10*n + c-'0';
return n;
}

(注意：只有当任意两个整数之间正好有一个非数字字符时，此字符才有效)。

当然，尽可能避免内存分配…

使用Random file access或使用binary mode。对于顺序，这是很大的，但仍然取决于你读的是什么。