如何在C ++中解析tar文件

How to parse a tar file in C++

我要做的是下载一个.tar文件，它有多个目录，每个目录有两个文件。问题是，如果不提取文件(使用tar)，我就找不到读取tar文件的方法。

完美的解决方案是：

1
2
3
4
5
6
7
8
9

#include <easytar>

Tarfile tar("somefile.tar");
std::string currentFile, currentFileName;
for(int i=0; i<tar.size(); i++){
file = tar.getFileText(i);
currentFileName = tar.getFileName(i);
// do stuff with it
}

我可能要自己写这篇文章，但任何想法都会受到赞赏的。

相关讨论

经过一点工作，我自己就知道了。tar文件规范实际上告诉了你需要知道的一切。

首先，每个文件都以512字节的头开始，因此您可以用一个char[512]或char*表示它，该char*指向较大的char数组中的某个位置(例如，如果您将整个文件加载到一个数组中)。

标题如下：

1
2
3
4
5
6
7
8
9
10

location size field
0 100 File name
100 8 File mode
108 8 Owner's numeric user ID
116 8 Group's numeric user ID
124 12 File size in bytes
136 12 Last modification time in numeric Unix time format
148 8 Checksum for header block
156 1 Link indicator (file type)
157 100 Name of linked file

所以如果你想要文件名，你可以在这里用string filename(buffer[0], 100);抓取它。文件名是空填充的，因此您可以检查以确保至少有一个空值，然后如果您想节省空间，请保留大小。

现在我们想知道是文件还是文件夹。"链接指示器"字段包含此信息，因此：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

// Note that we're comparing to ascii numbers, not ints
switch(buffer[156]){
case '0': // intentionally dropping through
case '\0':
// normal file
break;
case '1':
// hard link
break;
case '2':
// symbolic link
break;
case '3':
// device file/special file
break;
case '4':
// block device
break;
case '5':
// directory
break;
case '6':
// named pipe
break;
}

此时，我们已经拥有了关于目录的所有信息，但是我们需要从普通文件中得到另外一件事：实际的文件内容。

文件的长度可以以两种不同的方式存储，一种是以0或空格填充的以空结尾的八进制字符串，另一种是"通过设置数字字段最左边字节的高位来指示的base-256编码"。

Numeric values are encoded in octal numbers using ASCII digits, with leading zeroes. For historical reasons, a final NUL or space character should be used. Thus although there are 12 bytes reserved for storing the file size, only 11 octal digits can be stored. This gives a maximum file size of 8 gigabytes on archived files. To overcome this limitation, star in 2001 introduced a base-256 coding that is indicated by setting the high-order bit of the leftmost byte of a numeric field. GNU-tar and BSD-tar followed this idea. Additionally, versions of tar from before the first POSIX standard from 1988 pad the values with spaces instead of zeroes.

以下是您阅读八进制格式的方法，但我没有为Base-256版本编写代码：

1
2
3
4
5
6
7
8
9
10
11
12
13

// in one function
int size_of_file = octal_string_to_int(&buffer[124], 11);

// elsewhere
int octal_string_to_int(char *current_char, unsigned int size){
unsigned int output = 0;
while(size > 0){
output = output * 8 + *current_char - '0';
current_char++;
size--;
}
return output;
}

好的，现在除了实际的文件内容之外，我们还有所有的内容。我们所要做的就是从tar文件中获取下一个size字节的数据，我们将得到我们的文件内容：

1
2
3
4
5
6
7
8

// Get to the next block after the header ends
location += 512;
file_contents = new char[size];
memcpy(file_contents, &buffer[location], size);
// Go to the next block by rounding up to 512
// This isn't necessarily the most efficient way to do this,
// but it's the most obvious.
location += (int)ceil(size / 512.0)