How to parse a tar file in C++
我要做的是下载一个.tar文件,它有多个目录,每个目录有两个文件。问题是,如果不提取文件(使用
完美的解决方案是:
1 2 3 4 5 6 7 8 9 | #include <easytar> Tarfile tar("somefile.tar"); std::string currentFile, currentFileName; for(int i=0; i<tar.size(); i++){ file = tar.getFileText(i); currentFileName = tar.getFileName(i); // do stuff with it } |
我可能要自己写这篇文章,但任何想法都会受到赞赏的。
经过一点工作,我自己就知道了。tar文件规范实际上告诉了你需要知道的一切。
首先,每个文件都以512字节的头开始,因此您可以用一个char[512]或char*表示它,该char*指向较大的char数组中的某个位置(例如,如果您将整个文件加载到一个数组中)。
标题如下:
1 2 3 4 5 6 7 8 9 10 | location size field 0 100 File name 100 8 File mode 108 8 Owner's numeric user ID 116 8 Group's numeric user ID 124 12 File size in bytes 136 12 Last modification time in numeric Unix time format 148 8 Checksum for header block 156 1 Link indicator (file type) 157 100 Name of linked file |
所以如果你想要文件名,你可以在这里用
现在我们想知道是文件还是文件夹。"链接指示器"字段包含此信息,因此:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | // Note that we're comparing to ascii numbers, not ints switch(buffer[156]){ case '0': // intentionally dropping through case '\0': // normal file break; case '1': // hard link break; case '2': // symbolic link break; case '3': // device file/special file break; case '4': // block device break; case '5': // directory break; case '6': // named pipe break; } |
此时,我们已经拥有了关于目录的所有信息,但是我们需要从普通文件中得到另外一件事:实际的文件内容。
文件的长度可以以两种不同的方式存储,一种是以0或空格填充的以空结尾的八进制字符串,另一种是"通过设置数字字段最左边字节的高位来指示的base-256编码"。
Numeric values are encoded in octal numbers using ASCII digits, with leading zeroes. For historical reasons, a final NUL or space character should be used. Thus although there are 12 bytes reserved for storing the file size, only 11 octal digits can be stored. This gives a maximum file size of 8 gigabytes on archived files. To overcome this limitation, star in 2001 introduced a base-256 coding that is indicated by setting the high-order bit of the leftmost byte of a numeric field. GNU-tar and BSD-tar followed this idea. Additionally, versions of tar from before the first POSIX standard from 1988 pad the values with spaces instead of zeroes.
以下是您阅读八进制格式的方法,但我没有为Base-256版本编写代码:
1 2 3 4 5 6 7 8 9 10 11 12 13 | // in one function int size_of_file = octal_string_to_int(&buffer[124], 11); // elsewhere int octal_string_to_int(char *current_char, unsigned int size){ unsigned int output = 0; while(size > 0){ output = output * 8 + *current_char - '0'; current_char++; size--; } return output; } |
好的,现在除了实际的文件内容之外,我们还有所有的内容。我们所要做的就是从tar文件中获取下一个
1 2 3 4 5 6 7 8 | // Get to the next block after the header ends location += 512; file_contents = new char[size]; memcpy(file_contents, &buffer[location], size); // Go to the next block by rounding up to 512 // This isn't necessarily the most efficient way to do this, // but it's the most obvious. location += (int)ceil(size / 512.0) |
你看过libtar吗?
从fink包信息:
libtar-1.2-1: Tar file manipulation API
libtar is a C library for manipulating POSIX tar files. It handles
adding and extracting files to/from a tar archive.
libtar offers the following features:
* Flexible API - you can manipulate individual files or just
extract a whole archive at once.
* Allows user-specified read() and write() functions, such as
zlib's gzread() and gzwrite().
* Supports both POSIX 1003.1-1990 and GNU tar file formats.
不是C++本身,但是你可以很容易地链接到C…
libarchive可以是用于解析tarball的开源库。libarchive可以从存档文件中读取每个文件,而无需提取,还可以写入数据以形成新的存档文件。