String matching performance: gcc versus CPython
在研究Python和C++之间的性能权衡时,我设计了一个小例子,主要集中在一个哑子串匹配上。
下面是相关的C++:
1 2 3 4 | using std::string; std::vector<string> matches; std::copy_if(patterns.cbegin(), patterns.cend(), back_inserter(matches), [&fileContents] (const string &pattern) { return fileContents.find(pattern) != string::npos; } ); |
上面是用-O3建造的。
这里是Python:
1 2 | def getMatchingPatterns(patterns, text): return filter(text.__contains__, patterns) |
它们都采用一组大型的模式和输入文件,并使用哑子字符串搜索将模式列表过滤到文件中找到的模式。
版本包括:
- GCC-4.8.2(Ubuntu)和4.9.2(Cygwin)
- python-2.7.6(ubuntu)和2.7.8(cygwin)
令我惊讶的是演出。我在一个低规格的Ubuntu上运行,而python的速度快了20%。在中等规格的PC上,cygwin-python速度快了两倍。探查器显示,99%以上的周期用于字符串匹配(字符串复制和列表理解无关紧要)。
显然,Python实现是原生C,并且我预期它和C++是大致相同的,但没有想到它是快的。
与GCC相比,任何有关CPython优化的见解都是最受欢迎的。
以下是完整的示例,供参考。输入只需要一组50K的htlms(在每个测试中都是从磁盘读取的,没有特殊的缓存):
Python:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | import sys def getMatchingPatterns(patterns, text): return filter(text.__contains__, patterns) def serialScan(filenames, patterns): return zip(filenames, [getMatchingPatterns(patterns, open(filename).read()) for filename in filenames]) if __name__ =="__main__": with open(sys.argv[1]) as filenamesListFile: filenames = filenamesListFile.read().split() with open(sys.argv[2]) as patternsFile: patterns = patternsFile.read().split() resultTuple = serialScan(filenames, patterns) for filename, patterns in resultTuple: print ': '.join([filename, ','.join(patterns)]) |
C++:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 | #include <iostream> #include <iterator> #include <fstream> #include <string> #include <vector> #include <unordered_map> #include using namespace std; using MatchResult = unordered_map<string, vector<string>>; static const size_t PATTERN_RESERVE_DEFAULT_SIZE = 5000; MatchResult serialMatch(const vector<string> &filenames, const vector<string> &patterns) { MatchResult res; for (auto &filename : filenames) { ifstream file(filename); const string fileContents((istreambuf_iterator<char>(file)), istreambuf_iterator<char>()); vector<string> matches; std::copy_if(patterns.cbegin(), patterns.cend(), back_inserter(matches), [&fileContents] (const string &pattern) { return fileContents.find(pattern) != string::npos; } ); res.insert(make_pair(filename, std::move(matches))); } return res; } int main(int argc, char **argv) { vector<string> filenames; ifstream filenamesListFile(argv[1]); std::copy(istream_iterator<string>(filenamesListFile), istream_iterator<string>(), back_inserter(filenames)); vector<string> patterns; patterns.reserve(PATTERN_RESERVE_DEFAULT_SIZE); ifstream patternsFile(argv[2]); std::copy(istream_iterator<string>(patternsFile), istream_iterator<string>(), back_inserter(patterns)); auto matchResult = serialMatch(filenames, patterns); for (const auto &matchItem : matchResult) { cout << matchItem.first <<":"; for (const auto &matchString : matchItem.second) cout << matchString <<","; cout << endl; } } |
python 3.4代码
然后,
fast search/count implementation, based on a mix between boyer-
moore and horspool, with a few more bells and whistles on the top.
for some more background, see: http://effbot.org/zone/stringlib.htm
也有一些修改,如注释所示:
note: fastsearch may access
s[n] , which isn't a problem when using
Python's ordinary string types, but may cause problems if you're
using this code in other contexts. also, the count mode returns-1
if there cannot possible be a match in the target string, and0 if
it has actually checked for matches, but didn't find any. callers
beware!
GNU C++标准库EDOCX1×6实现尽可能通用(和哑);它只是尝试在每个连续字符位置哑匹配模式,直到找到匹配。
TL;DR:C++标准库与Python相比如此缓慢的原因是因为它试图在EDOCX1 7的顶部做一个泛型算法,但是对于更有趣的情况,它不能有效地执行它;而在Python中,程序员可以免费地逐个获得最有效的算法。