Why is code using intermediate variables faster than code without?
我遇到了这种奇怪的行为,但没能解释清楚。这些是基准:
1 2 3 4 | py -3 -m timeit"tuple(range(2000)) == tuple(range(2000))" 10000 loops, best of 3: 97.7 usec per loop py -3 -m timeit"a = tuple(range(2000)); b = tuple(range(2000)); a==b" 10000 loops, best of 3: 70.7 usec per loop |
为什么与变量分配的比较要比使用临时变量的一行程序快27%以上?
通过python文档,垃圾收集在timeit期间被禁用,因此不能是这样。这是某种优化吗?
结果也可以在python2.x中复制,但复制程度较低。
运行windows7、cpython3.5.1、intel i7 3.40GHz、64位操作系统和python。我尝试在Inteli7 3.60GHz上运行的python3.5.0似乎是另一台机器,但它无法复制结果。
使用同一个python进程运行
我的结果与您的类似:使用中间变量的代码在python3.4中的速度一致,至少快了10-20%,这让我很累。但是,当我在同一个python3.4解释器上使用ipython时,得到了以下结果:
1 2 3 4 5 | In [1]: %timeit -n10000 -r20 tuple(range(2000)) == tuple(range(2000)) 10000 loops, best of 20: 74.2 μs per loop In [2]: %timeit -n10000 -r20 a = tuple(range(2000)); b = tuple(range(2000)); a==b 10000 loops, best of 20: 75.7 μs per loop |
值得注意的是,当我从命令行使用
所以这只海森氏虫很有趣。我决定和
1 2 3 4 5 6 7 8 | % strace -o withoutvars python3 -m timeit"tuple(range(2000)) == tuple(range(2000))" 10000 loops, best of 3: 134 usec per loop % strace -o withvars python3 -mtimeit"a = tuple(range(2000)); b = tuple(range(2000)); a==b" 10000 loops, best of 3: 75.8 usec per loop % grep mmap withvars|wc -l 46 % grep mmap withoutvars|wc -l 41149 |
现在,这是造成差异的一个很好的原因。不使用变量的代码导致
对于256K区域,
1 2 3 4 5 6 | mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f32e56de000 munmap(0x7f32e56de000, 262144) = 0 mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f32e56de000 munmap(0x7f32e56de000, 262144) = 0 mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f32e56de000 munmap(0x7f32e56de000, 262144) = 0 |
Prior to Python 2.5, arenas were never
free() 'ed. Starting with Python 2.5,
we do try tofree() arenas, and use some mild heuristic strategies to increase
the likelihood that arenas eventually can be freed.
因此,这些启发式方法和python对象分配器一清空就释放这些空闲区域的事实导致
行为不存在于使用中间变量的代码中,因为它使用的内存稍多一些,并且由于一些对象仍在其中分配,因此无法释放内存区域。这是因为
1 2 3 4 | for n in range(10000) a = tuple(range(2000)) b = tuple(range(2000)) a == b |
现在的行为是,
最值得注意的是,不能保证使用中间变量的代码总是更快的——实际上,在某些设置中,使用中间变量可能会导致额外的
有人问,当
Note
By default,
timeit() temporarily turns off garbage collection during the timing. The advantage of this approach is that it makes independent timings more comparable. This disadvantage is that GC may be an important component of the performance of the function being measured. If so, GC can be re-enabled as the first statement in the setup string. For example:
但是,python的垃圾收集器只用于回收循环垃圾,即引用形成循环的对象的集合。这里不是这样,而是当引用计数降至零时立即释放这些对象。
这里的第一个问题是,它是可复制的吗?对我们中的一些人来说,至少这是肯定的,尽管其他人说他们没有看到效果。在Fedora上,平等测试改为
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | $ python3 -m timeit"a = tuple(range(200000)); b = tuple(range(200000)); a is b" 100 loops, best of 3: 7.03 msec per loop $ python3 -m timeit"a = tuple(range(200000)) is tuple(range(200000))" 100 loops, best of 3: 10.2 msec per loop $ python3 -m timeit"tuple(range(200000)) is tuple(range(200000))" 100 loops, best of 3: 10.2 msec per loop $ python3 -m timeit"a = b = tuple(range(200000)) is tuple(range(200000))" 100 loops, best of 3: 9.99 msec per loop $ python3 -m timeit"a = b = tuple(range(200000)) is tuple(range(200000))" 100 loops, best of 3: 10.2 msec per loop $ python3 -m timeit"tuple(range(200000)) is tuple(range(200000))" 100 loops, best of 3: 10.1 msec per loop $ python3 -m timeit"a = tuple(range(200000)); b = tuple(range(200000)); a is b" 100 loops, best of 3: 7 msec per loop $ python3 -m timeit"a = tuple(range(200000)); b = tuple(range(200000)); a is b" 100 loops, best of 3: 7.02 msec per loop |
我注意到运行之间的变化以及表达式的运行顺序对结果几乎没有影响。
在慢版本中添加
这就给了我们一个线索,即效果与堆栈深度有关,也许额外的级别会将堆栈推送到另一个内存页中。如果是这样的话,我们应该看到进行影响堆栈的其他更改将发生更改(很可能会消除此影响),事实上,这就是我们看到的:
1 2 3 4 5 6 7 8 9 10 11 12 | $ python3 -m timeit -s"def foo(): tuple(range(200000)) is tuple(range(200000))""foo()" 100 loops, best of 3: 10 msec per loop $ python3 -m timeit -s"def foo(): tuple(range(200000)) is tuple(range(200000))""foo()" 100 loops, best of 3: 10 msec per loop $ python3 -m timeit -s"def foo(): a = tuple(range(200000)); b = tuple(range(200000)); a is b""foo()" 100 loops, best of 3: 9.97 msec per loop $ python3 -m timeit -s"def foo(): a = tuple(range(200000)); b = tuple(range(200000)); a is b""foo()" 100 loops, best of 3: 10 msec per loop |
所以,我认为效果完全是由于在计时过程中消耗了多少Python堆栈。不过,这仍然很奇怪。