How to find the size of the L1 cache line size with IO timing measurements?
1 2 3 | for (i = 0; i < steps; i++) { arr[(i * 4) & lengthMod]++; } |
1 2 3 4 5 6 7 8 9 10 11 | // repeatedly access/modify data, varying the STRIDE for (int s = 4; s <= MAX_STRIDE/sizeof(int); s*=2) { start = wall_clock_time(); for (unsigned int k = 0; k < REPS; k++) { data[(k * s) & lengthMod]++; } end = wall_clock_time(); timeTaken = ((float)(end - start))/1000000000; printf("%d, %1.2f ", s * sizeof(int), timeTaken); } |
The idea underlying our calibrator tool is to have a micro benchmark whose performance only depends
on the frequency of cache misses that occur. Our calibrator is a simple C program, mainly a small loop
that executes a million memory reads. By changing the stride (i.e., the offset between two subsequent
memory accesses) and the size of the memory area, we force varying cache miss rates.In principle, the occurance of cache misses is determined by the array size. Array sizes that fit into
the L1 cache do not generate any cache misses once the data is loaded into the cache. Analogously,
arrays that exceed the L1 cache size but still fit into L2, will cause L1 misses but no L2 misses. Finally,
arrays larger than L2 cause both L1 and L2 misses.The frequency of cache misses depends on the access stride and the cache line size. With strides
equal to or larger than the cache line size, a cache miss occurs with every iteration. With strides
smaller than the cache line size, a cache miss occurs only every n iterations (on average), where n is
the ratio cache
size/stride.Thus, we can calculate the latency for a cache miss by comparing the execution time without
misses to the execution time with exactly one miss per iteration. This approach only works, if
memory accesses are executed purely sequential, i.e., we have to ensure that neither two or more load
instructions nor memory access and pure CPU work can overlap. We use a simple pointer chasing
mechanism to achieve this: the memory area we access is initialized such that each load returns the
address for the subsequent load in the next iteration. Thus, super-scalar CPUs cannot benefit from
their ability to hide memory access latency by speculative execution.To measure the cache characteristics, we run our experiment several times, varying the stride and
the array size. We make sure that the stride varies at least between 4 bytes and twice the maximal
expected cache line size, and that the array size varies from half the minimal expected cache size to
at least ten times the maximal expected cache size.< /块引用>
#include"math.h" 进行注释,以便编译它,然后它找到了我笔记本电脑的缓存值。我也无法查看生成的PostScript文件。您可以在汇编程序中使用
CPUID 函数,尽管它不可移植,但它会提供您想要的东西。For Intel Microprocessors, the Cache Line Size can be calculated by multiplying bh by 8 after calling cpuid function 0x1.
For AMD Microprocessors, the data Cache Line Size is in cl and the instruction Cache Line Size is in dl after calling cpuid function 0x80000005.
缓存线大小在少数ARM Cortex家族中是可变的,在执行期间可以更改,而不需要向当前程序发出任何通知。