32-byte aligned routine does not fit the uops cache
我正在研究 uops-cache 的行为,但遇到了一个误解。
如英特尔优化手册
The Decoded ICache consists of 32 sets. Each set contains eight Ways.
Each Way can hold up to six micro-ops.
-
All micro-ops in a Way represent instructions which are statically
contiguous in the code and have their EIPs within the same aligned
32-byte region.
-
Up to three Ways may be dedicated to the same 32-byte aligned chunk,
allowing a total of 18 micro-ops to be cached per 32-byte region of
the original IA program.
-
A non-conditional branch is the last micro-op in a Way.
案例 1:
考虑以下例程:
1 | void inhibit_uops_cache(size_t); |
1 2 3 4 5 6 7 8 9 10 11 12 13 | align 32 inhibit_uops_cache: mov edx, esi mov edx, esi mov edx, esi mov edx, esi mov edx, esi mov edx, esi jmp decrement_jmp_tgt decrement_jmp_tgt: dec rdi ja inhibit_uops_cache ;ja is intentional to avoid Macro-fusion ret |
为了确保例程的代码实际上是 32 字节对齐的,这里是 asm
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | 0x555555554820 <inhibit_uops_cache> mov edx,esi 0x555555554822 <inhibit_uops_cache+2> mov edx,esi 0x555555554824 <inhibit_uops_cache+4> mov edx,esi 0x555555554826 <inhibit_uops_cache+6> mov edx,esi 0x555555554828 <inhibit_uops_cache+8> mov edx,esi 0x55555555482a <inhibit_uops_cache+10> mov edx,esi 0x55555555482c <inhibit_uops_cache+12> jmp 0x55555555482e <decrement_jmp_tgt> 0x55555555482e <decrement_jmp_tgt> dec rdi 0x555555554831 <decrement_jmp_tgt+3> ja 0x555555554820 <inhibit_uops_cache> 0x555555554833 <decrement_jmp_tgt+5> ret 0x555555554834 <decrement_jmp_tgt+6> nop 0x555555554835 <decrement_jmp_tgt+7> nop 0x555555554836 <decrement_jmp_tgt+8> nop 0x555555554837 <decrement_jmp_tgt+9> nop 0x555555554838 <decrement_jmp_tgt+10> nop 0x555555554839 <decrement_jmp_tgt+11> nop 0x55555555483a <decrement_jmp_tgt+12> nop 0x55555555483b <decrement_jmp_tgt+13> nop 0x55555555483c <decrement_jmp_tgt+14> nop 0x55555555483d <decrement_jmp_tgt+15> nop 0x55555555483e <decrement_jmp_tgt+16> nop 0x55555555483f <decrement_jmp_tgt+17> nop |
运行方式
1 2 3 | int main(void){ inhibit_uops_cache(4096 * 4096 * 128L); } |
我拿到了计数器
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | Performance counter stats for './bin': 6?431?201?748 idq.dsb_cycles (56,91%) 19?175?741?518 idq.dsb_uops (57,13%) 7?866?687 idq.mite_uops (57,36%) 3?954?421 idq.ms_uops (57,46%) 560?459 dsb2mite_switches.penalty_cycles (57,28%) 884?486 frontend_retired.dsb_miss (57,05%) 6?782?598?787 cycles (56,82%) 1,749000366 seconds time elapsed 1,748985000 seconds user 0,000000000 seconds sys |
这正是我期望得到的。
绝大多数微指令来自微指令缓存。 uops 数字也完全符合我的期望
1 2 3 4 | mov edx, esi - 1 uop; jmp imm - 1 uop; near dec rdi - 1 uop; ja - 1 uop; near |
案例 2:
考虑
1 2 3 4 5 6 7 8 9 10 11 12 13 | align 32 inhibit_uops_cache: mov edx, esi mov edx, esi mov edx, esi mov edx, esi mov edx, esi ; mov edx, esi jmp decrement_jmp_tgt decrement_jmp_tgt: dec rdi ja inhibit_uops_cache ;ja is intentional to avoid Macro-fusion ret |
disas:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | 0x555555554820 <inhibit_uops_cache> mov edx,esi 0x555555554822 <inhibit_uops_cache+2> mov edx,esi 0x555555554824 <inhibit_uops_cache+4> mov edx,esi 0x555555554826 <inhibit_uops_cache+6> mov edx,esi 0x555555554828 <inhibit_uops_cache+8> mov edx,esi 0x55555555482a <inhibit_uops_cache+10> jmp 0x55555555482c <decrement_jmp_tgt> 0x55555555482c <decrement_jmp_tgt> dec rdi 0x55555555482f <decrement_jmp_tgt+3> ja 0x555555554820 <inhibit_uops_cache> 0x555555554831 <decrement_jmp_tgt+5> ret 0x555555554832 <decrement_jmp_tgt+6> nop 0x555555554833 <decrement_jmp_tgt+7> nop 0x555555554834 <decrement_jmp_tgt+8> nop 0x555555554835 <decrement_jmp_tgt+9> nop 0x555555554836 <decrement_jmp_tgt+10> nop 0x555555554837 <decrement_jmp_tgt+11> nop 0x555555554838 <decrement_jmp_tgt+12> nop 0x555555554839 <decrement_jmp_tgt+13> nop 0x55555555483a <decrement_jmp_tgt+14> nop 0x55555555483b <decrement_jmp_tgt+15> nop 0x55555555483c <decrement_jmp_tgt+16> nop 0x55555555483d <decrement_jmp_tgt+17> nop 0x55555555483e <decrement_jmp_tgt+18> nop 0x55555555483f <decrement_jmp_tgt+19> nop |
运行方式
1 2 3 | int main(void){ inhibit_uops_cache(4096 * 4096 * 128L); } |
我拿到了计数器
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | Performance counter stats for './bin': 2?464?970?970 idq.dsb_cycles (56,93%) 6?197?024?207 idq.dsb_uops (57,01%) 10?845?763?859 idq.mite_uops (57,19%) 3?022?089 idq.ms_uops (57,38%) 321?614 dsb2mite_switches.penalty_cycles (57,35%) 1?733?465?236 frontend_retired.dsb_miss (57,16%) 8?405?643?642 cycles (56,97%) 2,117538141 seconds time elapsed 2,117511000 seconds user 0,000000000 seconds sys |
计数器完全出乎意料。
我希望所有的微指令都像以前一样来自 dsb,因为例程符合微指令缓存的要求。
相比之下,几乎 70% 的微指令来自传统解码管道。
问题:CASE 2 有什么问题?需要查看哪些计数器以了解发生了什么?
UPD:按照@PeterCordes 的想法,我检查了无条件分支目标
案例 3:
将有条件的
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | align 32 inhibit_uops_cache: mov edx, esi mov edx, esi mov edx, esi mov edx, esi mov edx, esi ; mov edx, esi jmp decrement_jmp_tgt align 32 ; align 16 does not change anything decrement_jmp_tgt: dec rdi ja inhibit_uops_cache ret |
disas:
1 2 3 4 5 6 7 8 9 10 | 0x555555554820 <inhibit_uops_cache> mov edx,esi 0x555555554822 <inhibit_uops_cache+2> mov edx,esi 0x555555554824 <inhibit_uops_cache+4> mov edx,esi 0x555555554826 <inhibit_uops_cache+6> mov edx,esi 0x555555554828 <inhibit_uops_cache+8> mov edx,esi 0x55555555482a <inhibit_uops_cache+10> jmp 0x555555554840 <decrement_jmp_tgt> #nops to meet the alignment 0x555555554840 <decrement_jmp_tgt> dec rdi 0x555555554843 <decrement_jmp_tgt+3> ja 0x555555554820 <inhibit_uops_cache> 0x555555554845 <decrement_jmp_tgt+5> ret |
并运行为
1 2 3 | int main(void){ inhibit_uops_cache(4096 * 4096 * 128L); } |
我得到了以下计数器
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | Performance counter stats for './bin': 4?296?298?295 idq.dsb_cycles (57,19%) 17?145?751?147 idq.dsb_uops (57,32%) 45?834?799 idq.mite_uops (57,32%) 1?896?769 idq.ms_uops (57,32%) 136?865 dsb2mite_switches.penalty_cycles (57,04%) 161?314 frontend_retired.dsb_miss (56,90%) 4?319?137?397 cycles (56,91%) 1,096792233 seconds time elapsed 1,096759000 seconds user 0,000000000 seconds sys |
结果完全符合预期。超过 99% 的微指令来自 dsb。
平均 dsb uops 交付率 =
接近峰值带宽。
这不是 OP 问题的答案,但要提防
请参阅代码对齐显着影响编译器选项的性能,以解决英特尔在 Skylake 派生 CPU 中引入的这个性能坑,作为此解决方法的一部分。
其他观察:6 个
(为了将来可能有相同症状但原因不同的读者的利益而发布此内容。我在写完它时意识到
最近(2019 年末)的微码更新引入了一个新的性能坑。它围绕着英特尔在 Skylake 衍生微架构上的 JCC 勘误表工作。 (KBL142 专门在您的 Kaby-Lake 上)。
Microcode Update (MCU) to Mitigate JCC Erratum
This erratum can be prevented by a microcode update (MCU). The MCU prevents
jump instructions from being cached in the Decoded ICache when the jump
instructions cross a 32-byte boundary or when they end on a 32-byte boundary. In
this context, Jump Instructions include all jump types: conditional jump (Jcc), macrofused op-Jcc (where op is one of cmp, test, add, sub, and, inc, or dec), direct
unconditional jump, indirect jump, direct/indirect call, and return.
英特尔的白皮书还包括一个触发这种不可缓存效应的案例图表。 (PDF 屏幕截图取自 Phoronix 文章,其中包含之前/之后的基准测试,以及在 GCC/GAS 中尝试避免这种新的性能陷阱的一些变通方法进行重建之后)。
你的代码中 ja 的最后一个字节是
如果这是一个 32 字节的边界,而不仅仅是 16,那么我们就会遇到问题:
1 2 3 4 | 0x55555555482a <inhibit_uops_cache+10> jmp # fine 0x55555555482c <decrement_jmp_tgt> dec rdi 0x55555555482f <decrement_jmp_tgt+3> ja # spans 16B boundary (not 32) 0x555555554831 <decrement_jmp_tgt+5> ret # fine |
此部分未完全更新,仍在讨论跨越 32B 边界
JA 本身跨越了一个边界。
在
使用
您可以在
ASLR 可以更改从(地址的第 12 位和更高位)执行的虚拟页面代码,但不能更改页面内的对齐方式或相对于缓存行的对齐方式。所以我们在反汇编中看到的情况每次都会发生。
OBSERVATION 1:目标在同一 32 字节区域内的分支,从 uops 缓存的angular来看,其行为与无条件分支非常相似(即它应该是行中的最后一个 uop)。 考虑 代码针对评论中提到的所有分支进行了测试。结果证明差异非常微不足道,因此我只提供其中 2 个: jmp: jge: IDK 为什么 dsb uop 的数量是 用预测不会被采用的分支替换任何 jmp 会产生明显不同的结果。例如: 产生以下计数器: 考虑另一个类似于CASE 1的例子: 导致 jz: jno: 所有这些实验都让我认为观察结果对应于 uops 缓存的真实行为。我还进行了另一个实验,通过计数器 考虑以下 收集 绘图的 X 轴代表 从我来的情节来看 观察 2:如果 32 字节区域内有 2 个预计将被采用的分支,则在 增加 观察 3:由于某些(不清楚?)原因发生的 dsb 未命中会导致 IDQ 读取气泡,从而导致 RAT 下溢。 结论:考虑到所有测量结果, 中定义的行为之间肯定存在一些差异
2
3
4
5
6
7
8
9
10
11
12
inhibit_uops_cache:
xor eax, eax
jmp t1 ;jz, jp, jbe, jge, jle, jnb, jnc, jng, jnl, jno, jns, jae
t1:
jmp t2 ;jz, jp, jbe, jge, jle, jnb, jnc, jng, jnl, jno, jns, jae
t2:
jmp t3 ;jz, jp, jbe, jge, jle, jnb, jnc, jng, jnl, jno, jns, jae
t3:
dec rdi
ja inhibit_uops_cache
ret
2
3
4
5
6
7
8
9
10
11
12
13
14
4?748?772?552 idq.dsb_cycles (57,13%)
7?499?524?594 idq.dsb_uops (57,18%)
5?397?128?360 idq.mite_uops (57,18%)
8?696?719 idq.ms_uops (57,18%)
6?247?749?210 dsb2mite_switches.penalty_cycles (57,14%)
3?841?902?993 frontend_retired.dsb_miss (57,10%)
21?508?686?982 cycles (57,10%)
5,464493212 seconds time elapsed
5,464369000 seconds user
0,000000000 seconds sys
2
3
4
5
6
7
8
9
10
11
12
13
14
4?745?825?810 idq.dsb_cycles (57,13%)
7?494?052?019 idq.dsb_uops (57,13%)
5?399?327?121 idq.mite_uops (57,13%)
9?308?081 idq.ms_uops (57,13%)
6?243?915?955 dsb2mite_switches.penalty_cycles (57,16%)
3?842?842?590 frontend_retired.dsb_miss (57,16%)
21?507?525?469 cycles (57,16%)
5,486589670 seconds time elapsed
5,486481000 seconds user
0,000000000 seconds sys
2
3
4
5
6
7
8
9
10
11
12
inhibit_uops_cache:
xor eax, eax
jnz t1 ; perfectly predicted to not be taken
t1:
jae t2
t2:
jae t3
t3:
dec rdi
ja inhibit_uops_cache
ret
2
3
4
5
6
7
8
9
10
11
12
13
14
5?420?107?670 idq.dsb_cycles (56,96%)
10?551?728?155 idq.dsb_uops (57,02%)
2?326?542?570 idq.mite_uops (57,16%)
6?209?728 idq.ms_uops (57,29%)
787?866?654 dsb2mite_switches.penalty_cycles (57,33%)
1?031?630?646 frontend_retired.dsb_miss (57,19%)
11?381?874?966 cycles (57,05%)
2,927769205 seconds time elapsed
2,927683000 seconds user
0,000000000 seconds sys
2
3
4
5
6
7
8
9
10
11
12
13
inhibit_uops_cache:
nop
nop
nop
nop
nop
xor eax, eax
jmp t1
t1:
dec rdi
ja inhibit_uops_cache
ret
2
3
4
5
6
7
8
9
10
11
12
13
14
6?331?388?209 idq.dsb_cycles (57,05%)
19?052?030?183 idq.dsb_uops (57,05%)
343?629?667 idq.mite_uops (57,05%)
2?804?560 idq.ms_uops (57,13%)
367?020 dsb2mite_switches.penalty_cycles (57,27%)
55?220?850 frontend_retired.dsb_miss (57,27%)
7?063?498?379 cycles (57,19%)
1,788124756 seconds time elapsed
1,788101000 seconds user
0,000000000 seconds sys
2
3
4
5
6
7
8
9
10
11
12
13
14
6?347?433?290 idq.dsb_cycles (57,07%)
18?959?366?600 idq.dsb_uops (57,07%)
389?514?665 idq.mite_uops (57,07%)
3?202?379 idq.ms_uops (57,12%)
423?720 dsb2mite_switches.penalty_cycles (57,24%)
69?486?934 frontend_retired.dsb_miss (57,24%)
7?063?060?791 cycles (57,19%)
1,789012978 seconds time elapsed
1,788985000 seconds user
0,000000000 seconds sys
2
3
4
5
6
7
8
9
10
11
12
13
14
6?417?056?199 idq.dsb_cycles (57,02%)
19?113?550?928 idq.dsb_uops (57,02%)
329?353?039 idq.mite_uops (57,02%)
4?383?952 idq.ms_uops (57,13%)
414?037 dsb2mite_switches.penalty_cycles (57,30%)
79?592?371 frontend_retired.dsb_miss (57,30%)
7?044?945?047 cycles (57,20%)
1,787111485 seconds time elapsed
1,787049000 seconds user
0,000000000 seconds sys
2
3
4
5
6
7
8
9
10
inhibit_uops_cache:
t0:
;nops 0-9
jmp t1
t1:
;nop 0-6
dec rdi
ja t0
ret
2
3
4
5
6
7
8
9
10
11
12
13
14
inhibit_uops_cache:
t0:
nop
nop
nop
nop
jmp t1
t1:
nop
nop
dec rdi
ja t0
ret