Context switches much slower in new linux kernels
我们希望将服务器上的操作系统从Ubuntu 10.04 LTS升级到Ubuntu 12.04 LTS。不幸的是,似乎运行已经变为可运行的线程的延迟从2.6内核到3.2内核显着增加。事实上,我们得到的延迟数字很难相信。
让我对测试更加具体。我们有一个运行两个线程的程序。第一个线程获取当前时间(使用RDTSC以滴答为单位),然后每秒发送一次条件变量。第二个线程等待条件变量并在发出信号时唤醒。然后它获取当前时间(使用RDTSC以滴答为单位)。计算第二个线程中的时间与第一个线程中的时间之间的差异,并在控制台上显示。在此之后,第二个线程再次等待条件变量。大约第二次通过后,第一个线程将再次发出信号。
因此,简而言之,我们得到一个线程,通过条件可变延迟测量一次一次地进行线程通信。
在内核2.6.32中,这种延迟大约为2.8-3.5 us,这是合理的。在内核3.2.0中,这种延迟已经增加到大约40-100 us。我已经排除了两台主机之间硬件的任何差异。它们运行在相同的硬件上(双插槽X5687 {Westmere-EP}处理器,运行频率为3.6 GHz,具有超线程,speedtep和所有C状态关闭)。测试应用程序更改线程的亲和力以在同一套接字的独立物理核心上运行它们(即,第一个线程在Core 0上运行,第二个线程在Core 1上运行),因此没有线程的弹跳套接字之间的核心或弹跳/通信。
两台主机之间的唯一区别是,一台运行Ubuntu 10.04 LTS,内核为2.6.32-28(快速上下文切换盒),另一台运行最新的Ubuntu 12.04 LTS,内核为3.2.0-23(缓慢的上下文)开关盒)。所有BIOS设置和硬件都相同。
内核是否有任何变化可以解释线程被安排运行多长时间的这种荒谬的减速?
更新:
如果您想在主机和Linux版本上运行测试,我已将代码发布到pastebin供您阅读。编译:
1 | g++ -O3 -o test_latency test_latency.cpp -lpthread |
运行(假设您至少有一个双核盒子):
1 | ./test_latency 0 1 # Thread 1 on Core 0 and Thread 2 on Core 1 |
更新2:
在对内核参数,内核更改和个人研究的帖子进行了大量搜索之后,我已经找出了问题所在并且已经发布了解决方案作为这个问题的答案。
在最近的内核中,坏线程唤醒性能问题的解决方案与从
要查看您的设置中当前处于活动状态的cpuidle驱动程序,只需在
1 | cat /sys/devices/system/cpu/cpuidle/current_driver |
如果您希望现代Linux操作系统具有最低的上下文切换延迟,请添加以下内核启动参数以禁用所有这些省电功能:
在Ubuntu 12.04上,您可以通过将它们添加到
1 | intel_idle.max_cstate=0 processor.max_cstate=0 idle=poll |
以下是三个启动选项的详细信息:
将
最后,三个参数中的最后一个,
更新:
在使用各种
1 | intel_idle.max_cstate=0 processor.max_cstate=0 idle=mwait |
使用
更新2:
和
好。
也许变慢的是futex,它是条件变量的构建块。这将有所启发:
1 | strace -r ./test_latency 0 1 &> test_latency_strace & sleep 8 && killall test_latency |
然后
1 | for i in futex nanosleep rt_sig;do echo $i;grep $i test_latency_strace | sort -rn;done |
这将显示有趣的系统调用所采用的微秒,按时间排序。
在内核2.6.32上
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 | $ for i in futex nanosleep rt_sig;do echo $i;grep $i test_latency_strace | sort -rn;done futex 1.000140 futex(0x601ac4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x601ac0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 1.000129 futex(0x601ac4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x601ac0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 1.000124 futex(0x601ac4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x601ac0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 1.000119 futex(0x601ac4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x601ac0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 1.000106 futex(0x601ac4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x601ac0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 1.000103 futex(0x601ac4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x601ac0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 1.000102 futex(0x601ac4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x601ac0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 0.000125 futex(0x7f98ce4c0b88, FUTEX_WAKE_PRIVATE, 2147483647) = 0 0.000042 futex(0x601b00, FUTEX_WAKE_PRIVATE, 1) = 1 0.000038 futex(0x601b00, FUTEX_WAKE_PRIVATE, 1) = 1 0.000037 futex(0x601b00, FUTEX_WAKE_PRIVATE, 1) = 1 0.000030 futex(0x601b00, FUTEX_WAKE_PRIVATE, 1) = 1 0.000029 futex(0x601b00, FUTEX_WAKE_PRIVATE, 1) = 0 0.000028 futex(0x601b00, FUTEX_WAKE_PRIVATE, 1) = 1 0.000027 futex(0x601b00, FUTEX_WAKE_PRIVATE, 1) = 1 0.000018 futex(0x7fff82f0ec3c, FUTEX_WAKE_PRIVATE, 1) = 0 nanosleep 0.000027 nanosleep({1, 0}, {1, 0}) = 0 0.000019 nanosleep({1, 0}, {1, 0}) = 0 0.000019 nanosleep({1, 0}, {1, 0}) = 0 0.000018 nanosleep({1, 0}, {1, 0}) = 0 0.000018 nanosleep({1, 0}, {1, 0}) = 0 0.000018 nanosleep({1, 0}, {1, 0}) = 0 0.000018 nanosleep({1, 0}, 0x7fff82f0eb40) = ? ERESTART_RESTARTBLOCK (To be restarted) 0.000017 nanosleep({1, 0}, {1, 0}) = 0 rt_sig 0.000045 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 0.000040 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 0.000038 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 0.000035 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 0.000034 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 0.000033 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 0.000032 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 0.000032 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 0.000031 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 0.000031 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 0.000028 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 0.000028 rt_sigaction(SIGRT_1, {0x37f8c052b0, [], SA_RESTORER|SA_RESTART|SA_SIGINFO, 0x37f8c0e4c0}, NULL, 8) = 0 0.000027 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 0.000027 rt_sigaction(SIGRTMIN, {0x37f8c05370, [], SA_RESTORER|SA_SIGINFO, 0x37f8c0e4c0}, NULL, 8) = 0 0.000027 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 0.000025 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 0.000025 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 0.000023 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 0.000023 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 0.000022 rt_sigprocmask(SIG_UNBLOCK, [RTMIN RT_1], NULL, 8) = 0 0.000022 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 0.000021 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 0.000021 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 0.000021 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 0.000021 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 0.000021 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 0.000019 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 |
在内核3.1.9上
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 | $ for i in futex nanosleep rt_sig;do echo $i;grep $i test_latency_strace | sort -rn;done futex 1.000129 futex(0x601764, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x601760, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 1.000126 futex(0x601764, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x601760, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 1.000122 futex(0x601764, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x601760, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 1.000115 futex(0x601764, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x601760, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 1.000114 futex(0x601764, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x601760, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 1.000112 futex(0x601764, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x601760, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 1.000109 futex(0x601764, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x601760, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 0.000139 futex(0x3f8b8f2fb0, FUTEX_WAKE_PRIVATE, 2147483647) = 0 0.000043 futex(0x601720, FUTEX_WAKE_PRIVATE, 1) = 1 0.000041 futex(0x601720, FUTEX_WAKE_PRIVATE, 1) = 1 0.000037 futex(0x601720, FUTEX_WAKE_PRIVATE, 1) = 1 0.000036 futex(0x601720, FUTEX_WAKE_PRIVATE, 1) = 1 0.000034 futex(0x601720, FUTEX_WAKE_PRIVATE, 1) = 1 0.000034 futex(0x601720, FUTEX_WAKE_PRIVATE, 1) = 1 nanosleep 0.000025 nanosleep({1, 0}, 0x7fff70091d00) = 0 0.000022 nanosleep({1, 0}, {0, 3925413}) = ? ERESTART_RESTARTBLOCK (Interrupted by signal) 0.000021 nanosleep({1, 0}, 0x7fff70091d00) = 0 0.000017 nanosleep({1, 0}, 0x7fff70091d00) = 0 0.000017 nanosleep({1, 0}, 0x7fff70091d00) = 0 0.000017 nanosleep({1, 0}, 0x7fff70091d00) = 0 0.000017 nanosleep({1, 0}, 0x7fff70091d00) = 0 0.000017 nanosleep({1, 0}, 0x7fff70091d00) = 0 rt_sig 0.000045 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 0.000044 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 0.000043 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 0.000040 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 0.000038 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 0.000037 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 0.000036 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 0.000036 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 0.000035 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 0.000035 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 0.000035 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 0.000035 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 0.000034 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 0.000031 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 0.000027 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 0.000027 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 0.000027 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 0.000027 rt_sigaction(SIGRT_1, {0x3f892067b0, [], SA_RESTORER|SA_RESTART|SA_SIGINFO, 0x3f8920f500}, NULL, 8) = 0 0.000026 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 0.000026 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 0.000025 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 0.000024 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 0.000023 rt_sigprocmask(SIG_UNBLOCK, [RTMIN RT_1], NULL, 8) = 0 0.000023 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 0.000022 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 0.000021 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 0.000019 rt_sigaction(SIGRTMIN, {0x3f89206720, [], SA_RESTORER|SA_SIGINFO, 0x3f8920f500}, NULL, 8) = 0 |
我发现这个5岁的错误报告包含一个比较的"乒乓"性能测试
我不得不补充一下
1 | #include <stdint.h> |
为了编译,我用这个命令做了
1 | g++ -O3 -o condvar-perf condvar-perf.cpp -lpthread -lrt |
在内核2.6.32上
1 2 3 4 5 | $ ./condvar-perf 1000000 NPTL mutex elapsed: 29085 us; per iteration: 29 ns / 9.4e-05 context switches. c.v. ping-pong test elapsed: 4771993 us; per iteration: 4771 ns / 4.03 context switches. signal ping-pong test elapsed: 8685423 us; per iteration: 8685 ns / 4.05 context switches. |
在内核3.1.9上
1 2 3 4 5 | $ ./condvar-perf 1000000 NPTL mutex elapsed: 26811 us; per iteration: 26 ns / 8e-06 context switches. c.v. ping-pong test elapsed: 10930794 us; per iteration: 10930 ns / 4.01 context switches. signal ping-pong test elapsed: 10949670 us; per iteration: 10949 ns / 4.01 context switches. |
我得出结论,在内核2.6.32和3.1.9之间,上下文切换确实已经放慢了速度,尽管没有你在内核3.2中观察到的那么多。我意识到这还没有回答你的问题,我会继续挖掘。
编辑:我发现更改进程的实时优先级(两个线程)可以提高3.1.9的性能以匹配2.6.32。但是,在2.6.32上设置相同的优先级会让它变慢...去图 - 我会更多地研究它。
这是我现在的结果:
在内核2.6.32上
1 2 3 4 5 6 7 8 9 10 11 | $ ./condvar-perf 1000000 NPTL mutex elapsed: 29629 us; per iteration: 29 ns / 0.000418 context switches. c.v. ping-pong test elapsed: 6225637 us; per iteration: 6225 ns / 4.1 context switches. signal ping-pong test elapsed: 5602248 us; per iteration: 5602 ns / 4.09 context switches. $ chrt -f 1 ./condvar-perf 1000000 NPTL mutex elapsed: 29049 us; per iteration: 29 ns / 0.000407 context switches. c.v. ping-pong test elapsed: 16131360 us; per iteration: 16131 ns / 4.29 context switches. signal ping-pong test elapsed: 11817819 us; per iteration: 11817 ns / 4.16 context switches. $ |
在内核3.1.9上
1 2 3 4 5 6 7 8 9 10 11 | $ ./condvar-perf 1000000 NPTL mutex elapsed: 26830 us; per iteration: 26 ns / 5.7e-05 context switches. c.v. ping-pong test elapsed: 12812788 us; per iteration: 12812 ns / 4.01 context switches. signal ping-pong test elapsed: 13126865 us; per iteration: 13126 ns / 4.01 context switches. $ chrt -f 1 ./condvar-perf 1000000 NPTL mutex elapsed: 27025 us; per iteration: 27 ns / 3.7e-05 context switches. c.v. ping-pong test elapsed: 5099885 us; per iteration: 5099 ns / 4 context switches. signal ping-pong test elapsed: 5508227 us; per iteration: 5508 ns / 4 context switches. $ |
由于与c状态分开的pstate驱动程序,您可能还会看到处理器在更新的进程和Linux内核中单击。所以另外,要禁用它,你需要以下内核参数: