关于windows：Euler问题的性能问题和Int64类型的递归

Performance problem with Euler problem and recursion on Int64 types

我目前正在学习哈斯克尔，将项目欧拉问题作为我的操场。我惊讶于我的haskell程序和类似程序相比有多慢用其他语言编写的程序。我想知道我是否已经放弃了一些东西，或者这是在使用haskell时人们所期望的那种性能惩罚。

下面的程序受到331问题的启发，但我在发布之前已经更改了它，所以我不会为其他人破坏任何东西。它计算在2^30 x 2^30网格上绘制的离散圆的弧长。这是一个简单的尾部递归实现，我确保了保持弧长跟踪的累积变量的更新是严格的。然而，几乎需要一分半钟才能完成(用GHC的-o标志编译)。

1
2
3
4
5
6
7
8
9
10
11

import Data.Int

arcLength :: Int64->Int64
arcLength n = arcLength' 0 (n-1) 0 0 where
arcLength' x y norm2 acc
| x > y = acc
| norm2 < 0 = arcLength' (x + 1) y (norm2 + 2*x +1) acc
| norm2 > 2*(n-1) = arcLength' (x - 1) (y-1) (norm2 - 2*(x + y) + 2) acc
| otherwise = arcLength' (x + 1) y (norm2 + 2*x + 1) $! (acc + 1)

main = print $ arcLength (2^30)

这里是爪哇的一个相应的实现。完成大约需要4.5秒。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28

public class ArcLength {
public static void main(String args[]) {
long n = 1 << 30;
long x = 0;
long y = n-1;
long acc = 0;
long norm2 = 0;
long time = System.currentTimeMillis();

while(x <= y) {
if (norm2 < 0) {
norm2 += 2*x + 1;
x++;
} else if (norm2 > 2*(n-1)) {
norm2 += 2 - 2*(x+y);
x--;
y--;
} else {
norm2 += 2*x + 1;
x++;
acc++;
}
}

time = System.currentTimeMillis() - time;
System.err.println(acc);
System.err.println(time);
}

号

}

编辑：在评论中讨论之后，我对haskell代码做了一些修改，并做了一些性能测试。首先，我将n改为2^29以避免溢出。然后我尝试了6种不同的版本：用Int64或Int，用Bangs在norm2或两者之前，用norm2和acc在声明arcLength' x y !norm2 !acc中。都是用编译的

1	ghc -O3 -prof -rtsopts -fforce-recomp -XBangPatterns arctest.hs

结果如下：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

(Int !norm2 !acc)
total time = 3.00 secs (150 ticks @ 20 ms)
total alloc = 2,892 bytes (excludes profiling overheads)

(Int norm2 !acc)
total time = 3.56 secs (178 ticks @ 20 ms)
total alloc = 2,892 bytes (excludes profiling overheads)

(Int norm2 acc)
total time = 3.56 secs (178 ticks @ 20 ms)
total alloc = 2,892 bytes (excludes profiling overheads)

(Int64 norm2 acc)
arctest.exe: out of memory

(Int64 norm2 !acc)
total time = 48.46 secs (2423 ticks @ 20 ms)
total alloc = 26,246,173,228 bytes (excludes profiling overheads)

(Int64 !norm2 !acc)
total time = 31.46 secs (1573 ticks @ 20 ms)
total alloc = 3,032 bytes (excludes profiling overheads)

。

我正在64位Windows7(Haskell平台二进制发行版)下使用ghc 7.0.2。根据注释，在其他配置下编译时不会出现问题。这使我认为Int64类型在Windows版本中已损坏。

相关讨论

嗯，我为7.0.3安装了一个新的haskell平台，并为您的程序大致获得了以下核心(-ddump-simpl：

1
2
3
4
5
6

Main.$warcLength' =
\ (ww_s1my :: GHC.Prim.Int64#) (ww1_s1mC :: GHC.Prim.Int64#)
(ww2_s1mG :: GHC.Prim.Int64#) (ww3_s1mK :: GHC.Prim.Int64#) ->
case {__pkg_ccall ghc-prim hs_gtInt64 [...]
ww_s1my ww1_s1mC GHC.Prim.realWorld#
[...]

号

所以ghc已经意识到它可以解包整数，这很好。但这个hs_getInt64呼叫看起来像是C呼叫。在汇编程序输出(-ddump-asm中)，我们看到如下内容：

1
2
3
4
5

pushl %eax
movl 76(%esp),%eax
pushl %eax
call _hs_gtInt64
addl $16,%esp

号

因此，这看起来非常类似于Int64上的每一个操作都会在后端变成一个完整的C调用。很明显，这很慢。

GHC.IntWord64的源代码似乎验证了：在32位构建(与当前随平台一起提供的构建一样)中，您将只能通过ffi接口进行仿真。

相关讨论

嗯，这很有趣。所以我编译了你的两个程序，并尝试了一下：

1
2
3
4
5
6
7
8

% java -version
java version"1.6.0_18"
OpenJDK Runtime Environment (IcedTea6 1.8.7) (6b18-1.8.7-2~squeeze1)
OpenJDK 64-Bit Server VM (build 14.0-b16, mixed mode)
% javac ArcLength.java
% java ArcLength
843298604
6630

因此，Java解决方案大约需要6.6秒。接下来是具有一些优化的GHC：

1
2
3
4
5
6

% ghc --version
The Glorious Glasgow Haskell Compilation System, version 6.12.1
% ghc --make -O arc.hs
% time ./arc
843298604
./arc 12.68s user 0.04s system 99% cpu 12.718 total

号

对于GHC-O来说不到13秒

尝试进一步优化：

1
2
3
4

% ghc --make -O3
% time ./arc [13:16]
843298604
./arc 5.75s user 0.00s system 99% cpu 5.754 total

通过进一步的优化标志，haskell解决方案花费了不到6秒的时间。

了解您使用的是什么版本的编译器是很有意思的。

相关讨论

我也是。当我将Int64更改为Int并向内部工作人员添加显式类型签名时，情况变得更快了。我在X64，GHC 7.0.3
我在窗户下面用温室气体控制器。GHC——光荣格拉斯哥哈斯克尔汇编系统版本，7.0.2
dbergh:x86还是x64？哪个猪的速度？
@fuzzxl编译器不应该自动推断x、y等的类型与n的类型相同吗？使用int会导致溢出，至少如果int是32位的话。
@我在X64上，所以Int是64位类型。关于类型签名：我看了一下核心表示，当我看到仍然有一些Integer潜伏在周围时有点震惊。但现在我想，它们是从别的地方冒出来的。如果Int溢出，为什么不尝试Integer？ghc内置的Integer型火爆的很快，你不必为边界而挖苦。
Intel(R)Core(TM)2 Duo CPU [email protected]。我还尝试在一个较慢的Linux系统(IntelAtom)上使用ghc-o3版本6.12.1。这种配置需要5分钟20秒。
非X64处理器还具有64位数据寄存器。在这种情况下，int64比integer快得多(我使用integer在4分钟后终止了进程)。无论如何，这并不能解释计算时间的巨大差异。@Monk你真的没有修改代码就复制了代码吗？
在32位版本的GHC 7.0.2中编译时，我看到了与Int64相同的慢度，而在使用x64版本编译时(尽管是7.0.3)，也看到了小于5秒的时间。btw-fllvm选项似乎加快了32位版本的速度(由于它忽略了macos上x64 ghc中的-fllvm，所以不确定它在x64中的行为方式)
我复制了@dbergh发布的haskell代码，它在[real 0m4.322s，user 0m4.320s，sys 0m0.000s]编辑中完成：也用-o3编译
嗯，在没有-o3的情况下编译时，我在大约一分钟后就把它杀死了，所以这里-o3所做的优化是巨大的。
@我确实只是复制了代码而没有修改它。我的初始基准测试是在一个x64 Linux机器上运行一个x64二进制文件。我也在桌面上运行x64窗口，所以我只是在那里尝试了您的代码。haskell代码的运行速度确实慢得多，这可能是因为您无法从ghc for windows生成x64二进制文件？稍后将更新我的答案。
@蒙克：IIRC不，这可能是问题所在。
@艾德卡：在我的平台上-FLLVM把速度从6秒减到7秒。
但不再。在添加了一些刘海在这里和那里，LLVM使它运行在大约5.5秒，这是相当令人印象深刻。为了好玩，我把Java代码转换成C。在X64 Linux上用-O3编译时，它运行在2.8秒。
嗯，我觉得让人失望的是随机地把代码和感叹号混在一起会产生如此大的影响。但在我看来，真正的问题是64位算法的实现不好…根据我的计时，使用64位整数几乎会将执行时间增加10倍。
@别乱来。注意评估(严格/懒惰)并陈述你想要的行为。
@唐·斯图尔特。好吧，你认为在这种情况下什么是好的？我发现在这种情况下，变量acc是明显的。经过一些思考，norm2似乎是一个很好的候选者，但我认为编译器应该自己找到它。变量x、y和n我肯定不会。事实上，我试过用!n来减缓事情的发展(尽管我不明白为什么)，其他人没有影响。
是的，你是对的。acc没有在循环中进行测试，因此可以推断为懒惰。根据你如何编写循环(见我的答案)，norm2可能是也可能不是。ghc -O2可以看到其他东西的严格性。这是"需求分析"，你似乎对它有很好的直觉。

你的问题有几个有趣的地方。

您应该主要使用-O2。它只会做得更好(在本例中，识别和消除仍然存在于-O版本中的懒惰)。

其次，您的Haskell与Java不一样(它执行不同的测试和分支)。和其他人一样，在我的Linux设备上运行代码会导致大约6秒的运行时间。看起来不错。

确保它与Java相同

一个想法：让我们用相同的控制流程、操作和类型对Java进行文字转录。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

import Data.Bits
import Data.Int

loop :: Int -> Int
loop n = go 0 (n-1) 0 0
where
go :: Int -> Int -> Int -> Int -> Int
go x y acc norm2
| x <= y = case () of { _
| norm2 < 0 -> go (x+1) y acc (norm2 + 2*x + 1)
| norm2 > 2 * (n-1) -> go (x-1) (y-1) acc (norm2 + 2 - 2 * (x+y))
| otherwise -> go (x+1) y (acc+1) (norm2 + 2*x + 1)
}
| otherwise = acc

main = print $ loop (1 `shiftL` 30)

。

窥视核心

我们将使用ghc-core快速查看内核，它显示了一个非常好的非绑定类型的循环：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39

main_$s$wgo
:: Int#
-> Int#
-> Int#
-> Int#
-> Int#

main_$s$wgo =
\ (sc_sQa :: Int#)
(sc1_sQb :: Int#)
(sc2_sQc :: Int#)
(sc3_sQd :: Int#) ->
case <=# sc3_sQd sc2_sQc of _ {
False -> sc1_sQb;
True ->
case <# sc_sQa 0 of _ {
False ->
case ># sc_sQa 2147483646 of _ {
False ->
main_$s$wgo
(+# (+# sc_sQa (*# 2 sc3_sQd)) 1)
(+# sc1_sQb 1)
sc2_sQc
(+# sc3_sQd 1);
True ->
main_$s$wgo
(-#
(+# sc_sQa 2)
(*# 2 (+# sc3_sQd sc2_sQc)))
sc1_sQb
(-# sc2_sQc 1)
(-# sc3_sQd 1)
};
True ->
main_$s$wgo
(+# (+# sc_sQa (*# 2 sc3_sQd)) 1)
sc1_sQb
sc2_sQc
(+# sc3_sQd 1)

号

也就是说，所有的都是未装箱的。那个圈看起来不错！

性能很好(Linux/x86-64/GHC 7.03)：

1	./A 5.95s user 0.01s system 99% cpu 5.980 total

号

检查ASM

我们也得到了合理的组装，作为一个很好的循环：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

Main_mainzuzdszdwgo_info:
cmpq %rdi, %r8
jg .L8
.L3:
testq %r14, %r14
movq %r14, %rdx
js .L4
cmpq $2147483646, %r14
jle .L9
.L5:
leaq (%rdi,%r8), %r10
addq $2, %rdx
leaq -1(%rdi), %rdi
addq %r10, %r10
movq %rdx, %r14
leaq -1(%r8), %r8
subq %r10, %r14
jmp Main_mainzuzdszdwgo_info
.L9:
leaq 1(%r14,%r8,2), %r14
addq $1, %rsi
leaq 1(%r8), %r8
jmp Main_mainzuzdszdwgo_info
.L8:
movq %rsi, %rbx
jmp *0(%rbp)
.L4:
leaq 1(%r14,%r8,2), %r14
leaq 1(%r8), %r8
jmp Main_mainzuzdszdwgo_info

号

使用-fvia-C后端。

所以这个看起来不错！

正如上面的评论中提到的，我的怀疑与您在32位Windows上使用的libgmp版本有关，该版本为64位int生成糟糕的代码。首先尝试升级到GHC 7.0.3，然后尝试其他一些代码生成器后端，然后如果您仍然对Int64有问题，请向GHC TRAC提交一份错误报告。

广泛地确认，在64位整数的32位模拟中进行这些C调用的成本确实是如此，我们可以用Integer代替Int64，该方法在每台机器上通过C调用gmp来实现，而且实际上，运行时间从3秒延长到一分钟。

教训：尽可能使用64位硬件。

相关讨论

性能相关代码的正常优化标志是-O2。你所用的，-O，作用很小。-O3没有做太多(什么？)比-O2还多——它甚至还包括实验性的"优化"，这通常会使程序速度明显减慢。

与O2相比，我获得了与Java竞争的能力：

1
2
3
4
5
6
7
8
9
10
11
12
13
14

tommd@Mavlo:Test$ uname -r -m
2.6.37 x86_64
tommd@Mavlo:Test$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 7.0.3

tommd@Mavlo:Test$ ghc -O2 so.hs
[1 of 1] Compiling Main ( so.hs, so.o )
Linking so ...
tommd@Mavlo:Test$ time ./so
843298604

real 0m4.948s
user 0m4.896s
sys 0m0.000s

。

Java的速度大约是1秒(20%)：

1
2
3
4
5
6
7

tommd@Mavlo:Test$ time java ArcLength
843298604
3880

real 0m3.961s
user 0m3.936s
sys 0m0.024s

但是关于GHC，有趣的是它有许多不同的后端。默认情况下，它使用本机代码生成器(NCG)，我们在上面计时。还有一个llvm后端，它通常做得更好…但不在这里：

1
2
3
4
5
6
7
8
9

tommd@Mavlo:Test$ ghc -O2 so.hs -fllvm -fforce-recomp
[1 of 1] Compiling Main ( so.hs, so.o )
Linking so ...
tommd@Mavlo:Test$ time ./so
843298604

real 0m5.973s
user 0m5.968s
sys 0m0.000s

。

但是，正如评论中提到的fuzzxl，当您添加一些严格的注释时，llvm会做得更好：

1
2
3
4
5
6
7
8
9

$ ghc -O2 -fllvm -fforce-recomp so.hs
[1 of 1] Compiling Main ( so.hs, so.o )
Linking so ...
tommd@Mavlo:Test$ time ./so
843298604

real 0m4.099s
user 0m4.088s
sys 0m0.000s

。

还有一个旧的"via-c"生成器使用C作为中间语言。在这种情况下效果很好：

1
2
3
4
5
6
7
8
9
10
11
12
13

tommd@Mavlo:Test$ ghc -O2 so.hs -fvia-c -fforce-recomp
[1 of 1] Compiling Main ( so.hs, so.o )

on the commandline:
Warning: The -fvia-c flag will be removed in a future GHC release
Linking so ...
ttommd@Mavlo:Test$ ti
tommd@Mavlo:Test$ time ./so
843298604

real 0m3.982s
user 0m3.972s
sys 0m0.000s

希望在移除后端之前，NCG将得到改进，以便与VIA-C匹配。

相关讨论

dberg号，我觉得这一切都是从不幸的-O号旗开始的。为了强调其他人提出的观点，为了运行工厂编译和测试，请像我一样，将其粘贴到您的.bashrc或其他文件中：

1 2	alias ggg="ghc --make -O2" alias gggg="echo 'Glorious Glasgow for Great Good!' && ghc --make -O2 --fforce-recomp"

我已经玩了一些代码，这个版本似乎比我的笔记本电脑上的Java版本快(3.55秒vs 4.63s)：

1
2
3
4
5
6
7
8
9
10
11
12

{-# LANGUAGE BangPatterns #-}

arcLength :: Int->Int
arcLength n = arcLength' 0 (n-1) 0 0 where
arcLength' :: Int -> Int -> Int -> Int -> Int
arcLength' !x !y !norm2 !acc
| x > y = acc
| norm2 > 2*(n-1) = arcLength' (x - 1) (y - 1) (norm2 - 2*(x + y) + 2) acc
| norm2 < 0 = arcLength' (succ x) y (norm2 + x*2 + 1) acc
| otherwise = arcLength' (succ x) y (norm2 + 2*x + 1) (acc + 1)

main = print $ arcLength (2^30)

。

：

1
2
3
4
5
6
7
8
9
10

$ ghc -O2 tmp1.hs -fforce-recomp
[1 of 1] Compiling Main ( tmp1.hs, tmp1.o )
Linking tmp1 ...

$ time ./tmp1
843298604

real 0m3.553s
user 0m3.539s
sys 0m0.006s

相关讨论