用于分析Haskell程序性能的工具

Tools for analyzing performance of a Haskell program

在解决一些项目Euler问题以学习haskell(所以我现在是一个完全初学者)时，我遇到了问题12。我写了这个(幼稚的)解决方案：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

--Get Number of Divisors of n
numDivs :: Integer -> Integer
numDivs n = toInteger $ length [ x | x<-[2.. ((n `quot` 2)+1)], n `rem` x == 0] + 2

--Generate a List of Triangular Values
triaList :: [Integer]
triaList = [foldr (+) 0 [1..n] | n <- [1..]]

--The same recursive
triaList2 = go 0 1
where go cs n = (cs+n):go (cs+n) (n+1)

--Finds the first triangular Value with more than n Divisors
sol :: Integer -> Integer
sol n = head $ filter (\x -> numDivs(x)>n) triaList2

针对n=500(sol 500)的这个解决方案非常慢(现在运行超过2小时)，所以我想知道如何找出这个解决方案为什么这么慢。有没有命令告诉我大部分的计算时间花在哪里，这样我就知道我的haskell程序的哪个部分是慢的？类似于一个简单的分析器。

为了说明这一点，我不是要求更快的解决方案，而是寻求一种找到这个解决方案的方法。如果你没有哈斯克尔的知识，你会怎么开始？

我试着写两个triaList函数，但没有办法测试哪一个更快，所以这就是我的问题开始的地方。

谢谢

how to find out why this solution is so slow. Are there any commands that tell me where most of the computation-time is spend so I know which part of my haskell-program is slow?

号

没错！GHC提供了许多优秀的工具，包括：

运行时统计
时间分析
堆分析
螺纹分析
核心分析。
比较基准
GC调整

关于使用时间和空间分析的教程是真实世界haskell的一部分。

GC统计

首先，确保您正在使用ghc-o2进行编译。你可以确定它是一个现代的GHC(例如，GHC 6.12.x)

我们可以做的第一件事是检查垃圾收集是否是问题所在。用+rts-s运行程序

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

$ time ./A +RTS -s
./A +RTS -s
749700
9,961,432,992 bytes allocated in the heap
2,463,072 bytes copied during GC
29,200 bytes maximum residency (1 sample(s))
187,336 bytes maximum slop
**2 MB** total memory in use (0 MB lost due to fragmentation)

Generation 0: 19002 collections, 0 parallel, 0.11s, 0.15s elapsed
Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed

INIT time 0.00s ( 0.00s elapsed)
MUT time 13.15s ( 13.32s elapsed)
GC time 0.11s ( 0.15s elapsed)
RP time 0.00s ( 0.00s elapsed)
PROF time 0.00s ( 0.00s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 13.26s ( 13.47s elapsed)

%GC time **0.8%** (1.1% elapsed)

Alloc rate 757,764,753 bytes per MUT second

Productivity 99.2% of total user, 97.6% of total elapsed

./A +RTS -s 13.26s user 0.05s system 98% cpu 13.479 total

这已经给了我们很多信息：您只有一个200万的堆，GC占用了0.8%的时间。所以不用担心分配问题。

时间配置文件

获取程序的时间配置文件是直接的：使用-prof-auto all编译

1
2
3

$ ghc -O2 --make A.hs -prof -auto-all
[1 of 1] Compiling Main ( A.hs, A.o )
Linking A ...

。

当n=200时：

1
2
3

$ time ./A +RTS -p
749700
./A +RTS -p 13.23s user 0.06s system 98% cpu 13.547 total

它创建一个文件a.prof，包含：

1
2
3
4
5
6
7
8
9
10

Sun Jul 18 10:08 2010 Time and Allocation Profiling Report (Final)

A +RTS -p -RTS

total time = 13.18 secs (659 ticks @ 20 ms)
total alloc = 4,904,116,696 bytes (excludes profiling overheads)

COST CENTRE MODULE %time %alloc

numDivs Main 100.0 100.0

。

表示您所有的时间都花在numdivs上，它也是所有分配的来源。

堆配置文件

您还可以通过运行+rts-p-hy来获得这些分配的细分，它创建了一个.hp，您可以通过将其转换为PostScript文件(hp2ps-c a.hp)来查看它，生成：

alt text 。

这告诉我们您的内存使用没有任何问题：它是在恒定空间中分配的。

所以你的问题是numdivs的算法复杂性：

1	toInteger $ length [ x \| x<-[2.. ((n `quot` 2)+1)], n `rem` x == 0] + 2

。

解决这个问题，这是你100%的跑步时间，其他一切都很容易。

优化

这个表达式是流融合优化的一个很好的候选者，所以我将重写它使用data.vector，比如：

1
2
3
4

numDivs n = fromIntegral $
2 + (U.length $
U.filter (\x -> fromIntegral n `rem` x == 0) $
(U.enumFromN 2 ((fromIntegral n `div` 2) + 1) :: U.Vector Int))

它应该融合成一个单独的循环，没有不必要的堆分配。也就是说，它将比列表版本具有更好的复杂性(按常量因子)。您可以使用ghc核心工具(对于高级用户)在优化后检查中间代码。

测试这个，ghc-o2——制造z.hs

1
2
3

$ time ./Z
749700
./Z 3.73s user 0.01s system 99% cpu 3.753 total

。

因此，它将n=150的运行时间减少了3.5倍，而不改变算法本身。

结论

你的问题是麻木。这是你100%的运行时间，而且非常复杂。考虑numdivs，以及如何为每个n生成[2..n div2+1〕n次。试着记住这一点，因为价值观不会改变。

要测量哪个函数更快，请考虑使用标准，该标准将提供有关运行时间亚微秒改进的统计可靠信息。

附录

因为numdivs是你运行时间的100%，触摸程序的其他部分不会有太大的区别，然而，为了教学目的，我们也可以使用流融合来重写那些内容。

我们还可以重写Triallist，并依靠Fusion将其转换为您在Triallist2中手工编写的循环，这是一个"前缀扫描"功能(又名scanl)：

1
2
3

triaList = U.scanl (+) 0 (U.enumFrom 1 top)
where
top = 10^6

同样，对于Sol：

1 2	sol :: Int -> Int sol n = U.head $ U.filter (\x -> numDivs x > n) triaList

。

总体运行时间相同，但代码更干净。