thread performance in Julia
我对并行 Julia 代码的尝试并没有随着线程数量的增加而提高性能。
无论我将 JULIA_NUM_THREADS 设置为 2 还是 32,以下代码的运行时间都差不多。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 | using Random using Base.Threads rmax = 10 dr = 1 Ngal = 100000000 function bin(id, Njobs, x, y, z, w) bin_array = zeros(10) for i in (id-1)*Njobs + 1:id*Njobs r = sqrt(x[i]^2 + y[i]^2 + z[i]^2) i_bin = floor(Int, r/dr) + 1 if i_bin < 10 bin_array[i_bin] += w[i] end end bin_array end Nthreads = nthreads() x = rand(Ngal)*5 y = rand(Ngal)*5 z = rand(Ngal)*5 w = ones(Ngal) V = let VV = [zeros(10) for _ in 1:Nthreads] jobs_per_thread = fill(div(Ngal, Nthreads),Nthreads) for i in 1:Ngal-sum(jobs_per_thread) jobs_per_thread[i] += 1 end @threads for i = 1:Nthreads tid = threadid() VV[tid] = bin(tid, jobs_per_thread[tid], x, y, z, w) end reduce(+, VV) end |
我是不是做错了什么?
与其他操作相比,在线程循环中花费的时间可以忽略不计。您还根据线程数分配大小的数组,因此在使用多个线程时,您在内存分配上花费的时间甚至(稍微)更多。
如果您关心性能,请查看 https://docs.julialang.org/en/v1/manual/performance-tips/。特别是,不惜一切代价避免使用全局变量(它们会降低性能)并将所有内容都放在函数中,这也更容易测试和调试。例如,我将您的代码重写为:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 | using Random using Base.Threads function bin(id, Njobs, x, y, z, w) dr = 1 bin_array = zeros(10) for i in (id-1)*Njobs + 1:id*Njobs r = sqrt(x[i]^2 + y[i]^2 + z[i]^2) i_bin = floor(Int, r/dr) + 1 if i_bin < 10 bin_array[i_bin] += w[i] end end bin_array end function test() Ngal = 100000000 x = rand(Ngal)*5 y = rand(Ngal)*5 z = rand(Ngal)*5 w = ones(Ngal) Nthreads = nthreads() VV = [zeros(10) for _ in 1:Nthreads] jobs_per_thread = fill(div(Ngal, Nthreads),Nthreads) for i in 1:Ngal-sum(jobs_per_thread) jobs_per_thread[i] += 1 end @threads for i = 1:Nthreads tid = threadid() VV[tid] = bin(tid, jobs_per_thread[tid], x, y, z, w) end reduce(+, VV) end test() |
单线程性能:
1 2 | julia> @time test(); 3.054144 seconds (33 allocations: 5.215 GiB, 11.03% gc time) |
4 线程性能:
1 2 | julia> @time test(); 2.602698 seconds (65 allocations: 5.215 GiB, 9.92% gc time) |
如果我在
1 2 | julia> @time test(); 2.444296 seconds (21 allocations: 5.215 GiB, 10.54% gc time) |
4 个线程:
1 2 | julia> @time test(); 2.481054 seconds (27 allocations: 5.215 GiB, 12.08% gc time) |