MPI_Waitall error for asynchronous MPI_Irecv
我使用了 2 个 MPI_Irecv,然后是 2 个 MPI_Send,然后是 MPI_Waital,用于 MPI_Irecv,如下所示。经过几次计算,我再次编写了相同的代码块。但似乎 MPI 进程在第一块代码本身中失败了。
我的通信是这样的,一个矩阵被水平分割,因为没有 MPI 进程,并且通信只发生在矩阵边界之间,矩阵网格下方发送"开始"/第一行到矩阵网格上方和矩阵网格上方发送'end'/最后一行到矩阵网格下方。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 | MPI_Request request[2]; MPI_Status status[2]; double grid[size]; double grida[size]; . . . <Calculation for grid2[][]> ... MPI_Barrier(MPI_COMM_WORLD); if (world_rank != 0){ MPI_Irecv(&grid, size, MPI_DOUBLE, world_rank-1, 0, MPI_COMM_WORLD, &request[1]); printf("1 MPI_Irecv"); } if (world_rank != world_size-1){ MPI_Irecv(&grida, size, MPI_DOUBLE, world_rank+1, 1, MPI_COMM_WORLD, &request[0]); printf("2 MPI_Irecv"); } if (world_rank != world_size-1){ MPI_Send(grid2[end], size, MPI_DOUBLE, world_rank+1, 0, MPI_COMM_WORLD); printf("1 MPI_Send"); } if (world_rank != 0){ MPI_Send(grid2[start], size, MPI_DOUBLE, world_rank-1, 1, MPI_COMM_WORLD); printf("2 MPI_Send"); } MPI_Waitall(2, request, status); MPI_Barrier(MPI_COMM_WORLD); . . . <Again the above code but without the initialization of MPI_Request and MPI_Status> |
但是为此我得到了错误:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | *** Process received signal *** Signal: Bus error: 10 (10) Signal code: Non-existant physical address (2) Failing at address: 0x108bc91e3 [ 0] 0 libsystem_platform.dylib 0x00007fff50b65f5a _sigtramp + 26 [ 1] 0 ??? 0x000000010c61523d 0x0 + 4502671933 [ 2] 0 libmpi.20.dylib 0x0000000108bc8e4a MPI_Waitall + 154 [ 3] 0 dist-jacobi 0x0000000104b55770 Work + 1488 [ 4] 0 dist-jacobi 0x0000000104b54f01 main + 561 [ 5] 0 libdyld.dylib 0x00007fff508e5145 start + 1 [ 6] 0 ??? 0x0000000000000003 0x0 + 3 *** End of error message *** *** An error occurred in MPI_Waitall *** reported by process [1969881089,3] *** on communicator MPI_COMM_WORLD *** MPI_ERR_REQUEST: invalid request *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, *** and potentially your MPI job) -------------------------------------------------------------------------- mpirun noticed that process rank 0 with PID 0 on node dhcp-10 exited on signal 10 (Bus error: 10). -------------------------------------------------------------------------- |
为什么 Waitall 会抛出错误,以及如何不打印
代码与 MPI_Wait() 和 MPI_Isend() 一起使用,如下所示:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | // insert barrier MPI_Barrier(MPI_COMM_WORLD); if (world_rank != 0){ MPI_Irecv(&grid, size*2, MPI_DOUBLE, world_rank-1, 0, MPI_COMM_WORLD, &request[0]); printf("1 MPI_Irecv"); } if (world_rank != world_size-1){ MPI_Irecv(&grida, size*2, MPI_DOUBLE, world_rank+1, 1, MPI_COMM_WORLD, &request[1]); printf("2 MPI_Irecv"); } if (world_rank != world_size-1){ MPI_Isend(grid2[end], size*2, MPI_DOUBLE, world_rank+1, 0, MPI_COMM_WORLD, &request[0]); printf("1 MPI_Send"); } if (world_rank != 0){ MPI_Isend(grid2[start], size*2, MPI_DOUBLE, world_rank-1, 1, MPI_COMM_WORLD, &request[1]); printf("2 MPI_Send"); } //MPI_Waitall(2, request, status); MPI_Wait(&request[0], &status[0]); MPI_Wait(&request[1], &status[1]); |
一个可能的解决方法是静态初始化
1 | MPI_Request request[2] = {MPI_REQUEST_NULL, MPI_REQUEST_NULL}; |
附带说明,您可能需要考虑将