linux - Open MPI Virtual Timer Expired -
i'm using open mpi 1.8 on gentoo 3.13 manage data transfer 1 program via server/client concept. both server , clients launched via mpiexec
separate processes. after days (this quite heavy computation...), receive error
mpiexec noticed process rank 0 pid 17213 on node xxx exited on signal 26 (virtual timer expired).
unfortunately, error not reproducible in reliable way, i.e., error not appear , not @ same point in program flow. experienced error on other machines. tracked issue down itimer_virtual
which, upon expiration, delivers sigvtalrm
(see, e.g., http://man7.org/linux/man-pages/man2/setitimer.2.html). in bugs section of man page, says
under heavy loading, itimer_real timer may expire before signal previous expiration has been delivered. second signal in such event lost.
i wonder if similar might hold itimer_virtual
? did experience similar problems , can confirm error?
the workaround can think of invoke setitimer(...)
, try manipulate timer myself. however, hope there way since can't modify clients' source code. suggestions?
since question has not been answered officially, on behalf of hristo (@hristoiliev: hope ok you). pointed out in first comment question, there not single hint in open mpi source code can have caused virtual timer expiration. indeed, timer problem related third-party library made code crash after unpredictable time (depending on current loading of machine).
Comments
Post a Comment