This is probably data-parallel training with collective all-reduce (Horovod prob...

This is probably data-parallel training with collective all-reduce (Horovod probably, as they are using MPI). Membership of the ring in Horovod is static - you can't recover from a failed worker. You would need to build a consistent hashing ring (like a DHT), so that workers could identify and agree on failing workers (heartbeats) and evict them. None of those goodies in Horovod yet.

The workaround is to have a chief-node do periodic checkpointing of the model weights and epoc/iteration, so that you can recover from the checkpoint if a worker fails.