Slurm should be the answer but it isn't. In our ML environment, it required ML researchers to understand what is going on (more systems knowledge) and no one liked it. The situation devolved to sshing into machines and running jobs. You are right that slurm is a good fit for HPC ... I just don't think DL workloads are exactly that.
P.S. I also think the K8s scheduler isn't great.