Jobs crash with slurm scheduling

Hi,

I have been learning to use cisTEM recently and it runs very well when used interactively but I am having trouble running it on the cluster with our slurm scheduling system.
We set it up with our system administrator as explained under simple configuration in FAQ. Jobs start normally and initial batches of slurm jobs complete successfully but then the run would usually crash at a random point and only give “Master socket disconnected” error message (there is usually no red warning message inside the GUI). The problem seems to occur at a random point and only when running 3D refinements with my dataset or ab-initio 3D reconstructions with apoferritin test dataset. Movie alignment, ctf estimation and 2D classifications worked with test dataset and using slurm. I’ve already tried to use different partitions, number of cores, memory requirements (between 4G and 20G) and delay times (between 1s and 5s) without success. Any advice is appreciated.

Best, Domen

Hi Domen,

It seems lots of people are having problems with slurm clusters, I don't currently have a solution to this (or even access to a slurm cluster for testing). Hopefully this issue can be resolved in the future, but right now I don't have any useful suggestions for you, other than to triple check that the files are all accessible from all the nodes on the cluster.

Cheers,

Tim

Hi Tim,

I can give you access to our slurm cluster if you would like to test some stuff out. FYI, I think I figured out what's happening with the jobs that randomly hang. Occasionally when I run jobs locally, I will get a message that a slave disconnected prematurely. When that happens, the whole job hangs and won't proceed, and I have to terminate it and run it again.

Scott