I am trying to troubleshoot a problem on our cluster where I get the message "Error: A slave has disconnected before all jobs are finished" in the GUI. Usually, this happens at the end of particle picking, or after the reconstruction phases of a multi-class refinement. There is one such message per running process.
I don't believe it's a cisTEM bug, but I was hoping you could shed some light on situations in which the slaves disconnect or exit early so I can figure out how to fix the cluster configuration (e.g. is there a timeout on IO? Can I increase output verbosity?). We have many compute nodes connecting to the same RAID/NFS server via 10G ethernet, but the interactive node that runs the GUI only has 1G network.