I am trying to troubleshoot a problem on our cluster where I get the message "Error: A slave has disconnected before all jobs are finished" in the GUI. Usually, this happens at the end of particle picking, or after the reconstruction phases of a multi-class refinement. There is one such message per running process.
I don't believe it's a cisTEM bug, but I was hoping you could shed some light on situations in which the slaves disconnect or exit early so I can figure out how to fix the cluster configuration (e.g. is there a timeout on IO? Can I increase output verbosity?). We have many compute nodes connecting to the same RAID/NFS server via 10G ethernet, but the interactive node that runs the GUI only has 1G network.
We also have same issues of slave disconnection happened few times. The "disconnect job" issue always happened at end of 2D classfication or Ab initio 3D model process. When the error message pops up, the calculation kind of paused forever. We ran cisTEM on a local GPU workstation (16 Xeon cores, 128 GB DDR4 ram, 25TB local HDD, 1080 Ti cards x4).
Do you use default run configuration (17 processes in your case)? Also, how are your hard drives configured (how many drives, RAID level, filesystem type)? What did top/htop show during while the job was hung?
So far I only get this error when I use a certain network filesystem, so I strongly suspect a connection to IO availability.
We routinely ran 33 processes (16 cores, 32 threads). The HDD is configured by RAID 0 (10TB x2, 4TB x4, Hitachi enterprise SAS HDD), filesystem is ZFS.
When seeing the slave disconnection error, one CPU in htop was idling ( I usually got 1 slave disconnected). The werid thing is we reran same process without changing settings, it often finished normally (sometimes need to rerun 2 times). I am sure no other disc I/O process except cisTEM during calcuation.
Dear Tim and cisTEM users,
I run into similar issues with my (first) 2d classification job.
It goes through to round5.
If I run it from the cluster, I get a series of Error: A slave has disconnected before all jobs are finished. It then remains idle and the job never progresses past round 5.
If I run it interactively from a multicore workstation I get no error at all, but still the job gets stuck at round 5 and never goes forward.
In both cases if I look at the results I notice that for some rounds the classes are nonsensical: either totally grey, or weird pixelated boxes. Still, round 1,2 and 4 show what expected.
My first guess was that there must be some corrupted particles that when picked up in the randomly selected subset cause trouble, but I manually visualised all of them and couldn't see anything obvious. I got no warning or anything while extracting them.
Not quite sure where to start from to troubleshoot, any help would be much appreciated.
Does it reproducibly stop at round 5? If so, would be able to share the stack with me, so that I can look into it?
It does reproducibly get stuck at round5.
This is a groel test dataset that was collected on our new microscope for a test. I was handed the pre-motioncorrected sums and pre-picked .box files.
I imported DW and noDW, run ctffind through cisTEM (on noDW), then converted .box to .txt with a script I found on the forum and imported that too.
Then I followed the tutorial to extract particles (from DW), and run 2D class.
(FYI same .box and .mrc files imported in relion run just fine through the end)
What is the best way to share the stack with you?
Thanks a lot!!!
If you can export your refinement package - you will get a stack file and a .par file, how big is that?
Do you have a server I can connect to, to download it?
I've come across the same problem. The refinement package is working for auto-refine with 1 class. When I create another refinement package for 3 classes, I came across this error during auto-refinement. Could you let me know how to debug this problem? Thanks!
You are getting " Error: A slave has disconnected before all jobs are finished" printed in the GUI during an autorefine with 3 classes. Does it always happen at the same stage, if so - at what stage?
I just want to report that I was also getting "Error: A slave has disconnected before all jobs are finished" errors and the jobs would hang indefinitely during Movie Alignment. This would happen both when submitting jobs to SLURM or when running locally on a computing node ("Default local" profile). This was strange because normally cisTEM runs fine on our cluster. Turns out for this particular dataset I was using the option "Include All Frames in Sum?" to select only a subset of the aligned frames. When disabling this option (i.e. including all frames, which is default), the job ran normally to completion.
I'm not sure how using this option would cause such an error but this could be a useful hint for debugging.
Thanks for the info - i'm sure it will be helpful!
I ran into exactly same issue as rdrighetto mentioned. When I only select subset of frames for movie alignment, I got multiple "Error: A slave has disconnected before all jobs are finished". Everything is OK when I include all the frames.
I am having the same issue as rdrighetto, but the issue isn't solved by switching back to including all frames. Was this ever solved more thoroghly?
Thanks in advance,
I'm sorry, but I commented too quickly. I was able to correct the error by expanding my memory limit through "--mem 3500" in my srun command so as to accomodate the increase need for memory for the job.
Thanks for a great sys(cis)tem,
Error: A slave has disconnected before all jobs are finished" in the GUI. It happens at the end of particle picking and the GUI showed the process finished. But no output.
I terminated and re-run few times also adjusted the number of processes. But unable to sort this issue. Any pointers will be useful.
I am using a 32 cpu system.
Hmm, perhaps there is something strange with one of your images that cisTEM doesn't like. You could try to narrow down which image is cauing it to crash by using groups.
Make a group with the first half of the images, and run on that to see if it crashes, if it doesn't, make a group with the second half, then you can try and narrow it down that way. If you determine that it was a specific image that was causing a crash, it would be great if you could send it to us to investigate further.
I tried with upto 50 images and it keep on complaining the same.
I believe the issue might be due to large datset ~3000 images and somehow it manged to do the CTF align job after that it may not able to handle this volume.
Is there any way to split the datset into three or four parts after CTF correction and do teh particle picking and later rejoin the particles?
If you split into groups as detailed above, you can pick in groups, and all the results will sum in the particle position assests as (all positions)