I am trying to run on our cluster with job scheduler. Our IT manager setup the profile, however I always get some job hanging at E forever and the master job is simply waiting for them. Manual delete these job does not work as well. Any suggestions will be appreciated.
/apps/cistem/1.0.0-beta/$command
qsub -N cisTEM -j oe -- /apps/cistem/1.0.0-beta/$command
qstat
Job id Name User Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
295979.kylin cisTEM jianshi 00:00:04 E daily
295989.kylin cisTEM jianshi 00:00:04 E daily
296001.kylin cisTEM jianshi 00:00:04 E daily
296004.kylin cisTEM jianshi 00:00:04 E daily
296025.kylin cisTEM jianshi 00:00:04 E daily
296028.kylin cisTEM jianshi 00:00:04 E daily
296029.kylin cisTEM jianshi 00:00:04 E daily
296031.kylin cisTEM jianshi 00:00:04 E daily
296071.kylin cisTEM jianshi 00:00:00 E daily
296135.kylin cisTEM jianshi 00:00:00 E daily
A few questions :-
1. How many jobs are you submitting at once? Do some of the jobs connect?
2. Do jobs every run successfully?
3. Do you see red error messages in the gui?
4. Is the path of the project accessible in both your gui and the cluster nodes with exactly the same path?
E is supposed to mean that the job has finished and is exiting, I am not sure why it would get stuck here. Also in your qub command you have a --, does that do anything? It may well not matter though.
Cheers,
Tim
I submit 112 jobs. All jobs are connected, there are a few round iteration successfully completed. When the error comes out is rather random.
3. Do you see red error messages in the gui?
Sometime there is red error message as follows, sometime just quite waiting forever.
All slaves have re-connected to the master.
Error: Input parameter file /scratch/emlab/jianshi/betaG/120kxBin2/Assets/Parameters/startup_input_par_15_1.par not found
4. Is the path of the project accessible in both your gui and the cluster nodes with exactly the same path?
Yes, it's accessible to all node
E is supposed to mean that the job has finished and is exiting, I am not sure why it would get stuck here. Also in your qub command you have a --, does that do anything? It may well not matter though.
It's pretty much due to our cluster and job scheduler setting. But I have no clue where to start to debug.
Thanks,
Jian
This looks like some of the node jobs can't find the file. Do you know what your filesystem is - is it NFS?
Tim
It's GPFS. But it doesn't make sense to me, since there are already many successful iterations done. The cluster is a simple one, a headnode and eight computing node. I have logined in individually and confirmed that the directory are accessible.
To me, it seems the some job hanging without writing down the output par files but give wrong singal to job controller to next step.
Indeed, I found out today that one compute node actually rebooted itself and not NFS mounted so it can't access the shared file.
Cheers,
Jian
Hi Jian,
So everything is working now?
Cheers,
Tim
I have not test PBS queue submission yet due to occupied server.
Thanks and Happy Holidays,
Jian
I'm having a similar problem using the SLURM scheduler. I am running cisTEM from the headnode, my Manager command is $command, and my Command command is srun -p backfill -n 1 -o /dev/null $command. It completes with no problems when I run 2D classification, but when I do Ab-Initio, it hangs after a random number of iterations. The problem does not happen if I do local processing.
How many processors are you running? Do you get any error messages?
Tim
24 processors. No error messages. When I looked at the queue, there were jobs that were running but weren't proceding.
Hmm, that is weird that it happens with so few processors.
Does it stop during the connection generally? i.e. when it freezes does the number connected in the bottom left say 24/24, or does it freeze before it reaches 24?
Tim
Also, does it freeze on the first job "preparing stack" or later?
Thanks,
Tim
Hi Tim. The GUI doesn't seem to be aware anything is wrong; the refinement just stops proceeding. When I then manually checked the SLURM queue, there were jobs that were just going and going. It's like those particular jobs got stuck in a loop or something. I've just been working on local processors lately because we were having other problems with cisTEM on SLURM too. I'm going to start a new SLURM thread.