PBS job hanging

jamonshi

PBS job hanging

I am trying to run on our cluster with job scheduler. Our IT manager setup the profile, however I always get some job hanging at E forever and the master job is simply waiting for them. Manual delete these job does not work as well. Any suggestions will  be appreciated.

/apps/cistem/1.0.0-beta/$command

qsub -N cisTEM -j oe -- /apps/cistem/1.0.0-beta/$command

qstat
Job id            Name             User              Time Use S Queue
----------------  ---------------- ----------------  -------- - -----
295979.kylin      cisTEM           jianshi           00:00:04 E daily           
295989.kylin      cisTEM           jianshi           00:00:04 E daily           
296001.kylin      cisTEM           jianshi           00:00:04 E daily           
296004.kylin      cisTEM           jianshi           00:00:04 E daily           
296025.kylin      cisTEM           jianshi           00:00:04 E daily           
296028.kylin      cisTEM           jianshi           00:00:04 E daily           
296029.kylin      cisTEM           jianshi           00:00:04 E daily           
296031.kylin      cisTEM           jianshi           00:00:04 E daily           
296071.kylin      cisTEM           jianshi           00:00:00 E daily           
296135.kylin      cisTEM           jianshi           00:00:00 E daily

Sun, 12/17/2017 - 03:03

timgrant

A few questions :-

1. How many jobs are you submitting at once? Do some of the jobs connect? 

2. Do jobs every run successfully?

3. Do you see red error messages in the gui?

4. Is the path of the project accessible in both your gui and the cluster nodes with exactly the same path?

E is supposed to mean that the job has finished and is exiting, I am not sure why it would get stuck here.  Also in your qub command you have a --, does that do anything?  It may well not matter though.

Cheers,

Tim

 

I submit 112 jobs. All jobs are connected, there are a few round iteration successfully completed.  When the error comes out is rather random.

3. Do you see red error messages in the gui?

Sometime there is red error message as follows, sometime just quite waiting forever.

All slaves have re-connected to the master.
Error: Input parameter file /scratch/emlab/jianshi/betaG/120kxBin2/Assets/Parameters/startup_input_par_15_1.par not found

4. Is the path of the project accessible in both your gui and the cluster nodes with exactly the same path?

Yes, it's accessible to all node

E is supposed to mean that the job has finished and is exiting, I am not sure why it would get stuck here.  Also in your qub command you have a --, does that do anything?  It may well not matter though.

It's pretty much due to our cluster and job scheduler setting. But I have no clue where to start to debug.

 

Thanks,

Jian

It's GPFS. But it doesn't make sense to me, since there are already many successful iterations done. The cluster is a simple one, a headnode and eight computing node. I have logined in individually and confirmed that the directory are accessible.

To me, it seems the some job hanging without writing down the output par files but give wrong singal to job controller to next step.

Thu, 12/21/2017 - 06:09

jamonshi

Indeed, I found out today that one compute node actually rebooted itself and not NFS mounted so it can't access the shared file.

Cheers,

Jian

Thu, 12/21/2017 - 15:58

timgrant

Hi Jian,

So everything is working now?

Cheers,

Tim

Tue, 01/09/2018 - 21:08

sstagg

I'm having a similar problem using the SLURM scheduler. I am running cisTEM from the headnode, my Manager command is $command, and my Command command is srun -p backfill -n 1 -o /dev/null $command. It completes with no problems when I run 2D classification, but when I do Ab-Initio, it hangs after a random number of iterations. The problem does not happen if I do local processing.

Wed, 01/10/2018 - 01:17

sstagg

24 processors. No error messages. When I looked at the queue, there were jobs that were running but weren't proceding. 

Thu, 01/11/2018 - 16:58

timgrant

Hmm, that is weird that it happens with so few processors.

Does it stop during the connection generally?  i.e. when it freezes does the number connected in the bottom left say 24/24, or does it freeze before it reaches 24?

Tim

Sun, 01/14/2018 - 16:14

sstagg

Hi Tim. The GUI doesn't seem to be aware anything is wrong; the refinement just stops proceeding. When I then manually checked the SLURM queue, there were jobs that were just going and going. It's like those particular jobs got stuck in a loop or something. I've just been working on local processors lately because we were having other problems with cisTEM on SLURM too. I'm going to start a new SLURM thread.