PBS job hanging

14 posts / 0 new
Last post
jamonshi
PBS job hanging

I am trying to run on our cluster with job scheduler. Our IT manager setup the profile, however I always get some job hanging at E forever and the master job is simply waiting for them. Manual delete these job does not work as well. Any suggestions will  be appreciated.

/apps/cistem/1.0.0-beta/$command

qsub -N cisTEM -j oe -- /apps/cistem/1.0.0-beta/$command

qstat
Job id            Name             User              Time Use S Queue
----------------  ---------------- ----------------  -------- - -----
295979.kylin      cisTEM           jianshi           00:00:04 E daily           
295989.kylin      cisTEM           jianshi           00:00:04 E daily           
296001.kylin      cisTEM           jianshi           00:00:04 E daily           
296004.kylin      cisTEM           jianshi           00:00:04 E daily           
296025.kylin      cisTEM           jianshi           00:00:04 E daily           
296028.kylin      cisTEM           jianshi           00:00:04 E daily           
296029.kylin      cisTEM           jianshi           00:00:04 E daily           
296031.kylin      cisTEM           jianshi           00:00:04 E daily           
296071.kylin      cisTEM           jianshi           00:00:00 E daily           
296135.kylin      cisTEM           jianshi           00:00:00 E daily

timgrant
A few questions :-

A few questions :-

1. How many jobs are you submitting at once? Do some of the jobs connect? 

2. Do jobs every run successfully?

3. Do you see red error messages in the gui?

4. Is the path of the project accessible in both your gui and the cluster nodes with exactly the same path?

E is supposed to mean that the job has finished and is exiting, I am not sure why it would get stuck here.  Also in your qub command you have a --, does that do anything?  It may well not matter though.

Cheers,

Tim

 

jamonshi
answers to questions

I submit 112 jobs. All jobs are connected, there are a few round iteration successfully completed.  When the error comes out is rather random.

3. Do you see red error messages in the gui?

Sometime there is red error message as follows, sometime just quite waiting forever.

All slaves have re-connected to the master.
Error: Input parameter file /scratch/emlab/jianshi/betaG/120kxBin2/Assets/Parameters/startup_input_par_15_1.par not found

4. Is the path of the project accessible in both your gui and the cluster nodes with exactly the same path?

Yes, it's accessible to all node

E is supposed to mean that the job has finished and is exiting, I am not sure why it would get stuck here.  Also in your qub command you have a --, does that do anything?  It may well not matter though.

It's pretty much due to our cluster and job scheduler setting. But I have no clue where to start to debug.

 

Thanks,

Jian

timgrant
This looks like some of the

This looks like some of the node jobs can't find the file.  Do you know what your filesystem is - is it NFS?

Tim

jamonshi
GPFS

It's GPFS. But it doesn't make sense to me, since there are already many successful iterations done. The cluster is a simple one, a headnode and eight computing node. I have logined in individually and confirmed that the directory are accessible.

To me, it seems the some job hanging without writing down the output par files but give wrong singal to job controller to next step.

jamonshi
cann't access the file system

Indeed, I found out today that one compute node actually rebooted itself and not NFS mounted so it can't access the shared file.

Cheers,

Jian

timgrant
Hi Jian,

Hi Jian,

So everything is working now?

Cheers,

Tim

jamonshi
It working smoothly with SSH method now.

I have not test PBS queue submission yet due to occupied server.

Thanks and Happy Holidays,

Jian

sstagg
I'm having a similar problem

I'm having a similar problem using the SLURM scheduler. I am running cisTEM from the headnode, my Manager command is $command, and my Command command is srun -p backfill -n 1 -o /dev/null $command. It completes with no problems when I run 2D classification, but when I do Ab-Initio, it hangs after a random number of iterations. The problem does not happen if I do local processing.

timgrant
How many processors are you

How many processors are you running?  Do you get any error messages?  

Tim

sstagg
24 processors. No error

24 processors. No error messages. When I looked at the queue, there were jobs that were running but weren't proceding. 

timgrant
Hmm, that is weird that it

Hmm, that is weird that it happens with so few processors.

Does it stop during the connection generally?  i.e. when it freezes does the number connected in the bottom left say 24/24, or does it freeze before it reaches 24?

Tim

timgrant
Also, does it freeze on the

Also, does it freeze on the first job "preparing stack" or later?

Thanks,

Tim

sstagg
Hi Tim. The GUI doesn't seem

Hi Tim. The GUI doesn't seem to be aware anything is wrong; the refinement just stops proceeding. When I then manually checked the SLURM queue, there were jobs that were just going and going. It's like those particular jobs got stuck in a loop or something. I've just been working on local processors lately because we were having other problems with cisTEM on SLURM too. I'm going to start a new SLURM thread. 

Log in or register to post comments