PBS job hanging

I am trying to run on our cluster with job scheduler. Our IT manager setup the profile, however I always get some job hanging at E forever and the master job is simply waiting for them. Manual delete these job does not work as well. Any suggestions will be appreciated.

/apps/cistem/1.0.0-beta/$command

qsub -N cisTEM -j oe -- /apps/cistem/1.0.0-beta/$command

qstat
Job id            Name             User              Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
295979.kylin      cisTEM           jianshi           00:00:04 E daily
295989.kylin      cisTEM           jianshi           00:00:04 E daily
296001.kylin      cisTEM           jianshi           00:00:04 E daily
296004.kylin      cisTEM           jianshi           00:00:04 E daily
296025.kylin      cisTEM           jianshi           00:00:04 E daily
296028.kylin      cisTEM           jianshi           00:00:04 E daily
296029.kylin      cisTEM           jianshi           00:00:04 E daily
296031.kylin      cisTEM           jianshi           00:00:04 E daily
296071.kylin      cisTEM           jianshi           00:00:00 E daily
296135.kylin      cisTEM           jianshi           00:00:00 E daily

A few questions :-

1. How many jobs are you submitting at once? Do some of the jobs connect?

2. Do jobs every run successfully?

3. Do you see red error messages in the gui?

4. Is the path of the project accessible in both your gui and the cluster nodes with exactly the same path?

E is supposed to mean that the job has finished and is exiting, I am not sure why it would get stuck here. Also in your qub command you have a --, does that do anything? It may well not matter though.

Cheers,

Tim

answers to questions

I submit 112 jobs. All jobs are connected, there are a few round iteration successfully completed. When the error comes out is rather random.

3. Do you see red error messages in the gui?

Sometime there is red error message as follows, sometime just quite waiting forever.

All slaves have re-connected to the master.
Error: Input parameter file /scratch/emlab/jianshi/betaG/120kxBin2/Assets/Parameters/startup_input_par_15_1.par not found

4. Is the path of the project accessible in both your gui and the cluster nodes with exactly the same path?

Yes, it's accessible to all node

It's pretty much due to our cluster and job scheduler setting. But I have no clue where to start to debug.

Thanks,

Jian

This looks like some of the

This looks like some of the node jobs can't find the file. Do you know what your filesystem is - is it NFS?

Tim

GPFS

It's GPFS. But it doesn't make sense to me, since there are already many successful iterations done. The cluster is a simple one, a headnode and eight computing node. I have logined in individually and confirmed that the directory are accessible.

To me, it seems the some job hanging without writing down the output par files but give wrong singal to job controller to next step.

cann't access the file system

Indeed, I found out today that one compute node actually rebooted itself and not NFS mounted so it can't access the shared file.

Cheers,

Jian

Hi Jian,

So everything is working now?

Cheers,

Tim

It working smoothly with SSH method now.

I have not test PBS queue submission yet due to occupied server.

Thanks and Happy Holidays,

Jian

I'm having a similar problem

I'm having a similar problem using the SLURM scheduler. I am running cisTEM from the headnode, my Manager command is $command, and my Command command is srun -p backfill -n 1 -o /dev/null $command. It completes with no problems when I run 2D classification, but when I do Ab-Initio, it hangs after a random number of iterations. The problem does not happen if I do local processing.

How many processors are you

How many processors are you running? Do you get any error messages?

Tim

24 processors. No error

24 processors. No error messages. When I looked at the queue, there were jobs that were running but weren't proceding.

Hmm, that is weird that it

Hmm, that is weird that it happens with so few processors.

Does it stop during the connection generally? i.e. when it freezes does the number connected in the bottom left say 24/24, or does it freeze before it reaches 24?

Tim

Also, does it freeze on the

Also, does it freeze on the first job "preparing stack" or later?

Thanks,

Tim

Hi Tim. The GUI doesn't seem

Hi Tim. The GUI doesn't seem to be aware anything is wrong; the refinement just stops proceeding. When I then manually checked the SLURM queue, there were jobs that were just going and going. It's like those particular jobs got stuck in a loop or something. I've just been working on local processors lately because we were having other problems with cisTEM on SLURM too. I'm going to start a new SLURM thread.