cistem on cluster (slurm, sbatch)
Dear all, I am trying to run cistem on our HPC. These are my settings in the run profile:
Manager Command:
ssh -f login01.hpc.rockefeller.internal 'nohup /rugpfs/fs0/ruit/scratch/sbgrid/programs/x86_64-linux/cistem/1.0.0-beta/bin/$command'
Gui Address & Controller Address: Automatic
Command:
sbatch --export=c='$command' /store01/home/jbarandun/cistem/slurm.sh
No. Copies: 2, Delay 100
slurm.sh:
#!/bin/bash
##
## specify queue
##SBATCH -p normal
## run time
#SBATCH -t 202:00:00
## number of nodes
#SBATCH -N 3
## number of cores
#SBATCH -n 72
## error and output files
#SBATCH -o cistem.out
#SBATCH -e cistem.err
## Job Name
#SBATCH -J cistem
$c
The jobs are submitted to the scheduler, run for 30s or so then crash without error message. This is the output from cistem:
Res. limit for class #0 = 20.00
Res. limit for class #1 = 20.00
Res. limit for class #2 = 20.00
Launching Job...
(ssh -f login01.hpc.rockefeller.internal 'nohup /rugpfs/fs0/ruit/scratch/sbgrid/programs/x86_64-linux/cistem/1.0.0-beta/bin/cisTEM_job_control xx.xx.xx.xx,xx.xx.xx.xx,xx.xx.xx.xx 3004 6433352023334462')
Job Control : Executing 'sbatch --export=c='refine3d xx.xx.xx.xx,xx.xx.xx.xx,xx.xx.xx.xx 3005 6433352023334462' /store01/home/jbarandun/cistem/slurm.sh&' 2 times.
cistem.out is empty, cistem.err just contains one line:
Usage: refine3d [controller_address] [controller_port] [job_code]
I tried already different number of nodes and No. of copies.
Any idea what the problem could be?
Thanks a lot for the help,
Best
Jonas
no, I blanked out ip adresse
no, I blanked out ip adresse by request of our it dept, sorry for not mentioning this
Hmm, the error suggest that
Hmm, the error suggest that the program is not being run with all 3 arguments.
I am not very familiar with running slurm jobs, have you seen the following pages :-
https://cistem.org/frequently-asked-questions#tab-1-3
and
https://cistem.org/documentation#tab-1-15
Tim
In the gui does it print xx
In the gui does it print xx.xx.xx for the addresses? This implies that it cannot find the IP address of the machine?
Tim