Hi Tim et al,
I'm running into an issue with a heavily used SGE queue. Say I have cisTEM launch 100 copies with a hard runtime limit of 1 hour. If the queue is empty, they all run, and 1 hour is more than enough time for the iteration to complete. On the other hand, if the queue is nearly full, my first few jobs will run, but the rest may not start until > 1 hour has passed. In that case, my early jobs will die. AFAICT, the behavior right now is then that cisTEM idles forever - no error output - and all the jobs are removed from the queue.
In other words, the first queued job has to be able to run for at least as long as needed for the last job to queue (and maybe for it to finish). The workaround is to give spuriously high run times for the jobs, but doing so also decreases their queue priority.
It would be great if there was a way for cisTEM to ignore the loss of a worker job (and whatever incomplete work chunk it had) and wait for others to become available so we could "queue and forget" with shorter run time limits. Or maybe I'm missing something?