SGE: can't even add processors #122

dpo · 2019-09-01T05:24:07Z

julia> VERSION
v"1.1.0"

julia> using ClusterManagers#master

julia> ClusterManagers.addprocs_qrsh(5, queue="hs22")
Error launching workers
MethodError(iterate, (Process(`qrsh -q hs22 -V -N julia-26131 -now n cd /home/dorban '&&' /apps/local-fci/tools/julia-1.1.0/bin/julia --worker=zwJ987ih2wft8egg`, ProcessRunning),), 0x00000000000063e1)
0-element Array{Int64,1}

My cluster doesn't support qrsh. Submitting jobs from the command line works fine. Any ideas on how to get this package to work?

dpo · 2019-09-02T00:33:07Z

For some reason, ClusterManagers.addprocs_sge sometimes succeeds. I tried to add 28 workers, and it's been printing dots for over an hour now. Is this expected?!

vchuravy · 2019-09-02T20:12:33Z

I am not sure I understand this issue. You are running on a SGE cluster and you are trying to add processes with addprocs_qrsh? That won't work.

For some reason, ClusterManagers.addprocs_sge sometimes succeeds. I tried to add 28 workers, and it's been printing dots for over an hour now. Is this expected?!

That sounds like a bug. Can you post your entire script? I haven't run on SGE in a while, but on SLURM there is salloc which you can use to allocate resources before you use srun. Maybe there is something similar for SGE?

dpo · 2019-09-02T20:26:20Z

Apologies, that was a cut and paste of the wrong piece of code (I tried qrsh following some comments found on this issue tracker). I use addprocs_sge, but it doesn't succeed consistently. My script is easy enough:

using ClusterManagers
ClusterManagers.addprocs_sge(28, queue="hs22")

There are indeed 28 compute nodes available, but addprocs_sge never returns. I'll look for something similar to salloc, thanks.

vchuravy · 2019-09-02T20:31:19Z

Yeah so either you get stuck in allocating forever or ~~the time-out doesn't trigger.~~ Looks like qsub doesn't have a time-out. You may want to instrument this code.

ClusterManagers.jl/src/qsub.jl

Line 83 in e375f50

for i=1:np

aquaresima · 2020-03-19T17:06:39Z

Hi,
I encounter the same issue, here
Julia 1.3.1
SGE 8.1.8

Running:
addprocs_sge(4, queue=$ProvideByAdmin)

I get:
Error launching workers MethodError(iterate, (Base.ProcessChain(Base.Process[Process(echo 'cd /home/alequa/Documents/Research/phd_project/simulations/tripod && /home/alequa/Documents/Research/julia-1.3.1/bin/julia --worker=sFj9Hl4l94yYKPoz', ProcessExited(0)), Process(qsub -N julia-27738 -terse -j y -R y -t 1-1 -V -q single.q, ProcessRunning)], Base.DevNull(), Base.PipeEndpoint(RawFD(0x00000019) open, 0 bytes waiting), Base.DevNull()),), 0x0000000000006890) 0-element Array{Int64,1}

BUT the jobs start and I can get them on qstat

1552178 0.55500 julia-2773 alequa r 03/19/2020 18:56:59 single.q@gridnode006.mpi.nl 1 1 1552178 0.55500 julia-2773 alequa r 03/19/2020 18:56:59 single.q@gridnode004.mpi.nl 1 2

Can you help?
Thanks,
Alessio

juliohm · 2020-10-06T20:09:48Z

These sporadic issues where sometimes it works and sometimes it doesn't are quite annoying. I am experiencing something similar with the LSF manager. Will try to debug in the next days, and will share here if I learn something that could be ported to the SGE manager.

dpo changed the title ~~SGE: can't even add processor~~ SGE: can't even add processors Sep 1, 2019

jtrakk mentioned this issue Sep 23, 2020

addprocs_qrsh() fails on cluster that supports qrsh #141

Open

juliohm added the SGE label Oct 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SGE: can't even add processors #122

SGE: can't even add processors #122

SGE: can't even add processors #122

SGE: can't even add processors #122

Comments