[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SGE: can't even add processors #122

Open
dpo opened this issue Sep 1, 2019 · 6 comments
Open

SGE: can't even add processors #122

dpo opened this issue Sep 1, 2019 · 6 comments
Labels

Comments

@dpo
Copy link
dpo commented Sep 1, 2019
julia> VERSION
v"1.1.0"

julia> using ClusterManagers#master

julia> ClusterManagers.addprocs_qrsh(5, queue="hs22")
Error launching workers
MethodError(iterate, (Process(`qrsh -q hs22 -V -N julia-26131 -now n cd /home/dorban '&&' /apps/local-fci/tools/julia-1.1.0/bin/julia --worker=zwJ987ih2wft8egg`, ProcessRunning),), 0x00000000000063e1)
0-element Array{Int64,1}

My cluster doesn't support qrsh. Submitting jobs from the command line works fine. Any ideas on how to get this package to work?

@dpo dpo changed the title SGE: can't even add processor SGE: can't even add processors Sep 1, 2019
@dpo
Copy link
Author
dpo commented Sep 2, 2019

For some reason, ClusterManagers.addprocs_sge sometimes succeeds. I tried to add 28 workers, and it's been printing dots for over an hour now. Is this expected?!

@vchuravy
Copy link
Member
vchuravy commented Sep 2, 2019

I am not sure I understand this issue. You are running on a SGE cluster and you are trying to add processes with addprocs_qrsh? That won't work.

For some reason, ClusterManagers.addprocs_sge sometimes succeeds. I tried to add 28 workers, and it's been printing dots for over an hour now. Is this expected?!

That sounds like a bug. Can you post your entire script? I haven't run on SGE in a while, but on SLURM there is salloc which you can use to allocate resources before you use srun. Maybe there is something similar for SGE?

@dpo
Copy link
Author
dpo commented Sep 2, 2019

Apologies, that was a cut and paste of the wrong piece of code (I tried qrsh following some comments found on this issue tracker). I use addprocs_sge, but it doesn't succeed consistently. My script is easy enough:

using ClusterManagers
ClusterManagers.addprocs_sge(28, queue="hs22")

There are indeed 28 compute nodes available, but addprocs_sge never returns. I'll look for something similar to salloc, thanks.

@vchuravy
Copy link
Member
vchuravy commented Sep 2, 2019

Yeah so either you get stuck in allocating forever or the time-out doesn't trigger. Looks like qsub doesn't have a time-out. You may want to instrument this code.

for i=1:np

@aquaresima
Copy link
aquaresima commented Mar 19, 2020

Hi,
I encounter the same issue, here
Julia 1.3.1
SGE 8.1.8

Running:
addprocs_sge(4, queue=$ProvideByAdmin)

I get:
Error launching workers MethodError(iterate, (Base.ProcessChain(Base.Process[Process(echo 'cd /home/alequa/Documents/Research/phd_project/simulations/tripod && /home/alequa/Documents/Research/julia-1.3.1/bin/julia --worker=sFj9Hl4l94yYKPoz', ProcessExited(0)), Process(qsub -N julia-27738 -terse -j y -R y -t 1-1 -V -q single.q, ProcessRunning)], Base.DevNull(), Base.PipeEndpoint(RawFD(0x00000019) open, 0 bytes waiting), Base.DevNull()),), 0x0000000000006890) 0-element Array{Int64,1}

BUT the jobs start and I can get them on qstat

1552178 0.55500 julia-2773 alequa r 03/19/2020 18:56:59 single.q@gridnode006.mpi.nl 1 1 1552178 0.55500 julia-2773 alequa r 03/19/2020 18:56:59 single.q@gridnode004.mpi.nl 1 2

Can you help?
Thanks,
Alessio

@juliohm
Copy link
Collaborator
juliohm commented Oct 6, 2020

These sporadic issues where sometimes it works and sometimes it doesn't are quite annoying. I am experiencing something similar with the LSF manager. Will try to debug in the next days, and will share here if I learn something that could be ported to the SGE manager.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants