-
Notifications
You must be signed in to change notification settings - Fork 842
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
train_search on multi-gpus #37
Comments
That's because the However, DataParallel will not give you any speed up in this case. It will in fact be very slow. |
@arunmallya Thanks for your reply. And I'll have a try as you suggested. |
@arunmallya I agree with your points. Several people asked about this, but I haven't got the chance to try it myself. @JaminFong An alternative approach is to further reduce the batch size/number of channels during search, though this might lead to some additional discrepancies between search & evaluation. |
@quark0 Yes, I tried to reduce the number of layers or img size to fit my experiment within one gpu. But I think if we want to extend darts to larger scale task, data parallel may be necessary. |
@JaminFong Hi, have you tried to implement it with multi-gpu? I am also going to search on a large task but one gpu is limited. |
@VectorYoung You could refer to https://github.com/JaminFong/darts-multi_gpu. I have implemented a multi-gpu one for the first order version. |
@JaminFong ,hi, thanks for your work. But when i run multi-gpu, it comes with this error below, have you met before? logits = self.model(input_valid) |
@QiuPaul Hi, do u use the version of pytorch 0.3? |
@JaminFong , yeah, thanks very much for your advice, it can run with pytorch1.0. Do you also find this problem below?? thanks...
|
@QiuPaul In the original code, architecture parameters ( Line 123 in f276dd3
Therefore, there is no need to filter the parameters in the orginal code. Please refer to the running mechanism of PyTorch. |
Thanks so much for your work. I have run your code, but it seems that there is a problem according to the result (below figure) I get. I run the code on Titan gpu and the batch size is 64.
The problem is that multi-gpus run even slower than single gpu. |
@marsggbo When using multi-gpu running, |
@JaminFong I've looked at your implementation and only found instructions for running the 2nd order version of the algorithm. Could you specify the instructions on running the algorithm on just the first order? |
--unrolled False ? |
@JaminFong I run multi-gpu code on Titan gpu and the batch size is 64. It comes with this error below, have you met before? File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 769, in load_state_dict |
When u load the model from multi-gpu ones (or data_parallel ...), the params may come as |
Can train.py run on muti-gpus? drop_path is not supported? |
First-order approximation approach leads to worst performance comparing with Second-order approximation. |
I have implemented a distributed PC-Darts for the first order version,https://github.com/bitluozhuang/Distributed-PC-Darts.Welcome to try it. |
Hello, quark!
Thx for your great work. When I tried to run your train_search job with multi-gpus, the Variable of alphas_normal and alphas_reduce causes errors.
The errors are shown as following:
File "/mnt/data-3/data/jiemin.fang/anaconda3/envs/pytorch4/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__ result = self.forward(*input, **kwargs) File "/mnt/data-3/data/jiemin.fang/darts-maml/cnn/model_search.py", line 111, in forward s0, s1 = s1, cell(s0, s1, weights) File "/mnt/data-3/data/jiemin.fang/anaconda3/envs/pytorch4/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__ result = self.forward(*input, **kwargs) File "/mnt/data-3/data/jiemin.fang/darts-maml/cnn/model_search.py", line 54, in forward s = sum(self._ops[offset+j](h, weights[offset+j]) for j, h in enumerate(states)) File "/mnt/data-3/data/jiemin.fang/darts-maml/cnn/model_search.py", line 54, in <genexpr> s = sum(self._ops[offset+j](h, weights[offset+j]) for j, h in enumerate(states)) File "/mnt/data-3/data/jiemin.fang/anaconda3/envs/pytorch4/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__ result = self.forward(*input, **kwargs) File "/mnt/data-3/data/jiemin.fang/darts-maml/cnn/model_search.py", line 22, in forward return sum(w * op(x) for w, op in zip(weights, self._ops)) File "/mnt/data-3/data/jiemin.fang/darts-maml/cnn/model_search.py", line 22, in <genexpr> return sum(w * op(x) for w, op in zip(weights, self._ops)) RuntimeError: arguments are located on different GPUs at /opt/conda/conda-bld/pytorch_1532581333611/work/aten/src/THC/generated/../generic/THCTensorMathPointwise.cu:314
For debugging the code, I tried to remove 'w' which is from alphas_normal or alphas_reduce in
return sum(w * op(x) for w, op in zip(weights, self._ops))
Both 0.3 and 0.4 version of PyTorch have been tried, but the problem got no improvement.
Could you please tell me how I can deal with the multi-gpu training work? And have you ever met any similar problem like this?
Best and waiting for your reply!
The text was updated successfully, but these errors were encountered: