[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

train_search on multi-gpus #37

Open
JaminFong opened this issue Aug 27, 2018 · 19 comments
Open

train_search on multi-gpus #37

JaminFong opened this issue Aug 27, 2018 · 19 comments

Comments

@JaminFong
Copy link

Hello, quark!
Thx for your great work. When I tried to run your train_search job with multi-gpus, the Variable of alphas_normal and alphas_reduce causes errors.
The errors are shown as following:

File "/mnt/data-3/data/jiemin.fang/anaconda3/envs/pytorch4/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__ result = self.forward(*input, **kwargs) File "/mnt/data-3/data/jiemin.fang/darts-maml/cnn/model_search.py", line 111, in forward s0, s1 = s1, cell(s0, s1, weights) File "/mnt/data-3/data/jiemin.fang/anaconda3/envs/pytorch4/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__ result = self.forward(*input, **kwargs) File "/mnt/data-3/data/jiemin.fang/darts-maml/cnn/model_search.py", line 54, in forward s = sum(self._ops[offset+j](h, weights[offset+j]) for j, h in enumerate(states)) File "/mnt/data-3/data/jiemin.fang/darts-maml/cnn/model_search.py", line 54, in <genexpr> s = sum(self._ops[offset+j](h, weights[offset+j]) for j, h in enumerate(states)) File "/mnt/data-3/data/jiemin.fang/anaconda3/envs/pytorch4/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__ result = self.forward(*input, **kwargs) File "/mnt/data-3/data/jiemin.fang/darts-maml/cnn/model_search.py", line 22, in forward return sum(w * op(x) for w, op in zip(weights, self._ops)) File "/mnt/data-3/data/jiemin.fang/darts-maml/cnn/model_search.py", line 22, in <genexpr> return sum(w * op(x) for w, op in zip(weights, self._ops)) RuntimeError: arguments are located on different GPUs at /opt/conda/conda-bld/pytorch_1532581333611/work/aten/src/THC/generated/../generic/THCTensorMathPointwise.cu:314
For debugging the code, I tried to remove 'w' which is from alphas_normal or alphas_reduce in
return sum(w * op(x) for w, op in zip(weights, self._ops))
Both 0.3 and 0.4 version of PyTorch have been tried, but the problem got no improvement.
Could you please tell me how I can deal with the multi-gpu training work? And have you ever met any similar problem like this?
Best and waiting for your reply!

@arunmallya
Copy link

That's because the arch_parameters are not being copied onto every GPU. DataParallel only copies parameters and buffers of a module to all GPUs. In the above code, the arch_parameters are Variables and as a result, they do not get copied, hence the error. You can try making them parameters, but then you will have to override the parameters() function so that only weight parameters are returned, and not the arch parameters.

However, DataParallel will not give you any speed up in this case. It will in fact be very slow.
This is because copying over the modules before every forward will take a loooot of time. There are around 5000 nested modules in the search network, whereas a large network like ResNet-101 has less then 400. This overhead will wipe out any possible benefit of data parallelization.

@JaminFong
Copy link
Author

@arunmallya Thanks for your reply. And I'll have a try as you suggested.
For the speed of of DataParallel, I think it may not help when the network is tiny, but I wanna apply darts to larger networks on larger datasets. By this way, only one gpu may not be able to afford the work.

@quark0
Copy link
Owner
quark0 commented Sep 3, 2018

@arunmallya I agree with your points. Several people asked about this, but I haven't got the chance to try it myself.

@JaminFong An alternative approach is to further reduce the batch size/number of channels during search, though this might lead to some additional discrepancies between search & evaluation.

@quark0 quark0 closed this as completed Sep 3, 2018
@quark0 quark0 reopened this Sep 3, 2018
@JaminFong
Copy link
Author

@quark0 Yes, I tried to reduce the number of layers or img size to fit my experiment within one gpu. But I think if we want to extend darts to larger scale task, data parallel may be necessary.
Best!

@VectorYoung
Copy link

@JaminFong Hi, have you tried to implement it with multi-gpu? I am also going to search on a large task but one gpu is limited.
Thanks.

@JaminFong
Copy link
Author
JaminFong commented Feb 12, 2019

@VectorYoung You could refer to https://github.com/JaminFong/darts-multi_gpu. I have implemented a multi-gpu one for the first order version.

@QiuPaul
Copy link
QiuPaul commented Feb 26, 2019

@JaminFong ,hi, thanks for your work. But when i run multi-gpu, it comes with this error below, have you met before?

logits = self.model(input_valid)
File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 357, in call
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/torch/nn/parallel/data_parallel.py", line 69, in forward
inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids)
File "/usr/local/lib/python3.5/dist-packages/torch/nn/parallel/data_parallel.py", line 80, in scatter
return scatter_kwargs(inputs, kwargs, device_ids, dim=self.dim)
File "/usr/local/lib/python3.5/dist-packages/torch/nn/parallel/scatter_gather.py", line 38, in scatter_kwargs
inputs = scatter(inputs, target_gpus, dim) if inputs else []
File "/usr/local/lib/python3.5/dist-packages/torch/nn/parallel/scatter_gather.py", line 31, in scatter
return scatter_map(inputs)
File "/usr/local/lib/python3.5/dist-packages/torch/nn/parallel/scatter_gather.py", line 18, in scatter_map
return list(zip(*map(scatter_map, obj)))
File "/usr/local/lib/python3.5/dist-packages/torch/nn/parallel/scatter_gather.py", line 16, in scatter_map
assert not torch.is_tensor(obj), "Tensors not supported in scatter."
AssertionError: Tensors not supported in scatter.

@JaminFong
Copy link
Author

@QiuPaul Hi, do u use the version of pytorch 0.3?
If yes, it is best to use pytorch 1.0 instead or pytorch >=0.4 at least.

@QiuPaul
Copy link
QiuPaul commented Mar 1, 2019

@QiuPaul Hi, do u use the version of pytorch 0.3?
If yes, it is best to use pytorch 1.0 instead or pytorch >=0.4 at least.

@JaminFong , yeah, thanks very much for your advice, it can run with pytorch1.0.
What's more important , i found in your code, you make some modification:
optimizer = torch.optim.SGD(
weight_params, #model.parameters(),
args.learning_rate,
momentum=args.momentum,
weight_decay=args.weight_decay)

Do you also find this problem below?? thanks...
#75

In paper ,
while not converged do

  1. Update weights w by descending GRADw(w; alpha)
  2. Update architecture alpha by descending GRADalpha(updated w; alpha)
    Which means: when update weights ,the alpha is fixed. However in original code below, when use momentum to update the weights, all parameters model.parameters() including arch_parameters are updated, waiting to be confirmed ,thanks....

optimizer = torch.optim.SGD(
model.parameters(),
args.learning_rate,
momentum=args.momentum,
weight_decay=args.weight_decay)

@JaminFong
Copy link
Author

@QiuPaul In the original code, architecture parameters (alphas_normal and alphas_reduce) are not in model.parameters().

self.alphas_normal = Variable(1e-3*torch.randn(k, num_ops).cuda(), requires_grad=True)

Therefore, there is no need to filter the parameters in the orginal code. Please refer to the running mechanism of PyTorch.

@marsggbo
Copy link

@VectorYoung You could refer to https://github.com/JaminFong/darts-multi_gpu. I have implemented a multi-gpu one for the first order version.

Thanks so much for your work. I have run your code, but it seems that there is a problem according to the result (below figure) I get.

I run the code on Titan gpu and the batch size is 64.

  • Fw means forward time for each step
  • Bw means backward time for each step
  • Up means up time for each step, i.e. optimier.step()
  • Arch means the time of training archtecture for each step

image

The problem is that multi-gpus run even slower than single gpu.

The running info of gpus are as follows:
image
image

@JaminFong
Copy link
Author

@marsggbo When using multi-gpu running, DataParallel in pytorch will take much more time to copy data into all the expected nodes before forward operations, especially the number of modules in the darts network is much larger than normal ones. So when your batch size is small it may not speed up the running to apply the model on multi gpus.

@killawhale2
Copy link

@JaminFong I've looked at your implementation and only found instructions for running the 2nd order version of the algorithm. Could you specify the instructions on running the algorithm on just the first order?

@Margrate
Copy link

@JaminFong I've looked at your implementation and only found instructions for running the 2nd order version of the algorithm. Could you specify the instructions on running the algorithm on just the first order?

--unrolled False ?

@xjtuzll
Copy link
xjtuzll commented Jul 11, 2019

@JaminFong I run multi-gpu code on Titan gpu and the batch size is 64. It comes with this error below, have you met before?

File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 769, in load_state_dict
self.class.name, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for Network:
Missing key(s) in state_dict: "alphas_reduce", "alphas_normal", "stem.0.weight", "stem.1.running_var", "stem.1.bias", "stem.1.weight", "stem.1.running_mean", "cells.0.preprocess0.op.1.weight", "cells.0.preprocess0.op.2.running_var", "cells.0.preprocess0.op.2.running_mean", "cells.0.preprocess1.op.1.weight", "cells.0.preprocess1.op.2.running_var", "cells.0.preprocess1.op.2.running_mean", "cells.0._ops.0._ops.1.1.running_var", "cells.0._ops.0._ops.1.1.running_mean", "cells.0._ops.0._ops.2.1.running_var", "cells.0._ops.0._ops.2.1.running_mean", "cells.0._ops.0._ops.4.op.1.weight", "cells.0._ops.0._ops.4.op.2.weight"......

@JaminFong
Copy link
Author

@JaminFong I run multi-gpu code on Titan gpu and the batch size is 64. It comes with this error below, have you met before?

File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 769, in load_state_dict
self.class.name, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for Network:
Missing key(s) in state_dict: "alphas_reduce", "alphas_normal", "stem.0.weight", "stem.1.running_var", "stem.1.bias", "stem.1.weight", "stem.1.running_mean", "cells.0.preprocess0.op.1.weight", "cells.0.preprocess0.op.2.running_var", "cells.0.preprocess0.op.2.running_mean", "cells.0.preprocess1.op.1.weight", "cells.0.preprocess1.op.2.running_var", "cells.0.preprocess1.op.2.running_mean", "cells.0._ops.0._ops.1.1.running_var", "cells.0._ops.0._ops.1.1.running_mean", "cells.0._ops.0._ops.2.1.running_var", "cells.0._ops.0._ops.2.1.running_mean", "cells.0._ops.0._ops.4.op.1.weight", "cells.0._ops.0._ops.4.op.2.weight"......

When u load the model from multi-gpu ones (or data_parallel ...), the params may come as module.***. You need to check the key names of the params dict.

@Margrate
Copy link

Can train.py run on muti-gpus? drop_path is not supported?

@giangtranml
Copy link

First-order approximation approach leads to worst performance comparing with Second-order approximation.

@bitluozhuang
Copy link
bitluozhuang commented Dec 26, 2019

I have implemented a distributed PC-Darts for the first order version,https://github.com/bitluozhuang/Distributed-PC-Darts.Welcome to try it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests