train_search on multi-gpus #37

JaminFong · 2018-08-27T13:19:18Z

Hello, quark!
Thx for your great work. When I tried to run your train_search job with multi-gpus, the Variable of alphas_normal and alphas_reduce causes errors.
The errors are shown as following:

File "/mnt/data-3/data/jiemin.fang/anaconda3/envs/pytorch4/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__ result = self.forward(*input, **kwargs) File "/mnt/data-3/data/jiemin.fang/darts-maml/cnn/model_search.py", line 111, in forward s0, s1 = s1, cell(s0, s1, weights) File "/mnt/data-3/data/jiemin.fang/anaconda3/envs/pytorch4/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__ result = self.forward(*input, **kwargs) File "/mnt/data-3/data/jiemin.fang/darts-maml/cnn/model_search.py", line 54, in forward s = sum(self._ops[offset+j](h, weights[offset+j]) for j, h in enumerate(states)) File "/mnt/data-3/data/jiemin.fang/darts-maml/cnn/model_search.py", line 54, in <genexpr> s = sum(self._ops[offset+j](h, weights[offset+j]) for j, h in enumerate(states)) File "/mnt/data-3/data/jiemin.fang/anaconda3/envs/pytorch4/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__ result = self.forward(*input, **kwargs) File "/mnt/data-3/data/jiemin.fang/darts-maml/cnn/model_search.py", line 22, in forward return sum(w * op(x) for w, op in zip(weights, self._ops)) File "/mnt/data-3/data/jiemin.fang/darts-maml/cnn/model_search.py", line 22, in <genexpr> return sum(w * op(x) for w, op in zip(weights, self._ops)) RuntimeError: arguments are located on different GPUs at /opt/conda/conda-bld/pytorch_1532581333611/work/aten/src/THC/generated/../generic/THCTensorMathPointwise.cu:314
For debugging the code, I tried to remove 'w' which is from alphas_normal or alphas_reduce in
return sum(w * op(x) for w, op in zip(weights, self._ops))
Both 0.3 and 0.4 version of PyTorch have been tried, but the problem got no improvement.
Could you please tell me how I can deal with the multi-gpu training work? And have you ever met any similar problem like this?
Best and waiting for your reply!

The text was updated successfully, but these errors were encountered:

arunmallya · 2018-08-30T16:56:20Z

That's because the arch_parameters are not being copied onto every GPU. DataParallel only copies parameters and buffers of a module to all GPUs. In the above code, the arch_parameters are Variables and as a result, they do not get copied, hence the error. You can try making them parameters, but then you will have to override the parameters() function so that only weight parameters are returned, and not the arch parameters.

However, DataParallel will not give you any speed up in this case. It will in fact be very slow.
This is because copying over the modules before every forward will take a loooot of time. There are around 5000 nested modules in the search network, whereas a large network like ResNet-101 has less then 400. This overhead will wipe out any possible benefit of data parallelization.

JaminFong · 2018-09-03T11:43:32Z

@arunmallya Thanks for your reply. And I'll have a try as you suggested.
For the speed of of DataParallel, I think it may not help when the network is tiny, but I wanna apply darts to larger networks on larger datasets. By this way, only one gpu may not be able to afford the work.

quark0 · 2018-09-03T16:05:40Z

@arunmallya I agree with your points. Several people asked about this, but I haven't got the chance to try it myself.

@JaminFong An alternative approach is to further reduce the batch size/number of channels during search, though this might lead to some additional discrepancies between search & evaluation.

JaminFong · 2018-09-03T16:17:47Z

@quark0 Yes, I tried to reduce the number of layers or img size to fit my experiment within one gpu. But I think if we want to extend darts to larger scale task, data parallel may be necessary.
Best!

VectorYoung · 2019-02-11T16:47:26Z

@JaminFong Hi, have you tried to implement it with multi-gpu? I am also going to search on a large task but one gpu is limited.
Thanks.

JaminFong · 2019-02-12T04:10:58Z

@VectorYoung You could refer to https://github.com/JaminFong/darts-multi_gpu. I have implemented a multi-gpu one for the first order version.

QiuPaul · 2019-02-26T06:26:20Z

@JaminFong ,hi, thanks for your work. But when i run multi-gpu, it comes with this error below, have you met before?

logits = self.model(input_valid)
File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 357, in call
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/torch/nn/parallel/data_parallel.py", line 69, in forward
inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids)
File "/usr/local/lib/python3.5/dist-packages/torch/nn/parallel/data_parallel.py", line 80, in scatter
return scatter_kwargs(inputs, kwargs, device_ids, dim=self.dim)
File "/usr/local/lib/python3.5/dist-packages/torch/nn/parallel/scatter_gather.py", line 38, in scatter_kwargs
inputs = scatter(inputs, target_gpus, dim) if inputs else []
File "/usr/local/lib/python3.5/dist-packages/torch/nn/parallel/scatter_gather.py", line 31, in scatter
return scatter_map(inputs)
File "/usr/local/lib/python3.5/dist-packages/torch/nn/parallel/scatter_gather.py", line 18, in scatter_map
return list(zip(*map(scatter_map, obj)))
File "/usr/local/lib/python3.5/dist-packages/torch/nn/parallel/scatter_gather.py", line 16, in scatter_map
assert not torch.is_tensor(obj), "Tensors not supported in scatter."
AssertionError: Tensors not supported in scatter.

JaminFong · 2019-02-26T06:59:25Z

@QiuPaul Hi, do u use the version of pytorch 0.3?
If yes, it is best to use pytorch 1.0 instead or pytorch >=0.4 at least.

QiuPaul · 2019-03-01T08:14:09Z

@QiuPaul Hi, do u use the version of pytorch 0.3?
If yes, it is best to use pytorch 1.0 instead or pytorch >=0.4 at least.

@JaminFong , yeah, thanks very much for your advice, it can run with pytorch1.0.
What's more important , i found in your code, you make some modification:
optimizer = torch.optim.SGD(
weight_params, #model.parameters(),
args.learning_rate,
momentum=args.momentum,
weight_decay=args.weight_decay)

Do you also find this problem below?? thanks...
#75

In paper ,
while not converged do

Update weights w by descending GRADw(w; alpha)
Update architecture alpha by descending GRADalpha(updated w; alpha)
Which means: when update weights ,the alpha is fixed. However in original code below, when use momentum to update the weights, all parameters model.parameters() including arch_parameters are updated, waiting to be confirmed ，thanks....

optimizer = torch.optim.SGD(
model.parameters(),
args.learning_rate,
momentum=args.momentum,
weight_decay=args.weight_decay)

JaminFong · 2019-03-01T09:07:56Z

@QiuPaul In the original code, architecture parameters (alphas_normal and alphas_reduce) are not in model.parameters().

darts/cnn/model_search.py

Line 123 in f276dd3

    
           self.alphas_normal = Variable(1e-3*torch.randn(k, num_ops).cuda(), requires_grad=True)

Therefore, there is no need to filter the parameters in the orginal code. Please refer to the running mechanism of PyTorch.

marsggbo · 2019-04-17T07:32:18Z

@VectorYoung You could refer to https://github.com/JaminFong/darts-multi_gpu. I have implemented a multi-gpu one for the first order version.

Thanks so much for your work. I have run your code, but it seems that there is a problem according to the result (below figure) I get.

I run the code on Titan gpu and the batch size is 64.

Fw means forward time for each step
Bw means backward time for each step
Up means up time for each step, i.e. optimier.step()
Arch means the time of training archtecture for each step

The problem is that multi-gpus run even slower than single gpu.

The running info of gpus are as follows:

JaminFong · 2019-04-27T08:18:48Z

@marsggbo When using multi-gpu running, DataParallel in pytorch will take much more time to copy data into all the expected nodes before forward operations, especially the number of modules in the darts network is much larger than normal ones. So when your batch size is small it may not speed up the running to apply the model on multi gpus.

killawhale2 · 2019-06-21T04:01:42Z

@JaminFong I've looked at your implementation and only found instructions for running the 2nd order version of the algorithm. Could you specify the instructions on running the algorithm on just the first order?

Margrate · 2019-06-29T10:26:23Z

@JaminFong I've looked at your implementation and only found instructions for running the 2nd order version of the algorithm. Could you specify the instructions on running the algorithm on just the first order?

--unrolled False ?

xjtuzll · 2019-07-11T14:28:29Z

@JaminFong I run multi-gpu code on Titan gpu and the batch size is 64. It comes with this error below, have you met before?

File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 769, in load_state_dict
self.class.name, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for Network:
Missing key(s) in state_dict: "alphas_reduce", "alphas_normal", "stem.0.weight", "stem.1.running_var", "stem.1.bias", "stem.1.weight", "stem.1.running_mean", "cells.0.preprocess0.op.1.weight", "cells.0.preprocess0.op.2.running_var", "cells.0.preprocess0.op.2.running_mean", "cells.0.preprocess1.op.1.weight", "cells.0.preprocess1.op.2.running_var", "cells.0.preprocess1.op.2.running_mean", "cells.0._ops.0._ops.1.1.running_var", "cells.0._ops.0._ops.1.1.running_mean", "cells.0._ops.0._ops.2.1.running_var", "cells.0._ops.0._ops.2.1.running_mean", "cells.0._ops.0._ops.4.op.1.weight", "cells.0._ops.0._ops.4.op.2.weight"......

JaminFong · 2019-07-12T03:02:33Z

@JaminFong I run multi-gpu code on Titan gpu and the batch size is 64. It comes with this error below, have you met before?

File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 769, in load_state_dict
self.class.name, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for Network:
Missing key(s) in state_dict: "alphas_reduce", "alphas_normal", "stem.0.weight", "stem.1.running_var", "stem.1.bias", "stem.1.weight", "stem.1.running_mean", "cells.0.preprocess0.op.1.weight", "cells.0.preprocess0.op.2.running_var", "cells.0.preprocess0.op.2.running_mean", "cells.0.preprocess1.op.1.weight", "cells.0.preprocess1.op.2.running_var", "cells.0.preprocess1.op.2.running_mean", "cells.0._ops.0._ops.1.1.running_var", "cells.0._ops.0._ops.1.1.running_mean", "cells.0._ops.0._ops.2.1.running_var", "cells.0._ops.0._ops.2.1.running_mean", "cells.0._ops.0._ops.4.op.1.weight", "cells.0._ops.0._ops.4.op.2.weight"......

When u load the model from multi-gpu ones (or data_parallel ...), the params may come as module.***. You need to check the key names of the params dict.

Margrate · 2019-07-16T03:29:40Z

Can train.py run on muti-gpus? drop_path is not supported?

giangtranml · 2019-11-04T09:00:21Z

First-order approximation approach leads to worst performance comparing with Second-order approximation.

bitluozhuang · 2019-12-26T10:55:09Z

I have implemented a distributed PC-Darts for the first order version,https://github.com/bitluozhuang/Distributed-PC-Darts.Welcome to try it.

quark0 closed this as completed Sep 3, 2018

quark0 reopened this Sep 3, 2018

chenwydj mentioned this issue Feb 2, 2020

multi gpu training VITA-Group/FasterSeg#11

Closed

aaron851113 mentioned this issue Apr 12, 2020

Very slow when multi-gpu JaminFong/darts-multi_gpu#3

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

train_search on multi-gpus #37

train_search on multi-gpus #37

train_search on multi-gpus #37

train_search on multi-gpus #37

Comments

Do you also find this problem below?? thanks... #75

Do you also find this problem below?? thanks...
#75