Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resnet 50 #95

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

resnet 50 #95

wants to merge 4 commits into from

Conversation

olimastro
Copy link
Contributor

@olimastro olimastro commented Jun 7, 2017

Regarding #94 issue.
For now I did only a trial on synthetic data. I am reporting the time for one epoch of training which is 1000 data points:

1GPU batch size 80 (all that could fit on the DGX): 19.715s
2GPU batch size 40: 13.452s

Copy link
Contributor

@nouiz nouiz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The speed up seem very bad. We will need to investigate this more. I'm not sure you are timing only the training part. I think your current timing also include data transfer time! So I think we should add timing just for the computation part (and keep the current, as if this due to the data transfer, we need to re-work that part)

def resnet_control(saveFreq=1110, saveto=None):
parser = Controller.default_parser()
parser.add_argument('--seed', default=1234, type=int,
required=False, help='Maximum mini-batches to train upon in total.')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update the help

from platoon.channel import Controller


class ResNetController(Controller):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why you can't reuse the default controler? I don't like the idea to ask every user to make there own controler. If for this example we need a new controler, do we need to update the class Controler to meat the current need?

Copy link
Contributor Author

@olimastro olimastro Jun 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mostly copied everything from the lstm example and work from there. You are right though that if I do a diff between the lstm_controller and resnet_controller there are no differences except the names (the time request is just for the timing and won't end up in final release). There could just be a general controller used by both examples. I do have to implement the method handle_control and subclass the general Controler class.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, just reuse the LSTMController directly. We can see what we do later for a general controler

parser.add_argument('--seed', default=1234, type=int,
required=False, help='Maximum mini-batches to train upon in total.')
parser.add_argument('--patience', default=10, type=int, required=False,
help='Maximum patience when failing to get better validation results.')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose it is in mini-batch unit. Can you specify that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the original lstm example it is from validation epoch actually, so since the validation is not done after every training epoch then this number has to be not too large. Could make the help more useful indeed.

@olimastro
Copy link
Contributor Author

When you say timing after one computation pass, do you want me to take into account the overhead of calling ASGD()? If yes, I can only have one epoch have the same amount of data as the batch size and reuse the same code, if not, I can just time the difference between a single use with of 80 batch size and a single use with 40 batch size.

@olimastro
Copy link
Contributor Author

By the way, I should mention that I used THEANO_FLAGS=device=cuda,floatX=float32,dnn.conv.algo_fwd=time_once,dnn.conv.algo_bwd_filter=time_once,dnn.conv.algo_bwd_data=time_once,gpuarray.preallocate=0.95 and I updated Theano on the 5th of June.

@nouiz
Copy link
Contributor

nouiz commented Jun 7, 2017

remove the dnn.conv.algo_* from the timing. As this do the timing in the first call, it could bias the timing. We don't care if you we don't have the fastest timing, but we should not bias the timing.

For the timing, we need to know the time of overhead, vs the compute time, vs the asgd.

We had in mind to use the a synchrone update. Mostly, it just split the minibatch on each gpu and sync the gradient. So you should also change that.

@lamblin
Copy link
Contributor

lamblin commented Jun 7, 2017

The overall metric we are competing for is images / sec. While it is useful to time the different parts to see where we can improve things, this is the final metric we should report.

Regarding cuDNN options, again, if we want to correctly identify bottlenecks, I think we should use time_once. Just do one dummy call before starting training (and profiling) so that the right algorithm gets selected.

When we want to check for correctness, it is OK to split a batch between GPUs, and check if the synchronous update is consistent with what happens on only 1 GPU. However, for the final result, we want to have the full batch size on all GPUs, and report that.


print("Using all_reduce worker's interface!")
asgd = AverageSGD(worker)
asgd.make_rule(params)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add other update algorithm, like EASGD, Downpour around here, so we can see the differences please?

@borisfom
Copy link

Can you please post exact command lines you have used? I keep getting OOM errors on startup if I preallocate 0.95 .. 0.5. If I do not preallocate, the errors are :
WARNING! Failed to register in a local GPU comm world.
Reason: 'utf8' codec can't decode byte 0xa5 in position 2: invalid start byte
WARNING! Platoon all_reduce interface will not be functional.
Traceback (most recent call last):
File "resnet_worker.py", line 566, in
train_resnet()
File "resnet_worker.py", line 475, in train_resnet
asgd.make_rule(params)
File "/home/bfomitchev/git/theano/third_party/platoon/platoon/training/global_dynamics.py", line 161, in make_rule
gup = AllReduceSum(update, inplace=True)
File "/home/bfomitchev/git/theano/third_party/platoon/platoon/ops.py", line 155, in AllReduceSum
return AllReduce(theano.scalar.add, inplace, worker)(src, dest)
File "/home/bfomitchev/git/theano/third_party/platoon/platoon/ops.py", line 61, in init
self._f16_ok = not self.worker._multinode
AttributeError: 'Worker' object has no attribute '_multinode'

@cshanbo
Copy link

cshanbo commented Jun 27, 2017

Hi @borisfom
It seems to be a installation error. You could try the solution under this issue to see if it works.

@borisfom
Copy link

@cshanbo: I have recompiled and reinstalled - still same issue. Can in be nccl 2.0 incompatibility ?
How exactly did you run the benchmark ?

@cshanbo
Copy link

cshanbo commented Jun 28, 2017

Hi @borisfom
I think it's not caused by nccl incompatibility. What I did is to follow the issue I mentioned above.

You could attach you running script, installation information etc, so that I might be able to help.
Just to make sure, did you set your environmental variables, such as $PATH and $LD_LIBRARY_PATH?

@olimastro
Copy link
Contributor Author

The command I use is THEANO_FLAGS=device=cpu python resnet_controller.py --single resnet /PATH/TO/OUT/FILES

@borisfom
Copy link

@cshanbo : What exactly should be added to PATH/LD_LIBRARY_PATH?
What happens is this call returns garbage:
self._local_id = gpucoll.GpuCommCliqueId(context=self.gpuctx)
And then utf8 decode fails on it:
response = self.send_req("platoon-get_platoon_info",
info={'device': self.device,
'local_id': self._local_id.comm_id.decode('utf-8')})

@cshanbo
Copy link

cshanbo commented Jun 28, 2017

Hi @borisfom ,
I think you should add the corresponding path of llb to LD_LIBRARY_PATH.
For example, export LD_LIBRARY_PATH=/path/to/nccl/lib:$LD_LIBRARY_PATH
You can try something like this

@borisfom
Copy link

NCCL is in standard system path (installed via .deb), tried adding it to LD_LIBRARY_PATH - no effect. GpuCommCliqueId would have failed in it was not found, right? Instead it returns garbage. Is it some initialization that could have been missed? Are you running Python 2 or 3 ?

@cshanbo
Copy link

cshanbo commented Jun 29, 2017

I'm running Python 2.7 with anaconda.

Can you make sure your nccl and pygpu installation correct?

  1. for nccl
  2. for pygpu

@nouiz
Copy link
Contributor

nouiz commented Jun 29, 2017 via email

for c in block_size:
if c == 'a':
sub_net, parent_layer_name = build_residual_block(net[parent_layer_name],
1, 1, True, 4, ix='2%s' % c)
Copy link

@astooke astooke Aug 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've copied the code for building the network (so the line numbers are different), and when I run it, it breaks like so:

----> 1 net = build_resnet()

/home/adam/GitRepos/synkhronos/demos/resnet/build_resnet.py in build_resnet()
    151         if c == 'a':
    152             sub_net, parent_layer_name = build_residual_block(net[parent_layer_name],
--> 153                                                               1, 1, True, 4, ix='2%s' % c)
    154         else:
    155             sub_net, parent_layer_name = build_residual_block(net[parent_layer_name],

/home/adam/GitRepos/synkhronos/demos/resnet/build_resnet.py in build_residual_block(incoming_layer, ratio_n_filter, ratio_size, has_left_branch, upscale_factor, ix)
    103         incoming_layer, map(lambda s: s % (ix, 2, 'a'), simple_block_name_pattern),
    104         int(lasagne.layers.get_output_shape(incoming_layer)[1]*ratio_n_filter),
--> 105         1, int(1.0/ratio_size), 0)
    106     net.update(net_tmp)
    107 

/home/adam/GitRepos/synkhronos/demos/resnet/build_resnet.py in build_simple_block(incoming_layer, names, num_filters, filter_size, stride, pad, use_bias, nonlin)
     49     net = []
     50     net.append((
---> 51             names[0],
     52             ConvLayer(incoming_layer, num_filters, filter_size, stride, pad,
     53                       flip_filters=False, nonlinearity=None) if use_bias

TypeError: 'map' object is not subscriptable

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you use Python 3? Probably this code isn't python 3 compatible. See
https://stackoverflow.com/questions/6800481/python-map-object-is-not-subscriptable
for a way to fix this. Or try with python 2.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK thanks, I wrapped each of the four map() like so: list(map()) and now it seems to work.

@ReyhaneAskari
Copy link
Member

ReyhaneAskari commented Sep 6, 2017

Here is the result of my benchmarking. The first number in the results cells is epoch 0 and the second number is the cumulative time until the end of epoch 1. I calculated the improvement by dividing epoch 0 of one gpu and epoch 0 of 2 gpus.

I think the base number is the one we want to report. If that's the case, I will rerun it 3 times to take an average.

One Gpu (Quadro K6000) Two Gpus (Quadro K6000s) Improvement
Base 91.63506, 209.2868 49.1683, 110.79238 1.86
With Dnn flags 85.8325, 190.6249 47.4801, 103.6666 1.80
With pre allocate flag 96.2377, 216.74893 51.97668, 115.8446 1.85
Base with profiling 130.5099, 291.01650 112.8854, 247.0402 1.56
With Dnn and profiling flags 125.09018, 272.49762 114.2143 , 248.42933 1.09
With pre allocate and profiling flags 134.58385, 297.5514 111.07611, 242.2090 1.21

pre allocate flag = gpuarray.preallocate=0.95
profiling flag = profile=True,profile_optimizer=True,profile_memory=True
Dnn flags = dnn.conv.algo_fwd=time_once,dnn.conv.algo_bwd_filter=time_once,dnn.conv.algo_bwd_data=time_once

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants