resnet 50 #95

olimastro · 2017-06-07T05:25:41Z

Regarding #94 issue.
For now I did only a trial on synthetic data. I am reporting the time for one epoch of training which is 1000 data points:

1GPU batch size 80 (all that could fit on the DGX): 19.715s
2GPU batch size 40: 13.452s

nouiz

The speed up seem very bad. We will need to investigate this more. I'm not sure you are timing only the training part. I think your current timing also include data transfer time! So I think we should add timing just for the computation part (and keep the current, as if this due to the data transfer, we need to re-work that part)

nouiz · 2017-06-07T12:07:54Z

example/synchronous_resnet/resnet_controller.py

+def resnet_control(saveFreq=1110, saveto=None):
+    parser = Controller.default_parser()
+    parser.add_argument('--seed', default=1234, type=int,
+                        required=False, help='Maximum mini-batches to train upon in total.')


update the help

nouiz · 2017-06-07T12:15:34Z

example/synchronous_resnet/resnet_controller.py

+from platoon.channel import Controller
+
+
+class ResNetController(Controller):


Why you can't reuse the default controler? I don't like the idea to ask every user to make there own controler. If for this example we need a new controler, do we need to update the class Controler to meat the current need?

I mostly copied everything from the lstm example and work from there. You are right though that if I do a diff between the lstm_controller and resnet_controller there are no differences except the names (the time request is just for the timing and won't end up in final release). There could just be a general controller used by both examples. I do have to implement the method handle_control and subclass the general Controler class.

For now, just reuse the LSTMController directly. We can see what we do later for a general controler

nouiz · 2017-06-07T12:20:18Z

example/synchronous_resnet/resnet_controller.py

+    parser.add_argument('--seed', default=1234, type=int,
+                        required=False, help='Maximum mini-batches to train upon in total.')
+    parser.add_argument('--patience', default=10, type=int, required=False,
+                        help='Maximum patience when failing to get better validation results.')


I suppose it is in mini-batch unit. Can you specify that?

From the original lstm example it is from validation epoch actually, so since the validation is not done after every training epoch then this number has to be not too large. Could make the help more useful indeed.

olimastro · 2017-06-07T19:28:34Z

When you say timing after one computation pass, do you want me to take into account the overhead of calling ASGD()? If yes, I can only have one epoch have the same amount of data as the batch size and reuse the same code, if not, I can just time the difference between a single use with of 80 batch size and a single use with 40 batch size.

olimastro · 2017-06-07T19:32:14Z

By the way, I should mention that I used THEANO_FLAGS=device=cuda,floatX=float32,dnn.conv.algo_fwd=time_once,dnn.conv.algo_bwd_filter=time_once,dnn.conv.algo_bwd_data=time_once,gpuarray.preallocate=0.95 and I updated Theano on the 5th of June.

nouiz · 2017-06-07T19:39:45Z

remove the dnn.conv.algo_* from the timing. As this do the timing in the first call, it could bias the timing. We don't care if you we don't have the fastest timing, but we should not bias the timing.

For the timing, we need to know the time of overhead, vs the compute time, vs the asgd.

We had in mind to use the a synchrone update. Mostly, it just split the minibatch on each gpu and sync the gradient. So you should also change that.

lamblin · 2017-06-07T22:50:21Z

The overall metric we are competing for is images / sec. While it is useful to time the different parts to see where we can improve things, this is the final metric we should report.

Regarding cuDNN options, again, if we want to correctly identify bottlenecks, I think we should use time_once. Just do one dummy call before starting training (and profiling) so that the right algorithm gets selected.

When we want to check for correctness, it is OK to split a batch between GPUs, and check if the synchronous update is consistent with what happens on only 1 GPU. However, for the final result, we want to have the full batch size on all GPUs, and report that.

cshanbo · 2017-06-09T07:39:59Z

example/synchronous_resnet/resnet_worker.py

+
+    print("Using all_reduce worker's interface!")
+    asgd = AverageSGD(worker)
+    asgd.make_rule(params)


Can you please add other update algorithm, like EASGD, Downpour around here, so we can see the differences please?

borisfom · 2017-06-27T01:15:24Z

Can you please post exact command lines you have used? I keep getting OOM errors on startup if I preallocate 0.95 .. 0.5. If I do not preallocate, the errors are :
WARNING! Failed to register in a local GPU comm world.
Reason: 'utf8' codec can't decode byte 0xa5 in position 2: invalid start byte
WARNING! Platoon all_reduce interface will not be functional.
Traceback (most recent call last):
File "resnet_worker.py", line 566, in
train_resnet()
File "resnet_worker.py", line 475, in train_resnet
asgd.make_rule(params)
File "/home/bfomitchev/git/theano/third_party/platoon/platoon/training/global_dynamics.py", line 161, in make_rule
gup = AllReduceSum(update, inplace=True)
File "/home/bfomitchev/git/theano/third_party/platoon/platoon/ops.py", line 155, in AllReduceSum
return AllReduce(theano.scalar.add, inplace, worker)(src, dest)
File "/home/bfomitchev/git/theano/third_party/platoon/platoon/ops.py", line 61, in init
self._f16_ok = not self.worker._multinode
AttributeError: 'Worker' object has no attribute '_multinode'

cshanbo · 2017-06-27T07:53:00Z

Hi @borisfom
It seems to be a installation error. You could try the solution under this issue to see if it works.

borisfom · 2017-06-28T00:34:40Z

@cshanbo: I have recompiled and reinstalled - still same issue. Can in be nccl 2.0 incompatibility ?
How exactly did you run the benchmark ?

cshanbo · 2017-06-28T01:59:39Z

Hi @borisfom
I think it's not caused by nccl incompatibility. What I did is to follow the issue I mentioned above.

You could attach you running script, installation information etc, so that I might be able to help.
Just to make sure, did you set your environmental variables, such as $PATH and $LD_LIBRARY_PATH?

olimastro · 2017-06-28T02:03:24Z

The command I use is THEANO_FLAGS=device=cpu python resnet_controller.py --single resnet /PATH/TO/OUT/FILES

borisfom · 2017-06-28T02:23:22Z

@cshanbo : What exactly should be added to PATH/LD_LIBRARY_PATH?
What happens is this call returns garbage:
self._local_id = gpucoll.GpuCommCliqueId(context=self.gpuctx)
And then utf8 decode fails on it:
response = self.send_req("platoon-get_platoon_info",
info={'device': self.device,
'local_id': self._local_id.comm_id.decode('utf-8')})

cshanbo · 2017-06-28T11:37:50Z

Hi @borisfom ,
I think you should add the corresponding path of llb to LD_LIBRARY_PATH.
For example, export LD_LIBRARY_PATH=/path/to/nccl/lib:$LD_LIBRARY_PATH
You can try something like this

borisfom · 2017-06-28T15:49:09Z

NCCL is in standard system path (installed via .deb), tried adding it to LD_LIBRARY_PATH - no effect. GpuCommCliqueId would have failed in it was not found, right? Instead it returns garbage. Is it some initialization that could have been missed? Are you running Python 2 or 3 ?

cshanbo · 2017-06-29T08:38:45Z

I'm running Python 2.7 with anaconda.

Can you make sure your nccl and pygpu installation correct?

for nccl
for pygpu

nouiz · 2017-06-29T12:34:36Z

We didn't try nccl 2 with platoon. This could be the cause of the problem if it's interface changed. Le jeu. 29 juin 2017 04:38, Justin Chan <[email protected]> a écrit :

…

I'm running Python 2.7 with anaconda <https://www.continuum.io/downloads>. Can you make sure your nccl and pygpu installation correct? 1. for nccl <https://github.com/NVIDIA/nccl> 2. for pygpu <http://deeplearning.net/software/libgpuarray/installation.html#running-tests> — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#95 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AALC-5jBHbSa71sE_rwv3OchW4Z1FFYzks5sI2KVgaJpZM4NyOMN> .

astooke · 2017-08-24T17:49:04Z

example/synchronous_resnet/resnet_worker.py

+    for c in block_size:
+        if c == 'a':
+            sub_net, parent_layer_name = build_residual_block(net[parent_layer_name],
+                                                              1, 1, True, 4, ix='2%s' % c)


I've copied the code for building the network (so the line numbers are different), and when I run it, it breaks like so:

----> 1 net = build_resnet() /home/adam/GitRepos/synkhronos/demos/resnet/build_resnet.py in build_resnet() 151 if c == 'a': 152 sub_net, parent_layer_name = build_residual_block(net[parent_layer_name], --> 153 1, 1, True, 4, ix='2%s' % c) 154 else: 155 sub_net, parent_layer_name = build_residual_block(net[parent_layer_name], /home/adam/GitRepos/synkhronos/demos/resnet/build_resnet.py in build_residual_block(incoming_layer, ratio_n_filter, ratio_size, has_left_branch, upscale_factor, ix) 103 incoming_layer, map(lambda s: s % (ix, 2, 'a'), simple_block_name_pattern), 104 int(lasagne.layers.get_output_shape(incoming_layer)[1]*ratio_n_filter), --> 105 1, int(1.0/ratio_size), 0) 106 net.update(net_tmp) 107 /home/adam/GitRepos/synkhronos/demos/resnet/build_resnet.py in build_simple_block(incoming_layer, names, num_filters, filter_size, stride, pad, use_bias, nonlin) 49 net = [] 50 net.append(( ---> 51 names[0], 52 ConvLayer(incoming_layer, num_filters, filter_size, stride, pad, 53 flip_filters=False, nonlinearity=None) if use_bias TypeError: 'map' object is not subscriptable

Do you use Python 3? Probably this code isn't python 3 compatible. See
https://stackoverflow.com/questions/6800481/python-map-object-is-not-subscriptable
for a way to fix this. Or try with python 2.

OK thanks, I wrapped each of the four map() like so: list(map()) and now it seems to work.

ReyhaneAskari · 2017-09-06T15:06:36Z

Here is the result of my benchmarking. The first number in the results cells is epoch 0 and the second number is the cumulative time until the end of epoch 1. I calculated the improvement by dividing epoch 0 of one gpu and epoch 0 of 2 gpus.

I think the base number is the one we want to report. If that's the case, I will rerun it 3 times to take an average.

	One Gpu (Quadro K6000)	Two Gpus (Quadro K6000s)	Improvement
Base	91.63506, 209.2868	49.1683, 110.79238	1.86
With Dnn flags	85.8325, 190.6249	47.4801, 103.6666	1.80
With pre allocate flag	96.2377, 216.74893	51.97668, 115.8446	1.85
Base with profiling	130.5099, 291.01650	112.8854, 247.0402	1.56
With Dnn and profiling flags	125.09018, 272.49762	114.2143 , 248.42933	1.09
With pre allocate and profiling flags	134.58385, 297.5514	111.07611, 242.2090	1.21

pre allocate flag = gpuarray.preallocate=0.95
profiling flag = profile=True,profile_optimizer=True,profile_memory=True
Dnn flags = dnn.conv.algo_fwd=time_once,dnn.conv.algo_bwd_filter=time_once,dnn.conv.algo_bwd_data=time_once

sync resnet

87bf136

nouiz reviewed Jun 7, 2017

View reviewed changes

cshanbo reviewed Jun 9, 2017

View reviewed changes

some small timing changes

188ddba

astooke reviewed Aug 24, 2017

View reviewed changes

ReyhaneAskari added 2 commits August 28, 2017 10:40

sync_shared added

7ec9167

minor fix

663bdd8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

resnet 50 #95

resnet 50 #95

olimastro commented Jun 7, 2017 •

edited by ReyhaneAskari

Loading

nouiz left a comment

nouiz Jun 7, 2017

nouiz Jun 7, 2017

olimastro Jun 7, 2017 •

edited

Loading

nouiz Jun 7, 2017

nouiz Jun 7, 2017

olimastro Jun 7, 2017

olimastro commented Jun 7, 2017

olimastro commented Jun 7, 2017

nouiz commented Jun 7, 2017

lamblin commented Jun 7, 2017

cshanbo Jun 9, 2017

borisfom commented Jun 27, 2017

cshanbo commented Jun 27, 2017

borisfom commented Jun 28, 2017

cshanbo commented Jun 28, 2017

olimastro commented Jun 28, 2017

borisfom commented Jun 28, 2017

cshanbo commented Jun 28, 2017 •

edited

Loading

borisfom commented Jun 28, 2017

cshanbo commented Jun 29, 2017

nouiz commented Jun 29, 2017 via email

astooke Aug 24, 2017 •

edited

Loading

nouiz Aug 25, 2017

astooke Aug 25, 2017

ReyhaneAskari commented Sep 6, 2017 •

edited

Loading

		from platoon.channel import Controller


		class ResNetController(Controller):

resnet 50 #95

Are you sure you want to change the base?

resnet 50 #95

Conversation

olimastro commented Jun 7, 2017 • edited by ReyhaneAskari Loading

nouiz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

olimastro Jun 7, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

olimastro commented Jun 7, 2017

olimastro commented Jun 7, 2017

nouiz commented Jun 7, 2017

lamblin commented Jun 7, 2017

Choose a reason for hiding this comment

borisfom commented Jun 27, 2017

cshanbo commented Jun 27, 2017

borisfom commented Jun 28, 2017

cshanbo commented Jun 28, 2017

olimastro commented Jun 28, 2017

borisfom commented Jun 28, 2017

cshanbo commented Jun 28, 2017 • edited Loading

borisfom commented Jun 28, 2017

cshanbo commented Jun 29, 2017

nouiz commented Jun 29, 2017 via email

astooke Aug 24, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ReyhaneAskari commented Sep 6, 2017 • edited Loading

olimastro commented Jun 7, 2017 •

edited by ReyhaneAskari

Loading

olimastro Jun 7, 2017 •

edited

Loading

cshanbo commented Jun 28, 2017 •

edited

Loading

astooke Aug 24, 2017 •

edited

Loading

ReyhaneAskari commented Sep 6, 2017 •

edited

Loading