-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
resnet 50 #95
base: master
Are you sure you want to change the base?
resnet 50 #95
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The speed up seem very bad. We will need to investigate this more. I'm not sure you are timing only the training part. I think your current timing also include data transfer time! So I think we should add timing just for the computation part (and keep the current, as if this due to the data transfer, we need to re-work that part)
def resnet_control(saveFreq=1110, saveto=None): | ||
parser = Controller.default_parser() | ||
parser.add_argument('--seed', default=1234, type=int, | ||
required=False, help='Maximum mini-batches to train upon in total.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update the help
from platoon.channel import Controller | ||
|
||
|
||
class ResNetController(Controller): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why you can't reuse the default controler? I don't like the idea to ask every user to make there own controler. If for this example we need a new controler, do we need to update the class Controler to meat the current need?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mostly copied everything from the lstm example and work from there. You are right though that if I do a diff between the lstm_controller
and resnet_controller
there are no differences except the names (the time request is just for the timing and won't end up in final release). There could just be a general controller used by both examples. I do have to implement the method handle_control
and subclass the general Controler
class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, just reuse the LSTMController directly. We can see what we do later for a general controler
parser.add_argument('--seed', default=1234, type=int, | ||
required=False, help='Maximum mini-batches to train upon in total.') | ||
parser.add_argument('--patience', default=10, type=int, required=False, | ||
help='Maximum patience when failing to get better validation results.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose it is in mini-batch unit. Can you specify that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the original lstm example it is from validation epoch actually, so since the validation is not done after every training epoch then this number has to be not too large. Could make the help more useful indeed.
When you say timing after one computation pass, do you want me to take into account the overhead of calling ASGD()? If yes, I can only have one epoch have the same amount of data as the batch size and reuse the same code, if not, I can just time the difference between a single use with of 80 batch size and a single use with 40 batch size. |
By the way, I should mention that I used |
remove the dnn.conv.algo_* from the timing. As this do the timing in the first call, it could bias the timing. We don't care if you we don't have the fastest timing, but we should not bias the timing. For the timing, we need to know the time of overhead, vs the compute time, vs the asgd. We had in mind to use the a synchrone update. Mostly, it just split the minibatch on each gpu and sync the gradient. So you should also change that. |
The overall metric we are competing for is images / sec. While it is useful to time the different parts to see where we can improve things, this is the final metric we should report. Regarding cuDNN options, again, if we want to correctly identify bottlenecks, I think we should use When we want to check for correctness, it is OK to split a batch between GPUs, and check if the synchronous update is consistent with what happens on only 1 GPU. However, for the final result, we want to have the full batch size on all GPUs, and report that. |
|
||
print("Using all_reduce worker's interface!") | ||
asgd = AverageSGD(worker) | ||
asgd.make_rule(params) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add other update algorithm, like EASGD
, Downpour
around here, so we can see the differences please?
Can you please post exact command lines you have used? I keep getting OOM errors on startup if I preallocate 0.95 .. 0.5. If I do not preallocate, the errors are : |
Hi @borisfom |
@cshanbo: I have recompiled and reinstalled - still same issue. Can in be nccl 2.0 incompatibility ? |
Hi @borisfom You could attach you running script, installation information etc, so that I might be able to help. |
The command I use is |
@cshanbo : What exactly should be added to PATH/LD_LIBRARY_PATH? |
Hi @borisfom , |
NCCL is in standard system path (installed via .deb), tried adding it to LD_LIBRARY_PATH - no effect. GpuCommCliqueId would have failed in it was not found, right? Instead it returns garbage. Is it some initialization that could have been missed? Are you running Python 2 or 3 ? |
We didn't try nccl 2 with platoon. This could be the cause of the problem
if it's interface changed.
Le jeu. 29 juin 2017 04:38, Justin Chan <[email protected]> a écrit :
… I'm running Python 2.7 with anaconda <https://www.continuum.io/downloads>.
Can you make sure your nccl and pygpu installation correct?
1. for nccl <https://github.com/NVIDIA/nccl>
2. for pygpu
<http://deeplearning.net/software/libgpuarray/installation.html#running-tests>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#95 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AALC-5jBHbSa71sE_rwv3OchW4Z1FFYzks5sI2KVgaJpZM4NyOMN>
.
|
for c in block_size: | ||
if c == 'a': | ||
sub_net, parent_layer_name = build_residual_block(net[parent_layer_name], | ||
1, 1, True, 4, ix='2%s' % c) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've copied the code for building the network (so the line numbers are different), and when I run it, it breaks like so:
----> 1 net = build_resnet()
/home/adam/GitRepos/synkhronos/demos/resnet/build_resnet.py in build_resnet()
151 if c == 'a':
152 sub_net, parent_layer_name = build_residual_block(net[parent_layer_name],
--> 153 1, 1, True, 4, ix='2%s' % c)
154 else:
155 sub_net, parent_layer_name = build_residual_block(net[parent_layer_name],
/home/adam/GitRepos/synkhronos/demos/resnet/build_resnet.py in build_residual_block(incoming_layer, ratio_n_filter, ratio_size, has_left_branch, upscale_factor, ix)
103 incoming_layer, map(lambda s: s % (ix, 2, 'a'), simple_block_name_pattern),
104 int(lasagne.layers.get_output_shape(incoming_layer)[1]*ratio_n_filter),
--> 105 1, int(1.0/ratio_size), 0)
106 net.update(net_tmp)
107
/home/adam/GitRepos/synkhronos/demos/resnet/build_resnet.py in build_simple_block(incoming_layer, names, num_filters, filter_size, stride, pad, use_bias, nonlin)
49 net = []
50 net.append((
---> 51 names[0],
52 ConvLayer(incoming_layer, num_filters, filter_size, stride, pad,
53 flip_filters=False, nonlinearity=None) if use_bias
TypeError: 'map' object is not subscriptable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you use Python 3? Probably this code isn't python 3 compatible. See
https://stackoverflow.com/questions/6800481/python-map-object-is-not-subscriptable
for a way to fix this. Or try with python 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK thanks, I wrapped each of the four map()
like so: list(map())
and now it seems to work.
Here is the result of my benchmarking. The first number in the results cells is epoch 0 and the second number is the cumulative time until the end of epoch 1. I calculated the improvement by dividing epoch 0 of one gpu and epoch 0 of 2 gpus. I think the base number is the one we want to report. If that's the case, I will rerun it 3 times to take an average.
pre allocate flag = gpuarray.preallocate=0.95 |
Regarding #94 issue.
For now I did only a trial on synthetic data. I am reporting the time for one epoch of training which is 1000 data points:
1GPU batch size 80 (all that could fit on the DGX): 19.715s
2GPU batch size 40: 13.452s