-
Notifications
You must be signed in to change notification settings - Fork 9
/
recurrent_local_online_profile.log
543 lines (528 loc) · 46.6 KB
/
recurrent_local_online_profile.log
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
Function profiling
==================
Message: recurrent_local_online.py:320
Time in 4 calls to Function.__call__: 1.065561e+02s
Time in Function.fn.__call__: 1.065547e+02s (99.999%)
Time in thunks: 1.064697e+02s (99.919%)
Total compile time: 2.745719e+01s
Number of Apply nodes: 394
Theano Optimizer time: 2.191432e+01s
Theano validate time: 1.021887e+00s
Theano Linker time (includes C, CUDA code generation/compiling): 5.450504e+00s
Import time 3.744481e-01s
Time in all call to theano.grad() 1.292913e+00s
Time since theano import 155.017s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
99.5% 99.5% 105.950s 1.32e+01s Py 8 2 theano.scan_module.scan_op.Scan
0.3% 99.8% 0.280s 5.38e-03s C 52 13 theano.sandbox.cuda.basic_ops.GpuCAReduce
0.1% 99.9% 0.127s 1.32e-03s C 96 24 theano.sandbox.cuda.basic_ops.GpuAlloc
0.1% 100.0% 0.089s 3.37e-04s C 264 66 theano.sandbox.cuda.basic_ops.GpuElemwise
0.0% 100.0% 0.009s 3.13e-04s C 28 7 theano.sandbox.cuda.basic_ops.GpuFromHost
0.0% 100.0% 0.008s 2.69e-04s C 28 7 theano.sandbox.cuda.basic_ops.GpuIncSubtensor
0.0% 100.0% 0.004s 6.84e-06s C 548 137 theano.tensor.elemwise.Elemwise
0.0% 100.0% 0.001s 5.96e-06s C 176 44 theano.compile.ops.Shape_i
0.0% 100.0% 0.001s 6.98e-06s C 112 28 theano.sandbox.cuda.basic_ops.GpuDimShuffle
0.0% 100.0% 0.001s 7.04e-06s C 108 27 theano.sandbox.cuda.basic_ops.GpuSubtensor
0.0% 100.0% 0.000s 4.09e-05s C 12 3 theano.sandbox.cuda.basic_ops.HostFromGpu
0.0% 100.0% 0.000s 1.62e-05s Py 24 6 theano.compile.ops.Rebroadcast
0.0% 100.0% 0.000s 5.16e-06s C 68 17 theano.tensor.basic.ScalarFromTensor
0.0% 100.0% 0.000s 1.15e-05s C 24 6 theano.sandbox.cuda.basic_ops.GpuAllocEmpty
0.0% 100.0% 0.000s 5.03e-06s C 12 3 theano.tensor.elemwise.DimShuffle
0.0% 100.0% 0.000s 7.12e-06s C 8 2 theano.tensor.opt.MakeVector
0.0% 100.0% 0.000s 1.07e-05s C 4 1 theano.tensor.subtensor.IncSubtensor
0.0% 100.0% 0.000s 9.00e-06s C 4 1 theano.tensor.basic.AllocEmpty
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
97.8% 97.8% 104.174s 2.60e+01s Py 4 1 forall_inplace,gpu,scan_fn}
1.7% 99.5% 1.776s 4.44e-01s Py 4 1 forall_inplace,gpu,grad_of_scan_fn}
0.3% 99.8% 0.279s 9.97e-03s C 28 7 GpuCAReduce{pre=sqr,red=add}{1,1}
0.1% 99.9% 0.127s 1.32e-03s C 96 24 GpuAlloc{memset_0=True}
0.0% 99.9% 0.032s 6.56e-04s C 48 12 GpuElemwise{Composite{(i0 - ((i1 * i2) / sqrt((i3 + i4 + i5))))}}[(0, 0)]
0.0% 99.9% 0.016s 3.31e-04s C 48 12 GpuElemwise{Composite{Switch(i0, (i1 / i2), i1)}}[(0, 1)]
0.0% 100.0% 0.015s 2.82e-04s C 52 13 GpuElemwise{Add}[(0, 0)]
0.0% 100.0% 0.013s 2.72e-04s C 48 12 GpuElemwise{Composite{(i0 * sqr(i1))},no_inplace}
0.0% 100.0% 0.013s 2.71e-04s C 48 12 GpuElemwise{Mul}[(0, 1)]
0.0% 100.0% 0.009s 3.13e-04s C 28 7 GpuFromHost
0.0% 100.0% 0.007s 3.07e-04s C 24 6 GpuIncSubtensor{InplaceSet;:int64:}
0.0% 100.0% 0.000s 4.09e-05s C 12 3 HostFromGpu
0.0% 100.0% 0.000s 6.04e-06s C 76 19 GpuSubtensor{int64}
0.0% 100.0% 0.000s 5.94e-06s C 76 19 Shape_i{0}
0.0% 100.0% 0.000s 2.82e-05s C 16 4 GpuCAReduce{pre=sqr,red=add}{1}
0.0% 100.0% 0.000s 1.62e-05s Py 24 6 Rebroadcast{0}
0.0% 100.0% 0.000s 5.93e-06s C 60 15 Shape_i{1}
0.0% 100.0% 0.000s 5.16e-06s C 68 17 ScalarFromTensor
0.0% 100.0% 0.000s 6.05e-06s C 52 13 Elemwise{add,no_inplace}
0.0% 100.0% 0.000s 1.15e-05s C 24 6 GpuAllocEmpty
... (remaining 94 Ops account for 0.01%(0.01s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
97.8% 97.8% 104.174s 2.60e+01s 4 233 forall_inplace,gpu,scan_fn}(Shape_i{1}.0, GpuSubtensor{int64:int64:int8}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, IncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, Shape_i{1}.0, conv1_filters, Wz, Uz, Wg, Wr, Ur, Ug, W_fc2, GpuDimShuffle{x,0}.0, GpuDi
input 0: dtype=int64, shape=(), strides=c
input 1: dtype=float32, shape=(20, 32, 1, 50, 50), strides=(2500, 50000, 0, 50, 1)
input 2: dtype=float32, shape=(21, 32, 4), strides=(128, 4, 1)
input 3: dtype=float32, shape=(21, 32, 200), strides=(6400, 200, 1)
input 4: dtype=int32, shape=(1,), strides=c
input 5: dtype=float32, shape=(1, 32), strides=(0, 1)
input 6: dtype=float32, shape=(1, 32, 20, 50, 50), strides=(0, 50000, 2500, 50, 1)
input 7: dtype=float32, shape=(1, 32, 20, 32, 50, 50), strides=(0, 1600000, 80000, 2500, 50, 1)
input 8: dtype=float32, shape=(1, 32, 32, 9, 9), strides=(0, 2592, 81, 9, 1)
input 9: dtype=int64, shape=(), strides=c
input 10: dtype=float32, shape=(32, 1, 9, 9), strides=c
input 11: dtype=float32, shape=(80004, 200), strides=c
input 12: dtype=float32, shape=(200, 200), strides=c
input 13: dtype=float32, shape=(80004, 200), strides=c
input 14: dtype=float32, shape=(80004, 200), strides=c
input 15: dtype=float32, shape=(200, 200), strides=c
input 16: dtype=float32, shape=(200, 200), strides=c
input 17: dtype=float32, shape=(200, 4), strides=c
input 18: dtype=float32, shape=(1, 4), strides=(0, 1)
input 19: dtype=float32, shape=(1, 200), strides=(0, 1)
input 20: dtype=float32, shape=(1, 200), strides=(0, 1)
input 21: dtype=float32, shape=(1, 200), strides=(0, 1)
input 22: dtype=int64, shape=(4,), strides=c
input 23: dtype=int64, shape=(), strides=c
input 24: dtype=int64, shape=(), strides=c
input 25: dtype=int64, shape=(), strides=c
input 26: dtype=int64, shape=(), strides=c
input 27: dtype=int64, shape=(), strides=c
output 0: dtype=float32, shape=(21, 32, 4), strides=(128, 4, 1)
output 1: dtype=float32, shape=(21, 32, 200), strides=(6400, 200, 1)
output 2: dtype=int32, shape=(1,), strides=c
output 3: dtype=float32, shape=(1, 32), strides=(0, 1)
output 4: dtype=float32, shape=(1, 32, 20, 50, 50), strides=(0, 50000, 2500, 50, 1)
output 5: dtype=float32, shape=(1, 32, 20, 32, 50, 50), strides=(0, 1600000, 80000, 2500, 50, 1)
output 6: dtype=float32, shape=(1, 32, 32, 9, 9), strides=(0, 2592, 81, 9, 1)
output 7: dtype=float32, shape=(20, 32, 32, 50, 50), strides=(2560000, 80000, 2500, 50, 1)
1.7% 99.5% 1.776s 4.44e-01s 4 306 forall_inplace,gpu,grad_of_scan_fn}(Elemwise{Maximum}[(0, 0)].0, GpuDimShuffle{0,2,1}.0, GpuDimShuffle{0,2,1}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuElemwise{Composite{(i0 - sqr(i1))},no_inplace}.0, GpuAlloc{memset_0=True}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{::int64}.0, Gp
input 0: dtype=int64, shape=(), strides=c
input 1: dtype=float32, shape=(20, 200, 32), strides=(-6400, 1, 200)
input 2: dtype=float32, shape=(20, 200, 32), strides=(-6400, 1, 200)
input 3: dtype=float32, shape=(20, 32, 32, 9, 9), strides=(82944, 2592, 81, 9, 1)
input 4: dtype=float32, shape=(20, 32, 20, 32, 50, 50), strides=(51200000, 1600000, 80000, 2500, 50, 1)
input 5: dtype=float32, shape=(20, 32, 20, 50, 50), strides=(1600000, 50000, 2500, 50, 1)
input 6: dtype=float32, shape=(20, 32), strides=(32, 1)
input 7: dtype=float32, shape=(20, 32, 4), strides=(128, 4, 1)
input 8: dtype=float32, shape=(20,), strides=(1,)
input 9: dtype=float32, shape=(20, 32, 1, 50, 50), strides=(-2500, 50000, 0, 50, 1)
input 10: dtype=float32, shape=(20, 32, 4), strides=(-128, 4, 1)
input 11: dtype=float32, shape=(20, 32, 200), strides=(-6400, 200, 1)
input 12: dtype=float32, shape=(21, 32, 4), strides=(-128, 4, 1)
input 13: dtype=float32, shape=(21, 32, 200), strides=(6400, 200, 1)
input 14: dtype=float32, shape=(21,), strides=(1,)
input 15: dtype=float32, shape=(21, 32), strides=(32, 1)
input 16: dtype=float32, shape=(21, 32, 20, 50, 50), strides=(1600000, 50000, 2500, 50, 1)
input 17: dtype=float32, shape=(21, 32, 20, 32, 50, 50), strides=(51200000, 1600000, 80000, 2500, 50, 1)
input 18: dtype=float32, shape=(21, 32, 32, 9, 9), strides=(82944, 2592, 81, 9, 1)
input 19: dtype=float32, shape=(1, 32, 1, 9, 9), strides=(0, 81, 0, 9, 1)
input 20: dtype=float32, shape=(1, 80004, 200), strides=(0, 200, 1)
input 21: dtype=float32, shape=(1, 200, 200), strides=(0, 200, 1)
input 22: dtype=float32, shape=(1, 200), strides=(0, 1)
input 23: dtype=float32, shape=(1, 80004, 200), strides=(0, 200, 1)
input 24: dtype=float32, shape=(1, 80004, 200), strides=(0, 200, 1)
input 25: dtype=float32, shape=(1, 200, 200), strides=(0, 200, 1)
input 26: dtype=float32, shape=(1, 200), strides=(0, 1)
input 27: dtype=float32, shape=(1, 200, 200), strides=(0, 200, 1)
input 28: dtype=float32, shape=(1, 200), strides=(0, 1)
input 29: dtype=float32, shape=(1, 200, 4), strides=(0, 4, 1)
input 30: dtype=float32, shape=(1, 4), strides=(0, 1)
input 31: dtype=float32, shape=(32, 1, 9, 9), strides=c
input 32: dtype=float32, shape=(80004, 200), strides=c
input 33: dtype=float32, shape=(200, 200), strides=c
input 34: dtype=float32, shape=(80004, 200), strides=c
input 35: dtype=float32, shape=(80004, 200), strides=c
input 36: dtype=float32, shape=(200, 200), strides=c
input 37: dtype=float32, shape=(200, 200), strides=c
input 38: dtype=float32, shape=(200, 80004), strides=(1, 200)
input 39: dtype=float32, shape=(1, 200), strides=(0, 1)
input 40: dtype=float32, shape=(200, 200), strides=(1, 200)
input 41: dtype=float32, shape=(200, 200), strides=(1, 200)
input 42: dtype=float32, shape=(1, 200), strides=(0, 1)
input 43: dtype=float32, shape=(200, 80004), strides=(1, 200)
input 44: dtype=float32, shape=(1, 200), strides=(0, 1)
input 45: dtype=float32, shape=(200, 200), strides=(1, 200)
input 46: dtype=float32, shape=(200, 80004), strides=(1, 200)
input 47: dtype=float32, shape=(4, 200), strides=(1, 4)
input 48: dtype=int64, shape=(4,), strides=c
input 49: dtype=int64, shape=(), strides=c
input 50: dtype=int64, shape=(), strides=c
input 51: dtype=int64, shape=(), strides=c
input 52: dtype=int64, shape=(), strides=c
input 53: dtype=int64, shape=(), strides=c
output 0: dtype=float32, shape=(21, 32, 4), strides=c
output 1: dtype=float32, shape=(21, 32, 200), strides=c
output 2: dtype=float32, shape=(21,), strides=c
output 3: dtype=float32, shape=(21, 32), strides=c
output 4: dtype=float32, shape=(21, 32, 20, 50, 50), strides=c
output 5: dtype=float32, shape=(21, 32, 20, 32, 50, 50), strides=c
output 6: dtype=float32, shape=(21, 32, 32, 9, 9), strides=c
output 7: dtype=float32, shape=(1, 32, 1, 9, 9), strides=c
output 8: dtype=float32, shape=(1, 80004, 200), strides=c
output 9: dtype=float32, shape=(1, 200, 200), strides=c
output 10: dtype=float32, shape=(1, 200), strides=c
output 11: dtype=float32, shape=(1, 80004, 200), strides=c
output 12: dtype=float32, shape=(1, 80004, 200), strides=c
output 13: dtype=float32, shape=(1, 200, 200), strides=c
output 14: dtype=float32, shape=(1, 200), strides=c
output 15: dtype=float32, shape=(1, 200, 200), strides=c
output 16: dtype=float32, shape=(1, 200), strides=c
output 17: dtype=float32, shape=(1, 200, 4), strides=c
output 18: dtype=float32, shape=(1, 4), strides=c
0.1% 99.6% 0.093s 2.32e-02s 4 323 GpuCAReduce{pre=sqr,red=add}{1,1}(GpuSubtensor{int64}.0)
input 0: dtype=float32, shape=(80004, 200), strides=c
output 0: dtype=float32, shape=(), strides=c
0.1% 99.7% 0.093s 2.32e-02s 4 326 GpuCAReduce{pre=sqr,red=add}{1,1}(GpuSubtensor{int64}.0)
input 0: dtype=float32, shape=(80004, 200), strides=c
output 0: dtype=float32, shape=(), strides=c
0.1% 99.8% 0.093s 2.32e-02s 4 329 GpuCAReduce{pre=sqr,red=add}{1,1}(GpuSubtensor{int64}.0)
input 0: dtype=float32, shape=(80004, 200), strides=c
output 0: dtype=float32, shape=(), strides=c
0.1% 99.8% 0.060s 1.50e-02s 4 101 GpuAlloc{memset_0=True}(CudaNdarrayConstant{0.0}, Elemwise{add,no_inplace}.0, Shape_i{0}.0, Shape_i{1}.0, Shape_i{2}.0, Shape_i{3}.0, Shape_i{4}.0)
input 0: dtype=float32, shape=(), strides=c
input 1: dtype=int64, shape=(), strides=c
input 2: dtype=int64, shape=(), strides=c
input 3: dtype=int64, shape=(), strides=c
input 4: dtype=int64, shape=(), strides=c
input 5: dtype=int64, shape=(), strides=c
input 6: dtype=int64, shape=(), strides=c
output 0: dtype=float32, shape=(21, 32, 20, 32, 50, 50), strides=(51200000, 1600000, 80000, 2500, 50, 1)
0.1% 99.9% 0.057s 1.43e-02s 4 124 GpuAlloc{memset_0=True}(CudaNdarrayConstant{[[[[[[ 0.]]]]]]}, Elemwise{Composite{(Switch(LT((i0 - i1), i2), Switch(LT(((i0 - i1) + i3), i2), i2, ((i0 - i1) + i3)), Switch(LT((i0 - i1), i3), (i0 - i1), i3)) - i2)}}.0, Shape_i{0}.0, Shape_i{1}.0, Shape_i{2}.0, Shape_i{3}.0, Shape_i{4}.0)
input 0: dtype=float32, shape=(1, 1, 1, 1, 1, 1), strides=c
input 1: dtype=int64, shape=(), strides=c
input 2: dtype=int64, shape=(), strides=c
input 3: dtype=int64, shape=(), strides=c
input 4: dtype=int64, shape=(), strides=c
input 5: dtype=int64, shape=(), strides=c
input 6: dtype=int64, shape=(), strides=c
output 0: dtype=float32, shape=(20, 32, 20, 32, 50, 50), strides=(51200000, 1600000, 80000, 2500, 50, 1)
0.0% 99.9% 0.010s 2.54e-03s 4 376 GpuElemwise{Composite{(i0 - ((i1 * i2) / sqrt((i3 + i4 + i5))))}}[(0, 0)](Wg, CudaNdarrayConstant{[[ 0.001]]}, GpuElemwise{Composite{Switch(i0, (i1 / i2), i1)}}[(0, 1)].0, CudaNdarrayConstant{[[ 9.99999997e-07]]}, GpuElemwise{Mul}[(0, 1)].0, GpuElemwise{Composite{(i0 * sqr(i1))},no_inplace}.0)
input 0: dtype=float32, shape=(80004, 200), strides=c
input 1: dtype=float32, shape=(1, 1), strides=c
input 2: dtype=float32, shape=(80004, 200), strides=c
input 3: dtype=float32, shape=(1, 1), strides=c
input 4: dtype=float32, shape=(80004, 200), strides=c
input 5: dtype=float32, shape=(80004, 200), strides=c
output 0: dtype=float32, shape=(80004, 200), strides=c
0.0% 99.9% 0.010s 2.53e-03s 4 378 GpuElemwise{Composite{(i0 - ((i1 * i2) / sqrt((i3 + i4 + i5))))}}[(0, 0)](Wz, CudaNdarrayConstant{[[ 0.001]]}, GpuElemwise{Composite{Switch(i0, (i1 / i2), i1)}}[(0, 1)].0, CudaNdarrayConstant{[[ 9.99999997e-07]]}, GpuElemwise{Mul}[(0, 1)].0, GpuElemwise{Composite{(i0 * sqr(i1))},no_inplace}.0)
input 0: dtype=float32, shape=(80004, 200), strides=c
input 1: dtype=float32, shape=(1, 1), strides=c
input 2: dtype=float32, shape=(80004, 200), strides=c
input 3: dtype=float32, shape=(1, 1), strides=c
input 4: dtype=float32, shape=(80004, 200), strides=c
input 5: dtype=float32, shape=(80004, 200), strides=c
output 0: dtype=float32, shape=(80004, 200), strides=c
0.0% 99.9% 0.010s 2.53e-03s 4 380 GpuElemwise{Composite{(i0 - ((i1 * i2) / sqrt((i3 + i4 + i5))))}}[(0, 0)](Wr, CudaNdarrayConstant{[[ 0.001]]}, GpuElemwise{Composite{Switch(i0, (i1 / i2), i1)}}[(0, 1)].0, CudaNdarrayConstant{[[ 9.99999997e-07]]}, GpuElemwise{Mul}[(0, 1)].0, GpuElemwise{Composite{(i0 * sqr(i1))},no_inplace}.0)
input 0: dtype=float32, shape=(80004, 200), strides=c
input 1: dtype=float32, shape=(1, 1), strides=c
input 2: dtype=float32, shape=(80004, 200), strides=c
input 3: dtype=float32, shape=(1, 1), strides=c
input 4: dtype=float32, shape=(80004, 200), strides=c
input 5: dtype=float32, shape=(80004, 200), strides=c
output 0: dtype=float32, shape=(80004, 200), strides=c
0.0% 99.9% 0.008s 2.04e-03s 4 34 GpuFromHost(<TensorType(float32, 5D)>)
input 0: dtype=float32, shape=(32, 20, 1, 50, 50), strides=c
output 0: dtype=float32, shape=(32, 20, 1, 50, 50), strides=(50000, 2500, 0, 50, 1)
0.0% 99.9% 0.006s 1.58e-03s 4 227 GpuIncSubtensor{InplaceSet;:int64:}(GpuAllocEmpty.0, Rebroadcast{0}.0, Constant{1})
input 0: dtype=float32, shape=(1, 32, 20, 32, 50, 50), strides=(0, 1600000, 80000, 2500, 50, 1)
input 1: dtype=float32, shape=(1, 32, 20, 32, 50, 50), strides=(0, 1600000, 80000, 2500, 50, 1)
input 2: dtype=int64, shape=8, strides=c
output 0: dtype=float32, shape=(1, 32, 20, 32, 50, 50), strides=(0, 1600000, 80000, 2500, 50, 1)
0.0% 99.9% 0.005s 1.24e-03s 4 352 GpuElemwise{Composite{Switch(i0, (i1 / i2), i1)}}[(0, 1)](GpuFromHost.0, GpuSubtensor{int64}.0, GpuDimShuffle{x,x}.0)
input 0: dtype=float32, shape=(1, 1), strides=c
input 1: dtype=float32, shape=(80004, 200), strides=c
input 2: dtype=float32, shape=(1, 1), strides=c
output 0: dtype=float32, shape=(80004, 200), strides=c
0.0% 99.9% 0.005s 1.24e-03s 4 354 GpuElemwise{Composite{Switch(i0, (i1 / i2), i1)}}[(0, 1)](GpuFromHost.0, GpuSubtensor{int64}.0, GpuDimShuffle{x,x}.0)
input 0: dtype=float32, shape=(1, 1), strides=c
input 1: dtype=float32, shape=(80004, 200), strides=c
input 2: dtype=float32, shape=(1, 1), strides=c
output 0: dtype=float32, shape=(80004, 200), strides=c
0.0% 99.9% 0.005s 1.24e-03s 4 356 GpuElemwise{Composite{Switch(i0, (i1 / i2), i1)}}[(0, 1)](GpuFromHost.0, GpuSubtensor{int64}.0, GpuDimShuffle{x,x}.0)
input 0: dtype=float32, shape=(1, 1), strides=c
input 1: dtype=float32, shape=(80004, 200), strides=c
input 2: dtype=float32, shape=(1, 1), strides=c
output 0: dtype=float32, shape=(80004, 200), strides=c
0.0% 99.9% 0.005s 1.14e-03s 4 390 GpuElemwise{Add}[(0, 0)](GpuElemwise{Mul}[(0, 1)].0, GpuElemwise{Composite{(i0 * sqr(i1))},no_inplace}.0)
input 0: dtype=float32, shape=(80004, 200), strides=c
input 1: dtype=float32, shape=(80004, 200), strides=c
output 0: dtype=float32, shape=(80004, 200), strides=c
0.0% 99.9% 0.005s 1.14e-03s 4 392 GpuElemwise{Add}[(0, 0)](GpuElemwise{Mul}[(0, 1)].0, GpuElemwise{Composite{(i0 * sqr(i1))},no_inplace}.0)
input 0: dtype=float32, shape=(80004, 200), strides=c
input 1: dtype=float32, shape=(80004, 200), strides=c
output 0: dtype=float32, shape=(80004, 200), strides=c
0.0% 100.0% 0.005s 1.14e-03s 4 388 GpuElemwise{Add}[(0, 0)](GpuElemwise{Mul}[(0, 1)].0, GpuElemwise{Composite{(i0 * sqr(i1))},no_inplace}.0)
input 0: dtype=float32, shape=(80004, 200), strides=c
input 1: dtype=float32, shape=(80004, 200), strides=c
output 0: dtype=float32, shape=(80004, 200), strides=c
0.0% 100.0% 0.004s 1.01e-03s 4 364 GpuElemwise{Composite{(i0 * sqr(i1))},no_inplace}(CudaNdarrayConstant{[[ 0.1]]}, GpuElemwise{Composite{Switch(i0, (i1 / i2), i1)}}[(0, 1)].0)
input 0: dtype=float32, shape=(1, 1), strides=c
input 1: dtype=float32, shape=(80004, 200), strides=c
output 0: dtype=float32, shape=(80004, 200), strides=c
0.0% 100.0% 0.004s 1.00e-03s 4 366 GpuElemwise{Composite{(i0 * sqr(i1))},no_inplace}(CudaNdarrayConstant{[[ 0.1]]}, GpuElemwise{Composite{Switch(i0, (i1 / i2), i1)}}[(0, 1)].0)
input 0: dtype=float32, shape=(1, 1), strides=c
input 1: dtype=float32, shape=(80004, 200), strides=c
output 0: dtype=float32, shape=(80004, 200), strides=c
... (remaining 374 Apply instances account for 0.04%(0.04s) of the runtime)
Memory Profile
(Sparse variables are ignored)
(For values in brackets, it's for linker = c|py
---
Max peak memory with current setting
CPU: 10KB (10KB)
GPU: 9071438KB (9071438KB)
CPU + GPU: 9071448KB (9071448KB)
Max peak memory with current setting and Theano flag optimizer_excluding=inplace
CPU: 10KB (10KB)
GPU: 13598024KB (13786007KB)
CPU + GPU: 13598034KB (13786018KB)
Max peak memory if allow_gc=False (linker don't make a difference)
CPU: 11KB
GPU: 9259442KB
CPU + GPU: 9259454KB
---
<Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>
4635224004B [(21, 32, 4), (21, 32, 200), (21,), (21, 32), (21, 32, 20, 50, 50), (21, 32, 20, 32, 50, 50), (21, 32, 32, 9, 9), (1, 32, 1, 9, 9), (1, 80004, 200), (1, 200, 200), (1, 200), (1, 80004, 200), (1, 80004, 200), (1, 200, 200), (1, 200), (1, 200, 200), (1, 200), (1, 200, 4), (1, 4)] i i i i i i i i i i i i i i i i i i i forall_inplace,gpu,grad_of_scan_fn}(Elemwise{Maximum}[(0, 0)].0, GpuDimShuffle{0,2,1}.0, GpuDimShuffle{0,2,1}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuElemwise{Composite{(i0 - sqr(i1))},no_inplace}.0, GpuAlloc{memset_0=True}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{::int64}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, conv1_filters, Wz, Uz, Wg, Wr, Ur, Ug, GpuDimShuffle{1,0}.0, GpuDimShuffle{x,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{x,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{x,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, MakeVector{dtype='int64'}.0, Shape_i{0}.0, Shape_i{3}.0, Shape_i{2}.0, Elemwise{Composite{(i0 * (i1 // i0))}}.0, Elemwise{Composite{(i0 * (i1 // i0))}}.0)
4300800000B [(21, 32, 20, 32, 50, 50)] c GpuAlloc{memset_0=True}(CudaNdarrayConstant{0.0}, Elemwise{add,no_inplace}.0, Shape_i{0}.0, Shape_i{1}.0, Shape_i{2}.0, Shape_i{3}.0, Shape_i{4}.0)
4096000000B [(20, 32, 20, 32, 50, 50)] c GpuAlloc{memset_0=True}(CudaNdarrayConstant{[[[[[[ 0.]]]]]]}, Elemwise{Composite{(Switch(LT((i0 - i1), i2), Switch(LT(((i0 - i1) + i3), i2), i2, ((i0 - i1) + i3)), Switch(LT((i0 - i1), i3), (i0 - i1), i3)) - i2)}}.0, Shape_i{0}.0, Shape_i{1}.0, Shape_i{2}.0, Shape_i{3}.0, Shape_i{4}.0)
416880260B [(21, 32, 4), (21, 32, 200), (1,), (1, 32), (1, 32, 20, 50, 50), (1, 32, 20, 32, 50, 50), (1, 32, 32, 9, 9), (20, 32, 32, 50, 50)] i i i i i i i c forall_inplace,gpu,scan_fn}(Shape_i{1}.0, GpuSubtensor{int64:int64:int8}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, IncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, Shape_i{1}.0, conv1_filters, Wz, Uz, Wg, Wr, Ur, Ug, W_fc2, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, MakeVector{dtype='int64'}.0, Shape_i{0}.0, Shape_i{3}.0, Shape_i{2}.0, Elemwise{Composite{(i0 * (i1 // i0))}}.0, Elemwise{Composite{(i0 * (i1 // i0))}}.0)
204800000B [(1, 32, 20, 32, 50, 50)] v Rebroadcast{0}(GpuDimShuffle{x,0,1,2,3,4}.0)
204800000B [(1, 32, 20, 32, 50, 50)] c GpuAllocEmpty(Elemwise{Composite{(Switch(LT(maximum(i0, i1), i2), (maximum(i0, i1) + i1), (maximum(i0, i1) - i1)) + i1)}}[(0, 0)].0, Shape_i{0}.0, Shape_i{1}.0, Shape_i{2}.0, Shape_i{3}.0, Shape_i{4}.0)
204800000B [(1, 32, 20, 32, 50, 50)] i GpuIncSubtensor{InplaceSet;:int64:}(GpuAllocEmpty.0, Rebroadcast{0}.0, Constant{1})
204800000B [(32, 20, 32, 50, 50)] v GpuSubtensor{int64}(forall_inplace,gpu,scan_fn}.5, ScalarFromTensor.0)
204800000B [(1, 32, 20, 32, 50, 50)] v GpuDimShuffle{x,0,1,2,3,4}(featmaps)
134400000B [(21, 32, 20, 50, 50)] c GpuAlloc{memset_0=True}(CudaNdarrayConstant{0.0}, Elemwise{add,no_inplace}.0, Shape_i{0}.0, Shape_i{1}.0, Shape_i{2}.0, Shape_i{3}.0)
128000000B [(20, 32, 20, 50, 50)] c GpuAlloc{memset_0=True}(CudaNdarrayConstant{[[[[[ 0.]]]]]}, Elemwise{Composite{(Switch(LT((i0 - i1), i2), Switch(LT(((i0 - i1) + i3), i2), i2, ((i0 - i1) + i3)), Switch(LT((i0 - i1), i3), (i0 - i1), i3)) - i2)}}.0, Shape_i{0}.0, Shape_i{1}.0, Shape_i{2}.0, Shape_i{3}.0)
64003200B [(80004, 200)] i GpuElemwise{Mul}[(0, 1)](CudaNdarrayConstant{[[ 0.89999998]]}, <CudaNdarrayType(float32, matrix)>)
64003200B [(80004, 200)] i GpuElemwise{Add}[(0, 0)](GpuElemwise{Mul}[(0, 1)].0, GpuElemwise{Composite{(i0 * sqr(i1))},no_inplace}.0)
64003200B [(80004, 200)] i GpuElemwise{Composite{Switch(i0, (i1 / i2), i1)}}[(0, 1)](GpuFromHost.0, GpuSubtensor{int64}.0, GpuDimShuffle{x,x}.0)
64003200B [(80004, 200)] i GpuElemwise{Mul}[(0, 1)](CudaNdarrayConstant{[[ 0.89999998]]}, <CudaNdarrayType(float32, matrix)>)
64003200B [(80004, 200)] v GpuSubtensor{int64}(forall_inplace,gpu,grad_of_scan_fn}.11, ScalarFromTensor.0)
64003200B [(80004, 200)] v GpuSubtensor{int64}(forall_inplace,gpu,grad_of_scan_fn}.12, ScalarFromTensor.0)
64003200B [(1, 80004, 200)] c GpuAlloc{memset_0=True}(CudaNdarrayConstant{0.0}, Elemwise{Composite{(Switch(LT(Composite{maximum(maximum(i0, i1), i1)}(i0, i1), i2), Switch(LT((Composite{maximum(maximum(i0, i1), i1)}(i0, i1) + i1 + i3), i2), i2, (Composite{maximum(maximum(i0, i1), i1)}(i0, i1) + i1 + i3)), Switch(LT(Composite{maximum(maximum(i0, i1), i1)}(i0, i1), i4), Composite{maximum(maximum(i0, i1), i1)}(i0, i1), i4)) - i2)}}[(0, 0)].0, Shape_i{0}.0, Shape_i{1}.0)
64003200B [(80004, 200)] i GpuElemwise{Composite{(i0 - ((i1 * i2) / sqrt((i3 + i4 + i5))))}}[(0, 0)](Wz, CudaNdarrayConstant{[[ 0.001]]}, GpuElemwise{Composite{Switch(i0, (i1 / i2), i1)}}[(0, 1)].0, CudaNdarrayConstant{[[ 9.99999997e-07]]}, GpuElemwise{Mul}[(0, 1)].0, GpuElemwise{Composite{(i0 * sqr(i1))},no_inplace}.0)
64003200B [(80004, 200)] i GpuElemwise{Add}[(0, 0)](GpuElemwise{Mul}[(0, 1)].0, GpuElemwise{Composite{(i0 * sqr(i1))},no_inplace}.0)
... (remaining 374 Apply account for 1040744597B/16352077661B ((6.36%)) of the Apply with dense outputs sizes)
<created/inplace/view> is taken from the Op's declaration.
Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
Sorry, no tip for today.
Scan Op profiling ( scan_fn )
==================
Message: None
Time in 4 calls of the op (for a total of 80 steps) 1.041707e+02s
Total time spent in calling the VM 1.040035e+02s (99.840%)
Total overhead (computing slices..) 1.671259e-01s (0.160%)
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
24.0% 24.0% 24.673s 6.17e-02s Py 400 5 theano.tensor.subtensor.AdvancedIncSubtensor
20.8% 44.8% 21.443s 2.68e-02s C 800 10 theano.sandbox.cuda.basic_ops.HostFromGpu
20.5% 65.3% 21.132s 5.28e-02s C 400 5 theano.sandbox.cuda.dnn.GpuDnnConvGradW
16.6% 81.9% 17.052s 3.55e-02s C 480 6 theano.sandbox.cuda.dnn.GpuDnnConv
14.0% 95.8% 14.380s 1.63e-02s C 880 11 theano.sandbox.cuda.basic_ops.GpuFromHost
2.4% 98.2% 2.449s 1.02e-02s C 240 3 theano.tensor.basic.Alloc
0.8% 99.0% 0.856s 4.87e-04s C 1760 22 theano.sandbox.cuda.basic_ops.GpuReshape
0.3% 99.3% 0.303s 7.57e-04s Py 400 5 theano.tensor.subtensor.AdvancedSubtensor
0.2% 99.5% 0.208s 8.97e-05s C 2320 29 theano.sandbox.cuda.basic_ops.GpuElemwise
0.2% 99.7% 0.183s 5.73e-04s C 320 4 theano.sandbox.cuda.blas.GpuDot22
0.2% 99.9% 0.168s 1.05e-03s C 160 2 theano.sandbox.cuda.basic_ops.GpuIncSubtensor
0.0% 99.9% 0.035s 8.86e-05s C 400 5 theano.sandbox.cuda.basic_ops.GpuCAReduce
0.0% 99.9% 0.034s 4.30e-04s C 80 1 theano.sandbox.cuda.basic_ops.GpuJoin
0.0% 99.9% 0.011s 4.21e-06s C 2560 32 theano.compile.ops.Shape_i
0.0% 100.0% 0.010s 4.14e-05s C 240 3 theano.sandbox.cuda.blas.GpuGemm
0.0% 100.0% 0.009s 1.08e-05s C 880 11 theano.sandbox.cuda.basic_ops.GpuAllocEmpty
0.0% 100.0% 0.009s 4.54e-06s C 2000 25 theano.tensor.elemwise.Elemwise
0.0% 100.0% 0.008s 7.33e-06s C 1040 13 theano.sandbox.cuda.basic_ops.GpuContiguous
0.0% 100.0% 0.006s 5.01e-06s C 1200 15 theano.tensor.opt.MakeVector
0.0% 100.0% 0.005s 4.72e-06s C 1120 14 theano.sandbox.cuda.basic_ops.GpuDimShuffle
... (remaining 6 Classes account for 0.01%(0.01s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
24.0% 24.0% 24.673s 6.17e-02s Py 400 5 AdvancedIncSubtensor{inplace=False, set_instead_of_inc=False}
20.8% 44.8% 21.443s 2.68e-02s C 800 10 HostFromGpu
20.5% 65.3% 21.132s 5.28e-02s C 400 5 GpuDnnConvGradW{algo='none', inplace=True}
16.6% 81.9% 17.052s 3.55e-02s C 480 6 GpuDnnConv{algo='small', inplace=True}
14.0% 95.8% 14.380s 1.63e-02s C 880 11 GpuFromHost
2.4% 98.2% 2.449s 1.02e-02s C 240 3 Alloc
0.8% 99.0% 0.850s 9.66e-04s C 880 11 GpuReshape{4}
0.3% 99.3% 0.303s 7.57e-04s Py 400 5 AdvancedSubtensor
0.2% 99.5% 0.183s 5.73e-04s C 320 4 GpuDot22
0.2% 99.7% 0.168s 1.05e-03s C 160 2 GpuIncSubtensor{Set;::, int32}
0.1% 99.8% 0.105s 2.62e-04s C 400 5 GpuElemwise{Composite{(((i0 * i1 * (i2 - Composite{scalar_sigmoid((i0 + i1))}(i3, i4))) / i5) - ((i0 * i6 * Composite{scalar_sigmoid((i0 + i1))}(i3, i4)) / i5))}}[(0, 3)]
0.0% 99.8% 0.035s 8.86e-05s C 400 5 GpuCAReduce{add}{0,1,1,1}
0.0% 99.8% 0.034s 4.30e-04s C 80 1 GpuJoin
0.0% 99.9% 0.025s 3.09e-04s C 80 1 GpuElemwise{Mul}[(0, 0)]
0.0% 99.9% 0.015s 1.86e-04s C 80 1 GpuElemwise{Composite{(i0 * (i1 + Abs(i1)))}}[(0, 1)]
0.0% 99.9% 0.013s 4.00e-05s C 320 4 GpuElemwise{Sub}[(0, 0)]
0.0% 99.9% 0.010s 4.14e-05s C 240 3 GpuGemm{inplace}
0.0% 99.9% 0.009s 1.08e-05s C 880 11 GpuAllocEmpty
0.0% 99.9% 0.009s 2.71e-05s C 320 4 GpuElemwise{Composite{(i0 - (i1 * i2))}}[(0, 0)]
0.0% 99.9% 0.008s 1.02e-04s C 80 1 GpuElemwise{sub,no_inplace}
... (remaining 46 Ops account for 0.09%(0.09s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
4.8% 4.8% 4.979s 6.22e-02s 80 149 AdvancedIncSubtensor{inplace=False, set_instead_of_inc=False}(Alloc.0, HostFromGpu.0, Reshape{1}.0, Reshape{1}.0, Reshape{1}.0)
input 0: dtype=float32, shape=(32, 20, 32, 50, 50), strides=c
input 1: dtype=float32, shape=(640, 50, 50), strides=c
input 2: dtype=int64, shape=(640,), strides=c
input 3: dtype=int64, shape=(640,), strides=c
input 4: dtype=int64, shape=(640,), strides=c
output 0: dtype=float32, shape=(32, 20, 32, 50, 50), strides=c
4.8% 9.6% 4.926s 6.16e-02s 80 172 AdvancedIncSubtensor{inplace=False, set_instead_of_inc=False}(Alloc.0, HostFromGpu.0, Reshape{1}.0, Reshape{1}.0, Reshape{1}.0)
input 0: dtype=float32, shape=(32, 20, 32, 50, 50), strides=c
input 1: dtype=float32, shape=(640, 50, 50), strides=c
input 2: dtype=int64, shape=(640,), strides=c
input 3: dtype=int64, shape=(640,), strides=c
input 4: dtype=int64, shape=(640,), strides=c
output 0: dtype=float32, shape=(32, 20, 32, 50, 50), strides=c
4.8% 14.4% 4.924s 6.16e-02s 80 195 AdvancedIncSubtensor{inplace=False, set_instead_of_inc=False}(Alloc.0, HostFromGpu.0, Reshape{1}.0, Reshape{1}.0, Reshape{1}.0)
input 0: dtype=float32, shape=(32, 20, 32, 50, 50), strides=c
input 1: dtype=float32, shape=(640, 50, 50), strides=c
input 2: dtype=int64, shape=(640,), strides=c
input 3: dtype=int64, shape=(640,), strides=c
input 4: dtype=int64, shape=(640,), strides=c
output 0: dtype=float32, shape=(32, 20, 32, 50, 50), strides=c
4.8% 19.2% 4.922s 6.15e-02s 80 241 AdvancedIncSubtensor{inplace=False, set_instead_of_inc=False}(Alloc.0, HostFromGpu.0, Reshape{1}.0, Reshape{1}.0, Reshape{1}.0)
input 0: dtype=float32, shape=(32, 20, 32, 50, 50), strides=c
input 1: dtype=float32, shape=(640, 50, 50), strides=c
input 2: dtype=int64, shape=(640,), strides=c
input 3: dtype=int64, shape=(640,), strides=c
input 4: dtype=int64, shape=(640,), strides=c
output 0: dtype=float32, shape=(32, 20, 32, 50, 50), strides=c
4.8% 24.0% 4.922s 6.15e-02s 80 218 AdvancedIncSubtensor{inplace=False, set_instead_of_inc=False}(Alloc.0, HostFromGpu.0, Reshape{1}.0, Reshape{1}.0, Reshape{1}.0)
input 0: dtype=float32, shape=(32, 20, 32, 50, 50), strides=c
input 1: dtype=float32, shape=(640, 50, 50), strides=c
input 2: dtype=int64, shape=(640,), strides=c
input 3: dtype=int64, shape=(640,), strides=c
input 4: dtype=int64, shape=(640,), strides=c
output 0: dtype=float32, shape=(32, 20, 32, 50, 50), strides=c
4.1% 28.1% 4.251s 5.31e-02s 80 136 HostFromGpu(GpuReshape{5}.0)
input 0: dtype=float32, shape=(32, 20, 32, 50, 50), strides=(1600000, 80000, 2500, 50, 1)
output 0: dtype=float32, shape=(32, 20, 32, 50, 50), strides=c
4.1% 32.2% 4.228s 5.28e-02s 80 154 GpuDnnConvGradW{algo='none', inplace=True}(GpuContiguous.0, GpuContiguous.0, GpuAllocEmpty.0, GpuDnnConvDesc{border_mode='half', subsample=(1, 1), conv_mode='conv', precision='float32'}.0, Constant{0.1}, Constant{0.0})
input 0: dtype=float32, shape=(640, 32, 50, 50), strides=(80000, 2500, 50, 1)
input 1: dtype=float32, shape=(640, 32, 50, 50), strides=(80000, 2500, 50, 1)
input 2: dtype=float32, shape=(32, 32, 9, 9), strides=(2592, 81, 9, 1)
input 3: dtype=no dtype, shape=input no shape, strides=input no strides
input 4: dtype=float32, shape=4, strides=c
input 5: dtype=float32, shape=4, strides=c
output 0: dtype=float32, shape=(32, 32, 9, 9), strides=(2592, 81, 9, 1)
4.1% 36.3% 4.227s 5.28e-02s 80 245 GpuDnnConvGradW{algo='none', inplace=True}(GpuContiguous.0, GpuContiguous.0, GpuAllocEmpty.0, GpuDnnConvDesc{border_mode='half', subsample=(1, 1), conv_mode='conv', precision='float32'}.0, Constant{0.1}, Constant{0.0})
input 0: dtype=float32, shape=(640, 32, 50, 50), strides=(80000, 2500, 50, 1)
input 1: dtype=float32, shape=(640, 32, 50, 50), strides=(80000, 2500, 50, 1)
input 2: dtype=float32, shape=(32, 32, 9, 9), strides=(2592, 81, 9, 1)
input 3: dtype=no dtype, shape=input no shape, strides=input no strides
input 4: dtype=float32, shape=4, strides=c
input 5: dtype=float32, shape=4, strides=c
output 0: dtype=float32, shape=(32, 32, 9, 9), strides=(2592, 81, 9, 1)
4.1% 40.4% 4.226s 5.28e-02s 80 223 GpuDnnConvGradW{algo='none', inplace=True}(GpuContiguous.0, GpuContiguous.0, GpuAllocEmpty.0, GpuDnnConvDesc{border_mode='half', subsample=(1, 1), conv_mode='conv', precision='float32'}.0, Constant{0.1}, Constant{0.0})
input 0: dtype=float32, shape=(640, 32, 50, 50), strides=(80000, 2500, 50, 1)
input 1: dtype=float32, shape=(640, 32, 50, 50), strides=(80000, 2500, 50, 1)
input 2: dtype=float32, shape=(32, 32, 9, 9), strides=(2592, 81, 9, 1)
input 3: dtype=no dtype, shape=input no shape, strides=input no strides
input 4: dtype=float32, shape=4, strides=c
input 5: dtype=float32, shape=4, strides=c
output 0: dtype=float32, shape=(32, 32, 9, 9), strides=(2592, 81, 9, 1)
4.1% 44.5% 4.226s 5.28e-02s 80 177 GpuDnnConvGradW{algo='none', inplace=True}(GpuContiguous.0, GpuContiguous.0, GpuAllocEmpty.0, GpuDnnConvDesc{border_mode='half', subsample=(1, 1), conv_mode='conv', precision='float32'}.0, Constant{0.1}, Constant{0.0})
input 0: dtype=float32, shape=(640, 32, 50, 50), strides=(80000, 2500, 50, 1)
input 1: dtype=float32, shape=(640, 32, 50, 50), strides=(80000, 2500, 50, 1)
input 2: dtype=float32, shape=(32, 32, 9, 9), strides=(2592, 81, 9, 1)
input 3: dtype=no dtype, shape=input no shape, strides=input no strides
input 4: dtype=float32, shape=4, strides=c
input 5: dtype=float32, shape=4, strides=c
output 0: dtype=float32, shape=(32, 32, 9, 9), strides=(2592, 81, 9, 1)
4.1% 48.6% 4.225s 5.28e-02s 80 200 GpuDnnConvGradW{algo='none', inplace=True}(GpuContiguous.0, GpuContiguous.0, GpuAllocEmpty.0, GpuDnnConvDesc{border_mode='half', subsample=(1, 1), conv_mode='conv', precision='float32'}.0, Constant{0.1}, Constant{0.0})
input 0: dtype=float32, shape=(640, 32, 50, 50), strides=(80000, 2500, 50, 1)
input 1: dtype=float32, shape=(640, 32, 50, 50), strides=(80000, 2500, 50, 1)
input 2: dtype=float32, shape=(32, 32, 9, 9), strides=(2592, 81, 9, 1)
input 3: dtype=no dtype, shape=input no shape, strides=input no strides
input 4: dtype=float32, shape=4, strides=c
input 5: dtype=float32, shape=4, strides=c
output 0: dtype=float32, shape=(32, 32, 9, 9), strides=(2592, 81, 9, 1)
4.1% 52.7% 4.197s 5.25e-02s 80 159 HostFromGpu(GpuReshape{5}.0)
input 0: dtype=float32, shape=(32, 20, 32, 50, 50), strides=(1600000, 80000, 2500, 50, 1)
output 0: dtype=float32, shape=(32, 20, 32, 50, 50), strides=c
4.1% 56.8% 4.195s 5.24e-02s 80 228 HostFromGpu(GpuReshape{5}.0)
input 0: dtype=float32, shape=(32, 20, 32, 50, 50), strides=(1600000, 80000, 2500, 50, 1)
output 0: dtype=float32, shape=(32, 20, 32, 50, 50), strides=c
4.1% 60.8% 4.195s 5.24e-02s 80 182 HostFromGpu(GpuReshape{5}.0)
input 0: dtype=float32, shape=(32, 20, 32, 50, 50), strides=(1600000, 80000, 2500, 50, 1)
output 0: dtype=float32, shape=(32, 20, 32, 50, 50), strides=c
4.1% 64.9% 4.192s 5.24e-02s 80 205 HostFromGpu(GpuReshape{5}.0)
input 0: dtype=float32, shape=(32, 20, 32, 50, 50), strides=(1600000, 80000, 2500, 50, 1)
output 0: dtype=float32, shape=(32, 20, 32, 50, 50), strides=c
3.3% 68.2% 3.407s 4.26e-02s 80 157 GpuDnnConv{algo='small', inplace=True}(GpuContiguous.0, GpuContiguous.0, GpuAllocEmpty.0, GpuDnnConvDesc{border_mode='half', subsample=(1, 1), conv_mode='conv', precision='float32'}.0, Constant{1.0}, Constant{0.0})
input 0: dtype=float32, shape=(640, 32, 50, 50), strides=(80000, 2500, 50, 1)
input 1: dtype=float32, shape=(32, 32, 9, 9), strides=(2592, 81, 9, 1)
input 2: dtype=float32, shape=(640, 32, 50, 50), strides=(80000, 2500, 50, 1)
input 3: dtype=no dtype, shape=input no shape, strides=input no strides
input 4: dtype=float32, shape=4, strides=c
input 5: dtype=float32, shape=4, strides=c
output 0: dtype=float32, shape=(640, 32, 50, 50), strides=(80000, 2500, 50, 1)
3.3% 71.5% 3.406s 4.26e-02s 80 203 GpuDnnConv{algo='small', inplace=True}(GpuContiguous.0, GpuContiguous.0, GpuAllocEmpty.0, GpuDnnConvDesc{border_mode='half', subsample=(1, 1), conv_mode='conv', precision='float32'}.0, Constant{1.0}, Constant{0.0})
input 0: dtype=float32, shape=(640, 32, 50, 50), strides=(80000, 2500, 50, 1)
input 1: dtype=float32, shape=(32, 32, 9, 9), strides=(2592, 81, 9, 1)
input 2: dtype=float32, shape=(640, 32, 50, 50), strides=(80000, 2500, 50, 1)
input 3: dtype=no dtype, shape=input no shape, strides=input no strides
input 4: dtype=float32, shape=4, strides=c
input 5: dtype=float32, shape=4, strides=c
output 0: dtype=float32, shape=(640, 32, 50, 50), strides=(80000, 2500, 50, 1)
3.3% 74.8% 3.406s 4.26e-02s 80 180 GpuDnnConv{algo='small', inplace=True}(GpuContiguous.0, GpuContiguous.0, GpuAllocEmpty.0, GpuDnnConvDesc{border_mode='half', subsample=(1, 1), conv_mode='conv', precision='float32'}.0, Constant{1.0}, Constant{0.0})
input 0: dtype=float32, shape=(640, 32, 50, 50), strides=(80000, 2500, 50, 1)
input 1: dtype=float32, shape=(32, 32, 9, 9), strides=(2592, 81, 9, 1)
input 2: dtype=float32, shape=(640, 32, 50, 50), strides=(80000, 2500, 50, 1)
input 3: dtype=no dtype, shape=input no shape, strides=input no strides
input 4: dtype=float32, shape=4, strides=c
input 5: dtype=float32, shape=4, strides=c
output 0: dtype=float32, shape=(640, 32, 50, 50), strides=(80000, 2500, 50, 1)
3.3% 78.1% 3.406s 4.26e-02s 80 226 GpuDnnConv{algo='small', inplace=True}(GpuContiguous.0, GpuContiguous.0, GpuAllocEmpty.0, GpuDnnConvDesc{border_mode='half', subsample=(1, 1), conv_mode='conv', precision='float32'}.0, Constant{1.0}, Constant{0.0})
input 0: dtype=float32, shape=(640, 32, 50, 50), strides=(80000, 2500, 50, 1)
input 1: dtype=float32, shape=(32, 32, 9, 9), strides=(2592, 81, 9, 1)
input 2: dtype=float32, shape=(640, 32, 50, 50), strides=(80000, 2500, 50, 1)
input 3: dtype=no dtype, shape=input no shape, strides=input no strides
input 4: dtype=float32, shape=4, strides=c
input 5: dtype=float32, shape=4, strides=c
output 0: dtype=float32, shape=(640, 32, 50, 50), strides=(80000, 2500, 50, 1)
3.3% 81.4% 3.406s 4.26e-02s 80 134 GpuDnnConv{algo='small', inplace=True}(GpuContiguous.0, GpuContiguous.0, GpuAllocEmpty.0, GpuDnnConvDesc{border_mode='half', subsample=(1, 1), conv_mode='conv', precision='float32'}.0, Constant{1.0}, Constant{0.0})
input 0: dtype=float32, shape=(640, 32, 50, 50), strides=(80000, 2500, 50, 1)
input 1: dtype=float32, shape=(32, 32, 9, 9), strides=(2592, 81, 9, 1)
input 2: dtype=float32, shape=(640, 32, 50, 50), strides=(80000, 2500, 50, 1)
input 3: dtype=no dtype, shape=input no shape, strides=input no strides
input 4: dtype=float32, shape=4, strides=c
input 5: dtype=float32, shape=4, strides=c
output 0: dtype=float32, shape=(640, 32, 50, 50), strides=(80000, 2500, 50, 1)
... (remaining 227 Apply instances account for 18.57%(19.12s) of the runtime)
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
File "/usr/lib/python2.7/atexit.py", line 24, in _run_exitfuncs
func(*targs, **kargs)
File "/usr/local/lib/python2.7/dist-packages/Theano-0.9.0dev0-py2.7.egg/theano/compile/profiling.py", line 60, in _atexit_print_fn
n_apply_to_print=config.profiling.n_apply)
File "/usr/local/lib/python2.7/dist-packages/Theano-0.9.0dev0-py2.7.egg/theano/compile/profiling.py", line 1240, in summary
self.summary_memory(file, n_apply_to_print)
File "/usr/local/lib/python2.7/dist-packages/Theano-0.9.0dev0-py2.7.egg/theano/compile/profiling.py", line 1099, in summary_memory
ord, fgraph, nodes_mem, ignore_dmap=ignore_dmap)
File "/usr/local/lib/python2.7/dist-packages/Theano-0.9.0dev0-py2.7.egg/theano/compile/profiling.py", line 841, in count_running_memory
viewed_by[origin].remove(ins)
ValueError: list.remove(x): x not in list
Error in sys.exitfunc:
Traceback (most recent call last):
File "/usr/lib/python2.7/atexit.py", line 24, in _run_exitfuncs
func(*targs, **kargs)
File "/usr/local/lib/python2.7/dist-packages/Theano-0.9.0dev0-py2.7.egg/theano/compile/profiling.py", line 60, in _atexit_print_fn
n_apply_to_print=config.profiling.n_apply)
File "/usr/local/lib/python2.7/dist-packages/Theano-0.9.0dev0-py2.7.egg/theano/compile/profiling.py", line 1240, in summary
self.summary_memory(file, n_apply_to_print)
File "/usr/local/lib/python2.7/dist-packages/Theano-0.9.0dev0-py2.7.egg/theano/compile/profiling.py", line 1099, in summary_memory
ord, fgraph, nodes_mem, ignore_dmap=ignore_dmap)
File "/usr/local/lib/python2.7/dist-packages/Theano-0.9.0dev0-py2.7.egg/theano/compile/profiling.py", line 841, in count_running_memory
viewed_by[origin].remove(ins)
ValueError: list.remove(x): x not in list