what is option2 #19

ethem-kinginthenorth · 2020-08-10T19:00:11Z

In this example, I am seeing "option 1". What is option2?

Is there a clear example that shows how to use nvtx plugins:

is it op trace, start, and end?
is it nvtx hook?
both?

I am trying to get nvtx plugin working but I am keep getting "The application terminated before the collection started. No report was generated" I am definitely doing something wrong but where.

DEKHTIARJonathan · 2020-08-10T19:33:26Z

Hi,

FYI: @ahmadki

Right on, we probably not perfectly commented our examples ...

Option 1: nvtx.plugins.tf.ops.trace decorator
Option 2: nvtx.plugins.tf.ops.start and nvtx.plugins.tf.ops.end
Option 3 (Keras Only): nvtx.plugins.tf.keras.layers.NVTXStart and nvtx.plugins.tf.keras.layers.NVTXEnd

I am trying to get nvtx plugin working but I am keep getting "The application terminated before the collection started. No report was generated" I am definitely doing something wrong but where.

This message means that the delay to start profiling is longer than the time it takes for your program to finish.

If you look here: https://github.com/NVIDIA/nvtx-plugins/blob/master/examples/run_tf_session.sh

nsys profile \
  -d 60 \
  -w true \
  --force-overwrite=true \
  --sample=cpu \
  -t 'nvtx,cuda' \
  --stop-on-exit=true \
  --kill=sigkill \
  -o examples/tf_session_example \
  python examples/tf_session_example.py

Adjust these settings and you'll be fine ;)

$ nsys profile --help

	-y, --delay=
	   Collection start delay in seconds.
	   Default is 0.

	-d, --duration=
	   Collection duration in seconds.
	   Default is 0 seconds.

	...

ethem-kinginthenorth · 2020-08-10T19:42:05Z

I am more looking this one:

# Option 1: use decorators
@nvtx_tf.ops.trace(message='Dense Block', grad_message='Dense Block grad',
                   domain_name='Forward', grad_domain_name='Gradient',
                   enabled=ENABLE_NVTX, trainable=True)```

then I am seeing

x = inputs
    x, nvtx_context = nvtx_tf.ops.start(x, message='Dense 1',
        grad_message='Dense 1 grad', domain_name='Forward',
        grad_domain_name='Gradient', trainable=True, enabled=ENABLE_NVTX)
    x = tf.compat.v1.layers.dense(x, 1024, activation=tf.nn.relu, name='dense_1')
    x = nvtx_tf.ops.end(x, nvtx_context)

then I also see:

nvtx_callback = NVTXHook(skip_n_steps=1, name='Train')
with tf.compat.v1.train.MonitoredSession(hooks=[nvtx_callback]) as sess:

I tried 1) and 2) individually along with their combination. But still cannot get it run.

The code I used to call profiler is below:

nsys profile -w true -t "cudnn,cuda,osrt,nvtx" -c cudaProfilerApi --stop-on-range-end true --stop-on-exit=true \
--kill=sigkill \
--export=sqlite -o ./test python main.py --arch resnet50 \
--mode train --data_dir /raid/ethem/tfr_small \
--export_dir /raid/ethem/results \
--batch_size 128 --num_iter 1 \
--iter_unit epoch --results_dir /raid/ethem/results \
--display_every 10 --lr_init 0.01 --seed 12345

I am interested in profiling certain places rather than a certain period of time. thanks

DEKHTIARJonathan · 2020-08-10T20:06:24Z

Try the following:

nsys profile \
  -d 60 \
  -w true \
  --force-overwrite=true \
  --sample=cpu \
  -t 'nvtx,cuda' \
  --stop-on-exit=true \
  --kill=sigkill \
  -o examples/tf_session_example \
  python main.py \
    --arch resnet50 \
    --mode train \
    --data_dir /raid/ethem/tfr_small \
    --export_dir /raid/ethem/results \
    --batch_size 128 \
    --num_iter 1 \
    --iter_unit epoch \
    --results_dir /raid/ethem/results \
    --display_every 10 \
    --lr_init 0.01 \
    --seed 12345

You don't need to combine Option 1 & 2 & 3. They are completely independent.

I am interested in profiling certain places rather than a certain period of time.

You don't want to profile for the whole training it doesn't make it any sense and it will hurt your performance. Profiling is designed to give you on a short script representative of the normal script. You can use some delay to account for warmup and library loading, but a profiling script doesn't more than 50 good iterations to be useful.

ethem-kinginthenorth · 2020-08-10T20:14:52Z

@DEKHTIARJonathan thanks for the suggestion. I will try.

You touched a great point, I use pyprof with pytorch, where I can control how many iterations to profile. I usually do 1 or 2 iterations (e.g., 10th iteration) which gives me what I want. The reason I tried start and end ntvx plugins to do the same thing with tensorflow. Is there a way to control this or is it just using time related parameters?

ethem-kinginthenorth · 2020-08-11T14:43:55Z

here is the verdict: when I add "-c cudaProfilerApi" I get the message The application terminated before the collection started. No report was generated.. When removed, independent of the "-y" and "-d" parameters, I am getting something. The documentation says: "profiling will start only when cudaProfilerStart API is invoked" I am a bit puzzled here. So I checked nvidia-smi to make sure it is using GPU, and I can see the usage. I appreciate any guidance here. Thanks.

ethem-kinginthenorth · 2020-08-13T14:04:47Z

So it seems like -c cudaProfilerApi is not working. As far as I understand, having start and end to limit the part of the code to profile is dependent upon to this parameter (along with --stop-on-range-end true). Therefore, it also does NOT work. Please correct me if I am wrong.

rrforte · 2020-08-13T21:17:13Z

Using -c cudaProfilerApi would limit capturing to the scope in between cudaProfilerStart()/cudaProfilerStop() calls. Unless you call these functions in your code (maybe using ctypes?), this means nothing will be captured since nvtx-plugins does not call those functions, and AFAIK tensorflow doesn't either (though I am not certain).

Instead, you can try using -c nvtx -e NSYS_NVTX_PROFILER_REGISTER_ONLY=0 then specify your outer range message and domain using -p message@domain (or just -p message if you are using the default domain).

Of course, you can disable the capture range limit entirely by removing -c. You already mentioned that you do get some results when doing so. Are the results OK in this case? Is there a particular reason for using -c cudaProfilerApi?

ethem-kinginthenorth · 2020-08-18T23:15:18Z

@rrforte I am trying to do profiling for only 1 iteration. In the pytorch, I use pyprof and start and stop profiling for a certain iteration. With start and stop I use -c cudaProfilerApi --stop-on-range-end true. I am trying to do the same on the tensorflow side. Whatever I do , I am keep getting profiling for more than 1 iteration. I appreciate any help on that.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

what is option2 #19

what is option2 #19

ethem-kinginthenorth commented Aug 10, 2020

DEKHTIARJonathan commented Aug 10, 2020 •

edited

Loading

ethem-kinginthenorth commented Aug 10, 2020 •

edited

Loading

DEKHTIARJonathan commented Aug 10, 2020

ethem-kinginthenorth commented Aug 10, 2020

ethem-kinginthenorth commented Aug 11, 2020

ethem-kinginthenorth commented Aug 13, 2020

rrforte commented Aug 13, 2020

ethem-kinginthenorth commented Aug 18, 2020 •

edited

Loading

what is option2 #19

what is option2 #19

Comments

ethem-kinginthenorth commented Aug 10, 2020

DEKHTIARJonathan commented Aug 10, 2020 • edited Loading

ethem-kinginthenorth commented Aug 10, 2020 • edited Loading

DEKHTIARJonathan commented Aug 10, 2020

ethem-kinginthenorth commented Aug 10, 2020

ethem-kinginthenorth commented Aug 11, 2020

ethem-kinginthenorth commented Aug 13, 2020

rrforte commented Aug 13, 2020

ethem-kinginthenorth commented Aug 18, 2020 • edited Loading

DEKHTIARJonathan commented Aug 10, 2020 •

edited

Loading

ethem-kinginthenorth commented Aug 10, 2020 •

edited

Loading

ethem-kinginthenorth commented Aug 18, 2020 •

edited

Loading