Skip to content
This repository has been archived by the owner on Jun 9, 2023. It is now read-only.

what is option2 #19

Open
ethem-kinginthenorth opened this issue Aug 10, 2020 · 8 comments
Open

what is option2 #19

ethem-kinginthenorth opened this issue Aug 10, 2020 · 8 comments

Comments

@ethem-kinginthenorth
Copy link

In this example, I am seeing "option 1". What is option2?

Is there a clear example that shows how to use nvtx plugins:

  1. is it op trace, start, and end?
  2. is it nvtx hook?
  3. both?

I am trying to get nvtx plugin working but I am keep getting "The application terminated before the collection started. No report was generated" I am definitely doing something wrong but where.

@DEKHTIARJonathan
Copy link
Collaborator

DEKHTIARJonathan commented Aug 10, 2020

Hi,

FYI: @ahmadki

Right on, we probably not perfectly commented our examples ...

  • Option 1: nvtx.plugins.tf.ops.trace decorator
  • Option 2: nvtx.plugins.tf.ops.start and nvtx.plugins.tf.ops.end
  • Option 3 (Keras Only): nvtx.plugins.tf.keras.layers.NVTXStart and nvtx.plugins.tf.keras.layers.NVTXEnd

I am trying to get nvtx plugin working but I am keep getting "The application terminated before the collection started. No report was generated" I am definitely doing something wrong but where.

This message means that the delay to start profiling is longer than the time it takes for your program to finish.

If you look here: https://github.com/NVIDIA/nvtx-plugins/blob/master/examples/run_tf_session.sh

nsys profile \
  -d 60 \
  -w true \
  --force-overwrite=true \
  --sample=cpu \
  -t 'nvtx,cuda' \
  --stop-on-exit=true \
  --kill=sigkill \
  -o examples/tf_session_example \
  python examples/tf_session_example.py

Adjust these settings and you'll be fine ;)

$ nsys profile --help

	-y, --delay=
	   Collection start delay in seconds.
	   Default is 0.

	-d, --duration=
	   Collection duration in seconds.
	   Default is 0 seconds.

	...

@ethem-kinginthenorth
Copy link
Author

ethem-kinginthenorth commented Aug 10, 2020

I am more looking this one:

# Option 1: use decorators
@nvtx_tf.ops.trace(message='Dense Block', grad_message='Dense Block grad',
                   domain_name='Forward', grad_domain_name='Gradient',
                   enabled=ENABLE_NVTX, trainable=True)```
  1. then I am seeing
x = inputs
    x, nvtx_context = nvtx_tf.ops.start(x, message='Dense 1',
        grad_message='Dense 1 grad', domain_name='Forward',
        grad_domain_name='Gradient', trainable=True, enabled=ENABLE_NVTX)
    x = tf.compat.v1.layers.dense(x, 1024, activation=tf.nn.relu, name='dense_1')
    x = nvtx_tf.ops.end(x, nvtx_context)
  1. then I also see:
nvtx_callback = NVTXHook(skip_n_steps=1, name='Train')
with tf.compat.v1.train.MonitoredSession(hooks=[nvtx_callback]) as sess:

I tried 1) and 2) individually along with their combination. But still cannot get it run.

The code I used to call profiler is below:

nsys profile -w true -t "cudnn,cuda,osrt,nvtx" -c cudaProfilerApi --stop-on-range-end true --stop-on-exit=true \
--kill=sigkill \
--export=sqlite -o ./test python main.py --arch resnet50 \
--mode train --data_dir /raid/ethem/tfr_small \
--export_dir /raid/ethem/results \
--batch_size 128 --num_iter 1 \
--iter_unit epoch --results_dir /raid/ethem/results \
--display_every 10 --lr_init 0.01 --seed 12345

I am interested in profiling certain places rather than a certain period of time. thanks

@DEKHTIARJonathan
Copy link
Collaborator

Try the following:

nsys profile \
  -d 60 \
  -w true \
  --force-overwrite=true \
  --sample=cpu \
  -t 'nvtx,cuda' \
  --stop-on-exit=true \
  --kill=sigkill \
  -o examples/tf_session_example \
  python main.py \
    --arch resnet50 \
    --mode train \
    --data_dir /raid/ethem/tfr_small \
    --export_dir /raid/ethem/results \
    --batch_size 128 \
    --num_iter 1 \
    --iter_unit epoch \
    --results_dir /raid/ethem/results \
    --display_every 10 \
    --lr_init 0.01 \
    --seed 12345

You don't need to combine Option 1 & 2 & 3. They are completely independent.


I am interested in profiling certain places rather than a certain period of time.

You don't want to profile for the whole training it doesn't make it any sense and it will hurt your performance. Profiling is designed to give you on a short script representative of the normal script. You can use some delay to account for warmup and library loading, but a profiling script doesn't more than 50 good iterations to be useful.

@ethem-kinginthenorth
Copy link
Author

@DEKHTIARJonathan thanks for the suggestion. I will try.

You touched a great point, I use pyprof with pytorch, where I can control how many iterations to profile. I usually do 1 or 2 iterations (e.g., 10th iteration) which gives me what I want. The reason I tried start and end ntvx plugins to do the same thing with tensorflow. Is there a way to control this or is it just using time related parameters?

@ethem-kinginthenorth
Copy link
Author

here is the verdict: when I add "-c cudaProfilerApi" I get the message The application terminated before the collection started. No report was generated.. When removed, independent of the "-y" and "-d" parameters, I am getting something. The documentation says: "profiling will start only when cudaProfilerStart API is invoked" I am a bit puzzled here. So I checked nvidia-smi to make sure it is using GPU, and I can see the usage. I appreciate any guidance here. Thanks.

@ethem-kinginthenorth
Copy link
Author

So it seems like -c cudaProfilerApi is not working. As far as I understand, having start and end to limit the part of the code to profile is dependent upon to this parameter (along with --stop-on-range-end true). Therefore, it also does NOT work. Please correct me if I am wrong.

@rrforte
Copy link

rrforte commented Aug 13, 2020

Using -c cudaProfilerApi would limit capturing to the scope in between cudaProfilerStart()/cudaProfilerStop() calls. Unless you call these functions in your code (maybe using ctypes?), this means nothing will be captured since nvtx-plugins does not call those functions, and AFAIK tensorflow doesn't either (though I am not certain).

Instead, you can try using -c nvtx -e NSYS_NVTX_PROFILER_REGISTER_ONLY=0 then specify your outer range message and domain using -p message@domain (or just -p message if you are using the default domain).

Of course, you can disable the capture range limit entirely by removing -c. You already mentioned that you do get some results when doing so. Are the results OK in this case? Is there a particular reason for using -c cudaProfilerApi?

@ethem-kinginthenorth
Copy link
Author

ethem-kinginthenorth commented Aug 18, 2020

@rrforte I am trying to do profiling for only 1 iteration. In the pytorch, I use pyprof and start and stop profiling for a certain iteration. With start and stop I use -c cudaProfilerApi --stop-on-range-end true. I am trying to do the same on the tensorflow side. Whatever I do , I am keep getting profiling for more than 1 iteration. I appreciate any help on that.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants