forked from jmrosinski/GPTL
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME
151 lines (115 loc) · 6.76 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
This file contains information about using GPTL. For information on building
and installing GPTL, see the file INSTALL.
GPTL is the "General Purpose Timing Library". It can be used to manually
instrument application codes with an arbitrary set of "regions" (or "timers")
over which statistics such as wallclock time and CPU time are gathered and
subsequently printed. If the target application is built with the GNU
compilers (gcc or gfortran), Pathscale (pathcc or pathf95), or PGI compilers,
GPTL can also be used to automatically instrument regions which are defined
by function entry and exit points. This is an easy way to generate a dynamic
call tree. See Auto-Instrumentation below for a description of how to use
this feature.
Similar to compiler-generated auto-instrumentation, GPTL can intercept and
auto-profile MPI calls made by the application if the target MPI library
supports the PMPI profiling layer. In this case an estimate of bytes
transferred by each MPI call is presented in the printed output.
If the PAPI library is installed (http://icl.cs.utk.edu/papi), GPTL
also provides a convenient mechanism to access all available PAPI events. In
addtion to PAPI preset and native events, GPTL defines derived events which
are based on PAPI counters. See gptl.h for a list of available derived events.
Of course these events can only be enabled if the PAPI counters they require
are available on the target architecture.
Using GPTL
----------
C codes making GPTL library calls should #include <gptl.h>. Fortran codes can
"use gptl" or #include or Fortran include 'gptl.inc'. The C and Fortran
interfaces are identical, except that the C interface uses mixed case. All
user-accessible functions return either 0 (success) or -1 (failure). Example
codes that use the library can be found in subdirectories ctests/ and
ftests/.
Code instrumentation to utilize GPTL involves zero or more calls to
GPTLsetoption(), then a single call to GPTLinitialize(), then an arbitrary
sequence of calls to GPTLstart() and GPTLstop(), and finally a call to
GPTLpr() or GPTLpr_file(). See "Example" below for a sample calling
sequence. Calls to GPTLstart() and GPTLstop() are thread-safe, with per-thread
statistics printed by GPTLpr() or GPTLpr_file().
The purpose of GPTLsetoption() is to enable or disable various library
options. For example, to enable the PAPI counter for total cycles, do this:
ret = GPTLsetoption (PAPI_TOT_CYC, 1);
The "1" says "enable". Use "0" for "disable". See the man pages for complete
documentation on function usage and arguments. The list of available GPTL
options is contained in gptl.h, and a complete list of available PAPI-based
events can be found by running "ctests/avail".
GPTLinitialize() initializes the GPTL library.
There can be an arbitrary number of start/stop pairs before GPTLpr() or
GPTLpr_file() is called to print the results. And an arbitrary amount of
nesting of regions is also allowed. The printed results will be indented to
indicate the level of nesting for each region.
GPTLpr() prints the results to a file named timing.<number>, where the single
argument <number> is an integer. For MPI jobs, it is most convenient to use
the MPI rank of the invoking task for <number>. Equivalently, function
GPTLpr_file() can be called. Its input argument is a character string
indicating the output file name to be written. It is up to the user to ensure
that these print functions write to uniquely-named files, in order to avoid
name-space collisions.
GPTLfinalize() can be called to clean up the GPTL environment. All space
malloc'ed by the GPTL library will be freed by this call.
Example
-------
From "man GPTLstart", a simple example calling sequence to time a couple of
code regions and print the results is:
(void) GPTLsetoption (GPTLcpu, 1); /* enable cpu timings */
(void) GPTLsetoption (GPTLwall, 0); /* disable wallclock timings */
(void) GPTLsetoption (PAPI_TOT_CYC, 1); /* enable counting of total cycles */
...
(void) GPTLinitialize(); /* initialize the GPTL library */
(void) GPTLstart ("total"); /* start a timer */
...
(void) GPTLstart ("do_work"); /* start another timer */
do_work(); /* do some work */
(void) GPTLstop ("do_work"); /* stop a timer */
(void) GPTLstop ("total"); /* stop a timer */
...
(void) GPTLpr (mympitaskid); /* print the results to timing.<mympitaskid> */
Auto-instrumentation
--------------------
If the regions to be timed are defined by function entry and exit points, and
the application to be profiled is built with either the GNU or Pathscale
compilers, you might find it convenient to use the auto-instrumentation
feature of GPTL. Here's how:
1) Add the flag -finstrument-functions (-Minstrument:functions under PGI)
when compiling the routines you'd like to profile.
2) Add calls to GPTLsetoption() (if desired), and GPTLinitialize() to the main
program before any other routines are invoked.
3) Add a call to GPTLpr() or GPTLpr_file() wherever appropriate prior to where
the code terminates.
4) Link with -lgptl (and -lpapi if PAPI is enabled).
5) Run the code.
6) Run "hex2name.pl <a.out> <timing.0> | less", where
<a.out> is the name of the executable, and <timing.0> is the name of the
timing file to be converted.
The result should be a dynamic call tree with timings and (if enabled) PAPI
counts and derived event statistics for each region, where regions are defined
by function entry and exit points.
Here's what's happening under the covers:
The -finstrument-functions flag tells the compiler to insert calls to
__cyg_profile_func_enter (void *this_fn, void *call_site) at function start,
and __cyg_profile_func_exit (void *this_fn, void *call_site) at function
exit. GPTL defines these functions as calls to (effectively) GPTLstart() and
GPTLstop(), where the address of the function is used as the input sentinel to
these routines.
Running hex2name.pl converts the function addresses back to human-readable
function names. It uses the UNIX "nm" utility to do this.
When using MPI auto-profiling, steps 2) and 3) above can be omitted. In this
case GPTL auto-generates calls to GPTLinitialize and GPTLpr from MPI_Init and
MPI_finalize, respectively.
Multi-processor instrumented codes
----------------------------------
With rev. 4.3 of GPTL, function GPTLpr_summary(mpi_communicator) was
rewritten from scratch for scalability and the presentation of additional
statistical information. Max, min, mean, and standard deviation of region
timings, along with the process and thread index responsible for max and min,
are presented in a single output file named timing.summary. With this
rewrite, this is now the preferred method (over parsegptlout.pl) for
gathering summary statistics across threads and tasks. See example3 in the
web documentation for further information.