- How can I see the code generated by Devito?
- How can I see the compilation command with which Devito compiles the generated code
- Where does the generated code go and how do I look at it
- Can I change the directory where Devito stashes the generated code
- I create an Operator, look at the generated code, and the equations appear in a different order than I expected.
- Do Devito Operators release the GIL when executing C code?
- What performance optimizations does Devito apply
- Does Devito optimize complex expressions
- How are abstractions used in the seismic examples
- What environment variables control how Devito works
- What are the accepted combinations of PLATFORM, ARCH, and LANGUAGE
- How do you run the unit tests from the command line
- What is the difference between f() and f[] notation
- What is the indexed notation
- Is there a flow chart
- What's up with object.data
- How do I create an N-dimensional grid
- What are the keys to fast code
- As time increases in the finite difference evolution, are wavefield arrays "swapped" as you might see in c/c++ code
- What units are typically used in Devito examples
- Can I subclass basic types such as TimeFunction
- How can I change the compilation flags (for example, I want to change the optimization level from -O3 to -O0)
- Is the jitted code IEEE-compliant
- Can I control the MPI domain decomposition
- How should I use MPI on multi-socket machines
- How do I make sure my code is "MPI safe"
- My code is generating different results when running with MPI
- Why does my Operator kernel die suddenly
- Can I manually modify the C code generated by Devito and test these modifications
- How do I find the source of my bug quickly and get support
- Is there a way to get the performance of an Operator
- How does devito compute the performance of an Operator
- How do I fairly compare the performance to that of my in-house code
- Is there is list of refereed papers related to the Devito project
- How do I cite Devito
- Where did the name Devito come from?
After you build an op=Operator(...)
implementing one or more equations, you can use print(op)
to see the generated low level code. The example below builds an operator that takes a 1/2 cell forward shifted derivative of the Function
f and puts the result in the Function
g.
import numpy as np
import devito
from devito import Grid, Function, Eq, Operator
grid = Grid(shape=(21,), extent=(1.0,), origin=(0.0,), dtype=np.float32)
x = grid.dimensions[0]
f = Function(name='f', grid=grid, space_order=8)
g = Function(name='g', grid=grid, space_order=8)
eq_dfdx = Eq(g, getattr(f, 'dx')(x+0.5*x.spacing))
op = Operator(eq_dfdx)
print(op)
And the output:
#define _POSIX_C_SOURCE 200809L
#include "stdlib.h"
#include "math.h"
#include "sys/time.h"
#include "xmmintrin.h"
#include "pmmintrin.h"
struct dataobj
{
void *restrict data;
int * size;
int * npsize;
int * dsize;
int * hsize;
int * hofs;
int * oofs;
} ;
struct profiler
{
double section0;
} ;
int Kernel(struct dataobj *restrict f_vec, struct dataobj *restrict g_vec, const float h_x, struct profiler * timers, const int x_M, const int x_m)
{
float (*restrict f) __attribute__ ((aligned (64))) = (float (*)) f_vec->data;
float (*restrict g) __attribute__ ((aligned (64))) = (float (*)) g_vec->data;
/* Flush denormal numbers to zero in hardware */
_MM_SET_DENORMALS_ZERO_MODE(_MM_DENORMALS_ZERO_ON);
_MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);
START(section0)
for (int x = x_m; x <= x_M; x += 1)
{
g[x + 8] = (3.57142857e-3F*(f[x + 4] - f[x + 12]) + 3.80952381e-2F*(-f[x + 5] + f[x + 11]) + 2.0e-1F*(f[x + 6] - f[x + 10]) + 8.0e-1F*(-f[x + 7] + f[x + 9]))/h_x;
}
STOP(section0,timers)
return 0;
}
Set the environment variable DEVITO_LOGGING=DEBUG
. When an Operator gets compiled, the used compilation command will be emitted to stdout.
If nothing seems to change, it is possible that no compilation is happening under-the-hood as all kernels have already been compiled in a previous run. You will then have to clear up the Devito kernel cache. From the Devito root directory, run:
python scripts/clear_devito_cache.py
Devito stores the generated code as well as the jit-compiled libraries in a temporary directory. By setting the environment variable DEVITO_LOGGING=DEBUG
, Devito will show, amongst other things, the absolute path to the generated code.
Yes, just set the environment variable TMPDIR
to your favorite location.
I create an Operator, look at the generated code, and the equations appear in a different order than I expected.
The Devito compiler computes a topological ordering of the input equations based on data dependency analysis. Heuristically, some equations might be moved around to improve performance (e.g., data locality). Therefore, the order of the equations in the generated code might be different than that used as input to the Operator.
Yes. Devito uses ctypes.CDLL to call the JIT C code which releases the GIL.
Take a look here and in particular at this notebook.
Devito applies several performance optimizations to improve the number of operations ("operation count") in complex expressions. These optimizations are designed to do a really good job but also be reasonably fast. One such pass attempts to factorize as many common terms as possible in expressions in order to reduce the operation count. We will construct a demonstrative example below that has a common term that is not factored out by the Devito optimization. The difference in floating-point operations per output point for the factoring of that term is about 10 percent, and the generated C is different, but numerical outputs of running the two different operators are indistinguishable to machine precision. In terms of actual performance, the (few) missed factorization opportunities may not necessarily be a relevant issue: as long as the code is not heavily compute-bound, the runtimes may only be slightly higher than in the optimally-factorized version.
ux_update = t.spacing**2 * b * \
((c33 * u_x.dx(x0=x+x.spacing/2)).dx(x0=x-x.spacing/2) +
(c55 * u_x.dz(x0=z+z.spacing/2)).dz(x0=z-z.spacing/2) +
(c13 * u_z.dz(x0=z+z.spacing/2)).dx(x0=x-x.spacing/2) +
(c55 * u_z.dx(x0=x+x.spacing/2)).dz(x0=z-z.spacing/2)) + \
(2 - t.spacing * wOverQ) * u_x + (t.spacing * wOverQ - 1) * u_x.backward
stencil_x = Eq(u_x.forward, ux_update)
print("\n", stencil_x)
op = Operator([stencil_x])
ux_update = \
t.spacing**2 * b * (c33 * u_x.dx(x0=x+x.spacing/2)).dx(x0=x-x.spacing/2) + \
t.spacing**2 * b * (c55 * u_x.dz(x0=z+z.spacing/2)).dz(x0=z-z.spacing/2) + \
t.spacing**2 * b * (c13 * u_z.dz(x0=z+z.spacing/2)).dx(x0=x-x.spacing/2) + \
t.spacing**2 * b * (c55 * u_z.dx(x0=x+x.spacing/2)).dz(x0=z-z.spacing/2) + \
(2 - t.spacing * wOverQ) * u_x + (t.spacing * wOverQ - 1) * u_x.backward
stencil_x = Eq(u_x.forward, ux_update)
print("\n", stencil_x)
op = Operator([stencil_x])
Eq(u_x(t + dt, x, z), dt**2*(Derivative(c13(x, z)*Derivative(u_z(t, x, z), z), x) + Derivative(c33(x, z)*Derivative(u_x(t, x, z), x), x) + Derivative(c55(x, z)*Derivative(u_x(t, x, z), z), z) + Derivative(c55(x, z)*Derivative(u_z(t, x, z), x), z))*b(x, z) + (-dt*wOverQ(x, z) + 2)*u_x(t, x, z) + (dt*wOverQ(x, z) - 1)*u_x(t - dt, x, z))
Operator `Kernel` generated in 1.26 s
* lowering.Expressions: 0.61 s (48.7 %)
* lowering.Clusters: 0.52 s (41.5 %)
* specializing.Clusters: 0.31 s (24.8 %)
Flops reduction after symbolic optimization: [1160 --> 136]
Eq(u_x(t + dt, x, z), dt**2*b(x, z)*Derivative(c13(x, z)*Derivative(u_z(t, x, z), z), x) + dt**2*b(x, z)*Derivative(c33(x, z)*Derivative(u_x(t, x, z), x), x) + dt**2*b(x, z)*Derivative(c55(x, z)*Derivative(u_x(t, x, z), z), z) + dt**2*b(x, z)*Derivative(c55(x, z)*Derivative(u_z(t, x, z), x), z) + (-dt*wOverQ(x, z) + 2)*u_x(t, x, z) + (dt*wOverQ(x, z) - 1)*u_x(t - dt, x, z))
Operator `Kernel` generated in 1.12 s
* lowering.Expressions: 0.59 s (53.0 %)
* lowering.Clusters: 0.40 s (35.9 %)
* specializing.Clusters: 0.31 s (27.8 %)
Flops reduction after symbolic optimization: [1169 --> 149]
Other optimizations include common sub-expressions elimination, hoisting of loop-invariant code, and detection of cross-iteration redundancies (e.g., due to high order derivatives). Below we show the cumulative impact of all these optimizations in a number of seismic operators.
For more info, take a look at this notebook.
Many Devito examples are provided that demonstrate application for specific problems, including e.g. fluid mechanics and seismic modeling. We focus in this question on seismic modeling examples that provide convenience wrappers to build differential equations and create Devito Operators for various types of modeling physics including isotropic and anisotropic, acoustic and elastic.
These examples (link) use abstractions to remove details from the methods that actually build the operators. The idea is that at the time you build a Devito operator, you don't need specific material parameter arrays (e.g. velocity or density or Thomsen parameter), and you don't need specific locations of sources and receiver instruments. All you need to build the operator is a placeholder that can provide the dimensionality and (for example) the spatial order of finite difference approximation you wish to employ. In this way you can build and return functioning highly optimized operators to which you can provide the specific implementation details at runtime via command line arguments.
An example of this abstraction (or placeholder design pattern) in operation is the call to the isotropic acoustic AcousticWaveSolver.forward
method that returns a Devito operator via the ForwardOperator
method defined in operators.py.
You will note that this method uses placeholders for the material parameter arrays and the source and receiver locations, and then at runtime uses arguments provided to the returned Operator
to provide state to the placeholders. You can see this happen on lines 112-113 in wavesolver.py.
You can get the list of environment variables and all their possible values with the following python code:
from devito import print_defaults
print_defaults()
These environment variables can either be set from the shell or programmatically. Note that when setting these variables programmatically, you need to use lower case, and the leading DEVITO
is omitted. Values are case sensitive, meaning openmp
is accepted and OPENMP
will throw an error. Below are examples of setting these variables in the shell (before running python) and from python (before executing devito code).
Method | Example |
---|---|
bourne shell | DEVITO_LANGUAGE="openmp" |
c shell | setenv DEVITO_LANGUAGE "openmp" |
programmatically | configuration['language'] = 'openmp' |
Used to select a specific "backend compiler". The backend compiler takes as input the code generated by Devito and produces a shared object. Supported backend compilers are gcc
, icc
, pgcc
, clang
. For each of these compilers, Devito uses some preset compilation flags (e.g., -O3
, -march=native
, -fast-math
). If this environment variable is left unset, Devito will attempt auto-detection of the most suitable backend compiler available on the underlying system.
This environment variable is mostly needed when running on GPUs, to ask Devito to generate code for a particular device (see for example this tutorial). Can be also used to specify CPU architectures such as Intel's -- Haswell, Broadwell, SKL and KNL -- ARM, AMD, and Power. Often one can ignore this variable because Devito typically does a decent job at auto-detecting the underlying platform.
Specify the generated code language. The default is C
, which means sequential C. Use openmp
to emit C+OpenMP or openacc
for C+OpenACC.
Choose the performance profiling level. This is also automatically increased with DEVITO_LOGGING=PERF
or DEVITO_LOGGING=DEBUG
, in which case this environment variable can be ignored.
Mostly useful for developers when chasing segfaults.
Choose the performance optimization level. By default set to the maximum level, advanced
.
Controls MPI in Devito. Use 1
to enable MPI. The most powerful MPI mode is called "full", and is activated setting DEVITO_MPI=full
. The "full" mode implements a number of optimizations including computation/communication overlap.
Search across a set of block shapes to maximize the effectiveness of loop tiling (aka cache blocking). You can choose between off
(default), basic
, aggressive
, max
. A more aggressive autotuning should eventually result in better runtime performance, though the search phase will take longer.
Run with DEVITO_LOGGING=DEBUG
to find out the specific performance optimizations applied by an Operator, how auto-tuning is getting along, to emit the command used to compile the generated code, to emit more performance metrics, and much more.
Use DEVITO_FIRST_TOUCH=1
in combination with DEVITO_LANGUAGE=openmp
to use an OpenMP parallel Operator for initialization of Function data. This should ensure NUMA locality for data access when running "flatten" OpenMP across multiple sockets on the same node (as opposed to using MPI -- one MPI process per socket -- plus OpenMP, which is the recommended way).
You can set DEVITO_JIT_BACKDOOR=1
to test custom modifications to the generated code. For more info, take a look at this FAQ.
Set DEVITO_IGNORE_UNKNOWN_PARAMS=1
to avoid Devito raising an exception if one attempts to pass an unknown argument to op.apply()
.
These two languages can be used with virtually any PLATFORM and ARCH.
With a device PLATFORM (e.g., nvidiaX
, amdgpuX
, or intelgpuX
), the compiler will generate OpenMP code for device offloading.
When using OpenMP offloading, it is recommended to stick to the corresponding vendor compiler, so ARCH=amdclang
for AMD, ARCH={icc,icx,intel}
for Intel, and ARCH=nvc
for NVidia.
Requires: PLATFORM=nvidiaX
and ARCH=nvc
.
The legacy PGI compiler is also supported via ARCH=pgcc
.
DevitoPRO only.
Requires: PLATFORM=nvidiaX
and ARCH=cuda
.
DevitoPRO only.
Requires: PLATFORM=amdgpuX
and ARCH=hip
.
DevitoPRO only.
Requires: PLATFORM=intelgpuX
or PLATFORM=intel64
, and ARCH=sycl
.
In addition to the tutorials, the unit tests provide an excellent way to see how the Devito API works with small self-contained examples. You can exercise individual unit tests with the following python code:
pytest <test.py>
pytest -vs <test.py> [more detailed log]
Devito offers a functional language to express finite difference operators. This is introduced here and systematically used throughout our examples and tutorials. The language relies on what in jargon we call the "f() notation".
>>> from devito import Grid, Function
>>> grid = Grid(shape=(5, 6))
>>> f = Function(name='f', grid=grid, space_order=2)
>>> f.dx
Derivative(f(x, y), x)
>>> f.dx.evaluate
-f(x, y)/h_x + f(x + h_x, y)/h_x
Sometimes, one wishes to escape the constraints of the language. Instead of taking derivatives, other special operations are required. Or perhaps, a specific grid point needs to be accessed. In such a case, one could use the "f[] notation" or "indexed notation". Following on from the example above:
>>> x, y = grid.dimensions
>>> f[x + 1000, y]
f[x + 1000, y]
The indexed object can be used at will to construct Eq
s, and they can be mixed up with objects stemming from the "f() notation".
>>> f.dx + f[x + 1000, y]
Derivative(f(x, y), x) + f[x + 1000, y]
However, while the "f() notation" is substantially safe -- the language is designed such that only legal stencil expressions are built -- the "f[] notation" is not, and one can easily end up creating operators performing out-of-bounds accesses. So use it judiciously!
The indexed notation, or "f[] notation", is discussed here
needs links
- Equation Lowering
- Indexification
- Substitutions
- Domain alignment
- Eq -> LoweredEq
- Local Analysis
- Clustering
- Symbolic Optimization via Devito Symbolic Engine (DSE)
- IET (Iteration/Expression Tree) Construction
- IET Analysis
- IET Optimization (DLE/YLE)
- Loop optimization via Devito Loop Engine (DLE)
- Loop Tiling / Cache Blocking
- SIMD
- OpenMP
- MPI
- Synthetic
- JIT (Just In Time) Compilation
The .data
property which is associated with objects such as Constant
, Function
and SparseFunction
(along with their derivatives) represents the 'numerical' value of the 'data' associated with that particular object. For example, a Constant
will have a single numerical value associated with it as shown in the following snippet
from devito import Constant
c = Constant(name='c')
c.data = 2.7
print(c.data)
2.7
Then, a Function
defined on a Grid
will have a data value associated with each of the grid points (as shown in the snippet below) and so forth.
import numpy as np
from devito import Grid, Function
grid = Grid(shape=(4, 4), extent=(1, 1))
f = Function(name='f', grid=grid)
f.data[:] = np.arange(16).reshape(grid.shape)
print(f.data)
[[ 0. 1. 2. 3.]
[ 4. 5. 6. 7.]
[ 8. 9. 10. 11.]
[12. 13. 14. 15.]]
Grids are often created via, e.g.,
grid = Grid(shape=(5, 5))
where printing the grid
object then returns:
Grid[extent=(1.0, 1.0), shape=(5, 5), dimensions=(x, y)]
Here we see the grid
has been created with the 'default' dimensions x
and y
. If a grid is created and passed a shape of (5, 5, 5)
we'll see that in addition it has a z
dimension. However, what if we want to create a grid with, say, a shape of (5, 5, 5, 5)
? For this case, we've now run out of the dimensions defined by default and hence need to create our own dimensions to achieve this. This can be done via, e.g.,
a = SpaceDimension('a')
b = SpaceDimension('b')
c = SpaceDimension('c')
d = SpaceDimension('d')
grid = Grid(shape=(5, 5, 5, 5), dimensions=(a, b, c, d))
where now, printng grid
we get
Grid[extent=(1.0, 1.0, 1.0, 1.0), shape=(5, 5, 5, 5), dimensions=(a, b, c, d)]
and grid.shape
returns
(5, 5, 5, 5)
The code generated by devito is designed to run fast on CPU, GPU and clusters thereof. Broadly outlined, some of the mechanics for generating fast code are:
- Loop tiling (or "cache blocking")
- Loop fusion (to maximize data locality) Loop fission (to maximize parallelism)
- Flop-reducing transformations (CSE, cross-stencil redundancies, factorization, hoisting)
- OpenMP threading
- OpenMP-based SIMD
- Alignment to promote SIMD vectorization
- Longer pipelines, less travel to host (do more work on the GPU before communicating data between host and GPU)
- Computation/communication overlap with prodding of the asynchronous progress engine
- Avoidance of unnecessary halo exchanges
- Reshuffling of halo exchanges
- Threaded data packing/unpacking
As time increases in the finite difference evolution, are wavefield arrays "swapped" as you might see in c/c++ code
In c/c++ code using two wavefield arrays for second order acoustics, you might see code like the following to “swap” the wavefield arrays at each time step:
float *p_tmp = p_old;
p_old = p_cur;
p_cur = p_tmp;
Instead of swapping arrays, devito uses the modulus of a time index to map increasing temporal indices [0, 1, 2, 3, 4, 5, ...] into cyclic indices [0, 1, 2, 0, 1, 2, ...].
- Sampling rates: msec
- Frequency: KHz
- Velocity: km/sec
Yes, just like we did for our seismic examples, for example, the PointSource class. A few caveats are necessary, though.
First, classes such as Function
or SparseTimeFunction
are inherently complex. In fact, SparseTimeFunction
itself is a subclass of Function
. The whole class hierarchy is modular and well-structured, but at the same time, it's depth and offers several hooks to allow specialization by subclasses -- for example, all methods starting with __
such as __init_finalize__
or __shape_setup__
. It will take some time to digest it. Luckily, you will only need to learn a little of this, at least for simple subclasses.
Second, you must know that these objects are subjected to so-called reconstruction during compilation. Objects are immutable inside Devito; therefore, even a straightforward symbolic transformation such as f[x] -> f[y]
boils down to performing a reconstruction, that is, creating a whole new object. Since f
carries around several attributes (e.g., shape, grid, dimensions), each time Devito performs a reconstruction, we only want to specify which attributes are changing -- not all of them, as it would make the code ugly and incredibly complicated. The solution to this problem is that all the base symbolic types inherit from a common base class called Reconstructable
; a Reconstructable
object has two special class attributes, called __rargs__
and __rkwargs__
. If a subclass adds a new positional or keyword argument to its __init_finalize__
, it must also be added to __rargs__
or __rkwargs__
, respectively. This will provide Devito with enough information to perform a reconstruction when it's needed during compilation. The following example should clarify:
class Foo(Reconstructable):
__rargs__ = ('a', 'b')
__rkwargs__ = ('c',)
def __init__(self, a, b, c=4):
self.a = a
self.b = b
self.c = c
def __repr__(self):
return "x(%d, %d, %d)" % (self.a, self.b, self.c)
class Bar(Foo):
__rkwargs__ = Foo.__rkwargs__ + ('d',)
def __init__(self, *args, d=6, **kwargs)
super().__init__(*args, **kwargs)
self.d = d
You are unlikely to care about how reconstruction works in practice, but here are a few examples for a = Foo(3, 5)
to give you more context.
a._rebuild() -> "x(3, 5, 4)" (i.e., copy of `a`).
a._rebuild(4) -> "x(4, 5, 4)"
a._rebuild(4, 7) -> "x(4, 7, 4)"
a._rebuild(c=5) -> "x(3, 5, 5)"
a._rebuild(1, c=7) -> "x(1, 5, 7)"
How can I change the compilation flags (for example, I want to change the optimization level from -O3 to -O0)
There is currently no API to achieve this straightforwardly. However, there are three workarounds:
- hacky way: change the flags explicitly in the Devito source code. In Devito v4.0, you can do that here
- via env vars: use a CustomCompiler -- just leave the
DEVITO_ARCH
environment variable unset or set it to'custom'
. Then,export CFLAGS="..."
to tell Devito to use the exported flags in place of the default ones. - programmatically: subclass one of the compiler classes and set
self.cflags
to whatever you need. Do not forget to add the subclass to the compiler registry. For example, you could do
from devito import configuration, compiler_registry
from devito.compiler import GNUCompiler
class MyOwnCompiler(GNUCompiler):
def __init__(self, *args, **kwargs):
super(MyOwnCompiler, self).__init__(*args, **kwargs)
# <manipulate self.cflags here >
### Make sure Devito is aware of this new Compiler class
compiler_registry['mycompiler'] = MyOwnCompiler
configuration.add("compiler", "custom", list(compiler_registry), callback=lambda i: compiler_registry[i]())
### Then, what remains to be done is asking Devito to use MyOwnCompiler
configuration['compiler'] = 'mycompiler'
By default, Devito compiles the generated code using flags that maximize the runtime performance. Some of these flags, such as GCC's -ffast-math
, break IEEE compliance by, for example, reordering operations based on associativity. This may alter the numerical output, including even making some codes produce incorrect results. If you think your code is affected by -ffast-math
or similar flags that break IEEE compliance, you can disable them by setting the environment variable DEVITO_SAFE_MATH=1
. You can achieve the same effect in Python by setting configuration['safe-math'] = 1
.
<This FAQ was co-authored by @FabioLuporini and @ofmla>
By default, domain decomposition starts along the slowest axis. This means that for an n-dimensional grid, the outermost dimension is decomposed first. For example, in a three-dimensional grid (x, y, z), the x dimension is split into chunks first, followed by the y dimension, and finally the z dimension. Then the process restarts from the outermost dimension again, ensuring an n-dimensional decomposition, until as many partitions as MPI ranks are created.
The Grid
class accepts an optional topology
argument, allowing users to specify custom domain decompositions as an n-tuple, where n
is the number of distributed dimensions. For instance, for a two-dimensional grid, the topology (4, 1)
will decompose the slowest axis into four partitions (one per MPI rank), while the fastest axis will be replicated across all MPI ranks. So, for example, given a Grid(shape=(16, 16), topology=(4, 1))
, the x dimension would be split into 4 chunks of 4, resulting in partitions of shape (4, 16)
for each MPI rank.
Consider a domain with three distributed dimensions: x, y, and z, and an MPI communicator with N
processes. Here are some examples of specifying a custom topology
:
-
With N known (e.g., N=4):
(1, 1, 4)
: Decomposes the z dimension into 4 chunks.(2, 1, 2)
: Decomposes the x dimension into 2 chunks and the z dimension into 2 chunks.
-
With N unknown:
(1, '*', 1)
: The wildcard'*'
indicates that the runtime should decompose the y dimension into N chunks.('*', '*', 1)
: The runtime decomposes both x and y dimensions into factors of N, prioritizing the outermost dimension.
Assuming the number of ranks
N
cannot be evenly decomposed into the requested stars, decomposition is as even as possible, prioritizing the outermost dimension:-
For N=3:
('*', '*', 1)
results in (3, 1, 1).('*', 1, '*')
results in (3, 1, 1).(1, '*', '*')
results in (1, 3, 1).
-
For N=6:
('*', '*', 1)
results in (3, 2, 1).('*', 1, '*')
results in (3, 1, 2).(1, '*', '*')
results in (1, 3, 2).
-
For N=8:
('*', '*', '*')
results in (2, 2, 2).('*', '*', 1)
results in (4, 2, 1).('*', 1, '*')
results in (4, 1, 2).(1, '*', '*')
results in (1, 4, 2).
As of Devito v4.8.11, the domain decomposition topology can also be specified globally using the environment variable DEVITO_TOPOLOGY
. Accepted values are:
x
: Corresponds to the topology('*', 1, 1)
, decomposing the x dimension.y
: Corresponds to the topology(1, '*', 1)
, decomposing the y dimension.z
: Corresponds to the topology(1, 1, '*')
, decomposing the z dimension.xy
: Corresponds to the topology('*', '*', 1)
, decomposing both x and y dimensions.
In general you should use one MPI rank per NUMA node on a multi-socket machine. You can find the number of numa nodes with the lscpu
command. For example, here is the relevant part of the output from the lscpu
command on an AMD 7502 2 socket machine with 2 NUMA nodes:
Architecture: x86_64
CPU(s): 64
On-line CPU(s) list: 0-63
Thread(s) per core: 1
Core(s) per socket: 32
Socket(s): 2
NUMA node(s): 2
Vendor ID: AuthenticAMD
Model name: AMD EPYC 7502 32-Core Processor
NUMA node0 CPU(s): 0-31
NUMA node1 CPU(s): 32-63
There are a few things you may want to check
- To refer to the actual ("global") shape of the domain, you should always use
grid.shape
(or analogously through aFunction
,f.grid.shape
). And unless you know well what you're doing, you should never use the function shape, namelyf.shape
orf.data.shape
, as that will return the "local" domain shape, that is the data shape after domain decomposition, which might differ across the various MPI ranks.
There are a few things you may want to check
-
Users are expected to execute the MPI launch command, specifying the desired number of ranks and other possible arguments: mpirun, mpiexec, srun, e.g.:
mpirun -np <num processes> [options] <executable> [arguments]
, and get exactly the same results. However some pre- and post-processing may be rank-specific (e.g., plotting on a given MPI rank only). -
In case you used hardcoded boundary conditions to model your problem, consider switching to
SubDomain
s. Devito's support for distributed NumPy arrays enables domain decomposition without requiring changes to user code. The data is physically distributed, but from the user’s perspective, it remains a logically centralized entity. Users can interact with data using familiar indexing schemes (e.g., slicing) without concern about the underlying layout. You can find related tutorials here:.
For example, instead of
t = grid.stepping_dim
x, y = grid.dimensions
op = Operator([Eq(u.forward, u + 1),
Eq(u[t+1, x, 0], 0),
Eq(u[t+1, x, 2], 0),
Eq(u[t+1, 0, y], 0),
Eq(u[t+1, 2, y], 0)])
one should use a semantically equivalent computation can be expressed exploiting SubDomains.
u.data[:] = 0
op = Operator(Eq(u.forward, u + 1, subdomain=grid.interior))
This is likely due to an out-of-bounds (OOB) array access while running the generated code. This shouldn't really ever happen! The compiler and runtime should catch any DSL misuses, ultimately leading to such OOB accesses, well before jumping to C-land via op.apply(...)
to run the generated code. However, there are currently two open issues when this might unfortunately happen:
- When misusing
SubDimensions
orSubDomains
. See this issue - When misusing
ConditionalDimensions
. See this issue
Yes, as of Devito v3.5 it is possible to modify the generated C code and run it inside Devito. First you need to get the C file generated for a given Operator
. Run your code in DEBUG
mode:
DEVITO_LOGGING=DEBUG python your_code.py
The generated code path will be shown as in the excerpt below:
CustomCompiler: compiled `/tmp/devito-jitcache-uid1000/ed41e9373af1bc129471b7ae45e1c3740b60a856.c` [0.29 s]
You can now open the C file, do the modifications you like, and save them. Finally, rerun the same program but this time with the Devito JIT backdoor enabled:
DEVITO_JIT_BACKDOOR=1 python your_code.py
This will force Devito to recompile and link the modified C code.
If you have a large codebase with many Operator
s, here's a trick to speed up your hacking with the JIT backdoor.
Assuming this is actually an issue in Devito, we distinguish between three types of bugs:
- a Python-level bug, e.g. some exceptions raised during compilation
- a C-level bug causing abrupt termination, e.g. a segmentation fault (ouch)
- a C-level bug, e.g. the numerical output "looks wrong"
Typically such a bug occurs in a moderately big code, so how should we proceed?
If you are in the cases 1 or 2 above, the first thing to do, regardless of who fixes it (either you directly if feeling brave, or most likely someone from the Devito team), is to create an MFE -- a Minimal Failing Example. An interesting read about MFEs in general is available here. The typical workflow in Devito is as follows:
- If the failure is from a Jupyter Notebook, first of all, convert it into a Python file. It's trivial, you can do it directly from within the Jupyter Notebook itself, under the "File" tab there's an option to convert the notebook into a Python file.
- Then it's time to trim down the code. The idea is to incrementally remove parts of the code until the bug disappears, i.e., Devito compilation and execution reach completion without failures. In doing so, you don't care about things such as breaking the physics, making the method unstable, and so on. So ...
- If your Operator has many Eqs, start with removing some of them. Is the bug still there? Great, keeping on removing the unnecessary. Some Functions may now become unnecessary because they were only appearing in the removed Eqs. Then, drop them too from the Python file.
- Here's another trick. You have many SubDomains -- try with removing some of them, both their definition and where they're used.
- The MFE must have a super quick turnaround time. Avoid
solve
at all costs. Run for only a few timesteps, possibly just one; this will also help with avoiding NaN or Inf since your MFE is most likely unstable. - If you have just one Eq in your Operator, but it's a big one, perhaps with lots of operations over many Functions, then you should trim it down. Try to mitigate the arithmetic complexity, for example by dropping derivatives or entire sub-expressions.
- There are other tricks for the creation of an MFE. If your code by default uses MPI, or OpenMP, or OpenACC, or combinations thereof, but the bug appears even when running sequentially, then explicitly disable parallelism. Also try with disabling the performance optimizations applied by an Operator --
Operator(..., opt='noop')
. - There are a few other tricks and things you may wanna play with to create an MFE...
At the end of the day, an MFE should really be small, ideally not more than 10 lines of code, and definitely not more than 20~30. If you managed to create an actual MFE, then we're in business. File an issue on GitHub or talk to us on Slack, attach the MFE, and someone will quickly look into it.
Now, stepping back: you may unfortunately be in the case 3 above -- the numerical output "looks wrong", despite the code really looks sane. This is in our experience an extremely rare circumstance. You may try to look at the generated C, to see if you spot anything odd; or you may just file an issue on GitHub with instructions to reproduce the problem, or talk to us directly or Slack.
Yes, any logging level below or equal to PERF
will do the trick. For example, if you run:
DEVITO_LOGGING=PERF python your_code.py
you will see that Devito emits lots of useful information concerning the performance of an Operator. The following is reported:
- the code generation, compilation, and execution times;
- for each section in the generated code, its execution time, operational intensity (OI), GFlops/s and GPts/s performance;
- global GFlops/s and GPts/s performance of the Operator (i.e., cumulative across all sections);
- in the case of an MPI run, per-rank GFlops/s and GPts/s performance.
About the GPts/s metric, that is number of gigapoints per seconds. The "points" we refer to here are the actual grid points -- so if the grid is an N**3
cube, the number of timesteps is T
, and the Operator runs for S
secs, then we have N**3*T/S GPts/s
. This is the typical metric used for finite difference codes.
An excerpt of the performance profile emitted by Devito upon running an Operator is provided below. In this case, the Operator has two sections, section0
and section1
, and section1
consists of two consecutive 6D iteration spaces whose size is given between angle brackets.
Global performance: [OI=0.16, 8.00 GFlops/s, 0.04 GPts/s]
Local performance:
* section0<136,136,136> run in 0.10 s [OI=0.16, 0.14 GFlops/s]
* section1<<341,16,16,14,14,136>,<341,16,16,8,8,130>> run in 35.91 s [OI=5.36, 8.01 GFlops/s, 0.05 GPts/s]
Note that section0
doesn't show the GPts/s. This is because no TimeFunction is written in this section.
The section execution time is computed directly in the generated code through cheap timers. The cumulative Operator execution time is computed through Python-level timers and includes the overhead inherent in the processing of the arguments supplied to op.apply(...)
.
The floating-point operations are counted once all of the symbolic flop-reducing transformations have been performed during compilation. Devito uses an in-house estimate of cost, rather than SymPy's estimate, to take care of some low-level intricacies. For example, Devito's estimate ignores the cost of integer arithmetic used for offset indexing into multi-dimensional arrays. Examples of how the Devito estimator works are available here.
To calculate the GFlops/s performance, Devito multiplies the floating-point operations calculated at compile time by the size of the iteration space, and it does that at the granularity of individual expressions. For example, consider the following snippet:
<section0 start>
for x = x_m to x_M
for y = y_m to y_M
u[x, y] = 2*v[x, y] + 1
for z = z_m to z_M
w[x, y, z] += 4*h[x, y, z] + 5 - f[x, y, z]
<section 0 end>
At compile time, Devito produces the following estimate: 2*(x_M-x_m+1)*(y_M-y_m+1) + 4*(x_M-x_m+1)*(y_M-y_m+1)*(z_M-z_m+1) = 2*(x_M-x_m+1)*(y_M-y_m+1)*(1 + 2*(z_M-z_m+1))
. Upon op.apply(...)
, the values of x_M, x_m, ...
become known and are instantied. This gives the total number of operations performed in section0
, which is eventually divided by the section execution time to give the GFlops/s performance.
The produced GFlops/s has been checked against that reported by Intel Advisor in a number of problems and the results were extremely close, which gives confidence about the soundness of the Devito estimate. Clearly, when it gets to articles or presentations, we encourage the use of profilers such as Intel Advisor to back these numbers up. The metrics emitted by Devito are only intended to give an initial yet fairly realistic indication of the Operator performance.
Compiler ninjas may wonder about the counting of loop-invariant sub-expressions, which might produce an over-estimate of the actual performance. Thanks to aggressive code motion, the amount of innermost-loop-invariant sub-expressions in a Devito Operator is typically negligible, so the Devito estimate doesn't basically suffer from this issue, or at least not in a tangible way to the best of our knowledge.
It's challenging! Here's a potentially non-exhaustive list of things to check:
- Check the physics — e.g., there are many TTI variations out there, are the two codes using the same one? You wanna ensure the PDEs are the same.
- Check the discretization — do the two codes use the same spatial order? most importantly, is it the same spatial order in all problem dimensions? carefully optimized codes often use lower order stencils in some dimensions. This significantly impacts performance
- Did you try tuning the performance of the Devito
Operator
? E.g., on GPUs it is worth giving thepar-tile
option a go. - Snapshotting: compression, serialization, asynchronous streaming if running on device... Many legacy codes have these features enabled out-of-the-box. Are you using them?
- Time sub-sampling for snapshotting -- yes or no?
- Expanding box -- yes or no.
- Mixed precision for model parameters -- yes or no?
- Other minor potential devito optimizations: use of Buffer(2) to reduce working set size; Trigonometric identities to avoid some temporaries (e.g., cos2 = 1 - sin2), etc...
Please see https://www.devitoproject.org/publications
Please see https://www.devitoproject.org/citing
The precursor project that led to Devito was named by @ggorman using a backronym generator. He put in some keywords like "open source performant seismic codes", and chose the (weird) name "Opesci". No one knew how to pronounce this, so a common conversation was "How do you pronounce this?" "Opesci, like Joe Pesci". So for the next version we chose to go with a famous Joe Pesci character - Tommy Devito from Goodfellas. The name came up in a discussion between @navjotk and @mlange05 (mostly the latter) and we stuck with it.