Imagine you want to test out a new machine learning model for your data. This usually means coming up with some loss function to capture how well your model fits the data and optimizing that loss with respect to the model parameters. If there are many model parameters (neural nets can have millions) then you need gradients. You then have two options: derive and code them up yourself, or implement your model using the syntactic and semantic constraints of a system like Theano.
We want to provide a third way: just write down the loss function using a standard numerical library like numpy, and autograd will give you its gradient.
Autograd's grad
function takes in a function, and gives you a function that computes its derivative.
Your function must have a scalar-valued output (i.e. a float).
This covers the common case when you want to use gradients to optimize something.
Autograd works on ordinary Python and Numpy code containing all the usual control structures, including while
loops, if
statements, and closures. Here's a simple example of using an open-ended loop to compute the sine function:
import autograd.numpy as np # Thinly-wrapped version of Numpy
from autograd import grad
def taylor_sine(x): # Taylor approximation to sine function
ans = currterm = x
i = 0
while np.abs(currterm) > 0.001:
currterm = -currterm * x**2 / ((2 * i + 3) * (2 * i + 2))
ans = ans + currterm
i += 1
return ans
grad_sine = grad(taylor_sine)
print "Gradient of sin(pi) is", grad_sine(np.pi)
A common use case for automatic differentiation is to train a probabilistic model. Here we present a very simple (but complete) example of specifying and training a logistic regression model for binary classification:
import autograd.numpy as np
from autograd import grad
def sigmoid(x):
return 0.5*(np.tanh(x) + 1)
def logistic_predictions(weights, inputs):
# Outputs probability of a label being true according to logistic model.
return sigmoid(np.dot(inputs, weights))
def training_loss(weights):
# Training loss is the negative log-likelihood of the training labels.
preds = logistic_predictions(weights, inputs)
label_probabilities = preds * targets + (1 - preds) * (1 - targets)
return -np.sum(np.log(label_probabilities))
# Build a toy dataset.
inputs = np.array([[0.52, 1.12, 0.77],
[0.88, -1.08, 0.15],
[0.52, 0.06, -1.30],
[0.74, -2.49, 1.39]])
targets = np.array([True, True, False, True])
# Define a function that returns gradients of training loss using autograd.
training_gradient_fun = grad(training_loss)
# Optimize weights using gradient descent.
weights = np.array([0.0, 0.0, 0.0])
print "Initial loss:", training_loss(weights)
for i in xrange(100):
weights -= training_gradient_fun(weights) * 0.01
print "Trained loss:", training_loss(weights)
Python syntax is pretty good for specifying probabilistic models. The biggest win is that it becomes a lot easier to modify a model and rapidly iterate.
For more complex examples, see our examples directory, which includes:
- a simple neural net
- a convolutional neural net
- a recurrent neural net
- a long short-term memory (LSTM)
- backpropagating through a fluid simulation
To compute the gradient, autograd first has to record every transformation that was applied to the input as it was turned into the output of your function.
To do this, autograd wraps functions (using class primitive
) so that when they're called, they add themselves to a list of operations performed.
The primitive
class also allows us to specify the gradient of these primitive functions.
To flag the variables we're taking the gradient with respect to, we wrap them using the Node
class.
You should never have to think about the Node
class, but you might notice it when printing out debugging info.
After the function is evaluated, autograd has a list of all operations that were performed and which nodes they depended on. This is the computational graph of the function evaluation. To compute the derivative, we simply apply the rules of differentiation to each node in the graph.
Given a function made up of several nested function calls, there are several ways to compute its derivative.
For example, given L(x) = F(G(H(x))), the chain rule says that its gradient is dL/dx = dF/dG * dG/dH * dH/dx. If we evaluate this product from right-to-left: (dF/dG * (dG/dH * dH/dx)), the same order as the computations themselves were performed, this is called forward-mode differentiation. If we evaluate this product from left-to-right: (dF/dG * dG/dH) * dH/dx)), the reverse order as the computations themselves were performed, this is called reverse-mode differentiation.
Compared to finite differences or forward-mode, reverse-mode differentiation is by far the more practical method for differentiating functions that take in a large vector and output a single number. In the machine learning community, reverse-mode differentiation is known as 'backpropagation', since the gradients propogate backwards through the function. It's particularly nice since you don't need to instantiate Jacobians, and Jacobian-vector products can often be computed efficiently as well. Because autograd supports higher derivatives as well, Hessian-vector products (a form of second-derivative) are also available and efficient to compute.
Some autodiff packages (such as Theano or Kayak) work by having you specify a graph of the computation that your function performs, including all the control flow (such as if and for loops), and then turn that graph into another one that computes gradients.
This has some benefits (such as allowing compile-time optimizations), but it requires you to express control flow in a limited mini-language that those packages know how to handle. (For example, the scan()
operation in Theano.)
In contrast, autograd doesn't have to know about any ifs, branches, loops or recursion that were used to decide which operations were called. To compute the gradient of a particular input, one only needs to know which continuous transforms were applied to that particular input, not which other transforms might have been applied. Since autograd keeps track of the relevant operations on each function call separately, it's not a problem that all the Python control flow operations are invisible to autograd. In fact, it greatly simplifies the implementation.
The main constraint is that any function that operates on a node is marked as primitive
, and has its gradient implemented.
This is taken care of for most functions in the Numpy library, and it's easy to write your own gradients.
The input can be a scalar, complex number, vector, tuple, a tuple of vectors, a tuple of tuples, etc.
When using the grad
function, the output must be a scalar, but the functions elementwise_grad
and jacobian
allow gradients of vectors.
Numpy has a lot of features. We've done our best to support most of them. So far, we've implemented gradients for:
- most of the mathematical operations
- most of the array and matrix manipulation routines
- some linear algebra functions
- most of the fast fourier transform routines
- full support for complex numbers
- N-dimensional convolutions
- Some scipy routines, including
scipy.stats.norm
Some things remain to be implemented. For example, we support indexing (x = A[i, j, :]
) but not assignment (A[i,j] = x
) in arrays that are being differentiated with respect to.
Assignment is hard to support because it requires keeping copies of the overwritten data, but we plan to support this in the future.
Similarly, we don't support the syntax A.dot(B)
; use the equivalent np.dot(A, B)
instead. The reason we don't support the first way is that subclassing ndarray
raises a host of issues. As another consequence of not subclassing ndarray
, some subclass checks can break, like isinstance(x, np.ndarray)
can return False
.
In-place modification of arrays not being differentiated with respect to (for example, A[i] = x
or A += B
) won't raise an error, but be careful. It's easy to accidentally change something without autograd knowing about it. This can be a problem because autograd keeps references
to variables used in the forward pass if they will be needed on the reverse pass. Making copies would
be too slow.
Lists and dicts can be used freely - like control flow, autograd doesn't even need to know about them.
The exception is passing in a list to a primitive function, such as autograd.np.sum
.
This requires special care, since the list contents need to be examined for nodes.
So far, we do support passing lists to autograd.np.array
and autograd.np.concatenate
, but in other
cases, make sure you explicitly cast lists to arrays using autograd.np.array
.
- Most of numpy's functions
- Most numpy.ndarray methods
- Some scipy functions
- Indexing and slicing of arrays
x = A[3, :, 2:4]
- Explicit array creation from lists
A = np.array([x, y])
- Assignment to arrays
A[0,0] = x
- Implicit casting of lists to arrays
A = np.sum([x, y])
, useA = np.sum(np.array([x, y]))
instead. A.dot(B)
notation (usenp.dot(A, B)
instead)- In-place operations (such as
a += b
, usea = a + b
instead) - Some isinstance checks like
isinstance(x, np.ndarray)
orisinstance(x, tuple)
Luckily, it's easy to check gradients numerically if you're worried that something's wrong.
What if autograd doesn't support a function you need to take the gradient of? This can happen if your code depends on external library calls or C code. It can sometimes even be a good idea to provide the gradient of a pure Python function for speed or numerical stability.
For example, let's add the gradient of a numerically stable version of log(sum(exp(x)))
.
This function is included in scipy.misc
and already supported, but let's make our own version.
Next, we define our function using standard Python, using @primitive
as a decorator:
import autograd.numpy as np
from autograd.core import primitive
@primitive
def logsumexp(x):
"""Numerically stable log(sum(exp(x)))"""
max_x = np.max(x)
return max_x + np.log(np.sum(np.exp(x - max_x)))
@primitive
tells autograd not to look inside the function, but instead
to treat it as a black box whose gradient can be specified later.
Functions with this decorator can contain anything that Python knows
how to execute, including calls to other languages.
Next, we write a function that specifies the gradient of the primitive logsumexp
:
def logsumexp_vjp(g, ans, vs, gvs, x):
return np.full(x.shape, g) * np.exp(x - np.full(x.shape, ans))
logsumexp_vjp
computes a vector-Jacobian product (VJP):
it right-multiplies g
by the Jacobian of logsumexp
.
g
will be the gradient of the final objective with respect to ans
(the output of logsumexp
).
The calculation can depend on both the input (x
)
and the output (ans
) of the original function.
If you want to be able to take higher-order derivatives, then the
code inside the VJP function must be itself differentiable by autograd.
The final step is to tell autograd about logsumexp
's vector-Jacobian product function:
logsumexp.defvjp(logsumexp_vjp)
Now we can use logsumexp() anywhere, including inside of a larger function that we want to differentiate:
from autograd import grad
def example_func(y):
z = y**2
lse = logsumexp(z)
return np.sum(lse)
grad_of_example = grad(example_func)
print "Gradient: ", grad_of_example(np.array([1.5, 6.7, 1e-10])
This example can be found as a Python script here.
Autograd supports complex arrays and scalars using a convention described as follows.
Consider a complex-to-complex function, f
,
expressed in terms of real-to-real components, u
and v
:
def f(z):
x, y = real(z), imag(z)
return u(x, y) + v(x, y) * 1j
We define grad
of f
as
def grad_f(z):
x, y = real(z), imag(z)
return grad(u, 0)(x, y) - i * grad(u, 1)(x, y)
(The second argument of grad
specifies which argument we're differentiating with respect to.)
So we throw out v, the imaginary part of f, entirely.
Our convention covers three important cases:
- If
f
is holomorphic, we get the usual complex derivative (sincegrad(u, 0) == grad(v, 1)
andgrad(u, 1) == - grad(v, 0)
). - If
f
is a real-valued loss function of a complex parameter,x
, we get a result that we can use in a gradient-based optimizer, by taking steps in the direction of the complex conjugate ofgrad(f)(x)
. - If
f
is a real-to-real function that happens to use complex primitives internally, some of which must necessarily be non-holomorphic (maybe you use FFTs to implement convolutions for example) then we get the same result that a purely real implementation would have given.
Our convention doesn't handle the case where f
is a non-holomorphic function
and you're interested in all of du/dx, du/dy, dv/dx and dv/dy.
But then the answer would have to contain four real values
and there would be no way to express it as a single complex number.
We define primitive vector-Jacobian products of complex functions like this
def f_vjp(g, z):
z_x, z_y = real(z), imag(z)
g_x, g_y = real(g), imag(g)
return ( g_x * grad(u, 0)(x, y)
- i * g_x * grad(u, 1)(x, y)
- g_y * grad(v, 0)(x, y)
+ i * g_y * grad(v, 1)(x, y))
For holomorphic primitives, this is just the regular complex derivative multiplied by g
,
so most simple math primitives don't need to be changed from their real implementations.
For non-holomorphic primitives, it preserves all four real partial derivatives as if we
were treating complex numbers as real 2-tuples
(though it throws a couple of negative signs in there).
Chapter 4 of Dougal's PhD thesis
goes into a bit more detail about how we define the primitive vector-Jacobian products.
Autograd is still under active development. We plan to support:
- GPU operations
- In-place array operations and assignment to arrays
Autograd was written by Dougal Maclaurin, David Duvenaud, and Matthew Johnson and we're actively developing it. Please feel free to submit any bugs or feature requests. We'd also love to hear about your experiences with autograd in general. Drop us an email!