4. Backpropagation

💡 Includes a nice detail about why we don't 'frontpropagate'. ;-)

Backpropagation

We learned that we can find a mapping \(f \in \mathcal{F}\) that maps features X to outcome Y consistent with the data by minimizing the cost function with repeated gradient evaluations using a gradient descent optimizer.

We compute the gradient with backpropagation. This algorithm allows us to efficiently compute gradients of functions that 1.) are scalar-valued, 2.) have many input parameters, and 3.) can be decomposed into simpler intermediate functions that are differentiable. The mathematical formula of backpropagation is inspired by the third property. It is the gradient computation via recursive applications of the chain rule from calculus. The algorithmic order is inspired by the first two properties. The idea is to efficiently compute the resulting derivative starting from the intermediate functions on the parameter instead of the cost function side.

Toy example

Recall that we would like to compute the gradient of cost function \(C(\theta)\), which takes not only parameters \(\theta\) but also multiple examples \((x_i, y_i)\) as input. Specifically, our first-order optimizer requires the gradient of the cost function with respect to the model parameters \(\nabla_{\theta}C(\theta)\) in order to update the parameters \(\theta\). Let us consider the following example. Let \(C(\theta_1, \theta_2)=\theta_1 \theta_2 + \tanh (\theta_1)\) be the cost function of a neural network with parameters \(\theta_1\) and \(\theta_2\). Figure 1 shows that we can view this equation as a computational graph with cost function \(C\) as root and parameters as leaf nodes. We introduce intermediate variables to write \(C\) as a sequence of the intermediate functions \(z_3=\tanh (\theta_1)\), \(z_4=\theta_1 \theta_2\), \(z_5=z_3+z_4\), and \(C(\theta)=z_5\). For ease of notation, we will later also write \(\theta_1,\theta_2\) and \(C(\theta)\) as \(z_1, z_2\) and \(z_6\). We can write the gradient of the cost function with respect to each parameter as a combination of the gradients of its parent nodes using the chain rule from calculus and the fact that multiple occurrences of a term add up in its derivative. This is shown in Equation 1 and 2:

\[\begin{align} \frac{\partial C}{\partial \theta_1}=\frac{\partial C}{\partial z_5} \bigg( \frac{\partial z_5}{\partial z_3} \frac{\partial z_3}{\partial \theta_1} + \frac{\partial z_5}{\partial z_4} \frac{\partial z_4}{\partial \theta_1} \bigg) = 1 - \tanh ( \theta_1)^2 + \theta_2, \label{eq:dtheta_1}\end{align}\] \[\begin{align} \label{eq:dtheta_2} \quad \frac{\partial C}{\partial \theta_2}=\frac{\partial C}{\partial z_5}\frac{\partial z_5}{\partial z_4}\frac{\partial z_4}{\partial \theta_2} =\theta_1.\end{align}\]

Figure 1. Computational graph for a toy example of a neural network’s forward pass without vector-valued intermediate functions, data, regularization and loss. The function given by the graph is C(θ1,θ2) = θ1θ2 + tanh (θ1). Intermediate functions that are relevant for deriving the gradient with the chain rule are denoted by zi with i ∈ {1, …., 6}. The edges along which the intermediate evaluations move forth during the forward pass and along which the gradients move back during the backward pass are denoted by ei, j with i, j ∈ {1, …, 6}. We can view two edges leaving one node as a a shortcut depiction for applying the duplicate or fork operation along both edges.

Vector-valued intermediate functions

The toy example deviates from a realistic neural network application in a few aspects. One of these aspects is that we usually consider vectors of large parameter groups. Another aspect is that the intermediate functions that transform the parameter vector \(\theta\) step-by-step are usually vector-valued (unlike the final cost function). Apart of confluence operators like "+" and "*" that usually combine different parameter groups, the final gradients can be written as a sequence of "local" gradient computations. Due to vector-valued functions and the number of parameters, these computations are written as matrix multiplications. We look at this extension more formally. To this end, let \(z_0\) be the input vector which we transform through a series of functions be \(z_i = f_i(z_{i-1})\) where \(i=1,...,k\) and only the last \(z_i\) can be scalar. Assuming that the functions \(f_i\) are once differentiable, we can compute the Jacobian matrix \(\frac{\partial z_i}{\partial z_{i-1}}\) for all intermediate functions. This will give us the values of the first derivative of every output dimension of \(z_{i}\) depending on each single input dimension of \(z_{i-1}\). From the multivariable chain rule, we obtain the result that the gradient of our final function with respect to input vector equals the product of all intermediate Jacobians: \(\frac{\partial z_k}{\partial z_{0}} = \prod_{i=1}^k \frac{\partial z_i}{\partial z_{i-1}}\).

Reverse accumulation

We can compute \(\frac{\partial C}{\partial \theta}\) in two different ways. In this section we learn that one approach is much more efficient. The reason is the the structure of our problem: we minimize a scalar-valued function with a large number of parameters. One approach is to compute the gradient from parameters to output: \(\frac{\partial C}{\partial \theta} = \frac{\partial z_k}{\partial z_{k-1}} \cdot ... \cdot \frac{\partial z_2}{\partial z_1}\). This is called forward accumulation. The other approach, reverse accumulation, is to compute the gradients from output to parameters: \(\frac{\partial C}{\partial \theta} = \frac{\partial z_2}{\partial z_1}\cdot ... \cdot \frac{\partial z_k}{\partial z_{k-1}}\). The first approach is less efficient but more intuitive. It is more intuitive because the derivatives can be computed in sync with the evaluation steps. This makes it easier to think about how confluence operations like "\(+\)", "\(*\)", the fork operation duplicate, or filter operations like max or average transform the gradients. We will later see that it is crucial to use these operations in a careful way in the architecture design of neural networks because they can have large effects on the model performance. In reverse accumulation, we fix the dependent variable to be differentiated and compute the derivative with respect to each intermediate function recursively. Table 1 shows how we can calculate the two gradients of our toy example step-by-step with this method. There are two important things to note. First, computing the backward operations corresponding to the forward operations is prone to error. This is one reason why this should be done automatically by a graph-based computer program. Second, we can re-use our intermediate computations \(\bar{z}_3\) to \(\bar{z}_5\) for both gradients based only on one evaluation of the cost function \(C(\theta) = z_6\). In realistic neural network applications, we are able to re-use a large number of intermediate results based on only one forward evaluation for an even larger number of input parameters. In contrast, with forward-mode accumulation, computing the gradient requires to evaluate each intermediate function with the whole parameter vector. Although re-using the intermediate gradients requires more storage, the reverse-mode accumulation strategy is much more efficient for functions like neural networks where the number of output values is much smaller than the number of input values.

Table 1. Gradient computations in reverse accumulation mode for toy example C(θ1,θ2) = θ1θ2 + tanh θ1. The left column shows the evaluation and the right column depicts the derivation steps. denotes the derivative of the cost function with respect to an intermediate expression, i.e. \(\frac{\partial C}{\partial z}\). In the backward pass, we first evaluate the scalar-cost function and use its derivative with respect to z5 at the top of the graph. Here we assume 5 equals 1. In the next step, we combine the chained gradients of the intermediate expressions according to the respective backward functions of the duplicate and the confluence operations “+” and “*” from top to bottom. Addition distributes the upstream gradient down to all its inputs. Multiplication passes the upstream gradient multiplied with the other input back. I.e., \(\frac{\partial (z_3 + z_4)}{\partial z_{4}} = 1, \frac{\partial (z_1 * z_2)}{\partial z_{2}} = z_1\), and the backward function of (z1, 3,z1, 4) = duplicate(z1) equals \(\frac{\partial z_3}{\partial z_{1}} + \frac{\partial z_4}{\partial z_{1}}\), where z1, 3 and z1, 4 denote abbreviated nodes from the implicit duplication operation in the third and fourth evaluation step.

Implementation

In our toy example, we have already discovered that it is easier to think of the evaluation and differentiation of a neural network in terms of a directed graph of operations instead of a linear sequence of function applications. In this graph, the nodes stand for differentiable operations that take a number of vectors from its incoming edges, transforms and potentially interrelates them, and sends the result to the next nodes along its outgoing edges. The graph abstraction translates to most implementations of the backpropagation algorithm. A Graph object keeps the connections (e.g. duplicate or \(+\) ) between the nodes and the collection of operations (e.g. \(\tanh\)). Both Node and Graph objects implement a forward() and a backward() method. The Graph’s forward() calls the Nodes’ forward() methods in their topological order. With that, every Node computes its operation on its input, and the Graph sends it to the next Node. The Graph’s backward() iterates over the nodes in reverse topological order and calls their backward() methods. In the backward pass, each Node is given the gradient of the cost function with respect to its output and it returns the "chained" gradients with respect to all its inputs. "Chained" means that the Node multiplies the Jacobian that it received from its parent nodes with its own local Jacobian. The Graph sends the resulting product to the Nodes’ children and the process repeats until the recursion ends at the last computation which includes the data and the current parameter values. At last, the optimizer updates the parameter vector based on the final gradient (Step 3 in Algorithm 1). Note that the implemented compute graph for Figure 1 would in practice be larger and rudimentary because we can decompose many operations, like divide or subtract in tanh, into more basic operations.

Citation

In case you like this series, cite it with:

@misc{stenzel2023deeplearning,
  title   = "Deep Learning Series",
  author  = "Stenzel, Tobias",
  year    = "2023",
  url     = "https://www.tobiasstenzel.com/blog/2023/dl-overview/
}