Deriving the Learning Correction#
For gradient descent, we need to derive the update to the matrix \({\bf A}\) based on training on a set of our data, \(({\bf x}^k, {\bf y}^k)\).
Let’s start with our cost function:
where we’ll refer to the product \({\boldsymbol \alpha} \equiv {\bf Ax}\) to help simplify notation.
We can compute the derivative with respect to a single matrix element, \(A_{pq}\) by applying the chain rule:
with
and for \(g(\xi)\), we will assume the sigmoid function,so
which gives us:
where we used the fact that the \(\delta_{ip}\) means that only a single term contributes to the sum.
Note that:
\(e_p^k \equiv (z_p - y_p^k)\) is the error on the output layer, and the correction is proportional to the error (as we would expect).
The \(k\) superscripts here remind us that this is the result of only a single pair of data from the training set.
Now \({\bf z}\) and \({\bf y}^k\) are all vectors of size \(N_\mathrm{out} \times 1\) and \({\bf x}^k\) is a vector of size \(N_\mathrm{in} \times 1\), so we can write this expression for the matrix as a whole as:
where the operator \(\circ\) represents element-by-element multiplication (the Hadamard product).
Performing the update#
We could do the update like we just saw with our gradient descent example: take a single data point, \(({\bf x}^k, {\bf y}^k)\) and do the full minimization, continually estimating the correction, \(\partial f/\partial {\bf A}\) and updating \({\bf A}\) until we reach a minimum. The problem with this is that \(({\bf x}^k, {\bf y}^k)\) is only one point in our training data, and there is no guarantee that if we minimize completely with point \(k\) that we will also be a minimum with point \(k+1\).
Instead we take multiple passes through the training data (called epochs) and apply only a single push in the direction that gradient descent suggests, scaled by a learning rate \(\eta\).
The overall minimization appears as:
Loop over the training data, \(\{ ({\bf x}^0, {\bf y}^0), ({\bf x}^1, {\bf y}^1), \ldots \}\). We’ll refer to the current training pair as \(({\bf x}^k, {\bf y}^k)\)
Propagate \({\bf x}^k\) through the network, getting the output \({\bf z} = g({\bf A x}^k)\)
Compute the error on the output layer, \({\bf e}^k = {\bf z} - {\bf y}^k\)
Update the matrix \({\bf A}\) according to:
\[{\bf A} \leftarrow {\bf A} - 2 \,\eta\, {\bf e}^k \circ {\bf z} \circ (1 - {\bf z}) \cdot ({\bf x}^k)^\intercal\]