Deriving the Learning Correction#
For gradient descent, we need to derive the update to the matrix
Important
The derivation we do here is specific to our choice of loss function,
Let’s start with our cost function:
where we’ll refer to the product
We can compute the derivative with respect to a single matrix
element,
with
and for
which gives us:
where we used the fact that the
Note that:
is the error on the output layer, and the correction is proportional to the error (as we would expect).The
superscripts here remind us that this is the result of only a single pair of data from the training set.
Now
where the operator
Performing the update#
We could do the update like we just saw with our gradient descent
example: take a single data point,
Instead we take multiple passes through the training data (called epochs) and apply only a single push in the direction that gradient
descent suggests, scaled by a learning rate
The overall minimization appears as:
Loop over the training data,
. We’ll refer to the current training pair asPropagate
through the network, getting the outputCompute the error on the output layer,
Update the matrix
according to: