Deriving the Learning Correction

Deriving the Learning Correction#

For gradient descent, we need to derive the update to the matrix A based on training on a set of our data, (xk,yk).

Important

The derivation we do here is specific to our choice of loss function, L(Aij) and activation function, g(ξ).

Let’s start with our cost function:

L(Aij)=i=1Nout(ziyik)2=i=1Nout[g(j=1NinAijxjkαi)yik]2

where we’ll refer to the product αAx to help simplify notation.

We can compute the derivative with respect to a single matrix element, Apq by applying the chain rule:

LApq=2i=1Nout(ziyik)gξ|ξ=αiαiApq

with

αiApq=j=1NinAijApqxjk=j=1Ninδipδjqxjk=δipxqk

and for g(ξ), we will assume the sigmoid function,so

gξ=ξ11+eξ=(1+eξ)2(eξ)=g(ξ)eξ1+eξ=g(ξ)(1g(ξ))

which gives us:

LApq=2i=1Nout(ziyik)zi(1zi)δipxqk=2(zpypk)zp(1zp)xqk

where we used the fact that the δip means that only a single term contributes to the sum.

Note that:

  • epk(zpypk) is the error on the output layer, and the correction is proportional to the error (as we would expect).

  • The k superscripts here remind us that this is the result of only a single pair of data from the training set.

Now z and yk are all vectors of size Nout×1 and xk is a vector of size Nin×1, so we can write this expression for the matrix as a whole as:

fA=2(zyk)z(1z)(xk)

where the operator represents element-by-element multiplication (the Hadamard product).

Performing the update#

We could do the update like we just saw with our gradient descent example: take a single data point, (xk,yk) and do the full minimization, continually estimating the correction, L/A and updating A until we reach a minimum. The problem with this is that (xk,yk) is only one point in our training data, and there is no guarantee that if we minimize completely with point k that we will also be a minimum with point k+1.

Instead we take multiple passes through the training data (called epochs) and apply only a single push in the direction that gradient descent suggests, scaled by a learning rate η.

The overall minimization appears as:

* Loop over epochs
  • Loop over the training data, {(x0,y0),(x1,y1),}. We’ll refer to the current training pair as (xk,yk)

    • Propagate xk through the network, getting the output z=g(Axk)

    • Compute the error on the output layer, ek=zyk

    • Update the matrix A according to:

      AA2ηekz(1z)(xk)