diff --git a/2014/12/29/deep-learning-in-a-nutshell/index.xhtml b/2014/12/29/deep-learning-in-a-nutshell/index.xhtml index ebfafdd..354127d 100644 --- a/2014/12/29/deep-learning-in-a-nutshell/index.xhtml +++ b/2014/12/29/deep-learning-in-a-nutshell/index.xhtml @@ -290,19 +290,7 @@

Now at this point you might be thinking, wait up... Why do we need to bother ourselves with this error function nonsense when we have a bunch of variables (weights) and we have a set of equations (one for each training example)? Couldn't we just solve this problem by setting up a system of linear system of equations? That would automaically give us an error of zero assuming that we have a consistent set of training examples, right?

That's a smart observation, but the insight unfortunately doesn't generalize well. Remember that although we're using a linear neuron here, linear neurons aren't used very much in practice because they're constrained in what they can learn. And the moment you start using nonlinear neurons like the sigmoidal neurons we talked about, we can no longer set up a system of linear equations!

-

- So maybe we can use an iterative approach instead that generalizes to nonlinear examples. Let's try to visualize how we might minimize the squared error over all of the training examples by simplifying the problem. Let's say we're dealing with a linear neuron with only two inputs (and thus only two weights, - - and - ). - Then we can imagine a 3-dimensional space where the horizontal dimensions correspond to the weights - - and - , - and there is one vertical dimension that corresponds to the value of the error function - . - So in this space, points in the horizontal plane correspond to different settings of the weights, and the height at those points corresponds to the error that we're incurring, summed over all training cases. If we consider the errors we make over all possible weights, we get a surface in this 3-dimensional space, in particular a quadratic bowl: -

+

So maybe we can use an iterative approach instead that generalizes to nonlinear examples. Let's try to visualize how we might minimize the squared error over all of the training examples by simplifying the problem. Let's say we're dealing with a linear neuron with only two inputs (and thus only two weights, w1 and w2). Then we can imagine a 3-dimensional space where the horizontal dimensions correspond to the weights w1 and w2, and there is one vertical dimension that corresponds to the value of the error function E. So in this space, points in the horizontal plane correspond to different settings of the weights, and the height at those points corresponds to the error that we're incurring, summed over all training cases. If we consider the errors we make over all possible weights, we get a surface in this 3-dimensional space, in particular a quadratic bowl:

Quadratic Error Surface
@@ -321,23 +309,94 @@

Learning Rates and the Delta Rule

-

In practice at each step of moving perpendicular to the contour, we need to determine how far we want to walk before recalculating our new direction. This distance needs to depend on the steepness of the surface. Why? The closer we are to the minimum, the shorter we want to step forward. We know we are close to the minimum, because the surface is a lot flatter, so we can use the steepness as an indicator of how close we are to the minimum. We multiply this measure of steepness with a pre-determined constant factor , the learning rate. Picking the learning rate is a hard problem. If we pick a learning rate that's too small, we risk taking too long during the training process. If we pick a learning rate that's too big, we'll mostly likely start diverging away from the minimum (this pretty easy to visualize). Modern training algorithms adapt the learning rate to overcome this difficult challenge.

-

For those who are interested, putting all the pieces results in what is called the delta rule for training the linear neuron. The delta rule states that given a learning rate , we ought to change the weight at each iteration of training by . Deriving this formula is left as an exercise for the experienced reader. For a hint, study our derivation for a sigmoidal neuron in the next section.

+

In practice at each step of moving perpendicular to the contour, we need to determine how far we want to walk before recalculating our new direction. This distance needs to depend on the steepness of the surface. Why? The closer we are to the minimum, the shorter we want to step forward. We know we are close to the minimum, because the surface is a lot flatter, so we can use the steepness as an indicator of how close we are to the minimum. We multiply this measure of steepness with a pre-determined constant factor ϵ, the learning rate. Picking the learning rate is a hard problem. If we pick a learning rate that's too small, we risk taking too long during the training process. If we pick a learning rate that's too big, we'll mostly likely start diverging away from the minimum (this pretty easy to visualize). Modern training algorithms adapt the learning rate to overcome this difficult challenge.

+

For those who are interested, putting all the pieces results in what is called the delta rule for training the linear neuron. The delta rule states that given a learning rate ϵ, we ought to change the weight wk at each iteration of training by:

+ + + Δ + + w + k + + = + + + i + + ϵ + + x + k + + ( + i + ) + + + ( + + t + + ( + i + ) + + + + + y + + ( + i + ) + + + ) + +

Deriving this formula is left as an exercise for the experienced reader. For a hint, study our derivation for a sigmoidal neuron in the next section.

Unfortunately, just taking the path of steepest descent doesn't always do the trick when we have nonlinear neurons. The error surface can get complicated and there could be multiple local minimum. As a result, using this procedure could potentially get us to a bad local minimum that isn't the global minimum. As a result, in practice, training neural nets involves a modification of gradient descent called stochastic gradient descent, that tries to use randomization and noise to find the global minimum with high probability on a complex error surface.

Moving onto the Sigmoidal Neuron *

This section and the next will get a little heavy with the math, so just be forewarned. If you're not comfortable with multivariate calculus, feel free to skip them and move onto the remaining sections. Otherwise, let's just dive right into it!

Let's recall the mechanism by which logistic neurons compute their output value from their inputs:

- - - - + + + z + = + + + k + + + w + k + + + x + k + + + + + y + = + + 1 + + 1 + + + + e + + + z + + + + +

The neuron computes the weighted sum of its inputs, the logit, . It then feeds into the input function to compute , its final output. These functions have very nice derivatives, which makes learning easy! For learning, we want to compute the gradient of the error function with respect to the weights. To do so, we start by taking the derivative of the logit, , with respect to the inputs and the weights. By linearity of the logit: