Gradient Descent

Updated at 2019-03-06 02:01

Like in machine learning, you should avoid overfitting in real life. Travel and see new things to increase data points to attain generalization.

Gradient descent is the most common approach to optimize a function. Finding global minimum of a function is a hard problem, especially as neural networks can have billions of variables. There are many variations of gradient descent but they all have the same principle.

Take a point p1.
Go to direction v1 for a distance of v1d.
Go to direction v2 for a distance of v2d.
d() is derivative

Change in C = (d(C) / d(v1)) * v1d + (d(C) / d(v2)) * v2d
And we are looking for negative change in C.

Gradient vector of C = Change in C * distance of v
Distance of v = -n * Change in C
Where n is a small positive parameter called learning rate.

1. Choose parameters (position in the valley)
2. Calculate gradient vector of cost function (C).
3. Move to the opposite direction for distance of learning rate (n)
4. Repeat until at the bottom of the valley.

* n needs to be small enough so it's a good approximation.
* n shouldn't be too small or gradient descent becomes very slow.

Gradient is the value used to adjust the network internal weights. The bigger the gradient, the bigger the adjustment. Mathematically, gradient is a multi-variable generalization of the derivative of the predictive model, main difference being that gradient is vector and derivative is a scalar-value. It tells how should each of the weigts be changed to improve the predictions.

  • TODO: explain parameter vs weight.
  • TODO: explain update.
  • Hyperparameters: epoch count, mini-batch size, learning rate
  • Parameters: weight, bias

Optimizers in machine learning context are algorithms how to go about descending or estimating the gradient to find the optimal parameter set.

Convergence means searching for the global minima, but optimizers can still converge to a local minima.

Here are the most common optimizers:

Traditional Gradient Descent: For a single update, you need to calculate the gradient for the whole dataset. This is infeasible in practice as it would take too long time as datasets rarely fit in memory.

Stochastic Gradient Descent (SGD): For a single update, you look at one sample. Easy to compute but is not as effective if e.g. the data variance is too high. Will easily converge to a local minima. Also called online learning.

Mini-batch Stochastic Gradient Descent: For a single update, you look at subset of the whole dataset. The best of both worlds as you can customize the batch size. Also confusingly just called stochastic gradient descent.

Momentum Gradient Descent: When making jumps, amplify by the previous momentum.

Nesterov-accelerated Gradient Descent (NAG): First calculate the jump, the gradient and finally add correction. NAG tries to avoid going too fast past a local minima which in fact is the global minima.

Adaptive Gradient Descent (AdaGrad): Big jumps for infrequent parameters, small jumps for frequent parameters. It uses different learning rate for each parameter. Downside is that then learning rate will always be decreasing.

Adaptive Gradient Descent with Delta (AdaDelta): Only look at fixed size of previous momentums, not all of them like in AdaGrad.

Adaptive Moment Estimation (Adam): Calculate learning rates. Calculate momentum changes.


What should I use?

Always start with Adam. It usually converges the fastest.

Focus using 32 - 1024 samples per mini-batch This usually contains the sweet spot, even if you have millions of samples.


Backpropagation focuses on discovering how changing one weight changes the output of the whole network. Backpropagation is a fast algorithm to calculate gradient of a cost function. This gradient vector of C is required in gradient descent step 2 discussed before.

Main workhorse of backpropagation is Hadamard product ⊙. ⊙ is the element-wise product of two vectors of the same shape.

[[1]    [[3]     [[1 * 3]      [[3]
 [2]] ⊙  [4]]  =  [2 * 4]]  =   [8]]

Code Example

net = Network([2, 3, 1])

w_jk = weight of connection of
       kth neuron in second layer
       and jth neuron in the thrid layer

Vector of activations of the third layer of neurons is:
a' = sig(w * a + b).
1. a is the vector of activations of the second layer of neurons (inputs).
2. Multiply a by the weight matrix w.
3. Add the vector b of biases.
4. Apply the sigmoid function to every element in the resulting vector.
    * This is called "vectorizing" the activation function.

cd ~/Projects
git clone

# my virtual env aliases
vmk2 neural-networks-and-deep-learning
v neural-networks-and-deep-learning

pip install -r requirements.txt
cd src
import mnist_loader
training_data, validation_data, test_data = mnist_loader.load_data_wrapper()

import network
net = network.Network([784, 30, 10])
net.SGD(training_data, 30, 10, 3.0, test_data=test_data)
# Epoch 0: 9022 / 10000
# Epoch 1: 9184 / 10000
# ...
# Epoch 29: 9479 / 10000

net = network.Network([784, 100, 10])
net.SGD(training_data, 30, 10, 3.0, test_data=test_data)
# Epoch 0: 6629 / 10000
# Epoch 1: 6815 / 10000
# ...
# Epoch 29: 8736 / 10000

net.SGD(training_data, 30, 10, 0.001, test_data=test_data)
# Epoch 0: 8736 / 10000
# Epoch 1: 8736 / 10000
# ...
# Epoch 29: 8732 / 10000

net.SGD(training_data, 30, 10, 0.1, test_data=test_data)
# Epoch 0: 8734 / 10000
# Epoch 1: 8736 / 10000
# ...
# Epoch 29: 8742 / 10000