Neural Networks
Activation Functions

Updated at 2019-02-04 07:14

The activation function of a node defines the output of that node given a set of inputs. This output is then used as input for the next node and so on until a desired solution to the original problem is found.

Useful activation functions are always non-linear. Stacking linear functions doesn't make sense as it can always be represented as a single layer of linear functions.

TL;DR: what should I use.

  1. Try normal ReLUs.
  2. If a lot of neurons keep dying: try reducing learning rate.
  3. If a lot of neurons still die: try Leaky ReLU.

Lowering learning rate can reduce the number of dying neurons. Big jumps might cause activation function to jump in the middle of "a dead zone" where it can be hard to get back from.

This is a common behaviour in normal ReLUs.

Some artificial neurons have difficulty learning when it's very wrong. Far more difficulty than when it's just a little wrong.

This is the result of general Sigmoid function shape, it is very flat on 1 and -1.

Keep an eye out for weights that are learning slowly. Weight will learn slowly if either the input neuron is low-activation, or if the output neuron has saturated activation, either high or low. You can use different activation functions to avoid neurons getting saturated like this.

Artificial neurons are usually named after their activation function:


  • takes several binary inputs, x1, x2, ...
  • produces a single binary output based on weight and threshold (w * x) > threshold
  • never use Perceptrons


  • takes several inputs, x1, x2, ..., preferably values between 0 and 1
  • produces output using a sigmoid function, sig(w * x + b)
  • generates an "S"-shaped curve
  • for very large numbers sigmoid function approaches 1 and for very negative number sigmoid function approaches 0
  • causes gradients to vanish/explode, vanishing/exploding gradients are hard to optimize as the changes are so small near 0 or 1
  • never use Sigmoid

Tanh (Hyperbolic Tangent Function):

  • takes several inputs, x1, x2, ..., preferably values between -1 and 1
  • produces output using Tanh function, tanh(w * x + b), from -1 to +1
  • generates an "S"-shaped curve.
  • for very large numbers Tanh function approaches 1 and for very negative number Tanh function approaches -1
  • causes gradients to vanish/explode
  • avoid using Tanh; improbable that they'll work better than ReLUs

ReLU (Rectified Linear Unit):

  • produces output using rectifier function, max(0, w * x + b), from 0 to infinity
  • work better on large and complex datasets
  • doesn't have gradient problems
  • can also suffer from dead neurons (nothing excites the neuron)
  • use ReLU by default


  • Approximation of ReLU, creating a smooth curve around 0.
  • ln(1 + e^x).
  • more expensive to compute than ReLU
  • avoid using SoftPlus; too expensive for small benefit

Leaky ReLU:

  • ReLU with small non-zero gradient less than zero
  • x if x > 0, 0.01*x otherwise
  • if the leakage multiplier is given as a parameter; this becomes Parametric ReLU where the leakage is learned during training.
  • if the leakage multiplier is random; this becomes Randomized ReLU where leakage changes during training but is fixed in testing.
  • use Leaky ReLU if a lot of normal ReLUs seem to die during training

Noisy ReLU:

  • ReLU with Gaussian noise
  • max(0, x + Y), where Y ~ N(0, sig(x))

RefLU (Reflected Linear Unit):

  • More exotic variation of ReLU that avoids dying.
  • Quite unique that it can become negative for both low and high stimulation inputs.


  • Generalization of ReLU and Leaky ReLU but more expensive to compute.
  • max(w1 * x + b1, w2 * x + b2)
  • This doubles the number of parameters for each neuron.

Output layer should use:

  • Softmax for classification
  • An applicable linear function for regression