# Neural Networks - *Activation Functions*

*Activation Functions*

**The activation function of a node defines the output of that node given a set of inputs.** This output is then used as input for the next node and so on until a desired solution to the original problem is found.

**Useful activation functions are always non-linear.** Stacking linear functions doesn't make sense as it can always be represented as a single layer of linear functions.

**TL;DR: what should I use.**

- Try normal ReLUs.
- If a lot of neurons keep dying: try reducing learning rate.
- If a lot of neurons still die: try Leaky ReLU.

**Lowering learning rate can reduce the number of dying neurons.** Big jumps might cause activation function to jump in the middle of "a dead zone" where it can be hard to get back from.

```
This is a common behaviour in normal ReLUs.
```

**Some artificial neurons have difficulty learning when it's very wrong.** Far more difficulty than when it's just a little wrong.

```
This is the result of general Sigmoid function shape, it is very flat on 1 and -1.
```

**Keep an eye out for weights that are learning slowly.** Weight will learn slowly if either the input neuron is low-activation, or if the output neuron has saturated activation, either high or low. You can use different activation functions to avoid neurons getting saturated like this.

**Artificial neurons are usually named after their activation function:**

*Perceptron:*

- takes several binary inputs,
`x1, x2, ...`

- produces a single binary output based on weight and threshold
`(w * x) > threshold`

- never use Perceptrons

*Sigmoid:*

- takes several inputs,
`x1, x2, ...`

, preferably values between 0 and 1 - produces output using a sigmoid function,
`sig(w * x + b)`

- generates an "S"-shaped curve
- for very large numbers sigmoid function approaches 1 and for very negative number sigmoid function approaches 0
- causes gradients to vanish/explode, vanishing/exploding gradients are hard to optimize as the changes are so small near 0 or 1
- never use Sigmoid

*Tanh (Hyperbolic Tangent Function):*

- takes several inputs,
`x1, x2, ...`

, preferably values between -1 and 1 - produces output using Tanh function,
`tanh(w * x + b)`

, from -1 to +1 - generates an "S"-shaped curve.
- for very large numbers Tanh function approaches 1 and for very negative number Tanh function approaches -1
- causes gradients to vanish/explode
- avoid using Tanh; improbable that they'll work better than ReLUs

*ReLU (Rectified Linear Unit):*

- produces output using rectifier function,
`max(0, w * x + b)`

, from 0 to infinity - work better on large and complex datasets
- doesn't have gradient problems
- can also suffer from dead neurons (nothing excites the neuron)
- use ReLU by default

*SoftPlus:*

- Approximation of ReLU, creating a smooth curve around 0.
`ln(1 + e^x)`

.- more expensive to compute than ReLU
- avoid using SoftPlus; too expensive for small benefit

*Leaky ReLU:*

- ReLU with small non-zero gradient less than zero
`x if x > 0, 0.01*x otherwise`

- if the leakage multiplier is given as a parameter; this becomes
*Parametric ReLU*where the leakage is learned during training. - if the leakage multiplier is random; this becomes
*Randomized ReLU*where leakage changes during training but is fixed in testing. - use Leaky ReLU if a lot of normal ReLUs seem to die during training

*Noisy ReLU:*

- ReLU with Gaussian noise
`max(0, x + Y)`

, where`Y ~ N(0, sig(x))`

*RefLU (Reflected Linear Unit):*

- More exotic variation of ReLU that avoids dying.
- Quite unique that it can become negative for both low and high stimulation inputs.

*Maxout:*

- Generalization of ReLU and Leaky ReLU but more expensive to compute.
`max(w1 * x + b1, w2 * x + b2)`

- This doubles the number of parameters for each neuron.

**Output layer should use:**

*Softmax*for classification*An applicable linear function*for regression