Activation functions are a crucial component of artificial neural networks. They introduce non-linearity into the models, allowing them to capture complex relationships in data. In this article, we'll explore various non-linear activation functions commonly used in neural networks, their properties, and when to use them.
The sigmoid activation function is defined as:
- Range: (0, 1)
- Common Use: Often used in the last layer for binary classification, where it maps the linear activation to a probability range.
The hyperbolic tangent activation function is defined as:
- Range: (-1, 1)
- Common Use: Suitable for hidden layers, it performs better than sigmoid but still suffers from the vanishing gradient problem.
The ReLU activation function is defined as:
- Range: [0, ∞)
- Common Use: Highly popular for hidden layers due to simplicity and efficiency. However, it can suffer from the "dying ReLU" problem.
The leaky ReLU activation function is defined as:
- Range: (-∞, ∞)
- Common Use: Helps solve the dying ReLU problem by introducing a small slope ((L)) (aka "leak") for negative inputs.
The parameterized ReLU activation function is similar to leaky ReLU but with the slope ((L)) learned during training.
- Range: (-∞, ∞)
The ELU activation function is defined as:
- Range: (-∞, ∞)
- Advantage: Smoother than ReLU for negative values.
The SELU activation function is similar to ELU but includes a scaling factor ((S)).
- Range: (-∞, ∞)
- Advantage: Addresses vanishing-exploding gradient problems, but requires specific conditions on network architecture.
The GELU activation function combines sigmoid and tanh functions to create a smooth non-linear activation. It is defined as:
- Range: Approximately (-0.17, 0.17)
- Advantage: Suitable for deep learning models, particularly transformers like BERT and GPT-2.
The softmax activation function is used in the output layer for multi-class classification:
- Range: [0, 1]
- Common Use: Converts raw scores (logits) into a probability distribution over multiple classes.
Swish is a smooth non-linear activation function defined as:
- Range: (-∞, ∞)
- Advantage: Swish is similar to ReLU but generally performs better due to its smoothness. Can help alleviate some of the vanishing gradient problems.
Softplus is defined as:
- Range: (0, ∞)
- Advantage: Approximates ReLU. Useful in scenarios where you need a non-linear activation that remains positive.