GELU() Vs Mish(): Activation Functions For Transformers

GELU() & Mish() mitigate Vanishing Gradient Problem & Dying ReLU Problem. They're computationally expensive but effective alternatives to traditional activation functions like ReLU. Used in PyTorch, Transformer models like BERT & ChatGPT.

Buy Me a Coffee☕
*Memos:

My post explains GELU() and Mish().
My post explains SiLU() and Softplus().
My post explains Step function, Identity and ReLU.
My post explains Leaky ReLU, PReLU and FReLU.
My post explains ELU, SELU and CELU.
My post explains Tanh, Softsign, Sigmoid and Softmax.
My post explains Vanishing Gradient Problem, Exploding Gradient Problem and Dying ReLU Problem.
My post explains layers in PyTorch.
My post explains loss functions in PyTorch.
My post explains optimizers in PyTorch.

(1) GELU(Gaussian Error Linear Unit):

can convert an input value(x) to an output value by t...

Read the full article