Problem 1 - Olympiad

Which activation function is most likely to cause the "vanishing gradient" problem in deep neural networks?

A. ReLUB. SigmoidC. Leaky ReLUD. ELU

Correct: B

Sigmoid squashes inputs into (0,1), so its derivative is at most 0.25. Repeated multiplication of gradients <1 during back-propagation makes gradients exponentially small in early layers, slowing or halting learning.