When training a binary classifier, cross entropy (CE) loss is usually used as squared error loss cannot distinguish bad predictions from extremely bad predictions. The CE loss is defined as follows:
where is the probability of the sample falling in the positive class
.
, where
is a sigmoid function.
When implementing CE loss, we could calculate first and then plug
in the definition of CE loss. However, there is a problem with this in practice. At the beginning of training, a positive example might be confidently classified as a negative example
, implying
. If
is small enough, it could be smaller than the smallest floating point value i.e. numerically zero. Then we get
if we take the log of 0 when computing the cross-entropy. To tackle this potential numerical stability issue, the logistic function and cross-entropy are usually combined into one in package in Tensorflow and Pytorch
Still, the numerical stability issue is not completely under control since could blow up if z is a large negative number. To tackle this potential problem, the “log-sum-exp” trick is used to shift the center of the exponential sum. The log-sum-exp trick is described as follows
Using this formula, we can force the greatest value to be zero even if other values would underflow. So can be
in practice.
References:
[1] http://www.cs.toronto.edu/~rgrosse/courses/csc321_2017/readings/L04%20Training%20a%20Classifier.pdf
[2] https://www.xarg.org/2016/06/the-log-sum-exp-trick-in-machine-learning/
Leo
but the log(0) issue still exists when you calculate x=logit(y)=log(y/1-y) when y->0