When training a binary classifier, cross entropy (CE) loss is usually used as squared error loss cannot distinguish bad predictions from extremely bad predictions. The CE loss is defined as follows:
where is the probability of the sample falling in the positive class . , where is a sigmoid function.
When implementing CE loss, we could calculate first and then plug in the definition of CE loss. However, there is a problem with this in practice. At the beginning of training, a positive example might be confidently classified as a negative example , implying . If is small enough, it could be smaller than the smallest floating point value i.e. numerically zero. Then we get if we take the log of 0 when computing the cross-entropy. To tackle this potential numerical stability issue, the logistic function and cross-entropy are usually combined into one in package in Tensorflow and Pytorch
Still, the numerical stability issue is not completely under control since could blow up if z is a large negative number. To tackle this potential problem, the “log-sum-exp” trick is used to shift the center of the exponential sum. The log-sum-exp trick is described as follows
Using this formula, we can force the greatest value to be zero even if other values would underflow. So can be in practice.
but the log(0) issue still exists when you calculate x=logit(y)=log(y/1-y) when y->0