When training a binary classifier, cross entropy (CE) loss is usually used as squared error loss cannot distinguish bad predictions from extremely bad predictions. The CE loss is defined as follows:
where is the probability of the sample falling in the positive class . , where is a sigmoid function.
When implementing CE loss, we could calculate first and then plug in the definition of CE loss. However, there is a problem with this in practice. At the beginning of training, a positive example might be confidently classified as a negative example , implying . If is small enough, it could be smaller than the smallest floating point value i.e. numerically zero. Then we get if we take the log of 0 when computing the cross-entropy. To tackle this potential numerical stability issue, the logistic function and cross-entropy are usually combined into one in package in Tensorflow and Pytorch
Still, the numerical stability issue is not completely under control since could blow up if z is a large negative number. To tackle this potential problem, the “log-sum-exp” trick is used to shift the center of the exponential sum. The log-sum-exp trick is described as follows
Using this formula, we can force the greatest value to be zero even if other values would underflow. So can be in practice.