LogSumExp is often used in machine learning. It has the following form:

\[\begin{equation} LSE(x_1, x_2, \ldots, x_n) = \log\sum_{i=1}^{N}\exp(x_i) \end{equation}\]

LSE as an upper bound for max()

LSE is an upper bound for \(max(x_1, x_2, \ldots, x_n)\) (The equality establishes only when \(n=1\)). We have the following inequality:

\[\begin{equation} max(x_1, x_2, \ldots, x_n) <= \log\sum_{i=1}^{N}\exp(x_i) \end{equation}\]

To verify this, I have drawn a graph comparing LSE vs max for 2D data in the range of [-1, 1]. The graph is shown in the title image. It clear that LSE is indeed an uppber bound for max.

click here to check the code used to generate the title image

import matplotlib.pyplot as plt

import numpy as np

def main():
    fig = plt.figure()
    ax = fig.add_subplot(projection='3d')

    N = 200
    x1 = np.linspace(-1, 1, N)
    x2 = np.linspace(-1, 1, N)

    X1, X2 = np.meshgrid(x1, x2)
    Y = np.log(np.exp(X1) + np.exp(X2))
    surf = ax.plot_surface(X1, X2, Y, color='red')

    Y_up = np.max(np.stack([X1, X2], axis=0), axis=0)
    surf2 = ax.plot_surface(X1, X2, Y_up, color='blue')

    # y = np.log(np.exp(x1) + np.exp(x2))
    # ax.plot_trisurf(x1, x2, y, color='red')

    # y_up = np.max(np.stack([x1, x2], axis=1), axis=1)
    # ax.plot_trisurf(x1, x2, y_up, color='green')


    # change the 3D plot angel and dist, https://stackoverflow.com/q/12904912/6064933
    ax.view_init(elev=11, azim=-46)
    ax.dist = 10

    # plt.show()

    plt.savefig("log_sum_exp_vs_max.pdf", bbox_inches='tight')

if __name__ == "__main__":

How is this information useful? Well, we can use it to transform the optimization target. For example, you may want to optimize \(max(x_1, x_2)\), which is not differentiable. Then we can optimize \(LSE(x_1, x_2)\) instead.

In Lifted structure loss, they used this trick to transform equation 3 to equation 4. Without this knowledge, you will find it difficult to understand how they arrive at equation 4.

LSE for numerical stability

When we use softmax function to normalize a vector and the vector contains large or small values, we will encounter numerical issues (overflow or underflow). We need to use LSE to alleviate this issue.

This post explains how it works in detail.