Softmax function is commonly used in classification tasks. Suppose that we have an input vector \([z_1, z_2, \ldots, z_N]\), after softmax, each element becomes:

\[p_i = \frac{\exp(z_i)}{\sum_{j=1}^{N}\exp(z_j)}\]

The denominator term normalize each element so that their sum is 1. The original vector is transformed into a probability distribution, and the index that corresponds to the highest probability is the chosen class.

In practice, we often see softmax with temperature, which is a slight modification of softmax:

\[p_i = \frac{\exp(x_i/\tau)}{\sum_{j=1}^{N}\exp(x_j/\tau)}\]

The parameter \(\tau\) is called the temperature parameter1, and it is used to control the softness of the probability distribution. When \(\tau\) gets lower, the biggest value in \(x\) get more probability, when \(\tau\) gets larger, the probability will be split more evenly on different elements. Consider the extreme cases where \(\tau\) approaches zero, the probability for the largest element will approach 1, while when \(\tau\) approaches infinity, the probability for each element will be the same.

import math

def softmax(vec, temperature):
    turn vec into normalized probability
    sum_exp = sum(math.exp(x/temperature) for x in vec)
    return [math.exp(x/temperature)/sum_exp for x in vec]

def main():
    vec = [1, 5, 7, 10]
    ts = [0.1, 1, 10, 100, 10000]

    for t in ts:
        print(t, softmax(vec, t))

if __name__ == "__main__":

With different values of t, the output probability is (also check the title image):

0.1 [8.194012623989748e-40, 1.928749847963737e-22, 9.357622968839298e-14, 0.9999999999999064]
1 [0.00011679362893736733, 0.006376716075637758, 0.0471179128098403, 0.9463885774855847]
10 [0.14763314666550595, 0.2202427743860977, 0.26900513210002774, 0.3631189468483686]
100 [0.23827555570657363, 0.24799976560608047, 0.25300969319764466, 0.2607149854897012]
10000 [0.2498812648459304, 0.2499812373450356, 0.2500312385924627, 0.2501062592165714]

According to this post, the name softmax is kind of misleading, it should be softargmax, especially when you have a very small \(\tau\) value.

For example, for vec = [1, 5, 7, 10], argmax result should be 3. If we express it as one-hot encoding, the result is [0, 0, 0, 1], which is pretty close to the result of softmax when \(\tau = 0.1\).


In Distilling the Knowledge in a Neural Network, they also used temperature parameter in softmax:

Using a higher value for T produces a softer probability distribution over classes.

Supervised contrastive learning

In the MoCo paper, softmax loss with temperature is used (it is a slightly modified version of InfoNCE loss):

\[Loss = -\log\frac{exp(q\cdot k_+/\tau)}{\sum_{i=0}^{K} exp(q\cdot k_i/ \tau)}\]

In that paper, \(\tau\) is set to a very small value 0.07. If we do not use the temperature parameter, suppose that the dot product of negative pairs are -1, and dot product of positive pair is 1, and we have K = 1024, in this case, the model has separated the positive and negative pairs perfectly, but the softmax loss is still too large:

\[-log\frac{e}{e + 1023e^{-1}} = 4.94\]

If we use a parameter of \(\tau = 0.07\), however, the loss will now become literally 0.0. So using a small \(\tau\) helps collapse the probability distribution to the positive pair and reduces loss.

MoCo borrows this value from Unsupervised Feature Learning via Non-Parametric Instance Discrimination, in which the authors say:

τ is important for supervised feature learning [43], and also necessary for tuning the concentration of v on our unit sphere.

Ref 43 refers to paper NormFace: L2 Hypersphere Embedding for Face Verification. In NormFace Sec. 3.3, the authors show theoretically why it is necessary to use a scaling factor2 in softmax loss. Basically, if we do not use a scaling factor, the lower bound for the loss is high, and we can not learn a good representation of image features.


  1. The name temperature may come from Boltzmann distribution, where it has similar formulation and a temperature parameter.↩︎

  2. In NormFace, they use \(s=1/\tau\) as the scaling factor and multiply it, instead of dividing \(\tau\) directly.↩︎