 Softmax function is commonly used in classification tasks. Suppose that we have an input vector $$[z_1, z_2, \ldots, z_N]$$, after softmax, each element becomes:

$p_i = \frac{\exp(z_i)}{\sum_{j=1}^{N}\exp(z_j)}$

The denominator term normalize each element so that their sum is 1. The original vector is transformed into a probability distribution, and the index that corresponds to the highest probability is the chosen class.

In practice, we often see softmax with temperature, which is a slight modification of softmax:

$p_i = \frac{\exp(x_i/\tau)}{\sum_{j=1}^{N}\exp(x_j/\tau)}$

The parameter $$\tau$$ is called the temperature parameter1, and it is used to control the softness of the probability distribution. When $$\tau$$ gets lower, the biggest value in $$x$$ get more probability, when $$\tau$$ gets larger, the probability will be split more evenly on different elements. Consider the extreme cases where $$\tau$$ approaches zero, the probability for the largest element will approach 1, while when $$\tau$$ approaches infinity, the probability for each element will be the same.

import math

def softmax(vec, temperature):
"""
turn vec into normalized probability
"""
sum_exp = sum(math.exp(x/temperature) for x in vec)
return [math.exp(x/temperature)/sum_exp for x in vec]

def main():
vec = [1, 5, 7, 10]
ts = [0.1, 1, 10, 100, 10000]

for t in ts:
print(t, softmax(vec, t))

if __name__ == "__main__":
main()

With different values of t, the output probability is (also check the title image):

0.1 [8.194012623989748e-40, 1.928749847963737e-22, 9.357622968839298e-14, 0.9999999999999064]
1 [0.00011679362893736733, 0.006376716075637758, 0.0471179128098403, 0.9463885774855847]
10 [0.14763314666550595, 0.2202427743860977, 0.26900513210002774, 0.3631189468483686]
100 [0.23827555570657363, 0.24799976560608047, 0.25300969319764466, 0.2607149854897012]
10000 [0.2498812648459304, 0.2499812373450356, 0.2500312385924627, 0.2501062592165714]

According to this post, the name softmax is kind of misleading, it should be softargmax, especially when you have a very small $$\tau$$ value.

For example, for vec = [1, 5, 7, 10], argmax result should be 3. If we express it as one-hot encoding, the result is [0, 0, 0, 1], which is pretty close to the result of softmax when $$\tau = 0.1$$.

# Applications

In Distilling the Knowledge in a Neural Network, they also used temperature parameter in softmax:

Using a higher value for T produces a softer probability distribution over classes.

## Supervised contrastive learning

In the MoCo paper, softmax loss with temperature is used (it is a slightly modified version of InfoNCE loss):

$Loss = -\log\frac{exp(q\cdot k_+/\tau)}{\sum_{i=0}^{K} exp(q\cdot k_i/ \tau)}$

In that paper, $$\tau$$ is set to a very small value 0.07. If we do not use the temperature parameter, suppose that the dot product of negative pairs are -1, and dot product of positive pair is 1, and we have K = 1024, in this case, the model has separated the positive and negative pairs perfectly, but the softmax loss is still too large:

$-log\frac{e}{e + 1023e^{-1}} = 4.94$

If we use a parameter of $$\tau = 0.07$$, however, the loss will now become literally 0.0. So using a small $$\tau$$ helps collapse the probability distribution to the positive pair and reduces loss.

MoCo borrows this value from Unsupervised Feature Learning via Non-Parametric Instance Discrimination, in which the authors say:

τ is important for supervised feature learning , and also necessary for tuning the concentration of v on our unit sphere.

Ref 43 refers to paper NormFace: L2 Hypersphere Embedding for Face Verification. In NormFace Sec. 3.3, the authors show theoretically why it is necessary to use a scaling factor2 in softmax loss. Basically, if we do not use a scaling factor, the lower bound for the loss is high, and we can not learn a good representation of image features.

1. The name temperature may come from Boltzmann distribution, where it has similar formulation and a temperature parameter.↩︎

2. In NormFace, they use $$s=1/\tau$$ as the scaling factor and multiply it, instead of dividing $$\tau$$ directly.↩︎