The softmax activation function is commonly used as the output layer in a neural network.

## The Math

Mathematically, the softmax function is represented as follows.

\[ \Large\sigma(z_i) = \huge\frac{e^{z_i}}{\sum_{j=1}^{K}e^{z_j}} \]

*For i = 1,…,K, K is the number of distinct classes to be predicted and z denotes the input vector.*

For e.g. \[ \begin{bmatrix} 2.0 \\ 4.3 \\ 1.2 \\ -3.1 \end{bmatrix} \Rightarrow \begin{bmatrix} e^{2.0}/(e^{2.0}+e^{4.3}+e^{1.2}+e^{-3.1}) \\ e^{4.3}/(e^{2.0}+e^{4.3}+e^{1.2}+e^{-3.1}) \\ e^{1.2}/(e^{2.0}+e^{4.3}+e^{1.2}+e^{-3.1}) \\ e^{-3.1}/(e^{2.0}+e^{4.3}+e^{1.2}+e^{-3.1})\end{bmatrix} = \begin{bmatrix} 0.08749 \\ 0.87266 \\ 0.03931 \\ 0.00029 \end{bmatrix} \]

## Why

The softmax activation function is often the final layer in multi-class classification neural networks. This function possesses the following important properties:

### Ensure activations are positive numbers

It can be hard to interpret activation with negative values, taking the exponent ensure all values are positive.

### Amplify small differences between activations

Taking the exponent also serves to amplify the difference between activations, this assists in picking one class over another.

### Generate a valid probability distribution

Once we have positive values, normalizing across these values will generate activation between 0 and 1 which sum up to 1 and may be interpreted as probabilities.

#### Examples

Let’s validate our reasoning of the properties mentioned above. Consider the following activations.

\[ \begin{bmatrix} -0.1849 \\ 3.1026 \\ 1.7967 \end{bmatrix} \] It isn’t clear which class the input likely belongs to, let’s try to fix this by normalizing.

\[ \begin{bmatrix} -0.1849 \\ 3.1026 \\ 1.7967 \end{bmatrix} \Rightarrow \begin{bmatrix} -0.1849/(-0.1849+3.1026+1.7967) \\ 3.1026/(-0.1849+3.1026+1.7967) \\ 1.7967/(-0.1849+3.1026+1.7967)\end{bmatrix} = \begin{bmatrix} -0.0392 \\ 0.6581 \\ 0.3811 \end{bmatrix} \] Our activations now sum to 1, however we still have the first activation with a negative value of -0.0392 and hence regarding them as probabilities would be incorrect.

Finally, after applying the softmax function we end up with \[ \begin{bmatrix} 0.0285 \\ 0.7643 \\ 0.2070 \end{bmatrix} \]

## The code

**Note: These code examples are for educational purposes only and are in no way intended to be used in production.**

We’ll be extending the concepts explained in my previous post on “Gradient descent with code” with matrix operations, considering an oversimplified neural network with a single linear layer. This should give you the intuition of what the activations look like, with and without using the softmax layer.

*Note: We’ll be using the gonum library to represet our matrices and perform operations on them.*

### Putting it all together

Here `SimpleNN()`

is an oversimplified representation of a neural network with a single linear layer and `SoftMaxNN()`

is the network with the softmax layer.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

func main() {
SimpleNN()
SoftmaxNN()
}
func SimpleNN() {
nn := linear()
fmt.Println("Activation after linear layer")
fmt.Println(mat.Formatted(nn))
}
func SoftmaxNN() {
nn := linear()
nnsmax := softmax(nn)
fmt.Println("Activation after softmax layer")
fmt.Println(mat.Formatted(nnsmax))
}

### Output

You should see an output similar to the one below.

**Activations after linear layer**
\[
\begin{bmatrix} -0.971289832017899 \\ -0.31566700629829314 \\ 0.20609785706914163 \end{bmatrix}
\]

**Activations after softmax layer**
\[
\begin{bmatrix} 0.9849108426709876 \\ 0.013460649878849347 \\ 0.0016285074501629477 \end{bmatrix}
\]

### Linear Layer

Let’s code out the simple linear layer. In keeping with the conventions from the previous post, we’ll use `m`

, `x`

and `c`

to represent our weights, inputs and biases.

To keeps things simple, we’ll assume we have only 3 classes which need to be predicted, hence our output matrix should have a dimension of `3x1`

.
Accordingly well assume `m`

to have dimensions `3x3`

and `c`

to have `3x1`

.

E.g. `mx + c`

\[ \begin{bmatrix} 5 & 3 & 5 \\ 3 & 7 & 0 \\ 1 & 2 & 5 \end{bmatrix} \begin{bmatrix} 1 \\ 2 \\ 1 \end{bmatrix} + \begin{bmatrix} 1 \\ 1 \\ 1 \end{bmatrix} = \begin{bmatrix} 17 \\ 18 \\ 11 \end{bmatrix} \]

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

import "gonum.org/v1/gonum/mat"
func initRandomMatrix(row int, col int) *mat.Dense {
size := row * col
arr := make([]float64, size)
for i := range arr {
arr[i] = rand.NormFloat64()
}
return mat.NewDense(row, col, arr)
}
func linear() *mat.Dense {
m := initRandomMatrix(3, 3)
x := initRandomMatrix(3, 1)
c := initRandomMatrix(3, 1)
ll := mat.NewDense(3, 1, nil)
ll.Mul(m, x)
ll.Add(ll, c)
return ll
}

`initRandomMatrix`

is a helper method to initialize a matrix with random values.

### Softmax Function

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

import "gonum.org/v1/gonum/mat"
func softmax(matrix *mat.Dense) *mat.Dense {
var sum float64
// Calculate the sum
for _, v := range matrix.RawMatrix().Data {
sum += math.Exp(v)
}
resultMatrix := mat.NewDense(matrix.RawMatrix().Rows, matrix.RawMatrix().Cols, nil)
// Calculate softmax value for each element
resultMatrix.Apply(func(i int, j int, v float64) float64 {
return math.Exp(v) / sum
}, matrix)
return resultMatrix
}

As you can see from the code above, the `softmax`

function is pretty straight forward.

## Source Code

All the source code can be found on Github.