Gradient Descent is one of the most basic and fundamental algorithms in machine learning. In this post I’ll attempt to explain how the algorithm works with Go code.

If you are completely new to the gradient descent algorithm, I’d suggest watching the Linear Regression with One Variable series of videos by Andrew Ng, he explains the theory behind the algorithm in great depth.

Motivation

If you’re a programmer like me, mathematical equations may not be intuitive and may even feel intimidating. I find that converting those equations to code helps me grasp them better and make them seem not as intimidating as before.

Gradient descent is one of the most fundamental algorithms used to find optimal parameters of machine learning models. In practice, you may rarely need to implement gradient descent on your own, however understanding how it works will help you better train complex ML models.

Note: These code examples are for educational purposes only and are in no way intended to be used in production.

Getting started

To get started, I’ve randomly generated points which approximately fall along a straight line. Hence for our model, we’ll use the straight line equation to predict points in the future.

Straight Line equation

Think of this as, our model needs to predict the value y given an input feature x. In order to do this we need to find the best values for m and c to fit the given sample values of x and y.

This is where the gradient descent algorithm comes in, we’ll use it to find the best possible values for m and c.

m and c are the parameters of our model.

The model

Let’s start by defining our model with a function.

$f(x) = mx + c$

The loss function

We need a way to measure how good or bad values of m and c are. To do this we need to define a loss function J.

We’ll use the mean squared error (MSE) loss function, which is the mean of squared difference between predicted and actual values.

$J = \frac{1}{n}\displaystyle\sum_{i=0}^{n-1} \big(f(x[i]) - y[i] \big)^2$ $\text{ where n is the size of the input}$

Our objective with gradient descent is to minimize the loss (or cost) function.

Image Source

The image above represents the graph of a loss (or cost) function with a single parameter w. In order to know the direction to move in to get to the bottom of the curve, we calculate the gradient (or derivate or slope) at that point.

For our loss function J defined above, since we have two parameters, we have to compute the partial derivate w.r.t m and the partial derivate w.r.t c. The final values come out to be:

$\frac{\partial J}{\partial m} = \frac{2}{n}\displaystyle\sum_{i=0}^{n-1} \big(f(x[i]) - y[i] \big)x[i]$

$\frac{\partial J}{\partial c} = \frac{2}{n}\displaystyle\sum_{i=0}^{n-1} \big(f(x[i]) - y[i] \big)$

Updating the parameters

Once we know in which direction to move along the curve, we need to decide how big a step we need to take. Taking too big or too small a step will cause problems- to know more please refer to Andrew Ng’s videos above which gives an in-depth explanation.

The amount we step is controlled by the learning rate (LR).

$m_{new} = m_{old} - LR * \text{ Gradient of m}$ $c_{new} = c_{old} - LR * \text{ Gradient of c}$

Putting it all together

In machine learning, an epoch is defined as one iteration of training though the dataset. Here we’ll set epochs = 15 and a learning rate of 1e-3. We then initialize the parameter with random values and start training.

Note: rand.Seed(7.0) is used to get reproducible results.

Running this gives me the following results:

Plugging in the last values of m and c and plotting the resultant line on a graph along with the actual values give us this result.

Source Code

All the source code can be found on Github.