I saw an interesting question that how to get square root of x without using builtin function from your language. There are different ways to approach this problem.

Babylonian method

There is a method called Babylonian method to calculate the square root. See this Wikipedia page for the details.

The code is as follows:

def sqrt(x):

    if x == 0:
        return 0

    res = x/2.0
    eps = 1e-7

    while res - x/res > eps:
        res = (res + x/res) * 0.5

    return res

This method is very fast, it can calculate \(sqrt(5)\) is less than 5 iterations.

Newton method

We can also use newton’s method to find the square root of \(num\). Basically, we have the following \(f(x)\):

\[f(x) = x^2 - num\]

We can get the square root of \(num\) iteratively via:

\[x_{n+1} = x_n - \frac{f(x_n)}{f'(x_n)}\]

The code is like this:

def newton_sqrt(num):
    cur = num / 2.0
    eps = 1e-7

    while abs(get_next(cur, num) - cur) > eps:
        cur = get_next(cur, num)

    return cur

def get_next(x, num):
    return x - f_val(x, num)/df(x)

def f_val(x, num):
    return x*x - num

def df(x):
    return 2*x

def main():
    num = 5
    res = newton_sqrt(num)

    print(f"res: {res}")

if __name__ == "__main__":

Newton’s method also works pretty quickly to find the root square of a num.

Gradient descent

Apart from the Babylonian method, I thought we can also solve this problem with gradient descent easily.

l1 loss

Suppose that square root of \(x\) is \(w\), then our loss function will be the difference between \(x\) and \(w^2\). Our objective is minimize the loss.

Here is my sample code to solve this problem:

import random

import matplotlib.pyplot as plt
import numpy as np

def main():
    x = 5
    w = random.uniform(1, x)
    print(f"initial w: {w}")

    loss_type = "l2"

    n_iter = 200
    lr = 0.01
    eps = 0.00001
    losses = []

    for i in range(n_iter):
        lr = exp_decay_lr_scheduler(lr, i, n_iter)
        # lr = step_lr_scheduler(lr, i, n_iter)

        if loss_type == 'l1':
            dl_dw = get_dl_dw(w, x)
            dl_dw = get_dl_dw2(w, x)
        w -= (lr * dl_dw)

        if loss_type == 'l1':
            loss = abs(x - w**2)
            loss =(x - w**2)**2
        print(f"step: {i}, lr: {lr}, w: {w}, loss: {loss}")

        if loss < eps:

    print(f"final w: {w}")

    ax = plt.subplot()
    x = range(len(losses))
    ax.plot(x, losses)


def step_lr_scheduler(lr, i, n_iter):
    lr_factor = 5
    step = 20

    if i != 0 and i % step == 0:
        lr = lr / lr_factor

    return lr

def exp_decay_lr_scheduler(lr, i, n_iter):
    alpha = 1.0

    return lr * np.exp(-alpha * i / n_iter)

def get_dl_dw(w, x):
    if w*w >= x:
        return 2*w
        return -2*w

def get_dl_dw2(w, x):
    return 4*w**3 - 4*w*x

if __name__ == "__main__":

There are two points worth noting.

First, it is how we calculate the loss. Initially I made the mistake of defining the loss as \(x - w^2\). Since we want to minimize the difference between \(x\) and \(w^2\), the loss should be defined as \(\Vert x - w^2 \Vert\). Otherwise, we can never learn a proper value for \(w\). The derivative of loss w.r.t \(w\) is:

\[\frac{\partial l}{\partial w} = \begin{cases} -2w & w^2 < x\\ 2w & w^2 >= x \end{cases}\]

Another point is the learning rate schedule. When \(w\) is close to its real value, we should use a small learning rate. Otherwise, we will see oscillation of the training curve: the loss will bump up and down. I first tried with step policy for learning rate, i.e., reducing the learning rate after the specified steps. With this policy, I found that the loss curve will oscillates and is not smooth, and we can not get a loss significantly small than 0.001.

I tried with the exponentially decaying policy, where we decay the learning rate smoothly with the following factor:

\[\exp(-\alpha * \frac{i}{N})\]

\(N\) is the total number of iterations, and \(i\) is the current iteration. With this policy, we can approximate \(\sqrt{x}\) with precision as high as \(1*10^{-5}\) with fewer iterations.

Using l2 loss

Aside from using l1 loss, we can also use l2 loss: \((x - w^2)^2\). The derivative of loss w.r.t \(w\) is:

\[\frac{\partial l}{\partial w} = 4w^3 - 4xw\]

In the above code, we can change loss_type to l2 to try the l2 loss. Using l2 loss often leads to faster convergence than l1 loss.


Both the Babylonian method and Newton’s method converge much faster than gradient descent. However, using gradient descent to solve this problem is fun and provides a new perspective.