# How to Calculate Square Root without Using sqrt()?

## Contents

I saw an interesting question that how to get square root of x without using builtin function from your language. There are different ways to approach this problem.

# Babylonian method

There is a method called Babylonian method to calculate the square root. See this Wikipedia page for the details.

The code is as follows:

```
def sqrt(x):
if x == 0:
return 0
= x/2.0
res = 1e-7
eps
while res - x/res > eps:
= (res + x/res) * 0.5
res
return res
```

This method is very fast, it can calculate \(sqrt(5)\) is less than 5 iterations.

# Newton method

We can also use newton’s method to find the square root of \(num\). Basically, we have the following \(f(x)\):

\[f(x) = x^2 - num\]

We can get the square root of \(num\) iteratively via:

\[x_{n+1} = x_n - \frac{f(x_n)}{f'(x_n)}\]

The code is like this:

```
def newton_sqrt(num):
= num / 2.0
cur = 1e-7
eps
while abs(get_next(cur, num) - cur) > eps:
= get_next(cur, num)
cur
return cur
def get_next(x, num):
return x - f_val(x, num)/df(x)
def f_val(x, num):
return x*x - num
def df(x):
return 2*x
def main():
= 5
num = newton_sqrt(num)
res
print(f"res: {res}")
if __name__ == "__main__":
main()
```

Newton’s method also works pretty quickly to find the root square of a num.

# Gradient descent

Apart from the Babylonian method, I thought we can also solve this problem with gradient descent easily.

## l1 loss

Suppose that square root of \(x\) is \(w\), then our loss function will be the difference between \(x\) and \(w^2\). Our objective is minimize the loss.

Here is my sample code to solve this problem:

```
import random
import matplotlib.pyplot as plt
import numpy as np
def main():
= 5
x = random.uniform(1, x)
w print(f"initial w: {w}")
= "l2"
loss_type
= 200
n_iter = 0.01
lr = 0.00001
eps = []
losses
for i in range(n_iter):
= exp_decay_lr_scheduler(lr, i, n_iter)
lr # lr = step_lr_scheduler(lr, i, n_iter)
if loss_type == 'l1':
= get_dl_dw(w, x)
dl_dw else:
= get_dl_dw2(w, x)
dl_dw -= (lr * dl_dw)
w
if loss_type == 'l1':
= abs(x - w**2)
loss else:
=(x - w**2)**2
loss print(f"step: {i}, lr: {lr}, w: {w}, loss: {loss}")
losses.append(loss)
if loss < eps:
break
print(f"final w: {w}")
= plt.subplot()
ax = range(len(losses))
x
ax.plot(x, losses)"iteration")
ax.set_xlabel("loss")
ax.set_ylabel(
plt.show()
def step_lr_scheduler(lr, i, n_iter):
= 5
lr_factor = 20
step
if i != 0 and i % step == 0:
= lr / lr_factor
lr
return lr
def exp_decay_lr_scheduler(lr, i, n_iter):
= 1.0
alpha
return lr * np.exp(-alpha * i / n_iter)
def get_dl_dw(w, x):
if w*w >= x:
return 2*w
else:
return -2*w
def get_dl_dw2(w, x):
return 4*w**3 - 4*w*x
if __name__ == "__main__":
main()
```

There are two points worth noting.

First, it is how we calculate the loss. Initially I made the mistake of defining the loss as \(x - w^2\). Since we want to minimize the difference between \(x\) and \(w^2\), the loss should be defined as \(\Vert x - w^2 \Vert\). Otherwise, we can never learn a proper value for \(w\). The derivative of loss w.r.t \(w\) is:

\[\frac{\partial l}{\partial w} = \begin{cases} -2w & w^2 < x\\ 2w & w^2 >= x \end{cases}\]

Another point is the learning rate schedule. When \(w\) is close to its real value, we should use a small learning rate. Otherwise, we will see oscillation of the training curve: the loss will bump up and down. I first tried with step policy for learning rate, i.e., reducing the learning rate after the specified steps. With this policy, I found that the loss curve will oscillates and is not smooth, and we can not get a loss significantly small than 0.001.

I tried with the exponentially decaying policy, where we decay the learning rate smoothly with the following factor:

\[\exp(-\alpha * \frac{i}{N})\]

\(N\) is the total number of iterations, and \(i\) is the current iteration. With this policy, we can approximate \(\sqrt{x}\) with precision as high as \(1*10^{-5}\) with fewer iterations.

## Using l2 loss

Aside from using l1 loss, we can also use l2 loss: \((x - w^2)^2\). The derivative of loss w.r.t \(w\) is:

\[\frac{\partial l}{\partial w} = 4w^3 - 4xw\]

In the above code, we can change `loss_type`

to
`l2`

to try the l2 loss. Using l2 loss often leads to faster
convergence than l1 loss.

# Conclusion

Both the Babylonian method and Newton’s method converge much faster than gradient descent. However, using gradient descent to solve this problem is fun and provides a new perspective.