Automatic Differentiation

7 minute read

Published:

Automatic Differentiation (AD) is a vital process in Deep Learning. Many of deep learning’s techniques like backpropagation relies heavily on AD. There are multiple ways to implement AD, one of which is utilizing Dual Numbers.

Dual Numbers

According to Wikipedia, the dual numbers extend the real numbers by adjoining one new element $\epsilon$ (epsilon) with the property $\epsilon^2 = 0$ ($\epsilon$ is nilpotent). It is similar to the idea of Complex Numbers, whereby it introduces a new element $i$ with the propery $i^2 = -1$. We’ll see how this similarity applies to the rules of promotion/extending of real numbers.

Julia Approach

Julia Language allows its users to easily implement a new type system, which in this case would be Dual Numbers. Moreover, we can introduce method dispatching, allowing for different method behaviours depending on its parameters’ types. This notebook is based on the lecture by Alan Edelman from MIT.

AD vs. Symbolic & Numerical Differentiation

Normally, one could approach differentiation by two popular methods: symbolic and numerical differentiation. Numerical differentiation is basically using the limit-definition of a derivative, as follows:

\[f'(x) = \lim_{h\to0}\frac{f(x+h)-f(x)}{h}\]

By setting the value of $h$ close enough to zero, one can estimate the derivative of a function. However, numerical differentiation may cause round-off errors in our result.

On the other hand, symbolic differentation allows for a more analytical derivative, like that of Python’s Scipy. Similarly, it has drawbacks such as being inefficient at times, especially when dealing with complicated functions.

AD doesn’t face those kinds of problems, and we’ll see shortly why that is. In a nutshell, we lay down the rules of differentation embedded in the Dual Number system.

Code

Dummy Square-root Function

We first create an iterative function to take square roots using the Babylonian Method. This function will serve as a test function to evaluate our AD implementation later on.

function Babylonian(x; N = 10)
    t = (1+x)/2
    for i = 2:N
        t = (t+x/t)/2
    end
    t
end
Babylonian (generic function with 1 method)

Let’s check whether the function works, by comparing it to the default Julia square-root, denoted by .

Babylonian(pi), √pi
(1.7724538509055159, 1.7724538509055159)

It outputs the same result as a default square root operator would do.

Dual Numbers

Dual Numbers is similar to Complex Numbers, but instead of having $i^2 = -1$, we have $\epsilon^2 = 0$ but $\epsilon \neq 0$. Dual Numbers therefore comes in the form of $a + b\epsilon$ with $a, b \in \mathbb{R}$.

We now begin to construct the Dual Numbers struct, which simply consists of a Tuple of Floats. The first element being the function, and the second element being its derivative.

struct D <: Number  # D is a function-derivative pair
    f::Tuple{Float64,Float64}
end

Overloading Base Operations

Then, we need to overload basic arithmetic functions to work with Dual Numbers. The rules set are simply the usual Calculus rules of differentiation. Here are those rules:

\[\frac{d}{dx}[f(x) \pm g(x)] = f'(x) \pm g'(x)\] \[\frac{d}{dx}[f(x) \cdot g(x)] = f(x)\cdot g'(x) + f'(x) \cdot g(x)\] \[\frac{d}{dx}\left[\frac{f(x)}{g(x)}\right] = \frac{f'(x) \cdot g(x) - f(x)\cdot g'(x)}{[g(x)]^2}\]

Firstly, import the operations we would like to overload. In Julia, they are under the Base package.

import Base: +, -, *, /

Then apply the rules! The code below looks daunting at first, but most of them are syntactic sugar. Operations .+ and .- are shorthand for element-wise plus and minus, respectively. In the case of * and /, the first parameter that they return are the left-hand side of the equations above. Its second parameters are the right-hand side of the equations, with the first index of f being the original function, whereas the second corresponding to its derivative.

+(x::D, y::D) = D(x.f .+ y.f)
-(x::D, y::D) = D(x.f .- y.f)
*(x::D, y::D) = D((x.f[1]*y.f[1], x.f[2]*y.f[1] + x.f[1]*y.f[2]))
/(x::D, y::D) = D((x.f[1]/y.f[1], (y.f[1]*x.f[2] - x.f[1]*y.f[2])/y.f[1]^2))
/ (generic function with 119 methods)

Converting Real Numbers

In order for Dual Numbers to interact with Real Numbers, we also require several more things. Namely, if we have a Real Number $\alpha$, we need to convert it to a Dual Number of the form

\[\alpha+0\epsilon\]

for two reasons. Firstly, like Complex Numbers, we can extend any real number $\alpha$ into a complex number $\alpha + 0i$. Secondly, this is parallel to the derivative rule which says that for any real number $\alpha$, its derivative is always zero.

To do so in Julia, we can import the convert function from Base. Then we specify that when converting a real number into a Dual, let its derivative be zero.

import Base: convert
convert(::Type{D}, x::Real) = D((x, zero(x)))
convert (generic function with 184 methods)

Lastly, we need to tell Julia what happens when there are other numerical data types like Float or Int. Here, wee specify that we want them to be converted to Dual Number, via overloading the promote_rule function.

import Base: promote_rule
promote_rule(::Type{D}, ::Type{<:Number}) = D
promote_rule (generic function with 123 methods)

Testing AD

That’s it! We’ve successfully implemented Automatic Differentiation. Let’s test whether the Babylonian Square Root works with Dual Numbers. Note that the derivative of $\sqrt x$ with respect to $x$ is

\[\frac{1}{2 \sqrt{x}}\]
x = π

(√x,.5/√x)
(1.7724538509055159, 0.28209479177387814)

Now, instead of saying that $x = \pi$, we pass it as a Dual Number. Here, we set $x = (x, 1)$, where 1 is the derivative of $x$. Remember that in this case, $x$ is the function’s independent variable. In Calculus,

\[\frac{d}{dx}[x] = 1\]
x = D((x, 1))

Babylonian(x)
D((1.7724538509055159, 0.28209479177387814))

As we can see, the results of automatic differentiation is the same as if we would use the analytic differentiation of the squareroot function.

Adding Other Elementary Functions

To add more differentiation rules, we can repeat the same process, just with their respective rules. For instance,

\[\frac{d}{dx}\left[sin(u(x))\right] = cos(u)\cdot u'(x)\] \[\frac{d}{dx}[cos(u(x))] = -sin(u)\cdot u'(x)\] \[\frac{d}{dx}[e^{u(x)}] = e^u \cdot u'(x)\] \[\frac{d}{dx}\left[\ln(u(x))\right] = \frac{u'(x)}{x}\]
import Base: sin, cos, exp, log

sin(x::D) = D((sin(x.f[1]), cos(x.f[1])*x.f[2]))
cos(x::D) = D((cos(x.f[1]), -sin(x.f[1])*x.f[2]))
exp(x::D) = D((exp(x.f[1]), exp(x.f[1])*x.f[2]))
log(x::D) = D((log(x.f[1]), x.f[2]/x.f[1]))
log (generic function with 21 methods)

Testing AD with Elementary Functions

Test and compare with the symbolic approach using a dummy function \(\frac{d}{dx}\left[e^{2x}\right] = 2e^{2x}\)

foo(x) = exp(2x)
foo (generic function with 1 method)
x = 3

foo(D((x, 1))), (exp(2x), 2*exp(2x))
(D((403.4287934927351, 806.8575869854702)), (403.4287934927351, 806.8575869854702))

And with another one

\[\frac{d}{dx}\left[\ln\left(x^2\right)\right] = \frac{2x}{x^2} = \frac{2}{x}\]
baz(x) = log(x^2)
baz (generic function with 1 method)

ℯ = 2.7182818284590...
x = 

baz(D((x, 1))), (log(x^2), 2/x)
(D((2.0, 0.7357588823428847)), (2.0, 0.7357588823428847))

Closing Remarks

We’ve successfully implemented AD. In the real case though, AD is not implemented with Dual Numbers. Instead, most of them use forward and/or reverse accumulation. Either way, it’s still fun to see how easy it is to implement a new type system and method dispatching in Julia. Hope you’ve learned something!