Differentiation (mini)

Pablo Winant

Main approaches

Often our models involve complex sets of equations that we have to differentiate.

There are several approaches

  1. Manual
  2. Finite Differences
  3. Symbolic Differentiation
  4. Automatic Differentiation

Manual Differentiation

  • Trick:
    • never use \(\frac{d}{dx} \frac{u(x)}{v(x)} = \frac{u'(x)v(x)-u(x)v'(x)}{v(x)^2}\)
      • too error prone
    • use instead \[\frac{d}{dx} {u(x)v(x)} = {u'(x)v(x)+u(x)v'(x)}\] and \[\frac{d}{dx} \frac{1}{u(x)} = -\frac{u^{\prime}}{u(x)^2}\]
  • You can get easier calculations (in some cases) by using log-deviation rules

Finite Differences

  • Choose small \(\epsilon>0\), typically \(\sqrt{ \textit{machine eps}}\)
    • check eps()
  • Forward Difference scheme:
    • \(f'(x) \approx \frac{f(x+\epsilon) - f(x)}{\epsilon}\)
    • precision: \(O(\epsilon)\)
    • bonus: if \(f(x+\epsilon)\) can compute \(f(x)-f(x-\epsilon)\) instead (Backward)
  • Central Difference scheme:
    • Finite \(f'(x) \approx \frac{f(x+\epsilon) - f(x-\epsilon)}{2\epsilon}\)
    • average of forward and backward
    • precision: \(O(\epsilon^2)\)

Two packages FiniteDiff and FiniteDifferences.

Example:

using FiniteDiff

g(x) = [x[1]^2 x[2]^3]
FiniteDiff.finite_difference_jacobian(g, [0.1, 0.2])

Finite Differences: Higher order

  • Central formula:

\[ \begin{aligned} f''(x) & \approx & \frac{f'(x)-f'(x-\epsilon)}{\epsilon} \approx \frac{(f(x+\epsilon))-f(x))-(f(x)-f(x-\epsilon))}{\epsilon^2} \\ & = & \frac{f(x+\epsilon)-2f(x)+f(x-\epsilon)}{\epsilon^2} \end{aligned} \] - precision: \(o(\epsilon)\) - Generalizes to higher order but becomes more and more innacurate

Symbolic Differentiation

  • manipulate the tree of algebraic expressions
    • implements various simplification rules
  • requires mathematical expression
  • can produce mathematical insights
  • sometimes inaccurate:
    • cf: \(\left(\frac{1+u(x)}{1+v(x)}\right)^{100}\)

Two main libraries:

  • SymEngine.jl
    • fast symbolic calculation
    • mature C++ engine
  • Symbolics.jl:
    • pure julia
    • finite difference
    • symbolic calculation

Example using Symbolics:

using Symbolics
@variables a x b y
eq = a*sin(x) + b*cos(y)
Symbolics.derivative(eq, x)

Automatic Differentiation

Automatic Differentiation

  • does not provide mathematical insights but solves the other problems

  • can differentiate any piece of code

  • two flavours

    • forward accumulation
    • reverse accumulation

Automatic source code rewrite

Consider this simple function

function f(x::Number)

    a = x + 1
    b = x^2
    c = sin(a) 
    d = c + b

    return d

end

We can rewrite the code as follows:

function f(x::Number)

    # x is an argument
    x_dx = 1.0

    a = x + 1
    a_dx = x_dx

    b = x^2
    b_dx = 2*x*x_dx

    c = sin(a)
    c_x = cos(a)*a_dx

    d = c + b
    d_x = c_dx + b_dx

    return (d, d_x)
end

Dual numbers: operator overloading

Instead of rewriting source code, we can use dual numbers to perform exactly the same calculations.

struct DN<:Number
    x::Float64
    dx::Float64
end

+(a::DN,b::DN) = DN(a.x+b.x, a.dx+b.dx)
-(a::DN,b::DN) = DN(a.x-b.x, a.dx-b.dx)
*(a::DN,b::DN) = DN(a.x*b.x, a.x*b.dx+a.dx*b.x)
/(a::DN,b::DN) = DN(a.x/b.x, (a.dx*b.x-a.x*b.dx)/b.dx^2)

...

Try it on function f What do we miss ?

Dual numbers

This approach (and automatic differentiation in general) is compatible with control flow operations (if, while, …)

Let’s see it with the dual numbers defined by ForwardDiff library:

import ForwardDiff: Dual

x = Dual(1.0, 1.0)
a = 0.5*x
b = sum([(x)^i/i*(-1)^(i+1) for i=1:5000])
# compare with log(1+x)
  • generalizes nicely to gradient computations
x = Dual(1.0, 1.0, 0.0)
y = Dual(1.0, 0.0, 1.0)
exp(x) + log(y)

Automatic differentiation

There are many flavours of automatic differenation (check JuliaDiff.org)

 

 

Forward Accumulation Mode

  • isomorphic to dual number calculation
  • compute values and derivatives at the same time
  • efficient for \(f: R^n\rightarrow R^m\), with \(n<<m\)
    • (keeps lots of empty gradients when \(n>>m\))

Reverse Accumulation Mode

  • Reverse Accumulation / Back Propagation
    • efficient for \(f: R^n\rightarrow R^m\), with \(n>>m\)
    • requires data storage (to keep intermediate values)
    • graph / example
  • Very good for machine learning:
    • \(\nabla_{\theta} F(x;\theta)\) where \(F\) is a scalar objective

Libraries for AutoDiff

  • See JuliaDiff: http://www.juliadiff.org/
    • ForwardDiff.jl
    • ReverseDiff.jl
  • Zygote.jl
  • Deep learning framework:
    • higher order diff w.r.t. any vector -> tensor operations
    • Flux.jl, MXNet.jl, Tensorflow.jl

Example with ForwardDiff

Example with ForwardDiff:

  using ForwardDiff
ForwardDiff.jacobian(u->[u[1]^2, u[2]+u[1]], [0.1,0.2])

Solution using Linear Time iteration

Our problem

General formulation of a linearized model: \[ \begin{eqnarray} A s_t + B x_t + C s_{t+1} + D x_{t+1} & = & 0_{n_x} \\ s_{t+1} & = & E s_t + F x_t \end{eqnarray}\] where:

  • \(s_t \in \mathbb{R}^{n_s}\) is a vector of states
  • \(x_t \in \mathbb{R}^{n_x}\) is a vector of controls

Remark:

  • first equation is forward looking
  • second equation is backward looking

In the neoclassical model: \[\begin{eqnarray} s_t & = & (\Delta z_t, \Delta k_t) \\ x_t & = & (\Delta i_t, \Delta c_t) \end{eqnarray}\]

The linearized system is: \[\begin{eqnarray} A & = & ...\\ B & = & ...\\ C & = & ...\\ D & = & ...\\ E & = & ...\\ F & = & \end{eqnarray}\]

Solution

What is the solution of our problem?

  • At date \(t\) controls must be chosen as a function of (predetermined) states
  • Mathematically speaking, the solution is a function \(\varphi\) such that: \[\forall t, x_t = \varphi(s_t)\]
  • Since the model is linear we look for un unknown matrix \(X \in \mathbb{R}^{n_x} \times \mathbb{R}^{n_s}\) such that: \[\Delta x_t = X \Delta s_t\]

In the neoclassical model

  • The states are \(k_t\) and \(z_t\)

  • The controls \(i_t\) and \(c_t\) must be a function of the states

    • there is a decision rule \(i()\), \(c()\) such that \[i_t = i(z_t, k_t)\] \[c_t = c(z_t, k_t)\]
  • In the linearized model: \[\Delta i_t =i_z \Delta z_t + i_k \Delta k_t\] \[\Delta c_t =c_z \Delta z_t + c_k \Delta k_t\]

Optimality condition:

Replacing in the system:

\[\Delta x_t = X \Delta s_t\] \[\Delta s_{t+1} = E \Delta s_t + F X \Delta s_t\] \[\Delta x_{t+1} = X \Delta s_{t+1}\] \[A \Delta s_t + B \Delta x_t + C \Delta s_{t+1} + D \Delta x_{t+1} = 0\]

If we make the full substitution:

\[( (A + B X) + ( D X + C) ( E + F X ) ) \Delta s_t = 0\]

This must be true for all \(s_t\). We get the special Ricatti equation:

\[(A + B {\color{red}{X}} ) + ( D {\color{red}{X}} + C) ( E + F {\color{red}X} ) = 0\]

This is a quadratic, matrix ( \(X\) is 2 by 2 ) equation:

  • requires special solution method

  • there are multiple solutions: which should we choose?

  • today: linear time iteration selects only one solution

    • alternative: eigenvalues analysis

Linear Time Iteration

  • Let’s be more subtle: define
    • \(X\): decision rule today and
    • \(\tilde{X}\): is decision rule tomorrow. \[\begin{eqnarray} \Delta x_t & =& X \Delta s_t \\ \Delta s_{t+1} & = & E \Delta s_t + F X \Delta s_t \\ \Delta x_{t+1} & = & \tilde{X} \Delta s_{t+1} \\ A \Delta s_t + B \Delta x_t + C \Delta s_{t+1} + D \Delta x_{t+1} & = & 0 \end{eqnarray}\]
  • We get, \(\forall s_t\): \[(A + B X) + (C + D \tilde{X}) ( E + F X ) ) \Delta s_t = 0 \]
  • Again, this must be zero in all states \(\Delta s_t\).

Linear Time Iteration (2)

  • We get the equation: \[\begin{eqnarray} F(X, \tilde{X}) & = & (A + B X) + ( C+ D \tilde{X}) ( E + F X ) \\&=& 0 \end{eqnarray}\]
  • We can solve it with a simple Bernouilli scheme
  • When the model is well-specified it is guaranteed to converge to the right solution.
    • cf linear time iteration by Pontus Rendahl (link)
  • There are simple criteria to check that the solution is right, and that the model is well specified

Algorithm:

  • choose stopping criteria: \(\epsilon_0\) and \(\eta_0\)
  • choose random \(X_0\)
  • given \(X_n\):
    • compute \(X_{n+1}\) such that \(F(X_{n+1}, X_{n}) = 0\) \[(B + (C+D X_{n})F)X_{n+1} + A + (C+D X_n )E=0\]\[X_{n+1} = - (B + (C + D X_n) F)^{-1} (A + (C+DX_n)E)\]\[X_{n+1} = T(X_n)\]
    • compute:
      • \(\eta_n = |X_{n+1} - X_n|\)
      • \(\epsilon_n = F(X_{n+1}, X_{n+1})\)
    • if \(\eta_n<\eta_0\) and \(\epsilon_n<\epsilon_0\)
      • stop and return \(X_{n+1}\)
      • otherwise iterate with \(X_{n+1}\)

Simulating the model

  • Suppose we have found the solution \(\Delta x_t = X \Delta s_t\)
  • Recall the transition equation: \(\Delta s_{t+1} = E \Delta s_t + F \Delta x_t\)
  • We can now compute the model evolution following initial deviation in the state: \[\Delta s_t = \underbrace{(E + F X)}_{P} \Delta s_{t-1}\]
    • \(P\) is the simulation operator
    • it is a backward operator
  • The system is stable if the biggest eigenvalue of \(P\) is smaller than one…
  • … or if its spectral radius is smaller than 1: \[\rho(P)<1\]
  • This condition is called backward stability
    • it rules out explosive solutions
    • if \(\rho(P)>1\) one can always find \(s_0\) such that the model simulation diverges

Spectral radius

  • How do you compute the spectral radius of matrix P?
    • naive approach: compute all eigenvalues, check the value of the biggest one…
    using LinearAlgebra
    M = rand(2,2)
    maximum(abs, eigvals(M))
    • becomes limited when size of matrix growth
    • another approach: power iteration method
  • Power iteration method:
    • works for matrices and linear operators
    • take a linear operator \(L\) over a Banach Space \(\mathcal{B}\) (vector space with a norm)
    • use the fact that for most \(u_0\in \mathcal{B}\), \(\frac{|L^{n+1} u_0|}{|L^n u_0|}\rightarrow \rho(L)\)

Algorithm:

  • choose tolerance criterium: \(\eta>0\)
  • choose random initial \(x_0\) and define \(u_0 = \frac{x_0}{|x_0|}\)
    • by construction: \(|u_0|=1\)
  • given \(u_n\), compute
    • \(x_{n+1} = L.u_n\)
    • \(u_{n+1} = \frac{x_{n+1}}{|x_{n+1}|}\)
    • compute \(\eta_{n+1} = |u_{n+1} - u_n|\)
    • if \(\eta_{n+1}<\eta\):
      • stop and return \(|x_{n+1}|\)
      • else iterate with \(u_{n+1}\)

The Time Iteration Operator

The time iteration operator associated to the linearized model, associates to any function function \(\tilde{\varphi}\) the function \(\varphi\) such that:

\[D \tilde{\varphi}( E s + F \varphi(s) ) + C ( E s + F \varphi(s)) + B \varphi(s) + A s = 0\]

To find the solution we have iterated this operator in the space of linear functions (with \(\varphi(\overline{s})=\overline{x}\)), but time-iteration can be defined in a more general space.

In particular, we can study it in the space of affine functions \(\varphi(\overline{s})=x_0 + x_1 (\overline{s} - \overline{s})\).

In this space of functions one can show that:

  • if \(P\) is stable, then the spectral radius of the time iteration operator is the spectral radius of \(((B + (C + D X) F)^{-1})D\)
  • in that case the stable solution is the unique if and only if, the spectral radius of the time iteration operator is smaller than 1 (forward stability)

Recap

  1. We compute the derivatives of the model
  2. Time iteration algorithm, starting from an initial guess \(X_0\) and we repeat until convergence: \[X_{n+1} = (B + (C + D X_n) F)^{-1} (A + (C+DX_n)E)\]
  3. We compute the spectral radius of two operators to ensure the model is well defined and that the solution is the right one.
  4. backward stability: derivative of simulation operator \[\boxed{\rho(E + F \overline{X} )}\]
  5. forward stability: derivative of time iteration operator \[\boxed{\rho \left( u\mapsto ((B + (C + D X) F)^{-1})D\right)}\]

\(\implies\) forward and backward stability are equivalent to Blanchard-Kahn conditions

  • there is a unique stable solution