[[toc]]

Backpropagation

I have written this description about the intrinsics of backpropagation to get a quick and clear insight into the idea and the mathematics of backpropagation in a level which a first year CS master student should understand. I myself use this for a quick reminder, so maybe the best purpose for this should be recap. At least the mathematics should be generally correct

Some references

  • https://www.coursera.org/learn/machine-learning/supplement/pjdBA/backpropagation-algorithm
  • https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/

Why it is important to know this

  • https://medium.com/@karpathy/yes-you-should-understand-backprop-e2f06eab496b

Algorithm

Definitionsss

  • K = number of output units
  • L = number of layers in network
  • number of units in layer l
  • m = number of training examples
  • E - another symbol for cost function J
  • - regularization term
  • - this is used for cumuating the gradietns calculated on every training example for every theta
  • D - final gradient after one iteration of backpropagation.

Steps

Assumptions about Dimensions

  • We currently do backpropagation on every training example separately

FORWARD PROPAGATION DIMENSION SPECIFICATION

If network has units in layer j and units in layer j+1, then will have rows and columns. (The +1 for columns comes from bias nodes).

This result in (and also ) being a vector with dimensions X

BACKPROPAGATION DIMENSIONS SPECIFICATION

Continuing with the last example, has also X dimensions. Hint to backpropagation part later on:

Algorithm

Initialize

REPEAT FOR

  1. Perform Forward Propagation. This can be done using the training matrix as well. Result is going be in the size of (m * K ).

Example of forward propagation

  1. Calculate the cost
    • Definitions
      • - Hypothesis function parameters
      • Hypothesis function
      • i-th training data
  2. Calculate derivatives of Cost Function according to every
  3. Calculate derivatives for every theta
  4. Add cumulatively to
    or in vectorial form

END LOOP

- final form of derivative after this training step

  • Update if j no equal 0 <- add derivative of regularization
  • Update j = 0 <- no regularization for bias term

Elaboration about Derivatives

3. Step - Calculating

Calculating

It is important to note that is pretty much for all z.
Calculating is different from calculating

  1. You have to calculate and

Here is the derivation for

Calculating GENERAL CASE

Reminder: In each level l derivative we can have multiple , like we have multiple

Calculation is done somewhat recursively. For every smaller level we need to use in our calculation of the derivative because of the chain derivative rule: . If, inside the formula we go further from the output in layer L towards the input in layer 1, we need to always take functions in between under consideration.

In other words

Here that part in bold stands for the part thats being added.

This translates to

Where each part stands for

  • In each row consists of derivatives for 1 single i-th for all . For each we have multiple and column j in consists of derivatives for

  • in i-th row ańd j-th column we have . The multiplication is going to have l num_layers X 1 dimensions.

    • This rule is somewhat more interesting, because it applies a somewhat complicated concept where 1 variable in a function affects the end result through multiple other functions. E.g Such a function, when taking a derivative is solved by just summing up all the derivatives. This is somewhat intuitive but I wont go too deep into it
    • Explanation in Estonian of this rule

4. Step - Calculating

Reminder

  • 1 scalar has effect only on 1 z

This in regular derivative form translates to

This translates in vectorial form to

results in i x j matrix, like

Gradient Checking

https://www.coursera.org/learn/machine-learning/supplement/fqeMw/gradient-checking

  • Do it for all the -s and matrices
  • For 1 matrix this looks like this.
thetaPlus=zeros(theta)
gradApprox=zeros(theta)
epsilon = 1e-4;
for i = 1:n,
  thetaPlus = theta;
  thetaPlus(i) += epsilon;
  thetaMinus = theta;
  thetaMinus(i) -= epsilon;
  gradApprox(i) = (J(thetaPlus) - J(thetaMinus))/(2*epsilon)
end;