The goal of this blog post is to demonstrate so far the the neural network is a relatively simple algorithm to understand and show using mathematics and machine learning that it is a universal function approximator.

According to the Universal Approximation Theorem, a feedforward network with at least one hidden layer, utilizing non-polynomial activation functions (like sigmoid or ReLU), can approximate any continuous function to any desired degree of accuracy, provided it has enough neurons.

We are going to test this on a simple MNIST dataset gotten from sklearn datasets.

First we start by importing our needed functions, we will need just
numpy --> for linear algebra
pandas --> for nice tables
scikit learn --> for the datset
matplotlib --> for plotting

from sklearn.datasets import load_digits
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import math

INTRODUCTION

We might have heard of the term Artificial Neural Network before as a Machine Learning algorithm built as a model of the human brain, but in reality this is far from it, from my findings an artificial neural network is by many degrees of magnitude far less complicated than the human brain and just a bunch of interconnected nodes called artificial neurons that take in input, apply a mathematical formula and spit out an output.

Just Like the basic logistic regression unit, an ANN as we would be calling it, applies the straight line fitting equation $ y = mx + c $ on data but with a big twist, which is that it is applied at a much larger scale, many neurons as we call them receive the input $x$ perform such an operation, and each of the neurons in the layer(input layer) which receives the input perform a a mathematical operation closely related to that which we outlined above, and return a single value as output, but we don't use each of this individual outputs for anything but instead we concatenate them into a vector and pass them as input to the next layer after performing some non-linear operation to it using what is known as an activation function(to add non-linearities) to the data, it can actually be shown that these activation functions are what gives the neural network its power, these neurons which are stacked in layers can only learn linear relationship no matter how many times we stack them, find the mathematical proof below, and would just be a redundant straight line fitting linear/logistic regression model, but by applying activations, that is turning some neurons on or off or performing some form of normalization, we are able to learn complex patterns and not only straight lines. The next layer then does the same thing as the first, but its input is the concatenated output of the first and we go on until we reach the final layer which can contain one neuron for regression tasks or n neurons for classification, where n corresponds to the number of classes which we then use as output.

Dataset

We would be using sklearn digit dataset, which is like the MNIST dataset with lots of handwritten digits. The goal is too see with what degree of accuracy the network would predict these digits. Each digit is a 8 x 8 grid of pixels(in black and white), where each pixel value corresponds to the brightness of that pixel.

Now we've loaded our digits dataset which is a 8x8 grid Matrix. We would like to visualize it using pandas quickly.

digits = load_digits()
dir(digits)

df = pd.DataFrame(digits.data, columns=digits.feature_names)
df['target'] = pd.Series(digits.target)
df # so we can see each pixel.

| | **

pixel_0_0** | pixel_0_1 | pixel_0_2 | pixel_0_3 | pixel_0_4 | pixel_0_5 | pixel_0_6 | pixel_0_7 | pixel_1_0 | pixel_1_1 | ... | pixel_6_7 | pixel_7_0 | pixel_7_1 | pixel_7_2 | pixel_7_3 | pixel_7_4 | pixel_7_5 | pixel_7_6 | pixel_7_7 | target | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | 0 | 0.0 | 0.0 | 5.0 | 13.0 | 9.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 6.0 | 13.0 | 10.0 | 0.0 | 0.0 | 0.0 | 0 | | 1 | 0.0 | 0.0 | 0.0 | 12.0 | 13.0 | 5.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 11.0 | 16.0 | 10.0 | 0.0 | 0.0 | 1 | | 2 | 0.0 | 0.0 | 0.0 | 4.0 | 15.0 | 12.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 11.0 | 16.0 | 9.0 | 0.0 | 2 | | 3 | 0.0 | 0.0 | 7.0 | 15.0 | 13.0 | 1.0 | 0.0 | 0.0 | 0.0 | 8.0 | ... | 0.0 | 0.0 | 0.0 | 7.0 | 13.0 | 13.0 | 9.0 | 0.0 | 0.0 | 3 | | 4 | 0.0 | 0.0 | 0.0 | 1.0 | 11.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 16.0 | 4.0 | 0.0 | 0.0 | 4 | | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | | 1792 | 0.0 | 0.0 | 4.0 | 10.0 | 13.0 | 6.0 | 0.0 | 0.0 | 0.0 | 1.0 | ... | 0.0 | 0.0 | 0.0 | 2.0 | 14.0 | 15.0 | 9.0 | 0.0 | 0.0 | 9 | | 1793 | 0.0 | 0.0 | 6.0 | 16.0 | 13.0 | 11.0 | 1.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 6.0 | 16.0 | 14.0 | 6.0 | 0.0 | 0.0 | 0 | | 1794 | 0.0 | 0.0 | 1.0 | 11.0 | 15.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 2.0 | 9.0 | 13.0 | 6.0 | 0.0 | 0.0 | 8 | | 1795 | 0.0 | 0.0 | 2.0 | 10.0 | 7.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 5.0 | 12.0 | 16.0 | 12.0 | 0.0 | 0.0 | 9 | | 1796 | 0.0 | 0.0 | 10.0 | 14.0 | 8.0 | 1.0 | 0.0 | 0.0 | 0.0 | 2.0 | ... | 0.0 | 0.0 | 1.0 | 8.0 | 12.0 | 14.0 | 12.0 | 1.0 | 0.0 | 8 |

Truncated dataframe giving us an idea of what is happening.

We can use matplotlib now that we have seen the data in a pandas dataframe to visualize what we are seeing here.

the above image is a 6 and a 2.

From earlier we saw that the features and targets are concatenated together, we can store features as X and targets as y below. Note we saw we had a 1797 x 64 feature that above, we would transpose it to get each training example on the columns because we want to multiply the weights by the training data and not the other way around for preference, for example if our weights represent 100 neurons in the input layer they would need to have 64 values per each of those neurons, so that a neuron can process a single training example correctly without shape mismatch errors which I personally battled with when I did this for the first time.

To avoid using the train_test_split in sklearn, we can still split our dataset by taking random indexes in from our dataset, we would be using 80% for training and 20% for testing.

X = np.array(digits.data)
y = np.array(digits.target)
n = X.shape[0]
m = y.shape[0]
n_train = int(n*0.8) # 80% should be used for training
n_test = int(n*0.2) # 20% should be used to test the network
m_train = int(m*0.8)
m_test = int(m*0.2)

X_train = X[:n_train, :].T
X_test = X[n_train:, :]
y_train = y[:m_train]
y_test = y[m_train:]
X_train.shape, X_test.shape, y_train.shape, y_test.shape

(64, 1797), (64, )

We can now implement the softmax function, which is an extension of the sigmoid for higher dimensional spaces where the target values in a classification task is not binary but rather multiclass.

The reason we need this function is because this task as you might have seen from the target values, is a multiclass classification task, and we have handwritten digit from 0-9(10 digits), so we would need our last layer to spit out 10 values meaning the number of neurons in that layer must be 10 since each individual neuron spits out an individual value. And we want each value to be a probability from 0 to 1, the softmax achieves this by squishing each input within this range taking very keen account of the value of other inputs. Its implemented below in numpy

$$sigmoid(z) = \frac{1}{1 + e^{-z}}$$

softmax with n classes is given as:

$$ softmax(z_i) = \frac{e^{z_i}}{\sum_{j=1}^n e^{z_j}} $$

def softmax(z):
    exp_z = np.exp(z - np.max(z, axis=0, keepdims=True))
    return exp_z / np.sum(exp_z, axis=0, keepdims=True)

def ReLU(z):
    return np.maximum(0, z)

The above is the very important ReLU function which takes the positive activations and discard negative ones as 0 allowing those neurons in the next layer stay inactive, This is crucial for the early layers of the network to learn complex patterns in the dataset given that using a linear function across layers is redundant logistic regression, other functions like tanh could also be used, but though a ReLU is prone to Dying ReLU problem, we should be able to escape this though our neural network is very small and basic, it's a 3 layer neural network, in larger networks as we keep taking gradients as we backpropagate through all the layers we could reach a point where by the values of our weights start to become negative and start getting discarded by the network relatively causing some neurons not to fire our activate, hence knon as the dying ReLU problem, or if our weights are initialized to high or low, a common fix is the leaky ReLU but we won't go into that now.

We would use the ReLU as the activation function that introduces non-linearity between layers, the only layer which the output wont be passed through the ReLU is the final layer which outputs logits we would use to predict.

We can prove mathematically that a neural network without such a function is a fancy logistic regression unit. proof.

$$ReLU(z) = max(0, z)$$

Here, we decided were going with minibatches, This is a common techniques used when we have many examples to speed up training where we feed only a batch usually 32 0r 64 examples(mostly absolutely random examples) into the network at a time during the process of forward propagation, we complete an epoch when we feed the entire dataset into the network, and as usual we update our parameters a total of (sample_size/batch_size) times. The mini-batching process is implemented below.

Note: This mini-batching causes training to be unstable as the network keeps seeing different sets of examples at each interation, causing the loss to increase and correct itself, we would see this later.

def get_minibatches(X, Y, batch_size=64, shuffle=True):
    m = X.shape[1] # get the total number of examples
    indices = np.arange(m) # create a list of a sequence from 1 to m
    if shuffle:  # if we decide to use minibatching then shffle will be true
        np.random.shuffle(indices) 
    num_batches = math.ceil(m / batch_size) # number of batches must be an integer
    for i in range(num_batches): # for each batch
        start = i * batch_size # start at 0, 64, 128 etc
        end = min(start + batch_size, m) # end +batch_size away
        batch_idx = indices[start:end] # select batch_size random indices
        yield X[:, batch_idx], Y[batch_idx] # using all rows(features), select unique 64 random examples

The loss function for a multiclass classification task is given as:

$$ \text{Loss} = -\frac{1}{N} \sum_{n=1}^N \sum_{i=1}^{K} y_i \log(\hat{y}_i) $$ sum loss for each prediction and then sum for each example, then average out with number of example

$$\hat{y}i = \frac{e^{z_i}}{\sum{j=1}^{K} e^{z_j}}$$

$$\text{Loss} = - \sum_{i=1}^{K} y_i \log \left( \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}} \right)$$

Where $ y_i $ is 1 in our one hot encoded matrix for every true value and zero for false values so the loss is zero when y_i is not the right digit and

$$\log \left( \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}} \right) $$ i s the probability from 0 to 1 of it being that actual digit. We basically for each target create a one-hot encoded matrix, e.g if target is 9 we create something like this to replace the target [0,0,0,0,0,0,0,0,1] to replace it as the true value, remember the neural network outputs will be turned into predictions for each value, so from the formula above you can see that we multiply by 0 almost all the time so the formula can simplify to: $$ \text{Loss} = - \sum_{n=1}^{N} y \log(\hat{y}) $$ where $ \hat{y} $ is the prediction for the correct example(the only place where the value of target y is not zero)$$

e.g when actual digit is 5, at index 4, $ y_i $ is 1, and zero at every other index, so if the computer in its own one hot encoded matrix predicts 0.8 at that position after computing the log equation, we get 0.096 x 1, which is a small loss, unlike if it predicts 0.1, which means it says that digit is unlikely we get a huge loss, of 1.

def compute_loss(y, y_true):
    # y (ndarray(n,m)) : matrix of probabilities with row n represent probability of vector i
    # y_true (ndarray(n,)) :  column vector each column i represent value for true class 
    m, n = y.shape
    loss = 0
    
    for i in range(m):
            loss += - np.sum(np.log2(y[i, y_true[i]]))
    return loss

def init_params():
    np.random.seed(432)
    W1 = np.random.randn(64, 64) * 0.01  # small random init helps too
    W2 = np.random.randn(16, 64) * 0.01
    W3 = np.random.randn(10, 16) * 0.01
    b1 = np.random.randn(64, 1)
    b2 = np.random.randn(16, 1)
    b3 = np.random.randn(10, 1)
    return W1, b1, W2, b2, W3, b3

We Just Initialized random weights, I found that multiply all weights by 0.01 usually works fine, because even though randn return values from a standard normal distribution, they still seem to explode or die quickly.

FORWARD PROPAGATION

we would be using a vectorized implementation here, instead of a traditional for-loop because it is faster and easier to compute because numpy using optimized code in faster languages under the hood.

layer 1

layer 1 takes in a single example at each time as a vector then each neuron in layer one performs a mathematical operation where each feature has a weight in each neuron which multiplies the feature, then we sum it all e.g w1feture_1 + w2+feature_2 + w3feature_3 + ... w_n*feature_n + b(the neurons bias) = something, each neuron performs this operation on thesame features(for each example) with a different set of biases ans weights and returns a value, we then combine all the values into a coherent vector e.g [neuron_1-->5, neuron_2 -->4, neuron_3 --->3, ..., neuron_last --->-4.5], then we apply a relu which turns all negative inputs to zero and keeps positive ones and pass the vector to the next layer and repeat.

To vectorize the implementation, for a mini-batch we can create a weight matrix of number of rows equal to the number of neurons and columns to the number of weights in each neuron, e.g our layer 1 has 64 neurons each with 64 weights i.e a (64x64) matrix, then we multiply by our matrix X which contains 64 rows which correspond to the number of features and coincidentally 64 columns where each column is a unique example so instead of 64x1 we pass 64x64, (m x r) * (r x n) = (m x n) matrix so we get a 64 x 64 matrix, where each column corresponds to the concatenated vector that the layer as outputed for that example, we would have 64 of such columns. W then add bias which is a vector containing one value that is added to the output of each neuron for all examples through broadcasting. $$ Z_1 = W_1 * X + b_1 $$

layer 2

$$Z_2 = W_2 * ReLU(Z_1) + b_2$$

layer 3

$$Z_3 = W_2 * ReLU(Z_3) + b_3 $$$$

$$predictions = softmax(Z_3)$$

def forward_propagation(X, W1, b1, W2, b2, W3, b3):

    Z1 = W1 @ X + b1
    A1 = ReLU(Z1)
    Z2 = W2 @ A1 + b2
    A2 = ReLU(Z2)
    Z3 = W3 @ A2 + b3
    A3 = softmax(Z3)
    
    return Z1, A1, Z2, A2, Z3, A3

computing back prop over many epochs and updating the params We transpose our input dataset from each sample being a 8x8 matrix to a 64 d vector bfore passing through the network.

def one_hot(Y, num_classes=10):
    Y = np.array(Y).flatten().astype(int)
    oh = np.zeros((Y.size, num_classes))
    oh[np.arange(Y.size), Y] = 1
    return oh.T  # (n_classes, m)

BACKPROPAGATION

This is where the power of the neural network comes to the front, at every iteration of the gradient descent alogrithm which we would be using to train the network, the network at first performs forward propagation as detailed above and produces rubbish as predictions, It then compares it's predictions with the actual target values, which are ofcourse one hot encoded, because for each example we get a prediction looking like this [0.4, 0.6, 0.2, 0.1, 0.2, ..., 0.4], and lets say the handwritten digit is 6, we make our target look like this [0,0,0,0,0,1,0,0,0,0], so we want the prediction for 6 to be very high and the rest low, so that the model is confident of its output, but how can we do this?

This is where backpropagation comes in, by taking the error, that is the difference between the actual and predicted value, we can in a higher dimensional space nudge the weights and biases in a direction that when they multiply the input, they are more likely to spit out the correct solution to the problem, this is why its a learning algorithm, we learn the correct weights and biases, starting from something random.

This is a simple image to get an idea of what we are talking about it's like having y = mx + c and trying to learn the values of m and c that best fit the equation given any input x, but on a much higher dimensional scale, here this is a vectorized implementation so we are dealing with matrices, now we might know from calculus that to minimize the cost of a function with respect to any variable, we must find the point where the derivative of the function is zero, because at that point the cost is minimal. we know this from the way a derivative behaves.

Now we have weights W1, W2, W3, b1, b2, b3 in our 3 layer neural netowrk and cost function $$ \text{Loss} = - \sum_{n=1}^{N} y \log(\hat{y}) $$ sow we need to know how the loss changes with respect to these variables and update them in such a way that the loss gets smaller, in other words we must calculate $ dL/dW_1, dL,dW_2, dL,dW_3, dL/db_1, dL/db_2, dL,db_3 and update the parameters as such:

Backpropagation Equations (3-Layer Network)

Let’s denote: $$ Z_1 = W_1 X + b_1,
A_1 = g_1(Z_1), Z_2 = W_2 A_1 + b_2,
A_2 = g_2(Z_2),
Z_3 = W_3 A_2 + b_3,
A_3 = \hat{y} $$

where $A_3$ is the output after applying softmax, and the loss is: $$ \text{Loss} = - \sum_{n=1}^{N} y \log(\hat{y}) $$

Step 1: Output Layer Gradient

$$ dZ_3 = A_3 - Y_{\text{one-hot}} $$

This result looks almost too clean, but it comes from combining:

softmax activation cross-entropy loss

If you actually differentiate the loss w.r.t $Z_3$, you go through a messy chain rule involving exponentials and sums (because softmax is a quotient). After applying the quotient rule + chain rule, everything cancels beautifully, leaving: $$ dL/dZ_3 = A_3 - Y_{onehot} $$

This is one of the reasons this combination is so popular — it simplifies what would otherwise be a nightmare.

This result looks almost too clean, but it comes from combining:

softmax activation
cross-entropy loss

If you actually differentiate the loss with respect to $$ Z_3 $$, you go through a messy chain rule involving exponentials and sums (because softmax is a quotient). After applying the quotient rule and chain rule, everything cancels beautifully, leaving:

$$\frac{\partial L}{\partial Z_3} = A_3 - Y$$

This is one of the reasons this combination is so popular — it simplifies what would otherwise be a nightmare.

Step 2: Gradients for $$ W_3 $$ and $$ b_3 $$

$$dW_3 = \frac{1}{m} dZ_3 \cdot A_2^T$$

$$db_3 = \frac{1}{m} \sum dZ_3$$

Why the transpose?

Look at the shapes:

$$ dZ_3 $$: $$ (n_{\text{output}}, m) $$
$$ A_2 $$: $$ (n_{\text{hidden}}, m) $$

We want $$ dW_3 $$ to have shape:

$$(n_{\text{output}}, n_{\text{hidden}})$$

So we multiply:

$$dZ_3 \cdot A_2^T$$

where:

$$A_2^T : (m, n_{\text{hidden}})$$

Now the matrix multiplication works:

$$(n_{\text{output}}, m) \cdot (m, n_{\text{hidden}}) = (n_{\text{output}}, n_{\text{hidden}})$$

The transpose is not arbitrary — it ensures dimension alignment and effectively computes how each neuron in layer 2 contributes to the error in layer 3 across all training examples.

following through them math we arrive at:

Step 3: Backpropagate to Layer 2

$$dZ_2 = (W_3^T \cdot dZ_3) * g_2'(Z_2)$$

$$dW_2 = \frac{1}{m} dZ_2 \cdot A_1^T$$

$$db_2 = \frac{1}{m} \sum dZ_2$$

Step 4: Backpropagate to Layer 1

$$dZ_1 = (W_2^T \cdot dZ_2) * g_1'(Z_1)$$

$$dW_1 = \frac{1}{m} dZ_1 \cdot X^T$$

$$db_1 = \frac{1}{m} \sum dZ_1$$

Parameter Update

$$W_i := W_i - \alpha dW_i$$

$$b_i := b_i - \alpha db_i$$

def back_propagation(Z1, A1, Z2, A2, Z3, A3, W1, W2, W3, X, Y): #Y is target train
    m = A3.shape[1]
    Y_oh = one_hot(Y, num_classes=10)
    dZ3 = A3 - Y_oh          # (10, m)
    dW3 = (1/m) * dZ3 @ A2.T    # (10,16)
    db3 = (1/m) * np.sum(dZ3, axis=1, keepdims=True)  # (10,1)
    
    dA2 = W3.T @ dZ3            # (16, m)
    dZ2 = dA2 * (Z2 > 0)        # (16, m)  ReLU derivative
    dW2 = (1/m) * dZ2 @ A1.T    # (16,64)
    db2 = (1/m) * np.sum(dZ2, axis=1, keepdims=True)  # (16,1)
    
    dA1 = W2.T @ dZ2            # (64, m)
    dZ1 = dA1 * (Z1 > 0)        # (64, m)
    dW1 = (1/m) * dZ1 @ X.T     # (64,1437)
    db1 = (1/m) * np.sum(dZ1, axis=1, keepdims=True)  # (64,1)
    
    return dW3, db3, dW2, db2, dW1, db1

Then we update our parameters with a small learning rate to enusre it doesnt overshoot the minima its looking for with a very massive gradient.

def update_params(W1, b1, W2, b2, W3, b3, dW3, db3, dW2, db2, dW1, db1, alpha):
    W3 = W3 - alpha* dW3
    b3 = b3 -alpha*db3
    W2 = W2 - alpha*dW2
    b2 = b2 - alpha*db2
    W1 = W1 - alpha* dW1
    b1 = b1 -alpha*db1
    
    return W1, b1, W2, b2, W3,b3

def get_predictions(A3): ## get predictions from output layer
    return  np.argmax(A3,axis=0)
    
def get_accuracy(predictions, y):
    return np.sum(predictions == y) / y.size

loss_s = []

Now we put it all together to perform gradient descent, a process where we perform forward propagation, predict, calculate the error and propagate that error backward through the network and nudge the weights and biases in a direction that reduces the errors.

def train(X, Y, epochs=100, batch_size=64, alpha=0.01): # perform gradient descent for a number of epochs
    W1, b1, W2, b2, W3, b3 = init_params() # initialize weights and biases
    for epoch in range(epochs):
        for Xb, Yb in get_minibatches(X, Y, batch_size):
            Z1,A1,Z2,A2,Z3,A3 = forward_propagation(Xb, W1, b1, W2, b2, W3, b3) # forward pass
            dW3, db3, dW2, db2, dW1, db1 = back_propagation(Z1, A1, Z2, A2, Z3, A3, W1, W2, W3, Xb, Yb) # backward pass
            W1, b1, W2, b2, W3,b3 = update_params(W1, b1, W2, b2, W3, b3, dW3, db3, dW2, db2, dW1, db1, alpha) # update parameters using gradient descent
            loss_s.append(compute_loss(A3.T, Yb.flatten())) # compute loss for the current batch and append to list
        print(f"Epoch {epoch+1} |  Accuracy: {get_accuracy(get_predictions(A3), Yb)*100:.2f}%, loss: {compute_loss(A3.T, Yb.flatten())}") # print accuracy and loss for each epoch

    return W1, b1, W2, b2, W3, b3

W1_final, b1_final, W2_final, b2_final, W3_final, b3_final = train(X_train, y_train, alpha=0.01)

Epoch 1 |  Accuracy: 0.00%, loss: 120.10748063870817
Epoch 2 |  Accuracy: 17.24%, loss: 111.60073049838431
Epoch 3 |  Accuracy: 13.79%, loss: 94.45606195479442
Epoch 4 |  Accuracy: 6.90%, loss: 94.48629998533681
Epoch 5 |  Accuracy: 3.45%, loss: 94.08104421121865
Epoch 6 |  Accuracy: 24.14%, loss: 90.47152887680893
Epoch 7 |  Accuracy: 13.79%, loss: 87.20910899622615
Epoch 8 |  Accuracy: 27.59%, loss: 84.56191366768557
Epoch 9 |  Accuracy: 20.69%, loss: 89.73735213360764
Epoch 10 |  Accuracy: 31.03%, loss: 81.04931083164604
Epoch 11 |  Accuracy: 20.69%, loss: 79.53268751578887
Epoch 12 |  Accuracy: 17.24%, loss: 81.91170354671324

...
Epoch 97 |  Accuracy: 100.00%, loss: 1.9158704458028466
Epoch 98 |  Accuracy: 96.55%, loss: 3.013109498701423
Epoch 99 |  Accuracy: 96.55%, loss: 3.70180754169769
Epoch 100 |  Accuracy: 100.00%, loss: 2.1896857628603104

The loss reduces as the model predicts digits more accurately.

def test(X_test, y_test):
    idx = np.random.randint(1,360, size=64)
    X = X_test[idx]
    y = y_test[idx]
    Z1,A1,Z2,A2,Z3,A3 = forward_propagation(X.T, W1_final, b1_final, W2_final, b2_final, W3_final, b3_final)
    print(f"Accuracy: {get_accuracy(get_predictions(A3), y):.3f}")
    return X, y, A3

%matplotlib inline
X, y, A3 = test(X_test, y_test)

rd1 = np.random.randint(0,64)
rd2 = np.random.randint(0,64)
print(X[rd1])
plt.imshow(X[rd1].reshape(8,8), cmap='grey')
true_value = y[rd1]
prediction = np.argmax(A3[:,rd1])
print(f"from the image above the handwritten digit is '{true_value}' and our neural network predicted its '{prediction}'")
# when i do the matrix multiplcation dont forget A3 is a 10 x 64 matrix

Accuracy: 0.953
[ 0.  0.  0. 13. 12.  0.  0.  0.  0.  0.  6. 16.  4.  0.  0.  0.  0.  2.
 16. 10.  0.  0.  0.  0.  0.  5. 16. 10.  0.  0.  0.  0.  0.  8. 15. 15.
  6.  0.  0.  0.  0.  3. 16. 14. 13. 10.  2.  0.  0.  0. 12. 16. 13. 16.
 12.  0.  0.  0.  1. 10. 16. 14.  4.  0.]
from the image above the handwritten digit is '6' and our neural network predicted its '6'

So we can see from our random testing data that we got a 95% accuracy, which is quite alright for a 3 layer Neural Network using the ReLU and softmax activation built with our derived gradients which we used the chain rule to derive that patterns can arise out of chaos, therefore we have sufficiently proved by virtue that our input function which is a grid of pixels which each value follows a non-linear relationship with each other and the target itself that an Artificial Neural Network is a Universal Function approximator.

We want to see a plot of how our loss changes with each iteration for training and testing data

%matplotlib inline

iterations = [i for i in range(len(loss_s))]
plt.Figure(figsize=(50,50))
plt.plot(iterations, loss_s, color='red')
plt.legend()
plt.xlabel('iterations')
plt.ylabel('Mini-batch gradient descent loss')
plt.show()

We can see the amazingly smooth and steady decline when we train all our batches at once per epoch, but note this required more iterations due to the great decrease in the number of samples been optimized per each epoch, But generally this proves for relatively small datasets, Batch gradient descent is the best option

note: we also increases alpha due to the smoother nature of convergence so we are less likely to overshoot with such in place.

iterations_batch = [i for i in range(len(loss_ss))]  
plt.Figure(figsize=(50,50))
plt.plot(iterations_batch, loss_ss, color='red')
plt.xlabel('iterations')
plt.ylabel('Batch gradient descent loss')
plt.show()
len(iterations_batch)
len(loss_ss)
#When learning rate is a bit higher for faster convergence model overshoots and corrects multiple times but overalldescent is smoother

Gradient Descent with Momentum

The velocity term is defined as an exponential moving average of past gradients:

$$V_t = \beta V_{t-1} + (1 - \beta) \nabla W$$

Similarly for the biases:

$$V_t^{(b)} = \beta V_{t-1}^{(b)} + (1 - \beta) \nabla b$$

Where: $$ \beta $$

is the momentum (smoothing factor), typically close to 1 (e.g. 0.9) $$ \nabla W $$ and $$ \nabla b $$ are the current gradients

We then update the parameters using the velocity instead of the raw gradients:

$$ W := W - \alpha V_t
$$ $$ b := b - \alpha V_t^{(b)} $$

Intuition

Momentum works by accumulating past gradients, so updates build up speed in directions where gradients consistently point the same way.

It reduces oscillations in noisy directions
It accelerates convergence along relevant directions
It produces a smoother, more stable descent path

Instead of reacting only to the current gradient, the model now moves with inertia, like a ball rolling down a valley rather than jittering step by step.

def momentum(W1, b1, W2, b2, W3, b3, dW3, db3, dW2, db2, dW1, db1, alpha, VdW3=0, VdW2=0, VdW1=0):
    beta = 0.9
    VdW3 = beta*VdW3 + (1-beta)*dW3
    W3 = W3 - alpha* VdW3
    b3 = b3 -alpha*db3
    VdW2 = beta*VdW2 + (1-beta)*dW2
    W2 = W2 - alpha*VdW2
    b2 = b2 - alpha*db2
    VdW1 = beta*VdW1 + (1-beta)*dW1
    W1 = W1 - alpha* VdW1
    b1 = b1 -alpha*db1
    
    return W1, b1, W2, b2, W3, b3 #mismatch

loss_momentum =[]
def train(X, Y, epochs=100, batch_size=64, alpha=0.01):
    W1, b1, W2, b2, W3, b3 = init_params()
    for epoch in range(epochs):
        Z1,A1,Z2,A2,Z3,A3 = forward_propagation(X, W1, b1, W2, b2, W3, b3)
        
        dW3, db3, dW2, db2, dW1, db1 = back_propagation(Z1, A1, Z2, A2, Z3, A3, W1, W2, W3, X, Y)
        W1, b1, W2, b2, W3,b3 = momentum(W1, b1, W2, b2, W3, b3, dW3, db3, dW2, db2, dW1, db1, alpha)
        loss_momentum.append(compute_loss(A3.T, Y.flatten()))
        print(f"Epoch {epoch+1} |  Accuracy: {get_accuracy(get_predictions(A3), Y)*100:.2f}%, loss: {compute_loss(A3.T, Y.flatten())}")

    return W1, b1, W2, b2, W3, b3

notice how momentum causes the gradient to be more stable and reduce training time due to the exponentially weighted average.

Adaptive Moment Estimator (ADAM)

We want to smoothen out our loss function for faster and smoother convergence. This is very important in large neural networks, where we have many local minima, and training using just momentum might still face potential issues and take days to train.

So we combine momentum with an adaptive learning rate, which updates the constant learning rate (\alpha) based on the magnitude of the gradient, an adaptation of Root Mean Squared Propagation which is given by:

$$S_{dw} = \beta S_{dw} + (1-\beta) (dw)^2$$

$$S_{db} = \beta S_{db} + (1-\beta) (db)^2$$

$$W = W - \frac{\alpha}{\sqrt{S_{dw}} + \epsilon} , dw$$

$$b = b - \frac{\alpha}{\sqrt{S_{db}} + \epsilon} , db$$

Combining gradient descent with momentum and RMSProp produces a brilliant adaptive momentum estimator — a momentum estimator with corrected learning rate based on gradients.

Given our weights are vectorized:

$$V_{dw} = 0, \quad S_{dw} = 0, \quad V_{db} = 0, \quad S_{db} = 0$$

Momentum part:

$$V_{dw} = \beta_1 V_{dw} + (1 - \beta_1) dw, \quad V_{db} = \beta_1 V_{db} + (1 - \beta_1) db$$

RMSProp part:

$$S_{dw} = \beta_2 S_{dw} + (1 - \beta_2) (dw)^2, \quad S_{db} = \beta_2 S_{db} + (1 - \beta_2) (db)^2$$

Bias-corrected estimates:

$$V_{dw}^{\text{corrected}} = \frac{V_{dw}}{1 - \beta_1^t}, \quad V_{db}^{\text{corrected}} = \frac{V_{db}}{1 - \beta_1^t}$$

$$S_{dw}^{\text{corrected}} = \frac{S_{dw}}{1 - \beta_2^t}, \quad S_{db}^{\text{corrected}} = \frac{S_{db}}{1 - \beta_2^t}$$

Parameter updates:

$$W := W - \frac{\alpha}{\sqrt{S_{dw}^{\text{corrected}}} + \epsilon} , V_{dw}^{\text{corrected}}$$

$$ b := b - \frac{\alpha}{\sqrt{S_{db}^{\text{corrected}}} + \epsilon} , V_{db}^{\text{corrected}} $$

def Adam(W1, b1, W2, b2, W3, b3, dW3, db3, dW2, db2, dW1, db1, alpha, t, VdW3=0, VdW2=0, VdW1=0, SdW3=0, SdW2=0, SdW1=0, Vdb3=0, Vdb2=0, Vdb1=0, Sdb3=0, Sdb2=0, Sdb1=0, eps=1e-8, B1=0.9, B2=0.999):
    
    VdW3 = B1*VdW3 + (1 -B1)*dW3
    Vdb3 = B1*Vdb3 + (1-B1)*db3
    VdW3_corr = VdW3 / (1 - (B1**t)) #t=1 at the start of the first iteration
    Vdb3_corr = Vdb3 / (1 - (B1**t))
    SdW3 = B2*SdW3 + (1- B2)*(np.square(dW3))
    Sdb3 = B2*Sdb3 + (1- B2)*(np.square(db3))
    SdW3_corr = SdW3 / (1 - B2**t)
    Sdb3_corr = Sdb3 / (1 - B2**t)
    
    VdW2 = B1*VdW2 + (1 -B1)*dW2
    Vdb2 = B1*Vdb2 + (1-B1)*db2
    VdW2_corr = VdW2 / (1 - (B1**t)) #t=1 at the start of the first iteration
    Vdb2_corr = Vdb2 / (1 - (B1**t))
    SdW2 = B2*SdW2 + (1- B2)*(np.square(dW2))
    Sdb2 = B2*Sdb2 + (1- B2)*(np.square(db2))
    SdW2_corr = SdW2 / (1 - B2**t)
    Sdb2_corr = Sdb2 / (1 - B2**t)
    
    VdW1 = B1*VdW1 + (1 -B1)*dW1
    Vdb1 = B1*Vdb1 + (1-B1)*db1
    VdW1_corr = VdW1 / (1 - (B1**t)) #t=1 at the start of the first iteration
    Vdb1_corr = Vdb1 / (1 - (B1**t))
    SdW1 = B2*SdW1 + (1- B2)*(np.square(dW1))
    Sdb1 = B2*Sdb1 + (1- B2)*(np.square(db1))
    SdW1_corr = SdW1 / (1 - B2**t)
    Sdb1_corr = Sdb1 / (1 - B2**t)
    
    
    
    W3 = W3 - alpha * (VdW3_corr / (np.sqrt(SdW3_corr) + eps))
    b3 =  b3 - alpha * (Vdb3_corr / (np.sqrt(Sdb3_corr) + eps))
    
    W2 = W2 - alpha * (VdW2_corr / (np.sqrt(SdW2_corr) + eps))
    b2 =  b2 - alpha * (Vdb2_corr / (np.sqrt(Sdb2_corr) + eps))
    
    W1 = W1 - alpha * (VdW1_corr / (np.sqrt(SdW1_corr) + eps))
    b1 =  b1 - alpha * (Vdb1_corr / (np.sqrt(Sdb1_corr) + eps))
    
    return W1, b1, W2, b2, W3,b3

loss_Adam =[]
def train(X, Y, epochs=100, batch_size=64, alpha=0.01):
    W1, b1, W2, b2, W3, b3 = init_params()
    t = 1
    for epoch in range(epochs):
        Z1,A1,Z2,A2,Z3,A3 = forward_propagation(X, W1, b1, W2, b2, W3, b3)
        
        dW3, db3, dW2, db2, dW1, db1 = back_propagation(Z1, A1, Z2, A2, Z3, A3, W1, W2, W3, X, Y)
        W1, b1, W2, b2, W3,b3 = Adam(W1, b1, W2, b2, W3, b3, dW3, db3, dW2, db2, dW1, db1, alpha, t)
        loss_Adam.append(compute_loss(A3.T, Y.flatten()))
        print(f"Epoch {epoch+1} |  Accuracy: {get_accuracy(get_predictions(A3), Y)*100:.2f}%, loss: {compute_loss(A3.T, Y.flatten())}")
        t += 1

    return W1, b1, W2, b2, W3, b3

ADAM combines momentum with an adaptive learning rate that changes based on how much the gradient changes if the gradient is changing too fast, adam reduces the learning rate to prevent overshooting, if its too slow it increases the learning rate and so on.

perfect Accuracy dropped a bit but the model still performs very well and converges better. Now we Just built a simple Neural Network that can recognize digit with a high degree of accuracy using only numpy for mathematical operations, weve effectively torn open the black box and seen how it does it's magic. If you understand this section properly, you’re no longer just “using” neural networks — you actually know what they’re doing under the hood and are prepared to not only use but tweak and understand them, and dive into higher architectures like RNN/GRU or a CNN.

Artificial NEURAL NETWORK FROM SCRATCH FOR HANDWRITTEN DIGIT RECOGNITION

INTRODUCTION

Dataset

FORWARD PROPAGATION

layer 1

layer 2

layer 3

BACKPROPAGATION

Step 2: Gradients for $$ W_3 $$ and $$ b_3 $$

Why the transpose?

Step 3: Backpropagate to Layer 2

Step 4: Backpropagate to Layer 1

Parameter Update

note: we also increases alpha due to the smoother nature of convergence so we are less likely to overshoot with such in place.

Gradient Descent with Momentum

Intuition

Adaptive Moment Estimator (ADAM)

We want to smoothen out our loss function for faster and smoother convergence. This is very important in large neural networks, where we have many local minima, and training using just momentum might still face potential issues and take days to train.

Comments (1)

More from this blog

THE REASONABLE INEFFECTIVENESS OF VANILLA RECURRENT NEURAL NETWORKS

VITERBI ALGORITHM FOR PART OF SPEECH TAGGIN

Logistic Regression for sentiment analysis

Naïve Bayes Classifier for sentiment Analysis using IDMB movie reviews

Command Palette

INTRODUCTION

Dataset

FORWARD PROPAGATION

layer 1

layer 2

layer 3

BACKPROPAGATION

Step 2: Gradients for $$ W_3 $$ and $$ b_3 $$

Why the transpose?

Step 3: Backpropagate to Layer 2

Step 4: Backpropagate to Layer 1

Parameter Update

note: we also increases alpha due to the smoother nature of convergence so we are less likely to overshoot with such in place.

Gradient Descent with Momentum

Intuition

Adaptive Moment Estimator (ADAM)

We want to smoothen out our loss function for faster and smoother convergence. This is very important in large neural networks, where we have many local minima, and training using just momentum might still face potential issues and take days to train.

Comments (1)

More from this blog