Outline

Background

History

Artificial neural networks (ANNs), or neural networks for short, are some of the most powerful methods in machine learning (ML). Neural networks were first hypothesized in the 1940s by Warren McCulloch and Walter Pitts but it wasn’t until the late 1950s when Frank Rosenblatt created the the first ANN, the perceptron.

A single perceptron unit.

A single perceptron unit.

From the 1960s to the 2000s, the field of DL witnessed several landmark achievements:

  • 1943: Warren McCulloch and Walter Pitts hypothesize ANNs
  • 1958: Frank Rosenblatt develops the perceptron
  • 1980: Kunihiko Fukushima creates the neocognitron, the predecessor to the convolutional neural network
  • 1971, 1982: Paul Werbos, David Rummelhart and Geoffery Hinton independently develop the backpropagation algorithm to train multilayer perceptrons
  • 1990: Yann LeCun creates develops and applies the Convolutional Neural Network to handwritten digit recognition

Despite this, the field largely stagnated due to algorithmic, computational and data limitations of the time. It wasn’t until the 2012 when Alex Krizhevsky, Ilya Sutskever and Geoffery Hinton published the famous “AlexNet” paper. Since then, DL has surged with ANNs steadily becoming the de facto method for solving large-scale problems in ML.

Shortcomings

Despite their immense flexibility and seemingly “silver bullet”-nature, neural networks are highly unintepretable often being called “black box” models. “Black box” ML models refer to models for which data is input into the model and results are output, but the mechanics by which this happens is obscure and unintuitive. There is little to no understanding as to what the model is learning or what decisions are being made. As a result, explaining why a neural network makes a certain decision is unclear.

Society is growing increasingly dependent on AI models and as they continue to permeate everyday life, the big question of what is happening “under-the-hood” in neural networks is becoming of serious concern.

Objectives

In this article, we try and bring some much-needed transparency to the learning process of neural networks by visualizing the way ANNs learn. We will consider a simple multilayer perceptron tasked with classifying datapoints of different classes and examine how its decision boundaries and weights change over the course of the training process.By uncovering what happens during this learning process, we hope to turn AI models into explainable AI models to allow data scientists and end users to comprehend and trust the results of their models.

A typical multilayer perceptron architecture consisting of an in input layer (left), an output layer (right) and any number of hidden layers in between.

A typical multilayer perceptron architecture consisting of an in input layer (left), an output layer (right) and any number of hidden layers in between.

Decision Boundaries

One of the best ways to understand how an ANN is able to classify datapoints is by understanding how they draw decision boundaries. A decision boundary is a surface that partitions the feature space into sets that optimally separate the classes which the network is trying to predict. Future points that the network has not seen will be classified into the corresponding region they fall into.

A linear decsion boundary.

A linear decsion boundary.

Mathematical Foundations

Gradient Based Learning

ML predominantly revolves around a concept called gradient-based learning. In gradient-based learning, we reduced a complicated learning task (e.g., classification) that a human can perform down to a mathematical function called an objective function. This objective function compares the prediction made by the neural network against a reference/target value. Depending on how incorrect the prediction was, the network will return a single value that quantifies the performance of the network on that predicted value. This is referred to as a loss. In ML, objective functions are designed to be minimized. This makes sense as we want to minimize loss. The loss accumulated over all training datapoints is called the cost.

For example, consider the Mean Squared Error cost function commonly used to train linear regression models. The larger the difference between the reference value(s) \(\mathbf{y}\) and the predicted values \(\mathbf{X}\boldsymbol{\beta}\), the larger the cost.

\[\underset{\boldsymbol{\beta}}{\text{arg min }} \mathcal{C}: \mathcal{C}(\beta_{0}, \beta_{1}) = \frac{1}{n} \left \lVert \mathbf{y} - \mathbf{X} \begin{bmatrix} \beta_{0} \\ \beta_{1} \end{bmatrix} \right \rVert_{2}\]

A key point to note is that the predicted value that cost functions use depend on the parameters of the model. By the Chain Rule of calculus, this means the entire cost function as a whole depends on the parameters of the model and thus can be optimized to find the best parameters for the model that minimize the cost.

The Universal Approximation Theorem

The Universal Approximation Theorem is closely intertwined with gradient-based learning in the context of classification tasks. When it comes to classification, neural networks aim to learn a decision boundary that effectively separates different classes in the feature space. The theorem asserts that given a continuous function \(f(x)\), there exists a neural network that with an arbitrary number of layers, neurons and type of activation function that can approximate this function to any desired degree of accuracy.

Gradient-based learning forms the backbone of training neural networks for classification tasks. This approach adjusts the parameters of the network iteratively by computing gradients of a cost function with respect to these parameters and updating them in the direction that minimizes the loss. The Universal Approximation Theorem supports this process by providing a theoretical guarantee that, with sufficiently large and appropriately configured networks, these gradients can guide the network towards approximating the true decision boundary between classes, enabling effective classification. This theorem fundamentally underpins the power and versatility of neural networks in approximating complex functions

Example Datasets

To visualize how a neural network learns to separate two classes, we have selected three datasets: biclusters, circles and moons. These synthetic datasets were generated using the scikit-learn Python package. The two clusters of biclusters can be optimally separated with a linear decision boundary while the clusters in the other datasets require nonlinear boundaries.

Biclusters

Circles

Moons


Weights and Biases

MNIST

For details about the exploratory analysis we performed on the dataset, including a couple of different dimensionality reduction techniques we performed to show the distribution of output classes they can be found here:

Click on a tab to view weights for the specified hidden layer:

Entry to Hidden Layer 1

Hidden Layer 1 into Hidden Layer 2