Neural Network Series: Multilayer perceptron, can you really do it all? (Part V)
The previous article was the last of the simple perceptron introductory series. Before moving on, let’s summarize the basic concepts that should be understood:
- McCulloch and Pitts introduced the simple perceptron, adapted to solve binary classification problems (if we can find the right weights).
- This simple perceptron can find a solution if the problem is linearly separable, at least approximately.
- Rosenblatt develops an algorithm that allows for weights’ iterative improvement. This is a great tool because there is no need for a priori information about the ideal weights.
- When modifying the activation function, the kind of problem that it can solve is different.
Simple perceptron quickly reaches its limitations…
Now, let’s imagine we have a non-linear relationship that we would like to approximate, such as the one in Figure 1, on the left. With our current set of tools, there is no way a single activation function could do it. Something we could try, for example, is to split the function in two pieces and, with one neuron each, approximate that portion.
The green and red neurons are exemplifying this behavior. In this case, the activation is the logistics function. This belongs to the sigmoid functions’ family, which are non-linear. Something interesting about them is that their parameters can be configured to simulate a linear relationship or a step activation function, despite their original shape being different from it. You can verify this behavior in this short video, where the beta parameter is modified.
Could the input to a neuron be… another neuron?
With both perceptrons representing a piece of the original relationship, let’s combine the output of each into a new neuron, in order to finish building the function, as it is shown in Figure 2.
The blue perceptron is an ADALINE perceptron (linear) which takes as input the outputs from the green neuron (representing input x1) and the red neuron (representing input x2). If we analyze what happens from minus infinity to 1 inside the blue neuron, the result is zero, same as our original approximation. From 1 to 2, it approximates one (let’s remember these are sigmoid functions so their actual value never reaches 1). From 2 to infinity, the result inside the blue perceptron is equivalent to zero again.
The little system that we’ve just built is called a multi-layer perceptron (really?) or neural network. A multi layer perceptron is a combination of neurons that has one or more hidden layers. Hidden layers are those between the input and the output. In this case, the green and red neurons represent a hidden layer of 2 neurons and the blue neuron represents the output layer. There can be as many hidden layers as we want, with as many neurons in each layer as we decide to configure (which automatically creates the problem for how to choose this configuration).
Approximation Theorem, by Cybenko and Hornik
Up until now, the power representation for a simple perceptron was restricted by the activation function that was used. What happens when we start combining more and more neurons? There is no clear way to tell how the outputs will look like.
What Cybenko and Hornik proved is that, given enough neurons in a single hidden-layer perceptron and provided that these neurons use a sigmoid activation function, the multi-layer perceptron could approximate any function. The intuition behind this idea is similar to the one we used in the example above.
If we have any function, it is possible to think about it as rectangles of different height and width that follow the function’s shape. Each neuron could represent a portion of it, and combining those neurons could give us the function’s approximate value.
In reality, this is not exactly what happens. There is no clear understanding about how multi-layer perceptrons model different portions of the function or how many neurons are used to build the relationships between input/output. This is an open problem called “interpretability”, with active research and investment to better understand the modelling and decision making of these networks. In this article, I’ve shown little networks, but this problem gets very complex when we scale to models with billions of neurons.
The promise of this theorem is huge. Too good to be true, right? Let’s take a closer look at a quote from the original paper, “Approximation by Superpositions of a Sigmoidal Function” [1]:
While the approximating properties we have described are quite powerful, we have focused only on existence. The important questions that remain to be answered deal with feasibility, namely how many terms in the summation (or equivalently, how many neural nodes) are required to yield an approximation of a given quality? What properties of the function being approximated play a role in determining the number of terms? At this point, we can only say that we suspect quite strongly that the overwhelming majority of approximation problems will require astronomical numbers of terms. This feeling is based on the curse of dimensionality that plagues multidimensional approximation theory and statistics.
These are all open questions and decisions one must make when building a neural network to tackle any problem. Luckily for us these days, the research on architectures has sky rocketed and there are options in which we can base our decisions. For example, when dealing with images, it is very likely that we will favour an approach using Convolutional Neural Networks, given their success in the area. On the other hand, when dealing with language problems, one might consider Transformers. Unless we are at the edge of research, it is possible to obtain good practical knowledge a prior.
Multi Layer Perceptron Architecture
The generic architecture of a neural network can be summarized as follows: O represents the neurons of the output layer and V the neurons of the hidden layers. M represents how many layers the network has, n the number of input parameters, and p how many samples are there in the dataset. We can simplify the nomenclature for layers if we rename them as V⁰, V¹, … all the way up to M.
There are two important operations: feed-forward and back-propagation. Feed-forward is the process through which we will obtain the neuron’s output. In the end, the neural network behaves as a function: it takes an input from a n-dimensional space and gives an output corresponding to the m-dimensional space (the output is no longer restricted to single values).
To calculate it, we have to think of each neuron in the network as the simple perceptron with its corresponding activation function (not all have to be the same). If we look at any unit, we already know how to calculate its output: the dot product of the input with the weights. Starting from bottom, all the way to the top, we can compute the intermediate steps that will result in the final output.
On the other hand, we are required to define how the neural network gets trained. Something similar to Rosenblatt’s perceptron algorithm. Before, we were comparing the expected value versus the neuron’s output and, if required, we used this information to update the weights.
But how does one adjust the weights of neurons in the hidden layers? What’s the expected output for those? Does it make sense to even talk about expected value? In this context, Rumelhart, Hinton and Williams propose an algorithm based on gradient descent and the chain of derivation rule. The process to compute how to update the weights when training is called backpropagation, and it is a core concept to modern neural networks.
What is the intuitive idea behind this algorithm? It will try to calculate how much the weights need to be updated to influence the output in a way that allows us to achieve what we want.
In the following article, we’ll go step by step. A strong recommendation is that you watch the following Youtube Series, from 3Blue1Brown. It is important to read and search from multiple sources, as this is not an easy concept to grasp. See you soon!
Note: this article was crafted without the use of artificial intelligence technologies. A big part of writing them is an effort to work on my communication skills.
References
[1]: George Cybenko. (1989). https://web.njit.edu/~usman/courses/cs675_fall18/10.1.1.441.7873.pdf
[2]: David E. Rumelhart, Geoffrey E. Hinton, Ronald J. Williams. (1986). https://www.nature.com/articles/323533a0