Simple Guide to Neural Networks and Deep Learning in Python

Introduction

Deep Learning algorithm is one of the most powerful learning algorithms of the digital era. It has found a unique place in various industrial applications such as fraud detection in credit approval, automated bank loan approval, stock price prediction etc. Some of the more recent uses of neural networks are image recognition and speech recognition.

In fact, you’d be amazed to know that google incorporates neural networks into its image search and voice applications. Furthermore, the first successful deep learning model for speech recognition made by Microsoft is now used in Cortana.

This article provides an introduction to deep learning methods in detail.

The first section of the article presents a detailed introduction of the perceptron model and a python implementation of the algorithm. Although the perceptron model is a linear classifier and has limited applications, it forms the building block of multi-layered neural network. Hence, it is imperative for you to learn.

In the second section, a rather complicated extension of the perceptron model called Deep Learning Network (also known as Multilayered Neural Network) is introduced along with its mathematical and algorithmic development. Additionally, a task of intelligent decision-making has been implemented using Deep Learning.

Note: This article is best suited for people having a concrete understanding of mathematical concepts such as differentiation, matrix multiplication etc.

What is Perceptron Algorithm ?

The human brain is astonishingly smart and powerful. It is capable of recognizing the underlying patterns in massive and noisy world’s information, memorizing those patterns and generalizing the knowledge for making informed decision making.

Perceptron model is an artificial neural network inspired by biological neural networks and is used to approximate functions that are generally unknown. They are inspired by the behavior of neurons and they convey electrical signals between input (such as from the eyes or nerve endings in the hand), processing, and output from the brain (such as reacting to light, touch, or heat). A biological neuron generates an electrical signal if the strength of the input signal is greater than a threshold value. The biological neural network is a layer of interconnected neurons which receives an external stimulus (such as a sensation of heat) and information propagates all the way to the brain, which in turn generates a response signal which again travels through the neural network.

The Perceptron model was an attempt to replicate human learning inside computers by Frank Rosenblatt in 1957. Here, the neural computational units are called perceptrons. The model is composed of two layers: an input and output layer. The output layer is composed of perceptron units. Each output unit is connected to all the inputs through a weight. Perceptron does the weighted summation of the input signals and processes this sum through a function called “activation function”. For a given activation function “f(x)”, if the weighted summation of input is z then output produced by the perceptron will be “f(z)”.

The activation function of a perceptron is a Heaviside step function “(H(Z))”.

Here, z is the weighted summation of the input vector, W_ij is weight connecting i^th perceptron to j^th input node and y_i is the output of i^th perceptron.

The figure above depicts an input vector with components 1, x₁and x₂ connected to a single perceptron unit through weights w0, w1, and w2, respectively. The physical interpretation of weights could be related to the strength of the connection. A weight with a very high value will prejudice the perceptron’s response toward the signal from its corresponding input node and vice versa. Here input node with value 1 is called a bias node. The role and significance of the bias node will be explained later in this section, and for now, let’s consider bias as a regular input node.

Let’s do an analysis with the figure above by setting x₂ = 0 and w0, w1 and w2 equal 0.5, 1 and 3.

So, here it’s evident that the response of the perceptron is heavily prejudiced toward the values of x_1, i.e, there exist values of x₁< x₂for which the perceptron generates a positive signal.

Let’s consider a task of character recognition. Identification of a character from its image is a complicated intelligent task. It’s almost impossible to explicitly code a bot to identify different alphabets. Everytime we write an alphabet say “a” it’s different from all previous times from the perspective of scale position, color and orientation. A more sophisticated and intelligent model would be one which itself infer the underlying significance of the character from the training data.

Let’s say you are given a bunch of images of hand-written characters “a” and “b” along with their labels. A label “1” corresponds to an image of character “a” and “0” for “b”.Now using this we need to train the perceptron network to classify these images into different sets, one set of character “a” and another of character “b”. Basically, any digital image is a matrix of pixels values, which determines the color at a specific point on the screen. This matrix of pixel values could be converted into an array of pixels, and further this array could be used as an input for the network.

Now to train the model such that whenever an image of character “a” is given it returns 1, and returns 0 if image of character “b” is given. For a given single input vector and its output, there exist many solutions of weights but the values of weights have to be updated to a value which gives the desired output for all the examples in the training set. In doing so, the common underlying feature embedded in the examples is captured and higher values of weights is assigned to them. Noisy distinctive features which don’t tend to appear in every example are weighted negligibly. So, after training, when a new image is fed to the model, it will extensively look for the heavily weighted features in the signal and judge its class.

Further, a methodology has to be devised to tune the values of weights for appropriate decision making. Since the output of a perceptron could be +1 or 0, the error function E of i^th perceptron for m^th training example could be +1,-1, or 0.

Since, \(E^i_m = t^i_m – y^i_m\)

where \(E^i_m\), \(t^i_m\)and \(y^i_m\) are error, target value (actual desired output), and output value of i^th perceptron for m^thtraining example. If \(E^i_m=0\), then weights associated with that perceptron are appropriate and don’t need to be changed. But if \(E^i_m=1\), this means output produced is 0 when +1 was desired, hence weights associated with that perceptron need an ascent. Each weight is increased by a fraction of the corresponding input node and similarly, the weights are reduced if \(E^i_m=-1\). The equation below could do the required updates.

Here, ? determines learning rate, for higher values of alpha, the model tends to learn faster but has lesser chances of procuring an optimum solution, whereas, for lower alpha, the model learns slowly but has better chances of procuring an optimum solution. Consequently, a moderate choice of learning rate is made ranging from 0.1 to 0.8. However, the best strategy to determine alpha is trial and error.

The above strategy to update the weights might fail if the value of all the input nodes is 0 because then the second term on the right-hand side of equation (1) will be zero. So, in order to avoid such complication biasing is introduced.

Biasing: an extra input node is acquainted whose value is ±1 and this node is called bias node. Bias node ensures that weights get updated when needed to be, even when all the input nodes of a vector are 0. Suppose input (0,0) has target value 0 and output given by model is 1 hence its required for weights to be updated but using (1) it could be easily seen that none of the weights will get updated if there is no biasing.

Algorithmic Development:

Initialization
- set all weights W_ij to a random initial value between 0 to 1.
Training
- for T iterations
  - for all:m training examples
    - compute activation of each neuron i using activation function H(z)

update weights using:

Implementation

Perceptron neural networks can be used for the classification of a logical function. A logical function takes two input parameters In 1 and In 2. The input of the logical function is digital i.e. input values could be either 0 or 1. There are different types of logical function, and following is a classification of Binary OR logical function.

Binary ORIn 1 and In 2 are components of input vectors and In 1 OR In2 are their corresponding output.

There are two different types of outputs of the binary OR function, 1 marked with the dot and 0 marked with a plus sign in the above figure.

Using Perceptron model we will try to determine the boundary of classification of this two different types of output.

Step 1: Import libraries necessary for the project.

import numpy as numpy
from numpy import*
import random

Step 2: Initialize weights to random values.

W= [random.random() for i in range(3)]

Step 3: Define input feature vectors and corresponding targets in arrays X and t, and set the value of bias node to -1.

X= [[-1,1,1],[-1,1,0],[-1,0,1],[-1,0,0]]
t= [1,1,1,0]

Step 4: Train the model. Here, the sum is an array containing weighted summation of each input vector and y is an array containing activation output of perceptron corresponding to each input. Weighted summation for each perceptron can be calculated using the dot method of numpy where dot does the matrix multiplication.

for i in range(10):
 sum= dot(x,transpose(w))
 
 #Activation
 y= 0.5*(numpy.sign(sum)+1)
 
 #weights update
 w+=0.5*dot(transpose(x),(t-y))

Output

A linear classification boundary separating two different types of examples will be obtained.

Since all the operations in the model are linear, the extent of the model is limited to linear classification. For the extension of the model to exhibit non-linear classification, novel improvisations have been introduced. The method is known as the multi-layer perceptron model or deep learning neural network.

What are Deep Learning Neural Networks ?

In contrast to perceptron network, the activation function of the Deep learning neural networks is non-linear, enabling it to learn complex and nonlinear features of the system. In addition to input and output layers deep learning architecture has a stack of hidden layers between the input and output layer. Deep learning neural networks are capable of extracting deep features out of the data; hence the name Deep Learning.

There are several different types of neural networks.

1. Feed Forward neural network: It was the first and arguably most simple type of artificial neural network devised. In this network, the information moves in only one direction — forward. From the input nodes data goes through the hidden nodes (if any) and to the output nodes. There are no cycles or loops in the network.

2. Radial basis function networks: It is used to perform interpolation in multidimensional space. An RBF is a function which has built into it a distance criterion with respect to a center. RBF networks have two layers of processing. In first, the input is mapped onto each RBF in the hidden layer. The activation function chosen in RBF network is a Gaussian function contrary to Feed forward networks where activation function is sigmoidal.

3. Recurrent neural networks (RNNs): These are networks comprising of bi-directional data flow i.e. information in the network flows from later processing stages to earlier stages. RNNs can be used as general sequence processors. Hopfield network (like similar attractor-based networks) is of historic interest although it is not a general RNN as it is not designed to process sequences of patterns. Instead, it requires stationary inputs. It is an RNN in which all connections are symmetric, and it was invented by John Hopfield in 1982.

This section presents a basic yet powerful form of deep learning neural network — Feed Forward Neural network.

The key idea of the model is that each feature of an input vector is connected to all the neurons of the subsequent hidden layer and all the neurons of this layer are further connected to the next hidden layer and so on. The last hidden layer is connected to the output neurons; each connection has a weight associated with it. Output neurons generate an output signal depicting information about the corresponding input signal.

In the perceptron model, each single processing unit of a neuron does the weighted summation of signals arriving from the neurons of the previous layer and processes an output through Heaviside step function “H(z)”. But here the point of demarcation from the perceptron model is that the activation function of a neuron is sigmoid “g(z)” instead of a Heaviside step function. The advantage of choosing a sigmoid over step function is that sigmoid is very close to step function. And, it’s also differentiable, which turns out to be an essential criterion for parameter tuning in neural networks.

where y_i is the output of i^thneuron and w_i^j is the weight connecting i^th neuron to j^thinput node.

Another difference from the perceptron model is that here, multiple hidden layers are stacked between input and output layers and each of these layers acts as input for the subsequent layer. As the signal progresses through each layer, more detailed and fine structures are recovered, hence improvisation of hidden layers has enabled the penetration into deeply embedded features in data. This has made learning more comprehensive.

Further sigmoid activation adds non-linearity to the model, which enables it to learn non-linear discrepancies between different classes, making it a better and more complex brain for the machines.

Learning in neural networks progresses in a similar fashion as in the perceptron model. First, training examples are fed to the network and the input signal propagates through each layer, producing an output. This mechanism is known as “Feed Forward.” Then, weights are tuned such that the output produced by networks matches the target values of the corresponding training example. This tuning is done by an algorithm called “Back Propagation.”

Let’s consider a network with n hidden layers and let X_i^m denote the i^th component of the m^th training example. In the context, this framework, Feed Forward and Back Propagation are illustrated below.

Feed Forward:

where Y_i^L is the output of i^th hidden neuron in L^th hidden layer and W_ij^L are the weights connecting i^th hidden neuron of L^th hidden layer to j^th node of the previous layer.

where \(y^{m}_i\) is the output of i^th output neuron for m^th training example and since there are n hidden layers in the network, \(Y^n_j\) represents the output of j^th neuron in the last hidden layer.

Back Propagation: The backward propagation of errors or back propagation is a common method of training artificial neural networks and used in conjunction with gradient descent optimization. Optimization objective of back propagation is the cost function “J”.

Gradient Descent is an iterative procedure to optimize an objective. For the optimization problem above

where alpha determines the learning rate and functions as in perceptron model.

There are two aspects of incorporating gradient descent in back propagation depending on the choice of optimization objective “J.“

Bash Gradient Descent (BGD): \(J= \frac{1}{2} \sum_m{(t^m_i – y^m_i)}^2\)
Stochastic Gradient Descent (SGD): \(J= \frac{1}{2} (t^m_i – y^m_i)\)

Here \(y^m_i\) is target value of i^th output neuron for m^th training example.

Since the cost function in BGD is the sum of squared error over all training examples the solution converges at local minima, whereas in SGD since weights are updated independently for each training example solution doesn’t converge at local minima, instead it oscillates near the optimum regime. However, SGD is preferred over BGD because of its lower time complexity.

To tune the weights to produce desired output, cost J has to be optimized.

\(\displaystyle\frac{\partial J}{\partial W^L_{ij}} = \sum_m{\frac{\partial J}{\partial z^{L+1}_m} \frac{\partial z^{L+1}_m}{\partial W^L_{ij}}}\) (chain Rule) (1)

where \(z^{L+1}_i= \sum_j{W^L_{ij}Y^L_j}\); therefore \(Y^{L+1}_i= g(z^{L+1}_i)\)

\(\displaystyle\frac{\partial z^{L+1}_m}{\partial W^L_{ij}}=\ \left\{ \begin{array}{c} Y^L_m\ if\ m=i \\ 0\ otherwise \end{array} \right. \) (2)

\(\displaystyle\frac{\partial J}{\partial W^L_{ij}} = \frac{\partial J}{\partial z^{L+1}_i}Y^L_j \) (3)

\(\displaystyle\frac{\partial J}{\partial z^{L+1}_i} = \sum_m{\frac{\partial J}{\partial z^{L+2}_m} \frac{\partial z^{L+2}_m}{\partial z^{L+1}_i}}\) (4)

from equation (4)

\(\displaystyle\frac{\partial J}{\partial z^{L+1}_i} = Y^{L+1}_i (1-Y^{L+1}_i) \sum_m{\frac{\partial J}{\partial z^{L+2}_m} W^{L+1}_{mi}}\) (5)

\(\displaystyle\frac{\partial J}{\partial W^{n}_{ij}} = \frac{\partial J}{\partial z^{n+1}_i} Y^n_j\) (6)

\(\displaystyle\frac{\partial J}{\partial z^{n+1}_i} = \delta Y^{n+1}_i(1-Y^{n+1}_i)\) (7)

where, \(\delta = y^m_i – t^m_i \)

In back propagation, gradient descent starts from the output layer and propagates all the way back to the input layer. \(\frac{\partial J}{\partial z^{L+1}_i}\) computed in L^th layer is used in computing \(\frac{\partial J}{\partial z^{L-1}_i}\) in L-1^th layer.

Algorithm:

n= number of hidden layers.

for T iterations:
- obtain \(\frac{\partial J}{\partial z^{n+1}_{i}}\) and \(\frac{\partial J}{\partial W^{n}_{ij}}\) using (6) and (7).
- for k=1 to n
  - using \(\frac{\partial J}{\partial z^{n-k+2}_m}\) calculated in previous iteration obtain \(\frac{\partial J}{\partial z^{n-k+1}_i}\) using equation (5).\(\displaystyle\frac{\partial J}{\partial z^{n-k+1}_i} =\ Y^{n-k+1}_i (1- Y^{n-k+1}_i) (\sum_m{\frac{\partial J}{\partial z^{n-k+2}_m} W^{n-k+1}_{mi}})\)
  - from above result calculate \(\displaystyle\frac{\partial J}{\partial W^{n-k}_{ij}} =\ \frac{\partial J}{\partial z^{n-k+1}_i}Y^{n-k}_j\)
- end for
- ? L ? {0,1,…..n}
  - update W_ij^L
end for

Iterating Back Propagation step for several iterations converges the weights to the optimum solution.

Implementation:

Following is an implementation of non-linear multiclass classification of the Iris data set written in Python using the Keras library. It’s a deterministic task to identify the class of iris flower from its physical dimensions.

Attributes: petal length, petal width, sepal length, sepal width

Classes: Iris is categorized into three differed classes, “Iris-setosa,” “Iris-versicolor,” and “Iris-virginica.”

Data-Source: https://archive.ics.uci.edu/ml/datasets/Iris

Neural Network Architecture: One hidden layer consists of five neural nodes.

Step 1: Randomly segregate the data into two different parts, training set with 120 examples and testing set with 30 examples.

Step 2: Import required libraries.

from numpy import*
import numpy as numpy
import keras
from keras.layers import Dense
from keras.models import Sequential
from keras.utils import np_utils
from sklearn.preprocessing import LabelEncoder

Step 3: Load data from the training set.

X= numpy.genfromtxt("Iris_Data.txt",delimiter= ",", usecols=(0,1,2,3))
t= numpy.genfromtxt("Iris_Data.txt",delimiter= ",", usecols= (4), dtype= None)

Here, X is an array of input feature vectors and t is an array containing their corresponding target values. dtype= None changes the default data type of numpy array (i.e float) to contents of each column, individually.

Step 4: Since target values t are in string format, it has to be assigned numerical labels and this could be done by using scikit’s LabelEncoder class.

encode= LabelEncoder()
encode.fit(t)
encoded_t= encode.transform(t)

transform method will assign a distinct numerical label to each distinct values in t.

Iris-setosa = 0  Iris-versicolor = 1        Iris-virginica = 2.

Step 5: When modeling multi-class classification problems using neural networks, it is good practice to reshape the output attributes to binary Euclidean basis vectors and this could be easily done using to_categorical method from np_util class of keras library.

“Iris-setosa” “Iris-versicolor”    “Iris-virginica”
     1                 0                    0
     0                 1                    0
     0                 0                    1

vec_t= np_utils.to_categorical(encoded_t)

vec_t is array of target vectors in binary Euclidean basis format.

Step 6: Define model of the network using Sequential class. An object of layers package is added to the model and since our model is fully connected dense network, a Dense method will be added to the model.

model= Sequential()
model.add(Dense(3, input_dim=4, init= "uniform", activation= "sigmoid"))
model.add(Dense(3,init= "uniform",activation= "sigmoid"))

A dense network consisting of a hidden layer with five nodes and an output layer with three nodes with sigmoid activation and uniform random weights initialization is defined above.

Step 7: Compile and train the model using compile and fit methods. Set loss (cost function) to mean squared error (mse) and optimizer to sgd and train the model for 500 epoch.

model.compile(loss='mse', optimizer='sgd', metrics=['accuracy'])
model.fit(X,vec_t,nb_epoch= 500, batch_size= 1)

Training with an accuracy of 98% could be achieved.

Step 8: Predict test data using the predict method

X_test= numpy.genfromtxt("test.txt",delimiter= ",",usecols= (0,1,2,3))
t_test= numpy.genfromtxt("test.txt", delimiter= ",", usecols= (4), dtype= None)
predictions= model.predict(X_test)

A value of prediction basically quantifies certainty of an input belonging to a particular class. Choose the one with maximum certainty. Prediction with an accuracy of around 90% is achieved.

Step 9: The best architecture for prediction could be determined using trial and error.

Above implementation signifies the remarkable performance of Neural Networks. Although neural networks have been overlooked ever since its inception but the multilayer extension to the model and recent developments in computational technologies has made it one of the best learning algorithms. The neural network has outperformed every Machine learning algorithm in task image recognition with an accuracy level of above 95%. Inspired by collective underlying behavior and performance of different types of biological neurons many insightful architectures and variations in this model have been improvised.

Summary

Perceptron model is a neural network capable of linear classification.
Deep Learning neural networks are capable of learning non linear decision boundaries.
Typically activation function used in a feed forward neural network is a sigmoid function.
The cost function (mean squared difference of desired and the predicted value) is optimized to determine appropriate weights.
A gradient based optimization algorithm called back propagation is used to optimize the cost function.

Manish Saraswat

Making an effort to help people understand Machine Learning. I believe your educational background doesn't stop you to pursue ML & Data Science. Earned Masters in F/M, a self taught data science professional. Previously worked at Analytics Vidhya. Now solving ML & Growth challenges at HackerEarth!