Counting Number of Parameters in Feed-Forward Deep Neural Network | Keras

9 min readJan 26, 2020

**Counting Number of Parameters in Feed-Forward Deep Neural Network | Keras**

Introduction

This post is to make readers understand practically how to calculate the number of parameters in feed-forward deep neural network using APIs from keras. Coming straight forward to the points which I will cover in this tutorial are creating a simple Keras model for binary classification. I will also discuss about the dataset, then how to create model and calculate both trainable and non-trainable parameters after creating the model. There will be some discussion on nodes in the hidden layers as well as activation function used in the code. There will be some light thrown on why to use particular activation function for this code. Meanwhile, I will also provide readers taste of setting weights as well as early stopping. In the different section weights and bias will also be discussed which is the core of deep learning models. In a nutshell, I will try to make things simple and understandable. Before proceeding further, I urge and request readers to read the below hot burning topics

Simple way to save and load model in pytorch

Loading saved keras model and continue training from last epoch

Understanding NumPy arrays in Simple Way

What is a webhook

There will also be discussion on optimizers and loss function which leads to discussion on gradient descent and back propagation.

Understanding Dataset

Dataset has been taken from the link Dataweekend

Let us have a look at different features of dataset along with the labels.

Input Features or X or Predictor Variables: Input features are variace, skewness, curtosis and entropy.

Target Variable / Output or y: It contains two values named as 0 and 1.

This dataset is seen as a supervised classification problem because it has target which is binary in nature. I told classification because output is binary in nature. Now let us move to build the code using keras with tensorflow backend.

Writing code using Keras with TensorFlow as backend

Let us see code step by step

Let us understand the above code using the numbers mentioned in the above circles

This line stands for importing pandas. Pandas is used to read the dataset. Dataset is in form of .csv and is used to separate the dataset into predictor variables i.e input data or number of features as well as target variable or labels or output.
Numpy is very important library as deep learning model needs tensors as input. Numpy is used to make tensors from the input data. Let me teach you simple funda. List of scalar is called as vector and list of vector is called as matrix and list of matrix is called tensor. Numpy is used to make these. There are a lot of other operations done by this library but i am stick to the operations used in this tutorial.
It contains a lot of lines like keras.backend as K which is used to clear session if you want to run the model iteratively. Matplotlib is imported as graphs of accuracy and loss with respect to epoch will be drawn during the execution of code i.e during training time. Then. from keras.models Sequential model is imported whereas from keras.layers activation as well as dense layer is imported. In this code, RMSprop optimizer is used to minimize cost function (loss)
sklearn is the library which is used to normalize the data, preprocessing it as well as splitting data into training and testing part. In this particular case, it is used to split the input variables/features as well target into training and testing phase.

5. As mentioned earlier, Pandas is used to read the data from csv file. This is implemented using read_csv function of pandas

6. X is input features named as variace, skewness, curtosis and entropy. These are stores in X using loc function of dataframe. Loc slice the data based on column names. In the same way,7 indicates slicing of target from the dataset using the same function.

Note: variance, skewness, curtosis, entropy and class are the names of dataframe columns where first four are the features and last is the target. Since the dataset has targets, this is supervised problem.

8. Till 7, we have successful in separating input from target. But if you look carefully, type of input and target is not a numpy array. To train a model, we need data in form of array and hence both X and y are converted into array using values function. Further, the input data and target are separated into training and testing phase using train_test_split from sklearn. I have defined as testing size 30 % of length of dataset and 70 % as training size. random_state simply sets a seed to the random generator, so that your train-test splits are always deterministic. If you don’t set a seed, it is different each time.

9. This line clears session when the model is run for specific iterations. Question is why to use this. Suppose, one needs to test at which batch size or epoch or learning rate or optimizer model will fit better. To test for such cases, one needs to run the model with different above parameters in the list and this will make the model run many times. To achieve better results, one needs to ensure that there is no previous cache stored. Such is achieved using this step.

10. Keras provides two kinds of model: Sequential and Functional. In this case, I have made use of sequential model. Keras can use both tensorflow or theano at the backend. In this code as well as keras version which I am using, tensorflow is used in the backend.

11. Since this dataset has 4 features in the predictor variables column or there are 4 columns in the input data (without target variable column), hence input dimension has the shape of 4. I have used 5 neurons/nodes in the first hidden layer and parameter calculation will be followed in the next section. One can see that activation function used is relu. Question could arise that I may also use softmax or sigmoid or tanh. Then, why relu only. To answer this question, let us look at the definition of RELU (Rectified Linear Unit) which has zero output if the value is negative and covers positive value till infinity. This implies that when output from these 5 neurons in the first hidden layer is passed to these activation functions, then only positive values are retained and limit is infinity. Hence, every positive value is retained whereas sigmoid or tanh activation functions are defined within specific boundaries like 0 to 1 for sigmoid or -1 to 1 for tanh. Hence, RELU is best choice here. Softmax is related to probability and is used where one needs to find the probability of any class belonging to particular real time data. Dense means fully connected network which means each node in the first hidden layer is calculated to all the nodes in the previous layer which is the input layer. Input layer passes two parameters to this layer which is passed to the next layers and these parameters are called weights and bias. Each neuron in the hidden layer multiplies all input with the weights and add bias to them and send the output to the activation function which further removes the negative results if any (RELU).

12. Now comes the second hidden layer. It has two neurons and every 5 neurons in the previous hidden layer are connected to both of these neurons. Since this is also a dense layer and hence this is a fully connected layer. One can say that previous layer acts has an input to this layer which produces the output by multiplying the previous input with the weights and adding bias to it. Output from these two nodes are passed to two activation functions (RELU) respectively. Again we will get only positive values from 0 to infinity.

13. Since this is a binary classification problem, therefore one neuron is enough to predict the output because output would be either zero or 1. Softmax returns the probability of each class and target or predicted class would be class with high probability. This is also a dense layer and hence receives input from both the nodes in the second hidden layer.

This is all about model architecture.

14. Model.compile will compile the model with loss as binary cross entropy, optimizer as RMSProp (lr=0.01) and metrics as accuracy.

15. Fitting model with training data and label. Two epochs are used to train the model and validation split is 0.3. Verbose =1 will show the loss and accuracy along with epoch for both training data as well as validation data.

One can plot graph of accuracy or loss with epochs for both training and validation set. All these variables are stored in h variable.

**Evaluating deep learning keras model**

16. Using model.evaluate, one will get accuracy which is desired parameter on testing data. Testing data is part of the overall dataset and performance parameter calculation on this will feed us with confidence on how model will behave with real time data. One thing here to discuss is that difference between training set accuracy as well as testing set or validation set accuracy must be small as huge difference will reveal that model is overfit and it has not learnt the patterns well while training. This also implies it tries to by heart all the training data which is not good. If this is the case with you please tries to balance fitting. Use of regularization can help you.

I hope in this section readers have gained some important insights from this section. Now let us move to the next section which is counting the number of trainable parameters deep learning keras model.

Counting Number of Parameters in Feed Forward Deep Neural Network | Keras

Mathematical Calculation for first dense layer parameters

Input layer has 4 features and hence for each feature 1 node is needed.

First hidden layer has 5 nodes

Weight matrix for input layer and first hidden layer will have rows = number of input features and columns equal to the number of nodes in first hidden layer

Weight Matrix for input layer and first hidden layer has dimension = 4 x 5

Number of trainable weights for the first layer = 20

Number of bias for first hidden layer = Number of Nodes in first hidden layer = 5

Note: Weights and Bias are the trainable parameters

Total Trainable parameters for first dense layer = 20+5 = 25

Mathematical Calculation for second dense layer parameters

Number of input nodes for second hidden layer = Number of nodes in first hidden layer = 5

Number of nodes in second hidden layer = 2

Weight matrix dimension = Number of nodes in first hidden layer * Number of nodes in second hidden layer = 5 * 2 = 10

Number of bias = Number of nodes in the second hidden layer = 2

Total Trainable parameters for second dense layer = 10+2 = 12

Mathematical Calculation for third dense layer parameters

Number of input nodes for this layer = Number of nodes in the second hidden layer = 2

Number of weights for this layer = 2

Number of bias for this layer = 1

Total trainable parameters for this layer = 3

Conclusion

In a nutshell, we have seen in this short but effective tutorial how to count the number of trainable parameters in a deep learning feed forward neural network using keras api. We have also seen from where to download dataset and learned code line by line. In the end, there is some mathematical calculation involved showing how to calculate weights and bias for each dense layer. Everything is mentioned with the help of proper calculations. There is a humble request to readers to follow me at quora as I answer a lot of questions there. Below is my quora link. Along with that, you may also see some of the images on Pinterest. Along with these please free to ask your questions in the comment section.

Sanpreet Quora

Sanpreet Pinterest

Originally published at http://ersanpreet.wordpress.com on January 26, 2020.