PyTorch allows the user to build neural networks and evaluate their performance using difference loss methods like MAE, MSE, KL divergence, etc. The Kullback-Leibler or KL divergence loss is used to get the distance between the probabilities of the correct and wrong predictions of the NN model. Furthermore,  the KLDivLoss() function is offered by the torch library to get the loss value of the predicted value of the neural networks. 

How to Calculate KL Divergence Loss of Neural Networks in PyTorch

Calculating KL divergence loss requires building a deep learning model with neural networks like Sequential or MLP. Train these models on the training data and test them using the unseen data to ensure unbiasedness in model making. Once the models made their predictions, simply check their performances using the KL divergence loss function offered by the torch library. 

Head into the process of building the sequential model to get the loss value while training:

Model 1: Using Sequential Neural Network

The sequential model is the deep learning model that uses all the phases or processes in a sequence. This model works efficiently in streaming data like strings, video clips, time series, and many other forms. Additionally, it produces output for each input as it uses a single input and produces one output. To learn how to build the sequential model and calculate the performance, go through the listed steps:

Step 1: Access Python Notebook

First, open the Python development environment using Jupyter or Google Colab Notebook. This guide uses the colab notebook which can be created from the official Google Colaboratory page:

Step 2: Import Libraries

This step is to import the required libraries to build and train the sequential model and plot the results at the end:

import torch
from keras.layers import Dense
from sklearn.datasets import make_circles
from keras.models import Sequential
from matplotlib import pyplot
from numpy import where
  • The Torch library is used to build the deep learning model and contains different methods to implement it.
  • Next, the Keras is a Google-built API that is used to add layers in the neural network architectures.
  • Now, scikit-learn or sklearn is a Python library to apply machine learning techniques in the model like splitting of data.
  • The Matplotlib library allows the user to build a graphical representation that displays different phases of the model.
  • Finally, the NumPy is a well-known Python library to build arrays and here it is used to build the dataset for the model.

Step 3: Building the Dataset

Now, build the dataset for the model so the model can be trained on proper diverse data and then plot it on the screen using the pyplot library:

a, b = make_circles(n_samples=1000, noise=0.1, random_state=1)#using the for loop to split the data in training and testing data
for i in range(2):
samples_ix = where(b == i)       #set the structure of the graph with scatter graphs
pyplot.scatter(a[samples_ix, 0], a[samples_ix, 1])
  • Start the make_circles() method to store the 1000 samples in a and b variables with 0.1 noise and random_state to make unbiased data.
  • Now, apply the where() method to apply the condition in the for loop to add a condition for splitting data
  • Then, call the pyplot library to use the scatter() method to build the scatterplot to get the dataset in two colors.
  • To display circles in the graph, use the show() method with the pyplot library:

The color scheme of data shows that data points are stored in two classes in the variables a and b.

Step 4: Splitting the Dataset into Testing and Training Samples

Here, split the data stored in both variables (a and b) into the testing or training data from the complete dataset:

n_test = 500
traina = a[:n_test]
testa = a[n_test:]
trainb = b[:n_test]
testb = b[n_test:]
  • Start splitting data by creating a variable containing the split value as 500 from the 2000 data samples.
  • Both “a” and “b” variables are split into testing and training data containing 500 samples each.

Note: Normalization of Data

After building and splitting the dataset for the model, there is a phase called pre-processing to normalize the dataset. It will be used to remove ambiguities and distortions from the data set in general making it linear for the model. Another process that can be done here is making all the data points in a similar format or structure. This guide did not perform this step as the data set is self-generated and according to the requirements of the model.

Step 5: Building Sequential Model

Now, initialize the model variable with the Sequential() method and then add 100 dimensions of the layers with the activation function. The ReLU activation function is used here to implement the concept of nonlinearity of the model. After that, add sigmoid as the activation function creates an S-shaped curve to predict the class of the final output:

model = Sequential()
model.add(Dense(100, input_shape=(2,), activation='relu'))
model.add(Dense(1, activation='sigmoid'))

Integrate all the components of training and optimizing the sequential model using the compile() method:

model.compile(loss='kl_divergence', optimizer='adam', metrics=['accuracy'])
  • Call the compile() method with multiple arguments like loss, optimizer, and metrics with all the values.
  • Enter the kl_divergence as a value in the loss arguments and the adam optimizer with the accuracy as the metrics. 
  • Adam optimizer has the advantage of a dynamic learning rate over other optimizers with static LR.

Step 6: Model Fitting

Define the result variable to fit the model for the training of the model and then test it across 30 iterations:

result =, trainb, validation_data=(testa, testb), epochs=30, verbose=0)
  • Fitting the model means the training sample from both variables while running the 30 epochs.
  • These testing samples are used to validate the accurate performance of the model on the unseen data. 
  • Now, the verbose is kept 0 here so the model can perform the training in the background to save the computing resources.

Step 7: Model Evaluation

Evaluate the model to get the results of the training and testing data and store them in the train_acc and test_acc variables:

_, train_acc = model.evaluate(traina, trainb, verbose=1)
_, test_acc = model.evaluate(testa, testb, verbose=1)

The following snippet displays the loss value for training and testing data which is almost similar and close to zero. Both these aspects are good in the case of machine learning as similar loss values mean that the model is well balanced. A loss value close to zero means that the model is accurate and produces correct answers:

Step 8: Displaying Results

Finally, display the results of the model on the graph for multiple epochs in the training process. Looking at the performance provides us useful insights using the matplotlib library using the following code:

pyplot.plot(result.history['loss'], label='train')
pyplot.plot(result.history['val_loss'], label='test')
  • Display graphs using the subplot(211) to give the graph’s dimensions to represent the loss values of the training and testing data.
  • Use the title(‘Loss’) method to display the title of the graph and the plot() method to give the variables (loss and val_loss) from the loss variable.
  • Additionally, the plot() method contains the labels for the training and testing lines for better understandability.
  • Finally, use the legend() method to give the axis of the graph and the show() method to display the graph:

The results are looking very good as the KL loss value approaches zero making the model accurate. Now, move on to the MultiLayer Perceptron or MLP model using the neural networks of deep learning:

Model 2: Using MLP Neural Network

The MultiLayer Perceptron(MLP) model uses the basic architecture of the neural network as all the neurons are fully connected. It uses the feedforward approach to take the input from the user that keeps moving forward through multiple layers of neurons. The input is extracted from the previous layer’s output to get the final output at the last layer making it the output layer.

To learn the process of building and evaluating the MLP model using the KL divergence loss function, simply go through the following steps:

Step 1: Import Libraries

Get started with the implementation of the model by importing the required dependencies from the torch mentioned below:

import torch #importing torch to get methods for calculating the KL divergence loss
from torch import nn #importing nn dependency to build neural networks
from torchvision.datasets import FakeData #importing FakeData library to get the dataset
from import DataLoader #importing DataLoader library to load the dataset
from torchvision import transforms #importing transforms library to normalize the data
  • The torch library offers multiple dependencies like nn to create the neural network models.
  • The torchvision library contains different datasets for designing the structure of the neural network models.
  • Get the utils library from the torch to get the utility functions like loading the dataset in the model.

Step 2: Configuring MLP Model

Define the MLP class to store the module as the argument with the neural network dependency of the torch:

class MLP(nn.Module): # definition of MLP class with neural network argument

  def __init__(self):# using the constructor of the class to set the structure of the model
    self.layers = nn.Sequential(
      nn.Flatten(), #convert multi-dimensional data into 1D
      nn.Linear(28 * 28 * 3, 64), #first layer with its dimensions
      nn.ReLU(), #activation function for the first layer
      nn.Linear(64, 32), #second layer with its dimensions
      nn.ReLU(), #activation function for the second layer
      nn.Linear(32, 1), #third layer with its dimensions
      nn.Sigmoid() #activation function for the last layer

  def forward(self, x): #feedforward approach to set the approach for the model
    return self.layers(x)
  • Create a constructor of the MLP class to configure the architecture of the neural network model.
  • Use the Flatten() method to remove all the dimensions and then add the dimensions according to the requirement.
  • Apply the Linear(input, output) method to add the dimensions for the input layer with the ReLU() as the activation method.
  • The first layer contains the 28 * 28 * 3 input dimension and 64 as the output of the first layer.
  • Add the hidden layer with its dimension and then the output layer with the sigmoid() method as the activation function.
  • Call the forward() method to apply the feed-forward approach to get the results using all the layers.

Step 3: Building the Dataset

Extract the accurate data using the FakeData() method from the torch library with its arguments and dimensions:

if __name__ == '__main__':
    torch.manual_seed(42)    #extract the data set using the FakeData library from the torchvision
    dataset = FakeData(size=15000, image_size=(3, 28, 28), num_classes=2, transform=transforms.ToTensor())    #load the data set using the DataLoader library from the torch
    trainloader =, batch_size=64, shuffle=True, num_workers = 4, pin_memory = True)
  • Firstly, use the manual_seed() method before getting the dataset to apply normalization on the data storage.
  • Store the data in the dataset variable using the FakeData() method with the arguments to extract the samples.
  • The data contains the 15k samples with (3, 28, 28) dimensions of the image/objects and 2 classes.
  • Now, transform the data in the tensors to store the normalized form of data for training the model.
  • After that, load the data for training the model in the trainloader variable using the DataLoader() method.

Step 4: Using the KLDivLoss() Function

Integrate all the components configured earlier to store them in their respective variable to be used for training the model:

mlp = MLP()
kl = nn.KLDivLoss()
optimizer = torch.optim.Adam(mlp.parameters(), lr=1e-4)
  • Start by calling the MLP() method in the mlp variable with the KLDivLoss() function offered by the PyTorch library.
  • Now, Adam() is used as an optimizer to get the gradient descent approach using the learning rate as the steps in each iteration.
  • Learning rate means that the model evaluates its performance after each epoch and uses it as a head start to improve the accuracy.
  • The gradient descent technique is used to fine-tune the parameters of the model after each iteration with the backpropagation approach.

Step 5: Model Training

Use the following code that uses for loop to get three series of iterations with multiple epochs for training the model:

# training the model on the training data so the model can understand the features form the datafor epoch in range(0, 3):
    # printing the iteration number at the start of each iteration with zero loss
    print(f'Starting epoch {epoch+1}')
    current_loss = 0
    for i, data in enumerate(trainloader, 0):      # getting the data and labels from the dataset
      inputs, targets = data
      targets = targets \
                  .type(torch.FloatTensor) \
                  .reshape((targets.shape[0], 1))

      outputs = mlp(inputs) #getting the predictions using the net variable
      loss = kl(outputs, targets) #getting the loss value by comparing the output and labels
      current_loss += loss.item()
      if i % 10 == 0:          # extracting loss values after 10 mini-batches with improvement each time
          print('Loss after mini-batch %5d: %.3f' %
                (i + 1, current_loss / 500)) #printing the epoch number with the loss value after 2000 mini-batches
          current_loss = 0.0

print('\n Training process has finished')
  • At first, use the nested loop to iterate minibatches in each epoch as the outer loop is used for the epochs.
  • The inner loop is used to train the model by providing the input values and extracting the loss value for each iteration.
  • The inner loop also contains all the components(Optimizer, loss, model) to train the model with each iteration.
  • The if statement is used to get iterations after 10 mini-batches to get an overview of the improvement in the model:

Evaluate the performance of the model by looking at the loss value for each batch in the iteration. The loss value is very near to 0 which tells us that the predictions of the model are accurate.


To calculate the DL Divergence loss of the deep learning model in PyTorch, simply use the KLDivLoss() method or call the compile() method with the loss argument. The user needs to build the Sequential or MLP deep learning model with the neural network architecture to get the predictions. The loss method is then used to evaluate the accuracy and loss values of the neural network model.