PyTorch framework builds the Deep Learning models with the interaction of other frameworks like TensorFlow, etc. These frameworks design interactive and accurate deep-learning models in Computer Vision, NLP, and many other domains. To solve problems in the NLP domain, Transformers have a huge impact with better accuracy and performance than other models. It also improves the performance of the seq2seq model by processing longer strings for understanding and generating text.

What are Transformers

Transformers are one of the latest architectures proposed in the research article called “Attention is all you need”. It uses the self-attention mechanism, that gives weights to the features(word) from the given sentence in the natural language. The computational results of the weights are used to get or take required and valuable aspects from the dataset. This framework allows the LSTM to grasp the complexities of the text written in natural language and generate its answer accordingly.

How to Create a Transformer in PyTorch

The PyTorch framework contains a torch library to build neural network architecture for deep learning models or transformers. To do so, import the required library to design the self-attention mechanism for the transformer with its encoders and decoders as well. After setting the architecture for the transformer, simply integrate all the components in a single model and test it to get the answer. To learn the process in detail, simply go through the listed steps:

Step 1: Importing Libraries

First, import the torch library from the PyTorch framework with its neural network (nn) dependency. The nn dependency contains multiple methods or functions to build machine or deep learning models in Python:

import torch
import torch.nn as nn

Step 2: Designing Self-Attention Mechanism

Now, get into the process of designing the self-attention mechanism for the transformer so it can get better features from the input:

class SelfAttention(nn.Module):
    def __init__(a, embedding, heads):
        super(SelfAttention, a).__init__()
        a.embedding = embedding #convert the natural language into numbers
        a.heads = heads #get the first feature’s index from the dataset
        a.head_dim = embedding // heads #get the dimensions of the first feature of the data

        assert (
            a.head_dim * heads == embedding #building the word embeddings to apply attention to important features

        a.values = nn.Linear(embedding, embedding)
        a.keys = nn.Linear(embedding, embedding)
        a.queries = nn.Linear(embedding, embedding)
        a.fc_out = nn.Linear(embedding, embedding)

    def forward(a, values, keys, query, mask):
        #using the number of iterations to train the transformer for the queries
        N = query.shape[0]
        #getting the lengths of the required data samples
        value_len = values.shape[1]#lengths of the values        key_len = keys.shape[1]# lengths of the keys        query_len = query.shape[1]#lengths of the queries

        values = a.values(values) #number of values with embedding size
        keys = a.keys(keys) #length of values with embedding size
        queries = a.queries(query) #length of query with embedding size

        #Split the embedding for the values using the asked queries
        values = values.reshape(N, value_len, a.heads, a.head_dim)
        keys = keys.reshape(N, key_len, a.heads, a.head_dim)
        queries = queries.reshape(N, query_len, a.heads, a.head_dim)
        #use the matrix to get the scalar product after multiplication
        energy = torch.einsum("nqhd,nkhd->nhqk", [queries, keys])
        #conditional statement to store the masks of the original data
        if mask is not None:
            energy = energy.masked_fill(mask == 0, float("-1e20"))
        #store the output of the matrix product in its variable
        attention = torch.softmax(energy / (a.embedding ** (1 / 2)), dim=3)
        out = torch.einsum("nhql,nlhd->nqhd", [attention, values]).reshape(
            N, query_len, a.heads * a.head_dim
        #getting the output using the fully-connected layer
        out = a.fc_out(out)
        return out
  • Create the SelfAttention class with a neural network Module as its argument.
  • Define the constructor to initialize the size and dimensions of the given dataset using the head variable.
  • Also, initialize the size of the word embeddings from the input data to make it understandable for the machine.
  • Design the structure of the neural network layers with multiple Linear() models working simultaneously.
  • After that, use the einsum() method from the torch library to process the values stored against the features.
  • Return the out variable initialized with the neural network layer to get the predictions.

Step 3: Building the Transformer

Here, build the transformer block containing the structure of the transformer with the feedforward approach. It simply starts from the input layer and goes through the hidden layers one by one to extract the result at the output layer:

class TransformerBlock(nn.Module):
    def __init__(a, embedding, heads, dropout, forward_expansion):
        super(TransformerBlock, a).__init__()
        a.attention = SelfAttention(embedding, heads)#apply self-attention using embeddings and head of the data
        a.norm1 = nn.LayerNorm(embedding) #apply a normal layer on the input using embeddings
        a.norm2 = nn.LayerNorm(embedding)
        #set the feed-forward approach on the neural network structure
        a.feed_forward = nn.Sequential(
            nn.Linear(embedding, forward_expansion * embedding),
            nn.ReLU(), #apply ReLU activation function on sequential layer
            nn.Linear(forward_expansion * embedding, embedding),

        a.dropout = nn.Dropout(dropout) #dropping the unnecessary features

    def forward(a, value, key, query, mask):
        attention = a.attention(value, key, query, mask)

        #move forward after dropping the features to return the output
        x = a.dropout(a.norm1(attention + query))
        forward = a.feed_forward(x)
        out = a.dropout(a.norm2(forward + x))
        return out
  • In the TransformerBlock class, define its constructor initialized with the input layers for each multi-head attention method.
  • Call the self-attention mechanism with the input layer to continuously integrate the new input with the existing one.
  • After that, set the neural network structure with the Linear layer using the ReLU activation function.
  • Use the Dropout() method to ignore the unimportant features from the existing data. It enables the transformer to process longer strings and produce accurate results.
  • Finally, design the forward() method to apply the feedforward approach in the transformer.

Step 4: Setting the Encoders

Now, design the flow of the Encoder to be used in the transformer for accepting input and applying processing on it. The word embedding mechanism converts the natural language input to the numerical form. It enables the machine to understand and process the data to extract important features through weights:

class Encoder(nn.Module):
    def __init__(
        a, #instance of the Encoder class
        src_vocab_size, #vocabulary size of the source input
        num_layers, #number of layers used in the encoder
        heads, #first element of the data
        device, #device running the transformer like CPU or GPU
        dropout, #dropping the non-required features
        max_length, #maximum length of the string in the dataset
        #setting the flow of the encoder using the word and position embedding
        super(Encoder, a).__init__()
        a.embedding = embedding
        a.device = device
        a.word_embedding = nn.Embedding(src_vocab_size, embedding)
        a.position_embedding = nn.Embedding(max_length, embedding)
        #setting the neural network using the layer with the instance of the class
        a.layers = nn.ModuleList(
                    embedding,#stored dataset for each word in the data
                    heads, #first value of the array or list
                    dropout=dropout, #expanding layer in the feed-forward approach
                    forward_expansion=forward_expansion,  #feed-forward approach
                for _ in range(num_layers)
        #dropping the features at the encoder side
        a.dropout = nn.Dropout(dropout)
    #moving forward using the word and position embedding
    def forward(a, x, mask):
        N, seq_length = x.shape
        positions = torch.arange(0, seq_length).expand(N, seq_length).to(a.device)
        out = a.dropout(
            (a.word_embedding(x) + a.position_embedding(positions))

        for layer in a.layers:
            out = layer(out, out, out, mask)

        return out
  • Design the Encoder class to initialize the components required to learn the important features using its constructor.
  • Define multiple variables to make the transformer understand the size of the vocabulary, its embedding size, and the number of layers.
  • Also, get the head of the data to learn where the string begins in the dataset with the device on which the model is working.
  • The forward_expansion and dropout variables are also required to complete the structure of the encoder.
  • Here, the word_embedding variable stores numbers across each word, and position_embedding stores the position of each word in the string.
  • Also, define the forward() method in the Encoder class to get the embeddings of each word using the for loop.

Step 5: Setting the Decoders

After setting the encoders, design their decoders so the model language can be converted to the human-understandable language:

class DecoderBlock(nn.Module):
    def __init__(a, embedding, heads, forward_expansion, dropout, device):
        super(DecoderBlock, a).__init__()
        a.norm = nn.LayerNorm(embedding) #apply a normal layer on the input using embeddings
        a.attention = SelfAttention(embedding, heads=heads) #apply self-attention using embeddings and head of the data
        a.transformer_block = TransformerBlock(
            embedding, heads, dropout, forward_expansion
        a.dropout = nn.Dropout(dropout)
    #moving forward using the input data and self-attention mechanism
    def forward(a, x, value, key, src_mask, trg_mask):
        attention = a.attention(x, x, x, trg_mask)
        query = a.dropout(a.norm(attention + x))
        out = a.transformer_block(value, key, query, src_mask)
        return out
  • Create the DecoderBlock class to get the input from the encoders and apply the self-attention mechanism to it.
  • It uses the input from the encoders with its existing input and processes their features.
  • Again use the Dropout() method to remove the unnecessary features from the new data.
  • Apply the forward() method with the attention mechanism to the query and generate its answer effectively.

After that, set the structure of the Decoder using its class and initialize its components using the constructor:

class Decoder(nn.Module):#setting the structure of the decoder using neural network
    def __init__(
        trg_vocab_size, #vocabulary size of the target input
        num_layers, #number of layers used in the encoder
        heads, #first element of the data
        device, #device running the transformer like CPU or GPU
        max_length, #maximum length of the string in the dataset
        super(Decoder, a).__init__()
        a.device = device
        a.word_embedding = nn.Embedding(trg_vocab_size, embedding)
        a.position_embedding = nn.Embedding(max_length, embedding)
        #setting the decoder block with the layers
        a.layers = nn.ModuleList(
                DecoderBlock(embedding, heads, forward_expansion, dropout, device)                #loop to get the number of layers in the decoder
                for _ in range(num_layers)
        a.fc_out = nn.Linear(embedding, trg_vocab_size)#fully connected layer
        a.dropout = nn.Dropout(dropout)
    #applying the feed-forward technique using the masks of the input
    def forward(a, x, enc_out, src_mask, trg_mask):
        N, seq_length = x.shape
        positions = torch.arange(0, seq_length).expand(N, seq_length).to(a.device)
        x = a.dropout((a.word_embedding(x) + a.position_embedding(positions)))
        #get the prediction for each input using neural networks
        for layer in a.layers: #loop to get the masks values from layers
            x = layer(x, enc_out, enc_out, src_mask, trg_mask)
        #producing output from the decoder at a fully connected layer
        out = a.fc_out(x)
        return out
  • Use the same features mentioned in the encoder so they can be decoded properly in the transformer.
  • The decoder gets input from its positional vector and adds the encoder’s output to make a single dataset.
  • After that, apply multi-head attention to the data to remove unnecessary features and clean the data.
  • Finally, process this data using the self-attention and Linear() model to produce accurate output.

Step 6: Integrating Components in the Transformer

Now, integrate all the components designed earlier like the self-attention mechanism, encoder, and decoder to make it structured. Combining all the components enables the transformer to set the structure in a specific format so the input can be processed properly:

class Transformer(nn.Module):
    def __init__(
        src_vocab_size,#vocabulary of the source data
        trg_vocab_size,#vocabulary of the target data
        src_pad_idx,#masking pad of the source data
        trg_pad_idx,#masking pad of the target data
        embedding=512,#512 embeddings
        num_layers=6,#6 neural network layers used

        super(Transformer, a).__init__()

        a.encoder = Encoder(
            embedding, #stored dataset for each word in the data
            num_layers, #number of layers used in the encoder block
            heads, #first element of the data
            device, #device running the transformer like CPU or GPU
            forward_expansion, #expanding layer in the feed-forward approach
            dropout, #dropping the non-required features
            max_length, #maximum length of the string in the dataset
        a.decoder = Decoder(
            trg_vocab_size, #vocabulary size of the target input
        #Masking the source and target dataset
        a.src_pad_idx = src_pad_idx
        a.trg_pad_idx = trg_pad_idx
        a.device = device

    def make_src_mask(a, src):
        src_mask = (src != a.src_pad_idx).unsqueeze(1).unsqueeze(2)
        #Masking the source dataset and storing it in the variable

    def make_trg_mask(a, trg):#method to get the target data and converts its content
        N, trg_len = trg.shape #masking the target dataset and store it in the
        trg_mask = torch.tril(torch.ones((trg_len, trg_len))).expand(
            N, 1, trg_len, trg_len #getting the masks of the target values


    def forward(a, src, trg):
        src_mask = a.make_src_mask(src)#store masked source data
        trg_mask = a.make_trg_mask(trg)#store masked target data
        enc_src = a.encoder(src, src_mask)#store masked encoder data
        out = a.decoder(trg, enc_src, src_mask, trg_mask)#store masked decoder data
        return out
  • Create the Transformer class with its features initialized in the constructor like vocabulary size of source and target data.
  • The source and target pads are also defined to make the masks of the original data to hide it from the transformer.
  • After that, set the values of all the variables used to design self-attention, encoder, and decoders.
  • Define the methods for creating masks of the source and target dataset to return the data in the desired format.
  • Finally, call the forward method to set the flow of the complete structure of the transformer.

Step 7: Testing the Transformer

Finally, test the transformer by giving the input datasets using the PyTorch tensors, as mentioned in the following code:

if __name__ == "__main__":
    #stores the input and target data set and then applies the transformer to it
    x = torch.tensor([[1, 6, 8, 3, 7, 5, 3, 8, 0], [1, 5, 3, 6, 7, 6, 2, 3, 2]])
    target = torch.tensor([[1, 7, 3, 7, 9, 6, 8, 0], [1, 9, 4, 5, 7, 8, 2, 2]])
    #use the model in the transformer to get the tokens from the transformer
    model = Transformer(10,10,0,0) #vocabulary size of the source data, vocabulary size of the target data, indexes of the source data, indexes of the target data
    out = model(x, target[:, :-1]) #stores the output by applying the model
  • Create the x variable to store the input data and the target variable to get the historical output data.
  • Initialize the model variable with the Transformer() class containing the values of its arguments.
  • Apply the model on the datasets to store the value in the out variable and print them on the screen.

The following screenshot displays the dimensions of the output data like (2, 7, 10) in the first line. It means that the variable has two(2) datasets with seven(7) lists containing ten(10) values in each set. 

The input data has 9 values stored in both the dimensions and then applied the transformer on them. The transformer takes the data and removes the first and last values as special characters. After removing the unnecessary features from the dataset, it simply produces the tokens for the rest of the values.

That’s all about how to build the transformers using the neural network architecture with the PyTorch framework.


To create the transformer in Pytorch, get the neural network dependencies from the torch library. Using these dependencies, start building the components for the transformer like self-attention, encoder, and decoder. After that, integrate all the components in the Transformer class to set the flow of the model’s architecture. Finally, give the input and the target dataset to test the model by extracting the tokens for each feature in the data.