Skip to content
sarthakksingh2 edited this page Jan 31, 2025 · 8 revisions

TRANSFORMER PROJECT

This repository contains the basics and implementation of Transformer in Machine Learning.

EXPLANATION

Transformer is a kind of neural network type of deep learning model/architecture that transforms one type of input into another type of output. They don't need text from start to end they soak in all at once in parallel. It learns context and thus generates the meaning by tracking the relationships in the sequential data just like the words in a sentence.

They rely on a special operation called Attention or Self-Attention. The Attention block is also called the heart of the Transformer. The second type is called feedforward neural network. It basically allows to store more data.

LLM - LLM or Large Language Model are suffosticated mathematical model that predicts what will come next. In this the Backpropagation algorithm is used. Huge amount of parameter and large number of training data is given and the steps involved are Pretraining and Reinforcement Learning with Human Feedback (RLHF). All of this is possible by using a special computer chips called GPU's.

The data repeatedly flows through many different iterations.

Input/Lines are broken into little pieces/ small chunks which is known as Tokens. These Tokens are associated with a vectors (number which encode a piece). These Vectors are then passed through Attention Block by which the information is passed back and forth to update the values between them. Attention Block figures out which words in context are relevant to update the meanings of the words and how they should be updated (in vector form). Then these vectors are passed through operation known as Multi-Layer Perceptron or Feed-forward layer. It is repetitions of Attention and Multi-layer Perceptron and all these operations are done in parallel.

In Deep Learning Tunable Parameters are set up which are known as Weights and the only way these parameters interact with data is through weighted sum. In Deep Learning there is pile of matrices (By Backpropagation). The weights are arranged in Matrices.

  • GPT or Generative Pre Trained Transformer uses a specific kind of neural network and converts one type to another like text to voice, text to image, etc. They learn from a massive amount of data i.e.Pre Trained and are bots that generates new text by the process of repeated prediction and sampling. Inside Matrix Vector Multiplication goes on. Break Lines into small chunks/tokens and turn it into vectors. All words -> Embedded Matrix. The dot product involve multiplying and the desired output is in probability distribution (Unembedding).

Attention Mechanism

It is the first step transformer. In this each token is associated with a high dimensional vector, called embedding (words -> vectors). The aim of the transformer is to adjust the embeddings so that they don't merely encode an individual word but instead they make much logical contextual meaning.

Th vectors flows through the network including different attention blocks, the computation to perform to produce a prediction of the next token which is entirely a function of last vector in sequence. Attention blocks allows the mechanism to move information one embeddings to another to make information more richer and logical. Updation in mechanism is done so that the vector is updated in more specific direction that more specifically encodes the word to make logic. Many different heads run in parallel. Initial embedding for each word is high dimensional vector that encodes the meaning of that particular word with no context & also encode the position.

Parallelizable means that we can run huge number of computations in a short time using GPUs

Query vector -> q , Second Matrix -> Key matrix.
To measure how well each key matches each query we compute a dot product between each possible key-query pair.

We compute a softmax along each one of columns to normalize the values and then the grid is filled with softmax values. It gives weights according to how relevant the word on left is corresponding value at top, this grid is called attention pattern.

Attention (Q,K,V) = softmax (QK^T/√dk) V

where, K = Key, V = Values, Q & K represents full array of Query, QK^T -> way to represent the grid of all possible dot products between pairs of Key & Queries , √dk -> Square root of the dimension in Key Query Space.

We divide by √dk to atten numerical stability.

Value Parameter = #Query Parameter + #Key Parameter

The attention pattern size is equal to square of the context size.

MLP - Multi Layer Perceptron

Each individual vector from the sequence goes through a short series of operations and in end we get another vector with same dimension. The other vector is added to the origional vector that flows in and the sum is the result flowing out.

It is applied to every vector in sequence associated with every token in the input & it happens in Parallel.

MLP has three layers :-

  1. Input Layer - Where the raw data enters the network.
  2. Hidden Layer - These layers process the data & learn patterns.
  3. Output Layer - Final layer that gives the result.

W + E + B = Intermediate Vector

where, B = Bias Vector

Large Intermediate Vector is passed through Non-Linear function, called as Rectified Linear Unit or ReLU. IN ReLU we get the Graph of the vector.

It is somewhat like the behaviour of an AND Gate.

Clone this wiki locally