Transformer
- Architecture

- Transformer have encoders and decoders
- encoders understand the input
- decoders generate the output using that understanding
- Encoder
- Reads the entire input sequence at once
- builds context-aware representations
- outputs a set of vectors (one per input token)
- Example:
- input:
The capital of India is New Delhi - output: [v(“capital”), v(“India”), v(“New Delhi”) ]
- input:
- The vectos encode
- meaning
- relationshipts
- grammar
- facts present in the input
- Decoder:
- Generates tokens one by one using previous output tokens and encoders output (if present)
- Example
- Input (encoder)
I love cats - output (decoder)
Amo a los gatos
- Decoder-only models
- GPT
- Meta LLama
- Mistra
- Encoder only models
- BERT
- NLP
- classification/embeddings => Encoder-only
- Translation/summarization => Encoder-Decoders
- Chat/Code/agent => Decoder
Decoder
- Goal: A decoder-only transformer does one core job
Given previous tokens, compute probabilities for the next token -
Phases:
- Tokenization: Tokenizer splits the text into tokens
- Example: tokens (conceptual)
["I", "love", " cats"] -
Then convert them into token ID =>
[101, 3456, 987] -
Embedding lookup: Each token ID is mapped to vector
- We get vectors
E(I), E(love), E( cats) - we have a matrix
x = [E1; E2; E3] - Add positional information: Transformers need positional embedding
Xpos = X + p- Pass through Transformer layer (repeated N Times)
one Transformer Layer
- One Transformer Layer
- LayerNorm (stabilization): This makes training stable and keeps the values in good range
H = LayerNorm(Xpos) - Create Q, K, V for attention
- For each token vector, the mode computes
- Q (Query) => What this position is looking for
- K (Key) => What this position offers
- V (Value) => The information content to pass along
- Masked Self Attention (no looking ahead)
- Attention scores (relevance)
- weighted sum of values
- Multi head attention: Instead of doing one attention, it does many heads
- Each head learns a different releation
- syntax head
- semantic head
- coreference head
- Residual connection: Add back original input
- LayerNorm again
- Feed Forward network: This transform each token independently
- Residual
The result of this will be final layer output (softmax -> next-token probilities )
Hugging face
- Refer Here
- Hugging face is a platform designed to make using ML modes as easily as possible
Setup the environment
- Git
- Visual studio code
- Python
- uv
A simple hugging face transformer
- Create a new directory
hello-transformerand cd in to it
mkdir hello-transformer
cd hello-transformer
- initalize using uv
uv init
uv add transformers datasets accelerate evaluate tokenizers sentencepiece sacremoses torch
code .
- Lets get started with an example of classifer that determines the sentiment of text
- In main.py copy the folowing code
from transformers import pipeline
classifier = pipeline(
"sentiment-analysis",
model="distilbert/distilbert-base-uncased-finetuned-sst-2-english")
# response = classifier("I like Large Language models very much")
# print(response)
response = classifier("I got irritate standing in line at airport")
print(response)
- run the code using
uv run main.py

