Gen-AI Developer Classroom notes 04/Jan/2026

Transformer

Architecture
Transformer have encoders and decoders
- encoders understand the input
- decoders generate the output using that understanding
Encoder
- Reads the entire input sequence at once
- builds context-aware representations
- outputs a set of vectors (one per input token)
- Example:
  - input: The capital of India is New Delhi
  - output: [v(“capital”), v(“India”), v(“New Delhi”) ]
- The vectos encode
  - meaning
  - relationshipts
  - grammar
  - facts present in the input
Decoder:
- Generates tokens one by one using previous output tokens and encoders output (if present)
- Example
- Input (encoder) I love cats
- output (decoder) Amo a los gatos
Decoder-only models
- GPT
- Meta LLama
- Mistra
Encoder only models
- BERT
NLP
- classification/embeddings => Encoder-only
- Translation/summarization => Encoder-Decoders
- Chat/Code/agent => Decoder

Decoder

Goal: A decoder-only transformer does one core job Given previous tokens, compute probabilities for the next token
Phases:
1. Tokenization: Tokenizer splits the text into tokens
2. Example: tokens (conceptual) ["I", "love", " cats"]
3. Then convert them into token ID => [101, 3456, 987]
4. Embedding lookup: Each token ID is mapped to vector
5. We get vectors E(I), E(love), E( cats)
6. we have a matrix x = [E1; E2; E3]
7. Add positional information: Transformers need positional embedding
8. Xpos = X + p
9. Pass through Transformer layer (repeated N Times)
one Transformer Layer
1. One Transformer Layer
2. LayerNorm (stabilization): This makes training stable and keeps the values in good range H = LayerNorm(Xpos)
3. Create Q, K, V for attention
4. For each token vector, the mode computes
5. Q (Query) => What this position is looking for
6. K (Key) => What this position offers
7. V (Value) => The information content to pass along
8. Masked Self Attention (no looking ahead)
9. Attention scores (relevance)
10. weighted sum of values
11. Multi head attention: Instead of doing one attention, it does many heads
12. Each head learns a different releation
13. syntax head
14. semantic head
15. coreference head
16. Residual connection: Add back original input
17. LayerNorm again
18. Feed Forward network: This transform each token independently
19. Residual

The result of this will be final layer output (softmax -> next-token probilities )

Hugging face

Refer Here
Hugging face is a platform designed to make using ML modes as easily as possible

Setup the environment

Git
Visual studio code
Python
uv

A simple hugging face transformer

Create a new directory hello-transformer and cd in to it

mkdir hello-transformer
cd hello-transformer

initalize using uv

uv init
uv add transformers datasets accelerate evaluate tokenizers sentencepiece sacremoses torch
code .

Lets get started with an example of classifer that determines the sentiment of text
In main.py copy the folowing code

from transformers import pipeline

classifier = pipeline(
    "sentiment-analysis",
    model="distilbert/distilbert-base-uncased-finetuned-sst-2-english")

# response = classifier("I like Large Language models very much")
# print(response)

response = classifier("I got irritate standing in line at airport")
print(response)

run the code using uv run main.py

Gen-AI Developer Classroom notes 04/Jan/2026

Transformer

Decoder

one Transformer Layer

Hugging face

Setup the environment

A simple hugging face transformer

Like this:

By continuous learner

Leave a ReplyCancel reply

Transformer

Decoder

one Transformer Layer

Hugging face

Setup the environment

A simple hugging face transformer

Share this:

Like this:

By continuous learner

Leave a ReplyCancel reply

Discover more from Direct AI Powered By Quality Thought