Gen-AI Developer Classroom notes 04/Jan/2026

Transformer

  • Architecture
    Preview
  • Transformer have encoders and decoders
    • encoders understand the input
    • decoders generate the output using that understanding
  • Encoder
    • Reads the entire input sequence at once
    • builds context-aware representations
    • outputs a set of vectors (one per input token)
    • Example:
      • input: The capital of India is New Delhi
      • output: [v(“capital”), v(“India”), v(“New Delhi”) ]
    • The vectos encode
      • meaning
      • relationshipts
      • grammar
      • facts present in the input
  • Decoder:
    • Generates tokens one by one using previous output tokens and encoders output (if present)
    • Example
    • Input (encoder) I love cats
    • output (decoder) Amo a los gatos
  • Decoder-only models
    • GPT
    • Meta LLama
    • Mistra
  • Encoder only models
    • BERT
  • NLP
    • classification/embeddings => Encoder-only
    • Translation/summarization => Encoder-Decoders
    • Chat/Code/agent => Decoder

Decoder

  • Goal: A decoder-only transformer does one core job Given previous tokens, compute probabilities for the next token
  • Phases:

    1. Tokenization: Tokenizer splits the text into tokens
    2. Example: tokens (conceptual) ["I", "love", " cats"]
    3. Then convert them into token ID => [101, 3456, 987]

    4. Embedding lookup: Each token ID is mapped to vector

    5. We get vectors E(I), E(love), E( cats)
    6. we have a matrix x = [E1; E2; E3]
    7. Add positional information: Transformers need positional embedding
    8. Xpos = X + p
    9. Pass through Transformer layer (repeated N Times)

    one Transformer Layer

    1. One Transformer Layer
    2. LayerNorm (stabilization): This makes training stable and keeps the values in good range H = LayerNorm(Xpos)
    3. Create Q, K, V for attention
    4. For each token vector, the mode computes
    5. Q (Query) => What this position is looking for
    6. K (Key) => What this position offers
    7. V (Value) => The information content to pass along
    8. Masked Self Attention (no looking ahead)
    9. Attention scores (relevance)
    10. weighted sum of values
    11. Multi head attention: Instead of doing one attention, it does many heads
    12. Each head learns a different releation
    13. syntax head
    14. semantic head
    15. coreference head
    16. Residual connection: Add back original input
    17. LayerNorm again
    18. Feed Forward network: This transform each token independently
    19. Residual

The result of this will be final layer output (softmax -> next-token probilities )

Hugging face

  • Refer Here
  • Hugging face is a platform designed to make using ML modes as easily as possible

Setup the environment

  • Git
  • Visual studio code
  • Python
  • uv

A simple hugging face transformer

  • Create a new directory hello-transformer and cd in to it
mkdir hello-transformer
cd hello-transformer
  • initalize using uv
uv init
uv add transformers datasets accelerate evaluate tokenizers sentencepiece sacremoses torch
code .
  • Lets get started with an example of classifer that determines the sentiment of text
  • In main.py copy the folowing code
from transformers import pipeline

classifier = pipeline(
    "sentiment-analysis",
    model="distilbert/distilbert-base-uncased-finetuned-sst-2-english")

# response = classifier("I like Large Language models very much")
# print(response)

response = classifier("I got irritate standing in line at airport")
print(response)
  • run the code using uv run main.py
    Preview

By continuous learner

enthusiastic technology learner

Leave a Reply

Discover more from Direct AI Powered By Quality Thought

Subscribe now to keep reading and get access to the full archive.

Continue reading