Gen-AI Developer Classroom notes 22/Feb/2026

How are LLMs Trained ?

  • LLMs are trained in three major phases
    • Pretraining
    • Supervised Fine Tuning
    • Alignment (Reinforcement Learning from Human Feedback (RLHF) or preference Tuning)

Super Simple Version

  • Think of training an LLM like teaching a child to complete sentences.
  • Summary:
    • LLMs are trained by showing them massive amounts of text and repated asking them to guess the next work. Every mistake slightly adjusts the weights until the model becomes very good at predicting language

Step 1: Show it Massive Amounts of Text

  • The model reads
    • Books
    • Websites
    • Articles
    • Code
    • Conversations
  • Millions and billions of statements
  • It does not read like a human, it sees text as patterns

Step 2: Play a Guessing game

  • Guess the next word
  • Example
    • input: The sky is
    • correct answer: blue
  • If the model guesses
    • Car -> Wrong
    • blue -> Correct
  • If it goes wrong we do back propagation (adjusting internal numbers slightly)
  • This happens trillions of times

Step 3: Adjust the Numbers

  • Inside a model are billions of numbers (weights and biases)
  • When it guesses wrong:
    • Numbers are changed slightly
  • Over time:
    • The model becomes very good at predicting what word usually comes next

Step 4: Teach it to following instructions

  • After basic training, It knows language patterns, But it doesn’t know how to behave like assistant (chatgpt, claude)
  • So we train it with examples like
  • User: Explain gravity simply
  • Assitant: Gravity is a force that pulls objects together.

Step 5: Teach it to be Safe

  • Humans then
    • compare answers
    • Rank good vs bad answers
  • The model learns:
    • To be polite
    • Avoid harmful reponses
    • Say “i dont know” when unsure.

Technical version

Pre-training (Foundation phase)

  • This is where most intelligence is learned
  • Objective: Learn to predict the next token
Step 1: Collect Massive Data
  • Data sources:
    • Web pagges
    • Books
    • Wikipedia
    • Code
    • Acadaemic articles
  • Scale: Trillions of tokens
Step 2: Tokenization
  • Text -> Tokens
Step 3: Forward Pass Through Transformers
  • Tokens -> embedding
  • Pass through Transformer layers
  • Get logits for next tokens
Step 4: Compute Loss
  • We compare predicted probabiliites with actual next token
  • Use cross-entropy loss
    • If correct word has low probability -> high loss
    • If correct word has high probability -> low losss
Step 4: Backpropagation
  • Then update weights using optmizer (gradients are calculated and AdamW optimizer)
  • This repeats billions of times

Conclusion

  • During Pretraining:
    • All weights are updated
    • Billions are parameters adjust gradully
    • Model learns statistical patterns in language

Supervised Fine-Tuning (Instruction Tuning)

  • After pretraining
    • Models know language
    • but not how to behave as assistant
  • As part of supervised fine tuning Model learns
    • Following instructions
    • Format Answers Properly
    • Be Conversational
Step 1: Create Instruction Dataset
  • Humans write examples like

  • Example 1:

    • User: Explain photosynthesis simply.
    • Assistant: ....
  • Example 2:
    • User: Write a python function to reverse a string
    • Assistant: ...
Step 2: Train like pretraining – But with Pairs
  • Model is shown Input Instruction + Expected Output
  • Then it learns how to follow instruction properly

Alignment (RHLF)

  • Goal: Make model helpful and safe
  • After
    • Pretraining: -> model knows language
    • SFT -> model follows instructions
  • Still problems remain
    • It may give unsafe answers
    • It may be too verbose
    • It may be rude
  • RLHF:
    • Use human feedback to train the model to produce preffered responses
  • Instead of telling it exact correct answer, we tell which answer is better
Step 1: Generate Multiple response
  • For one prompt, Model geenrate multiple responses (Response A, Response B, Response C)
Step 2: Humans Rank Responses
  • Humands Rank Best to Worst
  • Example A>C>B
Step 3: Train a Reward model
  • Now we train a small model called as Reward model
  • Its job
    • Given prompt + response -> output a socre
  • It learns: Humans prefer A more than B
Step 4: Reinforcement Learning
  • NOw we use Reinforcement Learning (Usually PPO)
    • Main LLM generates an answer
    • The rewar model scores
    • If the score is high -> reinforce behavior
    • It the score is low -> penalize behavior
  • Gradually model shifts towards higher-reward response

What is Actually Learned ?

  • The model learns
    • Grammar
    • Syntax
    • Semantic Relationships
    • Reasoning patterns
    • coding patterns
    • statistical world knowledge
  • But it doest not stored
    • A database of facts
    • Symbolic rules

Lets look at datasets used by llama 1 during training

  • Datasets:

    • CommonCrawl
    • C4 (Colossal Clean Crawled Corpus)
    • GitHub
    • Wikipedia
    • Books
    • ArXiv
    • StackExchange
  • Meta even provided rough percentages of token distribution:

    • ~67% web data (CommonCrawl + C4)

    • ~15% books

    • ~4.5% code (GitHub)

    • ~4.5% Wikipedia

    • ~2.5% ArXiv

    • ~2% StackExchange

By continuous learner

enthusiastic technology learner

Leave a Reply

Discover more from Direct AI Powered By Quality Thought

Subscribe now to keep reading and get access to the full archive.

Continue reading