Gen-AI Developer Classroom notes 22/Feb/2026

How are LLMs Trained ?

LLMs are trained in three major phases
- Pretraining
- Supervised Fine Tuning
- Alignment (Reinforcement Learning from Human Feedback (RLHF) or preference Tuning)

Super Simple Version

Think of training an LLM like teaching a child to complete sentences.
Summary:
- LLMs are trained by showing them massive amounts of text and repated asking them to guess the next work. Every mistake slightly adjusts the weights until the model becomes very good at predicting language

Step 1: Show it Massive Amounts of Text

The model reads
- Books
- Websites
- Articles
- Code
- Conversations
Millions and billions of statements
It does not read like a human, it sees text as patterns

Step 2: Play a Guessing game

Guess the next word
Example
- input: The sky is
- correct answer: blue
If the model guesses
- Car -> Wrong
- blue -> Correct
If it goes wrong we do back propagation (adjusting internal numbers slightly)
This happens trillions of times

Step 3: Adjust the Numbers

Inside a model are billions of numbers (weights and biases)
When it guesses wrong:
- Numbers are changed slightly
Over time:
- The model becomes very good at predicting what word usually comes next

Step 4: Teach it to following instructions

After basic training, It knows language patterns, But it doesn’t know how to behave like assistant (chatgpt, claude)
So we train it with examples like
User: Explain gravity simply
Assitant: Gravity is a force that pulls objects together.

Step 5: Teach it to be Safe

Humans then
- compare answers
- Rank good vs bad answers
The model learns:
- To be polite
- Avoid harmful reponses
- Say “i dont know” when unsure.

Technical version

Pre-training (Foundation phase)

This is where most intelligence is learned
Objective: Learn to predict the next token

Step 1: Collect Massive Data

Data sources:
- Web pagges
- Books
- Wikipedia
- Code
- Acadaemic articles
Scale: Trillions of tokens

Step 2: Tokenization

Text -> Tokens

Step 3: Forward Pass Through Transformers

Tokens -> embedding
Pass through Transformer layers
Get logits for next tokens

Step 4: Compute Loss

We compare predicted probabiliites with actual next token
Use cross-entropy loss
- If correct word has low probability -> high loss
- If correct word has high probability -> low losss

Step 4: Backpropagation

Then update weights using optmizer (gradients are calculated and AdamW optimizer)
This repeats billions of times

Conclusion

During Pretraining:
- All weights are updated
- Billions are parameters adjust gradully
- Model learns statistical patterns in language

Supervised Fine-Tuning (Instruction Tuning)

After pretraining
- Models know language
- but not how to behave as assistant
As part of supervised fine tuning Model learns
- Following instructions
- Format Answers Properly
- Be Conversational

Step 1: Create Instruction Dataset

Humans write examples like
Example 1:
- User: Explain photosynthesis simply.
- Assistant: ....
Example 2:
- User: Write a python function to reverse a string
- Assistant: ...

Step 2: Train like pretraining – But with Pairs

Model is shown Input Instruction + Expected Output
Then it learns how to follow instruction properly

Alignment (RHLF)

Goal: Make model helpful and safe
After
- Pretraining: -> model knows language
- SFT -> model follows instructions
Still problems remain
- It may give unsafe answers
- It may be too verbose
- It may be rude
RLHF:
- Use human feedback to train the model to produce preffered responses
Instead of telling it exact correct answer, we tell which answer is better

Step 1: Generate Multiple response

For one prompt, Model geenrate multiple responses (Response A, Response B, Response C)

Step 2: Humans Rank Responses

Humands Rank Best to Worst
Example A>C>B

Step 3: Train a Reward model

Now we train a small model called as Reward model
Its job
- Given prompt + response -> output a socre
It learns: Humans prefer A more than B

Step 4: Reinforcement Learning

NOw we use Reinforcement Learning (Usually PPO)
- Main LLM generates an answer
- The rewar model scores
- If the score is high -> reinforce behavior
- It the score is low -> penalize behavior
Gradually model shifts towards higher-reward response

What is Actually Learned ?

The model learns
- Grammar
- Syntax
- Semantic Relationships
- Reasoning patterns
- coding patterns
- statistical world knowledge
But it doest not stored
- A database of facts
- Symbolic rules

Lets look at datasets used by llama 1 during training

Datasets:
- CommonCrawl
- C4 (Colossal Clean Crawled Corpus)
- GitHub
- Wikipedia
- Books
- ArXiv
- StackExchange
Meta even provided rough percentages of token distribution:
- ~67% web data (CommonCrawl + C4)
- ~15% books
- ~4.5% code (GitHub)
- ~4.5% Wikipedia
- ~2.5% ArXiv
- ~2% StackExchange

By continuous learner

enthusiastic technology learner

View all of continuous learner's posts.

Leave a ReplyCancel reply