Gen-AI Developer Classroom notes 10/Mar/2026

Tuning hyper parameters

  • per_device_train_batch_size = 2: batch size per gpu
  • gradient_accumulation_steps = 4
  • warmup_steps = 5
  • max_steps = 60
  • learning_rate=2e-4
  • fp16= not torch.cuda.is_bf16_supported()
  • bf16=torch.cuda.is_bf16_supported()
  • logging_steps = 1
  • optim =”adamw_8bit”
  • weight_decay=0.01
  • lr_scheduler_type = “linear”

losses

| Training Loss | Meaning | Model Quality |
| ————- | ———————————– | ——————– |
| > 3.0 | Model predictions are mostly wrong | Very poor |
| 2.0 – 3.0 | Model is starting to learn patterns | Weak |
| 1.5 – 2.0 | Model learning reasonably | Moderate |
| 1.0 – 1.5 | Good training progress | Good |
| 0.7 – 1.0 | Strong predictions | Very good |
| 0.4 – 0.7 | Very confident predictions | Excellent |
| < 0.3 | Model almost memorizing data | Possible overfitting |

Here is your content cleanly converted into structured Markdown (with headings, code blocks, and tables where helpful).


Unsloth QLoRA Fine-Tuning — Simple Notes

These notes apply well for:

  • Qwen3 4B
  • Google Colab
  • Support / Customer-Care datasets

Unsloth Recommended Approach

  • Start with LoRA / QLoRA
  • Avoid full fine-tuning initially
  • Keep configuration simple
  • Increase complexity only if necessary

Default Testing Configuration

max_seq_length = 2048
load_in_4bit = True

Key Hyperparameters (Simple Explanation)


Model

model_name

Use an instruct model

Why?

  • Instruct models already understand conversation format

Avoid

  • Base models for chat tasks

Context Length

max_seq_length = 2048

Controls:

  • Maximum tokens in one training example

Increase only if:

  • Training long conversations
  • Training long documents

Problem if too large:

  • GPU memory error

Precision

dtype = None

Let the system choose automatically.

Advanced users may use:

  • fp16
  • bf16

Quantization

load_in_4bit = True

Used for:

  • QLoRA training

Benefits:

  • 4× less memory usage

Disable only when doing:

  • 16-bit LoRA
  • Full fine-tuning

LoRA Hyperparameters


LoRA Rank

r = 16

Controls:

  • How much the model can change

Guidelines:

| Rank | Meaning |
| —- | —————– |
| 8 | Small change |
| 16 | Default |
| 32 | Larger adaptation |

Too high:

  • Overfitting
  • High VRAM usage

Target Modules

Standard modules:

q_proj
k_proj
v_proj
o_proj
gate_proj
up_proj
down_proj

These are the attention + MLP layers where LoRA is applied.

Usually do not change.


LoRA Alpha

lora_alpha = 16

Rule:

alpha ≈ r

Purpose:

  • Controls strength of LoRA updates

Too large:

  • Training instability

LoRA Dropout

lora_dropout = 0

Unsloth commonly uses:

0

Use dropout only when:

  • Dataset is very small

Example:

0.05

Bias

bias = "none"

Reason:

  • Keeps training parameter efficient

Usually never changed


Gradient Checkpointing

use_gradient_checkpointing = "unsloth"

Purpose:

  • Reduces GPU memory usage

Trade-off:

  • Slightly slower training

Training Hyperparameters


Learning Rate

learning_rate = 2e-4

Typical QLoRA values:

| Value | Meaning |
| —– | ————– |
| 1e-4 | Safer |
| 2e-4 | Common default |

Too high causes:

  • Loss exploding

Batch Size

per_device_train_batch_size = 2

Small batch sizes work well on Colab.

Increase only if:

  • GPU allows

Gradient Accumulation

gradient_accumulation_steps = 4

Purpose:

  • Simulate larger batch size

Example:

batch_size = 2
accumulation = 4

effective batch size = 8

Epochs

num_train_epochs = 2-3

Avoid too many epochs.

Too many epochs cause:

  • Model memorization

Warmup Steps

warmup_steps = 5

Purpose:

  • Gradually increase learning rate

Helps stabilize early training


Learning Rate Scheduler

lr_scheduler_type = "linear"

Good default choice

Other options:

  • cosine
  • constant

Use them after baseline experiments


Optimizer

optim = "adamw_8bit"

Benefits:

  • Less GPU memory
  • Good performance

Weight Decay

weight_decay = 0.01

Purpose:

  • Prevents overfitting

Logging

logging_steps = 1-10

Lower values are good for:

  • Teaching
  • Debugging

Response-Only Training (Important)

Use when dataset format is:

User → Assistant conversation

Train only on:

  • Assistant responses

Reason:

  • Improves response quality

âš  Danger

If masking is wrong:

  • All labels become -100
  • Loss becomes 0
  • Training fails

Recommended Starter Configuration

For your airline support tone dataset

max_seq_length = 2048
load_in_4bit = True

r = 16
lora_alpha = 16
lora_dropout = 0
bias = "none"

target_modules = [
 "q_proj","k_proj","v_proj","o_proj",
 "gate_proj","up_proj","down_proj"
]

learning_rate = 2e-4
per_device_train_batch_size = 2
gradient_accumulation_steps = 4
num_train_epochs = 2

warmup_steps = 5
lr_scheduler_type = "linear"

optim = "adamw_8bit"
weight_decay = 0.01

Important Rules for Students

Rule 1

Always start with:

QLoRA
Not full fine-tuning.


Rule 2

Use an instruct model for chat datasets.


Rule 3

First experiment should be simple.

r = 16
alpha = 16
dropout = 0
bias = "none"
learning_rate = 2e-4
lr_scheduler_type = "linear"
warmup_steps = small

Rule 4

Watch training behavior carefully.

Possible Problems

Loss not decreasing

  • Dataset formatting issue
  • Learning rate too small

Loss exploding

  • Learning rate too high

Loss suddenly becomes 0

  • Label masking error

Use this mermaid

flowchart TD
A[Training Started] --> B{Is Loss Decreasing?}

B -- No --> C{Loss Constant Around Same Value}
C --> C1[Check Dataset Formatting]
C1 --> C2[Check Tokenization]
C2 --> C3[Increase Learning Rate Slightly]

B -- Yes --> D{Is Loss Exploding?}
D -- Yes --> D1[Learning Rate Too High]
D1 --> D2[Reduce Learning Rate]
D2 --> D3[Check Warmup Steps]

D -- No --> E{Is Loss Near Zero Very Early?}

E -- Yes --> E1[Label Masking Issue]
E1 --> E2[Check Response-Only Training]
E2 --> E3[Ensure Labels Not All -100]

E -- No --> F{Model Output Poor?}

F -- Yes --> G{Problem Type}

G --> G1[Model Too Weak]
G1 --> G2[Increase LoRA Rank r]

G --> G3[Model Overfitting]
G3 --> G4[Reduce Epochs]
G4 --> G5[Add LoRA Dropout]

G --> G6[Training Too Slow]
G6 --> G7[Increase Batch Size]
G7 --> G8[Reduce Gradient Accumulation]

F -- No --> H[Training Healthy]

H --> I[Evaluate Model With Test Prompts]

By continuous learner

enthusiastic technology learner

Leave a Reply

Discover more from Direct AI Powered By Quality Thought

Subscribe now to keep reading and get access to the full archive.

Continue reading