Tuning hyper parameters
- per_device_train_batch_size = 2: batch size per gpu
- gradient_accumulation_steps = 4
- warmup_steps = 5
- max_steps = 60
- learning_rate=2e-4
- fp16= not torch.cuda.is_bf16_supported()
- bf16=torch.cuda.is_bf16_supported()
- logging_steps = 1
- optim =”adamw_8bit”
- weight_decay=0.01
- lr_scheduler_type = “linear”
losses
| Training Loss | Meaning | Model Quality |
| ————- | ———————————– | ——————– |
| > 3.0 | Model predictions are mostly wrong | Very poor |
| 2.0 – 3.0 | Model is starting to learn patterns | Weak |
| 1.5 – 2.0 | Model learning reasonably | Moderate |
| 1.0 – 1.5 | Good training progress | Good |
| 0.7 – 1.0 | Strong predictions | Very good |
| 0.4 – 0.7 | Very confident predictions | Excellent |
| < 0.3 | Model almost memorizing data | Possible overfitting |
Here is your content cleanly converted into structured Markdown (with headings, code blocks, and tables where helpful).
Unsloth QLoRA Fine-Tuning — Simple Notes
These notes apply well for:
- Qwen3 4B
- Google Colab
- Support / Customer-Care datasets
Unsloth Recommended Approach
- Start with LoRA / QLoRA
- Avoid full fine-tuning initially
- Keep configuration simple
- Increase complexity only if necessary
Default Testing Configuration
max_seq_length = 2048
load_in_4bit = True
Key Hyperparameters (Simple Explanation)
Model
model_name
Use an instruct model
Why?
- Instruct models already understand conversation format
Avoid
- Base models for chat tasks
Context Length
max_seq_length = 2048
Controls:
- Maximum tokens in one training example
Increase only if:
- Training long conversations
- Training long documents
Problem if too large:
- GPU memory error
Precision
dtype = None
Let the system choose automatically.
Advanced users may use:
fp16bf16
Quantization
load_in_4bit = True
Used for:
- QLoRA training
Benefits:
- 4× less memory usage
Disable only when doing:
- 16-bit LoRA
- Full fine-tuning
LoRA Hyperparameters
LoRA Rank
r = 16
Controls:
- How much the model can change
Guidelines:
| Rank | Meaning |
| —- | —————– |
| 8 | Small change |
| 16 | Default |
| 32 | Larger adaptation |
Too high:
- Overfitting
- High VRAM usage
Target Modules
Standard modules:
q_proj
k_proj
v_proj
o_proj
gate_proj
up_proj
down_proj
These are the attention + MLP layers where LoRA is applied.
Usually do not change.
LoRA Alpha
lora_alpha = 16
Rule:
alpha ≈ r
Purpose:
- Controls strength of LoRA updates
Too large:
- Training instability
LoRA Dropout
lora_dropout = 0
Unsloth commonly uses:
0
Use dropout only when:
- Dataset is very small
Example:
0.05
Bias
bias = "none"
Reason:
- Keeps training parameter efficient
Usually never changed
Gradient Checkpointing
use_gradient_checkpointing = "unsloth"
Purpose:
- Reduces GPU memory usage
Trade-off:
- Slightly slower training
Training Hyperparameters
Learning Rate
learning_rate = 2e-4
Typical QLoRA values:
| Value | Meaning |
| —– | ————– |
| 1e-4 | Safer |
| 2e-4 | Common default |
Too high causes:
- Loss exploding
Batch Size
per_device_train_batch_size = 2
Small batch sizes work well on Colab.
Increase only if:
- GPU allows
Gradient Accumulation
gradient_accumulation_steps = 4
Purpose:
- Simulate larger batch size
Example:
batch_size = 2
accumulation = 4
effective batch size = 8
Epochs
num_train_epochs = 2-3
Avoid too many epochs.
Too many epochs cause:
- Model memorization
Warmup Steps
warmup_steps = 5
Purpose:
- Gradually increase learning rate
Helps stabilize early training
Learning Rate Scheduler
lr_scheduler_type = "linear"
Good default choice
Other options:
cosineconstant
Use them after baseline experiments
Optimizer
optim = "adamw_8bit"
Benefits:
- Less GPU memory
- Good performance
Weight Decay
weight_decay = 0.01
Purpose:
- Prevents overfitting
Logging
logging_steps = 1-10
Lower values are good for:
- Teaching
- Debugging
Response-Only Training (Important)
Use when dataset format is:
User → Assistant conversation
Train only on:
- Assistant responses
Reason:
- Improves response quality
âš Danger
If masking is wrong:
- All labels become -100
- Loss becomes 0
- Training fails
Recommended Starter Configuration
For your airline support tone dataset
max_seq_length = 2048
load_in_4bit = True
r = 16
lora_alpha = 16
lora_dropout = 0
bias = "none"
target_modules = [
"q_proj","k_proj","v_proj","o_proj",
"gate_proj","up_proj","down_proj"
]
learning_rate = 2e-4
per_device_train_batch_size = 2
gradient_accumulation_steps = 4
num_train_epochs = 2
warmup_steps = 5
lr_scheduler_type = "linear"
optim = "adamw_8bit"
weight_decay = 0.01
Important Rules for Students
Rule 1
Always start with:
QLoRA
Not full fine-tuning.
Rule 2
Use an instruct model for chat datasets.
Rule 3
First experiment should be simple.
r = 16
alpha = 16
dropout = 0
bias = "none"
learning_rate = 2e-4
lr_scheduler_type = "linear"
warmup_steps = small
Rule 4
Watch training behavior carefully.
Possible Problems
Loss not decreasing
- Dataset formatting issue
- Learning rate too small
Loss exploding
- Learning rate too high
Loss suddenly becomes 0
- Label masking error
Use this mermaid
flowchart TD
A[Training Started] --> B{Is Loss Decreasing?}
B -- No --> C{Loss Constant Around Same Value}
C --> C1[Check Dataset Formatting]
C1 --> C2[Check Tokenization]
C2 --> C3[Increase Learning Rate Slightly]
B -- Yes --> D{Is Loss Exploding?}
D -- Yes --> D1[Learning Rate Too High]
D1 --> D2[Reduce Learning Rate]
D2 --> D3[Check Warmup Steps]
D -- No --> E{Is Loss Near Zero Very Early?}
E -- Yes --> E1[Label Masking Issue]
E1 --> E2[Check Response-Only Training]
E2 --> E3[Ensure Labels Not All -100]
E -- No --> F{Model Output Poor?}
F -- Yes --> G{Problem Type}
G --> G1[Model Too Weak]
G1 --> G2[Increase LoRA Rank r]
G --> G3[Model Overfitting]
G3 --> G4[Reduce Epochs]
G4 --> G5[Add LoRA Dropout]
G --> G6[Training Too Slow]
G6 --> G7[Increase Batch Size]
G7 --> G8[Reduce Gradient Accumulation]
F -- No --> H[Training Healthy]
H --> I[Evaluate Model With Test Prompts]
