Tuning hyper parameters

per_device_train_batch_size = 2: batch size per gpu
gradient_accumulation_steps = 4
warmup_steps = 5
max_steps = 60
learning_rate=2e-4
fp16= not torch.cuda.is_bf16_supported()
bf16=torch.cuda.is_bf16_supported()
logging_steps = 1
optim =”adamw_8bit”
weight_decay=0.01
lr_scheduler_type = “linear”

losses

Here is your content cleanly converted into structured Markdown (with headings, code blocks, and tables where helpful).

Unsloth QLoRA Fine-Tuning â€” Simple Notes

These notes apply well for:

Qwen3 4B
Google Colab
Support / Customer-Care datasets

Unsloth Recommended Approach

Start with LoRA / QLoRA
Avoid full fine-tuning initially
Keep configuration simple
Increase complexity only if necessary

Default Testing Configuration

max_seq_length = 2048
load_in_4bit = True

Key Hyperparameters (Simple Explanation)

Model

`model_name`

Use an instruct model

Why?

Instruct models already understand conversation format

Avoid

Base models for chat tasks

Context Length

`max_seq_length = 2048`

Controls:

Maximum tokens in one training example

Increase only if:

Training long conversations
Training long documents

Problem if too large:

GPU memory error

Precision

`dtype = None`

Let the system choose automatically.

Advanced users may use:

fp16
bf16

Quantization

`load_in_4bit = True`

Used for:

QLoRA training

Benefits:

4Ã— less memory usage

Disable only when doing:

16-bit LoRA
Full fine-tuning

LoRA Hyperparameters

LoRA Rank

r = 16

Controls:

How much the model can change

Guidelines:

| Rank | Meaning |
| —- | —————– |
| 8 | Small change |
| 16 | Default |
| 32 | Larger adaptation |

Too high:

Overfitting
High VRAM usage

Target Modules

Standard modules:

q_proj
k_proj
v_proj
o_proj
gate_proj
up_proj
down_proj

These are the attention + MLP layers where LoRA is applied.

Usually do not change.

LoRA Alpha

lora_alpha = 16

Rule:

alpha â‰ˆ r

Purpose:

Controls strength of LoRA updates

Too large:

Training instability

LoRA Dropout

lora_dropout = 0

Unsloth commonly uses:

Use dropout only when:

Dataset is very small

Example:

0.05

Bias

bias = "none"

Reason:

Keeps training parameter efficient

Usually never changed

Gradient Checkpointing

use_gradient_checkpointing = "unsloth"

Purpose:

Reduces GPU memory usage

Trade-off:

Slightly slower training

Training Hyperparameters

Learning Rate

learning_rate = 2e-4

Typical QLoRA values:

| Value | Meaning |
| —– | ————– |
| 1e-4 | Safer |
| 2e-4 | Common default |

Too high causes:

Loss exploding

Batch Size

per_device_train_batch_size = 2

Small batch sizes work well on Colab.

Increase only if:

GPU allows

Gradient Accumulation

gradient_accumulation_steps = 4

Purpose:

Simulate larger batch size

Example:

batch_size = 2
accumulation = 4

effective batch size = 8

Epochs

num_train_epochs = 2-3

Avoid too many epochs.

Too many epochs cause:

Model memorization

Warmup Steps

warmup_steps = 5

Purpose:

Gradually increase learning rate

Helps stabilize early training

Learning Rate Scheduler

lr_scheduler_type = "linear"

Good default choice

Other options:

cosine
constant

Use them after baseline experiments

Optimizer

optim = "adamw_8bit"

Benefits:

Less GPU memory
Good performance

Weight Decay

weight_decay = 0.01

Purpose:

Prevents overfitting

Logging

logging_steps = 1-10

Lower values are good for:

Teaching
Debugging

Response-Only Training (Important)

Use when dataset format is:

User â†’ Assistant conversation

Train only on:

Assistant responses

Reason:

Improves response quality

âš Danger

If masking is wrong:

All labels become -100
Loss becomes 0
Training fails

Recommended Starter Configuration

For your airline support tone dataset

max_seq_length = 2048
load_in_4bit = True

r = 16
lora_alpha = 16
lora_dropout = 0
bias = "none"

target_modules = [
 "q_proj","k_proj","v_proj","o_proj",
 "gate_proj","up_proj","down_proj"
]

learning_rate = 2e-4
per_device_train_batch_size = 2
gradient_accumulation_steps = 4
num_train_epochs = 2

warmup_steps = 5
lr_scheduler_type = "linear"

optim = "adamw_8bit"
weight_decay = 0.01

Important Rules for Students

Rule 1

Always start with:

QLoRA
Not full fine-tuning.

Rule 2

Use an instruct model for chat datasets.

Rule 3

First experiment should be simple.

r = 16
alpha = 16
dropout = 0
bias = "none"
learning_rate = 2e-4
lr_scheduler_type = "linear"
warmup_steps = small

Rule 4

Watch training behavior carefully.

Possible Problems

Loss not decreasing

Dataset formatting issue
Learning rate too small

Loss exploding

Learning rate too high

Loss suddenly becomes 0

Label masking error

Use this mermaid

flowchart TD
A[Training Started] --> B{Is Loss Decreasing?}

B -- No --> C{Loss Constant Around Same Value}
C --> C1[Check Dataset Formatting]
C1 --> C2[Check Tokenization]
C2 --> C3[Increase Learning Rate Slightly]

B -- Yes --> D{Is Loss Exploding?}
D -- Yes --> D1[Learning Rate Too High]
D1 --> D2[Reduce Learning Rate]
D2 --> D3[Check Warmup Steps]

D -- No --> E{Is Loss Near Zero Very Early?}

E -- Yes --> E1[Label Masking Issue]
E1 --> E2[Check Response-Only Training]
E2 --> E3[Ensure Labels Not All -100]

E -- No --> F{Model Output Poor?}

F -- Yes --> G{Problem Type}

G --> G1[Model Too Weak]
G1 --> G2[Increase LoRA Rank r]

G --> G3[Model Overfitting]
G3 --> G4[Reduce Epochs]
G4 --> G5[Add LoRA Dropout]

G --> G6[Training Too Slow]
G6 --> G7[Increase Batch Size]
G7 --> G8[Reduce Gradient Accumulation]

F -- No --> H[Training Healthy]

H --> I[Evaluate Model With Test Prompts]

Tuning hyper parameters

losses

Unsloth QLoRA Fine-Tuning â€” Simple Notes

Unsloth Recommended Approach

Default Testing Configuration

Key Hyperparameters (Simple Explanation)

Model

model_name

Context Length

max_seq_length = 2048

Precision

dtype = None

Quantization

load_in_4bit = True

LoRA Hyperparameters

LoRA Rank

Target Modules

LoRA Alpha

LoRA Dropout

Bias

Gradient Checkpointing

Training Hyperparameters

Learning Rate

Batch Size

Gradient Accumulation

Epochs

Warmup Steps

Learning Rate Scheduler

Optimizer

Weight Decay

Logging

Response-Only Training (Important)

âš Danger

Recommended Starter Configuration

Important Rules for Students

Rule 1

Rule 2

Rule 3

Rule 4

Possible Problems

Loss not decreasing

Loss exploding

Loss suddenly becomes 0

Use this mermaid

Share this:

Like this:

By continuous learner

Leave a ReplyCancel reply

Discover more from Direct AI Powered By Quality Thought

`model_name`

`max_seq_length = 2048`

`dtype = None`

`load_in_4bit = True`