Size Mismatch For Lm_head When Fintune QWEN2.5
Introduction
In this article, we will discuss a common issue encountered when fine-tuning the QWEN2.5 model using the PEFT library. The issue arises when trying to load a pre-trained adapter for the QWEN2.5 model, resulting in a size mismatch for the LM head. We will explore the possible causes of this issue and provide a solution to resolve it.
System Information
The system information is as follows:
- Transformers version: 4.49.0
- Platform: Linux-6.6.0-72.0.0.64.oe2403.x86_64-x86_64-with-glibc2.38
- Python version: 3.10.16
- Huggingface_hub version: 0.29.1
- Safetensors version: 0.5.3
- Accelerate version: 1.4.0
- PyTorch version (GPU?): 2.2.2+cu121 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
Who Can Help?
The following individuals can provide assistance with this issue:
- @benjaminbossan
- @sayakpaul
Information
The following information is relevant to this issue:
- [ ] The official example scripts
- [x] My own modified scripts
Tasks
The following tasks are relevant to this issue:
- [ ] An officially supported task in the
examples
folder - [x] My own task or dataset (give details below)
Reproduction
To reproduce this issue, you can use the following code to load an adapter for the QWEN2.5 model:
import torch
from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer, pipeline
from peft import PeftConfig, PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
peft_model_id = "/home/chenjq/pythonWork/nlp/Qwen2.5-0.5B-SFT-Capybara/checkpoint-31"
# peft_model_id = args.output_dir
tokenizer = AutoTokenizer.from_pretrained(peft_model_id)
# Load Model with PEFT adapter
model = AutoPeftModelForCausalLM.from_pretrained(
peft_model_id,
device_map="auto",
torch_dtype=torch.float16
)
This code will result in the following error:
Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.
Traceback (most recent call last):
File "/home/chenjq/.pycharm_helpers/pydev/pydevd.py", line 1500, in _exec
pydev_imports.execfile(file, globals, locals) # execute the script
File "/home/chenjq/.pycharm_helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "/home/chenjq/pythonWork/nlp/test14.py", line 11, in <module>
model = AutoPeftModelForCausalLM.from_pretrained(
File "/home/chenjq/miniconda3/envs/nlp/lib/python3.10/site-packages/peft/auto.py", line 130, in from_pretrained
return cls._target_peft_class.from_pretrained(
File "/home/chenjq/miniconda3/envs/nlp/lib/python3.10/site-packages/peft/peft_model.py", line 581, in from_pretrained
load_result = model.load_adapter(
File "/home/chenjq/miniconda3/envs/nlp/lib/python3.10/site-packages/peft/peft_model.py", line 1239, in load_adapter
load_result = set_peft_model_state_dict(
File "/home/chenjq/miniconda3/envs/nlp/lib/python3.10/site-packages/peft/utils/save_and_load.py", line 451, in set_peft_model_state_dict
load_result = model.load_state_dict(peft_model_state_dict, strict=False)
File "/home/chenjq/miniconda3/envs/nlp/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2153, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for PeftModelForCausalLM:
size mismatch for base_model.model.lm_head.modules_to_save.default.weight: copying a param with shape torch.Size([151936, 896]) from checkpoint, the shape in current model is torch.Size([151665, 896]).
However, if you use the following code to load the model, everything works fine:
from peft import PeftConfig, PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base_model_name ='/home/models/qwen/Qwen2.5-0.5B'
adapter_model_name = "/home/chenjq/pythonWork/nlp/Qwen2.5-0.5B-SFT-Capybara/checkpoint-31"
model = AutoModelForCausalLM.from_pretrained(base_model_name)
model = PeftModel.from_pretrained(model, adapter_model_name)
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
Possible Cause
The possible cause of this issue is that the length of the tokenizer (151665) and the embedding size (151936) of the QWEN2.5 model do not match. The _BaseAutoPeftModel.from_pretrained
function resizes the base model embeddings to match with the tokenizer, which results in an error when trying to load the saved weights.
Solution
A possible solution to this issue is to only resize the base model embeddings if the tokenizer size differs from the base tokenizer size. This can be achieved by modifying the _BaseAutoPeftModel.from_pretrained
function to check if the tokenizer size matches the base tokenizer size before resizing the embeddings.
Adapter Training
The adapter was trained using the following code:
from datasets import load_dataset
from trl import SFTConfig, SFTTrainer
from peft import LoraConfig
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '1'
dataset = load_dataset("trl-lib/Capybara", split="train")
dataset = dataset.select(range(500))
MODEL_ID = 'Qwen/Qwen2.5-0.5B'
peft_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
target_modules="all-linear",
modules_to_save=["lm_head", "embed_token"],
task_type="CAUSAL_LM",
)
args = SFTConfig(
output_dir="Qwen2.5-0.5B-SFT-Capybara", # directory to save and repository id
num_train_epochs=1, # number of training epochs
per_device_train_batch_size=4, # batch size per device during training
gradient_accumulation_steps=4, # number of steps before performing a backward/update pass
gradient_checkpointing=True, # use gradient checkpointing to save memory
optim="adamw_torch_fused", # use fused adamw optimizer
logging_steps=10, # log every 10 steps
save_strategy="epoch", # save checkpoint every epoch
bf16=True, # use bfloat16 precision
tf32=True, # use tf32 precision
learning_rate=2e-4, # learning rate, based on QLoRA paper
max_grad_norm=0.3, # max gradient norm based on QLoRA paper
warmup_ratio=0.03, # warmup ratio based on QLoRA paper
lr_scheduler_type="constant", # use constant learning rate scheduler
push_to_hub=False, # push model to hub
# report_to="tensorboard", # report metrics to tensorboard
)
trainer = SFTTrainer(
MODEL_ID,
train_dataset=dataset,
args=args,
peft_config=peft_config
)
trainer.train()
print('end')
Expected Behavior
Q: What is the size mismatch for LM head when fine-tuning QWEN2.5?
A: The size mismatch for LM head when fine-tuning QWEN2.5 occurs when the length of the tokenizer (151665) and the embedding size (151936) of the QWEN2.5 model do not match. This results in an error when trying to load the saved weights.
Q: What is the cause of this issue?
A: The cause of this issue is that the _BaseAutoPeftModel.from_pretrained
function resizes the base model embeddings to match with the tokenizer, which results in an error when trying to load the saved weights.
Q: How can I resolve this issue?
A: To resolve this issue, you can modify the _BaseAutoPeftModel.from_pretrained
function to check if the tokenizer size matches the base tokenizer size before resizing the embeddings.
Q: What is the expected behavior when fine-tuning QWEN2.5?
A: The expected behavior is that the model can predict normally.
Q: What is the adapter training code for QWEN2.5?
A: The adapter training code for QWEN2.5 is as follows:
from datasets import load_dataset
from trl import SFTConfig, SFTTrainer
from peft import LoraConfig
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '1'
dataset = load_dataset("trl-lib/Capybara", split="train")
dataset = dataset.select(range(500))
MODEL_ID = 'Qwen/Qwen2.5-0.5B'
peft_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
target_modules="all-linear",
modules_to_save=["lm_head", "embed_token"],
task_type="CAUSAL_LM",
)
args = SFTConfig(
output_dir="Qwen2.5-0.5B-SFT-Capybara", # directory to save and repository id
num_train_epochs=1, # number of training epochs
per_device_train_batch_size=4, # batch size per device during training
gradient_accumulation_steps=4, # number of steps before performing a backward/update pass
gradient_checkpointing=True, # use gradient checkpointing to save memory
optim="adamw_torch_fused", # use fused adamw optimizer
logging_steps=10, # log every 10 steps
save_strategy="epoch", # save checkpoint every epoch
bf16=True, # use bfloat16 precision
tf32=True, # use tf32 precision
learning_rate=2e-4, # learning rate, based on QLoRA paper
max_grad_norm=0.3, # max gradient norm based on QLoRA paper
warmup_ratio=0.03, # warmup ratio based on QLoRA paper
lr_scheduler_type="constant", # use constant learning rate scheduler
push_to_hub=False, # push model to hub
# report_to="tensorboard", # report metrics to tensorboard
)
trainer = SFTTrainer(
MODEL_ID,
train_dataset=dataset,
args=args,
peft_config=peft_config
)
trainer.train()
print('end')
Q: What is the solution to this issue?
A: The solution to this issue is to modify the _BaseAutoPeftModel.from_pretrained
function to check if the tokenizer size matches the base tokenizer size before resizing the embeddings.
Q: What is the expected behavior when using the modified _BaseAutoPeftModel.from_pretrained
function?
A: The expected behavior is that the model can predict normally.
Q: What is the adapter training code for QWEN2.5 using the modified _BaseAutoPeftModel.from_pretrained
function?
A: The adapter training code for QWEN2.5 using the modified _BaseAutoPeftModel.from_pretrained
function is the same as the original code.
Q: What is the difference between the original and modified _BaseAutoPeftModel.from_pretrained
function?
A: The difference between the original and modified _BaseAutoPeftModel.from_pretrained
function is that the modified function checks if the tokenizer size matches the base tokenizer size before resizing the embeddings.
Q: What is the expected behavior when using the modified _BaseAutoPeftModel.from_pretrained
function?
A: The expected behavior is that the model can predict normally.