Size Mismatch For Lm_head When Fintune QWEN2.5

Mar 10, 2025 by ADMIN 47 views

**Size Mismatch for LM Head when Fine-Tuning QWEN2.5**

Introduction

In this article, we will discuss a common issue encountered when fine-tuning the QWEN2.5 model using the PEFT library. The issue arises when trying to load a pre-trained adapter for the QWEN2.5 model, resulting in a size mismatch for the LM head. We will explore the possible causes of this issue and provide a solution to resolve it.

System Information

The system information is as follows:

Transformers version: 4.49.0
Platform: Linux-6.6.0-72.0.0.64.oe2403.x86_64-x86_64-with-glibc2.38
Python version: 3.10.16
Huggingface_hub version: 0.29.1
Safetensors version: 0.5.3
Accelerate version: 1.4.0
PyTorch version (GPU?): 2.2.2+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed

Who Can Help?

The following individuals can provide assistance with this issue:

@benjaminbossan
@sayakpaul

Information

The following information is relevant to this issue:

[ ] The official example scripts
[x] My own modified scripts

Tasks

The following tasks are relevant to this issue:

[ ] An officially supported task in the examples folder
[x] My own task or dataset (give details below)

Reproduction

To reproduce this issue, you can use the following code to load an adapter for the QWEN2.5 model:

import torch
from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer, pipeline
from peft import PeftConfig, PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

peft_model_id = "/home/chenjq/pythonWork/nlp/Qwen2.5-0.5B-SFT-Capybara/checkpoint-31"
# peft_model_id = args.output_dir
tokenizer = AutoTokenizer.from_pretrained(peft_model_id)
# Load Model with PEFT adapter
model = AutoPeftModelForCausalLM.from_pretrained(
    peft_model_id,
    device_map="auto",
    torch_dtype=torch.float16
)

This code will result in the following error:

Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.
Traceback (most recent call last):
  File "/home/chenjq/.pycharm_helpers/pydev/pydevd.py", line 1500, in _exec
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "/home/chenjq/.pycharm_helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "/home/chenjq/pythonWork/nlp/test14.py", line 11, in <module>
    model = AutoPeftModelForCausalLM.from_pretrained(
  File "/home/chenjq/miniconda3/envs/nlp/lib/python3.10/site-packages/peft/auto.py", line 130, in from_pretrained
    return cls._target_peft_class.from_pretrained(
  File "/home/chenjq/miniconda3/envs/nlp/lib/python3.10/site-packages/peft/peft_model.py", line 581, in from_pretrained
    load_result = model.load_adapter(
  File "/home/chenjq/miniconda3/envs/nlp/lib/python3.10/site-packages/peft/peft_model.py", line 1239, in load_adapter
    load_result = set_peft_model_state_dict(
  File "/home/chenjq/miniconda3/envs/nlp/lib/python3.10/site-packages/peft/utils/save_and_load.py", line 451, in set_peft_model_state_dict
    load_result = model.load_state_dict(peft_model_state_dict, strict=False)
  File "/home/chenjq/miniconda3/envs/nlp/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2153, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for PeftModelForCausalLM:
	size mismatch for base_model.model.lm_head.modules_to_save.default.weight: copying a param with shape torch.Size([151936, 896]) from checkpoint, the shape in current model is torch.Size([151665, 896]).

However, if you use the following code to load the model, everything works fine:

from peft import PeftConfig, PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base_model_name ='/home/models/qwen/Qwen2.5-0.5B'
adapter_model_name = "/home/chenjq/pythonWork/nlp/Qwen2.5-0.5B-SFT-Capybara/checkpoint-31"
model = AutoModelForCausalLM.from_pretrained(base_model_name)
model = PeftModel.from_pretrained(model, adapter_model_name)
tokenizer = AutoTokenizer.from_pretrained(base_model_name)

Possible Cause

The possible cause of this issue is that the length of the tokenizer (151665) and the embedding size (151936) of the QWEN2.5 model do not match. The _BaseAutoPeftModel.from_pretrained function resizes the base model embeddings to match with the tokenizer, which results in an error when trying to load the saved weights.

Solution

A possible solution to this issue is to only resize the base model embeddings if the tokenizer size differs from the base tokenizer size. This can be achieved by modifying the _BaseAutoPeftModel.from_pretrained function to check if the tokenizer size matches the base tokenizer size before resizing the embeddings.

Adapter Training

The adapter was trained using the following code:

from datasets import load_dataset
from trl import SFTConfig, SFTTrainer
from peft import LoraConfig
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '1'

dataset = load_dataset("trl-lib/Capybara", split="train")
dataset = dataset.select(range(500))
MODEL_ID = 'Qwen/Qwen2.5-0.5B'
peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    target_modules="all-linear",
    modules_to_save=["lm_head", "embed_token"],
    task_type="CAUSAL_LM",
)
args = SFTConfig(
    output_dir="Qwen2.5-0.5B-SFT-Capybara",  # directory to save and repository id
    num_train_epochs=1,  # number of training epochs
    per_device_train_batch_size=4,  # batch size per device during training
    gradient_accumulation_steps=4,  # number of steps before performing a backward/update pass
    gradient_checkpointing=True,  # use gradient checkpointing to save memory
    optim="adamw_torch_fused",  # use fused adamw optimizer
    logging_steps=10,  # log every 10 steps
    save_strategy="epoch",  # save checkpoint every epoch
    bf16=True,  # use bfloat16 precision
    tf32=True,  # use tf32 precision
    learning_rate=2e-4,  # learning rate, based on QLoRA paper
    max_grad_norm=0.3,  # max gradient norm based on QLoRA paper
    warmup_ratio=0.03,  # warmup ratio based on QLoRA paper
    lr_scheduler_type="constant",  # use constant learning rate scheduler
    push_to_hub=False,  # push model to hub
    # report_to="tensorboard",  # report metrics to tensorboard
)

trainer = SFTTrainer(
    MODEL_ID,
    train_dataset=dataset,
    args=args,
    peft_config=peft_config
)

trainer.train()
print('end')

Expected Behavior

Q: What is the size mismatch for LM head when fine-tuning QWEN2.5?

A: The size mismatch for LM head when fine-tuning QWEN2.5 occurs when the length of the tokenizer (151665) and the embedding size (151936) of the QWEN2.5 model do not match. This results in an error when trying to load the saved weights.

Q: What is the cause of this issue?

A: The cause of this issue is that the _BaseAutoPeftModel.from_pretrained function resizes the base model embeddings to match with the tokenizer, which results in an error when trying to load the saved weights.

Q: How can I resolve this issue?

A: To resolve this issue, you can modify the _BaseAutoPeftModel.from_pretrained function to check if the tokenizer size matches the base tokenizer size before resizing the embeddings.

Q: What is the expected behavior when fine-tuning QWEN2.5?

A: The expected behavior is that the model can predict normally.

Q: What is the adapter training code for QWEN2.5?

A: The adapter training code for QWEN2.5 is as follows:

from datasets import load_dataset
from trl import SFTConfig, SFTTrainer
from peft import LoraConfig
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '1'

dataset = load_dataset("trl-lib/Capybara", split="train")
dataset = dataset.select(range(500))
MODEL_ID = 'Qwen/Qwen2.5-0.5B'
peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    target_modules="all-linear",
    modules_to_save=["lm_head", "embed_token"],
    task_type="CAUSAL_LM",
)
args = SFTConfig(
    output_dir="Qwen2.5-0.5B-SFT-Capybara",  # directory to save and repository id
    num_train_epochs=1,  # number of training epochs
    per_device_train_batch_size=4,  # batch size per device during training
    gradient_accumulation_steps=4,  # number of steps before performing a backward/update pass
    gradient_checkpointing=True,  # use gradient checkpointing to save memory
    optim="adamw_torch_fused",  # use fused adamw optimizer
    logging_steps=10,  # log every 10 steps
    save_strategy="epoch",  # save checkpoint every epoch
    bf16=True,  # use bfloat16 precision
    tf32=True,  # use tf32 precision
    learning_rate=2e-4,  # learning rate, based on QLoRA paper
    max_grad_norm=0.3,  # max gradient norm based on QLoRA paper
    warmup_ratio=0.03,  # warmup ratio based on QLoRA paper
    lr_scheduler_type="constant",  # use constant learning rate scheduler
    push_to_hub=False,  # push model to hub
    # report_to="tensorboard",  # report metrics to tensorboard
)

trainer = SFTTrainer(
    MODEL_ID,
    train_dataset=dataset,
    args=args,
    peft_config=peft_config
)

trainer.train()
print('end')

Q: What is the solution to this issue?

A: The solution to this issue is to modify the _BaseAutoPeftModel.from_pretrained function to check if the tokenizer size matches the base tokenizer size before resizing the embeddings.

Q: What is the expected behavior when using the modified `_BaseAutoPeftModel.from_pretrained` function?

A: The expected behavior is that the model can predict normally.

Q: What is the adapter training code for QWEN2.5 using the modified `_BaseAutoPeftModel.from_pretrained` function?

A: The adapter training code for QWEN2.5 using the modified _BaseAutoPeftModel.from_pretrained function is the same as the original code.

Q: What is the difference between the original and modified `_BaseAutoPeftModel.from_pretrained` function?

A: The difference between the original and modified _BaseAutoPeftModel.from_pretrained function is that the modified function checks if the tokenizer size matches the base tokenizer size before resizing the embeddings.

Q: What is the expected behavior when using the modified `_BaseAutoPeftModel.from_pretrained` function?

A: The expected behavior is that the model can predict normally.

Introduction

System Information

Who Can Help?

Information

Tasks

Reproduction

Possible Cause

Solution

Adapter Training

Expected Behavior

Q: What is the size mismatch for LM head when fine-tuning QWEN2.5?

Q: What is the cause of this issue?

Q: How can I resolve this issue?

Q: What is the expected behavior when fine-tuning QWEN2.5?

Q: What is the adapter training code for QWEN2.5?

Q: What is the solution to this issue?

Q: What is the expected behavior when using the modified _BaseAutoPeftModel.from_pretrained function?

Q: What is the adapter training code for QWEN2.5 using the modified _BaseAutoPeftModel.from_pretrained function?

Q: What is the difference between the original and modified _BaseAutoPeftModel.from_pretrained function?

Q: What is the expected behavior when using the modified _BaseAutoPeftModel.from_pretrained function?

Q: What is the expected behavior when using the modified `_BaseAutoPeftModel.from_pretrained` function?

Q: What is the adapter training code for QWEN2.5 using the modified `_BaseAutoPeftModel.from_pretrained` function?

Q: What is the difference between the original and modified `_BaseAutoPeftModel.from_pretrained` function?

Q: What is the expected behavior when using the modified `_BaseAutoPeftModel.from_pretrained` function?