Issue With Training On Custom Dataset: KeyError (v_num)

Mar 13, 2025 by ADMIN 56 views

Issue with Training on Custom Dataset: KeyError (v_num)

Training a model on a custom dataset can be a challenging task, especially when dealing with complex architectures like YOLOv9. In this article, we will explore the issue of KeyError (v_num) when training a YOLOv9 model on a custom dataset using the Lightning framework.

The error message is as follows:

Error executing job with overrides: ['dataset=custom', 'task=train', 'task.data.batch_size=8', 'model=v9-s', 'weight=False', 'use_wandb=False']
Traceback (most recent call last):
  File "/mnt/c/Users/is231191daus/algorithms/YOLO/yolo/lazy.py", line 45, in <module>
    main()
  File "/root/miniconda3/lib/python3.12/site-packages/hydra/main.py", line 94, in decorated_main
    _run_hydra(
  File "/root/miniconda3/lib/python3.12/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
    _run_app(
  File "/root/miniconda3/lib/python3.12/site-packages/hydra/_internal/utils.py", line 457, in _run_app
    run_and_report(
  File "/root/miniconda3/lib/python3.12/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
    raise ex
  File "/root/miniconda3/lib/python3.12/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
    return func()
          ^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/hydra/_internal/hydra.py", line 132, in run
    _ = ret.return_value
          ^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/root/miniconda3/lib/python3.12/site-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
          ^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/c/Users/is231191daus/algorithms/YOLO/yolo/lazy.py", line 35, in main
    trainer.fit(model)
  File "/root/miniconda3/lib/python3.12/site-packages/lightning/pytorch/trainer/trainer.py", line 539, in fit
    call._call_and_handle_interrupt(
  File "/root/miniconda3/lib/python3.12/site-packages/lightning/pytorch/trainer/call.py", line 47, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/lightning/pytorch/trainer/trainer.py", line 575, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/root/miniconda3/lib/python3.12/site-packages/lightning/pytorch/trainer/trainer.py", line 982, in _run
    results = self._run_stage()
          ^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/lightning/pytorch/trainer/trainer.py", line 1026, in _run_stage
    self.fit_loop.run()
  File "/root/miniconda3/lib/python3.12/site-packages/lightning/pytorch/loops/fit_loop.py", line 216, in run
    self.advance()
  File "/root/miniconda3/lib/python3.12/site-packages/lightning/pytorch/loops/fit_loop.py", line 455, in advance
    self.epoch_loop.run(self._data_fetcher)
  File "/root/miniconda3/lib/python3.12/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 150, in run
    self.advance(data_fetcher)
  File "/root/miniconda3/lib/python3.12/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 339, in advance
    call._call_callback_hooks(trainer, "on_train_batch_end", batch_output, batch, batch_idx)
  File "/root/miniconda3/lib/python3.12/site-packages/lightning/pytorch/trainer/call.py", line 222, in _call_callback_hooks
    fn(trainer, trainer.lightning_module, *args, **kwargs)
  File "/root/miniconda3/lib/python3.12/site-packages/lightning_utilities/core/rank_zero.py", line 42, in wrapped_fn
    return fn(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/yolo/utils/logging_utils.py", line 110, in on_train_batch_end
    metrics.pop("v_num")
KeyError: 'v_num'

The config file is as follows:

# config.yaml
dataset: custom
task: train
task.data.batch_size: 8
model: v9-s
weight: False
use_wandb: False

The system info is as follows:

Ubuntu 22.04
Python 3.12
PyTorch 2.6.0+cu124
Lightning 2.5.0
CUDA 11.5
YOLOv9 - s

To troubleshoot this issue, we need to identify the source of the KeyError (v_num). The error message suggests that the metrics.pop("v_num") line in the on_train_batch_end function is causing the issue.

Possible Causes

Missing key in metrics dictionary: The v_num key may not exist in the metrics dictionary, causing the KeyError.
Typo in key name: There may be a typo in the key name, causing the KeyError.
Incorrect data type: The value associated with the v_num key may not be a dictionary, causing the KeyError.

Solution

To solve this issue, we need to ensure that the v_num key exists in the metrics dictionary before attempting to pop it. We can do this by adding a check to see if the key exists before popping it.

# yolo/utils/logging_utils.py
def on_train_batch_end(trainer, lightning_module, batch_output, batch, batch_idx):
    metrics = batch_output["metrics"]
    if "v_num" in metrics:
        metrics.pop("v_num")

Alternatively, we can also use the get() method to retrieve the value associated with the v_num key, and if it doesn't exist, return a default value.

# yolo/utils/logging_utils.py
def on_train_batch_end(trainer, lightning_module, batch_output, batch, batch_idx):
    metrics = batch_output["metrics"]
    v_num = metrics.get("v_num", None)
    if v_num is not None:
        metrics.pop("v_num")

By making these changes, we can ensure that the v_num key exists in the metrics dictionary before attempting to pop it, and avoid the KeyError.

In conclusion, the KeyError (v_num) issue when training a YOLOv9 model on a custom dataset using the Lightning framework is caused by the missing key in the metrics dictionary. By adding a check to see if the key exists before popping it, or by using the get() method to retrieve the value associated with the key, we can solve this issue and ensure that our model trains successfully.
Q&A: Issue with Training on Custom Dataset: KeyError (v_num)

A: The issue is a KeyError (v_num) that occurs when training a YOLOv9 model on a custom dataset using the Lightning framework.

A: The error message is as follows:

Error executing job with overrides: ['dataset=custom', 'task=train', 'task.data.batch_size=8', 'model=v9-s', 'weight=False', 'use_wandb=False']
Traceback (most recent call last):
  File "/mnt/c/Users/is231191daus/algorithms/YOLO/yolo/lazy.py", line 45, in <module>
    main()
  File "/root/miniconda3/lib/python3.12/site-packages/hydra/main.py", line 94, in decorated_main
    _run_hydra(
  File "/root/miniconda3/lib/python3.12/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
    _run_app(
  File "/root/miniconda3/lib/python3.12/site-packages/hydra/_internal/utils.py", line 457, in _run_app
    run_and_report(
  File "/root/miniconda3/lib/python3.12/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
    raise ex
  File "/root/miniconda3/lib/python3.12/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
    return func()
          ^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/hydra/_internal/hydra.py", line 132, in run
    _ = ret.return_value
          ^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/root/miniconda3/lib/python3.12/site-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
          ^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/c/Users/is231191daus/algorithms/YOLO/yolo/lazy.py", line 35, in main
    trainer.fit(model)
  File "/root/miniconda3/lib/python3.12/site-packages/lightning/pytorch/trainer/trainer.py", line 539, in fit
    call._call_and_handle_interrupt(
  File "/root/miniconda3/lib/python3.12/site-packages/lightning/pytorch/trainer/call.py", line 47, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/lightning/pytorch/trainer/trainer.py", line 575, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/root/miniconda3/lib/python3.12/site-packages/lightning/pytorch/trainer/trainer.py", line 982, in _run
    results = self._run_stage()
          ^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/lightning/pytorch/trainer/trainer.py", line 1026, in _run_stage
    self.fit_loop.run()
  File "/root/miniconda3/lib/python3.12/site-packages/lightning/pytorch/loops/fit_loop.py", line 216, in run
    self.advance()
  File "/root/miniconda3/lib/python3.12/site-packages/lightning/pytorch/loops/fit_loop.py", line 455, in advance
    self.epoch_loop.run(self._data_fetcher)
  File "/root/miniconda3/lib/python3.12/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 150, in run
    self.advance(data_fetcher)
  File "/root/miniconda3/lib/python3.12/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 339, in advance
    call._call_callback_hooks(trainer, "on_train_batch_end", batch_output, batch, batch_idx)
  File "/root/miniconda3/lib/python3.12/site-packages/lightning/pytorch/trainer/call.py", line 222, in _call_callback_hooks
    fn(trainer, trainer.lightning_module, *args, **kwargs)
  File "/root/miniconda3/lib/python3.12/site-packages/lightning_utilities/core/rank_zero.py", line 42, in wrapped_fn
    return fn(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/yolo/utils/logging_utils.py", line 110, in on_train_batch_end
    metrics.pop("v_num")
KeyError: 'v_num'

A: The possible causes of the KeyError (v_num) are:

Missing key in metrics dictionary: The v_num key may not exist in the metrics dictionary, causing the KeyError.
Typo in key name: There may be a typo in the key name, causing the KeyError.
Incorrect data type: The value associated with the v_num key may not be a dictionary, causing the KeyError.

A: To solve the KeyError (v_num) issue, you can:

Add a check to see if the key exists: Before attempting to pop the v_num key, add a check to see if the key exists in the metrics dictionary.
Use the get() method: Use the get() method to retrieve the value associated with the v_num key, and if it doesn't exist, return a default value.

A: The solution code is as follows:

# yolo/utils/logging_utils.py
def on_train_batch_end(trainer, lightning_module, batch_output, batch, batch_idx):
    metrics = batch_output["metrics"]
    if "v_num" in metrics:
        metrics.pop("v_num")

Alternatively, you can use the get() method:

# yolo/utils/logging_utils.py
def on_train_batch_end(trainer, lightning_module, batch_output, batch, batch_idx):
    metrics = batch_output["metrics"]
    v_num = metrics.get("v_num", None)
    if v_num is not None:
        metrics.pop("v_num")

By making these changes, you can solve the KeyError (v_num) issue and ensure that your model trains successfully.