Issue With Training On Custom Dataset: KeyError (v_num)
Issue with Training on Custom Dataset: KeyError (v_num)
Training a model on a custom dataset can be a challenging task, especially when dealing with complex architectures like YOLOv9. In this article, we will explore the issue of KeyError (v_num) when training a YOLOv9 model on a custom dataset using the Lightning framework.
The error message is as follows:
Error executing job with overrides: ['dataset=custom', 'task=train', 'task.data.batch_size=8', 'model=v9-s', 'weight=False', 'use_wandb=False']
Traceback (most recent call last):
File "/mnt/c/Users/is231191daus/algorithms/YOLO/yolo/lazy.py", line 45, in <module>
main()
File "/root/miniconda3/lib/python3.12/site-packages/hydra/main.py", line 94, in decorated_main
_run_hydra(
File "/root/miniconda3/lib/python3.12/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
_run_app(
File "/root/miniconda3/lib/python3.12/site-packages/hydra/_internal/utils.py", line 457, in _run_app
run_and_report(
File "/root/miniconda3/lib/python3.12/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
raise ex
File "/root/miniconda3/lib/python3.12/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
return func()
^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/hydra/_internal/hydra.py", line 132, in run
_ = ret.return_value
^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/hydra/core/utils.py", line 260, in return_value
raise self._return_value
File "/root/miniconda3/lib/python3.12/site-packages/hydra/core/utils.py", line 186, in run_job
ret.return_value = task_function(task_cfg)
^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/c/Users/is231191daus/algorithms/YOLO/yolo/lazy.py", line 35, in main
trainer.fit(model)
File "/root/miniconda3/lib/python3.12/site-packages/lightning/pytorch/trainer/trainer.py", line 539, in fit
call._call_and_handle_interrupt(
File "/root/miniconda3/lib/python3.12/site-packages/lightning/pytorch/trainer/call.py", line 47, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/lightning/pytorch/trainer/trainer.py", line 575, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/root/miniconda3/lib/python3.12/site-packages/lightning/pytorch/trainer/trainer.py", line 982, in _run
results = self._run_stage()
^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/lightning/pytorch/trainer/trainer.py", line 1026, in _run_stage
self.fit_loop.run()
File "/root/miniconda3/lib/python3.12/site-packages/lightning/pytorch/loops/fit_loop.py", line 216, in run
self.advance()
File "/root/miniconda3/lib/python3.12/site-packages/lightning/pytorch/loops/fit_loop.py", line 455, in advance
self.epoch_loop.run(self._data_fetcher)
File "/root/miniconda3/lib/python3.12/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 150, in run
self.advance(data_fetcher)
File "/root/miniconda3/lib/python3.12/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 339, in advance
call._call_callback_hooks(trainer, "on_train_batch_end", batch_output, batch, batch_idx)
File "/root/miniconda3/lib/python3.12/site-packages/lightning/pytorch/trainer/call.py", line 222, in _call_callback_hooks
fn(trainer, trainer.lightning_module, *args, **kwargs)
File "/root/miniconda3/lib/python3.12/site-packages/lightning_utilities/core/rank_zero.py", line 42, in wrapped_fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/yolo/utils/logging_utils.py", line 110, in on_train_batch_end
metrics.pop("v_num")
KeyError: 'v_num'
The config file is as follows:
# config.yaml
dataset: custom
task: train
task.data.batch_size: 8
model: v9-s
weight: False
use_wandb: False
The system info is as follows:
- Ubuntu 22.04
- Python 3.12
- PyTorch 2.6.0+cu124
- Lightning 2.5.0
- CUDA 11.5
- YOLOv9 - s
To troubleshoot this issue, we need to identify the source of the KeyError (v_num). The error message suggests that the metrics.pop("v_num")
line in the on_train_batch_end
function is causing the issue.
Possible Causes
- Missing key in metrics dictionary: The
v_num
key may not exist in themetrics
dictionary, causing the KeyError. - Typo in key name: There may be a typo in the key name, causing the KeyError.
- Incorrect data type: The value associated with the
v_num
key may not be a dictionary, causing the KeyError.
Solution
To solve this issue, we need to ensure that the v_num
key exists in the metrics
dictionary before attempting to pop it. We can do this by adding a check to see if the key exists before popping it.
# yolo/utils/logging_utils.py
def on_train_batch_end(trainer, lightning_module, batch_output, batch, batch_idx):
metrics = batch_output["metrics"]
if "v_num" in metrics:
metrics.pop("v_num")
Alternatively, we can also use the get()
method to retrieve the value associated with the v_num
key, and if it doesn't exist, return a default value.
# yolo/utils/logging_utils.py
def on_train_batch_end(trainer, lightning_module, batch_output, batch, batch_idx):
metrics = batch_output["metrics"]
v_num = metrics.get("v_num", None)
if v_num is not None:
metrics.pop("v_num")
By making these changes, we can ensure that the v_num
key exists in the metrics
dictionary before attempting to pop it, and avoid the KeyError.
In conclusion, the KeyError (v_num) issue when training a YOLOv9 model on a custom dataset using the Lightning framework is caused by the missing key in the metrics
dictionary. By adding a check to see if the key exists before popping it, or by using the get()
method to retrieve the value associated with the key, we can solve this issue and ensure that our model trains successfully.
Q&A: Issue with Training on Custom Dataset: KeyError (v_num)
A: The issue is a KeyError (v_num) that occurs when training a YOLOv9 model on a custom dataset using the Lightning framework.
A: The error message is as follows:
Error executing job with overrides: ['dataset=custom', 'task=train', 'task.data.batch_size=8', 'model=v9-s', 'weight=False', 'use_wandb=False']
Traceback (most recent call last):
File "/mnt/c/Users/is231191daus/algorithms/YOLO/yolo/lazy.py", line 45, in <module>
main()
File "/root/miniconda3/lib/python3.12/site-packages/hydra/main.py", line 94, in decorated_main
_run_hydra(
File "/root/miniconda3/lib/python3.12/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
_run_app(
File "/root/miniconda3/lib/python3.12/site-packages/hydra/_internal/utils.py", line 457, in _run_app
run_and_report(
File "/root/miniconda3/lib/python3.12/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
raise ex
File "/root/miniconda3/lib/python3.12/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
return func()
^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/hydra/_internal/hydra.py", line 132, in run
_ = ret.return_value
^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/hydra/core/utils.py", line 260, in return_value
raise self._return_value
File "/root/miniconda3/lib/python3.12/site-packages/hydra/core/utils.py", line 186, in run_job
ret.return_value = task_function(task_cfg)
^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/c/Users/is231191daus/algorithms/YOLO/yolo/lazy.py", line 35, in main
trainer.fit(model)
File "/root/miniconda3/lib/python3.12/site-packages/lightning/pytorch/trainer/trainer.py", line 539, in fit
call._call_and_handle_interrupt(
File "/root/miniconda3/lib/python3.12/site-packages/lightning/pytorch/trainer/call.py", line 47, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/lightning/pytorch/trainer/trainer.py", line 575, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/root/miniconda3/lib/python3.12/site-packages/lightning/pytorch/trainer/trainer.py", line 982, in _run
results = self._run_stage()
^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/lightning/pytorch/trainer/trainer.py", line 1026, in _run_stage
self.fit_loop.run()
File "/root/miniconda3/lib/python3.12/site-packages/lightning/pytorch/loops/fit_loop.py", line 216, in run
self.advance()
File "/root/miniconda3/lib/python3.12/site-packages/lightning/pytorch/loops/fit_loop.py", line 455, in advance
self.epoch_loop.run(self._data_fetcher)
File "/root/miniconda3/lib/python3.12/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 150, in run
self.advance(data_fetcher)
File "/root/miniconda3/lib/python3.12/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 339, in advance
call._call_callback_hooks(trainer, "on_train_batch_end", batch_output, batch, batch_idx)
File "/root/miniconda3/lib/python3.12/site-packages/lightning/pytorch/trainer/call.py", line 222, in _call_callback_hooks
fn(trainer, trainer.lightning_module, *args, **kwargs)
File "/root/miniconda3/lib/python3.12/site-packages/lightning_utilities/core/rank_zero.py", line 42, in wrapped_fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/yolo/utils/logging_utils.py", line 110, in on_train_batch_end
metrics.pop("v_num")
KeyError: 'v_num'
A: The possible causes of the KeyError (v_num) are:
- Missing key in metrics dictionary: The
v_num
key may not exist in themetrics
dictionary, causing the KeyError. - Typo in key name: There may be a typo in the key name, causing the KeyError.
- Incorrect data type: The value associated with the
v_num
key may not be a dictionary, causing the KeyError.
A: To solve the KeyError (v_num) issue, you can:
- Add a check to see if the key exists: Before attempting to pop the
v_num
key, add a check to see if the key exists in themetrics
dictionary. - Use the
get()
method: Use theget()
method to retrieve the value associated with thev_num
key, and if it doesn't exist, return a default value.
A: The solution code is as follows:
# yolo/utils/logging_utils.py
def on_train_batch_end(trainer, lightning_module, batch_output, batch, batch_idx):
metrics = batch_output["metrics"]
if "v_num" in metrics:
metrics.pop("v_num")
Alternatively, you can use the get()
method:
# yolo/utils/logging_utils.py
def on_train_batch_end(trainer, lightning_module, batch_output, batch, batch_idx):
metrics = batch_output["metrics"]
v_num = metrics.get("v_num", None)
if v_num is not None:
metrics.pop("v_num")
By making these changes, you can solve the KeyError (v_num) issue and ensure that your model trains successfully.