[python-package] Cannot Construct A Dataset From A CSV Using weight_column

by ADMIN 77 views

Introduction

LightGBM is a popular open-source library for gradient boosting, widely used in various machine learning tasks. However, when creating a Dataset from a CSV file and passing both label_column and weight_column, an assertion error is encountered from the C++ side. This issue is observed in the Python package, but it's unclear if it's specific to Python.

Description

The problem arises when trying to create a Dataset from a CSV file and specifying both label_column and weight_column. The error occurs because the weight_column is not being correctly subtracted from the count of the number of features. This issue is not limited to the Python package, as it's also observed in the C++ code.

Reproducible Example

To reproduce the issue, we can use the following code:

import lightgbm as lgb
import numpy as np
import pandas as pd
from sklearn.datasets import make_regression

X, y = make_regression(n_features=5)
X_df = pd.DataFrame(X, columns=[f"feat_{i}" for i in range(X.shape[1])])

X_df["y"] = y
X_df["w"] = np.full_like(y, fill_value=0.456)
X_df.to_csv("data.csv")

# create a Dataset
dtrain = lgb.Dataset(
    data = "data.csv",
    params = {
        "header": True,
        "label_column": "name:y",
        "weight_column": "name:w",
    },
)
dtrain.construct()

This code creates a Dataset from a CSV file and specifies both label_column and weight_column. However, when running this code, an assertion error is encountered:

[LightGBM] [Info] Using column y as label
[LightGBM] [Info] Using column w as weight
[LightGBM] [Fatal] Check failed: (dataset->num_total_features_) == (static_cast<int>(feature_names_.size())) at /Users/jlamb/repos/LightGBM/src/io/dataset_loader.cpp, line 1100 .

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/jlamb/miniforge3/envs/lgb-dev/lib/python3.11/site-packages/lightgbm/basic.py", line 2590, in construct
    self._lazy_init(
  File "/Users/jlamb/miniforge3/envs/lgb-dev/lib/python3.11/site-packages/lightgbm/basic.py", line 2174, in _lazy_init
    _safe_call(
  File "/Users/jlamb/miniforge3/envs/lgb-dev/lib/python3.11/site-packages/lightgbm/basic.py", line 313, in _safe_call
    raise LightGBMError(_LIB.LGBM_GetLastError().decode("utf-8"))
lightgbm.basic.LightGBMError: Check failed: (dataset->num_total_features_) == (static_cast<int>(feature_names_.size())) at /Users/jlamb/repos/LightGBM/src/io/dataset_loader.cpp, line 1100 .

Environment Info

The issue is observed on arm64, macOS, Python 3.12, and LightGBM 4.6.0. The LightGBM version or commit hash is:

https://github.com/microsoft/LightGBM/commit/6437645c4a0c17046be59e4f57d09952e2e0185f

The command(s) used to install LightGBM are:

cmake -B build -S .
cmake --build build --target _lightgbm -j4
sh build-python.sh install --precompile

Additional Comments

It's unclear what's happening here, but it's possible that we need to modify the code to drop the label_column from the list of feature names, but for the weight_column and group_column. This might be similar to the code that drops the label_column from the list of feature names:

https://github.com/microsoft/LightGBM/blob/6437645c4a0c17046be59e4f57d09952e2e0185f/src/io/dataset_loader.cpp#L95-L101

However, this might not be sufficient:

https://github.com/microsoft/LightGBM/blob/6437645c4a0c17046be59e4f57d09952e2e0185f/src/io/dataset_loader.cpp#L145

Conclusion

In conclusion, the issue of creating a Dataset from a CSV file and passing both label_column and weight_column results in an assertion error from the C++ side. This issue is observed in the Python package, but it's unclear if it's specific to Python. Further investigation is needed to resolve this issue.

Possible Solutions

  1. Modify the code to drop the label_column from the list of feature names, but for the weight_column and group_column.
  2. Modify the code to correctly subtract the weight_column from the count of the number of features.
  3. Investigate the C++ code to understand why the assertion error is encountered.

Future Work

To resolve this issue, further investigation is needed. This might involve:

  1. Investigating the C++ code to understand why the assertion error is encountered.
  2. Modifying the code to correctly subtract the weight_column from the count of the number of features.
  3. Testing the modified code to ensure that the issue is resolved.

References

Q&A

Q: What is the issue with creating a Dataset from a CSV file and passing both label_column and weight_column?

A: The issue arises because the weight_column is not being correctly subtracted from the count of the number of features. This results in an assertion error from the C++ side.

Q: Is this issue specific to the Python package?

A: It's unclear if this issue is specific to the Python package. The issue is observed in the Python package, but it's possible that it's also present in the C++ code.

Q: What is the reproducible example for this issue?

A: The reproducible example is provided in the following code:

import lightgbm as lgb
import numpy as np
import pandas as pd
from sklearn.datasets import make_regression

X, y = make_regression(n_features=5)
X_df = pd.DataFrame(X, columns=[f"feat_{i}" for i in range(X.shape[1])])

X_df["y"] = y
X_df["w"] = np.full_like(y, fill_value=0.456)
X_df.to_csv("data.csv")

# create a Dataset
dtrain = lgb.Dataset(
    data = "data.csv",
    params = {
        "header": True,
        "label_column": "name:y",
        "weight_column": "name:w",
    },
)
dtrain.construct()

Q: What is the error message for this issue?

A: The error message is:

[LightGBM] [Info] Using column y as label
[LightGBM] [Info] Using column w as weight
[LightGBM] [Fatal] Check failed: (dataset->num_total_features_) == (static_cast<int>(feature_names_.size())) at /Users/jlamb/repos/LightGBM/src/io/dataset_loader.cpp, line 1100 .

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/jlamb/miniforge3/envs/lgb-dev/lib/python3.11/site-packages/lightgbm/basic.py", line 2590, in construct
    self._lazy_init(
  File "/Users/jlamb/miniforge3/envs/lgb-dev/lib/python3.11/site-packages/lightgbm/basic.py", line 2174, in _lazy_init
    _safe_call(
  File "/Users/jlamb/miniforge3/envs/lgb-dev/lib/python3.11/site-packages/lightgbm/basic.py", line 313, in _safe_call
    raise LightGBMError(_LIB.LGBM_GetLastError().decode("utf-8"))
lightgbm.basic.LightGBMError: Check failed: (dataset->num_total_features_) == (static_cast<int>(feature_names_.size())) at /Users/jlamb/repos/LightGBM/src/io/dataset_loader.cpp, line 1100 .

Q: What is the environment info for this issue?

A: The environment info is:

cmake -B build -S .
cmake --build build --target _lightgbm -j4
sh build-python.sh install --precompile

Q: What are the possible solutions for this issue?

A: The possible solutions are:

  1. Modify the code to drop the label_column from the list of feature names, but for the weight_column and group_column.
  2. Modify the code to correctly subtract the weight_column from the count of the number of features.
  3. Investigate the C++ code to understand why the assertion error is encountered.

Q: What is the future work for this issue?

A: The future work is to:

  1. Investigate the C++ code to understand why the assertion error is encountered.
  2. Modify the code to correctly subtract the weight_column from the count of the number of features.
  3. Test the modified code to ensure that the issue is resolved.

Q: What are the references for this issue?

A: The references are: