LightGBMによる分位点回帰

Akahachi

LightGBMによる分位点回帰

LightGBMでの分位点回帰を試してみました

from google.colab import files
files.upload()  #probspace_convini.zip

!unzip probspace_convini.zip

Upload widget is only available when the cell has been executed in the current browser session. Please rerun this cell to enable.

Saving probspace_convini.zip to probspace_convini.zip
Archive:  probspace_convini.zip
  inflating: convini_submission.csv  
  inflating: convini_test_data.csv   
  inflating: convini_train_data.csv

import warnings
warnings.simplefilter('ignore')

import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_pinball_loss
from lightgbm import LGBMRegressor

train = pd.read_csv('convini_train_data.csv', index_col=0)
test = pd.read_csv('convini_test_data.csv', index_col=0)

ice1を例にします

x_train, x_valid, y_train, y_valid = train_test_split(
    train[['highest', 'lowest', 'rain']], train['ice1'], test_size=0.3, random_state=0)

lgb = LGBMRegressor(
    objective='quantile',   # 分位点回帰
    alpha=0.1,  # tutorialのqに相当
    n_estimators=1000,
    max_depth=5,
    colsample_bytree=0.5,
    random_state=0)
lgb.fit(x_train, y_train, eval_set=(x_valid, y_valid), early_stopping_rounds=100, verbose=50)

Training until validation scores don't improve for 100 rounds.
[50]	valid_0's quantile: 2.29425
[100]	valid_0's quantile: 1.92121
[150]	valid_0's quantile: 2.08282
Early stopping, best iteration is:
[77]	valid_0's quantile: 1.89044

LGBMRegressor(alpha=0.1, colsample_bytree=0.5, max_depth=5, n_estimators=10000,
              objective='quantile', random_state=0)

qを評価指標で示されているそれぞれの値として実行

quantiles = [0.01, 0.1, 0.5, 0.9, 0.99]
lgb_scores = []
oof = np.zeros(len(x_valid)*len(quantiles)).reshape(len(x_valid), len(quantiles))

for i, q in enumerate(quantiles):
    lgb = LGBMRegressor(
        objective='quantile',
        alpha=q,
        n_estimators=10000,
        max_depth=5,
        colsample_bytree=0.5,
        random_state=0)
    lgb.fit(x_train, y_train, eval_set=(x_valid, y_valid), early_stopping_rounds=100, verbose=False)
    lgb_scores.append(lgb.best_score_['valid_0']['quantile'])  # validationのbest_score
    oof[:,i] = lgb.predict(x_valid)

print(lgb_scores)

[0.23401825926240563, 1.890435292153683, 6.512968622995786, 6.5805607036590565, 5.17318563153584]

oof_df = pd.DataFrame(y_valid).reset_index(drop=True)
for i, q in enumerate(quantiles):
    oof_df['oof_'+str(q)] = oof[:,i]

oof_df.sample(n=5)

	ice1	oof_0.01	oof_0.1	oof_0.5	oof_0.9	oof_0.99
6	22	19.612281	19.784028	21.397135	24.267949	27.741964
21	18	13.733374	14.060220	15.360082	13.212091	27.741964
51	19	17.307507	18.008099	20.886427	24.908968	40.119255
78	17	12.964339	14.259036	15.861713	17.307589	40.119255
56	24	24.486111	24.151860	26.276319	40.019864	30.791935

lighGBMのbest_soreはscikit-learnのmean_pinball_lossとほぼ一致しました

pinball_sklearn = [mean_pinball_loss(oof_df['ice1'], oof_df['oof_'+str(q)], alpha=q) for q in quantiles]
print(pinball_sklearn)

[0.2340182592624057, 1.8904352921536827, 6.5129686229957855, 6.580560703659057, 5.17318563153584]

qの値による予測値と実績値

（注）x軸が実績値、赤の直線はy=x

fig, ax = plt.subplots(2,3, sharex=True, sharey=True, figsize=(16, 9))
for i, c in enumerate(quantiles):
    axy, axx = divmod(i,3)
    sns.scatterplot(y=f"oof_{quantiles[i]}", x='ice1', data=oof_df, ax=ax[axy,axx])
    ax[axy, axx].set_title(f"oof_{quantiles[i]}")
    ax[axy, axx].set_ylabel('')
    ax[axy, axx].plot([0,250], [0,250], color='red')
plt.delaxes(ax=ax[1,2])

qが大きくなるほど、予測値>実績値となる比率が高くなっています
qの値ごとに予測値の上限のようなものがあるのは少し謎です

添付データ

convini_LGB_sample.ipynb?X-Amz-Expires=10800&X-Amz-Date=20250128T220221Z&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIP7GCBGMWPMZ42PQ

LightGBMによる分位点回帰

ice1を例にします

添付データ

new user