LightGBMによる分位点回帰

LightGBMでの分位点回帰を試してみました

from google.colab import files
files.upload()  #probspace_convini.zip

!unzip probspace_convini.zip
Upload widget is only available when the cell has been executed in the current browser session. Please rerun this cell to enable.
Saving probspace_convini.zip to probspace_convini.zip
Archive:  probspace_convini.zip
  inflating: convini_submission.csv  
  inflating: convini_test_data.csv   
  inflating: convini_train_data.csv  
import warnings
warnings.simplefilter('ignore')

import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_pinball_loss
from lightgbm import LGBMRegressor
train = pd.read_csv('convini_train_data.csv', index_col=0)
test = pd.read_csv('convini_test_data.csv', index_col=0)
ice1を例にします
x_train, x_valid, y_train, y_valid = train_test_split(
    train[['highest', 'lowest', 'rain']], train['ice1'], test_size=0.3, random_state=0)
lgb = LGBMRegressor(
    objective='quantile',   # 分位点回帰
    alpha=0.1,  # tutorialのqに相当
    n_estimators=1000,
    max_depth=5,
    colsample_bytree=0.5,
    random_state=0)
lgb.fit(x_train, y_train, eval_set=(x_valid, y_valid), early_stopping_rounds=100, verbose=50)
Training until validation scores don't improve for 100 rounds.
[50]	valid_0's quantile: 2.29425
[100]	valid_0's quantile: 1.92121
[150]	valid_0's quantile: 2.08282
Early stopping, best iteration is:
[77]	valid_0's quantile: 1.89044
LGBMRegressor(alpha=0.1, colsample_bytree=0.5, max_depth=5, n_estimators=10000,
              objective='quantile', random_state=0)

qを評価指標で示されているそれぞれの値として実行

quantiles = [0.01, 0.1, 0.5, 0.9, 0.99]
lgb_scores = []
oof = np.zeros(len(x_valid)*len(quantiles)).reshape(len(x_valid), len(quantiles))

for i, q in enumerate(quantiles):
    lgb = LGBMRegressor(
        objective='quantile',
        alpha=q,
        n_estimators=10000,
        max_depth=5,
        colsample_bytree=0.5,
        random_state=0)
    lgb.fit(x_train, y_train, eval_set=(x_valid, y_valid), early_stopping_rounds=100, verbose=False)
    lgb_scores.append(lgb.best_score_['valid_0']['quantile'])  # validationのbest_score
    oof[:,i] = lgb.predict(x_valid)

print(lgb_scores)
[0.23401825926240563, 1.890435292153683, 6.512968622995786, 6.5805607036590565, 5.17318563153584]
oof_df = pd.DataFrame(y_valid).reset_index(drop=True)
for i, q in enumerate(quantiles):
    oof_df['oof_'+str(q)] = oof[:,i]
oof_df.sample(n=5)
ice1 oof_0.01 oof_0.1 oof_0.5 oof_0.9 oof_0.99
6 22 19.612281 19.784028 21.397135 24.267949 27.741964
21 18 13.733374 14.060220 15.360082 13.212091 27.741964
51 19 17.307507 18.008099 20.886427 24.908968 40.119255
78 17 12.964339 14.259036 15.861713 17.307589 40.119255
56 24 24.486111 24.151860 26.276319 40.019864 30.791935

lighGBMのbest_soreはscikit-learnのmean_pinball_lossとほぼ一致しました

pinball_sklearn = [mean_pinball_loss(oof_df['ice1'], oof_df['oof_'+str(q)], alpha=q) for q in quantiles]
print(pinball_sklearn)
[0.2340182592624057, 1.8904352921536827, 6.5129686229957855, 6.580560703659057, 5.17318563153584]

qの値による予測値と実績値

(注)x軸が実績値、赤の直線はy=x

fig, ax = plt.subplots(2,3, sharex=True, sharey=True, figsize=(16, 9))
for i, c in enumerate(quantiles):
    axy, axx = divmod(i,3)
    sns.scatterplot(y=f"oof_{quantiles[i]}", x='ice1', data=oof_df, ax=ax[axy,axx])
    ax[axy, axx].set_title(f"oof_{quantiles[i]}")
    ax[axy, axx].set_ylabel('')
    ax[axy, axx].plot([0,250], [0,250], color='red')
plt.delaxes(ax=ax[1,2])

qが大きくなるほど、予測値>実績値となる比率が高くなっています
qの値ごとに予測値の上限のようなものがあるのは少し謎です

添付データ

  • convini_LGB_sample.ipynb?X-Amz-Expires=10800&X-Amz-Date=20240419T005203Z&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIP7GCBGMWPMZ42PQ
  • Favicon
    new user
    コメントするには 新規登録 もしくは ログイン が必要です。