Akahachi
LightGBMでの分位点回帰を試してみました
from google.colab import files
files.upload() #probspace_convini.zip
!unzip probspace_convini.zip
Saving probspace_convini.zip to probspace_convini.zip Archive: probspace_convini.zip inflating: convini_submission.csv inflating: convini_test_data.csv inflating: convini_train_data.csv
import warnings
warnings.simplefilter('ignore')
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_pinball_loss
from lightgbm import LGBMRegressor
train = pd.read_csv('convini_train_data.csv', index_col=0)
test = pd.read_csv('convini_test_data.csv', index_col=0)
x_train, x_valid, y_train, y_valid = train_test_split(
train[['highest', 'lowest', 'rain']], train['ice1'], test_size=0.3, random_state=0)
lgb = LGBMRegressor(
objective='quantile', # 分位点回帰
alpha=0.1, # tutorialのqに相当
n_estimators=1000,
max_depth=5,
colsample_bytree=0.5,
random_state=0)
lgb.fit(x_train, y_train, eval_set=(x_valid, y_valid), early_stopping_rounds=100, verbose=50)
Training until validation scores don't improve for 100 rounds. [50] valid_0's quantile: 2.29425 [100] valid_0's quantile: 1.92121 [150] valid_0's quantile: 2.08282 Early stopping, best iteration is: [77] valid_0's quantile: 1.89044
LGBMRegressor(alpha=0.1, colsample_bytree=0.5, max_depth=5, n_estimators=10000, objective='quantile', random_state=0)
qを評価指標で示されているそれぞれの値として実行
quantiles = [0.01, 0.1, 0.5, 0.9, 0.99]
lgb_scores = []
oof = np.zeros(len(x_valid)*len(quantiles)).reshape(len(x_valid), len(quantiles))
for i, q in enumerate(quantiles):
lgb = LGBMRegressor(
objective='quantile',
alpha=q,
n_estimators=10000,
max_depth=5,
colsample_bytree=0.5,
random_state=0)
lgb.fit(x_train, y_train, eval_set=(x_valid, y_valid), early_stopping_rounds=100, verbose=False)
lgb_scores.append(lgb.best_score_['valid_0']['quantile']) # validationのbest_score
oof[:,i] = lgb.predict(x_valid)
print(lgb_scores)
[0.23401825926240563, 1.890435292153683, 6.512968622995786, 6.5805607036590565, 5.17318563153584]
oof_df = pd.DataFrame(y_valid).reset_index(drop=True)
for i, q in enumerate(quantiles):
oof_df['oof_'+str(q)] = oof[:,i]
oof_df.sample(n=5)
ice1 | oof_0.01 | oof_0.1 | oof_0.5 | oof_0.9 | oof_0.99 | |
---|---|---|---|---|---|---|
6 | 22 | 19.612281 | 19.784028 | 21.397135 | 24.267949 | 27.741964 |
21 | 18 | 13.733374 | 14.060220 | 15.360082 | 13.212091 | 27.741964 |
51 | 19 | 17.307507 | 18.008099 | 20.886427 | 24.908968 | 40.119255 |
78 | 17 | 12.964339 | 14.259036 | 15.861713 | 17.307589 | 40.119255 |
56 | 24 | 24.486111 | 24.151860 | 26.276319 | 40.019864 | 30.791935 |
lighGBMのbest_soreはscikit-learnのmean_pinball_lossとほぼ一致しました
pinball_sklearn = [mean_pinball_loss(oof_df['ice1'], oof_df['oof_'+str(q)], alpha=q) for q in quantiles]
print(pinball_sklearn)
[0.2340182592624057, 1.8904352921536827, 6.5129686229957855, 6.580560703659057, 5.17318563153584]
qの値による予測値と実績値
(注)x軸が実績値、赤の直線はy=x
fig, ax = plt.subplots(2,3, sharex=True, sharey=True, figsize=(16, 9))
for i, c in enumerate(quantiles):
axy, axx = divmod(i,3)
sns.scatterplot(y=f"oof_{quantiles[i]}", x='ice1', data=oof_df, ax=ax[axy,axx])
ax[axy, axx].set_title(f"oof_{quantiles[i]}")
ax[axy, axx].set_ylabel('')
ax[axy, axx].plot([0,250], [0,250], color='red')
plt.delaxes(ax=ax[1,2])
qが大きくなるほど、予測値>実績値となる比率が高くなっています
qの値ごとに予測値の上限のようなものがあるのは少し謎です