気象シミュレーションデータから、各商品が何個売れるか当ててみよう!
Akahachi
LightGBMでの分位点回帰を試してみました
from google.colab import files files.upload() #probspace_convini.zip !unzip probspace_convini.zip
Saving probspace_convini.zip to probspace_convini.zip Archive: probspace_convini.zip inflating: convini_submission.csv inflating: convini_test_data.csv inflating: convini_train_data.csv
import warnings warnings.simplefilter('ignore') import pandas as pd import numpy as np %matplotlib inline import matplotlib.pyplot as plt import seaborn as sns sns.set() from sklearn.model_selection import train_test_split from sklearn.metrics import mean_pinball_loss from lightgbm import LGBMRegressor
train = pd.read_csv('convini_train_data.csv', index_col=0) test = pd.read_csv('convini_test_data.csv', index_col=0)
x_train, x_valid, y_train, y_valid = train_test_split( train[['highest', 'lowest', 'rain']], train['ice1'], test_size=0.3, random_state=0)
lgb = LGBMRegressor( objective='quantile', # 分位点回帰 alpha=0.1, # tutorialのqに相当 n_estimators=1000, max_depth=5, colsample_bytree=0.5, random_state=0) lgb.fit(x_train, y_train, eval_set=(x_valid, y_valid), early_stopping_rounds=100, verbose=50)
Training until validation scores don't improve for 100 rounds. [50] valid_0's quantile: 2.29425 [100] valid_0's quantile: 1.92121 [150] valid_0's quantile: 2.08282 Early stopping, best iteration is: [77] valid_0's quantile: 1.89044
LGBMRegressor(alpha=0.1, colsample_bytree=0.5, max_depth=5, n_estimators=10000, objective='quantile', random_state=0)
qを評価指標で示されているそれぞれの値として実行
quantiles = [0.01, 0.1, 0.5, 0.9, 0.99] lgb_scores = [] oof = np.zeros(len(x_valid)*len(quantiles)).reshape(len(x_valid), len(quantiles)) for i, q in enumerate(quantiles): lgb = LGBMRegressor( objective='quantile', alpha=q, n_estimators=10000, max_depth=5, colsample_bytree=0.5, random_state=0) lgb.fit(x_train, y_train, eval_set=(x_valid, y_valid), early_stopping_rounds=100, verbose=False) lgb_scores.append(lgb.best_score_['valid_0']['quantile']) # validationのbest_score oof[:,i] = lgb.predict(x_valid) print(lgb_scores)
[0.23401825926240563, 1.890435292153683, 6.512968622995786, 6.5805607036590565, 5.17318563153584]
oof_df = pd.DataFrame(y_valid).reset_index(drop=True) for i, q in enumerate(quantiles): oof_df['oof_'+str(q)] = oof[:,i]
oof_df.sample(n=5)
lighGBMのbest_soreはscikit-learnのmean_pinball_lossとほぼ一致しました
pinball_sklearn = [mean_pinball_loss(oof_df['ice1'], oof_df['oof_'+str(q)], alpha=q) for q in quantiles] print(pinball_sklearn)
[0.2340182592624057, 1.8904352921536827, 6.5129686229957855, 6.580560703659057, 5.17318563153584]
qの値による予測値と実績値
(注)x軸が実績値、赤の直線はy=x
fig, ax = plt.subplots(2,3, sharex=True, sharey=True, figsize=(16, 9)) for i, c in enumerate(quantiles): axy, axx = divmod(i,3) sns.scatterplot(y=f"oof_{quantiles[i]}", x='ice1', data=oof_df, ax=ax[axy,axx]) ax[axy, axx].set_title(f"oof_{quantiles[i]}") ax[axy, axx].set_ylabel('') ax[axy, axx].plot([0,250], [0,250], color='red') plt.delaxes(ax=ax[1,2])
qが大きくなるほど、予測値>実績値となる比率が高くなっていますqの値ごとに予測値の上限のようなものがあるのは少し謎です