YouTuberとしておさえるべきポイントとは?
chizuchizu
学習をさせるためにデータを整形します。
import pandas as pd import numpy as np
path = "../data/01_raw/" # 各自書き換えてください train = pd.read_csv(path + "train_data.csv") test = pd.read_csv(path + "test_data.csv") print(train.shape, test.shape) train.head()
(19720, 17) (29582, 16)
trainとtestをくっつけることで前処理を楽にすることができます(同じ動作を繰り返す必要がなくなる)
target = train["y"] del train["y"]
data = pd.concat([train, test]) print(data.shape) # 19720 + 29582 = 49302 data.head()
(49302, 16)
日付のデータを整形します。
data["publishedAt"] = pd.to_datetime(data["publishedAt"]) data["year"] = data["publishedAt"].dt.year data["month"] = data["publishedAt"].dt.month data["day"] = data["publishedAt"].dt.day data["hour"] = data["publishedAt"].dt.hour data["minute"] = data["publishedAt"].dt.minute
data["collection_date"] = "20" + data["collection_date"] data["collection_date"] = pd.to_datetime(data["collection_date"], format="%Y.%d.%m") data["c_year"] = data["collection_date"].dt.year data["c_month"] = data["collection_date"].dt.month data["c_day"] = data["collection_date"].dt.day
タグの数を特徴量にいれます
data["length_tags"] = data["tags"].astype(str).apply(lambda x: len(x.split("|")))
data = data.drop(["channelId", "video_id", "publishedAt", "thumbnail_link", "channelTitle", "collection_date", "id", "tags", "description", "title"], axis=1)
data.head()
data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 49302 entries, 0 to 29581 Data columns (total 15 columns): categoryId 49302 non-null int64 likes 49302 non-null int64 dislikes 49302 non-null int64 comment_count 49302 non-null int64 comments_disabled 49302 non-null bool ratings_disabled 49302 non-null bool year 49302 non-null int64 month 49302 non-null int64 day 49302 non-null int64 hour 49302 non-null int64 minute 49302 non-null int64 c_year 49302 non-null int64 c_month 49302 non-null int64 c_day 49302 non-null int64 length_tags 49302 non-null int64 dtypes: bool(2), int64(13) memory usage: 5.4 MB
LightGBMで損失関数はRMSEを使います。
RMSLEはsklearnのライブラリを使って計算することが可能です。
import gc from sklearn.model_selection import KFold from sklearn.metrics import mean_squared_log_error import lightgbm as lgb
# 分割 train = data.iloc[:len(target), :] test = data.iloc[len(target):, :] train.shape, test.shape
((19720, 15), (29582, 15))
# Kfold cv = KFold(n_splits=5, shuffle=True, random_state=2020) # RMSLE用 score = 0 # testデータの予測用 pred = np.zeros(test.shape[0])
logを使うほうがより安定したモデルになります(RMSLEが2.1→0.8程に改善される)
グラフの分布によりますが、偏りが大きいデータだと有効です。
target = np.log(target)
params = { 'boosting_type': 'gbdt', 'metric': 'rmse', 'objective': 'regression', 'seed': 20, 'learning_rate': 0.01, "n_jobs": -1, "verbose": -1 }
for tr_idx, val_idx in cv.split(train): x_train, x_val = train.iloc[tr_idx], train.iloc[val_idx] y_train, y_val = target[tr_idx], target[val_idx] # Datasetに入れて学習させる train_set = lgb.Dataset(x_train, y_train) val_set = lgb.Dataset(x_val, y_val, reference=train_set) # Training model = lgb.train(params, train_set, num_boost_round=8000, early_stopping_rounds=100, valid_sets=[train_set, val_set], verbose_eval=500) # 予測したらexpで元に戻す test_pred = np.exp(model.predict(test)) # 0より小さな値があるとエラーになるので補正 test_pred = np.where(test_pred < 0, 0, test_pred) pred += test_pred / 5 # 5fold回すので oof = np.exp(model.predict(x_val)) oof = np.where(oof < 0, 0, oof) rmsle = np.sqrt(mean_squared_log_error(np.exp(y_val), oof)) print(f"RMSLE : {rmsle}") score += model.best_score["valid_1"]["rmse"] / 5
Training until validation scores don't improve for 100 rounds [500] training's rmse: 0.790217 valid_1's rmse: 0.878511 [1000] training's rmse: 0.723568 valid_1's rmse: 0.867093 Early stopping, best iteration is: [1214] training's rmse: 0.701332 valid_1's rmse: 0.864694 RMSLE : 0.8636586877575745 Training until validation scores don't improve for 100 rounds [500] training's rmse: 0.793219 valid_1's rmse: 0.868182 [1000] training's rmse: 0.728989 valid_1's rmse: 0.855429 [1500] training's rmse: 0.682203 valid_1's rmse: 0.85194 Early stopping, best iteration is: [1458] training's rmse: 0.685965 valid_1's rmse: 0.851479 RMSLE : 0.8511941452093897 Training until validation scores don't improve for 100 rounds [500] training's rmse: 0.79201 valid_1's rmse: 0.872163 [1000] training's rmse: 0.726111 valid_1's rmse: 0.859045 [1500] training's rmse: 0.68015 valid_1's rmse: 0.854041 [2000] training's rmse: 0.641535 valid_1's rmse: 0.851314 [2500] training's rmse: 0.61155 valid_1's rmse: 0.849352 Early stopping, best iteration is: [2732] training's rmse: 0.596627 valid_1's rmse: 0.848531 RMSLE : 0.8475811660357269 Training until validation scores don't improve for 100 rounds [500] training's rmse: 0.795452 valid_1's rmse: 0.872153 [1000] training's rmse: 0.726643 valid_1's rmse: 0.860974 [1500] training's rmse: 0.682811 valid_1's rmse: 0.857198 [2000] training's rmse: 0.645627 valid_1's rmse: 0.853884 Early stopping, best iteration is: [2147] training's rmse: 0.635177 valid_1's rmse: 0.85334 RMSLE : 0.8527310826494601 Training until validation scores don't improve for 100 rounds [500] training's rmse: 0.795426 valid_1's rmse: 0.855615 [1000] training's rmse: 0.729845 valid_1's rmse: 0.843 [1500] training's rmse: 0.685542 valid_1's rmse: 0.838488 Early stopping, best iteration is: [1763] training's rmse: 0.66674 valid_1's rmse: 0.836856 RMSLE : 0.8365263539622312
高評価や低評価の特徴量がよく効いていますね。もちろん、再生回数が増えないと評価の数も増えないので納得いく結果です。
lgb.plot_importance(model, importance_type="gain", max_num_features=20)
<matplotlib.axes._subplots.AxesSubplot at 0x7fcb8f2b1210>
print(f"Mean RMSLE SCORE :{score}")
Mean RMSLE SCORE :0.8509800227942899
submit_df = pd.DataFrame({"y": pred}) submit_df.index.name = "id" submit_df.to_csv("submit.csv")
score += model.best_score["valid_1"]["rmse"] / 5ここではscoreをRMSEにしていますが、OOFのRMSLEのスコアをscoreにするべきでした。 score += rmsle / 5 とすることでRMSLEのCVスコアを導出することが出来ます。
score += model.best_score["valid_1"]["rmse"] / 5
score += rmsle / 5
LightGBMに入れる前のtargetの変換で、np.log() と np.exp() の代わりにnp.log1p() と np.expm1() を使えば、変換前のtargetでRMSLEを最小化しているのと同じ事になるので、僅かな差ですが今回のコンペ的には良いかなと思いました。
np.log()
np.exp()
np.log1p()
np.expm1()
また model.best_score["valid_1"]["rmse"] がそのままoofのrmsleになるので、計算しなおす必要が減るかと。(実際に試すと、計算し直す時とどこかで計算誤差? で微妙に値がずれるみたいですが...たぶん解釈は合ってると思います)
model.best_score["valid_1"]["rmse"]
コメントありがとうございます。ソースを読んできましたが仰るとおりnp.log1p()ですね。 修正します。