chizuchizu
学習をさせるためにデータを整形します。
import pandas as pd
import numpy as np
path = "../data/01_raw/" # 各自書き換えてください
train = pd.read_csv(path + "train_data.csv")
test = pd.read_csv(path + "test_data.csv")
print(train.shape, test.shape)
train.head()
(19720, 17) (29582, 16)
id | video_id | title | publishedAt | channelId | channelTitle | categoryId | collection_date | tags | likes | dislikes | comment_count | thumbnail_link | comments_disabled | ratings_disabled | description | y | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | GDtyztIThRQ | [12] BGM Inazuma Eleven 3 - ~ライオコツト ダンジョン~ | 2011-01-09T05:50:33.000Z | UCQaNYC3dNvH8FqrEyK7hTJw | DjangoShiny | 20 | 20.01.02 | Inazuma|Eleven|Super|Once|bgm|ost|イナズマイレブン|Kyo... | 114 | 0 | 7 | https://i.ytimg.com/vi/GDtyztIThRQ/default.jpg | False | False | ~ライオコツト ダンジョン~Inazuma Eleven 3 BGM Complete (R... | 29229 |
1 | 2 | m4H9s3GtTlQ | ねごと - メルシールー [Official Music Video] | 2012-07-23T03:00:09.000Z | UChMWDi-HBm5aS3jyRSaAWUA | ねごと Official Channel | 10 | 20.08.02 | ねごと|ネゴト|メルシールー|Re:myend|リマインド|Lightdentity|ライデ... | 2885 | 50 | 111 | https://i.ytimg.com/vi/m4H9s3GtTlQ/default.jpg | False | False | http://www.negoto.com/全員平成生まれ、蒼山幸子(Vo&Key)、沙田瑞... | 730280 |
2 | 3 | z19zYZuLuEU | VF3tb 闇よだれvsちび太 (SEGA) | 2007-07-26T13:54:09.000Z | UCBdcyoZSt5HBLd_n6we-xIg | siropai | 24 | 20.14.01 | VF3|VF4|VF5|ちび太|闇よだれ|chibita|virtuafighter|seg... | 133 | 17 | 14 | https://i.ytimg.com/vi/z19zYZuLuEU/default.jpg | False | False | Beat-tribe cup finalhttp://ameblo.jp/siropai/ | 80667 |
3 | 4 | pmcIOsL7s98 | free frosty weekend! | 2005-05-15T02:38:43.000Z | UC7K5am1UAQEsCRhzXpi9i1g | Jones4Carrie | 22 | 19.22.12 | frosty | 287 | 51 | 173 | https://i.ytimg.com/vi/pmcIOsL7s98/default.jpg | False | False | I look so bad but look at me! | 34826 |
4 | 5 | ZuQgsTcuM-4 | トップ・オブ・ザ・ワールド | 2007-09-09T09:52:47.000Z | UCTW1um4R-QWa8iIfITGvlZQ | Tatsuya Maruyama | 10 | 20.08.01 | ギター|guitar|南澤大介|トップオブザワールド|トップ|オブ|ワールド|カーペンターズ... | 178 | 6 | 17 | https://i.ytimg.com/vi/ZuQgsTcuM-4/default.jpg | False | False | ソロギターのしらべより「トップオブザワールド」です。クラシックギターで弾いてます。Offic... | 172727 |
trainとtestをくっつけることで前処理を楽にすることができます(同じ動作を繰り返す必要がなくなる)
target = train["y"]
del train["y"]
data = pd.concat([train, test])
print(data.shape) # 19720 + 29582 = 49302
data.head()
(49302, 16)
id | video_id | title | publishedAt | channelId | channelTitle | categoryId | collection_date | tags | likes | dislikes | comment_count | thumbnail_link | comments_disabled | ratings_disabled | description | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | GDtyztIThRQ | [12] BGM Inazuma Eleven 3 - ~ライオコツト ダンジョン~ | 2011-01-09T05:50:33.000Z | UCQaNYC3dNvH8FqrEyK7hTJw | DjangoShiny | 20 | 20.01.02 | Inazuma|Eleven|Super|Once|bgm|ost|イナズマイレブン|Kyo... | 114 | 0 | 7 | https://i.ytimg.com/vi/GDtyztIThRQ/default.jpg | False | False | ~ライオコツト ダンジョン~Inazuma Eleven 3 BGM Complete (R... |
1 | 2 | m4H9s3GtTlQ | ねごと - メルシールー [Official Music Video] | 2012-07-23T03:00:09.000Z | UChMWDi-HBm5aS3jyRSaAWUA | ねごと Official Channel | 10 | 20.08.02 | ねごと|ネゴト|メルシールー|Re:myend|リマインド|Lightdentity|ライデ... | 2885 | 50 | 111 | https://i.ytimg.com/vi/m4H9s3GtTlQ/default.jpg | False | False | http://www.negoto.com/全員平成生まれ、蒼山幸子(Vo&Key)、沙田瑞... |
2 | 3 | z19zYZuLuEU | VF3tb 闇よだれvsちび太 (SEGA) | 2007-07-26T13:54:09.000Z | UCBdcyoZSt5HBLd_n6we-xIg | siropai | 24 | 20.14.01 | VF3|VF4|VF5|ちび太|闇よだれ|chibita|virtuafighter|seg... | 133 | 17 | 14 | https://i.ytimg.com/vi/z19zYZuLuEU/default.jpg | False | False | Beat-tribe cup finalhttp://ameblo.jp/siropai/ |
3 | 4 | pmcIOsL7s98 | free frosty weekend! | 2005-05-15T02:38:43.000Z | UC7K5am1UAQEsCRhzXpi9i1g | Jones4Carrie | 22 | 19.22.12 | frosty | 287 | 51 | 173 | https://i.ytimg.com/vi/pmcIOsL7s98/default.jpg | False | False | I look so bad but look at me! |
4 | 5 | ZuQgsTcuM-4 | トップ・オブ・ザ・ワールド | 2007-09-09T09:52:47.000Z | UCTW1um4R-QWa8iIfITGvlZQ | Tatsuya Maruyama | 10 | 20.08.01 | ギター|guitar|南澤大介|トップオブザワールド|トップ|オブ|ワールド|カーペンターズ... | 178 | 6 | 17 | https://i.ytimg.com/vi/ZuQgsTcuM-4/default.jpg | False | False | ソロギターのしらべより「トップオブザワールド」です。クラシックギターで弾いてます。Offic... |
日付のデータを整形します。
data["publishedAt"] = pd.to_datetime(data["publishedAt"])
data["year"] = data["publishedAt"].dt.year
data["month"] = data["publishedAt"].dt.month
data["day"] = data["publishedAt"].dt.day
data["hour"] = data["publishedAt"].dt.hour
data["minute"] = data["publishedAt"].dt.minute
data["collection_date"] = "20" + data["collection_date"]
data["collection_date"] = pd.to_datetime(data["collection_date"], format="%Y.%d.%m")
data["c_year"] = data["collection_date"].dt.year
data["c_month"] = data["collection_date"].dt.month
data["c_day"] = data["collection_date"].dt.day
タグの数を特徴量にいれます
data["length_tags"] = data["tags"].astype(str).apply(lambda x: len(x.split("|")))
data = data.drop(["channelId",
"video_id",
"publishedAt",
"thumbnail_link",
"channelTitle",
"collection_date",
"id",
"tags",
"description",
"title"], axis=1)
data.head()
categoryId | likes | dislikes | comment_count | comments_disabled | ratings_disabled | year | month | day | hour | minute | c_year | c_month | c_day | length_tags | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 20 | 114 | 0 | 7 | False | False | 2011 | 1 | 9 | 5 | 50 | 2020 | 2 | 1 | 48 |
1 | 10 | 2885 | 50 | 111 | False | False | 2012 | 7 | 23 | 3 | 0 | 2020 | 2 | 8 | 19 |
2 | 24 | 133 | 17 | 14 | False | False | 2007 | 7 | 26 | 13 | 54 | 2020 | 1 | 14 | 9 |
3 | 22 | 287 | 51 | 173 | False | False | 2005 | 5 | 15 | 2 | 38 | 2019 | 12 | 22 | 1 |
4 | 10 | 178 | 6 | 17 | False | False | 2007 | 9 | 9 | 9 | 52 | 2020 | 1 | 8 | 12 |
data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 49302 entries, 0 to 29581 Data columns (total 15 columns): categoryId 49302 non-null int64 likes 49302 non-null int64 dislikes 49302 non-null int64 comment_count 49302 non-null int64 comments_disabled 49302 non-null bool ratings_disabled 49302 non-null bool year 49302 non-null int64 month 49302 non-null int64 day 49302 non-null int64 hour 49302 non-null int64 minute 49302 non-null int64 c_year 49302 non-null int64 c_month 49302 non-null int64 c_day 49302 non-null int64 length_tags 49302 non-null int64 dtypes: bool(2), int64(13) memory usage: 5.4 MB
LightGBMで損失関数はRMSEを使います。
RMSLEはsklearnのライブラリを使って計算することが可能です。
import gc
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_log_error
import lightgbm as lgb
# 分割
train = data.iloc[:len(target), :]
test = data.iloc[len(target):, :]
train.shape, test.shape
((19720, 15), (29582, 15))
# Kfold
cv = KFold(n_splits=5, shuffle=True, random_state=2020)
# RMSLE用
score = 0
# testデータの予測用
pred = np.zeros(test.shape[0])
logを使うほうがより安定したモデルになります(RMSLEが2.1→0.8程に改善される)
グラフの分布によりますが、偏りが大きいデータだと有効です。
target = np.log(target)
params = {
'boosting_type': 'gbdt',
'metric': 'rmse',
'objective': 'regression',
'seed': 20,
'learning_rate': 0.01,
"n_jobs": -1,
"verbose": -1
}
for tr_idx, val_idx in cv.split(train):
x_train, x_val = train.iloc[tr_idx], train.iloc[val_idx]
y_train, y_val = target[tr_idx], target[val_idx]
# Datasetに入れて学習させる
train_set = lgb.Dataset(x_train, y_train)
val_set = lgb.Dataset(x_val, y_val, reference=train_set)
# Training
model = lgb.train(params, train_set, num_boost_round=8000, early_stopping_rounds=100,
valid_sets=[train_set, val_set], verbose_eval=500)
# 予測したらexpで元に戻す
test_pred = np.exp(model.predict(test))
# 0より小さな値があるとエラーになるので補正
test_pred = np.where(test_pred < 0, 0, test_pred)
pred += test_pred / 5 # 5fold回すので
oof = np.exp(model.predict(x_val))
oof = np.where(oof < 0, 0, oof)
rmsle = np.sqrt(mean_squared_log_error(np.exp(y_val), oof))
print(f"RMSLE : {rmsle}")
score += model.best_score["valid_1"]["rmse"] / 5
Training until validation scores don't improve for 100 rounds [500] training's rmse: 0.790217 valid_1's rmse: 0.878511 [1000] training's rmse: 0.723568 valid_1's rmse: 0.867093 Early stopping, best iteration is: [1214] training's rmse: 0.701332 valid_1's rmse: 0.864694 RMSLE : 0.8636586877575745 Training until validation scores don't improve for 100 rounds [500] training's rmse: 0.793219 valid_1's rmse: 0.868182 [1000] training's rmse: 0.728989 valid_1's rmse: 0.855429 [1500] training's rmse: 0.682203 valid_1's rmse: 0.85194 Early stopping, best iteration is: [1458] training's rmse: 0.685965 valid_1's rmse: 0.851479 RMSLE : 0.8511941452093897 Training until validation scores don't improve for 100 rounds [500] training's rmse: 0.79201 valid_1's rmse: 0.872163 [1000] training's rmse: 0.726111 valid_1's rmse: 0.859045 [1500] training's rmse: 0.68015 valid_1's rmse: 0.854041 [2000] training's rmse: 0.641535 valid_1's rmse: 0.851314 [2500] training's rmse: 0.61155 valid_1's rmse: 0.849352 Early stopping, best iteration is: [2732] training's rmse: 0.596627 valid_1's rmse: 0.848531 RMSLE : 0.8475811660357269 Training until validation scores don't improve for 100 rounds [500] training's rmse: 0.795452 valid_1's rmse: 0.872153 [1000] training's rmse: 0.726643 valid_1's rmse: 0.860974 [1500] training's rmse: 0.682811 valid_1's rmse: 0.857198 [2000] training's rmse: 0.645627 valid_1's rmse: 0.853884 Early stopping, best iteration is: [2147] training's rmse: 0.635177 valid_1's rmse: 0.85334 RMSLE : 0.8527310826494601 Training until validation scores don't improve for 100 rounds [500] training's rmse: 0.795426 valid_1's rmse: 0.855615 [1000] training's rmse: 0.729845 valid_1's rmse: 0.843 [1500] training's rmse: 0.685542 valid_1's rmse: 0.838488 Early stopping, best iteration is: [1763] training's rmse: 0.66674 valid_1's rmse: 0.836856 RMSLE : 0.8365263539622312
高評価や低評価の特徴量がよく効いていますね。もちろん、再生回数が増えないと評価の数も増えないので納得いく結果です。
lgb.plot_importance(model, importance_type="gain", max_num_features=20)
<matplotlib.axes._subplots.AxesSubplot at 0x7fcb8f2b1210>
print(f"Mean RMSLE SCORE :{score}")
Mean RMSLE SCORE :0.8509800227942899
submit_df = pd.DataFrame({"y": pred})
submit_df.index.name = "id"
submit_df.to_csv("submit.csv")
chizuchizu
score += model.best_score["valid_1"]["rmse"] / 5
ここではscoreをRMSEにしていますが、OOFのRMSLEのスコアをscoreにするべきでした。
score += rmsle / 5
とすることでRMSLEのCVスコアを導出することが出来ます。wkwkhautbois
LightGBMに入れる前のtargetの変換で、
np.log()
とnp.exp()
の代わりにnp.log1p()
とnp.expm1()
を使えば、変換前のtargetでRMSLEを最小化しているのと同じ事になるので、僅かな差ですが今回のコンペ的には良いかなと思いました。
また
model.best_score["valid_1"]["rmse"]
がそのままoofのrmsleになるので、計算しなおす必要が減るかと。(実際に試すと、計算し直す時とどこかで計算誤差? で微妙に値がずれるみたいですが...たぶん解釈は合ってると思います)