Oregin
遅くなってしまいましたが,1st Place Solution(by Oregin)を公開させていただきます。
運営のみなさま、一緒に参加してくださった皆様、レビューのほどよろしくお願いいたします。
以下のブログにも全体像の図解等を掲載いたしましたので、必要に応じてご参照ください。
【1位解法】ProbSpace開催「プロ野球データ分析チャレンジ 」の振り返り。 https://oregin-ai.hatenablog.com/entry/2021/06/23/223054
全体としては、大きく分けて、以下の2種類のモデルを作成して予測結果を合成することで、最終的な予測結果を作成しています。
ディレクトリ構成
# 最終予測結果につける名前
Notebookname = 'BestScore'
#このファイルを保存したディレクトリに移動(実行する環境に合わせて編集してください。)
cd /content/drive/MyDrive/XXXXXXXXXX/notebook
まずは、トピックに公開させていただいている目的となる分類の全クラスを多値分類を行うベースラインになります。
トピックに投稿したベースライン:https://prob.space/competitions/npb/discussions/Oregin-Postbd2b4e8a9808ec850876
このベースラインの予測値に、後述の4〜7のクラスを予測するモデルの予測値を混合することで、最終予測としています。
# xfeatのインストール
!pip install git+https://github.com/pfnet-research/xfeat.git
# ------------------------------------------------------------------------------
# 各種ライブラリのインポート
# ------------------------------------------------------------------------------
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import json
import os
import random
import string
import re
from pathlib import Path
from tqdm import tqdm
import lightgbm as lgb
from sklearn.model_selection import KFold,GroupKFold
from sklearn.metrics import f1_score,precision_score
from sklearn import preprocessing
from sklearn.naive_bayes import MultinomialNB # ComplementNB
from imblearn.under_sampling import RandomUnderSampler
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neural_network import MLPRegressor
from sklearn.pipeline import make_pipeline
from tqdm import tqdm
# データ読み込み
#####################################
###### train ########################
#####################################
train = pd.read_csv('../data/train_data.csv')
target = train['y']
train = train.drop(['id','y'],axis=1)
#####################################
#### test ###########################
#####################################
test = pd.read_csv('../data/test_data.csv')
test = test.drop('id',axis=1)
#####################################
#### game info ######################
#####################################
game = pd.read_csv('../data/game_info.csv')
game = game.drop('Unnamed: 0',axis=1)
train.shape,test.shape,target.shape,game.shape
((20400, 22), (33808, 13), (20400,), (726, 8))
# 訓練データの球速を他の特徴量から予測できるように仮の目的変数とする。
# 欠損値は直前の値を入れて補完
target_speed = train['speed'].str.extract(r'(\d+)').fillna(method='ffill')
##############################
# チーム名を数字に置き換える
##############################
# 全チーム名のリストを作成する
TeamList = game['topTeam'].unique()
# チーム名のディクショナリを初期化
TeamDic = {}
# チーム名毎に数字を割り当てたディクショナリを初期化
for i in range(len(TeamList)):
TeamDic[TeamList[i]] = i
game['bottomTeam']=game['bottomTeam'].replace(TeamDic)
game['topTeam']=game['topTeam'].replace(TeamDic)
game.tail()
startTime | bottomTeam | bgBottom | topTeam | place | startDayTime | bgTop | gameID | |
---|---|---|---|---|---|---|---|---|
721 | 13:00 | 7 | 12 | 3 | PayPayドーム | 2020-11-15 13:00:00 | 9 | 20203323 |
722 | 18:10 | 8 | 1 | 7 | 京セラD大阪 | 2020-11-21 18:10:00 | 12 | 20203326 |
723 | 18:10 | 8 | 1 | 7 | 京セラD大阪 | 2020-11-22 18:10:00 | 12 | 20203327 |
724 | 18:30 | 7 | 12 | 8 | PayPayドーム | 2020-11-24 18:30:00 | 1 | 20203328 |
725 | 18:30 | 7 | 12 | 8 | PayPayドーム | 2020-11-25 18:30:00 | 1 | 20203329 |
# 年月日、曜日、時分秒を追加
game['startDayTime'] = pd.to_datetime(game['startDayTime']) # 型を変換
game['year']=game["startDayTime"].dt.year
game['month']=game["startDayTime"].dt.month
game['day']=game["startDayTime"].dt.day
game['hour']=game["startDayTime"].dt.hour
game['dayofweek']=game["startDayTime"].dt.dayofweek
game['minute']=game["startDayTime"].dt.minute
game['second']=game["startDayTime"].dt.second
# 'startDayTime'を削除
game = game.drop(['startDayTime'],axis=1)
# 訓練データのみにある列名(テストデータにはない列名)のリストを作成
delcollist = []
for col in train.columns:
if not col in test.columns:
delcollist.append(col)
# 訓練データのみにある列名を削除
train = train.drop(delcollist,axis=1)
# inning を 数値に変換
train['inning_num'] = train['inning'].apply(lambda x: re.sub("\\D", "", x))
test['inning_num'] = test['inning'].apply(lambda x: re.sub("\\D", "", x))
# 表裏を判定する関数
def omote_ura(x):
if '表' in x:
return 0
else:
return 1
# 表裏の列を追加
train['inning_ForB'] = train['inning'].apply(lambda x: omote_ura(x))
test['inning_ForB'] = test['inning'].apply(lambda x: omote_ura(x))
# game_infoの追加
train = pd.merge(train, game, how='left')
test = pd.merge(test, game, how='left')
# inningの削除
train = train.drop('inning',axis=1)
test = test.drop('inning',axis=1)
# ボール、ストライク、アウトの合計値を追加
train['total_stat'] = train['B']+train['S']+train['O']
test['total_stat'] = test['B']+test['S']+test['O']
train['B_S'] = train['B']+train['S']
test['B_S'] = test['B']+test['S']
# ベース上のランナーの数を追加
train['total_base'] = train['b1'].astype('int')+train['b2'].astype('int')+train['b3'].astype('int')
test['total_base'] = test['b1'].astype('int')+test['b2'].astype('int')+test['b3'].astype('int')
# バッターのチームを追加
train['batterTeam'] = train['topTeam']
train['batterTeam'] = train['batterTeam'].where(train['inning_ForB']==1, train['bottomTeam'])
test['batterTeam'] = test['topTeam']
test['batterTeam'] = test['batterTeam'].where(test['inning_ForB']==1, test['bottomTeam'])
# ピッチャーのチームを追加
train['pitcherTeam'] = train['topTeam']
train['pitcherTeam'] = train['pitcherTeam'].where(train['inning_ForB']==0, train['bottomTeam'])
test['pitcherTeam'] = test['topTeam']
test['pitcherTeam'] = test['pitcherTeam'].where(test['inning_ForB']==0, test['bottomTeam'])
# カテゴリカル変数のカラムを抽出
categorical_columns = [x for x in train.columns if train[x].dtypes == 'object']
# カテゴリカル変数をカウントエンコードする
from xfeat import CountEncoder
encoder = CountEncoder(input_cols=categorical_columns)
train = encoder.fit_transform(train)
test = encoder.transform(test)
# 訓練データにターゲット列を追加する
train['target'] = target
# カテゴリカル変数をターゲットエンコーディングする
from sklearn.model_selection import KFold
from xfeat import TargetEncoder
fold = KFold(n_splits=5, shuffle=True, random_state=42)
encoder = TargetEncoder(input_cols=categorical_columns,
target_col='target',
fold=fold)
train = encoder.fit_transform(train)
test = encoder.transform(test)
# エンコーディング前の列を削除する
train = train.drop(categorical_columns,axis=1)
test = test.drop(categorical_columns,axis=1)
# ターゲット列を削除
train = train.drop('target',axis=1)
train.shape,test.shape,target.shape,target_speed.shape
((20400, 39), (33808, 39), (20400,), (20400, 1))
# pivot tabel を用いた特徴量を追加する関数
def get_game_id_vecs_features(input_df):
_input_df = input_df
# pivot table
stat_df = pd.pivot_table(_input_df, index="gameID", columns="batter_te", values="total_stat").add_prefix("total_stat=")
base_df = pd.pivot_table(_input_df, index="gameID", columns="batter_te", values="total_base").add_prefix("total_base=")
inning_df = pd.pivot_table(_input_df, index="gameID", columns="batter_te", values="inning_num_ce").add_prefix("inning=")
all_df = pd.concat([stat_df, base_df, inning_df], axis=1)
# PCA all
sc_all_df = StandardScaler().fit_transform(all_df.fillna(0))
pca = PCA(n_components=59, random_state=2021)
pca_all_df = pd.DataFrame(pca.fit_transform(sc_all_df), index=all_df.index).rename(columns=lambda x: f"gameID_all_PCA={x:03}")
# PCA Stat
sc_stat_df = StandardScaler().fit_transform(stat_df.fillna(0))
pca = PCA(n_components=16, random_state=2021)
pca_stat_df = pd.DataFrame(pca.fit_transform(sc_stat_df), index=all_df.index).rename(columns=lambda x: f"gameID_stat_PCA={x:03}")
# PCA bace
sc_base_df = StandardScaler().fit_transform(base_df.fillna(0))
pca = PCA(n_components=16, random_state=2021)
pca_base_df = pd.DataFrame(pca.fit_transform(sc_base_df), index=all_df.index).rename(columns=lambda x: f"gameID_base_PCA={x:03}")
# PCA inning
sc_inning_df = StandardScaler().fit_transform(inning_df.fillna(0))
pca = PCA(n_components=16, random_state=2021)
pca_inning_df = pd.DataFrame(pca.fit_transform(sc_inning_df), index=all_df.index).rename(columns=lambda x: f"gameID_inning_PCA={x:03}")
df = pd.concat([all_df, pca_all_df, pca_stat_df, pca_base_df, pca_inning_df], axis=1)
output_df = pd.merge(_input_df[["gameID"]], df, left_on="gameID", right_index=True, how="left")
return output_df
# 訓練データとテストデータを結合する
input_df = pd.concat([train, test]).reset_index(drop=True) # use concat data
# ピボットデータを作成する
output_df = get_game_id_vecs_features(input_df)
# ピボットデータを訓練データとテストデータに分割する
train_x = output_df.iloc[:len(train)]
test_x = output_df.iloc[len(train):].reset_index(drop=True)
train_x.shape,test_x.shape,train.shape,test.shape,target.shape,target_speed.shape
((20400, 2847), (33808, 2847), (20400, 39), (33808, 39), (20400,), (20400, 1))
# 元データとピボットデータを結合する
input_all_df = pd.concat([input_df,output_df],axis=1)
input_all_df.shape
(54208, 2886)
# null のカラムの確認
nul_sum = input_all_df.isnull().sum()
null_cols = list(nul_sum[nul_sum > 0].index)
# null があるカラムの削除
input_all_df = input_all_df.drop(null_cols,axis=1)
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import VarianceThreshold
# 分散が0(すべて同じ値)のカラムの探索
sel = VarianceThreshold(threshold=0)
sel.fit(input_all_df)
# get_supportで分散が0でないカラムのみをTrue値、分散が0のカラムはFalse値を返します
print(sum(sel.get_support()))
# 分散が0のカラムを削除
input_all_df =input_all_df.loc[:, sel.get_support()]
print(input_all_df.shape)
145 (54208, 145)
# indexとcolumnsを入れ替える
input_all_df_T = input_all_df.T
print(input_all_df_T.duplicated().sum())
# 同じ特徴量の名前を取得
duplicated_features = input_all_df_T[input_all_df_T.duplicated()].index.values
# 値が同じ特徴量の片方を削除
input_all_df = input_all_df.drop(duplicated_features,axis=1)
print(input_all_df.shape)
1 (54208, 143)
# テストデータと訓練データに分ける
X_train = input_all_df.iloc[:len(train)]
X_test = input_all_df.iloc[len(train):].reset_index(drop=True)
X_train.shape,X_test.shape
((20400, 143), (33808, 143))
# 作成した特徴量のデータを保存しておく
X_train.to_csv('../features/preprocessed_train.csv',index=False)
X_test.to_csv('../features/preprocessed_test.csv',index=False)
target.to_csv('../features/preprocessed_target.csv',index=False)
target_speed.to_csv('../features/preprocessed_speed.csv',index=False)
訓練データのみに存在する「Speed」を予測するモデルを作成して、テストデータにも特徴量として追加する。
SEED = 42
NFOLDS = 5
# speed のデータを1次元に変換
target_speed = target_speed.to_numpy().reshape(-1,)
#ニューラルネットを作成する関数定義
def create_model_NN(activation, n_layers, n_neurons, solver):
hidden_layer_sizes=[]
#与えられたパラメータのレイヤを作成
for i in range(n_layers):
hidden_layer_sizes.append(n_neurons[i])
#ニューラルネットのモデルを作成
model = MLPRegressor(activation = activation,
hidden_layer_sizes=hidden_layer_sizes,
solver = solver,
random_state=42
)
#標準化とニューラルネットのパイプラインを作成
pipe = make_pipeline(StandardScaler(),model)
return pipe
# テストデータの「Speed」を予測する関数
def pred_speed_of_test_data(train_x,test,target_speed,param):
###################################
### パラメータの設定
##################################
activation = param['activation']
n_layers = param['n_layers']
n_neurons=[]
for i in range(n_layers):
n_neurons.append(param['neuron' + str(i).zfill(2)])
solver = param['solver']
###################################
### CVの設定
##################################
FOLD_NUM = 5
kf = KFold(n_splits=NFOLDS, shuffle=True, random_state=SEED)
scores = []
mlp_pred = 0
for i, (tdx, vdx) in enumerate(kf.split(X=train_x)):
X_train, X_valid, y_train, y_valid = train_x.iloc[tdx], train_x.iloc[vdx], target_speed[tdx], target_speed[vdx]
#モデルを作成
mlp = create_model_NN(activation, n_layers, n_neurons, solver)
# 学習
mlp.fit(X_train,y_train)
# 予測
mlp_pred += mlp.predict(test) / FOLD_NUM
print('#######################################################')
print('### Speed was predicted #######')
print('#######################################################')
return mlp_pred
# Speed予測用のハイパーパラメータ
param = {
"activation": 'tanh',
"n_layers": 9,
"neuron00": 45,
"neuron01": 52,
"neuron02": 57,
"neuron03": 79,
"neuron04": 21,
"neuron05": 102,
"neuron06": 118,
"neuron07": 31,
"neuron08": 66,
"solver": 'sgd',
}
# テストデータの「Speed」を予測する
speed_pred = pred_speed_of_test_data(X_train,X_test,target_speed,param)
####################################################### ### Speed was predicted ####### #######################################################
target.shape
(20400,)
target_speed
array(['149', '149', '137', ..., '120', '131', '143'], dtype=object)
speed_pred
array([137.30386032, 137.30378782, 137.30386424, ..., 137.30380607, 137.30379655, 137.30378883])
# テストデータの「y」を予測する関数
#####################################################3
### LGBで学習、予測する関数の定義
########################################################
def pred_y_of_test_data(train,test,target,lgb_param,mlp_pred,select_col_list):
# --------------------------------------
# パラメータ定義
# --------------------------------------
lgb_params = {
'objective': 'multiclass',
'boosting_type': 'gbdt',
'n_estimators': 50000,
'colsample_bytree': 0.5,
'subsample': 0.5,
'subsample_freq': 3,
'reg_alpha': 8,
'reg_lambda': 2,
'random_state': SEED,
'bagging_fraction': lgb_param['bagging_fraction'],
'bagging_freq': lgb_param['bagging_freq'],
'feature_fraction': lgb_param['feature_fraction'],
"learning_rate":lgb_param['learning_rate'],
'min_child_samples': lgb_param['min_child_samples'],
'num_leaves': lgb_param['num_leaves'],
}
# --------------------------------------
# 学習と予測
# --------------------------------------
kf = KFold(n_splits=NFOLDS, shuffle=True, random_state=SEED)
lgb_oof = np.zeros(train.shape[0])
lgb_pred = pd.DataFrame()
train_x = train.loc[:][select_col_list]
test_x = test.loc[:][select_col_list]
train_x['speed'] = target_speed.astype('float')
test_x['speed'] = mlp_pred
target_y = target
for fold, (trn_idx, val_idx) in enumerate(kf.split(X=train_x)):
X_train, y_train = train_x.iloc[trn_idx], target_y[trn_idx]
X_valid, y_valid = train_x.iloc[val_idx], target_y[val_idx]
X_test = test_x
# LightGBM
model = lgb.LGBMClassifier(**lgb_params)
model.fit(X_train, y_train,
eval_set=(X_valid, y_valid),
eval_metric='logloss',
verbose=False,
early_stopping_rounds=500
)
lgb_oof[val_idx] = model.predict(X_valid)
lgb_pred[f'fold_{fold}'] = model.predict(X_test)
f1_macro = f1_score(y_valid, lgb_oof[val_idx], average='macro')
print(f"fold {fold} lgb score: {f1_macro}")
# 予測値の最頻値を求める(ご指摘をいただき修正)
sub_pred = lgb_pred.mode(axis=1)[0]
print("+-" * 40)
print(f"score: {f1_macro}")
return sub_pred
# 「y」を予測するモデルのハイパーパラメータを設定
lgb_param = {
"bagging_fraction": 0.7537281209924886,
"bagging_freq": 5,
"feature_fraction": 0.7548131884427044,
"learning_rate": 0.00854494687558397,
"min_child_samples": 78,
"num_leaves": 209,
}
# 予測に使う特徴量を選択
select_col_list =['B', 'O', 'b1', 'b3', 'bottomTeam', 'topTeam', 'bgTop',
'month', 'dayofweek', 'total_stat', 'pitcherTeam',
'pitcherHand_ce', 'batter_ce', 'inning_num_ce',
'startTime_ce', 'pitcherHand_te', 'batter_te',
'inning_num_te', 'startTime_te', 'place_te',
'gameID_all_PCA=000', 'gameID_all_PCA=002',
'gameID_all_PCA=004', 'gameID_all_PCA=005',
'gameID_all_PCA=009', 'gameID_all_PCA=012',
'gameID_all_PCA=015', 'gameID_all_PCA=016',
'gameID_all_PCA=017', 'gameID_all_PCA=019',
'gameID_all_PCA=023', 'gameID_all_PCA=024',
'gameID_all_PCA=029', 'gameID_all_PCA=031',
'gameID_all_PCA=035', 'gameID_all_PCA=039',
'gameID_all_PCA=040', 'gameID_all_PCA=042',
'gameID_all_PCA=045', 'gameID_all_PCA=046',
'gameID_all_PCA=047', 'gameID_all_PCA=048',
'gameID_all_PCA=049', 'gameID_all_PCA=051',
'gameID_all_PCA=053', 'gameID_all_PCA=054',
'gameID_all_PCA=057', 'gameID_stat_PCA=000',
'gameID_stat_PCA=001', 'gameID_stat_PCA=003',
'gameID_stat_PCA=004', 'gameID_stat_PCA=005',
'gameID_stat_PCA=006', 'gameID_stat_PCA=008',
'gameID_stat_PCA=010', 'gameID_stat_PCA=012',
'gameID_stat_PCA=014', 'gameID_stat_PCA=015',
'gameID_base_PCA=001', 'gameID_base_PCA=005',
'gameID_base_PCA=007', 'gameID_base_PCA=008',
'gameID_base_PCA=009', 'gameID_base_PCA=011',
'gameID_base_PCA=012', 'gameID_base_PCA=013',
'gameID_base_PCA=014', 'gameID_base_PCA=015',
'gameID_inning_PCA=001', 'gameID_inning_PCA=002',
'gameID_inning_PCA=003', 'gameID_inning_PCA=004',
'gameID_inning_PCA=006', 'gameID_inning_PCA=008',
'gameID_inning_PCA=009', 'gameID_inning_PCA=010',
'gameID_inning_PCA=012', 'gameID_inning_PCA=013',
'gameID_inning_PCA=014']
#学習と予測の実行
sub_pred = pred_y_of_test_data(X_train,X_test,target,lgb_param,speed_pred,select_col_list)
fold 0 lgb score: 0.14689857610654244 fold 1 lgb score: 0.14367666587151642 fold 2 lgb score: 0.15786179891131963 fold 3 lgb score: 0.14182548394951047 fold 4 lgb score: 0.1483487213241339 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+- score: 0.1483487213241339
# ------------------------------------------------------------------------------
# 予測ファイルの作成
# ------------------------------------------------------------------------------
#テスト結果の出力
submit_df = pd.DataFrame({'y': sub_pred.astype(int)})
submit_df.index.name = 'id'
submit_df.to_csv('../submission/submission.csv')
この特徴量作成については、DT-SNさんが公開してくださいましたベースラインの特徴量を活用させていただきました。
DT-SNさんのベスライン:https://prob.space/competitions/npb/discussions/DT-SN-Post2126e8f25865e24a1cc4
import pandas as pd
import numpy as np
import random
import os
from tqdm.notebook import tqdm
import lightgbm as lgb
from sklearn.metrics import f1_score
from sklearn.model_selection import GroupKFold
# メモリ使用量削減
def reduce_mem_usage(df, verbose=False):
start_mem = df.memory_usage().sum() / 1024**2
cols = df.columns.to_list()
df_1 = df.select_dtypes(exclude=['integer', 'float'])
df_2 = df.select_dtypes(include=['integer']).apply(pd.to_numeric, downcast='integer')
df_3 = df.select_dtypes(include=['float']).apply(pd.to_numeric, downcast='float')
df = df_1.join([df_2, df_3]).loc[:, cols]
end_mem = df.memory_usage().sum() / 1024**2
if verbose:
print('{:.2f}Mb->{:.2f}Mb({:.1f}% reduction)'.format(
start_mem, end_mem, 100 * (start_mem - end_mem) / start_mem))
return df
# 乱数SEED初期化
def seed_everything(seed=42):
random.seed(seed)
os.environ['PYTHONHASHSEED'] = str(seed)
np.random.seed(seed)
# 設定
INPUT_PATH = os.path.join('..', 'data')
N_CLASS = 8
SEED = 42
N_SAMPLE = 5
N_FOLDS = 5
N_LOOPS = 3
# game_info.csv読み取り
game_df = reduce_mem_usage(pd.read_csv(os.path.join(INPUT_PATH, 'game_info.csv'), index_col=0))
display(game_df)
game_df.info()
startTime | bottomTeam | bgBottom | topTeam | place | startDayTime | bgTop | gameID | |
---|---|---|---|---|---|---|---|---|
0 | 18:00 | DeNA | 3 | 広島 | 横浜 | 2020-06-19 18:00:00 | 6 | 20202173 |
1 | 18:00 | ヤクルト | 2 | 中日 | 神宮 | 2020-06-19 18:00:00 | 4 | 20202174 |
2 | 18:00 | 巨人 | 1 | 阪神 | 東京ドーム | 2020-06-19 18:00:00 | 5 | 20202175 |
3 | 18:00 | ソフトバンク | 12 | ロッテ | PayPayドーム | 2020-06-19 18:00:00 | 9 | 20202170 |
4 | 18:00 | オリックス | 11 | 楽天 | 京セラD大阪 | 2020-06-19 18:00:00 | 10 | 20202171 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
721 | 13:00 | ソフトバンク | 12 | ロッテ | PayPayドーム | 2020-11-15 13:00:00 | 9 | 20203323 |
722 | 18:10 | 巨人 | 1 | ソフトバンク | 京セラD大阪 | 2020-11-21 18:10:00 | 12 | 20203326 |
723 | 18:10 | 巨人 | 1 | ソフトバンク | 京セラD大阪 | 2020-11-22 18:10:00 | 12 | 20203327 |
724 | 18:30 | ソフトバンク | 12 | 巨人 | PayPayドーム | 2020-11-24 18:30:00 | 1 | 20203328 |
725 | 18:30 | ソフトバンク | 12 | 巨人 | PayPayドーム | 2020-11-25 18:30:00 | 1 | 20203329 |
726 rows × 8 columns
<class 'pandas.core.frame.DataFrame'> Int64Index: 726 entries, 0 to 725 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 startTime 726 non-null object 1 bottomTeam 726 non-null object 2 bgBottom 726 non-null int8 3 topTeam 726 non-null object 4 place 726 non-null object 5 startDayTime 726 non-null object 6 bgTop 726 non-null int8 7 gameID 726 non-null int32 dtypes: int32(1), int8(2), object(5) memory usage: 58.3+ KB
# train_data.csv読み取り
tr_df = reduce_mem_usage(pd.read_csv(os.path.join(INPUT_PATH, 'train_data.csv')))
# 重複行除去
print('duplicated lines:', tr_df.drop('id', axis=1).duplicated().sum())
tr_df = tr_df[~tr_df.drop('id', axis=1).duplicated()]
# game_infoマージ
tr_df = pd.merge(tr_df, game_df.drop(['bgTop', 'bgBottom'], axis=1), on='gameID', how='left')
# 同名選手回避
f = tr_df['inning'].str.contains('表')
tr_df.loc[ f, 'batter'] = tr_df.loc[ f, 'batter'] + '@' + tr_df.loc[ f, 'topTeam'].astype(str)
tr_df.loc[~f, 'batter'] = tr_df.loc[~f, 'batter'] + '@' + tr_df.loc[~f, 'bottomTeam'].astype(str)
tr_df.loc[ f, 'pitcher'] = tr_df.loc[ f, 'pitcher'] + '@' + tr_df.loc[ f, 'bottomTeam'].astype(str)
tr_df.loc[~f, 'pitcher'] = tr_df.loc[~f, 'pitcher'] + '@' + tr_df.loc[~f, 'topTeam'].astype(str)
duplicated lines: 3264
display(tr_df)
tr_df.info()
id | totalPitchingCount | B | S | O | b1 | b2 | b3 | pitcher | pitcherHand | batter | batterHand | gameID | inning | pitchType | speed | ballPositionLabel | ballX | ballY | dir | dist | battingType | isOuts | y | startTime | bottomTeam | topTeam | place | startDayTime | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 0 | 0 | 0 | False | False | False | 今永 昇太@DeNA | L | ピレラ@広島 | R | 20202173 | 1回表 | ストレート | 149km/h | 内角低め | 17 | J | NaN | NaN | NaN | NaN | 0 | 18:00 | DeNA | 広島 | 横浜 | 2020-06-19 18:00:00 |
1 | 1 | 2 | 1 | 0 | 0 | False | False | False | 今永 昇太@DeNA | L | ピレラ@広島 | R | 20202173 | 1回表 | ストレート | 149km/h | 内角低め | 14 | I | NaN | NaN | NaN | NaN | 1 | 18:00 | DeNA | 広島 | 横浜 | 2020-06-19 18:00:00 |
2 | 2 | 3 | 1 | 1 | 0 | False | False | False | 今永 昇太@DeNA | L | ピレラ@広島 | R | 20202173 | 1回表 | チェンジアップ | 137km/h | 外角高め | 8 | D | NaN | NaN | NaN | NaN | 0 | 18:00 | DeNA | 広島 | 横浜 | 2020-06-19 18:00:00 |
3 | 3 | 4 | 2 | 1 | 0 | False | False | False | 今永 昇太@DeNA | L | ピレラ@広島 | R | 20202173 | 1回表 | スライダー | 138km/h | 内角中心 | 21 | G | NaN | NaN | NaN | NaN | 2 | 18:00 | DeNA | 広島 | 横浜 | 2020-06-19 18:00:00 |
4 | 4 | 5 | 2 | 2 | 0 | False | False | False | 今永 昇太@DeNA | L | ピレラ@広島 | R | 20202173 | 1回表 | チェンジアップ | 136km/h | 外角中心 | 7 | F | S | 38.299999 | G | False | 4 | 18:00 | DeNA | 広島 | 横浜 | 2020-06-19 18:00:00 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
17131 | 17131 | 2 | 1 | 0 | 2 | False | False | False | 森 唯斗@ソフトバンク | R | 大田 泰示@日本ハム | R | 20202118 | 9回裏 | カットファストボール | 143km/h | 外角中心 | 7 | F | NaN | NaN | NaN | NaN | 2 | 18:00 | 日本ハム | ソフトバンク | 札幌ドーム | 2020-06-30 18:00:00 |
17132 | 17132 | 3 | 1 | 1 | 2 | False | False | False | 森 唯斗@ソフトバンク | R | 大田 泰示@日本ハム | R | 20202118 | 9回裏 | カーブ | 120km/h | 真ん中低め | 12 | K | NaN | NaN | NaN | NaN | 0 | 18:00 | 日本ハム | ソフトバンク | 札幌ドーム | 2020-06-30 18:00:00 |
17133 | 17133 | 4 | 2 | 1 | 2 | False | False | False | 森 唯斗@ソフトバンク | R | 大田 泰示@日本ハム | R | 20202118 | 9回裏 | カーブ | 120km/h | 真ん中低め | 10 | H | NaN | NaN | NaN | NaN | 1 | 18:00 | 日本ハム | ソフトバンク | 札幌ドーム | 2020-06-30 18:00:00 |
17134 | 17134 | 5 | 2 | 2 | 2 | False | False | False | 森 唯斗@ソフトバンク | R | 大田 泰示@日本ハム | R | 20202118 | 9回裏 | フォーク | 131km/h | 真ん中低め | 12 | K | NaN | NaN | NaN | NaN | 0 | 18:00 | 日本ハム | ソフトバンク | 札幌ドーム | 2020-06-30 18:00:00 |
17135 | 17135 | 6 | 3 | 2 | 2 | False | False | False | 森 唯斗@ソフトバンク | R | 大田 泰示@日本ハム | R | 20202118 | 9回裏 | カットファストボール | 143km/h | 外角中心 | 6 | E | NaN | 0.000000 | NaN | True | 1 | 18:00 | 日本ハム | ソフトバンク | 札幌ドーム | 2020-06-30 18:00:00 |
17136 rows × 29 columns
<class 'pandas.core.frame.DataFrame'> Int64Index: 17136 entries, 0 to 17135 Data columns (total 29 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 17136 non-null int16 1 totalPitchingCount 17136 non-null int8 2 B 17136 non-null int8 3 S 17136 non-null int8 4 O 17136 non-null int8 5 b1 17136 non-null bool 6 b2 17136 non-null bool 7 b3 17136 non-null bool 8 pitcher 17136 non-null object 9 pitcherHand 17105 non-null object 10 batter 17136 non-null object 11 batterHand 17105 non-null object 12 gameID 17136 non-null int32 13 inning 17136 non-null object 14 pitchType 17136 non-null object 15 speed 17136 non-null object 16 ballPositionLabel 17136 non-null object 17 ballX 17136 non-null int8 18 ballY 17136 non-null object 19 dir 3094 non-null object 20 dist 4356 non-null float32 21 battingType 3094 non-null object 22 isOuts 4356 non-null object 23 y 17136 non-null int8 24 startTime 17136 non-null object 25 bottomTeam 17136 non-null object 26 topTeam 17136 non-null object 27 place 17136 non-null object 28 startDayTime 17136 non-null object dtypes: bool(3), float32(1), int16(1), int32(1), int8(6), object(17) memory usage: 2.7+ MB
# test_data.csv読み取り
ts_df = reduce_mem_usage(pd.read_csv(os.path.join(INPUT_PATH, 'test_data.csv')))
# game_infoマージ
ts_df = pd.merge(ts_df, game_df.drop(['bgTop', 'bgBottom'], axis=1), on='gameID', how='left')
# 同名選手回避
f = ts_df['inning'].str.contains('表')
ts_df.loc[ f, 'batter'] = ts_df.loc[ f, 'batter'] + '@' + ts_df.loc[ f, 'topTeam'].astype(str)
ts_df.loc[~f, 'batter'] = ts_df.loc[~f, 'batter'] + '@' + ts_df.loc[~f, 'bottomTeam'].astype(str)
ts_df.loc[ f, 'pitcher'] = ts_df.loc[ f, 'pitcher'] + '@' + ts_df.loc[ f, 'bottomTeam'].astype(str)
ts_df.loc[~f, 'pitcher'] = ts_df.loc[~f, 'pitcher'] + '@' + ts_df.loc[~f, 'topTeam'].astype(str)
display(ts_df)
ts_df.info()
id | totalPitchingCount | B | S | O | b1 | b2 | b3 | pitcher | pitcherHand | batter | batterHand | gameID | inning | startTime | bottomTeam | topTeam | place | startDayTime | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 2 | 1 | 0 | 0 | False | False | False | 遠藤 淳志@広島 | R | 乙坂 智@DeNA | L | 20202564 | 2回表 | 13:30 | 広島 | DeNA | マツダスタジアム | 2020-09-06 13:30:00 |
1 | 1 | 1 | 0 | 0 | 0 | False | False | False | バンデンハーク@ソフトバンク | R | 西川 遥輝@日本ハム | L | 20202106 | 3回裏 | 18:00 | 日本ハム | ソフトバンク | 札幌ドーム | 2020-07-02 18:00:00 |
2 | 2 | 7 | 3 | 2 | 2 | True | False | False | スアレス@阪神 | R | 堂林 翔太@広島 | R | 20203305 | 9回裏 | 14:00 | 広島 | 阪神 | マツダスタジアム | 2020-11-07 14:00:00 |
3 | 3 | 1 | 0 | 0 | 2 | True | False | False | クック@ヤクルト | R | 井領 雅貴@中日 | L | 20202650 | 3回裏 | 18:00 | 中日 | ヤクルト | ナゴヤドーム | 2020-09-23 18:00:00 |
4 | 4 | 2 | 0 | 0 | 2 | False | False | False | 則本 昂大@楽天 | R | 安達 了一@オリックス | R | 20202339 | 2回表 | 18:00 | 楽天 | オリックス | 楽天生命パーク | 2020-07-24 18:00:00 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
33803 | 33803 | 2 | 0 | 1 | 0 | False | False | False | 床田 寛樹@広島 | L | 坂口 智隆@ヤクルト | L | 20202023 | 5回表 | 18:00 | 広島 | ヤクルト | マツダスタジアム | 2020-07-18 18:00:00 |
33804 | 33804 | 1 | 0 | 0 | 0 | False | False | False | 堀岡 隼人@巨人 | R | メヒア@広島 | R | 20202640 | 9回表 | 18:00 | 巨人 | 広島 | 東京ドーム | 2020-09-21 18:00:00 |
33805 | 33805 | 1 | 0 | 0 | 0 | True | False | False | ディプラン@巨人 | R | 鈴木 誠也@広島 | R | 20202864 | 7回裏 | 18:00 | 広島 | 巨人 | マツダスタジアム | 2020-11-04 18:00:00 |
33806 | 33806 | 5 | 3 | 1 | 1 | False | True | False | 田村 伊知郎@西武 | R | 周東 佑京@ソフトバンク | L | 20202806 | 8回裏 | 18:00 | ソフトバンク | 西武 | PayPayドーム | 2020-10-23 18:00:00 |
33807 | 33807 | 3 | 0 | 2 | 1 | False | False | False | 山本 由伸@オリックス | R | 源田 壮亮@西武 | L | 20202572 | 6回裏 | 18:00 | 西武 | オリックス | メットライフ | 2020-09-08 18:00:00 |
33808 rows × 19 columns
<class 'pandas.core.frame.DataFrame'> Int64Index: 33808 entries, 0 to 33807 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 33808 non-null int32 1 totalPitchingCount 33808 non-null int8 2 B 33808 non-null int8 3 S 33808 non-null int8 4 O 33808 non-null int8 5 b1 33808 non-null bool 6 b2 33808 non-null bool 7 b3 33808 non-null bool 8 pitcher 33807 non-null object 9 pitcherHand 33737 non-null object 10 batter 33803 non-null object 11 batterHand 33737 non-null object 12 gameID 33808 non-null int32 13 inning 33808 non-null object 14 startTime 33808 non-null object 15 bottomTeam 33808 non-null object 16 topTeam 33808 non-null object 17 place 33808 non-null object 18 startDayTime 33808 non-null object dtypes: bool(3), int32(2), int8(4), object(10) memory usage: 3.3+ MB
# trainとtestに共通のピッチャーを取得
tr_pitcher = set(tr_df['pitcher'].unique())
ts_pitcher = set(ts_df['pitcher'].unique())
print(tr_df['pitcher'].isin(tr_pitcher & ts_pitcher).sum())
print(ts_df['pitcher'].isin(tr_pitcher & ts_pitcher).sum())
# trainとtestに共通のバッターを取得
tr_batter = set(tr_df['batter'].unique())
ts_batter = set(ts_df['batter'].unique())
print(tr_df['batter'].isin(tr_batter & ts_batter).sum())
print(ts_df['batter'].isin(tr_batter & ts_batter).sum())
16949 24459 17115 28264
# train_dataとtest_dataを結合
input_df = pd.concat([tr_df, ts_df], axis=0).reset_index(drop=True)
# pitcherHandとbatterHand
input_df['pitcherHand'] = input_df['pitcherHand'].fillna('R')
input_df['batterHand'] = input_df['batterHand'].fillna('R')
# 球種
input_df['pitchType'] = input_df['pitchType'].fillna('-')
# 球速
input_df['speed'] = input_df['speed'].str.replace('km/h', '').replace('-', '135').astype(float)
input_df['speed'] = input_df['speed'].fillna(0)
# 投球位置
input_df['ballPositionLabel'] = input_df['ballPositionLabel'].fillna('中心')
# 投球のX座標(1-21)
input_df['ballX'] = input_df['ballX'].fillna(0).astype(int)
# 投球のY座標(A-K)変換
input_df['ballY'] = input_df['ballY'].map({chr(ord('A')+i):i+1 for i in range(11)})
input_df['ballY'] = input_df['ballY'].fillna(0).astype(int)
# 打球方向(A-Z)
input_df['dir'] = input_df['ballY'].map({chr(ord('A')+i):i+1 for i in range(26)})
input_df['dir'] = input_df['dir'].fillna(0).astype(int)
# 打球距離
input_df['dist'] = input_df['dist'].fillna(0)
# 打球種類
input_df['battingType'] = input_df['battingType'].fillna('G')
# 投球結果がアウトか
input_df['isOuts'] = input_df['isOuts'].fillna('-1').astype(int)
display(input_df)
input_df.info()
del tr_df, ts_df, game_df
id | totalPitchingCount | B | S | O | b1 | b2 | b3 | pitcher | pitcherHand | batter | batterHand | gameID | inning | pitchType | speed | ballPositionLabel | ballX | ballY | dir | dist | battingType | isOuts | y | startTime | bottomTeam | topTeam | place | startDayTime | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 0 | 0 | 0 | False | False | False | 今永 昇太@DeNA | L | ピレラ@広島 | R | 20202173 | 1回表 | ストレート | 149.0 | 内角低め | 17 | 10 | 0 | 0.000000 | G | -1 | 0.0 | 18:00 | DeNA | 広島 | 横浜 | 2020-06-19 18:00:00 |
1 | 1 | 2 | 1 | 0 | 0 | False | False | False | 今永 昇太@DeNA | L | ピレラ@広島 | R | 20202173 | 1回表 | ストレート | 149.0 | 内角低め | 14 | 9 | 0 | 0.000000 | G | -1 | 1.0 | 18:00 | DeNA | 広島 | 横浜 | 2020-06-19 18:00:00 |
2 | 2 | 3 | 1 | 1 | 0 | False | False | False | 今永 昇太@DeNA | L | ピレラ@広島 | R | 20202173 | 1回表 | チェンジアップ | 137.0 | 外角高め | 8 | 4 | 0 | 0.000000 | G | -1 | 0.0 | 18:00 | DeNA | 広島 | 横浜 | 2020-06-19 18:00:00 |
3 | 3 | 4 | 2 | 1 | 0 | False | False | False | 今永 昇太@DeNA | L | ピレラ@広島 | R | 20202173 | 1回表 | スライダー | 138.0 | 内角中心 | 21 | 7 | 0 | 0.000000 | G | -1 | 2.0 | 18:00 | DeNA | 広島 | 横浜 | 2020-06-19 18:00:00 |
4 | 4 | 5 | 2 | 2 | 0 | False | False | False | 今永 昇太@DeNA | L | ピレラ@広島 | R | 20202173 | 1回表 | チェンジアップ | 136.0 | 外角中心 | 7 | 6 | 0 | 38.299999 | G | 0 | 4.0 | 18:00 | DeNA | 広島 | 横浜 | 2020-06-19 18:00:00 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
50939 | 33803 | 2 | 0 | 1 | 0 | False | False | False | 床田 寛樹@広島 | L | 坂口 智隆@ヤクルト | L | 20202023 | 5回表 | - | 0.0 | 中心 | 0 | 0 | 0 | 0.000000 | G | -1 | NaN | 18:00 | 広島 | ヤクルト | マツダスタジアム | 2020-07-18 18:00:00 |
50940 | 33804 | 1 | 0 | 0 | 0 | False | False | False | 堀岡 隼人@巨人 | R | メヒア@広島 | R | 20202640 | 9回表 | - | 0.0 | 中心 | 0 | 0 | 0 | 0.000000 | G | -1 | NaN | 18:00 | 巨人 | 広島 | 東京ドーム | 2020-09-21 18:00:00 |
50941 | 33805 | 1 | 0 | 0 | 0 | True | False | False | ディプラン@巨人 | R | 鈴木 誠也@広島 | R | 20202864 | 7回裏 | - | 0.0 | 中心 | 0 | 0 | 0 | 0.000000 | G | -1 | NaN | 18:00 | 広島 | 巨人 | マツダスタジアム | 2020-11-04 18:00:00 |
50942 | 33806 | 5 | 3 | 1 | 1 | False | True | False | 田村 伊知郎@西武 | R | 周東 佑京@ソフトバンク | L | 20202806 | 8回裏 | - | 0.0 | 中心 | 0 | 0 | 0 | 0.000000 | G | -1 | NaN | 18:00 | ソフトバンク | 西武 | PayPayドーム | 2020-10-23 18:00:00 |
50943 | 33807 | 3 | 0 | 2 | 1 | False | False | False | 山本 由伸@オリックス | R | 源田 壮亮@西武 | L | 20202572 | 6回裏 | - | 0.0 | 中心 | 0 | 0 | 0 | 0.000000 | G | -1 | NaN | 18:00 | 西武 | オリックス | メットライフ | 2020-09-08 18:00:00 |
50944 rows × 29 columns
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50944 entries, 0 to 50943 Data columns (total 29 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 50944 non-null int32 1 totalPitchingCount 50944 non-null int8 2 B 50944 non-null int8 3 S 50944 non-null int8 4 O 50944 non-null int8 5 b1 50944 non-null bool 6 b2 50944 non-null bool 7 b3 50944 non-null bool 8 pitcher 50943 non-null object 9 pitcherHand 50944 non-null object 10 batter 50939 non-null object 11 batterHand 50944 non-null object 12 gameID 50944 non-null int32 13 inning 50944 non-null object 14 pitchType 50944 non-null object 15 speed 50944 non-null float64 16 ballPositionLabel 50944 non-null object 17 ballX 50944 non-null int64 18 ballY 50944 non-null int64 19 dir 50944 non-null int64 20 dist 50944 non-null float32 21 battingType 50944 non-null object 22 isOuts 50944 non-null int64 23 y 17136 non-null float64 24 startTime 50944 non-null object 25 bottomTeam 50944 non-null object 26 topTeam 50944 non-null object 27 place 50944 non-null object 28 startDayTime 50944 non-null object dtypes: bool(3), float32(1), float64(2), int32(2), int64(4), int8(4), object(13) memory usage: 8.3+ MB
from sklearn.preprocessing import LabelEncoder
def get_base_features(input_df):
seed_everything(seed=SEED)
output_df = input_df.copy()
output_df['inning'] = 2 * (output_df['inning'].str[0].astype(int) - 1) + output_df['inning'].str.contains('裏')
output_df['pitcherCommon'] = output_df['pitcher']
output_df['batterCommon'] = output_df['batter']
output_df.loc[~(output_df['pitcherCommon'].isin(tr_pitcher & ts_pitcher)), 'pitcherCommon'] = np.nan
output_df.loc[~(output_df['batterCommon'].isin(tr_batter & ts_batter)), 'batterCommon'] = np.nan
# label encoding
cat_cols = output_df.select_dtypes(include=['object']).columns
for col in cat_cols:
f = output_df[col].notnull()
output_df.loc[f, col] = LabelEncoder().fit_transform(output_df.loc[f, col].values)
output_df.loc[~f, col] = -1
output_df[col] = output_df[col].astype(int)
output_df['inningHalf'] = output_df['inning'] % 2
output_df['inningNumber'] = output_df['inning'] // 2
output_df['outCount'] = output_df['inning'] * 3 + output_df['O']
output_df['B_S_O'] = output_df['B'] + 4 * (output_df['S'] + 3 * output_df['O'])
output_df['b1_b2_b3'] = output_df['b1'] * 1 + output_df['b2'] * 2 + output_df['b3'] * 4
return reduce_mem_usage(output_df)
def random_sampling(input_df, n_sample=10):
dfs = []
tr_df = input_df[input_df['y'].notnull()].copy()
ts_df = input_df[input_df['y'].isnull()].copy()
for i in tqdm(range(n_sample)):
df = tr_df.groupby(['gameID', 'outCount']).apply(lambda x: x.sample(n=1, random_state=i)).reset_index(drop=True)
df['subGameID'] = df['gameID'] * n_sample + i
dfs.append(df)
ts_df['subGameID'] = ts_df['gameID'] * n_sample
return pd.concat(dfs + [ts_df], axis=0)
# 集約関数
def aggregation(input_df, group_keys, group_values, agg_methods):
new_df = []
for agg_method in agg_methods:
for col in group_values:
if callable(agg_method):
agg_method_name = agg_method.__name__
else:
agg_method_name = agg_method
new_col = f'agg_{agg_method_name}_{col}_grpby_' + '_'.join(group_keys)
agg_df = input_df[[col]+group_keys].groupby(group_keys)[[col]].agg(agg_method)
agg_df.columns = [new_col]
new_df.append(agg_df)
new_df = pd.concat(new_df, axis=1).reset_index()
output_df = pd.merge(input_df, new_df, on=group_keys, how='left')
return output_df, list(new_df.columns)
def get_agg_gameID_inningHalf_features(input_df):
group_keys = ['subGameID', 'inningHalf']
group_values = ['S', 'B', 'b1', 'b2', 'b3']
agg_methods = ['mean', 'std']
output_df, cols = aggregation(
input_df, group_keys=group_keys, group_values=group_values, agg_methods=agg_methods)
return reduce_mem_usage(output_df)
from sklearn.decomposition import NMF
from sklearn.preprocessing import MinMaxScaler
# pivot tabel を用いた特徴量
def get_pivot_NMF9_features(input_df, n, value_col):
pivot_df = pd.pivot_table(input_df, index='subGameID', columns='outCount', values=value_col, aggfunc=np.median)
sc0 = MinMaxScaler().fit_transform(np.median(pivot_df.fillna(0).values.reshape(-1,54//3,3)[:,0::2,:], axis=-1))
sc1 = MinMaxScaler().fit_transform(np.median(pivot_df.fillna(0).values.reshape(-1,54//3,3)[:,1::2,:], axis=-1))
nmf = NMF(n_components=n, random_state=2021)
nmf_df0 = pd.DataFrame(nmf.fit_transform(sc0), index=pivot_df.index).rename(
columns=lambda x: f'pivot_{value_col}_NMF9T={x:02}')
nmf_df1 = pd.DataFrame(nmf.fit_transform(sc1), index=pivot_df.index).rename(
columns=lambda x: f'pivot_{value_col}_NMF9B={x:02}')
nmf_df = pd.concat([nmf_df0, nmf_df1], axis=1)
nmf_df = pd.merge(
input_df, nmf_df, left_on='subGameID', right_index=True, how='left')
return reduce_mem_usage(nmf_df)
# pivot tabel を用いた特徴量
def get_pivot_NMF27_features(input_df, n, value_col):
pivot_df = pd.pivot_table(input_df, index='subGameID', columns='outCount', values=value_col, aggfunc=np.median)
sc0 = MinMaxScaler().fit_transform(pivot_df.fillna(0).values.reshape(-1,54//3,3)[:,0::2].reshape(-1,27))
sc1 = MinMaxScaler().fit_transform(pivot_df.fillna(0).values.reshape(-1,54//3,3)[:,1::2].reshape(-1,27))
nmf = NMF(n_components=n, random_state=2021)
nmf_df0 = pd.DataFrame(nmf.fit_transform(sc0), index=pivot_df.index).rename(
columns=lambda x: f'pivot_{value_col}_NMF27T={x:02}')
nmf_df1 = pd.DataFrame(nmf.fit_transform(sc1), index=pivot_df.index).rename(
columns=lambda x: f'pivot_{value_col}_NMF27B={x:02}')
nmf_df = pd.concat([nmf_df0, nmf_df1], axis=1)
nmf_df = pd.merge(
input_df, nmf_df, left_on='subGameID', right_index=True, how='left')
return reduce_mem_usage(nmf_df)
# pivot tabel を用いた特徴量
def get_pivot_NMF54_features(input_df, n, value_col):
pivot_df = pd.pivot_table(input_df, index='subGameID', columns='outCount', values=value_col, aggfunc=np.median)
sc = MinMaxScaler().fit_transform(pivot_df.fillna(0).values)
nmf = NMF(n_components=n, random_state=2021)
nmf_df = pd.DataFrame(nmf.fit_transform(sc), index=pivot_df.index).rename(
columns=lambda x: f'pivot_{value_col}_NMF54={x:02}')
nmf_df = pd.merge(
input_df, nmf_df, left_on='subGameID', right_index=True, how='left')
return reduce_mem_usage(nmf_df)
def get_diff_feature(input_df, value_col, periods, in_inning=True, aggfunc=np.median):
pivot_df = pd.pivot_table(input_df, index='subGameID', columns='outCount', values=value_col, aggfunc=aggfunc)
if in_inning:
dfs = []
for inning in range(9):
df0 = pivot_df.loc[:, [out+inning*6 for out in range(0,3)]].diff(periods, axis=1)
df1 = pivot_df.loc[:, [out+inning*6 for out in range(3,6)]].diff(periods, axis=1)
dfs += [df0, df1]
pivot_df = pd.concat(dfs, axis=1).stack()
else:
df0 = pivot_df.loc[:, [out+inning*6 for inning in range(9) for out in range(0,3)]].diff(periods, axis=1)
df1 = pivot_df.loc[:, [out+inning*6 for inning in range(9) for out in range(3,6)]].diff(periods, axis=1)
pivot_df = pd.concat([df0, df1], axis=1).stack()
return pivot_df
def get_shift_feature(input_df, value_col, periods, in_inning=True, aggfunc=np.median):
pivot_df = pd.pivot_table(input_df, index='subGameID', columns='outCount', values=value_col, aggfunc=aggfunc)
if in_inning:
dfs = []
for inning in range(9):
df0 = pivot_df.loc[:, [out+inning*6 for out in range(0,3)]].shift(periods, axis=1)
df1 = pivot_df.loc[:, [out+inning*6 for out in range(3,6)]].shift(periods, axis=1)
dfs += [df0, df1]
pivot_df = pd.concat(dfs, axis=1).stack()
else:
df0 = pivot_df.loc[:, [out+inning*6 for inning in range(9) for out in range(0,3)]].shift(periods, axis=1)
df1 = pivot_df.loc[:, [out+inning*6 for inning in range(9) for out in range(3,6)]].shift(periods, axis=1)
pivot_df = pd.concat([df0, df1], axis=1).stack()
return pivot_df
def get_next_data(input_df, value_col, in_inning=True, nan_value=None):
pivot_df = get_shift_feature(input_df, value_col, periods=-1, in_inning=in_inning)
pivot_df.name = 'next_' + value_col
output_df = pd.merge(
input_df, pivot_df, left_on=['subGameID', 'outCount'], right_index=True, how='left')
if nan_value is not None:
output_df[pivot_df.name].fillna(nan_value, inplace=True)
return output_df
def get_prev_data(input_df, value_col, in_inning=True, nan_value=None):
pivot_df = get_shift_feature(input_df, value_col, periods=1, in_inning=in_inning)
pivot_df.name = 'prev_' + value_col
output_df = pd.merge(
input_df, pivot_df, left_on=['subGameID', 'outCount'], right_index=True, how='left')
if nan_value is not None:
output_df[pivot_df.name].fillna(nan_value, inplace=True)
return output_df
def get_next_diff(input_df, value_col, in_inning=True, nan_value=None):
pivot_df = get_diff_feature(input_df, value_col, periods=-1, in_inning=in_inning)
pivot_df.name = 'next_diff_' + value_col
output_df = pd.merge(
input_df, pivot_df, left_on=['subGameID', 'outCount'], right_index=True, how='left')
if nan_value is not None:
output_df[pivot_df.name].fillna(nan_value, inplace=True)
return output_df
def get_prev_diff(input_df, value_col, in_inning=True, nan_value=None):
pivot_df = get_diff_feature(input_df, value_col, periods=1, in_inning=in_inning)
pivot_df.name = 'prev_diff_' + value_col
output_df = pd.merge(
input_df, pivot_df, left_on=['subGameID', 'outCount'], right_index=True, how='left')
if nan_value is not None:
output_df[pivot_df.name].fillna(nan_value, inplace=True)
return output_df
def get_tfidf(input_df, term_col, document_col):
output_df = input_df.copy()
output_df['dummy'] = 0
tf1 = output_df[[document_col, term_col, 'dummy']].groupby([document_col, term_col])['dummy'].count()
tf1.name = 'tf1'
tf2 = output_df[[document_col, term_col, 'dummy']].groupby([document_col])['dummy'].count()
tf2.name = 'tf2'
idf1 = output_df[document_col].nunique()
idf2 = output_df[[document_col, term_col, 'dummy']].groupby([term_col])[document_col].nunique()
idf2.name = 'idf2'
output_df = pd.merge(output_df, tf1, left_on=[document_col, term_col], right_index=True, how='left')
output_df = pd.merge(output_df, tf2, left_on=[document_col], right_index=True, how='left')
output_df['idf1'] = idf1
output_df = pd.merge(output_df, idf2, left_on=[term_col], right_index=True, how='left')
col_name = 'tfidf_' + term_col + '_in_' + document_col
tf = np.log(1 + (1 + output_df['tf1']) / (1 + output_df['tf2']))
idf = 1 + np.log((1 + output_df['idf1']) / (1 + output_df['idf2']))
output_df[col_name] = tf * idf
return output_df.drop(['tf1', 'tf2', 'idf1', 'idf2', 'dummy'], axis=1)
def get_skip(input_df):
output_df = input_df.copy()
next_skip_map = {}
prev_skip_map = {}
for key, group in output_df.groupby(['subGameID', 'inningHalf']):
n = len(group)
dist_map = {}
batter = group.sort_values('outCount')['batter']
for i in range(n - 1):
b1 = batter.iloc[i]
for d in range(1, 5):
if i + d >= n:
break
b2 = batter.iloc[i + d]
if (b1, b2) in dist_map.keys():
if dist_map[(b1, b2)] < d:
dist_map[(b1, b2)] = d
else:
dist_map[(b1, b2)] = d
for i in range(len(batter) - 1):
next_skip_map[batter.index[i]] = dist_map[(batter.iloc[i], batter.iloc[i+1])]
for i in range(1, len(batter)):
prev_skip_map[batter.index[i]] = dist_map[(batter.iloc[i-1], batter.iloc[i])]
output_df['next_skip'] = output_df.index.map(next_skip_map).fillna(0).astype(np.int8)
output_df['prev_skip'] = output_df.index.map(prev_skip_map).fillna(0).astype(np.int8)
return output_df
# 特徴量作成用の関数を実行する関数
def preprocess(input_df):
seed_everything(seed=SEED)
output_df = input_df.copy()
# aggrigation
output_df = get_agg_gameID_inningHalf_features(output_df)
# pivot
output_df = get_pivot_NMF9_features(output_df, n=2, value_col='b1_b2_b3')
output_df = get_pivot_NMF27_features(output_df, n=2, value_col='b1_b2_b3')
output_df = get_pivot_NMF54_features(output_df, n=2, value_col='b1_b2_b3')
# next/previous
output_df = get_next_data(output_df, value_col='b1_b2_b3', nan_value=8)
output_df = get_next_diff(output_df, value_col='b1_b2_b3', nan_value=8)
output_df = get_prev_data(output_df, value_col='b1_b2_b3', nan_value=8)
output_df = get_prev_diff(output_df, value_col='b1_b2_b3', nan_value=8)
# TF-IDF
output_df = get_tfidf(output_df, term_col='batter', document_col='subGameID')
# skip
output_df = get_skip(output_df)
return output_df
base_df = get_base_features(input_df)
display(base_df)
base_df.info()
id | totalPitchingCount | B | S | O | b1 | b2 | b3 | pitcher | pitcherHand | batter | batterHand | gameID | inning | pitchType | speed | ballPositionLabel | ballX | ballY | dir | dist | battingType | isOuts | y | startTime | bottomTeam | topTeam | place | startDayTime | pitcherCommon | batterCommon | inningHalf | inningNumber | outCount | B_S_O | b1_b2_b3 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 0 | 0 | 0 | False | False | False | 71 | 0 | 24 | 1 | 20202173 | 0 | 5 | 149.0 | 4 | 17 | 10 | 0 | 0.000000 | 2 | -1 | 0.0 | 5 | 0 | 7 | 10 | 0 | 47 | 16 | 0 | 0 | 0 | 0 | 0 |
1 | 1 | 2 | 1 | 0 | 0 | False | False | False | 71 | 0 | 24 | 1 | 20202173 | 0 | 5 | 149.0 | 4 | 14 | 9 | 0 | 0.000000 | 2 | -1 | 1.0 | 5 | 0 | 7 | 10 | 0 | 47 | 16 | 0 | 0 | 0 | 1 | 0 |
2 | 2 | 3 | 1 | 1 | 0 | False | False | False | 71 | 0 | 24 | 1 | 20202173 | 0 | 7 | 137.0 | 8 | 8 | 4 | 0 | 0.000000 | 2 | -1 | 0.0 | 5 | 0 | 7 | 10 | 0 | 47 | 16 | 0 | 0 | 0 | 5 | 0 |
3 | 3 | 4 | 2 | 1 | 0 | False | False | False | 71 | 0 | 24 | 1 | 20202173 | 0 | 6 | 138.0 | 3 | 21 | 7 | 0 | 0.000000 | 2 | -1 | 2.0 | 5 | 0 | 7 | 10 | 0 | 47 | 16 | 0 | 0 | 0 | 6 | 0 |
4 | 4 | 5 | 2 | 2 | 0 | False | False | False | 71 | 0 | 24 | 1 | 20202173 | 0 | 7 | 136.0 | 6 | 7 | 6 | 0 | 38.299999 | 2 | 0 | 4.0 | 5 | 0 | 7 | 10 | 0 | 47 | 16 | 0 | 0 | 0 | 10 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
50939 | 33803 | 2 | 0 | 1 | 0 | False | False | False | 171 | 0 | 113 | 0 | 20202023 | 8 | 0 | 0.0 | 1 | 0 | 0 | 0 | 0.000000 | 2 | -1 | NaN | 5 | 7 | 3 | 4 | 38 | 98 | 72 | 0 | 4 | 24 | 4 | 0 |
50940 | 33804 | 1 | 0 | 0 | 0 | False | False | False | 109 | 1 | 30 | 1 | 20202640 | 16 | 0 | 0.0 | 1 | 0 | 0 | 0 | 0.000000 | 2 | -1 | NaN | 5 | 6 | 7 | 8 | 143 | -1 | 22 | 0 | 8 | 48 | 0 | 0 |
50941 | 33805 | 1 | 0 | 0 | 0 | True | False | False | 19 | 1 | 375 | 1 | 20202864 | 13 | 0 | 0.0 | 1 | 0 | 0 | 0 | 0.000000 | 2 | -1 | NaN | 5 | 7 | 6 | 4 | 206 | -1 | 213 | 1 | 6 | 39 | 0 | 1 |
50942 | 33806 | 5 | 3 | 1 | 1 | False | True | False | 250 | 1 | 108 | 0 | 20202806 | 15 | 0 | 0.0 | 1 | 0 | 0 | 0 | 0.000000 | 2 | -1 | NaN | 5 | 2 | 10 | 0 | 190 | 139 | 69 | 1 | 7 | 46 | 19 | 2 |
50943 | 33807 | 3 | 0 | 2 | 1 | False | False | False | 153 | 1 | 274 | 0 | 20202572 | 11 | 0 | 0.0 | 1 | 0 | 0 | 0 | 0.000000 | 2 | -1 | NaN | 5 | 10 | 1 | 5 | 118 | 85 | 163 | 1 | 5 | 34 | 20 | 0 |
50944 rows × 36 columns
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50944 entries, 0 to 50943 Data columns (total 36 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 50944 non-null int32 1 totalPitchingCount 50944 non-null int8 2 B 50944 non-null int8 3 S 50944 non-null int8 4 O 50944 non-null int8 5 b1 50944 non-null bool 6 b2 50944 non-null bool 7 b3 50944 non-null bool 8 pitcher 50944 non-null int16 9 pitcherHand 50944 non-null int8 10 batter 50944 non-null int16 11 batterHand 50944 non-null int8 12 gameID 50944 non-null int32 13 inning 50944 non-null int8 14 pitchType 50944 non-null int8 15 speed 50944 non-null float32 16 ballPositionLabel 50944 non-null int8 17 ballX 50944 non-null int8 18 ballY 50944 non-null int8 19 dir 50944 non-null int8 20 dist 50944 non-null float32 21 battingType 50944 non-null int8 22 isOuts 50944 non-null int8 23 y 17136 non-null float32 24 startTime 50944 non-null int8 25 bottomTeam 50944 non-null int8 26 topTeam 50944 non-null int8 27 place 50944 non-null int8 28 startDayTime 50944 non-null int16 29 pitcherCommon 50944 non-null int16 30 batterCommon 50944 non-null int16 31 inningHalf 50944 non-null int8 32 inningNumber 50944 non-null int8 33 outCount 50944 non-null int8 34 B_S_O 50944 non-null int8 35 b1_b2_b3 50944 non-null int8 dtypes: bool(3), float32(3), int16(5), int32(2), int8(23) memory usage: 2.7 MB
sampling_df = random_sampling(base_df, n_sample=N_SAMPLE)
display(sampling_df)
sampling_df.info()
HBox(children=(FloatProgress(value=0.0, max=5.0), HTML(value='')))
id | totalPitchingCount | B | S | O | b1 | b2 | b3 | pitcher | pitcherHand | batter | batterHand | gameID | inning | pitchType | speed | ballPositionLabel | ballX | ballY | dir | dist | battingType | isOuts | y | startTime | bottomTeam | topTeam | place | startDayTime | pitcherCommon | batterCommon | inningHalf | inningNumber | outCount | B_S_O | b1_b2_b3 | subGameID | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 16276 | 3 | 2 | 0 | 0 | False | False | False | 325 | 1 | 55 | 1 | 20202116 | 0 | 5 | 148.0 | 6 | 8 | 7 | 0 | 0.0 | 2 | -1 | 1.0 | 5 | 10 | 1 | 5 | 13 | 180 | 37 | 0 | 0 | 0 | 2 | 0 | 101010580 |
1 | 16280 | 3 | 1 | 1 | 1 | False | False | False | 325 | 1 | 142 | 1 | 20202116 | 0 | 6 | 136.0 | 6 | 8 | 5 | 0 | 25.5 | 2 | 1 | 3.0 | 5 | 10 | 1 | 5 | 13 | 180 | 92 | 0 | 0 | 1 | 17 | 0 | 101010580 |
2 | 16288 | 3 | 1 | 1 | 2 | True | False | False | 325 | 1 | 15 | 1 | 20202116 | 0 | 5 | 148.0 | 11 | 15 | 3 | 0 | 0.0 | 2 | -1 | 1.0 | 5 | 10 | 1 | 5 | 13 | 180 | 9 | 0 | 0 | 2 | 29 | 1 | 101010580 |
3 | 16292 | 3 | 0 | 2 | 0 | False | False | False | 0 | 0 | 374 | 0 | 20202116 | 1 | 6 | 126.0 | 6 | 15 | 6 | 0 | 0.0 | 2 | -1 | 2.0 | 5 | 10 | 1 | 5 | 13 | 0 | 212 | 1 | 0 | 3 | 8 | 0 | 101010580 |
4 | 16300 | 7 | 2 | 2 | 1 | False | False | False | 0 | 0 | 274 | 0 | 20202116 | 1 | 3 | 135.0 | 9 | 10 | 11 | 0 | 0.0 | 2 | -1 | 0.0 | 5 | 10 | 1 | 5 | 13 | 0 | 163 | 1 | 0 | 4 | 22 | 0 | 101010580 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
50939 | 33803 | 2 | 0 | 1 | 0 | False | False | False | 171 | 0 | 113 | 0 | 20202023 | 8 | 0 | 0.0 | 1 | 0 | 0 | 0 | 0.0 | 2 | -1 | NaN | 5 | 7 | 3 | 4 | 38 | 98 | 72 | 0 | 4 | 24 | 4 | 0 | 101010115 |
50940 | 33804 | 1 | 0 | 0 | 0 | False | False | False | 109 | 1 | 30 | 1 | 20202640 | 16 | 0 | 0.0 | 1 | 0 | 0 | 0 | 0.0 | 2 | -1 | NaN | 5 | 6 | 7 | 8 | 143 | -1 | 22 | 0 | 8 | 48 | 0 | 0 | 101013200 |
50941 | 33805 | 1 | 0 | 0 | 0 | True | False | False | 19 | 1 | 375 | 1 | 20202864 | 13 | 0 | 0.0 | 1 | 0 | 0 | 0 | 0.0 | 2 | -1 | NaN | 5 | 7 | 6 | 4 | 206 | -1 | 213 | 1 | 6 | 39 | 0 | 1 | 101014320 |
50942 | 33806 | 5 | 3 | 1 | 1 | False | True | False | 250 | 1 | 108 | 0 | 20202806 | 15 | 0 | 0.0 | 1 | 0 | 0 | 0 | 0.0 | 2 | -1 | NaN | 5 | 2 | 10 | 0 | 190 | 139 | 69 | 1 | 7 | 46 | 19 | 2 | 101014030 |
50943 | 33807 | 3 | 0 | 2 | 1 | False | False | False | 153 | 1 | 274 | 0 | 20202572 | 11 | 0 | 0.0 | 1 | 0 | 0 | 0 | 0.0 | 2 | -1 | NaN | 5 | 10 | 1 | 5 | 118 | 85 | 163 | 1 | 5 | 34 | 20 | 0 | 101012860 |
48768 rows × 37 columns
<class 'pandas.core.frame.DataFrame'> Int64Index: 48768 entries, 0 to 50943 Data columns (total 37 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 48768 non-null int32 1 totalPitchingCount 48768 non-null int8 2 B 48768 non-null int8 3 S 48768 non-null int8 4 O 48768 non-null int8 5 b1 48768 non-null bool 6 b2 48768 non-null bool 7 b3 48768 non-null bool 8 pitcher 48768 non-null int16 9 pitcherHand 48768 non-null int8 10 batter 48768 non-null int16 11 batterHand 48768 non-null int8 12 gameID 48768 non-null int32 13 inning 48768 non-null int8 14 pitchType 48768 non-null int8 15 speed 48768 non-null float32 16 ballPositionLabel 48768 non-null int8 17 ballX 48768 non-null int8 18 ballY 48768 non-null int8 19 dir 48768 non-null int8 20 dist 48768 non-null float32 21 battingType 48768 non-null int8 22 isOuts 48768 non-null int8 23 y 14960 non-null float32 24 startTime 48768 non-null int8 25 bottomTeam 48768 non-null int8 26 topTeam 48768 non-null int8 27 place 48768 non-null int8 28 startDayTime 48768 non-null int16 29 pitcherCommon 48768 non-null int16 30 batterCommon 48768 non-null int16 31 inningHalf 48768 non-null int8 32 inningNumber 48768 non-null int8 33 outCount 48768 non-null int8 34 B_S_O 48768 non-null int8 35 b1_b2_b3 48768 non-null int8 36 subGameID 48768 non-null int64 dtypes: bool(3), float32(3), int16(5), int32(2), int64(1), int8(23) memory usage: 3.3 MB
prep_df = preprocess(sampling_df)
prep_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 48768 entries, 0 to 48767 Data columns (total 64 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 48768 non-null int32 1 totalPitchingCount 48768 non-null int8 2 B 48768 non-null int8 3 S 48768 non-null int8 4 O 48768 non-null int8 5 b1 48768 non-null bool 6 b2 48768 non-null bool 7 b3 48768 non-null bool 8 pitcher 48768 non-null int16 9 pitcherHand 48768 non-null int8 10 batter 48768 non-null int16 11 batterHand 48768 non-null int8 12 gameID 48768 non-null int32 13 inning 48768 non-null int8 14 pitchType 48768 non-null int8 15 speed 48768 non-null float32 16 ballPositionLabel 48768 non-null int8 17 ballX 48768 non-null int8 18 ballY 48768 non-null int8 19 dir 48768 non-null int8 20 dist 48768 non-null float32 21 battingType 48768 non-null int8 22 isOuts 48768 non-null int8 23 y 14960 non-null float32 24 startTime 48768 non-null int8 25 bottomTeam 48768 non-null int8 26 topTeam 48768 non-null int8 27 place 48768 non-null int8 28 startDayTime 48768 non-null int16 29 pitcherCommon 48768 non-null int16 30 batterCommon 48768 non-null int16 31 inningHalf 48768 non-null int8 32 inningNumber 48768 non-null int8 33 outCount 48768 non-null int8 34 B_S_O 48768 non-null int8 35 b1_b2_b3 48768 non-null int8 36 subGameID 48768 non-null int32 37 agg_mean_S_grpby_subGameID_inningHalf 48768 non-null float32 38 agg_mean_B_grpby_subGameID_inningHalf 48768 non-null float32 39 agg_mean_b1_grpby_subGameID_inningHalf 48768 non-null float32 40 agg_mean_b2_grpby_subGameID_inningHalf 48768 non-null float32 41 agg_mean_b3_grpby_subGameID_inningHalf 48768 non-null float32 42 agg_std_S_grpby_subGameID_inningHalf 48768 non-null float32 43 agg_std_B_grpby_subGameID_inningHalf 48768 non-null float32 44 agg_std_b1_grpby_subGameID_inningHalf 48768 non-null float32 45 agg_std_b2_grpby_subGameID_inningHalf 48768 non-null float32 46 agg_std_b3_grpby_subGameID_inningHalf 48768 non-null float32 47 pivot_b1_b2_b3_NMF9T=00 48768 non-null float32 48 pivot_b1_b2_b3_NMF9T=01 48768 non-null float32 49 pivot_b1_b2_b3_NMF9B=00 48768 non-null float32 50 pivot_b1_b2_b3_NMF9B=01 48768 non-null float32 51 pivot_b1_b2_b3_NMF27T=00 48768 non-null float32 52 pivot_b1_b2_b3_NMF27T=01 48768 non-null float32 53 pivot_b1_b2_b3_NMF27B=00 48768 non-null float32 54 pivot_b1_b2_b3_NMF27B=01 48768 non-null float32 55 pivot_b1_b2_b3_NMF54=00 48768 non-null float32 56 pivot_b1_b2_b3_NMF54=01 48768 non-null float32 57 next_b1_b2_b3 48768 non-null float64 58 next_diff_b1_b2_b3 48768 non-null float64 59 prev_b1_b2_b3 48768 non-null float64 60 prev_diff_b1_b2_b3 48768 non-null float64 61 tfidf_batter_in_subGameID 48768 non-null float64 62 next_skip 48768 non-null int64 63 prev_skip 48768 non-null int64 dtypes: bool(3), float32(23), float64(5), int16(5), int32(3), int64(2), int8(23) memory usage: 10.7 MB
drop_cols = [
'id',
'gameID',
'subGameID',
'pitchType',
'speed',
'ballPositionLabel',
'ballX',
'ballY',
'dir',
'dist',
'battingType',
'isOuts',
'startDayTime',
'startTime',
'pitcher',
'batter',
]
target_col = 'y'
group_col = 'gameID'
F010_train = prep_df[prep_df[target_col].notnull()]
F010_test = prep_df[prep_df[target_col].isnull()]
F010_target = F010_train[target_col]
F010_train = F010_train.drop([target_col] + drop_cols, axis=1)
F010_test = F010_test.drop([target_col] + drop_cols, axis=1)
F010_train.shape,F010_test.shape,F010_target.shape
((14960, 47), (33808, 47), (14960,))
# 作成した特徴量のデータを保存しておく
F010_train.to_csv('../features/F010_train.csv',index=False)
F010_test.to_csv('../features/F010_test.csv',index=False)
F010_target.to_csv('../features/F010_target.csv',index=False)
ストライク、ボール、ファール、アウト(0〜3)に比べ、出現頻度の少ない、ヒット、2塁打、3塁打、ホームラン(4〜7)については、クラスごとに2値分類するモデルを作成しました。
2値分類のモデルについては、Probspaceで開催された過去のコンペで培ったノウハウを活用しました。
###### スケールを変換する関数
def scale_train_test(train,valid,test,flg):
# スケール変換器を作成
if flg == 0:
scaler = preprocessing.StandardScaler()
else:
scaler = preprocessing.MinMaxScaler()
# 特徴量を変換
std_train = pd.DataFrame(scaler.fit_transform(train))
std_valid = pd.DataFrame(scaler.transform(valid))
std_test = pd.DataFrame(scaler.transform(test))
std_train.columns = train.columns
std_valid.columns = valid.columns
std_test.columns = test.columns
return std_train,std_valid,std_test
###### カラム名を変換する関数
def make_colname(df):
collist = []
colnamelist = []
for j in range(len(df.columns)):
collist.append(f'col-{j}')
colnamelist.append(df.columns[j])
#print(df.columns[j])# = f'col-{j}'
df.columns = collist
return collist,colnamelist
###### 検証データとテストデータを1チーム分に絞り込む関数
def select_by_team(train,test,target,team_num,flg):
if len(train[train['bottomTeam']==team_num]) > 0:
select_valid = train[train['bottomTeam']==team_num]
select_valid_target = target[train['bottomTeam']==team_num]
else:
select_valid = train.copy()
select_valid_target = target.copy()
select_test = test[test['bottomTeam']==team_num]
test_index = test[test['bottomTeam']==team_num].index
std_train,std_valid,std_test = scale_train_test(train,select_valid,select_test,flg)
collist,colnamelist = make_colname(std_train)
collist,colnamelist = make_colname(std_valid)
collist,colnamelist = make_colname(std_test)
return std_train,std_valid,std_test,select_valid_target,test_index
#####################################################3
### LGBで学習、予測する関数の定義
########################################################
def objective(train,test,target,valid,valid_target,all_param):
#+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
# 目的変数「y」の予測
#+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
# --------------------------------------
# パラメータ定義
# --------------------------------------
params = {
'task': 'train',
'boosting_type': 'gbdt',
# 二値分類問題
'objective': 'binary',
'metric': 'auc',
'verbosity': -1,
"seed":42,
"learning_rate":all_param['learning_rate'],
'lambda_l1': all_param['lambda_l1'],
'lambda_l2': all_param['lambda_l2'],
'num_leaves': all_param['num_leaves'],
'feature_fraction': all_param['feature_fraction'],
'bagging_fraction': all_param['bagging_fraction'],
'bagging_freq': all_param['bagging_freq'],
'min_child_samples': all_param['min_child_samples'],
}
Threshold_y = all_param['Threshold_y']
# --------------------------------------
# 学習と予測(最終予測)
# --------------------------------------
col_list = test.columns
train_x = pd.DataFrame()
test_x = pd.DataFrame()
select_col_list = all_param['select_col_list']
train_x = train.loc[:][select_col_list]
test_x = test.loc[:][select_col_list]
X_train, y_train = train_x, target['y']
X_test = test_x
randomTrainState = all_param['randomTrainState']
# アンダーサンプリングしながら訓練用データの作成
sampler = RandomUnderSampler(random_state=randomTrainState)
X_resampled, y_resampled = sampler.fit_resample(X_train,y_train)
# 検証用データの作成
#X_val, y_val = valid,valid_target
X_val = valid.loc[:][select_col_list]
y_val = valid_target
# LightGBM
num_round = all_param['num_round']
lgb_train = lgb.Dataset(X_resampled, y_resampled)
model = lgb.train(params, # 上記で設定したパラメータ
lgb_train, # 使用するデータセット
num_boost_round=num_round, # 学習の回数
verbose_eval=None) # 学習の経過の表示しない
lgb_val_pred = model.predict(X_val)
val_pred = np.zeros(lgb_val_pred.shape)
val_pred[lgb_val_pred > Threshold_y] = 1
f1_macro = f1_score(y_val, val_pred)
#print("+-" * 40)
#print(f"score: {f1_macro}")
lgb_pred = model.predict(X_test)
sub_pred = np.zeros(lgb_pred.shape)
sub_pred[lgb_pred > Threshold_y] = 1
return f1_macro,sub_pred,lgb_pred
#####################################################3
### MultinomialNBで学習、予測する関数の定義
########################################################
def objective2(train,test,target,valid,valid_target,all_param):
# --------------------------------------
# パラメータ定義
# --------------------------------------
NFOLDS = 11
Threshold_y = all_param['Threshold_y']
# --------------------------------------
# 学習と予測(最終予測)
# --------------------------------------
lgb_oof = pd.DataFrame()
lgb_preds = pd.DataFrame()
col_list = test.columns
train_x = pd.DataFrame()
test_x = pd.DataFrame()
select_col_list = all_param['select_col_list']
train_x = train.loc[:][select_col_list]
test_x = test.loc[:][select_col_list]
X_train, y_train = train_x, target['y']
X_test = test_x
randomTrainState = all_param['randomTrainState']
val_scores = []
for fold in range(NFOLDS):
# アンダーサンプリングしながら訓練用データの作成
sampler = RandomUnderSampler(random_state=randomTrainState+fold)
X_resampled, y_resampled = sampler.fit_resample(X_train,y_train)
# print('X resample:'+str(len(X_resampled)))
# 検証用データの作成(random_stateを変えて取得)
sampler = RandomUnderSampler(random_state=randomTrainState+fold+5)
X_val, y_val = sampler.fit_resample(X_train,y_train)
# 学習 -----------------------------------------------------------------------
model = MultinomialNB()
model.fit(X_resampled, y_resampled)
lgb_preds[f'pred{fold:03d}'] = model.predict(X_test)
tmp_pred = model.predict(X_val)
val_pred = np.zeros(tmp_pred.shape)
val_pred[tmp_pred > Threshold_y] = 1
valscore = f1_score(y_val, val_pred)
val_scores.append(valscore)
f1_macro = np.mean(val_scores)
#print("+-" * 40)
#print(f"score: {f1_macro}")
lgb_pred = lgb_preds.mean(axis='columns')
sub_pred = np.zeros(lgb_pred.shape)
sub_pred[lgb_pred > Threshold_y] = 1
return f1_macro,sub_pred,lgb_pred
def pred_by_team(train,test,target,team_num,all_sub_pred,all_param00,flg):
# Team に絞った検証データと、テストデータを作成する。
train00,valid00,test00,valid_target00,testindex00 = select_by_team(train,test,target,team_num,flg)
#学習と予測の実行
if flg == 0:
f1_macro00,sub_pred00,lgb_pred00 = objective(train00,test00,target,valid00,valid_target00,all_param00)
else:
f1_macro00,sub_pred00,lgb_pred00 = objective2(train00,test00,target,valid00,valid_target00,all_param00)
# 予測結果の格納
all_sub_pred[testindex00] = sub_pred00
return f1_macro00,sub_pred00,lgb_pred00,all_sub_pred
def pred_all_teams(header_num,prednum,param_list,flg=0):
# データ読み込み
###### train ########################
train = pd.read_csv(f'../features/{header_num}_train.csv')
target = pd.read_csv(f'../features/{header_num}_target.csv')
target['y'] = target['y'].astype(int)
#### test ###########################
test = pd.read_csv(f'../features/{header_num}_test.csv')
##### target to one hot vector ######
target_df = pd.get_dummies(target, columns=['y'])
# select pred target
# select only target pred num
target['y'] = target_df[f'y_{prednum}']
# 予測値格納用の配列を作成
all_sub_pred = np.zeros((len(test),))
for i in range(len(param_list)):
f1_macro00,sub_pred00,lgb_pred00,all_sub_pred = pred_by_team(train,test,target,i,all_sub_pred,param_list[i],flg)
return all_sub_pred
header_num = 'F010'
prednum = 4
sub_num = f'ind0{prednum}-t00-all'
param_list = []
##############################################
# Team 00 の予測のパラメータをセット
##############################################
all_param00 ={
'num_round': 22,
'Threshold_y': 0.5457880758611685,
'bagging_fraction': 0.8011059140187697,
'bagging_freq': 6,
'feature_fraction': 0.596451862314391,
'lambda_l1': 0.03797737964975018,
'lambda_l2': 1.5247646095285393e-05,
'learning_rate': 0.011373097841736944,
'min_child_samples': 54,
'num_leaves': 108,
'randomTrainState': 5000,
'select_col_list': ['col-1', 'col-2', 'col-3', 'col-4', 'col-5', 'col-6', 'col-7', 'col-10', 'col-11', 'col-12', 'col-15', 'col-16', 'col-17', 'col-18', 'col-21', 'col-23', 'col-25', 'col-28', 'col-30', 'col-33', 'col-38', 'col-39', 'col-41', 'col-43', 'col-44', 'col-45', 'col-46'],
}
param_list.append(all_param00)
##############################################
# Team 01 の予測のパラメータをセット
##############################################
all_param01 ={
'num_round': 23,
'Threshold_y': 0.5889101186701063,
'bagging_fraction': 0.6374123048258021,
'bagging_freq': 1,
'feature_fraction': 0.7134861073611422,
'lambda_l1': 1.9170047296779988e-06,
'lambda_l2': 5.208948566113227e-05,
'learning_rate': 0.024343076451874744,
'min_child_samples': 91,
'num_leaves': 256,
'randomTrainState': 7547,
'select_col_list': ['col-0', 'col-2', 'col-3', 'col-5', 'col-6', 'col-7', 'col-8', 'col-11', 'col-12', 'col-13', 'col-14', 'col-18', 'col-19', 'col-20', 'col-22', 'col-23', 'col-29', 'col-30', 'col-33', 'col-34', 'col-35', 'col-38', 'col-40', 'col-43', 'col-44'],
}
param_list.append(all_param01)
##############################################
# Team 02の予測のパラメータをセット
##############################################
all_param02 ={
'num_round': 25,
'Threshold_y': 0.5893221185816869,
'bagging_fraction': 0.8754828088009734,
'bagging_freq': 7,
'feature_fraction': 0.5629420510817766,
'lambda_l1': 8.096083569290082e-06,
'lambda_l2': 2.5099283088293556e-08,
'learning_rate': 0.02012230648399335,
'min_child_samples': 24,
'num_leaves': 249,
'randomTrainState': 9991,
'select_col_list': ['col-0', 'col-2', 'col-3', 'col-5', 'col-7', 'col-11', 'col-12', 'col-14', 'col-15', 'col-16', 'col-20', 'col-21', 'col-22', 'col-24', 'col-28', 'col-30', 'col-37', 'col-38', 'col-39', 'col-41', 'col-43', 'col-44', 'col-45'],
}
param_list.append(all_param02)
##############################################
# Team 03の予測のパラメータをセット
##############################################
all_param03 ={
'num_round': 81,
'Threshold_y': 0.625101164467119,
'bagging_fraction': 0.7373604916657892,
'bagging_freq': 4,
'feature_fraction': 0.7071738392109185,
'lambda_l1': 0.00011291713906883523,
'lambda_l2': 3.481715177476982e-05,
'learning_rate': 0.009426473025184819,
'min_child_samples': 14,
'num_leaves': 193,
'randomTrainState': 9993,
'select_col_list': ['col-0', 'col-3', 'col-5', 'col-6', 'col-7', 'col-9', 'col-11', 'col-12', 'col-13', 'col-17', 'col-18', 'col-19', 'col-21', 'col-24', 'col-30', 'col-31', 'col-32', 'col-35', 'col-36', 'col-37', 'col-40', 'col-41', 'col-43', 'col-45'],
}
param_list.append(all_param03)
##############################################
# Team 04の予測のパラメータをセット
##############################################
all_param04 ={
'num_round': 29,
'Threshold_y': 0.5259362725435317,
'bagging_fraction': 0.7259878847691205,
'bagging_freq': 7,
'feature_fraction': 0.4024323069123868,
'lambda_l1': 1.4698558567372508e-05,
'lambda_l2': 0.0001484293366246452,
'learning_rate': 0.005120538199748757,
'min_child_samples': 40,
'num_leaves': 194,
'randomTrainState': 6164,
'select_col_list': ['col-1', 'col-2', 'col-8', 'col-10', 'col-14', 'col-16', 'col-18', 'col-19', 'col-20', 'col-21', 'col-26', 'col-27', 'col-28', 'col-34', 'col-35', 'col-36', 'col-39', 'col-40', 'col-41', 'col-42', 'col-43', 'col-45', 'col-46'],
}
param_list.append(all_param04)
##############################################
# Team 05の予測のパラメータをセット
##############################################
all_param05 ={
'num_round': 82,
'Threshold_y': 0.6977108739620748,
'bagging_fraction': 0.7282751551795792,
'bagging_freq': 6,
'feature_fraction': 0.45809154061872265,
'lambda_l1': 0.1322170148676693,
'lambda_l2': 1.1545917924898332e-06,
'learning_rate': 0.02257794841585556,
'min_child_samples': 16,
'num_leaves': 139,
'randomTrainState': 4312,
'select_col_list': ['col-3', 'col-7', 'col-8', 'col-9', 'col-14', 'col-16', 'col-17', 'col-18', 'col-19', 'col-21', 'col-24', 'col-26', 'col-30', 'col-31', 'col-35', 'col-37', 'col-40', 'col-41', 'col-42', 'col-43', 'col-45', 'col-46'],
}
param_list.append(all_param05)
##############################################
# Team 06の予測のパラメータをセット
##############################################
all_param06 = {
'num_round': 76,
'Threshold_y': 0.677259,
'bagging_fraction': 0.622379131760522,
'bagging_freq': 1,
'feature_fraction': 0.43523697616265694,
'lambda_l1': 0.8211241372618634,
'lambda_l2': 2.111340578441308e-05,
'learning_rate': 0.021960522978137942,
'min_child_samples': 12,
'num_leaves': 27,
'randomTrainState': 886,
'select_col_list': ['col-1', 'col-2', 'col-3', 'col-4', 'col-5', 'col-7', 'col-8', 'col-12', 'col-13', 'col-14', 'col-18', 'col-19', 'col-20', 'col-21', 'col-22', 'col-23', 'col-25', 'col-30', 'col-34', 'col-37', 'col-38', 'col-39', 'col-41', 'col-42', 'col-43', 'col-45'],
}
param_list.append(all_param06)
##############################################
# Team 07の予測のパラメータをセット
##############################################
all_param07 ={
'num_round': 11,
'Threshold_y': 0.5153190259276867,
'bagging_fraction': 0.7713406672167327,
'bagging_freq': 5,
'feature_fraction': 0.9198934709118338,
'lambda_l1': 0.10237235943070253,
'lambda_l2': 9.078045159422409e-06,
'learning_rate': 0.0053148818152905924,
'min_child_samples': 53,
'num_leaves': 74,
'randomTrainState': 5415,
'select_col_list': ['col-0', 'col-1', 'col-2', 'col-4', 'col-5', 'col-7', 'col-11', 'col-12', 'col-14', 'col-16', 'col-17', 'col-20', 'col-21', 'col-26', 'col-28', 'col-29', 'col-30', 'col-32', 'col-34', 'col-35', 'col-39', 'col-40', 'col-42', 'col-43', 'col-44', 'col-45', 'col-46'],
}
param_list.append(all_param07)
##############################################
# Team 08の予測のパラメータをセット
##############################################
all_param08 ={
'num_round': 10,
'Threshold_y': 0.5380212227317296,
'bagging_fraction': 0.9187000291018446,
'bagging_freq': 2,
'feature_fraction': 0.6481201635454304,
'lambda_l1': 7.715161904743353e-07,
'lambda_l2': 0.002920082833647645,
'learning_rate': 0.02575749622485481,
'min_child_samples': 42,
'num_leaves': 157,
'randomTrainState': 5678,
'select_col_list': ['col-0', 'col-2', 'col-4', 'col-6', 'col-8', 'col-9', 'col-10', 'col-12', 'col-15', 'col-18', 'col-20', 'col-21', 'col-23', 'col-26', 'col-29', 'col-30', 'col-32', 'col-33', 'col-34', 'col-37', 'col-38', 'col-40', 'col-41', 'col-46'],
}
param_list.append(all_param08)
##############################################
# Team 09の予測のパラメータをセット
##############################################
all_param09 ={
'num_round': 61,
'Threshold_y': 0.560865361282143,
'bagging_fraction': 0.7236062249349682,
'bagging_freq': 2,
'feature_fraction': 0.40196600417593165,
'lambda_l1': 3.272803221486633e-07,
'lambda_l2': 0.0059614935640386335,
'learning_rate': 0.008946676196277018,
'min_child_samples': 77,
'num_leaves': 35,
'randomTrainState': 9582,
'select_col_list': ['col-0', 'col-2', 'col-4', 'col-5', 'col-8', 'col-9', 'col-10', 'col-12', 'col-14', 'col-17', 'col-20', 'col-25', 'col-27', 'col-28', 'col-34', 'col-36', 'col-37', 'col-38', 'col-39', 'col-40', 'col-44', 'col-45', 'col-46'],
}
param_list.append(all_param09)
##############################################
# Team 10の予測のパラメータをセット
##############################################
all_param10 ={
'num_round': 3,
'Threshold_y': 0.5136703347722501,
'bagging_fraction': 0.6576434003296778,
'bagging_freq': 2,
'feature_fraction': 0.9238185170320545,
'lambda_l1': 9.453058045248753e-07,
'lambda_l2': 6.173857715519222e-07,
'learning_rate': 0.015417272505010775,
'min_child_samples': 39,
'num_leaves': 178,
'randomTrainState': 1971,
'select_col_list': ['col-1', 'col-2', 'col-5', 'col-6', 'col-12', 'col-15', 'col-19', 'col-20', 'col-22', 'col-24', 'col-26', 'col-27', 'col-29', 'col-30', 'col-32', 'col-34', 'col-35', 'col-36', 'col-38', 'col-39', 'col-40', 'col-43', 'col-45'],
}
param_list.append(all_param10)
##############################################
# Team 11の予測のパラメータをセット
##############################################
all_param11 ={
'num_round': 114,
'Threshold_y': 0.6276608676627834,
'bagging_fraction': 0.6079086267466319,
'bagging_freq': 3,
'feature_fraction': 0.8841048573926545,
'lambda_l1': 4.6216686227184415e-06,
'lambda_l2': 0.0016302782388678146,
'learning_rate': 0.005927956010265586,
'min_child_samples': 14,
'num_leaves': 243,
'randomTrainState': 7314,
'select_col_list': ['col-0', 'col-2', 'col-4', 'col-5', 'col-6', 'col-10', 'col-12', 'col-14', 'col-15', 'col-16', 'col-19', 'col-20', 'col-21', 'col-23', 'col-24', 'col-27', 'col-28', 'col-29', 'col-31', 'col-35', 'col-37', 'col-38', 'col-39', 'col-40', 'col-41', 'col-44', 'col-45'],
}
param_list.append(all_param11)
all_sub_pred = pred_all_teams(header_num,prednum,param_list,flg=0)
# ------------------------------------------------------------------------------
# 予測ファイルの作成
# ------------------------------------------------------------------------------
#テスト結果の出力
submit_df = pd.DataFrame({'y': all_sub_pred.astype(int)})
submit_df.index.name = 'id'
submit_df.to_csv(f'../submission/{sub_num}_submission.csv')
header_num = 'F010'
prednum = 5
versionnum = 2
sub_num = f'ind0{prednum}-t0{versionnum-1}-all'
param_list = []
##############################################
# Team 00 の予測のパラメータをセット
##############################################
all_param00 ={
'num_round': 51,
'Threshold_y': 0.5988650028111873,
'bagging_fraction': 0.6341989032831783,
'bagging_freq': 3,
'feature_fraction': 0.9292090322755205,
'lambda_l1': 1.773192620560166e-06,
'lambda_l2': 1.1227618933934534e-06,
'learning_rate': 0.009081946102827547,
'min_child_samples': 16,
'num_leaves': 124,
'randomTrainState': 7454,
'select_col_list': ['col-3', 'col-5', 'col-6', 'col-7', 'col-8', 'col-10', 'col-11', 'col-13', 'col-15', 'col-18', 'col-19', 'col-24', 'col-25', 'col-32', 'col-33', 'col-34', 'col-39', 'col-40', 'col-42', 'col-43', 'col-45'],
}
param_list.append(all_param00)
##############################################
# Team 01 の予測のパラメータをセット
##############################################
all_param01 ={
'num_round': 28,
'Threshold_y': 0.6071184768452184,
'bagging_fraction': 0.7923625382057141,
'bagging_freq': 6,
'feature_fraction': 0.8688674184611311,
'lambda_l1': 0.00029783203896991247,
'lambda_l2': 0.001102626205095361,
'learning_rate': 0.00998140615023234,
'min_child_samples': 21,
'num_leaves': 131,
'randomTrainState': 2263,
'select_col_list': ['col-4', 'col-8', 'col-11', 'col-12', 'col-13', 'col-14', 'col-15', 'col-17', 'col-18', 'col-19', 'col-21', 'col-22', 'col-25', 'col-28', 'col-29', 'col-31', 'col-34', 'col-35', 'col-36', 'col-40', 'col-41', 'col-42', 'col-43', 'col-46'],
}
param_list.append(all_param01)
##############################################
# Team 02の予測のパラメータをセット
##############################################
all_param02 ={
'num_round': 79,
'Threshold_y': 0.8006167049819931,
'bagging_fraction': 0.7076316063654214,
'bagging_freq': 2,
'feature_fraction': 0.8284870339000393,
'lambda_l1': 0.2968327757147524,
'lambda_l2': 4.858948316137708e-07,
'learning_rate': 0.014441578144409953,
'min_child_samples': 7,
'num_leaves': 96,
'randomTrainState': 1652,
'select_col_list': ['col-1', 'col-2', 'col-4', 'col-6', 'col-8', 'col-9', 'col-11', 'col-12', 'col-13', 'col-15', 'col-19', 'col-20', 'col-21', 'col-22', 'col-23', 'col-24', 'col-25', 'col-26', 'col-27', 'col-30', 'col-33', 'col-36', 'col-38', 'col-40', 'col-41', 'col-43', 'col-45', 'col-46'],
}
param_list.append(all_param02)
##############################################
# Team 03の予測のパラメータをセット
##############################################
all_param03 ={
'num_round': 20,
'Threshold_y': 0.5822015809423722,
'bagging_fraction': 0.8485251397259249,
'bagging_freq': 7,
'feature_fraction': 0.6987577577331818,
'lambda_l1': 4.570181490781628e-08,
'lambda_l2': 3.103035915457753e-05,
'learning_rate': 0.01161090478870929,
'min_child_samples': 13,
'num_leaves': 239,
'randomTrainState': 6063,
'select_col_list': ['col-2', 'col-3', 'col-4', 'col-6', 'col-14', 'col-17', 'col-19', 'col-20', 'col-22', 'col-23', 'col-24', 'col-25', 'col-31', 'col-33', 'col-37', 'col-40', 'col-41', 'col-42', 'col-45'],
}
param_list.append(all_param03)
##############################################
# Team 04の予測のパラメータをセット
##############################################
all_param04 ={
'num_round': 13,
'Threshold_y': 0.5840054766305995,
'bagging_fraction': 0.9633826802938512,
'bagging_freq': 3,
'feature_fraction': 0.5037259239028026,
'lambda_l1': 0.1715113716988102,
'lambda_l2': 0.00037841395658834735,
'learning_rate': 0.02581013349605285,
'min_child_samples': 16,
'num_leaves': 188,
'randomTrainState': 4892,
'select_col_list': ['col-0', 'col-1', 'col-4', 'col-10', 'col-12', 'col-16', 'col-19', 'col-20', 'col-21', 'col-26', 'col-29', 'col-31', 'col-32', 'col-34', 'col-35', 'col-37', 'col-41', 'col-45', 'col-46'],
}
param_list.append(all_param04)
##############################################
# Team 05の予測のパラメータをセット
##############################################
all_param05 ={
'num_round': 27,
'Threshold_y': 0.5676781314173396,
'bagging_fraction': 0.5465120737723101,
'bagging_freq': 3,
'feature_fraction': 0.5239167666136786,
'lambda_l1': 0.08853036547863448,
'lambda_l2': 0.0009080573846847937,
'learning_rate': 0.011709255698557373,
'min_child_samples': 30,
'num_leaves': 75,
'randomTrainState': 2689,
'select_col_list': ['col-0', 'col-1', 'col-3', 'col-4', 'col-5', 'col-6', 'col-8', 'col-10', 'col-15', 'col-17', 'col-19', 'col-20', 'col-24', 'col-25', 'col-26', 'col-27', 'col-28', 'col-29', 'col-30', 'col-32', 'col-35', 'col-36', 'col-37', 'col-41', 'col-43', 'col-45', 'col-46'],
}
param_list.append(all_param05)
##############################################
# Team 06の予測のパラメータをセット
##############################################
all_param06 ={
'num_round': 39,
'Threshold_y': 0.6321024724116466,
'bagging_fraction': 0.8291697371770824,
'bagging_freq': 4,
'feature_fraction': 0.714953727496321,
'lambda_l1': 1.0168924231577864e-08,
'lambda_l2': 3.9010744136314086,
'learning_rate': 0.014855555702265079,
'min_child_samples': 11,
'num_leaves': 189,
'randomTrainState': 8210,
'select_col_list': ['col-1', 'col-3', 'col-6', 'col-8', 'col-9', 'col-10', 'col-11', 'col-12', 'col-13', 'col-14', 'col-17', 'col-18', 'col-19', 'col-20', 'col-21', 'col-22', 'col-23', 'col-25', 'col-27', 'col-37', 'col-39', 'col-40', 'col-41', 'col-44', 'col-45'],
}
param_list.append(all_param06)
##############################################
# Team 07の予測のパラメータをセット
##############################################
all_param07 ={
'num_round': 27,
'Threshold_y': 0.6411434456489781,
'bagging_fraction': 0.8322805023962896,
'bagging_freq': 4,
'feature_fraction': 0.7242422251249383,
'lambda_l1': 1.7872305643364622e-06,
'lambda_l2': 0.012347332855752268,
'learning_rate': 0.012792533160345695,
'min_child_samples': 5,
'num_leaves': 172,
'randomTrainState': 3373,
'select_col_list': ['col-1', 'col-5', 'col-11', 'col-13', 'col-19', 'col-20', 'col-22', 'col-23', 'col-24', 'col-27', 'col-29', 'col-33', 'col-34', 'col-40', 'col-41', 'col-43'],
}
param_list.append(all_param07)
##############################################
# Team 08の予測のパラメータをセット
##############################################
all_param08 ={
'num_round': 9,
'Threshold_y': 0.5242129210057586,
'bagging_fraction': 0.862789007067898,
'bagging_freq': 4,
'feature_fraction': 0.41212081743841283,
'lambda_l1': 0.00015207051582398425,
'lambda_l2': 1.916552147625866e-06,
'learning_rate': 0.008489738430627213,
'min_child_samples': 16,
'num_leaves': 132,
'randomTrainState': 5037,
'select_col_list': ['col-1', 'col-6', 'col-7', 'col-10', 'col-11', 'col-12', 'col-13', 'col-20', 'col-21', 'col-26', 'col-27', 'col-28', 'col-29', 'col-31', 'col-32', 'col-33', 'col-37', 'col-40'],
}
param_list.append(all_param08)
##############################################
# Team 09の予測のパラメータをセット
##############################################
all_param09 ={
'num_round': 32,
'Threshold_y': 0.6494177400635616,
'bagging_fraction': 0.769538933451904,
'bagging_freq': 5,
'feature_fraction': 0.6071303730337405,
'lambda_l1': 0.09787230103279305,
'lambda_l2': 0.002781501819288105,
'learning_rate': 0.01844675577717961,
'min_child_samples': 17,
'num_leaves': 245,
'randomTrainState': 5510,
'select_col_list': ['col-0', 'col-1', 'col-2', 'col-3', 'col-4', 'col-6', 'col-7', 'col-10', 'col-12', 'col-14', 'col-17', 'col-18', 'col-21', 'col-22', 'col-23', 'col-25', 'col-26', 'col-29', 'col-31', 'col-32', 'col-33', 'col-34', 'col-36', 'col-41', 'col-42', 'col-44', 'col-45'],
}
param_list.append(all_param09)
##############################################
# Team 10の予測のパラメータをセット
##############################################
all_param10 ={
'num_round': 36,
'Threshold_y': 0.5909216962211854,
'bagging_fraction': 0.6488973849546392,
'bagging_freq': 5,
'feature_fraction': 0.42504440898708207,
'lambda_l1': 0.017937047426524792,
'lambda_l2': 2.671891261744761e-08,
'learning_rate': 0.009163978056130472,
'min_child_samples': 7,
'num_leaves': 52,
'randomTrainState': 507,
'select_col_list': ['col-0', 'col-1', 'col-3', 'col-4', 'col-5', 'col-7', 'col-8', 'col-9', 'col-11', 'col-12', 'col-15', 'col-18', 'col-21', 'col-23', 'col-27', 'col-28', 'col-30', 'col-31', 'col-32', 'col-33', 'col-36', 'col-38', 'col-39', 'col-40', 'col-42', 'col-45'],
}
param_list.append(all_param10)
##############################################
# Team 11の予測のパラメータをセット
##############################################
all_param11 ={
'num_round': 32,
'Threshold_y': 0.5910917436123575,
'bagging_fraction': 0.6937932067153532,
'bagging_freq': 4,
'feature_fraction': 0.49245322978952205,
'lambda_l1': 0.041304191123197054,
'lambda_l2': 0.02073817982403152,
'learning_rate': 0.020908997050675497,
'min_child_samples': 31,
'num_leaves': 84,
'randomTrainState': 3191,
'select_col_list': ['col-1', 'col-2', 'col-3', 'col-4', 'col-6', 'col-7', 'col-8', 'col-10', 'col-11', 'col-15', 'col-18', 'col-20', 'col-21', 'col-22', 'col-23', 'col-26', 'col-29', 'col-31', 'col-33', 'col-34', 'col-37', 'col-38', 'col-40', 'col-43', 'col-44', 'col-45'],
}
param_list.append(all_param11)
all_sub_pred = pred_all_teams(header_num,prednum,param_list,flg=0)
# ------------------------------------------------------------------------------
# 予測ファイルの作成
# ------------------------------------------------------------------------------
#テスト結果の出力
submit_df = pd.DataFrame({'y': all_sub_pred.astype(int)})
submit_df.index.name = 'id'
submit_df.to_csv(f'../submission/{sub_num}_submission.csv')
header_num = 'F010'
prednum = 6
versionnum = 2
sub_num = f'ind0{prednum}-t0{versionnum-1}-all'
param_list = []
##############################################
# Team 00 の予測のパラメータをセット
##############################################
all_param00 = {
'Threshold_y': 0.9090909090909091,
'randomTrainState': 4032,
'select_col_list': ['col-1', 'col-2', 'col-4', 'col-6', 'col-7', 'col-9', 'col-12', 'col-16', 'col-20', 'col-22', 'col-23', 'col-25', 'col-27', 'col-30', 'col-31', 'col-32', 'col-33', 'col-34', 'col-38', 'col-39', 'col-41', 'col-43'],
}
param_list.append(all_param00)
##############################################
# Team 01 の予測のパラメータをセット
##############################################
all_param01 = {
'Threshold_y': 0.9090909090909091,
'randomTrainState': 6180,
'select_col_list': ['col-4', 'col-6', 'col-7', 'col-9', 'col-10', 'col-16', 'col-19', 'col-20', 'col-22', 'col-23', 'col-24', 'col-26', 'col-28', 'col-30', 'col-33', 'col-34', 'col-38', 'col-43'],
}
param_list.append(all_param01)
##############################################
# Team 02の予測のパラメータをセット
##############################################
all_param02 = {
'Threshold_y': 0.9090909090909091,
'randomTrainState': 8530,
'select_col_list': ['col-0', 'col-1', 'col-4', 'col-6', 'col-10', 'col-13', 'col-19', 'col-22', 'col-24', 'col-25', 'col-30', 'col-32', 'col-34', 'col-36', 'col-43', 'col-44', 'col-45', 'col-46'],
}
param_list.append(all_param02)
##############################################
# Team 03の予測のパラメータをセット
##############################################
all_param03 = {
'Threshold_y': 0.9090909090909091,
'randomTrainState': 8340,
'select_col_list': ['col-0', 'col-4', 'col-7', 'col-10', 'col-11', 'col-12', 'col-19', 'col-22', 'col-23', 'col-24', 'col-26', 'col-30', 'col-31', 'col-33', 'col-37', 'col-38', 'col-41', 'col-43', 'col-44', 'col-46'],
}
param_list.append(all_param03)
##############################################
# Team 04の予測のパラメータをセット
##############################################
all_param04 = {
'Threshold_y': 0.9090909090909091,
'randomTrainState': 1838,
'select_col_list': ['col-0', 'col-1', 'col-6', 'col-8', 'col-11', 'col-13', 'col-15', 'col-19', 'col-21', 'col-22', 'col-26', 'col-27', 'col-36', 'col-37', 'col-38', 'col-39', 'col-44'],
}
param_list.append(all_param04)
##############################################
# Team 05の予測のパラメータをセット
##############################################
all_param05 = {
'Threshold_y': 0.9090909090909091,
'randomTrainState': 7404,
'select_col_list': ['col-0', 'col-4', 'col-6', 'col-11', 'col-18', 'col-19', 'col-22', 'col-24', 'col-27', 'col-28', 'col-29', 'col-30', 'col-31', 'col-33', 'col-35', 'col-37', 'col-38', 'col-40', 'col-41', 'col-45'],
}
param_list.append(all_param05)
##############################################
# Team 06の予測のパラメータをセット
##############################################
all_param06 = {
'Threshold_y': 0.9090909090909091,
'randomTrainState': 9847,
'select_col_list': ['col-0', 'col-1', 'col-4', 'col-6', 'col-7', 'col-10', 'col-15', 'col-25', 'col-26', 'col-27', 'col-29', 'col-30', 'col-37', 'col-38', 'col-39', 'col-41', 'col-43'],
}
param_list.append(all_param06)
##############################################
# Team 07の予測のパラメータをセット
##############################################
all_param07 = {
'Threshold_y': 0.9090909090909091,
'randomTrainState': 5801,
'select_col_list': ['col-1', 'col-4', 'col-6', 'col-8', 'col-10', 'col-14', 'col-19', 'col-22', 'col-23', 'col-25', 'col-28', 'col-33', 'col-34', 'col-39', 'col-43'],
}
param_list.append(all_param07)
##############################################
# Team 08の予測のパラメータをセット
##############################################
all_param08 = {
'Threshold_y': 0.9090909090909091,
'randomTrainState': 2587,
'select_col_list': ['col-2', 'col-10', 'col-17', 'col-18', 'col-22', 'col-23', 'col-25', 'col-26', 'col-27', 'col-30', 'col-31', 'col-33', 'col-34', 'col-35', 'col-37', 'col-38', 'col-39', 'col-40', 'col-41', 'col-43', 'col-44'],
}
param_list.append(all_param08)
##############################################
# Team 09の予測のパラメータをセット
##############################################
all_param09 = {
'Threshold_y': 0.9090909090909091,
'randomTrainState': 7251,
'select_col_list': ['col-1', 'col-2', 'col-4', 'col-6', 'col-9', 'col-10', 'col-11', 'col-12', 'col-17', 'col-19', 'col-22', 'col-24', 'col-26', 'col-27', 'col-31', 'col-33', 'col-35', 'col-36', 'col-44'],
}
param_list.append(all_param09)
##############################################
# Team 10の予測のパラメータをセット
##############################################
all_param10 = {
'Threshold_y': 0.9090909090909091,
'randomTrainState': 1427,
'select_col_list': ['col-0', 'col-6', 'col-7', 'col-10', 'col-13', 'col-17', 'col-19', 'col-20', 'col-22', 'col-25', 'col-27', 'col-28', 'col-29', 'col-31', 'col-33', 'col-37', 'col-39', 'col-41', 'col-43', 'col-44', 'col-46'],
}
param_list.append(all_param10)
##############################################
# Team 11の予測のパラメータをセット
##############################################
all_param11 = {
'Threshold_y': 0.9090909090909091,
'randomTrainState': 8532,
'select_col_list': ['col-4', 'col-7', 'col-10', 'col-12', 'col-16', 'col-17', 'col-18', 'col-22', 'col-24', 'col-29', 'col-30', 'col-33', 'col-34', 'col-35', 'col-37', 'col-40', 'col-41', 'col-43', 'col-46'],
}
param_list.append(all_param11)
all_sub_pred = pred_all_teams(header_num,prednum,param_list,flg=1)
# ------------------------------------------------------------------------------
# 予測ファイルの作成
# ------------------------------------------------------------------------------
#テスト結果の出力
submit_df = pd.DataFrame({'y': all_sub_pred.astype(int)})
submit_df.index.name = 'id'
submit_df.to_csv(f'../submission/{sub_num}_submission.csv')
header_num = 'F010'
prednum = 7
versionnum = 2
sub_num = f'ind0{prednum}-t0{versionnum-1}-all'
param_list = []
##############################################
# Team 00 の予測のパラメータをセット
##############################################
all_param00 ={
'num_round': 17,
'Threshold_y': 0.5301191795552576,
'bagging_fraction': 0.6947579218693694,
'bagging_freq': 4,
'feature_fraction': 0.6427395221155658,
'lambda_l1': 3.868220205359664e-06,
'lambda_l2': 0.00010945101702546976,
'learning_rate': 0.012303614918122196,
'min_child_samples': 29,
'num_leaves': 68,
'randomTrainState': 7925,
'select_col_list': ['col-2', 'col-4', 'col-6', 'col-7', 'col-10', 'col-11', 'col-12', 'col-13', 'col-14', 'col-17', 'col-19', 'col-20', 'col-22', 'col-24', 'col-25', 'col-28', 'col-29', 'col-31', 'col-32', 'col-35', 'col-36', 'col-40', 'col-41', 'col-43', 'col-45'],
}
param_list.append(all_param00)
##############################################
# Team 01 の予測のパラメータをセット
##############################################
all_param01 ={
'num_round': 12,
'Threshold_y': 0.5441403422091486,
'bagging_fraction': 0.9012232602552761,
'bagging_freq': 4,
'feature_fraction': 0.7253945401233162,
'lambda_l1': 0.2914740452257848,
'lambda_l2': 0.04572992822765882,
'learning_rate': 0.016231192303684548,
'min_child_samples': 23,
'num_leaves': 91,
'randomTrainState': 1380,
'select_col_list': ['col-0', 'col-1', 'col-4', 'col-6', 'col-7', 'col-10', 'col-11', 'col-12', 'col-14', 'col-16', 'col-18', 'col-19', 'col-21', 'col-22', 'col-25', 'col-27', 'col-31', 'col-32', 'col-37', 'col-40', 'col-43', 'col-44', 'col-45'],
}
param_list.append(all_param01)
##############################################
# Team 02の予測のパラメータをセット
##############################################
all_param02 ={
'num_round': 18,
'Threshold_y': 0.5141469693930233,
'bagging_fraction': 0.4454919386587845,
'bagging_freq': 4,
'feature_fraction': 0.4910223795324583,
'lambda_l1': 8.25489775949208e-07,
'lambda_l2': 0.00028137109089829035,
'learning_rate': 0.009886683851392216,
'min_child_samples': 33,
'num_leaves': 164,
'randomTrainState': 4258,
'select_col_list': ['col-0', 'col-5', 'col-6', 'col-7', 'col-8', 'col-10', 'col-11', 'col-12', 'col-14', 'col-15', 'col-16', 'col-17', 'col-18', 'col-20', 'col-22', 'col-25', 'col-27', 'col-29', 'col-30', 'col-35', 'col-36', 'col-37', 'col-41', 'col-43', 'col-44', 'col-45', 'col-46'],
}
param_list.append(all_param02)
##############################################
# Team 03の予測のパラメータをセット
##############################################
all_param03 ={
'num_round': 30,
'Threshold_y': 0.6157333939466578,
'bagging_fraction': 0.7416403427407998,
'bagging_freq': 4,
'feature_fraction': 0.6390231509876305,
'lambda_l1': 0.0013841308254424842,
'lambda_l2': 2.947622365708902e-08,
'learning_rate': 0.017965732967915805,
'min_child_samples': 10,
'num_leaves': 58,
'randomTrainState': 8544,
'select_col_list': ['col-1', 'col-2', 'col-5', 'col-9', 'col-11', 'col-17', 'col-19', 'col-23', 'col-25', 'col-27', 'col-28', 'col-30', 'col-33', 'col-34', 'col-38', 'col-40', 'col-41', 'col-42', 'col-44', 'col-45', 'col-46'],
}
param_list.append(all_param03)
##############################################
# Team 04の予測のパラメータをセット
##############################################
all_param04 ={
'num_round': 10,
'Threshold_y': 0.5382883421585195,
'bagging_fraction': 0.9260827614635292,
'bagging_freq': 1,
'feature_fraction': 0.623208621531255,
'lambda_l1': 1.1685440624114588e-05,
'lambda_l2': 2.0705413009667016e-06,
'learning_rate': 0.01337137234883777,
'min_child_samples': 17,
'num_leaves': 247,
'randomTrainState': 1034,
'select_col_list': ['col-1', 'col-2', 'col-7', 'col-9', 'col-13', 'col-17', 'col-22', 'col-23', 'col-24', 'col-26', 'col-27', 'col-30', 'col-32', 'col-33', 'col-35', 'col-37', 'col-40', 'col-44'],
}
param_list.append(all_param04)
##############################################
# Team 05の予測のパラメータをセット
##############################################
all_param05 ={
'num_round': 9,
'Threshold_y': 0.5099648119079944,
'bagging_fraction': 0.5492567465444324,
'bagging_freq': 1,
'feature_fraction': 0.4897531927776741,
'lambda_l1': 2.4809120148651396e-05,
'lambda_l2': 1.5095440320754698,
'learning_rate': 0.00512450905231661,
'min_child_samples': 13,
'num_leaves': 160,
'randomTrainState': 6233,
'select_col_list': ['col-0', 'col-4', 'col-8', 'col-9', 'col-10', 'col-11', 'col-12', 'col-13', 'col-14', 'col-17', 'col-18', 'col-19', 'col-22', 'col-23', 'col-25', 'col-26', 'col-32', 'col-33', 'col-34', 'col-35', 'col-38', 'col-39', 'col-40', 'col-42', 'col-44', 'col-45'],
}
param_list.append(all_param05)
##############################################
# Team 06の予測のパラメータをセット
##############################################
all_param06 ={
'num_round': 4,
'Threshold_y': 0.5106891139428876,
'bagging_fraction': 0.8856916241331956,
'bagging_freq': 6,
'feature_fraction': 0.6045574049863165,
'lambda_l1': 4.77724838181608e-08,
'lambda_l2': 0.001905462262038446,
'learning_rate': 0.005373278707637609,
'min_child_samples': 7,
'num_leaves': 113,
'randomTrainState': 6210,
'select_col_list': ['col-0', 'col-3', 'col-4', 'col-6', 'col-7', 'col-10', 'col-11', 'col-12', 'col-14', 'col-15', 'col-16', 'col-17', 'col-18', 'col-20', 'col-21', 'col-22', 'col-24', 'col-25', 'col-26', 'col-28', 'col-29', 'col-34', 'col-38', 'col-40', 'col-41', 'col-43', 'col-45'],
}
param_list.append(all_param06)
##############################################
# Team 07の予測のパラメータをセット
##############################################
all_param07 ={
'num_round': 4,
'Threshold_y': 0.5322936094849703,
'bagging_fraction': 0.5475129500981026,
'bagging_freq': 5,
'feature_fraction': 0.4681456989889187,
'lambda_l1': 1.0545528690242021e-08,
'lambda_l2': 2.7055951582708258e-05,
'learning_rate': 0.02718788620631782,
'min_child_samples': 13,
'num_leaves': 55,
'randomTrainState': 6971,
'select_col_list': ['col-0', 'col-5', 'col-10', 'col-11', 'col-13', 'col-14', 'col-16', 'col-18', 'col-19', 'col-20', 'col-21', 'col-23', 'col-24', 'col-27', 'col-28', 'col-29', 'col-30', 'col-32', 'col-36', 'col-37', 'col-38', 'col-39', 'col-40', 'col-44', 'col-45', 'col-46'],
}
param_list.append(all_param07)
##############################################
# Team 08の予測のパラメータをセット
##############################################
all_param08 ={
'num_round': 30,
'Threshold_y': 0.551452774858463,
'bagging_fraction': 0.7458101524869044,
'bagging_freq': 6,
'feature_fraction': 0.9613253477230401,
'lambda_l1': 7.788897620269775e-05,
'lambda_l2': 0.00011503845220559952,
'learning_rate': 0.012429959408447424,
'min_child_samples': 38,
'num_leaves': 42,
'randomTrainState': 6235,
'select_col_list': ['col-0', 'col-1', 'col-5', 'col-6', 'col-7', 'col-8', 'col-9', 'col-11', 'col-13', 'col-14', 'col-17', 'col-19', 'col-21', 'col-23', 'col-28', 'col-29', 'col-30', 'col-32', 'col-34', 'col-36', 'col-38', 'col-40', 'col-45', 'col-46'],
}
param_list.append(all_param08)
##############################################
# Team 09の予測のパラメータをセット
##############################################
all_param09 ={
'num_round': 44,
'Threshold_y': 0.6830165308159903,
'bagging_fraction': 0.6369598156229328,
'bagging_freq': 2,
'feature_fraction': 0.628698124193263,
'lambda_l1': 0.34808902669160463,
'lambda_l2': 0.00454043954405315,
'learning_rate': 0.015746213466134217,
'min_child_samples': 5,
'num_leaves': 196,
'randomTrainState': 6937,
'select_col_list': ['col-0', 'col-4', 'col-7', 'col-9', 'col-10', 'col-11', 'col-13', 'col-16', 'col-21', 'col-25', 'col-28', 'col-29', 'col-31', 'col-32', 'col-34', 'col-39', 'col-40', 'col-42', 'col-43', 'col-45', 'col-46'],
}
param_list.append(all_param09)
##############################################
# Team 10の予測のパラメータをセット
##############################################
all_param10 ={
'num_round': 8,
'Threshold_y': 0.5120004095438825,
'bagging_fraction': 0.8744266038117634,
'bagging_freq': 5,
'feature_fraction': 0.7002886281573345,
'lambda_l1': 0.0003268581788463471,
'lambda_l2': 1.1108540310880616e-07,
'learning_rate': 0.005442442065203486,
'min_child_samples': 23,
'num_leaves': 76,
'randomTrainState': 3586,
'select_col_list': ['col-1', 'col-3', 'col-5', 'col-6', 'col-10', 'col-11', 'col-13', 'col-14', 'col-15', 'col-16', 'col-17', 'col-19', 'col-21', 'col-22', 'col-26', 'col-27', 'col-28', 'col-29', 'col-31', 'col-32', 'col-33', 'col-35', 'col-39', 'col-40', 'col-43', 'col-44', 'col-45', 'col-46'],
}
param_list.append(all_param10)
##############################################
# Team 11の予測のパラメータをセット
##############################################
all_param11 ={
'num_round': 32,
'Threshold_y': 0.6387455194897039,
'bagging_fraction': 0.9556231800736257,
'bagging_freq': 4,
'feature_fraction': 0.8217034788203843,
'lambda_l1': 2.5066872014083758e-05,
'lambda_l2': 0.40979135808078626,
'learning_rate': 0.014743719382639122,
'min_child_samples': 10,
'num_leaves': 200,
'randomTrainState': 7136,
'select_col_list': ['col-2', 'col-3', 'col-4', 'col-5', 'col-10', 'col-11', 'col-12', 'col-13', 'col-19', 'col-23', 'col-24', 'col-32', 'col-33', 'col-36', 'col-38', 'col-39', 'col-40', 'col-42', 'col-45', 'col-46'],
}
param_list.append(all_param11)
all_sub_pred = pred_all_teams(header_num,prednum,param_list,flg=0)
# ------------------------------------------------------------------------------
# 予測ファイルの作成
# ------------------------------------------------------------------------------
#テスト結果の出力
submit_df = pd.DataFrame({'y': all_sub_pred.astype(int)})
submit_df.index.name = 'id'
submit_df.to_csv(f'../submission/{sub_num}_submission.csv')
# baseとなる予測を読み込み
submit_df = pd.read_csv('../submission/submission.csv',index_col='id')
submit_df.head()
y | |
---|---|
id | |
0 | 1 |
1 | 1 |
2 | 3 |
3 | 1 |
4 | 0 |
# 予測値毎に予測した値を追加する関数
def add_submission(prednum,filename,submit_df):
sub066_df = pd.read_csv(filename,index_col='id')
# print(sub066_df.sum()*prednum)
sub066_df[sub066_df['y']==1] = prednum
# print(sub066_df.sum())
submit_df[f'ypred{prednum}'] = sub066_df['y']
# 4〜7については、予測値毎に予測した値を追加
add_submission(4,'../submission/ind04-t00-all_submission.csv',submit_df)
add_submission(5,'../submission/ind05-t01-all_submission.csv',submit_df)
add_submission(6,'../submission/ind06-t01-all_submission.csv',submit_df)
add_submission(7,'../submission/ind07-t01-all_submission.csv',submit_df)
# ベースとなる予測に指定した順番に予測値ごとの予測を合成する関数
def overwrite_submission(paramlist,submit_df):
calc_df = submit_df.copy()
overwrite_name = 'y'
for i in paramlist:
next_overwrite_name = f'{overwrite_name}-{i}'
calc_df[next_overwrite_name] = calc_df[f'ypred{i}'].where(calc_df[f'ypred{i}']==i,calc_df[overwrite_name])
overwrite_name = next_overwrite_name
calc_df['y'] = calc_df[overwrite_name]
return calc_df
# 合成処理
submit_df = overwrite_submission([ 4, 5, 7, 6],submit_df)
#Brend結果の出力
submit_brend = pd.DataFrame({'y': submit_df['y']})
submit_brend.index.name = 'id'
submit_brend.to_csv(f'../submission/{Notebookname}_submission.csv')
hashimoto
すばらしいソリューションの公開ありがとうございます。 出現頻度の低いクラスを2値分類に落とし込むアイデアに脱帽しました。
コードを再実行していた時にLightGBMのpredictでエラーになったので、回避策を共有しておきます。
以下の関数にて、X_valもselect_col_listを使ってカラムを絞り込む必要があると思います。
def objective(train,test,target,valid,valid_target,all_param):
(省略)
(省略)
lingmu3
1位おめでとうございます!また、すばらしいソリューションの公開ありがとうございます。 チームごとに2値分類をするのと、全チーム一気に2値分類するのでは結構結果が異なったのでしょうか。 (チームごとに分類するのがどれくらい効果があったのか気になったもので質問させていただきました。) パラメーターやカラムがチームごとに細かく違うのは、どのように決めていった感じでしょうか。
Quvotha-nndropout100
ソリューションを公開下さりありがとうございます。今更の質問となり恐縮ですが、差し支えなければ教えてください。
random_sampling()
でsubGameId
という特徴量が計算され、以降様々な箇所(pivot 集計等)で軸として用いられていますが、この特徴量はどのような考え方で作成された特徴量でしょうか?計算式を見ても理解が及ばず・・・。