次の一投の行方を予測！プロ野球データ分析チャレンジ

遅くなってしまいましたが,1st Place Solution(by Oregin)を公開させていただきます。
運営のみなさま、一緒に参加してくださった皆様、レビューのほどよろしくお願いいたします。
以下のブログにも全体像の図解等を掲載いたしましたので、必要に応じてご参照ください。

【1位解法】ProbSpace開催「プロ野球データ分析チャレンジ」の振り返り。 https://oregin-ai.hatenablog.com/entry/2021/06/23/223054

全体としては、大きく分けて、以下の２種類のモデルを作成して予測結果を合成することで、最終的な予測結果を作成しています。

目的となる分類の全クラスを多値分類するモデル。（ベースライン）
頻度の少ないヒット、２塁打、３塁打、ホームランのそれぞれを予測するモデル

ディレクトリ構成

./notebook : このファイルを入れておくディレクトリ（カレントディレクトリをこのディレクトリに移動して実行してください。）
./features : 前処理した特徴量を保存しておくディレクトリ
./data : test_data.csv,train_data.csv,game_info.csvを入れておくディレクトリ
./submission : 中間の予測結果、最終予測結果(BestScore_submission.csv)を保存しておくディレクトリ

# 最終予測結果につける名前
Notebookname = 'BestScore'

#このファイルを保存したディレクトリに移動（実行する環境に合わせて編集してください。）

cd　/content/drive/MyDrive/XXXXXXXXXX/notebook

目的となる分類の全クラスを多値分類するモデル。（ベースライン）

まずは、トピックに公開させていただいている目的となる分類の全クラスを多値分類を行うベースラインになります。

　トピックに投稿したベースライン：https://prob.space/competitions/npb/discussions/Oregin-Postbd2b4e8a9808ec850876

このベースラインの予測値に、後述の４〜７のクラスを予測するモデルの予測値を混合することで、最終予測としています。

ライブラリのインストール、及びインポート

# xfeatのインストール
!pip install git+https://github.com/pfnet-research/xfeat.git

# ------------------------------------------------------------------------------
# 各種ライブラリのインポート
# ------------------------------------------------------------------------------
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import json
import os
import random
import string
import re

from pathlib import Path
from tqdm import tqdm

import lightgbm as lgb
from sklearn.model_selection import KFold,GroupKFold
from sklearn.metrics import f1_score,precision_score
from sklearn import preprocessing
from sklearn.naive_bayes import MultinomialNB  # ComplementNB
from imblearn.under_sampling import RandomUnderSampler
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from  sklearn.neural_network import MLPRegressor
from  sklearn.pipeline import make_pipeline
from tqdm import tqdm

データの読込

# データ読み込み
#####################################
###### train ########################
#####################################

train = pd.read_csv('../data/train_data.csv')
target = train['y']
train = train.drop(['id','y'],axis=1)

#####################################
#### test ###########################
#####################################

test = pd.read_csv('../data/test_data.csv')
test = test.drop('id',axis=1)

#####################################
#### game info ######################
#####################################

game = pd.read_csv('../data/game_info.csv')
game = game.drop('Unnamed: 0',axis=1)

train.shape,test.shape,target.shape,game.shape

((20400, 22), (33808, 13), (20400,), (726, 8))

# 訓練データの球速を他の特徴量から予測できるように仮の目的変数とする。
# 欠損値は直前の値を入れて補完
target_speed = train['speed'].str.extract(r'(\d+)').fillna(method='ffill')

前処理

##############################
# チーム名を数字に置き換える
##############################
# 全チーム名のリストを作成する
TeamList = game['topTeam'].unique()
# チーム名のディクショナリを初期化
TeamDic = {}
#　チーム名毎に数字を割り当てたディクショナリを初期化
for i in range(len(TeamList)):
  TeamDic[TeamList[i]] = i

game['bottomTeam']=game['bottomTeam'].replace(TeamDic)
game['topTeam']=game['topTeam'].replace(TeamDic)

game.tail()

	startTime	bottomTeam	bgBottom	topTeam	place	startDayTime	bgTop	gameID
721	13:00	7	12	3	PayPayドーム	2020-11-15 13:00:00	9	20203323
722	18:10	8	1	7	京セラD大阪	2020-11-21 18:10:00	12	20203326
723	18:10	8	1	7	京セラD大阪	2020-11-22 18:10:00	12	20203327
724	18:30	7	12	8	PayPayドーム	2020-11-24 18:30:00	1	20203328
725	18:30	7	12	8	PayPayドーム	2020-11-25 18:30:00	1	20203329

# 年月日、曜日、時分秒を追加
game['startDayTime'] = pd.to_datetime(game['startDayTime']) # 型を変換
game['year']=game["startDayTime"].dt.year
game['month']=game["startDayTime"].dt.month
game['day']=game["startDayTime"].dt.day
game['hour']=game["startDayTime"].dt.hour
game['dayofweek']=game["startDayTime"].dt.dayofweek
game['minute']=game["startDayTime"].dt.minute
game['second']=game["startDayTime"].dt.second
# 'startDayTime'を削除
game = game.drop(['startDayTime'],axis=1)

# 訓練データのみにある列名（テストデータにはない列名）のリストを作成
delcollist = []
for col in train.columns:
  if not col in test.columns:
    delcollist.append(col)
# 訓練データのみにある列名を削除
train = train.drop(delcollist,axis=1)

# inning を　数値に変換
train['inning_num'] =  train['inning'].apply(lambda x: re.sub("\\D", "", x))
test['inning_num'] =  test['inning'].apply(lambda x: re.sub("\\D", "", x))

# 表裏を判定する関数
def omote_ura(x):
  if '表' in x:
    return 0
  else:
    return 1

# 表裏の列を追加
train['inning_ForB'] =  train['inning'].apply(lambda x: omote_ura(x))
test['inning_ForB'] =  test['inning'].apply(lambda x: omote_ura(x))

# game_infoの追加
train = pd.merge(train, game, how='left')
test = pd.merge(test, game, how='left')

# inningの削除
train = train.drop('inning',axis=1)
test = test.drop('inning',axis=1)

# ボール、ストライク、アウトの合計値を追加
train['total_stat'] = train['B']+train['S']+train['O']
test['total_stat'] = test['B']+test['S']+test['O']
train['B_S'] = train['B']+train['S']
test['B_S'] = test['B']+test['S']

# ベース上のランナーの数を追加
train['total_base'] = train['b1'].astype('int')+train['b2'].astype('int')+train['b3'].astype('int')
test['total_base'] = test['b1'].astype('int')+test['b2'].astype('int')+test['b3'].astype('int')

# バッターのチームを追加
train['batterTeam'] = train['topTeam']
train['batterTeam'] = train['batterTeam'].where(train['inning_ForB']==1, train['bottomTeam'])
test['batterTeam'] = test['topTeam']
test['batterTeam'] = test['batterTeam'].where(test['inning_ForB']==1, test['bottomTeam'])

# ピッチャーのチームを追加
train['pitcherTeam'] = train['topTeam']
train['pitcherTeam'] = train['pitcherTeam'].where(train['inning_ForB']==0, train['bottomTeam'])
test['pitcherTeam'] = test['topTeam']
test['pitcherTeam'] = test['pitcherTeam'].where(test['inning_ForB']==0, test['bottomTeam'])

# カテゴリカル変数のカラムを抽出
categorical_columns = [x for x in train.columns if train[x].dtypes == 'object']

# カテゴリカル変数をカウントエンコードする
from xfeat import CountEncoder

encoder = CountEncoder(input_cols=categorical_columns)
train = encoder.fit_transform(train)
test = encoder.transform(test)

# 訓練データにターゲット列を追加する
train['target'] = target

# カテゴリカル変数をターゲットエンコーディングする
from sklearn.model_selection import KFold
from xfeat import TargetEncoder

fold = KFold(n_splits=5, shuffle=True, random_state=42)
encoder = TargetEncoder(input_cols=categorical_columns,
                        target_col='target',
                        fold=fold)
train = encoder.fit_transform(train)
test = encoder.transform(test)

# エンコーディング前の列を削除する
train = train.drop(categorical_columns,axis=1)
test = test.drop(categorical_columns,axis=1)

# ターゲット列を削除
train = train.drop('target',axis=1)

train.shape,test.shape,target.shape,target_speed.shape

((20400, 39), (33808, 39), (20400,), (20400, 1))

# pivot tabel を用いた特徴量を追加する関数
def get_game_id_vecs_features(input_df):
    _input_df = input_df
    # pivot table
    stat_df = pd.pivot_table(_input_df, index="gameID", columns="batter_te", values="total_stat").add_prefix("total_stat=")
    base_df = pd.pivot_table(_input_df, index="gameID", columns="batter_te", values="total_base").add_prefix("total_base=")
    inning_df = pd.pivot_table(_input_df, index="gameID", columns="batter_te", values="inning_num_ce").add_prefix("inning=")
    all_df = pd.concat([stat_df, base_df, inning_df], axis=1)
    
    # PCA all 
    sc_all_df = StandardScaler().fit_transform(all_df.fillna(0))
    pca = PCA(n_components=59, random_state=2021)
    pca_all_df = pd.DataFrame(pca.fit_transform(sc_all_df), index=all_df.index).rename(columns=lambda x: f"gameID_all_PCA={x:03}")
    # PCA Stat
    sc_stat_df = StandardScaler().fit_transform(stat_df.fillna(0))
    pca = PCA(n_components=16, random_state=2021)
    pca_stat_df = pd.DataFrame(pca.fit_transform(sc_stat_df), index=all_df.index).rename(columns=lambda x: f"gameID_stat_PCA={x:03}")
    # PCA bace
    sc_base_df = StandardScaler().fit_transform(base_df.fillna(0))
    pca = PCA(n_components=16, random_state=2021)
    pca_base_df = pd.DataFrame(pca.fit_transform(sc_base_df), index=all_df.index).rename(columns=lambda x: f"gameID_base_PCA={x:03}")
    # PCA inning
    sc_inning_df = StandardScaler().fit_transform(inning_df.fillna(0))
    pca = PCA(n_components=16, random_state=2021)
    pca_inning_df = pd.DataFrame(pca.fit_transform(sc_inning_df), index=all_df.index).rename(columns=lambda x: f"gameID_inning_PCA={x:03}")
    
    df = pd.concat([all_df, pca_all_df, pca_stat_df, pca_base_df, pca_inning_df], axis=1)
    output_df = pd.merge(_input_df[["gameID"]], df, left_on="gameID", right_index=True, how="left")
    return output_df

# 訓練データとテストデータを結合する
input_df = pd.concat([train, test]).reset_index(drop=True)  # use concat data

#　ピボットデータを作成する
output_df = get_game_id_vecs_features(input_df)

# ピボットデータを訓練データとテストデータに分割する
train_x = output_df.iloc[:len(train)]
test_x = output_df.iloc[len(train):].reset_index(drop=True)

train_x.shape,test_x.shape,train.shape,test.shape,target.shape,target_speed.shape

((20400, 2847), (33808, 2847), (20400, 39), (33808, 39), (20400,), (20400, 1))

# 元データとピボットデータを結合する
input_all_df = pd.concat([input_df,output_df],axis=1)
input_all_df.shape

(54208, 2886)

# null のカラムの確認
nul_sum = input_all_df.isnull().sum()
null_cols = list(nul_sum[nul_sum > 0].index)

# null があるカラムの削除
input_all_df = input_all_df.drop(null_cols,axis=1)

from sklearn.model_selection import train_test_split
from sklearn.feature_selection import VarianceThreshold

# 分散が0（すべて同じ値）のカラムの探索
sel = VarianceThreshold(threshold=0)
sel.fit(input_all_df)

# get_supportで分散が0でないカラムのみをTrue値、分散が0のカラムはFalse値を返します
print(sum(sel.get_support()))

# 分散が0のカラムを削除
input_all_df =input_all_df.loc[:, sel.get_support()]
print(input_all_df.shape)

145
(54208, 145)

# indexとcolumnsを入れ替える
input_all_df_T = input_all_df.T

print(input_all_df_T.duplicated().sum())

# 同じ特徴量の名前を取得
duplicated_features = input_all_df_T[input_all_df_T.duplicated()].index.values

# 値が同じ特徴量の片方を削除
input_all_df = input_all_df.drop(duplicated_features,axis=1)

print(input_all_df.shape)

1
(54208, 143)

# テストデータと訓練データに分ける
X_train = input_all_df.iloc[:len(train)]
X_test = input_all_df.iloc[len(train):].reset_index(drop=True)

X_train.shape,X_test.shape

((20400, 143), (33808, 143))

# 作成した特徴量のデータを保存しておく
X_train.to_csv('../features/preprocessed_train.csv',index=False)
X_test.to_csv('../features/preprocessed_test.csv',index=False)
target.to_csv('../features/preprocessed_target.csv',index=False)
target_speed.to_csv('../features/preprocessed_speed.csv',index=False)

「Speed」の学習と予測

訓練データのみに存在する「Speed」を予測するモデルを作成して、テストデータにも特徴量として追加する。

SEED = 42
NFOLDS = 5

# speed のデータを１次元に変換
target_speed = target_speed.to_numpy().reshape(-1,)

#ニューラルネットを作成する関数定義
def create_model_NN(activation, n_layers, n_neurons, solver):
    hidden_layer_sizes=[]
    
    #与えられたパラメータのレイヤを作成
    for i in range(n_layers):
        hidden_layer_sizes.append(n_neurons[i])
    
    #ニューラルネットのモデルを作成
    model = MLPRegressor(activation = activation,
                         hidden_layer_sizes=hidden_layer_sizes,
                         solver = solver,
                         random_state=42
                        )
    #標準化とニューラルネットのパイプラインを作成
    pipe = make_pipeline(StandardScaler(),model)
    return pipe

# テストデータの「Speed」を予測する関数
def pred_speed_of_test_data(train_x,test,target_speed,param):
    ###################################
    ### パラメータの設定
    ##################################
    activation = param['activation']
    n_layers = param['n_layers']
    n_neurons=[]
    for i in range(n_layers):
        n_neurons.append(param['neuron' + str(i).zfill(2)])
    solver = param['solver']
    
    ###################################
    ### CVの設定
    ##################################
    
    FOLD_NUM = 5
    kf = KFold(n_splits=NFOLDS, shuffle=True, random_state=SEED)

    scores = []
    mlp_pred = 0

    for i, (tdx, vdx) in enumerate(kf.split(X=train_x)):
        X_train, X_valid, y_train, y_valid = train_x.iloc[tdx], train_x.iloc[vdx], target_speed[tdx], target_speed[vdx]
        #モデルを作成
        mlp  = create_model_NN(activation, n_layers, n_neurons, solver)
        # 学習
        mlp.fit(X_train,y_train)
        # 予測
        mlp_pred += mlp.predict(test) / FOLD_NUM

    print('#######################################################')
    print('### Speed was predicted #######')
    print('#######################################################')
    return mlp_pred

# Speed予測用のハイパーパラメータ
param = {
"activation": 'tanh',
"n_layers": 9,
"neuron00": 45,
"neuron01": 52,
"neuron02": 57,
"neuron03": 79,
"neuron04": 21,
"neuron05": 102,
"neuron06": 118,
"neuron07": 31,
"neuron08": 66,
"solver": 'sgd',
}

# テストデータの「Speed」を予測する
speed_pred = pred_speed_of_test_data(X_train,X_test,target_speed,param)

#######################################################
### Speed was predicted #######
#######################################################

target.shape

(20400,)

target_speed

array(['149', '149', '137', ..., '120', '131', '143'], dtype=object)

speed_pred

array([137.30386032, 137.30378782, 137.30386424, ..., 137.30380607,
       137.30379655, 137.30378883])

「ｙ」の学習と予測

# テストデータの「ｙ」を予測する関数
#####################################################3
### LGBで学習、予測する関数の定義
########################################################
def pred_y_of_test_data(train,test,target,lgb_param,mlp_pred,select_col_list):
    # --------------------------------------
    # パラメータ定義
    # --------------------------------------
    lgb_params = {
                    'objective': 'multiclass',
                    'boosting_type': 'gbdt',
                    'n_estimators': 50000,
                    'colsample_bytree': 0.5,
                    'subsample': 0.5,
                    'subsample_freq': 3,
                    'reg_alpha': 8,
                    'reg_lambda': 2,
                    'random_state': SEED,
                    'bagging_fraction': lgb_param['bagging_fraction'],
                    'bagging_freq': lgb_param['bagging_freq'],        
                    'feature_fraction': lgb_param['feature_fraction'],
                    "learning_rate":lgb_param['learning_rate'],
                    'min_child_samples': lgb_param['min_child_samples'],
                    'num_leaves': lgb_param['num_leaves'],
        
                  }

    # --------------------------------------
    # 学習と予測
    # --------------------------------------
    kf = KFold(n_splits=NFOLDS, shuffle=True, random_state=SEED)
    lgb_oof = np.zeros(train.shape[0])
    lgb_pred = pd.DataFrame()

    train_x = train.loc[:][select_col_list]
    test_x = test.loc[:][select_col_list]

    train_x['speed'] = target_speed.astype('float')
    test_x['speed'] = mlp_pred
    
    target_y = target

    for fold, (trn_idx, val_idx) in enumerate(kf.split(X=train_x)):
        X_train, y_train = train_x.iloc[trn_idx], target_y[trn_idx]
        X_valid, y_valid = train_x.iloc[val_idx], target_y[val_idx]
        X_test = test_x

        # LightGBM
        model = lgb.LGBMClassifier(**lgb_params)
        model.fit(X_train, y_train,
                  eval_set=(X_valid, y_valid),
                  eval_metric='logloss',
                  verbose=False,
                  early_stopping_rounds=500
                  )

        lgb_oof[val_idx] = model.predict(X_valid)
        lgb_pred[f'fold_{fold}'] = model.predict(X_test)
        f1_macro = f1_score(y_valid, lgb_oof[val_idx], average='macro')
        print(f"fold {fold} lgb score: {f1_macro}")

    # 予測値の最頻値を求める（ご指摘をいただき修正）
    sub_pred = lgb_pred.mode(axis=1)[0]
    print("+-" * 40)
    print(f"score: {f1_macro}")
    
    return sub_pred

# 「ｙ」を予測するモデルのハイパーパラメータを設定
lgb_param = {
"bagging_fraction": 0.7537281209924886,
"bagging_freq": 5,
"feature_fraction": 0.7548131884427044,
"learning_rate": 0.00854494687558397,
"min_child_samples": 78,
"num_leaves": 209,
}

# 予測に使う特徴量を選択
select_col_list =['B', 'O', 'b1', 'b3', 'bottomTeam', 'topTeam', 'bgTop',
                  'month', 'dayofweek', 'total_stat', 'pitcherTeam',
                  'pitcherHand_ce', 'batter_ce', 'inning_num_ce',
                  'startTime_ce', 'pitcherHand_te', 'batter_te',
                  'inning_num_te', 'startTime_te', 'place_te',
                  'gameID_all_PCA=000', 'gameID_all_PCA=002',
                  'gameID_all_PCA=004', 'gameID_all_PCA=005',
                  'gameID_all_PCA=009', 'gameID_all_PCA=012',
                  'gameID_all_PCA=015', 'gameID_all_PCA=016',
                  'gameID_all_PCA=017', 'gameID_all_PCA=019',
                  'gameID_all_PCA=023', 'gameID_all_PCA=024',
                  'gameID_all_PCA=029', 'gameID_all_PCA=031',
                  'gameID_all_PCA=035', 'gameID_all_PCA=039',
                  'gameID_all_PCA=040', 'gameID_all_PCA=042',
                  'gameID_all_PCA=045', 'gameID_all_PCA=046',
                  'gameID_all_PCA=047', 'gameID_all_PCA=048',
                  'gameID_all_PCA=049', 'gameID_all_PCA=051',
                  'gameID_all_PCA=053', 'gameID_all_PCA=054',
                  'gameID_all_PCA=057', 'gameID_stat_PCA=000',
                  'gameID_stat_PCA=001', 'gameID_stat_PCA=003',
                  'gameID_stat_PCA=004', 'gameID_stat_PCA=005',
                  'gameID_stat_PCA=006', 'gameID_stat_PCA=008',
                  'gameID_stat_PCA=010', 'gameID_stat_PCA=012',
                  'gameID_stat_PCA=014', 'gameID_stat_PCA=015',
                  'gameID_base_PCA=001', 'gameID_base_PCA=005',
                  'gameID_base_PCA=007', 'gameID_base_PCA=008',
                  'gameID_base_PCA=009', 'gameID_base_PCA=011',
                  'gameID_base_PCA=012', 'gameID_base_PCA=013',
                  'gameID_base_PCA=014', 'gameID_base_PCA=015',
                  'gameID_inning_PCA=001', 'gameID_inning_PCA=002',
                  'gameID_inning_PCA=003', 'gameID_inning_PCA=004',
                  'gameID_inning_PCA=006', 'gameID_inning_PCA=008',
                  'gameID_inning_PCA=009', 'gameID_inning_PCA=010',
                  'gameID_inning_PCA=012', 'gameID_inning_PCA=013',
                  'gameID_inning_PCA=014']

#学習と予測の実行
sub_pred = pred_y_of_test_data(X_train,X_test,target,lgb_param,speed_pred,select_col_list)

fold 0 lgb score: 0.14689857610654244
fold 1 lgb score: 0.14367666587151642
fold 2 lgb score: 0.15786179891131963
fold 3 lgb score: 0.14182548394951047
fold 4 lgb score: 0.1483487213241339
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-
score: 0.1483487213241339

# ------------------------------------------------------------------------------
# 予測ファイルの作成
# ------------------------------------------------------------------------------

#テスト結果の出力
submit_df = pd.DataFrame({'y': sub_pred.astype(int)})
submit_df.index.name = 'id'
submit_df.to_csv('../submission/submission.csv')

ヒット、２塁打、３塁打、ホームラン（４〜７）予測用の特徴量作成

この特徴量作成については、DT-SNさんが公開してくださいましたベースラインの特徴量を活用させていただきました。

DT-SNさんのベスライン：https://prob.space/competitions/npb/discussions/DT-SN-Post2126e8f25865e24a1cc4

基本設定

import pandas as pd
import numpy as np
import random
import os

from tqdm.notebook import tqdm
import lightgbm as lgb
from sklearn.metrics import f1_score
from sklearn.model_selection import GroupKFold

# メモリ使用量削減
def reduce_mem_usage(df, verbose=False):
    start_mem = df.memory_usage().sum() / 1024**2
    cols = df.columns.to_list()
    df_1 = df.select_dtypes(exclude=['integer', 'float'])
    df_2 = df.select_dtypes(include=['integer']).apply(pd.to_numeric, downcast='integer')
    df_3 = df.select_dtypes(include=['float']).apply(pd.to_numeric, downcast='float')
    df = df_1.join([df_2, df_3]).loc[:, cols]
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose:
        print('{:.2f}Mb->{:.2f}Mb({:.1f}% reduction)'.format(
            start_mem, end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df

# 乱数SEED初期化
def seed_everything(seed=42):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)

# 設定
INPUT_PATH = os.path.join('..', 'data')
N_CLASS = 8
SEED = 42
N_SAMPLE = 5
N_FOLDS = 5
N_LOOPS = 3

game_info.csv 処理

# game_info.csv読み取り
game_df = reduce_mem_usage(pd.read_csv(os.path.join(INPUT_PATH, 'game_info.csv'), index_col=0))

display(game_df)
game_df.info()

	startTime	bottomTeam	bgBottom	topTeam	place	startDayTime	bgTop	gameID
0	18:00	DeNA	3	広島	横浜	2020-06-19 18:00:00	6	20202173
1	18:00	ヤクルト	2	中日	神宮	2020-06-19 18:00:00	4	20202174
2	18:00	巨人	1	阪神	東京ドーム	2020-06-19 18:00:00	5	20202175
3	18:00	ソフトバンク	12	ロッテ	PayPayドーム	2020-06-19 18:00:00	9	20202170
4	18:00	オリックス	11	楽天	京セラD大阪	2020-06-19 18:00:00	10	20202171
...	...	...	...	...	...	...	...	...
721	13:00	ソフトバンク	12	ロッテ	PayPayドーム	2020-11-15 13:00:00	9	20203323
722	18:10	巨人	1	ソフトバンク	京セラD大阪	2020-11-21 18:10:00	12	20203326
723	18:10	巨人	1	ソフトバンク	京セラD大阪	2020-11-22 18:10:00	12	20203327
724	18:30	ソフトバンク	12	巨人	PayPayドーム	2020-11-24 18:30:00	1	20203328
725	18:30	ソフトバンク	12	巨人	PayPayドーム	2020-11-25 18:30:00	1	20203329

726 rows × 8 columns

<class 'pandas.core.frame.DataFrame'>
Int64Index: 726 entries, 0 to 725
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   startTime     726 non-null    object
 1   bottomTeam    726 non-null    object
 2   bgBottom      726 non-null    int8  
 3   topTeam       726 non-null    object
 4   place         726 non-null    object
 5   startDayTime  726 non-null    object
 6   bgTop         726 non-null    int8  
 7   gameID        726 non-null    int32 
dtypes: int32(1), int8(2), object(5)
memory usage: 58.3+ KB

train_data.csv 処理

# train_data.csv読み取り
tr_df = reduce_mem_usage(pd.read_csv(os.path.join(INPUT_PATH, 'train_data.csv')))

# 重複行除去
print('duplicated lines:', tr_df.drop('id', axis=1).duplicated().sum())
tr_df = tr_df[~tr_df.drop('id', axis=1).duplicated()]

# game_infoマージ
tr_df = pd.merge(tr_df, game_df.drop(['bgTop', 'bgBottom'], axis=1), on='gameID', how='left')

# 同名選手回避
f = tr_df['inning'].str.contains('表')
tr_df.loc[ f, 'batter'] = tr_df.loc[ f, 'batter'] + '@' + tr_df.loc[ f, 'topTeam'].astype(str)
tr_df.loc[~f, 'batter'] = tr_df.loc[~f, 'batter'] + '@' + tr_df.loc[~f, 'bottomTeam'].astype(str)
tr_df.loc[ f, 'pitcher'] = tr_df.loc[ f, 'pitcher'] + '@' + tr_df.loc[ f, 'bottomTeam'].astype(str)
tr_df.loc[~f, 'pitcher'] = tr_df.loc[~f, 'pitcher'] + '@' + tr_df.loc[~f, 'topTeam'].astype(str)

duplicated lines: 3264

display(tr_df)
tr_df.info()

	id	totalPitchingCount	B	S	O	b1	b2	b3	pitcher	pitcherHand	batter	batterHand	gameID	inning	pitchType	speed	ballPositionLabel	ballX	ballY	dir	dist	battingType	isOuts	y	startTime	bottomTeam	topTeam	place	startDayTime
0	0	1	0	0	0	False	False	False	今永昇太@DeNA	L	ピレラ@広島	R	20202173	1回表	ストレート	149km/h	内角低め	17	J	NaN	NaN	NaN	NaN	0	18:00	DeNA	広島	横浜	2020-06-19 18:00:00
1	1	2	1	0	0	False	False	False	今永昇太@DeNA	L	ピレラ@広島	R	20202173	1回表	ストレート	149km/h	内角低め	14	I	NaN	NaN	NaN	NaN	1	18:00	DeNA	広島	横浜	2020-06-19 18:00:00
2	2	3	1	1	0	False	False	False	今永昇太@DeNA	L	ピレラ@広島	R	20202173	1回表	チェンジアップ	137km/h	外角高め	8	D	NaN	NaN	NaN	NaN	0	18:00	DeNA	広島	横浜	2020-06-19 18:00:00
3	3	4	2	1	0	False	False	False	今永昇太@DeNA	L	ピレラ@広島	R	20202173	1回表	スライダー	138km/h	内角中心	21	G	NaN	NaN	NaN	NaN	2	18:00	DeNA	広島	横浜	2020-06-19 18:00:00
4	4	5	2	2	0	False	False	False	今永昇太@DeNA	L	ピレラ@広島	R	20202173	1回表	チェンジアップ	136km/h	外角中心	7	F	S	38.299999	G	False	4	18:00	DeNA	広島	横浜	2020-06-19 18:00:00
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
17131	17131	2	1	0	2	False	False	False	森唯斗@ソフトバンク	R	大田泰示@日本ハム	R	20202118	9回裏	カットファストボール	143km/h	外角中心	7	F	NaN	NaN	NaN	NaN	2	18:00	日本ハム	ソフトバンク	札幌ドーム	2020-06-30 18:00:00
17132	17132	3	1	1	2	False	False	False	森唯斗@ソフトバンク	R	大田泰示@日本ハム	R	20202118	9回裏	カーブ	120km/h	真ん中低め	12	K	NaN	NaN	NaN	NaN	0	18:00	日本ハム	ソフトバンク	札幌ドーム	2020-06-30 18:00:00
17133	17133	4	2	1	2	False	False	False	森唯斗@ソフトバンク	R	大田泰示@日本ハム	R	20202118	9回裏	カーブ	120km/h	真ん中低め	10	H	NaN	NaN	NaN	NaN	1	18:00	日本ハム	ソフトバンク	札幌ドーム	2020-06-30 18:00:00
17134	17134	5	2	2	2	False	False	False	森唯斗@ソフトバンク	R	大田泰示@日本ハム	R	20202118	9回裏	フォーク	131km/h	真ん中低め	12	K	NaN	NaN	NaN	NaN	0	18:00	日本ハム	ソフトバンク	札幌ドーム	2020-06-30 18:00:00
17135	17135	6	3	2	2	False	False	False	森唯斗@ソフトバンク	R	大田泰示@日本ハム	R	20202118	9回裏	カットファストボール	143km/h	外角中心	6	E	NaN	0.000000	NaN	True	1	18:00	日本ハム	ソフトバンク	札幌ドーム	2020-06-30 18:00:00

17136 rows × 29 columns

<class 'pandas.core.frame.DataFrame'>
Int64Index: 17136 entries, 0 to 17135
Data columns (total 29 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   id                  17136 non-null  int16  
 1   totalPitchingCount  17136 non-null  int8   
 2   B                   17136 non-null  int8   
 3   S                   17136 non-null  int8   
 4   O                   17136 non-null  int8   
 5   b1                  17136 non-null  bool   
 6   b2                  17136 non-null  bool   
 7   b3                  17136 non-null  bool   
 8   pitcher             17136 non-null  object 
 9   pitcherHand         17105 non-null  object 
 10  batter              17136 non-null  object 
 11  batterHand          17105 non-null  object 
 12  gameID              17136 non-null  int32  
 13  inning              17136 non-null  object 
 14  pitchType           17136 non-null  object 
 15  speed               17136 non-null  object 
 16  ballPositionLabel   17136 non-null  object 
 17  ballX               17136 non-null  int8   
 18  ballY               17136 non-null  object 
 19  dir                 3094 non-null   object 
 20  dist                4356 non-null   float32
 21  battingType         3094 non-null   object 
 22  isOuts              4356 non-null   object 
 23  y                   17136 non-null  int8   
 24  startTime           17136 non-null  object 
 25  bottomTeam          17136 non-null  object 
 26  topTeam             17136 non-null  object 
 27  place               17136 non-null  object 
 28  startDayTime        17136 non-null  object 
dtypes: bool(3), float32(1), int16(1), int32(1), int8(6), object(17)
memory usage: 2.7+ MB

test_data.csv 処理

# test_data.csv読み取り
ts_df = reduce_mem_usage(pd.read_csv(os.path.join(INPUT_PATH, 'test_data.csv')))

# game_infoマージ
ts_df = pd.merge(ts_df, game_df.drop(['bgTop', 'bgBottom'], axis=1), on='gameID', how='left')

# 同名選手回避
f = ts_df['inning'].str.contains('表')
ts_df.loc[ f, 'batter'] = ts_df.loc[ f, 'batter'] + '@' + ts_df.loc[ f, 'topTeam'].astype(str)
ts_df.loc[~f, 'batter'] = ts_df.loc[~f, 'batter'] + '@' + ts_df.loc[~f, 'bottomTeam'].astype(str)
ts_df.loc[ f, 'pitcher'] = ts_df.loc[ f, 'pitcher'] + '@' + ts_df.loc[ f, 'bottomTeam'].astype(str)
ts_df.loc[~f, 'pitcher'] = ts_df.loc[~f, 'pitcher'] + '@' + ts_df.loc[~f, 'topTeam'].astype(str)

display(ts_df)
ts_df.info()

	id	totalPitchingCount	B	S	O	b1	b2	b3	pitcher	pitcherHand	batter	batterHand	gameID	inning	startTime	bottomTeam	topTeam	place	startDayTime
0	0	2	1	0	0	False	False	False	遠藤淳志@広島	R	乙坂智@DeNA	L	20202564	2回表	13:30	広島	DeNA	マツダスタジアム	2020-09-06 13:30:00
1	1	1	0	0	0	False	False	False	バンデンハーク@ソフトバンク	R	西川遥輝@日本ハム	L	20202106	3回裏	18:00	日本ハム	ソフトバンク	札幌ドーム	2020-07-02 18:00:00
2	2	7	3	2	2	True	False	False	スアレス@阪神	R	堂林翔太@広島	R	20203305	9回裏	14:00	広島	阪神	マツダスタジアム	2020-11-07 14:00:00
3	3	1	0	0	2	True	False	False	クック@ヤクルト	R	井領雅貴@中日	L	20202650	3回裏	18:00	中日	ヤクルト	ナゴヤドーム	2020-09-23 18:00:00
4	4	2	0	0	2	False	False	False	則本昂大@楽天	R	安達了一@オリックス	R	20202339	2回表	18:00	楽天	オリックス	楽天生命パーク	2020-07-24 18:00:00
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
33803	33803	2	0	1	0	False	False	False	床田寛樹@広島	L	坂口智隆@ヤクルト	L	20202023	5回表	18:00	広島	ヤクルト	マツダスタジアム	2020-07-18 18:00:00
33804	33804	1	0	0	0	False	False	False	堀岡隼人@巨人	R	メヒア@広島	R	20202640	9回表	18:00	巨人	広島	東京ドーム	2020-09-21 18:00:00
33805	33805	1	0	0	0	True	False	False	ディプラン@巨人	R	鈴木誠也@広島	R	20202864	7回裏	18:00	広島	巨人	マツダスタジアム	2020-11-04 18:00:00
33806	33806	5	3	1	1	False	True	False	田村伊知郎@西武	R	周東佑京@ソフトバンク	L	20202806	8回裏	18:00	ソフトバンク	西武	PayPayドーム	2020-10-23 18:00:00
33807	33807	3	0	2	1	False	False	False	山本由伸@オリックス	R	源田壮亮@西武	L	20202572	6回裏	18:00	西武	オリックス	メットライフ	2020-09-08 18:00:00

33808 rows × 19 columns

<class 'pandas.core.frame.DataFrame'>
Int64Index: 33808 entries, 0 to 33807
Data columns (total 19 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   id                  33808 non-null  int32 
 1   totalPitchingCount  33808 non-null  int8  
 2   B                   33808 non-null  int8  
 3   S                   33808 non-null  int8  
 4   O                   33808 non-null  int8  
 5   b1                  33808 non-null  bool  
 6   b2                  33808 non-null  bool  
 7   b3                  33808 non-null  bool  
 8   pitcher             33807 non-null  object
 9   pitcherHand         33737 non-null  object
 10  batter              33803 non-null  object
 11  batterHand          33737 non-null  object
 12  gameID              33808 non-null  int32 
 13  inning              33808 non-null  object
 14  startTime           33808 non-null  object
 15  bottomTeam          33808 non-null  object
 16  topTeam             33808 non-null  object
 17  place               33808 non-null  object
 18  startDayTime        33808 non-null  object
dtypes: bool(3), int32(2), int8(4), object(10)
memory usage: 3.3+ MB

train、test 間の情報取得

# trainとtestに共通のピッチャーを取得
tr_pitcher = set(tr_df['pitcher'].unique())
ts_pitcher = set(ts_df['pitcher'].unique())
print(tr_df['pitcher'].isin(tr_pitcher & ts_pitcher).sum())
print(ts_df['pitcher'].isin(tr_pitcher & ts_pitcher).sum())

# trainとtestに共通のバッターを取得
tr_batter = set(tr_df['batter'].unique())
ts_batter = set(ts_df['batter'].unique())
print(tr_df['batter'].isin(tr_batter & ts_batter).sum())
print(ts_df['batter'].isin(tr_batter & ts_batter).sum())

train、test結合

# train_dataとtest_dataを結合
input_df = pd.concat([tr_df, ts_df], axis=0).reset_index(drop=True)

# pitcherHandとbatterHand
input_df['pitcherHand'] = input_df['pitcherHand'].fillna('R')
input_df['batterHand'] = input_df['batterHand'].fillna('R')

# 球種
input_df['pitchType'] = input_df['pitchType'].fillna('-')

# 球速
input_df['speed'] = input_df['speed'].str.replace('km/h', '').replace('-', '135').astype(float)
input_df['speed'] = input_df['speed'].fillna(0)

# 投球位置
input_df['ballPositionLabel'] = input_df['ballPositionLabel'].fillna('中心')

# 投球のX座標(1-21)
input_df['ballX'] = input_df['ballX'].fillna(0).astype(int)

# 投球のY座標(A-K)変換
input_df['ballY'] = input_df['ballY'].map({chr(ord('A')+i):i+1 for i in range(11)})
input_df['ballY'] = input_df['ballY'].fillna(0).astype(int)

# 打球方向(A-Z)
input_df['dir'] = input_df['ballY'].map({chr(ord('A')+i):i+1 for i in range(26)})
input_df['dir'] = input_df['dir'].fillna(0).astype(int)

# 打球距離
input_df['dist'] = input_df['dist'].fillna(0)

# 打球種類
input_df['battingType'] = input_df['battingType'].fillna('G')

# 投球結果がアウトか
input_df['isOuts'] = input_df['isOuts'].fillna('-1').astype(int)

display(input_df)
input_df.info()

del tr_df, ts_df, game_df

	id	totalPitchingCount	B	S	O	b1	b2	b3	pitcher	pitcherHand	batter	batterHand	gameID	inning	pitchType	speed	ballPositionLabel	ballX	ballY	dir	dist	battingType	isOuts	y	startTime	bottomTeam	topTeam	place	startDayTime
0	0	1	0	0	0	False	False	False	今永昇太@DeNA	L	ピレラ@広島	R	20202173	1回表	ストレート	149.0	内角低め	17	10	0	0.000000	G	-1	0.0	18:00	DeNA	広島	横浜	2020-06-19 18:00:00
1	1	2	1	0	0	False	False	False	今永昇太@DeNA	L	ピレラ@広島	R	20202173	1回表	ストレート	149.0	内角低め	14	9	0	0.000000	G	-1	1.0	18:00	DeNA	広島	横浜	2020-06-19 18:00:00
2	2	3	1	1	0	False	False	False	今永昇太@DeNA	L	ピレラ@広島	R	20202173	1回表	チェンジアップ	137.0	外角高め	8	4	0	0.000000	G	-1	0.0	18:00	DeNA	広島	横浜	2020-06-19 18:00:00
3	3	4	2	1	0	False	False	False	今永昇太@DeNA	L	ピレラ@広島	R	20202173	1回表	スライダー	138.0	内角中心	21	7	0	0.000000	G	-1	2.0	18:00	DeNA	広島	横浜	2020-06-19 18:00:00
4	4	5	2	2	0	False	False	False	今永昇太@DeNA	L	ピレラ@広島	R	20202173	1回表	チェンジアップ	136.0	外角中心	7	6	0	38.299999	G	0	4.0	18:00	DeNA	広島	横浜	2020-06-19 18:00:00
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
50939	33803	2	0	1	0	False	False	False	床田寛樹@広島	L	坂口智隆@ヤクルト	L	20202023	5回表	-	0.0	中心	0	0	0	0.000000	G	-1	NaN	18:00	広島	ヤクルト	マツダスタジアム	2020-07-18 18:00:00
50940	33804	1	0	0	0	False	False	False	堀岡隼人@巨人	R	メヒア@広島	R	20202640	9回表	-	0.0	中心	0	0	0	0.000000	G	-1	NaN	18:00	巨人	広島	東京ドーム	2020-09-21 18:00:00
50941	33805	1	0	0	0	True	False	False	ディプラン@巨人	R	鈴木誠也@広島	R	20202864	7回裏	-	0.0	中心	0	0	0	0.000000	G	-1	NaN	18:00	広島	巨人	マツダスタジアム	2020-11-04 18:00:00
50942	33806	5	3	1	1	False	True	False	田村伊知郎@西武	R	周東佑京@ソフトバンク	L	20202806	8回裏	-	0.0	中心	0	0	0	0.000000	G	-1	NaN	18:00	ソフトバンク	西武	PayPayドーム	2020-10-23 18:00:00
50943	33807	3	0	2	1	False	False	False	山本由伸@オリックス	R	源田壮亮@西武	L	20202572	6回裏	-	0.0	中心	0	0	0	0.000000	G	-1	NaN	18:00	西武	オリックス	メットライフ	2020-09-08 18:00:00

50944 rows × 29 columns

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50944 entries, 0 to 50943
Data columns (total 29 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   id                  50944 non-null  int32  
 1   totalPitchingCount  50944 non-null  int8   
 2   B                   50944 non-null  int8   
 3   S                   50944 non-null  int8   
 4   O                   50944 non-null  int8   
 5   b1                  50944 non-null  bool   
 6   b2                  50944 non-null  bool   
 7   b3                  50944 non-null  bool   
 8   pitcher             50943 non-null  object 
 9   pitcherHand         50944 non-null  object 
 10  batter              50939 non-null  object 
 11  batterHand          50944 non-null  object 
 12  gameID              50944 non-null  int32  
 13  inning              50944 non-null  object 
 14  pitchType           50944 non-null  object 
 15  speed               50944 non-null  float64
 16  ballPositionLabel   50944 non-null  object 
 17  ballX               50944 non-null  int64  
 18  ballY               50944 non-null  int64  
 19  dir                 50944 non-null  int64  
 20  dist                50944 non-null  float32
 21  battingType         50944 non-null  object 
 22  isOuts              50944 non-null  int64  
 23  y                   17136 non-null  float64
 24  startTime           50944 non-null  object 
 25  bottomTeam          50944 non-null  object 
 26  topTeam             50944 non-null  object 
 27  place               50944 non-null  object 
 28  startDayTime        50944 non-null  object 
dtypes: bool(3), float32(1), float64(2), int32(2), int64(4), int8(4), object(13)
memory usage: 8.3+ MB

基礎特徴量

from sklearn.preprocessing import LabelEncoder
def get_base_features(input_df):
    seed_everything(seed=SEED)
    output_df = input_df.copy()

    output_df['inning'] = 2 * (output_df['inning'].str[0].astype(int) - 1) + output_df['inning'].str.contains('裏')

    output_df['pitcherCommon'] = output_df['pitcher']
    output_df['batterCommon'] = output_df['batter']
    output_df.loc[~(output_df['pitcherCommon'].isin(tr_pitcher & ts_pitcher)), 'pitcherCommon'] = np.nan
    output_df.loc[~(output_df['batterCommon'].isin(tr_batter & ts_batter)), 'batterCommon'] = np.nan

    # label encoding
    cat_cols = output_df.select_dtypes(include=['object']).columns
    for col in cat_cols:
        f = output_df[col].notnull()
        output_df.loc[f, col] = LabelEncoder().fit_transform(output_df.loc[f, col].values)
        output_df.loc[~f, col] = -1
        output_df[col] = output_df[col].astype(int)
    
    output_df['inningHalf'] = output_df['inning'] % 2
    output_df['inningNumber'] = output_df['inning'] // 2
    output_df['outCount'] = output_df['inning'] * 3 + output_df['O']
    output_df['B_S_O'] = output_df['B'] + 4 * (output_df['S'] + 3 * output_df['O'])
    output_df['b1_b2_b3'] = output_df['b1'] * 1 + output_df['b2'] * 2 + output_df['b3'] * 4
    
    return reduce_mem_usage(output_df)

ランダムサンプリング

def random_sampling(input_df, n_sample=10):
    dfs = []
    tr_df = input_df[input_df['y'].notnull()].copy()
    ts_df = input_df[input_df['y'].isnull()].copy()
    for i in tqdm(range(n_sample)):
        df = tr_df.groupby(['gameID', 'outCount']).apply(lambda x: x.sample(n=1, random_state=i)).reset_index(drop=True)
        df['subGameID'] = df['gameID'] * n_sample + i
        dfs.append(df)
    ts_df['subGameID'] = ts_df['gameID'] * n_sample
    return pd.concat(dfs + [ts_df], axis=0)

集約特徴量

# 集約関数
def aggregation(input_df, group_keys, group_values, agg_methods):
    new_df = []
    for agg_method in agg_methods:
        for col in group_values:
            if callable(agg_method):
                agg_method_name = agg_method.__name__
            else:
                agg_method_name = agg_method
            new_col = f'agg_{agg_method_name}_{col}_grpby_' + '_'.join(group_keys)
            agg_df = input_df[[col]+group_keys].groupby(group_keys)[[col]].agg(agg_method)
            agg_df.columns = [new_col]
            new_df.append(agg_df)
    new_df = pd.concat(new_df, axis=1).reset_index()

    output_df = pd.merge(input_df, new_df, on=group_keys, how='left')
    return output_df, list(new_df.columns)

def get_agg_gameID_inningHalf_features(input_df):
    group_keys = ['subGameID', 'inningHalf']
    group_values = ['S', 'B', 'b1', 'b2', 'b3']
    agg_methods = ['mean', 'std']
    output_df, cols = aggregation(
        input_df, group_keys=group_keys, group_values=group_values, agg_methods=agg_methods)
    return reduce_mem_usage(output_df)

pivot table 特徴量

from sklearn.decomposition import NMF
from sklearn.preprocessing import MinMaxScaler

# pivot tabel を用いた特徴量
def get_pivot_NMF9_features(input_df, n, value_col):
    pivot_df = pd.pivot_table(input_df, index='subGameID', columns='outCount', values=value_col, aggfunc=np.median)
    sc0 = MinMaxScaler().fit_transform(np.median(pivot_df.fillna(0).values.reshape(-1,54//3,3)[:,0::2,:], axis=-1))
    sc1 = MinMaxScaler().fit_transform(np.median(pivot_df.fillna(0).values.reshape(-1,54//3,3)[:,1::2,:], axis=-1))
    nmf = NMF(n_components=n, random_state=2021)
    nmf_df0 = pd.DataFrame(nmf.fit_transform(sc0), index=pivot_df.index).rename(
        columns=lambda x: f'pivot_{value_col}_NMF9T={x:02}')
    nmf_df1 = pd.DataFrame(nmf.fit_transform(sc1), index=pivot_df.index).rename(
        columns=lambda x: f'pivot_{value_col}_NMF9B={x:02}')
    nmf_df = pd.concat([nmf_df0, nmf_df1], axis=1)
    nmf_df = pd.merge(
        input_df, nmf_df, left_on='subGameID', right_index=True, how='left')
    return reduce_mem_usage(nmf_df)

# pivot tabel を用いた特徴量
def get_pivot_NMF27_features(input_df, n, value_col):
    pivot_df = pd.pivot_table(input_df, index='subGameID', columns='outCount', values=value_col, aggfunc=np.median)
    sc0 = MinMaxScaler().fit_transform(pivot_df.fillna(0).values.reshape(-1,54//3,3)[:,0::2].reshape(-1,27))
    sc1 = MinMaxScaler().fit_transform(pivot_df.fillna(0).values.reshape(-1,54//3,3)[:,1::2].reshape(-1,27))
    nmf = NMF(n_components=n, random_state=2021)
    nmf_df0 = pd.DataFrame(nmf.fit_transform(sc0), index=pivot_df.index).rename(
        columns=lambda x: f'pivot_{value_col}_NMF27T={x:02}')
    nmf_df1 = pd.DataFrame(nmf.fit_transform(sc1), index=pivot_df.index).rename(
        columns=lambda x: f'pivot_{value_col}_NMF27B={x:02}')
    nmf_df = pd.concat([nmf_df0, nmf_df1], axis=1)
    nmf_df = pd.merge(
        input_df, nmf_df, left_on='subGameID', right_index=True, how='left')
    return reduce_mem_usage(nmf_df)

# pivot tabel を用いた特徴量
def get_pivot_NMF54_features(input_df, n, value_col):
    pivot_df = pd.pivot_table(input_df, index='subGameID', columns='outCount', values=value_col, aggfunc=np.median)
    sc = MinMaxScaler().fit_transform(pivot_df.fillna(0).values)
    nmf = NMF(n_components=n, random_state=2021)
    nmf_df = pd.DataFrame(nmf.fit_transform(sc), index=pivot_df.index).rename(
        columns=lambda x: f'pivot_{value_col}_NMF54={x:02}')
    nmf_df = pd.merge(
        input_df, nmf_df, left_on='subGameID', right_index=True, how='left')
    return reduce_mem_usage(nmf_df)

前後特徴量

def get_diff_feature(input_df, value_col, periods, in_inning=True, aggfunc=np.median):
    pivot_df = pd.pivot_table(input_df, index='subGameID', columns='outCount', values=value_col, aggfunc=aggfunc)
    if in_inning:
        dfs = []
        for inning in range(9):
            df0 = pivot_df.loc[:, [out+inning*6 for out in range(0,3)]].diff(periods, axis=1)
            df1 = pivot_df.loc[:, [out+inning*6 for out in range(3,6)]].diff(periods, axis=1)
            dfs += [df0, df1]
        pivot_df = pd.concat(dfs, axis=1).stack()
    else:
        df0 = pivot_df.loc[:, [out+inning*6 for inning in range(9) for out in range(0,3)]].diff(periods, axis=1)
        df1 = pivot_df.loc[:, [out+inning*6 for inning in range(9) for out in range(3,6)]].diff(periods, axis=1)
        pivot_df = pd.concat([df0, df1], axis=1).stack()
    return pivot_df

def get_shift_feature(input_df, value_col, periods, in_inning=True, aggfunc=np.median):
    pivot_df = pd.pivot_table(input_df, index='subGameID', columns='outCount', values=value_col, aggfunc=aggfunc)
    if in_inning:
        dfs = []
        for inning in range(9):
            df0 = pivot_df.loc[:, [out+inning*6 for out in range(0,3)]].shift(periods, axis=1)
            df1 = pivot_df.loc[:, [out+inning*6 for out in range(3,6)]].shift(periods, axis=1)
            dfs += [df0, df1]
        pivot_df = pd.concat(dfs, axis=1).stack()
    else:
        df0 = pivot_df.loc[:, [out+inning*6 for inning in range(9) for out in range(0,3)]].shift(periods, axis=1)
        df1 = pivot_df.loc[:, [out+inning*6 for inning in range(9) for out in range(3,6)]].shift(periods, axis=1)
        pivot_df = pd.concat([df0, df1], axis=1).stack()
    return pivot_df

def get_next_data(input_df, value_col, in_inning=True, nan_value=None):
    pivot_df = get_shift_feature(input_df, value_col, periods=-1, in_inning=in_inning)
    pivot_df.name = 'next_' + value_col
    output_df = pd.merge(
        input_df, pivot_df, left_on=['subGameID', 'outCount'], right_index=True, how='left')
    if nan_value is not None:
        output_df[pivot_df.name].fillna(nan_value, inplace=True)
    return output_df

def get_prev_data(input_df, value_col, in_inning=True, nan_value=None):
    pivot_df = get_shift_feature(input_df, value_col, periods=1, in_inning=in_inning)
    pivot_df.name = 'prev_' + value_col
    output_df = pd.merge(
        input_df, pivot_df, left_on=['subGameID', 'outCount'], right_index=True, how='left')
    if nan_value is not None:
        output_df[pivot_df.name].fillna(nan_value, inplace=True)
    return output_df
    
def get_next_diff(input_df, value_col, in_inning=True, nan_value=None):
    pivot_df = get_diff_feature(input_df, value_col, periods=-1, in_inning=in_inning)
    pivot_df.name = 'next_diff_' + value_col
    output_df = pd.merge(
        input_df, pivot_df, left_on=['subGameID', 'outCount'], right_index=True, how='left')
    if nan_value is not None:
        output_df[pivot_df.name].fillna(nan_value, inplace=True)
    return output_df

def get_prev_diff(input_df, value_col, in_inning=True, nan_value=None):
    pivot_df = get_diff_feature(input_df, value_col, periods=1, in_inning=in_inning)
    pivot_df.name = 'prev_diff_' + value_col
    output_df = pd.merge(
        input_df, pivot_df, left_on=['subGameID', 'outCount'], right_index=True, how='left')
    if nan_value is not None:
        output_df[pivot_df.name].fillna(nan_value, inplace=True)
    return output_df

TF-IDF

def get_tfidf(input_df, term_col, document_col):
    output_df = input_df.copy()
    output_df['dummy'] = 0
    tf1 = output_df[[document_col, term_col, 'dummy']].groupby([document_col, term_col])['dummy'].count()
    tf1.name = 'tf1'
    tf2 = output_df[[document_col, term_col, 'dummy']].groupby([document_col])['dummy'].count()
    tf2.name = 'tf2'
    idf1 = output_df[document_col].nunique()
    idf2 = output_df[[document_col, term_col, 'dummy']].groupby([term_col])[document_col].nunique()
    idf2.name = 'idf2'
    output_df = pd.merge(output_df, tf1, left_on=[document_col, term_col], right_index=True, how='left')
    output_df = pd.merge(output_df, tf2, left_on=[document_col], right_index=True, how='left')
    output_df['idf1'] = idf1
    output_df = pd.merge(output_df, idf2, left_on=[term_col], right_index=True, how='left')
    col_name = 'tfidf_' + term_col + '_in_' + document_col
    tf = np.log(1 + (1 + output_df['tf1']) / (1 + output_df['tf2']))
    idf = 1 + np.log((1 + output_df['idf1']) / (1 + output_df['idf2']))
    output_df[col_name] = tf * idf
    return output_df.drop(['tf1', 'tf2', 'idf1', 'idf2', 'dummy'], axis=1)

打席スキップ数

def get_skip(input_df):
    output_df = input_df.copy()

    next_skip_map = {}
    prev_skip_map = {}
    for key, group in output_df.groupby(['subGameID', 'inningHalf']):
        n = len(group)
        dist_map = {}
        batter = group.sort_values('outCount')['batter']
        for i in range(n - 1):
            b1 = batter.iloc[i]
            for d in range(1, 5):
                if i + d >= n:
                    break
                b2 = batter.iloc[i + d]

                if (b1, b2) in dist_map.keys():
                    if dist_map[(b1, b2)] < d:
                        dist_map[(b1, b2)] = d
                else:
                    dist_map[(b1, b2)] = d
            
        for i in range(len(batter) - 1):
            next_skip_map[batter.index[i]] = dist_map[(batter.iloc[i], batter.iloc[i+1])]
        for i in range(1, len(batter)):
            prev_skip_map[batter.index[i]] = dist_map[(batter.iloc[i-1], batter.iloc[i])]

    output_df['next_skip'] = output_df.index.map(next_skip_map).fillna(0).astype(np.int8)
    output_df['prev_skip'] = output_df.index.map(prev_skip_map).fillna(0).astype(np.int8)
    return output_df

特徴量演算

# 特徴量作成用の関数を実行する関数
def preprocess(input_df):
    seed_everything(seed=SEED)
    output_df = input_df.copy()

    # aggrigation
    output_df = get_agg_gameID_inningHalf_features(output_df)    

    # pivot
    output_df = get_pivot_NMF9_features(output_df, n=2, value_col='b1_b2_b3')
    output_df = get_pivot_NMF27_features(output_df, n=2, value_col='b1_b2_b3')
    output_df = get_pivot_NMF54_features(output_df, n=2, value_col='b1_b2_b3')

    # next/previous
    output_df = get_next_data(output_df, value_col='b1_b2_b3', nan_value=8)
    output_df = get_next_diff(output_df, value_col='b1_b2_b3', nan_value=8)
    output_df = get_prev_data(output_df, value_col='b1_b2_b3', nan_value=8)
    output_df = get_prev_diff(output_df, value_col='b1_b2_b3', nan_value=8)

    # TF-IDF
    output_df = get_tfidf(output_df, term_col='batter', document_col='subGameID')

    # skip
    output_df = get_skip(output_df)

    return output_df

base_df = get_base_features(input_df)
display(base_df)
base_df.info()

	id	totalPitchingCount	B	S	O	b1	b2	b3	pitcher	pitcherHand	batter	batterHand	gameID	inning	pitchType	speed	ballPositionLabel	ballX	ballY	dir	dist	battingType	isOuts	y	startTime	bottomTeam	topTeam	place	startDayTime	pitcherCommon	batterCommon	inningHalf	inningNumber	outCount	B_S_O	b1_b2_b3
0	0	1	0	0	0	False	False	False	71	0	24	1	20202173	0	5	149.0	4	17	10	0	0.000000	2	-1	0.0	5	0	7	10	0	47	16	0	0	0	0	0
1	1	2	1	0	0	False	False	False	71	0	24	1	20202173	0	5	149.0	4	14	9	0	0.000000	2	-1	1.0	5	0	7	10	0	47	16	0	0	0	1	0
2	2	3	1	1	0	False	False	False	71	0	24	1	20202173	0	7	137.0	8	8	4	0	0.000000	2	-1	0.0	5	0	7	10	0	47	16	0	0	0	5	0
3	3	4	2	1	0	False	False	False	71	0	24	1	20202173	0	6	138.0	3	21	7	0	0.000000	2	-1	2.0	5	0	7	10	0	47	16	0	0	0	6	0
4	4	5	2	2	0	False	False	False	71	0	24	1	20202173	0	7	136.0	6	7	6	0	38.299999	2	0	4.0	5	0	7	10	0	47	16	0	0	0	10	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
50939	33803	2	0	1	0	False	False	False	171	0	113	0	20202023	8	0	0.0	1	0	0	0	0.000000	2	-1	NaN	5	7	3	4	38	98	72	0	4	24	4	0
50940	33804	1	0	0	0	False	False	False	109	1	30	1	20202640	16	0	0.0	1	0	0	0	0.000000	2	-1	NaN	5	6	7	8	143	-1	22	0	8	48	0	0
50941	33805	1	0	0	0	True	False	False	19	1	375	1	20202864	13	0	0.0	1	0	0	0	0.000000	2	-1	NaN	5	7	6	4	206	-1	213	1	6	39	0	1
50942	33806	5	3	1	1	False	True	False	250	1	108	0	20202806	15	0	0.0	1	0	0	0	0.000000	2	-1	NaN	5	2	10	0	190	139	69	1	7	46	19	2
50943	33807	3	0	2	1	False	False	False	153	1	274	0	20202572	11	0	0.0	1	0	0	0	0.000000	2	-1	NaN	5	10	1	5	118	85	163	1	5	34	20	0

50944 rows × 36 columns

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50944 entries, 0 to 50943
Data columns (total 36 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   id                  50944 non-null  int32  
 1   totalPitchingCount  50944 non-null  int8   
 2   B                   50944 non-null  int8   
 3   S                   50944 non-null  int8   
 4   O                   50944 non-null  int8   
 5   b1                  50944 non-null  bool   
 6   b2                  50944 non-null  bool   
 7   b3                  50944 non-null  bool   
 8   pitcher             50944 non-null  int16  
 9   pitcherHand         50944 non-null  int8   
 10  batter              50944 non-null  int16  
 11  batterHand          50944 non-null  int8   
 12  gameID              50944 non-null  int32  
 13  inning              50944 non-null  int8   
 14  pitchType           50944 non-null  int8   
 15  speed               50944 non-null  float32
 16  ballPositionLabel   50944 non-null  int8   
 17  ballX               50944 non-null  int8   
 18  ballY               50944 non-null  int8   
 19  dir                 50944 non-null  int8   
 20  dist                50944 non-null  float32
 21  battingType         50944 non-null  int8   
 22  isOuts              50944 non-null  int8   
 23  y                   17136 non-null  float32
 24  startTime           50944 non-null  int8   
 25  bottomTeam          50944 non-null  int8   
 26  topTeam             50944 non-null  int8   
 27  place               50944 non-null  int8   
 28  startDayTime        50944 non-null  int16  
 29  pitcherCommon       50944 non-null  int16  
 30  batterCommon        50944 non-null  int16  
 31  inningHalf          50944 non-null  int8   
 32  inningNumber        50944 non-null  int8   
 33  outCount            50944 non-null  int8   
 34  B_S_O               50944 non-null  int8   
 35  b1_b2_b3            50944 non-null  int8   
dtypes: bool(3), float32(3), int16(5), int32(2), int8(23)
memory usage: 2.7 MB

sampling_df = random_sampling(base_df, n_sample=N_SAMPLE)
display(sampling_df)
sampling_df.info()

HBox(children=(FloatProgress(value=0.0, max=5.0), HTML(value='')))

	id	totalPitchingCount	B	S	O	b1	b2	b3	pitcher	pitcherHand	batter	batterHand	gameID	inning	pitchType	speed	ballPositionLabel	ballX	ballY	dir	dist	battingType	isOuts	y	startTime	bottomTeam	topTeam	place	startDayTime	pitcherCommon	batterCommon	inningHalf	inningNumber	outCount	B_S_O	b1_b2_b3	subGameID
0	16276	3	2	0	0	False	False	False	325	1	55	1	20202116	0	5	148.0	6	8	7	0	0.0	2	-1	1.0	5	10	1	5	13	180	37	0	0	0	2	0	101010580
1	16280	3	1	1	1	False	False	False	325	1	142	1	20202116	0	6	136.0	6	8	5	0	25.5	2	1	3.0	5	10	1	5	13	180	92	0	0	1	17	0	101010580
2	16288	3	1	1	2	True	False	False	325	1	15	1	20202116	0	5	148.0	11	15	3	0	0.0	2	-1	1.0	5	10	1	5	13	180	9	0	0	2	29	1	101010580
3	16292	3	0	2	0	False	False	False	0	0	374	0	20202116	1	6	126.0	6	15	6	0	0.0	2	-1	2.0	5	10	1	5	13	0	212	1	0	3	8	0	101010580
4	16300	7	2	2	1	False	False	False	0	0	274	0	20202116	1	3	135.0	9	10	11	0	0.0	2	-1	0.0	5	10	1	5	13	0	163	1	0	4	22	0	101010580
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
50939	33803	2	0	1	0	False	False	False	171	0	113	0	20202023	8	0	0.0	1	0	0	0	0.0	2	-1	NaN	5	7	3	4	38	98	72	0	4	24	4	0	101010115
50940	33804	1	0	0	0	False	False	False	109	1	30	1	20202640	16	0	0.0	1	0	0	0	0.0	2	-1	NaN	5	6	7	8	143	-1	22	0	8	48	0	0	101013200
50941	33805	1	0	0	0	True	False	False	19	1	375	1	20202864	13	0	0.0	1	0	0	0	0.0	2	-1	NaN	5	7	6	4	206	-1	213	1	6	39	0	1	101014320
50942	33806	5	3	1	1	False	True	False	250	1	108	0	20202806	15	0	0.0	1	0	0	0	0.0	2	-1	NaN	5	2	10	0	190	139	69	1	7	46	19	2	101014030
50943	33807	3	0	2	1	False	False	False	153	1	274	0	20202572	11	0	0.0	1	0	0	0	0.0	2	-1	NaN	5	10	1	5	118	85	163	1	5	34	20	0	101012860

48768 rows × 37 columns

<class 'pandas.core.frame.DataFrame'>
Int64Index: 48768 entries, 0 to 50943
Data columns (total 37 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   id                  48768 non-null  int32  
 1   totalPitchingCount  48768 non-null  int8   
 2   B                   48768 non-null  int8   
 3   S                   48768 non-null  int8   
 4   O                   48768 non-null  int8   
 5   b1                  48768 non-null  bool   
 6   b2                  48768 non-null  bool   
 7   b3                  48768 non-null  bool   
 8   pitcher             48768 non-null  int16  
 9   pitcherHand         48768 non-null  int8   
 10  batter              48768 non-null  int16  
 11  batterHand          48768 non-null  int8   
 12  gameID              48768 non-null  int32  
 13  inning              48768 non-null  int8   
 14  pitchType           48768 non-null  int8   
 15  speed               48768 non-null  float32
 16  ballPositionLabel   48768 non-null  int8   
 17  ballX               48768 non-null  int8   
 18  ballY               48768 non-null  int8   
 19  dir                 48768 non-null  int8   
 20  dist                48768 non-null  float32
 21  battingType         48768 non-null  int8   
 22  isOuts              48768 non-null  int8   
 23  y                   14960 non-null  float32
 24  startTime           48768 non-null  int8   
 25  bottomTeam          48768 non-null  int8   
 26  topTeam             48768 non-null  int8   
 27  place               48768 non-null  int8   
 28  startDayTime        48768 non-null  int16  
 29  pitcherCommon       48768 non-null  int16  
 30  batterCommon        48768 non-null  int16  
 31  inningHalf          48768 non-null  int8   
 32  inningNumber        48768 non-null  int8   
 33  outCount            48768 non-null  int8   
 34  B_S_O               48768 non-null  int8   
 35  b1_b2_b3            48768 non-null  int8   
 36  subGameID           48768 non-null  int64  
dtypes: bool(3), float32(3), int16(5), int32(2), int64(1), int8(23)
memory usage: 3.3 MB

prep_df = preprocess(sampling_df)
prep_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 48768 entries, 0 to 48767
Data columns (total 64 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   id                                      48768 non-null  int32  
 1   totalPitchingCount                      48768 non-null  int8   
 2   B                                       48768 non-null  int8   
 3   S                                       48768 non-null  int8   
 4   O                                       48768 non-null  int8   
 5   b1                                      48768 non-null  bool   
 6   b2                                      48768 non-null  bool   
 7   b3                                      48768 non-null  bool   
 8   pitcher                                 48768 non-null  int16  
 9   pitcherHand                             48768 non-null  int8   
 10  batter                                  48768 non-null  int16  
 11  batterHand                              48768 non-null  int8   
 12  gameID                                  48768 non-null  int32  
 13  inning                                  48768 non-null  int8   
 14  pitchType                               48768 non-null  int8   
 15  speed                                   48768 non-null  float32
 16  ballPositionLabel                       48768 non-null  int8   
 17  ballX                                   48768 non-null  int8   
 18  ballY                                   48768 non-null  int8   
 19  dir                                     48768 non-null  int8   
 20  dist                                    48768 non-null  float32
 21  battingType                             48768 non-null  int8   
 22  isOuts                                  48768 non-null  int8   
 23  y                                       14960 non-null  float32
 24  startTime                               48768 non-null  int8   
 25  bottomTeam                              48768 non-null  int8   
 26  topTeam                                 48768 non-null  int8   
 27  place                                   48768 non-null  int8   
 28  startDayTime                            48768 non-null  int16  
 29  pitcherCommon                           48768 non-null  int16  
 30  batterCommon                            48768 non-null  int16  
 31  inningHalf                              48768 non-null  int8   
 32  inningNumber                            48768 non-null  int8   
 33  outCount                                48768 non-null  int8   
 34  B_S_O                                   48768 non-null  int8   
 35  b1_b2_b3                                48768 non-null  int8   
 36  subGameID                               48768 non-null  int32  
 37  agg_mean_S_grpby_subGameID_inningHalf   48768 non-null  float32
 38  agg_mean_B_grpby_subGameID_inningHalf   48768 non-null  float32
 39  agg_mean_b1_grpby_subGameID_inningHalf  48768 non-null  float32
 40  agg_mean_b2_grpby_subGameID_inningHalf  48768 non-null  float32
 41  agg_mean_b3_grpby_subGameID_inningHalf  48768 non-null  float32
 42  agg_std_S_grpby_subGameID_inningHalf    48768 non-null  float32
 43  agg_std_B_grpby_subGameID_inningHalf    48768 non-null  float32
 44  agg_std_b1_grpby_subGameID_inningHalf   48768 non-null  float32
 45  agg_std_b2_grpby_subGameID_inningHalf   48768 non-null  float32
 46  agg_std_b3_grpby_subGameID_inningHalf   48768 non-null  float32
 47  pivot_b1_b2_b3_NMF9T=00                 48768 non-null  float32
 48  pivot_b1_b2_b3_NMF9T=01                 48768 non-null  float32
 49  pivot_b1_b2_b3_NMF9B=00                 48768 non-null  float32
 50  pivot_b1_b2_b3_NMF9B=01                 48768 non-null  float32
 51  pivot_b1_b2_b3_NMF27T=00                48768 non-null  float32
 52  pivot_b1_b2_b3_NMF27T=01                48768 non-null  float32
 53  pivot_b1_b2_b3_NMF27B=00                48768 non-null  float32
 54  pivot_b1_b2_b3_NMF27B=01                48768 non-null  float32
 55  pivot_b1_b2_b3_NMF54=00                 48768 non-null  float32
 56  pivot_b1_b2_b3_NMF54=01                 48768 non-null  float32
 57  next_b1_b2_b3                           48768 non-null  float64
 58  next_diff_b1_b2_b3                      48768 non-null  float64
 59  prev_b1_b2_b3                           48768 non-null  float64
 60  prev_diff_b1_b2_b3                      48768 non-null  float64
 61  tfidf_batter_in_subGameID               48768 non-null  float64
 62  next_skip                               48768 non-null  int64  
 63  prev_skip                               48768 non-null  int64  
dtypes: bool(3), float32(23), float64(5), int16(5), int32(3), int64(2), int8(23)
memory usage: 10.7 MB

不要なカラムの削除

drop_cols = [
    'id',
    'gameID',
    'subGameID',

    'pitchType',
    'speed',
    'ballPositionLabel',
    'ballX',
    'ballY',
    'dir',
    'dist',
    'battingType',
    'isOuts',

    'startDayTime',
    'startTime',
    'pitcher',
    'batter',
]
target_col = 'y'
group_col = 'gameID'

F010_train = prep_df[prep_df[target_col].notnull()]
F010_test = prep_df[prep_df[target_col].isnull()]

F010_target = F010_train[target_col]

F010_train = F010_train.drop([target_col] + drop_cols, axis=1)
F010_test = F010_test.drop([target_col] + drop_cols, axis=1)

F010_train.shape,F010_test.shape,F010_target.shape

((14960, 47), (33808, 47), (14960,))

作成した特徴量の保存

# 作成した特徴量のデータを保存しておく
F010_train.to_csv('../features/F010_train.csv',index=False)
F010_test.to_csv('../features/F010_test.csv',index=False)
F010_target.to_csv('../features/F010_target.csv',index=False)

ヒット、２塁打、３塁打、ホームラン（４〜７）予測用のモデル、予測値の作成

ストライク、ボール、ファール、アウト（０〜３）に比べ、出現頻度の少ない、ヒット、２塁打、３塁打、ホームラン（４〜７）については、クラスごとに２値分類するモデルを作成しました。

２値分類のモデルについては、Probspaceで開催された過去のコンペで培ったノウハウを活用しました。

スパムメール判別コンペ：https://prob.space/competitions/spam_mail
対戦ゲームデータ分析甲子園：https://prob.space/competitions/game_winner

共通の関数

###### スケールを変換する関数
def scale_train_test(train,valid,test,flg):

  # スケール変換器を作成
  if flg == 0:
    scaler = preprocessing.StandardScaler()
  else:
    scaler = preprocessing.MinMaxScaler()

  # 特徴量を変換
  std_train = pd.DataFrame(scaler.fit_transform(train))
  std_valid = pd.DataFrame(scaler.transform(valid))
  std_test = pd.DataFrame(scaler.transform(test))

  std_train.columns = train.columns
  std_valid.columns = valid.columns
  std_test.columns = test.columns 
  return std_train,std_valid,std_test

###### カラム名を変換する関数
def make_colname(df):
    collist = []
    colnamelist = []
    for j in range(len(df.columns)):
        collist.append(f'col-{j}')
        colnamelist.append(df.columns[j])
        #print(df.columns[j])# = f'col-{j}'
    df.columns = collist
    return collist,colnamelist

###### 検証データとテストデータを１チーム分に絞り込む関数
def select_by_team(train,test,target,team_num,flg):
  if len(train[train['bottomTeam']==team_num]) > 0:
    select_valid = train[train['bottomTeam']==team_num]
    select_valid_target = target[train['bottomTeam']==team_num]
  else:
    select_valid = train.copy()
    select_valid_target = target.copy()

  select_test = test[test['bottomTeam']==team_num]
  test_index = test[test['bottomTeam']==team_num].index

  std_train,std_valid,std_test = scale_train_test(train,select_valid,select_test,flg)
  collist,colnamelist = make_colname(std_train)
  collist,colnamelist = make_colname(std_valid)
  collist,colnamelist = make_colname(std_test)
  return std_train,std_valid,std_test,select_valid_target,test_index

#####################################################3
### LGBで学習、予測する関数の定義
########################################################
def objective(train,test,target,valid,valid_target,all_param):
    #+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
    # 目的変数「ｙ」の予測
    #+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
    # --------------------------------------
    # パラメータ定義
    # --------------------------------------
    params = {
        'task': 'train',
        'boosting_type': 'gbdt',
        # 二値分類問題
        'objective': 'binary',
        'metric': 'auc',
        'verbosity': -1,
        "seed":42,
        "learning_rate":all_param['learning_rate'],
        'lambda_l1': all_param['lambda_l1'],
        'lambda_l2': all_param['lambda_l2'],
        'num_leaves': all_param['num_leaves'],
        'feature_fraction': all_param['feature_fraction'],
        'bagging_fraction': all_param['bagging_fraction'],
        'bagging_freq': all_param['bagging_freq'],
        'min_child_samples': all_param['min_child_samples'],
    }

    Threshold_y = all_param['Threshold_y']

    # --------------------------------------
    # 学習と予測（最終予測）
    # --------------------------------------

    col_list = test.columns

    train_x = pd.DataFrame()
    test_x = pd.DataFrame()
    
    select_col_list = all_param['select_col_list']

    
    train_x = train.loc[:][select_col_list]
    test_x = test.loc[:][select_col_list]

    X_train, y_train = train_x, target['y']
    X_test = test_x

    randomTrainState = all_param['randomTrainState']

    # アンダーサンプリングしながら訓練用データの作成
    sampler = RandomUnderSampler(random_state=randomTrainState)
    X_resampled, y_resampled = sampler.fit_resample(X_train,y_train)
    
    # 検証用データの作成
    #X_val, y_val = valid,valid_target
    X_val = valid.loc[:][select_col_list]
    y_val = valid_target

    # LightGBM
    num_round = all_param['num_round']
    lgb_train = lgb.Dataset(X_resampled, y_resampled)
    model = lgb.train(params,                                   # 上記で設定したパラメータ
                  lgb_train,                                # 使用するデータセット
                  num_boost_round=num_round,                     # 学習の回数
                  verbose_eval=None)                        # 学習の経過の表示しない

    lgb_val_pred  = model.predict(X_val)
    val_pred = np.zeros(lgb_val_pred.shape)
    val_pred[lgb_val_pred > Threshold_y] = 1 
    f1_macro = f1_score(y_val, val_pred)
    #print("+-" * 40)
    #print(f"score: {f1_macro}")

    lgb_pred  = model.predict(X_test)
    sub_pred = np.zeros(lgb_pred.shape)
    sub_pred[lgb_pred > Threshold_y] = 1 

    return f1_macro,sub_pred,lgb_pred

#####################################################3
### MultinomialNBで学習、予測する関数の定義
########################################################
def objective2(train,test,target,valid,valid_target,all_param):
    # --------------------------------------
    # パラメータ定義
    # --------------------------------------
    NFOLDS = 11
    Threshold_y = all_param['Threshold_y']

    # --------------------------------------
    # 学習と予測（最終予測）
    # --------------------------------------
    lgb_oof = pd.DataFrame()
    lgb_preds = pd.DataFrame()

    col_list = test.columns

    train_x = pd.DataFrame()
    test_x = pd.DataFrame()
    
    select_col_list = all_param['select_col_list']

    
    train_x = train.loc[:][select_col_list]
    test_x = test.loc[:][select_col_list]

    X_train, y_train = train_x, target['y']
    X_test = test_x

    randomTrainState = all_param['randomTrainState']

    val_scores = []
    for fold in range(NFOLDS):
      # アンダーサンプリングしながら訓練用データの作成
      sampler = RandomUnderSampler(random_state=randomTrainState+fold)
      X_resampled, y_resampled = sampler.fit_resample(X_train,y_train)
    
      # print('X resample:'+str(len(X_resampled)))
      # 検証用データの作成（random_stateを変えて取得）
      sampler = RandomUnderSampler(random_state=randomTrainState+fold+5)
      X_val, y_val = sampler.fit_resample(X_train,y_train)
    
      # 学習 -----------------------------------------------------------------------
      model = MultinomialNB()
      model.fit(X_resampled, y_resampled)

      lgb_preds[f'pred{fold:03d}'] = model.predict(X_test)

      tmp_pred = model.predict(X_val)
      val_pred = np.zeros(tmp_pred.shape)
      val_pred[tmp_pred > Threshold_y] = 1 
      valscore = f1_score(y_val, val_pred)
      val_scores.append(valscore)

    f1_macro = np.mean(val_scores)
    #print("+-" * 40)
    #print(f"score: {f1_macro}")

    lgb_pred  = lgb_preds.mean(axis='columns') 
    sub_pred = np.zeros(lgb_pred.shape)
    sub_pred[lgb_pred > Threshold_y] = 1 

    return f1_macro,sub_pred,lgb_pred

def pred_by_team(train,test,target,team_num,all_sub_pred,all_param00,flg):
  # Team に絞った検証データと、テストデータを作成する。
  train00,valid00,test00,valid_target00,testindex00 = select_by_team(train,test,target,team_num,flg)
  #学習と予測の実行
  if flg == 0:
    f1_macro00,sub_pred00,lgb_pred00 = objective(train00,test00,target,valid00,valid_target00,all_param00)
  else:
    f1_macro00,sub_pred00,lgb_pred00 = objective2(train00,test00,target,valid00,valid_target00,all_param00)
  # 予測結果の格納
  all_sub_pred[testindex00] = sub_pred00
  return f1_macro00,sub_pred00,lgb_pred00,all_sub_pred

def pred_all_teams(header_num,prednum,param_list,flg=0):
  # データ読み込み
  ###### train ########################
  train = pd.read_csv(f'../features/{header_num}_train.csv')
  target = pd.read_csv(f'../features/{header_num}_target.csv')
  target['y'] = target['y'].astype(int)
  #### test ###########################
  test = pd.read_csv(f'../features/{header_num}_test.csv')


  ##### target to one hot vector ######
  target_df = pd.get_dummies(target, columns=['y'])

  # select pred target
  # select only target pred num
  target['y'] = target_df[f'y_{prednum}']
  
  # 予測値格納用の配列を作成
  all_sub_pred = np.zeros((len(test),))
  
  for i in range(len(param_list)):
    f1_macro00,sub_pred00,lgb_pred00,all_sub_pred = pred_by_team(train,test,target,i,all_sub_pred,param_list[i],flg)

  return all_sub_pred

ヒット（4）の予測

setting

header_num = 'F010'
prednum = 4
sub_num = f'ind0{prednum}-t00-all'
param_list = []

Team 00

##############################################
# Team 00 の予測のパラメータをセット
##############################################
all_param00 ={
'num_round': 22,
'Threshold_y': 0.5457880758611685,
'bagging_fraction': 0.8011059140187697,
'bagging_freq': 6,
'feature_fraction': 0.596451862314391,
'lambda_l1': 0.03797737964975018,
'lambda_l2': 1.5247646095285393e-05,
'learning_rate': 0.011373097841736944,
'min_child_samples': 54,
'num_leaves': 108,
'randomTrainState': 5000,
'select_col_list': ['col-1', 'col-2', 'col-3', 'col-4', 'col-5', 'col-6', 'col-7', 'col-10', 'col-11', 'col-12', 'col-15', 'col-16', 'col-17', 'col-18', 'col-21', 'col-23', 'col-25', 'col-28', 'col-30', 'col-33', 'col-38', 'col-39', 'col-41', 'col-43', 'col-44', 'col-45', 'col-46'],
}
param_list.append(all_param00)

Team 01

##############################################
# Team 01 の予測のパラメータをセット
##############################################
all_param01 ={
'num_round': 23,
'Threshold_y': 0.5889101186701063,
'bagging_fraction': 0.6374123048258021,
'bagging_freq': 1,
'feature_fraction': 0.7134861073611422,
'lambda_l1': 1.9170047296779988e-06,
'lambda_l2': 5.208948566113227e-05,
'learning_rate': 0.024343076451874744,
'min_child_samples': 91,
'num_leaves': 256,
'randomTrainState': 7547,
'select_col_list': ['col-0', 'col-2', 'col-3', 'col-5', 'col-6', 'col-7', 'col-8', 'col-11', 'col-12', 'col-13', 'col-14', 'col-18', 'col-19', 'col-20', 'col-22', 'col-23', 'col-29', 'col-30', 'col-33', 'col-34', 'col-35', 'col-38', 'col-40', 'col-43', 'col-44'],
}
param_list.append(all_param01)

Team 02

##############################################
# Team 02の予測のパラメータをセット
##############################################
all_param02 ={
'num_round': 25,
'Threshold_y': 0.5893221185816869,
'bagging_fraction': 0.8754828088009734,
'bagging_freq': 7,
'feature_fraction': 0.5629420510817766,
'lambda_l1': 8.096083569290082e-06,
'lambda_l2': 2.5099283088293556e-08,
'learning_rate': 0.02012230648399335,
'min_child_samples': 24,
'num_leaves': 249,
'randomTrainState': 9991,
'select_col_list': ['col-0', 'col-2', 'col-3', 'col-5', 'col-7', 'col-11', 'col-12', 'col-14', 'col-15', 'col-16', 'col-20', 'col-21', 'col-22', 'col-24', 'col-28', 'col-30', 'col-37', 'col-38', 'col-39', 'col-41', 'col-43', 'col-44', 'col-45'],
}
param_list.append(all_param02)

Team 03

##############################################
# Team 03の予測のパラメータをセット
##############################################
all_param03 ={
'num_round': 81,
'Threshold_y': 0.625101164467119,
'bagging_fraction': 0.7373604916657892,
'bagging_freq': 4,
'feature_fraction': 0.7071738392109185,
'lambda_l1': 0.00011291713906883523,
'lambda_l2': 3.481715177476982e-05,
'learning_rate': 0.009426473025184819,
'min_child_samples': 14,
'num_leaves': 193,
'randomTrainState': 9993,
'select_col_list': ['col-0', 'col-3', 'col-5', 'col-6', 'col-7', 'col-9', 'col-11', 'col-12', 'col-13', 'col-17', 'col-18', 'col-19', 'col-21', 'col-24', 'col-30', 'col-31', 'col-32', 'col-35', 'col-36', 'col-37', 'col-40', 'col-41', 'col-43', 'col-45'],
}
param_list.append(all_param03)

Team 04

##############################################
# Team 04の予測のパラメータをセット
##############################################
all_param04 ={
'num_round': 29,
'Threshold_y': 0.5259362725435317,
'bagging_fraction': 0.7259878847691205,
'bagging_freq': 7,
'feature_fraction': 0.4024323069123868,
'lambda_l1': 1.4698558567372508e-05,
'lambda_l2': 0.0001484293366246452,
'learning_rate': 0.005120538199748757,
'min_child_samples': 40,
'num_leaves': 194,
'randomTrainState': 6164,
'select_col_list': ['col-1', 'col-2', 'col-8', 'col-10', 'col-14', 'col-16', 'col-18', 'col-19', 'col-20', 'col-21', 'col-26', 'col-27', 'col-28', 'col-34', 'col-35', 'col-36', 'col-39', 'col-40', 'col-41', 'col-42', 'col-43', 'col-45', 'col-46'],
}
param_list.append(all_param04)

Team 05

##############################################
# Team 05の予測のパラメータをセット
##############################################
all_param05 ={
'num_round': 82,
'Threshold_y': 0.6977108739620748,
'bagging_fraction': 0.7282751551795792,
'bagging_freq': 6,
'feature_fraction': 0.45809154061872265,
'lambda_l1': 0.1322170148676693,
'lambda_l2': 1.1545917924898332e-06,
'learning_rate': 0.02257794841585556,
'min_child_samples': 16,
'num_leaves': 139,
'randomTrainState': 4312,
'select_col_list': ['col-3', 'col-7', 'col-8', 'col-9', 'col-14', 'col-16', 'col-17', 'col-18', 'col-19', 'col-21', 'col-24', 'col-26', 'col-30', 'col-31', 'col-35', 'col-37', 'col-40', 'col-41', 'col-42', 'col-43', 'col-45', 'col-46'],
}

param_list.append(all_param05)

Team 06

##############################################
# Team 06の予測のパラメータをセット
##############################################
all_param06 = {
'num_round': 76,
'Threshold_y': 0.677259,
'bagging_fraction': 0.622379131760522,
'bagging_freq': 1,
'feature_fraction': 0.43523697616265694,
'lambda_l1': 0.8211241372618634,
'lambda_l2': 2.111340578441308e-05,
'learning_rate': 0.021960522978137942,
'min_child_samples': 12,
'num_leaves': 27,
'randomTrainState': 886,
'select_col_list': ['col-1', 'col-2', 'col-3', 'col-4', 'col-5', 'col-7', 'col-8', 'col-12', 'col-13', 'col-14', 'col-18', 'col-19', 'col-20', 'col-21', 'col-22', 'col-23', 'col-25', 'col-30', 'col-34', 'col-37', 'col-38', 'col-39', 'col-41', 'col-42', 'col-43', 'col-45'],
}
param_list.append(all_param06)

Team 07

##############################################
# Team 07の予測のパラメータをセット
##############################################
all_param07 ={
'num_round': 11,
'Threshold_y': 0.5153190259276867,
'bagging_fraction': 0.7713406672167327,
'bagging_freq': 5,
'feature_fraction': 0.9198934709118338,
'lambda_l1': 0.10237235943070253,
'lambda_l2': 9.078045159422409e-06,
'learning_rate': 0.0053148818152905924,
'min_child_samples': 53,
'num_leaves': 74,
'randomTrainState': 5415,
'select_col_list': ['col-0', 'col-1', 'col-2', 'col-4', 'col-5', 'col-7', 'col-11', 'col-12', 'col-14', 'col-16', 'col-17', 'col-20', 'col-21', 'col-26', 'col-28', 'col-29', 'col-30', 'col-32', 'col-34', 'col-35', 'col-39', 'col-40', 'col-42', 'col-43', 'col-44', 'col-45', 'col-46'],
}
param_list.append(all_param07)

Team 08

##############################################
# Team 08の予測のパラメータをセット
##############################################
all_param08 ={
'num_round': 10,
'Threshold_y': 0.5380212227317296,
'bagging_fraction': 0.9187000291018446,
'bagging_freq': 2,
'feature_fraction': 0.6481201635454304,
'lambda_l1': 7.715161904743353e-07,
'lambda_l2': 0.002920082833647645,
'learning_rate': 0.02575749622485481,
'min_child_samples': 42,
'num_leaves': 157,
'randomTrainState': 5678,
'select_col_list': ['col-0', 'col-2', 'col-4', 'col-6', 'col-8', 'col-9', 'col-10', 'col-12', 'col-15', 'col-18', 'col-20', 'col-21', 'col-23', 'col-26', 'col-29', 'col-30', 'col-32', 'col-33', 'col-34', 'col-37', 'col-38', 'col-40', 'col-41', 'col-46'],
}
param_list.append(all_param08)

Team 09

##############################################
# Team 09の予測のパラメータをセット
##############################################
all_param09 ={
'num_round': 61,
'Threshold_y': 0.560865361282143,
'bagging_fraction': 0.7236062249349682,
'bagging_freq': 2,
'feature_fraction': 0.40196600417593165,
'lambda_l1': 3.272803221486633e-07,
'lambda_l2': 0.0059614935640386335,
'learning_rate': 0.008946676196277018,
'min_child_samples': 77,
'num_leaves': 35,
'randomTrainState': 9582,
'select_col_list': ['col-0', 'col-2', 'col-4', 'col-5', 'col-8', 'col-9', 'col-10', 'col-12', 'col-14', 'col-17', 'col-20', 'col-25', 'col-27', 'col-28', 'col-34', 'col-36', 'col-37', 'col-38', 'col-39', 'col-40', 'col-44', 'col-45', 'col-46'],
}
param_list.append(all_param09)

Team 10

##############################################
# Team 10の予測のパラメータをセット
##############################################
all_param10 ={
'num_round': 3,
'Threshold_y': 0.5136703347722501,
'bagging_fraction': 0.6576434003296778,
'bagging_freq': 2,
'feature_fraction': 0.9238185170320545,
'lambda_l1': 9.453058045248753e-07,
'lambda_l2': 6.173857715519222e-07,
'learning_rate': 0.015417272505010775,
'min_child_samples': 39,
'num_leaves': 178,
'randomTrainState': 1971,
'select_col_list': ['col-1', 'col-2', 'col-5', 'col-6', 'col-12', 'col-15', 'col-19', 'col-20', 'col-22', 'col-24', 'col-26', 'col-27', 'col-29', 'col-30', 'col-32', 'col-34', 'col-35', 'col-36', 'col-38', 'col-39', 'col-40', 'col-43', 'col-45'],
}
param_list.append(all_param10)

Team 11

##############################################
# Team 11の予測のパラメータをセット
##############################################
all_param11 ={
'num_round': 114,
'Threshold_y': 0.6276608676627834,
'bagging_fraction': 0.6079086267466319,
'bagging_freq': 3,
'feature_fraction': 0.8841048573926545,
'lambda_l1': 4.6216686227184415e-06,
'lambda_l2': 0.0016302782388678146,
'learning_rate': 0.005927956010265586,
'min_child_samples': 14,
'num_leaves': 243,
'randomTrainState': 7314,
'select_col_list': ['col-0', 'col-2', 'col-4', 'col-5', 'col-6', 'col-10', 'col-12', 'col-14', 'col-15', 'col-16', 'col-19', 'col-20', 'col-21', 'col-23', 'col-24', 'col-27', 'col-28', 'col-29', 'col-31', 'col-35', 'col-37', 'col-38', 'col-39', 'col-40', 'col-41', 'col-44', 'col-45'],
}
param_list.append(all_param11)

学習と予測

all_sub_pred =  pred_all_teams(header_num,prednum,param_list,flg=0)

予測ファイルの作成

# ------------------------------------------------------------------------------
# 予測ファイルの作成
# ------------------------------------------------------------------------------

#テスト結果の出力
submit_df = pd.DataFrame({'y': all_sub_pred.astype(int)})
submit_df.index.name = 'id'
submit_df.to_csv(f'../submission/{sub_num}_submission.csv')

２塁打（5）の予測

setting

header_num = 'F010'
prednum = 5
versionnum = 2
sub_num = f'ind0{prednum}-t0{versionnum-1}-all'
param_list = []

Team 00

##############################################
# Team 00 の予測のパラメータをセット
##############################################
all_param00 ={
'num_round': 51,
'Threshold_y': 0.5988650028111873,
'bagging_fraction': 0.6341989032831783,
'bagging_freq': 3,
'feature_fraction': 0.9292090322755205,
'lambda_l1': 1.773192620560166e-06,
'lambda_l2': 1.1227618933934534e-06,
'learning_rate': 0.009081946102827547,
'min_child_samples': 16,
'num_leaves': 124,
'randomTrainState': 7454,
'select_col_list': ['col-3', 'col-5', 'col-6', 'col-7', 'col-8', 'col-10', 'col-11', 'col-13', 'col-15', 'col-18', 'col-19', 'col-24', 'col-25', 'col-32', 'col-33', 'col-34', 'col-39', 'col-40', 'col-42', 'col-43', 'col-45'],
}
param_list.append(all_param00)

Team 01

##############################################
# Team 01 の予測のパラメータをセット
##############################################
all_param01 ={
'num_round': 28,
'Threshold_y': 0.6071184768452184,
'bagging_fraction': 0.7923625382057141,
'bagging_freq': 6,
'feature_fraction': 0.8688674184611311,
'lambda_l1': 0.00029783203896991247,
'lambda_l2': 0.001102626205095361,
'learning_rate': 0.00998140615023234,
'min_child_samples': 21,
'num_leaves': 131,
'randomTrainState': 2263,
'select_col_list': ['col-4', 'col-8', 'col-11', 'col-12', 'col-13', 'col-14', 'col-15', 'col-17', 'col-18', 'col-19', 'col-21', 'col-22', 'col-25', 'col-28', 'col-29', 'col-31', 'col-34', 'col-35', 'col-36', 'col-40', 'col-41', 'col-42', 'col-43', 'col-46'],
}
param_list.append(all_param01)

Team 02

##############################################
# Team 02の予測のパラメータをセット
##############################################
all_param02 ={
'num_round': 79,
'Threshold_y': 0.8006167049819931,
'bagging_fraction': 0.7076316063654214,
'bagging_freq': 2,
'feature_fraction': 0.8284870339000393,
'lambda_l1': 0.2968327757147524,
'lambda_l2': 4.858948316137708e-07,
'learning_rate': 0.014441578144409953,
'min_child_samples': 7,
'num_leaves': 96,
'randomTrainState': 1652,
'select_col_list': ['col-1', 'col-2', 'col-4', 'col-6', 'col-8', 'col-9', 'col-11', 'col-12', 'col-13', 'col-15', 'col-19', 'col-20', 'col-21', 'col-22', 'col-23', 'col-24', 'col-25', 'col-26', 'col-27', 'col-30', 'col-33', 'col-36', 'col-38', 'col-40', 'col-41', 'col-43', 'col-45', 'col-46'],
}
param_list.append(all_param02)

Team 03

##############################################
# Team 03の予測のパラメータをセット
##############################################
all_param03 ={
'num_round': 20,
'Threshold_y': 0.5822015809423722,
'bagging_fraction': 0.8485251397259249,
'bagging_freq': 7,
'feature_fraction': 0.6987577577331818,
'lambda_l1': 4.570181490781628e-08,
'lambda_l2': 3.103035915457753e-05,
'learning_rate': 0.01161090478870929,
'min_child_samples': 13,
'num_leaves': 239,
'randomTrainState': 6063,
'select_col_list': ['col-2', 'col-3', 'col-4', 'col-6', 'col-14', 'col-17', 'col-19', 'col-20', 'col-22', 'col-23', 'col-24', 'col-25', 'col-31', 'col-33', 'col-37', 'col-40', 'col-41', 'col-42', 'col-45'],
}
param_list.append(all_param03)

Team 04

##############################################
# Team 04の予測のパラメータをセット
##############################################
all_param04 ={
'num_round': 13,
'Threshold_y': 0.5840054766305995,
'bagging_fraction': 0.9633826802938512,
'bagging_freq': 3,
'feature_fraction': 0.5037259239028026,
'lambda_l1': 0.1715113716988102,
'lambda_l2': 0.00037841395658834735,
'learning_rate': 0.02581013349605285,
'min_child_samples': 16,
'num_leaves': 188,
'randomTrainState': 4892,
'select_col_list': ['col-0', 'col-1', 'col-4', 'col-10', 'col-12', 'col-16', 'col-19', 'col-20', 'col-21', 'col-26', 'col-29', 'col-31', 'col-32', 'col-34', 'col-35', 'col-37', 'col-41', 'col-45', 'col-46'],
}
param_list.append(all_param04)

Team 05

##############################################
# Team 05の予測のパラメータをセット
##############################################
all_param05 ={
'num_round': 27,
'Threshold_y': 0.5676781314173396,
'bagging_fraction': 0.5465120737723101,
'bagging_freq': 3,
'feature_fraction': 0.5239167666136786,
'lambda_l1': 0.08853036547863448,
'lambda_l2': 0.0009080573846847937,
'learning_rate': 0.011709255698557373,
'min_child_samples': 30,
'num_leaves': 75,
'randomTrainState': 2689,
'select_col_list': ['col-0', 'col-1', 'col-3', 'col-4', 'col-5', 'col-6', 'col-8', 'col-10', 'col-15', 'col-17', 'col-19', 'col-20', 'col-24', 'col-25', 'col-26', 'col-27', 'col-28', 'col-29', 'col-30', 'col-32', 'col-35', 'col-36', 'col-37', 'col-41', 'col-43', 'col-45', 'col-46'],
}
param_list.append(all_param05)

Team 06

##############################################
# Team 06の予測のパラメータをセット
##############################################
all_param06 ={
'num_round': 39,
'Threshold_y': 0.6321024724116466,
'bagging_fraction': 0.8291697371770824,
'bagging_freq': 4,
'feature_fraction': 0.714953727496321,
'lambda_l1': 1.0168924231577864e-08,
'lambda_l2': 3.9010744136314086,
'learning_rate': 0.014855555702265079,
'min_child_samples': 11,
'num_leaves': 189,
'randomTrainState': 8210,
'select_col_list': ['col-1', 'col-3', 'col-6', 'col-8', 'col-9', 'col-10', 'col-11', 'col-12', 'col-13', 'col-14', 'col-17', 'col-18', 'col-19', 'col-20', 'col-21', 'col-22', 'col-23', 'col-25', 'col-27', 'col-37', 'col-39', 'col-40', 'col-41', 'col-44', 'col-45'],
}
param_list.append(all_param06)

Team 07

##############################################
# Team 07の予測のパラメータをセット
##############################################
all_param07 ={
'num_round': 27,
'Threshold_y': 0.6411434456489781,
'bagging_fraction': 0.8322805023962896,
'bagging_freq': 4,
'feature_fraction': 0.7242422251249383,
'lambda_l1': 1.7872305643364622e-06,
'lambda_l2': 0.012347332855752268,
'learning_rate': 0.012792533160345695,
'min_child_samples': 5,
'num_leaves': 172,
'randomTrainState': 3373,
'select_col_list': ['col-1', 'col-5', 'col-11', 'col-13', 'col-19', 'col-20', 'col-22', 'col-23', 'col-24', 'col-27', 'col-29', 'col-33', 'col-34', 'col-40', 'col-41', 'col-43'],
}
param_list.append(all_param07)

Team 08

##############################################
# Team 08の予測のパラメータをセット
##############################################
all_param08 ={
'num_round': 9,
'Threshold_y': 0.5242129210057586,
'bagging_fraction': 0.862789007067898,
'bagging_freq': 4,
'feature_fraction': 0.41212081743841283,
'lambda_l1': 0.00015207051582398425,
'lambda_l2': 1.916552147625866e-06,
'learning_rate': 0.008489738430627213,
'min_child_samples': 16,
'num_leaves': 132,
'randomTrainState': 5037,
'select_col_list': ['col-1', 'col-6', 'col-7', 'col-10', 'col-11', 'col-12', 'col-13', 'col-20', 'col-21', 'col-26', 'col-27', 'col-28', 'col-29', 'col-31', 'col-32', 'col-33', 'col-37', 'col-40'],
}
param_list.append(all_param08)

Team 09

##############################################
# Team 09の予測のパラメータをセット
##############################################
all_param09 ={
'num_round': 32,
'Threshold_y': 0.6494177400635616,
'bagging_fraction': 0.769538933451904,
'bagging_freq': 5,
'feature_fraction': 0.6071303730337405,
'lambda_l1': 0.09787230103279305,
'lambda_l2': 0.002781501819288105,
'learning_rate': 0.01844675577717961,
'min_child_samples': 17,
'num_leaves': 245,
'randomTrainState': 5510,
'select_col_list': ['col-0', 'col-1', 'col-2', 'col-3', 'col-4', 'col-6', 'col-7', 'col-10', 'col-12', 'col-14', 'col-17', 'col-18', 'col-21', 'col-22', 'col-23', 'col-25', 'col-26', 'col-29', 'col-31', 'col-32', 'col-33', 'col-34', 'col-36', 'col-41', 'col-42', 'col-44', 'col-45'],
}
param_list.append(all_param09)

Team 10

##############################################
# Team 10の予測のパラメータをセット
##############################################
all_param10 ={
'num_round': 36,
'Threshold_y': 0.5909216962211854,
'bagging_fraction': 0.6488973849546392,
'bagging_freq': 5,
'feature_fraction': 0.42504440898708207,
'lambda_l1': 0.017937047426524792,
'lambda_l2': 2.671891261744761e-08,
'learning_rate': 0.009163978056130472,
'min_child_samples': 7,
'num_leaves': 52,
'randomTrainState': 507,
'select_col_list': ['col-0', 'col-1', 'col-3', 'col-4', 'col-5', 'col-7', 'col-8', 'col-9', 'col-11', 'col-12', 'col-15', 'col-18', 'col-21', 'col-23', 'col-27', 'col-28', 'col-30', 'col-31', 'col-32', 'col-33', 'col-36', 'col-38', 'col-39', 'col-40', 'col-42', 'col-45'],
}
param_list.append(all_param10)

Team 11

##############################################
# Team 11の予測のパラメータをセット
##############################################
all_param11 ={
'num_round': 32,
'Threshold_y': 0.5910917436123575,
'bagging_fraction': 0.6937932067153532,
'bagging_freq': 4,
'feature_fraction': 0.49245322978952205,
'lambda_l1': 0.041304191123197054,
'lambda_l2': 0.02073817982403152,
'learning_rate': 0.020908997050675497,
'min_child_samples': 31,
'num_leaves': 84,
'randomTrainState': 3191,
'select_col_list': ['col-1', 'col-2', 'col-3', 'col-4', 'col-6', 'col-7', 'col-8', 'col-10', 'col-11', 'col-15', 'col-18', 'col-20', 'col-21', 'col-22', 'col-23', 'col-26', 'col-29', 'col-31', 'col-33', 'col-34', 'col-37', 'col-38', 'col-40', 'col-43', 'col-44', 'col-45'],
}
param_list.append(all_param11)

学習と予測

all_sub_pred =  pred_all_teams(header_num,prednum,param_list,flg=0)

予測ファイルの作成

# ------------------------------------------------------------------------------
# 予測ファイルの作成
# ------------------------------------------------------------------------------

#テスト結果の出力
submit_df = pd.DataFrame({'y': all_sub_pred.astype(int)})
submit_df.index.name = 'id'
submit_df.to_csv(f'../submission/{sub_num}_submission.csv')

３塁打（6）の予測

setting

header_num = 'F010'
prednum = 6
versionnum = 2
sub_num = f'ind0{prednum}-t0{versionnum-1}-all'
param_list = []

Team 00

##############################################
# Team 00 の予測のパラメータをセット
##############################################
all_param00 = {
'Threshold_y': 0.9090909090909091,
'randomTrainState': 4032,
'select_col_list': ['col-1', 'col-2', 'col-4', 'col-6', 'col-7', 'col-9', 'col-12', 'col-16', 'col-20', 'col-22', 'col-23', 'col-25', 'col-27', 'col-30', 'col-31', 'col-32', 'col-33', 'col-34', 'col-38', 'col-39', 'col-41', 'col-43'],
}
param_list.append(all_param00)

Team 01

##############################################
# Team 01 の予測のパラメータをセット
##############################################
all_param01 = {
'Threshold_y': 0.9090909090909091,
'randomTrainState': 6180,
'select_col_list': ['col-4', 'col-6', 'col-7', 'col-9', 'col-10', 'col-16', 'col-19', 'col-20', 'col-22', 'col-23', 'col-24', 'col-26', 'col-28', 'col-30', 'col-33', 'col-34', 'col-38', 'col-43'],
}
param_list.append(all_param01)

Team 02

##############################################
# Team 02の予測のパラメータをセット
##############################################
all_param02 = {
'Threshold_y': 0.9090909090909091,
'randomTrainState': 8530,
'select_col_list': ['col-0', 'col-1', 'col-4', 'col-6', 'col-10', 'col-13', 'col-19', 'col-22', 'col-24', 'col-25', 'col-30', 'col-32', 'col-34', 'col-36', 'col-43', 'col-44', 'col-45', 'col-46'],
}
param_list.append(all_param02)

Team 03

##############################################
# Team 03の予測のパラメータをセット
##############################################
all_param03 = {
'Threshold_y': 0.9090909090909091,
'randomTrainState': 8340,
'select_col_list': ['col-0', 'col-4', 'col-7', 'col-10', 'col-11', 'col-12', 'col-19', 'col-22', 'col-23', 'col-24', 'col-26', 'col-30', 'col-31', 'col-33', 'col-37', 'col-38', 'col-41', 'col-43', 'col-44', 'col-46'],
}
param_list.append(all_param03)

Team 04

##############################################
# Team 04の予測のパラメータをセット
##############################################
all_param04 = {
'Threshold_y': 0.9090909090909091,
'randomTrainState': 1838,
'select_col_list': ['col-0', 'col-1', 'col-6', 'col-8', 'col-11', 'col-13', 'col-15', 'col-19', 'col-21', 'col-22', 'col-26', 'col-27', 'col-36', 'col-37', 'col-38', 'col-39', 'col-44'],
}
param_list.append(all_param04)

Team 05

##############################################
# Team 05の予測のパラメータをセット
##############################################
all_param05 = {
'Threshold_y': 0.9090909090909091,
'randomTrainState': 7404,
'select_col_list': ['col-0', 'col-4', 'col-6', 'col-11', 'col-18', 'col-19', 'col-22', 'col-24', 'col-27', 'col-28', 'col-29', 'col-30', 'col-31', 'col-33', 'col-35', 'col-37', 'col-38', 'col-40', 'col-41', 'col-45'],
}
param_list.append(all_param05)

Team 06

##############################################
# Team 06の予測のパラメータをセット
##############################################
all_param06 = {
'Threshold_y': 0.9090909090909091,
'randomTrainState': 9847,
'select_col_list': ['col-0', 'col-1', 'col-4', 'col-6', 'col-7', 'col-10', 'col-15', 'col-25', 'col-26', 'col-27', 'col-29', 'col-30', 'col-37', 'col-38', 'col-39', 'col-41', 'col-43'],
}
param_list.append(all_param06)

Team 07

##############################################
# Team 07の予測のパラメータをセット
##############################################
all_param07 = {
'Threshold_y': 0.9090909090909091,
'randomTrainState': 5801,
'select_col_list': ['col-1', 'col-4', 'col-6', 'col-8', 'col-10', 'col-14', 'col-19', 'col-22', 'col-23', 'col-25', 'col-28', 'col-33', 'col-34', 'col-39', 'col-43'],
}
param_list.append(all_param07)

Team 08

##############################################
# Team 08の予測のパラメータをセット
##############################################
all_param08 = {
'Threshold_y': 0.9090909090909091,
'randomTrainState': 2587,
'select_col_list': ['col-2', 'col-10', 'col-17', 'col-18', 'col-22', 'col-23', 'col-25', 'col-26', 'col-27', 'col-30', 'col-31', 'col-33', 'col-34', 'col-35', 'col-37', 'col-38', 'col-39', 'col-40', 'col-41', 'col-43', 'col-44'],
}

param_list.append(all_param08)

Team 09

##############################################
# Team 09の予測のパラメータをセット
##############################################
all_param09 = {
'Threshold_y': 0.9090909090909091,
'randomTrainState': 7251,
'select_col_list': ['col-1', 'col-2', 'col-4', 'col-6', 'col-9', 'col-10', 'col-11', 'col-12', 'col-17', 'col-19', 'col-22', 'col-24', 'col-26', 'col-27', 'col-31', 'col-33', 'col-35', 'col-36', 'col-44'],
}
param_list.append(all_param09)

Team 10

##############################################
# Team 10の予測のパラメータをセット
##############################################
all_param10 = {
'Threshold_y': 0.9090909090909091,
'randomTrainState': 1427,
'select_col_list': ['col-0', 'col-6', 'col-7', 'col-10', 'col-13', 'col-17', 'col-19', 'col-20', 'col-22', 'col-25', 'col-27', 'col-28', 'col-29', 'col-31', 'col-33', 'col-37', 'col-39', 'col-41', 'col-43', 'col-44', 'col-46'],
}
param_list.append(all_param10)

Team 11

##############################################
# Team 11の予測のパラメータをセット
##############################################
all_param11 = {
'Threshold_y': 0.9090909090909091,
'randomTrainState': 8532,
'select_col_list': ['col-4', 'col-7', 'col-10', 'col-12', 'col-16', 'col-17', 'col-18', 'col-22', 'col-24', 'col-29', 'col-30', 'col-33', 'col-34', 'col-35', 'col-37', 'col-40', 'col-41', 'col-43', 'col-46'],
}
param_list.append(all_param11)

学習と予測

all_sub_pred = pred_all_teams(header_num,prednum,param_list,flg=1)

予測ファイルの作成

# ------------------------------------------------------------------------------
# 予測ファイルの作成
# ------------------------------------------------------------------------------

#テスト結果の出力
submit_df = pd.DataFrame({'y': all_sub_pred.astype(int)})
submit_df.index.name = 'id'
submit_df.to_csv(f'../submission/{sub_num}_submission.csv')

ホームラン（7）の予測

setting

header_num = 'F010'
prednum = 7
versionnum = 2
sub_num = f'ind0{prednum}-t0{versionnum-1}-all'
param_list = []

Team 00

##############################################
# Team 00 の予測のパラメータをセット
##############################################
all_param00 ={
'num_round': 17,
'Threshold_y': 0.5301191795552576,
'bagging_fraction': 0.6947579218693694,
'bagging_freq': 4,
'feature_fraction': 0.6427395221155658,
'lambda_l1': 3.868220205359664e-06,
'lambda_l2': 0.00010945101702546976,
'learning_rate': 0.012303614918122196,
'min_child_samples': 29,
'num_leaves': 68,
'randomTrainState': 7925,
'select_col_list': ['col-2', 'col-4', 'col-6', 'col-7', 'col-10', 'col-11', 'col-12', 'col-13', 'col-14', 'col-17', 'col-19', 'col-20', 'col-22', 'col-24', 'col-25', 'col-28', 'col-29', 'col-31', 'col-32', 'col-35', 'col-36', 'col-40', 'col-41', 'col-43', 'col-45'],
}
param_list.append(all_param00)

Team 01

##############################################
# Team 01 の予測のパラメータをセット
##############################################
all_param01 ={
'num_round': 12,
'Threshold_y': 0.5441403422091486,
'bagging_fraction': 0.9012232602552761,
'bagging_freq': 4,
'feature_fraction': 0.7253945401233162,
'lambda_l1': 0.2914740452257848,
'lambda_l2': 0.04572992822765882,
'learning_rate': 0.016231192303684548,
'min_child_samples': 23,
'num_leaves': 91,
'randomTrainState': 1380,
'select_col_list': ['col-0', 'col-1', 'col-4', 'col-6', 'col-7', 'col-10', 'col-11', 'col-12', 'col-14', 'col-16', 'col-18', 'col-19', 'col-21', 'col-22', 'col-25', 'col-27', 'col-31', 'col-32', 'col-37', 'col-40', 'col-43', 'col-44', 'col-45'],
}
param_list.append(all_param01)

Team 02

##############################################
# Team 02の予測のパラメータをセット
##############################################
all_param02 ={
'num_round': 18,
'Threshold_y': 0.5141469693930233,
'bagging_fraction': 0.4454919386587845,
'bagging_freq': 4,
'feature_fraction': 0.4910223795324583,
'lambda_l1': 8.25489775949208e-07,
'lambda_l2': 0.00028137109089829035,
'learning_rate': 0.009886683851392216,
'min_child_samples': 33,
'num_leaves': 164,
'randomTrainState': 4258,
'select_col_list': ['col-0', 'col-5', 'col-6', 'col-7', 'col-8', 'col-10', 'col-11', 'col-12', 'col-14', 'col-15', 'col-16', 'col-17', 'col-18', 'col-20', 'col-22', 'col-25', 'col-27', 'col-29', 'col-30', 'col-35', 'col-36', 'col-37', 'col-41', 'col-43', 'col-44', 'col-45', 'col-46'],
}
param_list.append(all_param02)

Team 03

##############################################
# Team 03の予測のパラメータをセット
##############################################
all_param03 ={
'num_round': 30,
'Threshold_y': 0.6157333939466578,
'bagging_fraction': 0.7416403427407998,
'bagging_freq': 4,
'feature_fraction': 0.6390231509876305,
'lambda_l1': 0.0013841308254424842,
'lambda_l2': 2.947622365708902e-08,
'learning_rate': 0.017965732967915805,
'min_child_samples': 10,
'num_leaves': 58,
'randomTrainState': 8544,
'select_col_list': ['col-1', 'col-2', 'col-5', 'col-9', 'col-11', 'col-17', 'col-19', 'col-23', 'col-25', 'col-27', 'col-28', 'col-30', 'col-33', 'col-34', 'col-38', 'col-40', 'col-41', 'col-42', 'col-44', 'col-45', 'col-46'],
}
param_list.append(all_param03)

Team 04

##############################################
# Team 04の予測のパラメータをセット
##############################################
all_param04 ={
'num_round': 10,
'Threshold_y': 0.5382883421585195,
'bagging_fraction': 0.9260827614635292,
'bagging_freq': 1,
'feature_fraction': 0.623208621531255,
'lambda_l1': 1.1685440624114588e-05,
'lambda_l2': 2.0705413009667016e-06,
'learning_rate': 0.01337137234883777,
'min_child_samples': 17,
'num_leaves': 247,
'randomTrainState': 1034,
'select_col_list': ['col-1', 'col-2', 'col-7', 'col-9', 'col-13', 'col-17', 'col-22', 'col-23', 'col-24', 'col-26', 'col-27', 'col-30', 'col-32', 'col-33', 'col-35', 'col-37', 'col-40', 'col-44'],
}
param_list.append(all_param04)

Team 05

##############################################
# Team 05の予測のパラメータをセット
##############################################
all_param05 ={
'num_round': 9,
'Threshold_y': 0.5099648119079944,
'bagging_fraction': 0.5492567465444324,
'bagging_freq': 1,
'feature_fraction': 0.4897531927776741,
'lambda_l1': 2.4809120148651396e-05,
'lambda_l2': 1.5095440320754698,
'learning_rate': 0.00512450905231661,
'min_child_samples': 13,
'num_leaves': 160,
'randomTrainState': 6233,
'select_col_list': ['col-0', 'col-4', 'col-8', 'col-9', 'col-10', 'col-11', 'col-12', 'col-13', 'col-14', 'col-17', 'col-18', 'col-19', 'col-22', 'col-23', 'col-25', 'col-26', 'col-32', 'col-33', 'col-34', 'col-35', 'col-38', 'col-39', 'col-40', 'col-42', 'col-44', 'col-45'],
}
param_list.append(all_param05)

Team 06

##############################################
# Team 06の予測のパラメータをセット
##############################################
all_param06 ={
'num_round': 4,
'Threshold_y': 0.5106891139428876,
'bagging_fraction': 0.8856916241331956,
'bagging_freq': 6,
'feature_fraction': 0.6045574049863165,
'lambda_l1': 4.77724838181608e-08,
'lambda_l2': 0.001905462262038446,
'learning_rate': 0.005373278707637609,
'min_child_samples': 7,
'num_leaves': 113,
'randomTrainState': 6210,
'select_col_list': ['col-0', 'col-3', 'col-4', 'col-6', 'col-7', 'col-10', 'col-11', 'col-12', 'col-14', 'col-15', 'col-16', 'col-17', 'col-18', 'col-20', 'col-21', 'col-22', 'col-24', 'col-25', 'col-26', 'col-28', 'col-29', 'col-34', 'col-38', 'col-40', 'col-41', 'col-43', 'col-45'],
}
param_list.append(all_param06)

Team 07

##############################################
# Team 07の予測のパラメータをセット
##############################################
all_param07 ={
'num_round': 4,
'Threshold_y': 0.5322936094849703,
'bagging_fraction': 0.5475129500981026,
'bagging_freq': 5,
'feature_fraction': 0.4681456989889187,
'lambda_l1': 1.0545528690242021e-08,
'lambda_l2': 2.7055951582708258e-05,
'learning_rate': 0.02718788620631782,
'min_child_samples': 13,
'num_leaves': 55,
'randomTrainState': 6971,
'select_col_list': ['col-0', 'col-5', 'col-10', 'col-11', 'col-13', 'col-14', 'col-16', 'col-18', 'col-19', 'col-20', 'col-21', 'col-23', 'col-24', 'col-27', 'col-28', 'col-29', 'col-30', 'col-32', 'col-36', 'col-37', 'col-38', 'col-39', 'col-40', 'col-44', 'col-45', 'col-46'],
}
param_list.append(all_param07)

Team 08

##############################################
# Team 08の予測のパラメータをセット
##############################################
all_param08 ={
'num_round': 30,
'Threshold_y': 0.551452774858463,
'bagging_fraction': 0.7458101524869044,
'bagging_freq': 6,
'feature_fraction': 0.9613253477230401,
'lambda_l1': 7.788897620269775e-05,
'lambda_l2': 0.00011503845220559952,
'learning_rate': 0.012429959408447424,
'min_child_samples': 38,
'num_leaves': 42,
'randomTrainState': 6235,
'select_col_list': ['col-0', 'col-1', 'col-5', 'col-6', 'col-7', 'col-8', 'col-9', 'col-11', 'col-13', 'col-14', 'col-17', 'col-19', 'col-21', 'col-23', 'col-28', 'col-29', 'col-30', 'col-32', 'col-34', 'col-36', 'col-38', 'col-40', 'col-45', 'col-46'],
}
param_list.append(all_param08)

Team 09

##############################################
# Team 09の予測のパラメータをセット
##############################################
all_param09 ={
'num_round': 44,
'Threshold_y': 0.6830165308159903,
'bagging_fraction': 0.6369598156229328,
'bagging_freq': 2,
'feature_fraction': 0.628698124193263,
'lambda_l1': 0.34808902669160463,
'lambda_l2': 0.00454043954405315,
'learning_rate': 0.015746213466134217,
'min_child_samples': 5,
'num_leaves': 196,
'randomTrainState': 6937,
'select_col_list': ['col-0', 'col-4', 'col-7', 'col-9', 'col-10', 'col-11', 'col-13', 'col-16', 'col-21', 'col-25', 'col-28', 'col-29', 'col-31', 'col-32', 'col-34', 'col-39', 'col-40', 'col-42', 'col-43', 'col-45', 'col-46'],
}
param_list.append(all_param09)

Team 10

##############################################
# Team 10の予測のパラメータをセット
##############################################
all_param10 ={
'num_round': 8,
'Threshold_y': 0.5120004095438825,
'bagging_fraction': 0.8744266038117634,
'bagging_freq': 5,
'feature_fraction': 0.7002886281573345,
'lambda_l1': 0.0003268581788463471,
'lambda_l2': 1.1108540310880616e-07,
'learning_rate': 0.005442442065203486,
'min_child_samples': 23,
'num_leaves': 76,
'randomTrainState': 3586,
'select_col_list': ['col-1', 'col-3', 'col-5', 'col-6', 'col-10', 'col-11', 'col-13', 'col-14', 'col-15', 'col-16', 'col-17', 'col-19', 'col-21', 'col-22', 'col-26', 'col-27', 'col-28', 'col-29', 'col-31', 'col-32', 'col-33', 'col-35', 'col-39', 'col-40', 'col-43', 'col-44', 'col-45', 'col-46'],
}
param_list.append(all_param10)

Team 11

##############################################
# Team 11の予測のパラメータをセット
##############################################
all_param11 ={
'num_round': 32,
'Threshold_y': 0.6387455194897039,
'bagging_fraction': 0.9556231800736257,
'bagging_freq': 4,
'feature_fraction': 0.8217034788203843,
'lambda_l1': 2.5066872014083758e-05,
'lambda_l2': 0.40979135808078626,
'learning_rate': 0.014743719382639122,
'min_child_samples': 10,
'num_leaves': 200,
'randomTrainState': 7136,
'select_col_list': ['col-2', 'col-3', 'col-4', 'col-5', 'col-10', 'col-11', 'col-12', 'col-13', 'col-19', 'col-23', 'col-24', 'col-32', 'col-33', 'col-36', 'col-38', 'col-39', 'col-40', 'col-42', 'col-45', 'col-46'],
}
param_list.append(all_param11)

学習と予測

all_sub_pred =  pred_all_teams(header_num,prednum,param_list,flg=0)

予測ファイルの作成

# ------------------------------------------------------------------------------
# 予測ファイルの作成
# ------------------------------------------------------------------------------

#テスト結果の出力
submit_df = pd.DataFrame({'y': all_sub_pred.astype(int)})
submit_df.index.name = 'id'
submit_df.to_csv(f'../submission/{sub_num}_submission.csv')

各出力の合成処理

# baseとなる予測を読み込み
submit_df = pd.read_csv('../submission/submission.csv',index_col='id')
submit_df.head()

	y
id
0	1
1	1
2	3
3	1
4	0

# 予測値毎に予測した値を追加する関数
def add_submission(prednum,filename,submit_df):
  sub066_df =  pd.read_csv(filename,index_col='id')
  # print(sub066_df.sum()*prednum)
  sub066_df[sub066_df['y']==1] = prednum
  # print(sub066_df.sum())
  submit_df[f'ypred{prednum}'] = sub066_df['y']

# 4〜7については、予測値毎に予測した値を追加
add_submission(4,'../submission/ind04-t00-all_submission.csv',submit_df)
add_submission(5,'../submission/ind05-t01-all_submission.csv',submit_df)
add_submission(6,'../submission/ind06-t01-all_submission.csv',submit_df)
add_submission(7,'../submission/ind07-t01-all_submission.csv',submit_df)

# ベースとなる予測に指定した順番に予測値ごとの予測を合成する関数
def overwrite_submission(paramlist,submit_df):
  calc_df = submit_df.copy()
  overwrite_name = 'y'
  for i in paramlist:
    next_overwrite_name = f'{overwrite_name}-{i}'
    calc_df[next_overwrite_name] = calc_df[f'ypred{i}'].where(calc_df[f'ypred{i}']==i,calc_df[overwrite_name])
    overwrite_name = next_overwrite_name
  calc_df['y'] = calc_df[overwrite_name]
  return calc_df

# 合成処理
submit_df = overwrite_submission([ 4, 5, 7, 6],submit_df)

#Brend結果の出力
submit_brend = pd.DataFrame({'y': submit_df['y']})
submit_brend.index.name = 'id'
submit_brend.to_csv(f'../submission/{Notebookname}_submission.csv')

1st Place Solution(by Oregin)

次の一投の行方を予測！ プロ野球データ分析チャレンジ

目的となる分類の全クラスを多値分類するモデル。（ベースライン）

ライブラリのインストール、及びインポート

データの読込

前処理

「Speed」の学習と予測

「ｙ」の学習と予測

ヒット、２塁打、３塁打、ホームラン（４〜７）予測用の特徴量作成

基本設定

game_info.csv 処理

train_data.csv 処理

test_data.csv 処理

train、test 間の情報取得

train、test結合

基礎特徴量

ランダムサンプリング

集約特徴量

pivot table 特徴量

前後特徴量

TF-IDF

打席スキップ数

特徴量演算

不要なカラムの削除

作成した特徴量の保存

ヒット、２塁打、３塁打、ホームラン（４〜７）予測用のモデル、予測値の作成

共通の関数

ヒット（4）の予測

setting

Team 00

Team 01

Team 02

Team 03

Team 04

Team 05

Team 06

Team 07

Team 08

Team 09

Team 10

Team 11

学習と予測

予測ファイルの作成

２塁打（5）の予測

setting

Team 00

Team 01

Team 02

Team 03

Team 04

Team 05

Team 06

Team 07

Team 08

Team 09

Team 10

Team 11

学習と予測

予測ファイルの作成

３塁打（6）の予測

setting

Team 00

Team 01

Team 02

Team 03

Team 04

Team 05

Team 06

Team 07

Team 08

Team 09

Team 10

Team 11

学習と予測

予測ファイルの作成

ホームラン（7）の予測

setting

Team 00

Team 01

Team 02

次の一投の行方を予測！プロ野球データ分析チャレンジ