LightGBM Base line−(CV:0.78974 / LB 0.89032) by Oregin

Oregin

LightGBM Base line−(CV:0.78974 / LB 0.89032) by Oregin

民泊サービスの宿泊料金予測

民泊サービスの宿泊料金予測のサンプルコードです。ご参考までご活用ください。

※Google Colabで実行可能です。

CV=0.78974 LB=0.89032 でした。

ディレクトリ構成

./notebook : このファイルを入れておくディレクトリ
./result : 出力結果を入れておくディレクトリ
./data : test_data.csv,train_data.csv,submission.csvを入れておくディレクトリ

# カレントディレクトリをnotebook,result,dataディレクトリが格納されているディレクトリに移動
%cd /xxxx/xxxx

特徴量を編集するためのツールを導入（ターゲットエンコーディングで使用）

!pip install git+https://github.com/pfnet-research/xfeat.git

各種ライブラリをインポート

#環境確認
import pandas as pd
import numpy as np
!python3 --version
print(pd.__version__)
print(np.__version__)
import matplotlib
print(matplotlib.__version__)
import pandas as pd
import numpy as np
import random
import matplotlib.pylab as plt
import scipy.stats as stats
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import GroupKFold,KFold
from sklearn.metrics import mean_absolute_error,mean_squared_error,mean_squared_log_error
from sklearn.model_selection import train_test_split
import lightgbm as lgb
import warnings
warnings.filterwarnings('ignore')
from xfeat import SelectCategorical, LabelEncoder, Pipeline, ConcatCombination, SelectNumerical, \
    ArithmeticCombinations, TargetEncoder, aggregation, GBDTFeatureSelector, GBDTFeatureExplorer

Python 3.7.12
1.3.5
1.21.5
3.2.2

データの読み込み・変換

#データの読み込み
train_df = pd.read_csv("./data/train_data.csv")
test_df = pd.read_csv("./data/test_data.csv")
submit_df = pd.read_csv("./data/submission.csv")
print(train_df.shape)
print(test_df.shape)

(9990, 13)
(4996, 12)

#データの確認
train_df.head()

	id	name	host_id	neighbourhood	latitude	longitude	room_type	minimum_nights	number_of_reviews	last_review	reviews_per_month	availability_365	y
0	1	KiyosumiShirakawa 3min\|★SkyTree★\|WIFI\|Max4\|Tre...	242899459	Koto Ku	35.68185	139.80310	Entire home/apt	1	55	2020-04-25	2.21	173	12008
1	2	Downtown Tokyo Iriya next to Ueno	308879948	Taito Ku	35.72063	139.78536	Entire home/apt	6	72	2020-03-25	2.11	9	6667
2	3	Japan Style,Private,Affordable,4min to Sta.	300877823	Katsushika Ku	35.74723	139.82349	Entire home/apt	1	18	2020-03-23	3.46	288	9923
3	4	4 min to Shinjuku Sta. by train / 2 ppl / Wi-fi	236935461	Shibuya Ku	35.68456	139.68077	Entire home/apt	1	2	2020-04-02	1.76	87	8109
4	5	LICENSED SHINJUKU HOUSE: Heart of the action!	243408889	Shinjuku Ku	35.69840	139.70467	Entire home/apt	1	86	2020-01-30	2.00	156	100390

# カラム名の確認
train_df.columns

Index(['id', 'name', 'host_id', 'neighbourhood', 'latitude', 'longitude',
       'room_type', 'minimum_nights', 'number_of_reviews', 'last_review',
       'reviews_per_month', 'availability_365', 'y'],
      dtype='object')

# 各カラムの種類の数を確認
collist = []
for colname in train_df.columns:
  lencol = len(train_df[colname].unique())
  print(colname,lencol)
  if lencol < 1000 and colname != 'y':
    collist.append(colname)

print(collist)

id 9990
name 9114
host_id 2325
neighbourhood 23
latitude 6239
longitude 6867
room_type 4
minimum_nights 30
number_of_reviews 261
last_review 547
reviews_per_month 595
availability_365 366
y 7520
['neighbourhood', 'room_type', 'minimum_nights', 'number_of_reviews', 'last_review', 'reviews_per_month', 'availability_365']

# エンコーディングするカラム
train_columns = ['neighbourhood', 'room_type', 'minimum_nights', 'number_of_reviews', 'last_review', 'reviews_per_month', 'availability_365']
# 目的変数のカラム
target = 'y'

X = train_df[train_columns]
y = train_df[target]

#説明変数の作成
test_X = test_df[train_columns]

train_target = pd.concat([X,y],axis=1)

# ターゲットエンコーディング
fold = KFold(n_splits=5, shuffle=False)
train = X.copy()
test = test_X.copy()
for col in train_columns:
  encoder = TargetEncoder(
    input_cols=[col], 
    target_col=target,
    fold=fold,
    output_suffix="_re"
    )

  encoded_df = encoder.fit_transform(train_target)
  train = pd.concat([train,encoded_df[f'{col}_re']],axis=1)
  encoded_df = encoder.transform(test)
  test = pd.concat([test,encoded_df[f'{col}_re']],axis=1)

train.drop(train_columns,axis=1,inplace=True)
test.drop(train_columns,axis=1,inplace=True)

# 欠損がないことの確認
train.isnull().sum(),test.isnull().sum()

(neighbourhood_re        0
 room_type_re            0
 minimum_nights_re       0
 number_of_reviews_re    0
 last_review_re          0
 reviews_per_month_re    0
 availability_365_re     0
 dtype: int64, neighbourhood_re        0
 room_type_re            0
 minimum_nights_re       0
 number_of_reviews_re    0
 last_review_re          0
 reviews_per_month_re    0
 availability_365_re     0
 dtype: int64)

特徴量追加

# 特徴量を追加する関数
def make_feat(df):
  df["neighbourhood_re"] = df["neighbourhood_re"]
  df["room_type_re"] = df["room_type_re"]
  df["minimum_nights_re"] = df["minimum_nights_re"]
  df["number_of_reviews_re"] = df["number_of_reviews_re"]
  df["last_review_re"] = df["last_review_re"]
  df["reviews_per_month_re"] = df["reviews_per_month_re"]
  df["availability_365_re"] = df["availability_365_re"]
  df["minimum_nights_re*neighbourhood_re"] = df["minimum_nights_re"]*df["neighbourhood_re"]
  df["availability_365_re*neighbourhood_re"] = df["availability_365_re"]*df["neighbourhood_re"]
  df["neighbourhood_re**2*reviews_per_month_re"] = df["neighbourhood_re"]**2*df["reviews_per_month_re"]
  df["availability_365_re**2*neighbourhood_re**2"] = df["availability_365_re"]**2*df["neighbourhood_re"]**2
  df["availability_365_re/room_type_re"] = df["availability_365_re"]/df["room_type_re"]
  df["neighbourhood_re**2/room_type_re"] = df["neighbourhood_re"]**2/df["room_type_re"]
  df["availability_365_re**2/room_type_re"] = df["availability_365_re"]**2/df["room_type_re"]
  df["neighbourhood_re*room_type_re**2"] = df["neighbourhood_re"]*df["room_type_re"]**2
  df["availability_365_re**2*minimum_nights_re**2"] = df["availability_365_re"]**2*df["minimum_nights_re"]**2
  df["availability_365_re**2/reviews_per_month_re"] = df["availability_365_re"]**2/df["reviews_per_month_re"]
  df["availability_365_re**2*number_of_reviews_re**2"] = df["availability_365_re"]**2*df["number_of_reviews_re"]**2
  df["sqrt(availability_365_re)*minimum_nights_re"] = np.sqrt(df["availability_365_re"])*df["minimum_nights_re"]
  df["log(last_review_re)*log(reviews_per_month_re)"] = np.log(df["last_review_re"])*np.log(df["reviews_per_month_re"])

# 特徴量の追加
make_feat(train)
make_feat(test)
train.shape,test.shape

((9990, 20), (4996, 20))

学習準備

# 目的変数を対数化
target = np.log1p(y)

all_param = {
"colsample_bytree": 0.32714832683589756,
"learning_rate": 0.006840905564844016,
"max_bin": 166,
"min_child_samples": 5,
"n_estimators": 573,
"num_leaves": 114,
"subsample": 0.956238192643021,
"subsample_freq": 3,
}

#LGBMで学習する関数
def objective(all_param):
    x_train = train.copy()
    y_train = target.copy()
    x_test = test.copy()

    # --------------------------------------
    # パラメータセット
    # --------------------------------------
    lgb_params = {
      'objective': 'regression',
      'importance_type': 'gain',
      'metric': 'rmse',
      'seed': 42,
      'n_jobs': -1,
      'verbose': 1,

      'n_estimators': all_param['n_estimators'],
      'learning_rate': all_param['learning_rate'],
      'boosting_type': 'gbdt',
      'subsample': all_param['subsample'],
      'subsample_freq': all_param['subsample_freq'],
      'colsample_bytree': all_param['colsample_bytree'],
      'num_leaves': all_param['num_leaves'],
      'min_child_samples': all_param['min_child_samples'],
      'max_bin': all_param['max_bin'],
    }

    # --------------------------------------
    # 学習
    # --------------------------------------
    
    x_tr_fold, x_vl_fold, y_tr_fold, y_vl_fold = train_test_split(x_train, y_train, test_size=0.1, random_state=42)
    
    y_oof = np.zeros(len(x_vl_fold))
    y_preds = np.zeros(len(x_test))
    
    model = lgb.LGBMRegressor(**lgb_params)
    model.fit(
            x_tr_fold, y_tr_fold,
            eval_set=(x_vl_fold, y_vl_fold),
            eval_metric='rmse',
            verbose=False,
            early_stopping_rounds=100,
    )

    y_oof = model.predict(x_vl_fold)

    score = np.sqrt(mean_squared_error(y_vl_fold,y_oof))

    
    print(
        'oof score:',
        score
    )
    # --------------------------------------
    # 予測
    # --------------------------------------
    pred_data = model.predict(x_test)


    return score,pred_data

学習実行

score,pred_data = objective(all_param)

oof score: 0.7897490783498501

# 対数化された予測値を戻す。
pred_data = np.expm1(pred_data)

提出ファイルを作成

submit_df = pd.read_csv(f"./data/submission.csv")
submit_df['y']=pd.Series(pred_data.reshape(-1,))
submit_df.to_csv(f'./result/submission.csv',index=False)

完了

添付データ

LGBM-baseline-notebook.ipynb?X-Amz-Expires=10800&X-Amz-Date=20260125T070011Z&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIP7GCBGMWPMZ42PQ

LightGBM Base line−(CV:0.78974 / LB 0.89032) by Oregin

LightGBM Base line−(CV:0.78974 / LB 0.89032) by Oregin

民泊サービスの宿泊料金予測

特徴量を編集するためのツールを導入（ターゲットエンコーディングで使用）

各種ライブラリをインポート

データの読み込み・変換

特徴量追加

学習準備

学習実行

提出ファイルを作成

完了

添付データ

new user