価格の周期性

野菜の流通量・価格にはある季節性(周期性)があるため、今回のコンペでも重要な要素になりそうです。このnoteではその部分を確認したいと思ます。

from google.colab import files # 必要ファイルをzip圧縮したファイルをアップロードします
files.upload()

! unzip ps_yasai.zip

Upload widget is only available when the cell has been executed in the current browser session. Please rerun this cell to enable.

Saving ps_yasai.zip to ps_yasai.zip
Archive:  ps_yasai.zip
  inflating: ps_yasai/submission.csv  
  inflating: ps_yasai/train_data.csv  
  inflating: ps_yasai/weather.csv

import warnings
warnings.simplefilter('ignore')

from IPython.display import display, clear_output

import pandas as pd
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt
!pip install --q japanize-matplotlib
import japanize_matplotlib
import seaborn as sns
sns.set(font="IPAexGothic")

from sklearn.metrics import mean_squared_error

clear_output()

train = pd.read_csv('ps_yasai/train_data.csv').rename(columns={'id':'date'}).set_index('date', drop=True)
# id（日付）をindexにします
sub_df = pd.read_csv('ps_yasai/submission.csv')

trainデータに欠損値はありません

train.isnull().sum().max()

trainデータの列数(=品目×地域)は340、品目は42種類あります

items = list(dict.fromkeys([c.split('_')[0] for c in train.columns]))
train.shape[1], len(items)

(340, 42)

品目ごとの時系列表示

def plot_prices(item_list):
    nrow = len(item_list)
    fig, ax = plt.subplots(nrow, sharex=True, figsize=(8, nrow*3))
    for i, item in enumerate(item_list):
        area_item = [c for c in train.columns if c.split('_')[0]==item]
        ymax = 0
        for c in area_item:
            train[c].plot(ax=ax[i])
            if train[c].max()>ymax:
                ymax = train[c].max()
        ax[i].legend(loc='center left', bbox_to_anchor=(1., .5))
        ax[i].set_ylim((0, 1.1*ymax))

品目・地域ごとにバラバラに動いているのではなく、同じ品目であれば地域間である程度の連動性がありそうです。
また、品目により価格に周期性がありそうです。

plot_prices(items[:4])

とはいえ、連動性・周期性は品目によりその「強さ」に違いがありそうです。

きゅうり、れんこんは連動性・周期性が強そうな品目、その他の菜類、ねぎは弱そうな品目の例です。
(注) 品目数が結構多いので、一部だけを抜粋して表示します

item_list = ['きゅうり','れんこん','その他の菜類', 'ねぎ']
plot_prices(item_list)

周期性のみを利用した場合の予測精度

各月の価格の幾何平均を予測値とした場合について調べます

# 評価指標がRMLSEであるため、元データを対数変換します
df0 = np.log(train)
date = pd.to_datetime(train.index)
df0['year'] = date.year
df0['month'] = date.month

display(df0.head())

	えのきだけ_中国	えのきだけ_九州	えのきだけ_北海道	えのきだけ_北陸	えのきだけ_四国	えのきだけ_東北	えのきだけ_東海	えのきだけ_近畿	えのきだけ_関東	かぶ_北海道	...	生しいたけ_九州	生しいたけ_北海道	生しいたけ_北陸	生しいたけ_四国	生しいたけ_東北	生しいたけ_東海	生しいたけ_近畿	生しいたけ_関東	year	month
date
2016-01-01	5.749393	5.707110	5.717028	5.613128	5.587249	5.361292	5.627621	5.652489	5.631212	4.820282	...	6.954639	6.602588	7.115582	6.928538	6.895683	7.020191	7.014814	7.019297	2016	1
2016-02-01	5.723585	5.755742	5.733341	5.662960	5.680173	5.627621	5.655992	5.686975	5.673323	5.056246	...	6.862758	6.620073	7.123673	6.944087	6.871091	7.037906	7.027315	7.017506	2016	2
2016-03-01	5.501258	5.379897	5.758902	5.389072	5.424950	5.099866	5.347108	5.361292	5.318120	5.220356	...	6.605298	6.590301	7.091742	6.786717	6.846943	6.896694	6.842683	6.950815	2016	3
2016-04-01	5.424950	5.204007	5.717028	5.273000	5.308268	5.030438	5.323010	5.313206	5.241747	5.241747	...	6.618739	6.598509	6.921658	6.717805	6.787845	6.828712	6.782192	6.889591	2016	4
2016-05-01	5.446737	5.308268	5.723585	5.327876	5.370638	5.135798	5.398163	5.393628	5.327876	5.010635	...	6.710523	6.565265	6.961296	6.793466	6.765039	6.848005	6.852243	6.897705	2016	5

5 rows × 342 columns

trainデータに対する予測の精度

# Leave One Out
loo = pd.DataFrame(np.zeros_like(train), index=train.index, columns=train.columns)

for i in loo.index:
    year = df0.at[i, 'year']
    month = df0.at[i, 'month']
    # 例えば2016年1月の「えのきだけ_中国」は2016年以外の1月の「えのきだけ_中国」の幾何平均値を予測値とします
    for c in loo.columns:
        loo.at[i, c] = df0.loc[(df0['year']!=year) & (df0['month']==month), c].mean()

# 各「品目_地域」に対してRMSLEを計算し、その平均値を算出
tot_rmlse = 0

for c in loo.columns:
    tot_rmlse += np.sqrt(mean_squared_error(df0[c], loo[c])) / train.shape[1]

print(f"RMLSE train data: {tot_rmlse:.5f}")

RMLSE train data: 0.22453

2019年12月の予測値

# 各「品目_地域」の12月の幾何平均を予測値とします
monthly_mean_dic = np.exp(df0[df0['month']==12][train.columns].mean()).to_dict()
sub_df['y'] = sub_df['id'].map(monthly_mean_dic)

sub_df.head()

	id	y
0	えのきだけ_中国	331.571958
1	えのきだけ_九州	313.940243
2	えのきだけ_北海道	288.437246
3	えのきだけ_北陸	302.597708
4	えのきだけ_四国	315.591973

# LB: 0.27182
sub_df.to_csv('sub5.csv', index=False)

月ごとの価格の幾何平均でまずまずの精度が得られました。
今回のコンペにおいて、周期性（月ごとの影響）がそれなりに意味を持つことが示されたと思います。