Ridgelinezとの共同開催第二弾!花粉予報にチャレンジ!
kotrying
# Library import os import pandas as pd import numpy as np import matplotlib.pyplot as plt !pip install japanize_matplotlib import japanize_matplotlib %matplotlib inline import seaborn as sns from tqdm.auto import tqdm import warnings warnings.simplefilter('ignore') # mount from google.colab import drive if not os.path.isdir('/content/drive'): drive.mount('/content/drive')
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/ Collecting japanize_matplotlib Downloading japanize-matplotlib-1.1.3.tar.gz (4.1 MB) K |████████████████████████████████| 4.1 MB 5.1 MB/s ent already satisfied: matplotlib in /usr/local/lib/python3.7/dist-packages (from japanize_matplotlib) (3.2.2) Requirement already satisfied: numpy>=1.11 in /usr/local/lib/python3.7/dist-packages (from matplotlib->japanize_matplotlib) (1.21.6) Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib->japanize_matplotlib) (3.0.9) Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib->japanize_matplotlib) (2.8.2) Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.7/dist-packages (from matplotlib->japanize_matplotlib) (0.11.0) Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib->japanize_matplotlib) (1.4.4) Requirement already satisfied: typing-extensions in /usr/local/lib/python3.7/dist-packages (from kiwisolver>=1.0.1->matplotlib->japanize_matplotlib) (4.1.1) Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.1->matplotlib->japanize_matplotlib) (1.15.0) Building wheels for collected packages: japanize-matplotlib Building wheel for japanize-matplotlib (setup.py) ... ?25latplotlib: filename=japanize_matplotlib-1.1.3-py3-none-any.whl size=4120275 sha256=9b1d2daa9fdeae4c56750925777085fa2172df4f0066449e60ea2c71cffcbccd Stored in directory: /root/.cache/pip/wheels/83/97/6b/e9e0cde099cc40f972b8dd23367308f7705ae06cd6d4714658 Successfully built japanize-matplotlib Installing collected packages: japanize-matplotlib Successfully installed japanize-matplotlib-1.1.3 Mounted at /content/drive
構成
MyDrive ├<pollen_counts> ├<notebook> │ └eda.ipynb ├<input> │ ├train.csv │ ├submission.csv │ └test.csv └<output>
# Config DRIVE_PATH = "/content/drive/MyDrive/ML/PROBSPACE/pollen_counts" INPUT = os.path.join(DRIVE_PATH, "input") OUTPUT = os.path.join(DRIVE_PATH, "output") TRAIN_FILE = os.path.join(INPUT, "train.csv") TEST_FILE = os.path.join(INPUT, "test.csv") SUB_FILE = os.path.join(INPUT, "submission.csv") seed =42 # plot style pd.set_option('display.max_columns', 200) plt.rcParams['axes.facecolor'] = 'EEFFFE'
# Data train = pd.read_csv(TRAIN_FILE) test = pd.read_csv(TEST_FILE) sub = pd.read_csv(SUB_FILE)
display(train.head(3)) display(test.head(3)) display(sub.head(3))
train.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 12168 entries, 0 to 12167 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 datetime 12168 non-null int64 1 precipitation_utsunomiya 12168 non-null float64 2 precipitation_chiba 12168 non-null float64 3 precipitation_tokyo 12168 non-null object 4 temperature_utsunomiya 12168 non-null float64 5 temperature_chiba 12168 non-null object 6 temperature_tokyo 12168 non-null object 7 winddirection_utsunomiya 12168 non-null int64 8 winddirection_chiba 12168 non-null object 9 winddirection_tokyo 12168 non-null object 10 windspeed_utsunomiya 12168 non-null float64 11 windspeed_chiba 12168 non-null object 12 windspeed_tokyo 12168 non-null object 13 pollen_utsunomiya 12168 non-null float64 14 pollen_chiba 12168 non-null float64 15 pollen_tokyo 12168 non-null float64 dtypes: float64(7), int64(2), object(7) memory usage: 1.5+ MB
test.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 336 entries, 0 to 335 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 datetime 336 non-null int64 1 precipitation_utsunomiya 336 non-null float64 2 precipitation_chiba 336 non-null float64 3 precipitation_tokyo 336 non-null float64 4 temperature_utsunomiya 336 non-null float64 5 temperature_chiba 336 non-null float64 6 temperature_tokyo 336 non-null float64 7 winddirection_utsunomiya 336 non-null int64 8 winddirection_chiba 336 non-null int64 9 winddirection_tokyo 336 non-null int64 10 windspeed_utsunomiya 336 non-null float64 11 windspeed_chiba 336 non-null float64 12 windspeed_tokyo 336 non-null float64 13 pollen_utsunomiya 336 non-null int64 14 pollen_chiba 336 non-null int64 15 pollen_tokyo 336 non-null int64 dtypes: float64(9), int64(7) memory usage: 42.1 KB
欠損はなしtrainに複数のobjectを含む
train.select_dtypes(object)
12168 rows × 7 columns
train.precipitation_tokyo.value_counts()
0.0 9952 0 1328 0.5 340 1.0 159 1.5 92 2.0 59 2.5 44 3.0 26 4.5 24 3.5 23 4.0 22 1 18 5.5 13 5.0 11 6.0 7 2 5 6.5 4 4 4 8.0 4 7.0 4 10.0 4 7.5 3 11.0 3 8.5 2 欠測 2 14.5 2 10.5 2 14.0 2 21.5 2 9.0 2 12.5 1 17.5 1 23.5 1 3 1 18.0 1 Name: precipitation_tokyo, dtype: int64
for col in train.select_dtypes(object).columns: print(train[train[col].isin(['欠測'])].datetime)
11795 2020031612 11796 2020031613 Name: datetime, dtype: int64 11146 2020021811 11147 2020021812 11148 2020021813 Name: datetime, dtype: int64 11794 2020031611 11795 2020031612 11796 2020031613 Name: datetime, dtype: int64 11146 2020021811 11147 2020021812 11148 2020021813 Name: datetime, dtype: int64 11794 2020031611 11795 2020031612 11796 2020031613 Name: datetime, dtype: int64 11146 2020021811 11147 2020021812 11148 2020021813 Name: datetime, dtype: int64 11794 2020031611 11795 2020031612 11796 2020031613 Name: datetime, dtype: int64
trainには数値だがobject型のデータが存在 -> 計6個の「欠測」ラベル (2020/02/18 11-13時、2020/03/16 11-13時)precipitation_tokyoを見ると0と0.0のように区別されている場合がある
# object(欠測) -> float import lightgbm as lgb from lightgbm import LGBMRegressor from sklearn.experimental import enable_iterative_imputer from sklearn.impute import IterativeImputer train_df = train.replace('欠測', np.nan) lgb_imp = IterativeImputer( estimator=LGBMRegressor(num_boost_round=100, random_state=seed), max_iter=10, initial_strategy='mean', imputation_order='ascending', verbose=1, random_state=seed) train_df = pd.DataFrame(lgb_imp.fit_transform(train_df), columns=train_df.columns) train_df[['winddirection_chiba', 'winddirection_tokyo']] = train_df[['winddirection_chiba', 'winddirection_tokyo']].round().astype(int) train_df[['precipitation_tokyo', 'temperature_chiba', 'temperature_tokyo', 'windspeed_chiba', 'windspeed_tokyo']] = train_df[['precipitation_tokyo', 'temperature_chiba', 'temperature_tokyo', 'windspeed_chiba', 'windspeed_tokyo']].round(1) train[train.select_dtypes(object).columns] = train_df[train.select_dtypes(object).columns] train
[IterativeImputer] Completing matrix with shape (12168, 16) [IterativeImputer] Change: 8.54139755011368, scaled tolerance: 2020033.124 [IterativeImputer] Early stopping criterion reached.
12168 rows × 16 columns
df = pd.concat([train, test]).reset_index(drop=True) df['time'] = pd.to_datetime(df.datetime.astype(str).str[:-2]) df['year'] = df['time'].dt.year df['month'] = df['time'].dt.month df['day'] = df['time'].dt.day df['hour'] = df.datetime.astype(str).str[-2:].astype(int) df['weekday'] = df['time'].dt.weekday df.head(3)
plot_col = ['pollen_utsunomiya', 'pollen_chiba', 'pollen_tokyo'] color = ['red','green','blue'] ncols = len(plot_col) plt.subplots(1, ncols, sharey=True, sharex=True, figsize=(30, 5)) for i, col in enumerate(plot_col): plt.subplot(1, ncols, i+1) plt.plot(df.time, df[col], alpha=1, color=color[i], label=col) plt.xlabel(col) plt.legend() plt.grid() plt.show()
df[df.month.isin([7,8,9,10,11,12,1])]
target_col = ['pollen_utsunomiya', 'pollen_chiba', 'pollen_tokyo'] for col in target_col: print('='*10+col+'='*10) print(df[col].describe()) print('<75%より多く花粉を含む(月)>') df[df[col] > df[col].quantile(0.75)].month.value_counts().plot(kind='barh', figsize=(5,3)) plt.show() print('<99%より多く花粉を含む(月)>') df[df[col] > df[col].quantile(0.99)].month.value_counts().plot(kind='barh', figsize=(5,3)) plt.show() print()
==========pollen_utsunomiya========== count 12504.000000 mean 84.292306 std 338.311803 min 0.000000 25% 0.000000 50% 16.000000 75% 57.000000 max 12193.000000 Name: pollen_utsunomiya, dtype: float64 <75%より多く花粉を含む(月)>
<99%より多く花粉を含む(月)>
==========pollen_chiba========== count 12504.000000 mean 28.822297 std 99.154223 min 0.000000 25% 0.000000 50% 8.000000 75% 24.000000 max 4141.000000 Name: pollen_chiba, dtype: float64 <75%より多く花粉を含む(月)>
==========pollen_tokyo========== count 12504.000000 mean 25.973289 std 73.793359 min 0.000000 25% 0.000000 50% 4.000000 75% 20.000000 max 2209.000000 Name: pollen_tokyo, dtype: float64 <75%より多く花粉を含む(月)>
各観測拠点でスケールが違う(宇都宮>千葉>東京)7-1月のデータは存在しない3月の花粉量がとても多い
plot_col = ['pollen_utsunomiya', 'pollen_chiba', 'pollen_tokyo'] color = ['red','green','blue'] ncols = len(plot_col) plt.subplots(1, ncols, sharey=True, sharex=True, figsize=(30, 5)) for i, col in enumerate(plot_col): plt.subplot(1, ncols, i+1) plt.hist(df[col], bins=200, alpha=1, color=color[i], label=col) plt.xlabel(col) plt.legend() plt.grid() plt.xlim(-1,500) plt.show()
df[df.month.isin([2,3])].groupby('year').mean()[target_col].style.background_gradient()
テストデータを含む2020年の花粉量は過去3年より少ない
df[df.datetime <= 2020033124].groupby('month').mean()[target_col].style.background_gradient()
df[df.datetime <= 2020033124].groupby('hour').mean()[target_col].style.background_gradient()
宇都宮は2-5時、千葉は8-10時頃、東京は7-8時頃に高い花粉量を観測
df[df.datetime <= 2020033124].groupby('weekday').mean()[target_col].style.background_gradient()
宇都宮は土曜日、千葉は水曜日や土曜日に花粉量が多く、東京は金土曜日の花粉量が少ない
fig, ax = plt.subplots(figsize=(12, 10)) sns.heatmap(df[df.datetime <= 2020033124].drop(['datetime', 'time'], axis=1).corr(), vmax=1, vmin=-1, center=0)
<matplotlib.axes._subplots.AxesSubplot at 0x7f0ed9ac4390>
降水量
plot_col = ['precipitation_utsunomiya', 'precipitation_chiba', 'precipitation_tokyo'] color = ['red','green','blue'] ncols = len(plot_col) plt.subplots(1, ncols, sharey=True, sharex=True, figsize=(30, 5)) for i, col in enumerate(plot_col): plt.subplot(1, ncols, i+1) for y in df.year.unique(): plt.plot(df[df.year==y].time, df[df.year==y][col], alpha=1, color=color[i]) plt.xlabel(col) plt.grid() plt.show()
plot_col = ['precipitation_utsunomiya', 'precipitation_chiba', 'precipitation_tokyo'] color = ['red','green','blue'] ncols = len(plot_col) plt.subplots(1, ncols, sharey=True, sharex=True, figsize=(30, 5)) for i, (tcol, col) in enumerate(zip(target_col, plot_col)): plt.subplot(1, ncols, i+1) plt.scatter(df[col], df[tcol], alpha=1, color=color[i]) plt.xlabel(col) plt.ylabel(tcol) plt.grid() plt.xlim(-1,11) plt.show()
降水量が多い時は花粉量が少ない
気温
plot_col = ['temperature_utsunomiya', 'temperature_chiba', 'temperature_tokyo'] color = ['red','green','blue'] ncols = len(plot_col) plt.subplots(1, ncols, sharey=True, sharex=True, figsize=(30, 5)) for i, col in enumerate(plot_col): plt.subplot(1, ncols, i+1) for j, y in enumerate(df.year.unique()): plt.plot(df[df.year==y].index, df[df.year==y][col], alpha=0.25*(j+1), color=color[i]) plt.plot(df[df.year==y].index, df[df.year==y][col].rolling(100).mean(), alpha=0.25*(j+1), color='black') plt.xlabel(col) plt.grid() plt.show()
plot_col = ['temperature_utsunomiya', 'temperature_chiba', 'temperature_tokyo'] color = ['red','green','blue'] ncols = len(plot_col) plt.subplots(1, ncols, sharey=True, sharex=True, figsize=(30, 5)) for i, (tcol, col) in enumerate(zip(target_col, plot_col)): plt.subplot(1, ncols, i+1) plt.scatter(df[col], df[tcol], alpha=1, color=color[i]) plt.xlabel(col) plt.ylabel(tcol) plt.grid() plt.show()
風向
# 指定地域の該当日の風向 winddirection = { 0:'静穏', 1:'北北東', 2:'北東', 3:'東北東', 4:'東', 5:'東南東', 6:'南東', 7:'南南東', 8:'南', 9:'南南西', 10:'南西', 11:'西南西', 12:'西', 13:'西北西', 14:'北西', 15:'北北西', 16:'北', } df_wd = df.copy() for col in ['winddirection_utsunomiya', 'winddirection_chiba', 'winddirection_tokyo']: df_wd[col] = df_wd[col].map(winddirection) df_wd.head(3)
plot_col = ['winddirection_utsunomiya', 'winddirection_chiba', 'winddirection_tokyo'] color = ['red','green','blue'] ncols = len(plot_col) plt.subplots(1, ncols, sharey=True, sharex=True, figsize=(30, 5)) for i, (tcol, col) in enumerate(zip(target_col, plot_col)): plt.subplot(1, ncols, i+1) df_wd[col].value_counts().plot(kind='barh') plt.xlabel(tcol) plt.ylabel(col) plt.grid() plt.show()
plot_col = ['winddirection_utsunomiya', 'winddirection_chiba', 'winddirection_tokyo'] color = ['red','green','blue'] ncols = len(plot_col) plt.subplots(1, ncols, sharey=True, sharex=True, figsize=(30, 5)) for i, (tcol, col) in enumerate(zip(target_col, plot_col)): plt.subplot(1, ncols, i+1) df_wd[df_wd.datetime <= 2020033124].groupby(col).mean()[tcol].plot(kind='barh') plt.xlabel(tcol) plt.ylabel(col) plt.grid() plt.show()
宇都宮は北風のとき花粉量が多く、千葉は南風のとき花粉量が少なく、東京は西風のとき花粉量が多い静穏は東京のみ多く存在し、花粉量は多い
風速
plot_col = ['windspeed_utsunomiya', 'windspeed_chiba', 'windspeed_tokyo'] color = ['red','green','blue'] ncols = len(plot_col) plt.subplots(1, ncols, sharey=True, sharex=True, figsize=(30, 5)) for i, col in enumerate(plot_col): plt.subplot(1, ncols, i+1) for j, y in enumerate(df.year.unique()): plt.plot(df[df.year==y].index, df[df.year==y][col], alpha=0.25*(j+1), color=color[i]) plt.plot(df[df.year==y].index, df[df.year==y][col].rolling(100).mean(), alpha=0.25*(j+1), color='black') plt.xlabel(col) plt.grid() plt.show()
plot_col = ['windspeed_utsunomiya', 'windspeed_chiba', 'windspeed_tokyo'] color = ['red','green','blue'] ncols = len(plot_col) plt.subplots(1, ncols, sharey=True, sharex=True, figsize=(30, 5)) for i, (tcol, col) in enumerate(zip(target_col, plot_col)): plt.subplot(1, ncols, i+1) plt.scatter(df[col], df[tcol], alpha=1, color=color[i]) plt.xlabel(col) plt.ylabel(tcol) plt.grid() plt.show()
!pip install folium
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/ Requirement already satisfied: folium in /usr/local/lib/python3.7/dist-packages (0.12.1.post1) Requirement already satisfied: requests in /usr/local/lib/python3.7/dist-packages (from folium) (2.23.0) Requirement already satisfied: numpy in /usr/local/lib/python3.7/dist-packages (from folium) (1.21.6) Requirement already satisfied: jinja2>=2.9 in /usr/local/lib/python3.7/dist-packages (from folium) (2.11.3) Requirement already satisfied: branca>=0.3.0 in /usr/local/lib/python3.7/dist-packages (from folium) (0.5.0) Requirement already satisfied: MarkupSafe>=0.23 in /usr/local/lib/python3.7/dist-packages (from jinja2>=2.9->folium) (2.0.1) Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests->folium) (2022.6.15) Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests->folium) (3.0.4) Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests->folium) (1.24.3) Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests->folium) (2.10)
観測拠点は以下の3つ
import folium cities = pd.DataFrame({ 'train': ['宇都宮(宇都宮市中央生涯学習センター)', '千葉(千葉県環境研究センター)', '府中(東京都多摩小平保健所)'], 'latitude': [36.559444, 35.525312, 35.730062], 'longtude': [139.882689, 140.068465, 139.516689], }) map = folium.Map(width='75%', height='75%',location=[36,140], zoom_start=8) for i, r in cities.iterrows(): folium.Marker(location=[r['latitude'], r['longtude']], popup=r['train']).add_to(map) map
地理情報は提供されていないが、位置的な関係を考慮して風向と風量から、時間差での影響をモデルに組み込む必要があるかもしれない例えばそれぞれの観測拠点で多く見られる北北東の風や南、南西の風の風が吹いた場合、ある程度の時間を置いて別の拠点へと風が到達するかもしれない