給与推定により人事の赤池くんの窮地を救おう
upura
As a feature engineering technique for categorical features, one hot encoding is introduced in official tutorial.
one hot encoding
In this notebook, the topic is another feature engineering technique for categorical features called label encoding.This is used especially when you use Gradient Boosting Decision Tree like LightGBM.
label encoding
Gradient Boosting Decision Tree
LightGBM
You can see the reference here.
import pandas as pd import japanize_matplotlib from IPython.core.display import display, HTML display(HTML("<style>.container { width:100% !important; }</style>")) train = pd.read_csv('../datasets/data/train_data.csv') test = pd.read_csv('../datasets/data/test_data.csv')
train.head()
test.head()
Let's convert area from prefecture names to labels
area
from sklearn import preprocessing le = preprocessing.LabelEncoder() train['area'] = le.fit_transform(train['area']) test['area'] = le.transform(test['area'])
area has 47 types of prefecture name. So when you use one hot encoding, you'll get 46 columns.
train['area'].value_counts().plot.bar(figsize=(20, 5))
<matplotlib.axes._subplots.AxesSubplot at 0x7f6d73a4d8d0>
pd.get_dummies(train, columns=['area'], drop_first=True).head()