upura
As a feature engineering technique for categorical features, one hot encoding
is introduced in official tutorial.
In this notebook, the topic is another feature engineering technique for categorical features called label encoding
.
This is used especially when you use Gradient Boosting Decision Tree
like LightGBM
.
You can see the reference here.
import pandas as pd
import japanize_matplotlib
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
train = pd.read_csv('../datasets/data/train_data.csv')
test = pd.read_csv('../datasets/data/test_data.csv')
train.head()
id | position | age | area | sex | partner | num_child | education | service_length | study_time | commute | overtime | salary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 44 | 愛知県 | 2 | 1 | 2 | 1 | 24 | 2.0 | 1.6 | 9.2 | 428.074887 |
1 | 1 | 2 | 31 | 奈良県 | 1 | 0 | 0 | 0 | 13 | 9.0 | 0.7 | 12.4 | 317.930517 |
2 | 2 | 2 | 36 | 山口県 | 1 | 0 | 0 | 2 | 14 | 4.0 | 0.4 | 16.9 | 357.350316 |
3 | 3 | 0 | 22 | 東京都 | 2 | 0 | 0 | 0 | 4 | 3.0 | 0.4 | 6.1 | 201.310911 |
4 | 4 | 0 | 25 | 鹿児島県 | 2 | 0 | 0 | 1 | 5 | 3.0 | 0.2 | 4.9 | 178.067475 |
test.head()
id | position | age | area | sex | partner | num_child | education | service_length | study_time | commute | overtime | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | 39 | 鹿児島県 | 2 | 1 | 5 | 1 | 19 | 1.0 | 1.8 | 14.2 |
1 | 1 | 1 | 31 | 宮城県 | 1 | 0 | 0 | 4 | 0 | 0.0 | 0.5 | 18.6 |
2 | 2 | 0 | 20 | 愛知県 | 2 | 1 | 2 | 0 | 2 | 2.0 | 1.2 | 2.3 |
3 | 3 | 0 | 28 | 三重県 | 2 | 0 | 0 | 0 | 10 | 3.0 | 0.3 | 0.0 |
4 | 4 | 1 | 41 | 愛媛県 | 2 | 0 | 0 | 0 | 23 | 3.0 | 0.5 | 10.1 |
Let's convert area
from prefecture names to labels
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
train['area'] = le.fit_transform(train['area'])
test['area'] = le.transform(test['area'])
train.head()
id | position | age | area | sex | partner | num_child | education | service_length | study_time | commute | overtime | salary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 44 | 24 | 2 | 1 | 2 | 1 | 24 | 2.0 | 1.6 | 9.2 | 428.074887 |
1 | 1 | 2 | 31 | 10 | 1 | 0 | 0 | 0 | 13 | 9.0 | 0.7 | 12.4 | 317.930517 |
2 | 2 | 2 | 36 | 14 | 1 | 0 | 0 | 2 | 14 | 4.0 | 0.4 | 16.9 | 357.350316 |
3 | 3 | 0 | 22 | 26 | 2 | 0 | 0 | 0 | 4 | 3.0 | 0.4 | 6.1 | 201.310911 |
4 | 4 | 0 | 25 | 46 | 2 | 0 | 0 | 1 | 5 | 3.0 | 0.2 | 4.9 | 178.067475 |
test.head()
id | position | age | area | sex | partner | num_child | education | service_length | study_time | commute | overtime | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | 39 | 46 | 2 | 1 | 5 | 1 | 19 | 1.0 | 1.8 | 14.2 |
1 | 1 | 1 | 31 | 11 | 1 | 0 | 0 | 4 | 0 | 0.0 | 0.5 | 18.6 |
2 | 2 | 0 | 20 | 24 | 2 | 1 | 2 | 0 | 2 | 2.0 | 1.2 | 2.3 |
3 | 3 | 0 | 28 | 0 | 2 | 0 | 0 | 0 | 10 | 3.0 | 0.3 | 0.0 |
4 | 4 | 1 | 41 | 23 | 2 | 0 | 0 | 0 | 23 | 3.0 | 0.5 | 10.1 |
one hot encoding
area
has 47 types of prefecture name. So when you use one hot encoding
, you'll get 46 columns.
train['area'].value_counts().plot.bar(figsize=(20, 5))
<matplotlib.axes._subplots.AxesSubplot at 0x7f6d73a4d8d0>
pd.get_dummies(train, columns=['area'], drop_first=True).head()
id | position | age | sex | partner | num_child | education | service_length | study_time | commute | overtime | salary | area_1 | area_2 | area_3 | area_4 | area_5 | area_6 | area_7 | area_8 | area_9 | area_10 | area_11 | area_12 | area_13 | area_14 | area_15 | area_16 | area_17 | area_18 | area_19 | area_20 | area_21 | area_22 | area_23 | area_24 | area_25 | area_26 | area_27 | area_28 | area_29 | area_30 | area_31 | area_32 | area_33 | area_34 | area_35 | area_36 | area_37 | area_38 | area_39 | area_40 | area_41 | area_42 | area_43 | area_44 | area_45 | area_46 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 44 | 2 | 1 | 2 | 1 | 24 | 2.0 | 1.6 | 9.2 | 428.074887 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 1 | 2 | 31 | 1 | 0 | 0 | 0 | 13 | 9.0 | 0.7 | 12.4 | 317.930517 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 2 | 2 | 36 | 1 | 0 | 0 | 2 | 14 | 4.0 | 0.4 | 16.9 | 357.350316 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 3 | 0 | 22 | 2 | 0 | 0 | 0 | 4 | 3.0 | 0.4 | 6.1 | 201.310911 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 4 | 0 | 25 | 2 | 0 | 0 | 1 | 5 | 3.0 | 0.2 | 4.9 | 178.067475 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |