[Salary Prediction] Introduction of Label Encoding

As a feature engineering technique for categorical features, one hot encoding is introduced in official tutorial.

In this notebook, the topic is another feature engineering technique for categorical features called label encoding.
This is used especially when you use Gradient Boosting Decision Tree like LightGBM.

You can see the reference here.

import pandas as pd
import japanize_matplotlib
from IPython.core.display import display, HTML 
display(HTML("<style>.container { width:100% !important; }</style>"))


train = pd.read_csv('../datasets/data/train_data.csv')
test = pd.read_csv('../datasets/data/test_data.csv')
train.head()
id position age area sex partner num_child education service_length study_time commute overtime salary
0 0 1 44 愛知県 2 1 2 1 24 2.0 1.6 9.2 428.074887
1 1 2 31 奈良県 1 0 0 0 13 9.0 0.7 12.4 317.930517
2 2 2 36 山口県 1 0 0 2 14 4.0 0.4 16.9 357.350316
3 3 0 22 東京都 2 0 0 0 4 3.0 0.4 6.1 201.310911
4 4 0 25 鹿児島県 2 0 0 1 5 3.0 0.2 4.9 178.067475
test.head()
id position age area sex partner num_child education service_length study_time commute overtime
0 0 3 39 鹿児島県 2 1 5 1 19 1.0 1.8 14.2
1 1 1 31 宮城県 1 0 0 4 0 0.0 0.5 18.6
2 2 0 20 愛知県 2 1 2 0 2 2.0 1.2 2.3
3 3 0 28 三重県 2 0 0 0 10 3.0 0.3 0.0
4 4 1 41 愛媛県 2 0 0 0 23 3.0 0.5 10.1

Let's convert area from prefecture names to labels

from sklearn import preprocessing


le = preprocessing.LabelEncoder()
train['area'] = le.fit_transform(train['area'])
test['area'] = le.transform(test['area']) 
train.head()
id position age area sex partner num_child education service_length study_time commute overtime salary
0 0 1 44 24 2 1 2 1 24 2.0 1.6 9.2 428.074887
1 1 2 31 10 1 0 0 0 13 9.0 0.7 12.4 317.930517
2 2 2 36 14 1 0 0 2 14 4.0 0.4 16.9 357.350316
3 3 0 22 26 2 0 0 0 4 3.0 0.4 6.1 201.310911
4 4 0 25 46 2 0 0 1 5 3.0 0.2 4.9 178.067475
test.head()
id position age area sex partner num_child education service_length study_time commute overtime
0 0 3 39 46 2 1 5 1 19 1.0 1.8 14.2
1 1 1 31 11 1 0 0 4 0 0.0 0.5 18.6
2 2 0 20 24 2 1 2 0 2 2.0 1.2 2.3
3 3 0 28 0 2 0 0 0 10 3.0 0.3 0.0
4 4 1 41 23 2 0 0 0 23 3.0 0.5 10.1

What happens in one hot encoding

area has 47 types of prefecture name. So when you use one hot encoding, you'll get 46 columns.

train['area'].value_counts().plot.bar(figsize=(20, 5))
<matplotlib.axes._subplots.AxesSubplot at 0x7f6d73a4d8d0>
pd.get_dummies(train, columns=['area'], drop_first=True).head()
id position age sex partner num_child education service_length study_time commute overtime salary area_1 area_2 area_3 area_4 area_5 area_6 area_7 area_8 area_9 area_10 area_11 area_12 area_13 area_14 area_15 area_16 area_17 area_18 area_19 area_20 area_21 area_22 area_23 area_24 area_25 area_26 area_27 area_28 area_29 area_30 area_31 area_32 area_33 area_34 area_35 area_36 area_37 area_38 area_39 area_40 area_41 area_42 area_43 area_44 area_45 area_46
0 0 1 44 2 1 2 1 24 2.0 1.6 9.2 428.074887 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 1 2 31 1 0 0 0 13 9.0 0.7 12.4 317.930517 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 2 2 36 1 0 0 2 14 4.0 0.4 16.9 357.350316 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 3 0 22 2 0 0 0 4 3.0 0.4 6.1 201.310911 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 4 0 25 2 0 0 1 5 3.0 0.2 4.9 178.067475 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1

添付データ

  • salary-prediction-label-encoding.ipynb?X-Amz-Expires=10800&X-Amz-Date=20240426T021953Z&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIP7GCBGMWPMZ42PQ
  • Favicon
    new user
    コメントするには 新規登録 もしくは ログイン が必要です。