upura

[Salary Prediction] Introduction of Label Encoding

As a feature engineering technique for categorical features, one hot encoding is introduced in official tutorial.

In this notebook, the topic is another feature engineering technique for categorical features called label encoding.
This is used especially when you use Gradient Boosting Decision Tree like LightGBM.

You can see the reference here.

import pandas as pd
import japanize_matplotlib
from IPython.core.display import display, HTML 
display(HTML("<style>.container { width:100% !important; }</style>"))


train = pd.read_csv('../datasets/data/train_data.csv')
test = pd.read_csv('../datasets/data/test_data.csv')

train.head()

	id	position	age	area	sex	partner	num_child	education	service_length	study_time	commute	overtime	salary
0	0	1	44	愛知県	2	1	2	1	24	2.0	1.6	9.2	428.074887
1	1	2	31	奈良県	1	0	0	0	13	9.0	0.7	12.4	317.930517
2	2	2	36	山口県	1	0	0	2	14	4.0	0.4	16.9	357.350316
3	3	0	22	東京都	2	0	0	0	4	3.0	0.4	6.1	201.310911
4	4	0	25	鹿児島県	2	0	0	1	5	3.0	0.2	4.9	178.067475

test.head()

	id	position	age	area	sex	partner	num_child	education	service_length	study_time	commute	overtime
0	0	3	39	鹿児島県	2	1	5	1	19	1.0	1.8	14.2
1	1	1	31	宮城県	1	0	0	4	0	0.0	0.5	18.6
2	2	0	20	愛知県	2	1	2	0	2	2.0	1.2	2.3
3	3	0	28	三重県	2	0	0	0	10	3.0	0.3	0.0
4	4	1	41	愛媛県	2	0	0	0	23	3.0	0.5	10.1

Let's convert area from prefecture names to labels

from sklearn import preprocessing


le = preprocessing.LabelEncoder()
train['area'] = le.fit_transform(train['area'])
test['area'] = le.transform(test['area'])

train.head()

	id	position	age	area	sex	partner	num_child	education	service_length	study_time	commute	overtime	salary
0	0	1	44	24	2	1	2	1	24	2.0	1.6	9.2	428.074887
1	1	2	31	10	1	0	0	0	13	9.0	0.7	12.4	317.930517
2	2	2	36	14	1	0	0	2	14	4.0	0.4	16.9	357.350316
3	3	0	22	26	2	0	0	0	4	3.0	0.4	6.1	201.310911
4	4	0	25	46	2	0	0	1	5	3.0	0.2	4.9	178.067475

test.head()

	id	position	age	area	sex	partner	num_child	education	service_length	study_time	commute	overtime
0	0	3	39	46	2	1	5	1	19	1.0	1.8	14.2
1	1	1	31	11	1	0	0	4	0	0.0	0.5	18.6
2	2	0	20	24	2	1	2	0	2	2.0	1.2	2.3
3	3	0	28	0	2	0	0	0	10	3.0	0.3	0.0
4	4	1	41	23	2	0	0	0	23	3.0	0.5	10.1

What happens in `one hot encoding`

area has 47 types of prefecture name. So when you use one hot encoding, you'll get 46 columns.

train['area'].value_counts().plot.bar(figsize=(20, 5))

<matplotlib.axes._subplots.AxesSubplot at 0x7f6d73a4d8d0>

pd.get_dummies(train, columns=['area'], drop_first=True).head()

	id	position	age	sex	partner	num_child	education	service_length	study_time	commute	overtime	salary	area_10	area_14	area_24	area_26	area_46
0	0	1	44	2	1	2	1	24	2.0	1.6	9.2	428.074887	0	0	1	0	0
1	1	2	31	1	0	0	0	13	9.0	0.7	12.4	317.930517	1	0	0	0	0
2	2	2	36	1	0	0	2	14	4.0	0.4	16.9	357.350316	0	1	0	0	0
3	3	0	22	2	0	0	0	4	3.0	0.4	6.1	201.310911	0	0	0	1	0
4	4	0	25	2	0	0	1	5	3.0	0.2	4.9	178.067475	0	0	0	0	1

添付データ

salary-prediction-label-encoding.ipynb?X-Amz-Expires=10800&X-Amz-Date=20241203T180539Z&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIP7GCBGMWPMZ42PQ

[Salary Prediction] Introduction of Label Encoding

What happens in one hot encoding

添付データ

new user

What happens in `one hot encoding`