1st place solution

Big thanks to Probspace for holding this competition, and congrats to all the top teams. I joined this competition 5 days before the deadline and I soonly discovered that data quality is very good, and my CV always goes with public LB (also private), it was truly a great experience for me.

In this documentation, I'll briefly talk about my solution, and some thought about this competition.

About me

Currently, I am a Master's student at Tokyo Tech. For further information, please refer to my Kaggle Profile.

Summary

My final score (Public: 0.25848 / Private: 0.25854 / CV:0.261616) is based on a single LGBM (5 fold bagging) by select top700 features from lgb feature importance. I only use Lightgbm here since it is good for dealing with tabular data and relatively fast compared to Xgboost and Catboost, which allow you to test more ideas in limited time. For the validation scheme, I simply use 5-Fold cross-validation and it works very well, CV score always aligns with the LB score.

Feature engineering

The data is a little bit dirty, but compared to data from Signate student cup 2019, it was not a problem at all for me. I just spent some time on transforming them from 全角 to 半角, then separating them into the single feature so that we can do some feature engineering on it.

My whole FE is composed of 6 parts :

  • Group method (numeric2cate) : Apply statistics of numeric features in different categorical features group. For example, applying "mean" on "面積(㎡)" group by "市区町村コード". The statistics functions I used :

    • Mean, max, min, std, sum, skewness, kurtosis
    • Bayes mean
    • IQR : q75 - q25
    • IQR_ratio : q75 / q25
    • Median absolute deviation : median( abs(x - median(x)) )
    • Mean variance : std(x) / mean(x)
    • hl_ratio : The ratio of numbers of the samples that higher and lower than the mean (Ref, Table 2).
    • MAD : Median Absolute Deviation : median( |x - median(x)| )
    • Beyond1std : Calculating the ratio beyond 1 std
    • Range : max - min
    • Range_ratio : max / min
    • Shapiro-Wilk Statistic
    • diff and ratio : "x - mean(x)" or "x / mean(x)"
    • Z-score : ( x-mean(x) ) / std(x)
  • Group method (cate2cate) : Apply statistics of categorical features in different categorical features group. For example, applying "entropy" on the frequency table of "最寄駅:名称" group by "市区町村コード". The statistics functions I used :

    • n_distinct : number of unique
    • Entropy : apply entropy on frequency table
    • freq1name : the number of most frequently appeared category
    • freq1ratio : the number of most frequently appeared category / group size
  • Target encoding : Reference and code

  • Count encoding : This works very well on some categorical features like "取引の事情等"

  • Feature from land_price.csv : Making features by 2 different "Group method" that I have mentioned above. Applying the statistics on the features that is grouped by "所在地コード", then just merge it to our train+test data

  • Feature pseudo-labeling : Build a LGBM model to predict the important features (I used "sq_meter", "land__mean__ON__h31_price", "nobeyuka_m2", "Age of house","time_to_nearest_aki"), and then take the oof predictions.

Hyper-parameter

Suprisingly that tuning "alpha" in huber loss give me really a big boost (~0.001). In huber loss, alpha=1 basically means absolute loss (same formula). So if we lower the alpha value, it will make your model less sensitive to those "outlier cases".

lgb_param <- list(boosting_type = 'gbdt',
                  objective = "huber",
                  boost_from_average = 'false',
                  metric = "none",
                  learning_rate = 0.008,
                  num_leaves = 128,
                  #  min_gain_to_split = 0.01,
                  feature_fraction = 0.05,
                  #  feature_fraction_seed = 666666,
                  bagging_freq = 1,
                  bagging_fraction = 1,
                  min_sum_hessian_in_leaf = 5,
                  #  min_data_in_leaf = 100,
                  lambda_l1 = 0,
                  lambda_l2 = 0,
                  alpha = 0.3
                  )

 

CODE

https://github.com/Anguschang582/Probspace---Re_estate---1st-place-solution

If you have any questions, please feel free to ask me!
日本語で質問しても大丈夫です :))

Aws4 request&x amz signedheaders=host&x amz signature=52092654c90410b54bab67ca3f9ff21deb4584b9ad7259ed98cd1cfc25f0845e
katsu1110

Congratulations! I learn a lot from your work. Thanks for your share.

A couple of questions from me if you don't mind:

1) How did you end up with using huber loss? Because of some outliers?

2) I'm astonished to find the diversity of your engineered features. Especially it was very wise of you to come up with such many aggregation features by "市区町村コード". Were there any particularly strong features?

3) Related to 2), how effective were features generated from 'Feature pseudo-labeling'? I've never tried that strategy and am very interested in its effectiveness.

4) What was your imputation strategy in this competition?

5) There was so-to-say a 'leak' in this competition (https://prob.space/competitions/re_real_estate_2020/discussions/masato8823-Post9982d5b9dcd6a33111e0). Did you take advantage of it anyhow?

Presumably some of my questions can be self-answered from your code, but my R proficiency is very limited so...I'm sorry but I would appreciate your kind reply. Thanks in advance!

Icon22
Angus_Chang

Hi katsu-san, thank you for your kind word :))

About the questions :

  1. I start off my modeling by using typical RMSE as my loss function, and then I tried Fair loss and Huber loss. The reason I used Huber loss in my final solution is that it gave me a better CV score. The default setting in lightgbm is alpha=0.9, so I was thinking that lower the alpha value probably can make my model more robust, and in hindsight, it really does.

  2. I didn't test those features one by one and examine the CV score so probably I can't tell you which feature is effective (boost CV a lot). But some features like "x_zscore__ON__nobeyuka_m2__BY__trans_date_yr" or "bayes_mean__ON__sq_meter__BY__nearest_aki_name" have really high "GAIN" in lgbm importance.

  3. Those Feature pseudo-labeling features gave me ~0.001 boost on CV

  4. Nothing, lightgbm can handle missing value very well

  5. No, I don't know about this, it looks very interesting

Feel free to ask me if you have any other questions :))

Icon3
pao

Congratulation!! and thank you for your sharing!!

I have two questions if you don't mind.

  1. Am I correct in recognizing that you did not use feature selection? Instead of feature selection, did the feature fraction of lightgbm make lower? If so, is it better than feature selection?

  2. Do you have some purpose that both of lambda L1 and L2 of LightGBM are zero?

Icon22
Angus_Chang

Hi pao-san :)

  1. I do use the feature selection in this competition, and I forgot to mention it in the post, really sorry for that. In my final submission, I first run a model with full dataset (~1600 features), then select the top700 then rerun it, I'm surprised that it only gave me ~0.0003 improvement. About the feature fraction, no matter I am using full dataset or top700 one, the value of 0.05 always gave me best CV score.

  2. Nope, just because those value gave me better CV

Icon10
amedama

Congratulations! I am so grateful for your detailed sharing. It is very educational for me. Just reading this topic and I am fully satisfied with this competition.

Icon22
Angus_Chang

Thank you amedama-san !!

Favicon
new user
コメントするには 新規登録 もしくは ログイン が必要です。