XGBoost Hyperparameter Tuning – My Journey into Data Science and Machine Learning

As I mentioned in my last post, I revisited my earlier Github projects (https://github.com/marin-stoytchev/data-science-projects ) looking to apply some of the things learned during the last four months. One of the projects I put significant work into is a project using XGBoost and I would like to share some insights gained in the process.

Why XGBoost

XGBoost was first released in March 2014 and soon after became the go-to ML algorithm for many Data Science problems, winning along the way numerous Kaggle competitions. Currently, it has become the most popular algorithm for any regression or classification problem which deals with tabulated data (data not comprised of images and/or text).

XGBoost improves on the regular Gradient Boosting method by: 1) improving the process of minimization of the model error; 2) adding regularization (L1 and L2) for better model generalization; 3) adding parallelization.

In addition, what makes XGBoost such a powerful tool is the many tuning knobs (hyperparameters) one has at their disposal for optimizing a model and achieving better predictions. However, in a way this is also a curse because there are no fast and tested rules regarding which hyperparameters need to be used for optimization and what ranges of these hyperparameters should be explored. And this is natural to expect because hyperparameter tuning is very problem dependent.

As a simple example, one can consider having data with a large number of highly correlated features or features which have no relation to the target. When using XGBoost with such data one would expect to achieve best results from models with strong L1 regularization (large alpha value) which gets rid of the meaningless features. The opposite case would be when we have data with highly uncorrelated features, all of which affect strongly the target values. In this case, it is natural to expect that the coefficient alpha for L1 regularization would be either 0 or very small for the best performing model (with some involvement of L2 regularization, perhaps). Thus, the number of hyperparameters and their ranges to be explored in the process of model optimization can vary dramatically depending on the data on hand. That’s why there are no clear-cut instructions on the specifics of hyperparameter tuning and it is considered sort of “black magic” among the ML algorithms users. There are certain general optimization rules available, but beyond that achieving best results is a question of understanding the data we are dealing with and long hours of experimentation.

Project Description

The project in which I used XGBoost deals with predicting the diameter of asteroids from data posted on Kaggle by Victor Basu – https://www.kaggle.com/basu369victor/prediction-of-asteroid-diameter.

I am not going to present here the entire Python code used in the project. It can be found on my Github site in the asteroid project directory – https://github.com/marin-stoytchev/data-science-projects/tree/master/asteroid_xgb_project. In this post, I will focus on some results as they relate to the insights gained regarding XGBoost hyperparameter tuning.

After some data processing and exploration, the original data set was used to generate two data subsets:

data_1 consisting of 14 features and known diameter, which is the target, with total of 137681 entries;
data_2 which has the same features, but no diameter (diameter is unknown) with total of 702055 entries.

The goal is to train an XGBRegessor model with data_1 after which predict the diameter of asteroids in data_2.

Initial Model

Initially, an XGBRegressor model was used with default parameters and objective set to ‘reg:squarederror’.

from xgboost import XGBRegressor

model_ini = XGBRegressor(objective = ‘reg:squarederror’)

The data with known diameter was split into training and test sets:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_1, y_1, test_size = 0.2, random_state = 0)

Please, note that the line breaks and proper alignment in the code are not always correct in the text here due to limitations of the blog editor. For the correct code, please, refer to the link mentioned earlier: https://github.com/marin-stoytchev/data-science-projects/tree/master/asteroid_xgb_project

Initial model, model_ini, was trained with the train data and predictions were made using the test data.

model_ini.fit(X_train, y_train)

y_pred_1_ini = model_ini.predict(X_test)

Comparison between predictions and test values

After getting the predictions, the predicted values were compared to the true test values in a scatter plot as shown in the figure below.

As one can see from the plot, the predicted values are grouped around the perfect fit line with exception of two data points which one could categorize as “outliers”. For these two data points the predicted diameter values are approximately 350 and 550 km, while the true values are close to 940 and 900 km, respectively. This reveals probably the only weakness of XGBoost (at least known to me): its predictions are bound by the minimum and maximum target values found in the training set. As it happened, the maximum diameter value in the training set is about 600 km. That’s why the model could not predict well values greater than that.

Statistical properties of residuals

For more quantitative evaluation of the predictions, examining the statistics of the residuals (residual is the difference between true and predicted value) of the predictions is probably the best way to gauge the model performance in the case of regression problems.

The mean and the standard deviation of the residuals were obtained and are shown below.

residuals_1_ini = y_test – y_pred_1_ini

residuals_1_ini

print(“Residuals_ini Mean:”, round(residuals_1_ini.mean(),4))

print(“Residuals_ini Sigma:”, round(residuals_1_ini.std(),4))

Residuals_ini Mean: 0.0187
Residuals_ini Sigma: 4.9317

The mean is close to zero and the sigma is small compared to the diameter values in the test data which indicates good model performance.

One a side note, we would like to point out that the sigma of the residuals is exactly the predictions RMSE, which is often used as a metric for model accuracy in regression problems. Thus, the residuals sigma is, in fact, the accuracy of our model.

In addition, plotting the histogram of the residuals is a good way to evaluate the quality of the predictions. A histogram centered around zero and close to normal distribution with small sigma indicates good model performance.

The histogram in our case is shown below.

The residuals histogram is narrow (small sigma), centered around zero (the mean) and is close to normal distribution. It is skewed to a small degree towards positive values which means that the model is slightly under-evaluating. However, overall the results indicate good model performance.

Please, note that we have set limits to the x-axis in the histogram plot, so that the histogram can be seen in better details. If we were to include the maximum residual values, we would not be able to distinguish any meaningful histogram features (perhaps visually not see the histogram at all). For the same purpose, appropriate axes limits were set for all histogram plots presented below.

Predictions for data with unknown diameter

After validating the model performance, predictions were made with the Initial model, model_ini, using the data with unknown diameter, data_2.

y_pred_2_ini = model_ini.predict(X_2)

Since there are no known true diameter values, the only way to examine whether the predictions make sense is to compare the predicted values to those from the data with known diameter – more specifically their distributions. For this purpose, the histograms of the known diameter values from data_1 and the histogram of the predicted values from data_2 have been compared.

As evident from the plot, the histogram of the predicted diameter values is much more narrow, encompassing smaller values. This brings the legitimate question whether the smaller predicted values are a result of the model not being optimized (i.e. they are an artifact of the model) or they are inherently smaller as determined by the features in data_2.

To answer this question two different optimizations of the model were performed.

Model optimization using RandomizedSearch

RandomizedSearch is not the best approach for model optimization, particularly for XGBoost algorithm which has large number of hyperparameters with wide range of values. However, we decided to include this approach to compare to both the Initial model, which is used as a benchmark, and to a more sophisticated optimization approach later.

Any GridSearch takes increasingly larger amount of time with the increase of the number of hyperparameters and their ranges to the point where the approach becomes impractical. Because of this, the search grid here was intentionally limited to the following parameters and ranges:

grid_random = {‘max_depth’: [3, 6, 10, 20],

‘min_child_weight’: np.arange(1, 10, 1),

‘gamma’: np.arange(0, 10, 1),

‘n_estimators’: [50, 100, 150],

‘learning_rate’: [0.1, 0.2, 0.3],

‘subsample’: np.arange(0.5, 1.0, 0.1)}

from sklearn.model_selection import RandomizedSearchCV

model = XGBRegressor(objective = ‘reg:squarederror’)

model_random = RandomizedSearchCV(estimator = model,

param_distributions = grid_random,

n_iter = 100,

cv = 5,

verbose = 2,

random_state = 42,

n_jobs = -1)

The model was then trained with the training data.

model_random.fit(X_train, y_train)

Training completed in 38 minutes and provided best score with the following parameters.

print(“Best score: %f with %s” % (model_random.best_score_, model_random.best_params_))

Best score: 0.867163 with {'subsample': 0.8999999999999999, 'n_estimators': 100, 'min_child_weight': 4, 'max_depth': 6, 'learning_rate': 0.1, 'gamma': 0}

Best estimator was selected and predictions were made using the test data.

model_rand = model_random.best_estimator_

y_pred_1_rand = model_rand.predict(X_test)

After getting the predictions, we followed the same procedure of evaluating the model performance as in the case of the Initial model.

The plot comparing the predicted with the true test values is shown below.

Although the results appear similar to those from the Initial model, the smaller values predicted for the two “outliers” appear to indicate that the RandomizedSearch optimization doesn’t provide a better performance.

To examine closer, once again the statistics of the residuals is examined.

residuals_1_rand = y_test – y_pred_1_rand

residuals_1_rand

print(“Residuals_ini Mean:”, round(residuals_1_ini.mean(),4))

print(“Residuals_ini Sigma:”, round(residuals_1_ini.std(),4))

print(‘\n’)

print(“Residuals_rand Mean:”, round(residuals_1_rand.mean(),4))

print(“Residuals_rand Sigma:”, round(residuals_1_rand.std(),4))

Residuals_ini Mean: 0.0187
Residuals_ini Sigma: 4.9317

Residuals_rand Mean: 0.0379
Residuals_rand Sigma: 5.5739

Both the mean and sigma of the residuals from the RandomizedSearch model are larger which indicates no improvement in the model predictions.

However, we need to compare the histograms of the residuals to make a more informed decision. The figure below shows this comparison.

The histogram from the RandomizedSearch model appears slightly more symmetrical (closer to normal distribution) than that of the Initial model. However, based on all of the above results, the conclusion is that the RandomizedSearch optimization does not provide meaningful performance improvements, if any, over the Initial model. That’s why, in order to eventually achieve better performance, Bayesian optimization using hyperopt was performed.

Bayesian optimization using hyperopt

I have to admit that this is the first time I used hyperopt. A good guide on XGBoost optimization with hyperopt for me personally was the Kaggle post by Prashant Banerjee https://www.kaggle.com/prashant111/bayesian-optimization-using-hyperopt and the links inside. Also, I found the post by Ray Bell, https://sites.google.com/view/raybellwaves/blog/using-xgboost-and-hyperopt-in-a-kaggle-comp, helpful for the practical implementation of hyperopt with XGBRegressor model.

Using Bayesian optimization with hyperopt allows for exploring large number of hyperparameters with wide range of values. So, following the guidelines (mostly) from the above two post I implemented this approach.

First optimization trial using hyperopt

For fair comparison with the previous two models the training set used earlier was split in two separate new sets – training and validation. The original test set was left as a true test set for comparison with the Initial and RandomizedSearch models predictions.

X_hp_train, X_hp_valid, y_hp_train, y_hp_valid = train_test_split(X_train, y_train, test_size = 0.2, random_state = 0)

The code used to perform the optimization is provided below.

from hyperopt import fmin, tpe, hp, STATUS_OK, Trials, space_eval

from sklearn import metrics

space = {‘max_depth’: hp.choice(‘max_depth’, np.arange(3, 15, 1, dtype = int)),

‘n_estimators’: hp.choice(‘n_estimators’, np.arange(50, 300, 10, dtype = int)),

‘colsample_bytree’: hp.quniform(‘colsample_bytree’, 0.5, 1.0, 0.1),

‘min_child_weight’: hp.choice(‘min_child_weight’, np.arange(0, 10, 1, dtype = int)),

‘subsample’: hp.quniform(‘subsample’, 0.5, 1.0, 0.1),

‘learning_rate’: hp.quniform(‘learning_rate’, 0.1, 0.3, 0.1),

‘gamma’: hp.choice(‘gamma’, np.arange(0, 10, 0.1, dtype = float)),

‘reg_alpha’: hp.choice(‘reg_alpha’, np.arange(0, 10, 0.1, dtype = float)),

‘reg_lambda’: hp.choice(‘reg_lambda’, np.arange(0, 20, 0.1, dtype = float)),

‘objective’: ‘reg:squarederror’,

‘eval_metric’: ‘rmse’}

def score(params):

model = XGBRegressor(**params)

model.fit(X_hp_train, y_hp_train,

eval_set = [(X_hp_train, y_hp_train), (X_hp_valid, y_hp_valid)],

verbose = False,

early_stopping_rounds = 10)

y_pred = model.predict(X_hp_valid)

score = np.sqrt(metrics.mean_squared_error(y_hp_valid, y_pred))

print(score)

return {‘loss’: score, ‘status’: STATUS_OK}

def optimize(trials, space):

best = fmin(score, space, algo = tpe.suggest, max_evals = 200)

return best

trials = Trials()

best_params = optimize(trials, space)

The optimization took approximately 49 minutes compared to 38 minutes for the RandomizedSearch optimization. The optimization yielded the following best optimization parameters.

space_eval(space, best_params)

{'colsample_bytree': 0.9,
 'eval_metric': 'rmse',
 'gamma': 9.200000000000001,
 'learning_rate': 0.2,
 'max_depth': 5,
 'min_child_weight': 0,
 'n_estimators': 240,
 'objective': 'reg:squarederror',
 'reg_alpha': 9.200000000000001,
 'reg_lambda': 1.7000000000000002,
 'subsample': 0.8}

Using these parameters a new optimized model, model_opt, was created and trained using the new training and validation sets.

# Create model with best parameters

model_opt = XGBRegressor(max_depth = 5,

n_estimators = 240,

learning_rate = 0.2,

min_child_weight = 0,

subsample = 0.8,

colsample_bytree = 0.9,

gamma = 9.2,

reg_alpha = 9.2,

reg_lambda = 1.7,

objective = ‘reg:squarederror’)

# Fit with hp training data

model_opt.fit(X_hp_train, y_hp_train,

eval_set = [(X_hp_train, y_hp_train), (X_hp_valid, y_hp_valid)],

eval_metric = ‘rmse’,

verbose = True,

early_stopping_rounds = 10)

After that, predictions were made using the original test set.

y_pred_1_opt = model_opt.predict(X_test)

As before, the predictions were compared to the true test values using the familiar scatter plot shown below.

From the plot it is difficult to make any conclusions whether the optimized model has better performance than the Initial model which we decided to use as a benchmark. That’s why we turn our attention to the statistics of the residuals.

residuals_1_opt = y_test – y_pred_1_opt

residuals_1_opt

print(“Residuals_ini Mean:”, round(residuals_1_ini.mean(),4))

print(“Residuals_ini Sigma:”, round(residuals_1_ini.std(),4))

print(‘\n’)

print(“Residuals_opt Mean:”, round(residuals_1_opt.mean(),4))

print(“Residuals_opt Sigma:”, round(residuals_1_opt.std(),4))

Residuals_ini Mean: 0.0187
Residuals_ini Sigma: 4.9317

Residuals_opt Mean: 0.041
Residuals_opt Sigma: 5.0795

The residuals mean and sigma for the Optimized model, although close, are actually worse than those obtained with the Initial model.

The comparison of the residuals histograms is shown in the figure below.

Despite the worse mean and sigma, the histogram from the Optimized model is clearly more symmetrical and closer to normal distribution which indicates better predictions. However, the improvement was not dramatic and I was not satisfied with the results from the hyperopt optimization. In an attempt to get better result, I ran the optimization with the same hyperparameter space three more times. The best hyperparameter values came out slightly different each time, but ultimately the predictions were not much different. That led me to change the hyperparameter space and run again hyperopt after the change.

Second optimization trial using hyperopt

For the second optimization trial, the only change in the hyperparameter space was simply extending the range of values for gamma from 0 to 20 compared to the range from 0 to 10 for the first try with hyperopt. Everything else was kept the same and training and predictions were made as in the first try.

Without going into great details I would like to make a quick note on the meaning of gamma. Quoting directly from XGBoost documentation (https://xgboost.readthedocs.io/en/latest/parameter.html): Gamma sets “the minimum loss reduction required to make a further partition on a leaf node of the tree. The larger gamma is, the more conservative the algorithm will be”. In other words, if gamma is large the trees resulting from the algorithm will be more shallow, which in turn serves to prevent over-fitting (deep tree is prone to over-fitting). Thus, one can consider gamma as a regularization term as well, but different in nature from L1 and L2 regularization which focus on the feature selection/importance.

space = {‘max_depth’: hp.choice(‘max_depth’, np.arange(3, 15, 1, dtype = int)),

‘n_estimators’: hp.choice(‘n_estimators’, np.arange(50, 300, 10, dtype = int)),

‘colsample_bytree’: hp.quniform(‘colsample_bytree’, 0.5, 1.0, 0.1),

‘min_child_weight’: hp.choice(‘min_child_weight’, np.arange(0, 10, 1, dtype = int)),

‘subsample’: hp.quniform(‘subsample’, 0.5, 1.0, 0.1),

‘learning_rate’: hp.quniform(‘learning_rate’, 0.1, 0.3, 0.1),

‘gamma’: hp.choice(‘gamma’, np.arange(0, 20, 0.5, dtype = float)),

‘reg_alpha’: hp.choice(‘reg_alpha’, np.arange(0, 20, 0.5, dtype = float)),

‘reg_lambda’: hp.choice(‘reg_lambda’, np.arange(0, 20, 0.5, dtype = float)),

‘objective’: ‘reg:squarederror’,

‘eval_metric’: ‘rmse’}

The resulting best parameters were as follows.

space_eval(space, best_params)

{'colsample_bytree': 0.60000000000000001,
 'eval_metric': 'rmse',
 'gamma': 18.5,
 'learning_rate': 0.30000000000000004,
 'max_depth': 4,
 'min_child_weight': 1,
 'n_estimators': 100,
 'objective': 'reg:squarederror',
 'reg_alpha': 3.5,
 'reg_lambda': 0.0,
 'subsample': 1.0}

The resulting new best parameters are different from the first optimization trial, but to me the most significant difference was that by allowing gamma to go up to 20 its value grew from 0 in the RandomizedSearch model to 9.2 in the first hyperopt try to 18.5 in the second try with new hyperparameter space.

I am not sure if the change in gamma is the determining factor or the combined changes in all parameters is what ultimately affects the model predictions. However, the predictions made using the new best parameters for the Optimized model showed definite improvement over the Initial model predictions.

# Create model with best parameters

model_opt = XGBRegressor(max_depth = 4,

n_estimators = 100,

learning_rate = 0.3,

min_child_weight = 1,

subsample = 1.0,

colsample_bytree = 0.6,

gamma = 18.5,

reg_alpha = 3.5,

reg_lambda = 0.0,

objective = ‘reg:squarederror’)

# Fit with hp datasets

model_opt.fit(X_hp_train, y_hp_train,

eval_set = [(X_hp_train, y_hp_train), (X_hp_valid, y_hp_valid)],

eval_metric = ‘rmse’,

verbose = True,

early_stopping_rounds = 10)

y_pred_1_opt = model_opt.predict(X_test)

The plot of the new predicted values vs. the true test values is shown below.

Although similar to the plot for the Initial model, the predicted values here appear closer to the perfect fit line.

However, it is the statistics of the residuals which showed clear and unambiguous improvements. Repeating the code below provided the following mean and sigma for the residuals.

residuals_1_opt = y_test – y_pred_1_opt

residuals_1_opt

print(“Residuals_ini Mean:”, round(residuals_1_ini.mean(),4))

print(“Residuals_ini Sigma:”, round(residuals_1_ini.std(),4))

print(‘\n’)

print(“Residuals_opt Mean:”, round(residuals_1_opt.mean(),4))

print(“Residuals_opt Sigma:”, round(residuals_1_opt.std(),4))

Residuals_ini Mean: 0.0187
Residuals_ini Sigma: 4.9317

Residuals_opt Mean: 0.0321
Residuals_opt Sigma: 4.2539

Although the mean of the Optimized model is slightly larger, the sigma is improved by approximately 15 %. Recalling that the residuals sigma is the predictions RMSE, this translates to 15 % improvement in model accuracy.

It is, however, the significant improvement in the residuals histogram as shown below, which to me distinguishes the two models.

The residuals histogram from the Optimized model is more narrow due to the smaller sigma. It is, also, much closer to normal distribution than the histogram from the Initial model. And these are hallmarks of better model performance.

Final predictions for data with unknown diameter

Using the last Optimized model, predictions with data with unknown diameter were made. Before making predictions, the model was re-trained using the entire original training set and the original test set for evaluation.

model_opt.fit(X_train, y_train,

eval_set = [(X_train, y_train), (X_test, y_test)],

eval_metric = ‘rmse’,

verbose = True,

early_stopping_rounds = 10)

y_pred_2_opt = model_opt.predict(X_2)

Here, as before, there are no true values to compare to, but that was not our goal. The goal is to compare the predicted values from the Initial model with those from the Optimized model, and more specifically their distributions. This comparison is shown in the figure below.

Besides some differences, the two distributions are very similar in covering the same range of values and having the same shape. This answers the question posed earlier whether the difference between the distributions of the predicted unknown diameter values and the known diameter values might be an issue of the Initial model not being optimized. The plot above clearly shows that the initially predicted values are not an artifact due to the model not being optimized. Thus, the conclusion we can make based on our analysis is that the predicted values are inherently associated with the values of the features in the data with unknown diameter and should be accepted as reasonably close to the true asteroids diameter values.

Final thoughts

After reading through the entire post, I realize that it is quite long and, perhaps, presents too many results plots. This in turn might make it difficult to follow. However, hyperparameter tuning is a complex subject and to me this seemed the best way to demonstrate the following points:

Even if model optimization is performed, this does not mean that the resulting model by default would achieve better results. I have seen many posts on the web which simply show the code for optimization and stop there. That is definitely not enough. Very few such posts show results demonstrating performance improvement of the model after optimization. Even the ones that compare the model performance before and after optimization limit their analysis to the accuracy of the model, which is not always the best metric as demonstrated here by examining the statistical distributions (histograms) of the residuals.
Model optimization is a process of exploring variety of hyperparameters and different ranges of the selected hyperparameters. There are no clear-cut rules for a specific algorithm that define the correct combination of hyperparameters and their ranges. As it was pointed out in the very beginning, these are very much problem dependent. And some times, all it takes is to expand the range of one hyperparameter only to achieve significantly better results.

As always, I hope that this post will be helpful to some of you in your Data Science and Machine Learning journey. Good luck!

Why XGBoost

Project Description

Initial Model

Model optimization using RandomizedSearch

Bayesian optimization using hyperopt

Final predictions for data with unknown diameter

Final thoughts

Related Post