18.3 8th place solution with GitHub
The eighth placed team was a team of eighth, they present their solution at Kaggle discussion
The team consisted of
- Eigth people
- Private group
- Organised via the net
18.3.1 Overall architecture
A variety of model were combined
- LightGBM (gbm)
- xgboost (xgb)
- Random Forest (rf)
- Neural Networks (didn’t get picked up on level 2, so they were removed)
18.3.2 Input data sets
The team created different data sets and used them with different models
Level 1 data set:
- Data set 1 (0.477 gbm): order, raw numeric, date, categorical
- Data set 2 (0.482 gbm, 0.477 xgb, 0.473 rf): order, path, raw numeric, date
- Data set 3 (0.479 gbm, 0.473 xgb): order, path, numeric, date, refined categorical
- Data set 4 (0.469 xgb, 0.442 rf): has features sorted by numeric values + date features + path, unsupervised nearest neighbors (L1 = Manhattan / L2 = Euclidean distances) per label
- Data set 5 (0.43 xgb): path, unsupervised nearest neighbors
The model was two staged, the second stage was as given below
Level 2 data set:
- Level 1 predictions (we had 12 predictions from level 1)
- Data set 5
- Duplicate feature (count and position)
18.3.3 Ensembling
Often a better performance can be achieved when ensembling several model together, good practice is it to use models which a dissimilar because the variance helps to improve the overall performance.
- 30% weighted xgboost gbtree (~0.488 CV)
- 70% weighted Random Forest (~0.485 CV)
18.3.4 Features
18.3.4.1 Features used
Features were created using several methods
- Maximum
- Minimum
- Kurtosis
- Lead
- Lag
- One-hot encoded
18.3.6 Software
The team used a variety of programming languages and tools
- Programming language
- R
- Python
- Tools
- LightGBM through Laurae package
- xgboost
- Random Forest scikit-learn
- H2O Random Forest
- Keras Neural Networks
- Markdown
- Rmarkdown
- RStudio for R,
- Spyder for Python
18.3.7 Code on GitHub
A detailed explanation of the code is given on GitHub
The scripts for:
- Pre-processing
- Feature engineering
- Modeling scripts
- Hyperparameter optimization using HyperOpt
18.3.7.1 Level 1 model scripts
Lets look into some of the model scripts
18.3.7.1.1 GBM Model
<- lgbm.cv(y_train = label,
temp_model x_train = train,
x_test = test,
data_has_label = TRUE,
NA_value = "nan",
lgbm_path = my_lgbm_is_at,
workingdir = my_script_is_using,
files_exist = TRUE,
save_binary = FALSE,
validation = TRUE,
folds = folds,
predictions = TRUE,
importance = TRUE,
full_quiet = FALSE,
verbose = FALSE,
num_threads = threads, # The number of threads to run for LightGBM.
application = "binary",
learning_rate = eta, # The shrinkage rate applied to each iteration
num_iterations = 5000, # The number of boosting iterations
early_stopping_rounds = 700, # The number of boosting iterations whose validation metric is lower than the best is required for LightGBM to automatically stop
num_leaves = leaves, # The number of leaves in one tree
min_data_in_leaf = min_sample, # Minimum number of data in one leaf
min_sum_hessian_in_leaf = min_hess, # Minimum sum of hessians in one leaf to allow a split
max_bin = 255, # The maximum number of bins created per feature
feature_fraction = colsample, # Column subsampling percentage. For instance, 0.5 means selecting 50% of features randomly for each iteration
bagging_fraction = subsample, # Row subsampling percentage. For instance, 0.5 means selecting 50% of rows randomly for each iteration.
bagging_freq = sampling_freq, # The frequency of row subsampling
is_unbalance = FALSE, # For binary classification, setting this to TRUE might be useful when the training data is unbalanced
metric = "auc",
is_training_metric = TRUE, # Whether to report the training metric in addition to the validation metric
is_sparse = FALSE) # Whether sparse optimization is enabled
18.3.7.1.2 XGBoost model
<- xgb.train(data = dtrain,
temp_model nthread = 12,
nrounds = floor(best_iter * 1.1), # max number of boosting iterations.
eta = 0.05, # control the learning rate: scale the contribution of each tree by a factor of 0 < eta < 1 when it is added to the current approximation
depth = 7, # maximum depth of a tree
#gamma = 20, # minimum loss reduction required to make a further partition on a leaf node of the tree.
subsample = 0.9, # Setting it to 0.5 means that xgboost randomly collected half of the data instances to grow trees
colsample_bytree = 0.7, # subsample ratio of columns when constructing each tree
min_child_weight = 50, # minimum sum of instance weight (hessian) needed in a child
booster = "gbtree", # which booster to use, can be gbtree or gblinear
#feval = mcc_eval_nofail,
eval_metric = "auc",
maximize = TRUE,
objective = "binary:logistic",
verbose = TRUE,
prediction = TRUE,
watchlist = list(test = dtrain))
18.3.7.2 Level 2 model scripts
18.3.7.2.1 70% weighted Random Forest (~0.485 CV)
First read in the results of level 1 models which are now the features for the level 2 model
<- read_feather("Shubin/retrain_material/train.feather")
train <- read_feather("Shubin/retrain_material/test.feather")
test "xgb_jay_joost_v2"] <- fread("Laurae/20161110_xgb_jayjoost_fix2/aaa_stacker_preds_train_headerY_scale.csv")$x
train[, "xgb_jay_joost_v2"] <- fread("Laurae/20161110_xgb_jayjoost_fix2/aaa_stacker_preds_test_headerY_scale.csv")$x
test[, "gbm_jay_joost_v2"] <- fread("Laurae/20161111_lgbm_jayjoost/aaa_stacker_preds_train_headerY_scale.csv")$x
train[, "gbm_jay_joost_v2"] <- fread("Laurae/20161111_lgbm_jayjoost/aaa_stacker_preds_test_headerY_scale.csv")$x
test[, "gbm_jay"] <- fread("Laurae/20161111_lgbm_jay/aaa_stacker_preds_train_headerY_scale.csv")$x
train[, "gbm_jay"] <- fread("Laurae/20161111_lgbm_jay/aaa_stacker_preds_test_headerY_scale.csv")$x
test[, "gbm_mike"] <- fread("Laurae/20161110_lgbm_mike/aaa_stacker_preds_train_headerY_scale.csv")$x
train[, "gbm_mike"] <- fread("Laurae/20161110_lgbm_mike/aaa_stacker_preds_test_headerY_scale.csv")$x
test[, "xgb_mike"] <- fread("Laurae/20161110_xgb_mike/aaa_stacker_preds_train_headerY_scale.csv")$x
train[, "xgb_mike"] <- fread("Laurae/20161110_xgb_mike/aaa_stacker_preds_test_headerY_scale.csv")$x test[,
then train the level 2 model
<- h2o.randomForest(x = 1:12,
temp_model y = "Response",
training_frame = my_train[[i]],
ntrees = 200, # Number of trees
max_depth = 12, # Maximum tree depth
min_rows = 20, # Fewest allowed (weighted) observations in a leaf
seed = 11111)
18.3.7.2.2 Hyperparameter optimization using HyperOpt
The models have been implemented in R, the hyperparameter optimizsation is implemented in Python.
Define parameters to be optimized
# Random Forest Params
= {'n_estimators': 100}
params 'random_state'] = 100
params['max_features'] = hp.choice('max_features', range(10, 199))
params['max_depth'] = hp.choice('max_depth', range(7,30))
params['verbose'] = 10
params['n_jobs'] = -1
params[
Run optimizer from the library Hyperopt
# Hyperopt
= Trials()
trials = 0
counter = fmin(score_rf,
best
params, =tpe.suggest, # search algorithm
algo=200,
max_evals=trials) trials
choosing the trials option gives back a dictionary with
- trials.trials - a list of dictionaries representing everything about the search
- trials.results - a list of dictionaries returned by ‘objective’ during the search
- trials.losses() - a list of losses (float for each ‘ok’ trial)
- trials.statuses() - a list of status strings