18.3 8th place solution with GitHub

The eighth placed team was a team of eighth, they present their solution at Kaggle discussion

The team consisted of

  • Eigth people
  • Private group
  • Organised via the net

18.3.1 Overall architecture

A variety of model were combined

  • LightGBM (gbm)
  • xgboost (xgb)
  • Random Forest (rf)
  • Neural Networks (didn’t get picked up on level 2, so they were removed)

18.3.2 Input data sets

The team created different data sets and used them with different models

Level 1 data set:

  • Data set 1 (0.477 gbm): order, raw numeric, date, categorical
  • Data set 2 (0.482 gbm, 0.477 xgb, 0.473 rf): order, path, raw numeric, date
  • Data set 3 (0.479 gbm, 0.473 xgb): order, path, numeric, date, refined categorical
  • Data set 4 (0.469 xgb, 0.442 rf): has features sorted by numeric values + date features + path, unsupervised nearest neighbors (L1 = Manhattan / L2 = Euclidean distances) per label
  • Data set 5 (0.43 xgb): path, unsupervised nearest neighbors

The model was two staged, the second stage was as given below

Level 2 data set:

  • Level 1 predictions (we had 12 predictions from level 1)
  • Data set 5
  • Duplicate feature (count and position)

18.3.3 Ensembling

Often a better performance can be achieved when ensembling several model together, good practice is it to use models which a dissimilar because the variance helps to improve the overall performance.

  • 30% weighted xgboost gbtree (~0.488 CV)
  • 70% weighted Random Forest (~0.485 CV)

18.3.4 Features

18.3.4.1 Features used

Features were created using several methods

  • Maximum
  • Minimum
  • Kurtosis
  • Lead
  • Lag
  • One-hot encoded

18.3.5 Validation method

The validation method used was 5-fold cross validation

18.3.6 Software

The team used a variety of programming languages and tools

18.3.7 Code on GitHub

A detailed explanation of the code is given on GitHub

The scripts for:

  • Pre-processing
  • Feature engineering
  • Modeling scripts
  • Hyperparameter optimization using HyperOpt

18.3.7.1 Level 1 model scripts

Lets look into some of the model scripts

18.3.7.1.1 GBM Model

temp_model <- lgbm.cv(y_train = label,
                      x_train = train,
                      x_test = test,
                      data_has_label = TRUE,
                      NA_value = "nan",
                      lgbm_path = my_lgbm_is_at,
                      workingdir = my_script_is_using,
                      files_exist = TRUE,
                      save_binary = FALSE,
                      validation = TRUE,
                      folds = folds,
                      predictions = TRUE,
                      importance = TRUE,
                      full_quiet = FALSE,
                      verbose = FALSE,
                      num_threads = threads, # The number of threads to run for LightGBM.
                      application = "binary",
                      learning_rate = eta, # The shrinkage rate applied to each iteration
                      num_iterations = 5000, # The number of boosting iterations 
                      early_stopping_rounds = 700, # The number of boosting iterations whose validation metric is lower than the best is required for LightGBM to automatically stop
                      num_leaves = leaves, # The number of leaves in one tree
                      min_data_in_leaf = min_sample, # Minimum number of data in one leaf
                      min_sum_hessian_in_leaf = min_hess, # Minimum sum of hessians in one leaf to allow a split
                      max_bin = 255, # The maximum number of bins created per feature
                      feature_fraction = colsample, # Column subsampling percentage. For instance, 0.5 means selecting 50% of features randomly for each iteration
                      bagging_fraction = subsample, # Row subsampling percentage. For instance, 0.5 means selecting 50% of rows randomly for each iteration.
                      bagging_freq = sampling_freq, # The frequency of row subsampling 
                      is_unbalance = FALSE, #  For binary classification, setting this to TRUE might be useful when the training data is unbalanced
                      metric = "auc",
                      is_training_metric = TRUE, #  Whether to report the training metric in addition to the validation metric
                      is_sparse = FALSE) # Whether sparse optimization is enabled
18.3.7.1.2 XGBoost model

temp_model <- xgb.train(data = dtrain,
                        nthread = 12,
                        nrounds = floor(best_iter * 1.1), # max number of boosting iterations.
                        eta = 0.05, # control the learning rate: scale the contribution of each tree by a factor of 0 < eta < 1 when it is added to the current approximation
                        depth = 7, # maximum depth of a tree
                        #gamma = 20, #  minimum loss reduction required to make a further partition on a leaf node of the tree.
                        subsample = 0.9, # Setting it to 0.5 means that xgboost randomly collected half of the data instances to grow trees 
                        colsample_bytree = 0.7, # subsample ratio of columns when constructing each tree
                        min_child_weight = 50, # minimum sum of instance weight (hessian) needed in a child
                        booster = "gbtree", # which booster to use, can be gbtree or gblinear
                        #feval = mcc_eval_nofail,
                        eval_metric = "auc",
                        maximize = TRUE,
                        objective = "binary:logistic",
                        verbose = TRUE,
                        prediction = TRUE,
                        watchlist = list(test = dtrain))
                        

18.3.7.2 Level 2 model scripts

18.3.7.2.1 70% weighted Random Forest (~0.485 CV)

First read in the results of level 1 models which are now the features for the level 2 model

train <- read_feather("Shubin/retrain_material/train.feather")
test <- read_feather("Shubin/retrain_material/test.feather")
train[, "xgb_jay_joost_v2"] <- fread("Laurae/20161110_xgb_jayjoost_fix2/aaa_stacker_preds_train_headerY_scale.csv")$x
test[, "xgb_jay_joost_v2"] <- fread("Laurae/20161110_xgb_jayjoost_fix2/aaa_stacker_preds_test_headerY_scale.csv")$x
train[, "gbm_jay_joost_v2"] <- fread("Laurae/20161111_lgbm_jayjoost/aaa_stacker_preds_train_headerY_scale.csv")$x
test[, "gbm_jay_joost_v2"] <- fread("Laurae/20161111_lgbm_jayjoost/aaa_stacker_preds_test_headerY_scale.csv")$x
train[, "gbm_jay"] <- fread("Laurae/20161111_lgbm_jay/aaa_stacker_preds_train_headerY_scale.csv")$x
test[, "gbm_jay"] <- fread("Laurae/20161111_lgbm_jay/aaa_stacker_preds_test_headerY_scale.csv")$x
train[, "gbm_mike"] <- fread("Laurae/20161110_lgbm_mike/aaa_stacker_preds_train_headerY_scale.csv")$x
test[, "gbm_mike"] <- fread("Laurae/20161110_lgbm_mike/aaa_stacker_preds_test_headerY_scale.csv")$x
train[, "xgb_mike"] <- fread("Laurae/20161110_xgb_mike/aaa_stacker_preds_train_headerY_scale.csv")$x
test[, "xgb_mike"] <- fread("Laurae/20161110_xgb_mike/aaa_stacker_preds_test_headerY_scale.csv")$x

then train the level 2 model

  temp_model <- h2o.randomForest(x = 1:12,
                                 y = "Response",
                                 training_frame = my_train[[i]],
                                 ntrees = 200, # Number of trees
                                 max_depth = 12, # Maximum tree depth
                                 min_rows = 20, # Fewest allowed (weighted) observations in a leaf
                                 seed = 11111)
18.3.7.2.2 Hyperparameter optimization using HyperOpt

The models have been implemented in R, the hyperparameter optimizsation is implemented in Python.

Define parameters to be optimized

# Random Forest Params
params = {'n_estimators': 100}
params['random_state'] = 100
params['max_features'] = hp.choice('max_features', range(10, 199))
params['max_depth'] = hp.choice('max_depth', range(7,30))
params['verbose'] = 10
params['n_jobs'] = -1

Run optimizer from the library Hyperopt


# Hyperopt
trials = Trials()
counter = 0
best = fmin(score_rf, 
                    params, 
                    algo=tpe.suggest, # search algorithm
                    max_evals=200, 
                    trials=trials)

choosing the trials option gives back a dictionary with

  • trials.trials - a list of dictionaries representing everything about the search
  • trials.results - a list of dictionaries returned by ‘objective’ during the search
  • trials.losses() - a list of losses (float for each ‘ok’ trial)
  • trials.statuses() - a list of status strings