18.3 8th place solution with GitHub

The eighth placed team was a team of eighth, they present their solution at Kaggle discussion

The team consisted of

Eigth people
Private group
Organised via the net

18.3.1 Overall architecture

A variety of model were combined

LightGBM (gbm)
xgboost (xgb)
Random Forest (rf)
Neural Networks (didn’t get picked up on level 2, so they were removed)

18.3.2 Input data sets

The team created different data sets and used them with different models

Level 1 data set:

Data set 1 (0.477 gbm): order, raw numeric, date, categorical
Data set 2 (0.482 gbm, 0.477 xgb, 0.473 rf): order, path, raw numeric, date
Data set 3 (0.479 gbm, 0.473 xgb): order, path, numeric, date, refined categorical
Data set 4 (0.469 xgb, 0.442 rf): has features sorted by numeric values + date features + path, unsupervised nearest neighbors (L1 = Manhattan / L2 = Euclidean distances) per label
Data set 5 (0.43 xgb): path, unsupervised nearest neighbors

The model was two staged, the second stage was as given below

Level 2 data set:

Level 1 predictions (we had 12 predictions from level 1)
Data set 5
Duplicate feature (count and position)

18.3.3 Ensembling

Often a better performance can be achieved when ensembling several model together, good practice is it to use models which a dissimilar because the variance helps to improve the overall performance.

30% weighted xgboost gbtree (~0.488 CV)
70% weighted Random Forest (~0.485 CV)

18.3.4 Features

18.3.4.1 Features used

Features were created using several methods

Maximum
Minimum
Kurtosis
Lead
Lag
One-hot encoded

18.3.5 Validation method

The validation method used was 5-fold cross validation

18.3.6 Software

The team used a variety of programming languages and tools

Programming language
- R
- Python
Tools
- LightGBM through Laurae package
- xgboost
- Random Forest scikit-learn
- H2O Random Forest
- Keras Neural Networks
- Markdown
- Rmarkdown
- RStudio for R,
- Spyder for Python

18.3.7 Code on GitHub

A detailed explanation of the code is given on GitHub

The scripts for:

Pre-processing
Feature engineering
Modeling scripts
Hyperparameter optimization using HyperOpt

18.3.7.1 Level 1 model scripts

Lets look into some of the model scripts

18.3.7.1.1 GBM Model


temp_model <- lgbm.cv(y_train = label,
                      x_train = train,
                      x_test = test,
                      data_has_label = TRUE,
                      NA_value = "nan",
                      lgbm_path = my_lgbm_is_at,
                      workingdir = my_script_is_using,
                      files_exist = TRUE,
                      save_binary = FALSE,
                      validation = TRUE,
                      folds = folds,
                      predictions = TRUE,
                      importance = TRUE,
                      full_quiet = FALSE,
                      verbose = FALSE,
                      num_threads = threads, # The number of threads to run for LightGBM.
                      application = "binary",
                      learning_rate = eta, # The shrinkage rate applied to each iteration
                      num_iterations = 5000, # The number of boosting iterations 
                      early_stopping_rounds = 700, # The number of boosting iterations whose validation metric is lower than the best is required for LightGBM to automatically stop
                      num_leaves = leaves, # The number of leaves in one tree
                      min_data_in_leaf = min_sample, # Minimum number of data in one leaf
                      min_sum_hessian_in_leaf = min_hess, # Minimum sum of hessians in one leaf to allow a split
                      max_bin = 255, # The maximum number of bins created per feature
                      feature_fraction = colsample, # Column subsampling percentage. For instance, 0.5 means selecting 50% of features randomly for each iteration
                      bagging_fraction = subsample, # Row subsampling percentage. For instance, 0.5 means selecting 50% of rows randomly for each iteration.
                      bagging_freq = sampling_freq, # The frequency of row subsampling 
                      is_unbalance = FALSE, #  For binary classification, setting this to TRUE might be useful when the training data is unbalanced
                      metric = "auc",
                      is_training_metric = TRUE, #  Whether to report the training metric in addition to the validation metric
                      is_sparse = FALSE) # Whether sparse optimization is enabled

18.3.7.1.2 XGBoost model


temp_model <- xgb.train(data = dtrain,
                        nthread = 12,
                        nrounds = floor(best_iter * 1.1), # max number of boosting iterations.
                        eta = 0.05, # control the learning rate: scale the contribution of each tree by a factor of 0 < eta < 1 when it is added to the current approximation
                        depth = 7, # maximum depth of a tree
                        #gamma = 20, #  minimum loss reduction required to make a further partition on a leaf node of the tree.
                        subsample = 0.9, # Setting it to 0.5 means that xgboost randomly collected half of the data instances to grow trees 
                        colsample_bytree = 0.7, # subsample ratio of columns when constructing each tree
                        min_child_weight = 50, # minimum sum of instance weight (hessian) needed in a child
                        booster = "gbtree", # which booster to use, can be gbtree or gblinear
                        #feval = mcc_eval_nofail,
                        eval_metric = "auc",
                        maximize = TRUE,
                        objective = "binary:logistic",
                        verbose = TRUE,
                        prediction = TRUE,
                        watchlist = list(test = dtrain))

18.3.7.2 Level 2 model scripts

18.3.7.2.1 70% weighted Random Forest (~0.485 CV)

First read in the results of level 1 models which are now the features for the level 2 model

train <- read_feather("Shubin/retrain_material/train.feather")
test <- read_feather("Shubin/retrain_material/test.feather")
train[, "xgb_jay_joost_v2"] <- fread("Laurae/20161110_xgb_jayjoost_fix2/aaa_stacker_preds_train_headerY_scale.csv")$x
test[, "xgb_jay_joost_v2"] <- fread("Laurae/20161110_xgb_jayjoost_fix2/aaa_stacker_preds_test_headerY_scale.csv")$x
train[, "gbm_jay_joost_v2"] <- fread("Laurae/20161111_lgbm_jayjoost/aaa_stacker_preds_train_headerY_scale.csv")$x
test[, "gbm_jay_joost_v2"] <- fread("Laurae/20161111_lgbm_jayjoost/aaa_stacker_preds_test_headerY_scale.csv")$x
train[, "gbm_jay"] <- fread("Laurae/20161111_lgbm_jay/aaa_stacker_preds_train_headerY_scale.csv")$x
test[, "gbm_jay"] <- fread("Laurae/20161111_lgbm_jay/aaa_stacker_preds_test_headerY_scale.csv")$x
train[, "gbm_mike"] <- fread("Laurae/20161110_lgbm_mike/aaa_stacker_preds_train_headerY_scale.csv")$x
test[, "gbm_mike"] <- fread("Laurae/20161110_lgbm_mike/aaa_stacker_preds_test_headerY_scale.csv")$x
train[, "xgb_mike"] <- fread("Laurae/20161110_xgb_mike/aaa_stacker_preds_train_headerY_scale.csv")$x
test[, "xgb_mike"] <- fread("Laurae/20161110_xgb_mike/aaa_stacker_preds_test_headerY_scale.csv")$x

then train the level 2 model

  temp_model <- h2o.randomForest(x = 1:12,
                                 y = "Response",
                                 training_frame = my_train[[i]],
                                 ntrees = 200, # Number of trees
                                 max_depth = 12, # Maximum tree depth
                                 min_rows = 20, # Fewest allowed (weighted) observations in a leaf
                                 seed = 11111)

18.3.7.2.2 Hyperparameter optimization using HyperOpt

The models have been implemented in R, the hyperparameter optimizsation is implemented in Python.

Define parameters to be optimized

# Random Forest Params
params = {'n_estimators': 100}
params['random_state'] = 100
params['max_features'] = hp.choice('max_features', range(10, 199))
params['max_depth'] = hp.choice('max_depth', range(7,30))
params['verbose'] = 10
params['n_jobs'] = -1

Run optimizer from the library Hyperopt


# Hyperopt
trials = Trials()
counter = 0
best = fmin(score_rf, 
                    params, 
                    algo=tpe.suggest, # search algorithm
                    max_evals=200, 
                    trials=trials)

choosing the trials option gives back a dictionary with

trials.trials - a list of dictionaries representing everything about the search
trials.results - a list of dictionaries returned by ‘objective’ during the search
trials.losses() - a list of losses (float for each ‘ok’ trial)
trials.statuses() - a list of status strings