## 7.4 Feature engineering

Sometimes even the best models have less than useful predictive performance. This could be because key relationships are not directly available.

Key relationships that are not directly available:

- Transformation of a predictor
- Interaction of two or more predictors such as a product or ratio
- Functional relationship among predictors
- Equivalent re-representation of a predictor

Feature engineering can be defined as:

Adjusting and reworking the predictors to enable models to better uncover predictor-response relationships has been termed feature engineering.

(Kuhn and Johnson 2018)

Since there are different faculties which work on machine learning (staticians, computer scientists, neurologists) there are different terms for the same thing, so variables that go into model are called:

Predictors

- Features
- Independent variables

Quantity being modeled called:

Prediction

- Outcome
- Response
- Dependent variable

So machine learning is nothing more spectacular than finding a function that maps the input, features, to the output, response as close as possible.

\[outcome = f(features) = f(X_1, X_2, \dots, Xp) = f(X)\]

\[\hat{Y} = \hat{f}(X)\]

### 7.4.1 Encoding Categorical Predictors

Categorical features are not quantitative nature but **qualitative data**, i.e. discrete values in an ordered relationship
For numerical features it is straight forward to feed them into a model, but how can categorical data be handled?

Categorical features:

- Gender
- Breeds of dogs
- Postal code

Ordered categorical features have a clear progression of values

Ordered categorical features:

- Bad
- Good
- Better

Ordered and unordered features might be handled differently to include the embedded information into a model.

Not all models need a special treatment for categorical data

Models that can deal with categorical features:

- Tree based models
- Naive Bayes models

one hot vs dummy variable

TBC

add - bert - word2vec - one hot encoding - dummy variable

This are ways how to modify qualitative features For continuous, real number values exists other tools for converting them into features that can be better utilized by a model.

### 7.4.2 Engineering numeric features

Features with continuous, real number values might need some pre-processing before being useful for a model. The pre-processing can be model dependent

Model dependent pre-processing:

- Skewed distribution
- No problem for:
- Trees (based on rank rather than values)

- Problem for:
- K-nearest neighbors
- SVM
- Neural networks

- No problem for:
- Correlated features
- No problem for:
- Partial least square

- Problem for:
- Neural networks
- Multiple linear regression

- No problem for:

Commonly occurring issues for numerical features and the counter measures are treated in this chapter

Commonly occurring issues:

- Vastly different scales
- Skewed distribution
- Small number of extreme values
- Values clipped at on the low/or high end of range
- Complex relationship with response
- Redundant information

Those problems are addressed by various methods in the next chapters

Remedies to commonly occurring issues:

#### 7.4.2.1 1:1 Transformations

There are plenty of methods to transform a feature, first focus is on distribution shaping

Box-Cox transformation

Apply the Box-Cox transformation the **skewed distribution below**

can be shaped by Box-Cox transformation to be almost normally distributed as shown below.

Box-Cox transformation maps the original data by the following equation

Box-Cox transformation equation:

\[x^{*}=\left\{\begin{array}{ll}{\frac{x^{\lambda}-1}{\lambda \bar{x}^{\lambda-1}},} & {\lambda \neq 0} \\ {\tilde{x} \log x,} & {\lambda=0}\end{array}\right.\]

where

\(\tilde{x}\) is geometric mean \(\bar{x}=\sqrt[n]{\prod_{i=1}^{n} x_{i}}=\sqrt[n]{x_{1} \cdot x_{2} \cdots x_{n}}\)

\(\lambda\) is estimated from the data, there are a few special cases

Special cases for \(\lambda\):

- \(\lambda\) = 1 \(\implies\) no transformation
- \(\lambda\) = 0.5 \(\implies\) square root transformation
- \(\lambda\) = -1 \(\implies\) inverse transformation

#### 7.4.2.2 1:Many Transformations

A single numeric feature can be expanded to many features to improve model performance.

The cubic expansion can be written as:

\[f(x)=\sum_{i=1}^{3} \beta_{i} f_{i}(x)=\beta_{1} x+\beta_{2} x^{2}+\beta_{3} x^{3}\]

The \(\beta\) values can be calculated using linear regression. If the true trend were linear the second and third regression parameter would be close to zero.

For example the **AMES sale price vs lot area ** as given below is clearly **not linear.**

Considering a spline with 6 regions. Functions of x are shown in figure *“(a)”* below.
Figure *“(b)”* shows the spline

One feature was replaced by 6 hopefully more meaningful features for the model

#### 7.4.2.3 Many:Many Transformations

Many to many transformation can solve a variety of issues

Many to many transformations help to:

- Deal with
- outliers
- collinearity (high correlation)

- Reduce dimensionality

Contrary to intuition more features are not always beneficial

Considering irrelevant features leads to:

- Increasing computational effort for model training
- Decrease predictive performance
- Complicating predictor importance calculation

There are several dimension reduction algorithms

Dimension reduction algorithms:

- Unsupervised
- principle component analysis (PCA)
- t-SNE
- independent component analysis (ICA)
- non-negative matrix factorization (NNMF)
- autoencoders

- Supervised
- partial least squares (PLS)
- reduced predictor space is optimally associated with the response

The **embedding projector of TensorFlow** at http://projector.tensorflow.org gives great insight into PCA and t-SNE
Below a PCA of the Word2Vec 200 dimensional space onto three dimensions, the word *“neural”* and points close by are marked.

A few more details on some of the above mentioned algorithms

Principle component analysis (PCA)

Principle component analysis (PCA):

- Linear orthogonal transformation
- Possibly correlated variables
- Into set of uncorrelated variables
- principle components

PCA maps a higher dimensional space to a lower dimensional space by linear orthogonal transformations. The orthogonality results from the condition that principle components are not correlated. Each principle component explains as much variance as possible. The principle components are weighted combinations of the independent variables. More on PCA can be found at Wikipedia https://en.wikipedia.org/wiki/Principal_component_analysis

A PCA of the irsis data set is given at https://scikit-learn.org/stable/auto_examples/decomposition/plot_pca_iris.html#sphx-glr-auto-examples-decomposition-plot-pca-iris-py, it creates the graph below.

A minimal PCA Python example is given below, source https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

```
>>> import numpy as np
>>> from sklearn.decomposition import IncrementalPCA
>>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
>>> ipca = IncrementalPCA(n_components=2, batch_size=3)
>>> ipca.fit(X)
=3, n_components=2)
IncrementalPCA(batch_size>>> ipca.transform(X)
```

t-distributed stochastic neighbor embedding (t-SNE)

The next algorithm is t-distributed stochastic neighbor embedding **t-SNE** which focuses more on the distance of samples than on variance. The main concept is that samples which are nearby in high dimensional space are also nearby in the lower dimensional space.

t-SNE:

- t-distributed stochastic neighbor embedding
- Nonlinear dimensionality reduction
- Similar object modeled by nearby points

The t-SNE algorithm has important hyperparameter which need to be set with some insight. More on those hyperparameter and their settings can be found at https://distill.pub/2016/misread-tsne/ and in Laurens van der Maaten’s original paper on t-SNE (Maaten and Hinton 2008)

Independent component analysis (ICA)

ICA is used in signal processing to differentiate different sources of signal, i.e. suppress in an audio signal the background noise which can be very helpful when using a phone in an loud environment.

Independent component analysis (ICA):

- Origin in signal processing
- Separating mulivariate signal into additive subcomponents

The ability to separate signal sources can be seen in the following example. Whereas PCA picks up the variance ICA assigns signals to the sources.

The code is from at https://scikit-learn.org/stable/auto_examples/decomposition/plot_ica_vs_pca.html#sphx-glr-auto-examples-decomposition-plot-ica-vs-pca-py

```
"""
==========================
FastICA on 2D point clouds
==========================
This example illustrates visually in the feature space a comparison by
results using two different component analysis techniques.
:ref:`ICA` vs :ref:`PCA`.
Representing ICA in the feature space gives the view of 'geometric ICA':
ICA is an algorithm that finds directions in the feature space
corresponding to projections with high non-Gaussianity. These directions
need not be orthogonal in the original feature space, but they are
orthogonal in the whitened feature space, in which all directions
correspond to the same variance.
PCA, on the other hand, finds orthogonal directions in the raw feature
space that correspond to directions accounting for maximum variance.
Here we simulate independent sources using a highly non-Gaussian
process, 2 student T with a low number of degrees of freedom (top left
figure). We mix them to create observations (top right figure).
In this raw observation space, directions identified by PCA are
represented by orange vectors. We represent the signal in the PCA space,
after whitening by the variance corresponding to the PCA vectors (lower
left). Running ICA corresponds to finding a rotation in this space to
identify the directions of largest non-Gaussianity (lower right).
"""
print(__doc__)
# Authors: Alexandre Gramfort, Gael Varoquaux
# License: BSD 3 clause
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA, FastICA
# #############################################################################
# Generate sample data
= np.random.RandomState(42)
rng = rng.standard_t(1.5, size=(20000, 2))
S 0] *= 2.
S[:,
# Mix data
= np.array([[1, 1], [0, 2]]) # Mixing matrix
A
= np.dot(S, A.T) # Generate observations
X
= PCA()
pca = pca.fit(X).transform(X)
S_pca_
= FastICA(random_state=rng)
ica = ica.fit(X).transform(X) # Estimate the sources
S_ica_
/= S_ica_.std(axis=0)
S_ica_
# #############################################################################
# Plot results
def plot_samples(S, axis_list=None):
0], S[:, 1], s=2, marker='o', zorder=10,
plt.scatter(S[:, ='steelblue', alpha=0.5)
colorif axis_list is not None:
= ['orange', 'red']
colors for color, axis in zip(colors, axis_list):
/= axis.std()
axis = axis
x_axis, y_axis # Trick to get legend to work
0.1 * x_axis, 0.1 * y_axis, linewidth=2, color=color)
plt.plot(0, 0, x_axis, y_axis, zorder=11, width=0.01, scale=6,
plt.quiver(=color)
color
0, -3, 3)
plt.hlines(0, -3, 3)
plt.vlines(-3, 3)
plt.xlim(-3, 3)
plt.ylim('x')
plt.xlabel('y')
plt.ylabel(
plt.figure()2, 2, 1)
plt.subplot(/ S.std())
plot_samples(S 'True Independent Sources')
plt.title(
= [pca.components_.T, ica.mixing_]
axis_list 2, 2, 2)
plt.subplot(/ np.std(X), axis_list=axis_list)
plot_samples(X = plt.legend(['PCA', 'ICA'], loc='upper right')
legend 100)
legend.set_zorder(
'Observations')
plt.title(
2, 2, 3)
plt.subplot(/ np.std(S_pca_, axis=0))
plot_samples(S_pca_ 'PCA recovered signals')
plt.title(
2, 2, 4)
plt.subplot(/ np.std(S_ica_))
plot_samples(S_ica_ 'ICA recovered signals')
plt.title(
0.09, 0.04, 0.94, 0.94, 0.26, 0.36)
plt.subplots_adjust(
plt.show()
```

Partial least squares

Since PCA does not take into consideration any of the response information. PLS is a supervised version of PCA which reduced dimensions are optimally related to the response (Stone and Brooks 1990).

Partial least square (PLS) facts:

- Supervised technique
- Objective
- find linear functions (latent factors)
- with optimal covariance with response
- each new dimension is uncorrelated

- Less PLS than PCA dimensions are needed for same performance

The following example is from https://scikit-learn.org/stable/auto_examples/cross_decomposition/plot_compare_cross_decomposition.html#sphx-glr-auto-examples-cross-decomposition-plot-compare-cross-decomposition-py and shows the **PLS components of two multivariate covarying two-dimensional datasets, X, and Y.**

The scatterplot shows:

- components 1 in dataset X and dataset Y are maximally correlated
- components 2 in dataset X and dataset Y are maximally correlated
- correlation across datasets for different components is weak

The code of the example is given below.

```
"""
===================================
Compare cross decomposition methods
===================================
Simple usage of various cross decomposition algorithms:
- PLSCanonical
- PLSRegression, with multivariate response, a.k.a. PLS2
- PLSRegression, with univariate response, a.k.a. PLS1
- CCA
Given 2 multivariate covarying two-dimensional datasets, X, and Y,
PLS extracts the 'directions of covariance', i.e. the components of each
datasets that explain the most shared variance between both datasets.
This is apparent on the **scatterplot matrix** display: components 1 in
dataset X and dataset Y are maximally correlated (points lie around the
first diagonal). This is also true for components 2 in both dataset,
however, the correlation across datasets for different components is
weak: the point cloud is very spherical.
"""
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cross_decomposition import PLSCanonical, PLSRegression, CCA
# #############################################################################
# Dataset based latent variables model
= 500
n # 2 latents vars:
= np.random.normal(size=n)
l1 = np.random.normal(size=n)
l2
= np.array([l1, l1, l2, l2]).T
latents = latents + np.random.normal(size=4 * n).reshape((n, 4))
X = latents + np.random.normal(size=4 * n).reshape((n, 4))
Y
= X[:n // 2]
X_train = Y[:n // 2]
Y_train = X[n // 2:]
X_test = Y[n // 2:]
Y_test
print("Corr(X)")
print(np.round(np.corrcoef(X.T), 2))
print("Corr(Y)")
print(np.round(np.corrcoef(Y.T), 2))
# #############################################################################
# Canonical (symmetric) PLS
# Transform data
# ~~~~~~~~~~~~~~
= PLSCanonical(n_components=2)
plsca
plsca.fit(X_train, Y_train)= plsca.transform(X_train, Y_train)
X_train_r, Y_train_r = plsca.transform(X_test, Y_test)
X_test_r, Y_test_r
# Scatter plot of scores
# ~~~~~~~~~~~~~~~~~~~~~~
# 1) On diagonal plot X vs Y scores on each components
=(12, 8))
plt.figure(figsize221)
plt.subplot(0], Y_train_r[:, 0], label="train",
plt.scatter(X_train_r[:, ="o", c="b", s=25)
marker0], Y_test_r[:, 0], label="test",
plt.scatter(X_test_r[:, ="o", c="r", s=25)
marker"x scores")
plt.xlabel("y scores")
plt.ylabel('Comp. 1: X vs Y (test corr = %.2f)' %
plt.title(0], Y_test_r[:, 0])[0, 1])
np.corrcoef(X_test_r[:,
plt.xticks(())
plt.yticks(())="best")
plt.legend(loc
224)
plt.subplot(1], Y_train_r[:, 1], label="train",
plt.scatter(X_train_r[:, ="o", c="b", s=25)
marker1], Y_test_r[:, 1], label="test",
plt.scatter(X_test_r[:, ="o", c="r", s=25)
marker"x scores")
plt.xlabel("y scores")
plt.ylabel('Comp. 2: X vs Y (test corr = %.2f)' %
plt.title(1], Y_test_r[:, 1])[0, 1])
np.corrcoef(X_test_r[:,
plt.xticks(())
plt.yticks(())="best")
plt.legend(loc
# 2) Off diagonal plot components 1 vs 2 for X and Y
222)
plt.subplot(0], X_train_r[:, 1], label="train",
plt.scatter(X_train_r[:, ="*", c="b", s=50)
marker0], X_test_r[:, 1], label="test",
plt.scatter(X_test_r[:, ="*", c="r", s=50)
marker"X comp. 1")
plt.xlabel("X comp. 2")
plt.ylabel('X comp. 1 vs X comp. 2 (test corr = %.2f)'
plt.title(% np.corrcoef(X_test_r[:, 0], X_test_r[:, 1])[0, 1])
="best")
plt.legend(loc
plt.xticks(())
plt.yticks(())
223)
plt.subplot(0], Y_train_r[:, 1], label="train",
plt.scatter(Y_train_r[:, ="*", c="b", s=50)
marker0], Y_test_r[:, 1], label="test",
plt.scatter(Y_test_r[:, ="*", c="r", s=50)
marker"Y comp. 1")
plt.xlabel("Y comp. 2")
plt.ylabel('Y comp. 1 vs Y comp. 2 , (test corr = %.2f)'
plt.title(% np.corrcoef(Y_test_r[:, 0], Y_test_r[:, 1])[0, 1])
="best")
plt.legend(loc
plt.xticks(())
plt.yticks(())
plt.show()
# #############################################################################
# PLS regression, with multivariate response, a.k.a. PLS2
= 1000
n = 3
q = 10
p = np.random.normal(size=n * p).reshape((n, p))
X = np.array([[1, 2] + [0] * (p - 2)] * q).T
B # each Yj = 1*X1 + 2*X2 + noize
= np.dot(X, B) + np.random.normal(size=n * q).reshape((n, q)) + 5
Y
= PLSRegression(n_components=3)
pls2
pls2.fit(X, Y)print("True B (such that: Y = XB + Err)")
print(B)
# compare pls2.coef_ with B
print("Estimated B")
print(np.round(pls2.coef_, 1))
pls2.predict(X)
# PLS regression, with univariate response, a.k.a. PLS1
= 1000
n = 10
p = np.random.normal(size=n * p).reshape((n, p))
X = X[:, 0] + 2 * X[:, 1] + np.random.normal(size=n * 1) + 5
y = PLSRegression(n_components=3)
pls1
pls1.fit(X, y)# note that the number of components exceeds 1 (the dimension of y)
print("Estimated betas")
print(np.round(pls1.coef_, 1))
# #############################################################################
# CCA (PLS mode B with symmetric deflation)
= CCA(n_components=2)
cca
cca.fit(X_train, Y_train)= cca.transform(X_train, Y_train)
X_train_r, Y_train_r = cca.transform(X_test, Y_test) X_test_r, Y_test_r
```

Autoencoders

Autoencoders learn a latent representation of data by feeding them through a neural network with a bottleneck, i.e. a layer with less width than the input layer. Training an autoencoder is unsupervised since the training task is to create an output as similar to the input as possible.

Further explanation is given in an example implemented in Keras at https://towardsdatascience.com/autoencoders-made-simple-6f59e2ab37ef

Autoencoders:

- Type of neural network
- Feeds information through bottleneck
- Input and output shall be as similar as possible
- Representation at bottleneck is dimensionality reduced

### 7.4.3 Feature importance

After creating features the next step is to find out which features are helpful for the model. First, two methods are compared

Feature importance analysis methods:

- Permutation Importance
- permute one feature
- calculate the change in prediction performance
- the bigger the drop in prediction performance \(\implies\) more important feature

- Random Forest Feature Importance
- during training compute decrease of impurity
- average for decrease for all features
- ranking according to decrease \(\implies\) ranking of feature importance

An example shows strength and weakness of the algorithms. The example *“Permutation Importance vs Random Forest Feature Importance”* can be found at the Scikit-Learn (Pedregosa et al. 2011) site https://scikit-learn.org/stable/auto_examples/inspection/plot_permutation_importance.html

Comparing test and training accuracy give an indication that the **model might have overfitted**

Model might be overfitted:

RF train accuracy: 1.000

RF test accuracy: 0.817

The most important feature is **random_num** which is totally random numeric feature which the model has used to memorize the training data set but is useless to predict anything. The wrong importance measure results from two causes

Causes of wrong feature importance:

- impurity-based importances are biased towards high cardinality features
- impurity-based importances
- computed on training set statistics
- do not reflect the ability of feature to be useful to make predictions that generalize to the test set

Permutation importance

Computing the permutation importance on held out test data shows that the low cardinality categorical feature ***sex** has highest importance.

Permutation importance:

- Most important feature: sex
- Random features low importance

Computing the permutation importance on the training data shows that the random features get significantly higher importance ranking than when computed on the test set.

Permutation Importance with Multicollinear or Correlated Features

Another example investigates the problem of multicollinear or correlated features. The example *“Permutation Importance with Multicollinear or Correlated Features”* can be found at the Scikit-Learn (Pedregosa et al. 2011) site https://scikit-learn.org/stable/auto_examples/inspection/plot_permutation_importance_multicollinear.html

Facts of example:

- Wisconsin breast cancer datase
- RandomForestClassifier 97% accuracy on a test data set
- Contains multicollinear features

The permutation importance plot shows that permuting a feature drops the accuracy by at most 0.012, which suggests that none of the features are important.

Permutation importance:

- highest drop = 0.012 \(\implies\) no feature is important
- when features collinear
- permuting one feature will have little effect

- how to handle collinear features
- performing hierarchical clustering on the Spearman rank-order correlations
^{19} - pick threshold
- in graph below = 2.0

- keep single feature from each cluster
- in graph below 8 cluster \(\implies\) 8 features

- performing hierarchical clustering on the Spearman rank-order correlations

In the image below a dendrogram of the hierarchical cluster on the left hand side is derived of the correlation matrix on the right hand side.

If threshold 1.0 is chosen than accuracy stays at 0.97 with 14 features instead of 30

### References

*Journal of Machine Learning Research*12: 2825–30.

*Journal of the Royal Statistical Society: Series B (Methodological)*52 (2): 237–58.

https://statistics.laerd.com/spss-tutorials/spearmans-rank-order-correlation-using-spss-statistics.php: The Spearman rank-order correlation coefficient (Spearman’s correlation, for short) is a nonparametric measure of the strength and direction of association that exists between two variables measured on at least an ordinal scale.↩︎