Dimensionality Reduction

Updated at 2017-06-18 13:39

Dimensionality reduction is deriving a new set of artificial features that are a smaller set than the original set of features while retaining most of the variance in the original data.

Feature vectors multiply the training sample size. This is called the problem of dimensionality.

My samples have features f1, f2, f3, ... fn.
How many features should I be using?
Which of the features should I using?

Dimensional reduction is essential in keeping training times feasible. The act of preprocessing data to reduce feature dimensions is frequently called data normalization.

Principal Component Analysis (PCA): finds configurable number of orthogonal directions/vectors that explain the maximum amount of the variance in the feature space.
Incremental Principal Component Analysis (IPCA): All data must fit in memory in PCA while IPCA allows almost as good as performance while allowing multiple batches of fitting.
Kernel Principal Component Analysis (KPCA): extends PCA with the use of kernels.
Sparse Principal Component Analysis (SPCA): tunes PCA so it extracts the set of sparse components that best reconstruct the data.
Non-Negative Matrix Factorization (NMF): similar to PCA but additional constraint that all values must be non-negative.
Independent Component Analysis (ICA): separates a multivariate signal into additive subcomponents that are maximally independent. Used to separate superimposed features, not to reduce dimensionality per se. Good in separating stacked audio signals form each other. If N sources are present, at least N observations (e.g. microphones) are needed to recover the original signals. Strongly assumes that the features are independent of each other.

Principal Component Analysis

How PCA works:

From total features-space, find the direction/vector of maximum variance, called "component 1", indicating that this hyper-direction contains most of the differentiating information.
Find "component 2" with the most information from orthogonal directions to "component 1". These are called the "principal components". There are as many principal components as there are features, but you choose how many to extract.
Subtract means from data on both directions which places the feature space around zero.
Rotate the data so that "component 1" becomes the x-axis and "component 2" becomes the y-axis.
We can choose to keep only some of the principal components.
We undo the rotation and add mean back to data.

PCA is frequently used to visualize high-dimension datasets. Generating pair variable scatter plots from 30 features would already generate 420 charts. With PCA, you can take two principal components an plot them. Although it might be easy to interpret what those axis are related to the original features.

PCA is great in reducing image dimensionality. Using the first 100 principal components from an image is a lot easier than using RGB values of each pixel.

PCA whitening rescales the principal components to the same scale. Project the data onto the singular space while scaling each component to unit variance, which is helpful if the next user of the new feature has strong assumptions about isotropy of the data.

Non-Negative Matrix Factorization

NMF is mainly used to find interesting patterns and features in data. NMF is not as good in reconstructing or encoding data as PCA.

NMF requires all used features to be non-negative values.

Component directions always starts from the origin (0, 0).

All components in NMF are of equal importance.

NMF works best for additive data. Such as audio, gene expression and text.

Sources

sklearn - Decomposing signals in components
Introduction to Machine Learning with Python, Andreas C. Müller, Sarah Guido
Wikipedia