Feature Engineering
Features are how data is presented to a model, serving to inform that model on things it may not infer by itself.
There are several ways to derive features:
- Derivatives: Infer new information from existing information
- Example: Determining what day of the week a given date falls on
- Enrichment: Add new external information
- Example: Checking if a particular day is a public holiday
- Encoding: Present the relevant information differently
- Example: Converting day from specific day to weekday/weekend classification
- Combination: Link features together
- Example: Weighting backlog size based on individual item complexity
Feature engineering is transforming raw data to present underlying information in the most effective way as possible. This way the learning algorithm has to spend minimum effort on wading through the noise.
"Applied machine learning" is basically feature engineering.
— Andrew Ng
Feature engineering is designing what to extract from your samples. Feature engineering is using domain knowledge of the data and basic math to create features that make machine learning algorithms work. It focuses on compressing raw sample so no relevant information is lost.
Turning a sample to quantitative traits is called feature extraction. We want to represent each sample as an array (vector) of real numbers (features).
Sample = [Feature1, Feature2, Feature3, ..., FeatureN]
Feature engineering can be automatic or manual. Manual feature engineering is slow, but there is no good automatic feature engineering solution, only shortcuts.
We have color images that are 1280x720 pixels.
Original = 1280x720x3 = 2,764,800 features, way too much
Greyscale = 1280x720x1 = 921,600 features
Downsampled 5x = 256x144x1 = 36,864 features, possible to use now
Categorical Features
Use one-hot encoding. Avoid turning categorical features to scalar types. You will be tempted to turn categorical feature to scalar value, but watch out as it will add superficial information into your data.
Clothing is Red, Green or Blue.
clothing_color: 0 (red), 1 (green), 2 (blue)
but now green is more "blue" than red,
and "average" clothing color might become around
2 even there are not a single green sample.
better:
clothing_red: 0 or 1
clothing_green: 0 or 1
clothing_blue: 0 or 1
Scalar Features
If your features are pure numbers, you can usually use them as is.
Bucketing or Binning is representing numerical attribute as categorical. Reduces noise and overfitting. One-hot encoding is usually also applied.
Age:
123
vs
age-young: 0 or 1 (1-10)
age-teen: 0 or 1 (11-18)
age-young-adult: 0 or 1 (19-25)
etc.
Location:
Longitude-Latitude
vs.
London, Paris
Text Features
Bag of Words / Word Count: Count the frequency of each word or pair of consecutive words in each document.
Term Frequency-Inverse Document Frequency (TF–IDF): Weight the word counts by a measure of how often they appear in the documents. Avoid the problem of focusing on common words.
Text is a natural sequence. Break into a sequence of characters or words and feed it into one of various recurrent statistical models.
Tokenization: Splitting text into sentences, words or smaller pieces; which are then turned into numbers with some mapping. The numbers are the "tokens".
Timeseries Features
You need to record time if you believe there to be relationship between time and other attributes.
Full timestamps are frequently unnecessary. Using epoch time or all dimensions (year, month, date, hour, minute, second) would just add noise.
Predicting traffic levels in a city most likely only needs:
hour, is-weekday, is-holiday, is-next-day-holiday, is-previous-day-holiday
Prefer local time without timezone. Otherwise, timezones might cause trouble in the future.
Timeseries data is a natural sequence. Split it up and feed it into a recurrent statistical model.
Image Samples
Rescale: Rescale the image to fixed size. Downsampling means reducing the image resolution.
Greyscale: Remove color information but maintain brightness information.
Vectorization: Take all raw pixel values, with or without luminosity normalization.
Signal Transform: Using gradients in each pixel, wavelets transforms etc.
Similarity Calculation: Compute Euclidean, Manhattan or cosine similarity to reference prototype image. Prototype image can be preset or extracted using an unsupervised algorithm.
Local Feature Extraction: Split the image into smaller regions and perform feature extraction in each area.
Histogram of Gradients (HOG):
- Optionally pre-normalize images. This leads to features that resist dependence on variations in illumination.
- Convolve the image with two filters that are sensitive to horizontal and vertical brightness gradients. These capture edge, contour, and texture information.
- Subdivide the image into cells of a predetermined size, and compute a histogram of the gradient orientations within each cell.
- Normalize the histograms in each cell by comparing to the block of neighboring cells. This further suppresses the effect of illumination across the image.
- Construct a one-dimensional feature vector from the information in each cell.
Sound Samples
Same approaches work as with images, but there is one dimension instead of two. More advanced feature extraction methods might split the sound into multiple channels.
Audio is a natural sequence. Chop audio spectrogram into chunks and feed that into one of various recurrent statistical models.
Video Samples
Same approaches work as with images as videos are just images in sequence. But then you have a lot of features per sample as one video clip is usually considered "a sample".
There might be some sampling that only subset of frames is analyzed. Analyzing every frame of a 60FPS video is rarely worth it.
Images in video are time-dependent. This might require you to use recurrent neural networks e.g., long short-term memory network.
Sound may or may not be included. Depends on your use case if sound contains additional information.
Missing Value Treatment
What do you do when sample is missing values?
Exclusion:
- List Wise Exclusion: Exclude samples that are missing any of the variables.
- Pair Wise Exclusion: Exclude samples that are missing the variable of interest.
Imputation:
- Generalized Imputation: Calculate mean/median of the variable from other samples and use that.
- Similar Case Imputation: Use mean/median of similar samples e.g., height of "Male" or "Female".
- KNN Imputation: Use k-nearest neighbor to fill in the missing value, but it can be time-consuming with large datasets.
- Predictive Imputation: You train a predictive model to fill the missing values. But this requires that there is some relationship between the missing values and rest of the values. Use simple approaches e.g. regressions.
Outlier Handling
Outliers are samples that are vastly different from others and add unnecessary noise to machine learning.
There are two types of outliers:
- Univariate Outlier: One value is abnormal in the context of a single variable.
- Multi-variate Outlier: Abnormality happens in n-dimensions so requires looking at multidimensional distributions.
Source of outliers:
- Data Entry Error: Human errors when collecting data.
- Measurement Error: Measure instrument is faulty.
- Experimental Error: Unrelated event caused an abnormality to the measurement.
- Intentional Outlier: The value was purposely reported wrong e.g., lying in tests.
- Data Processing Error: Data mining, transform or merging code has a bug.
- Sampling Error: Some samples are in the wrong category.
- Natural Outlier: All data is accurate, the sample just is an outlier.
Options how to deal with outliers:
- Delete the sample.
- Limit the features with transforms or bucketing.
- Impute the feature with mean, median or mode.
- Treat outliers separately, but only works if there are significant number of outliers.
Feature Crosses
Feature crosses combine 2+ categorical features into a single feature. Doing a cross-product between all possible values of the categorical features.
A is A1 or A2.
B is B2 or B2.
Possible pairs: (A1, B1) (A1, B2) (A2, B1) (A2, B2)
And now you can give these combinations "names" and use one-hot encoding.
Like (male, Canadian) = 0 or 1
You can also apply feature crosses to scalar values. You need to bucket the scalar values to categorize.
Feature crosses benefit if you have a lot of data.
Feature Importance
Feature importance score tells which features provide the most information. You should aim to remove less important features. Decision trees are good at giving out feature importance score. Unused features create technical debt.
Filter Methods: Assign statistical feature importance score to each feature and remove the least important e.g. Chi squared test, information gain and correlation coefficient score.
Wrapper Methods: Feature set is considered a search problem, and different combinations are tried to score features e.g. recursive feature elimination algorithm.
Embedded Methods: Learn which features best contribute to the accuracy while training A model e.g., with Lasso, Ridge and ElasticNet regularization.
Feature selection checklist:
- Do you have domain knowledge? If yes, construct a better set of features.
- Are features commensurate? If no, maybe normalize them.
- Do you suspect interdependence of features? If yes, maybe expand your feature set by constructing conjunctive features or product of features.
- Do you need to prune the input variables? If no, construct disjunctive feature or weighted sums of feature.
- Do you need to assess features individually? If yes, use a variable ranking method to get a baseline.
- Do you need a prediction? If no, stop building the model now.
- Do you suspect your data is dirty? If yes, visualize your data and check for outliers.
- Do you know what to try first? If no, use linear predictor.
- Do you have resources to spend time on this problem? If yes, compare a couple of feature selection methods with linear and non-linear predictors.
- Do you need a stable solution? If yes, subsample your data and redo your analysis a couple of times.
Tips and Tricks
Transform non-linear relationships into linear relationships. Log transformation is one of the commonly used.
Some algorithms require that all features are in comparable ranges. For example, age and income have quite the different scale.
Some algorithms require standardized data. You get zero mean and unit variance using z-score normalization aka. standardization.
Try out various ratios and proportions.
You are predicting apple sales.
- Don't use just "monthly apples sold" as the only feature.
- Use "daily apples sold / sales person"
or "apples sold / marketing spend".
Reframing numbers can expose relevant structure.
Item weight is in grams = 6289
Kilograms might work better = 6.289
And even additional rounding = 6
Or splitting the feature = 6 and 289 (remainder grams)
Try transforming running numbers to rates. Works especially with predictions that vary between seasons.
number_of_purchases
vs number_of_purchases_last_year
vs number_of_purchases_last_summer, _fall, _winter, _spring
Domain knowledge allows creating magic domain numbers.
We know that packages over 4 kg have higher tax rate.
So we have binary feature "item_weight_over_high_tax_rate_threshold", 0 or 1.
Look for errors in predictions to create new features. For example, if your model predicts a lot of the longer documents wrong, add word count as a feature.
Feature Learning: Automatically finding and using features in raw data e.g. autoencoders and restricted Boltzmann machines.