These are my notes that I condensed from here:
Feature engineering is the technique of extracting more information from existing data. You are not adding any new data, but you are making the data you already have more useful to a machine learning model.
From https://en.wikipedia.org/wiki/Feature_engineering :
“Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. Feature engineering is fundamental to the application of machine learning, and is both difficult and expensive. The need for manual feature engineering can be obviated by automated feature learning.”
The feature construction/enrichment process is absolutely crucial to the success of machine learning (with the exception that many neural networks do not require this step to perform well). Feature engineering enables you to get the most out of your data for building predictive models.
Feature engineering is performed once you have completed the first data exploration steps:
- Variable Identification
- Univariate, Bivariate Analysis
- Missing Values
- Imputation
- Outliers Treatment
Feature engineering can be divided in 2 steps:
- Variable Transformation
- Variable/Feature creation
Variable Transformation:
- Scaling and Centering (termed Standardisation): Standardisation is essential for scale dependent learning models such as Regression and Neural Networks. This technique ensures numerical features have a mean of 0 and a standard deviation of 1
- Normalisation: restricting to a min-max range
Variable/Feature creation:
- Missing data Creation using domain knowledge, possibly gained from the data or a domain expert (overlaps with Missing Values/Imputation)
- Feature Selection (selecting the most important features): Investigating feature correlation
- Creating new features (hyper-features) by combining existing ones, such as summary data (min, max, count) and discretised data
Feature Engineering Concepts
Categorical Feature Decomposition
Imagine you have a categorical feature “Cabin” that can take the values:
{A, B, C or Unknown}
The Unknown value is probably special, representing missing data, but to a model it looks like just another categorical attribute.
To encode this extra information you could create a new binary feature called “HasCabin”, taking the values 1 and 0 when an observation has a cabin or when the occupancy is unknown, respectively.
Additionally, you could create a new binary feature for each of the values that ‘Cabin’ can take: i.e. four binary features: Is_CabinA, Is_CabinB, Is_CabinC and Is_CabinUnknown. This is often referred to as ‘One-Hot Encoding’ [Python’s Pandas library has a built-in method to perform this called get_dummies() ]
These additional features could be used instead of the HasCabin feature (if you are using a simple linear model) or in addition to it (if you are using a decision tree based model).
DateTime Feature Decomposition
Date-times are essentially integer numerical values that contain information that can be difficult for a model to take full advantage of in a raw form.
There may be cyclical/seasonal relationships present between a date time and other attributes, such as time of day, day of week, month of year, quarter of year, etc.
For example, you could create a new categorical feature called DayOfWeek taking on 7 values. This categorical feature might be useful for a decision tree based model. Seasonality, such as QuarterOfYear, might be a useful feature.
There are often relationships between date-times and other attributes; to expose these you can decompose a date-time into constituent features that may allow models to learn these relationships. For example, if you suspect that there is a relationship between the hour of day and other attributes (such as NumberOfSales), you could create a new numerical feature called HourOfDay for the observation hour that might help a regression model.
Numerical Feature Transformation
Continuous numerical quantities, might benefit from being transformed to expose relevant information. This includes standardisation which is essential for scale dependent learning models, or transforming into a different unit of measure or the decomposition of a rate into separate time period and quantity.
For example, you may have a ShippingWeight quantity recorded in grams as an integer value, e.g. 9260. You could create a new feature with this quantity transformed into rounded kilograms, if the higher precision was not important.
There may be domain knowledge that items with a weight above certain thresholds incur a higher rates. That domain specific threshold could be used to create a new binary categorical feature ItemWeightAboveXkg
Discretisation
Binning, also known as quantisation, is used for transforming continuous numeric features into discrete ones (categories). A continuous numerical feature can be grouped (or binned) into a categorical feature.
It can be useful when data is skewed, or in the presence of extreme outliers.
In Fixed-width binning, each bin has a specific fixed width which are usually pre-defined by analysing the data and applying domain knowledge. Binning based on rounding is one example.
The drawback to using fixed-width bins is that some of the bins might be densely populated and some of them might be sparsely populated or empty. In Adaptive binning we use the data distribution to allocate our bin ranges.
Quantile based binning is a good strategy to use for adaptive binning.