Feature Engineering Basics
Feature engineering is the bridge between raw data and usable model input.
The goal is not to add complexity. The goal is to make data representation reliable and informative.
1) Missing Values
Common strategies:
- Drop rows/columns (when impact is small)
- Numeric imputation: mean/median
- Categorical imputation: most frequent or explicit “unknown”
Use training data statistics for imputation.
2) Categorical Variables
Common strategies:
- One-hot encoding for nominal categories
- Ordinal encoding only when ordering is meaningful
Watch out for high-cardinality categories and rare labels.
3) Feature Scaling
Some models are sensitive to scale (linear models, SVMs, k-NN, neural nets).
Common scalers:
- Standardization (zero mean, unit variance)
- Min-max scaling (fixed range)
Fit scaler on training set only.
4) Simple Derived Features
Examples:
- Ratios (price per unit)
- Time deltas (days since last event)
- Domain-specific transformations
Only keep derived features that improve validation performance or interpretability.
Minimal Pipeline Mindset
A reliable pipeline order:
- Split data
- Fit preprocessing on train
- Transform train/validation/test with same fitted preprocessors
- Train model
- Evaluate and iterate
Quick Self-Check
- Why should you not fit a scaler on the full dataset?
- When is one-hot encoding preferred over ordinal encoding?
- Name one reasonable way to handle numeric missing values.
For answer guidance, see Checkpoints.