Evaluation Basics
Good models are built by iteration, and good iteration requires reliable evaluation.
A practical loop is:
- Split data
- Train on training data
- Evaluate on validation/test data
- Iterate with justified changes
Data Splits
- Training set: fit model parameters.
- Validation set: compare model or hyperparameter choices.
- Test set: final unbiased performance estimate.
If data is limited, use cross-validation for more stable estimates.
Common Metrics
Regression
- MAE: average absolute error.
- RMSE: penalizes larger errors more strongly.
- R^2: fraction of variance explained.
Classification
- Accuracy: overall correct predictions.
- Precision: among predicted positives, how many are true positives.
- Recall: among actual positives, how many were found.
- F1: balance between precision and recall.
Pick metrics based on task costs, not habit.
Data Leakage (High Priority)
Leakage happens when training uses information that would not be available at prediction time.
Common sources:
- Scaling/encoding with full dataset before splitting
- Features derived from target or future information
- Duplicate rows crossing train/test boundaries
Rule: split first, then fit preprocessing on train only.
Error Analysis Checklist
When performance is weak, check:
- Data quality (missing values, label noise, class imbalance)
- Feature quality (signal and redundancy)
- Underfitting/overfitting signs
- Metric-task mismatch
- Baseline comparison (is your model actually better?)
Quick Self-Check
- Why is test data used only at the end?
- In a fraud problem, when is recall more important than accuracy?
- Give one leakage example from tabular data workflows.
For answer guidance, see Checkpoints.