Evaluation Basics | AI/ML

Evaluation Basics

Good models are built by iteration, and good iteration requires reliable evaluation.

A practical loop is:

Split data
Train on training data
Evaluate on validation/test data
Iterate with justified changes

Data Splits

Training set: fit model parameters.
Validation set: compare model or hyperparameter choices.
Test set: final unbiased performance estimate.

If data is limited, use cross-validation for more stable estimates.

Common Metrics

Regression

MAE: average absolute error.
RMSE: penalizes larger errors more strongly.
R^2: fraction of variance explained.

Classification

Accuracy: overall correct predictions.
Precision: among predicted positives, how many are true positives.
Recall: among actual positives, how many were found.
F1: balance between precision and recall.

Pick metrics based on task costs, not habit.

Data Leakage (High Priority)

Leakage happens when training uses information that would not be available at prediction time.

Common sources:

Scaling/encoding with full dataset before splitting
Features derived from target or future information
Duplicate rows crossing train/test boundaries

Rule: split first, then fit preprocessing on train only.

Error Analysis Checklist

When performance is weak, check:

Data quality (missing values, label noise, class imbalance)
Feature quality (signal and redundancy)
Underfitting/overfitting signs
Metric-task mismatch
Baseline comparison (is your model actually better?)

Quick Self-Check

Why is test data used only at the end?
In a fraud problem, when is recall more important than accuracy?
Give one leakage example from tabular data workflows.

For answer guidance, see Checkpoints.