Optimization Playground: Gradient Descent Intuition
If you haven’t learned the optimization basics yet, start with the Gradient Descent Theory lesson.
Objective
Build intuition for optimization by training:
- A linear regression model with simple gradient descent (square trick idea)
- The same linear model with full-batch gradient descent and a visual loss landscape
Prerequisites
- Basic Python and NumPy
- Intro to machine learning concepts (features, labels, loss)
Setup
Everything below is designed to run directly in Google Colab.
Tasks
- Complete Part 1 and Part 2 with default parameters.
- Re-run with at least two different learning rates and compare outcomes.
- Write down one failure case and one successful configuration.
Part 1: Simple Gradient Descent (Linear Regression)
We will reuse the same idea from the theory lesson: start with random parameters, make predictions, compute error, and nudge parameters in the direction that reduces error.
Setup + Data
import numpy as np
import matplotlib.pyplot as plt
import random
# Campaign fundraising toy dataset
features = np.array([1, 2, 3, 5, 6, 7, 8], dtype=np.float64)
labels = np.array([155, 197, 244, 356, 409, 448, 500], dtype=np.float64)
plt.figure(figsize=(5, 4))
plt.scatter(features, labels)
plt.title("Campaign Fundraising")
plt.xlabel("Number of Donations")
plt.ylabel("Funds Raised")
plt.show()
Code Breakdown <!– - import numpy as np: loads NumPy for numeric arrays and math.
import matplotlib.pyplot as plt: loads plotting utilities.import random: used for stochastic point selection in Part 1. –>features: input variable (x) = number of donations.labels: target variable (y) = funds raised.plt.scatter(features, labels): plots raw data so you can visually check if a line fit makes sense.- If the points roughly align linearly, gradient descent for a line model is a good first approach.
Gradient Descent Utilities
def rmse(y_true, y_pred):
return np.sqrt(np.mean((y_true - y_pred) ** 2))
# Square trick style update (single point)
def square_trick(base_funds, funds_per_dono, num_donos, funds, learning_rate):
pred = base_funds + funds_per_dono * num_donos
funds_per_dono += learning_rate * num_donos * (funds - pred)
base_funds += learning_rate * (funds - pred)
return base_funds, funds_per_dono
def train_linear(features, labels, learning_rate=0.01, epochs=2000, seed=7):
random.seed(seed)
base_funds = random.random()
funds_per_dono = random.random()
errors = []
for _ in range(epochs):
preds = base_funds + funds_per_dono * features
errors.append(rmse(labels, preds))
i = random.randint(0, len(features) - 1)
base_funds, funds_per_dono = square_trick(
base_funds,
funds_per_dono,
features[i],
labels[i],
learning_rate,
)
return base_funds, funds_per_dono, errors
Code Breakdown
rmse(...): computes average prediction error magnitude (lower is better).- In
square_trick(...): pred = base_funds + funds_per_dono * num_donoscomputes current line prediction for one point.(funds - pred)is the signed error for that point.funds_per_dono += ...updates slope (m) using error scaled byxand learning rate.base_funds += ...updates intercept (b) using error and learning rate.- In
train_linear(...): random.seed(seed)makes runs reproducible.base_fundsandfunds_per_donostart random (like random initialization in training).- Each epoch:
preds = ...predicts all points using current line.errors.append(rmse(...))stores error history for plotting.- Random index
iselects one training point (stochastic gradient descent behavior). square_trick(...)applies one update step.- Return values:
base_funds: final intercept.funds_per_dono: final slope.errors: learning curve over epochs.
Run and Interact
# Try changing these and re-running this cell
learning_rate = 0.01
epochs = 2000
seed = 7
base_funds, funds_per_dono, errors = train_linear(
features, labels, learning_rate=learning_rate, epochs=epochs, seed=seed
)
preds = base_funds + funds_per_dono * features
fig, ax = plt.subplots(1, 2, figsize=(12, 4))
# Fitted line
ax[0].scatter(features, labels, label='Data')
ax[0].plot(features, preds, color='red', label='Fitted Line')
ax[0].set_title('Linear Regression Fit')
ax[0].set_xlabel('Number of Donations')
ax[0].set_ylabel('Funds Raised')
ax[0].legend()
# Error curve
ax[1].plot(errors)
ax[1].set_title('RMSE During Training')
ax[1].set_xlabel('Epoch')
ax[1].set_ylabel('RMSE')
plt.show()
print(f"funds_per_dono (slope): {funds_per_dono:.4f}")
print(f"base_funds (intercept): {base_funds:.4f}")
print(f"final RMSE: {errors[-1]:.4f}")
Code Breakdown
- Hyperparameters:
learning_rate: step size for updates.epochs: number of update steps.seed: repeatable randomness.train_linear(...)returns fitted parameters and error history.preds = ...computes final fitted line values for plotting.- Left plot:
- scatter = original data.
- red line = fitted model.
- Right plot:
- RMSE vs epoch = training behavior over time.
- Printed values:
- slope/intercept define the learned line.
- final RMSE summarizes final fit quality.
How to Read Results
- Fast drop then plateau in RMSE is expected.
- Very noisy/oscillating RMSE often means learning rate is too high.
- Very slow monotonic decrease often means learning rate is too low.
What to try:
- Increase
learning_rateto0.1and observe instability. - Decrease it to
0.001and observe slower convergence. - Change
epochsand compare final RMSE.
Part 2: Visualizing the Loss Landscape
Instead of moving to neural networks, let’s stay with linear regression and build stronger intuition.
In this part, we optimize the same model: \(\hat{y} = wx + b\)
But now we:
- Use full-batch gradient descent (all points each step)
- Track
(w, b)over time - Draw the loss surface contours and show how gradient descent moves downhill
Batch Gradient Descent Setup
def mse_loss(x, y, w, b):
y_hat = w * x + b
return np.mean((y_hat - y) ** 2)
def gradients(x, y, w, b):
n = len(x)
y_hat = w * x + b
dw = (2 / n) * np.sum((y_hat - y) * x)
db = (2 / n) * np.sum(y_hat - y)
return dw, db
def train_batch_gd(x, y, w0=0.0, b0=0.0, learning_rate=0.01, epochs=100):
w, b = w0, b0
history = [(w, b, mse_loss(x, y, w, b))]
for _ in range(epochs):
dw, db = gradients(x, y, w, b)
w -= learning_rate * dw
b -= learning_rate * db
history.append((w, b, mse_loss(x, y, w, b)))
return w, b, history
Code Breakdown
mse_loss(...): computes mean squared error for the full dataset.gradients(...)computes exact full-batch gradients:dw: how loss changes with slopew.db: how loss changes with interceptb.- Formula intuition:
- if predictions are too high on average, gradients push parameters down.
- if predictions are too low on average, gradients push parameters up.
train_batch_gd(...):- starts from
w0,b0. - stores
(w, b, loss)at each step inhistory. - applies update rule:
w -= learning_rate * dwb -= learning_rate * db- This is standard gradient descent on a 2-parameter linear model.
Run with Different Learning Rates
# Try different learning rates
rates = [0.001, 0.01]
epochs = 120
all_histories = {}
for lr in rates:
w, b, hist = train_batch_gd(features, labels, w0=0.0, b0=0.0, learning_rate=lr, epochs=epochs)
all_histories[lr] = (w, b, hist)
print(f"lr={lr:<6} final w={w:8.3f}, final b={b:8.3f}, final MSE={hist[-1][2]:10.3f}")
Code Breakdown
rateslets you compare multiple learning rates in one run.all_historiesstores all trajectories for later plotting.- For each
lr: - train model from the same start point (
w0=0.0,b0=0.0). - collect full trajectory
hist. - print final parameters and final MSE.
What to Observe
- Which learning rate reaches lower MSE within fixed epochs?
- Which learning rate is unstable or too slow?
Plot Loss vs Epoch
plt.figure(figsize=(7, 4))
for lr in rates:
hist = all_histories[lr][2]
losses = [h[2] for h in hist]
plt.plot(losses, label=f"lr={lr}")
plt.title("MSE During Training (Batch Gradient Descent)")
plt.xlabel("Epoch")
plt.ylabel("MSE")
plt.legend()
plt.show()
Code Breakdown
- Extract loss sequence from each run:
[h[2] for h in hist]. - Plot one curve per learning rate on same axes.
- This gives a direct convergence-speed comparison.
What to Observe
- Steeper early decline indicates faster initial learning.
- Flatter curves indicate slower convergence.
- Curves that spike or diverge indicate unstable updates.
Visualize the Loss Contours + Descent Paths
# Build a grid of (w, b) values
w_vals = np.linspace(20, 70, 180)
b_vals = np.linspace(50, 180, 180)
W, B = np.meshgrid(w_vals, b_vals)
Z = np.zeros_like(W)
for i in range(W.shape[0]):
for j in range(W.shape[1]):
Z[i, j] = mse_loss(features, labels, W[i, j], B[i, j])
plt.figure(figsize=(8, 6))
contours = plt.contour(W, B, Z, levels=30, cmap='viridis')
plt.clabel(contours, inline=True, fontsize=8)
# Overlay paths for each learning rate
for lr in rates:
hist = all_histories[lr][2]
ws = [h[0] for h in hist]
bs = [h[1] for h in hist]
plt.plot(ws, bs, marker='o', markersize=2, linewidth=1.5, label=f"lr={lr}")
plt.title("Loss Landscape (MSE) and Gradient Descent Paths")
plt.xlabel("w (slope)")
plt.ylabel("b (intercept)")
plt.legend()
plt.show()
Code Breakdown
w_vals,b_vals: parameter grid for contour plot.W, B = np.meshgrid(...): creates every(w, b)pair on that grid.Z[i, j] = mse_loss(...): computes loss at each parameter pair.plt.contour(...): draws equal-loss contour lines (a topographic map of loss).- Overlay section:
ws = [h[0] for h in hist]andbs = [h[1] for h in hist]extract each run’s path.plt.plot(ws, bs, ...)shows how gradient descent moves through parameter space.
How to Interpret the Contour Plot
- Center/lower contour regions represent lower error.
- A good learning rate traces a smooth path toward low-loss contours.
- Too-large learning rates jump around or overshoot the valley.
- Too-small learning rates move correctly but very slowly.
What to try:
- Add a very large learning rate like
0.02and look for overshooting/divergence. - Change the start point in
train_batch_gd(w0,b0) and compare paths. - Increase
epochsand see how quickly each learning rate reaches the minimum region.
Wrap-Up
You just used the same optimization idea twice:
- In linear regression, gradient descent adjusted a line.
- In the loss-landscape view, you watched
(w, b)physically move downhill on the error surface.
That shared loop is the core of most modern machine learning training.
Validation
- You can explain why different learning rates produce different trajectories.
- Your plots show both convergence and at least one unstable/slow case.
- You can report final RMSE/MSE values and compare settings.
Extensions
- Add noise to the toy dataset and observe loss landscape changes.
- Try normalizing inputs and compare convergence speed.
- Plot parameter updates (
w,b) versus epoch for one run.
Deliverable
- A Colab notebook that runs top-to-bottom.
- Two figure outputs: fit/loss plot and contour/path plot.
- A short markdown summary of: best setting, failed setting, and why.