Optimization Playground

Optimization Playground: Gradient Descent Intuition

If you haven’t learned the optimization basics yet, start with the Gradient Descent Theory lesson.

Objective

Build intuition for optimization by training:

A linear regression model with simple gradient descent (square trick idea)
The same linear model with full-batch gradient descent and a visual loss landscape

Prerequisites

Basic Python and NumPy
Intro to machine learning concepts (features, labels, loss)

Setup

Everything below is designed to run directly in Google Colab.

Tasks

Complete Part 1 and Part 2 with default parameters.
Re-run with at least two different learning rates and compare outcomes.
Write down one failure case and one successful configuration.

Part 1: Simple Gradient Descent (Linear Regression)

We will reuse the same idea from the theory lesson: start with random parameters, make predictions, compute error, and nudge parameters in the direction that reduces error.

Setup + Data

import numpy as np
import matplotlib.pyplot as plt
import random

# Campaign fundraising toy dataset
features = np.array([1, 2, 3, 5, 6, 7, 8], dtype=np.float64)
labels = np.array([155, 197, 244, 356, 409, 448, 500], dtype=np.float64)

plt.figure(figsize=(5, 4))
plt.scatter(features, labels)
plt.title("Campaign Fundraising")
plt.xlabel("Number of Donations")
plt.ylabel("Funds Raised")
plt.show()

Code Breakdown <!– - import numpy as np: loads NumPy for numeric arrays and math.

import matplotlib.pyplot as plt: loads plotting utilities.
import random: used for stochastic point selection in Part 1. –>
features: input variable (x) = number of donations.
labels: target variable (y) = funds raised.
plt.scatter(features, labels): plots raw data so you can visually check if a line fit makes sense.
If the points roughly align linearly, gradient descent for a line model is a good first approach.

Gradient Descent Utilities

def rmse(y_true, y_pred):
    return np.sqrt(np.mean((y_true - y_pred) ** 2))

# Square trick style update (single point)
def square_trick(base_funds, funds_per_dono, num_donos, funds, learning_rate):
    pred = base_funds + funds_per_dono * num_donos
    funds_per_dono += learning_rate * num_donos * (funds - pred)
    base_funds += learning_rate * (funds - pred)
    return base_funds, funds_per_dono


def train_linear(features, labels, learning_rate=0.01, epochs=2000, seed=7):
    random.seed(seed)
    base_funds = random.random()
    funds_per_dono = random.random()
    errors = []

    for _ in range(epochs):
        preds = base_funds + funds_per_dono * features
        errors.append(rmse(labels, preds))

        i = random.randint(0, len(features) - 1)
        base_funds, funds_per_dono = square_trick(
            base_funds,
            funds_per_dono,
            features[i],
            labels[i],
            learning_rate,
        )

    return base_funds, funds_per_dono, errors

Code Breakdown

rmse(...): computes average prediction error magnitude (lower is better).
In square_trick(...):
pred = base_funds + funds_per_dono * num_donos computes current line prediction for one point.
(funds - pred) is the signed error for that point.
funds_per_dono += ... updates slope (m) using error scaled by x and learning rate.
base_funds += ... updates intercept (b) using error and learning rate.
In train_linear(...):
random.seed(seed) makes runs reproducible.
base_funds and funds_per_dono start random (like random initialization in training).
Each epoch:
preds = ... predicts all points using current line.
errors.append(rmse(...)) stores error history for plotting.
Random index i selects one training point (stochastic gradient descent behavior).
square_trick(...) applies one update step.
Return values:
base_funds: final intercept.
funds_per_dono: final slope.
errors: learning curve over epochs.

Run and Interact

# Try changing these and re-running this cell
learning_rate = 0.01
epochs = 2000
seed = 7

base_funds, funds_per_dono, errors = train_linear(
    features, labels, learning_rate=learning_rate, epochs=epochs, seed=seed
)

preds = base_funds + funds_per_dono * features

fig, ax = plt.subplots(1, 2, figsize=(12, 4))

# Fitted line
ax[0].scatter(features, labels, label='Data')
ax[0].plot(features, preds, color='red', label='Fitted Line')
ax[0].set_title('Linear Regression Fit')
ax[0].set_xlabel('Number of Donations')
ax[0].set_ylabel('Funds Raised')
ax[0].legend()

# Error curve
ax[1].plot(errors)
ax[1].set_title('RMSE During Training')
ax[1].set_xlabel('Epoch')
ax[1].set_ylabel('RMSE')

plt.show()

print(f"funds_per_dono (slope): {funds_per_dono:.4f}")
print(f"base_funds (intercept): {base_funds:.4f}")
print(f"final RMSE: {errors[-1]:.4f}")

Code Breakdown

Hyperparameters:
learning_rate: step size for updates.
epochs: number of update steps.
seed: repeatable randomness.
train_linear(...) returns fitted parameters and error history.
preds = ... computes final fitted line values for plotting.
Left plot:
scatter = original data.
red line = fitted model.
Right plot:
RMSE vs epoch = training behavior over time.
Printed values:
slope/intercept define the learned line.
final RMSE summarizes final fit quality.

How to Read Results

Fast drop then plateau in RMSE is expected.
Very noisy/oscillating RMSE often means learning rate is too high.
Very slow monotonic decrease often means learning rate is too low.

What to try:

Increase learning_rate to 0.1 and observe instability.
Decrease it to 0.001 and observe slower convergence.
Change epochs and compare final RMSE.

Part 2: Visualizing the Loss Landscape

Instead of moving to neural networks, let’s stay with linear regression and build stronger intuition.

In this part, we optimize the same model: \(\hat{y} = wx + b\)

But now we:

Use full-batch gradient descent (all points each step)
Track (w, b) over time
Draw the loss surface contours and show how gradient descent moves downhill

Batch Gradient Descent Setup

def mse_loss(x, y, w, b):
    y_hat = w * x + b
    return np.mean((y_hat - y) ** 2)


def gradients(x, y, w, b):
    n = len(x)
    y_hat = w * x + b
    dw = (2 / n) * np.sum((y_hat - y) * x)
    db = (2 / n) * np.sum(y_hat - y)
    return dw, db


def train_batch_gd(x, y, w0=0.0, b0=0.0, learning_rate=0.01, epochs=100):
    w, b = w0, b0
    history = [(w, b, mse_loss(x, y, w, b))]

    for _ in range(epochs):
        dw, db = gradients(x, y, w, b)
        w -= learning_rate * dw
        b -= learning_rate * db
        history.append((w, b, mse_loss(x, y, w, b)))

    return w, b, history

Code Breakdown

mse_loss(...): computes mean squared error for the full dataset.
gradients(...) computes exact full-batch gradients:
dw: how loss changes with slope w.
db: how loss changes with intercept b.
Formula intuition:
if predictions are too high on average, gradients push parameters down.
if predictions are too low on average, gradients push parameters up.
train_batch_gd(...):
starts from w0, b0.
stores (w, b, loss) at each step in history.
applies update rule:
w -= learning_rate * dw
b -= learning_rate * db
This is standard gradient descent on a 2-parameter linear model.

Run with Different Learning Rates

# Try different learning rates
rates = [0.001, 0.01]
epochs = 120

all_histories = {}
for lr in rates:
    w, b, hist = train_batch_gd(features, labels, w0=0.0, b0=0.0, learning_rate=lr, epochs=epochs)
    all_histories[lr] = (w, b, hist)
    print(f"lr={lr:<6} final w={w:8.3f}, final b={b:8.3f}, final MSE={hist[-1][2]:10.3f}")

Code Breakdown

rates lets you compare multiple learning rates in one run.
all_histories stores all trajectories for later plotting.
For each lr:
train model from the same start point (w0=0.0, b0=0.0).
collect full trajectory hist.
print final parameters and final MSE.

What to Observe

Which learning rate reaches lower MSE within fixed epochs?
Which learning rate is unstable or too slow?

Plot Loss vs Epoch

plt.figure(figsize=(7, 4))
for lr in rates:
    hist = all_histories[lr][2]
    losses = [h[2] for h in hist]
    plt.plot(losses, label=f"lr={lr}")

plt.title("MSE During Training (Batch Gradient Descent)")
plt.xlabel("Epoch")
plt.ylabel("MSE")
plt.legend()
plt.show()

Code Breakdown

Extract loss sequence from each run: [h[2] for h in hist].
Plot one curve per learning rate on same axes.
This gives a direct convergence-speed comparison.

What to Observe

Steeper early decline indicates faster initial learning.
Flatter curves indicate slower convergence.
Curves that spike or diverge indicate unstable updates.

Visualize the Loss Contours + Descent Paths

# Build a grid of (w, b) values
w_vals = np.linspace(20, 70, 180)
b_vals = np.linspace(50, 180, 180)
W, B = np.meshgrid(w_vals, b_vals)

Z = np.zeros_like(W)
for i in range(W.shape[0]):
    for j in range(W.shape[1]):
        Z[i, j] = mse_loss(features, labels, W[i, j], B[i, j])

plt.figure(figsize=(8, 6))
contours = plt.contour(W, B, Z, levels=30, cmap='viridis')
plt.clabel(contours, inline=True, fontsize=8)

# Overlay paths for each learning rate
for lr in rates:
    hist = all_histories[lr][2]
    ws = [h[0] for h in hist]
    bs = [h[1] for h in hist]
    plt.plot(ws, bs, marker='o', markersize=2, linewidth=1.5, label=f"lr={lr}")

plt.title("Loss Landscape (MSE) and Gradient Descent Paths")
plt.xlabel("w (slope)")
plt.ylabel("b (intercept)")
plt.legend()
plt.show()

Code Breakdown

w_vals, b_vals: parameter grid for contour plot.
W, B = np.meshgrid(...): creates every (w, b) pair on that grid.
Z[i, j] = mse_loss(...): computes loss at each parameter pair.
plt.contour(...): draws equal-loss contour lines (a topographic map of loss).
Overlay section:
ws = [h[0] for h in hist] and bs = [h[1] for h in hist] extract each run’s path.
plt.plot(ws, bs, ...) shows how gradient descent moves through parameter space.

How to Interpret the Contour Plot

Center/lower contour regions represent lower error.
A good learning rate traces a smooth path toward low-loss contours.
Too-large learning rates jump around or overshoot the valley.
Too-small learning rates move correctly but very slowly.

What to try:

Add a very large learning rate like 0.02 and look for overshooting/divergence.
Change the start point in train_batch_gd (w0, b0) and compare paths.
Increase epochs and see how quickly each learning rate reaches the minimum region.

Wrap-Up

You just used the same optimization idea twice:

In linear regression, gradient descent adjusted a line.
In the loss-landscape view, you watched (w, b) physically move downhill on the error surface.

That shared loop is the core of most modern machine learning training.

Validation

You can explain why different learning rates produce different trajectories.
Your plots show both convergence and at least one unstable/slow case.
You can report final RMSE/MSE values and compare settings.

Extensions

Add noise to the toy dataset and observe loss landscape changes.
Try normalizing inputs and compare convergence speed.
Plot parameter updates (w, b) versus epoch for one run.

Deliverable

A Colab notebook that runs top-to-bottom.
Two figure outputs: fit/loss plot and contour/path plot.
A short markdown summary of: best setting, failed setting, and why.