A full PyTorch model cycle — data, model, training, evaluation and saving — illustrated step by step on a regression problem.
On this page
Goal of the lesson
By the end of this 3-hour session you should be able to:
- recognize the five steps every PyTorch project follows,
- prepare data and split it into train and test,
- build a model by subclassing
nn.Module, - write a training loop with a loss, an optimizer and gradient descent,
- track and plot loss curves,
- save and reload a trained model,
- fit a noisy non-linear curve as a capstone.
This is the most important lesson of the series. Every chapter that follows reuses the same five-step skeleton — only the data and the model change.
Suggested timing
| Block | Topic |
|---|---|
| 20 min | What a workflow is, the 5 steps |
| 25 min | Generate and split the data |
| 30 min | Build a linear-regression model |
| 45 min | Train, track loss curves, evaluate |
| 20 min | Save and reload |
| 40 min | Capstone — fit a noisy sine wave |
The 5-step workflow
In machine learning, the model is a tiny part of the project. Most of your time will be on data and on training/diagnostics. The shape of the workflow stays remarkably constant: the same five steps for a 50-line linear regression and for a 500-million-parameter language model.
In this lesson we work on the smallest interesting problem — a linear regression that learns the line y = 0.7 x + 0.3 — so we can focus entirely on the workflow.
Setup
uv init --python 3.12 workflowcd workflowuv add torch matplotlibImports we will reuse:
import matplotlib.pyplot as pltimport torchfrom torch import nn
device = "cuda" if torch.cuda.is_available() else "cpu"torch.manual_seed(42)1. Prepare data
Real ML starts with real data. Here we generate it ourselves so we know the answer in advance and can verify whether the model finds it.
WEIGHT = 0.7BIAS = 0.3
x = torch.arange(0, 1, 0.02).unsqueeze(dim=1) # shape [50, 1]y = WEIGHT * x + BIAS # shape [50, 1]
print(x[:5])print(y[:5])print(x.shape, y.shape)Why unsqueeze(dim=1)? PyTorch layers expect data in the shape [batch, features]. A 1-D tensor [50] is ambiguous; turning it into [50, 1] says “50 samples, each with 1 feature”.
Train / test split
The first lesson of every machine-learning course: never evaluate on data the model has seen. We hide 20% of the points until evaluation time.
split = int(0.8 * len(x))x_train, y_train = x[:split], y[:split]x_test, y_test = x[split:], y[split:]
print(len(x_train), len(x_test)) # 40 10With real data you should also shuffle before splitting — otherwise you might end up with all “easy” cases in train and all “hard” cases in test. Our points are uniformly spaced, so we can skip that step.
Visualize
The first thing you do with a new dataset is plot it.
def plot_predictions(predictions=None, title=""): plt.figure(figsize=(8, 5)) plt.scatter(x_train, y_train, c="b", s=8, label="train") plt.scatter(x_test, y_test, c="g", s=8, label="test") if predictions is not None: plt.scatter(x_test, predictions, c="r", s=8, label="prediction") plt.title(title) plt.legend() plt.show()
plot_predictions(title="dataset")You should see two parallel runs of dots, blue for training and green for testing.
Try it — split sizes
What happens if you change the split to 50%? to 90%? Predict the result, then confirm.
50% means the model learns from less data — it may underfit, especially with noise. 90% leaves so few test points that the test score becomes noisy and unreliable. The 80/20 default is a good compromise; for very small datasets you would also use cross-validation.
2. Build a model
Every model in PyTorch is a class that subclasses nn.Module. You define:
- the parameters the model will learn (weights, biases),
- a
forward(x)method describing the math.
class LinearRegression(nn.Module): def __init__(self): super().__init__() self.weight = nn.Parameter(torch.randn(1, requires_grad=True)) self.bias = nn.Parameter(torch.randn(1, requires_grad=True))
def forward(self, x: torch.Tensor) -> torch.Tensor: return self.weight * x + self.biasTwo things to notice:
nn.Parameterregistersweightandbiasas trainable. They start as random numbers; the optimizer updates them during training.forward(x)is the math the model performs. Callingmodel(x)invokes it — never call.forward()directly, because that bypasses some PyTorch hooks.
torch.manual_seed(42)model = LinearRegression()
print(list(model.parameters()))print(model.state_dict())state_dict() returns the dictionary {name: tensor} PyTorch uses to save and load weights.
Predictions before training
with torch.inference_mode(): y_pred = model(x_test)
plot_predictions(y_pred, title="predictions before training")torch.inference_mode() is a context manager that turns off gradient bookkeeping — faster and uses less memory. The red dots will be far from the green ones because the parameters are random.
Sanity check — what shape comes out?
If the input has shape (N, 1), what shape will the output have?
(N, 1) — broadcast multiplication of (1,) weight and (N, 1) input gives (N, 1), then adding the (1,) bias keeps the same shape.
3. Train
Training is a loop that repeats four operations on every batch:
| Step | What it does |
|---|---|
| Forward | Run inputs through the model to get predictions |
| Loss | Compare predictions against ground truth |
| Backward | Compute gradients of the loss w.r.t. each parameter |
| Step | Move each parameter a tiny bit in the direction that reduces the loss |
Loss and optimizer
The loss is a number that measures how wrong the model is. For regression, the simplest choice is mean absolute error:
loss_fn = nn.L1Loss()The optimizer decides how to update the parameters from the gradients. SGD (stochastic gradient descent) is the simplest:
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)lr is the learning rate — how big a step to take. Too small and training crawls; too big and it overshoots and diverges.
Training loop
EPOCHS = 200train_losses, test_losses, epochs_logged = [], [], []
for epoch in range(EPOCHS): # --- training --- model.train() # training mode y_pred = model(x_train) # forward loss = loss_fn(y_pred, y_train) # loss
optimizer.zero_grad() # reset gradients loss.backward() # backward optimizer.step() # step
# --- evaluation --- model.eval() with torch.inference_mode(): test_pred = model(x_test) test_loss = loss_fn(test_pred, y_test)
if epoch % 20 == 0: train_losses.append(loss.item()) test_losses.append(test_loss.item()) epochs_logged.append(epoch) print(f"epoch {epoch:3d} loss={loss.item():.4f} test_loss={test_loss.item():.4f}")
print("learned parameters:", model.state_dict())After 200 epochs you should see something like:
learned parameters: OrderedDict([('weight', tensor([0.6987])), ('bias', tensor([0.3013]))])Compare with WEIGHT = 0.7 and BIAS = 0.3. The model rediscovered them.
Why zero_grad?
PyTorch accumulates gradients across backward() calls. That’s the right behavior for some advanced cases, but for a simple loop it would mean every iteration adds to the previous gradient, and the model drifts instead of converging. optimizer.zero_grad() clears the slate.
If you forget it, your training will look unstable — the loss bounces around or even grows. It’s the most common bug in beginner code.
train() vs eval()
train() and eval() toggle behaviors that differ between training and evaluation — Dropout, BatchNorm, etc. Linear regression has neither, but get into the habit now: every realistic model will have at least one layer that cares.
Plot the loss curves
plt.figure(figsize=(8, 5))plt.plot(epochs_logged, train_losses, label="train loss")plt.plot(epochs_logged, test_losses, label="test loss")plt.xlabel("epoch")plt.ylabel("L1 loss")plt.legend()plt.show()Both curves should drop quickly, then plateau. That is the shape of healthy training.
Hyperparameters worth playing with
| Hyperparameter | Effect |
|---|---|
EPOCHS | More epochs = more practice. Diminishing returns once loss plateaus. |
lr (learning rate) | Bigger = faster but can overshoot. Smaller = slower but more stable. |
| Optimizer | SGD, Adam, AdamW, RMSprop — different update rules. |
| Loss function | L1Loss (MAE), MSELoss, HuberLoss. Different sensitivities to outliers. |
Try it — break training
Set lr = 100. Run again. What happens, and why?
The loss explodes (NaN or huge numbers). With such a large learning rate, every step massively overshoots the minimum and the parameters diverge to infinity. This is called gradient explosion. The fix is a smaller learning rate or gradient clipping.
4. Evaluate
model.eval()with torch.inference_mode(): y_pred = model(x_test)
plot_predictions(y_pred, title="predictions after training")The red dots should now sit on top of the green dots.
It’s also useful to inspect the residuals (true minus predicted):
residuals = (y_test - y_pred).squeeze()print("mean residual :", residuals.mean().item())print("max abs error :", residuals.abs().max().item())For a perfect line fit on noise-free data, the residuals should be very close to zero.
5. Save and reload
PyTorch saves the state_dict (the dictionary of parameter tensors). It does not save the class definition. You will recreate the class manually when you load.
from pathlib import Path
MODEL_DIR = Path("models")MODEL_DIR.mkdir(parents=True, exist_ok=True)MODEL_FILE = MODEL_DIR / "linear_regression.pt"
torch.save(model.state_dict(), MODEL_FILE)Reload by instantiating the same class and copying the weights:
loaded = LinearRegression()loaded.load_state_dict(torch.load(MODEL_FILE))loaded.eval()
with torch.inference_mode(): print(loaded(x_test[:5]))Why save the dict and not the whole model object? The class definition is more stable across PyTorch versions than a pickled object. Saving the full model with torch.save(model, ...) works but breaks more easily.
Using a GPU
To run on the GPU when one is available, send both data and model to the same device:
device = "cuda" if torch.cuda.is_available() else "cpu"
model = LinearRegression().to(device)x_train, y_train = x_train.to(device), y_train.to(device)x_test, y_test = x_test.to(device), y_test.to(device)If model and data are on different devices PyTorch will raise a runtime error pointing at the mismatch.
Exercises
Warm-up
- Change
WEIGHTandBIAS, retrain, and verify the model recovers them. - Replace the manual
nn.Parameterdefinitions with a singleself.layer = nn.Linear(in_features=1, out_features=1). Confirm training still works and the parameter count is the same. - Replace
nn.L1Losswithnn.MSELoss. Does it converge faster, slower, or the same? Why might MSE be more sensitive to a single outlier?
Hyperparameter exploration
- For
lrin[0.001, 0.01, 0.1, 1.0], retrain from scratch and plot the four loss curves on the same figure. Which lr is best? - Reduce
EPOCHSto 10 and increase to 1000 withlr=0.01. What does the curve look like in each case?
Adding noise
- Add Gaussian noise to
y(y = 0.7 * x + 0.3 + 0.05 * torch.randn_like(x)) and retrain. Do the recovered parameters still match the true ones? How precisely? - Increase the noise to
0.2. Now what happens?
Multi-feature regression
- Generate
xof shape(100, 2)andy = 0.7 * x[:, 0] + 0.3 * x[:, 1] + 0.1. Replacenn.Linear(1, 1)withnn.Linear(2, 1)and recover the three parameters.
Diagnostic plotting
- Plot the parameter values (
weightandbias) over the epochs as well as the loss. Watch them converge to the true values.
For exercise 9:
weights, biases = [], []for epoch in range(EPOCHS): model.train() loss = loss_fn(model(x_train), y_train) optimizer.zero_grad(); loss.backward(); optimizer.step() weights.append(model.weight.item()) biases.append(model.bias.item())
plt.plot(weights, label="weight")plt.plot(biases, label="bias")plt.axhline(WEIGHT, color="grey", linestyle="--")plt.axhline(BIAS, color="grey", linestyle=":")plt.legend(); plt.show()Capstone — fit a noisy sine wave
A line is too easy. Let’s fit something the model can’t trivially solve with two parameters: a noisy sine wave. We’ll use a small neural network and watch the same workflow handle it.
Build the data
import torchfrom torch import nnimport matplotlib.pyplot as plt
torch.manual_seed(0)
x = torch.linspace(-3.14, 3.14, 200).unsqueeze(1)y = torch.sin(x) + 0.1 * torch.randn_like(x)
split = int(0.8 * len(x))indices = torch.randperm(len(x)) # shuffle before splittingtrain_idx, test_idx = indices[:split], indices[split:]x_train, y_train = x[train_idx], y[train_idx]x_test, y_test = x[test_idx], y[test_idx]
plt.scatter(x_train, y_train, s=8, label="train")plt.scatter(x_test, y_test, s=8, label="test", color="g")plt.legend(); plt.show()Build the model
A linear layer can only draw straight lines. To fit a sine we need a non-linearity between two linear layers. This is the smallest neural network: input → hidden → activation → output.
class SineNet(nn.Module): def __init__(self, hidden: int = 32): super().__init__() self.net = nn.Sequential( nn.Linear(1, hidden), nn.Tanh(), nn.Linear(hidden, hidden), nn.Tanh(), nn.Linear(hidden, 1), )
def forward(self, x): return self.net(x)
model = SineNet()Tanh is a smooth non-linearity that maps any real number to (-1, 1). Without it, stacking linear layers collapses to a single linear layer.
Train
loss_fn = nn.MSELoss()optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
train_losses, test_losses = [], []EPOCHS = 1000
for epoch in range(EPOCHS): model.train() loss = loss_fn(model(x_train), y_train) optimizer.zero_grad() loss.backward() optimizer.step()
model.eval() with torch.inference_mode(): test_loss = loss_fn(model(x_test), y_test)
train_losses.append(loss.item()) test_losses.append(test_loss.item())
if epoch % 100 == 0: print(f"epoch {epoch:4d} loss={loss.item():.4f} test_loss={test_loss.item():.4f}")Note we switched from SGD to Adam — it adapts the learning rate per parameter and converges much faster on neural networks.
Plot loss curves
plt.plot(train_losses, label="train")plt.plot(test_losses, label="test")plt.yscale("log")plt.xlabel("epoch"); plt.ylabel("MSE"); plt.legend()plt.show()The y-axis is log-scaled so you can see the late-stage improvements.
Plot the fit
model.eval()with torch.inference_mode(): grid = torch.linspace(-3.14, 3.14, 500).unsqueeze(1) pred = model(grid)
plt.scatter(x_train, y_train, s=4, alpha=0.5, label="train")plt.scatter(x_test, y_test, s=4, alpha=0.5, color="g", label="test")plt.plot(grid, pred, color="r", label="model")plt.plot(grid, torch.sin(grid), color="k", linestyle="--", label="true sin(x)")plt.legend(); plt.show()The red curve (model) should track the dashed line (true sine) closely.
Save the model
from pathlib import Path
Path("models").mkdir(exist_ok=True)torch.save(model.state_dict(), "models/sinenet.pt")Stretch goals
If you finish early, try:
- Predict outside the training range. Plot the model on
[-6, 6]. The model has no idea what to do beyond[-pi, pi]— this is extrapolation and neural networks are bad at it. - Smaller hidden size. Drop
hiddento 4. How well can the model fit then? At what point does it become unable to capture the curve? - More noise. Multiply the noise term by 0.5. The training loss won’t go to zero — the model can’t fit randomness — but the test fit should still recover the underlying sine.
Recap
The 5-step workflow:
- Data — load, split, visualize.
- Model — subclass
nn.Module, define parameters andforward. - Loss + optimizer — pick a loss for the task, an optimizer to update parameters.
- Train loop — forward, loss,
zero_grad, backward, step. Evaluate every few epochs. - Save — persist
state_dictfor reuse.
Every other chapter in this series reuses this skeleton. Only the data and the model change.