PyTorch - Workflow

A full PyTorch model cycle — data, model, training, evaluation and saving — illustrated step by step on a regression problem.

Goal of the lesson

By the end of this 3-hour session you should be able to:

  • recognize the five steps every PyTorch project follows,
  • prepare data and split it into train and test,
  • build a model by subclassing nn.Module,
  • write a training loop with a loss, an optimizer and gradient descent,
  • track and plot loss curves,
  • save and reload a trained model,
  • fit a noisy non-linear curve as a capstone.

This is the most important lesson of the series. Every chapter that follows reuses the same five-step skeleton — only the data and the model change.

Suggested timing

BlockTopic
20 minWhat a workflow is, the 5 steps
25 minGenerate and split the data
30 minBuild a linear-regression model
45 minTrain, track loss curves, evaluate
20 minSave and reload
40 minCapstone — fit a noisy sine wave

The 5-step workflow

In machine learning, the model is a tiny part of the project. Most of your time will be on data and on training/diagnostics. The shape of the workflow stays remarkably constant: the same five steps for a 50-line linear regression and for a 500-million-parameter language model.

In this lesson we work on the smallest interesting problem — a linear regression that learns the line y = 0.7 x + 0.3 — so we can focus entirely on the workflow.

Setup

PowerShell
uv init --python 3.12 workflow
cd workflow
uv add torch matplotlib

Imports we will reuse:

main.py
import matplotlib.pyplot as plt
import torch
from torch import nn
device = "cuda" if torch.cuda.is_available() else "cpu"
torch.manual_seed(42)

1. Prepare data

Real ML starts with real data. Here we generate it ourselves so we know the answer in advance and can verify whether the model finds it.

main.py
WEIGHT = 0.7
BIAS = 0.3
x = torch.arange(0, 1, 0.02).unsqueeze(dim=1) # shape [50, 1]
y = WEIGHT * x + BIAS # shape [50, 1]
print(x[:5])
print(y[:5])
print(x.shape, y.shape)

Why unsqueeze(dim=1)? PyTorch layers expect data in the shape [batch, features]. A 1-D tensor [50] is ambiguous; turning it into [50, 1] says “50 samples, each with 1 feature”.

Train / test split

The first lesson of every machine-learning course: never evaluate on data the model has seen. We hide 20% of the points until evaluation time.

main.py
split = int(0.8 * len(x))
x_train, y_train = x[:split], y[:split]
x_test, y_test = x[split:], y[split:]
print(len(x_train), len(x_test)) # 40 10
Note

With real data you should also shuffle before splitting — otherwise you might end up with all “easy” cases in train and all “hard” cases in test. Our points are uniformly spaced, so we can skip that step.

Visualize

The first thing you do with a new dataset is plot it.

main.py
def plot_predictions(predictions=None, title=""):
plt.figure(figsize=(8, 5))
plt.scatter(x_train, y_train, c="b", s=8, label="train")
plt.scatter(x_test, y_test, c="g", s=8, label="test")
if predictions is not None:
plt.scatter(x_test, predictions, c="r", s=8, label="prediction")
plt.title(title)
plt.legend()
plt.show()
plot_predictions(title="dataset")

You should see two parallel runs of dots, blue for training and green for testing.

Try it — split sizes

What happens if you change the split to 50%? to 90%? Predict the result, then confirm.

2. Build a model

Every model in PyTorch is a class that subclasses nn.Module. You define:

  • the parameters the model will learn (weights, biases),
  • a forward(x) method describing the math.
main.py
class LinearRegression(nn.Module):
def __init__(self):
super().__init__()
self.weight = nn.Parameter(torch.randn(1, requires_grad=True))
self.bias = nn.Parameter(torch.randn(1, requires_grad=True))
def forward(self, x: torch.Tensor) -> torch.Tensor:
return self.weight * x + self.bias

Two things to notice:

  • nn.Parameter registers weight and bias as trainable. They start as random numbers; the optimizer updates them during training.
  • forward(x) is the math the model performs. Calling model(x) invokes it — never call .forward() directly, because that bypasses some PyTorch hooks.
main.py
torch.manual_seed(42)
model = LinearRegression()
print(list(model.parameters()))
print(model.state_dict())

state_dict() returns the dictionary {name: tensor} PyTorch uses to save and load weights.

Predictions before training

main.py
with torch.inference_mode():
y_pred = model(x_test)
plot_predictions(y_pred, title="predictions before training")

torch.inference_mode() is a context manager that turns off gradient bookkeeping — faster and uses less memory. The red dots will be far from the green ones because the parameters are random.

Sanity check — what shape comes out?

If the input has shape (N, 1), what shape will the output have?

3. Train

Training is a loop that repeats four operations on every batch:

StepWhat it does
ForwardRun inputs through the model to get predictions
LossCompare predictions against ground truth
BackwardCompute gradients of the loss w.r.t. each parameter
StepMove each parameter a tiny bit in the direction that reduces the loss

Loss and optimizer

The loss is a number that measures how wrong the model is. For regression, the simplest choice is mean absolute error:

main.py
loss_fn = nn.L1Loss()

The optimizer decides how to update the parameters from the gradients. SGD (stochastic gradient descent) is the simplest:

main.py
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

lr is the learning rate — how big a step to take. Too small and training crawls; too big and it overshoots and diverges.

Training loop

main.py
EPOCHS = 200
train_losses, test_losses, epochs_logged = [], [], []
for epoch in range(EPOCHS):
# --- training ---
model.train() # training mode
y_pred = model(x_train) # forward
loss = loss_fn(y_pred, y_train) # loss
optimizer.zero_grad() # reset gradients
loss.backward() # backward
optimizer.step() # step
# --- evaluation ---
model.eval()
with torch.inference_mode():
test_pred = model(x_test)
test_loss = loss_fn(test_pred, y_test)
if epoch % 20 == 0:
train_losses.append(loss.item())
test_losses.append(test_loss.item())
epochs_logged.append(epoch)
print(f"epoch {epoch:3d} loss={loss.item():.4f} test_loss={test_loss.item():.4f}")
print("learned parameters:", model.state_dict())

After 200 epochs you should see something like:

Terminal window
learned parameters: OrderedDict([('weight', tensor([0.6987])), ('bias', tensor([0.3013]))])

Compare with WEIGHT = 0.7 and BIAS = 0.3. The model rediscovered them.

Why zero_grad?

PyTorch accumulates gradients across backward() calls. That’s the right behavior for some advanced cases, but for a simple loop it would mean every iteration adds to the previous gradient, and the model drifts instead of converging. optimizer.zero_grad() clears the slate.

If you forget it, your training will look unstable — the loss bounces around or even grows. It’s the most common bug in beginner code.

train() vs eval()

train() and eval() toggle behaviors that differ between training and evaluation — Dropout, BatchNorm, etc. Linear regression has neither, but get into the habit now: every realistic model will have at least one layer that cares.

Plot the loss curves

main.py
plt.figure(figsize=(8, 5))
plt.plot(epochs_logged, train_losses, label="train loss")
plt.plot(epochs_logged, test_losses, label="test loss")
plt.xlabel("epoch")
plt.ylabel("L1 loss")
plt.legend()
plt.show()

Both curves should drop quickly, then plateau. That is the shape of healthy training.

Hyperparameters worth playing with

HyperparameterEffect
EPOCHSMore epochs = more practice. Diminishing returns once loss plateaus.
lr (learning rate)Bigger = faster but can overshoot. Smaller = slower but more stable.
OptimizerSGD, Adam, AdamW, RMSprop — different update rules.
Loss functionL1Loss (MAE), MSELoss, HuberLoss. Different sensitivities to outliers.

Try it — break training

Set lr = 100. Run again. What happens, and why?

4. Evaluate

main.py
model.eval()
with torch.inference_mode():
y_pred = model(x_test)
plot_predictions(y_pred, title="predictions after training")

The red dots should now sit on top of the green dots.

It’s also useful to inspect the residuals (true minus predicted):

main.py
residuals = (y_test - y_pred).squeeze()
print("mean residual :", residuals.mean().item())
print("max abs error :", residuals.abs().max().item())

For a perfect line fit on noise-free data, the residuals should be very close to zero.

5. Save and reload

PyTorch saves the state_dict (the dictionary of parameter tensors). It does not save the class definition. You will recreate the class manually when you load.

main.py
from pathlib import Path
MODEL_DIR = Path("models")
MODEL_DIR.mkdir(parents=True, exist_ok=True)
MODEL_FILE = MODEL_DIR / "linear_regression.pt"
torch.save(model.state_dict(), MODEL_FILE)

Reload by instantiating the same class and copying the weights:

main.py
loaded = LinearRegression()
loaded.load_state_dict(torch.load(MODEL_FILE))
loaded.eval()
with torch.inference_mode():
print(loaded(x_test[:5]))

Why save the dict and not the whole model object? The class definition is more stable across PyTorch versions than a pickled object. Saving the full model with torch.save(model, ...) works but breaks more easily.

Using a GPU

To run on the GPU when one is available, send both data and model to the same device:

main.py
device = "cuda" if torch.cuda.is_available() else "cpu"
model = LinearRegression().to(device)
x_train, y_train = x_train.to(device), y_train.to(device)
x_test, y_test = x_test.to(device), y_test.to(device)

If model and data are on different devices PyTorch will raise a runtime error pointing at the mismatch.

Exercises

Warm-up

  1. Change WEIGHT and BIAS, retrain, and verify the model recovers them.
  2. Replace the manual nn.Parameter definitions with a single self.layer = nn.Linear(in_features=1, out_features=1). Confirm training still works and the parameter count is the same.
  3. Replace nn.L1Loss with nn.MSELoss. Does it converge faster, slower, or the same? Why might MSE be more sensitive to a single outlier?

Hyperparameter exploration

  1. For lr in [0.001, 0.01, 0.1, 1.0], retrain from scratch and plot the four loss curves on the same figure. Which lr is best?
  2. Reduce EPOCHS to 10 and increase to 1000 with lr=0.01. What does the curve look like in each case?

Adding noise

  1. Add Gaussian noise to y (y = 0.7 * x + 0.3 + 0.05 * torch.randn_like(x)) and retrain. Do the recovered parameters still match the true ones? How precisely?
  2. Increase the noise to 0.2. Now what happens?

Multi-feature regression

  1. Generate x of shape (100, 2) and y = 0.7 * x[:, 0] + 0.3 * x[:, 1] + 0.1. Replace nn.Linear(1, 1) with nn.Linear(2, 1) and recover the three parameters.

Diagnostic plotting

  1. Plot the parameter values (weight and bias) over the epochs as well as the loss. Watch them converge to the true values.

Capstone — fit a noisy sine wave

A line is too easy. Let’s fit something the model can’t trivially solve with two parameters: a noisy sine wave. We’ll use a small neural network and watch the same workflow handle it.

Build the data

capstone.py
import torch
from torch import nn
import matplotlib.pyplot as plt
torch.manual_seed(0)
x = torch.linspace(-3.14, 3.14, 200).unsqueeze(1)
y = torch.sin(x) + 0.1 * torch.randn_like(x)
split = int(0.8 * len(x))
indices = torch.randperm(len(x)) # shuffle before splitting
train_idx, test_idx = indices[:split], indices[split:]
x_train, y_train = x[train_idx], y[train_idx]
x_test, y_test = x[test_idx], y[test_idx]
plt.scatter(x_train, y_train, s=8, label="train")
plt.scatter(x_test, y_test, s=8, label="test", color="g")
plt.legend(); plt.show()

Build the model

A linear layer can only draw straight lines. To fit a sine we need a non-linearity between two linear layers. This is the smallest neural network: input → hidden → activation → output.

capstone.py
class SineNet(nn.Module):
def __init__(self, hidden: int = 32):
super().__init__()
self.net = nn.Sequential(
nn.Linear(1, hidden),
nn.Tanh(),
nn.Linear(hidden, hidden),
nn.Tanh(),
nn.Linear(hidden, 1),
)
def forward(self, x):
return self.net(x)
model = SineNet()

Tanh is a smooth non-linearity that maps any real number to (-1, 1). Without it, stacking linear layers collapses to a single linear layer.

Train

capstone.py
loss_fn = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
train_losses, test_losses = [], []
EPOCHS = 1000
for epoch in range(EPOCHS):
model.train()
loss = loss_fn(model(x_train), y_train)
optimizer.zero_grad()
loss.backward()
optimizer.step()
model.eval()
with torch.inference_mode():
test_loss = loss_fn(model(x_test), y_test)
train_losses.append(loss.item())
test_losses.append(test_loss.item())
if epoch % 100 == 0:
print(f"epoch {epoch:4d} loss={loss.item():.4f} test_loss={test_loss.item():.4f}")

Note we switched from SGD to Adam — it adapts the learning rate per parameter and converges much faster on neural networks.

Plot loss curves

capstone.py
plt.plot(train_losses, label="train")
plt.plot(test_losses, label="test")
plt.yscale("log")
plt.xlabel("epoch"); plt.ylabel("MSE"); plt.legend()
plt.show()

The y-axis is log-scaled so you can see the late-stage improvements.

Plot the fit

capstone.py
model.eval()
with torch.inference_mode():
grid = torch.linspace(-3.14, 3.14, 500).unsqueeze(1)
pred = model(grid)
plt.scatter(x_train, y_train, s=4, alpha=0.5, label="train")
plt.scatter(x_test, y_test, s=4, alpha=0.5, color="g", label="test")
plt.plot(grid, pred, color="r", label="model")
plt.plot(grid, torch.sin(grid), color="k", linestyle="--", label="true sin(x)")
plt.legend(); plt.show()

The red curve (model) should track the dashed line (true sine) closely.

Save the model

capstone.py
from pathlib import Path
Path("models").mkdir(exist_ok=True)
torch.save(model.state_dict(), "models/sinenet.pt")

Stretch goals

If you finish early, try:

  • Predict outside the training range. Plot the model on [-6, 6]. The model has no idea what to do beyond [-pi, pi] — this is extrapolation and neural networks are bad at it.
  • Smaller hidden size. Drop hidden to 4. How well can the model fit then? At what point does it become unable to capture the curve?
  • More noise. Multiply the noise term by 0.5. The training loss won’t go to zero — the model can’t fit randomness — but the test fit should still recover the underlying sine.

Recap

The 5-step workflow:

  1. Data — load, split, visualize.
  2. Model — subclass nn.Module, define parameters and forward.
  3. Loss + optimizer — pick a loss for the task, an optimizer to update parameters.
  4. Train loop — forward, loss, zero_grad, backward, step. Evaluate every few epochs.
  5. Save — persist state_dict for reuse.

Every other chapter in this series reuses this skeleton. Only the data and the model change.

References