PyTorch - Classification

Train neural networks to assign discrete labels — binary and multi-class classification with PyTorch.

Goal of the lesson

By the end of this 3-hour session you should be able to:

  • explain the difference between regression and classification,
  • generate synthetic 2-D datasets and visualize their decision regions,
  • build a feed-forward neural network with non-linear activations,
  • choose the right loss for binary and multi-class problems,
  • track loss and accuracy during training,
  • recognize underfitting and overfitting visually,
  • handle real-world tabular data with mixed numerical and categorical features,
  • solve the moons dataset as a capstone.

Suggested timing

BlockTopic
15 minWhat classification is, logits vs. probabilities
25 minGenerate the blobs dataset, build the model
25 minTraining loop with accuracy, decision boundary
15 minBinary classification with BCEWithLogitsLoss
55 minReal-world example — heart-disease prediction
45 minCapstone — moons dataset and overfitting

Regression vs. classification

TaskOutputLossFinal layer
RegressionA real numberMSELoss, L1LossLinear (no activation)
Binary classificationOne of two classesBCEWithLogitsLossLinear with 1 output (logit)
Multi-class classificationOne of K classesCrossEntropyLossLinear with K outputs (logits)

The five-step workflow doesn’t change. We swap the dataset, the model’s output size, and the loss.

Setup

PowerShell
uv init --python 3.12 classification
cd classification
uv add torch matplotlib scikit-learn numpy
main.py
import matplotlib.pyplot as plt
import numpy as np
import torch
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
from torch import nn
device = "cuda" if torch.cuda.is_available() else "cpu"
torch.manual_seed(42)

Multi-class — the blobs dataset

sklearn.datasets.make_blobs generates clusters of points in 2-D — perfect for visualizing what a classifier is doing.

main.py
NUM_CLASSES = 4
NUM_FEATURES = 2
x_np, y_np = make_blobs(
n_samples=1000,
n_features=NUM_FEATURES,
centers=NUM_CLASSES,
cluster_std=1.5,
random_state=42,
)
x = torch.from_numpy(x_np).float()
y = torch.from_numpy(y_np).long()
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
print(x_train.shape, y_train.shape, y_train[:10])

A few details that matter:

  • Features are float32. Targets for CrossEntropyLoss must be int64 (the dtype .long() produces).
  • Targets are class indices (0, 1, 2, 3), not one-hot vectors. PyTorch’s loss does the one-hot conversion internally.

Visualize:

main.py
plt.scatter(x[:, 0], x[:, 1], c=y, cmap=plt.cm.RdYlBu, s=8)
plt.title("blobs")
plt.show()

You should see four colored blobs.

Try it — change the dataset

What does cluster_std=0.5 look like? cluster_std=4.0?

Build a model

A linear model can only draw straight separators. Real data is rarely separable that way, so we add a non-linear activation between linear layers.

main.py
class BlobModel(nn.Module):
def __init__(self, in_features: int, out_features: int, hidden: int = 8):
super().__init__()
self.net = nn.Sequential(
nn.Linear(in_features, hidden),
nn.ReLU(),
nn.Linear(hidden, hidden),
nn.ReLU(),
nn.Linear(hidden, out_features),
)
def forward(self, x: torch.Tensor) -> torch.Tensor:
return self.net(x)
model = BlobModel(in_features=NUM_FEATURES, out_features=NUM_CLASSES).to(device)
print(model)

The output layer has one unit per class. We deliberately leave it without an activation — those raw outputs are called logits. nn.CrossEntropyLoss applies LogSoftmax internally and is numerically more stable than computing the softmax ourselves.

From logits to predictions

Three closely-related tensors to keep straight:

TensorMeaningHow to obtain it
logitsRaw network output, one number per classmodel(x)
probabilitiesSoftmax of logits, one per class, sum to 1torch.softmax(logits, dim=1)
predictionsIndex of the largest logitlogits.argmax(dim=1)

argmax of probabilities and argmax of logits agree, so you don’t actually need softmax to predict — only to report a confidence.

main.py
x_train, y_train = x_train.to(device), y_train.to(device)
x_test, y_test = x_test.to(device), y_test.to(device)
with torch.inference_mode():
logits = model(x_test[:5])
probs = torch.softmax(logits, dim=1)
preds = logits.argmax(dim=1)
print("logits:\n", logits)
print("probs (rows sum to 1):\n", probs)
print("preds:", preds)
print("truth:", y_test[:5])

Before training, the predictions are essentially random.

Loss, optimizer, accuracy

main.py
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
def accuracy(y_true: torch.Tensor, y_pred: torch.Tensor) -> float:
correct = (y_true == y_pred).sum().item()
return correct / len(y_pred)

Loss tells the optimizer how to improve. Accuracy tells us how well the model is doing in human terms. They almost always disagree slightly because cross-entropy penalizes overconfident wrong answers more than confident-correct ones.

Train

main.py
EPOCHS = 100
history = []
for epoch in range(EPOCHS):
model.train()
logits = model(x_train)
loss = loss_fn(logits, y_train)
train_acc = accuracy(y_train, logits.argmax(dim=1))
optimizer.zero_grad()
loss.backward()
optimizer.step()
model.eval()
with torch.inference_mode():
test_logits = model(x_test)
test_loss = loss_fn(test_logits, y_test)
test_acc = accuracy(y_test, test_logits.argmax(dim=1))
history.append((loss.item(), test_loss.item(), train_acc, test_acc))
if epoch % 10 == 0:
print(
f"epoch {epoch:3d} loss={loss.item():.4f} acc={train_acc:.2%} "
f"| test_loss={test_loss.item():.4f} test_acc={test_acc:.2%}"
)

After 100 epochs you should see test accuracy around 99% — four well-separated blobs are an easy problem.

Plot loss and accuracy

main.py
losses = np.array(history)
fig, axes = plt.subplots(1, 2, figsize=(10, 4))
axes[0].plot(losses[:, 0], label="train")
axes[0].plot(losses[:, 1], label="test")
axes[0].set_title("loss"); axes[0].legend()
axes[1].plot(losses[:, 2], label="train")
axes[1].plot(losses[:, 3], label="test")
axes[1].set_title("accuracy"); axes[1].legend()
plt.show()

Healthy training: both losses decrease, both accuracies increase, and the train and test curves stay close to each other.

Decision boundary

A picture is worth a thousand metrics. Sample a grid of points across the input space, ask the model for a prediction at each, and color the result.

main.py
def plot_decision_boundary(model, x, y, title=""):
model.eval()
x = x.to("cpu"); y = y.to("cpu")
x_min, x_max = x[:, 0].min() - 0.1, x[:, 0].max() + 0.1
y_min, y_max = x[:, 1].min() - 0.1, x[:, 1].max() + 0.1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200), np.linspace(y_min, y_max, 200))
grid = torch.from_numpy(np.column_stack((xx.ravel(), yy.ravel()))).float().to(device)
with torch.inference_mode():
preds = model(grid).argmax(dim=1).cpu().numpy().reshape(xx.shape)
plt.contourf(xx, yy, preds, cmap=plt.cm.RdYlBu, alpha=0.6)
plt.scatter(x[:, 0], x[:, 1], c=y, cmap=plt.cm.RdYlBu, s=8, edgecolors="k", linewidths=0.2)
plt.title(title); plt.show()
plot_decision_boundary(model, x_test, y_test, title="trained model")

Try it — kill the activations

Comment out the nn.ReLU() lines, retrain from scratch, and plot the decision boundary again. What changes?

Binary classification

For two classes you have two equivalent options:

ApproachOutput sizeLossTargets
One logit1nn.BCEWithLogitsLossfloat 0.0 / 1.0
Two logits2nn.CrossEntropyLossint 0 / 1

BCEWithLogitsLoss combines a sigmoid and binary cross-entropy in one numerically-stable step.

main.py
from sklearn.datasets import make_circles
x_np, y_np = make_circles(n_samples=1000, noise=0.05, factor=0.5, random_state=42)
x = torch.from_numpy(x_np).float()
y = torch.from_numpy(y_np).float().unsqueeze(1) # shape (N, 1)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
class CircleModel(nn.Module):
def __init__(self, hidden: int = 16):
super().__init__()
self.net = nn.Sequential(
nn.Linear(2, hidden),
nn.ReLU(),
nn.Linear(hidden, hidden),
nn.ReLU(),
nn.Linear(hidden, 1), # one logit
)
def forward(self, x):
return self.net(x)
model = CircleModel().to(device)
loss_fn = nn.BCEWithLogitsLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
x_train, y_train = x_train.to(device), y_train.to(device)
x_test, y_test = x_test.to(device), y_test.to(device)
for epoch in range(500):
model.train()
logits = model(x_train)
loss = loss_fn(logits, y_train)
optimizer.zero_grad(); loss.backward(); optimizer.step()
if epoch % 50 == 0:
with torch.inference_mode():
preds = (torch.sigmoid(model(x_test)) > 0.5).float()
acc = (preds == y_test).float().mean().item()
print(f"epoch {epoch:3d} loss={loss.item():.4f} test_acc={acc:.2%}")

Notice:

  • targets are float for BCEWithLogitsLoss, not int,
  • prediction is sigmoid(logit) > 0.5, equivalent to logit > 0,
  • the model output has a trailing dim of 1 to match the target shape (N, 1).

Real-world example — heart-disease prediction

Synthetic blobs and circles are good for understanding what the model does. Now we move to real tabular data: predict whether a patient has heart disease from a small set of clinical features.

The dataset is provided by the Cleveland Clinic Foundation: 303 rows, 13 features, one binary target.

FeatureTypeMeaning
agenumericalAge in years
sexcategorical0 = female, 1 = male
cpcategoricalChest-pain type (1–4)
trestbpsnumericalResting blood pressure
cholnumericalSerum cholesterol
fbscategoricalFasting blood sugar > 120 mg/dl
restecgcategoricalResting ECG results
thalachnumericalMaximum heart rate achieved
exangcategoricalExercise-induced angina
oldpeaknumericalST depression induced by exercise
slopenumericalSlope of the peak exercise ST segment
cacategoricalNumber of major vessels (0–3)
thalcategoricalnormal, fixed, or reversible
targetbinary1 = heart disease, 0 = no heart disease

This is a much more realistic setup than 2-D toy data. You will learn:

  • how to load tabular data with pandas,
  • how to preprocess mixed numerical + categorical features,
  • how to wrap tensors in a TensorDataset and iterate them with a DataLoader,
  • how to run inference on a single new patient through the same pipeline.

Setup

PowerShell
uv add pandas
heart.py
import pandas as pd
import torch
from torch import nn
from torch.utils.data import DataLoader, TensorDataset
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
device = "cuda" if torch.cuda.is_available() else "cpu"
torch.manual_seed(42)

Loading the data

heart.py
url = "http://storage.googleapis.com/download.tensorflow.org/data/heart.csv"
df = pd.read_csv(url)
print(df.shape) # (303, 14)
print(df.head())
print(df["target"].value_counts())

The first thing to do with any new dataset is to look at it. df.head() shows the first rows; value_counts() checks for class imbalance. The Cleveland set is roughly balanced (165 vs. 138).

Preparing the data

Mixed-type tabular data needs two preprocessing steps:

  • Numerical features get standardized (zero mean, unit variance) so the network’s gradients don’t get distorted by columns with very different scales.
  • Categorical features get integer-encoded so they’re representable as numbers.

The cardinal rule: fit the preprocessors on the training set only, then transform both train and test. Otherwise statistics from the test set leak into training and your reported accuracy is optimistic.

heart.py
categorical = ["sex", "cp", "fbs", "restecg", "exang", "ca", "thal"]
numerical = ["age", "trestbps", "chol", "thalach", "oldpeak", "slope"]
features = numerical + categorical
train_df, test_df = train_test_split(df, test_size=0.2, random_state=1337)
scaler = StandardScaler()
train_df.loc[:, numerical] = scaler.fit_transform(train_df[numerical])
test_df.loc[:, numerical] = scaler.transform(test_df[numerical])
encoders = {}
for column in categorical:
le = LabelEncoder()
train_df.loc[:, column] = le.fit_transform(train_df[column])
test_df.loc[:, column] = le.transform(test_df[column])
encoders[column] = le

Convert the dataframes to PyTorch tensors and wrap them in a DataLoader so we can iterate them in batches:

heart.py
x_train = torch.tensor(train_df[features].values, dtype=torch.float32)
y_train = torch.tensor(train_df["target"].values, dtype=torch.float32).unsqueeze(1)
x_test = torch.tensor(test_df[features].values, dtype=torch.float32)
y_test = torch.tensor(test_df["target"].values, dtype=torch.float32).unsqueeze(1)
train_loader = DataLoader(TensorDataset(x_train, y_train), batch_size=32, shuffle=True)
test_loader = DataLoader(TensorDataset(x_test, y_test), batch_size=32)
print(x_train.shape, y_train.shape)
# torch.Size([242, 13]) torch.Size([242, 1])

A batch is a subset of samples used in a single training iteration. Batched training is faster and gives smoother gradients. shuffle=True on the training loader prevents the model from memorising sample order.

Defining the model

A small two-layer multilayer perceptron (MLP) with dropout. The output is a single logit per sample — BCEWithLogitsLoss will turn it into a probability internally.

LayerPurpose
nn.Linear(13, 32)Linear transform from 13 features to 32 hidden units
nn.ReLUNon-linearity
nn.Dropout(0.5)Regularization — drops 50% of activations during training
nn.Linear(32, 1)Linear transform to a single output (logit)
heart.py
class HeartModel(nn.Module):
def __init__(self, input_size: int):
super().__init__()
self.net = nn.Sequential(
nn.Linear(input_size, 32),
nn.ReLU(),
nn.Dropout(0.5),
nn.Linear(32, 1),
)
def forward(self, x):
return self.net(x)
model = HeartModel(input_size=len(features)).to(device)
loss_fn = nn.BCEWithLogitsLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

Training

heart.py
EPOCHS = 50
for epoch in range(EPOCHS):
model.train()
for inputs, labels in train_loader:
inputs, labels = inputs.to(device), labels.to(device)
logits = model(inputs)
loss = loss_fn(logits, labels)
optimizer.zero_grad()
loss.backward()
optimizer.step()
model.eval()
with torch.inference_mode():
train_logits = model(x_train.to(device))
test_logits = model(x_test.to(device))
train_acc = ((train_logits > 0).float() == y_train.to(device)).float().mean().item()
test_acc = ((test_logits > 0).float() == y_test.to(device)).float().mean().item()
test_loss = loss_fn(test_logits, y_test.to(device)).item()
if (epoch + 1) % 5 == 0:
print(
f"epoch {epoch + 1:2d} loss={loss.item():.4f} "
f"train_acc={train_acc:.2%} test_loss={test_loss:.4f} test_acc={test_acc:.2%}"
)

You should reach roughly 80–85% test accuracy. A few percent of variance between runs is normal — there are only ~60 test patients.

In each iteration:

  • optimizer.zero_grad() clears gradients accumulated on parameters.
  • model(inputs) runs a forward pass.
  • loss.backward() computes gradients via backpropagation.
  • optimizer.step() updates the parameters.

model.train() enables training-mode behavior (Dropout active); model.eval() switches it off so evaluation is deterministic.

Note

With only ~240 training examples, dropout makes a real difference. Try setting nn.Dropout(0.0) and re-run — train accuracy will reach ~100% while test accuracy stays put. That’s textbook overfitting.

Inference on a new patient

To predict on a fresh sample, run it through the same preprocessing pipeline and pass the tensor through the model.

heart.py
sample = {
"age": 80, "sex": 0, "cp": 1, "trestbps": 124, "chol": 300,
"fbs": 1, "restecg": 2, "thalach": 150, "exang": 0,
"oldpeak": 2.3, "slope": 3, "ca": 0, "thal": "fixed",
}
sample_df = pd.DataFrame([sample])
sample_df.loc[:, numerical] = scaler.transform(sample_df[numerical])
for column in categorical:
sample_df.loc[:, column] = encoders[column].transform(sample_df[column])
sample_tensor = torch.tensor(sample_df[features].values, dtype=torch.float32).to(device)
model.eval()
with torch.inference_mode():
logit = model(sample_tensor)
proba = torch.sigmoid(logit).item()
print(f"heart-disease probability: {proba:.1%}")

A few details that matter in production:

  • The scaler and encoders objects must be saved alongside the model. Predicting later without them is a bug. Use pickle or joblib.
  • LabelEncoder.transform raises if it sees a value it didn’t see during training. Real systems handle this — for example by mapping unseen categories to a special “unknown” index.
  • The output is a probability, not a diagnosis. Threshold at 0.5 for a default decision; choose a different threshold to trade off false positives vs. false negatives depending on cost.

Try it — deliberate leakage

Move the scaler.fit_transform call to be applied to the whole dataframe before train_test_split. Re-run training. What happens to the test accuracy, and why is the result misleading?

Exercises

Warm-up

  1. Reduce cluster_std to 0.5, retrain. How does test accuracy change?
  2. Increase cluster_std to 4.0 and add a third hidden layer. Does accuracy improve?
  3. Replace nn.ReLU with nn.Tanh. Plot both decision boundaries and compare.

Generalization

  1. Reduce the training set to n_samples=50 (with cluster_std=2.0). Train and evaluate. Now repeat with n_samples=2000. What changes about the decision boundary?
  2. Train a model with no hidden layers — nn.Linear(2, 4) directly. What’s the maximum accuracy you can reach? On what kinds of datasets is it enough?

Confusion matrix

  1. After training, build a confusion matrix:
    from collections import Counter
    preds = model(x_test).argmax(dim=1).cpu().numpy()
    truth = y_test.cpu().numpy()
    import numpy as np
    cm = np.zeros((NUM_CLASSES, NUM_CLASSES), dtype=int)
    for t, p in zip(truth, preds):
    cm[t, p] += 1
    print(cm)
    Which class is the easiest? The hardest?

Binary classification

  1. Generate make_moons(n_samples=1000, noise=0.2) and train both a one-logit model and a two-logit model. Verify they give the same accuracy.
  2. Plot the decision boundary as a smooth probability heatmap (use torch.sigmoid(model(grid)) instead of argmax).

Heart disease

  1. Build a confusion matrix on the heart-disease test set: how many false positives and false negatives does the model produce?
  2. Try thresholds other than 0.5. Plot precision and recall as functions of the threshold (probs > t for t between 0.1 and 0.9). Which threshold would you pick if a false negative is twice as costly as a false positive?
  3. Add a second hidden layer to HeartModel. Does it help? Why might it not?
  4. Pickle the trained model and the scaler + encoders to a single file with joblib.dump({"model": model.state_dict(), "scaler": scaler, "encoders": encoders}, "heart.pkl"). Load them in a fresh script and predict on the same sample.

Capstone — overfitting on the moons dataset

The moons dataset has two interleaving half-moons that no straight line can separate. We will:

  1. train a tiny model and watch it underfit,
  2. train a too-big model and watch it overfit,
  3. apply two regularization tools — weight decay and dropout — and find a sweet spot.

Build the data

capstone.py
import matplotlib.pyplot as plt
import numpy as np
import torch
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from torch import nn
device = "cuda" if torch.cuda.is_available() else "cpu"
torch.manual_seed(0)
x_np, y_np = make_moons(n_samples=300, noise=0.30, random_state=0)
x = torch.from_numpy(x_np).float()
y = torch.from_numpy(y_np).long()
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=0)
x_train, y_train = x_train.to(device), y_train.to(device)
x_test, y_test = x_test.to(device), y_test.to(device)
plt.scatter(x[:, 0], x[:, 1], c=y, cmap=plt.cm.RdYlBu, s=10)
plt.title("moons (noisy)")
plt.show()

The data is deliberately noisy — some red points are inside the blue moon and vice versa. A perfect classifier on this data does not exist; the best we can do is recover the underlying shape.

A reusable training function

capstone.py
def train(model, epochs=2000, lr=0.01, weight_decay=0.0):
model = model.to(device)
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=lr, weight_decay=weight_decay)
history = []
for epoch in range(epochs):
model.train()
loss = loss_fn(model(x_train), y_train)
optimizer.zero_grad(); loss.backward(); optimizer.step()
model.eval()
with torch.inference_mode():
train_acc = (model(x_train).argmax(1) == y_train).float().mean().item()
test_acc = (model(x_test).argmax(1) == y_test).float().mean().item()
history.append((loss.item(), train_acc, test_acc))
return model, history

Underfitting — too small

capstone.py
class TinyModel(nn.Module):
def __init__(self):
super().__init__()
self.net = nn.Sequential(nn.Linear(2, 2), nn.Tanh(), nn.Linear(2, 2))
def forward(self, x): return self.net(x)
tiny, history_tiny = train(TinyModel())
print("tiny test acc:", history_tiny[-1][2])

This model can only draw a slight curve. Test accuracy will plateau around 80% — the model underfits.

Overfitting — too big

capstone.py
class BigModel(nn.Module):
def __init__(self):
super().__init__()
self.net = nn.Sequential(
nn.Linear(2, 256), nn.ReLU(),
nn.Linear(256, 256), nn.ReLU(),
nn.Linear(256, 256), nn.ReLU(),
nn.Linear(256, 2),
)
def forward(self, x): return self.net(x)
big, history_big = train(BigModel())
print("big train acc:", history_big[-1][1])
print("big test acc :", history_big[-1][2])

Train accuracy will reach 100%; test accuracy will be lower than the tiny model because the network started memorizing the training noise. This is overfitting.

Visualize both

Use the plot_decision_boundary function from earlier. The big model will draw bizarre wiggles around individual training points; the tiny model will draw an almost-straight line.

Sweet spot — moderate model + regularization

capstone.py
class GoodModel(nn.Module):
def __init__(self, hidden=16, p=0.2):
super().__init__()
self.net = nn.Sequential(
nn.Linear(2, hidden), nn.ReLU(), nn.Dropout(p),
nn.Linear(hidden, hidden), nn.ReLU(), nn.Dropout(p),
nn.Linear(hidden, 2),
)
def forward(self, x): return self.net(x)
good, history_good = train(GoodModel(), weight_decay=1e-4)
print("good train acc:", history_good[-1][1])
print("good test acc :", history_good[-1][2])

Two regularizers at work:

  • Dropout randomly zeros some activations during training, forcing the network to be redundant.
  • Weight decay (weight_decay in the optimizer) penalizes large weights — the model prefers the simplest function that fits.

Test accuracy should match or beat the big model with much less wiggling.

Compare the three

capstone.py
fig, axes = plt.subplots(1, 3, figsize=(14, 4))
for ax, (m, name) in zip(axes, [(tiny, "underfit"), (big, "overfit"), (good, "good")]):
plt.sca(ax)
# ... call your plot_decision_boundary or copy the contourf code here
ax.set_title(name)
plt.show()

Stretch goals

  • Sweep hidden from 2 to 256 and plot final test accuracy. Where is the sweet spot?
  • Sweep weight_decay from 0 to 1e-2 for the big model. Does it close the gap?
  • Add early stopping: keep the parameters that gave the best test accuracy, not the final ones.

Recap

  • Classification = predict a discrete label. Output logits, one per class.
  • Loss: CrossEntropyLoss for multi-class, BCEWithLogitsLoss for binary.
  • Targets are class indices (long) for cross-entropy, floats for BCE.
  • Always plot the decision boundary — metrics hide what the model actually learned.
  • Underfitting: the model is too small for the data.
  • Overfitting: the model memorizes noise. Counter with smaller capacity, dropout, weight decay, more data.

The next chapter, Vision, applies the same ideas to images with convolutional networks.

References