Train neural networks to assign discrete labels — binary and multi-class classification with PyTorch.
On this page
- Goal of the lesson
- Suggested timing
- Regression vs. classification
- Setup
- Multi-class — the blobs dataset
- Build a model
- From logits to predictions
- Loss, optimizer, accuracy
- Train
- Decision boundary
- Binary classification
- Real-world example — heart-disease prediction
- Exercises
- Capstone — overfitting on the moons dataset
- Recap
- References
Goal of the lesson
By the end of this 3-hour session you should be able to:
- explain the difference between regression and classification,
- generate synthetic 2-D datasets and visualize their decision regions,
- build a feed-forward neural network with non-linear activations,
- choose the right loss for binary and multi-class problems,
- track loss and accuracy during training,
- recognize underfitting and overfitting visually,
- handle real-world tabular data with mixed numerical and categorical features,
- solve the moons dataset as a capstone.
Suggested timing
| Block | Topic |
|---|---|
| 15 min | What classification is, logits vs. probabilities |
| 25 min | Generate the blobs dataset, build the model |
| 25 min | Training loop with accuracy, decision boundary |
| 15 min | Binary classification with BCEWithLogitsLoss |
| 55 min | Real-world example — heart-disease prediction |
| 45 min | Capstone — moons dataset and overfitting |
Regression vs. classification
| Task | Output | Loss | Final layer |
|---|---|---|---|
| Regression | A real number | MSELoss, L1Loss | Linear (no activation) |
| Binary classification | One of two classes | BCEWithLogitsLoss | Linear with 1 output (logit) |
| Multi-class classification | One of K classes | CrossEntropyLoss | Linear with K outputs (logits) |
The five-step workflow doesn’t change. We swap the dataset, the model’s output size, and the loss.
Setup
uv init --python 3.12 classificationcd classificationuv add torch matplotlib scikit-learn numpyimport matplotlib.pyplot as pltimport numpy as npimport torchfrom sklearn.datasets import make_blobsfrom sklearn.model_selection import train_test_splitfrom torch import nn
device = "cuda" if torch.cuda.is_available() else "cpu"torch.manual_seed(42)Multi-class — the blobs dataset
sklearn.datasets.make_blobs generates clusters of points in 2-D — perfect for visualizing what a classifier is doing.
NUM_CLASSES = 4NUM_FEATURES = 2
x_np, y_np = make_blobs( n_samples=1000, n_features=NUM_FEATURES, centers=NUM_CLASSES, cluster_std=1.5, random_state=42,)
x = torch.from_numpy(x_np).float()y = torch.from_numpy(y_np).long()
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
print(x_train.shape, y_train.shape, y_train[:10])A few details that matter:
- Features are
float32. Targets forCrossEntropyLossmust beint64(the dtype.long()produces). - Targets are class indices (
0,1,2,3), not one-hot vectors. PyTorch’s loss does the one-hot conversion internally.
Visualize:
plt.scatter(x[:, 0], x[:, 1], c=y, cmap=plt.cm.RdYlBu, s=8)plt.title("blobs")plt.show()You should see four colored blobs.
Try it — change the dataset
What does cluster_std=0.5 look like? cluster_std=4.0?
With cluster_std=0.5 the blobs are tight and trivially separable. With 4.0 they overlap heavily and even a perfect classifier can’t get 100% accuracy because the labels themselves disagree in the overlap region.
Build a model
A linear model can only draw straight separators. Real data is rarely separable that way, so we add a non-linear activation between linear layers.
class BlobModel(nn.Module): def __init__(self, in_features: int, out_features: int, hidden: int = 8): super().__init__() self.net = nn.Sequential( nn.Linear(in_features, hidden), nn.ReLU(), nn.Linear(hidden, hidden), nn.ReLU(), nn.Linear(hidden, out_features), )
def forward(self, x: torch.Tensor) -> torch.Tensor: return self.net(x)
model = BlobModel(in_features=NUM_FEATURES, out_features=NUM_CLASSES).to(device)print(model)The output layer has one unit per class. We deliberately leave it without an activation — those raw outputs are called logits. nn.CrossEntropyLoss applies LogSoftmax internally and is numerically more stable than computing the softmax ourselves.
From logits to predictions
Three closely-related tensors to keep straight:
| Tensor | Meaning | How to obtain it |
|---|---|---|
| logits | Raw network output, one number per class | model(x) |
| probabilities | Softmax of logits, one per class, sum to 1 | torch.softmax(logits, dim=1) |
| predictions | Index of the largest logit | logits.argmax(dim=1) |
argmax of probabilities and argmax of logits agree, so you don’t actually need softmax to predict — only to report a confidence.
x_train, y_train = x_train.to(device), y_train.to(device)x_test, y_test = x_test.to(device), y_test.to(device)
with torch.inference_mode(): logits = model(x_test[:5]) probs = torch.softmax(logits, dim=1) preds = logits.argmax(dim=1)
print("logits:\n", logits)print("probs (rows sum to 1):\n", probs)print("preds:", preds)print("truth:", y_test[:5])Before training, the predictions are essentially random.
Loss, optimizer, accuracy
loss_fn = nn.CrossEntropyLoss()optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
def accuracy(y_true: torch.Tensor, y_pred: torch.Tensor) -> float: correct = (y_true == y_pred).sum().item() return correct / len(y_pred)Loss tells the optimizer how to improve. Accuracy tells us how well the model is doing in human terms. They almost always disagree slightly because cross-entropy penalizes overconfident wrong answers more than confident-correct ones.
Train
EPOCHS = 100history = []
for epoch in range(EPOCHS): model.train() logits = model(x_train) loss = loss_fn(logits, y_train) train_acc = accuracy(y_train, logits.argmax(dim=1))
optimizer.zero_grad() loss.backward() optimizer.step()
model.eval() with torch.inference_mode(): test_logits = model(x_test) test_loss = loss_fn(test_logits, y_test) test_acc = accuracy(y_test, test_logits.argmax(dim=1))
history.append((loss.item(), test_loss.item(), train_acc, test_acc))
if epoch % 10 == 0: print( f"epoch {epoch:3d} loss={loss.item():.4f} acc={train_acc:.2%} " f"| test_loss={test_loss.item():.4f} test_acc={test_acc:.2%}" )After 100 epochs you should see test accuracy around 99% — four well-separated blobs are an easy problem.
Plot loss and accuracy
losses = np.array(history)fig, axes = plt.subplots(1, 2, figsize=(10, 4))axes[0].plot(losses[:, 0], label="train")axes[0].plot(losses[:, 1], label="test")axes[0].set_title("loss"); axes[0].legend()axes[1].plot(losses[:, 2], label="train")axes[1].plot(losses[:, 3], label="test")axes[1].set_title("accuracy"); axes[1].legend()plt.show()Healthy training: both losses decrease, both accuracies increase, and the train and test curves stay close to each other.
Decision boundary
A picture is worth a thousand metrics. Sample a grid of points across the input space, ask the model for a prediction at each, and color the result.
def plot_decision_boundary(model, x, y, title=""): model.eval() x = x.to("cpu"); y = y.to("cpu")
x_min, x_max = x[:, 0].min() - 0.1, x[:, 0].max() + 0.1 y_min, y_max = x[:, 1].min() - 0.1, x[:, 1].max() + 0.1 xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200), np.linspace(y_min, y_max, 200))
grid = torch.from_numpy(np.column_stack((xx.ravel(), yy.ravel()))).float().to(device) with torch.inference_mode(): preds = model(grid).argmax(dim=1).cpu().numpy().reshape(xx.shape)
plt.contourf(xx, yy, preds, cmap=plt.cm.RdYlBu, alpha=0.6) plt.scatter(x[:, 0], x[:, 1], c=y, cmap=plt.cm.RdYlBu, s=8, edgecolors="k", linewidths=0.2) plt.title(title); plt.show()
plot_decision_boundary(model, x_test, y_test, title="trained model")Try it — kill the activations
Comment out the nn.ReLU() lines, retrain from scratch, and plot the decision boundary again. What changes?
Without non-linearities the network collapses to a single linear transformation, no matter how many layers it has. The decision boundaries become straight lines and accuracy drops on data that needs curved separators. Activations are what make a “deep” network actually deep.
Binary classification
For two classes you have two equivalent options:
| Approach | Output size | Loss | Targets |
|---|---|---|---|
| One logit | 1 | nn.BCEWithLogitsLoss | float 0.0 / 1.0 |
| Two logits | 2 | nn.CrossEntropyLoss | int 0 / 1 |
BCEWithLogitsLoss combines a sigmoid and binary cross-entropy in one numerically-stable step.
from sklearn.datasets import make_circles
x_np, y_np = make_circles(n_samples=1000, noise=0.05, factor=0.5, random_state=42)x = torch.from_numpy(x_np).float()y = torch.from_numpy(y_np).float().unsqueeze(1) # shape (N, 1)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
class CircleModel(nn.Module): def __init__(self, hidden: int = 16): super().__init__() self.net = nn.Sequential( nn.Linear(2, hidden), nn.ReLU(), nn.Linear(hidden, hidden), nn.ReLU(), nn.Linear(hidden, 1), # one logit )
def forward(self, x): return self.net(x)
model = CircleModel().to(device)loss_fn = nn.BCEWithLogitsLoss()optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
x_train, y_train = x_train.to(device), y_train.to(device)x_test, y_test = x_test.to(device), y_test.to(device)
for epoch in range(500): model.train() logits = model(x_train) loss = loss_fn(logits, y_train) optimizer.zero_grad(); loss.backward(); optimizer.step()
if epoch % 50 == 0: with torch.inference_mode(): preds = (torch.sigmoid(model(x_test)) > 0.5).float() acc = (preds == y_test).float().mean().item() print(f"epoch {epoch:3d} loss={loss.item():.4f} test_acc={acc:.2%}")Notice:
- targets are float for
BCEWithLogitsLoss, not int, - prediction is
sigmoid(logit) > 0.5, equivalent tologit > 0, - the model output has a trailing dim of 1 to match the target shape
(N, 1).
Real-world example — heart-disease prediction
Synthetic blobs and circles are good for understanding what the model does. Now we move to real tabular data: predict whether a patient has heart disease from a small set of clinical features.
The dataset is provided by the Cleveland Clinic Foundation: 303 rows, 13 features, one binary target.
| Feature | Type | Meaning |
|---|---|---|
age | numerical | Age in years |
sex | categorical | 0 = female, 1 = male |
cp | categorical | Chest-pain type (1–4) |
trestbps | numerical | Resting blood pressure |
chol | numerical | Serum cholesterol |
fbs | categorical | Fasting blood sugar > 120 mg/dl |
restecg | categorical | Resting ECG results |
thalach | numerical | Maximum heart rate achieved |
exang | categorical | Exercise-induced angina |
oldpeak | numerical | ST depression induced by exercise |
slope | numerical | Slope of the peak exercise ST segment |
ca | categorical | Number of major vessels (0–3) |
thal | categorical | normal, fixed, or reversible |
target | binary | 1 = heart disease, 0 = no heart disease |
This is a much more realistic setup than 2-D toy data. You will learn:
- how to load tabular data with pandas,
- how to preprocess mixed numerical + categorical features,
- how to wrap tensors in a
TensorDatasetand iterate them with aDataLoader, - how to run inference on a single new patient through the same pipeline.
Setup
uv add pandasimport pandas as pdimport torchfrom torch import nnfrom torch.utils.data import DataLoader, TensorDatasetfrom sklearn.model_selection import train_test_splitfrom sklearn.preprocessing import StandardScaler, LabelEncoder
device = "cuda" if torch.cuda.is_available() else "cpu"torch.manual_seed(42)Loading the data
url = "http://storage.googleapis.com/download.tensorflow.org/data/heart.csv"df = pd.read_csv(url)
print(df.shape) # (303, 14)print(df.head())print(df["target"].value_counts())The first thing to do with any new dataset is to look at it. df.head() shows the first rows; value_counts() checks for class imbalance. The Cleveland set is roughly balanced (165 vs. 138).
Preparing the data
Mixed-type tabular data needs two preprocessing steps:
- Numerical features get standardized (zero mean, unit variance) so the network’s gradients don’t get distorted by columns with very different scales.
- Categorical features get integer-encoded so they’re representable as numbers.
The cardinal rule: fit the preprocessors on the training set only, then transform both train and test. Otherwise statistics from the test set leak into training and your reported accuracy is optimistic.
categorical = ["sex", "cp", "fbs", "restecg", "exang", "ca", "thal"]numerical = ["age", "trestbps", "chol", "thalach", "oldpeak", "slope"]features = numerical + categorical
train_df, test_df = train_test_split(df, test_size=0.2, random_state=1337)
scaler = StandardScaler()train_df.loc[:, numerical] = scaler.fit_transform(train_df[numerical])test_df.loc[:, numerical] = scaler.transform(test_df[numerical])
encoders = {}for column in categorical: le = LabelEncoder() train_df.loc[:, column] = le.fit_transform(train_df[column]) test_df.loc[:, column] = le.transform(test_df[column]) encoders[column] = leConvert the dataframes to PyTorch tensors and wrap them in a DataLoader so we can iterate them in batches:
x_train = torch.tensor(train_df[features].values, dtype=torch.float32)y_train = torch.tensor(train_df["target"].values, dtype=torch.float32).unsqueeze(1)x_test = torch.tensor(test_df[features].values, dtype=torch.float32)y_test = torch.tensor(test_df["target"].values, dtype=torch.float32).unsqueeze(1)
train_loader = DataLoader(TensorDataset(x_train, y_train), batch_size=32, shuffle=True)test_loader = DataLoader(TensorDataset(x_test, y_test), batch_size=32)
print(x_train.shape, y_train.shape)# torch.Size([242, 13]) torch.Size([242, 1])A batch is a subset of samples used in a single training iteration. Batched training is faster and gives smoother gradients. shuffle=True on the training loader prevents the model from memorising sample order.
Defining the model
A small two-layer multilayer perceptron (MLP) with dropout. The output is a single logit per sample — BCEWithLogitsLoss will turn it into a probability internally.
| Layer | Purpose |
|---|---|
nn.Linear(13, 32) | Linear transform from 13 features to 32 hidden units |
nn.ReLU | Non-linearity |
nn.Dropout(0.5) | Regularization — drops 50% of activations during training |
nn.Linear(32, 1) | Linear transform to a single output (logit) |
class HeartModel(nn.Module): def __init__(self, input_size: int): super().__init__() self.net = nn.Sequential( nn.Linear(input_size, 32), nn.ReLU(), nn.Dropout(0.5), nn.Linear(32, 1), )
def forward(self, x): return self.net(x)
model = HeartModel(input_size=len(features)).to(device)loss_fn = nn.BCEWithLogitsLoss()optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)Training
EPOCHS = 50
for epoch in range(EPOCHS): model.train() for inputs, labels in train_loader: inputs, labels = inputs.to(device), labels.to(device) logits = model(inputs) loss = loss_fn(logits, labels)
optimizer.zero_grad() loss.backward() optimizer.step()
model.eval() with torch.inference_mode(): train_logits = model(x_train.to(device)) test_logits = model(x_test.to(device)) train_acc = ((train_logits > 0).float() == y_train.to(device)).float().mean().item() test_acc = ((test_logits > 0).float() == y_test.to(device)).float().mean().item() test_loss = loss_fn(test_logits, y_test.to(device)).item()
if (epoch + 1) % 5 == 0: print( f"epoch {epoch + 1:2d} loss={loss.item():.4f} " f"train_acc={train_acc:.2%} test_loss={test_loss:.4f} test_acc={test_acc:.2%}" )You should reach roughly 80–85% test accuracy. A few percent of variance between runs is normal — there are only ~60 test patients.
In each iteration:
optimizer.zero_grad()clears gradients accumulated on parameters.model(inputs)runs a forward pass.loss.backward()computes gradients via backpropagation.optimizer.step()updates the parameters.
model.train() enables training-mode behavior (Dropout active); model.eval() switches it off so evaluation is deterministic.
With only ~240 training examples, dropout makes a real difference. Try setting nn.Dropout(0.0) and re-run — train accuracy will reach ~100% while test accuracy stays put. That’s textbook overfitting.
Inference on a new patient
To predict on a fresh sample, run it through the same preprocessing pipeline and pass the tensor through the model.
sample = { "age": 80, "sex": 0, "cp": 1, "trestbps": 124, "chol": 300, "fbs": 1, "restecg": 2, "thalach": 150, "exang": 0, "oldpeak": 2.3, "slope": 3, "ca": 0, "thal": "fixed",}
sample_df = pd.DataFrame([sample])sample_df.loc[:, numerical] = scaler.transform(sample_df[numerical])for column in categorical: sample_df.loc[:, column] = encoders[column].transform(sample_df[column])
sample_tensor = torch.tensor(sample_df[features].values, dtype=torch.float32).to(device)
model.eval()with torch.inference_mode(): logit = model(sample_tensor) proba = torch.sigmoid(logit).item()
print(f"heart-disease probability: {proba:.1%}")A few details that matter in production:
- The
scalerandencodersobjects must be saved alongside the model. Predicting later without them is a bug. Usepickleorjoblib. LabelEncoder.transformraises if it sees a value it didn’t see during training. Real systems handle this — for example by mapping unseen categories to a special “unknown” index.- The output is a probability, not a diagnosis. Threshold at 0.5 for a default decision; choose a different threshold to trade off false positives vs. false negatives depending on cost.
Try it — deliberate leakage
Move the scaler.fit_transform call to be applied to the whole dataframe before train_test_split. Re-run training. What happens to the test accuracy, and why is the result misleading?
Test accuracy goes up slightly because the scaler now “knows” the distribution of the test set. In a real deployment you don’t have the test set yet — only training data. The leaked statistics make the offline number look better than what you’d actually see in production. Always fit preprocessors on training data only.
Exercises
Warm-up
- Reduce
cluster_stdto0.5, retrain. How does test accuracy change? - Increase
cluster_stdto4.0and add a third hidden layer. Does accuracy improve? - Replace
nn.ReLUwithnn.Tanh. Plot both decision boundaries and compare.
Generalization
- Reduce the training set to
n_samples=50(withcluster_std=2.0). Train and evaluate. Now repeat withn_samples=2000. What changes about the decision boundary? - Train a model with no hidden layers —
nn.Linear(2, 4)directly. What’s the maximum accuracy you can reach? On what kinds of datasets is it enough?
Confusion matrix
- After training, build a confusion matrix:
Which class is the easiest? The hardest?from collections import Counterpreds = model(x_test).argmax(dim=1).cpu().numpy()truth = y_test.cpu().numpy()import numpy as npcm = np.zeros((NUM_CLASSES, NUM_CLASSES), dtype=int)for t, p in zip(truth, preds):cm[t, p] += 1print(cm)
Binary classification
- Generate
make_moons(n_samples=1000, noise=0.2)and train both a one-logit model and a two-logit model. Verify they give the same accuracy. - Plot the decision boundary as a smooth probability heatmap (use
torch.sigmoid(model(grid))instead ofargmax).
Heart disease
- Build a confusion matrix on the heart-disease test set: how many false positives and false negatives does the model produce?
- Try thresholds other than 0.5. Plot precision and recall as functions of the threshold (
probs > tfortbetween 0.1 and 0.9). Which threshold would you pick if a false negative is twice as costly as a false positive? - Add a second hidden layer to
HeartModel. Does it help? Why might it not? - Pickle the trained model and the
scaler+encodersto a single file withjoblib.dump({"model": model.state_dict(), "scaler": scaler, "encoders": encoders}, "heart.pkl"). Load them in a fresh script and predict on the samesample.
For exercise 8:
def plot_proba(model, x, y): model.eval() x_cpu = x.cpu(); y_cpu = y.cpu() x_min, x_max = x_cpu[:, 0].min() - 0.2, x_cpu[:, 0].max() + 0.2 y_min, y_max = x_cpu[:, 1].min() - 0.2, x_cpu[:, 1].max() + 0.2 xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200), np.linspace(y_min, y_max, 200)) grid = torch.from_numpy(np.column_stack((xx.ravel(), yy.ravel()))).float().to(device) with torch.inference_mode(): proba = torch.sigmoid(model(grid)).cpu().numpy().reshape(xx.shape) plt.contourf(xx, yy, proba, levels=20, cmap=plt.cm.RdBu) plt.scatter(x_cpu[:, 0], x_cpu[:, 1], c=y_cpu.squeeze(), cmap=plt.cm.RdBu, s=10, edgecolors="k") plt.show()Capstone — overfitting on the moons dataset
The moons dataset has two interleaving half-moons that no straight line can separate. We will:
- train a tiny model and watch it underfit,
- train a too-big model and watch it overfit,
- apply two regularization tools — weight decay and dropout — and find a sweet spot.
Build the data
import matplotlib.pyplot as pltimport numpy as npimport torchfrom sklearn.datasets import make_moonsfrom sklearn.model_selection import train_test_splitfrom torch import nn
device = "cuda" if torch.cuda.is_available() else "cpu"torch.manual_seed(0)
x_np, y_np = make_moons(n_samples=300, noise=0.30, random_state=0)x = torch.from_numpy(x_np).float()y = torch.from_numpy(y_np).long()
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=0)x_train, y_train = x_train.to(device), y_train.to(device)x_test, y_test = x_test.to(device), y_test.to(device)
plt.scatter(x[:, 0], x[:, 1], c=y, cmap=plt.cm.RdYlBu, s=10)plt.title("moons (noisy)")plt.show()The data is deliberately noisy — some red points are inside the blue moon and vice versa. A perfect classifier on this data does not exist; the best we can do is recover the underlying shape.
A reusable training function
def train(model, epochs=2000, lr=0.01, weight_decay=0.0): model = model.to(device) loss_fn = nn.CrossEntropyLoss() optimizer = torch.optim.Adam(model.parameters(), lr=lr, weight_decay=weight_decay)
history = [] for epoch in range(epochs): model.train() loss = loss_fn(model(x_train), y_train) optimizer.zero_grad(); loss.backward(); optimizer.step()
model.eval() with torch.inference_mode(): train_acc = (model(x_train).argmax(1) == y_train).float().mean().item() test_acc = (model(x_test).argmax(1) == y_test).float().mean().item() history.append((loss.item(), train_acc, test_acc)) return model, historyUnderfitting — too small
class TinyModel(nn.Module): def __init__(self): super().__init__() self.net = nn.Sequential(nn.Linear(2, 2), nn.Tanh(), nn.Linear(2, 2)) def forward(self, x): return self.net(x)
tiny, history_tiny = train(TinyModel())print("tiny test acc:", history_tiny[-1][2])This model can only draw a slight curve. Test accuracy will plateau around 80% — the model underfits.
Overfitting — too big
class BigModel(nn.Module): def __init__(self): super().__init__() self.net = nn.Sequential( nn.Linear(2, 256), nn.ReLU(), nn.Linear(256, 256), nn.ReLU(), nn.Linear(256, 256), nn.ReLU(), nn.Linear(256, 2), ) def forward(self, x): return self.net(x)
big, history_big = train(BigModel())print("big train acc:", history_big[-1][1])print("big test acc :", history_big[-1][2])Train accuracy will reach 100%; test accuracy will be lower than the tiny model because the network started memorizing the training noise. This is overfitting.
Visualize both
Use the plot_decision_boundary function from earlier. The big model will draw bizarre wiggles around individual training points; the tiny model will draw an almost-straight line.
Sweet spot — moderate model + regularization
class GoodModel(nn.Module): def __init__(self, hidden=16, p=0.2): super().__init__() self.net = nn.Sequential( nn.Linear(2, hidden), nn.ReLU(), nn.Dropout(p), nn.Linear(hidden, hidden), nn.ReLU(), nn.Dropout(p), nn.Linear(hidden, 2), ) def forward(self, x): return self.net(x)
good, history_good = train(GoodModel(), weight_decay=1e-4)print("good train acc:", history_good[-1][1])print("good test acc :", history_good[-1][2])Two regularizers at work:
- Dropout randomly zeros some activations during training, forcing the network to be redundant.
- Weight decay (
weight_decayin the optimizer) penalizes large weights — the model prefers the simplest function that fits.
Test accuracy should match or beat the big model with much less wiggling.
Compare the three
fig, axes = plt.subplots(1, 3, figsize=(14, 4))for ax, (m, name) in zip(axes, [(tiny, "underfit"), (big, "overfit"), (good, "good")]): plt.sca(ax) # ... call your plot_decision_boundary or copy the contourf code here ax.set_title(name)plt.show()Stretch goals
- Sweep
hiddenfrom 2 to 256 and plot final test accuracy. Where is the sweet spot? - Sweep
weight_decayfrom0to1e-2for the big model. Does it close the gap? - Add early stopping: keep the parameters that gave the best test accuracy, not the final ones.
Recap
- Classification = predict a discrete label. Output logits, one per class.
- Loss:
CrossEntropyLossfor multi-class,BCEWithLogitsLossfor binary. - Targets are class indices (
long) for cross-entropy, floats for BCE. - Always plot the decision boundary — metrics hide what the model actually learned.
- Underfitting: the model is too small for the data.
- Overfitting: the model memorizes noise. Counter with smaller capacity, dropout, weight decay, more data.
The next chapter, Vision, applies the same ideas to images with convolutional networks.