STAT 337 · Unit 2

Problem 1 — Scatterplots: 6 Things to Look For

Choose a dataset and variable pair, explore the scatterplot, and use the sliders and toggles to describe what you see. Try several pairs — discuss what you notice with your table.

REAL DATA

Dataset

Explanatory (x)

Response (y)

Species

Select a dataset and variable pair above

BELOW — SEE SIMULATED DATA OF THESE FEATURES

1 · DIRECTION positive

negative positive

2 · STRENGTH strong

weak strong

3 · FORM linear

concave ↓ concave ↑

4 · VARIABILITY constant

decreasing increasing

5 · OUTLIERS

6 · DISTINCT GROUPS

6 Things to Look For Reference

1

Direction

Is the overall association positive (y tends to increase as x increases), negative (y tends to decrease as x increases), or no association?
2

Strength

How tightly do the points cluster around the overall pattern? Strong — points stay close to the pattern. Moderate — the pattern is clear but there is noticeable scatter. Weak — the pattern is faint.
3

Form / Linearity

Does the pattern follow a straight line or is it curved / non-linear? Pearson's r measures only linear association, so a curved relationship may be strong visually even when |r| is small.
4

Outliers

An outlier is a point that does not follow the overall pattern of the other observations. A point that is far away but still follows the same pattern is not necessarily an outlier.
5

Changing Variability

Does the vertical spread of the points stay about constant, or does it increase or decrease across x?
6

Distinct Groups

Do the points form two or more separate clusters? Look for clearly separated bands or clouds. Distinct groups can sometimes be mistaken for one wide, noisy pattern, so check for visible gaps or subgroup structure.

STAT 337 · Unit 2

Problem 2 — Pearson's r: Guess the Correlation

Select a dataset and variable pair, then drag the slider to your best estimate of Pearson's r and submit. r ranges from −1 (perfect negative) to +1 (perfect positive).

TOTAL PTS

0

GUESSES

0

AVG ERROR

—

BULLSEYES

0

SELECT DATA — choose a variable pair then guess r

Dataset

Explanatory (x)

Response (y)

Species

Select a dataset above to begin

Interpreting r reference

Pearson's r measures the strength and direction of a linear association. It always falls between −1 and +1.

Key warnings:

r only measures the strength and direction of a linear association between two quantitative variables.

Scoring per plot

5 pts	Error ≤ 0.05 (bullseye)
4 pts	Error ≤ 0.15
3 pts	Error ≤ 0.22
2 pts	Error ≤ 0.32
1 pt	Error ≤ 0.50
0 pts	Error > 0.50

Max: 75 pts (15 × 5)

STAT 337 · Unit 2

Problem 3 — Line of Best Fit: Minimise the Squared Residuals

Drag the line handles to get your SSE as close to the true minimum SSE as possible. The OLS line is the only line that achieves that minimum — your goal is to find it.

TOTAL PTS

0

ATTEMPTS

0

AVG SSE RATIO

—

BULLSEYES

0

SELECT DATA — drag the line to minimise SSE

Dataset

Explanatory (x)

Response (y)

Species

Select a dataset above to begin

          🎯 Goal: drag the line until the orange squares are as small as possible
        

Your Goal

True minimum SSE — match this number

—

The OLS line is the only line that achieves this. No other line can do better.

The Idea section 6.3

For any line ŷ = b₀ + b₁x, each point has a residual — the vertical gap between the actual y and what the line predicts:

eᵢ = yᵢ − ŷᵢ

We square each residual so negatives don't cancel positives, then sum them all:

SSE = Σ eᵢ²

The line of best fit (OLS line) is the unique line that makes SSE as small as possible. No other line through that data will have a smaller SSE.

The orange boxes on the plot are the squared residuals — shrink the total area of those boxes and you are minimising the SSE.

Drag the ● handles on the line to adjust slope and intercept.

Scoring per plot

Scored on how close your SSE is to the true minimum SSE:

5 pts	Within 2× true SSE
4 pts	Within 3×
3 pts	Within 5×
2 pts	Within 10×
1 pt	Within 20×
0 pts	More than 20×

Max: 100 pts (20 × 5)

STAT 337 · Unit 2

Problem 4 — Proportion of Variability Explained: R²

Drag the line to see how R² = 1 − SSE/SST changes. Red boxes = SST (fixed). Blue boxes = SSE (shrink them to maximise R²).

TOTAL PTS

0

ATTEMPTS

0

AVG R² GAP

—

BULLSEYES

0

SELECT DATA — drag the line to maximise R²

Dataset

Explanatory (x)

Response (y)

Species

Select a dataset above to begin

          🎯 Goal: drag the line to shrink the blue boxes (SSE) relative to the red boxes (SST) — maximise R²
        

Your Goal

True maximum R² — match this number

—

The OLS line is the only line that achieves this. Drag to get your R² as close as you can.

The Idea section 6.4

Without x, your best guess for any y is ȳ. The red boxes show the squared deviations from ȳ — this is the total variability in y:

SST = Σ(yᵢ − ȳ)²

The blue boxes show the squared residuals from your line — the variability your model does not explain:

SSE = Σ(yᵢ − ŷᵢ)²

R² is the fraction of the total variability in y that your line explains:

R² = 1 − SSE/SST

The dashed line is ȳ. For a fixed dataset, SST stays the same — as the fit improves, SSE gets smaller and R² gets larger.

Scoring per plot

Scored on how close your R² is to the true maximum (OLS):

5 pts	Within 0.04 of true R²
4 pts	Within 0.08
3 pts	Within 0.15
2 pts	Within 0.25
1 pt	Within 0.40
0 pts	More than 0.40 off

Max: 75 pts (15 × 5)

STAT 337 · Unit 2

Problem 5 — Permutation Distribution: How is the null built?

The permutation test asks: could the observed slope happen by chance if x and y were unrelated? Hit Permute to shuffle y values and build the null distribution one slope at a time.

SELECT DATA — then work through the permutation phases below

Dataset

Explanatory (x)

Response (y)

Species

Original data

Null distribution of b₁*

PHASE 1 — watch the mechanism

0 / 10 clicks

PHASE 2 — shape starts forming

locked — finish phase 1 first

PHASE 3 — full null distribution

locked — finish phase 2 first

The Mechanism step by step

① Start with the real data. Compute r and the observed slope b₁.

② Ask: if there were no real linear association between x and y, could we have seen a slope this large just by chance?

③ Simulate the null: keep the x-values fixed, and randomly reassign the y-values to different x positions. This breaks any real pairing between x and y.

④ Refit the line on the scrambled data and record the new slope b₁*.

⑤ Repeat many times. The collection of b₁* values forms the null distribution — the slopes we would expect just by chance if the true slope were 0.

⑥ p-value = fraction of permuted slopes that are at least as extreme as the observed b₁. If the observed slope falls far into the tail, that is evidence against the null and suggests a real linear association.

Current simulation live

Press "Permute Once" to run the first simulation.

STAT 337 · Unit 2

Problem 6 — The Normal Tunnel: distribution of y at each x

Drag the slider to slice through the true regression line at any x. The cross-section shows how individual y-values are distributed at that x.

Simulation — true population model μy = 50 + 0.60(x − 50) · σ = 12

Red = true line μy(x). Drag the orange slider to pick a carat value and see the price distribution at that slice.

Slice at 0.8 ct · μy at this x = — · σ = 12

TRUE μy

—

TRUE σ

12

SD(y near x)

—

n near x

—

ŷ at this x

—

Cross-section — distribution of y at x = 50

What you are seeing the model

The red line is the true population line — fitted to 53,940 ggplot2 diamonds: price = −$2,254 + $7,753·carat, σ = $1,548.

At every carat value, prices scatter around μy with standard deviation σ. This scatter is the irreducible error in the model.

🎲 Simulated: points drawn randomly from the true model — perfect for understanding the concept.

💎 Real Diamonds: reveal actual diamonds from our 500-diamond sample one batch at a time — see the same tunnel emerge from real data.

The slice key idea

At any x, individual y-values follow a normal distribution centred at μy(x) with spread σ.

Drag the slider left and right — the curve moves because μy changes, but it stays exactly the same width. σ does not depend on x.

This is the equal-variance (homoscedasticity) assumption of SLR: σ is constant across all values of x.

Fixed model this visualisation

True model fitted to 53,940 ggplot2 diamonds: price = −$2,254 + $7,753·carat · σ = $1,548 · Each click adds 20 real diamonds from our 500-diamond sample.

True line + sample OLS lines

Slice at 0.8 ct · μy (price) = — · theoretical SE(ŷ) = —

TRUE μy (price)

—

theoretical SE(ŷ)

—

empirical SD(ŷ)

—

samples

0

Sampling distribution of ŷ at 0.8 ct

Bootstrap distribution of b₁ (slope)

Bootstrap distribution of b₀ (intercept)

What you are seeing the tunnel

Each bootstrap resample (n=500 with replacement from our 500 diamonds) produces a slightly different OLS line. The blue lines scatter around the true red line — forming the bootstrap confidence band.

The tunnel is narrowest near the mean carat (~0.8 ct) and wider at the extremes, because OLS predictions are most stable where the data is densest.

Drag the slider to slice through the tunnel at any x and see the spread of ŷ values from all samples.

Fixed model this visualisation

True model fitted to all 53,940 ggplot2 diamonds: price = −$2,254 + $7,753·carat · σ = $1,548 · Bootstrap resamples from our 500-diamond sample with replacement

Draw samples to begin.

Scatter plot — drag any point

Residual vs fitted plot

Residual strip plot

ASSUMPTION SCENARIOS

OUTLIER SCENARIOS

LIVE MODEL

Generate a dataset to begin.

What is a residual definition

For each point, the residual is: e = y − ŷ

It is the vertical distance from the point to the OLS line — positive if above, negative if below.

When you drag a point, its ŷ changes because the line moves, so the residual changes for every point — not just the one you moved.

Outlier concepts key ideas

High leverage — point far from x̄. Has the potential to pull the line but may not if it follows the trend.

High influence — actually changes the line substantially. Measured by Cook's D: high leverage combined with a large residual produces high influence.

Outlier (low leverage) — large residual near x̄. Doesn't move the line much but inflates RMSE.

Problem 1 — Scatterplots: 6 Things to Look For

Problem 2 — Pearson's r: Guess the Correlation

Problem 3 — Line of Best Fit: Minimise the Squared Residuals

Problem 4 — Proportion of Variability Explained: R²

Problem 5 — Permutation Distribution: How is the null built?

Problem 6 — The Normal Tunnel: distribution of y at each x

Problem 6B — Bootstrap Confidence Band: sampling variability of ŷ, b₀ and b₁

Residuals — drag a point and feel what changes

Reading the QQ Plot — Conceptual Shape Explorer

SST (total, fixed) — red boxes	—
SS_reg (explained)	—
SSE (residual) — blue boxes	—