STAT 337 · Unit 2

Problem 1 — Scatterplots: 6 Things to Look For

For each of the 10 plots: identify the direction, strength, form, outliers, variability, and grouping. Submit your answers to see your score and feedback.

REAL DATA
Dataset
Explanatory (x)
Response (y)
Species
Select a dataset and variable pair above
BELOW — SEE SIMULATED DATA OF THESE FEATURES
positive
negative positive
strong
weak strong
linear
concave ↓ concave ↑
constant
decreasing increasing
5 · OUTLIERS
6 · DISTINCT GROUPS
6 Things to Look For Reference
  • 1
    Direction
    Is the overall association positive (y tends to increase as x increases), negative (y tends to decrease as x increases), or no association?
  • 2
    Strength
    How tightly do the points cluster around the overall pattern? Strong — points stay close to the pattern. Moderate — the pattern is clear but there is noticeable scatter. Weak — the pattern is faint.
  • 3
    Form / Linearity
    Does the pattern follow a straight line or is it curved / non-linear? Pearson's r measures only linear association, so a curved relationship may be strong visually even when |r| is small.
  • 4
    Outliers
    An outlier is a point that does not follow the overall pattern of the other observations. A point that is far away but still follows the same pattern is not necessarily an outlier.
  • 5
    Changing Variability
    Does the vertical spread of the points stay about constant, or does it increase or decrease across x?
  • 6
    Distinct Groups
    Do the points form two or more separate clusters? Look for clearly separated bands or clouds. Distinct groups can sometimes be mistaken for one wide, noisy pattern, so check for visible gaps or subgroup structure.
STAT 337 · Unit 2

Problem 2 — Pearson's r: Guess the Correlation

Select a dataset and variable pair, then drag the slider to your best estimate of Pearson's r and submit. r ranges from −1 (perfect negative) to +1 (perfect positive).

TOTAL PTS
0
GUESSES
0
AVG ERROR
BULLSEYES
0
SELECT DATA — choose a variable pair then guess r
Dataset
Explanatory (x)
Response (y)
Species
Select a dataset above to begin
Interpreting r reference

Pearson's r measures the strength and direction of a linear association. It always falls between −1 and +1.

+1.00 Perfect positive +0.70 Strong positive +0.30 Weak positive 0.00 No association −0.30 Weak negative −0.70 Strong negative −1.00 Perfect negative

Key warnings:

r only measures the strength and direction of a linear association between two quantitative variables.

Scoring per plot
5 ptsError ≤ 0.05  (bullseye)
4 ptsError ≤ 0.15
3 ptsError ≤ 0.22
2 ptsError ≤ 0.32
1 ptError ≤ 0.50
0 ptsError > 0.50

Max: 75 pts (15 × 5)

STAT 337 · Unit 2

Problem 3 — Line of Best Fit: Minimise the Squared Residuals

Drag the line handles to get your SSE as close to the true minimum SSE as possible. The OLS line is the only line that achieves that minimum — your goal is to find it.

TOTAL PTS
0
ATTEMPTS
0
AVG SSE RATIO
BULLSEYES
0
SELECT DATA — drag the line to minimise SSE
Dataset
Explanatory (x)
Response (y)
Species
Select a dataset above to begin
🎯 Goal: drag the line until the orange squares are as small as possible
Your Goal
True minimum SSE — match this number
The OLS line is the only line that achieves this. No other line can do better.
The Idea section 6.3

For any line ŷ = b₀ + b₁x, each point has a residual — the vertical gap between the actual y and what the line predicts:

eᵢ = yᵢ − ŷᵢ

We square each residual so negatives don't cancel positives, then sum them all:

SSE = Σ eᵢ²

The line of best fit (OLS line) is the unique line that makes SSE as small as possible. No other line through that data will have a smaller SSE.

The orange boxes on the plot are the squared residuals — shrink the total area of those boxes and you are minimising the SSE.

Drag the handles on the line to adjust slope and intercept.

Scoring per plot

Scored on how close your SSE is to the true minimum SSE:

5 ptsWithin 2× true SSE
4 ptsWithin 3×
3 ptsWithin 5×
2 ptsWithin 10×
1 ptWithin 20×
0 ptsMore than 20×

Max: 100 pts (20 × 5)

STAT 337 · Unit 2

Problem 4 — Proportion of Variability Explained:

Drag the line to see how R² = 1 − SSE/SST changes. Red boxes = SST (fixed). Blue boxes = SSE (shrink them to maximise R²).

TOTAL PTS
0
ATTEMPTS
0
AVG R² GAP
BULLSEYES
0
SELECT DATA — drag the line to maximise R²
Dataset
Explanatory (x)
Response (y)
Species
Select a dataset above to begin
🎯 Goal: drag the line to shrink the blue boxes (SSE) relative to the red boxes (SST) — maximise R²
Your Goal
True maximum R² — match this number
The OLS line is the only line that achieves this. Drag to get your R² as close as you can.
The Idea section 6.4

Without x, your best guess for any y is ȳ. The red boxes show the squared deviations from ȳ — this is the total variability in y:

SST = Σ(yᵢ − ȳ)²

The blue boxes show the squared residuals from your line — the variability your model does not explain:

SSE = Σ(yᵢ − ŷᵢ)²

R² is the fraction of the total variability in y that your line explains:

R² = 1 − SSE/SST

The dashed line is ȳ. For a fixed dataset, SST stays the same — as the fit improves, SSE gets smaller and R² gets larger.

Scoring per plot

Scored on how close your R² is to the true maximum (OLS):

5 ptsWithin 0.04 of true R²
4 ptsWithin 0.08
3 ptsWithin 0.15
2 ptsWithin 0.25
1 ptWithin 0.40
0 ptsMore than 0.40 off

Max: 75 pts (15 × 5)

STAT 337 · Unit 2

Problem 5 — Permutation Distribution: How is the null built?

The permutation test asks: could the observed slope happen by chance if x and y were unrelated? Hit Permute to shuffle y values and build the null distribution one slope at a time.

SELECT DATA — then work through the permutation phases below
Dataset
Explanatory (x)
Response (y)
Species
Original data
Null distribution of b₁*
PHASE 1 — watch the mechanism
0 / 10 clicks
PHASE 2 — shape starts forming
locked — finish phase 1 first
PHASE 3 — full null distribution
locked — finish phase 2 first
The Mechanism step by step

① Start with the real data. Compute r and the observed slope b₁.

② Ask: if there were no real linear association between x and y, could we have seen a slope this large just by chance?

③ Simulate the null: keep the x-values fixed, and randomly reassign the y-values to different x positions. This breaks any real pairing between x and y.

④ Refit the line on the scrambled data and record the new slope b₁*.

⑤ Repeat many times. The collection of b₁* values forms the null distribution — the slopes we would expect just by chance if the true slope were 0.

⑥ p-value = fraction of permuted slopes that are at least as extreme as the observed b₁. If the observed slope falls far into the tail, that is evidence against the null and suggests a real linear association.

Current simulation live

Press "Permute Once" to run the first simulation.

STAT 337 · Unit 2

Problem 6 — The Normal Tunnel: distribution of y at each x

Drag the slider to slice through the true regression line at any x. The cross-section shows how individual y-values are distributed at that x.

Simulation — true population model μy = 50 + 0.60(x − 50) · σ = 12
TRUE μy
TRUE σ
12
SD(y near x)
n near x
ŷ at this x
Cross-section — distribution of y at x = 50
What you are seeing the model

The red line is the true population line — fitted to 53,940 ggplot2 diamonds: price = −$2,254 + $7,753·carat, σ = $1,548.

At every carat value, prices scatter around μy with standard deviation σ. This scatter is the irreducible error in the model.

🎲 Simulated: points drawn randomly from the true model — perfect for understanding the concept.

💎 Real Diamonds: reveal actual diamonds from our 500-diamond sample one batch at a time — see the same tunnel emerge from real data.

The slice key idea

At any x, individual y-values follow a normal distribution centred at μy(x) with spread σ.

Drag the slider left and right — the curve moves because μy changes, but it stays exactly the same width. σ does not depend on x.

This is the equal-variance (homoscedasticity) assumption of SLR: σ is constant across all values of x.

Fixed model this visualisation

True model fitted to 53,940 ggplot2 diamonds: price = −$2,254 + $7,753·carat  ·  σ = $1,548  ·  Each click adds 20 real diamonds from our 500-diamond sample.

STAT 337 · Unit 2

Problem 6B — Bootstrap Confidence Band: sampling variability of ŷ, b₀ and b₁

Each bootstrap resample (n=500 with replacement) produces a slightly different OLS line. Watch the confidence band form, then see the sampling distributions of the slope and intercept.

True line + sample OLS lines
TRUE μy (price)
theoretical SE(ŷ)
empirical SD(ŷ)
samples
0
Sampling distribution of ŷ at 0.8 ct
Bootstrap distribution of b₁ (slope)
Bootstrap distribution of b₀ (intercept)
What you are seeing the tunnel

Each bootstrap resample (n=500 with replacement from our 500 diamonds) produces a slightly different OLS line. The blue lines scatter around the true red line — forming the bootstrap confidence band.

The tunnel is narrowest near the mean carat (~0.8 ct) and wider at the extremes, because OLS predictions are most stable where the data is densest.

Drag the slider to slice through the tunnel at any x and see the spread of ŷ values from all samples.

Fixed model this visualisation

True model fitted to 500 diamonds: price = −$2,254 + $7,753·carat  ·  σ̂ = $1,548  ·  Bootstrap: n=500 resamples with replacement

Draw samples to begin.

STAT 337 · Unit 2

Residuals — drag a point and feel what changes

Drag any point. The OLS line recomputes instantly. Watch the residual plot update live.

Scatter plot — drag any point
Residual vs fitted plot
ASSUMPTION SCENARIOS
OUTLIER SCENARIOS
LIVE MODEL

Generate a dataset to begin.

What is a residual definition

For each point, the residual is: e = y − ŷ

It is the vertical distance from the point to the OLS line — positive if above, negative if below.

When you drag a point, its ŷ changes because the line moves, so the residual changes for every point — not just the one you moved.

Outlier concepts key ideas

High leverage — point far from x̄. Has the potential to pull the line but may not if it follows the trend.

High influence — actually changes the line substantially. Measured by Cook's D: high leverage combined with a large residual produces high influence.

Outlier (low leverage) — large residual near x̄. Doesn't move the line much but inflates RMSE.

STAT 337 · Unit 2 · Conceptual Tool

Reading the QQ Plot — Conceptual Shape Explorer

Note: This is a conceptual tool — the data shown is mathematically constructed to illustrate shape patterns. Use it to build intuition for what skew and heavy/light tails look like on a QQ plot. Real residuals will be messier.

longer negative tail SKEW longer positive tail
fewer extremes TAILS more extremes
normal
Distribution of residuals
Normal QQ Plot
Ranked residuals — observed vs theoretical normal quantile