Lesson 6: Simple Linear Regression

Purpose of this lesson

This lesson introduces simple linear regression using a workflow consistent with the OpenIntro Biostatistics approach: start with a research question, visualize the relationship, fit a model, interpret results in context, and communicate findings responsibly using a reproducible R + Quarto workflow.

You are not expected to memorize formulas. You are expected to interpret results and report them clearly.

Learning goals

By the end of this lesson, you will be able to:

  • Identify the response and predictor variables in a regression question
  • Fit a simple linear regression model using lm()
  • Interpret the slope as an association (not causation)
  • Visualize a regression relationship with a fitted line
  • Report regression results using inline R code
  • Perform basic visual checks of regression assumptions

Key Terms

You can refer to this table throughout the lesson (and beyond)

As a super quick reminder: In this course, definitions emphasize interpretation and application, not mathematical formulas.

Key terms for Lesson 6

Term Plain-language explanation
Simple linear regression A statistical method used to describe the average association between one numeric predictor and one numeric outcome.
Outcome (response variable) The numeric variable we are trying to explain or understand (e.g., BMI).
Predictor (explanatory variable) The numeric variable used to help explain changes in the outcome (e.g., age).
Observational unit The entity represented by each row of the dataset (e.g., one survey participant).
Regression line The line that represents the model’s estimated average relationship between the predictor and outcome.
Slope The estimated average change in the outcome for a one-unit increase in the predictor.
Intercept The predicted value of the outcome when the predictor equals zero; often not meaningful in public health contexts.
Residual The difference between an observed outcome value and the value predicted by the regression model.
Linearity The assumption that the relationship between the predictor and outcome can be reasonably approximated by a straight line.
Constant variance (homoscedasticity) The assumption that the variability of residuals is roughly the same across all values of the predictor.
Normality of residuals The assumption that residuals are approximately normally distributed; important for confidence intervals and p-values.
Confidence interval A range of plausible values for the true population-level association, reflecting uncertainty in the estimate.
P-value A measure of how consistent the observed association is with the assumption of no association in the population.
Association A statistical relationship between variables that does not, by itself, imply causation.
Causation A relationship where changes in one variable directly produce changes in another; requires stronger study designs than simple regression.

Setup

library(tidyverse)
Warning: package 'dplyr' was built under R version 4.5.1
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(broom)

# Install once if needed:
# install.packages("NHANES")
library(NHANES)

Exploring the Data

We will use a state-level public health dataset distributed with OpenIntro materials. Each row represents one U.S. state.

data("NHANES")
glimpse(NHANES)
Rows: 10,000
Columns: 76
$ ID               <int> 51624, 51624, 51624, 51625, 51630, 51638, 51646, 5164…
$ SurveyYr         <fct> 2009_10, 2009_10, 2009_10, 2009_10, 2009_10, 2009_10,…
$ Gender           <fct> male, male, male, male, female, male, male, female, f…
$ Age              <int> 34, 34, 34, 4, 49, 9, 8, 45, 45, 45, 66, 58, 54, 10, …
$ AgeDecade        <fct>  30-39,  30-39,  30-39,  0-9,  40-49,  0-9,  0-9,  40…
$ AgeMonths        <int> 409, 409, 409, 49, 596, 115, 101, 541, 541, 541, 795,…
$ Race1            <fct> White, White, White, Other, White, White, White, Whit…
$ Race3            <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ Education        <fct> High School, High School, High School, NA, Some Colle…
$ MaritalStatus    <fct> Married, Married, Married, NA, LivePartner, NA, NA, M…
$ HHIncome         <fct> 25000-34999, 25000-34999, 25000-34999, 20000-24999, 3…
$ HHIncomeMid      <int> 30000, 30000, 30000, 22500, 40000, 87500, 60000, 8750…
$ Poverty          <dbl> 1.36, 1.36, 1.36, 1.07, 1.91, 1.84, 2.33, 5.00, 5.00,…
$ HomeRooms        <int> 6, 6, 6, 9, 5, 6, 7, 6, 6, 6, 5, 10, 6, 10, 10, 4, 3,…
$ HomeOwn          <fct> Own, Own, Own, Own, Rent, Rent, Own, Own, Own, Own, O…
$ Work             <fct> NotWorking, NotWorking, NotWorking, NA, NotWorking, N…
$ Weight           <dbl> 87.4, 87.4, 87.4, 17.0, 86.7, 29.8, 35.2, 75.7, 75.7,…
$ Length           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ HeadCirc         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ Height           <dbl> 164.7, 164.7, 164.7, 105.4, 168.4, 133.1, 130.6, 166.…
$ BMI              <dbl> 32.22, 32.22, 32.22, 15.30, 30.57, 16.82, 20.64, 27.2…
$ BMICatUnder20yrs <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ BMI_WHO          <fct> 30.0_plus, 30.0_plus, 30.0_plus, 12.0_18.5, 30.0_plus…
$ Pulse            <int> 70, 70, 70, NA, 86, 82, 72, 62, 62, 62, 60, 62, 76, 8…
$ BPSysAve         <int> 113, 113, 113, NA, 112, 86, 107, 118, 118, 118, 111, …
$ BPDiaAve         <int> 85, 85, 85, NA, 75, 47, 37, 64, 64, 64, 63, 74, 85, 6…
$ BPSys1           <int> 114, 114, 114, NA, 118, 84, 114, 106, 106, 106, 124, …
$ BPDia1           <int> 88, 88, 88, NA, 82, 50, 46, 62, 62, 62, 64, 76, 86, 6…
$ BPSys2           <int> 114, 114, 114, NA, 108, 84, 108, 118, 118, 118, 108, …
$ BPDia2           <int> 88, 88, 88, NA, 74, 50, 36, 68, 68, 68, 62, 72, 88, 6…
$ BPSys3           <int> 112, 112, 112, NA, 116, 88, 106, 118, 118, 118, 114, …
$ BPDia3           <int> 82, 82, 82, NA, 76, 44, 38, 60, 60, 60, 64, 76, 82, 7…
$ Testosterone     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ DirectChol       <dbl> 1.29, 1.29, 1.29, NA, 1.16, 1.34, 1.55, 2.12, 2.12, 2…
$ TotChol          <dbl> 3.49, 3.49, 3.49, NA, 6.70, 4.86, 4.09, 5.82, 5.82, 5…
$ UrineVol1        <int> 352, 352, 352, NA, 77, 123, 238, 106, 106, 106, 113, …
$ UrineFlow1       <dbl> NA, NA, NA, NA, 0.094, 1.538, 1.322, 1.116, 1.116, 1.…
$ UrineVol2        <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ UrineFlow2       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ Diabetes         <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, N…
$ DiabetesAge      <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ HealthGen        <fct> Good, Good, Good, NA, Good, NA, NA, Vgood, Vgood, Vgo…
$ DaysPhysHlthBad  <int> 0, 0, 0, NA, 0, NA, NA, 0, 0, 0, 10, 0, 4, NA, NA, 0,…
$ DaysMentHlthBad  <int> 15, 15, 15, NA, 10, NA, NA, 3, 3, 3, 0, 0, 0, NA, NA,…
$ LittleInterest   <fct> Most, Most, Most, NA, Several, NA, NA, None, None, No…
$ Depressed        <fct> Several, Several, Several, NA, Several, NA, NA, None,…
$ nPregnancies     <int> NA, NA, NA, NA, 2, NA, NA, 1, 1, 1, NA, NA, NA, NA, N…
$ nBabies          <int> NA, NA, NA, NA, 2, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ Age1stBaby       <int> NA, NA, NA, NA, 27, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ SleepHrsNight    <int> 4, 4, 4, NA, 8, NA, NA, 8, 8, 8, 7, 5, 4, NA, 5, 7, N…
$ SleepTrouble     <fct> Yes, Yes, Yes, NA, Yes, NA, NA, No, No, No, No, No, Y…
$ PhysActive       <fct> No, No, No, NA, No, NA, NA, Yes, Yes, Yes, Yes, Yes, …
$ PhysActiveDays   <int> NA, NA, NA, NA, NA, NA, NA, 5, 5, 5, 7, 5, 1, NA, 2, …
$ TVHrsDay         <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ CompHrsDay       <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ TVHrsDayChild    <int> NA, NA, NA, 4, NA, 5, 1, NA, NA, NA, NA, NA, NA, 4, N…
$ CompHrsDayChild  <int> NA, NA, NA, 1, NA, 0, 6, NA, NA, NA, NA, NA, NA, 3, N…
$ Alcohol12PlusYr  <fct> Yes, Yes, Yes, NA, Yes, NA, NA, Yes, Yes, Yes, Yes, Y…
$ AlcoholDay       <int> NA, NA, NA, NA, 2, NA, NA, 3, 3, 3, 1, 2, 6, NA, NA, …
$ AlcoholYear      <int> 0, 0, 0, NA, 20, NA, NA, 52, 52, 52, 100, 104, 364, N…
$ SmokeNow         <fct> No, No, No, NA, Yes, NA, NA, NA, NA, NA, No, NA, NA, …
$ Smoke100         <fct> Yes, Yes, Yes, NA, Yes, NA, NA, No, No, No, Yes, No, …
$ Smoke100n        <fct> Smoker, Smoker, Smoker, NA, Smoker, NA, NA, Non-Smoke…
$ SmokeAge         <int> 18, 18, 18, NA, 38, NA, NA, NA, NA, NA, 13, NA, NA, N…
$ Marijuana        <fct> Yes, Yes, Yes, NA, Yes, NA, NA, Yes, Yes, Yes, NA, Ye…
$ AgeFirstMarij    <int> 17, 17, 17, NA, 18, NA, NA, 13, 13, 13, NA, 19, 15, N…
$ RegularMarij     <fct> No, No, No, NA, No, NA, NA, No, No, No, NA, Yes, Yes,…
$ AgeRegMarij      <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 20, 15, N…
$ HardDrugs        <fct> Yes, Yes, Yes, NA, Yes, NA, NA, No, No, No, No, Yes, …
$ SexEver          <fct> Yes, Yes, Yes, NA, Yes, NA, NA, Yes, Yes, Yes, Yes, Y…
$ SexAge           <int> 16, 16, 16, NA, 12, NA, NA, 13, 13, 13, 17, 22, 12, N…
$ SexNumPartnLife  <int> 8, 8, 8, NA, 10, NA, NA, 20, 20, 20, 15, 7, 100, NA, …
$ SexNumPartYear   <int> 1, 1, 1, NA, 1, NA, NA, 0, 0, 0, NA, 1, 1, NA, NA, 1,…
$ SameSex          <fct> No, No, No, NA, Yes, NA, NA, Yes, Yes, Yes, No, No, N…
$ SexOrientation   <fct> Heterosexual, Heterosexual, Heterosexual, NA, Heteros…
$ PregnantNow      <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
Tip

Do you know what all of these variables mean? Make sure to understand your dataset before you start to analyze.

Selecting Variables

We need to pick one Response variable and one Predictor variable.

What is another way we can phrase this?

If we follow along our OpenIntro textbook, this aligns with:

  • numeric response
  • numeric predictor
  • linear association
  • interpretation before complexity

Cleaning the Data

What did you first notice about this dataset?

If you said “there’s a lot of missing data”, you’re right. This is a real-world dataset, with real-world problems. One of them being that there will be missing data.

I have decided to use age and bmi as my predictor and response variables, respectively. I noticed that these variables have missing data.

How do we handle this? Well, we can do the following:

df <- NHANES %>%
  select(Age, BMI) %>%
  drop_na()

What happened? And is this the best way to handle this?

Short answer: For the purpose of this course (and this lesson), by using drop_na(), it’s the simplest mechanical way to handle missing data. HOWEVER, this is not the ideal method. We will cover that in a later lesson.

Let’s Visualize

By now, you’re experts at creating graphs using ggplot.

ggplot(df, 
       aes(
         x = Age, 
         y = BMI)
       ) +
  geom_point(
    alpha = 0.3
    ) +
  labs(
    title = "BMI vs Age (NHANES)",
    x = "Age (years)",
    y = "Body Mass Index"
  )

Try to interpret the graph using public health terms and using plain English.

Fit Simple Linear Regression

model <- lm(BMI ~ Age, data = df)
summary(model)

Call:
lm(formula = BMI ~ Age, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-16.794  -4.803  -1.236   3.466  55.697 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 21.451688   0.137193  156.36   <2e-16 ***
Age          0.138033   0.003148   43.84   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.735 on 9632 degrees of freedom
Multiple R-squared:  0.1664,    Adjusted R-squared:  0.1663 
F-statistic:  1922 on 1 and 9632 DF,  p-value: < 2.2e-16

How would you interpret this table?

You are not expected to interpret everything in the regression output.

Instead, focus on one row only: the row corresponding to Age.

From that row, identify:

  • Estimate → the slope (average change in BMI per year of age)
  • p-value → evidence of an association
  • Direction → positive or negative relationship

Ignore the following for now:

  • R-squared
  • F-statistic
  • Interpretation of the intercept

Slope and Confidence Interval

slope <- coef(model)["Age"]
ci <- confint(model)["Age", ]

The slope represents the average change in BMI for a one-year increase in age.

Inline reporting (required)

Use inline R code to report results so that your document updates automatically if the model changes.

For each additional year of age, BMI changes by an average of 0.138 units (95% CI: 0.132 to 0.144.

How to interpret this sentence

This sentence communicates:

  • Direction of the association
  • Magnitude of the change
  • Uncertainty in the estimate

This describes an association, not causation.

Adding a Regression Line

ggplot(df, aes(x = Age, y = BMI)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", se = TRUE) +
  labs(
    title = "Simple Linear Regression of BMI on Age",
    x = "Age (years)",
    y = "Body Mass Index"
  )
`geom_smooth()` using formula = 'y ~ x'

This plot combines:

  • the raw data (points), and
  • the model’s predicted average relationship (line).

The shaded band around the line represents the 95% confidence interval for the mean predicted BMI at each age.

You should focus on:

Direction of the line

  • Does the line slope upward or downward?
  • This should match the sign of the slope from the regression output.

How well the line summarizes the data

  • Do the points generally cluster around the line?
  • Or is there substantial scatter?

Uncertainty around the line

  • Is the confidence band narrow or wide?
  • Wider bands indicate more uncertainty in the estimated relationship.

Running Diagnostics

plot(model, which = 1)  # Residuals vs Fitted

plot(model, which = 2)  # Normal Q-Q

These plots help assess whether the assumptions of linear regression are reasonably met.

You are expected to recognize obvious red flags.

Plot 1: Residuals vs. Fitted

This plot checks:

  • Linearity
  • Constant variability (homoscedasticity)

What YOU should look for:

  • Residuals scattered randomly around zero
  • No clear curve or pattern
  • Roughly equal spread across fitted values

Acceptable interpretation:

The residuals appear randomly scattered around zero with no strong curvature, suggesting the linearity assumption is reasonable. The spread of residuals is fairly consistent across fitted values, indicating approximately constant variance.

What would indicate a concern:

  • Curved pattern → possible nonlinearity
  • Funnel shape → non-constant variance

You should identify, not fix, these issues (at least not yet anyways).

Plot 2: Normasl Q-Q Plot

What this plot checks:

  • Approximate normality of residuals

This matters for:

  • confidence intervals
  • p-values

What YOU should look for:

  • Points falling roughly along the diagonal line
  • Small deviations at the ends are acceptable

Normality is about the residuals, not the outcome variable itself.

In this course, you are expected to describe diagnostic plots in words and identify major issues if present, not to diagnose or correct subtle statistical problems.