Lesson 7: Logistic Regression for Public Health

Modeling Binary Health Outcomes

Learning Objectives

By the end of this lesson, you will be able to:

Explain when logistic regression is appropriate
Interpret odds and probabilities in health research
Fit a logistic regression model in R
Interpret odds ratios in public health language
Generate predicted probabilities
Communicate findings responsibly

Assigned Readings

OpenIntro Biostatistics, Chapter 8
Statistical Inference via Data Science, Chapter 6

1. When Do We Use Logistic Regression?

Linear regression is used when:

Outcome is continuous (BMI, blood pressure, cholesterol)

Logistic regression is used when:

Outcome is binary
- Hypertension: Yes / No
- Diabetes: Yes / No
- Smoker: Yes / No
- Fair/Poor health: Yes / No

Logistic regression models:

[ log() = _0 + _1X ]

Where:

(p) = probability of the outcome
(p/(1-p)) = odds

2. Load NHANES Data

Loading Libraries

library(tidyverse)

Warning: package 'dplyr' was built under R version 4.5.1

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(NHANES)

Loading the Dataset & Creating a Subsect

data(NHANES)

df <- NHANES %>%
  filter(!is.na(BMI), !is.na(Gender), !is.na(Age)) %>%
  mutate(
    high_bmi = ifelse(BMI >= 30, 1, 0)
  )

3. Fit Logistic Regression Model

Our example research question: Is obesity associated with age?

model <- glm(high_bmi ~ Age, 
             data = df, 
             family = binomial)

summary(model)


Call:
glm(formula = high_bmi ~ Age, family = binomial, data = df)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.865791   0.051335  -36.34   <2e-16 ***
Age          0.023872   0.001084   22.01   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 11552  on 9633  degrees of freedom
Residual deviance: 11040  on 9632  degrees of freedom
AIC: 11044

Number of Fisher Scoring iterations: 4

What Is Happening Here?

glm() fits a generalized linear model.
family = binomial specifies a binary outcome.
The model estimates log-odds, not probabilities.

The coefficient for Age is on the log-odds scale.

It must be transformed before interpretation.

4. Convert Coefficients to Odds Ratios

exp(coef(model))

(Intercept)         Age 
  0.1547737   1.0241594

Interpretation:

Each additional year of age is associated with a 2% increase in the odds of obesity.

Interpretation Rules

OR = 1 → No association
OR > 1 → Increased odds
OR < 1 → Decreased odds

Examples:

OR = 1.25 → 25% higher odds
OR = 0.80 → 20% lower odds

5. Confidence Intervals for Odds Ratios

exp(confint(model))

Waiting for profiling to be done...

                2.5 %    97.5 %
(Intercept) 0.1398656 0.1710458
Age         1.0219920 1.0263459

Interpretation framework:

Each one-year increase in age was associated with a ___-fold increase in odds of obesity (95% CI: lower to upper).

How to Interpret Confidence Intervals

If the CI includes 1 → not statistically significant
If the CI does not include 1 → statistically significant
CIs communicate magnitude and uncertainty

In public health research, effect size and uncertainty matter more than p-values alone.

6. Predicted Probabilities

Odds ratios can be abstract.

Public health audiences often understand probabilities better.

new_ages <- tibble(Age = c(20, 40, 60))

predicted <- predict(model, new_ages, type = "response")

cbind(new_ages, predicted_probability = predicted)

  Age predicted_probability
1  20             0.1996715
2  40             0.2868146
3  60             0.3932998

type = "response" converts log-odds into probabilities.

Interpretation Example

If predicted probability at age 20 = 0.18 → 18% probability of obesity
If predicted probability at age 60 = 0.34 → 34% probability of obesity

Notice

The increase is nonlinear.
Predicted values stay between 0 and 1.
Logistic regression produces an S-shaped curve.

7. Visualizing Logistic Model

age_grid <- tibble(Age = seq(min(df$Age), max(df$Age), length.out = 200))

age_grid$prob <- predict(model, age_grid, type = "response")

ggplot(df, 
       aes(
         x = Age, 
         y = high_bmi
         )) +
  geom_point(alpha = 0.2) +
  geom_line(data = age_grid, 
            aes(
              Age, 
              prob), 
            linewidth = 1.2) +
  labs(y = "Probability of Obesity",
       title = "Predicted Probability of Obesity by Age")

What to observe:

Points are binary (0 or 1).
The curve is smooth.
Probability increases gradually.
Predictions remain within logical bounds.

Linear regression could predict probabilities below 0 or above 1. Logistic regression avoids that issue.

8. Odds vs. Probability (Concept Clarification)

People often confuse odds and probability.

Probability is the chance that the event occurs. The range is 0 to 1.

Odds are the probability of something divided by 1 minus that probability.

So, if the probability of something happening is 0.20, the the odds are (.20)/(1-.20) = (.20)/(.80) = .25.

Odds can exceed 1, probability cannot.

Why Logistic Regression Uses Odds

Odds range from 0 to infinity.
Log(odds) ranges from negative infinity to positive infinity.
This makes the model mathematically stable.

Important:

Odds ratio ≠ risk ratio.
An OR of 2 does not mean “twice as likely” unless the outcome is rare.

9. Public Health Interpretation

Avoid:

“Age causes obesity.”
“Risk doubles.”

Use:

“Age was associated with higher odds of obesity.”
“After adjustment, age remained independently associated with obesity.”

Limitations to Consider

Cross-sectional data cannot establish causality.
Measurement error may bias results.
Residual confounding may remain.
Survey design limitations may affect inference.

Logistic Regression Assumptions (Introductory)

Independent observations
Correct link function
No severe multicollinearity
Adequate sample size