Lesson 6a: Biostats 1 Review
Purpose of this Lesson
In this lesson, we will review what you should have learned in Biostats 1. The topics will include:
- Introduction to Data
- Probability
- Distribution of Random Variables
- Inference
This lesson is not meant to teach you these items, but simply refresh your memory. We will learn how to run all of these items in R/RStudio.
Chapter 1 refresher: Data and variables
Types of variables
You should already be familiar with the distinction between:
- Categorical variables (e.g., sex, race/ethnicity, region)
- Numeric variables (e.g., age, BMI, blood pressure)
In R, this distinction affects: - which summaries are appropriate
- how plots are created
- how variables enter regression models
Observations and units of analysis
Each row in a dataset represents one observational unit.
Examples:
- one person (NHANES)
- one country-year (Gapminder)
- one clinic visit
Before analyzing any dataset, you should always be able to answer:
What does one row represent?
Study design and causality
Chapter 1 distinguishes between:
- Observational studies
- Experimental studies
You should be able to explain:
- whether a study allows causal conclusions
- why most public health datasets support associational, not causal, claims
Problem sets
Complete problems from the OpenIntro Book 1.1, 1.3, 1.9, 1.12, 1.16, 1.26, 1.28, 1.33., 1.35, 1.36
Chapter 2 refresher: Probability
What probability represents
Probability describes the long-run behavior of a random process.
A probability:
- quantifies uncertainty
- does not predict individual outcomes
In public health contexts, probabilities are often expressed as risks or proportions.
Conditional probability and independence
You should be able to reason about:
- probabilities given additional information
- whether events are independent or related
These ideas support later discussions of risk factors and outcomes.
Common misconceptions
A key goal of Chapter 2 is recognizing incorrect reasoning about chance.
You should be able to explain why:
- past outcomes do not “force” future outcomes
- randomness does not imply lack of structure
Numerical summaries
Key numerical summaries you should already recognize:
- mean
- median
- standard deviation
- range
- interquartile range
In this course, you will:
- compute these in R
- report them using inline code
- interpret them in writing
Example reminder (conceptual):
The mean describes the average, while the standard deviation describes variability.
Shape, center, and spread
When describing a numeric variable, you should think about:
- center (mean or median)
- spread (SD or IQR)
- shape (symmetric, skewed, multimodal)
These ideas will come up again when we assess model assumptions.
Problem sets
Complete problems from the OpenIntro Book 2.1, 2.5, 2.9, 2.11, 2.16, and 2.17.
Chapter 3 refresher: Distributions of random variables
Random variables
A random variable assigns a numerical value to the outcome of a random process.
You should be able to:
- define the random variable in context
- state whether it is discrete or continuous
This skill is essential for understanding model outcomes later in the course.
Probability distributions
A probability distribution describes: - which values are more or less likely
- how variability is structured
The book introduces common distributions conceptually (e.g., binomial, normal), without emphasizing computation.
Shape, center, and spread
You should be able to describe a distribution in terms of:
- typical values
- variability
- skewness or symmetry
These ideas will reappear when assessing model assumptions.
Purpose of visualization
Plots help us:
- identify patterns
- detect outliers
- compare groups
- assess relationships
In this course, visualization is not optional — every plot must be interpreted in writing.
Matching plots to variable types
You should already know:
- categorical → bar charts
- numeric → histograms, boxplots
- two numeric variables → scatterplots
- numeric vs categorical → boxplots
We will implement these using ggplot2, but the statistical reasoning comes first.
Problem sets
Complete problems from the OpenIntro Book 3.1, 3.4, 3.8, 3.18, 3.20, 3.21, 3.43, 3.57.
Chapter 4 refresher: Foundations of inference
Why inference is needed
Chapter 4 explains why we cannot rely on a single sample to know a population value exactly.
Key ideas include:
- sampling variability
- uncertainty in estimates
Inference provides tools to quantify this uncertainty.
Confidence intervals (conceptual)
A confidence interval gives a range of plausible values for a population parameter.
You should be able to:
- interpret a confidence interval in words
- explain what it does not mean
This skill is critical for reporting results in manuscripts.
Hypothesis testing (logic only)
You should recognize:
- the role of a null hypothesis
- what a small p-value suggests conceptually
You are not expected to memorize testing procedures at this stage.
Why probability matters in public health
Probability underlies:
- sampling variability
- uncertainty in estimates
- confidence intervals
- statistical inference
Even when we do not calculate probabilities directly, we rely on probabilistic reasoning.
Sampling vs population
Key distinction:
- Population: the group we want to understand
- Sample: the group we observe
Most public health data are samples, even large ones.
This distinction motivates:
- uncertainty estimates
- cautious interpretation
Problem sets
Complete problems from the OpenIntro Book 4.1, 4.4, 4.7, 4.12, 4.16, 4.20, 4.24.
Chapter 5 refresher: Inference for numerical data
Estimating and comparing means
Chapter 5 applies inference ideas to numerical outcomes.
You should be comfortable:
- identifying the parameter being estimated
- explaining comparisons between groups conceptually
Statistical vs practical significance
A recurring theme is that:
- statistical significance does not equal importance
- context and magnitude matter
This distinction is central to public health interpretation.
Scope of inference
You should always consider:
- who was studied
- who the results apply to
This affects how findings are communicated responsibly.
Variability and uncertainty
Two important ideas:
- Different samples give different results
- We quantify this variability using statistical methods
This is why:
- confidence intervals exist
- p-values are used (with caution)
Confidence intervals (conceptual)
A confidence interval provides a range of plausible values for a population parameter.
In this course, you will:
- compute confidence intervals in R
- report them using inline code
- interpret them in plain language
Example interpretation:
We are reasonably confident the true population value lies within this range.
Statistical significance vs public health relevance
You should already know that:
- statistical significance does not imply importance
- large samples can detect small differences
- context and magnitude matter
This distinction is critical in public health writing.
Problem sets
Complete problems from the OpenIntro Book 5.1, 5.2, 5.4, 5.8, 5.17, 5.19, 5.24, 5.36, 5.41, 5.44.
Connecting OpenIntro concepts to R and Quarto
Here is how these concepts show up technically in this course:
| Concept | How it appears in this course |
|---|---|
| Variables | Column types in datasets |
| Observations | Rows in data frames |
| Summaries | mean(), sd(), summary() |
| Visualization | ggplot2 |
| Inference | regression estimates and confidence intervals |
| Interpretation | inline code + written text |
Statistics guide what we do; R helps us do it reproducibly.
What you are expected to remember
You are not expected to memorize formulas.
You are expected to:
- recognize variable types
- choose appropriate summaries
- interpret plots and model output
- distinguish association from causation
- communicate uncertainty clearly
These skills will be assessed repeatedly.
Self-check
Before moving on, ask yourself:
- Can I identify the outcome and predictor in a question?
- Do I know which summaries are appropriate for different variables?
- Can I interpret a confidence interval in words?
- Can I explain why uncertainty exists?
If any of these feel shaky, briefly revisit OpenIntro Chapters 1–5 alongside this lesson.