Lesson 6a: Biostats 1 Review

Purpose of this Lesson

In this lesson, we will review what you should have learned in Biostats 1. The topics will include:

Introduction to Data
Probability
Distribution of Random Variables
Inference

This lesson is not meant to teach you these items, but simply refresh your memory. We will learn how to run all of these items in R/RStudio.

Chapter 1 refresher: Data and variables

Types of variables

You should already be familiar with the distinction between:

Categorical variables (e.g., sex, race/ethnicity, region)
Numeric variables (e.g., age, BMI, blood pressure)

In R, this distinction affects: - which summaries are appropriate
- how plots are created
- how variables enter regression models

Observations and units of analysis

Each row in a dataset represents one observational unit.

Examples:

one person (NHANES)
one country-year (Gapminder)
one clinic visit

Before analyzing any dataset, you should always be able to answer:

What does one row represent?

Study design and causality

Chapter 1 distinguishes between:

Observational studies
Experimental studies

You should be able to explain:

whether a study allows causal conclusions
why most public health datasets support associational, not causal, claims

Problem sets

Complete problems from the OpenIntro Book 1.1, 1.3, 1.9, 1.12, 1.16, 1.26, 1.28, 1.33., 1.35, 1.36

Chapter 2 refresher: Probability

What probability represents

Probability describes the long-run behavior of a random process.

A probability:

quantifies uncertainty
does not predict individual outcomes

In public health contexts, probabilities are often expressed as risks or proportions.

Conditional probability and independence

You should be able to reason about:

probabilities given additional information
whether events are independent or related

These ideas support later discussions of risk factors and outcomes.

Common misconceptions

A key goal of Chapter 2 is recognizing incorrect reasoning about chance.

You should be able to explain why:

past outcomes do not “force” future outcomes
randomness does not imply lack of structure

Numerical summaries

Key numerical summaries you should already recognize:

mean
median
standard deviation
range
interquartile range

In this course, you will:

compute these in R
report them using inline code
interpret them in writing

Example reminder (conceptual):

The mean describes the average, while the standard deviation describes variability.

Shape, center, and spread

When describing a numeric variable, you should think about:

center (mean or median)
spread (SD or IQR)
shape (symmetric, skewed, multimodal)

These ideas will come up again when we assess model assumptions.

Problem sets

Complete problems from the OpenIntro Book 2.1, 2.5, 2.9, 2.11, 2.16, and 2.17.

Chapter 3 refresher: Distributions of random variables

Random variables

A random variable assigns a numerical value to the outcome of a random process.

You should be able to:

define the random variable in context
state whether it is discrete or continuous

This skill is essential for understanding model outcomes later in the course.

Probability distributions

A probability distribution describes: - which values are more or less likely
- how variability is structured

The book introduces common distributions conceptually (e.g., binomial, normal), without emphasizing computation.

Shape, center, and spread

You should be able to describe a distribution in terms of:

typical values
variability
skewness or symmetry

These ideas will reappear when assessing model assumptions.

Purpose of visualization

Plots help us:

identify patterns
detect outliers
compare groups
assess relationships

In this course, visualization is not optional — every plot must be interpreted in writing.

Matching plots to variable types

You should already know:

categorical → bar charts
numeric → histograms, boxplots
two numeric variables → scatterplots
numeric vs categorical → boxplots

We will implement these using ggplot2, but the statistical reasoning comes first.

Problem sets

Complete problems from the OpenIntro Book 3.1, 3.4, 3.8, 3.18, 3.20, 3.21, 3.43, 3.57.

Chapter 4 refresher: Foundations of inference

Why inference is needed

Chapter 4 explains why we cannot rely on a single sample to know a population value exactly.

Key ideas include:

sampling variability
uncertainty in estimates

Inference provides tools to quantify this uncertainty.

Confidence intervals (conceptual)

A confidence interval gives a range of plausible values for a population parameter.

You should be able to:

interpret a confidence interval in words
explain what it does not mean

This skill is critical for reporting results in manuscripts.

Hypothesis testing (logic only)

You should recognize:

the role of a null hypothesis
what a small p-value suggests conceptually

You are not expected to memorize testing procedures at this stage.

Why probability matters in public health

Probability underlies:

sampling variability
uncertainty in estimates
confidence intervals
statistical inference

Even when we do not calculate probabilities directly, we rely on probabilistic reasoning.

Sampling vs population

Key distinction:

Population: the group we want to understand
Sample: the group we observe

Most public health data are samples, even large ones.

This distinction motivates:

uncertainty estimates
cautious interpretation

Problem sets

Complete problems from the OpenIntro Book 4.1, 4.4, 4.7, 4.12, 4.16, 4.20, 4.24.

Chapter 5 refresher: Inference for numerical data

Estimating and comparing means

Chapter 5 applies inference ideas to numerical outcomes.

You should be comfortable:

identifying the parameter being estimated
explaining comparisons between groups conceptually

Statistical vs practical significance

A recurring theme is that:

statistical significance does not equal importance
context and magnitude matter

This distinction is central to public health interpretation.

Scope of inference

You should always consider:

who was studied
who the results apply to

This affects how findings are communicated responsibly.

Variability and uncertainty

Two important ideas:

Different samples give different results
We quantify this variability using statistical methods

This is why:

confidence intervals exist
p-values are used (with caution)

Confidence intervals (conceptual)

A confidence interval provides a range of plausible values for a population parameter.

In this course, you will:

compute confidence intervals in R
report them using inline code
interpret them in plain language

Example interpretation:

We are reasonably confident the true population value lies within this range.

Statistical significance vs public health relevance

You should already know that:

statistical significance does not imply importance
large samples can detect small differences
context and magnitude matter

This distinction is critical in public health writing.

Problem sets

Complete problems from the OpenIntro Book 5.1, 5.2, 5.4, 5.8, 5.17, 5.19, 5.24, 5.36, 5.41, 5.44.

Connecting OpenIntro concepts to R and Quarto

Here is how these concepts show up technically in this course:

Concept	How it appears in this course
Variables	Column types in datasets
Observations	Rows in data frames
Summaries	`mean()`, `sd()`, `summary()`
Visualization	`ggplot2`
Inference	regression estimates and confidence intervals
Interpretation	inline code + written text

Statistics guide what we do; R helps us do it reproducibly.

What you are expected to remember

You are not expected to memorize formulas.

You are expected to:

recognize variable types
choose appropriate summaries
interpret plots and model output
distinguish association from causation
communicate uncertainty clearly

These skills will be assessed repeatedly.

Self-check

Before moving on, ask yourself:

Can I identify the outcome and predictor in a question?
Do I know which summaries are appropriate for different variables?
Can I interpret a confidence interval in words?
Can I explain why uncertainty exists?

If any of these feel shaky, briefly revisit OpenIntro Chapters 1–5 alongside this lesson.