Project 2: Cleaning & Preparing Health Data for Analysis (NHANES)

Purpose of Project 2

This project moves you from global health context into real biomedical surveillance data. You will work with the NHANES dataset, which contains nationally representative health, exam, and behavioral measures from U.S. adults and children.

The goal is to practice:

working with real-world health data
cleaning and restructuring data for analysis
managing missing values
creating new variables
documenting analytic decisions
writing clear and reproducible Quarto reports

This project prepares you for regression modeling and inference in Project 3 and your final manuscript.

Dataset

We will use the curated NHANES dataset included in the NHANES R package.

This dataset includes variables such as:

demographics (age, gender, race/ethnicity, education, income)
health status
smoking behavior
BMI
blood pressure
activity levels
depression screening
medical conditions

It is real data — including missingness, limitations, and measurement constraints.

Learning Objectives

By completing this project, you will be able to:

clean biomedical data using tidyverse tools
identify and manage missing data
construct analytic variables
describe population patterns in health outcomes
explain cleaning choices transparently
submit a fully reproducible Quarto report

These are core skills for public health data science.

Project Focus

You will choose one health outcome, such as:

Hypertension
BMI (as continuous or categorized)
Depression symptoms
Diabetes status
Self-rated health
Smoking status

and examine how it varies across predictors such as:

age
sex
race/ethnicity
income-to-poverty ratio
education
health behaviors

You may propose an alternative outcome with approval.

Project Tasks

Your Quarto report must include the following sections:

1. Introduction (≈ 2–3 paragraphs)

Briefly explain:

What NHANES is and why it matters in public health
Your chosen outcome and why it is important
Key predictors you will examine
Your broad guiding question

Use clear, MPH-appropriate language.

2. Data Preparation

Using tidyverse:

load the NHANES dataset
select your outcome and predictors
recode variables as needed (e.g., factors)
create at least one new variable, such as:
- age category
- BMI category
- income group
- binary hypertension status
remove or document missing data
provide a data dictionary table explaining variables

All choices must be transparent and explained in writing.

3. Descriptive Statistics

Produce:

summary statistics (means, proportions, counts)
stratified summaries where appropriate
at least one table formatted for publication style

Use inline R code whenever possible.

4. Visualizations

Create at least two figures, such as:

distributions of your outcome
outcome by age group
outcome by race/ethnicity
outcome by income category
outcome by smoking status

Each figure must include:

clear title
labeled axes
readable formatting
a short interpretation paragraph

5. Missing Data Review

Briefly describe:

where missingness appears
how many cases were removed
why that matters for interpretation
potential bias concerns

This introduces ethical & methodological thinking.

6. Reflection & Interpretation (≈ 3–4 paragraphs)

Discuss:

Public-health meaning of the patterns you observe
Possible explanations for disparities
Structural or contextual factors
What you would explore next with modeling

This section begins to sound like an early manuscript Results + Discussion.

Formatting Requirements

Your submission must:

be written entirely in Quarto
hidden code chunks
knit without errors
include inline code for reported values
use tidy, readable R code
apply organized section headings
include citations where relevant (min. 8)
use inclusive, professional language

Target length: 1,200–1,500 words (excluding code)

Submission Requirements

Submit:

your knitted report (word doc)
your .qmd source file
your 3MT slide