Project 2: Cleaning & Preparing Health Data for Analysis (NHANES)

Purpose of Project 2

This project moves you from global health context into real biomedical surveillance data. You will work with the NHANES dataset, which contains nationally representative health, exam, and behavioral measures from U.S. adults and children.

The goal is to practice:

  • working with real-world health data
  • cleaning and restructuring data for analysis
  • managing missing values
  • creating new variables
  • documenting analytic decisions
  • writing clear and reproducible Quarto reports

This project prepares you for regression modeling and inference in Project 3 and your final manuscript.

Dataset

We will use the curated NHANES dataset included in the NHANES R package.

This dataset includes variables such as:

  • demographics (age, gender, race/ethnicity, education, income)
  • health status
  • smoking behavior
  • BMI
  • blood pressure
  • activity levels
  • depression screening
  • medical conditions

It is real data — including missingness, limitations, and measurement constraints.

Learning Objectives

By completing this project, you will be able to:

  • clean biomedical data using tidyverse tools
  • identify and manage missing data
  • construct analytic variables
  • describe population patterns in health outcomes
  • explain cleaning choices transparently
  • submit a fully reproducible Quarto report

These are core skills for public health data science.

Project Focus

You will choose one health outcome, such as:

  • Hypertension
  • BMI (as continuous or categorized)
  • Depression symptoms
  • Diabetes status
  • Self-rated health
  • Smoking status

and examine how it varies across predictors such as:

  • age
  • sex
  • race/ethnicity
  • income-to-poverty ratio
  • education
  • health behaviors

You may propose an alternative outcome with approval.

Project Tasks

Your Quarto report must include the following sections:

1. Introduction (≈ 2–3 paragraphs)

Briefly explain:

  • What NHANES is and why it matters in public health
  • Your chosen outcome and why it is important
  • Key predictors you will examine
  • Your broad guiding question

Use clear, MPH-appropriate language.

2. Data Preparation

Using tidyverse:

  • load the NHANES dataset

  • select your outcome and predictors

  • recode variables as needed (e.g., factors)

  • create at least one new variable, such as:

    • age category
    • BMI category
    • income group
    • binary hypertension status
  • remove or document missing data

  • provide a data dictionary table explaining variables

All choices must be transparent and explained in writing.

3. Descriptive Statistics

Produce:

  • summary statistics (means, proportions, counts)
  • stratified summaries where appropriate
  • at least one table formatted for publication style

Use inline R code whenever possible.

4. Visualizations

Create at least two figures, such as:

  • distributions of your outcome
  • outcome by age group
  • outcome by race/ethnicity
  • outcome by income category
  • outcome by smoking status

Each figure must include:

  • clear title
  • labeled axes
  • readable formatting
  • a short interpretation paragraph

5. Missing Data Review

Briefly describe:

  • where missingness appears
  • how many cases were removed
  • why that matters for interpretation
  • potential bias concerns

This introduces ethical & methodological thinking.

6. Reflection & Interpretation (≈ 3–4 paragraphs)

Discuss:

  • Public-health meaning of the patterns you observe
  • Possible explanations for disparities
  • Structural or contextual factors
  • What you would explore next with modeling

This section begins to sound like an early manuscript Results + Discussion.

Formatting Requirements

Your submission must:

  • be written entirely in Quarto
  • hidden code chunks
  • knit without errors
  • include inline code for reported values
  • use tidy, readable R code
  • apply organized section headings
  • include citations where relevant (min. 8)
  • use inclusive, professional language

Target length: 1,200–1,500 words (excluding code)

Submission Requirements

Submit:

  • your knitted report (word doc)
  • your .qmd source file
  • your 3MT slide