Lesson 4: Data Wrangling with dplyr

Learning goals

By the end of this lesson, you will be able to:

  • Explain why data wrangling is essential in public health research
  • Use core dplyr verbs to manipulate datasets
  • Filter rows and select variables of interest
  • Create new variables using mutate()
  • Summarize data using group_by() and summarise()
  • Write short, clear interpretations of cleaned data
  • Maintain a reproducible workflow in Quarto

Why data wrangling matters in public health

Real-world public health data are rarely ready for analysis. Variables may be poorly named, values may need recoding, and datasets often include far more information than is needed for a specific research question.

Data wrangling allows you to:

  • focus on relevant variables
  • prepare data for visualization and modeling
  • document decisions transparently
  • reduce errors before analysis

Clean data are the foundation of valid public health conclusions.


Setup

Load the packages we will use in this lesson.

library(tidyverse)
library(gapminder)

The Gapminder dataset (review)

Gapminder contains country-level data over time, including life expectancy, population, and GDP per capita.

gapminder
# A tibble: 1,704 × 6
   country     continent  year lifeExp      pop gdpPercap
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.
 2 Afghanistan Asia       1957    30.3  9240934      821.
 3 Afghanistan Asia       1962    32.0 10267083      853.
 4 Afghanistan Asia       1967    34.0 11537966      836.
 5 Afghanistan Asia       1972    36.1 13079460      740.
 6 Afghanistan Asia       1977    38.4 14880372      786.
 7 Afghanistan Asia       1982    39.9 12881816      978.
 8 Afghanistan Asia       1987    40.8 13867957      852.
 9 Afghanistan Asia       1992    41.7 16317921      649.
10 Afghanistan Asia       1997    41.8 22227415      635.
# ℹ 1,694 more rows

Each row represents a country-year observation. Each column is a variable.


Selecting variables with select()

Often, you only need a subset of variables.

gap_select <- gapminder %>%
  select(country, year, lifeExp, gdpPercap)

head(gap_select)
# A tibble: 6 × 4
  country      year lifeExp gdpPercap
  <fct>       <int>   <dbl>     <dbl>
1 Afghanistan  1952    28.8      779.
2 Afghanistan  1957    30.3      821.
3 Afghanistan  1962    32.0      853.
4 Afghanistan  1967    34.0      836.
5 Afghanistan  1972    36.1      740.
6 Afghanistan  1977    38.4      786.

Selecting variables makes datasets easier to read and work with.


Filtering rows with filter()

Use filter() to keep only rows that meet a condition.

Example: data from a single year.

gap_2007 <- gapminder %>%
  filter(year == 2007)

head(gap_2007)
# A tibble: 6 × 6
  country     continent  year lifeExp      pop gdpPercap
  <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
1 Afghanistan Asia       2007    43.8 31889923      975.
2 Albania     Europe     2007    76.4  3600523     5937.
3 Algeria     Africa     2007    72.3 33333216     6223.
4 Angola      Africa     2007    42.7 12420476     4797.
5 Argentina   Americas   2007    75.3 40301927    12779.
6 Australia   Oceania    2007    81.2 20434176    34435.

Example: data for a single country.

gap_us <- gapminder %>%
  filter(country == "United States")

head(gap_us)
# A tibble: 6 × 6
  country       continent  year lifeExp       pop gdpPercap
  <fct>         <fct>     <int>   <dbl>     <int>     <dbl>
1 United States Americas   1952    68.4 157553000    13990.
2 United States Americas   1957    69.5 171984000    14847.
3 United States Americas   1962    70.2 186538000    16173.
4 United States Americas   1967    70.8 198712000    19530.
5 United States Americas   1972    71.3 209896000    21806.
6 United States Americas   1977    73.4 220239000    24073.

Combining verbs with the pipe

The pipe (%>%) passes the result of one step into the next. This makes code easier to read.

gap_us_2007 <- gapminder %>%
  filter(country == "United States", year == 2007) %>%
  select(country, year, lifeExp, gdpPercap, pop)

gap_us_2007
# A tibble: 1 × 5
  country        year lifeExp gdpPercap       pop
  <fct>         <int>   <dbl>     <dbl>     <int>
1 United States  2007    78.2    42952. 301139947

Read this as: “Take Gapminder, then filter, then select.”


Creating new variables with mutate()

Use mutate() to create new variables.

Example: log-transform GDP per capita.

gap_2007 <- gap_2007 %>%
  mutate(log_gdpPercap = log(gdpPercap))

head(gap_2007)
# A tibble: 6 × 7
  country     continent  year lifeExp      pop gdpPercap log_gdpPercap
  <fct>       <fct>     <int>   <dbl>    <int>     <dbl>         <dbl>
1 Afghanistan Asia       2007    43.8 31889923      975.          6.88
2 Albania     Europe     2007    76.4  3600523     5937.          8.69
3 Algeria     Africa     2007    72.3 33333216     6223.          8.74
4 Angola      Africa     2007    42.7 12420476     4797.          8.48
5 Argentina   Americas   2007    75.3 40301927    12779.          9.46
6 Australia   Oceania    2007    81.2 20434176    34435.         10.4 

Creating new variables is common in public health analyses.


Sorting data with arrange()

Use arrange() to order rows.

gap_2007 %>%
  arrange(desc(lifeExp)) %>%
  select(country, lifeExp) %>%
  head()
# A tibble: 6 × 2
  country          lifeExp
  <fct>              <dbl>
1 Japan               82.6
2 Hong Kong, China    82.2
3 Iceland             81.8
4 Switzerland         81.7
5 Australia           81.2
6 Spain               80.9

This helps identify countries with the highest or lowest values.


Grouped summaries with group_by() and summarise()

Grouped summaries allow you to compute statistics by group.

Example: mean life expectancy by continent in 2007.

cont_summary <- gap_2007 %>%
  group_by(continent) %>%
  summarise(
    mean_lifeExp = mean(lifeExp),
    median_lifeExp = median(lifeExp),
    n_countries = n(),
    .groups = "drop"
  )

cont_summary
# A tibble: 5 × 4
  continent mean_lifeExp median_lifeExp n_countries
  <fct>            <dbl>          <dbl>       <int>
1 Africa            54.8           52.9          52
2 Americas          73.6           72.9          25
3 Asia              70.7           72.4          33
4 Europe            77.6           78.6          30
5 Oceania           80.7           80.7           2

This is a common workflow in descriptive public health analyses.


Writing interpretations with summaries

Summaries should always be interpreted in writing.

Example sentence using inline code:

In 2007, the mean life expectancy in Africa was 54.8 years, which was lower than that observed in other continents.


Chaining multiple steps together

You will often combine many verbs into one pipeline.

gap_summary <- gapminder %>%
  filter(year == 2007) %>%
  group_by(continent) %>%
  summarise(
    mean_lifeExp = mean(lifeExp),
    mean_gdpPercap = mean(gdpPercap),
    .groups = "drop"
  )

gap_summary
# A tibble: 5 × 3
  continent mean_lifeExp mean_gdpPercap
  <fct>            <dbl>          <dbl>
1 Africa            54.8          3089.
2 Americas          73.6         11003.
3 Asia              70.7         12473.
4 Europe            77.6         25054.
5 Oceania           80.7         29810.

Pipelines make your workflow transparent and reproducible.


Common data wrangling mistakes

Common mistakes include:

  • overwriting your original dataset unintentionally
  • forgetting to drop grouping after summarise()
  • filtering incorrectly due to misspelled values
  • creating variables without documenting why

Use clear object names and comment your code when decisions matter.


Reproducibility reminder

In this course:

  • all data cleaning must be done in code
  • transformations should be documented
  • datasets used for analysis should be reproducible
  • you should never manually edit data files

Data preparation is part of your scientific record.


Weekly assignment reminder

This week you will complete HW 04: Data Wrangling with dplyr.


Key takeaways

Data wrangling prepares data for analysis. dplyr verbs make transformations readable and transparent. Grouped summaries are central to descriptive public health research. Clean data support valid conclusions.


Looking ahead

Next week, you will learn about tidy data and reshaping datasets, which will help prepare data for visualization, modeling, and tables.