library(tidyverse)
library(gapminder)Lesson 4: Data Wrangling with dplyr
Learning goals
By the end of this lesson, you will be able to:
- Explain why data wrangling is essential in public health research
- Use core
dplyrverbs to manipulate datasets
- Filter rows and select variables of interest
- Create new variables using
mutate()
- Summarize data using
group_by()andsummarise()
- Write short, clear interpretations of cleaned data
- Maintain a reproducible workflow in Quarto
Why data wrangling matters in public health
Real-world public health data are rarely ready for analysis. Variables may be poorly named, values may need recoding, and datasets often include far more information than is needed for a specific research question.
Data wrangling allows you to:
- focus on relevant variables
- prepare data for visualization and modeling
- document decisions transparently
- reduce errors before analysis
Clean data are the foundation of valid public health conclusions.
Setup
Load the packages we will use in this lesson.
The Gapminder dataset (review)
Gapminder contains country-level data over time, including life expectancy, population, and GDP per capita.
gapminder# A tibble: 1,704 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
7 Afghanistan Asia 1982 39.9 12881816 978.
8 Afghanistan Asia 1987 40.8 13867957 852.
9 Afghanistan Asia 1992 41.7 16317921 649.
10 Afghanistan Asia 1997 41.8 22227415 635.
# ℹ 1,694 more rows
Each row represents a country-year observation. Each column is a variable.
Selecting variables with select()
Often, you only need a subset of variables.
gap_select <- gapminder %>%
select(country, year, lifeExp, gdpPercap)
head(gap_select)# A tibble: 6 × 4
country year lifeExp gdpPercap
<fct> <int> <dbl> <dbl>
1 Afghanistan 1952 28.8 779.
2 Afghanistan 1957 30.3 821.
3 Afghanistan 1962 32.0 853.
4 Afghanistan 1967 34.0 836.
5 Afghanistan 1972 36.1 740.
6 Afghanistan 1977 38.4 786.
Selecting variables makes datasets easier to read and work with.
Filtering rows with filter()
Use filter() to keep only rows that meet a condition.
Example: data from a single year.
gap_2007 <- gapminder %>%
filter(year == 2007)
head(gap_2007)# A tibble: 6 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 2007 43.8 31889923 975.
2 Albania Europe 2007 76.4 3600523 5937.
3 Algeria Africa 2007 72.3 33333216 6223.
4 Angola Africa 2007 42.7 12420476 4797.
5 Argentina Americas 2007 75.3 40301927 12779.
6 Australia Oceania 2007 81.2 20434176 34435.
Example: data for a single country.
gap_us <- gapminder %>%
filter(country == "United States")
head(gap_us)# A tibble: 6 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 United States Americas 1952 68.4 157553000 13990.
2 United States Americas 1957 69.5 171984000 14847.
3 United States Americas 1962 70.2 186538000 16173.
4 United States Americas 1967 70.8 198712000 19530.
5 United States Americas 1972 71.3 209896000 21806.
6 United States Americas 1977 73.4 220239000 24073.
Combining verbs with the pipe
The pipe (%>%) passes the result of one step into the next. This makes code easier to read.
gap_us_2007 <- gapminder %>%
filter(country == "United States", year == 2007) %>%
select(country, year, lifeExp, gdpPercap, pop)
gap_us_2007# A tibble: 1 × 5
country year lifeExp gdpPercap pop
<fct> <int> <dbl> <dbl> <int>
1 United States 2007 78.2 42952. 301139947
Read this as: “Take Gapminder, then filter, then select.”
Creating new variables with mutate()
Use mutate() to create new variables.
Example: log-transform GDP per capita.
gap_2007 <- gap_2007 %>%
mutate(log_gdpPercap = log(gdpPercap))
head(gap_2007)# A tibble: 6 × 7
country continent year lifeExp pop gdpPercap log_gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl> <dbl>
1 Afghanistan Asia 2007 43.8 31889923 975. 6.88
2 Albania Europe 2007 76.4 3600523 5937. 8.69
3 Algeria Africa 2007 72.3 33333216 6223. 8.74
4 Angola Africa 2007 42.7 12420476 4797. 8.48
5 Argentina Americas 2007 75.3 40301927 12779. 9.46
6 Australia Oceania 2007 81.2 20434176 34435. 10.4
Creating new variables is common in public health analyses.
Sorting data with arrange()
Use arrange() to order rows.
gap_2007 %>%
arrange(desc(lifeExp)) %>%
select(country, lifeExp) %>%
head()# A tibble: 6 × 2
country lifeExp
<fct> <dbl>
1 Japan 82.6
2 Hong Kong, China 82.2
3 Iceland 81.8
4 Switzerland 81.7
5 Australia 81.2
6 Spain 80.9
This helps identify countries with the highest or lowest values.
Grouped summaries with group_by() and summarise()
Grouped summaries allow you to compute statistics by group.
Example: mean life expectancy by continent in 2007.
cont_summary <- gap_2007 %>%
group_by(continent) %>%
summarise(
mean_lifeExp = mean(lifeExp),
median_lifeExp = median(lifeExp),
n_countries = n(),
.groups = "drop"
)
cont_summary# A tibble: 5 × 4
continent mean_lifeExp median_lifeExp n_countries
<fct> <dbl> <dbl> <int>
1 Africa 54.8 52.9 52
2 Americas 73.6 72.9 25
3 Asia 70.7 72.4 33
4 Europe 77.6 78.6 30
5 Oceania 80.7 80.7 2
This is a common workflow in descriptive public health analyses.
Writing interpretations with summaries
Summaries should always be interpreted in writing.
Example sentence using inline code:
In 2007, the mean life expectancy in Africa was 54.8 years, which was lower than that observed in other continents.
Chaining multiple steps together
You will often combine many verbs into one pipeline.
gap_summary <- gapminder %>%
filter(year == 2007) %>%
group_by(continent) %>%
summarise(
mean_lifeExp = mean(lifeExp),
mean_gdpPercap = mean(gdpPercap),
.groups = "drop"
)
gap_summary# A tibble: 5 × 3
continent mean_lifeExp mean_gdpPercap
<fct> <dbl> <dbl>
1 Africa 54.8 3089.
2 Americas 73.6 11003.
3 Asia 70.7 12473.
4 Europe 77.6 25054.
5 Oceania 80.7 29810.
Pipelines make your workflow transparent and reproducible.
Common data wrangling mistakes
Common mistakes include:
- overwriting your original dataset unintentionally
- forgetting to drop grouping after
summarise()
- filtering incorrectly due to misspelled values
- creating variables without documenting why
Use clear object names and comment your code when decisions matter.
Reproducibility reminder
In this course:
- all data cleaning must be done in code
- transformations should be documented
- datasets used for analysis should be reproducible
- you should never manually edit data files
Data preparation is part of your scientific record.
Weekly assignment reminder
This week you will complete HW 04: Data Wrangling with dplyr.
Key takeaways
Data wrangling prepares data for analysis. dplyr verbs make transformations readable and transparent. Grouped summaries are central to descriptive public health research. Clean data support valid conclusions.
Looking ahead
Next week, you will learn about tidy data and reshaping datasets, which will help prepare data for visualization, modeling, and tables.