Tidyverse and ggplot2 Analysis with Gapminder Dataset

Data Analysis with R

2025-07-07

Tidyverse and ggplot2 with Gapminder

Module Overview

  • Module: DA-R1 — Tidyverse Foundations
  • Course: Data Analysis with R
  • Audience: Early Career Researchers (PhD, Postdoc)
  • Duration: ~60 minutes directed teaching (+14 exercises)

💡 Learning Outcomes

  • Load, explore, and manipulate datasets with the tidyverse suite
  • Apply dplyr verbs for selecting, filtering, mutating, arranging, and summarising data
  • Perform joins and tidy messy data with tidyr
  • Conduct string operations with stringr
  • Produce high-quality visualisations using ggplot2
  • Export results to delimited and Excel formats

❓ Questions

  1. How can I explore and manipulate tabular data with tidyverse?
  2. What are the core dplyr functions for filtering, selecting, and summarising?
  3. How do joins work when combining multiple datasets?
  4. What are the main ggplot2 visualisation types for data exploration?

Structure & Agenda

  1. Introduction to tidyverse & Gapminder dataset (~30 min)
  2. Data manipulation with dplyr (~30 min)
  3. File I/O, tidying, and string handling (~30 min)
  4. Data visualisation with ggplot2 (~40 min)
  5. Exercise solutions and recap (~20 min)

Introduction

This R Markdown document demonstrates data manipulation and visualization using the tidyverse suite of packages, particularly dplyr, tidyr, readr, stringr, and ggplot2. We will use the gapminder dataset to perform various data operations and create visualizations. The document also includes solutions to the provided exercises.

Installing and Loading Packages

First, we load the necessary packages

Exploring the Gapminder Dataset

Let’s take a look at the first few rows of the gapminder dataset.

Data Manipulation with dplyr

The dplyr package provides functions for data manipulation, such as select(), filter(), mutate(), arrange(), summarise(), and group_by().

Selecting Columns

Select specific columns using select().

individual columns

range of columns

Exclude columns

Exclude multiple columns

Exclude a vector of columns

Exclude a Complement of columns

Rename columns with select

Rename columns with rename()

Additional Selection Methods

Select columns using indices, ranges, or specific patterns.

Select by index

Select by index range

Exclude by labels

Exclude by labels

Select columns beginning with ‘co’

Select columns ending with ‘p’

# Select columns containing ‘pop’

Filtering Rows

Use filter() to subset rows based on conditions.

Filter for Belgium

Filter with multiple conditions

Filter using %in% operator

Combining Select and Filter

Combine select() and filter() to extract specific data.

Mutating Data

Use mutate() to create new columns and transmute() to keep only new columns.

Using Mutate to add new columns

Using Mutate to Keep only new columns

Using Mutate to Classify GDP per capita

Arranging Data

Sort data using arrange().

Sort in ascending order

Sort in descending order

Summarizing Data

Compute summary statistics with summarise().

Grouping Data

Group data with group_by() for per-group summaries.

Joining Data

Perform joins using inner_join(), left_join(), right_join(), and full_join().

Example data for joining


Inner join

left join

right join

full join

Reading and Writing Files

Use readr to read and write delimited files, and openxlsx for Excel files.

Write to a file

Write to Excel

Read from Excel

Tidying Data with tidyr

Handle missing values using tidyr.

Sample data with missing values

Remove rows with any NA

Remove rows with NA in specific column

Replace NA values

String Manipulation with stringr

Perform string operations using stringr.

Split string with stringr

Here we split the sentence up into a list of words

Access specific word with stringr

We can print specific works in the list

Extract substring with stringr

{r stringr3}#| echo: true #| echo: true # Extract substring sub_text <- str_sub(text, 7, 11) print(sub_text)

we can extract a substring

Replace substring with stringr

and we can replace a substring directly

Visualizations with ggplot2

Create visualizations using ggplot2.

Scatter plot with ggplot2

Bar plot with ggplot2

Histogram plot with ggplot2

Box plot with ggplot2

Violin plot with ggplot2

Scatter plot with color aesthetic with ggplot2

Scatter plot with custom color

Scatter plot with smooth line

Faceted plot in ggplot2

Customized axes

Box plot with customized y-axis

Box plot with customized x-axis

Color gradient scale

Discrete color scale

Plot with color palette

Box plot with theme_classic

Box plot with jitter

Combined violin and box plot

Multiple plots with gridExtra

Marginal histogram

Saving Plots

Save a plot to a file using ggsave.

Exercise Solutions

Below are the solutions to the provided exercises.

Exercise 1: Select year, lifeExp, and country

Exercise 2: Select all columns except year, lifeExp, and country

Exercise 3: Filter for year 2002

Exercise 4: Store and display data for Asian countries

Exercise 5: Compare Ireland and Brazil in 1977

Exercise 6: African countries in 2012 with specific criteria

Exercise 7: Lowest life expectancy

Exercise 8: Highest life expectancy

Exercise 9: Shortest life expectancy in the Americas in 1962

Exercise 10: Summary statistics for life expectancy

Exercise 11: Life expectancy by year

Exercise 12: Explore join statements

Already demonstrated in the “Joining Data” section above.

Exercise 13: Read a file

This requires a file like gapminder.xlsx. Assuming it’s available:

Exercise 14: Get and split working directory path

Conclusion

This document covers key tidyverse functionalities and ggplot2 visualizations using the gapminder dataset. It includes data manipulation, file handling, string manipulation, and various plotting techniques, along with solutions to the exercises.

To download this R Markdown file, save it with a .Rmd extension and open it in RStudio or another R Markdown-compatible environment.