Load, explore, and manipulate datasets with the tidyverse suite
Apply dplyr verbs for selecting, filtering, mutating, arranging, and summarising data
Perform joins and tidy messy data with tidyr
Conduct string operations with stringr
Produce high-quality visualisations using ggplot2
Export results to delimited and Excel formats
❓ Questions
How can I explore and manipulate tabular data with tidyverse?
What are the core dplyr functions for filtering, selecting, and summarising?
How do joins work when combining multiple datasets?
What are the main ggplot2 visualisation types for data exploration?
Structure & Agenda
Introduction to tidyverse & Gapminder dataset (~30 min)
Data manipulation with dplyr (~30 min)
File I/O, tidying, and string handling (~30 min)
Data visualisation with ggplot2 (~40 min)
Exercise solutions and recap (~20 min)
Introduction
This R Markdown document demonstrates data manipulation and visualization using the tidyverse suite of packages, particularly dplyr, tidyr, readr, stringr, and ggplot2. We will use the gapminder dataset to perform various data operations and create visualizations. The document also includes solutions to the provided exercises.
Installing and Loading Packages
First, we load the necessary packages
Exploring the Gapminder Dataset
Let’s take a look at the first few rows of the gapminder dataset.
Data Manipulation with dplyr
The dplyr package provides functions for data manipulation, such as select(), filter(), mutate(), arrange(), summarise(), and group_by().
Selecting Columns
Select specific columns using select().
individual columns
range of columns
Exclude columns
Exclude multiple columns
Exclude a vector of columns
Exclude a Complement of columns
Rename columns with select
Rename columns with rename()
Additional Selection Methods
Select columns using indices, ranges, or specific patterns.
Select by index
Select by index range
Exclude by labels
Exclude by labels
Select columns beginning with ‘co’
Select columns ending with ‘p’
# Select columns containing ‘pop’
Filtering Rows
Use filter() to subset rows based on conditions.
Filter for Belgium
Filter with multiple conditions
Filter using %in% operator
Combining Select and Filter
Combine select() and filter() to extract specific data.
Mutating Data
Use mutate() to create new columns and transmute() to keep only new columns.
Using Mutate to add new columns
Using Mutate to Keep only new columns
Using Mutate to Classify GDP per capita
Arranging Data
Sort data using arrange().
Sort in ascending order
Sort in descending order
Summarizing Data
Compute summary statistics with summarise().
Grouping Data
Group data with group_by() for per-group summaries.
Joining Data
Perform joins using inner_join(), left_join(), right_join(), and full_join().
Example data for joining
Inner join
left join
right join
full join
Reading and Writing Files
Use readr to read and write delimited files, and openxlsx for Excel files.
Write to a file
Write to Excel
Read from Excel
Tidying Data with tidyr
Handle missing values using tidyr.
Sample data with missing values
Remove rows with any NA
Remove rows with NA in specific column
Replace NA values
String Manipulation with stringr
Perform string operations using stringr.
Split string with stringr
Here we split the sentence up into a list of words
Below are the solutions to the provided exercises.
Exercise 1: Select year, lifeExp, and country
Exercise 2: Select all columns except year, lifeExp, and country
Exercise 3: Filter for year 2002
Exercise 4: Store and display data for Asian countries
Exercise 5: Compare Ireland and Brazil in 1977
Exercise 6: African countries in 2012 with specific criteria
Exercise 7: Lowest life expectancy
Exercise 8: Highest life expectancy
Exercise 9: Shortest life expectancy in the Americas in 1962
Exercise 10: Summary statistics for life expectancy
Exercise 11: Life expectancy by year
Exercise 12: Explore join statements
Already demonstrated in the “Joining Data” section above.
Exercise 13: Read a file
This requires a file like gapminder.xlsx. Assuming it’s available:
Exercise 14: Get and split working directory path
Conclusion
This document covers key tidyverse functionalities and ggplot2 visualizations using the gapminder dataset. It includes data manipulation, file handling, string manipulation, and various plotting techniques, along with solutions to the exercises.
To download this R Markdown file, save it with a .Rmd extension and open it in RStudio or another R Markdown-compatible environment.