+ - 0:00:00
Notes for current slide
Notes for next slide

In previous lessons, we defined data cleaning and learned how to organize our data. Now we can get started with using R, and in particular, on first getting our data into sensible, usable structures.

Let’s begin by going over the tools and packages that we’ll be using for this part and throughout this course.

Restructuring Data

1 / 52

Toolbox Refresher

2 / 52

In previous lessons, we defined data cleaning and learned how to organize our data. Now we can get started with using R, and in particular, on first getting our data into sensible, usable structures.

Let’s begin by going over the tools and packages that we’ll be using for this part and throughout this course.

Tools for preparing, transforming, and restructuring data

3 / 52

When working in R for data cleaning, we’ll be calling on functions from two popular packages: dplyr and tidyr. Both are written and maintained by Rstudio staff with extensive community contributions. They are part of larger set of packages referred to as the tidyverse.

Tools for preparing, transforming, and restructuring data

📦 dplyr
📦 tidyr

Part of the tidyverse

Flexible, but developed around built-in features that support a particular way of doing things

3 / 52

When working in R for data cleaning, we’ll be calling on functions from two popular packages: dplyr and tidyr. Both are written and maintained by Rstudio staff with extensive community contributions. They are part of larger set of packages referred to as the tidyverse.

These tools are meant to be flexible, but they are built around built-in features that support a particular way of doing specific tasks.

Tools for preparing, transforming, and restructuring data

📦 dplyr
📦 tidyr

Part of the tidyverse

Flexible, but developed around built-in features that support a particular way of doing things

  • Hard to use incorrectly by accident
  • Less code to achieve a goal
  • Shared grammar and input/output data structures
3 / 52

When working in R for data cleaning, we’ll be calling on functions from two popular packages: dplyr and tidyr. Both are written and maintained by Rstudio staff with extensive community contributions. They are part of larger set of packages referred to as the tidyverse.

These tools are meant to be flexible, but they are built around built-in features that support a particular way of doing specific tasks.

Functions in different tidyverse packages follow tidy data principles so they work well together. And, by having specialized purposes, each one does only one thing but they do it well.

Having established that this course is tidyverse-oriented. Let’s go over another tool that we’ll be using to make our code easy to read.

Pipes

To perform multiple operations in sequence:

4 / 52

When we work with functions to perform many sequential operations, we have three options.

Pipes

To perform multiple operations in sequence:

  • Nested code
4 / 52

When we work with functions to perform many sequential operations, we have three options.

we can use nested code, in which function calls are nested inside more function calls so that the results get evaluated from the inside out.

Pipes

To perform multiple operations in sequence:

  • Nested code

  • Intermediate objects

4 / 52

When we work with functions to perform many sequential operations, we have three options.

we can use nested code, in which function calls are nested inside more function calls so that the results get evaluated from the inside out.

We can also assign objects with intermediate results, passed on to the next function in our sequence.

Pipes

To perform multiple operations in sequence:

  • Nested code

  • Intermediate objects

  • Pipes

4 / 52

When we work with functions to perform many sequential operations, we have three options.

we can use nested code, in which function calls are nested inside more function calls so that the results get evaluated from the inside out.

We can also assign objects with intermediate results, passed on to the next function in our sequence.

Or, as we’ll be doing in this course, we can use pipes.

Let me show you how this looks like.

Nested code

grades <- c(7,8,8,10,6,9,5,9,8)
round(mean(grades), digits = 2)
## [1] 7.78
5 / 52

Nested code looks like this.

Here, we have a vector ‘grades’ with some numeric values. Our sequence of operations is to first get the mean grade, and then round this result. In nested code, the mean function is nested inside the round function.

Nested code

grades <- c(7,8,8,10,6,9,5,9,8)
round(mean(grades), digits = 2)
## [1] 7.78

Intermediate objects

mn_grades <- mean(grades)
rd_mn_grades <- round(mn_grades, digits = 2)
rd_mn_grades
## [1] 7.78
5 / 52

Nested code looks like this.

Here, we have a vector ‘grades’ with some numeric values. Our sequence of operations is to first get the mean grade, and then round this result. In nested code, the mean function is nested inside the round function.

Intermediate objects work in a similar way. First we assign an object with the mean of the grades, then we pass this object to the rounding function

Notice how both approaches give us the same result.

Pipes

For structuring sequential operations left-to-right:

Left Hand Side
(LHS)
 

pipe operator
%>% from 📦 magrittr

Right Hand Side
(RHS)
 

6 / 52

Alternatively, we can use pipes to avoid nested code or having so assign results to intermediate objects that we don’t really need.

Pipes let us structure sequential operations from left to right.

When we work with pipes, there is always something on the left hand side of a pipe operator, and something else on the right hand side.

The pipe operator appears in orange here. Its originally from the magrittr package

Pipes

  • Take an object on the Left Hand Side of the pipe and insert it into a function as an argument on the Right Hand Side
7 / 52

More specifically, pipes take and object on the left hand side of the pipe operator and insert it into a function on the right hand side.

Pipes

  • Take an object on the Left Hand Side of the pipe and insert it into a function as an argument on the Right Hand Side

  • By default, LHS is placed as the first argument in the call

7 / 52

More specifically, pipes take and object on the left hand side of the pipe operator and insert it into a function on the right hand side.

By default, the object on the left hand side becomes the first argument in the function call.

Pipes

  • Take an object on the Left Hand Side of the pipe and insert it into a function as an argument on the Right Hand Side

  • By default, LHS is placed as the first argument in the call

  • Packages in the tidyverse load %>% automatically

7 / 52

More specifically, pipes take and object on the left hand side of the pipe operator and insert it into a function on the right hand side.

By default, the object on the left hand side becomes the first argument in the function call.

by the way, the pipe operator gets loaded by the tidyverse packages that we’ll be using.

Let me summarize the main advantages of using pipes to write code.

Pipes

  • Minimize the need for intermediate objects and nested code
8 / 52

Let me summarize the main advantages of using pipes to write code.

First, by chaining our operations with pipes we pass the results onto the next function, so we don’t need nested code or to assign intermediate results to objects we won’t use again.

Pipes

  • Minimize the need for intermediate objects and nested code

  • Make code readable

8 / 52

Let me summarize the main advantages of using pipes to write code.

First, by chaining our operations with pipes we pass the results onto the next function, so we don’t need nested code or to assign intermediate results to objects we won’t use again.

This makes our code more readable, and

Pipes

  • Minimize the need for intermediate objects and nested code

  • Make code readable

  • Easy to add or remove steps at any point in the sequence of operations

8 / 52

Let me summarize the main advantages of using pipes to write code.

First, by chaining our operations with pipes we pass the results onto the next function, so we don’t need nested code or to assign intermediate results to objects we won’t use again.

This makes our code more readable, and

its also easier to add or remove steps in our sequence of operations.

Pipes

  • Minimize the need for intermediate objects and nested code

  • Make code readable

  • Easy to add or remove steps at any point in the sequence of operations

Insert with ctrl + shift + M

8 / 52

Let me summarize the main advantages of using pipes to write code.

First, by chaining our operations with pipes we pass the results onto the next function, so we don’t need nested code or to assign intermediate results to objects we won’t use again.

This makes our code more readable, and

its also easier to add or remove steps in our sequence of operations.

A useful shortcut to remember is that in Rstudio we can insert pipes using control shift M, or command shift m on a mac.

Pipes

  • Minimize the need for intermediate objects and nested code

  • Make code readable

  • Easy to add or remove steps at any point in the sequence of operations

Insert with ctrl + shift + M

library(magrittr)
grades %>%
mean() %>%
round(digits=2)
## [1] 7.78
8 / 52

Let me summarize the main advantages of using pipes to write code.

First, by chaining our operations with pipes we pass the results onto the next function, so we don’t need nested code or to assign intermediate results to objects we won’t use again.

This makes our code more readable, and

its also easier to add or remove steps in our sequence of operations.

A useful shortcut to remember is that in Rstudio we can insert pipes using control shift M, or command shift m on a mac.

The example from earlier looks like this is we use pipes. First we load the magrittr package, take the grades object, pass it to the mean function, and then pass this result to the round function. Note how we don’t have to write the first argument in functions that are on the right hand side of a pipe.

We can now learn about the functions from dplyr that we’ll use extensively throughout this course.

Useful 📦 dplyr functions

9 / 52

As I mentioned earlier, the dplyr package provides us with separate functions for doing the most common data manipulation tasks.

Useful 📦 dplyr functions

  • Individual functions for the most common operations
9 / 52

As I mentioned earlier, the dplyr package provides us with separate functions for doing the most common data manipulation tasks.

dplyr provides separate functions for the more common operations

Useful 📦 dplyr functions

  • Individual functions for the most common operations

  • Each function does one only thing, but does it well

9 / 52

As I mentioned earlier, the dplyr package provides us with separate functions for doing the most common data manipulation tasks.

dplyr provides separate functions for the more common operations

Each function does one only thing, but does it well

Useful 📦 dplyr functions

  • Individual functions for the most common operations

  • Each function does one only thing, but does it well

  • Intuitive, user-friendly functions for common data manipulation tasks:

9 / 52

As I mentioned earlier, the dplyr package provides us with separate functions for doing the most common data manipulation tasks.

dplyr provides separate functions for the more common operations

Each function does one only thing, but does it well

When we’re cleaning data, we usually do one of these if not all three of the following:

Useful 📦 dplyr functions

  • Individual functions for the most common operations

  • Each function does one only thing, but does it well

  • Intuitive, user-friendly functions for common data manipulation tasks:

    • Subset columns
9 / 52

As I mentioned earlier, the dplyr package provides us with separate functions for doing the most common data manipulation tasks.

dplyr provides separate functions for the more common operations

Each function does one only thing, but does it well

When we’re cleaning data, we usually do one of these if not all three of the following:

subset columns

Useful 📦 dplyr functions

  • Individual functions for the most common operations

  • Each function does one only thing, but does it well

  • Intuitive, user-friendly functions for common data manipulation tasks:

    • Subset columns

    • Subset rows

9 / 52

As I mentioned earlier, the dplyr package provides us with separate functions for doing the most common data manipulation tasks.

dplyr provides separate functions for the more common operations

Each function does one only thing, but does it well

When we’re cleaning data, we usually do one of these if not all three of the following:

subset columns

Subset rows

Useful 📦 dplyr functions

  • Individual functions for the most common operations

  • Each function does one only thing, but does it well

  • Intuitive, user-friendly functions for common data manipulation tasks:

    • Subset columns

    • Subset rows

    • Create or modify columns

9 / 52

As I mentioned earlier, the dplyr package provides us with separate functions for doing the most common data manipulation tasks.

dplyr provides separate functions for the more common operations

Each function does one only thing, but does it well

When we’re cleaning data, we usually do one of these if not all three of the following:

subset columns

Subset rows

or create or modify columns

name style alc_by_volume price_range country star_rating
County Lager 4.2 low USA 3.40
Mountain Special Pale Ale 4.1 low USA 2.70
Lagartos Lager 5.1 mid Mexico 4.10
Jealous Hobo Porter 6.5 mid USA 3.36
Strasse Stout 8.3 high Germany 4.89
Arbre Pale Ale 3.4 low Belgium 4.01
K1997 Lager 4.0 high Japan 3.88
Lagartos Cero Lager 0.0 mid Mexico 2.00
Airball Irish Ale 6.0 high USA 3.64
Clean Tackle Stout 7.2 mid England 4.20
beers <- tibble::tribble(
~name, ~style, ~alc_by_volume, ~price_range, ~country, ~star_rating,
"County", "Lager", 4.2, "low", "USA", 3.4,
"Mountain Special", "Pale Ale", 4.1, "low", "USA", 2.7,
"Lagartos", "Lager", 5.1, "mid", "Mexico", 4.1,
"Jealous Hobo", "Porter", 6.5, "mid", "USA", 3.36,
"Strasse", "Stout", 8.3, "high", "Germany", 4.89,
"Arbre", "Pale Ale", 3.4, "low", "Belgium", 4.01,
"K1997", "Lager", 4, "high", "Japan", 3.88,
"Lagartos Cero", "Lager", 0, "mid", "Mexico", 2,
"Airball", "Irish Ale", 6, "high", "USA", 3.64,
"Clean Tackle", "Stout", 7.2, "mid", "England", 4.2
)
10 / 52

These data here describe some fictional beers and their properties.

select() Subset variables

  • by name
  • by position
  • by type
beers %>% select(name, style, star_rating)

select animation

11 / 52

The first thing we might want to try is to subset some variables from the original table. For this we use the select function. Select helps us subset variables, which we can specify by their name, position on the data from left to right, or even by type.

In the example, we’re selecting three variables from the beers object, and we’re doing that with a pipe and by passing their names to the select function.

The beers object is passed by the pipe as the first argument of the select function, and the other arguments are the names of the variables we want to keep, which are name, style, and star rating.

mutate() Create and modify columns

  • create new
  • overwrite
beers %>% select(name, style, star_rating) %>%
mutate(stars = round(star_rating, digits = 0))

mutate animation

12 / 52

Next up, is creating new columns of modifying existing ones. We can do this with the mutate function. The syntax for creating new columns is quite flexible, but typically we need to specify its name and what will be in this new column. Lets see whats happening in the example.

To the transformation from before, in which we selected some variables from the beers object, we are adding an additional step using a pipe. With mutate, we’re creating a new variable called stars, which is itself the result of rounding the values of an existing column.

When we use mutate, we create new columns, but if the name of the new column is the same as an existing one, it will be overwritten.

filter() subset rows

  • keep rows that meet a condition
beers %>% select(name, style, star_rating) %>%
mutate(stars = round(star_rating, digits = 0)) %>%
filter(stars > 3)

filter animation

13 / 52

Finally, we can also work on rows. To subset rows that meet a condition, we use the filter function. Again, we’re using a pipe to add a new step to the sequence from before. With the new variable stars, we want to keep only the rows in which the values of stars are greater than 3.

Notice how with the pipes we can keep track of the different steps, and that we didn’t have to work with intermediate objects for the results from each step. This is how we will work throughout this course.

My Turn

14 / 52

Now, its my turn to use pipes and functions from dplyr to transform a dataset.

My Turn

  • Load the mammal sleep data bundled with ggplot2
14 / 52

Now, its my turn to use pipes and functions from dplyr to transform a dataset.

My Turn

  • Load the mammal sleep data bundled with ggplot2

  • Select the "name" and "conservation" columns, as well as all numeric variables

14 / 52

Now, its my turn to use pipes and functions from dplyr to transform a dataset.

My Turn

  • Load the mammal sleep data bundled with ggplot2

  • Select the "name" and "conservation" columns, as well as all numeric variables

  • Create a new column that contains the values of 'awake' multiplied by the values in 'brainwt'

14 / 52

Now, its my turn to use pipes and functions from dplyr to transform a dataset.

My Turn

  • Load the mammal sleep data bundled with ggplot2

  • Select the "name" and "conservation" columns, as well as all numeric variables

  • Create a new column that contains the values of 'awake' multiplied by the values in 'brainwt'

  • Filter the data to keep only rows with 'sleep_total' > 4

14 / 52

Now, its my turn to use pipes and functions from dplyr to transform a dataset.

My Turn

  • Load the mammal sleep data bundled with ggplot2

  • Select the "name" and "conservation" columns, as well as all numeric variables

  • Create a new column that contains the values of 'awake' multiplied by the values in 'brainwt'

  • Filter the data to keep only rows with 'sleep_total' > 4

I'll use pipes to chain the operations

14 / 52

Now, its my turn to use pipes and functions from dplyr to transform a dataset.

Your Turn

15 / 52

Your Turn

  • Load the mammal sleep data bundled with ggplot2
15 / 52

Your Turn

  • Load the mammal sleep data bundled with ggplot2

  • Select "name" and "conservation" columns and those that include the string 'sleep' in their name

15 / 52

Your Turn

  • Load the mammal sleep data bundled with ggplot2

  • Select "name" and "conservation" columns and those that include the string 'sleep' in their name

  • Create a new column that contains the values of 'sleep_total' multiplied by 3

15 / 52

Your Turn

  • Load the mammal sleep data bundled with ggplot2

  • Select "name" and "conservation" columns and those that include the string 'sleep' in their name

  • Create a new column that contains the values of 'sleep_total' multiplied by 3

  • Filter the data to remove domesticated mammals

15 / 52

Your Turn

  • Load the mammal sleep data bundled with ggplot2

  • Select "name" and "conservation" columns and those that include the string 'sleep' in their name

  • Create a new column that contains the values of 'sleep_total' multiplied by 3

  • Filter the data to remove domesticated mammals

Use pipes to chain the operations!

15 / 52

Working with Columns with across()

16 / 52

In all the previous examples in which we selected or modified columns, we were working on them either one at a time or were having to name each one.

Operating on multiple columns

across()

17 / 52

Operating on multiple columns

across()

filter or mutate multiple columns

17 / 52

Thanks to some recent updates to dplyr, we can operate on multiple columns at once using the across function.

Operating on multiple columns

across()

filter or mutate multiple columns

  • By position, name, or type
17 / 52

Thanks to some recent updates to dplyr, we can operate on multiple columns at once using the across function.

We can specify many columns at once by position, name, or type. I don’t recommend using positions, because these are likely to change and less meaningful.

Operating on multiple columns

across()

filter or mutate multiple columns

  • By position, name, or type

  • Compound selections of columns
    (e.g. numeric columns that contain "mass" in their name)

17 / 52

Thanks to some recent updates to dplyr, we can operate on multiple columns at once using the across function.

We can specify many columns at once by position, name, or type. I don’t recommend using positions, because these are likely to change and less meaningful.

More importantly, we can use across to create compound selections of columns. For example, selecting or mutating all columns of type numeric but only if the contain a specific string in their name.

across() arguments:

18 / 52

Across has a relatively straight forward syntax.

across() arguments:

.cols Columns to transform

18 / 52

Across has a relatively straight forward syntax.

Cols is how we specify the columns we want to transform

across() arguments:

.cols Columns to transform

.fns Function to apply to each of the selected columns

18 / 52

Across has a relatively straight forward syntax.

Cols is how we specify the columns we want to transform

and fns is the function to apply to each of the columns

across() arguments:

.cols Columns to transform

.fns Function to apply to each of the selected columns

Expects a function, so function name and arguments can be passed separately

18 / 52

Across has a relatively straight forward syntax.

Cols is how we specify the columns we want to transform

and fns is the function to apply to each of the columns

across() arguments:

.cols Columns to transform

.fns Function to apply to each of the selected columns

Expects a function, so function name and arguments can be passed separately

beers %>%
mutate(across(c(alc_by_volume, star_rating), round, digits = 3))
18 / 52

Across has a relatively straight forward syntax.

Cols is how we specify the columns we want to transform

and fns is the function to apply to each of the columns

If you recall the beers data from earlier, we can use across to round all the numeric variables at the same time, the cols argument is a vector with the names of the two columns we’re interested in, and after that the function that we wish to apply and any additional arguments. Here, its the round function, with the digits argument passed on as well. Notice we used just the function name without any brackets.

Useful column selection helpers

19 / 52

To specify sets of columns, there are a number of useful helper functions that we can use alongside or instead of the across function.

Useful column selection helpers

everything() Match all variables


beers %>%
mutate(across(everything(), as.character))
19 / 52

To specify sets of columns, there are a number of useful helper functions that we can use alongside or instead of the across function.

First up is everything, which matches all variables. In the example, we’re transforming every variable in the beers object to character.

Useful column selection helpers (continued)

! Take the complement of a set of variables (negate a selection)


beers %>%
select(!c(style,country))
20 / 52

We can also use the exclamation mark to take the complement of a set of variables. In the example, the question mark negates the set of variables that are specified by name, so we get everything except style and country.

Useful column selection helpers (continued)

where() Selects variables for which a predicate function returns TRUE

e.g. subset or transform all numeric variables, determined by is.numeric()

beers %>%
mutate(across(where(is.numeric), log2))
21 / 52

Where is somewhat more elaborate, but it works together with any other function that returns TRUE of FALSE, and then select or mutates only those variables for which the condition is TRUE.

For example, we want to apply a base two logarithm to any variables of type numeric in our beers object. We use the is.numeric function to check all the variables, and then the log2 function will only be applied to variables that meet the condition.

Useful column selection helpers (continued)

matches() Match variable names with a regular expression

e.g. drop variables with numbers in their name

beers %>%
mutate(across(matches("rating"), log2))
22 / 52

Matches is good for working with variable names and working only on those that contain a match for our pattern of interest. This is can be very useful, and we will see the power of regular expressions in a separate lesson.

In the example code, were matching any variables in which the name contains the word rating, and applying a base two log transformation.

Useful column selection helpers (continued)

: Select a range of consecutive variables


beers %>%
select(style:country)
23 / 52

The colon, when used as a selection helper, matches a range of variables that are next to each other in the data. In the example, we’ll be keeping style, country, and every other variable between these two.

Useful column selection helpers (continued)

- Exclude variables
(return all elements except the one preceded by the subtraction operator)

beers %>%
select(-country)
24 / 52

Finally, we can use the minus sign to exclude variables from a selection. In the example, we exclude country and keep everything else.

beers

name style alc_by_volume price_range country star_rating
County Lager 4.2 low USA 3.40
Mountain Special Pale Ale 4.1 low USA 2.70
Lagartos Lager 5.1 mid Mexico 4.10
Jealous Hobo Porter 6.5 mid USA 3.36
Strasse Stout 8.3 high Germany 4.89
Arbre Pale Ale 3.4 low Belgium 4.01
K1997 Lager 4.0 high Japan 3.88
Lagartos Cero Lager 0.0 mid Mexico 2.00
Airball Irish Ale 6.0 high USA 3.64
Clean Tackle Stout 7.2 mid England 4.20
25 / 52

Lets have another look at the beers data, and apply some transformations to many columns at once. Just looking at this table, we can see two numeric variables: alcohol by volume, and star rating. Just for demonstration, lets use the mutate function to apply a numeric transformation to these two columns.

Square root (sqrt()) of alc_by_volume and star_rating

beers %>%
mutate(across(c(alc_by_volume,star_rating), sqrt))
## # A tibble: 10 x 6
## name style alc_by_volume price_range country star_rating
## <chr> <chr> <dbl> <chr> <chr> <dbl>
## 1 County Lager 2.05 low USA 1.84
## 2 Mountain Special Pale Ale 2.02 low USA 1.64
## 3 Lagartos Lager 2.26 mid Mexico 2.02
## 4 Jealous Hobo Porter 2.55 mid USA 1.83
## 5 Strasse Stout 2.88 high Germany 2.21
## 6 Arbre Pale Ale 1.84 low Belgium 2.00
## 7 K1997 Lager 2 high Japan 1.97
## 8 Lagartos Cero Lager 0 mid Mexico 1.41
## 9 Airball Irish Ale 2.45 high USA 1.91
## 10 Clean Tackle Stout 2.68 mid England 2.05
26 / 52

To apply the square root function to these two columns, one option would be to refer to them by name inside across.

Square root (sqrt()) of all numeric variables

beers %>%
mutate(across(where(is.numeric), sqrt))
## # A tibble: 10 x 6
## name style alc_by_volume price_range country star_rating
## <chr> <chr> <dbl> <chr> <chr> <dbl>
## 1 County Lager 2.05 low USA 1.84
## 2 Mountain Special Pale Ale 2.02 low USA 1.64
## 3 Lagartos Lager 2.26 mid Mexico 2.02
## 4 Jealous Hobo Porter 2.55 mid USA 1.83
## 5 Strasse Stout 2.88 high Germany 2.21
## 6 Arbre Pale Ale 1.84 low Belgium 2.00
## 7 K1997 Lager 2 high Japan 1.97
## 8 Lagartos Cero Lager 0 mid Mexico 1.41
## 9 Airball Irish Ale 2.45 high USA 1.91
## 10 Clean Tackle Stout 2.68 mid England 2.05
27 / 52

Or, we can also use the where function and is.numeric to simplify our code and apply the square root function to every numeric variable. The results are both the same.

My Turn

28 / 52

My Turn

  • Load the midwest data bundled with ggplot2
28 / 52

My Turn

  • Load the midwest data bundled with ggplot2

  • Keep only rows for Wisconsin (WI)

28 / 52

My Turn

  • Load the midwest data bundled with ggplot2

  • Keep only rows for Wisconsin (WI)

  • Subset the 'county' column and all columns that match the string 'perc' (hint: use a selection helper)

28 / 52

My Turn

  • Load the midwest data bundled with ggplot2

  • Keep only rows for Wisconsin (WI)

  • Subset the 'county' column and all columns that match the string 'perc' (hint: use a selection helper)

  • log-transform all numeric variables

28 / 52

My Turn

  • Load the midwest data bundled with ggplot2

  • Keep only rows for Wisconsin (WI)

  • Subset the 'county' column and all columns that match the string 'perc' (hint: use a selection helper)

  • log-transform all numeric variables

  • Append the suffix "_check" to every value

28 / 52

Your Turn

29 / 52

Your Turn

  • Load the midwest data bundled with ggplot2
29 / 52

Your Turn

  • Load the midwest data bundled with ggplot2

  • Keep only rows for Ohio (OH)

29 / 52

Your Turn

  • Load the midwest data bundled with ggplot2

  • Keep only rows for Ohio (OH)

  • Subset the 'county' column and all columns that match the string 'pop' (hint: use a selection helper)

29 / 52

Your Turn

  • Load the midwest data bundled with ggplot2

  • Keep only rows for Ohio (OH)

  • Subset the 'county' column and all columns that match the string 'pop' (hint: use a selection helper)

  • Square-root transform all numeric variables

29 / 52

Pivoting Data

30 / 52

In the lesson on data organization, we went over tidy data principles, and glanced over long and wide structures for data. Now, we can go ahead and use R to reshape and pivot our data between these different structures.

Wide Data

Human-readable
Easier entry
Easier interactive editing

Long Data

Often meets Tidy Data principles
Analysis-ready
Maps well to plot axes and aesthetics

widelong

31 / 52

As the names suggest, data can take either wide and long structures. The difference between these two structures goes beyond just having different proportions of rows and columns. This goes back to tidy data principles.

Wide data is what we see often and what we use for data entry or reporting. This structure is readable, concise, and easy to edit.

On the other hand, long data is more likely to be tidy and more ready for use in our workflows. The fundamental difference lies with how variables and observations relate to rows and columns. The good news is that we can transform the data between structures.

Reshaping data with 📦 tidyr

32 / 52

To reshape data from wide to long and vice versa, we’ll use functions from the tidyr package, which, if you remember, is part of the tidyverse.

The names of the relevant functions are also quite easy to remember.

Reshaping data with 📦 tidyr

wide to long: pivot_longer()

pivot_longer(data, cols, names_to, values_to)
32 / 52

To reshape data from wide to long and vice versa, we’ll use functions from the tidyr package, which, if you remember, is part of the tidyverse.

The names of the relevant functions are also quite easy to remember.

To go from wide to long, we use pivot_longer

Reshaping data with 📦 tidyr

wide to long: pivot_longer()

pivot_longer(data, cols, names_to, values_to)

long to wide: pivot_wider()

pivot_longer(data, cols, names_from, values_from)
32 / 52

To reshape data from wide to long and vice versa, we’ll use functions from the tidyr package, which, if you remember, is part of the tidyverse.

The names of the relevant functions are also quite easy to remember.

To go from wide to long, we use pivot_longer

To go from long to wide, we use pivot_wider

Habitat County Frogs Crayfish Owls
Marsh Clarkville 7 5 4
Urban Smith 0 0 2
Wetland Wilson 9 2 1
Forest Smith 4 0 7
critters <-
tibble::tribble(
~Habitat, ~County, ~Frogs, ~Crayfish, ~Owls,
"Marsh", "Clarkville", 7L, 5L, 4L,
"Urban", "Smith", 0L, 0L, 2L,
"Wetland", "Wilson", 9L, 2L, 1L,
"Forest", "Smith", 4L, 0L, 7L
)
33 / 52

Ill illustrate these transformations with this simple dataset that enumerates some wildlife sightings across different habitats and counties. Just looking at this table, we can see that it is wider than it is long.

Habitat County Frogs Crayfish Owls
Marsh Clarkville 7 5 4
Urban Smith 0 0 2
Wetland Wilson 9 2 1
Forest Smith 4 0 7
critters <-
tibble::tribble(
~Habitat, ~County, ~Frogs, ~Crayfish, ~Owls,
"Marsh", "Clarkville", 7L, 5L, 4L,
"Urban", "Smith", 0L, 0L, 2L,
"Wetland", "Wilson", 9L, 2L, 1L,
"Forest", "Smith", 4L, 0L, 7L
)

Variables (attributes measured):

Habitat

County

Critter Type

Individuals Encountered

33 / 52

Ill illustrate these transformations with this simple dataset that enumerates some wildlife sightings across different habitats and counties. Just looking at this table, we can see that it is wider than it is long.

Going deeper into the structure of the data though, we will come to realize that this wide data is not tidy. Why? Because there are actually four attributes being measured, but their values do not correspond to columns. Specifically, the values for critter type and the number of individuals encountered are not in their own column. We can still interpret the information stored in these data, but for many purposes a long, tidy version might be preferable. Lets see how we can go about reshaping this critter data.

Wide to long

critters_long <- critters %>%
pivot_longer(
cols = c(Frogs, Crayfish, Owls),
names_to = "critter_type",
values_to = "individuals_encountered"
)
34 / 52

For this, we use the pivot longer function. Notice again how we’re using pipes. Pivot longer has three main arguments that we need to specify.

Wide to long

critters_long <- critters %>%
pivot_longer(
cols = c(Frogs, Crayfish, Owls),
names_to = "critter_type",
values_to = "individuals_encountered"
)

cols Which columns have data in their names

34 / 52

For this, we use the pivot longer function. Notice again how we’re using pipes. Pivot longer has three main arguments that we need to specify.

Cols are the columns that actually contain data in their names. In our example, these would be the columns for each of the little animals, which are actually levels of an attribute.

Wide to long

critters_long <- critters %>%
pivot_longer(
cols = c(Frogs, Crayfish, Owls),
names_to = "critter_type",
values_to = "individuals_encountered"
)

cols Which columns have data in their names

names_to Name for the new column to be created from cols

34 / 52

For this, we use the pivot longer function. Notice again how we’re using pipes. Pivot longer has three main arguments that we need to specify.

Cols are the columns that actually contain data in their names. In our example, these would be the columns for each of the little animals, which are actually levels of an attribute.

Having recognized that there were data in column names which should be in its own column, we can give it a name. This is what the names to argument is for. In this case, lets call the new column critter underscore type

Wide to long

critters_long <- critters %>%
pivot_longer(
cols = c(Frogs, Crayfish, Owls),
names_to = "critter_type",
values_to = "individuals_encountered"
)

cols Which columns have data in their names

names_to Name for the new column to be created from cols

values_to Name of the column to be created from the contents of the cells in cols

34 / 52

For this, we use the pivot longer function. Notice again how we’re using pipes. Pivot longer has three main arguments that we need to specify.

Cols are the columns that actually contain data in their names. In our example, these would be the columns for each of the little animals, which are actually levels of an attribute.

Having recognized that there were data in column names which should be in its own column, we can give it a name. This is what the names to argument is for. In this case, lets call the new column critter underscore type

finally, the content of the columns we specified with the cols argument also needs to go into a separate variable, and it will need a name. In our example, this would be the number of individuals spotted, and we can use that as a name.

critters %>%
pivot_longer(cols = c(Frogs, Crayfish, Owls),
names_to = "critter_type",
values_to = "individuals_encountered")
## # A tibble: 12 x 4
## Habitat County critter_type individuals_encountered
## <chr> <chr> <chr> <int>
## 1 Marsh Clarkville Frogs 7
## 2 Marsh Clarkville Crayfish 5
## 3 Marsh Clarkville Owls 4
## 4 Urban Smith Frogs 0
## 5 Urban Smith Crayfish 0
## 6 Urban Smith Owls 2
## 7 Wetland Wilson Frogs 9
## 8 Wetland Wilson Crayfish 2
## 9 Wetland Wilson Owls 1
## 10 Forest Smith Frogs 4
## 11 Forest Smith Crayfish 0
## 12 Forest Smith Owls 7
35 / 52

Lets see how the long table looks.

The same information is now in a long, tidy version. Notice how the function handled all the combinations for us, and how there is more repetition with this structure. Now, each of the four measured attributes is in a column and each row is now an observation with a value for each variable.

Long to wide

critters_long %>%
pivot_wider(names_from = critter_type,
values_from = individuals_encountered)
36 / 52

We can just as easily reshape the long version back to a wide version.

For this, we use two main arguments.

Long to wide

critters_long %>%
pivot_wider(names_from = critter_type,
values_from = individuals_encountered)

names_from Which columns' values will be used for new column names

36 / 52

We can just as easily reshape the long version back to a wide version.

For this, we use two main arguments.

First, we need to specify which column contains the values that need to be spread out across as many column names as there are unique values in the column. We do this with the names from argument.

Long to wide

critters_long %>%
pivot_wider(names_from = critter_type,
values_from = individuals_encountered)

names_from Which columns' values will be used for new column names

values_from Which column has the cell values for the new output columns

36 / 52

We can just as easily reshape the long version back to a wide version.

For this, we use two main arguments.

First, we need to specify which column contains the values that need to be spread out across as many column names as there are unique values in the column. We do this with the names from argument.

Then, we need to specify which column has the values that will populate the new columns. We use values from for this.

critters_long %>%
pivot_wider(names_from = critter_type,
values_from = individuals_encountered)
## # A tibble: 4 x 5
## Habitat County Frogs Crayfish Owls
## <chr> <chr> <int> <int> <int>
## 1 Marsh Clarkville 7 5 4
## 2 Urban Smith 0 0 2
## 3 Wetland Wilson 9 2 1
## 4 Forest Smith 4 0 7
37 / 52

With our example, the new “critter type” column has the names of the critters that we want as separate columns, and the “individuals encountered” column has the numeric values that will also be spread out into the new columns.

When we run this code, we get our initial wide dataset back.

Maybe an animated version of this transformation can illustrate the process better. Look at how the same information can appear in different layouts. Each one has its advantages and disadvantages, but its important to know that its not an issue shifting between the two.

My Turn

39 / 52

My Turn

  • Load the dog ranks data ("dogranks_mine.csv")
39 / 52

My Turn

  • Load the dog ranks data ("dogranks_mine.csv")

  • Pivot the data (wide to long and back to wide)

39 / 52

Your Turn

40 / 52

Your Turn

  • Load the dog ranks data ("dogranks_your.csv")
40 / 52

Your Turn

  • Load the dog ranks data ("dogranks_your.csv")

  • Pivot the data (wide to long and back to wide)

40 / 52

Coalescing Columns

41 / 52

By now, we’ve gone over subsetting rows and columns, creating new columns, and reshaping entire datasets between wide and long structures. These tools are pretty much everything we need to restructure our data. Now, we’ll learn about two more functions from dplyr and tidyr that may come in handy during the restructuring stage of our data cleaning.

First, we’ll learn about coalescing columns.

Coalescing values sounds strange and complicated, but as we’ll see, this comes up often during data cleaning and the important thing is that we can avoid having to do this by hand.

Coalescing columns is essentially lumping together two or more columns with data for the same variable, in which there may be missing values. We want to combine the data, keeping values from one column and filling any missing data with values from other columns.

coalesce() from 📦 dplyr

42 / 52

To do this, we use the coalesce function from dplyr.

coalesce() from 📦 dplyr

  • Find and return the first non-NA value at each position for a set of vectors
42 / 52

To do this, we use the coalesce function from dplyr.

Technically, coalesce finds and returns the first non NA value at each position for two or more vectors. Remember, columns in rectangular data are also vectors.

coalesce() from 📦 dplyr

  • Find and return the first non-NA value at each position for a set of vectors

  • Returns a single vector with the non-NA values that appear first

42 / 52

To do this, we use the coalesce function from dplyr.

Technically, coalesce finds and returns the first non NA value at each position for two or more vectors. Remember, columns in rectangular data are also vectors.

When we coalesce a set of vectors, the function returns a single vector with all the non NA values, according to where they appear.

coalesce() from 📦 dplyr

x <- c(1, 2, NA, NA, 5)
y <- c(NA, NA, 3, 4, 5)
z <- c(1, 4, 3, 5, 3)
43 / 52

Lets demonstrate this with some simple vectors. X Y and Z are numeric vectors of the same length and some have missing values.

coalesce() from 📦 dplyr

x <- c(1, 2, NA, NA, 5)
y <- c(NA, NA, 3, 4, 5)
z <- c(1, 4, 3, 5, 3)
coalesce(x, y, z)
## [1] 1 2 3 4 5
43 / 52

Lets demonstrate this with some simple vectors. X Y and Z are numeric vectors of the same length and some have missing values.

We want a single vector with as few Nas as possible. If we coalesce these three vectors in this particular order, the values from vector Y at the third and fourth positions fill in the missing values in vector X, and then vector Z is not used.

The order of the vectors here is important.

coalesce() from 📦 dplyr

x <- c(1, 2, NA, NA, 5)
y <- c(NA, NA, 3, 4, 5)
z <- c(1, 4, 3, 5, 3)
44 / 52

Lets do this again, but in a different order, this time z, x, and y. There are no missing values in vector Z so it is returned as is.

coalesce() from 📦 dplyr

x <- c(1, 2, NA, NA, 5)
y <- c(NA, NA, 3, 4, 5)
z <- c(1, 4, 3, 5, 3)
coalesce(z,x,y)
## [1] 1 4 3 5 3
44 / 52

Lets do this again, but in a different order, this time z, x, and y. There are no missing values in vector Z so it is returned as is.

Coalesce is a very useful function, but we need to be mindful of how we order and prioritize its inputs. Lets use coalesce in a more realistic scenario, to really illustrate its potential.

Sample Main.Sensor Backup.Sensor Manual.Reading
A NA 12.1 12.0
B 12 NA 11.9
C 16 NA NA
D 3 9.5 NA
E NA 13.0 12.0
sensors <-
tibble::tribble(
~Sample, ~Main.Sensor, ~Backup.Sensor, ~Manual.Reading,
"A", NA, 12.1, 12,
"B", 12L, NA, 11.9,
"C", 16L, NA, NA,
"D", 3L, 9.5, NA,
"E", NA, 13, 12
)
45 / 52

These data has temperature readings for different samples from two automated sensors plus a manual reading with an external device.

For whatever season, there are missing values in the main sensor data, so we can coalesce the main, backup, and manual readings into a single column. These data have an inherent priority and order, in which we want to fill in gaps in the main sensor data with data from the backup sensor, with the manual readings as a third option.


sensors %>%
mutate(Final.Reading=coalesce(Main.Sensor, Backup.Sensor, Manual.Reading))
## # A tibble: 5 x 5
## Sample Main.Sensor Backup.Sensor Manual.Reading Final.Reading
## <chr> <int> <dbl> <dbl> <dbl>
## 1 A NA 12.1 12 12.1
## 2 B 12 NA 11.9 12
## 3 C 16 NA NA 16
## 4 D 3 9.5 NA 3
## 5 E NA 13 12 13
46 / 52

The code would look like this, we use mutate to create a new column called final reading, which is the output from coalescing the three columns with the data. In this case, we specified the same order that they appear in in the data. The new column combines data from the first column with values from the second one, and the third one was not needed.

This animation might also show the process better. If we have a record of the process and we don’t overwrite our original data, we may even want to subset the data to keep only the new column.

Filling Adjacent Values

48 / 52

To wrap up, Let’s see how we can fill adjacent values in R. Doing this helps us avoid unnecessary NA values and makes for less ambiguous data. The logic is similar to the one you may be familiar with from filling adjacent values in spreadsheet software.

fill() from 📦 tidyr

49 / 52

We’ll use the fill function from the tidyr package. The usage is straight forward.

fill() from 📦 tidyr

animal population
Bird 21
NA 5
NA 7
NA 23
NA 74
Frog 23
NA 65
NA 12
NA 5


animal_populations

49 / 52

We’ll use the fill function from the tidyr package. The usage is straight forward.

Look at this two column table with data on animal populations. To avoid repetition, values in the “animal” column only appear when they change to a different one.

fill() from 📦 tidyr

animal population
Bird 21
NA 5
NA 7
NA 23
NA 74
Frog 23
NA 65
NA 12
NA 5


animal_populations

  • Fill missing values in a column (top to bottom by default)

  • Contiguous/adjacent values assumed to be the same

  • Values only appear when they change along a column
49 / 52

We’ll use the fill function from the tidyr package. The usage is straight forward.

Look at this two column table with data on animal populations. To avoid repetition, values in the “animal” column only appear when they change to a different one.

Fill will copy the value down into missing values, assuming that contiguous values are the same, and will stop at the next non-missing value.

fill()

animal_populations
## # A tibble: 9 x 2
## animal population
## <chr> <dbl>
## 1 Bird 21
## 2 <NA> 5
## 3 <NA> 7
## 4 <NA> 23
## 5 <NA> 74
## 6 Frog 23
## 7 <NA> 65
## 8 <NA> 12
## 9 <NA> 5
animal_populations %>%
fill(animal)
## # A tibble: 9 x 2
## animal population
## <chr> <dbl>
## 1 Bird 21
## 2 Bird 5
## 3 Bird 7
## 4 Bird 23
## 5 Bird 74
## 6 Frog 23
## 7 Frog 65
## 8 Frog 12
## 9 Frog 5
50 / 52

We can specify which variable we want to fill, by name, as an argument to the fill function. Here we are filling the animal column, and we can see that the values were copied down, replacing the NAS.

My Turn

51 / 52

My Turn

  • Load the fish landings data 'fish-landings.csv'
51 / 52

My Turn

  • Load the fish landings data 'fish-landings.csv'

  • Fill the 'Fish' and 'Lake' columns
51 / 52

My Turn

  • Load the fish landings data 'fish-landings.csv'

  • Fill the 'Fish' and 'Lake' columns

  • Create a new column, coalescing the three numeric variables
51 / 52

Your Turn

52 / 52

Your Turn

  • Load the fish landings data 'fish-landings.csv'
52 / 52

Your Turn

  • Load the fish landings data 'fish-landings.csv'

  • Fill the 'Fish' and 'Lake' columns
52 / 52

Your Turn

  • Load the fish landings data 'fish-landings.csv'

  • Fill the 'Fish' and 'Lake' columns

  • Reorder the numeric variables ('Commission reported total' first) and create a new column, coalescing the three numeric variables
52 / 52

Toolbox Refresher

2 / 52

In previous lessons, we defined data cleaning and learned how to organize our data. Now we can get started with using R, and in particular, on first getting our data into sensible, usable structures.

Let’s begin by going over the tools and packages that we’ll be using for this part and throughout this course.

Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow