In previous lessons, we defined data cleaning and learned how to organize our data. Now we can get started with using R, and in particular, on first getting our data into sensible, usable structures.
Let’s begin by going over the tools and packages that we’ll be using for this part and throughout this course.
In previous lessons, we defined data cleaning and learned how to organize our data. Now we can get started with using R, and in particular, on first getting our data into sensible, usable structures.
Let’s begin by going over the tools and packages that we’ll be using for this part and throughout this course.
When working in R for data cleaning, we’ll be calling on functions from two popular packages: dplyr and tidyr. Both are written and maintained by Rstudio staff with extensive community contributions. They are part of larger set of packages referred to as the tidyverse.
📦 dplyr
📦 tidyr
Part of the tidyverse
Flexible, but developed around built-in features that support a particular way of doing things
When working in R for data cleaning, we’ll be calling on functions from two popular packages: dplyr and tidyr. Both are written and maintained by Rstudio staff with extensive community contributions. They are part of larger set of packages referred to as the tidyverse.
These tools are meant to be flexible, but they are built around built-in features that support a particular way of doing specific tasks.
📦 dplyr
📦 tidyr
Part of the tidyverse
Flexible, but developed around built-in features that support a particular way of doing things
When working in R for data cleaning, we’ll be calling on functions from two popular packages: dplyr and tidyr. Both are written and maintained by Rstudio staff with extensive community contributions. They are part of larger set of packages referred to as the tidyverse.
These tools are meant to be flexible, but they are built around built-in features that support a particular way of doing specific tasks.
Functions in different tidyverse packages follow tidy data principles so they work well together. And, by having specialized purposes, each one does only one thing but they do it well.
Having established that this course is tidyverse-oriented. Let’s go over another tool that we’ll be using to make our code easy to read.
When we work with functions to perform many sequential operations, we have three options.
When we work with functions to perform many sequential operations, we have three options.
we can use nested code, in which function calls are nested inside more function calls so that the results get evaluated from the inside out.
To perform multiple operations in sequence:
Nested code
Intermediate objects
When we work with functions to perform many sequential operations, we have three options.
we can use nested code, in which function calls are nested inside more function calls so that the results get evaluated from the inside out.
We can also assign objects with intermediate results, passed on to the next function in our sequence.
To perform multiple operations in sequence:
Nested code
Intermediate objects
Pipes
When we work with functions to perform many sequential operations, we have three options.
we can use nested code, in which function calls are nested inside more function calls so that the results get evaluated from the inside out.
We can also assign objects with intermediate results, passed on to the next function in our sequence.
Or, as we’ll be doing in this course, we can use pipes.
Let me show you how this looks like.
grades <- c(7,8,8,10,6,9,5,9,8)round(mean(grades), digits = 2)
## [1] 7.78
Nested code looks like this.
Here, we have a vector ‘grades’ with some numeric values. Our sequence of operations is to first get the mean grade, and then round this result. In nested code, the mean function is nested inside the round function.
grades <- c(7,8,8,10,6,9,5,9,8)round(mean(grades), digits = 2)
## [1] 7.78
mn_grades <- mean(grades)rd_mn_grades <- round(mn_grades, digits = 2)rd_mn_grades
## [1] 7.78
Nested code looks like this.
Here, we have a vector ‘grades’ with some numeric values. Our sequence of operations is to first get the mean grade, and then round this result. In nested code, the mean function is nested inside the round function.
Intermediate objects work in a similar way. First we assign an object with the mean of the grades, then we pass this object to the rounding function
Notice how both approaches give us the same result.
For structuring sequential operations left-to-right:
Left Hand Side
(LHS)
pipe operator
%>%
from 📦 magrittr
Right Hand Side
(RHS)
Alternatively, we can use pipes to avoid nested code or having so assign results to intermediate objects that we don’t really need.
Pipes let us structure sequential operations from left to right.
When we work with pipes, there is always something on the left hand side of a pipe operator, and something else on the right hand side.
The pipe operator appears in orange here. Its originally from the magrittr package
More specifically, pipes take and object on the left hand side of the pipe operator and insert it into a function on the right hand side.
Take an object on the Left Hand Side of the pipe and insert it into a function as an argument on the Right Hand Side
By default, LHS is placed as the first argument in the call
More specifically, pipes take and object on the left hand side of the pipe operator and insert it into a function on the right hand side.
By default, the object on the left hand side becomes the first argument in the function call.
Take an object on the Left Hand Side of the pipe and insert it into a function as an argument on the Right Hand Side
By default, LHS is placed as the first argument in the call
Packages in the tidyverse
load %>%
automatically
More specifically, pipes take and object on the left hand side of the pipe operator and insert it into a function on the right hand side.
By default, the object on the left hand side becomes the first argument in the function call.
by the way, the pipe operator gets loaded by the tidyverse packages that we’ll be using.
Let me summarize the main advantages of using pipes to write code.
Let me summarize the main advantages of using pipes to write code.
First, by chaining our operations with pipes we pass the results onto the next function, so we don’t need nested code or to assign intermediate results to objects we won’t use again.
Minimize the need for intermediate objects and nested code
Make code readable
Let me summarize the main advantages of using pipes to write code.
First, by chaining our operations with pipes we pass the results onto the next function, so we don’t need nested code or to assign intermediate results to objects we won’t use again.
This makes our code more readable, and
Minimize the need for intermediate objects and nested code
Make code readable
Easy to add or remove steps at any point in the sequence of operations
Let me summarize the main advantages of using pipes to write code.
First, by chaining our operations with pipes we pass the results onto the next function, so we don’t need nested code or to assign intermediate results to objects we won’t use again.
This makes our code more readable, and
its also easier to add or remove steps in our sequence of operations.
Minimize the need for intermediate objects and nested code
Make code readable
Easy to add or remove steps at any point in the sequence of operations
Insert with
ctrl + shift + M
Let me summarize the main advantages of using pipes to write code.
First, by chaining our operations with pipes we pass the results onto the next function, so we don’t need nested code or to assign intermediate results to objects we won’t use again.
This makes our code more readable, and
its also easier to add or remove steps in our sequence of operations.
A useful shortcut to remember is that in Rstudio we can insert pipes using control shift M, or command shift m on a mac.
Minimize the need for intermediate objects and nested code
Make code readable
Easy to add or remove steps at any point in the sequence of operations
Insert with
ctrl + shift + M
library(magrittr)grades %>% mean() %>% round(digits=2)
## [1] 7.78
Let me summarize the main advantages of using pipes to write code.
First, by chaining our operations with pipes we pass the results onto the next function, so we don’t need nested code or to assign intermediate results to objects we won’t use again.
This makes our code more readable, and
its also easier to add or remove steps in our sequence of operations.
A useful shortcut to remember is that in Rstudio we can insert pipes using control shift M, or command shift m on a mac.
The example from earlier looks like this is we use pipes. First we load the magrittr package, take the grades object, pass it to the mean function, and then pass this result to the round function. Note how we don’t have to write the first argument in functions that are on the right hand side of a pipe.
We can now learn about the functions from dplyr that we’ll use extensively throughout this course.
As I mentioned earlier, the dplyr package provides us with separate functions for doing the most common data manipulation tasks.
dplyr
functionsAs I mentioned earlier, the dplyr package provides us with separate functions for doing the most common data manipulation tasks.
dplyr provides separate functions for the more common operations
dplyr
functionsIndividual functions for the most common operations
Each function does one only thing, but does it well
As I mentioned earlier, the dplyr package provides us with separate functions for doing the most common data manipulation tasks.
dplyr provides separate functions for the more common operations
Each function does one only thing, but does it well
dplyr
functionsIndividual functions for the most common operations
Each function does one only thing, but does it well
Intuitive, user-friendly functions for common data manipulation tasks:
As I mentioned earlier, the dplyr package provides us with separate functions for doing the most common data manipulation tasks.
dplyr provides separate functions for the more common operations
Each function does one only thing, but does it well
When we’re cleaning data, we usually do one of these if not all three of the following:
dplyr
functionsIndividual functions for the most common operations
Each function does one only thing, but does it well
Intuitive, user-friendly functions for common data manipulation tasks:
As I mentioned earlier, the dplyr package provides us with separate functions for doing the most common data manipulation tasks.
dplyr provides separate functions for the more common operations
Each function does one only thing, but does it well
When we’re cleaning data, we usually do one of these if not all three of the following:
subset columns
dplyr
functionsIndividual functions for the most common operations
Each function does one only thing, but does it well
Intuitive, user-friendly functions for common data manipulation tasks:
Subset columns
Subset rows
As I mentioned earlier, the dplyr package provides us with separate functions for doing the most common data manipulation tasks.
dplyr provides separate functions for the more common operations
Each function does one only thing, but does it well
When we’re cleaning data, we usually do one of these if not all three of the following:
subset columns
Subset rows
dplyr
functionsIndividual functions for the most common operations
Each function does one only thing, but does it well
Intuitive, user-friendly functions for common data manipulation tasks:
Subset columns
Subset rows
Create or modify columns
As I mentioned earlier, the dplyr package provides us with separate functions for doing the most common data manipulation tasks.
dplyr provides separate functions for the more common operations
Each function does one only thing, but does it well
When we’re cleaning data, we usually do one of these if not all three of the following:
subset columns
Subset rows
or create or modify columns
name | style | alc_by_volume | price_range | country | star_rating |
---|---|---|---|---|---|
County | Lager | 4.2 | low | USA | 3.40 |
Mountain Special | Pale Ale | 4.1 | low | USA | 2.70 |
Lagartos | Lager | 5.1 | mid | Mexico | 4.10 |
Jealous Hobo | Porter | 6.5 | mid | USA | 3.36 |
Strasse | Stout | 8.3 | high | Germany | 4.89 |
Arbre | Pale Ale | 3.4 | low | Belgium | 4.01 |
K1997 | Lager | 4.0 | high | Japan | 3.88 |
Lagartos Cero | Lager | 0.0 | mid | Mexico | 2.00 |
Airball | Irish Ale | 6.0 | high | USA | 3.64 |
Clean Tackle | Stout | 7.2 | mid | England | 4.20 |
beers <- tibble::tribble( ~name, ~style, ~alc_by_volume, ~price_range, ~country, ~star_rating, "County", "Lager", 4.2, "low", "USA", 3.4, "Mountain Special", "Pale Ale", 4.1, "low", "USA", 2.7, "Lagartos", "Lager", 5.1, "mid", "Mexico", 4.1, "Jealous Hobo", "Porter", 6.5, "mid", "USA", 3.36, "Strasse", "Stout", 8.3, "high", "Germany", 4.89, "Arbre", "Pale Ale", 3.4, "low", "Belgium", 4.01, "K1997", "Lager", 4, "high", "Japan", 3.88, "Lagartos Cero", "Lager", 0, "mid", "Mexico", 2, "Airball", "Irish Ale", 6, "high", "USA", 3.64, "Clean Tackle", "Stout", 7.2, "mid", "England", 4.2 )
These data here describe some fictional beers and their properties.
select()
Subset variables
beers %>% select(name, style, star_rating)
The first thing we might want to try is to subset some variables from the original table. For this we use the select function. Select helps us subset variables, which we can specify by their name, position on the data from left to right, or even by type.
In the example, we’re selecting three variables from the beers object, and we’re doing that with a pipe and by passing their names to the select function.
The beers object is passed by the pipe as the first argument of the select function, and the other arguments are the names of the variables we want to keep, which are name, style, and star rating.
mutate()
Create and modify columns
beers %>% select(name, style, star_rating) %>% mutate(stars = round(star_rating, digits = 0))
Next up, is creating new columns of modifying existing ones. We can do this with the mutate function. The syntax for creating new columns is quite flexible, but typically we need to specify its name and what will be in this new column. Lets see whats happening in the example.
To the transformation from before, in which we selected some variables from the beers object, we are adding an additional step using a pipe. With mutate, we’re creating a new variable called stars, which is itself the result of rounding the values of an existing column.
When we use mutate, we create new columns, but if the name of the new column is the same as an existing one, it will be overwritten.
filter()
subset rows
beers %>% select(name, style, star_rating) %>% mutate(stars = round(star_rating, digits = 0)) %>% filter(stars > 3)
Finally, we can also work on rows. To subset rows that meet a condition, we use the filter function. Again, we’re using a pipe to add a new step to the sequence from before. With the new variable stars, we want to keep only the rows in which the values of stars are greater than 3.
Notice how with the pipes we can keep track of the different steps, and that we didn’t have to work with intermediate objects for the results from each step. This is how we will work throughout this course.
Now, its my turn to use pipes and functions from dplyr to transform a dataset.
Now, its my turn to use pipes and functions from dplyr to transform a dataset.
Load the mammal sleep data bundled with ggplot2
Select the "name" and "conservation" columns, as well as all numeric variables
Now, its my turn to use pipes and functions from dplyr to transform a dataset.
Load the mammal sleep data bundled with ggplot2
Select the "name" and "conservation" columns, as well as all numeric variables
Create a new column that contains the values of 'awake' multiplied by the values in 'brainwt'
Now, its my turn to use pipes and functions from dplyr to transform a dataset.
Load the mammal sleep data bundled with ggplot2
Select the "name" and "conservation" columns, as well as all numeric variables
Create a new column that contains the values of 'awake' multiplied by the values in 'brainwt'
Filter the data to keep only rows with 'sleep_total' > 4
Now, its my turn to use pipes and functions from dplyr to transform a dataset.
Load the mammal sleep data bundled with ggplot2
Select the "name" and "conservation" columns, as well as all numeric variables
Create a new column that contains the values of 'awake' multiplied by the values in 'brainwt'
Filter the data to keep only rows with 'sleep_total' > 4
I'll use pipes to chain the operations
Now, its my turn to use pipes and functions from dplyr to transform a dataset.
Load the mammal sleep data bundled with ggplot2
Select "name" and "conservation" columns and those that include the string 'sleep' in their name
Load the mammal sleep data bundled with ggplot2
Select "name" and "conservation" columns and those that include the string 'sleep' in their name
Create a new column that contains the values of 'sleep_total' multiplied by 3
Load the mammal sleep data bundled with ggplot2
Select "name" and "conservation" columns and those that include the string 'sleep' in their name
Create a new column that contains the values of 'sleep_total' multiplied by 3
Filter the data to remove domesticated mammals
Load the mammal sleep data bundled with ggplot2
Select "name" and "conservation" columns and those that include the string 'sleep' in their name
Create a new column that contains the values of 'sleep_total' multiplied by 3
Filter the data to remove domesticated mammals
Use pipes to chain the operations!
In all the previous examples in which we selected or modified columns, we were working on them either one at a time or were having to name each one.
across()
filter
or mutate
multiple columns
Thanks to some recent updates to dplyr, we can operate on multiple columns at once using the across function.
across()
filter
or mutate
multiple columns
Thanks to some recent updates to dplyr, we can operate on multiple columns at once using the across function.
We can specify many columns at once by position, name, or type. I don’t recommend using positions, because these are likely to change and less meaningful.
across()
filter
or mutate
multiple columns
By position, name, or type
Compound selections of columns
(e.g. numeric columns that contain "mass" in their name)
Thanks to some recent updates to dplyr, we can operate on multiple columns at once using the across function.
We can specify many columns at once by position, name, or type. I don’t recommend using positions, because these are likely to change and less meaningful.
More importantly, we can use across to create compound selections of columns. For example, selecting or mutating all columns of type numeric but only if the contain a specific string in their name.
Across has a relatively straight forward syntax.
Across has a relatively straight forward syntax.
Cols is how we specify the columns we want to transform
across()
arguments:.cols
Columns to transform
.fns
Function to apply to each of the selected columns
Across has a relatively straight forward syntax.
Cols is how we specify the columns we want to transform
and fns is the function to apply to each of the columns
across()
arguments:.cols
Columns to transform
.fns
Function to apply to each of the selected columns
Expects a function, so function name and arguments can be passed separately
Across has a relatively straight forward syntax.
Cols is how we specify the columns we want to transform
and fns is the function to apply to each of the columns
across()
arguments:.cols
Columns to transform
.fns
Function to apply to each of the selected columns
Expects a function, so function name and arguments can be passed separately
beers %>% mutate(across(c(alc_by_volume, star_rating), round, digits = 3))
Across has a relatively straight forward syntax.
Cols is how we specify the columns we want to transform
and fns is the function to apply to each of the columns
If you recall the beers data from earlier, we can use across to round all the numeric variables at the same time, the cols argument is a vector with the names of the two columns we’re interested in, and after that the function that we wish to apply and any additional arguments. Here, its the round function, with the digits argument passed on as well. Notice we used just the function name without any brackets.
To specify sets of columns, there are a number of useful helper functions that we can use alongside or instead of the across function.
everything()
Match all variables
beers %>% mutate(across(everything(), as.character))
To specify sets of columns, there are a number of useful helper functions that we can use alongside or instead of the across function.
First up is everything, which matches all variables. In the example, we’re transforming every variable in the beers object to character.
!
Take the complement of a set of variables (negate a selection)
beers %>% select(!c(style,country))
We can also use the exclamation mark to take the complement of a set of variables. In the example, the question mark negates the set of variables that are specified by name, so we get everything except style and country.
where()
Selects variables for which a predicate function returns TRUE
e.g. subset or transform all numeric variables, determined by
is.numeric()
beers %>% mutate(across(where(is.numeric), log2))
Where is somewhat more elaborate, but it works together with any other function that returns TRUE of FALSE, and then select or mutates only those variables for which the condition is TRUE.
For example, we want to apply a base two logarithm to any variables of type numeric in our beers object. We use the is.numeric function to check all the variables, and then the log2 function will only be applied to variables that meet the condition.
matches()
Match variable names with a regular expression
e.g. drop variables with numbers in their name
beers %>% mutate(across(matches("rating"), log2))
Matches is good for working with variable names and working only on those that contain a match for our pattern of interest. This is can be very useful, and we will see the power of regular expressions in a separate lesson.
In the example code, were matching any variables in which the name contains the word rating, and applying a base two log transformation.
:
Select a range of consecutive variables
beers %>% select(style:country)
The colon, when used as a selection helper, matches a range of variables that are next to each other in the data. In the example, we’ll be keeping style, country, and every other variable between these two.
-
Exclude variables
(return all elements except the one preceded by the subtraction operator)
beers %>% select(-country)
Finally, we can use the minus sign to exclude variables from a selection. In the example, we exclude country and keep everything else.
beers
name | style | alc_by_volume | price_range | country | star_rating |
---|---|---|---|---|---|
County | Lager | 4.2 | low | USA | 3.40 |
Mountain Special | Pale Ale | 4.1 | low | USA | 2.70 |
Lagartos | Lager | 5.1 | mid | Mexico | 4.10 |
Jealous Hobo | Porter | 6.5 | mid | USA | 3.36 |
Strasse | Stout | 8.3 | high | Germany | 4.89 |
Arbre | Pale Ale | 3.4 | low | Belgium | 4.01 |
K1997 | Lager | 4.0 | high | Japan | 3.88 |
Lagartos Cero | Lager | 0.0 | mid | Mexico | 2.00 |
Airball | Irish Ale | 6.0 | high | USA | 3.64 |
Clean Tackle | Stout | 7.2 | mid | England | 4.20 |
Lets have another look at the beers data, and apply some transformations to many columns at once. Just looking at this table, we can see two numeric variables: alcohol by volume, and star rating. Just for demonstration, lets use the mutate function to apply a numeric transformation to these two columns.
sqrt()
) of alc_by_volume
and star_rating
beers %>% mutate(across(c(alc_by_volume,star_rating), sqrt))
## # A tibble: 10 x 6## name style alc_by_volume price_range country star_rating## <chr> <chr> <dbl> <chr> <chr> <dbl>## 1 County Lager 2.05 low USA 1.84## 2 Mountain Special Pale Ale 2.02 low USA 1.64## 3 Lagartos Lager 2.26 mid Mexico 2.02## 4 Jealous Hobo Porter 2.55 mid USA 1.83## 5 Strasse Stout 2.88 high Germany 2.21## 6 Arbre Pale Ale 1.84 low Belgium 2.00## 7 K1997 Lager 2 high Japan 1.97## 8 Lagartos Cero Lager 0 mid Mexico 1.41## 9 Airball Irish Ale 2.45 high USA 1.91## 10 Clean Tackle Stout 2.68 mid England 2.05
To apply the square root function to these two columns, one option would be to refer to them by name inside across.
sqrt()
) of all numeric variablesbeers %>% mutate(across(where(is.numeric), sqrt))
## # A tibble: 10 x 6## name style alc_by_volume price_range country star_rating## <chr> <chr> <dbl> <chr> <chr> <dbl>## 1 County Lager 2.05 low USA 1.84## 2 Mountain Special Pale Ale 2.02 low USA 1.64## 3 Lagartos Lager 2.26 mid Mexico 2.02## 4 Jealous Hobo Porter 2.55 mid USA 1.83## 5 Strasse Stout 2.88 high Germany 2.21## 6 Arbre Pale Ale 1.84 low Belgium 2.00## 7 K1997 Lager 2 high Japan 1.97## 8 Lagartos Cero Lager 0 mid Mexico 1.41## 9 Airball Irish Ale 2.45 high USA 1.91## 10 Clean Tackle Stout 2.68 mid England 2.05
Or, we can also use the where function and is.numeric to simplify our code and apply the square root function to every numeric variable. The results are both the same.
Load the midwest data bundled with ggplot2
Keep only rows for Wisconsin (WI)
Load the midwest data bundled with ggplot2
Keep only rows for Wisconsin (WI)
Subset the 'county' column and all columns that match the string 'perc' (hint: use a selection helper)
Load the midwest data bundled with ggplot2
Keep only rows for Wisconsin (WI)
Subset the 'county' column and all columns that match the string 'perc' (hint: use a selection helper)
log-transform all numeric variables
Load the midwest data bundled with ggplot2
Keep only rows for Wisconsin (WI)
Subset the 'county' column and all columns that match the string 'perc' (hint: use a selection helper)
log-transform all numeric variables
Append the suffix "_check" to every value
Load the midwest data bundled with ggplot2
Keep only rows for Ohio (OH)
Load the midwest data bundled with ggplot2
Keep only rows for Ohio (OH)
Subset the 'county' column and all columns that match the string 'pop' (hint: use a selection helper)
Load the midwest data bundled with ggplot2
Keep only rows for Ohio (OH)
Subset the 'county' column and all columns that match the string 'pop' (hint: use a selection helper)
Square-root transform all numeric variables
In the lesson on data organization, we went over tidy data principles, and glanced over long and wide structures for data. Now, we can go ahead and use R to reshape and pivot our data between these different structures.
Human-readable
Easier entry
Easier interactive editing
Often meets Tidy Data principles
Analysis-ready
Maps well to plot axes and aesthetics
As the names suggest, data can take either wide and long structures. The difference between these two structures goes beyond just having different proportions of rows and columns. This goes back to tidy data principles.
Wide data is what we see often and what we use for data entry or reporting. This structure is readable, concise, and easy to edit.
On the other hand, long data is more likely to be tidy and more ready for use in our workflows. The fundamental difference lies with how variables and observations relate to rows and columns. The good news is that we can transform the data between structures.
To reshape data from wide to long and vice versa, we’ll use functions from the tidyr package, which, if you remember, is part of the tidyverse.
The names of the relevant functions are also quite easy to remember.
tidyr
wide to long: pivot_longer()
pivot_longer(data, cols, names_to, values_to)
To reshape data from wide to long and vice versa, we’ll use functions from the tidyr package, which, if you remember, is part of the tidyverse.
The names of the relevant functions are also quite easy to remember.
To go from wide to long, we use pivot_longer
tidyr
wide to long: pivot_longer()
pivot_longer(data, cols, names_to, values_to)
long to wide: pivot_wider()
pivot_longer(data, cols, names_from, values_from)
To reshape data from wide to long and vice versa, we’ll use functions from the tidyr package, which, if you remember, is part of the tidyverse.
The names of the relevant functions are also quite easy to remember.
To go from wide to long, we use pivot_longer
To go from long to wide, we use pivot_wider
Habitat | County | Frogs | Crayfish | Owls |
---|---|---|---|---|
Marsh | Clarkville | 7 | 5 | 4 |
Urban | Smith | 0 | 0 | 2 |
Wetland | Wilson | 9 | 2 | 1 |
Forest | Smith | 4 | 0 | 7 |
critters <- tibble::tribble( ~Habitat, ~County, ~Frogs, ~Crayfish, ~Owls, "Marsh", "Clarkville", 7L, 5L, 4L, "Urban", "Smith", 0L, 0L, 2L, "Wetland", "Wilson", 9L, 2L, 1L, "Forest", "Smith", 4L, 0L, 7L )
Ill illustrate these transformations with this simple dataset that enumerates some wildlife sightings across different habitats and counties. Just looking at this table, we can see that it is wider than it is long.
Habitat | County | Frogs | Crayfish | Owls |
---|---|---|---|---|
Marsh | Clarkville | 7 | 5 | 4 |
Urban | Smith | 0 | 0 | 2 |
Wetland | Wilson | 9 | 2 | 1 |
Forest | Smith | 4 | 0 | 7 |
critters <- tibble::tribble( ~Habitat, ~County, ~Frogs, ~Crayfish, ~Owls, "Marsh", "Clarkville", 7L, 5L, 4L, "Urban", "Smith", 0L, 0L, 2L, "Wetland", "Wilson", 9L, 2L, 1L, "Forest", "Smith", 4L, 0L, 7L )
Variables (attributes measured):
Habitat
County
Critter Type
Individuals Encountered
Ill illustrate these transformations with this simple dataset that enumerates some wildlife sightings across different habitats and counties. Just looking at this table, we can see that it is wider than it is long.
Going deeper into the structure of the data though, we will come to realize that this wide data is not tidy. Why? Because there are actually four attributes being measured, but their values do not correspond to columns. Specifically, the values for critter type and the number of individuals encountered are not in their own column. We can still interpret the information stored in these data, but for many purposes a long, tidy version might be preferable. Lets see how we can go about reshaping this critter data.
critters_long <- critters %>% pivot_longer( cols = c(Frogs, Crayfish, Owls), names_to = "critter_type", values_to = "individuals_encountered" )
For this, we use the pivot longer function. Notice again how we’re using pipes. Pivot longer has three main arguments that we need to specify.
critters_long <- critters %>% pivot_longer( cols = c(Frogs, Crayfish, Owls), names_to = "critter_type", values_to = "individuals_encountered" )
cols
Which columns have data in their names
For this, we use the pivot longer function. Notice again how we’re using pipes. Pivot longer has three main arguments that we need to specify.
Cols are the columns that actually contain data in their names. In our example, these would be the columns for each of the little animals, which are actually levels of an attribute.
critters_long <- critters %>% pivot_longer( cols = c(Frogs, Crayfish, Owls), names_to = "critter_type", values_to = "individuals_encountered" )
cols
Which columns have data in their names
names_to
Name for the new column to be created from cols
For this, we use the pivot longer function. Notice again how we’re using pipes. Pivot longer has three main arguments that we need to specify.
Cols are the columns that actually contain data in their names. In our example, these would be the columns for each of the little animals, which are actually levels of an attribute.
Having recognized that there were data in column names which should be in its own column, we can give it a name. This is what the names to argument is for. In this case, lets call the new column critter underscore type
critters_long <- critters %>% pivot_longer( cols = c(Frogs, Crayfish, Owls), names_to = "critter_type", values_to = "individuals_encountered" )
cols
Which columns have data in their names
names_to
Name for the new column to be created from cols
values_to
Name of the column to be created from the contents of the cells in cols
For this, we use the pivot longer function. Notice again how we’re using pipes. Pivot longer has three main arguments that we need to specify.
Cols are the columns that actually contain data in their names. In our example, these would be the columns for each of the little animals, which are actually levels of an attribute.
Having recognized that there were data in column names which should be in its own column, we can give it a name. This is what the names to argument is for. In this case, lets call the new column critter underscore type
finally, the content of the columns we specified with the cols argument also needs to go into a separate variable, and it will need a name. In our example, this would be the number of individuals spotted, and we can use that as a name.
critters %>% pivot_longer(cols = c(Frogs, Crayfish, Owls), names_to = "critter_type", values_to = "individuals_encountered")
## # A tibble: 12 x 4## Habitat County critter_type individuals_encountered## <chr> <chr> <chr> <int>## 1 Marsh Clarkville Frogs 7## 2 Marsh Clarkville Crayfish 5## 3 Marsh Clarkville Owls 4## 4 Urban Smith Frogs 0## 5 Urban Smith Crayfish 0## 6 Urban Smith Owls 2## 7 Wetland Wilson Frogs 9## 8 Wetland Wilson Crayfish 2## 9 Wetland Wilson Owls 1## 10 Forest Smith Frogs 4## 11 Forest Smith Crayfish 0## 12 Forest Smith Owls 7
Lets see how the long table looks.
The same information is now in a long, tidy version. Notice how the function handled all the combinations for us, and how there is more repetition with this structure. Now, each of the four measured attributes is in a column and each row is now an observation with a value for each variable.
critters_long %>% pivot_wider(names_from = critter_type, values_from = individuals_encountered)
We can just as easily reshape the long version back to a wide version.
For this, we use two main arguments.
critters_long %>% pivot_wider(names_from = critter_type, values_from = individuals_encountered)
names_from
Which columns' values will be used for new column names
We can just as easily reshape the long version back to a wide version.
For this, we use two main arguments.
First, we need to specify which column contains the values that need to be spread out across as many column names as there are unique values in the column. We do this with the names from argument.
critters_long %>% pivot_wider(names_from = critter_type, values_from = individuals_encountered)
names_from
Which columns' values will be used for new column names
values_from
Which column has the cell values for the new output columns
We can just as easily reshape the long version back to a wide version.
For this, we use two main arguments.
First, we need to specify which column contains the values that need to be spread out across as many column names as there are unique values in the column. We do this with the names from argument.
Then, we need to specify which column has the values that will populate the new columns. We use values from for this.
critters_long %>% pivot_wider(names_from = critter_type, values_from = individuals_encountered)
## # A tibble: 4 x 5## Habitat County Frogs Crayfish Owls## <chr> <chr> <int> <int> <int>## 1 Marsh Clarkville 7 5 4## 2 Urban Smith 0 0 2## 3 Wetland Wilson 9 2 1## 4 Forest Smith 4 0 7
With our example, the new “critter type” column has the names of the critters that we want as separate columns, and the “individuals encountered” column has the numeric values that will also be spread out into the new columns.
When we run this code, we get our initial wide dataset back.
Maybe an animated version of this transformation can illustrate the process better. Look at how the same information can appear in different layouts. Each one has its advantages and disadvantages, but its important to know that its not an issue shifting between the two.
Load the dog ranks data ("dogranks_mine.csv")
Pivot the data (wide to long and back to wide)
Load the dog ranks data ("dogranks_your.csv")
Pivot the data (wide to long and back to wide)
By now, we’ve gone over subsetting rows and columns, creating new columns, and reshaping entire datasets between wide and long structures. These tools are pretty much everything we need to restructure our data. Now, we’ll learn about two more functions from dplyr and tidyr that may come in handy during the restructuring stage of our data cleaning.
First, we’ll learn about coalescing columns.
Coalescing values sounds strange and complicated, but as we’ll see, this comes up often during data cleaning and the important thing is that we can avoid having to do this by hand.
Coalescing columns is essentially lumping together two or more columns with data for the same variable, in which there may be missing values. We want to combine the data, keeping values from one column and filling any missing data with values from other columns.
To do this, we use the coalesce function from dplyr.
dplyr
To do this, we use the coalesce function from dplyr.
Technically, coalesce finds and returns the first non NA value at each position for two or more vectors. Remember, columns in rectangular data are also vectors.
dplyr
Find and return the first non-NA value at each position for a set of vectors
Returns a single vector with the non-NA values that appear first
To do this, we use the coalesce function from dplyr.
Technically, coalesce finds and returns the first non NA value at each position for two or more vectors. Remember, columns in rectangular data are also vectors.
When we coalesce a set of vectors, the function returns a single vector with all the non NA values, according to where they appear.
dplyr
x <- c(1, 2, NA, NA, 5)y <- c(NA, NA, 3, 4, 5)z <- c(1, 4, 3, 5, 3)
Lets demonstrate this with some simple vectors. X Y and Z are numeric vectors of the same length and some have missing values.
dplyr
x <- c(1, 2, NA, NA, 5)y <- c(NA, NA, 3, 4, 5)z <- c(1, 4, 3, 5, 3)
coalesce(x, y, z)
## [1] 1 2 3 4 5
Lets demonstrate this with some simple vectors. X Y and Z are numeric vectors of the same length and some have missing values.
We want a single vector with as few Nas as possible. If we coalesce these three vectors in this particular order, the values from vector Y at the third and fourth positions fill in the missing values in vector X, and then vector Z is not used.
The order of the vectors here is important.
dplyr
x <- c(1, 2, NA, NA, 5)y <- c(NA, NA, 3, 4, 5)z <- c(1, 4, 3, 5, 3)
Lets do this again, but in a different order, this time z, x, and y. There are no missing values in vector Z so it is returned as is.
dplyr
x <- c(1, 2, NA, NA, 5)y <- c(NA, NA, 3, 4, 5)z <- c(1, 4, 3, 5, 3)
coalesce(z,x,y)
## [1] 1 4 3 5 3
Lets do this again, but in a different order, this time z, x, and y. There are no missing values in vector Z so it is returned as is.
Coalesce is a very useful function, but we need to be mindful of how we order and prioritize its inputs. Lets use coalesce in a more realistic scenario, to really illustrate its potential.
Sample | Main.Sensor | Backup.Sensor | Manual.Reading |
---|---|---|---|
A | NA | 12.1 | 12.0 |
B | 12 | NA | 11.9 |
C | 16 | NA | NA |
D | 3 | 9.5 | NA |
E | NA | 13.0 | 12.0 |
sensors <- tibble::tribble( ~Sample, ~Main.Sensor, ~Backup.Sensor, ~Manual.Reading, "A", NA, 12.1, 12, "B", 12L, NA, 11.9, "C", 16L, NA, NA, "D", 3L, 9.5, NA, "E", NA, 13, 12 )
These data has temperature readings for different samples from two automated sensors plus a manual reading with an external device.
For whatever season, there are missing values in the main sensor data, so we can coalesce the main, backup, and manual readings into a single column. These data have an inherent priority and order, in which we want to fill in gaps in the main sensor data with data from the backup sensor, with the manual readings as a third option.
sensors %>% mutate(Final.Reading=coalesce(Main.Sensor, Backup.Sensor, Manual.Reading))
## # A tibble: 5 x 5## Sample Main.Sensor Backup.Sensor Manual.Reading Final.Reading## <chr> <int> <dbl> <dbl> <dbl>## 1 A NA 12.1 12 12.1## 2 B 12 NA 11.9 12 ## 3 C 16 NA NA 16 ## 4 D 3 9.5 NA 3 ## 5 E NA 13 12 13
The code would look like this, we use mutate to create a new column called final reading, which is the output from coalescing the three columns with the data. In this case, we specified the same order that they appear in in the data. The new column combines data from the first column with values from the second one, and the third one was not needed.
This animation might also show the process better. If we have a record of the process and we don’t overwrite our original data, we may even want to subset the data to keep only the new column.
To wrap up, Let’s see how we can fill adjacent values in R. Doing this helps us avoid unnecessary NA values and makes for less ambiguous data. The logic is similar to the one you may be familiar with from filling adjacent values in spreadsheet software.
We’ll use the fill function from the tidyr package. The usage is straight forward.
tidyr
animal | population |
---|---|
Bird | 21 |
NA | 5 |
NA | 7 |
NA | 23 |
NA | 74 |
Frog | 23 |
NA | 65 |
NA | 12 |
NA | 5 |
animal_populations
We’ll use the fill function from the tidyr package. The usage is straight forward.
Look at this two column table with data on animal populations. To avoid repetition, values in the “animal” column only appear when they change to a different one.
tidyr
animal | population |
---|---|
Bird | 21 |
NA | 5 |
NA | 7 |
NA | 23 |
NA | 74 |
Frog | 23 |
NA | 65 |
NA | 12 |
NA | 5 |
animal_populations
We’ll use the fill function from the tidyr package. The usage is straight forward.
Look at this two column table with data on animal populations. To avoid repetition, values in the “animal” column only appear when they change to a different one.
Fill will copy the value down into missing values, assuming that contiguous values are the same, and will stop at the next non-missing value.
animal_populations
## # A tibble: 9 x 2## animal population## <chr> <dbl>## 1 Bird 21## 2 <NA> 5## 3 <NA> 7## 4 <NA> 23## 5 <NA> 74## 6 Frog 23## 7 <NA> 65## 8 <NA> 12## 9 <NA> 5
animal_populations %>% fill(animal)
## # A tibble: 9 x 2## animal population## <chr> <dbl>## 1 Bird 21## 2 Bird 5## 3 Bird 7## 4 Bird 23## 5 Bird 74## 6 Frog 23## 7 Frog 65## 8 Frog 12## 9 Frog 5
We can specify which variable we want to fill, by name, as an argument to the fill function. Here we are filling the animal column, and we can see that the values were copied down, replacing the NAS.
In previous lessons, we defined data cleaning and learned how to organize our data. Now we can get started with using R, and in particular, on first getting our data into sensible, usable structures.
Let’s begin by going over the tools and packages that we’ll be using for this part and throughout this course.
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |