Manipulating Data
Bending data to answer your questions with dplyr.
To manipulate data in the tidyverse, we go to the dplyr
package - a toolbox of different "data pliers" we can use to modify and manipulate our dataset. Just like readr, dplyr
also comes with a handy cheatsheet for data transformation you can find here.
Today we're going to cover 5 key functions that'll come in handy in your data transformation. These are filter()
, select()
, mutate()
, summarise()
and group_by()
.
But before we get onto that, let's talk about another key feature in the tidyverse that makes reading and writing code a breeze! Ladies and gentlemen, meet the pipe %>%
.
Piping your data
The pipe is a useful tool that simplifies writing code by allowing us to string together functions into a single pipeline, which can be executed all at once. Let's take another look at our cleaning code from the last chapter.
Here we have each command running by itself, and storing data into an intermediate variable called titanic_no_cabin
in the middle. While this is perfectly fine, if we do everything in a stepwise manner, we're quickly going to clog up our environment with a whole bunch of intermediate nonsense we don't really want.
Another way to approach this would be to combine these into a single command, like this.
This works fine too, but it's easy to get confused as to what is going on, and changing code on the go can be a right pain. Instead, we can use the pipe to string functions together like this.
The pipe takes the output from the function in the left and uses it as the first input for the function on the right (or next line down). In a nutshell, the following two commands are equivalent:
The pipe makes it easy to follow and edit code. I you want more resources on the pipe, you can check this video or the documentation here.
Hint: Rather than typing out the symbols %>%
every time you want a pipe, you can use the shortcut shift+command+M
on mac or shift+control+M
on windows.
With pipes in hand, let's go transform some data!
Challenges
Challenge 1.1
filter()
The first function we're going to have a look at, is filter()
. This is a useful function for selecting a subset of observations (rows) of our data, based on whatever conditions we might want. Remember those comparisons we were using back in the first chapter? Now we can get a chance to use them!
Let's say I want to look at the subset of passengers aboard the Titanic who were 35 years of age. Using filter()
, this is super easy to do! All we need to do is tell the filter function we want the "Age" variable to be equal to 35.
In addition to our comparison operators (==
, <=
, >=
, !=
, <
and >
), we can also combine these using the logical and (&
) and or (|
) operators to do stuff like:
Find passengers who are 35 or 40...
Find passengers who are between 35 and 40...
Find female passengers from 1st class...
Or female passengers from 1st class who survived the voyage...
and the list goes on! Using filter()
, you can slice up your dataset into whatever specific subsets you might want.
Challenges
Challenge 2.1
Challenge 2.2
select()
Here's one we've already seen in action. While filter()
works to subset our data on a row by row basis, select()
lets us subset our data by columns. We can do this in a couple of ways. Either by dropping variables we don't want by placing a minus -
sign in front of them, like we did while cleaning the dataset, or by specifying which columns we want to keep by name.
Let's look at an example. Suppose we want to create a new data frame that only contains the names and the age of the passengers. You can do this by:
We can also combine this with filter to grab only the specific information we're interested in for our subsets. Say we want the names and fares of women in first class who survived. We can get that info like this:
Note that the select()
function needs to come after the filter()
function, otherwise there wouldn't be anything for us to filter by!
Challenges
Challenge 3.1
Challenge 3.2
mutate()
While filter()
and select()
are tools useful for reducing our dataset, mutate()
allows us to do the opposite and generate new columns in our dataset.
Say we want to have a representation of the fare variable adjusted for inflation to todays prices to give a better feel of the value of titanic tickets. A little poking around suggests that we would need to multiply the fare values by roughly 82 to get an equivalent price (in 2020 AUD$). Using mutate()
we can create a new variable in our dataset called fareToday
.
Now that we've set up an equivalent price, let's have a look at how expensive these tickets were.
$42,011 is a damn lot for a ticket!
mutate()
is also useful for creating new variables for us to filter or sort by. Say we're interested in separating out children and adults aboard the Titanic. We can use mutate to make a new logical variable stating whether a passenger was a child or adult using comparisons from our Age
variable, like this.
This is quite a powerful feature, especially when combined with the next two tools, group_by()
and summarise()
.
Challenges
Challenge 4.1
group_by() and summarise()
group_by()
and summarise()
are a pair of functions which work hand in hand to draw comparisons between, and answer questions about subsets of our dataset.
summarise()
is a function that creates a new data frame of summary statistics from our entire dataset for one or more variables. To use it, we call the summarise()
function, with names for our new columns, and whatever summary functions we want to use within it. These summary functions are things like:
mean()
to give us the mean value of a variable.sd()
to give us the standard deviation of a variable.min()
giving us the lowest value of a variable.max()
giving us the highest value of a variable.n()
giving us the number of observations in a variable.
and many more.
Let's go ahead and generate some summary statistics about the ages of passengers on the Titanic.
Simple!
Where summarise()
really gets powerful is when we combine it with group_by()
to perform summaries on a group by group basis. Let's rerun those age summary statistics, but this time we're going to group by class beforehand.
Now things are really getting interesting! We can see that passengers in 1st class were considerably older than those in 2nd and 3rd (means = 38.1, 29.9, and 25.1 respectively), and that there were far fewer 1st than 3rd class passengers (184 vs. 355).
With mutate()
, group_by()
and summarise()
, we can answer all kinds of questions about our dataset!
Were males or females more likely to survive?
Who paid more? Kids or adults?
And by grouping by multiple variables we can dive deeper into understanding the dataset, such as understanding the boarding dynamics of different classes.
Here we can see that the majority of passengers (especially those in 3rd class) embarked at Southampton, with a bunch more first class passengers joining at Cherbourg, and very few hopping aboard at Queenstown.
Challenges
Challenge 5.1
Challenge 5.2
Saving your data
Having performed our analyses, the last thing we might want to do is save our data into a file somewhere. When manipulating data in the tidyverse, the only place anything is changing is within RStudio itself, not in the underlying files we loaded the data from.
Fortunately, it is easy to save data into a file for use by other programs, or to bring back into R at a later date.
Let's save our cleaned dataset into a new csv file named "titanic_cleaned.csv" using write_csv()
.
This will save a new csv file to your working directory, and should appear in your files panel looking a little like this.
If you wan to change the working directory, you can type
Or you can select it manually by going to the menu and clicking on Session > Set Working Directory .
Challenges
Challenge 6.1
With these tools under your belt, you're now ready to go tackle your own datasets and read, transform, export, and question to your heart's content.
Go for it!
Last updated