The Tidyverse
Your one stop shop for data analysis.
Last updated
Your one stop shop for data analysis.
Last updated
Now that we're familiar with the RStudio interface, and the data types and structures used in R, we're ready to start manipulating and analysing our own real world data with a little help from a collection of packages known as the tidyverse.
At its core, the tidyverse is a collection of packages designed to work together as a full pipeline for doing every stage of data analysis on tidy data as an alternative to the inbuilt base R functions.
I use the tidyverse for my data analyses for 2 main reasons:
All the packages in the tidyverse fit together seamlessly and I don't need to worry about compatibility issues between different functions from different sources.
Tidyverse scripts are easier to write, read and understand than base R code (thanks largely to something called a pipe).
While messy data can be messy in myriad ways, all tidy data follows the same structure, allowing us to easily manipulate and transform our data however we want.
For a dataset to be considered tidy, it needs to follow 3 key rules:
Every different variable in our dataset gets a column to itself.
Every different observation or object measured in our dataset gets a row to itself.
Every different value in our dataset gets its own cell.
In practice, this looks a little something like this!
Remember, that in a data frame a column corresponds to a vector and a row corresponds to a list.
Let's look at the cats data set we created in the data structures chapter here. Is this a tidy dataset?
Yes it is! We have our four variables (name
, colour
, weight
and hates_mondays
) in their own columns, our four observations (Otis, Luna, Puss and Garfield) in their own rows, and only a single value in each of our cells.
You most certainly can! While we won't be talking about it in this class, the tidyr package exists specifically to help you transform your data from a non-tidy to a tidy format. You can read more about tidyr and how to use it here.
If you haven't yet, the first thing we're going to need to do is install the tidyverse
package onto our computers for R to use. To do this, we need to run:
R will download and install a bunch of different files for us - don't worry if this takes a little while, we're installing quite a lot of stuff here!
Once we've downloaded the package, the next thing we need to do is load it into our R environment using the library()
function, like so:
That library function attaches the tidyverse to our R session, allowing us to use the functions it contains. While it seems a bit redundant to need to load a package after we've already installed it, this is actually a useful safety feature to make sure we're using the functions we mean to. Note that you only need to install the tidyverse
once, but you need to call the library
command every time you start a new R session.
In R, function names are not unique and multiple different packages might contain functions by the same name that do different things. Loading only the packages you're using in a project helps to prevent these accidents from occurring.
Now that we've got the tidyverse
up and running, let's jump in and start playing with a real world dataset!