Reading and Cleaning Data
Getting your data in, and ready to work with.
Before we can do any data manipulation or management, we need to get some data into R in the first place!
The tidyverse package used to read data into R is, maybe unsurprisingly, called readr. It contains several different functions used to read in different types of tabular data. You can read about these different functions in the tidyr cheatsheet here (RStudio puts out many of these helpful cheatsheets as quick references for useful packages - I have a bunch printed out and stuck to the wall in my office, they're great!).
Today, we are going to use the
read_csv()function to read in a dataset (in csv format) containing information about passengers aboard the titanic. We can do this using the following code:
> titanic = read_csv("https://goo.gl/4Gqsnz")
This line of code downloads a comma separated value (CSV) file containing our dataset from the web link https://goo.gl/4Gqsnz, and loads it into a variable named titanic in our RStudio environment. Pretty neat huh?
Note that with this command you can also import files from your computer by specifying a filename instead of the url we used.
Now that we have our dataset ready to roll, let's take a look at it! There are a couple of different ways we can go about this.
One quick and easy way to check out a dataset is by using the
$ PassengerId <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18…
$ Survived <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0…
$ Pclass <dbl> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3, 2…
$ Name <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Florc…
$ Sex <chr> "male", "female", "female", "female", "male", "male", "male"…
$ Age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, 55…
$ SibSp <dbl> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0, 0…
$ Parch <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0, 0…
$ Ticket <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37345…
$ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625, 21…
$ Cabin <chr> NA, "C85", NA, "C123", NA, NA, "E46", NA, NA, NA, "G6", "C103…
$ Embarked <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"…
glimpse()gives us a quick overview of what our dataset contains, including the number of observations (rows) in our dataset, the number of variables (columns) in our dataset, and the first few entries in each of those variables. To better understand the data, here are what each of these variables represent:
- PassengerId: Just an indexing number unique for each passenger.
- Survived: Whether or not the passenger survived. 0 = No, 1 = Yes. These are commonly used for binary outcomes, and if coerced into a logical would be
- Pclass: Ticket class: 1 = 1st Class, 2 = 2nd Class, 3 = 3rd Class.
- Name: A character string containing the name of each passenger.
- Sex: Character strings for passenger sex (
"male"/ "female") note that everything is in lower case, as R is case-sensitive.
- Age: Age in years.
- SibSp: The number of siblings/spouses aboard the titanic with the passenger
- Parch: The number of parents/children aboard the titanic with the passenger
- Ticket: Another character string containing the ticket ID of the passenger.
- Fare: The price paid for tickets in pounds Sterling (multiply by 82 for a roughly equivalent price in 2020 AUD. Keep in mind that unskilled workers made around 1 pound a week, these were expensive tickets!)
- Cabin: The cabin number of the passengers (character).
- Embarked: Where passengers boarded the titanic. C = Cherbourg, Q = Queenstown, S = Southampton).
Another useful command for having a quick look at a dataset is the
summary()function, which can provide some quick summary statistics for our variables.
PassengerId Survived Pclass Name Sex Age SibSp
Min. : 1.0 Min. :0.0000 Min. :1.000 Length:891 Length:891 Min. : 0.42 Min. :0.000
1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.000 Class :character Class :character 1st Qu.:20.12 1st Qu.:0.000
Median :446.0 Median :0.0000 Median :3.000 Mode :character Mode :character Median :28.00 Median :0.000
Mean :446.0 Mean :0.3838 Mean :2.309 Mean :29.70 Mean :0.523
3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.000 3rd Qu.:38.00 3rd Qu.:1.000
Max. :891.0 Max. :1.0000 Max. :3.000 Max. :80.00 Max. :8.000
Parch Ticket Fare Cabin Embarked
Min. :0.0000 Length:891 Min. : 0.00 Length:891 Length:891
1st Qu.:0.0000 Class :character 1st Qu.: 7.91 Class :character Class :character
Median :0.0000 Mode :character Median : 14.45 Mode :character Mode :character
Mean :0.3816 Mean : 32.20
3rd Qu.:0.0000 3rd Qu.: 31.00
Max. :6.0000 Max. :512.33
Lastly, if we want to explore the whole dataset, we can access RStudio's inbuilt data viewer by clicking the table icon next to our dataset in the environment panel or by using the
View your data in a spreadsheet format.
Now that we have our data in our RStudio environment, the next thing we need to do is clean it! Cleaning data refers to the process of removing errors or missing values from the dataset, and is a crucial step in data analysis - how can we trust the results of an analysis if the data going into it isn't up to scratch?
Looking at the
glimpse()from before, we can see quite a few missing values in our dataset. In R, these missing values are represented by an
NAtag. Removing or resolving these NA's is an important part of cleaning our data, as many functions in R won't know what to do when they encounter such values.
Let's have a look at which variables contain these missing values. To do this, we're going to combine 2 functions:
is.na()returns a logical
TRUE/FALSEvalue whenever it finds a missing value.
colSums()sums up all the values in each column of our dataframe.
Let's go ahead and see where those NAs are located.
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare
0 0 0 0 0 177 0 0 0 0
Here we can see we have 177 missing values in our
Agecolumn, 2 missing values in our
Embarkedcolumn, and a whopping 687 missing values in our
Cabinnumber column. Given that our dataset contains 891 observations in total, this is a huge fraction!
The most straightforward way to deal with missing values is to delete the observations (rows) that contain them from our dataset, but doing so in this case would leave us very little data to work with. Instead, let's first remove the entire
Cabinvariable from our dataset given that we won't be using this variable for our analysis.
In order to remove the column we are going to use a function from the tidyverse (dplyr):
# We create a new clean titanic dataset,
# and select everything MINUS the cabin variable
> titanic_no_cabin <- select(titanic, -Cabin)
We now remove the entries with missing values using the
na.omit()function. That might look a little like this:
# we create a new clean titanic dataset,
# then we remove the NA's
> titanic_cleaned <- na.omit(titanic_no_cabin)
# and we're left with a dataset free from NA's!
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare
0 0 0 0 0 0 0 0 0 0
Of course, depending on your specific data your cleaning needs may look a little different to this, but for now, this dataset is ready to go!
Now that we've imported and cleaned our data, here is a challenge to test what you've learned.
1. How many observations (rows) are now in the clean data set?
2. How many variables (columns) are now in the clean data set?
# You can answer both questions by typing:
# You can also find the answer with the following commands:
1. Is there commmands to import .tsv files?
2. How can you import excel files?
HINT: Look at the tidyr/readr cheatsheet.
> df <- read_tsv("file_name")
# 2. Use the "readxl" package