Unless you are working with perfect data you will run into missing values. These absent data points are often the bane of the data analyst, as many algorithms do not play nicely with NA’s. Luckily, R presents us with many options for identifying, reformatting, omitting, and even replacing missing values in our data.
For the following examples we will use the airquality
data set which comes pre-loaded in R. It contains daily readings of the following air quality values for 153 days reporting 6 measurements.
Step One: Where the NA’s at?
First, we need to search our data frame to see if it has missing values. The following function can be used to get the job done.
is.na (airquality)
The function is.na()
will go over every value in the dataframe. It will return True
if there is a missing value or False
if the value is present.
This is a little overwhelming especially with large datasets. Using the colsums()
function with is.na()
will give a total number of NA’s per column.
colsums(is.na(airquality))
Ozone Solar.R Wind Temp Month Day
37 7 0 0 0 0
That is much more manageable, and now we can see in which columns the NA’s are located.
Step Two: Reformatting the NA’s
Data points that are not available will not always be represented by NA. In fact, there is a whole language of symbols used to represent missing data. In order to change a specific value to NA’s we will use the naniar
package and it’s replace_with_na()
command.
For this example let’s create a dataset based on a famous instance of missing information.
Name <- c('Lancelot', 'Robin', 'Galahad', 'Arthur', 'Bedevere')
Quest <- c('Holy Grail', 'Holy Grail', 'Holy Grail', 'Holy Grail', 'N / A')
Fav_Color<-c('Blue', NA, "Don't Know", 'N_A', 'N/A')
df<-data.frame(Name,Quest,Fav_Color)
df
Name Quest Fav_Color
1 Lancelot Holy Grail Blue
2 Robin Holy Grail <NA>
3 Galahad Holy Grail Don't Know
4 Arthur Holy Grail N_A
5 Bedevere N / A N/A
Looking at the data we can see that there are multiple representations of NA. We can replace them one column at a time or address the whole dataframe at once.
To address individual columns use replace_with_na()
Here we will just replace the unknown value in the Quest column.
library(naniar)
replace_with_na(df, replace= list(Quest = 'N / A'))
Name Quest Fav_Color
1 Lancelot Holy Grail Blue
2 Robin Holy Grail <NA>
3 Galahad Holy Grail Don't Know
4 Arthur Holy Grail N_A
5 Bedevere <NA> N/A
To handle the entire data frame use replace_with_na_all()
Here we will feed this function a list of all the values we want to replace with NA. This will create a new dataframe that we will call df2. It actually will create a tibble, but if you don’t know what that is (and are too lazy to click the link) don’t worry it will act the same as a dataframe.
library(naniar)
df2<-replace_with_na_all(df, condition ~. %in% c('N / A' , 'N_A', 'N/A', "Don't Know"))
df2
# A tibble: 5 x 3
Name Quest Fav_Color
<chr> <chr> <chr>
1 Lancelot Holy Grail Blue
2 Robin Holy Grail NA
3 Galahad Holy Grail NA
4 Arthur Holy Grail NA
5 Bedevere NA NA
Note: condition ~.
means use all columns in the dataframe and %in%
is a logical operator used to see if an element is in a vector.
Step Three: Omitting NA’s
Now that NA’s are standardized they can be omitted, if desired, using the na.omit()
function. But beware, omitting the NA will remove the whole row and can greatly reduce the size of your dataset. Let’s look at this with the Monty Python data and the airquality
data. For the airquality
we will just look at the dimension as it is too big to put all of it on screen.
na.omit(df2)
# A tibble: 1 x 3
Name Quest Fav_Color
<chr> <chr> <chr>
1 Lancelot Holy Grail Blue
dim(airquality)
[1] 153 6
air_quality_omitted <- (na.omit(airquality))
#creating a new df without the na's
dim(air_quality_omitted)
[1] 111 6
Wow! In the first example we lost all of our Knights except Lancelot and in the second example we lost 42 rows. This is something you should always keep in the back of your mind when omitting NA’s.
Step Four: Replacing the NA’s
If omitting the NA’s left a sour taste in your mouth, then I have good news! There is another way. Using certain functions, we can take the existing data and make approximations of what the NA’s might be. This is very useful in high dimensional data where the number of features often exceed the number of observations and omitting a row is not an option.
Replacing with a specific value
Using tidyr
and the replace_na()
command we can dictate which value we want to replace our NA’s.
I’m going to guess that Bedevere’s quest was also searching for the Holy Grail so let’s make that replacement.
library(tidyr)
replace_na(df2, replace = list(Quest='Holy Grail'))
# A tibble: 5 x 3
# A tibble: 5 x 3
Name Quest Fav_Color
<chr> <chr> <chr>
1 Lancelot Holy Grail Blue
2 Robin Holy Grail NA
3 Galahad Holy Grail NA
4 Arthur Holy Grail NA
5 Bedevere Holy Grail NA
Replacing with a calculated value
This same approach can be used for replacing NA’s with the mean or median value of that column.
Returning to the airquality
data set let’s replace all the missing ozone measurements with the median measurement of the column. Note when calculating the median, we have to omit the NA’s from the vector because R won’t calculate the median of a vector that contains NA’s.
#First get the median of the column Remember to omit NA's
oz_median <- median(na.omit(airquality$Ozone))
air_quality2<-replace_na(airquality, replace = list(Ozone = oz_median))
colSums(is.na(air_quality2))
Ozone Solar.R Wind Temp Month Day
0 7 0 0 0 0
#Note there are now zero NA's in the Ozone column
Now let’s replace the NA’s in the Solar.R column with the mean value, but this time all in one line.
replace_na(airquality, replace = list(Solar.R = mean(na.omit(airquality$Solar.R))))
Again note the na.omit()
before calculating the mean.
BONUS: Using machine learning to replace NA’s
Ok this is the big brain stuff and to describe it all could be a whole new post, but basically the gist is that certain algorithms can look at the data and make educated guesses to what the NA’s would be. The best example is with a Random Forrest.
First, we set a seed for reproducibility. Then using the randomforest
package and the function rfimpute()
we feed in the dataset, response column, number of iterations, and number of trees. The response column cannot contain NA’s as this would be the value we are trying to predict. In this case we will pretend we are trying to predict temperature from the rest of the airquality
data.
library(randomForest)
set.seed(25)
new_airquality<-rfImpute(airquality, airquality$Temp, iter = 5, ntree = 200)
new_airquality
For more information check out StatQuest’s video. It is truly the gold standard on this topic.
Conclusion
This concludes our basic look at handling NA’s in R. Please note that these are just a few of the many ways to deal with missing values. As you spend more time with R you will find more methods that allow you to tailor fit your approach to NA’s. If you have made it this far thanks for the read and if you have any methods of handling NA’s that you really like please share them in the comments, I’d love to hear them!
Categories: Uncategorized