Handling NA’s In R

By georgeazen on January 4, 2021

Unless you are working with perfect data you will run into missing values. These absent data points are often the bane of the data analyst, as many algorithms do not play nicely with NA’s. Luckily, R presents us with many options for identifying, reformatting, omitting, and even replacing missing values in our data.

For the following examples we will use the airquality data set which comes pre-loaded in R. It contains daily readings of the following air quality values for 153 days reporting 6 measurements.

Step One: Where the NA’s at?

First, we need to search our data frame to see if it has missing values. The following function can be used to get the job done.

is.na (airquality)

The function is.na() will go over every value in the dataframe. It will return True if there is a missing value or False if the value is present.

This is a little overwhelming especially with large datasets. Using the colsums() function with is.na() will give a total number of NA’s per column.

colsums(is.na(airquality))


  Ozone Solar.R    Wind    Temp   Month     Day 
     37       7       0       0       0       0

That is much more manageable, and now we can see in which columns the NA’s are located.

Step Two: Reformatting the NA’s

Data points that are not available will not always be represented by NA. In fact, there is a whole language of symbols used to represent missing data. In order to change a specific value to NA’s we will use the naniar package and it’s replace_with_na() command.

For this example let’s create a dataset based on a famous instance of missing information.

Name <- c('Lancelot', 'Robin', 'Galahad', 'Arthur', 'Bedevere')
Quest <- c('Holy Grail', 'Holy Grail', 'Holy Grail', 'Holy Grail', 'N / A')
Fav_Color<-c('Blue', NA, "Don't Know", 'N_A', 'N/A')


df<-data.frame(Name,Quest,Fav_Color)

df

       Name        Quest     Fav_Color
1  Lancelot   Holy Grail          Blue
2     Robin   Holy Grail          <NA>
3   Galahad   Holy Grail    Don't Know
4    Arthur   Holy Grail           N_A
5  Bedevere        N / A           N/A

Looking at the data we can see that there are multiple representations of NA. We can replace them one column at a time or address the whole dataframe at once.

To address individual columns use replace_with_na() Here we will just replace the unknown value in the Quest column.

library(naniar)

replace_with_na(df, replace= list(Quest = 'N / A'))


       Name         Quest     Fav_Color
1  Lancelot   Holy Grail          Blue
2     Robin   Holy Grail          <NA>
3   Galahad   Holy Grail     Don't Know
4    Arthur   Holy Grail            N_A
5  Bedevere       <NA>              N/A

To handle the entire data frame use replace_with_na_all() Here we will feed this function a list of all the values we want to replace with NA. This will create a new dataframe that we will call df2. It actually will create a tibble, but if you don’t know what that is (and are too lazy to click the link) don’t worry it will act the same as a dataframe.

library(naniar)

df2<-replace_with_na_all(df, condition ~. %in% c('N / A' , 'N_A', 'N/A', "Don't Know"))

df2

# A tibble: 5 x 3
  Name        Quest        Fav_Color
  <chr>       <chr>        <chr>    
1 Lancelot    Holy Grail   Blue     
2 Robin       Holy Grail   NA       
3 Galahad     Holy Grail   NA       
4 Arthur      Holy Grail   NA       
5 Bedevere    NA           NA

Note: condition ~. means use all columns in the dataframe and %in% is a logical operator used to see if an element is in a vector.

Step Three: Omitting NA’s

Now that NA’s are standardized they can be omitted, if desired, using the na.omit() function. But beware, omitting the NA will remove the whole row and can greatly reduce the size of your dataset. Let’s look at this with the Monty Python data and the airquality data. For the airquality we will just look at the dimension as it is too big to put all of it on screen.

na.omit(df2)

# A tibble: 1 x 3
  Name     Quest      Fav_Color
  <chr>    <chr>      <chr>    
1 Lancelot Holy Grail Blue  




dim(airquality)
[1] 153   6

air_quality_omitted <- (na.omit(airquality))
 #creating a new df without the na's

dim(air_quality_omitted) 
[1] 111   6

Wow! In the first example we lost all of our Knights except Lancelot and in the second example we lost 42 rows. This is something you should always keep in the back of your mind when omitting NA’s.

Step Four: Replacing the NA’s

If omitting the NA’s left a sour taste in your mouth, then I have good news! There is another way. Using certain functions, we can take the existing data and make approximations of what the NA’s might be. This is very useful in high dimensional data where the number of features often exceed the number of observations and omitting a row is not an option.

Replacing with a specific value

Using tidyr and the replace_na() command we can dictate which value we want to replace our NA’s.

I’m going to guess that Bedevere’s quest was also searching for the Holy Grail so let’s make that replacement.

library(tidyr)

replace_na(df2, replace = list(Quest='Holy Grail'))

# A tibble: 5 x 3
# A tibble: 5 x 3
  Name        Quest        Fav_Color
  <chr>       <chr>        <chr>    
1 Lancelot    Holy Grail   Blue     
2 Robin       Holy Grail   NA       
3 Galahad     Holy Grail   NA       
4 Arthur      Holy Grail   NA       
5 Bedevere    Holy Grail   NA

Replacing with a calculated value

This same approach can be used for replacing NA’s with the mean or median value of that column.

Returning to the airquality data set let’s replace all the missing ozone measurements with the median measurement of the column. Note when calculating the median, we have to omit the NA’s from the vector because R won’t calculate the median of a vector that contains NA’s.

#First get the median of the column Remember to omit NA's

oz_median <- median(na.omit(airquality$Ozone))


air_quality2<-replace_na(airquality, replace = list(Ozone = oz_median))


colSums(is.na(air_quality2))


  Ozone Solar.R    Wind    Temp   Month     Day 
      0       7       0       0       0       0 

#Note there are now zero NA's in the Ozone column

Now let’s replace the NA’s in the Solar.R column with the mean value, but this time all in one line.

replace_na(airquality, replace = list(Solar.R = mean(na.omit(airquality$Solar.R))))

Again note the na.omit() before calculating the mean.

BONUS: Using machine learning to replace NA’s

Ok this is the big brain stuff and to describe it all could be a whole new post, but basically the gist is that certain algorithms can look at the data and make educated guesses to what the NA’s would be. The best example is with a Random Forrest.

First, we set a seed for reproducibility. Then using the randomforest package and the function rfimpute() we feed in the dataset, response column, number of iterations, and number of trees. The response column cannot contain NA’s as this would be the value we are trying to predict. In this case we will pretend we are trying to predict temperature from the rest of the airquality data.

library(randomForest)

set.seed(25)

new_airquality<-rfImpute(airquality, airquality$Temp, iter = 5, ntree = 200)

new_airquality

For more information check out StatQuest’s video. It is truly the gold standard on this topic.

Conclusion

This concludes our basic look at handling NA’s in R. Please note that these are just a few of the many ways to deal with missing values. As you spend more time with R you will find more methods that allow you to tailor fit your approach to NA’s. If you have made it this far thanks for the read and if you have any methods of handling NA’s that you really like please share them in the comments, I’d love to hear them!

Categories: Uncategorized

Tagged as: cleaning data Data data science Handling Na machine learning machine learning na missing values Monty Python n/a NA na in R na replacement NA's R RStudio

Excessive Pepper

The Spiciest Ideas in Data