Reading excel files into R

It’s well known that data preparation is often the most time consuming part of any data science project. Before you can start doing much with the data though it first needs to be imported into your analysis software, something which can sometimes prove more tricky than anticipated.

The default in Rstudio if importing an excel file is set to read_excel in the readxl package so it will do something like below when reading in.

library (readxl)
myfile <- read_excel ("myfile.XLS")
view (myfile)

However if you are reading in a large sparse data set containing nominal attributes then be careful if using this function to read your file! By default the function only looks at the first 100 rows when assigning a column type. If you have a nominal attribute whose first 100 rows are empty, then by default it will be assigned the type numeric. You can get round this by explicitly specifying the column types using col_types argument but if you have a large number of attributes this is going to be time consuming.

The standard advice given in R documentation seems to be to avoid reading in excel files in the first place. It’s suggested to use the originating application to save the file as a text or comma separated value file and import it using read.csv or similar.

library (utils) 
read.csv (myfile.csv, header=TRUE, sep=",")

It may not always be possible to save the excel file in an easier format for import however. I was working with a file recently with around 250 attributes with a lot of lengthy names containing reserved characters etc, so when saved to csv the header didn’t import properly. One way around this is just to drop the column names from the import and manually assign them later. A bit of a pain if you have hundreds of attributes though!

The easiest workaround I found for this was simply to use the read_excel2 function created by Hadley Wickham which is reproduced below. This seems to be a common query from R users and has been discussed on stack overflow for example here.

## Source: https://github.com/tidyverse/readxl/blob/master/R/read_excel.R 
library(readxl)

read_excel2 <- function(path, sheet = 1, col_names = TRUE, col_types = NULL,
                       na = "", skip = 0, n_max = 1050000L) {

  path <- readxl:::check_file(path)
  ext <- tolower(tools::file_ext(path))

  switch(readxl:::excel_format(path),
         xls =  read_xls2(path, sheet, col_names, col_types, na, skip, n_max),
         xlsx = read_xlsx2(path, sheet, col_names, col_types, na, skip, n_max)
  )
}
read_xls2 <- function(path, sheet = 1, col_names = TRUE, col_types = NULL,
                     na = "", skip = 0, n_max = n_max) {

  sheet <- readxl:::standardise_sheet(sheet, readxl:::xls_sheets(path))

  has_col_names <- isTRUE(col_names)
  if (has_col_names) {
    col_names <- readxl:::xls_col_names(path, sheet, nskip = skip)
  } else if (readxl:::isFALSE(col_names)) {
    col_names <- paste0("X", seq_along(readxl:::xls_col_names(path, sheet)))
  }

  if (is.null(col_types)) {
    col_types <- readxl:::xls_col_types(
      path, sheet, na = na, nskip = skip, has_col_names = has_col_names, n = n_max
    )
  }

  readxl:::xls_cols(path, sheet, col_names = col_names, col_types = col_types, 
                    na = na, nskip = skip + has_col_names)
}

read_xlsx2 <- function(path, sheet = 1L, col_names = TRUE, col_types = NULL,
                       na = "", skip = 0, n_max = n_max) {
  path <- readxl:::check_file(path)
  sheet <-
    readxl:::standardise_sheet(sheet, readxl:::xlsx_sheets(path))

  if (is.null(col_types)) {
    col_types <-
      readxl:::xlsx_col_types(
        path = path, sheet = sheet, na = na, nskip = skip + isTRUE(col_names), n = n_max
      )
  }

  readxl:::read_xlsx_(path, sheet, col_names = col_names, col_types = col_types, na = na,
             nskip = skip)
}

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.