apply() yourself in R

by on November 14, 2016

A few months ago I wrote what I thought was a quite useful post on list comprehensions in R, which, after working with numerous datasets since, I have realized is almost useless. In the post, I suggested a few ways to go about generating vectors of data using non-vectorized functions. Packages such as foreach, plyr, and dplyr offer advanced solutions to tackle advanced manipulation and grouping. Most of the time, however, all I really want is to generate a vector of data based on some existing data structure. For sample data, we’ll use a sample list of latitudes and longitudes of a few locations in Nova Scotia, Canada, and try to generate a list of distances of adjacent cities.

locations <- prettymapr::geocode(c("digby, NS", "middleton, NS", 
                                   "wolfville, NS", "windsor, NS", 
                                   "halifax, NS"))
locations <- locations[c("query", "lon", "lat")]
query lon lat
digby, NS -65.75857 44.61940
middleton, NS -65.06807 44.94243
wolfville, NS -64.36449 45.09123
windsor, NS -64.13637 44.99051
halifax, NS -63.57497 44.64842

For those of you who don’t know, the cities listed here are along what could be a bus route from Digby to Halifax, and one could plausibly be interested in the distances between each. Finding the distance between two lat/lon points is quite easy using the geosphere package:

library(geosphere)
distGeo(c(-65.75857, 44.61940), c(-65.06807, 44.94243)) # about 65 km
## [1] 65385.73

However, in this situation, we need to compute the distance between the previous point and the current point, for which there is no vectorized function. It is possible to do this using a standard for loop, however it is usually best to avoid for loops in R as they are horrendous for performance. To avoid this, we need sapply().

At heart, sapply() takes a vector (or list) input, applys a function to each item, and produces as simple an output as it can. In most cases this is a vector, but if the output has a length > 1, results vary.

sapply(c("first item", "second item", "third item"), nchar)
##  first item second item  third item 
##          10          11          10

In this example, we apply nchar to each item individually, returning the result as a vector (the names above each item means that the vector is a named vector, which we can suppress by passing USE.NAMES=FALSE). In the case of nchar this is unnecessary, because nchar is already vectorized (i.e. passing in a vector of values results in the a vector the same length as output), but to do something more complicated, we need to specify a custom function.

sapply(c("first item", "second item", "third item"), function(item) {
  if(item == "first item") {
    return(1)
  } else if(item == "second item") {
    return(2)
  } else if(item == "third item") {
    return(3)
  }
}, USE.NAMES = FALSE)
## [1] 1 2 3

A third common use of sapply() is to use the indicies as well as the values within the funcion.

values <- c("first item", "second item", "third item")
sapply(1:length(values), function(index) {
  paste(rep(values[index], index), collapse="/")
})
## [1] "first item"                       "second item/second item"         
## [3] "third item/third item/third item"

In most cases, what you’re trying to accomplish can be done using a vectorized function (the above example could use a few nested ifelse calls), but there’s a few cases where this will not work:

  • A calculation involves values before/after the current value, or depends on the index of the value in addition to the value itself
  • A calculation involves multiple columns in a data frame, and the target function is not vectorized
  • It is necessary to construct a data structure more complicated than a vector (usually a list) from a vector.

Back to our list of Nova Scotian towns along a fictional bus line, the calculation we want to do falls into the first two categories. The first step is to generate a list of distances between adjascent points.

locations$distance <- c(0, sapply(2:nrow(locations), function(rownumber) {
  distGeo(c(locations$lon[rownumber-1], locations$lat[rownumber-1]), 
          c(locations$lon[rownumber], locations$lat[rownumber]))
}))
query lon lat distance
digby, NS -65.75857 44.61940 0.00
middleton, NS -65.06807 44.94243 65385.65
wolfville, NS -64.36449 45.09123 57871.46
windsor, NS -64.13637 44.99051 21173.59
halifax, NS -63.57497 44.64842 58453.64

Often there is a calculation that depends heavily on on the index of the value and the value itself, for which use sapply(1:nrow(some.data.frame), function(rownumber) ...). I find that I use this construct a few times a week during an average week of programming. While there is still no list comprehension in R (a Python construct), sapply() is as close as it gets.

Leave a Reply

WP Facebook Like Send & Open Graph Meta powered by TutsKid.com.