A Basic Introduction
BB
Overview
“And, if you then would like to know just what one of the functions does, that can also be answered, by looking in the source for R itself. Admittedly, the computations are written in a form of C that is not exactly for the faint of heart…
…All the same, I would encouragethe exploration for anyone with a general understanding of programming in C or similar languages, and with an interest in how computations are done in R.”
As Chambers, and many note, R’s source code, along with the C API, is uh, a little difficult understand/navigate for a number of different reasons…
First, there are two separate sets of functions, one in-use by R-Core and another set for S language compatibility that is no longer actively maintained maintained. However, the latter set is often used in what few third-party tutorials are available online which adds to the complexity.
Additionally, very few of those tutorials that do exist, either in the R documentation, or other books/websites, go into detail regarding both a) getting data into C from R and b) getting data from C to R all in the same function. Instead, most are either just ‘Hello World’ or the same ‘Convolve’ example over and over again.
Finally, while C is not a complicated language, its built-in operator overloading (for example, “*” can signify either multiplication or pointers/dereferencing) takes some getting used to as well as the fact any non-trivial operations (e.g., those “close to the machine”) require a considerable amount of forethought. Thus, a C newbie who wishes to extend R via C not only has to learn at least some C, but also needs to learn the syntax of a mostly undocumented API. This is why multiple people recommend avoiding the latter and going straight to RCPP (Hadley Wickham, Dirk Eddelbuettel, Romain François, etc.).
However, all that being said, there are some benefits that come with taking the time (often considerable!) to get a feel for how the system works. Notably, it gives one the tools to begin reading and understanding the source code and for those dead set on using the C API to write or link existing C code, provides one with the means to do so.
For those reasons, as well as my own, I will briefly try and supplement the information that already exists, consequently the structure will be as follows:
C in R
To begin, I want to briefly touch on ways to compile and use C code in R. In my piece on C for R users, I used the shell function to compile and run pure C programs in the command line and view output in R. However, if you want to interface with the R C API, there are a number of better ways. One, is to use the Knitr and RMarkdown and specify the C engine, which I utilize further on in this piece. The other, and one you may already familiar with, is the inline package which allows you to both compile and assign your code to an R wrapper function. Note that the default method is to compile via CPP. If you want to use C, you will need to specify that as well as the interface you wish to use. Of course, for most (toy) programs this shouldn’t matter as a valid C program will also compile in C++.
‘.C’ vs. ‘.Call’
There are a number of different interfaces available, including ‘.C’, ‘.Call’, and ‘.External’. The latter two are mostly the same except that ‘.External’ takes a variable number of arguments. As this walkthrough is already beastly long, I focus mainly on ‘.Call’ with only a brief nod to ‘.C’.
With all of that said, let’s finally look at some code. In particular, the knitr C engine example.
library(inline) code1 <- " *x= *x * *x; " square <- cfunction(signature(x="numeric"), code1, language = "C", convention = ".C") x <- 5 square(x)
## $x ## [1] 25
I actually started playing with C in R using this example and quickly became confused as all hell as when I tried to change the function so it didn’t take pointers, it failed to compile. The answer, of course, only obvious in hindsight, is that R doesn’t have a scalar type and everything is a vector at heart, or in C, an array. Note also that this function uses the ‘.C’ interface, which has the following characteristics:
- It takes care of many of the gory details that ‘.Call’ requires the programmer to handle
- All arguments are copied so it lacks the potential to be as efficient as ‘.Call’
- Only basic types can be passed to ‘.C’
In order to write this same function using the ‘.Call’ interface, one would do the following…
code2 <- " SEXP out = PROTECT(allocVector(REALSXP, 1)); // SEXP = R object // PROTECT = protect against R garbage collection // allocVector = request space, length 1, for ... // REALSXP = real or double REAL(out)[0] = REAL(x)[0] * REAL(x)[0]; // REAL()[0] = access first, C is zero indexed, element // UNPROTECT = must match number of PROTECTS (e.g., 1) UNPROTECT(1); return(out); " square2 <- cfunction(signature(x="numeric"), code2, language = "C", convention = ".Call") square2(x)
## [1] 25
Note that I omit a detailed discussion of the API macros as they are covered elsewhere (R API Cheat Sheet). Again, here is a comparison for ‘hello world’…
hw_text1 <- ' Rprintf("Hello World\\n"); ' hw_text2 <- ' Rprintf("Hello World\\n"); return(R_NilValue); ' Hello_World1 <- cfunction(signature(), hw_text1, language = "C", convention = ".C") Hello_World2 <- cfunction(signature(), hw_text2, language = "C", convention = ".Call") # Returns list Hello_World1()
## Hello World
## list()
# Returns NULL (or whatever else you specify...) Hello_World2()
## Hello World
## NULL
Returning to ‘square’, even though both interfaces are making copies in this case, ‘.Call’ is faster…
library(microbenchmark) microbenchmark(square(x))
## Unit: nanoseconds ## expr min lq mean median uq max neval ## square(x) 0 365 420.03 365 366 5106 100
microbenchmark(square2(x))
## Unit: nanoseconds ## expr min lq mean median uq max neval ## square2(x) 0 0.5 142.89 1 1 6200 100
Old vs. New Macros
I found the fact that a lot of sample code uses the old, un-maintained S compatibility macros to be confusing until I bookmarked, and thoroughly reviewed the Rdefines.h file. Given that the R source code is very much under development, and that the older function definitions are more of a relic than anything else, it makes sense to use the same ones in use by the R-core team. Nonetheless, I still found several of the tutorials that use the old macros extremely helpful; it just took me longer to understand. Because of this, I figure I’ll go ahead and update the some functions from these sites with newer code, in particular, Jonathon Callahan’s ‘helloC1’ function which counts string length and Søren Højsgaard’s matrix multiplication exampleexample.
helloC1_NEW <- cfunction(c(greeting = 'character'),' int i, vectorLength, stringLength; SEXP result; //PROTECT(greeting = AS_CHARACTER(greeting)); // OLD PROTECT(greeting = coerceVector(greeting,STRSXP)); // NEW /* Both functions below are correct actually. From the R Extension Manual: "Note that LENGTH is intended to be used for vectors: length is more generally applicable." */ //vectorLength = LENGTH(greeting); vectorLength = length(greeting); //PROTECT(result = NEW_INTEGER(vectorLength)); // OLD PROTECT(result = allocVector(INTSXP,vectorLength)); for (i=0; i<vectorLength; i++) { stringLength = strlen(CHAR(STRING_ELT(greeting, i))); INTEGER(result)[i] = stringLength; } UNPROTECT(2); return(result); ', language = "C", convention = ".Call") helloC1_NEW(c('Hello', 'Hola','Bonjour'))
## [1] 5 4 7
For space purposes, I only present the updated code with a few notes and edits for clarity…
mat_new <- cfunction(c(X = "matrix", Y="matrix"), body= int i, vectorLength, stringLength; SEXP result; //PROTECT(greeting = AS_CHARACTER(greeting)); // OLD PROTECT(greeting = coerceVector(greeting,STRSXP)); // NEW /* Both functions below are correct actually. From the R Extension Manual: "Note that LENGTH is intended to be used for vectors: length is more generally applicable." */ //vectorLength = LENGTH(greeting); vectorLength = length(greeting); //PROTECT(result = NEW_INTEGER(vectorLength)); // OLD PROTECT(result = allocVector(INTSXP,vectorLength)); for (i=0; i<vectorLength; i++) { stringLength = strlen(CHAR(STRING_ELT(greeting, i))); INTEGER(result)[i] = stringLength; } UNPROTECT(2); return(result); ',language = "C", convention = ".Call") mat <- matrix(c(1:4), nrow = 2, ncol = 2) mat_new(mat,mat)
## [,1] [,2] ## [1,] 7 15 ## [2,] 10 22
Wrapping C Functions
While it probably makes a lot more sense to write new code for functions/packages using RCPP, an additional reason beyond learning to read R source code one might use the C API is if they wanted to use an already existing C function/library in R and either don’t want to worry about any non-compatibility issues or for whatever reason just want to keep things in C. More importantly, the act of wrapping existing C programs so that they can be used in R is a good way to demonstrate how to use the R C API, or, at the very least, get started with it.
Consequently, I create wrappers for two (relatively) simple C functions related to string processing. Specifically, I use an implementation of the Levenshtein string distance and the Martin Porter’s Stemming Algorithm, both of which are hosted on Github by Titus Wormer, although the latter is the same as the author’s original C version.
First, I’ll create a very basic wrapper in C. Briefly, the function computes the minimum distance between two strings by taking the sum of the total number of deletions, insertions, or substitutions required to turn one string into another, presumably different string Levenshtein String Distance. The function is short and fairly simple, so I include it, along with a very basic wrapper function designed to get data into and out of R. Specifically, the function sends a string, or STRSXP, and then returns a double, or REALSXP, back to R.
#include <string.h> #include <stdlib.h> #include <R.h> #include <Rdefines.h> unsigned int levenshtein(const char *a, const char *b) { unsigned int length = strlen(a); unsigned int bLength = strlen(b); unsigned int *cache = calloc(length, sizeof(unsigned int)); unsigned int index = 0; unsigned int bIndex = 0; unsigned int distance; unsigned int bDistance; unsigned int result; char code; /* Shortcut optimizations / degenerate cases. */ if (a == b) { return 0; } if (length == 0) { return bLength; } if (bLength == 0) { return length; } /* initialize the vector. */ while (index < length) { cache[index] = index + 1; index++; } /* Loop. */ while (bIndex < bLength) { code = b[bIndex]; result = distance = bIndex++; index = -1; while (++index < length) { bDistance = code == a[index] ? distance : distance + 1; distance = cache[index]; cache[index] = result = distance > result ? bDistance > result ? result + 1 : bDistance : bDistance > distance ? distance + 1 : bDistance; } } free(cache); return result; } SEXP lv(SEXP x, SEXP y) { const char * s1 = CHAR(STRING_ELT(x, 0)); const char * s2 = CHAR(STRING_ELT(y, 0)); SEXP myint; PROTECT(myint = allocVector(INTSXP,1)); INTEGER(myint)[0] = levenshtein(s1, s2); UNPROTECT(1); return(myint); }
Then all that needs done is to load the ‘.dll’ file and wrap it in a function…
# Using Knitr 'c' engine automatically generates a '.dll' file on Windows.. dyn.load('../c2094e18284b.dll') lv <- function(string1,string2) { out <- .Call('lv',string1,string2) out } lv('foo','moo')
## [1] 1
lv('pig','horse')
## [1] 5
Let’s look at one more (minimal) example that takes a string input and then returns a string output, Martin Porter’s Stemming Algorithm. This is frequently used in natural language processing as a first step before more advanced processing Algorithm. Here, I just link to the function as a ‘.h’ file in the header. Visit the link to the right to view the actual function.
#include <string.h> #include <stdlib.h> #include "strmr.h" #include <R.h> #include <Rdefines.h> SEXP strmr(SEXP x) { char *word = CHAR(STRING_ELT(x, 0)); int end = stem(word, 0, strlen(word) - 1); word[end + 1] = 0; SEXP out = PROTECT(allocVector(STRSXP, 1)); SET_STRING_ELT(out, 0, mkChar(word)); UNPROTECT(1); return(out); }
dyn.load('../c17d866f046b2.dll') strmr <- function(x) { out <- .Call('strmr',x) out } strmr('populous')
## [1] "popul"
Errors & Validation
Although these programs work, it certainly isn’t ready to be released into the wild. Most importantly, if too many, too few, or the wrong type of arguments are passed to the function, it results in a segfault (be sure to save anything important if you want to test this!). This is par for the course with this API,
“Bugs will bite. These are ancient and relatively low-level languages. Programming problems are more likely. And, the consequences can be much worse than bugs in R: It’s quite easy to kill the R session with a memory fault, for example.”
There are a couple of ways around this. One is to just take care of it in R using functions like stop or stopifnot to make sure the correct arguments are given. The other is to use the error function in C and do type, length, etc. checking there.
feed_me_an_int <- cfunction(c(x='ANY'),' if (!isInteger(x)) { error("I wanted an int!\\n"); } else { Rprintf("Thanks for feeding me an int!\\n"); } return(R_NilValue); ',language=c('C'),convention=c('.Call')) feed_me_an_int(3)
## Error in feed_me_an_int(3): I wanted an int!
feed_me_an_int(NaN)
## Error in feed_me_an_int(NaN): I wanted an int!
feed_me_an_int('foo')
## Error in feed_me_an_int("foo"): I wanted an int!
feed_me_an_int(5L)
## Thanks for feeding me an int!
## NULL
Another option is to use the ‘.External’ macros for variable-length argument lists. These “cute” Lisp-influenced macros (https://www.r-project.org/conferences/useR-2004/Keynotes/Dalgaard.pdf) allow you to extract variable numbers of arguments from an R call,
first = CADR(args);
second = CADDR(args);
third = CADDDR(args);
fourth = CAD4R(args);
provide convenient ways to access the first four arguments. More generally we can use the CDR and CAR macros as in
args = CDR(args); a = CAR(args);
args = CDR(args); b = CAR(args);
mult_first_2 <- cfunction(c(args='ANY'),' SEXP x = CADR(args); // first SEXP y = CADDR(args); // second SEXP res = PROTECT(allocVector(REALSXP, 1)); REAL(res)[0] = asReal(x) * asReal(y); UNPROTECT(1); return res; ',language=c('C'),convention=c('.Call')) # More or less aping Hadley's Advanced R example here... call1 <- quote(f(x=2,y=3)) mult_first_2(call1)
## [1] 6
call2 <- quote(f(y=c(1,2,3),z=4)) mult_first_2(call2)
## [1] NA
call3 <- quote(f(z=c(1,2,3))) mult_first_2(call3)
## [1] NA
# DO NOT, HOWEVER, DO THIS: # mult_first_2(quote('foo')) # OR THIS: # mult_first_2('foo') # OR... YOU GET THE IDEA..
Handling NA’s is another beast completely,
“Note that the ISNA macro, and the similar macros ISNAN (which checks for NaN or NA) and R_FINITE (which is false for NA and all the special values), only apply to numeric values of type double. Missingness of integers, logicals and character strings can be tested by equality to the constants NA_INTEGER, NA_LOGICAL and NA_STRING. These and NA_REAL can be used to set elements of R vectors to NA.”
That’s about all I can cover without going mad at the present time. However, this is far from all there is to know. I hope to eventually spend more time with R source code at some point in the future…
Other Resources
As a parting note, I’ll provide links to some of the sources that I found helpful in my quest to understand the R API. A fair number are fairly old and use the older, S-compatible functions which I found confusing. I try and make a small note for each regarding what I useful (or confusing). Personally, I find it helpful to cross-reference a number of different sources when I’m trying to understand something new so if you are at all like me, hopefully this list is helpful.
- R Extensions: If you ask a question on the R-Dev Listserve, they’ll tell you to read this…
- R Docs Example: The R documentation provides a quality, commented example that makes use of many of the topics discussed above..
- Example Functions from R-Extension Manual
- Advanced R C API Chapter: Who else?
- Hadley’s Notes on C on Github
- Winston Chang’s R Source Code Mirror on Github
- Seth Falcon’s 2010 Slides on Native Interfaces: Slightly dated but fantastic resource overall…
- Søren Højsgaard’s Statistical Computing R-C Interface Notes.
- Brian Ripley’s 2009 R Course Slides.
- Peter Dalgaard’s Keynotes Slides from UseR 2004: A few great nuggets about the R language are included here.
- Johnathon Callahan’s R C API Tutorial: Dated, but excellent walkthrough…
- More Bioconductor Resources
- Github Page Linking R Functions to C Code
- R API Cheat Sheet, by Simon Urbanek
- R’s .C Interface, by Roger Peng
- What is a SEXP?: Deep Dive