Very Lazy Cliff-Notes Version
BB
For a lot of people, including myself, this is not a particularly thrilling topic to read or write about. However, as someone who does not come from a CS background (not even close), ignorance in this area is a major obstacle to better understanding R fundamentals or any language, for that matter. For that reason, I want to document some basic facts so that I, and others, can use it as a reference as I will inevitably forget a lot of this material. Because I’m obviously no expert in this field (or many others for that matter), I will rely heavily on sources, base-R code examples, and just a sprinkling of my own commentary.
This follows the examples given by the R documentation, except that at times I make certain function arguments explicit to make it more clear what is happening.
The first, and most important, topic in my opinion is how data is stored in a computer’s memory.
“A computer cannot store “numbers” or “letters”. The only thing a computer can store and work with is bits. A bit is binary, it is either a 0 or a 1. In fact from a physics perspective, a bit is just a blip of electricity that either is or isn’t there.”
Consequently, it is from building blocks of 0’s and 1’s that everything, including R is built. Bits are then combined to form larger structures of memory. These includeword is either 32 or 64 bits depending on the system (e.g., 4 vs 4+ GB Ram).’,$href = ”,$href_text = ”) ?>:
| Name | Value |
|---|---|
| bit | 1 binary digit (0/1) |
| nibble | 4 bits |
| byte | 8 bits |
| kilobyte | 1024 bytes (e.g., 8192 bits) |
| megabyte | 1024 kilobytes |
| gigabyte | 1024 megabytes |
| terabyte | 1024 gigabytes |
| petabyte | 1024 terabytes |
| exabyte | 1024 petabytes |
Considering that a byte holds 8 digits, that means there are 28 different possible combinations for a byte (00000000:11111111), which in turn, means there are 2^28 potential values for a byte, or 256. This number is sufficiently large to capture most characters on a standard keyboard.
Accordingly…
“Since characters (letters, decimal digits and special characters such as punctuation marks, etc) can be represented with bytes, a standard is needed to insure that the code that’s used on your computer is the same as the code that is used on mine. There are two standard codes that use one byte to represent a character, ASCII (ass’-key) and EBCDIC (ib’-suh-dik). ASCII, the American Standard Code for Information Interchange, is the code that is most commonly used today. EBCDIC, Extended Binary Coded Decimal Interchange Code, was used by IBM on its large mainframe computers in the past.”
However…
“In the past the ASCII character set dominated computing. This set defines 128 characters including 0 to 9, upper and lower case alpha-numeric and a few control characters such as a new line. To store these characters required 7 bits since 27 = 128, but 8 bits were typically used for performance reasons…
…The limitation of only having 256 characters led to the development of Unicode, a standard framework aimed at creating a single character set for every reasonable writing system. Typically, Unicode characters require sixteen bits of storage. Eight bits is one byte, or ASCII character. So two ASCII characters would use two bytes or 16 bits. A pure text document containing 100 characters would use 100 bytes (800 bits).”
Encoding can then be seen as the process of mapping characters to bytes as is shown in the ASCII sample mapping below.
| Bit representation | Character |
|---|---|
| 01000001 | A |
| 01000010 | B |
| 01000011 | C |
| 01000100 | D |
| 01000101 | E |
| 01010010 | R |
Beyond bytes and encoding, it is also important to at least be familiar with the hexadecimal, base-16, system as it is found everywhere in computing, and in particular, memory addresses. The following table is a mapping of binary, decimal, and digits.
| Decimal | Binary | Hexadecimal |
|---|---|---|
| 0 | 0 | 0 |
| 1 | 1 | 1 |
| 2 | 10 | 2 |
| 3 | 11 | 3 |
| 4 | 100 | 4 |
| 5 | 101 | 5 |
| 6 | 110 | 6 |
| 7 | 111 | 7 |
| 8 | 1000 | 8 |
| 9 | 1001 | 9 |
| 10 | 1010 | A |
| 11 | 1011 | B |
| 12 | 1100 | C |
| 13 | 1101 | D |
| 14 | 1110 | E |
| 15 | 1111 | F |
The following examples center around the ‘raw’ data type which, as the documentation puts it…
“The raw type is intended to hold raw bytes. It is possible to extract subsequences of bytes, and to replace elements (but only by elements of a raw vector)… …A raw vector is printed with each byte separately represented as a pair of hex digits”
xx <- raw(length = 2) # length of raw vector xx[1] <- as.raw(40) xx[2] <- charToRaw("A") xx
## [1] 28 41
# 28 = (2*16) + (8*16^0) # 41 = ASCII = A dput(xx) ## as.raw(c(0x28, 0x41))
## as.raw(c(0x28, 0x41))
# 0x = INSERT HERE !!!! as.integer(xx) ## 40 65
## [1] 40 65
rm(xx)
Conversions operate as follows:
“charToRaw converts a length-one character string to raw bytes. It does so without taking into account any declared encoding”
Whereas…
“rawToChar converts raw bytes either to a single character string or a character vector of single bytes (with “” for 0). (Note that a single character string could contain embedded nuls; only trailing nulls are allowed and will be removed.) In either case it is possible to create a result which is invalid in a multibyte locale, e.g. one using UTF-8. Long vectors are allowed if multiple is true.”
x <- "A test string" (y <- charToRaw(x))
## [1] 41 20 74 65 73 74 20 73 74 72 69 6e 67
rawToChar(y)
## [1] "A test string"
rawToChar(y)
## [1] "A test string"
rawToChar(y, multiple = TRUE)
## [1] "A" " " "t" "e" "s" "t" " " "s" "t" "r" "i" "n" "g"
(xx <- c(y, charToRaw("&"), charToRaw("more")))
## [1] 41 20 74 65 73 74 20 73 74 72 69 6e 67 26 6d 6f 72 65
rawToChar(xx)
## [1] "A test string&more"
I’m not sure how useful the bit shifting is within an R context, but I include the R documentation’s example for illustrative purposes. Also, note the conversion functions as well…
- rawShift(x, n) shift the bits in x by n positions to the right
- rawToBits returns a raw vector of 8 times the length of a raw vector with entries 0 or 1
- intToBits returns a raw vector of 32 times the length of an integer vector with entries 0 or 1
Finally, although not covered here, note that there are bitwise?bitwAnd‘,”,”); ?> logical operators as well…
rawShift(y, 1)
## [1] 82 40 e8 ca e6 e8 40 e6 e8 e4 d2 dc ce
rawShift(y, -2)
## [1] 10 08 1d 19 1c 1d 08 1c 1d 1c 1a 1b 19
# Gibberish rawToChar(rawShift(y, 1))
## [1] "‚@èÊæè@æèäÒÜÎ"
rawToBits(y)
## [1] 01 00 00 00 00 00 01 00 00 00 00 00 00 01 00 00 00 00 01 00 01 01 01 ## [24] 00 01 00 01 00 00 01 01 00 01 01 00 00 01 01 01 00 00 00 01 00 01 01 ## [47] 01 00 00 00 00 00 00 01 00 00 01 01 00 00 01 01 01 00 00 00 01 00 01 ## [70] 01 01 00 00 01 00 00 01 01 01 00 01 00 00 01 00 01 01 00 00 01 01 01 ## [93] 00 01 01 00 01 01 01 00 00 01 01 00
intToBits(5)
## [1] 01 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ## [24] 00 00 00 00 00 00 00 00 00
showBits <- function(r) stats::symnum(as.logical(rawToBits(r))) # symbolic number coding z <- as.raw(5) z ; showBits(z)
## [1] 05
## [1] | . | . . . . .
showBits(rawShift(z, 1)) # shift to right
## [1] . | . | . . . .
showBits(rawShift(z, 2))
## [1] . . | . | . . .
showBits(z)
## [1] | . | . . . . .
showBits(rawShift(z, -1)) # shift to left
## [1] . | . . . . . .
showBits(rawShift(z, -2)) # ..
## [1] | . . . . . . .
showBits(rawShift(z, -3)) # shifted off entirely
## [1] . . . . . . . .
bitwShiftR(-1, 1:31) # shifts of 2^32-1 = 4294967295
## [1] 2147483647 1073741823 536870911 268435455 134217727 67108863 ## [7] 33554431 16777215 8388607 4194303 2097151 1048575 ## [13] 524287 262143 131071 65535 32767 16383 ## [19] 8191 4095 2047 1023 511 255 ## [25] 127 63 31 15 7 3 ## [31] 1
The R documentation has the following to say regarding encoding:
“Character strings in R can be declared to be encoded in “latin1” or “UTF-8” or as “bytes”. These declarations can be read by Encoding, which will return a character vector of values “latin1”, “UTF-8” “bytes” or “unknown”, or set, when value is recycled as needed and other values are silently treated as “unknown”.
ASCII strings will never be marked with a declared encoding, since their representation is the same in all supported encodings. Strings marked as “bytes” are intended to be non-ASCII strings which should be manipulated as bytes, and never converted to a character encoding (so writing them to a text file is not supported).
enc2native and enc2utf8 convert elements of character vectors to the native encoding or UTF-8 respectively, taking any marked encoding into account. They are primitive functions, designed to do minimal copying.”
## x is intended to be in latin1 x <- "fa\xE7ile" Encoding(x)
## [1] "latin1"
Encoding(x) <- "latin1" x
## [1] "façile"
xx <- iconv(x, "latin1", "UTF-8") Encoding(c(x, xx))
## [1] "latin1" "UTF-8"
c(x, xx)
## [1] "façile" "façile"
Encoding(xx) <- "bytes" xx # will be encoded in hex
## [1] "fa\\xc3\\xa7ile"
cat("xx = ", xx, "\n", sep = "")
## xx = fa\xc3\xa7ile
i <- as.hexmode("7fffffff") i; class(i)
## [1] "7fffffff"
## [1] "hexmode"
identical(as.integer(i), .Machine$integer.max)
## [1] TRUE
hm <- as.hexmode(c(NA, 1)); hm
## [1] NA "1"
as.integer(hm)
## [1] NA 1
With strtoi it is possible to…
“Convert strings to integers according to the given base using the C function strtol, or choose a suitable base following the C rules.
For the default base = 0L, the base chosen from the string representation of that element of x, so different elements can have different bases (see the first example). The standard C rules for choosing the base are that octal constants (prefix 0 not followed by x or X) and hexadecimal constants (prefix 0x or 0X) are interpreted as base 8 and 16; all other strings are interpreted as base 10.
For a base greater than 10, letters a to z (or A to Z) are used to represent 10 to 35.”
strtoi(c("0xff", "077", "123"))
## [1] 255 63 123
strtoi(c("ffff", "FFFF"), 16L)
## [1] 65535 65535
strtoi(c("177", "377"), 8L)
## [1] 127 255
With all that in context, one can use .Internal(address()) to find the address of an object in memory…
x <- 5 .Internal(address(x))
## <pointer: 0x0000000010cc8690>
# Sample address on my machine... strtoi('0x000000001ce05810',16)
## [1] 484464656