Memory & Encoding Operations – Benjamin I. Bass

Very Lazy Cliff-Notes Version

For a lot of people, including myself, this is not a particularly thrilling topic to read or write about. However, as someone who does not come from a CS background (not even close), ignorance in this area is a major obstacle to better understanding R fundamentals or any language, for that matter. For that reason, I want to document some basic facts so that I, and others, can use it as a reference as I will inevitably forget a lot of this material. Because I’m obviously no expert in this field (or many others for that matter), I will rely heavily on sources, base-R code examples, and just a sprinkling of my own commentary.

This follows the examples given by the R documentation, except that at times I make certain function arguments explicit to make it more clear what is happening.

The first, and most important, topic in my opinion is how data is stored in a computer’s memory.

“A computer cannot store “numbers” or “letters”. The only thing a computer can store and work with is bits. A bit is binary, it is either a 0 or a 1. In fact from a physics perspective, a bit is just a blip of electricity that either is or isn’t there.”

Efficient R

Consequently, it is from building blocks of 0’s and 1’s that everything, including R is built. Bits are then combined to form larger structures of memory. These includeword is either 32 or 64 bits depending on the system (e.g., 4 vs 4+ GB Ram).’,$href = ”,$href_text = ”) ?>:

Name	Value
bit	1 binary digit (0/1)
nibble	4 bits
byte	8 bits
kilobyte	1024 bytes (e.g., 8192 bits)
megabyte	1024 kilobytes
gigabyte	1024 megabytes
terabyte	1024 gigabytes
petabyte	1024 terabytes
exabyte	1024 petabytes

Considering that a byte holds 8 digits, that means there are 28 different possible combinations for a byte (00000000:11111111), which in turn, means there are 2^28 potential values for a byte, or 256. This number is sufficiently large to capture most characters on a standard keyboard.

Accordingly…

“Since characters (letters, decimal digits and special characters such as punctuation marks, etc) can be represented with bytes, a standard is needed to insure that the code that’s used on your computer is the same as the code that is used on mine. There are two standard codes that use one byte to represent a character, ASCII (ass’-key) and EBCDIC (ib’-suh-dik). ASCII, the American Standard Code for Information Interchange, is the code that is most commonly used today. EBCDIC, Extended Binary Coded Decimal Interchange Code, was used by IBM on its large mainframe computers in the past.”

Scranton CS Dept.

However…

“In the past the ASCII character set dominated computing. This set defines 128 characters including 0 to 9, upper and lower case alpha-numeric and a few control characters such as a new line. To store these characters required 7 bits since 27 = 128, but 8 bits were typically used for performance reasons…

…The limitation of only having 256 characters led to the development of Unicode, a standard framework aimed at creating a single character set for every reasonable writing system. Typically, Unicode characters require sixteen bits of storage. Eight bits is one byte, or ASCII character. So two ASCII characters would use two bytes or 16 bits. A pure text document containing 100 characters would use 100 bytes (800 bits).”

Efficient R

Encoding can then be seen as the process of mapping characters to bytes as is shown in the ASCII sample mapping below.

Bit representation	Character
01000001	A
01000010	B
01000011	C
01000100	D
01000101	E
01010010	R

Beyond bytes and encoding, it is also important to at least be familiar with the hexadecimal, base-16, system as it is found everywhere in computing, and in particular, memory addresses. The following table is a mapping of binary, decimal, and digits.

Decimal	Binary	Hexadecimal
0	0	0
1	1	1
2	10	2
3	11	3
4	100	4
5	101	5
6	110	6
7	111	7
8	1000	8
9	1001	9
10	1010	A
11	1011	B
12	1100	C
13	1101	D
14	1110	E
15	1111	F

The following examples center around the ‘raw’ data type which, as the documentation puts it…

“The raw type is intended to hold raw bytes. It is possible to extract subsequences of bytes, and to replace elements (but only by elements of a raw vector)… …A raw vector is printed with each byte separately represented as a pair of hex digits”

R Documentation

xx <- raw(length = 2) # length of raw vector

xx[1] <- as.raw(40)

xx[2] <- charToRaw("A")

xx

## [1] 28 41

# 28 = (2*16) + (8*16^0)
# 41 = ASCII = A  

dput(xx) ## as.raw(c(0x28, 0x41))

## as.raw(c(0x28, 0x41))

# 0x = INSERT HERE !!!!

as.integer(xx) ## 40 65

## [1] 40 65

rm(xx)

Conversions operate as follows:

“charToRaw converts a length-one character string to raw bytes. It does so without taking into account any declared encoding”

Whereas…

“rawToChar converts raw bytes either to a single character string or a character vector of single bytes (with “” for 0). (Note that a single character string could contain embedded nuls; only trailing nulls are allowed and will be removed.) In either case it is possible to create a result which is invalid in a multibyte locale, e.g. one using UTF-8. Long vectors are allowed if multiple is true.”

x <- "A test string"
(y <- charToRaw(x))

##  [1] 41 20 74 65 73 74 20 73 74 72 69 6e 67

rawToChar(y)

## [1] "A test string"

rawToChar(y)

## [1] "A test string"

rawToChar(y, multiple = TRUE)

##  [1] "A" " " "t" "e" "s" "t" " " "s" "t" "r" "i" "n" "g"

(xx <- c(y,  charToRaw("&"), charToRaw("more")))

##  [1] 41 20 74 65 73 74 20 73 74 72 69 6e 67 26 6d 6f 72 65

rawToChar(xx)

## [1] "A test string&more"

I’m not sure how useful the bit shifting is within an R context, but I include the R documentation’s example for illustrative purposes. Also, note the conversion functions as well…

rawShift(x, n) shift the bits in x by n positions to the right
rawToBits returns a raw vector of 8 times the length of a raw vector with entries 0 or 1
intToBits returns a raw vector of 32 times the length of an integer vector with entries 0 or 1

Finally, although not covered here, note that there are bitwise?bitwAnd‘,”,”); ?> logical operators as well…

rawShift(y, 1)

##  [1] 82 40 e8 ca e6 e8 40 e6 e8 e4 d2 dc ce

rawShift(y, -2)

##  [1] 10 08 1d 19 1c 1d 08 1c 1d 1c 1a 1b 19

# Gibberish
rawToChar(rawShift(y, 1))

## [1] "‚@èÊæè@æèäÒÜÎ"

rawToBits(y)

##   [1] 01 00 00 00 00 00 01 00 00 00 00 00 00 01 00 00 00 00 01 00 01 01 01
##  [24] 00 01 00 01 00 00 01 01 00 01 01 00 00 01 01 01 00 00 00 01 00 01 01
##  [47] 01 00 00 00 00 00 00 01 00 00 01 01 00 00 01 01 01 00 00 00 01 00 01
##  [70] 01 01 00 00 01 00 00 01 01 01 00 01 00 00 01 00 01 01 00 00 01 01 01
##  [93] 00 01 01 00 01 01 01 00 00 01 01 00

intToBits(5)

##  [1] 01 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
## [24] 00 00 00 00 00 00 00 00 00

showBits <- function(r) stats::symnum(as.logical(rawToBits(r))) # symbolic number coding

z <- as.raw(5)
z ; showBits(z)

## [1] 05

## [1] | . | . . . . .

showBits(rawShift(z, 1)) # shift to right

## [1] . | . | . . . .

showBits(rawShift(z, 2))

## [1] . . | . | . . .

showBits(z)

## [1] | . | . . . . .

showBits(rawShift(z, -1)) # shift to left

## [1] . | . . . . . .

showBits(rawShift(z, -2)) # ..

## [1] | . . . . . . .

showBits(rawShift(z, -3)) # shifted off entirely

## [1] . . . . . . . .

bitwShiftR(-1, 1:31) # shifts of 2^32-1 = 4294967295

##  [1] 2147483647 1073741823  536870911  268435455  134217727   67108863
##  [7]   33554431   16777215    8388607    4194303    2097151    1048575
## [13]     524287     262143     131071      65535      32767      16383
## [19]       8191       4095       2047       1023        511        255
## [25]        127         63         31         15          7          3
## [31]          1

The R documentation has the following to say regarding encoding:

“Character strings in R can be declared to be encoded in “latin1” or “UTF-8” or as “bytes”. These declarations can be read by Encoding, which will return a character vector of values “latin1”, “UTF-8” “bytes” or “unknown”, or set, when value is recycled as needed and other values are silently treated as “unknown”.

ASCII strings will never be marked with a declared encoding, since their representation is the same in all supported encodings. Strings marked as “bytes” are intended to be non-ASCII strings which should be manipulated as bytes, and never converted to a character encoding (so writing them to a text file is not supported).

enc2native and enc2utf8 convert elements of character vectors to the native encoding or UTF-8 respectively, taking any marked encoding into account. They are primitive functions, designed to do minimal copying.”

R Documentation

## x is intended to be in latin1
x <- "fa\xE7ile"
Encoding(x)

## [1] "latin1"

Encoding(x) <- "latin1"
x

## [1] "façile"

xx <- iconv(x, "latin1", "UTF-8")
Encoding(c(x, xx))

## [1] "latin1" "UTF-8"

c(x, xx)

## [1] "façile" "façile"

Encoding(xx) <- "bytes"
xx # will be encoded in hex

## [1] "fa\\xc3\\xa7ile"

cat("xx = ", xx, "\n", sep = "")

## xx = fa\xc3\xa7ile

i <- as.hexmode("7fffffff")
i; class(i)

## [1] "7fffffff"

## [1] "hexmode"

identical(as.integer(i), .Machine$integer.max)

## [1] TRUE

hm <- as.hexmode(c(NA, 1)); hm

## [1] NA  "1"

as.integer(hm)

## [1] NA  1

With strtoi it is possible to…

“Convert strings to integers according to the given base using the C function strtol, or choose a suitable base following the C rules.

For the default base = 0L, the base chosen from the string representation of that element of x, so different elements can have different bases (see the first example). The standard C rules for choosing the base are that octal constants (prefix 0 not followed by x or X) and hexadecimal constants (prefix 0x or 0X) are interpreted as base 8 and 16; all other strings are interpreted as base 10.

For a base greater than 10, letters a to z (or A to Z) are used to represent 10 to 35.”

R Documentation

strtoi(c("0xff", "077", "123"))

## [1] 255  63 123

strtoi(c("ffff", "FFFF"), 16L)

## [1] 65535 65535

strtoi(c("177", "377"), 8L)

## [1] 127 255

With all that in context, one can use .Internal(address()) to find the address of an object in memory…

x <- 5
.Internal(address(x))

## <pointer: 0x0000000010cc8690>

# Sample address on my machine...

strtoi('0x000000001ce05810',16)

## [1] 484464656