R Code Optimization

August 16, 2011 § Leave a comment

Handling Large Data with R

The following experiments are inspired from this excellent presentation by Ryan Rosario: http://statistics.org.il/wp-content/uploads/2010/04/Big_Memory%20V0.pdf. R presents many I/O functions to the users for reading/writing data such as ‘read.table’ , ‘write.table’ -> http://cran.r-project.org/doc/manuals/R-intro.html#Reading-data-from-files. With data growing larger by the day many new methodologies are available in order to achieve faster I/O operations.

From the presentation above, many solutions are proposed (R libraries). Here are some benchmarking results with respect to the I/O.

Testing bigmemory package

Test Background & Motivation

R Works on RAM and can cause performance issues. The bigmemory package creates a variable X <- big.martix , such that X is a pointer to the dataset that is saved in the RAM or on the hard drive. Just like in the C world, here we create an reference to the object. This allows for memory-efficient parallel analysis. The R objects (such as matrices) are stored on the RAM using pointer reference. This allows multi-tasking/parallel R process to access the memory objects.
The bigmemory package mainly uses binary file format vs the ASCII/classic way in the R utils package.

Testing tools
Test Scenario

Reading and writing a large matrix using (write.table,read.table) vs (big.matrix,read.big.matrix).
i. Create a large matrix of random double values.

x1 <- matrix(rnorm(10000, 1.0, 10.0), nrow=10000, ncol=10000)

ii. Write and read a large matrix using read.table and write.table.

timeit({ foo = read.csv(filepath)})
timeit({write.table(x1, file = filepath,  sep = "," , eol = "\n", dec = ".", col.names = FALSE)})

iii. Write and read a large matrix using bigmemory package

timeit({big.matrix(x1,nrow = 10000, ncol = 10000, type = "double", separated = FALSE,
backingfile = "BigMem.bin", descriptorfile = "BigMem.desc", shared = TRUE)})

timeit({foo <- read.big.matrix(filepath, sep = ‘,’, header = FALSE, col.names = NULL, row.names = NULL,
has.row.names=FALSE, ignore.row.names=FALSE,
type = “double”, backingfile = “BigMem.bin” ,
descriptorfile = “BigMem.desc”, shared=TRUE)})

iv. Testing using my.read.lines

timeit({ foo = my.read.lines(filepath)})
Test Results

Platform: Dell Precision Desktop with Intel Core 2 Duo Quad CPU @ 2.66GHz, 7.93 RAM.

utils Total Elapsed Time(sec) bigmemory Total Elapsed Time(sec) File size on disk (.csv) Computation Time Saved by bigmemory
write.table 369.79 big.matrix 1.51 1.7GB MB 99%
read.csv 313.03 read.big.matrix 141.50 1.7GB 55%

* my.read.lines(filepath) took 23.73 secs.

Test Discussion

The computation time results show that the bigmemory provides big gains in speed with respect to I/O operations. The values of the foo dataframe are accurate.
The read.big.matrix function creates a bin file of size 789MB. This permits storing large objects (matrices etc.) in memory (on the
RAM) using pointer objects as reference. Please see parameters ‘backingfile’ and ‘descriptorfile’. When a new R session is loaded, the user provides reference to the pointer via the description file attach.big.matrix(‘BigMem.desc’). This way several R processes can share memory objects via ‘call by reference’.
The .desc file is an S4 type object -> https://github.com/hadley/devtools/wiki/S4

i. Faster in computation
ii. Takes less space on the file system.
iii. Subsequent loading of the data can be achieved using ‘call by reference’