R Code Optimization

August 16, 2011 § Leave a comment

Handling Large Data with R

The following experiments are inspired from this excellent presentation by Ryan Rosario: http://statistics.org.il/wp-content/uploads/2010/04/Big_Memory%20V0.pdf. R presents many I/O functions to the users for reading/writing data such as ‘read.table’ , ‘write.table’ -> http://cran.r-project.org/doc/manuals/R-intro.html#Reading-data-from-files. With data growing larger by the day many new methodologies are available in order to achieve faster I/O operations.

From the presentation above, many solutions are proposed (R libraries). Here are some benchmarking results with respect to the I/O.

Testing bigmemory package

Test Background & Motivation

R Works on RAM and can cause performance issues. The bigmemory package creates a variable X <- big.martix , such that X is a pointer to the dataset that is saved in the RAM or on the hard drive. Just like in the C world, here we create an reference to the object. This allows for memory-efficient parallel analysis. The R objects (such as matrices) are stored on the RAM using pointer reference. This allows multi-tasking/parallel R process to access the memory objects.
The bigmemory package mainly uses binary file format vs the ASCII/classic way in the R utils package.

Testing tools
Test Scenario

Reading and writing a large matrix using (write.table,read.table) vs (big.matrix,read.big.matrix).
i. Create a large matrix of random double values.

x1 <- matrix(rnorm(10000, 1.0, 10.0), nrow=10000, ncol=10000)

ii. Write and read a large matrix using read.table and write.table.

timeit({ foo = read.csv(filepath)})
timeit({write.table(x1, file = filepath,  sep = "," , eol = "\n", dec = ".", col.names = FALSE)})

iii. Write and read a large matrix using bigmemory package

timeit({big.matrix(x1,nrow = 10000, ncol = 10000, type = "double", separated = FALSE,
backingfile = "BigMem.bin", descriptorfile = "BigMem.desc", shared = TRUE)})

timeit({foo <- read.big.matrix(filepath, sep = ‘,’, header = FALSE, col.names = NULL, row.names = NULL,
has.row.names=FALSE, ignore.row.names=FALSE,
type = “double”, backingfile = “BigMem.bin” ,
descriptorfile = “BigMem.desc”, shared=TRUE)})

iv. Testing using my.read.lines

timeit({ foo = my.read.lines(filepath)})
Test Results

Platform: Dell Precision Desktop with Intel Core 2 Duo Quad CPU @ 2.66GHz, 7.93 RAM.

utils Total Elapsed Time(sec) bigmemory Total Elapsed Time(sec) File size on disk (.csv) Computation Time Saved by bigmemory
write.table 369.79 big.matrix 1.51 1.7GB MB 99%
read.csv 313.03 read.big.matrix 141.50 1.7GB 55%

* my.read.lines(filepath) took 23.73 secs.

Test Discussion

The computation time results show that the bigmemory provides big gains in speed with respect to I/O operations. The values of the foo dataframe are accurate.
The read.big.matrix function creates a bin file of size 789MB. This permits storing large objects (matrices etc.) in memory (on the
RAM) using pointer objects as reference. Please see parameters ‘backingfile’ and ‘descriptorfile’. When a new R session is loaded, the user provides reference to the pointer via the description file attach.big.matrix(‘BigMem.desc’). This way several R processes can share memory objects via ‘call by reference’.
The .desc file is an S4 type object -> https://github.com/hadley/devtools/wiki/S4

Advantages:
i. Faster in computation
ii. Takes less space on the file system.
iii. Subsequent loading of the data can be achieved using ‘call by reference’

Follow

Get every new post delivered to your Inbox.