If you have CULA installed, please try out our gputools R package with gpuFastICA. We especially encourage feedback from users. Did the package install correctly? Did the gpuFastICA routine work as expected? Tested on a subset of GSE6306, nonGPU enabled fastICA took over four hours while gpuFastICA took just 80 seconds!
R is the most popular open source statistical environment in the biomedical research community. However, most of the popular R function implementations involve no parallelism and they can only be executed as separate instances on multicore or cluster hardware for large dataparallel analysis tasks. The arrival of modern graphics processing units (GPUs) with user friendly programming tools, such as Nvidia's CUDA toolkit (http://www.nvidia.com/cuda), provides a possibility of increasing the computational efficiency of many common tasks by more than one order of magnitude (http://gpgpu.org/). However, most R users are not trained to program a GPU, a key obstacle for the widespread adoption of GPUs in biomedical research.
To overcome this obstacle, we decided to devote efforts for moving frequently used R functions in our work to the GPU using CUDA. In the ideal solution, if a CUDA compatible GPU and driver is present on a user's machine, the user may only need to prefix "gpu" to the original function name to take advantage of the GPU implementation of the corresponding R function. We take achieving this ideal as one of our primary goals so that any biomedical researcher can harness the computational power of a GPU using a familiar tool. Since our code is open source, researchers may customize the R interfaces to their particular needs. In addition, because CUDA uses shared libraries and unobtrusive extensions to the C programming language, any experienced C programmer can easily customize the underlying code.
Using the CUDA extension to C and the shared linear algebra library CUBLAS, we have implemented a variety of statistical analysis functions with R interfaces that execute with different degrees of parallelism on a Graphics Processing Unit (GPU). If an algorithm is comprised of common vector or matrix operations each performed once, we involve the GPU by implementing those operations with calls to CUBLAS. If an algorithm involves computing the elements of a large matrix, we can often merely assign each thread executing on the GPU a portion of a row and/or column. Algorithms for which we have implemented GPU enabled versions include the calculations of distances between sets of points (R dist function), hierarchical clustering (R hclust function). Pearson and Kendall correlation coefficients (similar to R cor function), and the Granger test ('granger.test' in the R MSBVAR package).
We are committed to implement more R GPU functions, and we hope to contribute packages to the open source community via our project's website. The initial package is hosted by CRAN as gputools a source package for UNIX and Linux systems. Install the package in the usual R manner. If there is any trouble with the installation, please see this set of notes included in the source distribution as the file INSTALL. We welcome contributions to the RGPGPU effort and encourage any comments or suggestions.
Figure 1 provides performance comparisons between original R functions assuming a four thread data parallel solution on Intel Core i7 920 and our GPU enabled R functions for a GTX 295 GPU. The speedup test consisted of testing each of three algorithms with five randomly generated data sets. The Granger causality algorithm was tested with a lag of 2 for 200, 400, 600, 800, and 1000 random variables with 10 observations each. Complete hierarchical clustering was tested with 1000, 2000, 4000, 6000, and 8000 points. Calculation of Kendall's correlation coefficient was tested with 20, 30, 40, 50, and 60 random variables with 10000 observations each. 

Figure 2 provides a performance comparison between the function 'granger.test' from the package 'MSBVAR' and our gpuGranger function. We use a single CPU thread on Intel Core i7 920 and a GTX 260 GPU. The Granger causality algorithm was tested with a lag of 2 for 200, 400, 600, 800, and 1000 random variables with 10 observations each. 

Figure 3 provides a performance comparison between the function 'hclust' and our gpuHclust function. We use a single CPU thread on Intel Core i7 920 and a GTX 260 GPU. Complete hierarchical clustering was tested with 1000, 2000, 4000, 6000, and 8000 points. 

Figure 4 provides a performance comparison between the function 'cor' and our gpuCor function with 'method' set to 'kendall'. We use a single CPU thread on Intel Core i7 920 and a GTX 260 GPU. Calculation of Kendall's correlation coefficient was tested with 20, 30, 40, 50, and 60 random variables with 10000 observations each. 
The gputools R package is free for academic use. For commercial use, please contact brainarray admin
Download the gputools source package from the CRAN gputools page. Send questions, comments and bugs to Josh Buckner. This text file may be helpful during installation of the source package. It appears as INSTALL in the package itself.
In case you are curious, here is a table listing the compute capability of various Nvidia products. This table comes from the CUDA Programming Guide. The CUDA Programming Guide is a free pdf that comes with the CUDA toolkit under the doc directory. The function gpuCor and the SVM functions require double precision arithmetic on the device. So devices with compute capability less than 1.3 may give unsatisfactory results when using those functions. The rest of the package should work fine for cards with compute capability less than 1.3.
Device name  Compute capability 

GeForce GTX 295  1.3 
GeForce GTX 285, GTX 280  1.3 
GeForce GTX 260  1.3 
GeForce 9800 GX2  1.1 
GeForce GTS 250, GTS 150, 9800 GTX, 9800 GTX+, 8800 GTS 512  1.1 
GeForce 8800 Ultra, 8800 GTX  1.0 
GeForce 9800 GT, 8800 GT, 9800M GTX  1.1 
GeForce GT 130, 9600 GSO, 8800 GS, 8800M GTX, 9800M GT  1.1 
GeForce 8800 GTS  1.0 
GeForce 9600 GT, 8800M GTS, 9800M GTS  1.1 
GeForce 9700M GT  1.1 
GeForce GT 120, 9500 GT, 8600 GTS, 8600 GT, 9700M GT, 9650M GS, 9600M GT, 9600M GS, 9500M GS, 8700M GT, 8600M GT, 8600M GS  1.1 
GeForce G100, 8500 GT, 8400 GS, 8400M GT, 9500M G, 9300M G, 8400M GS, 9400 mGPU, 9300 mGPU, 8300 mGPU, 8200 mGPU, 8100 mGPU  1.1 
GeForce 9300M GS, 9200M GS, 9100M G, 8400M G  1.1 
Tesla S1070  1.3 
Tesla C1060  1.3 
Tesla S870  1.0 
Tesla D870  1.0 
Tesla C870  1.0 
Quadro Plex 2200 D2  1.3 
Quadro Plex 2100 D4  1.1 
Quadro Plex 2100 Model S4  1.0 
Quadro Plex 1000 Model IV  1.0 
Quadro FX 5800  1.3 
Quadro FX 4800  1.3 
Quadro FX 4700 X2  1.1 
Quadro FX 3700M  1.1 
Quadro FX 5600  1.0 
Quadro FX 3700  1.1 
Quadro FX 3600M  1.1 
Quadro FX 4600  1.0 
Quadro FX 2700M  1.1 
Quadro FX 1700, FX 570, NVS 320M, FX 1700M, FX 1600M, FX 770M, FX 570M  1.1 
Quadro FX 370, NVS 290, NVS 140M, NVS 135M, FX 360M  1.1 
Quadro FX 370M, NVS 130M  1.1 