Tags

, ,

What is R and why it is used?

R is a language and environment for statistical computing and graphics. It is a GNU project which provides a wide variety of statistical and graphical techniques, and is highly extensible. It includes

  • an effective data handling and storage facility,
  • a suite of operators for calculations on arrays, in particular matrices,
  • a large, coherent, integrated collection of intermediate tools for data analysis,
  • graphical facilities for data analysis and display either on-screen or on hardcopy, and
  • a well-developed, simple and effective programming language which includes conditionals, loops, user-defined recursive functions and input and output facilities.

It is used in data analysis, statistical computing and data visualization

Installing R in RHEL.

In order to get R running on RHEL 6, we need the EPEL repository.

#rpm -Uvh http://download.fedoraproject.org/pub/epel/6/i386/epel-release-6-8.noarch.rpm’

#yum install R

Installing R studio

R studio is a web based integrated development environment (IDE) for R.

#yum install http://download1.rstudio.org/rstudio-0.97.320-x86_64.rpm

After the installation you can access the server via the port 8787

How to check R is installed or not

#rpm -qa|grep R

#yum list|grep R

#R

RStudio is configured by adding entries to two configuration files

/etc/rstudio/rserver.conf

/etc/rstudio/rsession.conf

Netwrok port www-port=80

bind address www-address=0.0.0.0

r-libs-user=~/R/packages
CRAN Repository r-cran-repos=https://mirrors.nics.utk.edu/cran/

 

R session parameters are specified in /etc/rstudio/rsession.conf

session-timeout-minutes=30

After editing configuration files we should verify them using

rstudio-server verify-installation

Method 1: Install from source

Download the add-on R package, say mypkg, and type the following command in Unix console to install it to /my/own/R-packages/:

$ R CMD INSTALL mypkg -l /my/own/R-packages/

Method 2: Install from CRAN directly

Type the following command in R console to install it to /my/own/R-packages/ directly from CRAN:

> install.packages(“mypkg”, lib=”/my/own/R-packages/”)

Load the library

Type the following command in R console to load the package

> library(“mypkg”, lib.loc=”/my/own/R-packages/”)

R Use case

installing Rhadoop

RHadoop is a collection of three R packages: rmr, rhdfs and rhbase. rmr package provides Hadoop MapReduce functionality in R, rhdfs provides HDFS file management in R and rhbase provides HBase database management from within R.

install R base first

#yum install r-base

Then we need to install RHadoop packages with their dependencies. rmr requires RCpp, RJSONIO, digest, functional, stringr and plyr, while rhdfs requires rJava.

As part of the installation, we need to reconfigure Java for rJava package and we also need to set HADOOP_CMD variable for rhdfs package. The installation requires the corresponding tar.gz archives to be downloaded and then we can run R CMD INSTALL command.

#R CMD INSTALL Rcpp Rcpp_0.10.2.tar.gz
#R CMD INSTALL RJSONIO RJSONIO_1.0-1.tar.gz
#R CMD INSTALL digest digest_0.6.2.tar.gz
#R CMD INSTALL functional functional_0.1.tar.gz
#R CMD INSTALL stringr stringr_0.6.2.tar.g
#R CMD INSTALL plyr plyr_1.8.tar.gz
#R CMD INSTALL rmr rmr2_2.0.2.tar.gz

#JAVA_HOME=/home/istvan/jdk1.6.0_38/jre R CMD javareconf
#R CMD INSTALL rJava rJava_0.9-3.tar.gz 
#HADOOP_CMD=/home/istvan/hadoop/bin/hadoop R CMD INSTALL rhdfs rhdfs_1.0.5.tar.gz 
#R CMD INSTALL rhdfs rhdfs_1.0.5.tar.gz

Sample R script to find out which country has got greater GDP than Apple
Country Code,Number,Country Name,GDP,
USA,1,United States,14991300
CHN,2,China,7318499
JPN,3,Japan,5867154
DEU,4,Germany,3600833
FRA,5,France,2773032

The gdp.R script looks like this: 

Sys.setenv(HADOOP_HOME="/home/anoop/hadoop")
Sys.setenv(HADOOP_CMD="/home/anoop/hadoop")

library(rmr2)
library(rhdfs)

setwd("/home/anoop/")
gdp <- read.csv("GDP_converted.csvad(gdp)

hdfs.init()
gdp.values <- to.dfs(gdp)

# AAPL revenue in 2012 in millions USD
aaplRevenue = 156508

gdp.map.fn <- function(k,v) {
key <- ifelse(v[4] < aaplRevenue, "less", "greater")
keyval(key, 1)
}

count.reduce.fn <- function(k,v) {
keyval(k, length(v))
}

count <- mapreduce(input=gdp.values,
                   map = gdp.map.fn,
                   reduce = count.reduce.fn)

from.dfs(count)

R will initiate a Hadoop streaming job to process the data using mapreduce algorithm. Then we will get the data and in Rstudio we can get a histogram.

Advertisements