This is an old revision of the document!

R

R is available on the cluster. R can also be installed on your computer for free by visiting the R-project main page.

For more information about R, you might want to use the R manual or RSeek, the search engine for R related resources.

Running R on the Cluster

You can run R interactively on the cluster with R. You can also run a .R file in batch mode with

R CMD BATCH Rfile.R<code> where Rfile.R is your R file. 
To run your R command in the background, see [[cluster:managing_jobs|Managing Jobs]]

===== Introduction to R =====
The following section comes initially from an introductory talk on R given by Paul Bailey in February 2011.  The data used in the examples is located at [[http://terpconnect.umd.edu/~pdbailey/R/MDemp.csv|this link]].

===== R Background =====
  * Based on Bell Labs S
  * Open source software
    * Large group of contributors
    * Most R code is written in R
    * Computationally intensive code written in FORTRAN or C
  * Datasets, matrices are native types
  * Easy, customizable graphics

==== R Pros ====
  * Free
  * Easy to get a sense of what is going on with data
  * Excellent at simulation
  * Interfaces with lots of other software (i.e. WINBUGS, SQL)

==== R Cons ====
  * Uses RAM to store data
  * Support mainly via listserves
  * Difficult to get started

==== Read in Data ====
  * Some type specific methods, and a general method <code>dat <- read.csv("MDemp.csv")

and general methods

dat <- read.table("MDemp.csv",sep=",")

Getting Help

You can use the following command to get the help page for a command:
```
?
```
To search for text in help text use the following command:
```
??
```

Summary

Getting summaries is easy: summary(dat)
You can also focus on one variable

summary(dat$num_child)
table(dat$num_child)

Subset Data

When you reference something with [condition,] you can select rows:

dat.lf <- dat[dat$emp %in% c("emp","unemp"),]
dat.hs <- dat.lf[dat.lf$educ==39,]

Linear Models

The lm function fits linear models with a formula:

lm1 <- lm(weekly_earn ~ age + year,data=dat)
summary(lm1)

You can also treat a variable as a factor:

dat$yearf <- as.factor(dat$year)
lm2 <- lm(weekly_earn ~ age + yearf,data=dat)
summary(lm2)

And change constraints:

contrasts(dat$yearf) <- "contr.sum"
lm3 <- lm(weekly_earn ~ age + yearf,data=dat)
summary(lm3)

Aggregate

Allows you to create summary statistics for groups
First argument is what you want to summarize
Second argument is what you want to group by
Their argument is what to do to the groups

agg.hs <- aggregate(dat.hs$emps,by=list(dat.lf$yq),mean)

Results names a little odd.

Merge

Groups two datasets by shared columns

merged <- merge(data.a,data.b)

* Lots of options for this one

Parallel

Some basic info can be found at the High Performance Computing CRAN view. You can use the “parallel” package (which merges both “snow” and “multicore”).

You can also use Rmpi and npRmpi packages. You have your choice of MPI2 libraries (both OpenMPI and MPICH2). You will have to install the packages in your userspace (requiring compilation).

OpenMPI	MPICH2
Before anything (installation or usage)	>module load openmpi-x86_64	>module load mpich2-x86_64
Installation	R> install.packages(“<package>”, configure.args=“–with-Rmpi-include=/usr/lib64/openmpi/1.4-gcc/include –with-Rmpi-libpath=/usr/lib64/openmpi/1.4-gcc/lib –with-Rmpi-type=OPENMPI”)	R> install.packages(“<package>”, configure.args=“-with-Rmpi-include=/usr/include/mpich2-x86_64 –with-Rmpi-libpath=/usr/lib64/mpich2/lib –with-Rmpi-type=MPICH”)

A good intro guide is npRmpi: A package for parallel distributed kernel estimation in R.

Other functions

merge merges datasets
glm fits limited dependent variable models.
optim minimizes / finds zeros
Contributed econometrics packages

ECON Knowledge Base

Table of Contents

R