This is an old revision of the document!

R

R is available on the cluster. R can also be installed on your computer for free by visiting the R-project main page.

For more information about R, you might want to use the R manual or RSeek, the search engine for R related resources.

Running R on the Cluster

You can run R interactively on the cluster with:

You can also run a .R file in batch mode with

R CMD BATCH Rfile.R

where Rfile.R is your R file. To run your R command in the background, see Managing Jobs

Introduction to R

The following section comes initially from an introductory talk on R given by Paul Bailey in February 2011. The data used in the examples is located at this link.

R Background

Based on Bell Labs S
Open source software
- Large group of contributors
- Most R code is written in R
- Computationally intensive code written in FORTRAN or C
Datasets, matrices are native types
Easy, customizable graphics

R Pros

Free
Easy to get a sense of what is going on with data
Excellent at simulation
Interfaces with lots of other software (i.e. WINBUGS, SQL)

R Cons

Uses RAM to store data
Support mainly via listserves
Difficult to get started

Read in Data

Some type specific methods, and a general method
```
dat <- read.csv("MDemp.csv")
```
and

general methods

dat <- read.table("MDemp.csv",sep=",")

Getting Help

* You can use

?/<code> command to get the help page for a command
* To search for text in help text use <code>??

command

Summary

* Getting summaries is easy summary(dat) * You can also focus on one variable summary(dat$num_child) table(dat$num_child) * (Extended example)

Subset Data

* When you reference something with

[condition,]

you can select rows dat.lf ← dat[dat$emp %in% c(“emp”,“unemp”),] dat.hs ← dat.lf[dat.lf$educ==39,]

Linear Models

* The

lm

function fits linear models with a ``formulalm1 ← lm(weekly_earn ~ age + year,data=dat) summary(lm1) * You can also treat a variable as a ``factor dat$yearf ← as.factor(dat$year) lm2 ← lm(weekly_earn ~ age + yearf,data=dat) summary(lm2) * And change constraints contrasts(dat$yearf) ← “contr.sum” lm3 ← lm(weekly_earn ~ age + yearf,data=dat) summary(lm3)

Aggregate

* Allows you to create summary statistics for groups * First argument is what you want to summarize * Second argument is what you want to group by * Their argument is what to do to the groups agg.hs ← aggregate(dat.hs$emps,by=list(dat.lf$yq),mean) * Results names a little odd.

Merge

* Groups two datasets by shared columns

merged <- merge(data.a,data.b)

* Lots of options for this one

Parallel

Some basic info can be found at the [http://cran.r-project.org/web/views/HighPerformanceComputing.html High Performance Computing CRAN view]. You can use the “parallel” package (which merges both “snow” and “multicore”.

You can also use [http://cran.r-project.org/web/packages/Rmpi/index.html Rmpi] and [http://cran.r-project.org/web/packages/npRmpi/index.html npRmpi] packages. You have your choice of MPI2 libraries (both OpenMPI and MPICH2). You will have to install the packages in your userspace (requiring compilation).

{| class=“wikitable”

! ! OpenMPI ! MPICH2

A good intro guide is [http://onlinelibrary.wiley.com/doi/10.1002/jae.1221/pdf npRmpi: A package for parallel distributed kernel estimation in R] Journal of Applied Econometrics.

Other functions

*

merge

merges datasets *

glm

fits limited dependent variable models. *

optim

minimizes / finds zeros * [http://cran.r-project.org/web/views/Econometrics.html contributed econometrics packages]

ECON Knowledge Base

Table of Contents

R