This is an old revision of the document!
Table of Contents
R
R is available on the cluster. R can also be installed on your computer for free by visiting the R-project main page.
For more information about R, you might want to use the R manual or RSeek, the search engine for R related resources.
Running R on the Cluster
You can run R interactively on the cluster with:
R
You can also run a .R file in batch mode with
R CMD BATCH Rfile.R
where Rfile.R is your R file. To run your R command in the background, see Managing Jobs
Introduction to R
The following section comes initially from an introductory talk on R given by Paul Bailey in February 2011. The data used in the examples is located at this link.
R Background
- Based on Bell Labs S
- Open source software
- Large group of contributors
- Most R code is written in R
- Computationally intensive code written in FORTRAN or C
- Datasets, matrices are native types
- Easy, customizable graphics
R Pros
- Free
- Easy to get a sense of what is going on with data
- Excellent at simulation
- Interfaces with lots of other software (i.e. WINBUGS, SQL)
R Cons
- Uses RAM to store data
- Support mainly via listserves
- Difficult to get started
Read in Data
- Some type specific methods, and a general method
dat <- read.csv("MDemp.csv")
and
general methods
dat <- read.table("MDemp.csv",sep=",")
Getting Help
* You can use
?/<code> command to get the help page for a command * To search for text in help text use <code>??
command
Summary
* Getting summaries is easy summary(dat) * You can also focus on one variable summary(dat$num_child) table(dat$num_child) * (Extended example)
Subset Data
* When you reference something with
[condition,]
you can select rows dat.lf ← dat[dat$emp %in% c(“emp”,“unemp”),] dat.hs ← dat.lf[dat.lf$educ==39,]
Linear Models
* The
lm
function fits linear models with a ``formula
lm1 ← lm(weekly_earn ~ age + year,data=dat)
summary(lm1)
* You can also treat a variable as a ``factor
dat$yearf ← as.factor(dat$year)
lm2 ← lm(weekly_earn ~ age + yearf,data=dat)
summary(lm2)
* And change constraints
contrasts(dat$yearf) ← “contr.sum”
lm3 ← lm(weekly_earn ~ age + yearf,data=dat)
summary(lm3)
Aggregate
* Allows you to create summary statistics for groups * First argument is what you want to summarize * Second argument is what you want to group by * Their argument is what to do to the groups agg.hs ← aggregate(dat.hs$emps,by=list(dat.lf$yq),mean) * Results names a little odd.
Merge
* Groups two datasets by shared columns
merged <- merge(data.a,data.b)
* Lots of options for this one
Parallel
Some basic info can be found at the [http://cran.r-project.org/web/views/HighPerformanceComputing.html High Performance Computing CRAN view]. You can use the “parallel” package (which merges both “snow” and “multicore”.
You can also use [http://cran.r-project.org/web/packages/Rmpi/index.html Rmpi] and [http://cran.r-project.org/web/packages/npRmpi/index.html npRmpi] packages. You have your choice of MPI2 libraries (both OpenMPI and MPICH2). You will have to install the packages in your userspace (requiring compilation).
{| class=“wikitable”
! ! OpenMPI ! MPICH2
A good intro guide is [http://onlinelibrary.wiley.com/doi/10.1002/jae.1221/pdf npRmpi: A package for parallel distributed kernel estimation in R] Journal of Applied Econometrics
.
Other functions
*
merge
merges datasets *
glm
fits limited dependent variable models. *
optim
minimizes / finds zeros * [http://cran.r-project.org/web/views/Econometrics.html contributed econometrics packages]