Thursday, May 17, 2012

Notes R


NotesR
=======
This is for Windows mainly. However, many of the notes may apply to Linux as well.

Contents
=========
Setup / Install
R Upgrade new version
R - changs in 2.15.0
Running R
Quick Commands
R.matlab
R object types
Arrays
Packages - Manual Install
Packages
R - Stat functions
R - Time Series functions
Exploratory Data Analysis
Plotting on multiple windows
Writing to HTML files
List
Sets Operations
R - functions
3D plots
Plotting features
ECDF ecdf Empirical CDF
data.frames
Reading Data Files
serialIndepTest
[R] Operators
RUnit - Testing
Linear Least Squares Regression
R Interoperability
Calling Fortran from R
Fitting Distributions using VGAM::vglm and others
Error Handling
Sweave
Global variables and scope


Setup / Install
================
Mirror Site: http://cran.ms.unimelb.edu.au/
Installed in directory:
   C:\Program Files\R\R-2.12.1\bin\i386

To install additional packages, use the GUI to download and install.

Help files:
http://127.0.0.1:21879/library/utils/html/help.html
C:\Program Files\R\R-2.12.1\doc\html\index.html

Appendix A in manual is a helpful starter.

Path - typical installation path is:  C:\Program Files\R\R-2.12.1\bin
Include the path in the PATH variable or another envrionment variable


R Upgrade new version
======================
The R installation is setup such that multiple versions can exist on Windows.
For example it is possible to have installations at:
   C:\program files\R\R-2.12.1
   C:\program files\R\R-2.12.2
To perform the upgrade, suppose R-2.12.1 already exist and you wanted the new version R-2.12.2. The steps are:
1. Download R-2.12.2-win.exe
2. Close any R applications that is running.
3. Install by running R-2.12.2-win.exe
4. In Windows Explorer, go to    C:\program files\R\R-2.12.1\library.
   Highlight all folders and COPY (Ctrl-C)
5. Go to the new installation path C:\program files\R\R-2.12.2\library.
   Paste (Ctrl-V) BUT DO NOT OVERRIDE existing folders.
   This action will copy the libraries YOU installed in the old version to the new version BUT does not delete the base libraries that exist already in the new installation.
6. Run R-2.12.2, go to Packages -> Update packages...
7. If you wish, you can try uninstalling the older version. Please note that I have not done this so I cannot say if this step will affect the new version or not.



R - changes in 2.15.0
======================
Some changes to R functions which may are incompatible with previous versions are:
old: fisk(link.a = "loge", link.scale = "loge", earg.a=list(), earg.scale=list(), init.a = NULL, init.scale = NULL, zero = NULL)
new: fisk(lshape1.a = "loge", lscale = "loge", eshape1.a = list(),  escale = list(), ishape1.a = NULL, iscale = NULL, zero = NULL)

old: VGAM::psinmad( a=..., q.arg=...) changed to
new: VGAM::psinmad( shape1.a=..., shape3.q=...)

old: VGAM::qfisk( a=....)
new: VGAM::qfisk( shape1.a=...)


Ref: http://www.r-project.org/
  C-LEVEL FACILITIES:

    o Passing R objects other than atomic vectors, functions, lists and
      environments to .C() is now deprecated and will give a warning.
      Most cases (especially NULL) are actually coding errors.  NULL
      will be disallowed in future.

      .C() now passes a pairlist as a SEXP to the compiled code.  This
      is as was documented, but pairlists were in reality handled
      differently as a legacy from the early days of R.

    o call_R and call_S are deprecated.  They still exist in the
      headers and as entry points, but are no longer documented and
      should not be used for new code.


Running R
==========
Two ways to use the R software
1. Rgui.exe - this is the windowing environment.

2. Rterm.exe - this is a command terminal environment.
- run by typing in any cmd terminal:
C:\Program Files\R\R-2.12.1\bin\i386

3. Running in batch mode:
Rterm.exe --no-restore --no-save < infile > outfile 2>&1

Alternatively, creaate a file called R.bat with the following content:
         "c:\Program Files\r\R-2.13.1\bin\x64\Rterm.exe" --vanilla
To run,
R < test.R > test.out 2>&1
Or,

    start cmd /K "R.bat < aaa.R > aaa.out 2>&1"     - new CMD window remains open
    start cmd /C "R.bat < aaa.R > aaa.out 2>&1"     - new CMD window closes


4. Running scripts from Windows CMD terminal.
- set windows environment variable to installation path, eg:
    Rbin=C:\Program Files\R\R-2.12.1\bin
- go to the dir with R file eg file called Initialize.R
- Run:    "%Rbin%\Rscript" Initialize.R
- OR:    START "runMe" cmd /K "%rbin%\Rscript" runMe.R

5. Running scripts from inside an R console.
- source("Initialize.R", print.eval = TRUE)  
the print.eval = TRUE ensures that evaluation in script file is printed to screen

6. Running Parallel from Windows CMD terminal
- START "someName" "%rbin%\Rscript" someScript1.R
- START "someName" "%rbin%\Rscript" someScript2.R
the START is a windows command that forks open a separtate Windows terminal


Quick Commands
===============
license(),  licence() - for distribution details.
contributors() - list of contributors
citation() - on how to cite R or R packages in publications.
demo() - for some demos
help() - for on-line help
help.start() - for an HTML browser interface to help.
help(par), ?par             - help for function par
??par                       - multiple lists of help for par
q() - to quit R.
<command> - typing just the command line reveals the code
<command()> - executes the function.
ls()                        - list variables in the workspace
objects()                   - list of objects in the workspace
rm(a, b) - removes object a and b in the workspace
rm(list=ls())               - WARNING: removes all variables
rm(a, envir = globalenv())  - removes object from the Global Environment
search()                    - list the loaded packages
getAnywhere(x)              - find which packages x belong too, good when multiple packages have same function name.
#                           - comments symbol
list.files()                - list the files in the current directory
getwd()                     - get the path of the current directory\
setwd("c:\\data")           - sets the working directory to c:\data
print(<r commands>)         - prints output to screen, useful for R scripts.
paste("aa",xx,sep="")       - concatenate strings; note xx may be a number
match(), pmatch(), grep     - string matching
y<-unlist(strsplit(x, "_")) - parse or split the string x, using delimiter as "_". Access each item by y[0], y[1], etc.

which(data==const)          - mask. Returns the indices where condition is true.
which(data==const, arr.ind=TRUE)    Returns indices properly according to dimensions. If arr.ind=FALSE (default),
                              then even for multi dim array, it returns the index cumulated in ONE dimension only.
which.max                   - get the index of the maximum value of a vector.
max.col                     - get the index of the maximum value of a matrix (have not tried)
options(digits=xxx)         - change precision display, where xxx is the number of digits required.
format(aaa, digits=9)       - change precision display
sprintf("%.14f",aaa)        - change precision display
calibrate::textxy           - special text formatting (also see text in core package)
source("filename.R")        - Batch: runs commands in file filename.R
sink("filename.R")          - Batch: outputs to file filename.R, MUST USE PRINT COMMAND to print to file using SINK.
sink() - return output to terminal
write(t(x), file="a.txt", ncolumns=31, append=TRUE, sep="\t")
                            - write 2D matrix x (need transpose t), to file called a.txt, display in 31 columns, with tab separation and append to existing file.
- write to console by putting argument file="".
save(x, file='  ')          - to save a variable to file, in binary format. See sink() command to save file in text format.
load(file='   ')            - to load data, in binary format. Note: Variables are automatically loaded and overwrite existing variables.
file.exists('filepath')     - checks if a file exists and returns TRUE/FALSE.
memory.size()               - set/get the memory size in MB
gc()                        - garbage collection
a %in% b                    - test to see if element a is in vector b.
compare(a,b)                - Package: compare. Compares objects a and b to see if they are the same.
runif(n, min=0, max=1)      - Random Uniform Distribution - draw n points randomly.
set.seed(x)                 - Sets the seed for random generators.
table(xWords)               - split data into frequency bins
c(rbind(vecA, vecB))        - interleave 2 vectors of data
file.exists('filename')     - checks if the file exists
Sys.Date() - returns system date in format 'yyyy-mm-dd'
Sys.time()                  - returns date and time in format 'yyyy-mm-dd hh:mm:ss: EST'
system.time([blah])         - time it takes to execute blah.
process.time()              - call this, run R statement, then call again and take difference between two times.
gc.time()                   - garbage collection time.


str(xxx)                    - information about the components of xxx
summary(xxx)                - information summary about xxx
names(xxx)                  - information about the names of the components of xxx.

t(x)                        - Transpose of matrix x.
m1 <- matrix(1:12, nc=3)    - create a matrix with 3 columns using the 12 data points
                              Data are distributed columnwise by default.
m1 <- m1[-4,]               - remove the 4th row from a matrix.
m2 <- rbind(m1, c(1,2,3))   - add items 1,2,3 as a new row of matrix m1.
1:5                         - like linspace, generates {1,2,3,4,5}
seq(a,b, by=c)              - like linspace, generates sequence from a to b with step c.
rep(x, [times|each])        - replicates vector x with various options.
a %% b                      - the remainder of the number of times a is divisible by b
a %/% b                     - the modulus or the number of times a is divisible by b

dyn.load("a.dll", local=FALSE) - load the dll into R. The local=FALSE puts it in Global space rather than local space.
dyn.unload()                - unload the dll from R, required if the dll needs to be recompiled.
is.loaded()                 - check whether a subroutine name has been loaded into R.
R.version                   - to list the system information R is run on.

sapply(X, FUN, ...)         - a user friendly version of lapply. It applies a function to the list of objects.
                            - if X is a list of collection, eg x <- list(a = 1:10, beta = exp(-3:3), logic = c(TRUE,FALSE,FALSE,TRUE))   - then sapply(x, mean) will return 3 means, one for each group.
- if X is a vector of elements, then the function is applied to each element.
                              http://www.ats.ucla.edu/stat/r/library/advanced_function_r.htm
as.vector                   - vectors have NO dimension. Can us this to make num[1:14(1d)] into num[1:14]

months(aDate)               - return Month in characters, abbreviate=TRUE to get short version.
as.Date(aDate, format)      - translate in R Date object for input with specified format.
for(ii in 1:10){}           - for loop control structure
for(ii in c(1:10){}         - for loop control structure - using a numbers from a list
next                        - to skip to next iteration in for loop.
break                       - to break out of a for loop.

is.null(arg)                - tests if the argument is NULL
is.nan(arg)                 - tests if the argument is NaN
is.finite(arg)              - tests if the argument is finite
is.infinite(arg)            - tests if the argument is infinite
.Machine$double.eps         - gives machine EPSILON value
exists('aVar')              - tests if the variable called aVar is defined or not.

knots(ecdf(X))              - retrieves distinct values of ecdf
setdiff(A,B)                - data in A, not in B


R.matlab
=========
it appears the R package is not compatible with all versions of .mat file
you must save file as v6: save('D:\DATA\QARiskApp\mloss-v6.mat','mLoss','mLogLoss','-v6')

 readMat(file.path("<mat file>"))          reads the contents of "mat file"
 bs <- readMat(file.path("<mat file>"))    reads the contents of "mat file" and store in a structure "bs"

If the data stored in BS are made of many different types of data, eg integer, vectors, and come in different shapes, eg scalar or vectors, then the relevant commands are:
ll        - info about the general structures available
bs[[1]]   - first item of the structure. This can be scalar, 1D, 2D, etc matrices.
bs$<var1> - same as above, accesses the first component of the structure.




R object types
===============
Ref: http://cran.r-project.org/doc/contrib/R_language.pdf
http://cran.r-project.org/doc/manuals/R-lang.html#Basic-types

The types of objects are:
vectors, arrays, factors, lists, data frames, functions

Intrinsic attributes of objectts are:
mode(object)
length(object)

To change the type of an object, use:
as.characters(object)
as.integers(object)
as.<etc..>(object)

Objects can change size dynamically, or
by using length:  length(object) <-3      # sets the array to new size 3
by resizing itself:   object <- object[2* 1:5]

Objects have the following attributes:  class, comment, dim, dimnames, names, row.names and tsp
attributes(object)      lists all the attributes
attr(object, "dim")     get or set the attribute "dim"


Arrays
=======
Any vector can be made into array if its "dim" attribute is defined. This gives the object a multi-dimensional shape.

The various ways of defining arrays are:
Z <- array(h, dim=c(3,4,2))
Z <- h ; dim(Z) <- c(3,4,2)
scal = "blah" ;  zz <- array(scal, dim=c(3));     # resize into zz = { blah, blah, blah }
Z <- array(0, dim=0)    # to create an array of zero length, it is of type numeric(0)


Outer product can be defined as:
ab <- a %o% b
ab <- outer(a, b, "*")
f <- function(x, y) cos(y)/(1 + x^2);  z <- outer(x, y, f)

tapply() function - provides masking functionality.
Example:
state <- c("tas", "sa",  "qld", "nsw", "nsw", "nt",  "wa",  "wa",
                  "qld", "vic", "nsw", "vic", "qld", "qld", "sa",  "tas",
                  "sa",  "nt",  "wa",  "vic", "qld", "nsw", "nsw", "wa",
                  "sa",  "act", "nsw", "vic", "vic", "act")
statef <- factor(state)
     > statef
      [1] tas sa  qld nsw nsw nt  wa  wa  qld vic nsw vic qld qld sa
     [16] tas sa  nt  wa  vic qld nsw nsw wa  sa  act nsw vic vic act
     Levels:  act nsw nt qld sa tas vic wa
levels(statef)
     [1] "act" "nsw" "nt"  "qld" "sa"  "tas" "vic" "wa"
incomes <- c(60, 49, 40, 61, 64, 60, 59, 54, 62, 69, 70, 42, 56,
                    61, 61, 61, 58, 51, 48, 65, 49, 49, 41, 48, 52, 46,
                    59, 46, 58, 43)
incmeans <- tapply(incomes, statef, mean)
giving a means vector with the components labelled by the levels (using function mean())
        act    nsw     nt    qld     sa    tas    vic     wa
     44.500 57.333 55.500 53.600 55.000 60.500 56.000 52.250
stderr <- function(x) sqrt(var(x)/length(x))     # user defined function
incster <- tapply(incomes, statef, stderr)
> incster
     act    nsw  nt    qld     sa tas   vic     wa
     1.5 4.3102 4.5 4.1061 2.7386 0.5 5.244 2.6575

Subarray - let A = array of size [3,3]
A[,2]    = second column of A
A[2,]    = second row of A


Packages - Manual Install
==========================
1. Download packages from: http://cran.ms.unimelb.edu.au/
2. In R, select Packages -> Install Packages from Local Zip file
3. Choose the zip file from the location where it was downloaded.


Packages
=========
sessionInfo()           - tells the packages loaded.
library() to see which packages are installed
library(boot)           to load the package called boot - gives error   if package not found
require(boot)           to load the package called boot - gives warning if package not found
library(help = evir)    to see the list of functions in the package
search()                to see which package are loaded
loadedNamespaces()      to see the packages loaded but not on the search list.
<lib>::<func>()         to specify overloaded function using its library name to avoid confusion, eg base::Version()

QQ Plots {stats}        : qqnorm, qqline, qqplot
importFrom(foo, f, g)   loads only the functions f,g from packages foo
import(foo, bar)        loads all functions which are exported from the packages foo and bar.
detach("package:R.matlab", unload=TRUE)     remove from search() path a data.frame that has been attached or a                                                   package that was attached by library.
unloadNamespace("ns")   unloads the namespace ns.

unloadNamespace("R.oo")
R.oo::Class()

detach("package:R.methodsS3", unload=TRUE)
removes R.methodsS3 from search() and loadedNamespaces()


Calling a function with package notation, eg copula::genFun
will bring package copula back into loadedNamespaces() but not in search().
unloadNamespace("copula") - will remove it again from loadedNamespaces()

library(copula) - adds copula to search() and loadedNamespaces()
unloadNamespace("copula") - removes it from search() AND loadedNamespaces()
genFun() - calling has no effect
copula::genFun - brings copula into loadedNamespaces()

library(copula) - adds copula to search() and loadedNamespaces()
detach("package:copula") - removes it from search() but NOT REMOVEd from loadedNamespaces()
genFun() - calling has no effect
copula::genFun - is OK

To use packages with Namespaces
- no need to use library() or any other way to load functions
- in the code, simply call [package]::[function]      # note the double colon
- when finished, call: unloadNamespace("[package]")
- using this method loads the package in and out of the Namespaces. It has not been added in the search path.

To use packages with NO Namespaces
- call the library by: library([package]). This will load the package into the search path and add to the Namespaces. But since the package has no namespace, nothing is added to the Namespaces.
- in the code, just call the function using its name:  [function](arguments)
- when finished, call : detach("[package]:[function]") # note the single colon. This will remove it from the search path. Since it has no Namespace, there is no point in calling: unloadNamespace().



Searching rules with Namespaces
It searches in package namespace first, then among the functions imported by that package, then R base, and lastly in the “normal” search path (as from ‘search()’). See “Writing R Extensions”, last paragraph before the end of sub-section. Consequently, the correct version of ‘set.vertex.attribute’ is used.

Namespace REF: http://www.r-bloggers.com/namespaces-and-name-conflicts/


R - Stat functions
===================
runif - Uniform random generator
rt - Student-T random generator
rexp - Exponential random generator
rweibull - Weibull random generator
rgamma - Gamma random generator
rpareto                 - Pareto random generator http://www.commanster.eu/rcode.html, http://hosho.ees.hokudai.ac.jp/~kubo/Rdoc/library/rmutil/html/Pareto.html

R - Time Series functions
==========================
This section is taken from http://cran.r-project.org/doc/contrib/Ricci-refcard-ts.pdf
written by Vito Ricci

INPUT
cycle(): gives the positions in the cycle of each observation (stats)
deltat(): returns the time interval between observations (stats)
end(): extracts and encodes the times the last observation were taken (stats)
frequency(): returns the number of samples per unit time (stats)
read.ts(): reads a time series file (tseries)
start(): extracts and encodes the times the first observation were taken (stats)
time(): creates the vector of times at which a time series was sampled (stats)
ts(): creates time-series objects (stats)
window(): is a generic function which extracts the subset of the object 'x' observed between the times 'start' and 'end'. If a frequency is specified, the series is then re-sampled at the new frequency (stats)

TS DECOMPOSITION
decompose(): decomposes a time series into seasonal, trend and irregular components using moving averages. Deals with additive or multiplicative seasonal component (stats)
filter(): linear filtering on a time series (stats)
HoltWinters(): computes Holt-Winters Filtering of a given time series (stats)
sfilter(): removes seasonal fluctuation using a simple moving average (ast)
spectrum(): estimates the spectral density of a time series (stats)
stl(): decomposes a time series into seasonal, trend and irregular components using 'loess' (stats)
tsr(): decomposes a time series into trend, seasonal and irregular. Deals with additive and multiplicative components (ast)

TESTS
adf.test(): computes the Augmented Dickey-Fuller test for the null that 'x' has a unit root (tseries)
Box.test(): computes the Box-Pierce or Ljung-Box test statistic for examining the null hypothesis of independence in a given time series (stats)
bds.test(): computes and prints the BDS test statistic for the null that 'x' is a series of i.i.d. random variables (tseries)
bptest(): performs the Breusch-Pagan test for heteroskedasticity of residuals (lmtest)
dwtest(): performs the Durbin-Watson test for autocorrelation of residuals (lmtest)
jarque.bera.test(): Jarque-Bera test for normality (tseries)
kpss.test(): computes KPSS test for stationarity (tseries)
shapiro.test(): Shapiro-Wilk Normality Test (stats)

STOCHASTIC MODELS
ar(): fits an autoregressive time series model to the data, by default selecting the complexity by AIC (stats)
arima(): fits an ARIMA model to a univariate time series (stats)
arima.sim(): simulate from an ARIMA model (stats)
arma(): fits an ARMA model to a univariate time series by conditional least squares (tseries)
garch(): fits a Generalized Autoregressive Conditional Heteroscedastic GARCH(p, q) time series model to the data by computing the maximum-likelihood estimates of the conditionally normal model (tseries)

GRAPHICS
lag.plot: plots time series against lagged versions of themselves. Helps visualizing "auto-dependence" even
when auto-correlations vanish (stats)
monthplot(): plots a seasonal (or other) subseries of a time series (stats)
plot.ts(): plotting time-series objects (stats)
seaplot(): plotting seasonal sub-series or profile (ast)R functions for time series analysis by Vito Ricci (vito_ricci@yahoo.com) R.0.5 26/11/04
seqplot.ts(): plots a two time series on the same plot frame (tseries)
tsdiag(): a generic function to plot time-series diagnostics (stats)
ts.plot(): plots several time series on a common plot. Unlike 'plot.ts' the series can have a different time bases, but they should have the same frequency (stats)

MISCELLANEOUS
acf(), pacf(), ccf(): the function 'acf' computes (and by default plots) estimates of the autocovariance or autocorrelation function. Function 'pacf' is the function used for the partial autocorrelations. Function 'ccf' computes the cross-correlation or cross-covariance of two univariate series (stats)
diff.ts(): returns suitably lagged and iterated differences (stats)
lag(): computes a lagged version of a time series, shifting the time base back by a given number of observations (stats)



Exploratory Data Analysis
==========================
summary(data)   Min, 1stQ, Median, Mean, 3rd Q, Max
fivenum(data)   Returns Tukey's five number summary (minimum, lower-hinge, median, upper-hinge, maximum) for the input data.
stem(data)      Stem and leaf plot



Plotting on multiple windows
=============================
# Create 3 plots
dev.new()  # Or X11()
dev.1 <- as.integer(dev.cur())
dev.new()
dev.2 <- as.integer(dev.cur())
dev.new()
dev.3 <- as.integer(dev.cur())

x <- seq(1, 100, 0.1)

# Switch to device 1
dev.set(dev.1)
plot(x, sin(x), "l")
# Switch to device 3
dev.set(dev.3)
plot(x, cos(x), "l")
# Add something to graph #1
dev.set(dev.1)
points(x, cos(x), "l", col="red")

Writing to HTML files
=======================
This functionality of writing to HTML files are facilitated by either of the following packages:
i)  HTMLutils - which depends on R2HTML
ii) hwriter

The functionalities are best described using examples.
The examples here are based on the hwriter package only.

<code>
pHand <- openPage(fileOut, title=title)
hwrite(title, page=pHand, center=TRUE, heading=1)
hwrite('',pHand, br=TRUE)

hwriteImage(fileOut_hist, pHand, br=TRUE)
hwrite('Histogram',pHand, br=TRUE)

closePage(pHand, splash=TRUE)
</code>

The html file is created using the openPage() function and finishes off using closePage() function.
pHand - The file handle. This is needed by closePage as well as other html functions to identify which file to write to.
hwrite() - the function used to write html text. Note there are many arguments in hwrite() to control HTML parameters like: p, br, heading, center, etc.
hwriteImage() - the function to write an image to the html file. In this example, the path of the image is stored in the string fileOut_hist.

List
=====
A list in R can be made up of different data types (numbers, strings) as well as different size objects.
An example is :
   Lst <- list(name="Fred", wife="Mary", no.children=3, child.ages=c(4,7,9))
In the example above, the list has got 4 components which are: name, wife, no.children, child.ages
The components can be accessed as Lst$wife or Lst[[2]]
The fourth component is a vector and each elements {4,7,9} can be accessed as:
   Lst[[4]][1],     Lst[[4]][2],     Lst[[4]][3]   OR
   Lst$child.ages[1],     Lst$child.ages[2],     Lst$child.ages[3]   OR
   Lst[["child.ages"]][1],     Lst[["child.ages"]][2],     Lst[["child.ages"]][3]   OR
Creating list: list(name1=obj1, name2=obj2, ....)
Removing elements from a list by index : lll[-c(2,3)],   remove second and third elements from the list.
Removing elements from a list by value : lll[lll != 'element']
Extracting elements from a list: lll[c(2,3)]   extracting the second and third elements from the list.
Append an item (aaa) to an existing list(lll):   append(lll, list(aaa))
Append an item to an 1D array of list
aa <- array(list(), dim=2) # create an array of list items, initially all items are NULL
aa[1] <- c(3.2) # so aa[1,2][[1]] = 3.2
aa[1] <-  list(c(aa[1][[1]], 3.4)) # dereference the list, then concatenate, then make into a list again.
Append an item to an array of list
aa <- array(list(), dim=c(2,2)) # create an array of list items, initially all items are NULL
aa[1,2] <- c(3.2) # so aa[1,2][[1]] = 3.2
aa[1,2] <-  list(c(aa[1,2][[1]], 3.4)) # dereference the list, then concatenate, then make into a list again.

To transform a list (ll) into just values, with no names:
ull <- unlist(ll)
names(ull) <- NULL

Dynamic reduction of list: This scans through a list, and at the same time removing specific elements of the list.
<code>
      while(length(qList) != 0) {
        ... blah ...
# Update the qList by removing the certain elements, or rather choosing only specific elements to remain.
        qList <- qList[qList != as.character(dataSet$xposId)]  
        ... blah ...
      }   # end while
</code>


Sets Operations
================
is.element(elem, set)                  - test if element is in the set
union(x, y)
intersect(x, y)
setdiff(x, y)
setequal(x, y)
a & b                   - element wise comparison, say a is {TRUE, FALSE, TRUE}, b is {FALSE, FALSE, TRUE}, then answer is {FALSE, FALSE, TRUE}



R - functions
===============
The example below best illustrate the structure of a function in R. Here are the things to note.
- the name of the function in the example is myFunction
- the argument list (a,b) are INPUT arguments. They are placed next to the 'function' keyword.
- the output of the function is the most tricky part in R. Essentially, the output must be the last variable assignment in the function. So the last line has the variable 'out'.
- to produce multiple outputs, one way (as illustrated), is to create a list called 'out'. Then at the very last line, aggregate all your desired output variables into the list.

<code>
myFunction <- function(a, b){
out <- NULL    # initialize output

c <- a + b
d <- a - b

# Final Result OUTPUT
out <- list(mySum=c, myDiff=d)
}
</code>

When calling the function, assign the function to a variable or handler,
     result <- myFunction(2.3, 5.1)

To access the multiple output variables:
result$mySum
result$myDiff

3D plots
=========
The 3D plotting functions are:
image(x,y,z, ...)      - heat plot
contour(x,y,z, ...)    - contour plot
persp(x,y,z, ...)      - wireframe plot

x = {1:rows},    y = {1:cols}   are 1D vectors
z[1:rows, 1:cols] is the 2D matrix

3D Cloud plot example:
Data type must be
#str(data )
#'data.frame':   14880 obs. of  4 variables:
# $ x  : num [1:14880(1d)] 1 2 3 4 5 6 7 8 9 10 ...
# $ y  : num [1:14880(1d)] 1 1 1 1 1 1 1 1 1 1 ...
# $ z  : num [1:14880(1d)] 1 1 1 1 1 1 1 1 1 1 ...
# $ val: Factor w/ 3 levels "-1","0","1": 1 1 1 1 1 1 1 2 2 2 ...
The plot function call is like:
<code>
thisPoints = Rows(trellis.par.get("superpose.symbol"), 1:3)
thisPoints$pch <- c(1,3,2)
thisPoints$col <- c("#0080ff",   "#ff00ff",   "#ffff00")

cloud(cRG ~ cBU * cRT, data = data , groups = val, screen = list(z = 20, x = -70),
perspective = FALSE, zoom = 1.0, pch = thisPoints$pch , col = thisPoints$col,
key = list(title = "Cluster", x = .15, y=.85, corner = c(1,-0.3),
border = TRUE,
points = thisPoints,
text = list(levels(data$val))))
</code>
Parameters like 'screen' are found under ?panel.cloud
show.settings()  - will reveal the settings for "superpose.symbol"
?points  - to see more options for points parameter
pch      - controls the symbols for the points eg 1=circle,2=triangle,3=plus
col      - color in Hex code


Plotting features
====================
All the parameters here appear as "plot( .... <param> ....)" unless otherwise specified:

plot(... xlim=c(-1, 10) ...) - specify the limit / boundary on the x-axis
text(x,y,labels,...) - specify text labels in a plot
lines(xvec, yvec)                   - draw line between 2 points
abline                              - draw line across whole graph using m,c from y=mx+c
plot(.... type='l' ...)             - type: l=lines, p=points
plot(.... las= .....)               - orientation of axis labels
plot(.... main='title' ....)        - Title for the graph

To control the margin around the plot, specify the margin specification first:
par(oma=c(4,2,2,2))
barplot(.....)

Plotting to PNG files &
Plotting Empirical CDFs
# let vData be a vector of data point
aECDF <- ecdf(vData)      # generate a ECDF object called aECDF
summary(aECDF)            # summary stats for the ECDF
png("filename.png", bg="transparent")
plot(aECDF, main="title of graph", ylab="ylabel", xlab="xlabel")
dev.off()
        # more examples
x10<-seq(1,10)
F10 <- ecdf(x10)
plot(F10)
plot(F10(x10), xlim=c(0,12))
plot(F10, xlim=c(0,12))
plot(x10, F10(x10), xlim=c(0,12), verticals=TRUE)
# another example
xData <-  rweibull(100, shape=0.5)+3.0
fdata <- ecdf(xData)          # note fdata is a ecdf function
plot(fdata, xlim=c(0,25))     # special plot takes in the ecdf function
plot(xData, fdata(xData), xlim=c(0,25))     # equivalent to using plot on the ecdf function


Adding points to existing plots:
- use the points() function, eg.
points(vecX, vecY, ....)

Plotting 2 or more data in one graph
plot(x), par(new=TRUE), plot(y)


ECDF ecdf Empirical CDF
==========================
aa <- c(1,2,3,4,5)
bb <- c(1,1,2,2,3,3,4,4,5,5)
plot(ecdf(aa))
par(new=TRUE)
plot(ecdf(bb))
### This results in identical ecdf

cc <- c(1,2,2,3,3,4,5)
plot(ecdf(aa))
par(new=TRUE)
plot(ecdf(cc))
### These two plots are different. The gaps are 1.0/7 ~ 0.148
### For the ecdf(cc), the x values are 1,2,3,4,5, the y values are: 0.14, 0.42, 0.71, 0.86, 1.0

dd <- c(2,3,3,2,5,1,4)
plot(ecdf(cc))
par(new=TRUE)
plot(ecdf(dd))
# Sorting has no effect on ecdf(), the data would be sorted by ecdf() anyway.

stepfun()
eg. myfun <- stepfun(1:10, cumsum(c(0, rep(0.1, 10))))
- first argument is x values, second argument starts at zero and is one element longer
-       1,  2,  3,  4,  5,  6,  7,  8,  9,  10     x values
- 0.0, .1, .2, .3, .4, .5, .6, .7, .8, .9, 1.0     y values
stepfun() plot looks identical to ecdf plot, but with step lines drawn.


data.frames
=============
1. Creating data.frames from existing mixed type array.
 r1 <- c(2,1, 0.44)
 r2 <- c(23,11, 0.424)
 dff <- data.frame(rbind(r1, r2))
 colnames(dff) <- c("a", "b", "c")
 rownames(dff) <- NULL     removes the row names in a data frame

where the last row changes the default column names.

2. Creating data.frames from two vectors
age <- c(.....)
height <- c(.....)
village <- data.frame(age=age, height=height)

3. Accessing data.frames
village[1,2]
village["age", "height"]

4. Utility functions
nrow(village)

5. Read in data.frame from input file - see section on "Reading Data Files"

6. To hide, remove the first column which is row numbering, need to use
print(dataframe, row.names=FALSE)

Reading Data Files
====================
1. Reading data from text files into data frame.
Example: given the text data is in the following format:
RG BU RT cMap
1 1 1 -1
2 1 1 -1
3 1 1 -1
1 2 1 -1
2 2 1 -1

- say the data above is in the file called data.txt
- to read the data:
     df <- read.table('data.txt', header=TRUE)
- the df structure would automatically be a data.frame.





serialIndepTest
================
1. d <- serialIndepTestSim(n=100, lag.max=3)
2. test <- serialIndepTest(x,d)
3. test
4. dependogram(test,print=TRUE)

Above are the steps to perform the serial independence tests using the serialIndepTest.
Step 1 creates the d object which is the result of simulation, which does not require any input data from the sample to be tested.
Step 2 performs the actual independence test.
Step 3 prints out the results. The Fisher result is better "tends to give the best results and was found to frequently outperform the test based on In".
Step 4. Plots and prints out the results.

Note that in Step 4, it tests all the subsets of lag of the original time series. The subsets are the Mobius decomposition. The original series is independent iff all the subsets are also independent. If any of the vertical line (test statistic) is below the dot(critical value), then that subset cannot reject the null hypothesis of independence. Only when the line is above the dot, then we reject the independence hypothesis for that subset.

Note (n - lag.max) >= 2

"Modeling Multivariate Distributions with Continuous Margins Using the Copula R package", Ivan Kojadinovic, Jun Yan, Journal of Statistical Software, May 2010, Vol 34, Issue 9.



[R] Operators
==============
- Minus, can be unary or binary
+ Plus, can be unary or binary
! Unary not
~ Tilde, used for model formulae, can be either unary or binary
? Help
: Sequence, binary (in model formulae: interaction)
* Multiplication, binary
/ Division, binary
^ Exponentiation, binary
%x% Special binary operators, x can be replaced by any valid name
%% Modulus, binary
%/% Integer divide, binary
%*% Matrix product, binary
%o% Outer product, binary
%x% Kronecker product, binary
%in% Matching operator, binary (in model formulae: nesting)
< Less than, binary
> Greater than, binary
== Equal to, binary
>= Greater than or equal to, binary
<= Less than or equal to, binary
& And, binary, vectorized
&& And, binary, not vectorized
| Or, binary, vectorized
|| Or, binary, not vectorized
<- Left assignment, binary
<<- Assigning to Global / R-Environment variables , see assignOps{base}
-> Right assignment, binary
$ List subset, binary


RUnit - Testing
================
RUnit is a unit testing package that allows each function to be tested automatically.
The following are instructions of how to set it up.

Ref: http://cran.r-project.org/web/packages/RUnit/vignettes/RUnit.pdf

1. Assume there is a function called c2f which needs to be tested. The c2f function can be like:
c2f <- function(c) return (9.5 * c + 32)

2. Create a file called "runitc2f.r" and write the test function as follows:
<code>
test.c2f <- function() {
checkEquals(c2f(0), 32)
checkEquals(c2f(10), 50)
checkException(c2f("xx"))
}
</code>

3. Create another file, with any name, as long as it is loaded using "source" command in R.
   Then define the test suite as:

<code>
testsuite.c2f <- defineTestSuite("c2f",
dirs = file.path( <path to your test files> ),
testFileRegexp = "^runit.+\\.r",
testFuncRegexp = "^test.+",
rngKind = "Marsaglia-Multicarry",
rngNormalKind = "Kinderman-Ramage")
</code>

The name of this test suite file and the actual name of the defined testsuite does not matter much. But the most important names are the names of the test file which must be "runit..." and the actual test function names which must be "test....". These restrictions are defined in the definition of the test suite above.

4. If the test suite is defined in the file "testsuite.R", then to run the unit tests:
<code>
source('testsuite.R')
testResult <- runTestSuite(testsuite.c2f)
printTextProtocol(testResult)
</code>

5. To test Exception cases.
- in the actual function, there need to be a catch for the error and a call to the "stop" function
- in the runitXXXX.r file, the checkException is called with the function call as its argument, eg:
checkException( funcName(var1, var2, ...) )


Linear Least Squares Regression
================================
There is a few steps to perform the linear regression. The following example illustrates the various functions to facilitate regression analysis.

Example:   Assume variable year, rate
    cor(year, rate)   - a first check for correlation

suppose we wish to fit the linear model as: rate = slope x year + intercept
    fit <- lm(rate ~ year)
fit returns the coefficients for the linear equation such that
fit$coefficients[1] -> intercept
fit$coefficients[2] -> slope

To calculate the residuals:
res <- rate - slope x year + intercept
alternatively,
residuals(fit)

To plot the data and the regression line:
plot(year, rate)
abline(fit)

Finally, a whole lot of information can be obtained by:
summary(fit)


R Interoperability
===================
http://cran.r-project.org/doc/manuals/R-exts.html#Package-structure
http://www.revolutionanalytics.com/why-revolution-r/customer-testimonials.php
http://www.omegahat.org/RwxWidgets/ExampleDocuments/RC++Methods.pdf

using .Call - a souped up version of .C. It provides better ability to use R functions from within C code, and the C code can then be wrapped up to be used in R.
http://www.biostat.jhsph.edu/~bcaffo/statcomp/files/dotCall.pdf


Calling Fortran from R
=========================
http://www.stat.sc.edu/~grego/courses/stat740/handout.pcfortran.pdf
http://www.stat.umn.edu/~charlie/rc/
http://math.acadiau.ca/ACMMaC/howtos/Fortran_R.html

The steps to call Fortran subroutines from R is listed here using simple examples to illustrate the process.
A few notes before beginning:
- R can call Fortran Subroutines easily. Apparently it cannot call Fortran Functions directly. If a Fortran Function is needed, then it should be changed into a subroutine or create a wrapper function (this is a purely Fortran exercise and will not be detailed here).
- To call the Fortran from R, this example uses the .C() R function. Apparently the .Fortran() in R has some instabilities, so it will not be detailed in this example.

1. Create the Fortran subroutine, compile into a dll and expose the subroutines to be exported.
There are many ways and conventions of doing this. However, this example will only show the one that has been tested here and works. This example is based on Intel Fortran 11 on Windows. Also the naming convention to export the subroutine name is STDCALL (there are a few others, but this works here). The sample code is:

<code>
module modFtest
    implicit none
    contains

    subroutine oneIntegerInput_dll(aaa)
    !DEC$ ATTRIBUTES DLLEXPORT, STDCALL, ALIAS:'_oneIntegerInput_dll' :: oneIntegerInput_dll
    !DEC$ ATTRIBUTES REFERENCE :: aaa
        implicit none
        integer(4), INTENT(IN) :: aaa
        integer(4) :: bbb
        bbb=aaa+1
    end subroutine oneIntegerInput_dll

end module modFtest
</code>

A few things to be noted:
- The subroutine oneIntegerInput_dll is in a module called modFtest.
- The modFtest module is found in the CallFromR project. When compiled, the output would be a callfromr.dll.
- There can be multiple modules with multiple subroutines, constructed under this CallFromR project which would be accessible from its callfromr.dll
- The Intel Fortran (for Windows) way of exporting symbols is to use the DLLEXPORT attribute and the REFERENCE attribute to declare its variables.
- To be safe, just keep the Fortran variables to primitive data types. And declare all variables as REFERENCE.
- The accessibility of the Fortran variables can still be controlled via the INTENT statements.


2. Loading the DLL from R.
Within R, the dll can be loaded as:
       dyn.load("callfromr.dll")

To unload, use:
dyn.unload("callfromr.dll")

If the dll is not in the current directory, the path may need to be provided. Ensure the path separation / or \ is correct for your system.
WARNING: If the dll is still loaded by R, then any attempt to recompile the Fortran code may fail because the dll is being used by R. The dll has to be unloaded before the dll can be recompiled by Fortran.


3. To use the subroutines from the dll, call as follows:
callfromr=.C("oneIntegerInput_dll", as.integer(3))

Note that the subroutine name is case sensitive. Although Fortran itself is not case-sensitive, the exporting of the name using DLLEXPORT has specified the name to be oneIntegerInput_dll with case-sensitivity.
The arguments of the subroutine need to be transformed from R, using as.integer() or something else.


4. Troubleshooting.
To check is a subroutine name has been loaded via the dll, type:
     is.loaded("oneIntegerInput_dll")


5. Advanced arguments passing.
The above steps is the first steps to show Fortran - R compatibility. The section here will explore a combination of arguments that can be passed between Fortran and R with expanded examples.
WARNING:
a) In Fortran, DO NOT USE real(4). Instead use real(8)
b) For integers at least, when calling for R, need to cast the integer using as.integer(iii) in the .C(...) argument list.
c) Fortran array arguments must be declared like FCN(N), not FCN(1:*)

A. One integer in, one integer out
        integer(4), INTENT(IN) :: aaa
        integer(4), INTENT(OUT) :: bbb
The call can be made by:
callfromr=.C("twoIntInOut_dll", as.integer(3), as.integer(0))
The result is that callfromr will be a list of two integers.

B. Double Precision, one in, one out
        REAL(8), INTENT(IN) :: aaa
        REAL(8), INTENT(OUT) :: bbb
The call can be made by:
callfromr=.C("twoDblInOut_dll", as.double(3.1), as.double(0.0))

C. Vector of double precisions
        integer(4), INTENT(IN) :: nn
        REAL(8), INTENT(IN) :: aaa(nn)
        REAL(8), INTENT(OUT) :: bbb(nn)
Setting up an R example
aaa <- seq(1,5)*1.2        # produces { 1.2 2.4 3.6 4.8 6.0 }
bbb<-seq.int(0, 0,length.out=5)    # produces { 0,0,0,0,0 }
The call from R is:
callfromr=.C("vecInOut_dll", as.integer(length(aaa)), as.double(aaa), as.double(bbb))
The result will have 3 items, 1 integer, 1 input vector and 1 output vector.


#### Testing MleFitTruncLN
!DEC$ ATTRIBUTES REFERENCE :: K, L, X, w, parVal, parFlag, stderr, factor , cdf, sList

   integer K                        !input, number of elements in the sample
   double precision L               !input, the known truncation level
   double precision X(1:*)           !input, sample data array
   double precision w(1:*)           !input array of weighting factors
 
   double precision :: parVal(1:*)    !output parameter values (mu and sigma)
   integer parFlag(1:*)               !input, option flag
 
   double precision stderr(1:*)   !output stderr(mu) and stderr(sigma)
   double precision factor(1:*)   !output truncation correction factor, for frequency fitting
   double precision cdf(1:K)      !output, cdf values at x points
   integer :: sList        ! error message structure              

   K <- 100
   L <- 2.0
   X <- abs(rnorm(K)) + 2*L    # rlnorm(K, meanlog =  1.553 , sdlog = 0.117)     #
   w <- seq(1.0,1.0,length.out=K)
   parVal <- c(0.0, 1.0) # mu, sigma - Need INITIAL VALUES
   parFlag <- c(0, 0) # correspond to mu, sigma. 0 is the default action. 1 means to not calculate that parameter
   stderr <- c(0.0, 0.0)  
   factor <- c(0.0, 0.0)            # Truncation factor
   cdf <- seq(0,0,length.out=K)  
   sList <- 0

callfromr=.C("MleFitTruncLN", as.integer(K), as.double(L), as.double(X), as.double(w), as.double(parVal), as.integer(parFlag), as.double(stderr), as.double(factor), as.double(cdf), as.integer(sList))

#Results
List of 10
 $ : int 100
 $ : num 2
 $ : num [1:100] 4.44 4.24 4.52 4.37 4.97 ...
 $ : num [1:100] 1 1 1 1 1 1 1 1 1 1 ...
 $ : num [1:2] 1.553 0.117
 $ : int [1:2] 0 0
 $ : num [1:2] 0.01168 0.00826
 $ : num [1:2] 1.00e+02 1.31e-21
 $ : num [1:100] 0.297 0.178 0.349 0.25 0.664 ...
 $ : int 0

#ECDF plot of original data
xecdf <- ecdf(X)
plot(xecdf)

# Compare with theoretical plot based on parameters found by MLE
xtest <- seq(0,8, length.out=150)
ptest <- plnorm(xtest, meanlog =  1.553 , sdlog = 0.117)
dev.new()
plot(xtest, ptest)

# Compare original and simulated curve by using adk.test
xtest <- rlnorm(150, meanlog =  1.553 , sdlog = 0.117)
#xtest <- rlnorm(150, meanlog =  1.549 , sdlog = 0.112)
adkOut <- adk.test(list(x1=X, x2=xtest))


Fitting Distributions using VGAM::vglm and others
==================================================
To fit a random distribution function, one can use the package VGAM with its vglm function.
An example is:
    ylg <- rweibull(5000, 3.2, 1.9)
    f3_dist <- VGAM::vglm(formula=ylg ~ 1, family=VGAM::weibull )

In the above example, we create a test random weibull distribution given the parameters (3.2, 1.9)
To use the VGAM::vglm function to find the parameters,
   let the formula be Y~1
   let the family be VGAM::weibull which is the distribution we want to fit.
Note the output coefficients returned in f3_dist are NOT the parameters.
The parameters are obtained by:
> VGAM::Coef(f3_dist)
          shape            scale
3.25349663371564 1.90127764909165

> VGAM::summaryvglm(f3_dist)
Call:
VGAM::vglm(formula = ylg ~ 1, family = VGAM::weibull)
Pearson Residuals:
                        Min               1Q           Median              3Q             Max
log(shape) -11.108862089551 -0.3482171960797  0.3379254287005 0.6850168947179 0.7955259659881
log(scale)  -1.492863009759 -0.7246117432266 -0.2719877869469 0.4750868268133 7.0563972998482
Coefficients:
                        Value        Std. Error        t value
(Intercept):1 1.1797303052868 0.011026577908436 106.9897038849
(Intercept):2 0.6425261070716 0.004577189943693 140.3756704388
Number of linear predictors:  2
Names of linear predictors: log(shape), log(scale)
Dispersion Parameter for weibull family:   1
Log-likelihood: -4310.23959834649 on 9998 degrees of freedom
Number of Iterations: 2

Notice that predictors are log(shape), log(scale) and their corresponding values are 1.179, 0.642.
These two values are called the Intercepts because our model is ylg~1
Since they are log, the actual values should be exp(1.179) and exp(0.642) which becomes 3.253 and 1.901.
These last two are consistent with the values from VGAM::Coef which are the parameters for the example
weibull distribution.


Alternative:
library(MASS)
fitdistr(ylg, "weibull")



Fitting poisson with glm:
xSamp <- c(18 15 20 16 29 11 11)
bbb <- glm(xSamp~1, family=poisson())

> summary(bbb)
Call:
glm(formula = xSamp ~ 1, family = poisson())
Deviance Residuals:
            Min               1Q           Median               3Q              Max
-1.588903781006  -1.058919519420  -0.279181657945   0.438723321313   2.603294064982
Coefficients:
                  Estimate     Std. Error  z value   Pr(>|z|)  
(Intercept) 2.841581593727 0.091287051899 31.12798 < 2.22e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
    Null deviance: 12.67798701909  on 6  degrees of freedom
Residual deviance: 12.67798701909  on 6  degrees of freedom
AIC: 47.14811314083
Number of Fisher Scoring iterations: 4

To get the fitted Poisson parameter:
1. fitted.values(bbb)
              1                2                3                4                5                6                7
17.1428571428641 17.1428571428641 17.1428571428641 17.1428571428641 17.1428571428641 17.1428571428641 17.1428571428641

2. Check with :   mean(xSamp)
[1] 17.1428571428571

3. Get value of intercept from summary -> 2.84158
Convert this: exp(2.84158)  = 17.1428




Error Handling
===============
Ref: http://mazamascience.com/WorkingWithData/?p=912

The general structure for tryCatch() has the following form:

result = tryCatch({
    expr
}, warning = function(w) {
    warning-handler-code
}, error = function(e) {
    error-handler-code
}, finally {
    cleanup-code
}

Alternative is the:    try()


Errors:
LoadLibrary failure:  %1 is not a valid Win32 application.

Sweave
========
http://www.statistik.lmu.de/~leisch/Sweave/Sweave-manual.pdf
http://www.r-bloggers.com/sweave-tutorial-1-using-sweave-r-and-make-to-generate-a-pdf-of-multiple-choice-questions/
http://www.r-bloggers.com/search/Sweave
http://www.cepe.ethz.ch/education/NPecoHS2010/Sartori-Sweave.pdf



Global variables and scope
===========================
R can handle variables with the same name defined in different scopes.
Example:
  aaa <- 2.0   - defined in the interactive environment
  go into function func() and print gives
     aaa = 2.0
  In function func(), define
  aaa <- 3.0  
  Printing gives
     aaa = 3.0
  Still in func(), rm(aaa) then print(aaa) will give:
     aaa=2.0

This behaviour is only valid when aaa is originally a global env variable and the aaa defined in the function is local.
In the function, if aaa is defined as global, eg. aaa <<- 3.0
then this will replace the global definition.

No comments: