Documentation

Data Manipulation

Data Structure

Basic

  • no 0-dim type: vector of length 1 e.g.
dimension homogeneous heterogeneous
1d atomic vector list
2d matrix data.frame
nd array
type suffix NA check coercion rule
logical is.logical(), is.atomic() 1
integer L NA_integer_ is.integer(), is.numeric(), is.atomic() 2
double NA_real_ is.double(), is.numeric(), is.atomic() 3
string $ NA_character_ is.character(), is.atomic() 4

Factor

  • create:
    f <- factor(c("a","b"), levels=c("a","b","c"))
    
    #> [1] a b 
    
    
    #> Levels:  a b c
    
    
  • view: levels(f)
  • feature:
    • built on top of integer vector
    • contains only predefined values, can’t combine factors: c(factor("a"),factor("b"))

Read/Write

csv

  • read: e.g.
    df <- read.csv(path, header=TRUE, sep=",", quote="\"", na.strings=c("", "NA"), col.names=c(), stringsAsFactors=FALSE)
    • difference: doc
      function delimited hierarchy
      read.table white space origin
      read.csv , wrapper
      read.csv2 ; wrapper
      read.delim \t wrapper
    • performance: doc
      • colClasses: int save 14 times memory than string
      • nrows: even as a mild over-estimate
      • comment.char = “”
  • write: doc
    write.table(df, file=path, sep=",", row.names=FALSE)

rda

  • read: df <- readRDS(path)e.g.
  • write: saveRDS(df, file=path)e.g.

xlsx

require(XLConnect)
wb <- loadWorkbook(path)
df <- readWorksheet(wb, sheet="Sheet1", header=FALSE)
require(XLConnect)
wb <- loadWorkbook(path, create=TRUE)
createSheet(wb, name="f")
writeWorksheet(wb, df, sheet="Sheet1", startRow=1, startCol=1)
saveWorkbook(wb, )

json

  • read: e.g.
    require(rjson)
    j <- fromJSON(file=path, method='C')
  • write:
    require(rjson)
    json <- toJSON(l)
    fileConn <- file(path)
    writeLines(json, fileConn)
    close(fileConn)

File System

work directory

variable

  • view: ls()
  • remove: rm(d)

Create

  • data.frame
    • manually:
      • fix: fix(df) e.g.
      • edit: df <- edit(df) e.g.
    • from list: e.g.
      l <- list(user=character(), id=integer())
      l$user <- c('John', 'Peter')
      l$id <- c(1, 2)
      df <- as.data.frame(l)
    • from vectors: e.g.
      user <- c('John', 'Peter') # character vecter
      id <- c(1, 2)
      df <- data.frame(user, id, stringAsFactors=FALSE) #stringAsFactor
    • from matrix: e.g.
      m <- matrix(c(1,2,3,4), ncol=2, nrow=2, dimnames=list(c('row1', 'row2'), c('var1', 'var2')))
      df <- as.data.frame(m)
  • list
    • from vector: l <- list(user=c("John","Peter"),id=c(1))
  • vector: c(1,2,3) c(1,c(2,3)) always flat
  • array
    • matrix: from vector: e.g.
    m <- matrix(1:20, nrow=5, ncol=4, byrow=TRUE, dimname=list(c("r1","r2")), c("c1","c2"))
    • array: multi-dim e.g.
      array(1:12, c(2,3,2))
      
      #>, , 1
      
      
      #>     [,1] [,2] [,3]
      
      
      #>[1,]    1    3    5
      
      
      #>[2,]    2    4    6   
      
      
      #>, , 2 
      
      
      #>     [,1] [,2] [,3]
      
      
      #>[1,]    7    9   11
      
      
      #>[2,]    8   10   12
      

View

  • general: e.g.
    • print: d
    • name: e.g.
      • create: v <- c(a=1,b=2,c=3) names(v) <- c("a","b","c") v <- setNames(1:3, c("a","b","c"))
      • remove: names(v) <- NULL unname(v)
    • type: diff
      • class(d): oop, element type
      • typeof(d): R memory
      • mode(d): Becker, Chambers & Wilks – S language
    • attribute: arbitrary additional metadata e.g.
      • attr(f, "attr_name")
      • attributes(d) e.g.
      • structure: str(d) best way to discover type
        • data.frame:
          'data.frame': 4 obs. of 2 variables:
          $ var1: num 1 2 3 4
          $ var2: num 4 3 2 1
        • list:
          List of 2
          $ var1: num [1:4] 1 2 3 4
          $ var2: num [1:2] 1 2
        • vector: num [1:4] 1 2 3 4
        • matrix: num [1:4, 1:2] 1 2 3 4 1 2 3 4
  • data.frame:
    • print: View(df)
    • first n rows: head(df, n=10)
    • last n rows: tail(df, n=10)
    • list column names:
      colnames(df)
      names(df)
    • dimension:
      dim(df)
      ncol(df), length(df)
      nrow(df)
  • list:
    • list keys: names(l)
  • vector:
  • matrix:
    • dimension: dim(m)

Index/Recode

  • data.framee.g.
    • select column:
      • [] return list:

        df[1:3] df[c("user","id")]

          user
        1 John
        2 Peter
      • [[]] $ [,] return components of list:

        df$user df[["user"]] df[,"user"] df[,1]

        John Peter
    • exclude column: df[c(-3,-5)] df[!c("user", "id")] df$user <- df$id <- NULL e.g.
    • filter: e.g.
      df <- df[which(df$user=="John" & df$id ==1), ] # which: return data.frame row index

      e.g.

      subset(df, user=="John" & id=1, select=c(user))
    • replace e.g.
      • build-in: df$user[df$id==1] <- "Bob"
      • match: df$user <- df[match(df$user, c("John","Peter"))]
      • plyr:
        library(plyr)
        df$user <- revalue(df$user, c("John"="Bob", "Peter"="David"))
        df$user <- mapvalues(df$user, from=c("John","Peter"), to=c("Bob","David"))
      • cut: df$level <- cut(df$id, breaks=c(-Inf,2,4,Inf), labels=c("low","medium","high"))
    • random sample: df[sample(1:nrow(df), 50, replace=FALSE),] e.g.
  • list: e.g.
    • column: l[[1]] l[["user"]] l$user
  • vector:
    • index: e.g.
      • positive number:
        • v[2] v[c(2,3)]
        • v[c(1,1)]is duplicated values
        • v[c(2.1,3.2)] doubles are truncated to integers
      • negative number:
        • v[-c(1,2)] remove elements
        • v[c(-1,2)] can’t mix positive & negative index
      • logical:
        • v[c(TRUE,TRUE,FALSE)] is v[c(1,2)]
        • v[c(TRUE,FALSE)] recycled to v[c(TRUE,FALSE,TRUE,FALSE...)]
        • v[c(NA)] is NA
      • nothing: v[] is v
      • zero: v[0] is numeric(0)
      • name:
        v <- setNames(v, letters[1:4])
        v[c("a","b")]
    • replace: e.g.
      • build-in: v[v=="beta"] <- "two"
      • plyr:
        library(plyr)
        revalue(v, c("beta"="two", "gamma"="three"))
        mapvalues(v, from=c("beta","gamma"), to=c("two","three"))
    • find index by value: which(v==value) ref
  • matrix: e.g.
    • column: m[,4]
    • row: m[3,]
    • subset: m[2,3] m[2:4,1:3]

Convert

factor2numeric

e.g.

as.numeric(as.character(f))
as.numeric(levels(f))[f]

factor2characer

e.g.

df <- data.frame(lapply(df, as.character), stringsAsFactors=FALSE) # convert all columns
i <- sapply(df, is.factor) # convert only factor columns
df[i] <- lapply(df[i], as.character)

data.frame2list (group by)

  • split

list2vector

  • unlist: v <- unlist(l, recursive=TRUE, use.names=FALSE) e.g.

string2vector

strsplite("a,b,c", ",") is c(“a”,”b”,”c”) e.g.

NA

  • check: is.na(d) e.g.
  • edit to missing: df$v1[df$v1==0] <- NA
  • remove missing:
    • na.omit()
    • mean(x, na.rm=TRUE)
    • df[!complete.cases(df),]

Unique/Duplicate

  • data.frame e.g.
    • duplicate by row
      • bool: duplicated(df)
      • filter: df[duplicated(df),]
      • filter w/out repeat: unique(df[duplicated(df),])
      • by column: e.g.
    • unique:
      • all columns: unique(df) df[!duplicated(df),]
      • by column: e.g.
        library(data.table)
        d <- as.data.table(df)
        unique(d, by=c("user", "id"))
  • vector e.g.
    • duplicate
      • bool: duplicated(v)
      • filter: v[duplicated(v)]
      • filter w/out repeat: unique(v[duplicated(v)])
    • unique: unique(v) v[!duplicated(v)]

Sort

  • data.frame
    • by column:

      df <- df[order(df$user, -df$id),] e.g.

      or e.g.

      library(plyr)
      arrange(df, user, id)
    • sort columns:
      • matrix-style: df <- df[, order(colnames(df))] e.g.
      • list-style: df <- df[order(colnames(df))] df <- df[c(3,2,1)] e.g.
    • trick
      • sort by all columns, from left to right: df[do.call(order, as.list(df)),] e.g.
      • reverse sort: e.g.
        • number: -df$id
        • factor: 2number -xtfrm(df$user)
        • character: 2factor2number
  • vector” sort(v, decreasing=TRUE) e.g.
  • matrix

Merge

  • data.frame
    • merge: e.g.
      df3 <- merge(df1, df2, by.x=c("var1", "var2"), by.y=c("var3", "var4"), all.x=TRUE, sort=FALSE) #sort=FALSE
    • cbind: df <- cbind(df, age=age)
    • rbind: df <- rbind(df, df2) e.g.
    • plyr::rbind.fill(): fill missing columns e.g.
  • vector

Count

  • data.frame:
    nrow(df)
    ncol(df)
    dim(df)
  • list: length(l)
  • vector:length(v)
  • matrix: dim(m)

Summarize

  • data.frame e.g.
    • ddply:
      library(plyr)
      dfs <- ddply(df, c("user", "id"), 
                       summarise, 
                       N=sum(!is.na(income)), 
                       mean=mean(income, na.rm=TRUE), 
                       sd=sd(income, na.rm=TRUE), 
                       se=sd/sqrt(N))
    • aggregate: dfs <- aggregate(df["income"], by=df[c("user","id")], FUN=length)

Trick

String

substring

substr("abcdef", 2, 4) is “bcd”

substr("abcdef", 2, 4) <- "ghi" is “aghief” e.g.

concatenate

paste("x", 1:3, sep="")is c(“x1”,”x2”,”x3”) e.g.

grep("B", c("a", "b", "c"), ignore.case=TRUE, fixed=TRUE) is 2 e.g.

replace

sub("B", "D", "abc"), ignore.case=TRUE, fixed=TRUE) is “aDc”

case

  • Uppercase: toupper("abc") e.g.
  • Lowercase: tolower("ABC")

format character

  • new line: \n
  • tab

split by character

Date

e.g.

Function

Recursive

Variable Scope

  • global environment variable: assign('var', value, envir=.GlobalEnv)
  • trace back parent scope variable: var <<- value e.g.

Closure

(http://stackoverflow.com/questions/1169534/writing-functions-in-r-keeping-scoping-in-mind)

Loop

Reshape

apply*

(http://www.statmethods.net/management/controlstructures.html)

Parallel Computing

Memory

e.g.

Graph

Advertisements