summarize               package:Hmisc               R Documentation

_S_u_m_m_a_r_i_z_e _S_c_a_l_a_r_s _o_r _M_a_t_r_i_c_e_s _b_y _C_r_o_s_s-_C_l_a_s_s_i_f_i_c_a_t_i_o_n

_D_e_s_c_r_i_p_t_i_o_n:

     'summarize' is a fast version of 'summary(formula,
     method="cross",overall=FALSE)' for producing stratified summary
     statistics and storing them in a data frame for plotting
     (especially with trellis 'xyplot' and 'dotplot' and Hmisc
     'xYplot').  Unlike 'aggregate', 'summarize' accepts a matrix as
     its first argument and a multi-valued 'FUN' argument and
     'summarize' also labels the variables in the new data frame using
     their original names.  Unlike methods based on 'tapply',
     'summarize' stores the values of the stratification variables
     using their original types, e.g., a numeric 'by' variable will
     remain a numeric variable in the collapsed data frame. 'summarize'
     also retains '"label"' attributes for variables. 'summarize' works
     especially well with the Hmisc 'xYplot' function for displaying
     multiple summaries of a single variable on each panel, such as
     means and upper and lower confidence limits.

     'subsAttr' saves attributes that are commonly preserved across row
     subsetting (i.e., it does not save 'dim', 'dimnames', or 'names'
     attributes).

     'asNumericMatrix' converts a data frame into a numeric matrix.

     'matrix2dataFrame' converts a numeric matrix back into a data
     frame if attributes and storage modes of the original variables
     are saved by calling 'subsAttr'.

_U_s_a_g_e:

     summarize(X, by, FUN, ..., 
               stat.name=deparse(substitute(X)),
               type=c('variables','matrix'), subset=TRUE)

     asNumericMatrix(x)

     subsAttr(x)

     matrix2dataFrame(x, at, restoreAll=TRUE)

_A_r_g_u_m_e_n_t_s:

       X: a vector or matrix capable of being operated on by the
          function specified as the 'FUN' argument 

      by: one or more stratification variables.  If a single variable,
          'by' may be a vector, otherwise it should be a list. Using
          the Hmisc 'llist' function instead of 'list' will result in
          individual variable names being accessible to 'summarize'. 
          For example, you can specify 'llist(age.group,sex)' or
          'llist(Age=age.group,sex)'.  The latter gives 'age.group' a
          new temporary name, 'Age'.  

     FUN: a function of a single vector argument, used to create the
          statistical summaries for 'summarize'.  'FUN' may compute any
          number of statistics.  

     ...: extra arguments are passed to 'FUN'

stat.name: the name to use when creating the main summary variable.  By
          default, the name of the 'X' argument is used.  Set
          'stat.name' to 'NULL' to suppress this name replacement. 

    type: Specify 'type="matrix"' to store the summary variables (if
          there are more than one) in a matrix. 

  subset: a logical vector or integer vector of subscripts used to
          specify the subset of data to use in the analysis.  The
          default is to use all observations in the data frame. 

       x: a data frame (for 'asNumericMatrix') or a numeric matrix (for
          'matrix2dataFrame').  For 'subsAttr', 'x' may be a data
          frame, list, or a vector. 

      at: result of 'subsAttr' 

restoreAll: set to 'FALSE' to only restore attributes 'label', 'units',
          and 'levels' instead of all attributes 

_V_a_l_u_e:

     For 'summarize', a data frame containing the 'by' variables and
     the statistical summaries (the first of which is named the same as
     the 'X' variable unless 'stat.name' is given).  If
     'type="matrix"', the summaries are stored in a single variable in
     the data frame, and this variable is a matrix.

     'asNumericMatrix' returns a numeric matrix.

     'matrix2dataFrame' returns a data frame.

     'subsAttr' returns a list of attribute lists if its argument is a
     list or data frame, and a list containing attributes of a single
     variable.

_A_u_t_h_o_r(_s):

     Frank Harrell 
      Department of Biostatistics 
      Vanderbilt University 
      f.harrell@vanderbilt.edu

_S_e_e _A_l_s_o:

     'label', 'cut2', 'llist', 'by'

_E_x_a_m_p_l_e_s:

     ## Not run: 
     s <- summarize(ap>1, llist(size=cut2(sz, g=4), bone), mean,
                    stat.name='Proportion')
     dotplot(Proportion ~ size | bone, data=s7)
     ## End(Not run)

     set.seed(1)
     temperature <- rnorm(300, 70, 10)
     month <- sample(1:12, 300, TRUE)
     year  <- sample(2000:2001, 300, TRUE)
     g <- function(x)c(Mean=mean(x,na.rm=TRUE),Median=median(x,na.rm=TRUE))
     summarize(temperature, month, g)
     mApply(temperature, month, g)

     mApply(temperature, month, mean, na.rm=TRUE)
     w <- summarize(temperature, month, mean, na.rm=TRUE)
     if(.R.) library(lattice)
     xyplot(temperature ~ month, data=w) # plot mean temperature by month

     w <- summarize(temperature, llist(year,month), 
                    quantile, probs=c(.5,.25,.75), na.rm=TRUE, type='matrix')
     xYplot(Cbind(temperature[,1],temperature[,-1]) ~ month | year, data=w)
     mApply(temperature, llist(year,month),
            quantile, probs=c(.5,.25,.75), na.rm=TRUE)

     # Compute the median and outer quartiles.  The outer quartiles are
     # displayed using "error bars"
     set.seed(111)
     dfr <- expand.grid(month=1:12, year=c(1997,1998), reps=1:100)
     attach(dfr)
     y <- abs(month-6.5) + 2*runif(length(month)) + year-1997
     s <- summarize(y, llist(month,year), smedian.hilow, conf.int=.5)
     s
     mApply(y, llist(month,year), smedian.hilow, conf.int=.5)

     xYplot(Cbind(y,Lower,Upper) ~ month, groups=year, data=s, 
            keys='lines', method='alt')
     # Can also do:
     s <- summarize(y, llist(month,year), quantile, probs=c(.5,.25,.75),
                    stat.name=c('y','Q1','Q3'))
     xYplot(Cbind(y, Q1, Q3) ~ month, groups=year, data=s, keys='lines')
     # To display means and bootstrapped nonparametric confidence intervals
     # use for example:
     s <- summarize(y, llist(month,year), smean.cl.boot)
     xYplot(Cbind(y, Lower, Upper) ~ month | year, data=s)

     # For each subject use the trapezoidal rule to compute the area under
     # the (time,response) curve using the Hmisc trap.rule function
     x <- cbind(time=c(1,2,4,7, 1,3,5,10),response=c(1,3,2,4, 1,3,2,4))
     subject <- c(rep(1,4),rep(2,4))
     trap.rule(x[1:4,1],x[1:4,2])
     summarize(x, subject, function(y) trap.rule(y[,1],y[,2]))

     ## Not run: 
     # Another approach would be to properly re-shape the mm array below
     # This assumes no missing cells.  There are many other approaches.
     # mApply will do this well while allowing for missing cells.
     m <- tapply(y, list(year,month), quantile, probs=c(.25,.5,.75))
     mm <- array(unlist(m), dim=c(3,2,12), 
                 dimnames=list(c('lower','median','upper'),c('1997','1998'),
                               as.character(1:12)))
     # aggregate will help but it only allows you to compute one quantile
     # at a time; see also the Hmisc mApply function
     dframe <- aggregate(y, list(Year=year,Month=month), quantile, probs=.5)

     # Compute expected life length by race assuming an exponential
     # distribution - can also use summarize
     g <- function(y) { # computations for one race group
       futime <- y[,1]; event <- y[,2]
       sum(futime)/sum(event)  # assume event=1 for death, 0=alive
     }
     mApply(cbind(followup.time, death), race, g)

     # To run mApply on a data frame:
     m <- mApply(asNumericMatrix(x), race, h)
     # Here assume h is a function that returns a matrix similar to x
     at <- subsAttr(x)  # get original attributes and storage modes
     matrix2dataFrame(m, at)

     # Get stratified weighted means
     g <- function(y) wtd.mean(y[,1],y[,2])
     summarize(cbind(y, wts), llist(sex,race), g, stat.name='y')
     mApply(cbind(y,wts), llist(sex,race), g)

     # Compare speed of mApply vs. by for computing 
     d <- data.frame(sex=sample(c('female','male'),100000,TRUE),
                     country=sample(letters,100000,TRUE),
                     y1=runif(100000), y2=runif(100000))
     g <- function(x) {
       y <- c(median(x[,'y1']-x[,'y2']),
              med.sum =median(x[,'y1']+x[,'y2']))
       names(y) <- c('med.diff','med.sum')
       y
     }

     system.time(by(d, llist(sex=d$sex,country=d$country), g))
     system.time({
                  x <- asNumericMatrix(d)
                  a <- subsAttr(d)
                  m <- mApply(x, llist(sex=d$sex,country=d$country), g)
                 })
     system.time({
                  x <- asNumericMatrix(d)
                  summarize(x, llist(sex=d$sex, country=d$country), g)
                 })

     # An example where each subject has one record per diagnosis but sex of
     # subject is duplicated for all the rows a subject has.  Get the cross-
     # classified frequencies of diagnosis (dx) by sex and plot the results
     # with a dot plot

     count <- rep(1,length(dx))
     d <- summarize(count, llist(dx,sex), sum)
     Dotplot(dx ~ count | sex, data=d)
     ## End(Not run)
     detach('dfr')

     # Run summarize on a matrix to get column means
     x <- c(1:19,NA)
     y <- 101:120
     z <- cbind(x, y)
     g <- c(rep(1, 10), rep(2, 10))
     summarize(z, g, colMeans, na.rm=TRUE, stat.name='x')
     # Also works on an all numeric data frame
     summarize(as.data.frame(z), g, colMeans, na.rm=TRUE, stat.name='x')

