curveRep                package:Hmisc                R Documentation

_R_e_p_r_e_s_e_n_t_a_t_i_v_e _C_u_r_v_e_s

_D_e_s_c_r_i_p_t_i_o_n:

     'curveRep' finds representative curves from a relatively large
     collection of curves.  The curves usually represent time-response
     profiles as in serial (longitudinal or repeated) data with
     possibly unequal time points and greatly varying sample sizes per
     subject.  After excluding records containing missing 'x' or 'y',
     records are first stratified into 'kn' groups having similar
     sample sizes per curve (subject).  Within these strata, curves are
     next stratified according to the distribution of 'x' points per
     curve (typically measurement times per subject).  The 'clara'
     clustering/partitioning function is used to do this, clustering on
     one, two, or three 'x' characteristics depending on the minimum
     sample size in the current interval of sample size.  If the
     interval has a minimum number of unique 'values' of one,
     clustering is done on the single 'x' values.  If the minimum
     number of unique 'x' values is two, clustering is done to create
     groups that are similar on both 'min(x)' and 'max(x)'.  For groups
     containing no fewer than three unique 'x' values, clustering is
     done on the trio of values 'min(x)', 'max(x)', and the longest gap
     between any successive 'x'.  Then within sample size and 'x'
     distribution strata, clustering of time-response profiles is based
     on 'p' values of 'y' all evaluated at the same 'p' equally-spaced
     'x''s within the stratum.  An option allows per-curve data to be
     smoothed with 'lowess' before proceeding.  Outer 'x' values are
     taken as extremes of 'x' across all curves within the stratum.
     Linear interpolation within curves is used to estimate 'y' at the
     grid of 'x''s.  For curves within the stratum that do not extend
     to the most extreme 'x' values in that stratum, extrapolation uses
     flat lines from the observed extremes in the curve unless
     'extrap=TRUE'. The 'p' 'y' values are clustered using 'clara'.

     'print' and 'plot' methods show results.  By specifying an
     auxiliary 'idcol' variable to 'plot', other variables such as
     treatment may be depicted to allow the analyst to determine for
     example whether subjects on different treatments are assigned to
     different time-response profiles.  To write the frequencies of a
     variable such as treatment in the upper left corner of each panel
     (instead of the grand total number of clusters in that panel),
     specify 'freq'.

     'curveSmooth' takes a set of curves and smooths them using
     'lowess'.  If the number of unique 'x' points in a curve is less
     than 'p', the smooth is evaluated at the unique 'x' values. 
     Otherwise it is evaluated at an equally spaced set of 'x' points
     over the observed range.  If fewer than 3 unique 'x's are in a
     curve, those points are used and smoothing is not done.

_U_s_a_g_e:

     curveRep(x, y, id, kn = 5, kxdist = 5, k = 5, p = 5,
              force1 = TRUE, metric = c("euclidean", "manhattan"),
              smooth=FALSE, extrap=FALSE, pr=FALSE)

     ## S3 method for class 'curveRep':
     print(x, ...)

     ## S3 method for class 'curveRep':
     plot(x, which=1:length(res), method=c('all','lattice'),
                             m=NULL, probs=c(.5, .25, .75), nx=NULL, fill=TRUE,
                             idcol=NULL, freq=NULL, xlim=range(x), ylim=range(y),
                             xlab='x', ylab='y', ...)
     curveSmooth(x, y, id, p=NULL, pr=TRUE)

_A_r_g_u_m_e_n_t_s:

       x: a numeric vector, typically measurement times. For
          'plot.curveRep' is an object created by 'curveRep'.

       y: a numeric vector of response values

      id: a vector of curve (subject) identifiers, the same length as
          'x' and 'y'

      kn: number of curve sample size groups to construct. 'curveRep'
          tries to divide the data into equal numbers of curves across
          sample size intervals.

  kxdist: maximum number of x-distribution clusters to derive using
          'clara'

       k: maximum number of x-y profile clusters to derive using
          'clara'

       p: number of 'x' points at which to interpolate 'y' for profile
          clustering.  For 'curveSmooth' is the number of equally
          spaced points at which to evaluate the lowess smooth, and if
          'p' is omitted the smooth is evaluated at the original 'x'
          values (which will allow 'curveRep' to still know the 'x'
          distribution

  force1: By default if any curves have only one point, all curves
          consisting of one point will be placed in a separate stratum.
           To prevent this separation, set 'force1=FALSE'.

  metric: see 'clara'

  smooth: By default, linear interpolation is used on raw data to
          obtain 'y' values to cluster to determine x-y profiles.
          Specify 'smooth=TRUE' to replace observed points with
          'lowess' before computing 'y' points on the grid. Also, when
          'smooth' is used, it may be desirable to use 'extrap=TRUE'.

  extrap: set to 'TRUE' to use linear extrapolation to evaluate 'y'
          points for x-y clustering.  Not recommended unless smoothing
          has been or is being done.

      pr: set to 'TRUE' to print progress notes

   which: an integer vector specifying which sample size intervals to
          plot.  Must be specified if 'method='lattice'' and must be a
          single number in that case.

  method: The default makes individual plots of possibly all
          x-distribution by sample size by cluster combinations.  Fewer
          may be plotted by specifying 'which'.  Specify
          'method='lattice'' to show a lattice 'xyplot' of a single
          sample size interval, with x distributions going across and
          clusters going down.

       m: the number of curves in a cluster to randomly sample if there
          are more than 'm' in a cluster.  Default is to draw all
          curves in a cluster.  For 'method='lattice'' you can specify
          'm='quantiles'' to use the 'xYplot' function to show
          quantiles of 'y' as a function of {x}, with the quantiles
          specified by the 'probs' argument.  This cannot be used to
          draw a group containing 'n=1'.

      nx: applies if 'm='quantiles''.  See 'xYplot'.

   probs: 3-vector of probabilities with the central quantile first. 
          Default uses quartiles.

    fill: for 'method='all'', by default if a sample size
          x-distribution stratum did not have enough curves to stratify
          into 'k' x-y profiles, empty graphs are drawn so that a
          matrix of graphs will have the next row starting with a
          different sample size range or x-distribution.  See the
          example below.

   idcol: a named vector to be used as a table lookup for color
          assignments (does not apply when 'm='quantile'').  The names
          of this vector are curve 'id's and the values are color names
          or numbers.

    freq: a named vector to be used as a table lookup for a grouping
          variable such as treatment.  The names are curve 'id's and
          values are any values useful for grouping in a frequency
          tabulation.

xlim, ylim, xlab, ylab: plotting parameters.  Default ranges are the
          ranges in the entire set of raw data given to 'curveRep'.

     ...: arguments passed to other functions.

_D_e_t_a_i_l_s:

     In the graph titles for the default graphic output, 'n' refers to
     the minimum sample size, 'x' refers to the sequential
     x-distribution cluster, and 'c' refers to the sequential x-y
     profile cluster.  Graphs from 'method='lattice'' are produced by
     'xyplot' and in the panel titles 'distribution' refers to the
     x-distribution stratum and 'cluster' refers to the x-y profile
     cluster.

_V_a_l_u_e:

     a list of class ''curveRep'' with the following elements 

     res: a hierarchical list first split by sample size intervals,
          then by x distribution clusters, then containing a vector of
          cluster numbers with 'id' values as a names attribute

      ns: a table of frequencies of sample sizes per curve after
          removing 'NA's

   nomit: total number of records excluded due to 'NA's

missfreq: a table of frequencies of number of 'NA's excluded per curve

   ncuts: cut points for sample size intervals

      kn: number of sample size intervals

  kxdist: number of clusters on x distribution

       k: number of clusters of curves within sample size and
          distribution groups

       p: number of points at which to evaluate each curve for
          clustering

       x: 

       y: 

      id: input data after removing 'NA's

     'curveSmooth' returns a list with elements 'x,y,id'.

_N_o_t_e:

     The references describe other methods for deriving representative
     curves, but those methods were not used here.  The last reference
     motivated 'curveRep' however.

_A_u_t_h_o_r(_s):

     Frank Harrell
      Department of Biostatistics
      Vanderbilt University
      f.harrell@vanderbilt.edu

_R_e_f_e_r_e_n_c_e_s:

     Segal M. (1994): Representative curves for longitudinal data via
     regression trees.  J Comp Graph Stat 3:214-233.

     Jones MC, Rice JA (1992): Displaying the important features of
     large collections of similar curves.  Am Statistician 46:140-145.

     Zhen X, Simpson JA, et al (2005): Data from a study of
     effectiveness suggested potential prognostic factors related to
     the patterns of shoulder pain.  J Clin Epi 58:823-830.

_S_e_e _A_l_s_o:

     'clara','dataRep'

_E_x_a_m_p_l_e_s:

     ## Not run: 
     # Simulate 200 curves with pre-curve sample sizes ranging from 1 to 10
     # Make curves with odd-numbered IDs have an x-distribution that is random
     # uniform [0,1] and those with even-numbered IDs have an x-dist. that is
     # half as wide but still centered at 0.5.  Shift y values higher with
     # increasing IDs
     set.seed(1)
     N <- 200
     nc <- sample(1:10, N, TRUE)
     id <- rep(1:N, nc)
     x <- y <- id
     for(i in 1:N) {
       x[id==i] <- if(i 
       y[id==i] <- i + 10*(x[id==i] - .5) + runif(nc[i], -10, 10)
     }

     w <- curveRep(x, y, id, kxdist=2, p=10)
     w
     par(ask=TRUE, mfrow=c(4,5))
     plot(w)                # show everything, profiles going across
     par(mfrow=c(2,5))
     plot(w,1)              # show n=1 results
     # Use a color assignment table, assigning low curves to green and
     # high to red.  Unique curve (subject) IDs are the names of the vector.
     cols <- c(rep('green', N/2), rep('red', N/2))
     names(cols) <- as.character(1:N)
     plot(w, 3, idcol=cols)
     par(ask=FALSE, mfrow=c(1,1))

     plot(w, 1, 'lattice')  # show n=1 results
     plot(w, 3, 'lattice')  # show n=4-5 results
     plot(w, 3, 'lattice', idcol=cols)  # same but different color mapping
     plot(w, 3, 'lattice', m=1)  # show a single "representative" curve
     # Show median, 10th, and 90th percentiles of supposedly representative curves
     plot(w, 3, 'lattice', m='quantiles', probs=c(.5,.1,.9))
     # Same plot but with much less grouping of x variable
     plot(w, 3, 'lattice', m='quantiles', probs=c(.5,.1,.9), nx=2)

     # Smooth data before profiling.  This allows later plotting to plot
     # smoothed representative curves rather than raw curves (which
     # specifying smooth=TRUE to curveRep would do, if curveSmooth was not used)
     d <- curveSmooth(x, y, id)
     w <- with(d, curveRep(x, y, id))

     # Example to show that curveRep can cluster profiles correctly when
     # there is no noise.  In the data there are four profiles - flat, flat
     # at a higher mean y, linearly increasing then flat, and flat at the
     # first height except for a sharp triangular peak

     set.seed(1)
     x <- 0:100
     m <- length(x)
     profile <- matrix(NA, nrow=m, ncol=4)
     profile[,1] <- rep(0, m)
     profile[,2] <- rep(3, m)
     profile[,3] <- c(0:3, rep(3, m-4))
     profile[,4] <- c(0,1,3,1,rep(0,m-4))
     col <- c('black','blue','green','red')
     matplot(x, profile, type='l', col=col)
     xeval <- seq(0, 100, length.out=5)
     s <- x 
     matplot(x[s], profile[s,], type='l', col=col)

     id <- rep(1:100, each=m)
     X <- Y <- id
     cols <- character(100)
     names(cols) <- as.character(1:100)
     for(i in 1:100) {
       s <- id==i
       X[s] <- x
       j <- sample(1:4,1)
       Y[s] <- profile[,j]
       cols[i] <- col[j]
     }
     table(cols)
     yl <- c(-1,4)
     w <- curveRep(X, Y, id, kn=1, kxdist=1, k=4)
     plot(w, 1, 'lattice', idcol=cols, ylim=yl)
     # Found 4 clusters but two have same profile
     w <- curveRep(X, Y, id, kn=1, kxdist=1, k=3)
     plot(w, 1, 'lattice', idcol=cols, freq=cols, ylim=yl)
     # Incorrectly combined black and red because default value p=5 did
     # not result in different profiles at x=xeval
     w <- curveRep(X, Y, id, kn=1, kxdist=1, k=4, p=40)
     plot(w, 1, 'lattice', idcol=cols, ylim=yl)
     # Found correct clusters because evaluated curves at 40 equally
     # spaced points and could find the sharp triangular peak in profile 4
     ## End(Not run)

