redun                 package:Hmisc                 R Documentation

_R_e_d_u_n_d_a_n_c_y _A_n_a_l_y_s_i_s

_D_e_s_c_r_i_p_t_i_o_n:

     Uses flexible parametric additive models (see 'areg' and its use
     of regression splines) to determine how well each variable can be
     predicted from the remaining variables.  Variables are dropped in
     a stepwise fashion, removing the most predictable variable at each
     step. The remaining variables are used to predict.  The process
     continues until no variable still in the list of predictors can be
     predicted with an R^2 or adjusted R^2 of at least 'r2' or until
     dropping the variable with the highest R^2 (adjusted or ordinary)
     would cause a variable that was dropped earlier to no longer be
     predicted at least at the 'r2' level from the now smaller list of
     predictors.

_U_s_a_g_e:

     redun(formula, data=NULL, subset=NULL, r2 = 0.9,
           type = c("ordinary", "adjusted"), nk = 3, tlinear = TRUE,
           allcat=FALSE, minfreq=0, pr = FALSE, ...)
     ## S3 method for class 'redun':
     print(x, digits=3, long=TRUE, ...)

_A_r_g_u_m_e_n_t_s:

 formula: a formula.  Enclose a variable in 'I()' to force linearity.

    data: a data frame

  subset: usual subsetting expression

      r2: ordinary or adjusted R^2 cutoff for redundancy

    type: specify '"adjusted"' to use adjusted R^2

      nk: number of knots to use for continuous variables.  Use 'nk=0'
          to force linearity for all variables.

 tlinear: set to 'FALSE' to allow a variable to be automatically
          nonlinearly transformed (see 'areg') while being predicted. 
          By default, only continuous variables on the right hand side
          (i.e., while they are being predictors) are automatically
          transformed, using regression splines.  Estimating
          transformations for target (dependent) variables causes more
          overfitting than doing so for predictors.

  allcat: set to 'TRUE' to ensure that all categories of categorical
          variables having more than two categories are redundant (see
          details below)

 minfreq: For a binary or categorical variable, there must be at least
          two categories with at least 'minfreq' observations or the
          variable will be dropped and not checked for redundancy
          against other variables.  'minfreq' also specifies the
          minimum frequency of a category or its complement  before
          that category is considered when 'allcat=TRUE'.

      pr: set to 'TRUE' to monitor progress of the stepwise algorithm

     ...: arguments to pass to 'dataframeReduce' to remove "difficult"
          variables from 'data' if 'formula' is '~.' to use all
          variables in 'data' ('data' must be specified when these
          arguments are used).  Ignored for 'print'.

       x: an object created by 'redun'

  digits: number of digits to which to round R^2 values when printing

    long: set to 'FALSE' to prevent the 'print' method from printing
          the R^2 history and the original R^2 with which each variable
          can be predicted from ALL other variables.

_D_e_t_a_i_l_s:

     A categorical variable is deemed redundant if a linear combination
     of dummy variables representing it can be predicted from a linear
     combination of other variables.  For example, if there were 4
     cities in the data and each city's rainfall was also present as a
     variable, with virtually the same rainfall reported for all
     observations for a city, city would be redundant given rainfall
     (or vice-versa; the one declared redundant would be the first one
     in the formula). If two cities had the same rainfall, 'city' might
     be declared redundant even though tied cities might be deemed
     non-redundant in another setting.  To ensure that all categories
     may be predicted well from other variables, use the 'allcat'
     option.  To ignore categories that are too infrequent or too
     frequent, set 'minfreq' to a nonzero integer.  When the number of
     observations in the category is below this number or the number of
     observations not in the category is below this number, no attempt
     is made to predict observations being in that category
     individually for the purpose of redundancy detection.

_V_a_l_u_e:

     an object of class '"redun"'

_A_u_t_h_o_r(_s):

     Frank Harrell 
      Department of Biostatistics 
      Vanderbilt University 
      f.harrell@vanderbilt.edu

_S_e_e _A_l_s_o:

     'areg', 'dataframeReduce', 'transcan', 'varclus'

_E_x_a_m_p_l_e_s:

     set.seed(1)
     n <- 100
     x1 <- runif(n)
     x2 <- runif(n)
     x3 <- x1 + x2 + runif(n)/10
     x4 <- x1 + x2 + x3 + runif(n)/10
     x5 <- factor(sample(c('a','b','c'),n,replace=TRUE))
     x6 <- 1*(x5=='a' | x5=='c')
     redun(~x1+x2+x3+x4+x5+x6, r2=.8)
     redun(~x1+x2+x3+x4+x5+x6, r2=.8, minfreq=40)
     redun(~x1+x2+x3+x4+x5+x6, r2=.8, allcat=TRUE)
     # x5 is no longer redundant but x6 is

