RegressionInterface       package:fRegression       R Documentation

_U_n_i_v_a_r_i_a_t_e _R_e_g_r_e_s_s_i_o_n _M_o_d_e_l_l_i_n_g

_D_e_s_c_r_i_p_t_i_o_n:

     A collection and description of easy to use  functions to perform
     an univariate regression  analysis from several methods, to
     analyse and  summarize the fit, and to predict for new  data
     records. 

     The models include:

       '"LM"'        Linear Modelling,
       '"GLM"'       Generalized Linear Modelling,
       '"GAM"'       Generalized Additive Modelling,
       '"PPR"'       Projection Pursuit Regression,
       '"POLYMARS"'  Polytochomous MARS, and
       '"NNET"'      Feedforward Neural Network Modelling.

     Available methods are:

       'predict'        Predict method for objects of class 'fREG',
       'print'          Print method for objects of class 'fREG',
       'plot'           Plot method for objects of class 'fREG',
       'summary'        Summary method for objects of class 'fREG',
       'fitted.values'  Fitted values method for objects of class 'fREG',
       'residuals'      Residuals method for objects of class 'fREG'.

     The print method prints the returned object from a regression 
     fit, and the summary method performs a diagnostic analysis and 
     summarizes the results of the fit in a detailed form. The plot
     method produces diagostic plots. The predict method forecasts 
     from new data records. Two other methods to print the fitted 
     values, and the residuals are available.

       'print'    Print method for objects of class 'fREG',
       'plot'     Plot method for objects of class 'fREG',
       'summary'  Summary method for objects of class 'fREG'.

_U_s_a_g_e:

     regSim(model = c("LM3", "LOGIT3", "GAM3"), n = 100, returnClass =
         c("timeSeries", "data.frame"))

     regFit(formula, data, use = c("lm", "rlm", "am", "ppr", "nnet", 
         "polymars"), title = NULL, description = NULL, ...)
     gregFit(formula, family, data, use = c("glm", "gam"), 
         title = NULL, description = NULL, ...)
         
     ## S3 method for class 'fREG':
     predict(object, newdata, se.fit = FALSE, type = "response", ...)

     show.fREG(object)
     ## S3 method for class 'fREG':
     plot(x, ...)
     ## S3 method for class 'fREG':
     summary(object, ...)

     ## S3 method for class 'fREG':
     coef(object, ...)
     ## S3 method for class 'fREG':
     fitted(object, ...)
     ## S3 method for class 'fREG':
     residuals(object, ...)
     ## S3 method for class 'fREG':
     vcov(object, ...)

_A_r_g_u_m_e_n_t_s:

data, newdata: 'data' is the data frame containing the variables in the
           model. By default the variables are taken from 
          'environment(formula)', typically the environment from  which
          'LM' is called. 'newdata' is the data frame  from which to
          predict. 

description: a brief description of the porject of type character. 

  family: a description of the error distribution and link function to
          be  used in 'glm' and 'gam' models. See 'glm'  and 'family'
          for more details. 

 formula: a symbolic description of the model to be fit. 
           A typical 'glm' predictor has the form 'response ~ terms' 
          where 'response' is the (numeric) response vector and 'terms'
          is a series of terms which specifies a (linear) predictor for
           'response'. For 'binomial' models the response can also  be
          specified as a 'factor'. 
           A 'gam' formula, see also 'gam.models', allows that smooth
          terms can be added to the right hand side of the  formula.
          See 'gam.side.conditions' for details and  examples. 

returnClass: [regSim] - 
           a character value which describes what should be returned,
          either a '"timeSeries"' object with dummy daily dates 
          starting 1970-01-01 (the default) or a data frame. 

     use: denotes the regression method by a character string used to
          fit  the model. 'method' must be one of the strings in the
          default argument.
           '"LM"', for linear regression models, 
           '"GLM"' for generalized linear modelling,
           '"GAM"' for generalized additive modelling,
           '"PPR"' for projection pursuit regression,
           '"POLYMARS"' for molytochomous MARS, and
           '"NNET"' for feedforward neural network modelling. 

   model: [regSim] - 
           a character string selecting one from three two-dimensional
          benchnmark models: '"LM2"', '"LOGIT2"', or '"GAM2"'. 

       n: [regSim] - 
           an integer value setting the length of the series to be
          simulated.  The default value is 100. 

object, x: [regFit] - 
           is an object returned by the regression function 'regFit' 
          and serves as input for the  'predict', 'print',  'summary',
          'print.summary', and 'plot' methods.  Some methods allow for
          additional arguments to be passed. 

  se.fit: [predict] - 
           ... 

   title: a character string which allows for a project title. 

    type: a character string, the type of prediction. 

     ...: additional optional arguments to be passed to the underlying 
          functions. For details we refer to inspect the following help
           pages: 'lm', 'glm', 'gam', 'ppr', 'polymars',  or 'nnet'.  

_D_e_t_a_i_l_s:

     *LM - Linear Modelling:* 

        Univariate linear regression analysis is a statistical
     methodology  that assumes a linear relationship between some
     predictor variables  and a response variable. The goal is to
     estimate the coefficients  and to predict new data from the
     estimated linear relationship. The function 'plot.lm' provides
     four plots: a plot of residuals  against fitted values, a
     Scale-Location plot of sqrt{| residuals |}  against fitted values,
     a normal QQ plot, and a plot of Cook's  distances versus row
     labels. 
       '[stats:lm]' 

     *GLM - Generalized Linear Models:* 

      Generalized linear modelling extends the linear model in two
     directions. (i) with a monotonic differentiable link function
     describing how the  expected values are related to the linear
     predictor, and (ii) with  response variables having a probability
     distribution from an exponential  family.  
      '[stats:glm]' 

     *GAM - Generalized Additive Models:* 

        An additive model generalizes a linear model by smoothing
     individually each predictor term. A generalized additive model
     extends the additive model in the same spirit as the generalized
     liner amodel extends the  linear model, namely for allowing a link
     function and for allowing  non-normal distributions from the
     exponential family.  
      '[mgcv:gam]' 

     *PPR - Projection Pursuit Regression:* 

        The basic method is given by Friedman (1984), and is
     essentially  the same code used by S-PLUS's 'ppreg'. It is
     observed that  this code is extremely sensitive to the compiler
     used. The algorithm  first adds up to 'max.terms', by default
     'ppr.nterms', ridge terms one at a time; it will use less if it is
     unable to find  a term to add that makes sufficient difference.
     The levels of  optimization (argument 'optlevel'), by default 2,
     differ in  how thoroughly the models are refitted during this
     process. At level 0 the existing ridge terms are not refitted.  At
     level 1 the projection directions are not refitted, but the ridge
     functions and the regression coefficients are. Levels 2 and 3
     refit  all the terms; level 3 is more careful to re-balance the
     contributions from each regressor at each step and so is a little
     less likely to converge to a saddle point of the sum of squares
     criterion. The  'plot' method plots Ridge functions for the
     projection pursuit  regression fit.  
      '[stats:ppr]' 

     *POLYMARS - Polytochomous MARS:* 

      The algorithm employed by 'polymars' is different from the 
     MARS(tm) algorithm of Friedman (1991), though it has many
     similarities.  Also the name 'polymars' has been used for this
     algorithm well  before MARS was trademarked.  
      Additional arguments which can be passed to the '"polymars"'
     estimator are: 
      'maxsize' - the maximum number of basis functions that the model
     is  allowed to grow to in the stepwise addition procedure. Default
     is  min(6*(n^{1/3}),n/4,100), where 'n' is the number of 
     observations.  
      'gcv' - parameter used to find the overall best model from a 
     sequence of fitted models. The residual sum of squares of a model 
     is penalized by dividing by the square of  '1-(gcv x model
     size)/cases'.    A larger gcv value would tend to produce a
     smaller model. 
       'additive' - Should the fitted model be additive in the
     predictors?  
       'startmodel' - the first model that is to be fit by 'polymars'. 
     It is either an object of the class 'polymars' or a model  dreamed
     up by the user. In that case, it takes the form of a  '4 x n'
     matrix, where 'n' is the  number of basis  functions in the
     starting model excluding the intercept. Each  row corresponds to
     one basis function (with two possible components).  Column 1 is
     the index of the first predictor involved. Column 2 is  a possible
     knot in this predictor. If column 2 is 'NA', the  first component
     is linear. Column 3 is the possible second predictor  involved (if
     column 3 is 'NA' the basis function only depends  on one
     predictor). Column 4 contains the possible knot for the  predictor
     in column 3, and it is 'NA' when this component is  linear. 
     Example: if a row reads '3 NA 2 4.7', the corresponding  basis
     function is [X_3 * (X_2-4.7)_+]; if a row reads  '2 4.3 NA NA' the
     corresponding basis function is  [(X_2-4.3)_+].  A fifth column
     can be added with 1s and 0s, The 1s specify which  basis functions
     of the startmodel must be in each model. Thus, these  functions
     stay in the model during the whole stepwise fitting  procedure. If
     'startmodel' is not specified 'polymars'  starts with a model that
     only contains  the intercept.  
      'weights' - optional vector of observation weights; if supplied, 
     the algorithm fits to minimize the sum of the weights multiplied 
     by the squared residuals. The length of weights must be the same 
     as the number of observations. The weights must be nonnegative.  
      'no.interact' - an optional matrix used if certain predictor 
     interactions are not allowed in the model. It is given as a 
     matrix of size '2 x m', with predictor indices as entries.   The
     two predictors of any row cannot have interaction terms with  each
     other.  
      'knots' - defines how the function is to find potential knots 
     for the spline basis functions.  This can be set to the maximum 
     number of knots you would like to be considered for each
     predictor.  Usually, to avoid the design matrix becoming singular
     the actual  number of knots produced is constrained to at most
     every third  order statistic in any predictor. This constraint can
     be adjusted  using the 'knot.space' argument. It can also be a
     vector with  the number of potential knots for each predictor.
     Again the actual  number of knots produced is constrained to be at
     most every  third order statistic any predictor.   A third
     possibility is to provide a matrix where each columns  corresponds
     to the ordered knots you would like to have considered  for that
     predictor.  This matrix should be filled out to a rectangular data
     structure  with NAs.  The default is 'min(20, round(n/4))' knots
     per predictor.  When specifying knots as a vector an entry of '-1'
     indicates  that the predictor is a categorical variable and each
     unique entry  in it's column is treated as a  level.  When
     specifying knots as a single number or a matrix and there are 
     categorical variables these are specified separately as such using
      the factor argument.  
      'knot.space' - is an integer describing the minimum number of 
     order statistics apart that two knots can be. Knots should not  be
     too close to insure numerical stability.  
      'ts.resp' - testset responses for model selection. Should have 
     the same number of columns as the training set response. A testset
      can be used for the model selection. Depending on the value of 
     classify, either the model with the smallest testset residual  sum
     of squares or the smallest testset classification error is 
     provided. Overrides 'gcv'.  
      'ts.pred' - testset predictors. Should have the same number of 
     columns as the training set predictors.  
      'ts.weights' - testset observation weights. A vector of length
     equal to the number  of cases of the testset. All weights must be
     non-negative.  
      'classify' - when the response is discrete (categorical),
     polymars  can be used for classification. In particular, when 
     'classify = TRUE', a discrete response with 'K' levels  is
     replaced by 'K' indicator variables as response. Model  selection
     is still being carried out using gcv, except when a  testset is
     provided, in which case testset misclassification is  used to
     select the best model.  
      'factors' - used to indicate that certain variables in the
     predictor  set are categorical variables. Specified as a vector
     containing the  appropriate predictor indices (column numbers of
     categorical  variables in predictors matrix). Factors can also be
     set when the  'knots' argument is given as a vector, with '-1' as 
      the appropriate entries for factors.  
      'tolerance' - for each possible candidate to be added/deleted 
     the resulting residual sums of squares of the model, with/without 
     this candidate, must be calculated. The inversion of of  the
     "X-transpose by X" matrix, X being the design matrix,   is done by
     an updating procedure c.f. C.R. Rao - Linear  Statistical
     Inference and Its Applications, 2nd. edition, page 33.    In the
     inversion the size of the bottom right-hand entry of this  matrix
     is critical. If it's value is near zero or the value  of it's
     inverse is almost zero then the  inversion procedure  becomes
     somewhat inaccurate. The lower the tolerance value the   more
     careful the procedure is in selecting candidates for addition  to
     the model but it may exclude too conservatively. And the other 
     hand if the tolerance is set too high a spurious result with a 
     singular or otherwise sub-optimal model may occur. By default 
     tolerance is set to 1.0e-5.  
      'verbose' - when set  to 'TRUE', the function will print  out a
     line for each addition or deletion stage. For  example, " + 8 : 5
     3.25 2 NA" means adding interaction basis  function of predictor 5
     with knot at 3.25 and predictor 2 (linear),  to make a model of
     size 8, including intercept.  
      '[polyclass:polymars]' 

     *NNET - Feedforward Neural Network Regression:* 

        If the response in 'formula' is a factor, an appropriate 
     classification network is constructed; this has one output and 
     entropy fit if the number of levels is two, and a number of 
     outputs equal to the number of classes and a softmax output  stage
     for more levels. If the response is not a factor, it is  passed on
     unchanged to 'nnet.default'. A quasi-Newton  optimizer is used,
     written in 'C'.  
      '[nnet:nnet]'

_V_a_l_u_e:

     *Function regFit:*  
      returns an S4 object of class '"fREG"', with the folliwing 
     slots:

    call: the matched function call. 

    data: the input data in form of a data.frame. 

description: allows for a brief project description. 

     fit: the results as a list returned from the underlying regression
          model function, e.g.
           'fit$parameters' - the fitted model parameters,
           'fit$residuals' - the model residuals, 
           'fit$fitted.values' - the fitted values of the model,
           and many more. For details we refer to the help pages of the
          selected regression model.  

  method: the selected regression model naming the applied method. 

 formula: the formula expression describing the model. 

  family: the selected family and link name if available, otherwise a
          string vector with to empty strings. 

parameters: named parameters or coefficients of the fitted model. 

   title: a title string. 


     *Methods:* 

     The output from the 'print' method gives information at  least
     about the function call, the fitted model parameters, and the
     residuals variance. 

     The 'plot' method produces three figures, the first plots the
     series of residuals, the second does a quantile-quantile plot of
     the residual plot, and the third plots the fitted values vs. the
     residuals. Additional plots can be generated from the plot method
     (if available) of the underlying model, see the example below.  

     The 'summary' method provides additional information, like errors
     on the model parameters as far as available, and adds  additional
     information about the fit.   

     The 'predict' method forecasts from a fitted model. The returned
     values are the same as produced by the prediction function of the
     selected regression model. Especially, '$fit'  returns the
     forecast vector. 

     The 'residuals' and 'fitted.values' methods return the residuals
     and the fitted values as numeric vectors.

_N_o_t_e:

     This 'regFit' function offers for several regression models an
     easy to use wrapper. There is nothing really new in this package. 
     However, the benefit you will get is, that all regression  models
     got a common argument list with a formula to desribe  the input
     data to be used, if rquired with an argument to specify the family
     function, and with a string to name the type of the  desired
     regression model. On the other hand, the user can pass  additional
     arguments to the underlying functions. This allows to  tailor the
     modelling process. 

     The output of the 'print', 'plot', 'summary', and  'predict'
     methods have all the same style of format. This  makes it very
     easy to compare and to interpret the results  obtained from
     different algorithms implemented in different  functions. 

     For further information we refer to the original help pages of the
     functions  'lm',  'glm', 'gam', 'ppr', 'polymars', and 'nnet'.

_A_u_t_h_o_r(_s):

     The R core team for the 'lm' functions from R's 'base' package, 
      B.R. Ripley for the 'glm' functions from R's 'base' package, 
      S.N. Wood for the 'gam' functions from R's 'mgcv' package, 
      N.N. for the 'ppr' functions from R's 'modreg' package, 
      M. O' Connors for the 'polymars' functions from R's '?' package, 
      The R core team for the 'nnet' functions from R's 'nnet' package, 
      Diethelm Wuertz for the Rmetrics R-port.

_R_e_f_e_r_e_n_c_e_s:

     Belsley D.A., Kuh E., Welsch R.E. (1980); _Regression
     Diagnostics_; Wiley, New York.

     Dobson, A.J. (1990); _An Introduction to Generalized Linear
     Models_; Chapman and Hall, London.

     Draper N.R., Smith H. (1981); _Applied Regression Analysis_; 
     Wiley, New York.

     Friedman, J.H. (1991);  _Multivariate Adaptive Regression Splines
     (with discussion)_, The Annals of Statistics 19, 1-141.

     Friedman J.H., and Stuetzle W. (1981);  _Projection Pursuit
     Regression_;  Journal of the American Statistical Association 76,
     817-823.

     Friedman J.H. (1984); _SMART User's Guide_;  Laboratory for
     Computational Statistics,  Stanford University Technical Report
     No. 1.

     Green, Silverman (1994); _Nonparametric Regression and Generalized
     Linear Models_; Chapman and Hall.

     Gu, Wahba (1991);  _Minimizing GCV/GML Scores with Multiple
     Smoothing Parameters via the Newton Method_; SIAM J. Sci. Statist.
     Comput. 12, 383-398.

     Hastie T., Tibshirani R. (1990); _Generalized Additive Models_;
     Chapman and Hall, London.

     Kooperberg Ch., Bose S., and  Stone C.J. (1997); _Polychotomous
     Regression_, Journal of the American Statistical Association 92,
     117-127.

     McCullagh P., Nelder, J.A. (1989); _Generalized Linear Models_;
     Chapman and Hall, London.

     Myers R.H. (1986); _Classical and Modern Regression with
     Applications_;  Duxbury, Boston.

     Rousseeuw P.J., Leroy, A. (1987); _Robust Regression and Outlier
     Detection_; Wiley, New York.

     Seber G.A.F. (1977); _Linear Regression Analysis_;  Wiley, New
     York.

     Stone C.J., Hansen M., Kooperberg Ch., and Truong Y.K. (1997);
     _The use of polynomial splines and their tensor products  in
     extended linear modeling (with discussion)_.

     Venables, W.N., Ripley, B.D. (1999); _Modern Applied Statistics
     with S-PLUS_;  Springer, New York.

     Wahba (1990);  _Spline Models of Observational Data_; SIAM.

     Weisberg S. (1985); _Applied Linear Regression_;   Wiley, New
     York.

     Wood (2000);  _Modelling and Smoothing Parameter Estimation  with
     Multiple  Quadratic Penalties_; JRSSB 62, 413-428.

     Wood (2001);  _mgcv: GAMs and Generalized Ridge Regression for R_.
     R News 1, 20-25.

     Wood (2001); _Thin Plate Regression Splines_.

     There exists a vast literature on regression. The references
     listed  above are just a small sample of what is available. The
     book by  Myers' is an introductory text book that covers
     discussions of much  of the recent advances in regression
     technology. Seber's book is  at a higher mathematical level and
     covers much of the classical theory  of least squares.

_E_x_a_m_p_l_e_s:

     ## Not run: 
     ## regFit -
        data(recession) 
        recession[,1] = paste(recession[,1], "28", sep = "")
        
     ## myPlot -
        myPlot = function(recession, in.sample) {
          recession = as.timeSeries(recession)[, "recession"]
          in.sample = as.timeSeries(recession)[, "recession"]
          Date = recession[, "date"]
          Date = trunc(Date/100) + (Date-100*trunc(Date/100))/12
          Recession = recession[, "recession"]
          inSample = as.vector(in.sample)
          plot(Date, Recession, type = "n", main = "US Recession")
          grid()
          lines(Date, Recession, type = "h", col = "steelblue")
          lines(Date, inSample) 
        }
        
     ## Generalized Additive Modelling:
        require(mgcv)
        par(mfrow = c(2, 2))
        fit = gregFit(formula = recession ~ s(tbills3m) + s(tbonds10y),
          family = gaussian(), data = recession, use = "gam")
        # In Sample Prediction:
        in.sample = predict(fit, newdata = recession)$fit  
        myPlot(recession, in.sample)
        # Summary:
        summary(fit)
        # Add plots from the original plot method:
        gam.fit = fit@fit
        class(gam.fit) = "gam"
        plot(gam.fit)
     ## End(Not run)

