cqo                   package:VGAM                   R Documentation

_F_i_t_t_i_n_g _C_o_n_s_t_r_a_i_n_e_d _Q_u_a_d_r_a_t_i_c _O_r_d_i_n_a_t_i_o_n (_C_Q_O)

_D_e_s_c_r_i_p_t_i_o_n:

     A _constrained quadratic ordination_ (CQO; formerly called
     _canonical Gaussian ordination_ or CGO) model is fitted using the
     _quadratic reduced-rank vector generalized linear model_
     (QRR-VGLM) framework.

_U_s_a_g_e:

     cqo(formula, family, data = list(), weights = NULL, subset = NULL,
         na.action = na.fail, etastart = NULL, mustart = NULL,
         coefstart = NULL, control = qrrvglm.control(...), offset = NULL,
         method = "cqo.fit", model = FALSE, x.arg = TRUE, y.arg = TRUE,
         contrasts = NULL, constraints = NULL, extra = NULL,
         smart = TRUE, ...)

_A_r_g_u_m_e_n_t_s:

     In this documentation, M is the number of linear predictors, S is
     the number of responses (species). Then M=S for Poisson and
     binomial species data, and M=2S for negative binomial and gamma
     distributed species data.

 formula: a symbolic description of the model to be fit. The RHS of the
          formula is applied to each linear predictor. Different
          variables in each linear predictor can be chosen by
          specifying constraint matrices. 

  family: a function of class '"vglmff"' describing what statistical
          model is to be fitted.  Currently the following families are
          supported: 'poissonff', 'binomialff' ('logit' and 'cloglog'
          links available), 'negbinomial', 'gamma2', 'gaussianff'.
          Sometimes special arguments are required for 'cqo()', e.g.,
          'binomialff(mv=TRUE)'. Also, 'quasipoissonff' and
          'quasibinomialff' may or may not work.

    data: an optional data frame containing the variables in the model.
          By default the variables are taken from
          'environment(formula)', typically the environment from which
          'cqo' is called.

 weights: an optional vector or matrix of (prior) weights  to be used
          in the fitting process. Currently, this argument should not
          be used.

  subset: an optional logical vector specifying a subset of
          observations to be used in the fitting process.

na.action: a function which indicates what should happen when the data
          contain 'NA's.  The default is set by the 'na.action' setting
          of 'options', and is 'na.fail' if that is unset. The
          ``factory-fresh'' default is 'na.omit'.

etastart: starting values for the linear predictors. It is a M-column
          matrix. If M=1 then it may be a vector. Currently, this
          argument probably should not be used.

 mustart: starting values for the  fitted values. It can be a vector or
          a matrix.  Some family functions do not make use of this
          argument. Currently, this argument probably should not be
          used.

coefstart: starting values for the coefficient vector. Currently, this
          argument probably should not be used.

 control: a list of parameters for controlling the fitting process. See
          'qrrvglm.control' for details.

  offset: This argument must not be used.

  method: the method to be used in fitting the model. The default (and
          presently only) method 'cqo.fit' uses _iteratively reweighted
          least squares_ (IRLS).

   model: a logical value indicating whether the _model frame_ should
          be assigned in the 'model' slot.

x.arg, y.arg: logical values indicating whether the model matrix and
          response matrix used in the fitting process should be
          assigned in the 'x' and 'y' slots. Note the model matrix is
          the LM model matrix.

contrasts: an optional list. See the 'contrasts.arg' of
          'model.matrix.default'.

constraints: an optional list  of constraint matrices. The components
          of the list must be named with the term it corresponds to
          (and it must match in character format).  Each constraint
          matrix must have M rows, and be of full-column rank. By
          default, constraint matrices are the M by M identity matrix
          unless arguments in the family function itself override these
          values.  If 'constraints' is used it must contain _all_ the
          terms; an incomplete list is not accepted. Constraint
          matrices for x_2 variables are taken as the identity matrix.

   extra: an optional list with any extra information that might be
          needed by the family function.

   smart: logical value indicating whether smart prediction
          ('smartpred') will be used.

     ...: further arguments passed into 'qrrvglm.control'.

_D_e_t_a_i_l_s:

     QRR-VGLMs or _constrained quadratic ordination_ (CQO) models are
     estimated here by maximum likelihood estimation. Optimal linear
     combinations of the environmental variables are computed, called
     _latent variables_ (these appear as 'lv' for R=1 else 'lv1',
     'lv2', etc. in the output).  Here, R is the _rank_ or the number
     of ordination axes.  Each species' response is then a regression
     of these latent variables using quadratic polynomials on a
     transformed scale (e.g., log for Poisson counts, logit for
     presence/absence responses).  The solution is obtained iteratively
     in order to maximize the log-likelihood function, or equivalently,
     minimize the deviance.

     The central formula (for Poisson and binomial species data) is
     given by

        eta = B_1^T x_1 + A nu + sum_{m=1}^M (nu^T D_m nu) e_m

     where x_1 is a vector (usually just a 1 for an intercept), x_2 is
     a vector of environmental variables, nu=C^T x_2 is a R-vector of
     latent variables, e_m is a vector of 0s but with a 1 in the mth
     position. The eta are a vector of linear/additive predictors,
     e.g., the mth element is eta_m = log(E[Y_m]) for the mth species. 
     The matrices B_1, A, C and D_m are estimated from the data, i.e.,
     contain the regression coefficients.  The tolerance matrices
     satisfy T_s = -(0.5 D_s^(-1). Many important CQO details are
     directly related to arguments in 'qrrvglm.control', e.g., the
     argument 'Norrr' specifies which variables comprise x_1.

     Theoretically, the four most popular 'VGAM' family functions to be
     used with 'cqo' correspond to the Poisson, binomial, normal, and
     negative binomial distributions. The latter is a 2-parameter
     model. All of these are implemented, as well as the 2-parameter
     gamma.  The Poisson is or should be catered for by
     'quasipoissonff' and 'poissonff', and the binomial by
     'quasibinomialff' and 'binomialff'. Those beginning with '"quasi"'
     have dispersion parameters that are estimated for each species.

     For initial values, the function '.Init.Poisson.QO' should work
     reasonably well if the data is Poisson with species having equal
     tolerances.  It can be quite good on binary data too.  Otherwise
     the 'Cinit' argument in 'qrrvglm.control' can be used.

     It is possible to relax the quadratic form to an additive model. 
     The result is a data-driven approach rather than a model-driven
     approach, so that CQO is extended to _constrained additive
     ordination_ (CAO) when R=1.  See 'cao' for more details.

_V_a_l_u_e:

     An object of class '"qrrvglm"'.  Note that the slot 'misc' has a
     list component called 'deviance.Bestof' which gives the history of
     deviances over all the iterations.

_W_a_r_n_i_n_g:

     Local solutions are not uncommon when fitting CQO models.  To
     increase the chances of obtaining the global solution, increase
     the value of the argument 'Bestof' in 'qrrvglm.control'. For
     reproducibility of the results, it pays to set a different random
     number seed before calling 'cqo' (the function 'set.seed' does
     this).  The function 'cqo' chooses initial values for *C* using
     '.Init.Poisson.QO()' if 'Use.Init.Poisson.QO=TRUE', else random
     numbers.

     Unless 'ITolerances=TRUE' or 'EqualTolerances=FALSE', CQO is
     computationally expensive. It pays to keep the rank down to 1 or
     2.  If 'EqualTolerances=TRUE' and 'ITolerances=FALSE' then the
     cost grows quickly with the number of species and sites (in terms
     of memory requirements and time).  The data needs to conform quite
     closely to the statistical model, and the environmental range of
     the data should be wide in order for the quadratics to fit the
     data well (bell-shaped response surfaces).  If not, RR-VGLMs will
     be more appropriate because the response is linear on the
     transformed scale (e.g., log or logit) and the ordination is
     called _constrained linear ordination_ or CLO.

     Like many regression models, CQO is sensitive to outliers (in the
     environmental and species data), sparse data, high leverage
     points, multicollinearity etc.  For these reasons, it is necessary
     to examine the data carefully for these features and take
     corrective action (e.g., omitting certain species, sites,
     environmental variables from the analysis, transforming certain
     environmental variables, etc.). Any optimum lying outside the
     convex hull of the site scores should not be trusted.  Fitting a
     CAO is recommended first, then upon transformations etc., possibly
     a CQO can be fitted.

     For binary data, it is necessary to have `enough' data.  In
     general, the number of sites n ought to be much larger than the
     number of species _S_, e.g., at least 100 sites for two species.
     Compared to count (Poisson) data, numerical problems occur more
     frequently with presence/absence (binary) data.  For example, if
     'Rank=1' and if the response data for each species is a string of
     all absences, then all presences, then all absences (when
     enumerated along the latent variable) then infinite parameter
     estimates will occur.  In general, setting 'ITolerances=TRUE' may
     help.

     This function was formerly called 'cgo'. It has been renamed to
     reinforce a new nomenclature described in Yee (2006).

_N_o_t_e:

     By default, a rank-1 equal-tolerances QRR-VGLM model is fitted
     (see 'qrrvglm.control' for the default control parameters). The
     latent variables are always transformed so that they are
     uncorrelated. By default, the argument 'trace' is 'TRUE' meaning a
     running log is printed out while the computations are taking
     place.  This is because the algorithm is computationally
     expensive, therefore users might think that their computers have
     frozen if 'trace=FALSE'!

     The argument 'Bestof' in 'qrrvglm.control' controls the number of
     models fitted (each uses different starting values) to the data.
     This argument is important because convergence may be to a _local_
     solution rather than the _global_ solution. Using more starting
     values increases the chances of finding the global solution. 
     Always plot an ordination diagram (use the generic function
     'lvplot') and see if it looks sensible.  Local solutions arise
     because the optimization problem is highly nonlinear, and this is
     particularly true for CAO.

     Many of the arguments applicable to 'cqo' are common to 'vglm' and
     'rrvglm.control'. The most important arguments are 'Rank',
     'Norrr', 'Bestof',  'ITolerances', 'EqualTolerances', 'isdlv', and
     'MUXfactor'.

     When fitting a 2-parameter model such as the negative binomial or
     gamma, it pays to set 'EqualTolerances=TRUE' and
     'ITolerances=FALSE'. This is because numerical problems can occur
     when fitting the model far away from the global solution when
     'ITolerances=TRUE'. Setting the two arguments as described will
     slow down the computation considerably, however it is numerically
     more stable.

     In Example 1 below, an unequal-tolerances rank-1 QRR-VGLM is
     fitted to the hunting spiders dataset. In Example 2 below, an
     equal-tolerances rank-2 QRR-VGLM is fitted to the hunting spiders
     dataset. The numerical difficulties encountered in fitting the
     rank-2 model suggests a rank-1 model is probably preferable. In
     Example 3 below, constrained binary quadratic ordination (in old
     nomenclature, constrained Gaussian logit ordination) is fitted to
     some simulated data coming from a species packing model. With
     multivariate binary responses, one must use 'mv=TRUE' to indicate
     that the response (matrix) is multivariate. Otherwise, it is
     interpreted as a single binary response variable.

     Sometime in the future, this function might handle input of the
     form 'cqo(x, y)', where 'x' and 'y' are matrices containing the
     environmental and species data respectively.

_A_u_t_h_o_r(_s):

     Thomas W. Yee

_R_e_f_e_r_e_n_c_e_s:

     Yee, T. W. (2004) A new technique for maximum-likelihood canonical
     Gaussian ordination. _Ecological Monographs_, *74*, 685-701.

     ter Braak, C. J. F. and Prentice, I. C. (1988) A theory of
     gradient analysis. _Advances in Ecological Research_, *18*,
     271-317.

     Yee, T. W. (2006) Constrained additive ordination. _Ecology_,
     *87*, 203-213.

_S_e_e _A_l_s_o:

     'qrrvglm.control', 'Coef.qrrvglm', 'rcqo', 'cao', 'uqo', 'rrvglm',
     'poissonff', 'binomialff', 'negbinomial', 'gamma2',
     'lvplot.qrrvglm', 'persp.qrrvglm', 'trplot.qrrvglm', 'vglm',
     'set.seed', 'hspider'.

     Documentation accompanying the 'VGAM' package at <URL:
     http://www.stat.auckland.ac.nz/~yee> contains further information
     and examples.

_E_x_a_m_p_l_e_s:

     # Example 1; Fit an unequal tolerances model to the hunting spiders data
     data(hspider)
     hspider[,1:6]=scale(hspider[,1:6]) # Standardize the environmental variables
     p1 = cqo(cbind(Alopacce, Alopcune, Alopfabr, Arctlute, Arctperi, Auloalbi, 
                    Pardlugu, Pardmont, Pardnigr, Pardpull, Trocterr, Zoraspin) ~
              WaterCon + BareSand + FallTwig + CoveMoss + CoveHerb + ReflLux,
              fam=poissonff, data=hspider, Crow1positive=FALSE, ITol=FALSE)
     sort(p1@misc$deviance.Bestof) # A history of all the iterations
     if(deviance(p1) > 1177) stop("suboptimal fit obtained")

     ## Not run: 
     S = ncol(p1@y) # Number of species
     clr = (1:(S+1))[-7] # omits yellow
     lvplot(p1, y=TRUE, lcol=clr, pch=1:S, pcol=clr, las=1) # ordination diagram
     legend("topright", leg=dimnames(p1@y)[[2]], col=clr,
            pch=1:S, merge=TRUE, bty="n", lty=1:S, lwd=2)
     ## End(Not run)
     (cp = Coef(p1))

     (a = cp@lv[cp@lvOrder])  # The ordered site scores along the gradient
     # Names of the ordered sites along the gradient:
     rownames(cp@lv)[cp@lvOrder]
     (a = (cp@Optimum)[,cp@OptimumOrder]) # The ordered optima along the gradient
     a = a[!is.na(a)] # Delete the species that is not unimodal
     names(a)         # Names of the ordered optima along the gradient

     ## Not run: 
     trplot(p1, whichSpecies=1:3, log="xy", type="b", lty=1, lwd=2,
            col=c("blue","red","green"), label=TRUE) -> ii # trajectory plot
     legend(0.00005, 0.3, paste(ii$species[,1], ii$species[,2], sep=" and "),
            lwd=2, lty=1, col=c("blue","red","green"))
     abline(a=0, b=1, lty="dashed")

     S = ncol(p1@y) # Number of species
     clr = (1:(S+1))[-7] # omits yellow
     persp(p1, col=clr, label=TRUE, las=1) # perspective plot
     ## End(Not run)

     # Example 2: A rank-2 equal tolerances CQO model with Poisson data
     # This example is numerically fraught.
     set.seed(555)
     p2 = cqo(cbind(Alopacce, Alopcune, Alopfabr, Arctlute, Arctperi, Auloalbi,
                    Pardlugu, Pardmont, Pardnigr, Pardpull, Trocterr, Zoraspin) ~
              WaterCon + BareSand + FallTwig + CoveMoss + CoveHerb + ReflLux,
              fam=poissonff, data=hspider, Crow1positive=FALSE,
     #        ITol=FALSE, EqualTol=TRUE,
              Rank=2, Bestof=1, isdlv=c(2.1,0.9))
     sort(p2@misc$deviance.Bestof) # A history of all the iterations
     if(deviance(p2) > 1127) stop("suboptimal fit obtained")
     ## Not run: 
     lvplot(p2, ellips=FALSE, label=TRUE, xlim=c(-3,4),
            C=TRUE, Ccol="brown", sites=TRUE, scol="grey", 
            pcol="blue", pch="+", chull=TRUE, ccol="grey")
     ## End(Not run)

     # Example 3: species packing model with presence/absence data
     n = 200; p = 5; S = 5
     mydata = rcqo(n, p, S, fam="binomial", hiabundance=4,
                   EqualTol=TRUE, ESOpt=TRUE, EqualMax=TRUE)
     myform = attr(mydata, "formula")
     b1 = cqo(myform, fam=binomialff(mv=TRUE, link="cloglog"), data=mydata)
     sort(b1@misc$deviance.Bestof) # A history of all the iterations
     ## Not run: 
     lvplot(b1, y=TRUE, lcol=1:S, pch=1:S, pcol=1:S, las=1)
     ## End(Not run)
     Coef(b1)

     # Compare the fitted model with the 'truth'
     cbind(truth=attr(mydata, "ccoefficients"), fitted=ccoef(b1))

