colAUC                package:caTools                R Documentation

_C_o_l_u_m_n-_w_i_s_e _A_r_e_a _U_n_d_e_r _R_O_C _C_u_r_v_e (_A_U_C)

_D_e_s_c_r_i_p_t_i_o_n:

     Calculate Area Under the ROC Curve (AUC) for every column of a 
     matrix. Also, can be used to plot the ROC curves.

_U_s_a_g_e:

       auc = colAUC(X, y, plotROC=FALSE, alg=c("Wilcoxon","ROC"))

_A_r_g_u_m_e_n_t_s:

       X: A matrix or data frame. Rows contain samples  and columns
          contain features/variables.

       y: Class labels for the 'X' data samples.  A response vector
          with one label for each row/component of 'X'. Can be either a
          factor, string or a numeric vector.

 plotROC: Plot ROC curves. Use only for small number of features.  If
          'TRUE', will set 'alg' to "ROC".

     alg: Algorithm to use: "ROC" integrates ROC curves, while
          "Wilcoxon" uses Wilcoxon Rank Sum Test to get the same
          results. Default "Wilcoxon" is faster. This argument is
          mostly provided for verification.

_D_e_t_a_i_l_s:

     AUC is a very useful measure of similarity between two classes
     measuring area under "Receiver Operating Characteristic" or ROC
     curve. In case of data with no ties all sections of ROC curve are
     either horizontal or vertical, in case of data with ties diagonal 
     sections can also occur. Area under the ROC curve is calculated
     using  'trapz' function. AUC is always in between 0.5  (two
     classes are statistically identical) and 1.0 (there is a threshold
     value that can achieve a perfect separation between the classes).

     Area under ROC Curve (AUC) measure is very similar to Wilcoxon
     Rank Sum Test  (see 'wilcox.test') and Mann-Whitney U Test. 

     There are numerous other functions for calculating AUC in other
     packages.  Unfortunately none of them had all the properties that
     were needed for  classification preprocessing, to lower the
     dimensionality of the data (from  tens of thousands to hundreds)
     before applying standard classification  algorithms. 

     The main properties of this code are: 

        *  Ability to work with multi-dimensional data ('X' can have
           many  columns).

        *  Ability to work with multi-class datasets ('y' can have more
            than 2 different values).

        *  Speed - this code was written to calculate AUC's of large
           number of  features, fast.

        *  Returned AUC is always bigger than 0.5, which is equivalent
           of  testing for each feature 'colAUC(x,y)' and
           'colAUC(-x,y)' and returning the value of the bigger one.

     If those properties do not fit your problem, see "See Also" and
     "Examples"  sections for AUC  functions in other packages that
     might be a better fit for your needs.

_V_a_l_u_e:

     An output is a single matrix with the same number of columns as
     'X' and  "n choose 2" ( frac{n!}{(n-2)! 2!} = n(n-1)/2 {n!/((n-2)!
     2!) = n(n-1)/2} ) number of rows,  where n is number of unique
     labels in 'y' list. For example, if 'y'  contains only two unique
     class labels ( 'length(unique(lab))==2' ) than output  matrix will
      have a single row containing AUC of each column. If more than 
     two unique labels are present than AUC is calculated for every
     possible  pairing of classes ("n choose 2" of them).

     For multi-class AUC "Total AUC" as defined by Hand & Till (2001)
     can be  calculated by 'colMeans(auc)'.

_A_u_t_h_o_r(_s):

     Jarek Tuszynski (SAIC) jaroslaw.w.tuszynski@saic.com

_R_e_f_e_r_e_n_c_e_s:


        *  Mason, S.J. and Graham, N.E. (1982) _Areas beneath the
           relative  operating characteristics (ROC) and relative
           operating levels (ROL)  curves: Statistical significance and
           interpretation_,  Q. J. R.  Meteorol. Soc. textbf{30}
           291-303. 

        *  Fawcett, Tom (2004) _ROC Graphs: Notes and Practical 
           Considerations for Researchers_,  <URL:
           http://home.comcast.net/~tom.fawcett/public_html/papers/ROC101.pdf>
           and <URL:
           http://home.comcast.net/~tom.fawcett/public_html/ROCCH/>

        *  Hand, David and Till, Robert (2001) _A Simple 
           Generalization of the Area Under the ROC Curve for Multiple
           Class  Classification Problems_;  Machine Learning 45(2):
           171-186

        *  See  <URL:
           http://www.medicine.mcgill.ca/epidemiology/hanley/software/>
            to find articles below: 

           *  Hanley and McNeil (1982), _The Meaning and Use of the
              Area  under a Receiver Operating Characteristic (ROC)
              Curve_,  Radiology 143: 29-36.

           *  Hanley and McNeil (1983), _A Method of Comparing the
              Areas   under ROC curves derived from same cases_,
              Radiology 148: 839-843.

           *  McNeil and Hanley (1984), _Statistical Approaches to the 
               Analysis of ROC curves_, Medical Decision Making 4(2):
              136-149.

        *  See <URL:
           http://rocr.bioinf.mpi-sb.mpg.de/evaluation_literature.html>
           for bibliography of 'ROCR' package. 

_S_e_e _A_l_s_o:


        *  'wilcox.test' and 'pwilcox'

        *  'wilcox.exact' from 'exactRankTests' package

        *  'wilcox_test' from 'coin' package

        *  'AUC' from 'ROC' package

        *  'performance' from 'ROCR' package

        *  'auROC' from 'limma' package 

        *  'ROC' from 'Epi' package

        *  'roc.area' from 'verification' package

        *  'rcorr.cens' from 'Hmisc' package

_E_x_a_m_p_l_e_s:

     # Load MASS library with "cats" data set that have following columns: sex, body
     # weight, hart weight. Calculate how good weights are in predicting sex of cats.
     # 2 classes; 2 features; 144 samples
     library(MASS); data(cats);
     colAUC(cats[,2:3], cats[,1], plotROC=TRUE) 

     # Load rpart library with "kyphosis" data set that records if kyphosis
     # deformation was present after corrective surgery. Calculate how good age, 
     # number and position of vertebrae are in predicting successful operation. 
     # 2 classes; 3 features; 81 samples
     library(rpart); data(kyphosis);
     colAUC(kyphosis[,2:4], kyphosis[,1], plotROC=TRUE)

     # Example of 3-class 4-feature 150-sample iris data
     data(iris)
     colAUC(iris[,-5], iris[,5], plotROC=TRUE)
     cat("Total AUC: \n"); 
     colMeans(colAUC(iris[,-5], iris[,5]))

     # Test plots in case of data without column names
     Iris = as.matrix(iris[,-5])
     dim(Iris) = c(600,1)
     dim(Iris) = c(150,4)
     colAUC(Iris, iris[,5], plotROC=TRUE)

     # Compare calAUC with other functions designed for similar purpose
     auc = matrix(NA,12,3)
     rownames(auc) = c("colAUC(alg='ROC')", "colAUC(alg='Wilcox')", "sum(rank)",
         "wilcox.test", "wilcox_test", "wilcox.exact", "roc.area", "AUC", 
         "performance", "ROC", "auROC", "rcorr.cens")
     colnames(auc) = c("AUC(x)", "AUC(-x)", "AUC(x+noise)")
     X = cbind(cats[,2], -cats[,2], cats[,2]+rnorm(nrow(cats)) )
     y = ifelse(cats[,1]=='F',0,1)
     for (i in 1:3) {
       x = X[,i]
       x1 = x[y==1]; n1 = length(x1);                 # prepare input data ...
       x2 = x[y==0]; n2 = length(x2);                 # ... into required format
       data = data.frame(x=x,y=factor(y))
       auc[1,i] = colAUC(x, y, alg="ROC") 
       auc[2,i] = colAUC(x, y, alg="Wilcox")
       r = rank(c(x1,x2))
       auc[3,i] = (sum(r[1:n1]) - n1*(n1+1)/2) / (n1*n2)
       auc[4,i] = wilcox.test(x1, x2, exact=0)$statistic / (n1*n2) 
       ## Not run: 
       if (require("coin"))
         auc[5,i] = statistic(wilcox_test(x~y, data=data)) / (n1*n2) 
       if (require("exactRankTests"))  
         auc[6,i] = wilcox.exact(x, y, exact=0)$statistic / (n1*n2) 
       if (require("verification"))
         auc[7,i] = roc.area(y, x)$A.tilda 
       if (require("ROC")) 
         auc[8,i] = AUC(rocdemo.sca(y, x, dxrule.sca))    
       if (require("ROCR")) 
         auc[9,i] = performance(prediction( x, y),"auc")@y.values[[1]]
       if (require("Epi"))   auc[10,i] = ROC(x,y,grid=0)$AUC
       if (require("limma")) auc[11,i] = auROC(y, x)
       if (require("Hmisc")) auc[12,i] = rcorr.cens(x, y)[1]
       
     ## End(Not run)
     }
     print(auc)
     stopifnot(auc[1, ]==auc[2, ])   # results of 2 alg's in colAUC must be the same
     stopifnot(auc[1,1]==auc[3,1])   # compare with wilcox.test results

     # time trials
     x = matrix(runif(100*1000),100,1000)
     y = (runif(100)>0.5)
     system.time(colAUC(x,y,alg="ROC"   ))
     system.time(colAUC(x,y,alg="Wilcox"))

