6 Predictor with Maximum Effect Size

The examples in Chapter 6 require (see the explanation of the function name conflict in Section 7.4)

library(survival)
library(maxEff)
# Loading required package: groupedHyperframe
# Registered S3 method overwritten by 'pROC':
#   method   from            
#   plot.roc spatstat.explore

search path & loadedNamespaces on author’s computer

search()
#  [1] ".GlobalEnv"                "package:maxEff"            "package:groupedHyperframe" "package:survival"          "package:stats"             "package:graphics"          "package:grDevices"        
#  [8] "package:utils"             "package:datasets"          "package:methods"           "Autoloads"                 "package:base"
loadedNamespaces() |> sort.int()
#   [1] "abind"             "base"              "caret"             "class"             "cli"               "cluster"           "codetools"         "compiler"          "data.table"       
#  [10] "datasets"          "deldir"            "digest"            "doParallel"        "dplyr"             "evaluate"          "farver"            "fastmap"           "fastmatrix"       
#  [19] "foreach"           "future"            "future.apply"      "generics"          "geomtextpath"      "GET"               "ggplot2"           "globals"           "glue"             
#  [28] "goftest"           "gower"             "graphics"          "grDevices"         "grid"              "gridExtra"         "groupedHyperframe" "gtable"            "hardhat"          
#  [37] "htmltools"         "htmlwidgets"       "ipred"             "iterators"         "jsonlite"          "knitr"             "lattice"           "lava"              "lifecycle"        
#  [46] "listenv"           "lubridate"         "magrittr"          "MASS"              "Matrix"            "matrixStats"       "maxEff"            "methods"           "ModelMetrics"     
#  [55] "nlme"              "nnet"              "otel"              "parallel"          "parallelly"        "patchwork"         "pillar"            "pkgconfig"         "plyr"             
#  [64] "polyclip"          "pracma"            "pROC"              "prodlim"           "purrr"             "R6"                "RColorBrewer"      "Rcpp"              "recipes"          
#  [73] "reshape2"          "rlang"             "rmarkdown"         "rpart"             "rstudioapi"        "S7"                "scales"            "SpatialPack"       "spatstat.data"    
#  [82] "spatstat.explore"  "spatstat.geom"     "spatstat.random"   "spatstat.sparse"   "spatstat.univar"   "spatstat.utils"    "splines"           "stats"             "stats4"           
#  [91] "stringi"           "stringr"           "survival"          "systemfonts"       "tensor"            "textshaping"       "tibble"            "tidyselect"        "timechange"       
# [100] "timeDate"          "tools"             "utils"             "vctrs"             "viridisLite"       "withr"             "xfun"              "yaml"

🚧 Chapter 6 is being re-written English-wise. All code are correct here, though. Expected delivery by 2025-12-31.

Chapter 6 documents methods to select one from many numeric predictors for a regression model, to ensure that the additional predictor has the maximum effect size.

6.1 Compute Aggregated Quantiles

Section 5.1

Ki67q = groupedHyperframe::Ki67 |>
  within.data.frame(expr = {
    x = y = NULL # remove x- and y-coords for non-spatial application
  }) |>
  as.groupedHyperframe(group = ~ patientID/tissueID) |>
  quantile(probs = seq.int(from = .01, to = .99, by = .01)) |>
  aggregate(by = ~ patientID)

A hyper data frame Ki67q with aggregated quantiles

Ki67q |>
  head()
# Hyperframe:
#   Tstage  PFS adj_rad adj_chemo histology  Her2   HR  node  race age patientID logKi67.quantile
# 1      2 100+   FALSE     FALSE         3  TRUE TRUE  TRUE White  66   PT00037        (numeric)
# 2      1   22   FALSE     FALSE         3 FALSE TRUE FALSE Black  42   PT00039        (numeric)
# 3      1  99+   FALSE        NA         3 FALSE TRUE FALSE White  60   PT00040        (numeric)
# 4      1  99+   FALSE      TRUE         3 FALSE TRUE  TRUE White  53   PT00042        (numeric)
# 5      1  112    TRUE      TRUE         3 FALSE TRUE  TRUE White  52   PT00054        (numeric)
# 6      4   12    TRUE     FALSE         2  TRUE TRUE  TRUE Black  51   PT00059        (numeric)

6.2 Data Partition

Partition into a training (80%) and test (20%) set.

set.seed(32); id = sample.int(n = nrow(Ki67q), size = nrow(Ki67q)*.8)
s0 = Ki67q[id, , drop = FALSE] # training set
s1 = Ki67q[-id, , drop = FALSE] # test set

dim(s0)
# [1] 497  12

dim(s1)
# [1] 125  12

6.3 Starting Model

Let’s consider a starting model of endpoint PFS with predictor Tstage on the training set s0. As of package spatstat.geom v3.6.1.16, the S3 method spatstat.geom::as.data.frame.hyperframe() warns on hypercolumns that aren’t able to be converted to a data frame. Therefore, user should use suppressWarnings() or set #| warning: false in R markdown code–chunk.

m0 = coxph(PFS ~ Tstage, data = s0)

Starting coxph model m0

summary(m0)
# Call:
# coxph(formula = PFS ~ Tstage, data = s0)
# 
#   n= 497, number of events= 91 
# 
#          coef exp(coef) se(coef)    z Pr(>|z|)    
# Tstage 0.6725    1.9591   0.1061 6.34 2.29e-10 ***
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 
#        exp(coef) exp(-coef) lower .95 upper .95
# Tstage     1.959     0.5104     1.591     2.412
# 
# Concordance= 0.654  (se = 0.028 )
# Likelihood ratio test= 32.23  on 1 df,   p=1e-08
# Wald test            = 40.2  on 1 df,   p=2e-10
# Score (logrank) test = 43.73  on 1 df,   p=4e-11

6.4 Adding `numeric` Predictor

a0 = m0 |>
  add_numeric(x = ~ logKi67.quantile) |>
  sort_by(y = abs(effsize), decreasing = TRUE) |>
  head(n = 2L)

The pipeline above consists of three steps.

First, function add_numeric()

considers each “slice” of the numeric-hypercolumn logKi67.quantile as an additional numeric predictor. Users are encourage to read the package groupedHyperframe vignette, Appendix Section anylist for more details;
updates the starting model m0 with each one of the additional numeric predictors, respectively;
returns an 'add_numeric' object, which inherits from the class 'add_'. This is a listof objects with internal class 'add_numeric_' (Chapter 14).

Second, the S3 method sort_by.add_() sorts the input by the absolute values of the regression coefficients, i.e., effect size effsize, of the additional numeric predictors.

Lastly, the S3 method utils:::head.default() chooses the top n element from the input.

The S3 method print.add_numeric() displays the calls to the selected numeric predictors.

a0
# logKi67.quantile["6%"]
# logKi67.quantile["5%"]

The S3 method predict.add_numeric() applies the starting model structure in m0, as well as the additional numeric predictors selected by a0, on the test set s1. The author categorizes this functionality under the S3 generic stats::predict(), instead of stats::update(), to emphasize that this “update” is on a newdata, even that the workhorse function is the S3 method stats:::update.default().

a1 = a0 |> 
  predict(newdata = s1)

Predicted models a1

a1
# logKi67.quantile["6%"] :
# Call:
# coxph(formula = PFS ~ Tstage + x., data = newd)
# 
#          coef exp(coef) se(coef)     z      p
# Tstage 0.5900    1.8041   0.1959 3.012 0.0026
# x.     0.6071    1.8352   0.8687 0.699 0.4846
# 
# Likelihood ratio test=9.23  on 2 df, p=0.009926
# n= 125, number of events= 27 
# 
# logKi67.quantile["5%"] :
# Call:
# coxph(formula = PFS ~ Tstage + x., data = newd)
# 
#          coef exp(coef) se(coef)     z       p
# Tstage 0.5886    1.8015   0.1961 3.002 0.00269
# x.     0.6276    1.8732   0.8746 0.718 0.47299
# 
# Likelihood ratio test=9.25  on 2 df, p=0.009794
# n= 125, number of events= 27

6.5 Adding `logical` Predictor

6.5.1 Naive Practice

b0 = m0 |>
  add_dummy(x = ~ logKi67.quantile) |>
  subset(subset = p1 > .05 & p1 < .95) |> 
  sort_by(y = abs(effsize), decreasing = TRUE) |>
  head(n = 2L)

The pipeline above consists of four steps.

First, function add_dummy()

dichotomizes (Chapter 32) each “slice” of the numeric-hypercolumn logKi67.quantile into a logical variable;
removes the duplicated logical variables;
updates the starting model m0 with each one of the additional logical predictors, respectively;
returns an 'add_dummy' object, which inherits from the class 'add_'. This is a listof objects with internal class 'add_dummy_' (Chapter 13).

Second, the S3 method subset.add_dummy() subsets the input by the balance of the partition of the additional logical predictor.

Third, the S3 method sort_by.add_() sorts the input by the absolute value of regression coefficients, i.e., effect size effsize, of the additional logical predictor.

Lastly, the S3 method utils:::head.default() chooses the top n element from the input.

The S3 method print.add_dummy() displays the calls to the selected logical predictors.

b0
# logKi67.quantile["95%"]>=6.76825904668107
# logKi67.quantile["94%"]>=6.72093005757954

The S3 method generic predict.add_dummy() applies the starting model structure in m0, as well as the additional logical predictors selected by b0, on the test set s1.

b1 = b0 |> 
  predict(newdata = s1)

Predicted models b1

b1
# logKi67.quantile["95%"]>=6.76825904668107 :
# Call:
# coxph(formula = PFS ~ Tstage + x., data = newd)
# 
#           coef exp(coef) se(coef)      z       p
# Tstage  0.6287    1.8753   0.1922  3.271 0.00107
# x.TRUE -0.2630    0.7687   0.4693 -0.560 0.57515
# 
# Likelihood ratio test=9.05  on 2 df, p=0.01086
# n= 125, number of events= 27 
# 
# logKi67.quantile["94%"]>=6.72093005757954 :
# Call:
# coxph(formula = PFS ~ Tstage + x., data = newd)
# 
#            coef exp(coef) se(coef)      z       p
# Tstage  0.61820   1.85558  0.19162  3.226 0.00125
# x.TRUE -0.09157   0.91249  0.46920 -0.195 0.84526
# 
# Likelihood ratio test=8.78  on 2 df, p=0.01238
# n= 125, number of events= 27

6.5.2 via Repeated Partitions

set.seed(83); c0 = m0 |> 
  add_dummy_partition(~ logKi67.quantile, times = 20L, p = .8) |>
  subset(subset = p1 > .15 & p1 < .85) |>
  sort_by(y = abs(effsize), decreasing = TRUE) |>
  head(n = 2L)

The pipeline above consists of four steps.

First, function add_dummy_partition() creates a dichotomizing rule for each additional numeric predictor in the following steps.

Generates twenty (20L) partitions of the training set s0 using functions caret::createDataPartition() or statusPartition() (Chapter 50) at .8 versus .2=(1-.8) ratio.
For the \(i\)-th partition \((i=1,\cdots,20)\) of the training set s0,
- creates a dichotomizing rule of the numeric predictor in the \(i\)-th training-subset of s0 using function node1() (Chapter 32);
- applies such dichotomizing rule to the numeric predictor in the \(i\)-th test-subset of s0;
- updates the starting model m0 with the \(i\)-th test-subset of s0, as well as the logical predictor from the previous step;
- records the estimated regression coefficients, i.e., effect size effsize, of the logical predictor.
Selects the dichotomizing rule based on the partition with the median effect size of the logical predictor in their own, specific test-subset of s0.
Returns an 'add_dummy' object.

Second, the S3 method subset.add_dummy() subsets the input by the balance of the partition of the additional logical predictor in their own, specific test-subset of s0.

Third, the S3 method sort_by.add_() sorts the input by the absolute value of regression coefficients, i.e., effect size effsize, of the additional logical predictor in their own, specific test-subset of s0.

Lastly, the S3 method utils:::head.default() chooses the top n element from the input.

Similar to Section 6.5.1, the S3 method print.add_dummy() displays the calls to the selected logical predictors.

c0
# logKi67.quantile["96%"]>=6.8055032497215
# logKi67.quantile["98%"]>=7.25821545225641

Similar to Section 6.5.1, the S3 method generic predict.add_dummy() applies the starting model structure in m0, as well as the additional logical predictors selected by c0, on the test set s1.

c1 = c0 |> 
  predict(newdata = s1)

Predicted models c1

c1
# logKi67.quantile["96%"]>=6.8055032497215 :
# Call:
# coxph(formula = PFS ~ Tstage + x., data = newd)
# 
#           coef exp(coef) se(coef)      z       p
# Tstage  0.6275    1.8730   0.1924  3.261 0.00111
# x.TRUE -0.3212    0.7253   0.4695 -0.684 0.49392
# 
# Likelihood ratio test=9.19  on 2 df, p=0.01012
# n= 125, number of events= 27 
# 
# logKi67.quantile["98%"]>=7.25821545225641 :
# Call:
# coxph(formula = PFS ~ Tstage + x., data = newd)
# 
#           coef exp(coef) se(coef)      z        p
# Tstage  0.6346    1.8862   0.1915  3.313 0.000923
# x.TRUE -0.6196    0.5382   0.3935 -1.575 0.115309
# 
# Likelihood ratio test=11.19  on 2 df, p=0.00372
# n= 125, number of events= 27

6.1 Compute Aggregated Quantiles

6.2 Data Partition

6.3 Starting Model

6.4 Adding numeric Predictor

6.5 Adding logical Predictor

6.5.1 Naive Practice

6.5.2 via Repeated Partitions

6.4 Adding `numeric` Predictor

6.5 Adding `logical` Predictor