library(survival)
library(maxEff)
# Loading required package: groupedHyperframe
# Registered S3 method overwritten by 'pROC':
# method from
# plot.roc spatstat.explore
6 Predictor with Maximum Effect Size
The examples in Chapter 6 require that the
search
path contains the followingnamespace
s. See the explanation of the function name conflict in Section 7.3.5.
search
path on author’s computer running RStudio (Posit Team 2025)
search()
# [1] ".GlobalEnv" "package:maxEff" "package:groupedHyperframe" "package:survival"
# [5] "package:stats" "package:graphics" "package:grDevices" "package:utils"
# [9] "package:datasets" "package:methods" "Autoloads" "package:base"
🚧 Chapter 6 is being re-written English-wise. All code are correct here, though. Expected delivery by 2025-12-31.
Chapter 6 documents methods to select one from many numeric
predictors for a regression model, to ensure that the additional predictor has the maximum effect size.
6.1 Compute Aggregated Quantiles
= groupedHyperframe::Ki67 |>
Ki67q within.data.frame(expr = {
= y = NULL # remove x- and y-coords for non-spatial application
x |>
}) as.groupedHyperframe(group = ~ patientID/tissueID) |>
quantile(probs = seq.int(from = .01, to = .99, by = .01)) |>
aggregate(by = ~ patientID)
A hyperframe
Ki67q
with aggregated quantiles
|>
Ki67q head()
# Hyperframe:
# Tstage PFS adj_rad adj_chemo histology Her2 HR node race age patientID logKi67.quantile
# 1 2 100+ FALSE FALSE 3 TRUE TRUE TRUE White 66 PT00037 (numeric)
# 2 1 22 FALSE FALSE 3 FALSE TRUE FALSE Black 42 PT00039 (numeric)
# 3 1 99+ FALSE NA 3 FALSE TRUE FALSE White 60 PT00040 (numeric)
# 4 1 99+ FALSE TRUE 3 FALSE TRUE TRUE White 53 PT00042 (numeric)
# 5 1 112 TRUE TRUE 3 FALSE TRUE TRUE White 52 PT00054 (numeric)
# 6 4 12 TRUE FALSE 2 TRUE TRUE TRUE Black 51 PT00059 (numeric)
6.2 Data Partition
Partition into a training (80%) and test (20%) set.
set.seed(32); id = sample.int(n = nrow(Ki67q), size = nrow(Ki67q)*.8)
= Ki67q[id, , drop = FALSE] # training set
s0 = Ki67q[-id, , drop = FALSE] # test set s1
dim(s0)
# [1] 497 12
dim(s1)
# [1] 125 12
6.3 Starting Model
Let’s consider a starting model of endpoint PFS
with predictor Tstage
on the training set s0
. As of package spatstat.geom
v3.6.0.3, the S3
method dispatch spatstat.geom::as.data.frame.hyperframe()
warns on hypercolumns that aren’t able to be converted to a data.frame
. Therefore, user should use suppressWarnings()
or set #| warning: false
in R markdown code–chunk.
= coxph(PFS ~ Tstage, data = s0) m0
Starting coxph
model m0
summary(m0)
# Call:
# coxph(formula = PFS ~ Tstage, data = s0)
#
# n= 497, number of events= 91
#
# coef exp(coef) se(coef) z Pr(>|z|)
# Tstage 0.6725 1.9591 0.1061 6.34 2.29e-10 ***
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# exp(coef) exp(-coef) lower .95 upper .95
# Tstage 1.959 0.5104 1.591 2.412
#
# Concordance= 0.654 (se = 0.028 )
# Likelihood ratio test= 32.23 on 1 df, p=1e-08
# Wald test = 40.2 on 1 df, p=2e-10
# Score (logrank) test = 43.73 on 1 df, p=4e-11
6.4 Adding numeric
Predictor
= m0 |>
a0 add_numeric(x = ~ logKi67.quantile) |>
sort_by(y = abs(effsize), decreasing = TRUE) |>
head(n = 2L)
The pipeline above consists of three steps.
First, function add_numeric()
- considers each “slice” of the
numeric
-hypercolumnlogKi67.quantile
as an additionalnumeric
predictor. Users are encourage to read the packagegroupedHyperframe
vignette, Appendix Sectionanylist
for more details; update
s the starting modelm0
with each one of the additionalnumeric
predictors, respectively;- returns an
'add_numeric'
object, whichinherits
from the class'add_'
. This is alistof
objects with internal class'add_numeric_'
(Section 33.1).
Second, the S3
method dispatch sort_by.add_()
sorts the input by the abs
olute values of the regression coefficients, i.e., effect size effsize
, of the additional numeric
predictors.
Lastly, the S3
method dispatch utils:::head.default()
chooses the top n
element from the input.
The S3
method dispatch print.add_numeric()
displays the call
s to the selected numeric
predictors.
a0# logKi67.quantile["6%"]
# logKi67.quantile["5%"]
The S3
method dispatch predict.add_numeric()
applies the starting model structure in m0
, as well as the additional numeric
predictors selected by a0
, on the test set s1
. The author categorizes this functionality under the S3
generic stats::predict()
, instead of stats::update()
, to emphasize that this “update” is on a newdata
, even that the workhorse function is the S3
method dispatch stats:::update.default()
.
= a0 |>
a1 predict(newdata = s1)
Predicted models a1
a1# logKi67.quantile["6%"] :
# Call:
# coxph(formula = PFS ~ Tstage + x., data = newd)
#
# coef exp(coef) se(coef) z p
# Tstage 0.5900 1.8041 0.1959 3.012 0.0026
# x. 0.6071 1.8352 0.8687 0.699 0.4846
#
# Likelihood ratio test=9.23 on 2 df, p=0.009926
# n= 125, number of events= 27
#
# logKi67.quantile["5%"] :
# Call:
# coxph(formula = PFS ~ Tstage + x., data = newd)
#
# coef exp(coef) se(coef) z p
# Tstage 0.5886 1.8015 0.1961 3.002 0.00269
# x. 0.6276 1.8732 0.8746 0.718 0.47299
#
# Likelihood ratio test=9.25 on 2 df, p=0.009794
# n= 125, number of events= 27
6.5 Adding logical
Predictor
6.5.1 Naive Practice
= m0 |>
b0 add_dummy(x = ~ logKi67.quantile) |>
subset(subset = p1 > .05 & p1 < .95) |>
sort_by(y = abs(effsize), decreasing = TRUE) |>
head(n = 2L)
The pipeline above consists of four steps.
First, function add_dummy()
- dichotomizes (Chapter 21) each “slice” of the
numeric
-hypercolumnlogKi67.quantile
into alogical
variable; - removes the
duplicated
logical
variables; update
s the starting modelm0
with each one of the additionallogical
predictors, respectively;- returns an
'add_dummy'
object, whichinherits
from the class'add_'
. This is alistof
objects with internal class'add_dummy_'
(Section 33.2).
Second, the S3
method dispatch subset.add_dummy()
subsets the input by the balance of the partition of the additional logical
predictor.
Third, the S3
method dispatch sort_by.add_()
sorts the input by the abs
olute value of regression coefficients, i.e., effect size effsize
, of the additional logical
predictor.
Lastly, the S3
method dispatch utils:::head.default()
chooses the top n
element from the input.
The S3
method dispatch print.add_dummy()
displays the call
s to the selected logical
predictors.
b0# logKi67.quantile["95%"]>=6.76825904668107
# logKi67.quantile["94%"]>=6.72093005757954
The S3
method generic predict.add_dummy()
applies the starting model structure in m0
, as well as the additional logical
predictors selected by b0
, on the test set s1
.
= b0 |>
b1 predict(newdata = s1)
Predicted models b1
b1# logKi67.quantile["95%"]>=6.76825904668107 :
# Call:
# coxph(formula = PFS ~ Tstage + x., data = newd)
#
# coef exp(coef) se(coef) z p
# Tstage 0.6287 1.8753 0.1922 3.271 0.00107
# x.TRUE -0.2630 0.7687 0.4693 -0.560 0.57515
#
# Likelihood ratio test=9.05 on 2 df, p=0.01086
# n= 125, number of events= 27
#
# logKi67.quantile["94%"]>=6.72093005757954 :
# Call:
# coxph(formula = PFS ~ Tstage + x., data = newd)
#
# coef exp(coef) se(coef) z p
# Tstage 0.61820 1.85558 0.19162 3.226 0.00125
# x.TRUE -0.09157 0.91249 0.46920 -0.195 0.84526
#
# Likelihood ratio test=8.78 on 2 df, p=0.01238
# n= 125, number of events= 27
6.5.2 via Repeated Partitions
set.seed(83); c0 = m0 |>
add_dummy_partition(~ logKi67.quantile, times = 20L, p = .8) |>
subset(subset = p1 > .15 & p1 < .85) |>
sort_by(y = abs(effsize), decreasing = TRUE) |>
head(n = 2L)
The pipeline above consists of four steps.
First, function add_dummy_partition()
creates a dichotomizing rule for each additional numeric
predictor in the following steps.
- Generates twenty (
20L
) partitions of the training sets0
using functionscaret::createDataPartition()
orstatusPartition()
(Section 33.5) at.8
vs..2=(1-.8)
ratio. - For the \(i\)-th partition \((i=1,\cdots,20)\) of the training set
s0
,- creates a dichotomizing rule of the
numeric
predictor in the \(i\)-th training-subset ofs0
using functionnode1()
(Chapter 21); - applies such dichotomizing rule to the
numeric
predictor in the \(i\)-th test-subset ofs0
; update
s the starting modelm0
with the \(i\)-th test-subset ofs0
, as well as thelogical
predictor from the previous step;- records the estimated regression coefficients, i.e., effect size
effsize
, of thelogical
predictor.
- creates a dichotomizing rule of the
- Selects the dichotomizing rule based on the partition with the
median
effect size of thelogical
predictor in their own, specific test-subset ofs0
. - Returns an
'add_dummy'
object.
Second, the S3
method dispatch subset.add_dummy()
subsets the input by the balance of the partition of the additional logical
predictor in their own, specific test-subset of s0
.
Third, the S3
method dispatch sort_by.add_()
sorts the input by the abs
olute value of regression coefficients, i.e., effect size effsize
, of the additional logical
predictor in their own, specific test-subset of s0
.
Lastly, the S3
method dispatch utils:::head.default()
chooses the top n
element from the input.
Similar to Section 6.5.1, the S3
method dispatch print.add_dummy()
displays the call
s to the selected logical
predictors.
c0# logKi67.quantile["96%"]>=6.8055032497215
# logKi67.quantile["98%"]>=7.25821545225641
Similar to Section 6.5.1, the S3
method generic predict.add_dummy()
applies the starting model structure in m0
, as well as the additional logical
predictors selected by c0
, on the test set s1
.
= c0 |>
c1 predict(newdata = s1)
Predicted models c1
c1# logKi67.quantile["96%"]>=6.8055032497215 :
# Call:
# coxph(formula = PFS ~ Tstage + x., data = newd)
#
# coef exp(coef) se(coef) z p
# Tstage 0.6275 1.8730 0.1924 3.261 0.00111
# x.TRUE -0.3212 0.7253 0.4695 -0.684 0.49392
#
# Likelihood ratio test=9.19 on 2 df, p=0.01012
# n= 125, number of events= 27
#
# logKi67.quantile["98%"]>=7.25821545225641 :
# Call:
# coxph(formula = PFS ~ Tstage + x., data = newd)
#
# coef exp(coef) se(coef) z p
# Tstage 0.6346 1.8862 0.1915 3.313 0.000923
# x.TRUE -0.6196 0.5382 0.3935 -1.575 0.115309
#
# Likelihood ratio test=11.19 on 2 df, p=0.00372
# n= 125, number of events= 27