library(survival)
library(hyper.gam)
# Loading required package: groupedHyperframe
# Registered S3 method overwritten by 'pROC':
# method from
# plot.roc spatstat.explore
5 Quantile Index
The examples in Chapter 5 require that the
search
path contains the followingnamespace
s. See the explanation of the function name conflict in Section 7.3.5.
search
path on author’s computer running RStudio (Posit Team 2025)
search()
# [1] ".GlobalEnv" "package:hyper.gam" "package:groupedHyperframe" "package:survival"
# [5] "package:stats" "package:graphics" "package:grDevices" "package:utils"
# [9] "package:datasets" "package:methods" "Autoloads" "package:base"
To cite the implementation of the Quantile Index (QI) methodology, please use
Zhan T, Yi M, Chervoneva I (2025). “Quantile Index predictors using R package
hyper.gam
.” Bioinformatics, 41(8), btaf430. ISSN 1367-4811, doi:10.1093/bioinformatics/btaf430 https://doi.org/10.1093/bioinformatics/btaf430.
BibTeX and/or BibLaTeX entries for LaTeX users
@Article{,
title = {Quantile Index predictors using R package `hyper.gam`},
author = {Tingting Zhan and Misung Yi and Inna Chervoneva},
journal = {Bioinformatics},
volume = {41},
number = {8},
pages = {btaf430},
year = {2025},
month = {07},
issn = {1367-4811},
doi = {10.1093/bioinformatics/btaf430},
}
as well as Yi et al. (2023b); Yi et al. (2023a); Yi et al. (2025).
Function(s) in Zhan, Yi, and Chervoneva (2025) but later deprecated
::aggregate_quantile()
groupedHyperframe# Function groupedHyperframe::aggregate_quantile described in
# doi:10.1093/bioinformatics/btaf430 (<https://doi.org/10.1093/bioinformatics/btaf430>)
# has been replaced by pipeline
# <groupedHyperframe> |> quantile() |> aggregate()
# Read vignette for details
# <https://rpubs.com/tingtingzhan/groupedHyperframe>
# Error in groupedHyperframe::aggregate_quantile(): 'aggregate_quantile' is defunct.
# Use '<groupedHyperframe> |> quantile() |> aggregate()' instead.
# See help("Defunct")
5.1 Compute Aggregated Quantiles
We use the data example Ki67
from package groupedHyperframe
(v0.3.0.20251020) in this non-spatial application.
= groupedHyperframe::Ki67 |>
Ki67q within.data.frame(expr = {
= y = NULL # remove x- and y-coords for non-spatial application
x |>
}) as.groupedHyperframe(group = ~ patientID/tissueID) |>
quantile(probs = seq.int(from = .01, to = .99, by = .01)) |>
aggregate(by = ~ patientID)
The returned object Ki67q
is a hyper data frame with
- a numeric-hypercolumn of the aggregated sample quantiles
logKi67.quantile
perpatientID
. These quantiles are calculated at a pre-specified grid of probabilities \(\{p_k, k=1,\cdots,K \} \in [0,1]\). Note that the aggregation must be performed at the level of biologically independent clusters, e.g.,~patientID
, to produce independent quantile predictors. - Metadata, including the outcome of interest, e.g., progression free survival
PFS
,Her2
,HR
, etc.
A hyper data frame Ki67q
: aggregated quantiles
|>
Ki67q head()
# Hyperframe:
# Tstage PFS adj_rad adj_chemo histology Her2 HR node race age patientID logKi67.quantile
# 1 2 100+ FALSE FALSE 3 TRUE TRUE TRUE White 66 PT00037 (numeric)
# 2 1 22 FALSE FALSE 3 FALSE TRUE FALSE Black 42 PT00039 (numeric)
# 3 1 99+ FALSE NA 3 FALSE TRUE FALSE White 60 PT00040 (numeric)
# 4 1 99+ FALSE TRUE 3 FALSE TRUE TRUE White 53 PT00042 (numeric)
# 5 1 112 TRUE TRUE 3 FALSE TRUE TRUE White 52 PT00054 (numeric)
# 6 4 12 TRUE FALSE 2 TRUE TRUE TRUE Black 51 PT00059 (numeric)
Readers are encouraged to learn more about the hyper data frame (hyperframe
) from package spatstat.geom
(Baddeley, Rubak, and Turner 2015; Baddeley and Turner 2005) and each function in this pipeline from,
Function | Purpose | Where |
---|---|---|
as.groupedHyperframe() |
to convert a data frame into a grouped hyper data frame | Chapter 2 |
quantile() |
to calculate the quantiles of each numeric-vector in the numeric-hypercolumn | Section 3.3.2 |
aggregate() |
to aggregate the quantiles over multiple tissues per patient by point-wise means | Section 3.4 |
5.2 Estimate Integrand Surface
Linear quantile index (QI) (Equation 5.1) is a predictor in a functional generalized linear model (James 2002) for outcomes from the exponential family of distributions, or a linear functional Cox model (Gellar et al. 2015) for survival outcomes,
\[ \text{QI}_{i}=\int_{0}^{1} \beta(p)Q_i(p)dp \tag{5.1}\]
where \(Q_i(p)\) is the (aggregated) sample quantiles logKi67.quantile
for the \(i\)-th subject, and \(\beta(p)\) is the unknown coefficient function to be estimated. We use function hyper_gam()
to fit a generalized additive model gam
with integrated linear spline-based smoothness estimation (function mgcv::s()
, Wood 2003). This is a scalar-on-function model (Reiss et al. 2017) that predicts a scalar outcome (e.g., progression free survival time PFS[,1L]
) using the aggregated quantiles function as a functional predictor.
= hyper_gam(PFS ~ logKi67.quantile, data = Ki67q) m0
Nonlinear quantile index (nlQI) (Equation 5.2) is a predictor in the functional generalized additive model (McLean et al. 2014) for outcomes from the exponential family of distributions, or an additive functional Cox model (Cui, Crainiceanu, and Leroux 2021) for survival outcomes.
\[ \text{nlQI}_{i}= \int_{0}^{1} F\big(p, Q_i(p)\big)dp \tag{5.2}\]
where \(F(\cdot,\cdot)\) is an unknown bivariate twice differentiable function. We use function hyper_gam(., nonlinear = TRUE)
to fit a generalized additive model gam
with tensor product interaction estimation (function mgcv::ti()
, Wood 2006).
= hyper_gam(PFS ~ logKi67.quantile, data = Ki67q, nonlinear = TRUE) m1
The fitted functional model m0
and m1
have the S3
class 'hyper_gam'
(Chapter 18).
5.2.1 Visualization
Function integrandSurface()
creates an interactive htmlwidget
(Vaidyanathan et al. 2023, v1.6.4) visualization of the estimated integrand surfaces for the linear (Equation 5.1) or nonlinear quantile index (Equation 5.2) using package plotly
(Sievert 2020, v4.11.0). The integrand surfaces, defined on \(p\in[0,1]\) and \(q\in\text{range}\big\{Q_i(p), i=1,\cdots,n\big\}\), are
\[ \hat{S}(p,q) = \begin{cases} \hat{\beta}(p)\cdot q\\ \hat{F}(p,q) \end{cases} \tag{5.3}\]
Also in this interactive visualization are
- the contour lines on the integrand surfaces (Equation 5.3), as well as their projections along the \(s\)-axis, i.e., onto the \((p,q)\)-plane (a.k.a., the “floor”);
- the estimated linear integrand paths \(\hat{\beta}(p)Q_i(p)\) or the nonlinear integrand paths \(\hat{F}(p, Q_i(p))\) on the integrand surfaces (Equation 5.3);
- the sample quantiles \(Q_i(p)\), i.e., the projections of the estimated linear or nonlinear integrand path along the \(s\)-axis, i.e., onto the \((p,q)\)-plane (a.k.a., the “floor”);
- the projections of the estimated linear or nonlinear integrand path along the \(q\)-axis, i.e., onto the \((p,s)\)-plane (a.k.a., the “backwall”), so that the area under each projected path is equal to the estimated linear (Equation 5.1) or nonlinear quantile index (Equation 5.2).
Figure 5.1 is a collage (via package htmltools
, Cheng et al. 2024, v0.5.8.1) of the interactive htmlwidget
visualizations of the linear and nonlinear integrand surfaces, contours, integrand paths and their projections to the “floor” and “backwall”. Readers may remove the argument n
in function call integrandSurface(, n=21L)
for a more refined surface. The authors must use n=21L
to reduce the htmlwidget
objects size, in order to comply with Quarto Pub file size limit.
Figure: visualize linear and nonlinear quantile index
::tagList(
htmltools|> integrandSurface(n = 21L),
m0 |> integrandSurface(n = 21L)
m1 |>
) ::browsable() htmltools
Static illustrations of the estimated integrand surfaces, e.g., the persp
ective and contour
plots (Section 18.2), are produced by calling the S3
generic functions persp()
and contour()
in package graphics
shipped with vanilla R
. These figures are suppressed to reduce the file size of this vignette.
graphics
Code
|> persp()
m0 |> contour()
m0 |> persp()
m1 |> contour() m1
5.3 Compute Quantile Index Predictor
Linear and nonlinear quantile indices are the predictors in the functional models (Equation 5.1) and (Equation 5.2), respectively. Let’s consider a conventional scenario that we first fit a hyper_gam
model to the training data set, then compute the quantile index predictors in the training and/or test data set using the training model.
First, we partition the 622 patients in hyper data frame Ki67q
into a training data set with 498 patients and a test data set with 124 patients, i.e., a 80% vs. 20% partition.
set.seed(16); id = Ki67q |> nrow() |> seq_len() |> caret::createDataPartition(p = .8)
= Ki67q[id[[1L]],] # training set
Ki67q_0 = Ki67q[-id[[1L]],] # test set Ki67q_1
Next, we fit a functional generalized additive model to the the training data set Ki67q_0
,
= hyper_gam(PFS ~ logKi67.quantile, nonlinear = TRUE, data = Ki67q_0) m1a
We can, but we should not, use the quantile index predictors (Section 18.3) of the training data set for downstream analysis, because these quantile index predictors are optimized on the training data set and the results would be optimistically biased.
Optimistically biased!!
c('PFS', 'age', 'race')] |>
Ki67q_0[,as.data.frame() |> # invokes spatstat.geom::as.data.frame.hyperframe()
data.frame(nlQI = predict(m1a, newdata = Ki67q_0)) |>
coxph(formula = PFS ~ age + nlQI, data = _)
# Call:
# coxph(formula = PFS ~ age + nlQI, data = data.frame(as.data.frame(Ki67q_0[,
# c("PFS", "age", "race")]), nlQI = predict(m1a, newdata = Ki67q_0)))
#
# coef exp(coef) se(coef) z p
# age -0.023320 0.976950 0.008321 -2.803 0.00507
# nlQI 1.118713 3.060911 0.343152 3.260 0.00111
#
# Likelihood ratio test=23.24 on 2 df, p=8.993e-06
# n= 498, number of events= 99
Instead, we should use the quantile index predictors computed in the test data set for downstream analysis,
c('PFS', 'age', 'race')] |>
Ki67q_1[,as.data.frame() |> # invokes spatstat.geom::as.data.frame.hyperframe()
data.frame(nlQI = predict(m1a, newdata = Ki67q_1)) |>
coxph(formula = PFS ~ age + nlQI, data = _)
# Call:
# coxph(formula = PFS ~ age + nlQI, data = data.frame(as.data.frame(Ki67q_1[,
# c("PFS", "age", "race")]), nlQI = predict(m1a, newdata = Ki67q_1)))
#
# coef exp(coef) se(coef) z p
# age -0.01636 0.98377 0.01862 -0.879 0.380
# nlQI 1.21050 3.35516 0.75858 1.596 0.111
#
# Likelihood ratio test=3.42 on 2 df, p=0.1805
# n= 124, number of events= 19