5 Quantile Index

The examples in Chapter 5 require (see the explanation of the function name conflict in Section 7.4)

library(survival)
library(hyper.gam)
# Loading required package: groupedHyperframe
# Registered S3 method overwritten by 'pROC':
#   method   from            
#   plot.roc spatstat.explore

search path & loadedNamespaces on author’s computer

search()
#  [1] ".GlobalEnv"                "package:hyper.gam"         "package:groupedHyperframe" "package:survival"          "package:stats"             "package:graphics"          "package:grDevices"        
#  [8] "package:utils"             "package:datasets"          "package:methods"           "Autoloads"                 "package:base"
loadedNamespaces() |> sort.int()
#   [1] "abind"             "base"              "caret"             "class"             "cli"               "cluster"           "codetools"         "compiler"          "data.table"       
#  [10] "datasets"          "deldir"            "digest"            "doParallel"        "dplyr"             "evaluate"          "farver"            "fastmap"           "fastmatrix"       
#  [19] "foreach"           "future"            "future.apply"      "generics"          "geomtextpath"      "GET"               "ggplot2"           "globals"           "glue"             
#  [28] "goftest"           "gower"             "graphics"          "grDevices"         "grid"              "gridExtra"         "groupedHyperframe" "gtable"            "hardhat"          
#  [37] "htmltools"         "htmlwidgets"       "httr"              "hyper.gam"         "ipred"             "iterators"         "jsonlite"          "knitr"             "lattice"          
#  [46] "lava"              "lazyeval"          "lifecycle"         "listenv"           "lubridate"         "magrittr"          "MASS"              "Matrix"            "matrixStats"      
#  [55] "methods"           "mgcv"              "ModelMetrics"      "nlme"              "nnet"              "otel"              "parallel"          "parallelly"        "patchwork"        
#  [64] "pillar"            "pkgconfig"         "plotly"            "plyr"              "polyclip"          "pracma"            "pROC"              "prodlim"           "purrr"            
#  [73] "R6"                "RColorBrewer"      "Rcpp"              "recipes"           "reshape2"          "rlang"             "rmarkdown"         "rpart"             "rstudioapi"       
#  [82] "S7"                "scales"            "SpatialPack"       "spatstat.data"     "spatstat.explore"  "spatstat.geom"     "spatstat.random"   "spatstat.sparse"   "spatstat.univar"  
#  [91] "spatstat.utils"    "splines"           "stats"             "stats4"            "stringi"           "stringr"           "survival"          "systemfonts"       "tensor"           
# [100] "textshaping"       "tibble"            "tidyr"             "tidyselect"        "timechange"        "timeDate"          "tools"             "utils"             "vctrs"            
# [109] "viridisLite"       "withr"             "xfun"              "yaml"

To cite the implementation of the Quantile Index (QI) methodology, please use

Zhan T, Yi M, Chervoneva I (2025). “Quantile Index predictors using R package hyper.gam.” Bioinformatics, 41(8), btaf430. ISSN 1367-4811, doi:10.1093/bioinformatics/btaf430 https://doi.org/10.1093/bioinformatics/btaf430.

BibTeX and/or BibLaTeX entries for LaTeX users

@Article{,
  title = {Quantile Index predictors using R package `hyper.gam`},
  author = {Tingting Zhan and Misung Yi and Inna Chervoneva},
  journal = {Bioinformatics},
  volume = {41},
  number = {8},
  pages = {btaf430},
  year = {2025},
  month = {07},
  issn = {1367-4811},
  doi = {10.1093/bioinformatics/btaf430},
}

as well as Yi et al. (2023b); Yi et al. (2023a); Yi et al. (2025).

Function(s) in Zhan, Yi, and Chervoneva (2025) but later deprecated

groupedHyperframe::aggregate_quantile()
# Function groupedHyperframe::aggregate_quantile() described in
# doi:10.1093/bioinformatics/btaf430 (<https://doi.org/10.1093/bioinformatics/btaf430>)
# has been replaced by pipeline
# <groupedHyperframe> |> quantile() |> aggregate()
# Read vignette (mirrors) for details
# <https://tingtingzhan.quarto.pub/groupedhyperframe/bioinformatics_btaf430.html>
# <https://tingtingzhan-groupedhyperframe.netlify.app/bioinformatics_btaf430.html>
# Error in `groupedHyperframe::aggregate_quantile()`:
# ! 'aggregate_quantile' is defunct.
# Use '<groupedHyperframe> |> quantile() |> aggregate()' instead.
# See help("Defunct")

5.1 Compute Aggregated Quantiles

We use the data example Ki67 from package groupedHyperframe (v0.3.2.20251225) in this non-spatial application.

Listing 5.1: Data: a hyper data frame Ki67q

Ki67q = groupedHyperframe::Ki67 |>
  within.data.frame(expr = {
    x = y = NULL # remove x- and y-coords for non-spatial application
  }) |>
  as.groupedHyperframe(group = ~ patientID/tissueID) |>
  quantile(probs = seq.int(from = .01, to = .99, by = .01)) |>
  aggregate(by = ~ patientID)

The returned object Ki67q is a hyper data frame with

a numeric-hypercolumn of the aggregated sample quantiles logKi67.quantile per patientID. These quantiles are calculated at a pre-specified grid of probabilities \(\{p_k, k=1,\cdots,K \} \in [0,1]\). Note that the aggregation must be performed at the level of biologically independent clusters, e.g., ~patientID, to produce independent quantile predictors.
Metadata, including the outcome of interest, e.g., progression free survival PFS, Her2, HR, etc.

A hyper data frame Ki67q: aggregated quantiles

Ki67q |>
  head()
# Hyperframe:
#   Tstage  PFS adj_rad adj_chemo histology  Her2   HR  node  race age patientID logKi67.quantile
# 1      2 100+   FALSE     FALSE         3  TRUE TRUE  TRUE White  66   PT00037        (numeric)
# 2      1   22   FALSE     FALSE         3 FALSE TRUE FALSE Black  42   PT00039        (numeric)
# 3      1  99+   FALSE        NA         3 FALSE TRUE FALSE White  60   PT00040        (numeric)
# 4      1  99+   FALSE      TRUE         3 FALSE TRUE  TRUE White  53   PT00042        (numeric)
# 5      1  112    TRUE      TRUE         3 FALSE TRUE  TRUE White  52   PT00054        (numeric)
# 6      4   12    TRUE     FALSE         2  TRUE TRUE  TRUE Black  51   PT00059        (numeric)

Readers are encouraged to learn more about the hyper data frame (hyperframe) from package spatstat.geom (Baddeley, Rubak, and Turner 2015; Baddeley and Turner 2005) and each function in this pipeline from,

Sections in Earlier Chapters
Function	Purpose	Where
`as.groupedHyperframe()`	to convert a data frame into a grouped hyper data frame	Chapter 2
`quantile()`	to calculate the quantiles of each numeric-vector in the numeric-hypercolumn	Section 3.3.3
`aggregate()`	to aggregate the quantiles over multiple tissues per patient by point-wise means	Section 3.4

5.2 Estimate Integrand Surface

Linear quantile index (QI) (Equation 5.1) is a predictor in a functional generalized linear model (James 2002) for outcomes from the exponential family of distributions, or a linear functional Cox model (Gellar et al. 2015) for survival outcomes,

\[ \text{QI}_{i}=\int_{0}^{1} \beta(p)Q_i(p)dp \tag{5.1}\]

where \(Q_i(p)\) is the (aggregated) sample quantiles logKi67.quantile for the \(i\)-th subject, and \(\beta(p)\) is the unknown coefficient function to be estimated. Listing 5.2 fits a generalized additive model (gam) with integrated linear spline-based smoothness estimation using the function mgcv::s() (Wood 2003, v1.9.4). This is a scalar-on-function model (Reiss et al. 2017) that predicts a scalar outcome (e.g., progression free survival time PFS[,1L]) using the aggregated quantiles function as a functional predictor.

Listing 5.2: Linear functional Cox model for survival outcome (Gellar et al. 2015) (Listing 5.1)

m0 = hyper_gam(PFS ~ logKi67.quantile, data = Ki67q)

Nonlinear quantile index (nlQI) (Equation 5.2) is a predictor in the functional generalized additive model (McLean et al. 2014) for outcomes from the exponential family of distributions, or an additive functional Cox model (Cui, Crainiceanu, and Leroux 2021) for survival outcomes.

\[ \text{nlQI}_{i}= \int_{0}^{1} F\big(p, Q_i(p)\big)dp \tag{5.2}\]

where \(F(\cdot,\cdot)\) is an unknown bivariate twice differentiable function. Listing 5.3 fits a generalized additive model (gam) with tensor product interaction estimation using the function mgcv::ti() (Wood 2006, v1.9.4).

Listing 5.3: Additive functional Cox model for survival outcome (Cui, Crainiceanu, and Leroux 2021) (Listing 5.1)

m1 = hyper_gam(PFS ~ logKi67.quantile, data = Ki67q, nonlinear = TRUE)

The fitted functional model m0 and m1 have the S3 class 'hyper_gam' (Chapter 27).

5.2.1 Visualization

Function integrandSurface() creates an interactive htmlwidget (Vaidyanathan et al. 2023, v1.6.4) visualization of the estimated integrand surfaces for the linear (Equation 5.1) or nonlinear quantile index (Equation 5.2) using package plotly (Sievert 2020, v4.11.0). The integrand surfaces, defined on \(p\in[0,1]\) and \(q\in\text{range}\big\{Q_i(p), i=1,\cdots,n\big\}\), are

\[ \hat{S}(p,q) = \begin{cases} \hat{\beta}(p)\cdot q\\ \hat{F}(p,q) \end{cases} \tag{5.3}\]

Also in this interactive visualization are

the contour lines on the integrand surfaces (Equation 5.3), as well as their projections along the \(s\)-axis, i.e., onto the \((p,q)\)-plane (a.k.a., the “floor”);
the estimated linear integrand paths \(\hat{\beta}(p)Q_i(p)\) or the nonlinear integrand paths \(\hat{F}(p, Q_i(p))\) on the integrand surfaces (Equation 5.3);
the sample quantiles \(Q_i(p)\), i.e., the projections of the estimated linear or nonlinear integrand path along the \(s\)-axis, i.e., onto the \((p,q)\)-plane (a.k.a., the “floor”);
the projections of the estimated linear or nonlinear integrand path along the \(q\)-axis, i.e., onto the \((p,s)\)-plane (a.k.a., the “backwall”), so that the area under each projected path is equal to the estimated linear (Equation 5.1) or nonlinear quantile index (Equation 5.2).

Figure 5.1 is a collage (via package htmltools, Cheng et al. 2025, v0.5.9) of the interactive htmlwidget visualizations of the linear and nonlinear integrand surfaces, contours, integrand paths and their projections to the “floor” and “backwall”. Listing 5.4 uses n=21L to reduce the htmlwidget objects size, in order to comply with Quarto Pub file size limit.

Listing 5.4: Figure: visualize linear and nonlinear quantile index (Listing 5.2, Listing 5.3)

Code

scene = list(
  xaxis = list(title = 'Probability (p)', tickformat = '.0%', color = 'dodgerblue'), 
  yaxis = list(title = 'Quantile (q)', color = 'deeppink'),
  zaxis = list(title = 'Integrand (s)', color = 'darkolivegreen')
)
htmltools::tagList(
  m0 |> integrandSurface(n = 21L) |> plotly::layout(scene = scene), 
  m1 |> integrandSurface(n = 21L) |> plotly::layout(scene = scene)
) |> 
  htmltools::browsable()

Figure 5.1: Linear (top) and nonlinear (bottom) integrand surfaces, contours, integrand paths and their projections to the “floor” and “backwall”

Static illustrations of the estimated integrand surfaces, e.g., the perspective and contour plots (Section 27.2), are produced by calling the S3 generic functions graphics::persp() and graphics::contour() in package graphics shipped with R version 4.5.2 (2025-10-31). These figures are suppressed to reduce the file size of this vignette.

Listing 5.5: Static figures using package graphics

Code

m0 |> persp()
m0 |> contour()
m1 |> persp()
m1 |> contour()

5.3 Compute Quantile Index Predictor

Linear and nonlinear quantile indices are the predictors in the functional models (Equation 5.1) and (Equation 5.2), respectively. Let’s consider a conventional scenario that we first fit a hyper_gam model to the training data set, then compute the quantile index predictors in the training and/or test data set using the training model.

First, we partition the 622 patients in hyper data frame Ki67q into a training data set with 498 patients and a test data set with 124 patients, i.e., a 80% versus 20% partition.

set.seed(16); id = Ki67q |> nrow() |> seq_len() |> caret::createDataPartition(p = .8)
Ki67q_0 = Ki67q[id[[1L]],] # training set
Ki67q_1 = Ki67q[-id[[1L]],] # test set

Next, we fit a functional generalized additive model to the the training data set Ki67q_0,

Listing 5.6: Functional generalized additive model

m1a = hyper_gam(PFS ~ logKi67.quantile, nonlinear = TRUE, data = Ki67q_0)

We can, but we should not, use the quantile index predictors (Section 27.3) of the training data set for downstream analysis, because these quantile index predictors are optimized on the training data set and the results would be optimistically biased.

Optimistically biased!!

Ki67q_0[,c('PFS', 'age', 'race')] |> 
  as.data.frame() |> # invokes spatstat.geom::as.data.frame.hyperframe()
  data.frame(nlQI = predict(m1a, newdata = Ki67q_0)) |>
  coxph(formula = PFS ~ age + nlQI, data = _)
# Call:
# coxph(formula = PFS ~ age + nlQI, data = data.frame(as.data.frame(Ki67q_0[, 
#     c("PFS", "age", "race")]), nlQI = predict(m1a, newdata = Ki67q_0)))
# 
#           coef exp(coef)  se(coef)      z       p
# age  -0.023320  0.976950  0.008321 -2.803 0.00507
# nlQI  1.118712  3.060910  0.343152  3.260 0.00111
# 
# Likelihood ratio test=23.24  on 2 df, p=8.993e-06
# n= 498, number of events= 99

Instead, we should use the quantile index predictors computed in the test data set for downstream analysis,

Ki67q_1[,c('PFS', 'age', 'race')] |> 
  as.data.frame() |> # invokes spatstat.geom::as.data.frame.hyperframe()
  data.frame(nlQI = predict(m1a, newdata = Ki67q_1)) |>
  coxph(formula = PFS ~ age + nlQI, data = _)
# Call:
# coxph(formula = PFS ~ age + nlQI, data = data.frame(as.data.frame(Ki67q_1[, 
#     c("PFS", "age", "race")]), nlQI = predict(m1a, newdata = Ki67q_1)))
# 
#          coef exp(coef) se(coef)      z     p
# age  -0.01636   0.98377  0.01862 -0.879 0.380
# nlQI  1.21050   3.35517  0.75858  1.596 0.111
# 
# Likelihood ratio test=3.42  on 2 df, p=0.1805
# n= 124, number of events= 19

Baddeley, Adrian, Ege Rubak, and Rolf Turner. 2015. Spatial Point Patterns: Methodology and Applications with R. London: Chapman; Hall/CRC Press. https://www.routledge.com/Spatial-Point-Patterns-Methodology-and-Applications-with-R/Baddeley-Rubak-Turner/p/book/9781482210200/.

Baddeley, Adrian, and Rolf Turner. 2005. “spatstat: An R Package for Analyzing Spatial Point Patterns.” Journal of Statistical Software 12 (6): 1–42. https://doi.org/10.18637/jss.v012.i06.

Cheng, Joe, Carson Sievert, Barret Schloerke, Winston Chang, Yihui Xie, and Jeff Allen. 2025. htmltools: Tools for HTML. https://CRAN.R-project.org/package=htmltools.

Cui, Erjia, Ciprian M. Crainiceanu, and Andrew Leroux. 2021. “Additive Functional Cox Model.” Journal of Computational and Graphical Statistics 30 (3): 780–93. https://doi.org/10.1080/10618600.2020.1853550.

Gellar, Jonathan E., Elizabeth Colantuoni, Dale M. Needham, and Ciprian M. Crainiceanu. 2015. “Cox Regression Models with Functional Covariates for Survival Data.” Statistical Modelling 15 (3): 256–78. https://doi.org/10.1177/1471082X14565526.

James, Gareth M. 2002. “Generalized Linear Models with Functional Predictors.” Journal of the Royal Statistical Society Series B: Statistical Methodology 64 (3): 411–32. https://doi.org/10.1111/1467-9868.00342.

McLean, Mathew W., Giles Hooker, Ana-Maria Staicu, Fabian Scheipl, and David Ruppert. 2014. “Functional Generalized Additive Models.” Journal of Computational and Graphical Statistics 23 (1): 249–69. https://doi.org/10.1080/10618600.2012.729985.

Reiss, Philip T., Jeff Goldsmith, Han Lin Shang, and R. Todd Ogden. 2017. “Methods for Scalar-on-Function Regression.” International Statistical Review 85 (2): 228–49. https://doi.org/10.1111/insr.12163.

Sievert, Carson. 2020. Interactive Web-Based Data Visualization with R, plotly, and shiny. Chapman; Hall/CRC. https://plotly-r.com.

Vaidyanathan, Ramnath, Yihui Xie, JJ Allaire, Joe Cheng, Carson Sievert, and Kenton Russell. 2023. htmlwidgets: HTML Widgets for R. https://CRAN.R-project.org/package=htmlwidgets.

Wood, Simon N. 2003. “Thin-Plate Regression Splines.” Journal of the Royal Statistical Society B: Statistical Methodology 65 (1): 95–114. https://doi.org/10.1111/1467-9868.00374.

———. 2006. “Low-Rank Scale-Invariant Tensor Product Smooths for Generalized Additive Mixed Models.” Biometrics 62 (4): 1025–36. https://doi.org/10.1111/j.1541-0420.2006.00574.x.

Yi, Misung, Tingting Zhan, Amy R. Peck, Jeffrey A. Hooke, Albert J. Kovatich, Craig D. Shriver, Hai Hu, Yunguang Sun, Hallgeir Rui, and Inna Chervoneva. 2023a. “Quantile Index Biomarkers Based on Single-Cell Expression Data.” Laboratory Investigation 103 (8): 100158. https://doi.org/10.1016/j.labinv.2023.100158.

———. 2023b. “Selection of Optimal Quantile Protein Biomarkers Based on Cell-Level Immunohistochemistry Data.” BMC Bioinformatics 24 (1): 298. https://doi.org/10.1186/s12859-023-05408-8.

Yi, Misung, Tingting Zhan, Hallgeir Rui, and Inna Chervoneva. 2025. “Functional Protein Biomarkers Based on Distributions of Expression Levels in Single-Cell Imaging Data.” Bioinformatics 41 (5): btaf182. https://doi.org/10.1093/bioinformatics/btaf182.

Zhan, Tingting, Misung Yi, and Inna Chervoneva. 2025. “Quantile Index Predictors Using R Package hyper.gam.” Bioinformatics 41 (8): btaf430. https://doi.org/10.1093/bioinformatics/btaf430.