Germline variant analysis via resampling

Background: acute leukemia as a model for cancer

Childhood cancers, of which acute leukemia is the most common, present both a clinical need and a model for cancer more generally. These tumors often lack recurrent gene mutations, instead driven by structural variants (SVs) which rewire multiple genes. Many SVs are missed by standard karyotyping (chromosomal analysis), and all raise a key question: where do these SVs come from?

A seemingly straightforward answer is that mammals have three deeply conserved genes, millions of years old, whose function is to cut-and-paste DNA, and when these genes are inappropriately activated, structural variants ensue. The precise nature of these cut-and-paste enzymes (RAG1, PGBD5, and THAP9) addresses a key difference between pediatric and adult tumors (namely, the fact that most pediatric tumors harbor very few mutations). RAG1 is critical for immune cell maturation, as it allows B and T (lymphoid) cells to activate in response to specific non-self proteins. However, in lymphoblastic leukemia and lymphoma (which together consitute the most common type of childhood cancer), off-target activity of RAG1 is sufficient to induce some of the same gene fusions seen in patients. Meanwhile, in most childhood solid tumors, the PGBD5 transposase gene is inappropriately activated, leading to the type of cut-and-paste SVs that characterize ultra-low-mutation-burden rhabdoid tumors.

But there is at least one major exception: none of the known transposase genes appear to be activated in acute myeloid leukemia (AML), nor are classical tumor suppressor genes like TP53 often mutated or deleted in pediatric AML patients. Why is AML different from most other childhood tumors? Could inherited genetic variants explain the missing mechanism in AML (and especially pediatric AML)?

A preview

For counts, such as pathogenic germline variants in leukemia, we can use Poisson models:

Code

combined <- read.csv(url("https://trichelab.github.io/data/combined.csv"),row=1)
combined$Source <- relevel(relevel(relevel(factor(combined$Source), 4), 4), 4)
combined$Lineage <- relevel(relevel(factor(combined$Lineage), 3), 3)
combined$Group <- paste(combined$AgeGroup, gsub("\\-", "", combined$Lineage))
combined$Group <- relevel(relevel(factor(combined$Group), 4), 4)
with(combined, table(Group, Sex))
##                 Sex
## Group            Female Male
##   Pediatric BALL    789  941
##   Pediatric TALL    108  322
##   Adult AML         140  200
##   Pediatric AML     251  268
fit2 <- glm(G_PLP ~ Group + Source, data=combined, family=poisson)
summary(fit2)
## 
## Call:
## glm(formula = G_PLP ~ Group + Source, family = poisson, data = combined)
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -4.05445    0.18186 -22.295   <2e-16 ***
## GroupPediatric TALL  0.32892    0.35170   0.935   0.3497    
## GroupAdult AML       0.07997    1.12443   0.071   0.9433    
## GroupPediatric AML   0.49548    1.00940   0.491   0.6235    
## SourceTARGET         1.06301    1.02985   1.032   0.3020    
## SourceUBTF           1.24533    1.06695   1.167   0.2431    
## SourceBEAT_AML       1.23658    1.10782   1.116   0.2643    
## SourceZeroCancer     1.68716    0.72613   2.324   0.0202 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 739.77  on 3018  degrees of freedom
## Residual deviance: 680.26  on 3011  degrees of freedom
## AIC: 910.72
## 
## Number of Fisher Scoring iterations: 6

library(nloptr)
library(effects)
plot(predictorEffects(fit2), ylab="Burden", axes=list(x=list(rotate=35)))

However, sometimes we are curious if a natural boundary exists in the data, suggesting that a more than one process is generating the observations. We can use resampling and mixture models to probe this:

Code

byGroup <- split(combined$G_PLP, combined$Group)
rs <- function(y, n, m, slop=20) {
  s <- n + (round(slop/2)) - sample(seq_len(slop), 1)
  replicate(m, sum(y[sample.int(n=length(y), size=s)] > 0) / s)
}

set.seed(1234)
library(reshape2)
res <- data.frame(lapply(byGroup, rs, n=300, m=3333, slop=20)) 
results <- melt(res, variable.name="Group", value.name="Burden")

library(mixR) 
library(ggplot2)
mixres <- select(results$Burden, ncomp=2:4)
ncomp <- mixres$ncomp[mixres$best == "*"]
fit <- mixfit(results$Burden, ncomp=ncomp)
p1 <- plot(fit, 
           title=paste("Germline P/LP variant burden across", 
                        nrow(combined), "leukemia patients"), 
           xlab="Germline burden",
           theme="minimal") + 
           scale_x_continuous(labels = scales::label_percent(), 
                              limits = c(0, 0.15))
p1

We might ask whether the observed cohorts split along roughly similar lines:

Code

library(ggplot2)
library(ggbeeswarm)
p2 <- ggplot(results, aes(x=Group, y=Burden, color=Group)) + 
        scale_y_continuous(labels = scales::percent, limits = c(0, 0.15),
                           breaks = seq(0, .15, .025)) + 
        ylab("Germline burden") +
        geom_hline(yintercept=0.05, lty=3) + 
        geom_quasirandom() + 
        theme_classic() + 
        theme(legend.position="none", 
              axis.text.x = element_text(angle = 35, vjust = 1, hjust=1))
p2

Finally, we might want to plot the two results alongside one another.

Code

library(patchwork)
p1 + p2

Below, we will walk through each of these steps.

Modeling germline burden with Poisson regression

We have collected and harmonized several large WGS/WES studies to investigate evidence for disparities in genetic variants between acute leukemia types. Here, we shall focus on germline pathogenic/likely pathogenic (P/LP) variants with seemingly relevant consequences for blood and immune development. Counts of these (somewhat rare) germline P/LP variants are integers, so we use Poisson regression for a first pass to model variant counts across groups.

It seems that, after correcting for who called the variants, there are a wide range of estimates for the relative risk of germline P/LP variants by group. B-ALL seems to have the lowest burden, but beyond that, it’s tough to make any strong conclusions about what’s going on. Let’s try another way.

Resampling-based inference of germline burden

We note that AML seems to predict a higher germline variant load than ALL. But suppose the Poisson model just isn’t a good fit for the data, or different authors apply different criteria, or something else is wrong with the way we have modeled the occurrence of germline variants. We can apply an orthogonal brute-force approach, which is to resample over and over without replacement. Note: we are tabulating the fraction of subjects with 1 or more G_PLP variants, rather than estimating the probability of seeing any G_PLP variants in a group of cases, which is subtly different from Poisson regression. In the end, we arrive at similar conclusions (specifically, More Research Is Needed :tm:).

When you ran the code in the Poisson regression chunks above (and you did run the code, right? That’s the whole point of this exercise), you may have noticed a commented-out piece of code to make effect size plots. The same approach we took to resampling can be applied to all sorts of questions, and in fact the sampling package on CRAN is dedicated to implementing fancy designs for biased sampling.

Exercise

Can you find examples where our predictions are inaccurate? Are there specific cases with high variant burdens not explained by any fit?

Modeling human biology is complicated (you heard it here first). Nevertheless, it is somewhat unlikely that AML and ALL would show such significant differences in clinically annotated germline variant frequency across five different cohorts (each comprising multiple clinical trials, across decades of studies) by chance. It seems that relevant germline variants are found in at least 5% of AML cases, and the > 5% burden in AML consistently emerges across age groups and studies. Moreover, pediatric AML is the group of patients where this threshold is most consistently crossed by random samples of patients from existing cohorts.

We have previously proposed that all patients who present with acute leukemia should be offered germline genetic testing, as should their family members, in light of the role for related donors in stem cell transplantation. The above findings serve to strengthen this statement: given the cost (median of about $1.1M US) to treat a child with AML, successfully or otherwise, the cost of germline genetic testing (about $400 for whole-genome sequencing in 2024) is a drop in the ocean. Registries like Project EveryChild ease the challenge of historical comparison, so one might ask, what’s standing in the way?

Session information

Code

sessionInfo()
## R version 4.3.3 (2024-02-29)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.4 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices datasets  utils     methods   base     
## 
## other attached packages:
## [1] patchwork_1.2.0  ggbeeswarm_0.7.2 ggplot2_3.5.0    mixR_0.2.0      
## [5] reshape2_1.4.4   effects_4.2-2    carData_3.0-5    nloptr_2.0.3    
## 
## loaded via a namespace (and not attached):
##  [1] generics_0.1.3    utf8_1.2.4        bspm_0.5.5        stringi_1.8.3    
##  [5] lattice_0.22-5    lme4_1.1-35.1     digest_0.6.35     magrittr_2.0.3   
##  [9] evaluate_0.23     grid_4.3.3        estimability_1.5  fastmap_1.1.1    
## [13] plyr_1.8.9        jsonlite_1.8.8    Matrix_1.6-5      nnet_7.3-19      
## [17] DBI_1.2.2         survival_3.5-8    fansi_1.0.6       scales_1.3.0     
## [21] cli_3.6.2         mitools_2.4       rlang_1.1.3       munsell_0.5.0    
## [25] splines_4.3.3     withr_3.0.0       yaml_2.3.8        tools_4.3.3      
## [29] dplyr_1.1.4       minqa_1.2.6       colorspace_2.1-0  boot_1.3-30      
## [33] vctrs_0.6.5       R6_2.5.1          lifecycle_1.0.4   stringr_1.5.1    
## [37] htmlwidgets_1.6.4 vipor_0.4.7       MASS_7.3-60.0.1   insight_0.19.8   
## [41] beeswarm_0.4.0    pkgconfig_2.0.3   pillar_1.9.0      gtable_0.3.4     
## [45] glue_1.7.0        Rcpp_1.0.12       tidyselect_1.2.1  xfun_0.42        
## [49] tibble_3.2.1      knitr_1.45        farver_2.1.1      htmltools_0.5.7  
## [53] nlme_3.1-164      survey_4.4-1      labeling_0.4.3    rmarkdown_2.26   
## [57] compiler_4.3.3