Code
paste("mixR", packageVersion("mixR"))
[1] "mixR 0.2.0"
Code
paste("Rcpp", packageVersion("Rcpp"))
[1] "Rcpp 1.0.12"
Code
paste("bifurcatoR", packageVersion("bifurcatoR"))
[1] "bifurcatoR 0.99.20240320"
It is not uncommon for a genetic condition to show unexplained heterogeneity; the same genotype(s) can produce different effects in comparable individuals. A particularly vivid example comes from neurofibromatosis type 1: identical twins who look nothing alike.
Is this divergence restricted to rare predisposition syndromes? Studies of the TRIM28 gene in childhood obesity suggest it is not, and that genetic differences in epigenetic machinery may create stochastic branch-points in development. We might wonder if approaches to finding āsplitsā could be handy.
In short, we often ask āone lump or two?ā Mixture models can help answer this.
Modeling observed data as a mixture of groups is fairly intuitive, and the process is iterative:
In the bifurcatoR
package, we farm out the compute-intensive iteration to the mixR
package, because itās fast (using Rcpp
for loops) and it supports Weibull mixture models (as for survival analyses).
Itās a good idea to keep track of versions, in case something were to change.
When this document was first compiled, the versions used were:
paste("mixR", packageVersion("mixR"))
[1] "mixR 0.2.0"
paste("Rcpp", packageVersion("Rcpp"))
[1] "Rcpp 1.0.12"
paste("bifurcatoR", packageVersion("bifurcatoR"))
[1] "bifurcatoR 0.99.20240320"
Below you can find a quick review of mixture model math.
Formally, a finite mixture model is a statistical model where two or more random processes, usually with different averages, happen to best explain the patterns in observed data. If the processes are distinguished by something like a genotype or an observable phenotype, we might apply a statistical test (like Studentās t-test) to gauge how often weād see such differences by chance. But in our line of work, we often want to identify distinguishable lumps that do not correspond to labels such as genotypes. This is where mixture models come in handy. A simple example is a mixture of two Gaussian (Normal) distributions:
\[ p(x) = \Sigma^K_{k=1}\pi_k\mathcal{N}(x|\mu_k,\sigma_k) \]
If we only have two components (K=2), then the probability \(\pi_1\) of being in group 1 is \(1 - \pi_2\), where \(\pi_2\) is the probability of NOT being in it. Formally, \(\pi_k = Pr(K = k | X = x)\). In a simple two-component model, we can start from a guess for \(\pi_1\) based on k-means or cutting a dendrogram. We will seek parameters \(\theta_k = (\mu_k, \sigma_k)\) for each group \(k\) to maximize the probability of observing our actual data given our model.
The joint probability \(\prod_{i=1}^n p(x; \theta)\) of observing the data \(X\) given the proposed parameters \(\theta\) is referred to as the likelihood \(\mathcal{L}(\theta|X)\) of the model, with \(n\) observations and parameters \(\theta = (\theta_1, ..., \theta_K) = (\theta_1, \theta_2)\) for us.
\[ \mathcal{L}(\theta|X) = \prod_{i=1}^n \sum_{k=1}^k \pi_k N(x_i; \theta_k) = \prod_{i=1}^n (\pi_1 N(x_i; \theta_1) + (1 - \pi_1) N(x_i; \theta_2)) \]
Normally we could take logs, set the first and second derivatives to 0, and solve for appropriate values of \(\theta\), but since there is a sum in the way, we alternate back and forth between assigning each observation to group 1 or 2, then solving for the parameters \(\theta_1\) and \(\theta_2\) which maximize \(\mathcal{L}(\theta_k|X_i)\) for all \(X_i\) where \(K_i = k\). It turns out that
\(X\) need not be from a Normal (Gaussian) distribution for this to work, something that will be explored later.
Convergence can take a while if there are a many observations, which is why we use mixR
for the calculations. More on the mathematical details and cute math tricks can be found at Matt Stephensā Five Minute Stats.
Here we fit a mixture model to expression of EVI1 from the MECOM gene locus.
# identical to data(MLL, package="bifurcatoR")
library(bifurcatoR)
data(MLL, package="bifurcatoR")
<- read.csv(url("https://trichelab.github.io/data/MLL.csv"), row=1)
MLLcsv stopifnot(identical(MLL, MLLcsv))
<- MLLcsv
MLL
# fit mixture
library(mixR)
<- mixfit(MLL$MECOM, ncomp=2)
MLL_EVI1_fit plot(MLL_EVI1_fit, xlab="log(EVI1 transcripts)",
title="EVI1 expression in MLL-rearranged leukemia")
<- MLL_EVI1_fit$comp.prob # component probability, i.e., group
classification $EVI1high <- (apply(classification, 1, which.max) > 1) MLL
The above grouping is highly predictive of response to therapy in leukemia driven by MLL gene fusions:
library(survival)
library(survminer)
<- survfit(Surv(OS, OSI) ~ EVI1high, data=MLL)
mixFit ggsurvplot(mixFit, conf.int=TRUE, pval=TRUE, xlab="Overall survival (days)",
tables.theme = theme_cleantable(), palette=c("darkgreen","darkred"))
MLL-rearranged subjects from COG AML trials have the following characteristics:
library(table1)
<- function(x, ...) {
pvalue <- unlist(x)
y <- factor(rep(1:length(x), times=sapply(x, length)))
g if (is.numeric(y)) p <- t.test(y ~ g)$p.value
if (!is.numeric(y)) p <- chisq.test(table(y, g))$p.value
c("", sub("<", "<", format.pval(p, digits=3, eps=0.001)))
}
<- names(which(table(MLL$fusion) > 5))
common $AgeGroup <- relevel(relevel(factor(MLL$AgeGroup), 2), 3)
MLL$Fusion <- ifelse(MLL$fusion %in% common,
MLL$fusion, "Other MLL fusions")
MLL$Fusion <- relevel(relevel(relevel(factor(MLL$Fusion), 6), 4), 6)
MLLtable1(~ AgeGroup + Sex + Fusion | Protocol,
data=MLL, extra.col=list(p=pvalue))
AAML03P1 (N=8) |
AAML0531 (N=106) |
AAML1031 (N=248) |
Overall (N=362) |
p | |
---|---|---|---|---|---|
AgeGroup | |||||
Infant | 1 (12.5%) | 50 (47.2%) | 113 (45.6%) | 164 (45.3%) | 0.726 |
Child | 5 (62.5%) | 40 (37.7%) | 97 (39.1%) | 142 (39.2%) | |
AYA | 2 (25.0%) | 16 (15.1%) | 38 (15.3%) | 56 (15.5%) | |
Sex | |||||
Female | 5 (62.5%) | 57 (53.8%) | 111 (44.8%) | 173 (47.8%) | 0.372 |
Male | 3 (37.5%) | 49 (46.2%) | 137 (55.2%) | 189 (52.2%) | |
Fusion | |||||
KMT2A-MLLT3 | 1 (12.5%) | 26 (24.5%) | 92 (37.1%) | 119 (32.9%) | 0.38 |
KMT2A-MLLT10 | 1 (12.5%) | 26 (24.5%) | 54 (21.8%) | 81 (22.4%) | |
KMT2A-MLLT4 | 3 (37.5%) | 14 (13.2%) | 26 (10.5%) | 43 (11.9%) | |
KMT2A-ELL | 3 (37.5%) | 14 (13.2%) | 21 (8.5%) | 38 (10.5%) | |
KMT2A-MLLT1 | 0 (0%) | 4 (3.8%) | 14 (5.6%) | 18 (5.0%) | |
KMT2A-MLLT11 | 0 (0%) | 1 (0.9%) | 6 (2.4%) | 7 (1.9%) | |
KMT2A-SEPT6 | 0 (0%) | 4 (3.8%) | 8 (3.2%) | 12 (3.3%) | |
Other MLL fusions | 0 (0%) | 17 (16.0%) | 27 (10.9%) | 44 (12.2%) |
In the following sections, you can run (and edit) our code and data in your web browser to follow along (or test assumptions we have made in our analyses).
We often want to choose the most useful predictor of some outcome, not just the easiest to measure. In childhood leukemia, an example of this comes from the Mixed-Lineage Leukemia or MLL protein, encoded by the KMT2A gene on chromosome 11, band q23. Gene fusions involving MLL are quite common in infant leukemia, with dozens of partner genes, historically yielding poor outcomes. Over the past few decades, improvements in care have changed this to the point that most people now ask instead, āwhat determines if these drugs will work?ā As we shall see, a patientās genetic lesions arenāt always the best answer.
Above, we fitted a mixture model to expression of the EVI1 isoform of the MECOM gene locus, which has the peculiar property of acting as either an oncogene or a tumor suppressor, depending on which isoform is expressed. It looks like there really are just two major groups. Letās label the patients by which group they fall into. (Note that the high-expression group is smaller.) For a quick sanity check, letās also plot the group-wise expression levels.
In large clinical trials, the primary endpoint (the thing the trial is meant to improve upon) usually involves time to an event. If that event is death, we refer to the analysis as a survival analysis. We might ask whether the gene fusion partner is informative in terms of MLL-rearranged patient survival.
Well, thatās a big huge mess. Letās limit the analysis to fusion partners that appear more than 5 times in the data, since otherwise we donāt have much power.
Now weāre getting somewhere, but as the confidence intervals (pastel highlights) suggest, thereās a tremendous amount of overlap between some ācommonā fusions. For example, patients with SEPT6 fusions clearly do better than patients with MLLT4. Are there individual-level differences at play here?
Hmm. It looks like most SEPT6 fusion cases have low EVI1 expression, whereas most MLLT4 cases have high EVI1 expression. Perhaps this is worth modeling. Letās compare how well our EVI1/MECOM RNA expression predicts survival in MLL-rearranged leukemia cases, and letās see if this might be better than basing treatment on the fusion partner. (EVI1-high cells tend to be chemoresistant.)
It sure looks like EVI1/MECOM expression level predicts outcomes better. Of course, it might be even more straightforward to build these and other models if it turned out that the treatment outcome was binary, i.e. some patients will be cured and some wonāt. Then we could focus the majority of our research efforts on finding cures for those patients who donāt have any good treatment options. We can fit mixtures to time-to-event (survival) outcomes, too, but we will want to use a slightly different model, since the assumption underlying survival analysis is that the risk of having an event goes up exponentially over time as a function of the hazard rate. Here we will also entertain the possibility of there being more than two groups. (Most leukemia patients are considered cured after 5 years, after which point anyone who hasnāt succumbed to their disease is eventually lost to followup.)
It appears that MLL-rearranged patients fall into two outcome groups. For lack of better words, we will refer to these groups as standard-risk and high-risk. How well does EVI1/MECOM expression predict a patientās eventual outcome?
Not perfect, but certainly better than expected by chance, given the grouping. Obviously, there are many other possible predictors of outcome, and we include numerous gene expression levels along with patient age and (where available) blast counts (i.e., how many of the cells in their bone marrow are immature). For example, the CD34 and MECOM genes turn out to be jointly informative; and modeling the two together turns out to be more powerful than either alone. But having an idea of whether the data supports natural grouping sure is handy.