Modeling phenotypes with mixtures

Background: unexplained phenotypic variability

It is not uncommon for a genetic condition to show unexplained heterogeneity; the same genotype(s) can produce different effects in comparable individuals. A particularly vivid example comes from neurofibromatosis type 1: identical twins who look nothing alike.

Is this divergence restricted to rare predisposition syndromes? Studies of the TRIM28 gene in childhood obesity suggest it is not, and that genetic differences in epigenetic machinery may create stochastic branch-points in development. We might wonder if approaches to finding ‘splits’ could be handy.

In short, we often ask ‘one lump or two?’ Mixture models can help answer this.

Mixture modeling

Modeling observed data as a mixture of groups is fairly intuitive, and the process is iterative:

Old faithful geyser eruption delays and durations

In the bifurcatoR package, we farm out the compute-intensive iteration to the mixR package, because it’s fast (using Rcpp for loops) and it supports Weibull mixture models (as for survival analyses).

It’s a good idea to keep track of versions, in case something were to change.

When this document was first compiled, the versions used were:

Code

paste("mixR", packageVersion("mixR"))

[1] "mixR 0.2.0"

Code

paste("Rcpp", packageVersion("Rcpp"))

[1] "Rcpp 1.0.12"

Code

paste("bifurcatoR", packageVersion("bifurcatoR"))

[1] "bifurcatoR 0.99.20240320"

Mixture models: the math

Below you can find a quick review of mixture model math.

Mixture model math (click to expand)

Formally, a finite mixture model is a statistical model where two or more random processes, usually with different averages, happen to best explain the patterns in observed data. If the processes are distinguished by something like a genotype or an observable phenotype, we might apply a statistical test (like Student’s t-test) to gauge how often we’d see such differences by chance. But in our line of work, we often want to identify distinguishable lumps that do not correspond to labels such as genotypes. This is where mixture models come in handy. A simple example is a mixture of two Gaussian (Normal) distributions:

\[ p(x) = \Sigma^K_{k=1}\pi_k\mathcal{N}(x|\mu_k,\sigma_k) \]

If we only have two components (K=2), then the probability \(\pi_1\) of being in group 1 is \(1 - \pi_2\), where \(\pi_2\) is the probability of NOT being in it. Formally, \(\pi_k = Pr(K = k | X = x)\). In a simple two-component model, we can start from a guess for \(\pi_1\) based on k-means or cutting a dendrogram. We will seek parameters \(\theta_k = (\mu_k, \sigma_k)\) for each group \(k\) to maximize the probability of observing our actual data given our model.

The joint probability \(\prod_{i=1}^n p(x; \theta)\) of observing the data \(X\) given the proposed parameters \(\theta\) is referred to as the likelihood \(\mathcal{L}(\theta|X)\) of the model, with \(n\) observations and parameters \(\theta = (\theta_1, ..., \theta_K) = (\theta_1, \theta_2)\) for us.

\[ \mathcal{L}(\theta|X) = \prod_{i=1}^n \sum_{k=1}^k \pi_k N(x_i; \theta_k) = \prod_{i=1}^n (\pi_1 N(x_i; \theta_1) + (1 - \pi_1) N(x_i; \theta_2)) \]

Normally we could take logs, set the first and second derivatives to 0, and solve for appropriate values of \(\theta\), but since there is a sum in the way, we alternate back and forth between assigning each observation to group 1 or 2, then solving for the parameters \(\theta_1\) and \(\theta_2\) which maximize \(\mathcal{L}(\theta_k|X_i)\) for all \(X_i\) where \(K_i = k\). It turns out that
\(X\) need not be from a Normal (Gaussian) distribution for this to work, something that will be explored later.

Convergence can take a while if there are a many observations, which is why we use mixR for the calculations. More on the mathematical details and cute math tricks can be found at Matt Stephens’ Five Minute Stats.

A preview from clinical trials

Here we fit a mixture model to expression of EVI1 from the MECOM gene locus.

Code

# identical to data(MLL, package="bifurcatoR")
library(bifurcatoR)
data(MLL, package="bifurcatoR")
MLLcsv <- read.csv(url("https://trichelab.github.io/data/MLL.csv"), row=1)
stopifnot(identical(MLL, MLLcsv))
MLL <- MLLcsv

# fit mixture 
library(mixR)
MLL_EVI1_fit <- mixfit(MLL$MECOM, ncomp=2)
plot(MLL_EVI1_fit, xlab="log(EVI1 transcripts)",
     title="EVI1 expression in MLL-rearranged leukemia")

Code

classification <- MLL_EVI1_fit$comp.prob # component probability, i.e., group
MLL$EVI1high <- (apply(classification, 1, which.max) > 1)

The above grouping is highly predictive of response to therapy in leukemia driven by MLL gene fusions:

Code

library(survival) 
library(survminer)
mixFit <- survfit(Surv(OS, OSI) ~ EVI1high, data=MLL)
ggsurvplot(mixFit, conf.int=TRUE, pval=TRUE, xlab="Overall survival (days)",
      tables.theme = theme_cleantable(), palette=c("darkgreen","darkred"))

Table 1

MLL-rearranged subjects from COG AML trials have the following characteristics:

Code

library(table1) 
pvalue <- function(x, ...) {
  y <- unlist(x)
  g <- factor(rep(1:length(x), times=sapply(x, length)))
  if (is.numeric(y)) p <- t.test(y ~ g)$p.value
  if (!is.numeric(y)) p <- chisq.test(table(y, g))$p.value
  c("", sub("<", "&lt;", format.pval(p, digits=3, eps=0.001)))
}

common <- names(which(table(MLL$fusion) > 5))
MLL$AgeGroup <- relevel(relevel(factor(MLL$AgeGroup), 2), 3)
MLL$Fusion <- ifelse(MLL$fusion %in% common, 
                     MLL$fusion, "Other MLL fusions")
MLL$Fusion <- relevel(relevel(relevel(factor(MLL$Fusion), 6), 4), 6)
table1(~ AgeGroup + Sex + Fusion | Protocol,
       data=MLL, extra.col=list(p=pvalue))

	AAML03P1 (N=8)	AAML0531 (N=106)	AAML1031 (N=248)	Overall (N=362)	p
AgeGroup
Infant	1 (12.5%)	50 (47.2%)	113 (45.6%)	164 (45.3%)	0.726
Child	5 (62.5%)	40 (37.7%)	97 (39.1%)	142 (39.2%)
AYA	2 (25.0%)	16 (15.1%)	38 (15.3%)	56 (15.5%)
Sex
Female	5 (62.5%)	57 (53.8%)	111 (44.8%)	173 (47.8%)	0.372
Male	3 (37.5%)	49 (46.2%)	137 (55.2%)	189 (52.2%)
Fusion
KMT2A-MLLT3	1 (12.5%)	26 (24.5%)	92 (37.1%)	119 (32.9%)	0.38
KMT2A-MLLT10	1 (12.5%)	26 (24.5%)	54 (21.8%)	81 (22.4%)
KMT2A-MLLT4	3 (37.5%)	14 (13.2%)	26 (10.5%)	43 (11.9%)
KMT2A-ELL	3 (37.5%)	14 (13.2%)	21 (8.5%)	38 (10.5%)
KMT2A-MLLT1	0 (0%)	4 (3.8%)	14 (5.6%)	18 (5.0%)
KMT2A-MLLT11	0 (0%)	1 (0.9%)	6 (2.4%)	7 (1.9%)
KMT2A-SEPT6	0 (0%)	4 (3.8%)	8 (3.2%)	12 (3.3%)
Other MLL fusions	0 (0%)	17 (16.0%)	27 (10.9%)	44 (12.2%)

In the following sections, you can run (and edit) our code and data in your web browser to follow along (or test assumptions we have made in our analyses).

Non-genetic heterogeneity in leukemia

We often want to choose the most useful predictor of some outcome, not just the easiest to measure. In childhood leukemia, an example of this comes from the Mixed-Lineage Leukemia or MLL protein, encoded by the KMT2A gene on chromosome 11, band q23. Gene fusions involving MLL are quite common in infant leukemia, with dozens of partner genes, historically yielding poor outcomes. Over the past few decades, improvements in care have changed this to the point that most people now ask instead, “what determines if these drugs will work?” As we shall see, a patient’s genetic lesions aren’t always the best answer.

Thirty years of trial data

Above, we fitted a mixture model to expression of the EVI1 isoform of the MECOM gene locus, which has the peculiar property of acting as either an oncogene or a tumor suppressor, depending on which isoform is expressed. It looks like there really are just two major groups. Let’s label the patients by which group they fall into. (Note that the high-expression group is smaller.) For a quick sanity check, let’s also plot the group-wise expression levels.

A survival analysis of gene fusions

In large clinical trials, the primary endpoint (the thing the trial is meant to improve upon) usually involves time to an event. If that event is death, we refer to the analysis as a survival analysis. We might ask whether the gene fusion partner is informative in terms of MLL-rearranged patient survival.

Well, that’s a big huge mess. Let’s limit the analysis to fusion partners that appear more than 5 times in the data, since otherwise we don’t have much power.

A survival analysis of common MLL fusions

Now we’re getting somewhere, but as the confidence intervals (pastel highlights) suggest, there’s a tremendous amount of overlap between some ‘common’ fusions. For example, patients with SEPT6 fusions clearly do better than patients with MLLT4. Are there individual-level differences at play here?

Examining evidence for a mixture

Hmm. It looks like most SEPT6 fusion cases have low EVI1 expression, whereas most MLLT4 cases have high EVI1 expression. Perhaps this is worth modeling. Let’s compare how well our EVI1/MECOM RNA expression predicts survival in MLL-rearranged leukemia cases, and let’s see if this might be better than basing treatment on the fusion partner. (EVI1-high cells tend to be chemoresistant.)

A mixture of outcomes

It sure looks like EVI1/MECOM expression level predicts outcomes better. Of course, it might be even more straightforward to build these and other models if it turned out that the treatment outcome was binary, i.e. some patients will be cured and some won’t. Then we could focus the majority of our research efforts on finding cures for those patients who don’t have any good treatment options. We can fit mixtures to time-to-event (survival) outcomes, too, but we will want to use a slightly different model, since the assumption underlying survival analysis is that the risk of having an event goes up exponentially over time as a function of the hazard rate. Here we will also entertain the possibility of there being more than two groups. (Most leukemia patients are considered cured after 5 years, after which point anyone who hasn’t succumbed to their disease is eventually lost to followup.)

A mixture of survival distributions

It appears that MLL-rearranged patients fall into two outcome groups. For lack of better words, we will refer to these groups as standard-risk and high-risk. How well does EVI1/MECOM expression predict a patient’s eventual outcome?

A test of phenotypic and outcome fits

Not perfect, but certainly better than expected by chance, given the grouping. Obviously, there are many other possible predictors of outcome, and we include numerous gene expression levels along with patient age and (where available) blast counts (i.e., how many of the cells in their bone marrow are immature). For example, the CD34 and MECOM genes turn out to be jointly informative; and modeling the two together turns out to be more powerful than either alone. But having an idea of whether the data supports natural grouping sure is handy.