Predicting cancer survival with machine learning

Treatment regimens in cancer involve a significant tradeoff between efficacy and side effects. To assign a patient an optimal therapy, it is crucial to be able to predict their risk level soon after diagnosis. The increase in molecular data generated in a clinical setting enables ever more accurate predictions, but sophisticated approaches are required to train the predictive algorithms. Continuing with the machine learning topic of our previous blog post, we now focus on predicting cancer survival with machine learning, using a method called random forest as an example.

Key points

  • Machine learning enables stratifying patients by their risk based on high-throughput data, such as genome-wide expression and mutation data
  • Random forest is a robust and popular algorithm designed to increase the generalizability of simpler models
  • To predict survival, the used predictive algorithm and performance metrics must be able to take censoring into account
  • It is important to evaluate the predictive model's performance on test data which has not been used in training the model


Last time we introduced the key concepts in machine learning, considering a simple case of predicting treatment response. With the basics thus covered, we will now take a closer look at the inner workings of a predictive algorithm — this time in the context of predicting survival.

Survival analyses are commonly conducted using Kaplan-Meier estimators (on single predictors, or input variables) and Cox regression (on a handful of predictors). Such approaches do not, however, readily apply to high-dimensional data with complex, non-linear interactions between predictors — genome-wide measurements, that is. The solution for just this type of a problem is machine learning.

Seeing the tree for the forest

One common machine learning algorithm adapted to survival analysis is called random forest. To understand how random forests work, let us first consider a simpler algorithm which it is based on: a decision tree.

The figure below shows a toy example of a decision tree being trained on a set of six patients, three of which are alive and three deceased (after, say, five-years from diagnosis). Using the expression levels of just two genes quantified from a primary tumor biopsy, the trained tree is able to perfectly separate the two patient groups. However, the tree’s performance on test data — eight new patients that were not used in training the model — is worse, with two patients being misclassified. This depicts a central challenge in machine learning: a predictive model may perform very well on training data, but poorly on unseen test data. Poor generalizability can result from overfitting or a training data set which is biased or just too small.

What is random forest?

It turns out that overfitting is a known problem of decision trees, and random forest has been developed to counteract just that. A random forest consists of multiple decision trees which are trained by randomly subsampling both the training data (patients, in our example) and predictors (genes). This results in a collection of decision trees which are all biased but, importantly, each in their own way. A prediction produced by a random forest is a combination (majority vote, for instance) of the predictions of its individual trees. Random forest is thus a so called ensemble classifier, a wisdom-of-the-crowds algorithm. This approach effectively reduces the overfitting problem of individual decision trees.

The figure below illustrates a random forest (consisting of just three little trees) trained on our toy data. Each tree is trained using a random sample of the full training set (a bag), and each decision node of each tree is defined by a randomly selected gene. While the trees perform well on the patients in their bag (tree-specific training data), they perform poorly on out-of-bag patients. Clearly, restricting access to data in training a tree decreases its performance, which should not surprise us!

What may surprise us, however, is that the combined prediction of these poor trees on the test data is great — indeed better than that of the simple decision tree model in the first case. The figure below illustrates how individual trees contribute to the ensemble prediction of a random forest, resulting in a combined decision boundary in the space defined by predictors, or input variables.

From toy data to real-world data

While our example data above consists of fewer patients and measurements than is required for any real-world prediction problem, it gives us an idea how an algorithm may derive predictions of a target variable (vital status, in the example) based on input variables (gene expression values). Furthermore, it highlights the challenge of generalizability and how machine learning algorithms can be resilient against overfitting by design.

Before demonstrating survival predictions on real-world data, we must acknowledge another simplification in our toy data. We assumed a simple dead-or-alive classification task. In reality, survival data is more complicated: patients are either dead or censored. Censored patients are alive as per their last follow-up, after which they might have died soon or lived happily ever after — but we don’t know. This so called right-censoring is a form of missingness that any survival analysis must take into account. The figure on the right illustrates survival data: the time from diagnosis to either death or last follow-up is known, and the patients alive at their last followup are considered censored.

An adaptation of random forests to survival prediction, random survival forests (RSF), applies to just this type of data. Instead of categorizing patients into dead or alive, it aims at stratifying the patients based on their estimated risk. The prediction of such a model, thus, is not a binary classification but a continuous risk score.

Evaluating a model

Since the model output (prediction) is a continuous risk score and not an easily-verifiable dead-or-alive category, how does one estimate the model’s performance? One good answer is c-index, or concordance index. The c-index is calculated as the fraction of the pairs of patients in which the one with a higher predicted risk actually died before the one with a lower predicted risk. (Pairs of patients where the one with a lower predicted risk is censored cannot be evaluated for concordance, and such pairs are omitted from the calculation — a limitation due to censoring that we have to accept.) A c-index of 0.5 corresponds to a random, good-for-nothing classifier, whereas 1 represents a perfect result as far as ordering patients by risk goes. This interpretation of the c-index might remind you of that of AUC discussed in our last post. In fact, c-index is a generalization of AUC to non-binary prediction problems.

Another way to evaluate a model’s performance is to group patients by their predicted risks, plot the group-specific Kaplan-Meier estimators and compare the groups using a log-rank test. The better the predictive model, the clearer the separation between high- and low-risk groups. It is important to evaluate these performance metrics based on predictions on test data, not just training data. The former is likely to be worse, yet a lot more informative than the latter.

Predicting liver cancer survival with random survival forest

The figure below shows results of stratifying liver cancer patients by the gene expression profiles of their primary tumors, using public data from The Cancer Genome Atlas (TCGA). Of the 369 liver cancer patients from the TCGA-LIHC data set, we randomly selected 80%, or 269 patients, as a training cohort to build a random survival forest. The remaining 20%, or 73 patients, were kept as a test cohort. The risk stratification on the training data is, unsurprisingly, great. The c-index is 0.925 and a binary grouping by the median risk yields risk groups with clearly distinct survival patterns and a log-rank p-value below 0.0001.

So how well does this model generalize? The predictions on the test set of 73 patients have a c-index of 0.61, and the binary grouping yields a p-value of 0.041. The performance is clearly better than nothing, but worse than on the training data. This highlights the importance of using separate test data to evaluate the model. Performance on training data, after all, tells very little about the applicability of the built model and identified biomarkers in eventual clinical use.

Beyond gene expression

In our examples above, we only considered gene expression levels as predictors, or potential biomarkers. An important benefit of machine learning models such as random forests is that they allow easily integrating any available data, such as clinical variables and mutations as inputs to the model. The more data, the better the predictions, as long as the applied training and validation schemes ensure generalizability!

Learn more

Our previous blog post, which introduces concepts of machine learning with a focus on treatment response prediction.

A paper from our customer, in which we trained random forest models to compare the predictive power of various transcript panels for prostate cancer prognosis:
Lehto, T. K., Stürenberg, C., Malén, A., Erickson, A. M., Koistinen, H., Mills, I. G., Rannikko, A., & Mirtti, T. (2021). Transcript analysis of commercial prostate cancer risk stratification panels in hard-to-predict grade group 2-4 prostate cancers. The Prostate, 10.1002/pros.24108. Advance online publication.

An interview of our customer for whom we trained machine learning models to categorize cancer patients into risk groups.

Are you interested in having machine learning applied to your data? Leave us a message, and we can discuss how we can help you!