Predicting treatment response with machine learning
Performing high-throughput molecular measurements in clinical trials is becoming more common. Coupled with the advances in machine learning, the developments open up possibilities for predicting a priori whether a patient will benefit from a treatment or not, based on algorithms trained on large data. Here, we introduce the key concepts and common approaches in training such algorithms.
- Applying machine learning to data from clinical trials enables biomarker discovery and algorithmically tailored treatment.
- Predictive models vary in complexity: simple methods, such as logistic regression, are interpretable and work with small samples sizes, whereas complex methods, such as deep learning, are more powerful but require large sample sizes. Intermediate methods, such as random forest, strike a balance between the extremes.
- Overfitting, or a model’s poor generalizability to data it was not trained on, is a central challenge in machine learning. Feature selection, model regularization and cross-validation are used to avoid overfitting.
- Area under the curve (AUC) is a common and robust measure of a predictive model’s performance, in addition to the ROC curve, which highlights the tradeoff between sensitivity and specificity.
- The complexity of a predictive model and the feasibility of quantifying its biomarkers in a clinical setting affect the model's clinical applicability.
The aim in applying machine learning to clinical data is to build a computational model that predicts a target variable (such as response to treatment) from input variables (any available demographic, clinical and molecular data) in a patient-specific manner. In addition to the predictive model, such an analysis provides answers to the following key questions:
How well can treatment response be predicted from the input variables?
Which input variables are predictive biomarkers?
Our focus here is on predicting treatment response, but this all largely applies to predicting other clinical variables as well, such as patient survival.
Training the model
Typically, the target variable is treated as a binary variable, e.g., 0 = did not respond and 1 = did respond to treatment. The model is algorithmically trained to predict the target variable based on the input variables.
In the training phase, an algorithm uses both the input variables and target variable to model the relationship between them (see figure below). Any data used in the training phase is called training data. After the model is trained, it can be used to predict the target variable using only the input variables. Importantly, the model’s performance can be estimated on validation data that was not used in the training.
Even when the target variable is binary, the model prediction is typically probabilistic — a number between 0 and 1. A threshold can be defined to binarize the prediction. Setting a threshold involves a tradeoff between sensitivity and specificity (see section Model validation and performance estimation).
Input variables can be pre-filtered to reduce noise, model complexity and computation. This pre-filtering is called feature selection (variables that are deemed uninformative are filtered out, and informative variables are selected). Furthermore, model complexity can be restricted in the training phase by the means of regularization, or gradually penalizing the algorithm for increasing the model’s complexity. Limiting the model’s complexity aims at avoiding overfitting, in which the model learns the training data too well: it performs accurately on the training data but poorly on validation data.
Types of models
Computational models used to predict a target variable from input variables come in all shapes and forms, but the main tradeoffs involved in model selection have to do with the models’ complexity. More complex models allow for better predictions, but require more data, are harder to interpret, and are more prone to overfitting. We can categorize types of models based on their complexity as in the table below:
Model validation and performance estimation
In quantifying a predictive model’s performance, one needs to ensure not only that it works accurately enough on the data it was trained on, but also that it is likely to work on new samples that were not used in training the model.
Let us assume we have trained a model to predict treatment response (a value between 0 and 1), and that we run the model with the input variables of ten patients whose response is known. By binarizing the prediction with any threshold, say 0.5, we can tabulate the predicted and known responses of the ten patients as a confusion matrix, which could look like this:
Most typical performance metrics of predictive algorithms, or classifiers, are based on a confusion matrix. In the example above, the treatment response of 7 out of 10 patients was correctly predicted, giving an overall accuracy of 70%. The false positive rate (FPR) was 1 out of 4, or 25%, and false negative rate (FNR) 2 out of 6, or 33%. Additionally, one can calculate the true positive rate (TPR, or sensitivity) as 1 − FNR = 67% and true negative rate (TNR, or specificity) as 1 − FPR = 75%.
Although the overall accuracy is a simple and easy-to-understand measure of the model’s performance, the problem is that it requires defining a hard threshold to binarize the predictions. A more robust performance metric, which does not depend on a single threshold value, is called area under the curve (AUC), and it is computed based on a ROC curve. A ROC curve shows how the true positive rate depends on the false positive rate (see figure below).
A ROC curve can be produced by gradually increasing the binarization threshold from 0 (where all patients are classified as responders, and both the TPR and FPR are 1 — the upper right corner of the ROC plot) to 1 (where all patients are classified as non-responders, and both the TPR and FPR are 0 — the lower left corner of the ROC plot). The AUC is simply the area in the plot which remains under the ROC curve. Any classifier will have an AUC of at least 0.5, corresponding to the straight ROC curve of a random classifier (dashed red line in the figure above). An ideal, perfect classifier would have a maximal AUC of 1. Real-world classifiers thus have an AUC somewhere between 0.5 and 1.
Ideally, one would train the model using training data and have a separate validation cohort to estimate the performance (most importantly the AUC), to ensure the model is generalizable. However, with a limited sample size, a common approach is to use cross-validation instead. In cross-validation, the model is trained with (a randomly selected) part of the data and validated with the remaining data, and this process is repeated, often so that each patient is used nine times in training and once in validating a model (10-fold cross-validation). An appropriately defined average of the separately trained models can then be used as the final model.
One model or several models?
In any real-world machine learning project, the main limitation tends to be the amount of data. Larger sample sizes enable better models, discovering better biomarkers and deeper biological conclusions from the findings. Combining data from separate yet similar patient cohorts (when available) to train a single model is a way to increase the sample size. If there is more than one variable that systematically differs between the cohorts (e.g., age, sex, used measurement platforms), these variables may be confounded: the algorithm may treat one variable as a proxy for the other even when there is no true relationship between them. With multiple cohorts, it makes sense to train a model both using all cohorts combined and separately for each cohort. Careful examination of the predictive variables between the models may reveal possible confounding factors.
A word on clinical applicability
In addition to the predictive power of a model, there are other factors determining its usefulness. One is its complexity: a simple model is, compared to a complex one, faster and easier to implement as an easy-to-use tool for both research and clinical purposes. Another factor is the feasibility of quantifying the required biomarkers in a clinical setting. For this reason, it might make more sense to train the model on data from samples of bodily fluids rather than solid tissue biopsies to find biomarkers that require less invasive sample collection.
A paper from our customer, in which we trained random forest models to compare the predictive power of various transcript panels for prostate cancer prognosis:
Lehto, T. K., Stürenberg, C., Malén, A., Erickson, A. M., Koistinen, H., Mills, I. G., Rannikko, A., & Mirtti, T. (2021). Transcript analysis of commercial prostate cancer risk stratification panels in hard-to-predict grade group 2-4 prostate cancers. The Prostate, 10.1002/pros.24108. Advance online publication. https://doi.org/10.1002/pros.24108
An interview of our customer for whom we trained machine learning models to categorize cancer patients into risk groups.