Clonal evolutionary analysis of cancer
Understanding how cancer evolves under selective pressures helps identify candidate targets and biomarkers for personalized therapy. This process, often referred to as tumor clonal evolution, can be studied by reconstructing the evolutionary history of tumor cell populations. This article explains how a tumor’s clonal structure can be reconstructed using DNA sequencing data.
What is tumor evolution?
Cancer cells within a tumor descend from a common ancestor cell. This founding cancer cell acquired mutations that enabled malignant growth. These mutations are inherited by all of its descendants and are referred to as clonal mutations.
Over time, some descendants acquire additional mutations. These mutations are present only in a subset of tumor cells and are therefore called subclonal mutations. Cells sharing such mutations form subclones.
Mutations serve as markers that allow cells to be grouped into clones. As mutations accumulate, subclones give rise to further subclones, producing a hierarchical lineage of related cell populations.
Mutations may confer selective advantages such as increased proliferation, immune evasion, or treatment resistance. Clones carrying such mutations may expand, while others decline. Tracking these changes across tumor samples allows the evolutionary dynamics of the tumor to be reconstructed. In addition to revealing biological mechanisms, this reconstruction may highlight mutations or genes that could serve as therapeutic targets or biomarkers.
Sampling and sequencing strategies for tumor evolution studies
Reconstructing the clonal phylogeny of a tumor requires detecting genomic alterations such as:
- small somatic mutations (SNVs and indels)
- copy-number alterations (CNAs)
These alterations are detected using high-throughput DNA sequencing. In addition to tumor DNA, clonal analysis benefits from sequencing matched normal DNA, typically obtained from blood. This enables distinguishing somatic mutations from germline variants.
Additional tumor samples greatly increase the information available for evolutionary reconstruction. Useful sampling strategies include:
- multiple regions from the same tumor
- samples from different time points (e.g., diagnosis and relapse)
- samples from primary tumor and metastases
The sampling design determines the evolutionary questions that can be addressed. For example, longitudinal samples allow tracking clonal expansions during treatment, whereas multi-region sampling reveals spatial heterogeneity.
Tumor biopsies may be sequenced using:
- whole-genome sequencing (WGS)
- whole-exome sequencing (WES)
- targeted sequencing panels
Broader sequencing improves detection of structural variation and copy-number alterations, while deeper sequencing increases sensitivity for detecting rarer subclones. Deep whole-genome sequencing is ideal, but deep exome sequencing or whole-genome sequencing combined with ultra-deep panel sequencing are viable compromises.
In this article we focus on approaches based on bulk DNA sequencing, but the clonal structure can also be interrogated with:
- Single-cell DNA sequencing. While less accessible than bulk sequencing, single-cell sequencing avoids many of the technical hurdles discussed in this article by providing direct evidence of mutations that co-occur in the same cell.
- Spatial transcriptomics. This approach leverages the gene dosage effect to infer copy-number changes from spatial gene expression patterns. The method is not sensitive to rare clones or clones that lack copy-number alterations, but it can reveal how clones are spatially distributed and how they relate to variation in the tumor microenvironment.
Sequencing of circulating tumor DNA (ctDNA) from liquid biopsies can also be used to monitor clonal dynamics over time. Because liquid biopsies can be obtained repeatedly with minimal invasiveness, they are useful for tracking known mutations and disease burden. However, reconstructing a full clonal phylogeny from ctDNA alone is more challenging due to lower signal.
Overview of the bioinformatics workflow
Reconstructing tumor clonal structure from sequencing data involves several computational steps:
- Pre-processing sequencing reads and alignment to a reference genome
- Calling small somatic mutations
- Calling somatic copy-number alterations
- Estimating cancer-cell fractions of mutations
- Detecting clones by clustering mutations
- Reconstructing the clonal phylogeny
The downstream steps depend strongly on the quality of upstream results. Errors or biases introduced during variant calling or copy-number estimation propagate through the workflow. For this reason, intermediate outputs typically require careful inspection, parameter tuning and filtering.
The first two steps are relatively standard in somatic mutation analysis. The remaining steps are more or less specific to clonal analysis and are described in more detail below.
Copy-number calling in clonality analysis
Copy-number calling is essential in clonal analysis, since the copy number at mutation loci must be known in order to estimate the mutations’ cancer-cell fractions. In addition, a copy-number alteration can itself define a clone.
How does a copy-number caller work?
A copy-number caller first segments the genome based on two signals:
- the read-depth ratio between tumor and matched normal sample
- the B-allele fraction (BAF) of germline SNPs
Changes in these signals indicate genomic breakpoints separating regions with different copy numbers.
Each segment is then assigned a copy number consistent with the observed signals. For example, a depth ratio of 1 together with a BAF of 0.5 is consistent with a diploid segment (copy number 2).
There is a trade-off between accuracy and genomic resolution: callers favoring longer segments produce more accurate copy-number estimates for large CNAs but may miss small CNAs altogether.
Copy numbers must be allele specific
For clonal analysis, copy numbers must be allele-specific in order to estimate mutation multiplicity.
For example, a copy number of three at a mutated site may correspond to three different mutation multiplicities:
- AAB (two reference alleles and one mutated allele)
- ABB (one reference allele and two mutated alleles)
- BBB (three mutated alleles)
This distinction is important because the number of mutated alleles directly affects the expected variant allele frequency
Ploidy and purity
Tumor samples usually contain a mixture of cancer cells and normal cells such as stromal or immune cells. The fraction of cancer cells in the sample is called tumor purity.
Copy-number callers typically estimate purity together with tumor ploidy, the genome-wide copy number state.
These parameters are strongly coupled. For example, doubling the ploidy estimate from diploid to tetraploid while lowering the purity estimate may be just as consistent with the observed data. As a result, multiple purity-ploidy combinations may fit the same data.
Errors in purity estimates propagate directly into downstream clonality estimates. Independent information, such as histological purity estimates or typical ploidy distributions for the tumor type, may help identify unrealistic solutions.
Subclonal copy-number alterations
Ideally, copy-number alterations are also assigned clonality estimates to distinguish clonal from subclonal CNAs. Detecting subclonal CNAs is difficult, and many tools do not even attempt it. Whether subclonal CNAs are called or not, the fact that they happen adds noise to downstream analysis and complicates the interpretation.
From variant allele fractions to cancer cell fractions
The next step is estimating each mutation’s cancer-cell fraction (CCF), or the fraction of cancer cells in which the mutation is present. This is done by adjusting variant allele frequencies (VAFs) for allele-specific copy numbers and sample purity.
The observed VAF is defined as the fraction of reads supporting the variant among all reads covering the site. For a heterozygous mutation in a diploid cancer genome, in the absence of CNAs and normal-cell contamination, the relationship between VAF and CCF is simple: CCF is simply twice the VAF.
In practice, the relationship must be adjusted for:
- sample purity
- local copy number
- number of mutated alleles
These factors mean that mutations with very different allele fractions may still occur in the same fraction of cancer cells.
The figure below illustrates this mapping from VAFs to CCFs using six example mutations with different copy-number contexts and an assumed sample purity of 80%.
In this example, five mutations are clonal (present in all cancer cells) while one mutation occurs in only half of the cancer cells. Note, for instance, that the mutation in a region with copy-neutral loss of heterozygosity (genotype BB) shows a particularly high VAF, in fact equal to the sample purity.
Note that terminology varies by context. Terms such as clonality, cellularity, and cellular prevalence are sometimes used interchangeably with cancer-cell fraction, although they may also refer to the fraction of all cells in a sample rather than only cancer cells.
Detecting clones
Once cancer-cell fractions have been estimated for all mutations across all samples, clones can be inferred. (Typically, only SNVs and indels are used here, but CNAs could be used as well.)
This is done by clustering mutations based on their CCF profiles across samples. Mutations belonging to the same clone should have similar CCF values.
Because artifacts and germline variants can distort the clustering, it is common to perform clustering on a carefully filtered subset of mutations. The remaining mutations can be assigned to clusters afterward.
Mutation filtering strategies, as well as clustering parameters such as the maximum number of clusters and minimum number of mutations per cluster, benefit from tuning if the output does not make sense.
A successful clustering should produce:
- one cluster with CCF ≈ 1 in all samples (the ancestral clone)
- several clusters with CCF < 1 in at least one sample (subclones)
Superclonal clusters (estimated CCF >> 1) typically indicate problems with upstream steps.
Clustering results can reveal interesting patterns, such as clones that expand, contract, or appear only in specific samples. The example below shows a subclone (violet) which has expanded in relapse and given rise to a new subclone (yellow).
Reconstructing clonal phylogeny
After identifying mutation clusters corresponding to clones, these clusters can be arranged into a clonal phylogeny describing their evolutionary relationships.
This process applies simple rules such as:
- Infinite sites assumption: each mutation is assumed to occur only once during evolution
- Pigeonhole principle: the combined CCF of sibling subclones cannot exceed that of their parent clone
- Crossing rule: if two clones swap relative abundance across samples, they must represent branching lineages
These rules restrict the set of phylogenetic trees consistent with the observed CCF values.
Multiple alternative phylogenies may be uncovered. If no phylogeny satisfies these constraints across all samples, earlier analysis steps should be re-examined.
The example below illustrates how alternative phylogenies may appear plausible when considering individual samples but become constrained when multiple samples are analyzed jointly.
Interpreting clonal evolution
The inferred phylogeny can be visualized using trees (with clones or, alternatively, samples as nodes), or river/fish plots (see below), annotated with known or novel mutations of interest.
By identifying expanding clones, tumor evolutionary analysis helps in discovering mutations associated with events such as:
- Metastasis
- Relapse
- Treatment resistance
However, it may be necessary to observe recurrence across a whole cohort of patients to sift driver mutations from passengers that are co-selected within the same clone.
In addition to identifying genes or specific mutations that associate with progression, clonal diversity, or intratumor heterogeneity (ITH), may be a valid biomarker on its own. Other features beyond individual mutations that can be quantified for the tumor or its clones include:
- Tumor mutational burden (TMB)
- Neoantigen load (learn more)
- Mutational signatures, which may reveal clone-specific mutational processes or artifactual clones.
Troubleshooting
High-fidelity clonal reconstruction is not an automated process. Tailoring the analysis workflow, selecting appropriate tools and parameters, and interpreting the results require expertise. In particular, a good understanding of tumor evolution and computational cancer genomics, as well as knowledge of cancer type-specific idiosyncrasies, is needed.
Even with high-quality data and careful analysis, the results will have limitations and remain probabilistic in nature. Re-analysis with alternative tools and careful interpretation can help identify the most robust and reproducible findings.
Common issues include:
- No phylogenies recovered. This is common when the number of samples and/or mutations is high. The more clusters, the more likely it is that some of them will violate the constraints of clonal reconstruction, such as the crossing rule.
- Multiple phylogenies recovered. This is common when the number of samples and/or mutations is low, or when the samples are very similar. A small number of clusters may be consistent with multiple contradictory phylogenies.
Whether you encounter these issues or not, it is worth stepping back in the workflow.
- Inspect the clusters. How many clusters are there? How many mutations does each cluster contain? How wide is the CCF distribution within each cluster? Which clusters violate the crossing rule? Are there superclonal clusters (CCF >> 1)? Are there no clusters with CCF ≈ 1?
- Focus on suspicious clusters. What do their VAF distributions look like? Does the relationship between VAF and CCF make sense? Do the mutations coincide with CNAs? Do they include known recurrent cancer mutations? Are their mutational signatures associated with known biological processes or artefacts?
Results that contradict biological or mathematical constraints commonly stem from:
- Limited sampling. Bulk clonal reconstruction depends on differences in clone prevalence across samples. If too few samples are available, or if the samples are too similar, the data may not contain enough information to resolve a unique phylogeny.
- Shallow sequencing. Exome sequencing at 50x depth often does not support fine-grained reconstruction of clonal structure. Focus instead on quantities that can be estimated reliably, such as intratumor heterogeneity, tumor mutational burden, neoantigen load, mutational signatures, and mutations shared or private between samples.
- Low mutation counts. Some tumors contain only a small number of somatic mutations, particularly in exome or targeted panel data. With too few informative mutations, clustering and phylogenetic reconstruction may remain ambiguous. Try carefully relaxed filtering, consider broader sequencing, or focus on the summary statistics listed above instead.
- False-positive mutations. Incomplete filtering of germline variants and artefactual mutations may lead to phantom clusters that render clonal reconstruction impossible. Try more stringent filtering.
- Bad purity and ploidy estimates. Underestimated purity and overestimated ploidy inflate CCF estimates, while overestimated purity and underestimated ploidy compress the range of CCFs. Consider evaluating multiple purity-ploidy solutions and consulting independent sources of information about sample purity and tumor ploidy.
- Bad copy-number calls. In tumors with extensive copy-number changes and high intratumor heterogeneity, some CNAs will be present only in subsets of cells. These subclonal CNAs complicate the relationship between VAF and CCF. Furthermore, short CNAs may be missed entirely by callers; if such CNAs coincide with small somatic mutations, the CCFs of these mutations will be distorted. Finally, errors in estimates of mutation multiplicity (arising from incorrect allele-specific CNA calls) will cause similar problems. Consider using alternative callers, including copy-number callers that estimate CNA clonality, or exclude mutations located in regions with suspicious copy-number states (or any CNAs) during the clustering step.
- Overfitting during clustering. Allowing too many clusters or very small cluster sizes can cause noise to be interpreted as biological structure. Consider increasing the minimum cluster size, limiting the number of clusters, or clustering only a high-confidence subset of mutations before assigning the remainder.
- Too strong assumptions. Computational models rely on assumptions that are sometimes violated. The infinite sites assumption, for instance, is clearly not always valid: the same mutation can occur independently more than once, and mutations can also disappear. Try running alternative tools or adjusting the tools’ parameters.
Conclusion
Clonal evolutionary analysis can reveal how tumors diversify, respond to selective pressures, and progress through treatment, relapse, and metastasis. However, reliable reconstruction of clonal structure from bulk sequencing data depends on much more than running a single tool. Variant calling, copy-number analysis, purity and ploidy estimation, mutation filtering, clustering, and phylogeny reconstruction all influence one another, and errors at any stage can distort the final result.
For this reason, successful clonal analysis requires careful study design, appropriate sequencing strategies, robust bioinformatics workflows, and expert interpretation of intermediate and final outputs. When done well, it can provide a useful framework for identifying candidate drivers, characterizing intratumor heterogeneity, comparing samples across time and space, and extracting clinically or biologically relevant signals from complex cancer sequencing datasets.
At Genevia Technologies, we help researchers design and execute cancer genomics analyses that are tailored to the biological question, sample material, and sequencing data available. If you are planning a cancer sequencing study or need help interpreting tumor evolution data, feel free to contact us to discuss your project.
Learn more
Contact us
Leave us a message if you are interested in contracting us to get your data analyzed.






