Clonal evolutionary analysis of cancer

Understanding how cancer manages to progress under evolutionary pressures helps us discover new drug targets and biomarkers for personalized therapy. Here, we explain the process and key concepts in reconstructing and tracking a tumor's clonal structure using DNA sequencing data.

What is tumor evolution?

Note: Defining the basic concepts of tumor evolution builds on certain reasonable assumptions. It pays to think of cases where they may not hold, and how it will affect the analysis!

All cells of the same tumor derive from a common ancestor. This cancer-founding cell will have had mutations that made it cancerous in the first place, and such mutations are inherited by all its progeny, and they are called clonal mutations. The founder's descendants may later acquire additional mutations, which are called subclonal: mutations shared by some, but not all cancer cells.

Assuming (!) that the same somatic mutation may occur only once in the history of a cancer, a set of cancer cells which share a subclonal mutation is considered a subclone. As mutations accumulate, subclones beget further subclones, leading to a pedigree of hierarchically branching and linearly descending clones.

Since a mutation may confer an advantage on a cancer cell — helping it evade treatment, for instance — clones may expand or die under selective pressures. Tracking the tumor's clonal structure across such processes lets us witness evolution almost in real time. While that is cool in itself, it may also reveal genes and mutations to develop new therapies against, or to use as biomarkers for clinical decisions.

Sampling and sequencing strategies for tumor evolution studies

Reconstructing the clonal phylogeny of a tumor relies on identifying genomic alterations such as small mutations (SNVs and indels) and copy-number alterations (CNAs). These mutations can by detected by high-throughput DNA sequencing. Both bulk tumor sequencing and single-cell DNA sequencing can be used for an in-depth clonal analysis, but we focus here on bulk sequencing.

Clonal analysis of a tumor benefits greatly from having patient-matched normal DNA sequencing, usually from a blood sample. The analysis is further enriched if the collection of a patient's tumor samples comprise any of the following:

  • samples from different regions of the same tumor and same time point,
  • samples taken before and after treatment (e.g. at diagnosis and relapse),
  • samples taken from both primary tumor and metastases.

Naturally, the types and timepoints of the sequenced tumor biopsies determines which research questions they may be used to answer.

DNA from the biopsies may be subjected to whole-genome, exome or targeted panel sequencing. Generally, broader sequencing (e.g., whole-genome instead of exome) enables more accurate detection of copy-number alterations, whereas deeper sequencing (e.g. panel sequencing at 2000x instead of exome sequencing at 100x) enables detecting rarer subclonal mutations. Performing exome sequencing only, rather than a combination of wide and deep approaches, is a viable compromise.


Of note, sequencing cell-free DNA (cfDNA) from liquid biopsies allows for more frequent and less invasive clonal tracking than sequencing surgically harvested biopsies. It applies well for clinical tracking of disease burden and marker mutations, but less well for reconstructing a clonal phylogeny from scratch.

Overview of the bioinformatics workflow

The computational work from raw sequencing data to a reconstructed clonal phylogeny involves the following core elements:

  • Pre-processing sequencing reads and alignment to reference genome
  • Calling small somatic mutations
  • Calling somatic copy-number alterations
  • Estimating the cancer-cell fractions of mutations
  • Detecting clones by clustering the mutations
  • Reconstructing the clonal phylogeny

The downstream analysis is sensitive to noise and errors from earlier steps. For this reason, most steps will require a thorough inspection of their outcome and experimenting with different tools, parameter values and filters. This, in turn, requires an understanding of the rationale and inner workings of the applied tools.

The first two steps are essentially the same as in any somatic mutation analysis, but the latter four are either specific to clonal analysis or require adapting to its needs. Next, we take a closer look at those four steps.

Copy-number calling in clonality analysis

Copy-number calling is essential in clonality analysis since the copy number at the sites of small mutations is required to estimate their clonality. Even in the absence of small mutations, a copy-number alteration itself is a qualifying mutation for defining clones.

How does a copy-number caller work?

A copy-number caller first segments the genome based on two metrics:

  • the read depth ratio between the tumor and matched normal sample and
  • the so called B-allele fractions (BAFs) of germline SNPs.

A shift in the average depth ratio and BAF is interpreted as a genomic breakpoint.

Next, each segment is assigned a copy number based on the same two metrics — a depth ratio of 1 and BAF of 0.5, for instance, are consistent with a copy-number of 2.

There is a tradeoff involved: copy-number callers favoring longer segments may have less noisy copy-number estimates but a worse genomic resolution.

Copy numbers must be allele specific

All copy-number callers are not allele specific, but here it is required. This means that for a copy number of, say, three, a distinction must be made between genotypes AAB and ABB, where A denotes a non-mutated allele and B a mutated allele. We'll see in a minute why this is important.

Ploidy and purity — certain unsurety

Copy-number calling must also take in to account that a tumor sample will include some non-cancerous cells such as adjacent normal tissue and tumor infiltrating immune cells. The expected fraction of normal cell contamination depends on the tumor type and other factors, and may vary from less than one percent to a vast majority of all cells.

Purity (1 − normal contamination), sometimes called cellularity, is often explicitly called in copy-number analysis. This happens in conjunction with calling the ploidy, or genome-wide copy status. Purity and ploidy go hand in hand; if one is estimated incorrectly, so is the other. Notably, doubling the ploidy estimate (e.g., from diploid to tetraploid) and lowering the purity estimate may be just as consistent with the observed data as one's original purity-ploidy estimate.

An erroneous purity estimate will bias clonality estimates downstream in the workflow. For this reason, it may be helpful to cross-reference purity with an orthogonal, histological estimate. (The histological estimate may also be bad since it is not based on the same exact cells whose DNA was sequenced, but a separate part of the biopsy.)

An orthogonal confirmation of ploidy can be useful as well; even general ploidy statistics on the tumor type in question may rescue an unrealistic ploidy call.

Subclonal CNAs

Ideally, CNAs are assigned clonality estimates to help distinguish between clonal and subclonal alterations. Detecting subclonal CNAs is difficult, and many tools do not even attempt it. Whether subclonal CNAs are called or not, the fact that they happen adds noise to downstream analysis and makes everything harder!

From variant allele fractions to cancer cell fractions

We are getting closer to detecting clones. First, one needs to estimate the clonality, or cancer-cell fraction (CCF) of individual mutations. This happens by adjusting variant allele frequencies (VAFs) by allele-specific copy-numbers and the sample purity.

Variant allele frequency is the ratio of mutation-carrying and non-mutated reads at the site of a mutation. With no CNAs and no normal cell contamination, the mapping from VAFs to CCFs is straightforward: CCF is simply twice the VAF.

Normal cell contamination lowers the expected VAFs for a given CCF, as does relative gain in non-mutated alleles, whereas relative gain in mutated reads increases the expected VAF. For this reason, mutations with very different VAFs may still be present in the same fraction of cancer cells, i.e. have the same CCF.

The example below highlights this with six mutations with associated copy numbers and genotypes, and how they map to CCFs when the sample purity is estimated at 80%.


In this example, we have five clonal mutations (present in all cancer cells) and just one subclonal mutation present in half the cells. Note that the mutation with a copy-neutral loss of heterozygosity (LOH) — the BB genotype — has a high VAF, in fact the same 80% as sample purity.

A note on the terminology: clonality, cellularity and cellular prevalence may all be seen used as a synonym for cancer-cell fraction. However, depending on the context, they may also refer to the fraction of all cells in a sample, rather than fraction of cancer cells. The difference is considerable with low-purity samples.

Detecting clones

Calling clones is where the previously called mutations from all the patient's tumor samples, along with their CCFs, come together to reveal the clones.

Clones are detected by clustering the mutations by their CCFs across the samples. It is common to use a subset of stringently filtered mutations for this clustering: false mutations calls (artifacts or germline variants mistaken for somatic mutations) or biased CCF estimates can cause more harm than omitting some true somatic mutations from the cluster analysis. The omitted mutations can be assigned to the clusters afterwards.

Mutation filtering strategies, as well as clustering parameters such as the maximum number of clusters and minimum number of mutations per cluster, benefit from tuning if the output does not make sense.

A successful clustering should result in one cluster with CCF approximately 1 in all samples (the ancestral clone) and other clusters with CCF < 1 in at least one sample (subclones). If superclonal clusters with CCF >> 1 show up, they may need to be filtered out, or previous steps revisited.

While the clonal phylogeny may not yet be obvious from the cluster analysis, it may reveal interesting patterns in clones (and their mutations) which expand or shrink, or are completely private to a sample. The example below shows a subclone (violet) which has expanded in relapse and given rise to a new subclone (yellow).


Reconstructing clonal phylogeny

Once credible clone-defining mutation clusters have been established, they are used to construct a phylogenetic tree to uncover their hierarchical structure. (Remember: a cell with a subclonal identity also belongs to its parent clones, all the way up to the ancestral clone.)

This process applies simple rules such as

  • infinite sites hypothesis: a mutation can occur strictly once in evolution,
  • the pigeonhole principle: the combined CCF of branching subclones must not be higher than that of their shared parent clone (otherwise they are linear),
  • crossing rule: if a clone has a higher CCF than another clone in one sample, but the opposite is true in another sample, the clones must be branching.

Multiple alternative phylogenies that satisfy the rules may be uncovered. It is also possible that no phylogenetic tree satisfies the required conditions in all samples, suggesting quality issues with the data or any analysis upstream.

The example below shows alternative clonal phylogenies that are plausible considering one sample at a time, and the only combined phylogeny that works for both samples.


The uncovered phylogeny can be visualized using trees (with clones or, alternatively, samples as nodes), or river/fish plots (see below), annotated with known or novel mutations of interest.


In addition to identifying genes or specific mutations that associate with events such as metastasis or treatment resistance, it is worth noting that clonal diversity, or intratumor heterogeneity (ITH), may be a valid biomarker itself. Other relevant metrics above the level of individual mutations, and which can be quantified for the tumor or individual clones, include

  • tumor mutational burden (TMB) and
  • mutational signatures, which may reveal clone-specific mutational processes or artifactual clones.

Naturally, identifying novel causative mutations and other genomic features may require performing the analysis on an entire cohort of patients to identify recurrent events as a way to sift genomic drivers from passengers.

Learn more


A PCAWG consortium study of clonal diversity across a number of tumor types:

Another PCAWG study focusing on mutation timing, something omitted in this blog post:

A more detailed presentation of everything discussed above:

A tumor evolutionary study of metastasis in prostate cancer, co-authored by our CSO:

Gundem, G. et al. (2015). The evolutionary history of lethal metastatic prostate cancer. Nature, 520(7547), 353–357.

Our services and expertise

Contact us

Leave us a message if you are interested in contracting us to get your data analyzed.