Integrative analysis of RNA-seq and ChIP-seq data
The flood of big omics data has washed in plenty of new buzzwords into the realm of biological research. Integrative data analysis is one of the most popular ones, having been floating around in grant applications for some years now. Here we focus on the integration of RNA-sequencing and ChIP-sequencing data: what questions can such multi-omics analysis answer and what are the actual analyses involved?
Integrative data analysis follows the rationale of the whole being more than the sum of its parts. Instead of measuring and analyzing just a single data type, why not run multiple, complementary omics measurements and analyze them in an integrative manner to uncover the delicate interplay of, say, transcription factor binding and gene transcription.
Combining ChIP-seq and RNA-seq data from the same samples, so the reasoning goes, allows one to tickle out insight on the mechanisms of gene expression that would remain hidden were the data analyzed separately.
ChIP-seq, or chromatin immunoprecipitation sequencing, is used to map the sites of chromatin-bound proteins in a genome-wide manner. Chromatin fragments with the protein of interest are selected with the aid of an antibody, and the bound DNA is sequenced to reveal the binding sites.
To save your time, we narrow our focus here on ChIP-seq of transcription factors (TFs). Transcription factors are proteins that bind to a gene promoter or enhancer to either recruit or block the RNA polymerase, thus regulating the transcription rate of their target genes.
What exactly does my favorite transcription factor do?
Let us assume you are interested in a specific TF. To find out where it binds, you would run ChIP-seq for that protein from an appropriate model. To quantify the TF’s regulatory impact, you would modify that model by knocking out the TF and measure global gene expression with RNA-seq from both the wild-type and knockout model, perhaps in a few replicates. The setting is illustrated in the figure below.
Analyzing the two data types from this setting separately can be used to answer a range of questions. The ChIP-seq data, with the help of some bioinformatics, tells you:
Where does the TF bind? (peak calling and annotation)
What is the sequence motif of the binding sites? (de novo motif discovery)
What do the TF-bound genes have in common? (pathway enrichment analysis)
The RNA-seq data, on the other hand, tells you:
Which genes’ expression is altered by the TF? (differential expression analysis)
Do the differentially expressed genes share a potential binding motif? (de novo motif discovery)
What do the differentially expressed genes have in common? (pathway enrichment analysis)
Data integration: the lazy way...
Note that the questions above are different for the separate data types, even though they sound similar. TF binding does not automatically imply altered expression and vice versa. The regulation and its direction (does the expression go up or down?) might depend on the combinatorial effect of multiple TFs, i.e. cooperative binding. Furthermore, regulation can be indirect: the change in gene expression may be an effect of a long regulatory path downstream from the TF of interest.
This is where data integration comes in. The questions for the integrative analysis of RNA-seq and ChIP-seq data are:
Which of the regulated genes are direct targets of the TF?
Is the TF an activator, repressor, or both?
Does the TF have different binding partners depending on the direction of regulation?
In other words, computational analysis combining these two types of data provides more detailed information on whether the potential regulation is direct or not and whether the binding partners modulate the direction of regulation (activation vs. repression).
The simplest approach to integrate these data is to compare the sets of differentially expressed genes and those with a TF binding site in their promoter (or other suitably defined regulatory region), as exemplified by the Venn diagram below.
The binding motifs of partner TFs could be studied using an equally simple approach: identify the enriched binding motifs near the observed ChIP-seq peaks separately for up- and down-regulated genes and compare them statistically.
...and the more rigorous approach
The type of analysis described so far is rather standard and straightforward, but it does qualify as integrative analysis. However, a more integrative approach would involve applying a statistical model of gene regulation using both data.
A great example of an integrative omics algorithm, and a commonly used tool designed specifically to answer the three mentioned integrative questions is BETA (Binding and expression target analysis, see the paper here).
As its input, BETA takes the identified binding sites and differentially expressed genes along with associated expression fold changes. At the heart of BETA is a computational model to quantify the regulatory potential of a gene based on the number and distance of TF binding sites around the gene. With default settings, it scans up to 100 kb from the transcription start site of each gene in order to incorporate not only the promoter, but also more distal cis-regulatory elements, i.e. enhancers.
BETA then combines the regulatory potential and expression difference to produce a P-value like score reflecting the likelihood of the gene being a real, direct target of the TF. It also runs a statistical test to categorize the TF as a repressor, activator, or both, and tests whether the binding motifs of other known TFs around the observed binding sites associate with activation or repression. In other words, it spits out a list of putative modulators — partnering TFs that may change the direction of regulation.
Compared to the “separate analysis plus Venn diagrams” approach, using a truly integrative algorithm based on solid statistical models enables going one or two steps further with the conclusions and, hopefully, provides better likelihood of making biologically meaningful findings. The challenge, on the other hand, is making sure that this more advanced tool — and the assumptions its models use — apply to your data and hypotheses.
At Genevia Technologies, downstream integrative analyses like this constitute a major part of our customer projects — bioinformatics that goes beyond basic NGS data processing is hard to solve without access to a professional bioinformatics team. If you are interested in hearing our analysis suggestion for your multi-omics data set, just leave your contact info with a brief description of your data below!