How to analyze RNA-seq data

Transcriptome analysis using next generation sequencing data has become increasingly popular and you might be considering running your own study. Let’s assume you approach Genevia with a typical RNA-seq data set involving raw reads from a set of samples and would like us to perform the analyses. Here we outline the kind of analyses that your samples could go through.

The question “How to analyze RNA-seq data” could of course be answered in a number of ways. Although there is a multitude of software and environments that can be used in the analysis, the steps to take in an analysis workflow are typically very similar and would most likely follow the outline presented here.

Let us look at a typical RNA-seq data analysis workflow that includes the following steps:

1. Quality control

The overall quality of the sequencing reads in the RNA-seq data is first inspected to ensure that nothing went wrong when the samples were prepared and the data produced.

At this step, we see whether there are any adapter sequences remaining in the data or if any sequencing reads have worse quality than expected. If anything like this occurs, the adapter sequences would be removed and reads with any bad-quality ends would be shortened, so that the data of a given sample can be used. Sometimes bad quality data is also identified in this step, in which case entire samples can also be removed from the study.

2. Read alignment and normalisation

Each good-quality RNA read needs to be associated with their gene of origin by sequence alignment to a reference genome.

After alignment, the reads at each gene position are counted in order to obtain a gene-specific expression value for all genes. These expression values are further normalised to enable cross-sample statistical analysis and visualisations using the obtained expression values.

3. PCA

Having all the expression values per sample to hand, it is now possible to proceed to visualizing and statistically comparing the samples.

A principal component analysis (PCA) is a practical and rapid approach that ensures the sample similarity within experimental groups prior to other statistical tests. A PCA analysis can efficiently reveal outliers and - although no-one would wish it - even samples mixed accidentally in the laboratory! An example case of a PCA analysis is shown below, with the different cell type samples forming separate groups. Sometimes PCA reveals the samples to be highly similar, with little likelihood of finding any differentially expressed genes. In these cases, the samples would seem much more mixed.

RNA-seq analysis and PCA analysis can be used to ensure the sample similarity within experimental groups prior to statistical comparisons, as well as to ensure the differences between experimental groups. Severe outliers can also be pinpointed and considered for removal after a careful study of their origin.A PCA analysis can be used to ensure the sample similarity within experimental groups prior to statistical comparisons, as well as to ensure the differences between experimental groups. Severe outliers can also be pinpointed and considered for removal after careful study of their origin.

4. Statistical tests

The statistical comparisons that aim at identifying differentially expressed genes between sample groups remain at the heart of transcriptomics data analysis.

Depending on the experimental setting, the approaches taken may vary. The typical approach is to compare sample groups pair-wise using a statistical test that can take into account sample dependencies, such as pairedness or other variables.

A typical output of the analysis includes a list of significantly differentially expressed genes between the conditions, together with their fold changes for expression levels and p-values for their significance. The list is typically filtered to include only genes with at least 2-fold change in expression and a significant p-value.

5. Pathway enrichment analysis

The lists of differentially expressed genes may naturally include a few expected hits that are easy to identify and to link to biological processes.

However, more often there are hundreds of differentially expressed genes that were not earlier associated with the given process under study. One then needs to understand what biological processes or pathways their up- or down-regulation may be associated with. One way to disentangle the functional meaning of the genes is to perform pathway enrichment analyses.

These analyses determine whether any pathway terms in databases are annotated to the list of differentially expressed genes at a frequency greater than would be expected due to chance alone. The typical output of such an analysis is a list of significantly enriched pathway terms together with p-values and with the original differentially expressed genes associated with each one.

6. Integration with other data types

these steps will typically appear in our workflow, yet each data set is analysed in a tailored fashion, taking each dataset’s requirements and customer’s interests into account.

The data could also be combined in integrative analyses with other data types, such as miRNA or proteomics data. Assuming that we also had data of differentially expressed miRNAs, we could predict their potential target genes in databases and find cases where both genes and their potential regulator miRNAs are differentially expressed in order to identify regulator - target relationships.

That is the main story of RNA-seq analysis in brief. As a next step, you could tell us more about your own experiment and enable us to plan yours!

What would you like to know more about?