Single-cell RNA sequencing data analysis
Single-cell RNA sequencing enables cataloging and studying cellular identities at a scale and resolution unmatched by bulk sequencing.
Single-cell RNA sequencing (scRNA-seq) is one of the most rapidly advancing and diversifying technologies in molecular biology. The ability to study gene expression on the resolution of single cells has been as transformative as the advent of bulk RNA-sequencing previously.
In addition to single-cell RNA-seq, a number of other next-generation sequencing (NGS) -based assays have been adapted to single-cell protocols. These include genomic, proteomic and epigenetic assays, notably single-cell ATAC-sequencing, which is commonly performed in conjunction with scRNA-seq.
Platforms and protocols for scRNA-seq vary in their throughput (number of cells) and transcript coverage (3'/5' tag-based vs whole-transcript). Our team has experience working with several technologies, such as 10X Genomics, Drop-Seq, BD Rhapsody system and protocols of the CEL-Seq and Smart-Seq families.
Here we present typical single-cell analyses, focusing on scRNA-seq but covering also its integration with other common single-cell assays. We also list single-cell papers that we have published.
Leave us a short description of your bioinformatics needs and we will be in touch very soon!
Quality control and preprocessing
Like with any NGS data, the analysis of single-cell sequencing data starts with quality control and preprocessing.
Raw sequencing reads are quality-tested and metrics such as cell quality, accuracy, and diversity are generated. Reads are then aligned to an applicable reference genome or transcriptome, and additional metrics such as the number of cells, reads per cell, genes per cell, sequencing saturation and fraction of mitochondrial transcripts are plotted and inspected.
These QC metrics inform us about the total quality of the libraries and the usability of the samples and enable identifying and removing low-quality cells.
Further preprocessing is often carried out to remove unwanted signal, or noise, from certain downstream analyses. These include
- imputation to estimate read counts for dropouts, or genes with zero transcripts due to technical, rather than biological, reasons,
- normalization to remove biases due to e.g., differences in cell sizes, and
- reducing the data to representative variables such as highly-variable genes or principal components.
Preprocessed single-cell RNA-seq data is clustered to identify groups of similar cells and visualized using non-linear dimensionality reduction algorithms such tSNE and UMAP and correlation heatmaps to unveil general patterns of cell heterogeneity.
These visualizations help us answer technical questions such as:
- Do the biological replicates resemble each other?
- Are there outlier samples or cells?
- Are the cell clusters distinct?
...and biological questions such as:
- How heterogeneous are the underlying cell types/states?
- Do distinct samples (e.g., different tissues, treatments or time points) form separate clusters?
Cell type identification
Identifying and characterizing cell types (and more refined cell states) is the most central part of most single-cell projects.
It all starts with identifying features (e.g., genes, proteins, accessible regions) that are specific to each cell cluster. These markers are defined by differential expression (DE) comparison of each cell cluster and the remaining ones, yielding DE statistics such as fold change and statistical significance.
The cluster markers can be visualized using scatter plots, violin plots, and heatmaps.
Markers are further annotated to biologically meaningful terms, such as a biological processes, signaling pathways or a specific disease. Such analyses may rely either on over-representation analysis or gene set enrichment analysis, which both result in a list of enriched gene sets with relevant statistics and annotations.
Single-cell datasets are typically also integrated with publicly available data in order to exploit the cell-type information from already annotated datasets or cell atlases. This enables transferring cell labels into the analyzed dataset.
The transferred cell labels and identified markers and their annotations are used, together with prior information on cell-type/state markers, to identify the captured cell types.
In addition to characterizing distinct cellular identities, single-cell data lends itself to identifying continuums of gradual change in cell state, or trajectories. Uncovering such continuums is also called pseudotime analysis — while all cells are sampled at the same time point, individual cells may represent different stages in a temporal process such as differentiation.
De novo reconstruction of lineage differentiation and cell maturation trajectories allow exploring cellular dynamics, delineation of cell developmental lineages, and characterization of transition between cell states along a latent pseudotime dimension.
An ensemble of trajectory inference algorithms may be used for robust identification of root and terminal cellular states, branching points, and lineages. Single cells are ranked across deterministic or probabilistic lineages, and their ranking indicates their progression in a dynamic process of interest.
This type of analysis may also utilize the ratio of processed and unprocessed transcripts to infer whether a gene's expression is increasing or decreasing in a given cell. Combining this information from all quantified genes at a given state enables inferring the direction and pace of change in states. This is called RNA velocity analysis.
Integrative single-cell analyses
Integrative single-cell analyses bring different datasets, including different data types and species together. This enables more accurate and detailed cell labeling and mechanistic insight into gene regulation in the studied system. Such analyses rely on common properties, or anchors, between the datasets, such as matched features (e.g., genes or homologues) or matched cells.
Integrating multiple single-cell RNA-seq datasets
Perhaps the most common integration of single-cell datasets takes place between scRNA-seq datasets from different sources or technology platforms. Using genes as anchors, a successful integration removes the technical bias while retaining biological variance of the datasets.
Combining different scRNA-seq datasets is particularly helpful when there is a well-characterized public expression atlas available for a relevant tissue or organism.
Integrating single-cell RNA-seq and epigenomics
Integrating single-cell RNA-seq data with single-cell ATAC-seq or single-cell methylation data often relies on matched cells as anchors (when the measurements derive from the same cells as in, e.g., 10X Genomics Multiome technology).
Combining expression data with chromatin accessibility or methylation profiles enables more robust identification of cell types and allows for quantifying the effect of chromatin state to expression in individual cell types.
Read more about integrating epigenomics and transcriptomics
Integrating single-cell RNA-seq and proteomics
Since proteins, rather than transcripts, are key drivers of cellular functions, single-cell proteomics complements scRNA-seq experiments with more accurate estimates of cells functional states.
Single-cell proteomic profiling (CITE-seq, flow cytometry, mass cytometry, and mass spectrometry) comes in different degrees of throughput (number of proteins quantified) and can be targeted specifically to surface proteins, as in CITE-seq which involves a panel surface proteins quantified from cells with matched scRNA-seq reads.
Surface proteins are particularly useful in cell type identification, while the inclusion of cytosolic proteins enable better characterization of pathway and gene-regulatory activities.
Cross-species integrative analysis enables the identification of cell-type phylogenies that define the relationships of evolutionary and developmental mechanisms between different organisms. Shared homologues are used as anchors in cross-species integration.
This is particularly helpful when a disease/organ is better characterized on a single-cell resolution in an animal model than in human.
Ligand-receptor (LR) analysis uncovers cell-cell interactions that coordinate homeostasis, development, and other system-level functions. Changes and dysfunction in such interactions may go unnoticed in an analysis limited to the internal state of individual cells or cell types.
Ligand-receptor analysis identifies and quantifies intercellular interactions based on the expression of known receptors and their ligands. The interactions may take place within or between tissues, and the strength of this interaction is compared between biological conditions of interest, such as patient groups, disease states, and treatments.
Spatial transcriptomic analysis
Spatially resolved single-cell transcriptomic assays couple expression data with the cell's positional context in a tissue or organ. This is particularly useful in the study of complex solid tissues, such as tumors and their microenvironment.
Spatial transcriptomic analysis involves cell/spot clustering in space, identification of spatially variable genes and resolving cell types in space.
Retaining the positional information of sequenced cells adds to the accuracy of identifying cell types and ligand-receptor interactions. It also enables spatial visualization of gene expression or chromatin accessibility (in the case of scATAC-seq) and integrating imaging-based data to the analysis.
Even in the case of lower-resolution assays, like 10X Visium, multimodal spatial analysis helps in correcting gene expression values and imputing dropout events.
Meet some of our single-cell experts
I specialize in gene and genome regulation, particularly in immunology, cancer research, DNA repair and cellular senescence.
For over 10 years, I have developed and applied computational pipelines to analyze data from transcriptomic and epigenomic sequencing assays, including scRNA-seq, scATAC-seq, spatial transcriptomics, ChIP-seq, RNA-seq, GRO-seq, ATAC-seq, CAGE-seq, XR-seq, DRIP-seq, BLISS-seq, Damage-seq, INI-seq, and HiC.
I have enjoyed working in multidisciplinary teams — as a bioinformatician, postdoc researcher, head of a single cell NGS bioinformatics facility and, most recently, as a project manager at Genevia.
I am an experienced biologist/bioinformatician specialized in mapping and functionally interrogating DNA regulatory elements and their target genes in normal cell development and disease using high-throughput genomics and genome editing tools.
I have over 7 years of experience in profiling the accessible chromatin using high-throughput methods (DNase I-seq, ATAC-seq, ChIP-seq, CUT&RUN) and transcriptomic profiling (RNA-seq) along hematopoietic development. I have also 4+ years of experience in implementing and analyzing single-cell multi-omic data (10X Genomics scRNA-seq, scATAC-seq) using a variety of computational tools.
Additionally, my all-around experience in life science data analysis and statistical implementation includes cancer database mining, mutational signature analysis, survival analysis, and machine learning applications.
As a scientist, I specialize in cellular differentiation and RNA biology. I have been studying the interplay between transcriptional regulators and non-coding RNAs in a multitude of biomedical contexts, including mesenchymal stem cell differentiation, endothelial cell differentiation, atherosclerosis and leukemia.
From the methodological perspective, I have been trained as a comprehensive systems biologist and generated NGS datasets myself in my research projects. I am experienced in analyzing sequencing data from RNA (mRNA-seq, short RNA-seq, scRNA-seq, GRO-seq, TT-seq) and DNA libraries (ChIP-seq, ATAC-seq, CITE-seq) and in integrating different data modalities to gain ever deeper insight into complex systems.
- Single-cell expression analysis (Customer case: Becton Dickinson)Single-cell RNA sequencing data analysis (blog post)
- RNA-seq data analysis
Selected publications from our team
- Armaka, M. et al. (2022). Single-cell multimodal analysis identifies common regulatory programs in synovial fibroblasts of rheumatoid arthritis patients and modeled TNF-driven arthritis. Genome medicine, 14(1), 78. https://doi.org/10.1186/s13073-022-01081-3
- Pham, T. et al. (2022). Modeling human extraembryonic mesoderm cells using naive pluripotent stem cells. Cell stem cell, 29(9), 1346–1365.e10. https://doi.org/10.1016/j.stem.2022.08.001
- Detsika, M. G. et al. (2022) Upregulation of CD55 complement regulator in distinct PBMC subpopulations of COVID-19 patients is associated with suppression of interferon responses. bioRxiv 2022.10.07.510750; doi: https://doi.org/10.1101/2022.10.07.510750
- Roos, K. et al. (2022). Single-cell RNA-seq analysis and cell-cluster deconvolution of the human preovulatory follicular fluid cells provide insights into the pathophysiology of ovarian hyporesponse. Frontiers in endocrinology, 13, 945347. https://doi.org/10.3389/fendo.2022.945347
- Smith, C. et al. (2022). A comparative transcriptomic analysis of ...* -like peptide-1 receptor- and glucose-dependent insulinotropic polypeptide-expressing cells in the hypothalamus. Appetite, 174, 106022. https://doi.org/10.1016/j.appet.2022.106022
Tzaferis, C. et al. (2022). SCALA: A web application for multimodal analysis of single cell next generation sequencing data. bioRxiv 2022.11.24.517826; doi: https://doi.org/10.1101/2022.11.24.517826
- Taavitsainen, S. et al. (2021). Single-cell ATAC and RNA sequencing reveal pre-existing and persistent cells associated with prostate cancer relapse. Nature communications, 12(1), 5307. https://doi.org/10.1038/s41467-021-25624-1
- Georgolopoulos, G. et al. (2021). Discrete regulatory modules instruct hematopoietic lineage commitment and differentiation. Nature communications, 12(1), 6790. https://doi.org/10.1038/s41467-021-27159-x
- Mehtonen, J. et al. (2020). Single cell characterization of B-lymphoid differentiation and leukemic cell states during chemotherapy in ETV6-RUNX1-positive pediatric leukemia identifies drug-targetable transcription factor activities. Genome medicine, 12(1), 99. https://doi.org/10.1186/s13073-020-00799-2
- Adriaenssens, A. E. et al. (2019). Glucose-Dependent Insulinotropic Polypeptide Receptor-Expressing Cells in the Hypothalamus Regulate Food Intake. Cell metabolism, 30(5), 987–996.e6. https://doi.org/10.1016/j.cmet.2019.07.013
* Names of pharmaceuticals removed to comply with regulation in certain countries
Leave your email address here with a brief description of your needs, and we will contact you to get things moving forward!