DNA sequencing data analysis

Understand the effects of genetic variation and mutations with DNA sequencing data analysis.

DNA-sequencing comes in many forms. Whole-genome sequencing (WGS), whole-exome sequencing (WES) and targeted sequencing enable studying heritable and somatic DNA variants. In addition to NGS data, SNP and CGH arrays can be used to identify genetic polymorphisms and copy-number variants, respectively. Metagenomic whole-genome sequencing of microbial communities allows analyzing their compositions and functions.

We routinely analyze DNA sequence data to address research questions in both basic biology and biomedical settings. Below we present some of the typical DNA-sequencing data analyses. If you are interested to learn how we can help you to get the most out of your DNA-seq data, leave us a message and we will book you a short call with our expert.

Leave us a short description of your bioinformatics needs and we will be in touch very soon!

Variant analysis

In most cases, DNA sequencing is employed in order to identify and analyze genetic variants. These variants can be small nucleotide substitutions, insertions, deletions, copy-number alterations or structural variants. Futhermore, they may be heritable polymorphisms or somatic mutations.

Variant analysis typically starts with the quality control of raw DNA-sequencing data and aligning the sequencing reads against a reference genome. Variants that differ between the sample and public reference — or between different samples — can then be computationally identified.

A crucial part of variant analysis is annotating the detected variants. Annotations such as allele frequencies (both in-sample and in public databases such as gnomAD), predicted effects on protein structure or gene regulation and predicted pathogeneicity allow for flexible selection or ranking of variants for downstream analyses and interpretation.

Variant analysis in cancer research often focuses on identifying somatic mutations which accelerate tumorigenesis (driver mutations) or that can be used to diagnose a patient or predict their course of disease. However, non-driver mutations (passengers) carry information, too. They add to the reliability of analyses into mutational signatures and cancer cell clonalities. Learn more about mutation analysis in cancer research.

Genome assembly

For organisms with no reference genomes or highly dynamic genomes, DNA-sequencing data analysis starts with assembling a genome de novo. Genome assembly benefits from deep whole-genome sequencing.

An assembled genome is annotated based on sequence homology, predicted gene sequences and, if available, RNA-sequencing data from the same organism. If annotated genomes for close relative species exist, the annotation can be improved by transferring gene information to the newly assembled genome.

The quality of an assembled genome is assessed using metrics such as N50, L50 and completness with regards to highly conserved orthologs. A new high-quality genome enables analyses into pan-genomes, population genetics and much more!


Metagenomics offers an unbiased view into the microbial diversity of ecological niches including samples from host organisms and soil. Using shot-gun whole-genome sequencing data, reads are assembled into contigs and assigned to species or operational taxonomic units (OTUs).

Identified species or OTUs are organized into a phylogeny and quantified. The functions brought about by individual genes or multi-gene pathways present in the sequenced community can be identified using public databases.

Note that 16S amplicon sequencing, a cost-effective alternative to metagenomic sequencing, can be used to identify species and build phylogenies, but it does not allow for high-quality functional analyses.

Population genetics

Genome-wide measurements of individuals sampled from related populations contain rich information on the populations’ structure, genealogy and history. Population genetic analyses of non-model organisms often begin with genome assembly and annotation, and proceed to identifying genetic polymorphisms in the sampled populations. The downstream analyses based on these polymorphisms and their allele frequencies help studying evolutionary phenomena such as speciation and adaptation.

Typical analyses involve principal component analysis, analysis of genetic variation within and between populations to identify loci affected by evolutionary selection, and analyses of population admixtures, phylogeny and demographic histories.

Genome-wide association analysis

Biomedically motivated population-scale genetic analyses aim to identify genes and variants associated to relevant phenotypes or diseases. Apart from the few diseases which are monogenic and strongly heritable, most diseases require large, population-level sample sizes to achieve sufficient statistical power to find associations. Such genome-wide association studies (GWAS) are based on SNP-array or DNA-sequencing data from biobanks or other large repositories.

GWAS results in summary statistics on the association between each individual variant and the studied disease. In the case of polygenic diseases, individual variants may have very weak effect sizes even when the disease is strongly heritable. In such cases, polygenic risk scores (PRS) can be used to sum the effect of a large number of variants, resulting in a combined risk score with potential clinical utility.

Meet some of our genomics experts

I am specialised in population genetics and evolutionary genetics, with over 8 years’ experience constructing Python and R based workflows for a wide range of genetic datasets.

I have worked on data types including whole genome re-sequencing, multispecies whole genome alignments, RNA sequencing, SNP genotyping arrays (high and low density), from which I have performed analysis on types of variation including, SNPs/SNVs, INDELs and CNVs, both in present day populations using polymorphism data and over evolutionary time using divergence data.

I have worked on a broad range of scientific topics in my career including genome evolution, detecting targets of selection, cancer genetics and viral analysis. Organism-wise, I have worked on data from across the tree of life, including humans, mice, fish, birds, insects, grasses and viruses.

Henry Barton
Henry Barton Scientific Project Manager Genevia Technologies Oy

My experience of over 8 years analyzing large volumes of multi-omics data covers a broad range of topics, including metagenomics, genomics of non-model organisms and genome-wide association studies.

I am also a software developer in the context of biological data analysis, including larger projects involving multi-disciplinary teams and end-users.

I have a broad range of experience in bioinformatics including, but not limited to, genome assembly & annotation, comparative genomics, phylogenetics, identification of novel species, transcriptomics, antimicrobial resistance and machine learning applications in a biomedical context.

Felipe Simao
Felipe Simao Scientific Project Manager, R&D manager Genevia Technologies Oy

I am a bioinformatician specialised in cancer genomics and genetics with a 10-year experience analysing omics data in countless genetic, genomic, transcriptomic and epigenetic studies.

While my research focus has been in cancer, I have also gained experience in a number of other fields, such as immunology, aging and developmental biology. In recent years, I have also applied machine learning methods to harness biomedical data in various clinical applications.

Tommi Rantapero
Tommi Rantapero Scientific Project Manager Genevia Technologies Oy

Learn more

Above we introduced some of the computational analyses applied to various types of DNA-seq data. However, our team's experience ranges much deeper — take a look at our references and publications.

Learn more about DNA-seq data analysis

References and customer cases

Selected publications from our customers

  • Yuan, O. et al. (2022). A somatic mutation in moesin drives progression into acute myeloid leukemia. Science advances, 8(16), eabm9987. https://doi.org/10.1126/sciadv.abm9987
  • Wahlström, G. et al. (2022). The variant rs77559646 associated with aggressive prostate cancer disrupts ANO7 mRNA splicing and protein expression. Human molecular genetics, ddac012. Advance online publication. https://doi.org/10.1093/hmg/ddac012
  • Kundu, S. et al. (2021). Common and mutation specific phenotypes of KRAS and BRAF mutations in colorectal cancer cells revealed by integrative -omics analysis. Journal of experimental & clinical cancer research : CR, 40(1), 225. https://doi.org/10.1186/s13046-021-02025-2
  • Pernaute-Lau, L. et al. (2021). Pharmacogene Sequencing of a Gabonese Population with Severe Plasmodium falciparum Malaria Reveals Multiple Novel Variants with Putative Relevance for Antimalarial Treatment. Antimicrobial agents and chemotherapy, 65(7), e0027521. https://doi.org/10.1128/AAC.00275-21
  • Åvall-Jääskeläinen, S. et al. (2021). Genomic Analysis of Staphylococcus aureus Isolates Associated With Peracute Non-gangrenous or Gangrenous Mastitis and Comparison With Other Mastitis-Associated Staphylococcus aureus Isolates. Frontiers in microbiology, 12, 688819. https://doi.org/10.3389/fmicb.2021.688819
  • Wullt, B. et al. (2021). Immunomodulation-A Molecular Solution to Treating Patients with Severe Bladder Pain Syndrome?. European urology open science, 31, 49–58. https://doi.org/10.1016/j.euros.2021.07.003
  • Gallegos, J. E. et al. (2020). Challenges and opportunities for strain verification by whole-genome sequencing. Scientific reports, 10(1), 5873. https://doi.org/10.1038/s41598-020-62364-6
  • Tikkanen, T. et al. (2018). Seshat: A Web service for accurate annotation, validation, and analysis of TP53 variants generated by conventional and next-generation sequencing. Human mutation, 39(7), 925–933. https://doi.org/10.1002/humu.23543

Selected publications from our team

  • Rajamäki, K. et al. (2021). Genetic and Epigenetic Characteristics of Inflammatory Bowel Disease-Associated Colorectal Cancer. Gastroenterology, 161(2), 592–607. https://doi.org/10.1053/j.gastro.2021.04.042
  • Vandekerkhove, G. et al. (2021). Plasma ctDNA is a tumor tissue surrogate and enables clinical-genomic stratification of metastatic bladder cancer. Nature communications, 12(1), 184. https://doi.org/10.1038/s41467-020-20493-6
  • Cerqueira, J. et al. (2021). Independent and cumulative coeliac disease-susceptibility loci are associated with distinct disease phenotypes. Journal of human genetics, 66(6), 613–623. https://doi.org/10.1038/s10038-020-00888-5
  • Yusuf, L. et al. (2020). Noncoding regions underpin avian bill shape diversification at macroevolutionary scales. Genome research, 30(4), 553–565. https://doi.org/10.1101/gr.255752.119
  • Lindfors, K. et al. (2020). Metagenomics of the faecal virome indicate a cumulative effect of enterovirus and gluten amount on the risk of coeliac disease autoimmunity in genetically at risk children: the TEDDY study. Gut, 69(8), 1416–1422. https://doi.org/10.1136/gutjnl-2019-319809
  • Manni, M. et al. (2020). The Genome of the Blind Soil-Dwelling and Ancestrally Wingless Dipluran Campodea augens: A Key Reference Hexapod for Studying the Emergence of Insect Innovations. Genome biology and evolution, 12(1), 3534–3549. https://doi.org/10.1093/gbe/evz260
  • Hayes, K. et al. (2020). A Study of Faster-Z Evolution in the Great Tit (Parus major). Genome biology and evolution, 12(3), 210–222. https://doi.org/10.1093/gbe/evaa044
  • Rotenberg, D. et al. (2020). Genome-enabled insights into the biology of thrips as crop pests. BMC biology, 18(1), 142. https://doi.org/10.1186/s12915-020-00862-9
  • Oeyen, J. P. et al. (2020). Sawfly Genomes Reveal Evolutionary Acquisitions That Fostered the Mega-Radiation of Parasitoid and Eusocial Hymenoptera. Genome biology and evolution, 12(7), 1099–1188. https://doi.org/10.1093/gbe/evaa106
  • Taavitsainen, S. et al. (2019). Evaluation of Commercial Circulating Tumor DNA Test in Metastatic Prostate Cancer. JCO precision oncology, 3, PO.19.00014. https://doi.org/10.1200/PO.19.00014
  • Zeng, K. et al. (2019). Methods for Estimating Demography and Detecting Between-Locus Differences in the Effective Population Size and Mutation Rate. Molecular biology and evolution, 36(2), 423–433. https://doi.org/10.1093/molbev/msy212
  • Barton, H. J. et al. (2019). The Impact of Natural Selection on Short Insertion and Deletion Variation in the Great Tit Genome. Genome biology and evolution, 11(6), 1514–1524. https://doi.org/10.1093/gbe/evz068
  • Olofsson, J. K. et al. (2019). Population-Specific Selection on Standing Variation Generated by Lateral Gene Transfers in a Grass. Current biology : CB, 29(22), 3921–3927.e5. https://doi.org/10.1016/j.cub.2019.09.023
  • Kriventseva, E. V. et al. (2019). OrthoDB v10: sampling the diversity of animal, plant, fungal, protist, bacterial and viral genomes for evolutionary and functional annotations of orthologs. Nucleic acids research, 47(D1), D807–D811. https://doi.org/10.1093/nar/gky1053
  • Gao, Q. et al. (2018). Driver Fusions and Their Implications in the Development and Treatment of Human Cancers. Cell reports, 23(1), 227–238.e3. https://doi.org/10.1016/j.celrep.2018.03.050
  • Lin, J. et al. (2018). Bioinformatics Assembling and Assessment of Novel Coxsackievirus B1 Genome. Methods in molecular biology (Clifton, N.J.), 1838, 261–272. https://doi.org/10.1007/978-1-4939-8682-8_18
  • Kaikkonen, E. et al. (2018). ANO7 is associated with aggressive prostate cancer. International journal of cancer, 143(10), 2479–2487. https://doi.org/10.1002/ijc.31746
  • Barton, H. J. et al. (2018). New Methods for Inferring the Distribution of Fitness Effects for INDELs and SNPs. Molecular biology and evolution, 35(6), 1536–1546. https://doi.org/10.1093/molbev/msy054
  • Kim, J. M. et al. (2018). A high-density SNP chip for genotyping great tit (Parus major) populations and its application to studying the genetic architecture of exploration behaviour. Molecular ecology resources, 18(4), 877–891. https://doi.org/10.1111/1755-0998.12778
  • Corcoran, P. et al. (2017). Determinants of the Efficacy of Natural Selection on Coding and Noncoding Variability in Two Passerine Species. Genome biology and evolution, 9(11), 2987–3007. https://doi.org/10.1093/gbe/evx213
  • Ioannidis, P. et al. (2017). Genomic Features of the Damselfly Calopteryx splendens Representing a Sister Clade to Most Insect Orders. Genome biology and evolution, 9(2), 415–430. https://doi.org/10.1093/gbe/evx006
  • Määttä, K. et al. (2016). Whole-exome sequencing of Finnish hereditary breast cancer families. European journal of human genetics : EJHG, 25(1), 85–93. https://doi.org/10.1038/ejhg.2016.141
  • Pritchard, C. C. et al. (2016). Inherited DNA-Repair Gene Mutations in Men with Metastatic Prostate Cancer. The New England journal of medicine, 375(5), 443–453. https://doi.org/10.1056/NEJMoa1603144
  • Laitinen, V. H. et al. Germline copy number variation analysis in Finnish families with hereditary prostate cancer. The Prostate, 76(3), 316–324. https://doi.org/10.1002/pros.23123
  • Hoy, M. A. et al. (2016). Genome Sequencing of the Phytoseiid Predatory Mite Metaseiulus occidentalis Reveals Completely Atomized Hox Genes and Superdynamic Intron Evolution. Genome biology and evolution, 8(6), 1762–1775. https://doi.org/10.1093/gbe/evw048
  • Simão, F. A. et al. (2015). BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics (Oxford, England), 31(19), 3210–3212. https://doi.org/10.1093/bioinformatics/btv351
  • Neafsey, D. E. et al. (2015). Mosquito genomics. Highly evolvable malaria vectors: the genomes of 16 Anopheles mosquitoes. Science (New York, N.Y.), 347(6217), 1258522. https://doi.org/10.1126/science.1258522

Browse all

Contact us

Leave your email address here with a brief description of your needs, and we will contact you to get things moving forward!

Antti Ylipää
Antti Ylipää CEO, co-founder Genevia Technologies Oy +358 40 747 7672


Genevia RNA-Seq Bioinformatics Grant 2022