What is multiple testing correction?
Most quantitative results in bioinformatics come with statistical scores such as p-values. While p-values are helpful in sorting out true phenomena from flukes, raw p-values may be misleading. In high-throughput data analysis in particular, p-values are corrected for multiple testing. Here we demonstrate multiple testing correction, using differential expression analysis as an example.
What is multiple testing?
Multiple testing refers to simultaneous testing of multiple hypotheses. Typical cases in bioinformatics include transcriptome-wide differential expression (DE) analyses or genome-wide association studies (GWAS). These analyses may yield thousands of p-values, one for each gene (in the case of DE analysis) or for each variant (GWAS).
P-values are conventionally thresholded at, e.g., 0.05, to identify events that would be rare (fewer than 1 in 20 cases) if the null hypothesis were valid. For an individual test, it is unlikely to get a significant p-value just by chance, but with a high number tested hypotheses, "accidentally significant" false-positive findings are inevitable.
In a differential expression analysis of 10,000 genes, 500 (or 5%) would have a p-value below 0.05 just by chance, when no true differences are present. That is a lot of false discoveries!
Even when true differences are present, false positives will still contaminate the significant findings.
The good news is that the number of false positives can be controlled by correcting the p-values based on their distribution.
What do p-value distributions tell?
Whenever multiple testing takes place, it is a great idea to inspect the p-value distribution. A flat distribution with 5% of the p-values below 0.05 suggests no true differences. In such cases, any reasonable p-value adjustment should render all findings non-significant. On the other hand, an inflated rate of significant values (higher than 5% for p < 0.05) suggests true differences.
Consider the histograms below. The left one shows a p-value distribution from a 10,000-gene DE analysis with simulated data with no true DE genes. The right one shows the same for a simulated case with 1,000 (or 10%) true DE genes. The dashed line shows the conventional p = 0.05 cutoff, and ground-truth DE genes are indicated in red.
As expected, 500 genes are significant in the "no true DE genes" case, with a 100% false-discovery rate (FDR). The approximately 1,500 significant genes in the "10% DE genes" case contain around 500 false positives as well, yielding an FDR of approx. 33%. While a p-value cutoff of 0.05 manages to detect almost all true DE genes, the 33% "contamination" is high.
How are p-values corrected?
There are many methods for correcting, or adjusting p-values for multiple testing. Below you see the effect of p-value correction in the two above-mentioned cases, using two different methods.
The first method, Bonferroni, is very conservative: nearly all p-values are shifted to the non-significant end of the distribution, close to 1. Bonferroni clearly works for the "no true DE genes" case: we have zero false positives — only true negatives, as we should. For the case with true differences, however, we have a high false-negative rate: almost no true DE genes are detected. Clearly, signor Bonferroni is too conservative for our needs.
The second method, widely used Benjamini-Hochberg (B-H) procedure, is designed to control the FDR. It transforms p-values into estimated FDR levels: a B-H cutoff of 0.05, for instance, corresponds to a false discovery rate of 5%. From the distributions above we see that B-H indeed seems to detect the true DE genes with very low false positives or negatives.
The volcano plots below give more detail and reveal the trends between statistical significance (vertical axis) and effect size (horizontal axis). In particular, we notice that the decrease in FDR comes — in this case — at a cost of slightly more false negatives (undetected DE genes). The B-H adjustment is generally good at finding a balance between the two, and may in fact improve both the FDR and detection rate.
For a given case of multiple testing, it may not be obvious which method to use for p-value correction. The important thing, however, is to carefully consider what exactly is being tested, and whether the p-value distribution corresponds to what one would expect. Just being able to identify a flat p-value distribution is a good start!
(Fun fact: the code to simulate, analyze and plot these data was generated by an AI, OpenAI's ChatGPT. The AI chat was given prompts such as "simulate such and such data", "make the histograms cleaner", "fix this bug" etc. Very little human touch was was required.)
Leave your email address here with a brief description of your needs, and we will contact you to get things moving forward!