Multiple testing and p-value correction

You may have heard of those terms, or maybe not, to each their own path. But as those are crucial concepts to understand and use p-values, wether this tutorial is a refresher or your first contact with those concepts, I hope it will be of use to you.
That being said, I will not be re-introducting p-values as we already have an article for that here.

The intuition

Before we go down into the maths and probabilities of the problem, let’s get an intuition of the multiple testing problem by talking about national lottery.
What are your probability to win? it would demand an enormous amount of luck.

But what are the chances that someone you know wins? More probable, but still unlikely.

And now what if you take the whole country, what is the probability of someone winning? And at that point, is it still just luck?

Your chances of winning are low, the chances of someone winning are not.

And this corresponds to the multiple testing problem.

Back to science

As previously discussed in the introduction to p-value article, setting a limit under which a p-value is considered significant corresponds to to probability that we observe this data when the truth of the universe is “normal” (This is the 0.05 threshold). Observing that would be (with a stringent enough threshold) luck. But if you were to make the same tests on the ~20 000 genes in the human body, we would get a similar case as with our lottery. The probability to observe something “by chance” rather than by truth augments a lot.

To phrase it in a different way, our p-value threshold in a “single experiment” scenario, corresponds to the maximum False Positive Rate that we are ready to accept. Meaning that for most case, we consider (by convention) that we are taking a 5% risk of saying something happens, when it doesn’t. However, if you repeat an experience enough time, or if you do a lot of similar experience, this 5% chance is going to happen, and it is more likely to happen the more you repeat, leading you to publish false results, or possibly loose time trying to reinforce a result that isn’t there.

Tools to compensate those

Both of these methods are easily accessible in the R programming language by using the p.adjust function on a vector and adding the option for the method you want to apply, see the manual page here.

Bonferonni

The Bonferonni method is the most stringent, meaning you are the less likely to falsely report a negative result as a positive one. However, it may at the sametime cut a good chunk of real positive results.

FDR correction

FDR, or False Detection Rate, correction is trying to aim for a healthy middle. Most notably, if you have a lot of low-pvalues, those will not be affected too much.

Both of those tests depends on a different philosophy, as Bonferoni limits the proportion of negative results marked positive, whereas the FDR limit the proportion of positive results that are false. But debating which is the best would take us into some rather deep weeds. Know that both are usable, but that an FDR < 5% and a p-value-corrected < 5% do not mean exactly the same thing, but both are perfectly usable in your scientific articles.

For a more detailed and mathematical look into multiple testing and multiple testing correction, I encourage you to read the following article

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6099145/

Multiple testing and p-value correction

À lire aussi

Using the iPOP-UP@RPBS High-Performance Computing Resource