OK, let me briefly stand on my soapbox ranting with shameless self promotion, and offer two perspectives in the pursuit of statistical evidence.

- The weirdness of the p-value

We’re all indoctrinated under the canopy of null hypothesis significance testing (NHST) and become so familiar with the definition of p-value that we tend to ignore its weirdness. First we build a straw man: e.g., there is absolutely no effect (e.g., no activation at all) in a brain region. Then we attack the straw man by showing how weird the data is if we truly believe the straw man is real. Lastly, if the weirdness reaches a ridiculous level (e.g., the magic number of 0.05), we feel it comfortable enough to present the results.

The side effects resulting from hard thresholding with a fixed weirdness level are:

** When dichotomizing a continuous world with a threshold, we face difficult scenarios such as the following. Do we truly believe that a result with p-value of 0.051 or a cluster with 53 voxels instead of a cluster threshold of 54 voxels does not carry any statistical evidence?

** Difference between a “significant” result and an “insignificant” one is not necessarily significant.

** It is not rare to see that the definition of p-value, i.e., p (evidence | H[sub]0[/sub]), is misinterpreted as p (H[sub]0[/sub] | data). In fact, the misinterpretation reflects the real research of interest; that is, the investigator is most likely interested in the latter probability instead of the artificial straw man or the weirdness p-value: with the data at hand, what the probability the effect of interest is positive or negative?

- Information waste due to inefficient modeling

The current approach of massively univariate modeling actually has two layers of modeling. The first layer is when we apply the same model to all spatial elements (voxels, regions, or region pairs), resulting in as many models as the number of the elements with the following assumptions:

** We pretend that all those spatial elements are isolated and unrelated to each other.

** We pretend we know nothing about the effect of interest ahead of time. Therefore, we fully trust the effect estimates from the model (e.g., GLM, AN(C)OVA, LME); that is, the effect can take any value within -infinity to +infinity with equal likelihood. By the same token, a variance can take any value within -infinity (or 0) to +infinity with equal likelihood as well. This is why negative variance is allowed under ANOVA!

The modeling inefficiency is that, as the first assumption above is false, we have to deal with the famous problem of “multiple testing” by constructing a second layer of modeling to counteract the false assumption by controlling the overall weirdness. This correction step through various modeling approaches (e.g, accounting for spatial relatedness, permutation) can be suffocating as shown from the real field experience. The falsehood of the second assumption leads to further information waste.

Now comes my shameless self promotion. The points above are further elaborated here: https://afni.nimh.nih.gov/afni/community/board/read.php?1,157054,157054#msg-157054

A solution is offered as well, which may or may not be applicable to your scenario.

I’m just looking for some tips or ideas, because we really want to follow your guidelines but using a per voxel p-value

of 0.002 is making it difficult to keep any findings.

Even if you stick with NHST, you can still adopt a more reasonable approach instead of dichotomization: highlighting instead of hiding. That is, show everything without hiding anything, and in the meantime highlight the results with strong evidence (e.g., those lucky clusters). And you can highlight the results with a gradient of statistical evidences (e.g., different thresholds).