AFNI and the pursuit of significant findings

Hi!

I guess this is the 100th post regarding this matter but maybe there is now some new info out there.

I really appreciate how thurough you guys have been in really addressing the issues that surfaced post the Eklund paper (he is actually my PhD co-supervisor). And as far as I can tell your approach is to reach a theoretically “true” false-positive cluster rate of 0.05 when all the thresholds have been set.

If I have a task that only makes sense analysing in a factorial manor, usually via 3dMVM, my only choice is to use the acf-method. If I want to be an honest scientist and to be able to say that “these results are multiple comparison corrected according to the AFNI-guidelines” I basically have to follow what is in Bob’s magnificent power point presentation. That means that even though the resulting table shows cluster-sizes for p-values as high as 0.05 a true cluster alpha of 0.05 is only reached for p-values lower or equal to 0.002.

In practise, this makes it really hard to get any “publishable” results (also 3dttest++ -clustim and -ETAC does seem to be even more conservative). I have tried to reduce the testing volume by using only the union of all mask_epi_anat masks and then just use the GM portion of this.

If we find activations in “predicted” areas or areas that makes so much sense (since we e.g. stimulate the insula) but the cluster only remains when using the table value for a p of 0.005 instead of 0.002 is that really unpublishable? How would you deal with that?

I’m just looking for some tips or ideas, because we really want to follow your guidelines but using a per voxel p-value of 0.002 is making it difficult to keep any findings.

Again, we are really grateful for all your hard work!

OK, let me briefly stand on my soapbox ranting with shameless self promotion, and offer two perspectives in the pursuit of statistical evidence.

  1. The weirdness of the p-value

We’re all indoctrinated under the canopy of null hypothesis significance testing (NHST) and become so familiar with the definition of p-value that we tend to ignore its weirdness. First we build a straw man: e.g., there is absolutely no effect (e.g., no activation at all) in a brain region. Then we attack the straw man by showing how weird the data is if we truly believe the straw man is real. Lastly, if the weirdness reaches a ridiculous level (e.g., the magic number of 0.05), we feel it comfortable enough to present the results.

The side effects resulting from hard thresholding with a fixed weirdness level are:

** When dichotomizing a continuous world with a threshold, we face difficult scenarios such as the following. Do we truly believe that a result with p-value of 0.051 or a cluster with 53 voxels instead of a cluster threshold of 54 voxels does not carry any statistical evidence?

** Difference between a “significant” result and an “insignificant” one is not necessarily significant.

** It is not rare to see that the definition of p-value, i.e., p (evidence | H[sub]0[/sub]), is misinterpreted as p (H[sub]0[/sub] | data). In fact, the misinterpretation reflects the real research of interest; that is, the investigator is most likely interested in the latter probability instead of the artificial straw man or the weirdness p-value: with the data at hand, what the probability the effect of interest is positive or negative?

  1. Information waste due to inefficient modeling

The current approach of massively univariate modeling actually has two layers of modeling. The first layer is when we apply the same model to all spatial elements (voxels, regions, or region pairs), resulting in as many models as the number of the elements with the following assumptions:

** We pretend that all those spatial elements are isolated and unrelated to each other.

** We pretend we know nothing about the effect of interest ahead of time. Therefore, we fully trust the effect estimates from the model (e.g., GLM, AN(C)OVA, LME); that is, the effect can take any value within -infinity to +infinity with equal likelihood. By the same token, a variance can take any value within -infinity (or 0) to +infinity with equal likelihood as well. This is why negative variance is allowed under ANOVA!

The modeling inefficiency is that, as the first assumption above is false, we have to deal with the famous problem of “multiple testing” by constructing a second layer of modeling to counteract the false assumption by controlling the overall weirdness. This correction step through various modeling approaches (e.g, accounting for spatial relatedness, permutation) can be suffocating as shown from the real field experience. The falsehood of the second assumption leads to further information waste.

Now comes my shameless self promotion. The points above are further elaborated here: https://afni.nimh.nih.gov/afni/community/board/read.php?1,157054,157054#msg-157054
A solution is offered as well, which may or may not be applicable to your scenario.

I’m just looking for some tips or ideas, because we really want to follow your guidelines but using a per voxel p-value
of 0.002 is making it difficult to keep any findings.

Even if you stick with NHST, you can still adopt a more reasonable approach instead of dichotomization: highlighting instead of hiding. That is, show everything without hiding anything, and in the meantime highlight the results with strong evidence (e.g., those lucky clusters). And you can highlight the results with a gradient of statistical evidences (e.g., different thresholds).

Thank you so much Gang!

Feel free to step up to the soap box any time…

Then we will give the rare p < 0.002 (0.001) findings extra weight but we will not through away anything interesting at the higher p-values (e.g. 0.005 with corresponding threshold).

The annoying thing is that it is nice to be able to write “multiple comparsion corrected” in a paper. For example SPS just provides the clusters and claims they are multiple comparsion corrected which is nice for the user (but you are more in the dark of how they did it). If we present a 0.005 finding your simulated alpha is not actually 0.05 and we cannot write that is is multiple comparsion corrected. Right?

If we present a 0.005 finding your simulated alpha is not actually 0.05 and we cannot write that is
is multiple comparsion corrected. Right?

Leaving modeling issues aside, I don’t think that the current approach of hiding everything below the threshold is healthy. Instead of varying the voxel-wise p-value, I suggest that you fix the voxel-wise p-value (whatever is currently considered acceptable) and find the clusters with varying cluster-level FWE rates (alpha) such as 0.05, 0.06, 0.07, etc. Then you can report those clusters of interest with an alpha value above but still reasonably close to 0.05. Let go of the obsession with the p-value, and don’t treat the watermark of 0.05 as something carved in stone.

Thanks Gang!

This leaves a, for now, final set of questions.

  1. The “currently considered acceptable threshold” I get form the Bob’s power-point presentation and from these two papers:
    fMRI clustering and false-positive rates.
    FMRI Clustering in AFNI: False-Positive Rates Redux.
    From here I gather that one should use no p-value greater than 0.002. Is there any new info/paper out there? These are from 2017.

  2. We would typically use this table:


# 3dClustSim -acf .62900675 2.70203459 13.01153783 -mask /data/dsk2/iaps/group_mask/group_mask_epi_anat_gm+tlrc.
# 2-sided thresholding
# Grid: 64x76x64 3.00x3.00x3.00 mm^3 (35260 voxels in mask)
#
# CLUSTER SIZE THRESHOLD(pthr,alpha) in Voxels
# -NN 1  | alpha = Prob(Cluster >= given size)
#  pthr  | .10000 .05000 .02000 .01000
# ------ | ------ ------ ------ ------
 0.050000    95.2  115.8  147.0  177.0
 0.020000    39.6   48.2   62.4   76.7
 0.010000    23.5   28.5   36.1   43.1
 0.005000    15.0   18.2   22.9   26.5
 0.002000     9.1   11.0   13.8   16.3
 0.001000     6.6    7.9    9.9   11.5
 0.000500     4.9    5.9    7.4    8.5
 0.000200     3.5    4.2    5.2    6.1
 0.000100     2.7    3.3    4.1    4.8

This table do only show cluster alphas, apart from 0.05, at 0.1, 0.02 and 0.01 levels. How do I get 3dClustSim (current command can be seen at the top of the table) to produce e.g. 0.06 or 0.07 as per your suggestion?

  1. “Then you can report those clusters of interest with an alpha value above but still reasonably close to 0.05. Let go of the obsession with the p-value, and don’t treat the watermark of 0.05 as something carved in stone.”

Yes, this sounds good and I agree! But the journals are often keen on knowing if your statistical results are multiple comparison corrected or not (annoyingly binary statement, I know). Writing a long discussion regarding thresholding in a paper can get kind of complicated and I can’t really cite the AFNI message board…

Don’t you think that a reviewer that is unfamiliar with AFNI and with the fact that these methods are relativley conservative compared to other softwares would be kind of sceptical to a cluster alpha over 0.05. More so than for a per voxel p-value at 0.05 that results in an alpha of 0.05.

Thanks so much for taking your time with this. We are approaching submit on one of our AFNI projects so this issue is taking its toll on us.
Thanks!

Hi, Robin-

Re. your Q2–
You can choose your desired “alpha” values in 3dClustSim with this option:


-athr a1 .. an = list of corrected (whole volume) alpha-values at which
                  the simulation will print out the cluster size
                  thresholds.  For each 'p' and 'a', the smallest cluster
                  size C(p,a) for which the probability of the 'p'-thresholded
                  image having a noise-only cluster of size C is less than 'a'
                  is the output (cf. the sample output, below)
                  [default = 0.10 0.05 0.02 0.01]

Re. your Q1–
You have cited the Big 3 of sources for clustering/thresholding stuff in AFNI to date. There has been a bit more work on ETAC by Bob, but the main point of it is included in his slides quite nicely.

Note that another REALLY relevant issue that a surprising number of published papers still do not get correct is that of what sidedness of testing to use. Using a pair of one-sided t-tests without correction doubles the FPR of reported results (and actually, does that while using a doubled p-value, which might even cause further FPR inflation). This is a model- and simulation-free result-- it is purely a mathematical point. As much as you strive to have a ‘valid’ p-value threshold, please also use correct testing for your hypotheses:
https://www.ncbi.nlm.nih.gov/pubmed/30265768 (HBM)
https://www.biorxiv.org/content/10.1101/328567v1 (bioRxiv)

Re. your Q3–
Well, I’m not sure if there is a question there, or what the precise question is.

–pt

Hi Paul,

Thanks for your input!

Great regarding the cluster sizes and that I have got the correct sources for citation!

Regarding running two 1-sided t-tests: We almost always use the 2-sided tables that are generated from clustsim. Perhaps we can get away with a 1-sided on some of our tasks but we have chosen to basically always use 2-sided since we are interesed in differencies in both directions. Hope that sounds reasonable.

Yeah “Q3” wasn’t really a question. It was just to illustrate that we agree that it is unreasonable to throw away findings that e.g. is one voxel from surviving or are only significant with an alpha of 0.06-7, especially if it is a bileteral finding in predicted areas. My point though, was that it might be tricky with some reviewers since we can’t, even though it is close, write that the finding is FWE corrected within AFNI because what we can cite is only the three sources above and they call for a p-value no larger than 0.002. And I can’t really cite a discussion like this. So now we are thiking about how to phrase thse situations in upcoming manuscripts without downlplaying the finding too much. A finding that might be a false negative since the method is kind of conservative (at least that is our experience so far).

it might be tricky with some reviewers since we can’t, even though it is close, write that the finding is
FWE corrected within AFNI because what we can cite is only the three sources above and they call
for a p-value no larger than 0.002. And I can’t really cite a discussion like this. So now we are thiking
about how to phrase thse situations in upcoming manuscripts without downlplaying the finding too
much. A finding that might be a false negative since the method is kind of conservative (at least that
is our experience so far).

Robin,

Here is another chance for my shameless self promotion. Since you do have extra supporting evidence (bilateral regions plus information from the literature if available) for the unfortunate cluster(s), I don’t see any reviewers would have objection against reporting such results. Here is a paragraph from a paper of ours a couple of years ago (https://www.biorxiv.org/content/10.1101/064212v2) that you may cite:


The issue of reporting marginally significant effects is controversial (e.g., Pritschet et al., 2016). Should
one not report a cluster simply because it cannot pass the rigorous statistical thresholding through FWE/FDR
control at the present group size? We argue that, even if a cluster fails to survive rigorous correction, it does
not necessarily mean that the results are not worth reporting, because they may be suggestive and provide some
benchmark for future confirmation. Statistical inference should not be a binary decision, and the inclusion of
effect estimates allows for a consistent approach to avoid this and to achieve a balance between false positives and
false negatives (Lieberman and Cunningham, 2009). Thus we propose a two-tier approach to reporting clusters.
In addition to the conventional FWE control, we believe that, if the individual voxels within a region achieve
a basic significance level (e.g., p ≤ 0.05) and if the cluster possesses some practically significant spatial extent
(e.g., less than the minimum cluster size required by a family-wise error correction scheme but still roughly
within the underlying anatomical structure), its reporting is warranted.

Thanks to both of you, really!

I think I have enough now to go ahead :). This will be a helpful thread to re-visiit!

Many thanks,
Robin