Calculating Outliers

nikinbaidar · May 26, 2022, 1:51pm

The help pages for the command 3dTcount explains that the outliers are defined as:

* The trend and MAD of each time series are calculated. - MAD = median absolute deviation = median absolute value of time series minus trend. * In each time series, points that are 'far away' from the trend are called outliers, where 'far' is defined by alpha * sqrt(PI/2) * MAD alpha = qginv(0.001/N) (inverse of reversed Gaussian CDF) N = length of time series

I cannot find any mathematical reference to how outliers are defined using the “inverse of reversed Gaussian CDF”. Can someone please elaborate on this on how outliers are defined. How the upper and lower bounds are calculated? I would appreciate any help from you folks. Thank you!

rickr · June 21, 2022, 2:10pm

Hello,

This is not all fully clear to me either, but there might be some points worth making.

The Gaussian CDF should be the cumulative distribution function of the standard normal Gaussian “bell” curve, with the CDF going from 0 as x approaches -inf, through (0, 0.5) (i.e. at x=0, half the CDF accumulates to 0.5), and approaching 1 as x approaches +inf. The “reversed” form of this is just a reflection over the y-axis, leaving a curve that starts at 1, still goes through (0, .5) and then approaches 0 as x approaches positive infinity.

So this reversed Gaussian CDF, when restricted to x > 0 is akin to a z-score to p-value conversion (though for a full CDF, it is like a 2-tailed version).

For kicks, consider the commands (in tcsh syntax):

# start with some arbitrary value
set val = 2.5
ccalc "qg($val)"
ccalc `cdf -t2p fizt $val`/2

That makes its inverse (qginv()) akin to a p-value to z-score conversion.

So alpha = qginv(0.001/N) is possibly like like p=0.001, bonferroni corrected for the number of time points (as noted by Paul), then converted to a z-score. The sqrt(pi/2) is unclear to me, though maybe it is because the actual Gaussian CDF integrating to sqrt(pi/2), or maybe it is another normalizing factor.

These p to z conversions might then give a reasonable justification for what values constitute outliers, when the deviations are compared with the median absolute deviation.

Does that seem at least modestly reasonable?

rick