3dDeconvolve matrix setup time

Hi,

I’m running 3dDeconvolve on a moderate sized dataset with a large number of CSPLIN regressors. Specifically, the design has 120 conditions, each with 8 basis functions, for a total of 980 signal regressors. Plus an additional 80 baseline regressors. The dataset itself has 8 runs each with 260 (2080 total images) timepoints where each time-point is 83X99X50.

So it’s a decent sized dataset with a lot of parameters. When I run 3dDeconvolve on this model on our high-performance cluster, I find that “matrix setup time” takes a looong time: 7066.22s or nearly 2 hours.

What computation is taking 2 hours in the “matrix set up time”?

For example if I load “statout.xmat.1D” and run a simple linear model with one dependent variable in R, it takes about 600ms.

e.g.
df1 ← read.table(“statout.xmat.1D”)
y ← rnorm(nrow(df1))
system.time(lm(y ~ ., data=df1))

user system elapsed
0.621 0.019 0.641

Even for 1000 dependent variables, the time is only 2.692 seconds:

y ← matrix(rnorm(nrow(df1)*1000), nrow(df1), 1000)
system.time(lm(y ~ ., data=df1))
user system elapsed
2.692 0.064 2.756

Any ideas why 3dDeconvolve is taking 2 hours to “set up the matrix”?

thanks,

Brad Buchsbaum

Full output of 3dDeconvolve run below.

** AFNI converts NIFTI_datatype=4 (INT16) in file … to FLOAT32
Warnings of this type will be muted for this session.
Set AFNI_NIFTI_TYPE_WARN to YES to see them all, NO to see none.
++ 3dDeconvolve extending num_stimts from 120 to 200 due to -ortvec
++ 3dDeconvolve: AFNI version=AFNI_18.1.01 (Dec 21 2018) [64-bit]
++ Authored by: B. Douglas Ward, et al.
++ current memory malloc-ated = 3,362,882 bytes (about 3.4 million [mega])
++ loading dataset …
++ current memory malloc-ated = 3,421,820,762 bytes (about 3.4 billion [giga])
++ Auto-catenated input datasets treated as multiple imaging runs
++ Auto-catenated datasets start at: 0 260 520 780 1040 1300 1560 1820
++ Skipping check for initial transients
++ -stim_times using TR=2 s for stimulus timing conversion
++ -stim_times using TR=1 s for any -iresp output datasets
++ -stim_times 1 using GLOBAL times
++ -stim_times 2 using GLOBAL times
++ -stim_times 3 using GLOBAL times
++ -stim_times 4 using GLOBAL times
++ -stim_times 5 using GLOBAL times
++ -stim_times 6 using GLOBAL times
++ -stim_times 7 using GLOBAL times
++ -stim_times 8 using GLOBAL times
++ -stim_times 9 using GLOBAL times
++ -stim_times 10 using GLOBAL times
++ -stim_times 11 using GLOBAL times
++ -stim_times 12 using GLOBAL times
++ -stim_times 13 using GLOBAL times
++ -stim_times 14 using GLOBAL times
++ -stim_times 15 using GLOBAL times
++ -stim_times 16 using GLOBAL times
++ -stim_times 17 using GLOBAL times
++ -stim_times 18 using GLOBAL times
++ -stim_times 19 using GLOBAL times
++ -stim_times 20 using GLOBAL times
++ -stim_times 21 using GLOBAL times
++ -stim_times 22 using GLOBAL times
++ -stim_times 23 using GLOBAL times
++ -stim_times 24 using GLOBAL times
++ -stim_times 25 using GLOBAL times
++ -stim_times 26 using GLOBAL times
++ -stim_times 27 using GLOBAL times
++ -stim_times 28 using GLOBAL times
++ -stim_times 29 using GLOBAL times
++ -stim_times 30 using GLOBAL times
++ -stim_times 31 using GLOBAL times
++ -stim_times 32 using GLOBAL times
++ -stim_times 33 using GLOBAL times
++ -stim_times 34 using GLOBAL times
++ -stim_times 35 using GLOBAL times
++ -stim_times 36 using GLOBAL times
++ -stim_times 37 using GLOBAL times
++ -stim_times 38 using GLOBAL times
++ -stim_times 39 using GLOBAL times
++ -stim_times 40 using GLOBAL times
++ -stim_times 41 using GLOBAL times
++ -stim_times 42 using GLOBAL times
++ -stim_times 43 using GLOBAL times
++ -stim_times 44 using GLOBAL times
++ -stim_times 45 using GLOBAL times
++ -stim_times 46 using GLOBAL times
++ -stim_times 47 using GLOBAL times
++ -stim_times 48 using GLOBAL times
++ -stim_times 49 using GLOBAL times
++ -stim_times 50 using GLOBAL times
++ -stim_times 51 using GLOBAL times
++ -stim_times 52 using GLOBAL times
++ -stim_times 53 using GLOBAL times
++ -stim_times 54 using GLOBAL times
++ -stim_times 55 using GLOBAL times
++ -stim_times 56 using GLOBAL times
++ -stim_times 57 using GLOBAL times
++ -stim_times 58 using GLOBAL times
++ -stim_times 59 using GLOBAL times
++ -stim_times 60 using GLOBAL times
++ -stim_times 61 using GLOBAL times
++ -stim_times 62 using GLOBAL times
++ -stim_times 63 using GLOBAL times
++ -stim_times 64 using GLOBAL times
++ -stim_times 65 using GLOBAL times
++ -stim_times 66 using GLOBAL times
++ -stim_times 67 using GLOBAL times
++ -stim_times 68 using GLOBAL times
++ -stim_times 69 using GLOBAL times
++ -stim_times 70 using GLOBAL times
++ -stim_times 71 using GLOBAL times
++ -stim_times 72 using GLOBAL times
++ -stim_times 73 using GLOBAL times
++ -stim_times 74 using GLOBAL times
++ -stim_times 75 using GLOBAL times
++ -stim_times 76 using GLOBAL times
++ -stim_times 77 using GLOBAL times
++ -stim_times 78 using GLOBAL times
++ -stim_times 79 using GLOBAL times
++ -stim_times 80 using GLOBAL times
++ -stim_times 81 using GLOBAL times
++ -stim_times 82 using GLOBAL times
++ -stim_times 83 using GLOBAL times
++ -stim_times 84 using GLOBAL times
++ -stim_times 85 using GLOBAL times
++ -stim_times 86 using GLOBAL times
++ -stim_times 87 using GLOBAL times
++ -stim_times 88 using GLOBAL times
++ -stim_times 89 using GLOBAL times
++ -stim_times 90 using GLOBAL times
++ -stim_times 91 using GLOBAL times
++ -stim_times 92 using GLOBAL times
++ -stim_times 93 using GLOBAL times
++ -stim_times 94 using GLOBAL times
++ -stim_times 95 using GLOBAL times
++ -stim_times 96 using GLOBAL times
++ -stim_times 97 using GLOBAL times
++ -stim_times 98 using GLOBAL times
++ -stim_times 99 using GLOBAL times
++ -stim_times 100 using GLOBAL times
++ -stim_times 101 using GLOBAL times
++ -stim_times 102 using GLOBAL times
++ -stim_times 103 using GLOBAL times
++ -stim_times 104 using GLOBAL times
++ -stim_times 105 using GLOBAL times
++ -stim_times 106 using GLOBAL times
++ -stim_times 107 using GLOBAL times
++ -stim_times 108 using GLOBAL times
++ -stim_times 109 using GLOBAL times
++ -stim_times 110 using GLOBAL times
++ -stim_times 111 using GLOBAL times
++ -stim_times 112 using GLOBAL times
++ -stim_times 113 using GLOBAL times
++ -stim_times 114 using GLOBAL times
++ -stim_times 115 using GLOBAL times
++ -stim_times 116 using GLOBAL times
++ -stim_times 117 using GLOBAL times
++ -stim_times 118 using GLOBAL times
++ -stim_times 119 using GLOBAL times
++ -stim_times 120 using GLOBAL times
++ Number of time points: 2080 (no censoring)

  • Number of parameters: 1040 [80 baseline ; 960 signal]
    ++ Memory required for output bricks = 3,815,974,800 bytes (about 3.8 billion [giga])
    ++ Wrote matrix values to file statout.xmat.1D
    ++ ========= Things you can do with the matrix file =========
    ++ (a) Linear regression with ARMA(1,1) modeling of serial correlation:

-mask …
-Rbeta coefout_REML -fout -tout -rout -Rbuck statout_REML -Rvar statout_REMLvar -verb

++ N.B.: 3dREMLfit command above written to file statout.REML_cmd
++ (b) Visualization/analysis of the matrix via ExamineXmat.R
++ (c) Synthesis of sub-model datasets using 3dSynthesize
++ ==========================================================
++ Matrix setup time = 7066.22 s
++ current memory malloc-ated = 14,219,692,914 bytes (about 14 billion [giga])
++ Calculations starting; elapsed time=7090.540
++ voxel loop:0123456789.012345
CANCELLED AT 2019-01-13T22:58:21 DUE TO TIME LIMIT ***

Hi Brad,

It is probably because of setting up all of the sub-models, too. 3dDeconvolve does not just solve the one model for betas, it has to remove single or groups of terms to get partial (t or F) statistics. Having 200 regressors in the model will make that effort costly.

  • rick

This occurs before any parallelization with multiple jobs which confuses me. I would have assumed that the sub-model estimation is parallelized.

Would removing -tout and -Fout reduce this problem?

It seems even with no -tout -fout and adding -nofullf_atall does not speed up the “matrix setup time”.

Also, using the R ‘lm’ example again, R compute marginal t-statistics for all 1000-odd parameters via the ‘summary’ function but this occurs very rapidly (for one time-series).

df1 ← read.table(“statout.xmat.1D”)
y ← rnorm(nrow(df1))
lm.1 ← lm(y ~ ., data=df1)
system.time(summary(lm.1))
user system elapsed
0.031 0.001 0.030

This is before any sub-models are solved. The estimation of parameters (solving the models) is, I believe, what is parallelized across voxels.

Though again, I am not sure exactly why this takes so long.

Yes, I would think that removing -fout and -tout would greatly reduce that time. But that is another guess…

  • rick

Hmmm, I will try to take a closer look.

  • rick

Yes, it is just initializing all of the sub-model matrices, basically extracting sub-matrices and computing transposes, products and inverses.

  • rick

Seems like there is room for optimization here. Thanks for investigating.