Fatal Signal 11 (SIGSEGV) received during 3dTrackID PROB

I’m receiving Fatal Signal 11 (SIGSEGV) when I run 3dTrackID PROB in parallel.
It works fine when I run one subject at a time or even a for loop, however I much prefer to run them in parallel to save time.
I have run all commands leading up to this (e.g. 3dDWItoDT and 3dDWUncert) in parallel with no problem, therefore I believe the error is related to AFNI. Any ideas what might be causing this? My command and the error (example from one subject, though it happened for all) are pasted below. Thank you!

find SPN01* -type f | nohup parallel ‘dtipath=“/home/eji/newdata/projects/spins/dwi” && cd {} && mkdir tract && cd tract && 3dTrackID -mode PROB -uncert $dtipath/{}/uncert/Uncert+orig. -netrois $dtipath/{}/parc318.nii.gz -mask $dtipath/{}/T1_dwspace_mask.nii.gz -prefix prob -dti_in $dtipath/{}/tensors/DT_ -alg_Thresh_FA 0.2 -alg_Nmonte 1000 -alg_Nseed_Vox 5 -alg_Thresh_Frac 0.1 -nifti -dump_rois AFNI -no_indipair_out -overwrite’ ::: * > nohup_tractography.out &

++ Tracking mode: PROB
++ Number of ROIs in netw[0] = 318
++ No refset labeltable for naming things.
++ SEARCHING for vector files with prefix ‘/home/eji/newdata/projects/spins/dwi/SPN01_MRP_0105_01/tensors/DT_':
FINDING: ‘V1’ ‘V2’ ‘V3’
++ SEARCHING for scalar files with prefix '/home/eji/newdata/projects/spins/dwi/SPN01_MRP_0105_01/tensors/DT_
’:
FINDING: not:‘DT’ ‘FA’ ‘L1’ not:‘L2’ not:‘L3’ ‘MD’ ‘RD’
++ Done with scalar search, found: 4 parameters
→ so will have 15 output data matrices.
++ Effective Monte iterations: 5000. Fraction threshold set: 0.10000
→ Ntrack voxel threshold: 500.

Fatal Signal 11 (SIGSEGV) received
Bottom of Debug Stack
** AFNI version = AFNI_20.3.03 Compile date = Dec 7 2020
** [[Precompiled binary linux_ubuntu_16_64: Dec 7 2020]]
** Program Death **
** If you report this crash to the AFNI message board,
** please copy the error messages EXACTLY, and give
** the command line you used to run the program, and
** any other information needed to repeat the problem.
** You may later be asked to upload data to help debug.
** Crash log is appended to file /home/eji/.afni.crashlog

------ CRASH LOG ------------------------------***
Fatal Signal 11 (SIGSEGV) received
… recent internal history …
mri_killpurge – check if im==NULL ptr=0x21ba600
mri_killpurge – can’t killpurge NULL fname! {17729 ms}
----mri_killpurge [4]: EXIT} (file=mri_purger.c line=270) to mri_free {17729 ms}
mri_free – free im {17729 ms}
—mri_free [3]: EXIT} (file=mri_free.c line=67) to EDIT_substitute_brick {17729 ms}
–EDIT_substitute_brick [2]: EXIT} (file=edt_substbrick.c line=69) to Bottom of Debug Stack {17729 ms}
++THD_delete_3dim_dataset [2]: {ENTRY (file=thd_delete.c line=131) from Bottom of Debug Stack {17731 ms}
+++THD_delete_datablock [3]: {ENTRY (file=thd_delete.c line=27) from THD_delete_3dim_dataset {17731 ms}
THD_delete_datablock – call purge {17731 ms}
++++THD_purge_datablock [4]: {ENTRY (file=thd_purgedblk.c line=15) from THD_delete_datablock {17731 ms}
THD_purge_datablock – MEM_MALLOC: clearing sub-bricks {17731 ms}
+++++mri_killpurge [5]: {ENTRY (file=mri_purger.c line=259) from THD_purge_datablock {17731 ms}
mri_killpurge – check if im==NULL ptr=0x21b62b0
mri_killpurge – can’t killpurge NULL fname! {17731 ms}
-----mri_killpurge [5]: EXIT} (file=mri_purger.c line=270) to THD_purge_datablock {17731 ms}
----THD_purge_datablock [4]: EXIT} (file=thd_purgedblk.c line=39) to THD_delete_datablock {17731 ms}
THD_delete_datablock – destroy imarr {17731 ms}
++++mri_free [4]: {ENTRY (file=mri_free.c line=49) from THD_delete_datablock {17731 ms}
mri_free – call killpurge {17731 ms}
+++++mri_killpurge [5]: {ENTRY (file=mri_purger.c line=259) from mri_free {17731 ms}
mri_killpurge – check if im==NULL ptr=0x21b62b0
mri_killpurge – can’t killpurge NULL fname! {17731 ms}
-----mri_killpurge [5]: EXIT} (file=mri_purger.c line=270) to mri_free {17731 ms}
mri_free – free im {17731 ms}
----mri_free [4]: EXIT} (file=mri_free.c line=67) to THD_delete_datablock {17731 ms}
THD_delete_datablock – free brick_ stuff {17731 ms}
THD_delete_datablock – KILL_KILL {17731 ms}
THD_delete_datablock – free attributes {17731 ms}
—THD_delete_datablock [3]: EXIT} (file=thd_delete.c line=122) to THD_delete_3dim_dataset {17731 ms}
THD_delete_3dim_dataset – KILL_KILL {17731 ms}
–THD_delete_3dim_dataset [2]: EXIT} (file=thd_delete.c line=179) to Bottom of Debug Stack {17731 ms}

------ CRASH LOG ------------------------------**
Fatal Signal 11 (SIGSEGV) received
… recent internal history …
–TrackItP_NEW_M [2]: EXIT} (file=DoTrackit.c line=1555) to Bottom of Debug Stack {853714 ms}
++TrackItP_NEW_M [2]: {ENTRY (file=DoTrackit.c line=1322) from Bottom of Debug Stack {853714 ms}
–TrackItP_NEW_M [2]: EXIT} (file=DoTrackit.c line=1555) to Bottom of Debug Stack {853714 ms}
++TrackItP_NEW_M [2]: {ENTRY (file=DoTrackit.c line=1322) from Bottom of Debug Stack {853714 ms}
–TrackItP_NEW_M [2]: EXIT} (file=DoTrackit.c line=1555) to Bottom of Debug Stack {853714 ms}
++TrackItP_NEW_M [2]: {ENTRY (file=DoTrackit.c line=1322) from Bottom of Debug Stack {853714 ms}
–TrackItP_NEW_M [2]: EXIT} (file=DoTrackit.c line=1555) to Bottom of Debug Stack {853714 ms}
++TrackItP_NEW_M [2]: {ENTRY (file=DoTrackit.c line=1322) from Bottom of Debug Stack {853714 ms}
–TrackItP_NEW_M [2]: EXIT} (file=DoTrackit.c line=1555) to Bottom of Debug Stack {853714 ms}
++TrackItP_NEW_M [2]: {ENTRY (file=DoTrackit.c line=1322) from Bottom of Debug Stack {853714 ms}
–TrackItP_NEW_M [2]: EXIT} (file=DoTrackit.c line=1555) to Bottom of Debug Stack {853714 ms}
++TrackItP_NEW_M [2]: {ENTRY (file=DoTrackit.c line=1322) from Bottom of Debug Stack {853714 ms}
–TrackItP_NEW_M [2]: EXIT} (file=DoTrackit.c line=1555) to Bottom of Debug Stack {853714 ms}
++TrackItP_NEW_M [2]: {ENTRY (file=DoTrackit.c line=1322) from Bottom of Debug Stack {853714 ms}
–TrackItP_NEW_M [2]: EXIT} (file=DoTrackit.c line=1555) to Bottom of Debug Stack {853714 ms}
++TrackItP_NEW_M [2]: {ENTRY (file=DoTrackit.c line=1322) from Bottom of Debug Stack {853714 ms}
–TrackItP_NEW_M [2]: EXIT} (file=DoTrackit.c line=1555) to Bottom of Debug Stack {853714 ms}
++TrackItP_NEW_M [2]: {ENTRY (file=DoTrackit.c line=1322) from Bottom of Debug Stack {853714 ms}
–TrackItP_NEW_M [2]: EXIT} (file=DoTrackit.c line=1555) to Bottom of Debug Stack {853714 ms}
++TrackItP_NEW_M [2]: {ENTRY (file=DoTrackit.c line=1322) from Bottom of Debug Stack {853714 ms}
–TrackItP_NEW_M [2]: EXIT} (file=DoTrackit.c line=1555) to Bottom of Debug Stack {853714 ms}
++TrackItP_NEW_M [2]: {ENTRY (file=DoTrackit.c line=1322) from Bottom of Debug Stack {853714 ms}
–TrackItP_NEW_M [2]: EXIT} (file=DoTrackit.c line=1555) to Bottom of Debug Stack {853714 ms}
++TrackItP_NEW_M [2]: {ENTRY (file=DoTrackit.c line=1322) from Bottom of Debug Stack {853714 ms}
–TrackItP_NEW_M [2]: EXIT} (file=DoTrackit.c line=1555) to Bottom of Debug Stack {853714 ms}
++TrackItP_NEW_M [2]: {ENTRY (file=DoTrackit.c line=1322) from Bottom of Debug Stack {853714 ms}
–TrackItP_NEW_M [2]: EXIT} (file=DoTrackit.c line=1555) to Bottom of Debug Stack {853714 ms}
++TrackItP_NEW_M [2]: {ENTRY (file=DoTrackit.c line=1322) from Bottom of Debug Stack {853714 ms}
–TrackItP_NEW_M [2]: EXIT} (file=DoTrackit.c line=1555) to Bottom of Debug Stack {853714 ms}
++TrackItP_NEW_M [2]: {ENTRY (file=DoTrackit.c line=1322) from Bottom of Debug Stack {853714 ms}
–TrackItP_NEW_M [2]: EXIT} (file=DoTrackit.c line=1555) to Bottom of Debug Stack {853714 ms}

** AFNI compile date = Dec 7 2020
** [[Precompiled binary linux_ubuntu_16_64: Dec 7 2020]]
** Program Crash **

Hi, Ellen-

I wonder if it is a memory problem, as 318 parcels is a lot for tracking (the memory consumption goes like Nroi**2).

To test this, could you try changing these two parts:


-netrois $dtipath/{}/parc318.nii.gz
-prefix prob

to


-netrois $dtipath/{}/parc318.nii.gz'<0..5>'
-prefix TEST

… which would presumably give you a network of 5 ROIs to track.

Does that solve the issue?

–pt

Hi Paul,
Thanks for your suggestion that it could be a memory issue. The test you proposed results in a syntax issue for me because I can’t include single quotes within the nohup command (a single quote ends the command).

However, I monitored the memory usage of one subject using htop (it continuously grew… to 48GB by the end… yikes!). Is it possible that there is a bug in the code where the memory usage keeps growing? Or perhaps it’s really because of the 318 parcels.

I have decreased the number of subjects that can be run at once based on our total ram of 256GB. So, while it was fine to run 32 subjects at once on our 32-core machine for 3dDWUncert and other AFNI commands, it appears that it would be best to manually decreased it to ~5 subjects in parallel to be safe for 3dTrackID. It makes sense because I initially decreased to 25 subjects and then 20 and then 15 and they still crashed. At 5, so far so good.

Thanks again for the memory hint.

Hi, Ellen-

Yes, what happens for probabilistic tracking is:
Each voxels makes a matrix to keep a tally of the tracks going through, or more specifically the N = n(n+1) independent elements of the upper-half triangle of the tracking matrix where ‘n’ is the number of ROIs. For your case, n=318 —> N = 101442. Each of those elements is 4 bytes, I believe, which means that each voxel requires ~400kilobytes, or 0.4MB, or 0.0004GB, of memory. So, if you have 100000 voxels in your mask, that would be 40GB of memory.

So, if you need that many ROIs, another way to save a bit of memory is to restrict the number of voxels in the mask. Typically, for humans over the age of 5 years, the standard FA threshold for tracking is 0.2. So, one might consider making the “mask” dataset by taking the wholebrain mask and intersecting it with a FA>0.2 mask, because no tracts could wander below the FA threshold anyways (so why set aside memory for them?). In the probabilistic case, though, the FA value gets perturbed according to its uncertainty, and so you would want a bit of extra allowance for places where FA is initially slightly below threshold to get perturbed above it; so, in reality you might use FA>0.15 as the extra threshold, and use 3dcalc to make this new mask:


3dcalc -a DT_FA -b WB_MASK -expr 'step(a-0.15)*step(b)' -prefix WB_w_FA_gt_0.15.nii.gz

While this makes sense in theory, in practice, I have found that this does not save thaaat much more space with memory, in most cases; maybe of order 10%, so it isn’t going to allow you parallelize tremendously more. I don’t know that this would really be worth it, unless you are really at a borderline case of runtime ability on a computer.

Sooo, yes, every program might have its own memory requirements, and even that design depends on type of data and inputs. For example, a smaller network would use a lot less memory in 3dTrackID; the “large” FreeSurfer parcellation with about 200 ROIs has about 60% of the Nrois, which means about 40% of the memory usage, which would be pretty significant savings in this case. 3dDWUncert has the advantage of being parallelized with OpenMP, so it can use multiple threads to run faster than its initial, single-processor design. Yay. For 3dTrackID: large networks will require large memory. At least until I have a looot of free time on my hands.

Note for the test case-- that is too bad that nohup doesn’t pass along quotes. Perhaps if you put the entire command in a script, that could help. Or, you could make a test dset beforehand, without nohup (because it is quick to do), such as with:


3dcalc -a DSET"<0..5>" -expr 'a' -prefix DSET_0-5_vals.nii.gz

But you have resolved this particular issue, even without this!

–pt