Well, having a cluster to run on is great, but there are some subtleties. (And I assume your reported times are for the job to start and finish; not including wait time from getting nodes on the system.) Actually, I’m curious about the amount of time it took for the jobs to finish?
Firstly, parallelization on a cluster is different than parallization on a single computer with multiple cores (–> ask The Bob for more details, surely!). You can get a sense of this with the list of “anecdotal” processing times recorded for FreeSurfer recon-all processing, which I did with various degrees of parallelization on both Biowulf and on my comp, reported under here:
See how the times are pretty different, even though Biowulf is a very impressive computing structure.
Secondly, the read/write steps on the cluster can be pretty slow (say, compared to a desktop) in the main directories; for some programs (like the scripts made by afni_proc.py), we tend to write to a temporary scratch disk, doing all work there, and copy everything back to the main directories when does—this has dramatically improved processing times in the past. (NB: the Biowulf folks are always improving things, and several related operations have improved in speed over time; we still tend to use temporary scratch disks when possible.) Some notes on using the Biowulf scratch disk for a FreeSurfer recon-all run (with script!) is provided here:
Thirdly, there are different sets of CPUs on biowulf (different partitions, such as “norm”, “quick”, etc.) and some might be faster or slower. Even within a given partition, speed can vary. NB: “quick” hear does not mean it processes faster, just that its jobs much be quick, < 4 hours. So, your testing might vary with this.
Another consideration is that the message passing that allows for parallelization also takes work—doubling the number of processors in a parallel job will not simply double the speed, but perhaps make it 60-70% of the time previous, perhaps (and this estimate is totally dependent on the kind of job being done, the method of parallelization, the systems being used—I really shouldn’t put any numbers here, but just to give a sense…). Also, as you increase the number of processors, that inter-organizational process becomes more expensive. So, your marginal benefit will likely keep decreasing. And at some point, you can actually reach a point of having too many processors for a task, because so much effort is spent on dividing and re-gathering. Finally, one can even get “process thrashing”, where the act of trying to parallelize causes the computer so much hassle it slows the work down to a crawl (or over-parallelizing becomes nearly-paralyzing).
I don’t know that 72 processors is causing thrashing, but it seems like overkill. Also, requesting that many processors could lead to a big delay in getting the resources from the slurm managing program, so you are effectively adding more wait time for yourself.
Finally, different programs need different levels of parallelization. The scripts from afni_proc.py have several steps—an important fraction of those (alignment ones and cluster simulation estimate ones) benefit from parallelization, but not all pieces do. So, having more cores will only speed up some pieces here. For @SSwarper, @animal_warper or 3dQwarp, having parallelization will help a lot, because the bulk of the steps of each benefit.
OK, so, if it were me running on biowulf, I would probably use 8-16 CPUs as a start. Consider if you are batch processing a lot of subjects—if you use a lot of CPUs per subj, then slurm might make some subjects wait to start processing until others are finished, whereas if you used less, you might be able to run them all starting around the same time. So, each subject might take a bit longer, but overall you benefit from getting them all running in parallel, and the group finishes earlier.
If you want, you could post the command you are using to submit the swarm/sbatch, too, and we could see about suggesting any improvements there, too.