Akos' Apps got their speedup from many different things, not all of them being related to SSE2 or SSE3 instructions, even if the Apps ran faster on CPUs capable of these.
Most of these things have been incorporated in the current SSE code.
I have written a code that makes use of SSE2 instructions (double precision vectors), but it actually runs slightly slower (on CPUs also capable of SSE2) than the FPU code we're using in the current code, as the handling of the "virtually two" FPUs by these CPUs seems to be faster. Akos thinks it might give a little advantage on the new Core architecture ("Woodcrest", I think, as I haven't seen any speedup on Core Duo CPUs), but that's all.
There are two places in the code that might benefit from SSE3 instructions, but the overall speedup should be only a few percent. I am currently looking into a possibility to avoid a conditional jump based on that, which may in conjunction end up giving a speedup that would really be noticable, but I can't promise that.
The average variation in the reported CPU times with identical Workunits is around 5%, so any speedup of the code would need to break this barrier to be noticable at all. I don't think that there are so many possibilities left in the current code for that (and so does Akos).
Also the effort necessary for optimization grows continously - I had to reorder some datastructures, including rewriting the operations on them, to make the SSE2 code work, and still got next to nothing out of it.
We (Akos & me) have ran out of big, striking ideas some time ago, and also the small ideas don't gain much. For the current code (and the current run) I think we have almost reached the top end. There might be 10% speedup we can get from assembler coding for specific CPUs, and maybe another 10% from playing with compilers on the code around our "kernel", but that's about it.
BM

Beta App including SSE2/3 optimisations
)
Short update on SSE3 (FISTTP instruction): using it at the first place in the program atually makes things slower than how they are done by the compiler. At the second place the overall speedup is below what you can measure reliably (may be 2,5% or sth.). Definitely not worth another CPU type distinction.
BM
BM
RE: That's odd. Akos's s4
)
1. On the same CPU?
2. At least in two places where Akos previously used a FISTTP instruction (requiring SSE3 capability) we are now using a different code that doesn't require SSE at all and is equally fast.
I'm not surprised.
BM
BM
@DanNeely: I think with the
)
@DanNeely: I think with the combined competence of Akos and me we can just make more efficient use of SSE in the current Apps.
@MetalWarrior: I have seen machines where the variation was about 10%. I didn't make a statistical analysis of that, 5% on average was just my educated guess. Maybe the machines with larger variations have a problem with CPU time measuring, maybe it depends on how often the App needs to be restarted, and maybe it had gotten better with the BOINC code in the latest Apps or the Core Clients that are used by now. My impression was still from the beginning of S5R1. Anyway, thanks for the report, and sorry for my sloppiness.
BM
BM