yao

Performance

From version 3.0, Yao has been ported to use either the Apple Veclib or FFTW. It can thus run on any *nix machine. The performance of FFTW and the Apple veclib are relatively similar, although details depend on the size of the problem.

The following table give a few examples of how fast the main loop run on various machines. Display was set to zero for these tests. Note that the large difference with and without noise for sh6m2-bench.par is largely explained by the large number of pixels/subaperture in this example (Poisson noise, although optimized using a compiled routine in this package, is still very expensive).
File Machine FFT Engine #it/sec WF sensing time Comments
sh6m2-bench.par PBG4 800MHz Veclib 62.3 8.16ms w/o noise
sh6m2-bench.par PBG4 800MHz FFTW 58.4 9.02ms w/o noise
sh6m2-bench.par G5 2x2GHz Veclib 205.5 2.70ms w/o noise
sh6m2-bench.par G5 2x2GHz FFTW 184.5 3.14ms w/o noise
sh6m2-bench.par Athlon 2x2.8GHz FFTW 158.0 2.75ms w/o noise
sh6m2-bench.par PBG4 800MHz Veclib 28.4 27.6ms w/ noise
sh6m2-bench.par PBG4 800MHz FFTW 27.4 28.4ms w/ noise
sh6m2-bench.par G5 2x2GHz Veclib 97.2 8.14ms w/ noise
sh6m2-bench.par G5 2x2GHz FFTW 92.6 8.54ms w/ noise
sh6m2-bench.par Athlon 2x2.8GHz FFTW 84.0 8.29ms
c188-bench.par PBG4 800MHz Veclib 7.44 62.14ms
c188-bench.par PBG4 800MHz FFTW 7.40 60.27ms
c188-bench.par G5 2x2GHz Veclib 24.7 26.3ms
c188-bench.par G5 2x2GHz FFTW 27.5 22.55ms
c188-bench.par Athlon 2x2.8GHz FFTW 13.2 39.38ms
mcao2-bench.par PBG4 800MHz Veclib 0.63 425ms
mcao2-bench.par G5 2x2GHz Veclib 2.56 136.5ms
mcao2-bench.par Athlon 2x2.8GHz FFTW 1.25 182ms

Table: Performance for yao v3.0

Yao-FFTW and yao-veclib display quite similar performance on the PowerBook G4 and the Dual G5. FFT tests and FFTW benchmarks for the G5 show that the Apple veclib wins for small array sizes, and that this trend is inverted at around 128x128.

Something I can't explain in these benchmarks is the poor performance of the Athlon w.r.t. the G5 for large FFT problems like c188-bench.par. I have double checked that the performance of the FFT itself is good: For instance, using the bench -s #x# utility provided with the FFTW distribution, I get 530 microseconds to do 128x128 FFTs on the Athlon and 490 microseconds on the G5, close to what could be expected. 1024x1024 takes 179ms on the Athlon and 322ms on the G5. This does not fit with the poor Athlon performance on c188-bench.par. I suspect something is wrong with my coding in _cwfs() (yao_fast.c) and that somehow it is better tolerated by the G5 (/OsX) than the Ahtlon (/Linux).

mcao2-bench.par is a very exhaustive simulation of the Gemini MCAO system. It uses a pupil of 128 pixels (6cm sampling in pupil plane), 5 16x16 LGS wfs (including LGS elongation and centroid gain optimization), 4 TT WFS, all including noise, 3 high order DMs (a total of approximately 950 actuators) including the effect of hysteresis, and estimate the PSF at 16 field locations and 2 different wavelengths. That's quite a lot of calculations. Here again, the G5 shows a significant advantage over the Athlon. The major difference comes in fact from the computation of the DM surfaces (405ms on the Athlon vs 113ms on the G5), a compiled routine that is basically a 2 line loop accumulating the scaled influence functions into the final DM shape array. I can't understand why the Athlon is so slow on this routine (even though I compiled it "-O3").

Page updated on UT $Date: 2007/12/12 23:29:23 $

Valid CSS!