yao
Links
Main pageInstallation
Examples and Scripts
Performance
Data structures and parfiles
Controlling Features
Screenshots
Algorithms
Yao tk dynamic control
News/Weblog
Performance
From version 3.0, Yao has been ported to use either the Apple Veclib or FFTW. It can thus run on any *nix machine. The performance of FFTW and the Apple veclib are relatively similar, although details depend on the size of the problem.The following table give a few examples of how fast the main loop run on various machines. Display was set to zero for these tests. Note that the large difference with and without noise for sh6m2-bench.par
is largely explained by the large number of pixels/subaperture in this example (Poisson noise, although optimized using a compiled routine in this package, is still very expensive).
File | Machine | FFT Engine | #it/sec | WF sensing time | Comments |
sh6m2-bench.par | PBG4 800MHz | Veclib | 62.3 | 8.16ms | w/o noise |
sh6m2-bench.par | PBG4 800MHz | FFTW | 58.4 | 9.02ms | w/o noise |
sh6m2-bench.par | G5 2x2GHz | Veclib | 205.5 | 2.70ms | w/o noise |
sh6m2-bench.par | G5 2x2GHz | FFTW | 184.5 | 3.14ms | w/o noise |
sh6m2-bench.par | Athlon 2x2.8GHz | FFTW | 158.0 | 2.75ms | w/o noise |
sh6m2-bench.par | PBG4 800MHz | Veclib | 28.4 | 27.6ms | w/ noise |
sh6m2-bench.par | PBG4 800MHz | FFTW | 27.4 | 28.4ms | w/ noise |
sh6m2-bench.par | G5 2x2GHz | Veclib | 97.2 | 8.14ms | w/ noise |
sh6m2-bench.par | G5 2x2GHz | FFTW | 92.6 | 8.54ms | w/ noise |
sh6m2-bench.par | Athlon 2x2.8GHz | FFTW | 84.0 | 8.29ms | |
c188-bench.par | PBG4 800MHz | Veclib | 7.44 | 62.14ms | |
c188-bench.par | PBG4 800MHz | FFTW | 7.40 | 60.27ms | |
c188-bench.par | G5 2x2GHz | Veclib | 24.7 | 26.3ms | |
c188-bench.par | G5 2x2GHz | FFTW | 27.5 | 22.55ms | |
c188-bench.par | Athlon 2x2.8GHz | FFTW | 13.2 | 39.38ms | |
mcao2-bench.par | PBG4 800MHz | Veclib | 0.63 | 425ms | |
mcao2-bench.par | G5 2x2GHz | Veclib | 2.56 | 136.5ms | |
mcao2-bench.par | Athlon 2x2.8GHz | FFTW | 1.25 | 182ms |
Yao-FFTW and yao-veclib display quite similar performance on the PowerBook G4 and the Dual G5. FFT tests and FFTW benchmarks for the G5 show that the Apple veclib wins for small array sizes, and that this trend is inverted at around 128x128.
Something I can't explain in these benchmarks is the poor performance of the Athlon w.r.t. the G5 for large FFT problems like c188-bench.par
. I have double checked that the performance of the FFT itself is good: For instance, using the bench -s #x#
utility provided with the FFTW distribution, I get 530 microseconds to do 128x128 FFTs on the Athlon and 490 microseconds on the G5, close to what could be expected. 1024x1024 takes 179ms on the Athlon and 322ms on the G5. This does not fit with the poor Athlon performance on c188-bench.par
. I suspect something is wrong with my coding in _cwfs()
(yao_fast.c) and that somehow it is better tolerated by the G5 (/OsX) than the Ahtlon (/Linux).
mcao2-bench.par
is a very exhaustive simulation of the Gemini MCAO system. It uses a pupil of 128 pixels (6cm sampling in pupil plane), 5 16x16 LGS wfs (including LGS elongation and centroid gain optimization), 4 TT WFS, all including noise, 3 high order DMs (a total of approximately 950 actuators) including the effect of hysteresis, and estimate the PSF at 16 field locations and 2 different wavelengths. That's quite a lot of calculations. Here again, the G5 shows a significant advantage over the Athlon. The major difference comes in fact from the computation of the DM surfaces (405ms on the Athlon vs 113ms on the G5), a compiled routine that is basically a 2 line loop accumulating the scaled influence functions into the final DM shape array. I can't understand why the Athlon is so slow on this routine (even though I compiled it "-O3").