Here are some simple benchmarks on the performance of this quad-double implementation. These benchmarks were run using CMUCL on a 1.42 GHz PPC. The columns are times relative to a double-float. The %quad-double represents the time using the internal implementaiton, without the overhead of CLOS. The QD-REAL column shows the effect of CLOS dispatch.
|Square root||125||133||There is no FP sqrt instruction on a PPC|
Here are some timing results using CMUCL on a 1.5 GHz UltraSparc IIIi
|Square root||13400||13600||UltraSparc has a FP sqrt instruction|
Here are some timing results using CMUCL with SSE2 support on a 3.06 GHz Core i3
Hida's QD package has a few timing tests. The lisp equivalent was written and here are the timing results. Note that the Lisp equivalent tried to be exactly the same as the QD reference, but no guarantees on that.
|Test||QD||Oct||Relative speed Oct/QD|
The second and third columns are microsec per operation. The last column is the relative time of Oct vs QD. All of these were run on a 1.5 GHz Ultrasparc III. Sun Studio 11 was used to compile the C code. CMUCL 2007-10 was used for the Lisp code.
It's surprising that Oct does as well as it does. To be fair, the times for Oct include the cost of CLOS dispatch since QD uses templates and classes in the tests. Except for add and mul, QD and Oct are within a few percent. The sin test is a bit slower in Oct. I don't know why, but the test did include the accurate argument reduction. The log test is quite a bit faster for Oct. This is probably due to using a different algorithm. QD uses a Newton iteration to compute the log. Oct uses Halley's iteration.