关于opengl：acos()函数是否存在准确的近似值？

Is there an accurate approximation of the acos() function?

我需要在计算着色器中具有双精度的acos()函数。由于GLSL中没有双精度的内置功能acos()，因此我尝试实现自己的功能。

首先，我实现了泰勒级数，如Wiki中的方程式-泰勒级数，带有预先计算的教职人员值。但这似乎是不准确的，大约为1。最大的错误是在40次迭代中大约为0.08。

我还实现了此方法，该方法在CPU上运行良好，最大错误为-2.22045e-16，但是在着色器中实现此方法有些麻烦。

当前，我在这里使用acos()逼近函数，有人在此站点上发布了他的逼近函数。我正在使用此站点上最准确的功能，现在出现的最大错误为-7.60454e-08，但是该错误也过高。

此功能的我的代码是：

1
2
3
4
5
6
7
8
9

double myACOS(double x)
{
double part[4];
part[0] = 32768.0/2835.0*sqrt(2.0-sqrt(2.0+sqrt(2.0+sqrt(2.0+2.0*x))));
part[1] = 256.0/135.0*sqrt(2.0-sqrt(2.0+sqrt(2.0+2.0*x)));
part[2] = 8.0/135.0*sqrt(2.0-sqrt(2.0+2.0*x));
part[3] = 1.0/2835.0*sqrt(2.0-2.0*x);
return (part[0]-part[1]+part[2]-part[3]);
}

是否有人知道acos()的另一种实现方法，该方法非常准确并且(如果可能)易于在着色器中实现？

某些系统信息：

英伟达GT 555M
使用optirun运行OpenGL 4.3

相关讨论

NVIDIA GT 555M GPU是具有2.1运算能力的设备，因此对基本的双精度操作(包括融合的multipy-add(FMA))具有本机硬件支持。与所有NVIDIA GPU一样，平方根运算也会被仿真。我熟悉CUDA，但不熟悉GLSL。根据GLSL规范的版本4.3，它将双精度FMA公开为函数fma()，并提供了双精度平方根sqrt()。尚不清楚sqrt()实现是否根据IEEE-754规则正确取整。我将以类似CUDA的方式假设它。

而不是使用泰勒级数，而是要使用多项式最小极大近似，从而减少所需项的数量。 Minimax近似值通常是使用Remez算法的变体生成的。为了优化速度和准确性，使用FMA是必不可少的。用霍纳(Horner)方案对多项式求值有助于提高准确性。在下面的代码中，使用了二阶Horner方案。就像DanceIgel的答案一样，acos可以方便地使用asin近似值作为基本构建块并结合标准数学恒等式进行计算。

对于400M测试向量，使用下面的代码看到的最大相对误差为2.67e-16，而观察到的最大ulp误差为1.442 ulp。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48

/* compute arcsin (a) for a in [-9/16, 9/16] */
double asin_core (double a)
{
double q, r, s, t;

s = a * a;
q = s * s;
r = 5.5579749017470502e-2;
t = -6.2027913464120114e-2;
r = fma (r, q, 5.4224464349245036e-2);
t = fma (t, q, -1.1326992890324464e-2);
r = fma (r, q, 1.5268872539397656e-2);
t = fma (t, q, 1.0493798473372081e-2);
r = fma (r, q, 1.4106045900607047e-2);
t = fma (t, q, 1.7339776384962050e-2);
r = fma (r, q, 2.2372961589651054e-2);
t = fma (t, q, 3.0381912707941005e-2);
r = fma (r, q, 4.4642857881094775e-2);
t = fma (t, q, 7.4999999991367292e-2);
r = fma (r, s, t);
r = fma (r, s, 1.6666666666670193e-1);
t = a * s;
r = fma (r, t, a);

return r;
}

/* Compute arccosine (a), maximum error observed: 1.4316 ulp
Double-precision factorization of ?€ courtesy of Tor Myklebust
*/
double my_acos (double a)
{
double r;

r = (a > 0.0) ? -a : a; // avoid modifying the"sign" of NaNs
if (r > -0.5625) {
/* arccos(x) = pi/2 - arcsin(x) */
r = fma (9.3282184640716537e-1, 1.6839188885261840e+0, asin_core (r));
} else {
/* arccos(x) = 2 * arcsin (sqrt ((1-x) / 2)) */
r = 2.0 * asin_core (sqrt (fma (0.5, r, 0.5)));
}
if (!(a > 0.0) && (a >= -1.0)) { // avoid modifying the"sign" of NaNs
/* arccos (-x) = pi - arccos(x) */
r = fma (1.8656436928143307e+0, 1.6839188885261840e+0, -r);
}
return r;
}

我当前对\\'acos()\\的精确着色器实现是通常的泰勒级数和Bence的答案的混合。经过40次迭代，我从math.h中获得\\'acos()\\实现的精度为4.44089e-16。也许它不是最好的，但是对我有用：

这里是：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32

double myASIN2(double x)
{
double sum, tempExp;
tempExp = x;
double factor = 1.0;
double divisor = 1.0;
sum = x;
for(int i = 0; i < 40; i++)
{
tempExp *= x*x;
divisor += 2.0;
factor *= (2.0*double(i) + 1.0)/((double(i)+1.0)*2.0);
sum += factor*tempExp/divisor;
}
return sum;
}

double myASIN(double x)
{
if(abs(x) <= 0.71)
return myASIN2(x);
else if( x > 0)
return (PI/2.0-myASIN2(sqrt(1.0-(x*x))));
else //x < 0 or x is NaN
return (myASIN2(sqrt(1.0-(x*x)))-PI/2.0);

}

double myACOS(double x)
{
return (PI/2.0 - myASIN(x));
}

任何评论，还有什么可以做得更好的？例如，使用LUT作为factor的值，但在我的着色器中，\\'acos()\\'仅被调用一次，因此不需要它。