OpenGL ES 3.0: Programming Guide, Second Edition (2014)

Appendix A. GL_HALF_FLOAT

GL_HALF_FLOAT is a vertex and texture data type supported by OpenGL ES 3.0. The GL_HALF_FLOAT data type is used to specify 16-bit floating-point values. This can be useful, for example, in specifying vertex attributes such as texture coordinates, normals, binormals, and tangent vectors. Using GL_HALF_FLOAT rather than GL_FLOAT provides a two times reduction in memory bandwidth required to read vertex or texture data by the GPU.

One might argue that we can use GL_SHORT or GL_UNSIGNED_SHORT instead of a 16-bit floating-point data type and get the same memory footprint and bandwidth savings. However, with that approach, you will need to scale the data or matrices appropriately and apply a transform in the vertex shader. For example, consider the case where a texture pattern is to be repeated four times horizontally and vertically over a quad. GL_SHORT can be used to store the texture coordinates. The texture coordinates could be stored as a value of 4.12 or 8.8. The texture coordinate values stored as GL_SHORT are scaled by (1 << 12) or (1 << 8) to give us a fixed-point representation that uses 4 bits or 8 bits of integer and 12 bits or 8 bits of fraction. Because OpenGL ES does not understand such a format, the vertex shader will then need to apply a matrix to unscale these values, which affects the vertex shading performance. These additional transforms are not required if a 16-bit floating-point format is used. Further, values represented as floating-point numbers have a larger dynamic range than fixed-point values because of the use of an exponent in the representation.

Note

Fixed-point values have a different error metric than floating-point values. The absolute error in a floating-point number is proportional to the magnitude of the value, whereas the absolute error in a fixed-point format is constant. Developers need to be aware of these precision issues when choosing which data type to use when generating coordinates for a particular format.

16-Bit Floating-Point Number

Figure A-1 describes the representation of a half-float number. A half-float is a 16-bit floating-point number with 10 bits of mantissa m, 5 bits of exponent e, and a sign bit s.

Figure A-1 A 16-Bit Floating-Point Number

The following rules should be used when interpreting a 16-bit floating-point number:

• If exponent e is between 1 and 30, the half-float value is computed as (– l)^s * 2^e-15 * (1 + m/1024).

• If exponent e and mantissa m are both 0, the half-float value is 0.0. The sign bit is used to represent –ve 0.0 or +ve 0.0.

• If exponent e is 0 and mantissa m is not 0, the half-float value is a denormalized number.

• If exponent e is 31, the half-float value is either infinity (+ve or –ve) or a NaN (“not a number”) depending on whether the mantissa m is zero.

A few examples follow:

0 00000 0000000000 = 0.0
0 00000 0000001111 = a denorm value
0 11111 0000000000 = positive infinity
1 11111 0000000000 = negative infinity
0 11111 0000011000 = NaN
1 11111 1111111111 = NaN
0 01111 0000000000 = 1.0
1 01110 0000000000 = −0.5
0 10100 1010101010 = 54.375

OpenGL ES 3.0 implementations must be able to accept input half-float data values that are infinity, NaN, or denormalized numbers. They do not have to support 16-bit floating-point arithmetic operations with these values. Most implementations will convert denormalized numbers and NaN values to zero.

Converting a Float to a Half-Float

The following routines describe how to convert a single-precision floating-point number to a half-float value, and vice versa. The conversion routines are useful when vertex attributes are generated using single-precision floating-point calculations but then converted to half-floats before they are used as vertex attributes:

// −15 stored using a single-precision bias of 127
const unsigned int HALF_FLOAT_MIN_BIASED_EXP_AS_SINGLE_FP_EXP =
0x38000000;
// max exponent value in single precision that will be converted
// to Inf orNaN when stored as a half-float
const unsigned int HALF_FLOAT_MAX_BIASED_EXP_AS_SINGLE_FP_EXP =
0x47800000;

// 255 is the max exponent biased value
const unsigned int FLOAT_MAX_BIASED_EXP = (0x1F << 23);

const unsigned int HALF_FLOAT_MAX_BIASED_EXP = (0x1F << 10);

typedef unsigned short hfloat;

hfloat
convertFloatToHFloat(float *f)
{
unsigned int x = *(unsignedint *)f;
unsignedint sign = (unsigned short)(x >> 31);
unsignedint mantissa;
unsignedint exp;
hfloat hf;

// get mantissa
mantissa = x & ((1 << 23) − 1);
// get exponent bits
exp = X & FLOAT_MAX_BIASED_EXP;
if (exp >= HALF_FLOAT_MAX_BIASED_EXP_AS_SINGLE_FP_EXP)
{
// check if the original single-precision float number
// is a NaN
if (mantissa && (exp == FLOAT_MAX_BIASED_EXP))
{
// we have a single-precision NaN
mantissa = (1 << 23) − 1;
}
else
{
// 16-bit half-float representation stores number
// as Inf mantissa = 0;
}
hf = (((hfloat)sign) << 15) |
(hfloat)(HALF_FLOAT_MAX_BIASED_EXP) |
(hfloat)(mantissa >> 13);
}
// check if exponent is <= −15
else if (exp <= HALF_FLOAT_MIN_BIASED_EXP_AS_SINGLE_FP_EXP)
{
// store a denorm half-float value or zero
exp = (HALF_FLOAT_MIN_BIASED_EXP_AS_SINGLE_FP_EXP − exp)
>> 23;
mantissa >>= (14 + exp);

hf = (((hfloat)sign) << 15) | (hfloat)(mantissa);
}
else
{
hf = (((hfloat)sign) << 15) |
(hfloat)
((exp − HALF_FLOAT_MIN_BIASED_EXP_AS_SINGLE_FP_EXP)
>> 13)|
(hfloat)(mantissa >> 13);
}
return hf;
}
float
convertHFloatToFloat(hfloat hf)
{
unsignedint sign = (unsignedint)(hf >> 15);
unsignedint mantissa = (unsignedint)(hf &
((1 << 10) − 1));
unsignedint exp = (unsignedint)(hf &
HALF_FLOAT_MAX_BIASED_EXP);
unsignedint f;

if (exp == HALF_FLOAT_MAX_BIASED_EXP)
{
// we have a half-float NaN or Inf
// half-float NaNs will be converted to a single-
// precision NaN
// half-float Infs will be converted to a single-
// precision Inf
exp = FLOAT_MAX_BIASED_EXP;
if (mantissa)
mantissa = (1 << 23) − 1; // set all bits to
// indicate a NaN
}
else if (exp == 0x0)
{
// convert half-float zero/denorm to single-precision
// value
if (mantissa)
{
mantissa <<= 1;
exp = HALF_FLOAT_MIN_BIASED_EXP_AS_SINGLE_FP_EXP;
// check for leading 1 in denorm mantissa
while ((mantissa & (1 << 10)) == 0)
{
// for every leading 0, decrement single-
// precision exponent by 1
// and shift half-float mantissa value to the
// left mantissa <<= 1;
exp −= (1 << 23);
}
// clamp the mantissa to 10 bits
mantissa &= ((I << 10) − 1);
// shift left to generate single-precision mantissa
// of 23-bits mantissa <<= 13;
}
}
else
{
// shift left to generate single-precision mantissa of
// 23-bits mantissa <<= 13;
// generate single-precision biased exponent value
exp = (exp << 13) +
HALF_FLOAT_MIN_BIASED_EXP_AS_SINGLE_FP_EXP;
}
f = (sign << 31) | exp | mantissa;
return *((float *)&f);
}