Use BFloat16 Mixed Precision for TensorFlow Keras Inference#

Brain Floating Point Format (BFloat16) is a custom 16-bit floating point format designed for machine learning. BFloat16 is comprised of 1 sign bit, 8 exponent bits, and 7 mantissa bits. With the same number of exponent bits, BFloat16 has the same dynamic range as FP32, but requires only half the memory usage.

BFloat16 Mixed Precison combines BFloat16 and FP32 during training and inference, which could lead to increased performance and reduced memory usage. Compared to FP16 mixed precision, BFloat16 mixed precision has better numerical stability.

When conducting BF16 mixed precision inference on CPU, it could be a common case that the model is pretrained in FP32. With the help of InferenceOptimizer.quantize(..., precision='bf16') API, you could conduct BF16 mixed precsion inference on a FP32 pretrained model with a few lines of code.

⚠️ Warning

BigDL-Nano will enable intel’s oneDNN optimizations by default. oneDNN BFloat16 are only supported on platforms with AVX512 instruction set.

Platforms without hardware acceleration for BFloat16 could lead to bad BFloat16 inference performance.

Let’s take a MobileNetV2 Keras model pretained on ImageNet dataset as an example. It is clear that the model here is in FP32.

[ ]:

from tensorflow import keras

fp32_model = keras.applications.MobileNetV2(weights="imagenet")

[2]:

print(f"The model's dtype policy is {fp32_model.dtype_policy.name}")

The model's dtype policy is float32

Without Extra Accelertor#

To conduct BF16 mixed preision inference, one approach is to convert the layers (except for the input one) in the Keras model to have bfloat16_mixed as their dtype policy. To achieve this, you could simply import BigDL-Nano InferenceOptimizer, and quantize your model without extra accelerator:

[ ]:

from bigdl.nano.tf.keras import InferenceOptimizer

bf16_model = InferenceOptimizer.quantize(fp32_model, precision='bf16')

📝 Note

Please note that, during the 'bf16' quantization without extra accelerator, there are also changes conducted on the fp32_model. If you still need the original model after the optimization, please load the original model again, or make a copy of it before the optimization, etc.

With Extra Accelerator#

You could also conduct BF16 mixed precision inference with OpenVINO at the mean time as the accelerator. To achieve this, you could simply import BigDL-Nano InferenceOptimizer, and quantize your model with accelerator='openvino':

[ ]:

# load the model again if you run the previous code cell
# to conduct bf16 quantization without extra accelerator
fp32_model = keras.applications.MobileNetV2(weights="imagenet")

from bigdl.nano.tf.keras import InferenceOptimizer

bf16_ov_model = InferenceOptimizer.quantize(fp32_model,
                                            precision='bf16',
                                            accelerator='openvino')

📝 Note

Different from the 'bf16' quantization without accelerator, the optimization here is not in place.

Please also note that, when you have a custom model to quantize (e.g. inherated from tf.keras.Model), you need to specify the input_spec parameter to let OpenVINO accelerator know the shape of the model input.

Please refer to API documentation for more information on InferenceOptimizer.quantize.

After quantizing your model with or without extra accelerator, you could then conduct BF16 mixed precision inference as normal:

[5]:

import numpy as np
import time

test_data = np.random.rand(32, 224, 224, 3)

# FP32 inference
st1 = time.time()
for _ in range(100):
    fp32_model(test_data)
print(f'The time for 100 iterations of FP32 inference is: {time.time() - st1} s')

# BF16 mixed precision inference
st2 = time.time()
for _ in range(100):
    bf16_model(test_data)
print(f'The time for 100 iterations of BF16 mixed precision inference is: {time.time() - st2} s')

# BF16 mixed precision inference with OpenVINO
st3 = time.time()
for _ in range(100):
    bf16_ov_model(test_data)
print(f'The time for 100 iterations of BF16 mixed precision inference with OpenVINO is: {time.time() - st3} s')

The time for 100 iterations of FP32 inference is: 20.940536975860596 s
The time for 100 iterations of BF16 mixed precision inference is: 16.39808487892151 s
The time for 100 iterations of BF16 mixed precision inference with OpenVINO is: 5.174031972885132 s

📚 Related Readings

How to install BigDL-Nano

How to install BigDL-Nano in Google Colab