Know about Quantization in TensorFlow

Akshith Kumar
Analytics Vidhya
Published in
3 min readSep 15, 2021

--

Get detailed into miracle’s of quantization

Motivation

Whenever I work on deep learning projects to train a model and make it ready for production by saving the model, it gives me huge memory. Then I started researching to decrease the saved model memory and here I found a name called “Quantization”. I would like to explain more about this quantization with some code samples and the theory behind this.

Photo by Ussama Azam on Unsplash

There are two forms of quantization:

  1. post-training quantization.
  2. Quantization aware training.

Start with post-training quantization since it’s easier to use, though quantization aware training is often better for model accuracy.

Source: “Image by author”

By seeing the above image we can say that through quantization we can reduce the weights of the model from float to integers and even reduce its size to free up memory.

Overview of Post-training quantization

Post-training quantization includes general techniques to reduce CPU and hardware accelerator latency, processing, and model size with little degradation in model accuracy. These techniques can be performed on an already-trained float TensorFlow model and applied during TensorFlow Lite conversion. These techniques are enabled as options in the TensorFlow Lite converter.

Weights can be converted to types with reduced precision, such as 16-bit floats or 8-bit integers. We generally recommend 16-bit floats for GPU acceleration and an 8-bit integer for CPU execution.

Source: “Image by author”

Here, using the above lines of code we can reduce the size of the saved model and start production. But the accuracy may vary with the previously trained model when we use Post-training quantization. So, use quantization aware training.

Source: “Image by author”

Overview of Quantization aware training

Quantization aware training emulates inference-time quantization, creating a model that downstream tools will use to produce actually quantized models. The quantized models use lower-precision (e.g. 8-bit instead of 32-bit float), leading to benefits during deployment.

Quantization brings improvements via model compression and latency reduction. With the API defaults, the model size shrinks by 4x, and we typically see between 1.5–4x improvements in CPU latency in the tested backends.

Source: “Image by author”

For making this quantization aware training we need to start quantizing the saved model TF model. Then fit the quantized model by compiling it and fit with some epochs. later on, fine-tune the quantized fit model, and by using the tf lite convertor extract the Tf lite model to get good accuracy on deploying.

Source: “Image by author”

Conclusion

Through this quantization, we can reduce the size of the saved TF model and obtain good accuracy for its production. Moreover, it's easy to implement, lightweight, and most importantly memory free while building a large neural network.

Hope you like it, Thanks for reading!!

--

--