
Case Studies

Optimising CNN Model on Low Power Vision DSP

March 27, 2024

The Client

The customer, an IP company, specializes in vision-based DSPs utilized for Imaging, Computer Vision, and AI applications.


The project aimed to execute the end-to-end Inception-V3 CNN image classification ML model inference on the Customer’s Vision DSP.


The project utilized a range of tools and technologies, including C/C++, Quantization, CNN inference, DSP intrinsic, DMA, and tiling methodologies.

We had successfully identified the ImageNet dataset-based Inception-V3 floating point model, achieving Top-1 and Top-5 accuracy rates of 74% and 91.62% respectively. We then quantized the float model to the INT8 data type using McW’s custom quantization algorithm. Subsequently, an x86-based reference Inception-V3 pipeline was implemented for the INT8 data type.

Top-5 / Top-1 Classification Accuracy for Float vs. 8-Bit Quantized Graph

MulticoreWare hand-optimized various layers/operations in the Inception-V3 model for the Vision DSP, creating an end-to-end intrinsic-based pipeline while matching the accuracy with an x86-based INT8 pipeline. Considering the numerous layers in Inception-V3 and the DSP’s limited on-chip data memory, we had carefully designed and implemented DMA and data tiling algorithms to manage data transfer from external to on-chip memory efficiently.

Custom Quantization Logic:

MulticoreWare’s solution featured custom quantization logic with minimal loss in Top-1 and Top-5 classification accuracy for the quantized model. We hand-optimized approximately 94 layers of the Inception-V3 model using DSP intrinsic techniques, closely aligning with theoretical performance estimates. Additionally, our team implemented data tiling of input/output/weights and constructed an end-to-end Inception-V3 optimized pipeline, effectively hiding DMA data transfer latency.

CNN Model: Inception-V3 (Pre-Trained with Imagenet Dataset)

Convolutional Neural Network Architecture Details
Number of Convolution layers
Number of Concatenation layers
Number of Pooling layers

Business Impact

MulticoreWare’s efforts resulted in the customer achieving a processing speed of 30 FPS for input images sized at 299x299x3 while maintaining Top-1 and Top-5 accuracy levels similar to the float accuracy. This served as an excellent demonstration for the customer to showcase to their clients.

Memory Modeling - DDR latency [clock cycles] FPS
Performance Achieved Based On Memory Modeling Type (With Tiling And DMA)


This case study highlights MulticoreWare’s expertise in Quantization and DSPs. For a more comprehensive understanding of our solutions and services, please contact us at

Share Via

Explore More

Nov 15 2024

Advancing Compiler Support for a Semiconductor Provider

Customer is a semiconductor-based technology company.

Read more
Oct 3 2024

Enhancing AI Model Support for RISC-V

The customer is a RISC-V based AI accelerator company.

Read more
Aug 8 2024

Enhancing AI Accelerator Capabilities

The customer is a RISC-V based AI accelerator company.

Read more


    (Max 300 characters)