MulticoreWare

Case Studies

Optimising CNN Model on Low Power Vision DSP

March 27, 2024

The Client

The customer, an IP company, specializes in vision-based DSPs utilized for Imaging, Computer Vision, and AI applications.

Challenge

The project aimed to execute the end-to-end Inception-V3 CNN image classification ML model inference on the Customer’s Vision DSP.

Solution

The project utilized a range of tools and technologies, including C/C++, Quantization, CNN inference, DSP intrinsic, DMA, and tiling methodologies.

We had successfully identified the ImageNet dataset-based Inception-V3 floating point model, achieving Top-1 and Top-5 accuracy rates of 74% and 91.62% respectively. We then quantized the float model to the INT8 data type using McW’s custom quantization algorithm. Subsequently, an x86-based reference Inception-V3 pipeline was implemented for the INT8 data type.

Top-5 / Top-1 Classification Accuracy for Float vs. 8-Bit Quantized Graph

MulticoreWare hand-optimized various layers/operations in the Inception-V3 model for the Vision DSP, creating an end-to-end intrinsic-based pipeline while matching the accuracy with an x86-based INT8 pipeline. Considering the numerous layers in Inception-V3 and the DSP’s limited on-chip data memory, we had carefully designed and implemented DMA and data tiling algorithms to manage data transfer from external to on-chip memory efficiently.

Custom Quantization Logic:

MulticoreWare’s solution featured custom quantization logic with minimal loss in Top-1 and Top-5 classification accuracy for the quantized model. We hand-optimized approximately 94 layers of the Inception-V3 model using DSP intrinsic techniques, closely aligning with theoretical performance estimates. Additionally, our team implemented data tiling of input/output/weights and constructed an end-to-end Inception-V3 optimized pipeline, effectively hiding DMA data transfer latency.

CNN Model: Inception-V3 (Pre-Trained with Imagenet Dataset)

Convolutional Neural Network Architecture Details
Number of Convolution layers
94
Number of Concatenation layers
11
Number of Pooling layers
14

Business Impact

MulticoreWare’s efforts resulted in the customer achieving a processing speed of 30 FPS for input images sized at 299x299x3 while maintaining Top-1 and Top-5 accuracy levels similar to the float accuracy. This served as an excellent demonstration for the customer to showcase to their clients.

Memory Modeling - DDR latency [clock cycles] FPS
100
30.42
0
31.09
Performance Achieved Based On Memory Modeling Type (With Tiling And DMA)

Conclusion

This case study highlights MulticoreWare’s expertise in Quantization and DSPs. For a more comprehensive understanding of our solutions and services, please contact us at info@multicorewareinc.com

Share Via

Explore More

Nov 15 2024

Advancing Compiler Support for a Semiconductor Provider

Client
Customer is a semiconductor-based technology company.

Read more
Oct 3 2024

Enhancing AI Model Support for RISC-V

Client
The customer is a RISC-V based AI accelerator company.

Read more
Aug 8 2024

Enhancing AI Accelerator Capabilities

The customer is a RISC-V based AI accelerator company.

Read more

GET IN TOUCH

    (Max 300 characters)