Optimize NN Ops and stitch inference pipeline for AI Accelerator Hardware
October 6, 2022The client is the leader in memory-efficient computation for Artificial Intelligence workloads. The customer provides ultra-efficient, high-performance AI chips to enable new frontiers in AI applications. By combining the power efficiency of memory-efficient computation with the robustness of digital processing, the customer has developed a groundbreaking new chip architecture for neural net inference. The client was looking for a technology partner to create a software acceleration for neural network inference layers for their inference chips.
The Project
The customer’s inference chip was new in the market. To add support and obtain the best inference performance for various popular neural network architectures, support for these models had to be provided in their SDK and different layers of the NN architecture had to be optimized for the inference chip. The requirement from the customer was to hand-optimize each of the neural network inference operators leveraging the customer’s unique memory-efficient architecture using SIMD instructions on the ALUs and MIMD across the chip.
Challenges
The customer’s architecture is unique in the market and, at the same time, quite a complex one. The development ecosystem from the customer, including compiler, debugger, etc., was under development, and their SDK was also expected to be continually updated during the project execution.
The customer wanted to build a team of 10+ engineers, each with knowledge and experience working on microarchitecture-aware kernel optimization and can start contributing to the project in a short period.
The MulticoreWare Advantage & Approach
MCW formed a team of 10+ engineers in a short time. The team had substantial related experience gained over the years from similar projects where the MCW development team worked on optimizing computer vision and machine learning training and inference algorithms for various DSP, GPU, and NPU platforms.
The customer had given MCW a set of NN architectures created for object detection and segmentation problems whose layers needed to be optimized for the customer’s inference chip, along with the target FPS for each. Therefore, MCW had to hand-optimize individual layers of these architectures, stitch the pipeline together and ensure that the complete network ran with the target FPS on the hardware.
Execution flow from the start to the end of the project: –
- MCW came up with a detailed design plan on how the end-to-end NN model can be architected on the device, including the communication between layers, data placement, etc., and got it approved by the customer
- Ops / Layers in the network were split across to different team members, each individually optimizing different layers of the network, focusing on the functionality and the performance. Each layer in the network was carefully designed and optimized to utilize the device’s compute elements and data memory fully. The functional correctness of each model layer was tested against TensorFlow layer outputs.
- After optimizing individual layers, the end-to-end model was put together and tested with random and real-time data to verify its functional correctness on the simulator platform.
- Once the model passed all the tests in the simulator environment, MCW tested the NN models on the actual inference device, and performance was measured.
- MCW team has optimized different layers of the networks in such a way that the performance of the given networks always met/exceeded the target FPS set by the customer.
Outcome
MCW came up with a design for end-to-end neural network model inference with minimal help from the customer and achieved the target FPS. As a result, the networks were made to run successfully on the simulator platform and the device with functional correctness and desired performance. Furthermore, the customer was happy with the result and execution mode wherein MCW could achieve this independently, minimizing overheads for their critical engineers.