MulticoreWare

AI & ROBOTICS

Deploying Vision Language Action (VLA) based AI Models in Robotics: Optimization for Real-Time Edge Inference

July 1, 2025

 

Author Selventhiran Rengaraj is an Associate Technical Project Manager in the Mobility & Transportation Business Unit at MulticoreWare. He has hands-on experience in developing Robotics Stacks for Ground & Underwater Robots and is working on cutting-edge AI and ADAS Perception Stack Optimization on leading Automotive Semiconductor platforms.

Introduction: VLA Models in Robotics – The Shift to Multi-Modality

The robotics industry is amid a major paradigm shift, driven by the emergence of foundation models: large-scale, multi-modal AI systems trained to understand vision, language, and action in a unified framework. Models like Google’s RT-2, which translates web-scale vision-language knowledge into robotic actions, and PaLM-E, an embodied language model capable of reasoning across diverse sensor inputs and tasks, are setting new benchmarks for generalization and task versatility. 

However, these models come with a significant trade-off: their size and compute demand make them impractical for real-time deployment on resource-constrained edge platforms & cost constrained robots. This opens an opportunity for models like CogACT, which strike a balance between multi-modal reasoning and architectural efficiency. In this blog, we dive into the CogACT architecture and about edge optimization of such VLA-style robot models. 

CogAct: A General-Purpose Robotic Intelligence Stack

CogAct is a next-generation, large-scale multi-modal model designed to power general-purpose robotic autonomy. It’s built around three core modules Vision, Language, and Action working together to perceive, reason, and act in real-world environments. With a total of ~7.6 billion parameters, CogAct brings the scale and generalization of foundation models into robotics, without compromising on task-level execution.

How it works?

Vision Module

Built on high-capacity transformers like DINOv2 and SigLIP, this module processes raw images into perceptual tokens. Trained on large-scale datasets, it captures both spatial layouts and object-level semantics with high fidelity.

Language Module

Powered by a Large language model (LLM) LLaMA-2, this module blends visual context with language instructions to understand goals, reason through intent, and ground actions in the environment. It also enables flexible task execution by adapting to diverse natural language prompts, from simple object manipulation to more complex sequential tasks.

Action Module

To generate smooth, multi-step actions, CogAct uses a Diffusion Transformer. It translates cognitive features into temporally consistent motion commands, enabling complex, real-world tasks like grasping, placing, or navigating.

A Real-World Example

  • Give CogAct an image of a cluttered tabletop and the instruction: Move the Pepsi can near the orange”.
  • It will recognize the objects, reason through the instruction, plan a collision-free path, and output a sequence of actions for a robotic arm to physically move the can next to the orange.
An overview of CogAct Model

CogAct has already been used in robotics scenarios like mobile manipulation, indoor navigation, warehouse automation, and multi-agent collaboration tasks that require high-level understanding and fine-grained action control. Its architecture enables robots to act with context, intent, and temporal coherence.

But with this level of intelligence comes significant computational overhead, making edge deployment a real challenge. That’s where our work at MulticoreWare comes in: transforming powerful but heavy models like CogAct into edge-ready systems without compromising their core capabilities.

Our Approach:

CogAct, with its 7.6B parameters and multi-stream architecture, presented a unique challenge. Using a combination of optimization techniques including quantization, pruning, and model graph tuning, we significantly reduced its inference time. As a result, we have achieved 1.3× faster performance, translating to around 26% reduction in latency, all while preserving the model’s original accuracy and behaviour.

Results of the Original CogAct Model
Results of the MulticoreWare Optimized
CogAct Model (1.3x faster)

We successfully deployed the optimized model on real-world edge platforms proving that even foundation-scale robotics models like CogAct can be made efficient and practical for on-device execution.

Applications of VLA Models: Why Optimization Matters

VLA models like CogAct are ushering in a new era of robotic intelligence by enabling machines to understand and act upon complex, high-level instructions. Their potential applications span a wide array of real-world domains:

Warehouse Automation

Robots can understand flexible commands like “Stack all the red boxes near the loading bay,” and figure out object types, spatial relationships, and task sequences on the fly.

Healthcare Robotics

In hospitals or elder care settings, robots powered by VLA models can safely follow spoken instructions, navigate through crowded spaces, and assist with simple fetch-and-carry tasks.

Household Assistance

Whether it’s tidying up or following multi-step instructions like “Put the dishes in the sink and wipe the counter,” VLA-based robots make it easier for humans to interact with machines in a natural way.

Multi-Agent Collaboration

In environments where several robots need to work together, like coordinating drone fleets or warehouse bots shared understanding of language and vision helps improve coordination, efficiency, and safety.

But while these models promise general-purpose autonomy, deploying them in the field, especially on low-power, mobile, or real-time systems requires overcoming steep computational challenges. That’s why optimization is not just beneficial, but essential. Edge optimization ensures that:

  • Fast, real-time responses to dynamic environments.
  • Energy efficiency for mobile or battery-powered robots.
  • Compliance with strict memory and compute limits on embedded platforms.
  • Reliable performance for safety-critical tasks.

By optimizing VLA models like CogAct, we bridge the gap between foundational intelligence and deployable autonomy, bringing sophisticated reasoning to practical robotics applications, from warehouses to wheels to underwater exploration.

Our Expertise in AI powered Edge Solutions

  • Proficiency Across 150+ SOTA AI Models: Custom optimization for CPUs, GPUs, DSPs, NPUs, and low-power edge AI SoCs across modalities.
  • Edge-First BEV (Bird’s Eye View) Algorithms for Diverse Mobility Systems: Tailored BEV pipelines for micro-mobility, two-wheeled, and four-wheeled & AMR / AGV based mobile robotics platforms.
  • End-to-End Robotics Perception Stack Development: Experience building modular perception systems including object detection, depth estimation, semantic segmentation, and sensor fusion tailored for robotics use cases.
  • Expertise in BEV & Vision Transformers: Optimized models like BEVFormer, BEV-SegFormer, and Lift-Splat-Shoot (LSS) etc for automotive AI accelerators.
  • Advanced Quantization of Fusion Models: INT8 quantization of camera+LiDAR models like DeepFusion, BEV-Det, and DeepInteraction without sacrificing accuracy.
  • SLAM, Mapping & Navigation Algorithms: In-house expertise in visual-inertial SLAM, 3D mapping, and real-time navigation for autonomous robotic systems in GPS-denied and dynamic environments.

Conclusion: From Intelligence to On-Device Autonomy

Our method is optimized for hardware efficiency, prioritizes low latency, and emphasizes high accuracy-designed to enable real-world deployment in industries such as autonomous vehicles, warehouse robotics, last-mile delivery, smart infrastructure, and more. At MulticoreWare, we leverage our specialized expertise to enhance and accelerate AI solutions, tailored to the specific demands of your unique use cases. To learn more about how we are building efficient AI solutions, write to us at info@multicorewareinc.com

Mobility & Transportation Industry | Automotive Compute

Share Via

Explore More

Jun 18 2025

Achieving Performance Parity across Architectures: A Deeper Dive into Vector Portability

As compute workloads diversify across CPUs, GPUs, NPUs, and other processors, maintaining efficiency across architecture has become one of the most pressing challenges in high-performance and embedded computing.

Read more
May 23 2025

The HEVC Equation: Efficiency, Adoption, and Its Impact

From ultra-sharp 4K streaming to bandwidth-sensitive video conferencing, modern digital life depends on efficient video compression. One of the most important technologies enabling this is HEVC, or High Efficiency Video Coding—a codec that’s been quietly transforming how video is delivered in the real world.

Read more
Mar 28 2025

Unlocking AI Performance Analysis & Optimization with Perfalign

In the world of AI and Machine Learning, optimizing performance across diverse hardware platforms is a critical challenge. Developers and AI engineers often struggle with visualizing complex models, analyzing performance bottlenecks, and fine-tuning workloads for efficiency.

Read more

GET IN TOUCH

    Please note: Personal emails like Gmail, Hotmail, etc. are not accepted
    (Max 2000 characters)