LLVM Expertise

MulticoreWare has leading expertise in the LLVM compiler infrastructure.  Our LLVM team is located by the University of Illinois at Urbana-Champaign, where the LLVM project began in 2000.  Our researchers work with professors and graduate students at the University of Illinois to develop and maintain various projects to create LLVM back-ends, front-ends, and other infrastructure.

Sample Projects

HCC: C++ AMP for Linux

The Heterogeneous Compute Compiler (HCC) was originally developed by MulticoreWare, AMD, and Microsoft as a C++ AMP to OpenCL compiler.  Written using LLVM, it is the first compiler to enable C++ AMP on Linux and OS X across a range of hardware accelerators such as GPUs and DSPs.

Custom MC Assembler

Custom machine code assemblers allow LLVM to directly target unique and custom hardware architectures.  MulticoreWare works with vendors to design custom assemblers that enable LLVM support on their unique architectures.


MulticoreWare’s MxPA technology provides an OpenCL 1.2 stack based on Clang and LLVM.  This allows a higher-level language frontend to take advantage of hardware acceleration across various GPU, mobile, and server architectures.


A back-end with target specific optimizations are necessary to extract performance from your code. MxPA technology consolidates our expertise in loop transformations, vectorization, and novel memory optimizations for architectures with unique memory hierarchy such as DSPs.

Heterogeneous Compute Compiler (HCC)

The Heterogeneous Compute Compiler (HCC) project is open-source and developed and maintained by MulticoreWare for the HSA Foundation. HCC relies heavily on the LLVM compiler infrastructure and implements a front-end that accepts C++ AMP, C++17 Parallel STL, and OpenMP.  HCC outputs HSAIL for HSA-enabled devices, or a SPIR binary and OpenCL-C for OpenCL-enabled devices.

HCC is an integral part of the GPUOpen and Boltzmann initiatives by AMD.
C++AMP and its HSA Extensions are supported by HCC.
OpenMP 3.1/4.0 are supported in HCC for the CPU. Accelerator offloading for OpenMP 4.x is coming soon.
HCC can output HSAIL, OpenCL-C, or a SPIR binary.

HSA Foundation

Heterogeneous System Architecture

The HSA Foundation is a consortium of companies that are building a heterogeneous compute software ecosystem based on open-source, royalty-free standards.  HSA hardware provides a shared memory space that, together with HSA-enabled software, can simplify heterogeneous programming and eliminate much of the overhead of using multiple architectures simultaneously.


MulticoreWare has been a supporting member of the HSA Foundation since 2012 and has developed the HSAIL (HSA Intermediate Language) simulator to enable consortium members to test their HSA implementations and the Heterogeneous Compute Compiler (HCC) to compile C++ AMP, C++17 Parallel STL, and OpenMP to HSAIL.

Tools & Technologies

MulticoreWare has a large portfolio of compiler tools and technologies to profile & improve performance, enable new functionality and provide custom solutions for any hardware platform.

Multicore Cross Platform Architecture (MxPA)

MxPA (Multicore Cross Platform Architecture) is a licensable base of IP to enable the implementation of accelerator languages on new platforms.  Designed with LLVM, MxPA provides the infrastructure necessary to support OpenCL, CUDA, Renderscript, or any custom accelerator language on a target platform.
MxPA can also provide a functional OpenCL backend if no OpenCL driver is available on a platform, to enable end-to-end OpenCL support.  MulticoreWare can integrate the entirety of MxPA or partial features into your platform’s compiler and runtime.
Key Features
  • LLVM Front-end support for OpenCL, Renderscript, CUDA, and more.
  • Provides complete OpenCL kernel translation and runtime.
  • Automatically adjust thread parallelism granularity for a target platform.
  • Automatically perform DMA transfer and kernel fusion.

Parallel Path Analyzer (PPA)

PPA (Parallel Path Analyzer) is a performance profiling tool developed and licensed by MulticoreWare.  PPA allows a user to view timelines and gather metrics on specialized accelerators, and can be customized to support your hardware.  Identify critical paths and bottlenecks in your application, analyze the use of system resources, visualize call graphs among multiple processor and accelerator cores, and measure where application time is spent with a global clock.
PPA plugs into Microsoft Visual Studio, or works as a standalone tool.  A custom version of PPA is included by AMD in their APP SDK.
Key Features
  • Profile heterogeneous processors such as GPUs, APUs, DSPs.
  • Native support for profiling OpenCL applications.
  • C/C++/Java APIs for custom instrumentation of your code.

Task Manager - Load Balancing

Splitting work fairly between multiple, heterogeneous cores is difficult in scenarios where cores may exhibit vastly different performance characteristics and may be throttled to meet thermal or power limits.  MulticoreWare developed Task Manager (TM) as a set of APIs to enable programmers to design task-based applications and allow a runtime to handle scheduling and load balancing automatically.  TM has been integrated into the AMD APP SDK and can target multiple platforms, including CPUs and OpenCL-capable devices such as AMD, Intel, and Nvidia GPUs.

GMAC - Data Coherence

Accelerators often have their own local memory to which data needs to be copied for processing to occur. With several accelerators being used, data being input and output must be managed for each of them. GMAC (Global Memory for Accelerators) automates data movement between disparate memory spaces and presents the programmer with a single global memory space.  GMAC has been adapted to work with OpenCL and CUDA devices through a C/C++ API.  A custom version of GMAC is integrated in the AMD APP SDK.

Data Layout

Various cores on the same system often except different memory layouts and the programmer must translate data from one memory layout on the CPU, to another layout on the GPU, in order to maximize code performance.  MulticoreWare’s Data Layout engine (DL) automates data layout conversion via OpenCL source-to-source transforms.  DL may be linked into your application or be used as a library with OpenCL, CUDA, or PyOpenCL.

Slot Maximizer - Work Coalescing

Slot Maximizer (SM) enables automated tuning of OpenCL kernels to available hardware capability by merging similar work items to remove redundant operations.  SM works at the system level, in the OpenCL driver, to detect similarities between operations performed in separate kernels.  At runtime, work may be coalesced if overhead is impacting performance.  SM is part of a set of technologies developed by MulticoreWare that may be integrated into a hardware accelerator runtime framework.