We can profile your software to determine exactly where the most time is spent and where you should focus your efforts. We’ve built custom tools to profile applications across a range of platforms and can plug these tools into your software and platform to get a birds-eye view of bottlenecks.
Port your algorithm to a smaller, faster, cheaper, or more power efficient platform. We can analyze your algorithm and suggest a better platform, or target a platform you’ve already chosen. We’re architecture agnostic, so we’ll find what works best for your software and hardware target.
We can analyze your algorithm or application to optimize it for power and performance on your existing platform. We’ll find and exploit the parallelism of both your algorithm and your hardware. We have experience working in many application domains, and can help your software work better for you.
Implement your idea or algorithm quickly and efficiently. We can start with a simple C model, Matlab code, or just a mathematical description of your problem and implement it to your performance, power, and cost specifications. We can also compare various platforms so you can choose what best fits your needs.
Acceleration for Any Platform
Whether its a multi-core CPU or a deeply-embedded custom design, our expertise can help accelerate your code.
GPUs have become ubiquitous in all levels of computing. They can be found in the largest supercomputers in the world, and on your smartphone. GPU programming is one of our core competencies and we have one of the largest teams in the world with knowledge in CUDA, OpenCL, C++ AMP, and other GPU-targeted APIs. We can quickly and efficiently port your code to the GPU platform of your choice, whether it be mobile, workstation, cloud, or cluster.
Embedded, DSPs, and FPGA
Power efficiency often demands more specialized hardware and software. Many algorithms can take advantage of the benefits of these low power platforms, but programming for them quickly becomes a challenge. We can help port your algorithm or application down to deeply embedded architectures, DSPs, or FPGAs. We’ve worked directly with manufacturers of these platforms to gain a deep understanding of how to use them well.
Cloud & Cluster
Computing as a service has taken off but software hasn’t caught up yet. We’ve designed applications that can scale to hundreds or thousands of compute nodes and work efficiently in GPU or FPGA-enabled cloud services. We work directly with cloud and cluster providers to improve their back-end software and with customers who have massive compute requirements to improve their scaling, reliability, and costs.
CPUs have gone almost exclusively multi-core, while millions of lines of software still remains single-threaded serial code. We can take your code, find any inherent parallelism and specialize it for ARM, Intel or AMD x86, MIPs, or IBM Power cores by taking full advantage of special SIMD instruction sets and intrinsics. Our experience with all these architectures can help improve your performance and your power usage.
We've accelerated code using all these platforms, languages and APIs and more.
Nvidia CUDA Acceleration
ARM Assembly and NEON Instructions
MATLAB Code Porting and Acceleration
C++ AMP Acceleration
Qualcomm Snapdragon and Hexagon DSP
Cadence Tensilica DSP
Don’t see your hardware or software platform here? It might not have made the list but our expertise can certainly be applied to your platform as well.
MulticoreWare has one of the largest and oldest CUDA-experienced teams in the industry. Our CTO, Dr. Wen-Mei Hwu, a professor at the University of Illinois at Urbana-Champaign, worked with NVIDIA Chief Scientist Dr. David Kirk to develop the first CUDA Center of Excellence in 2008. Our COO, Curtis Davis, was a co-Founder of AGEIA and creator of the PhysX engine. After AGEIA was sold to NVIDIA in 2008, Curtis became VP of PhysX, leading the largest CUDA development team in the world to port PhysX to run on NVIDIA GPUs. Curtis’s team included Dr. Lihua Zhang, MulticoreWare’s VP of China Operations, and several key engineers who joined MulticoreWare after it was founded in 2009.
OpenCL is the primary competitor to CUDA and supported by the most number of platforms. MulticoreWare implements and licenses an OpenCL platform called MxPA to multiple semiconductor companies, serving as the default OpenCL device on their platforms.
MulticoreWare is a contributing member of the Khronos Group, and develops OpenCL tools. MulticoreWare has accelerated applications in a vast variety of domains including video/image processing, computer vision, Raster Image Processing (RIP), big data (Hadoop) and neural networks.
Microsoft C++ AMP
MulticoreWare works closely with Microsoft in support of extending the C++ AMP framework, enabling cross-platform support of this powerful heterogeneous computing architecture. MulticoreWare’s extensions provide the highest developer productivity combined with the highest possible performance and cross platform support.
MulticoreWare has developed Kalmar C++, a C++ compiler implementation. Kalmar is capable of taking a program conforming to C++AMP 1.2 standard and transforming it into HSAIL, SPIR binary, or OpenCL-C. With Kalmar, developers can write accelerated code using C++ AMP, targeting Windows, Linux, or MacOS systems running a variety of GPU architectures.
MATLAB Algorithm Porting & Acceleration
MATLAB’s built-in support for vector and matrix representation of the data, make it a very suitable for targeting GPU accelerated platforms. Out of all the available MATLAB functionalities over many are already GPU-enabled, but further acceleration can be achieved by modifying the code to enable the use of more native GPU functions, keeping the data in GPU pointers and avoiding data copy between the GPU and CPU.
A typical method is to invoke kernels directly from MATLAB using MEX interface and the CUDA MATLAB plug-in. Optimizing Matlab code for performance could be done by rewriting functions in C/C++ and CUDA and calling them from MATLAB using MATLAB’s MEX interface and calling them as if they are built-in Matlab functions. Alternatively the entire algorithm can be ported to another language.
Case Study 1: MEX and CUDA Interface
The MulticoreWare team helped a defense industry client to achieve about 4X overall acceleration in an entire application. Matlab was retained as the driver program and MulticoreWare converted the execution of Matlab functions to Matlab Native GPU functions to achieve the first level of optimization. This was followed by identifying hot-spots where successive functions were executed using GPU data pointers without copying data back and forth between CPU and GPU. MulticoreWare also wrote custom CUDA kernels and integrated them with the Matlab framework using the MEX interface.
Case Study 2: Image-Processing Pipeline
MulticoreWare team converted stages of an existing image processing pipeline from Matlab to C/C++ and CUDA and wrote stages of the pipeline from specifications and documents from the client. An input data stream was captured using high speed cameras and processed frame-by-frame. The original application time was 10 minutes to process one frame of data. The C/C++ pipeline created by MulticoreWare cut down the processing time to one minute, a 10X speedup compared to the Matlab implementation. Further optimizations enabled the data to remain entirely on the GPU between stages of the image processing pipeline. The final per-frame processing time was reduced to less than 1 second, for an impressive 600X acceleration compared to the original Matlab pipeline.
Case Study 3: Medical X‑Ray Processing
The client approached us with a medical X-ray image processing and combination task where multiple partial images with different focal planes required filtering and recombination. MulticoreWare analyzed the system which included an FPGA-controlled phase-array, a pipeline to move data to a workstation, and the original Matlab and C code, and recommended software and hardware specifications to allow real-time processing. Compute-heavy sections of the code were re-written as CUDA kernels and targeted to a 9-GPU system that met all the customer’s specifications in terms of power, cost, accuracy and reliability.
FPGA Design and Development
FPGAs can offer the advantages of high performance, reconfigurability and fast development. We’ve ported our own machine learning, computer vision and video processing libraries to FPGAs and can do the same for your applications or algorithms. MulticoreWare offers full development and design services for FPGA development. If your FPGA supports OpenCL, we can leverage its capabilities to develop quickly and efficiently, bypassing the many man-months of effort that are typically required to enable FPGA applications. We also have RTL experts for performance-critical applications and tuning.
Xilinx Alliance Partner
MulticoreWare is an SDAccel™ development environment-certified Xilinx Alliance Member. The Xilinx Alliance Program is a worldwide ecosystem of qualified companies collaborating with Xilinx to further the development of All Programmable technologies. As a member of this alliance, MulticoreWare offers design services for Xilinx FPGAs using the SDAccel™ environment. The SDAccel™ development environment is a member of the Xilinx SDx™ family that combines the industry’s first architecturally optimizing compiler supporting any combination of OpenCL™, C, and C++ kernels, along with libraries and development boards for the first complete CPU/GPU-like development and run-time experience for FPGAs.
We’ve provided our code acceleration services for numerous application domains. Here are just a few of the open-source projects we’ve contributed to. Contact us to talk to us about your code or project, proprietary or open-source – and we’ll lend our expertise to the problem.