You are here:  Articles


 
 
 
INSIDE DSP ARTICLES  

Current Articles | Categories | Search

Implementing SIMD in Software
By BDTI, 12/6/2006

Many high-performance DSP and general-purpose processors are equipped with SIMD single-instruction, multiple data) hardware and instructions. SIMD enables processors to execute a single instruction (say, an addition) on multiple independent sets of data in parallel, producing multiple independent results.

SIMD support has become increasingly common because it improves performance on many types of DSP-oriented applications (which tend to perform the same operations over and over again). But what if you’re implementing a DSP-oriented application on an embedded processor without SIMD support? Depending on the specifics of the application, the data, and the processor, it might make sense to emulate SIMD in software.

In this article, we'll give some examples of how to implement software-based SIMD, and quantify the processing speed-up you can achieve with this technique.

We'll also explain how to determine when "soft SIMD" is likely to be useful; unlike many other optimization techniques,soft SIMD is only useful for a relatively small range of applications and processors, and it imposes significant limitations on input operands. But in digital signal processing applications, where you're often trying to squeeze every last cycle out of a processor and you're willing to put in significant extra work to meet performance targets, it's a useful addition to your optimization toolkit. Here at BDTI we've used soft SIMD optimizations in a couple of demanding projects and it's proven to be worth the effort.
SIMD processors
In a processor with explicit SIMD support, one operation is executed on two (or more) operand sets to produce multiple outputs. This is often accomplished by having the processor treat registers as containing multiple data words; for example, a 32-bit register can be treated as containing two 16-bit data words, as illustrated in Figure 1. In this case, the processor automatically handles sign and carry propagation for each of the SIMD calculations. Most processors with explicit support for SIMD operations also provide instructions to allow the programmer to pack and unpack the registers with SIMD operands.

bdtifigure1.gif

Figure 1. Diagram of SIMD (single instruction, multiple data) processing. SIMD performs the same operation simultaneously on multiple sets of operands under the control of a single instruction.
SIMD tends to be very effective at accelerating algorithms and applications that are highly parallel and repetitive, such as block FIR filtering or vector addition.As you’d expect, it’s not effective for processing that’s predominantly serial, as in control code, or when operands are scattered in memory and can't be easily packed and unpacked into the registers.

Try a softer approach?
If you’re implementing an algorithm that’s a good fit for SIMD but your processor doesn’t explicitly support SIMD, you might want to consider emulating SIMD in software to boost performance. You'll need to evaluate whether the application, data, and processor are good candidates for soft SIMD. In most cases it only makes sense if the processor has registers that are wide enough to hold at least four data operands; for example, if the processor has 32-bit registers and your data is 8 bits wide. In contrast, if your processor has 32-bit registers and you need 16-bit operands (for example), you’re less likely to find soft SIMD useful. That’s because some of the overhead of implementing soft SIMD is constant regardless of how many words you process in parallel.

The input operands must be amenable to fairly severe limiting—you'll often find that it's only useful if your operands can stay within, say, a 7- or 8-bit range. Soft SIMD is also best suited for operating on fairly large data sets, where the overhead can be amortized over many operations.

Video and imaging applications are often good candidates for soft SIMD because they use large data sets, execute the same operations repetitively, often use 8-bit data, and are frequently implemented on embedded processors that have relatively wide registers and no native SIMD support (like some ARM and MIPS cores).

Previous Page | Next Page
 
 
biometric
FPGAs for DSP, Second Edition
  
HomeAbout Inside DSPArticlesSearch ArticlesArchivesResourcesContact UsSubscribe to Inside DSPAdvertise with Inside DSP
Copyright 2006-2008 by BDTI  |  Terms Of Use  |  Privacy Statement
  |