You are here:  Articles


 
 
 
INSIDE DSP ARTICLES  

Current Articles | Categories | Search

Measuring Performance of DSP Code
By BDTI, 2/5/2007

Measuring the performance of real-time digital signal processing code is essential.  But whether you're using a simulator or hardware, it can be a headache to get accurate, repeatable performance measurements.  In this article we'll cover some of the common pitfalls you might encounter, and present some techniques for working around them.

Why Measure?

Digital signal processing applications typically have tight speed and cost constraints, and may also have challenging real-time deadlines. In most cases, these application requirements can only be met by thoroughly optimized code.

Optimizing signal processing code (and debugging it) is much easier if the software developer can accurately measure the processing time spent in various code segments. This information helps the software developer see where cycles are wasted, assess how close the code is to being optimal, devise code improvements that will improve performance, and observe the performance impact of various optimizations.

To meet real-time performance constraints, the programmer also needs to be able to guarantee that the code will execute within a certain amount of time, every time.  Again, this will typically require accurate performance measurements, preferably on hardware that's similar to what will be used in the final system.

Measurement Tools

Engineers measure the performance of their DSP software using either a software simulator or some form of hardware.  Simulators are usually easier to set up and use than hardware, and are more readily shared among several engineers. Simulators often also provide the ability to see internal details of the operation of the processor while it is executing code, such as pipeline stalls and cache behavior. This information is extremely useful for code optimization and debugging, and is usually impractical to obtain from hardware.  Simulators are often painfully slow, however, which usually precludes the use of lengthy test vectors. Simulators usually can't run applications in real-time—which is an awkward limitation for applications that require real-time interaction with external system components, such as in motor control.

Hardware is faster than simulators, but if the chip is new or still under development,  hardware may not be available, or the hardware may have bugs.

Simulator Shortfalls

If you're writing code for a DSP processor, you're likely to be able to get a cycle-accurate instruction-set simulator. That's because cycle-accurate simulators are considered a must-have tool for optimizing and measuring the performance of DSP code. (In contrast to DSPs, if you're working with an embedded general-purpose processor or a PC processor, you'll find that cycle-accurate instruction-set simulators are less common. General-purpose processor vendors typically don't design their tools with the needs of real-time signal processing software developers in mind.) Figure 1 shows a screen shot of simulator/profiler output from Texas Instruments' Code Composer Studio.

bdtifigure21.gif

Figure 1. Simulator output from Texas Instruments' Code Composer Studio.

"Cycle-accurate" is a relative term, however.  Instruction-set simulators are usually only cycle-accurate for the processor core and possibly level-one memory; they rarely provide cycle-accurate models of caches, peripherals, I/O, or other levels of memory. One problem we've encountered is that simulator documentation often doesn't clearly explain the boundary between what is accurately modeled and what isn't—which can lead to unwanted surprises when the code is finally run on hardware. (This is particularly true for newer processors with immature tools.)   It's a good idea to devise a few simple tests to gauge the accuracy of the simulator rather than relying entirely on the vendor's documentation. Don't assume that the simulator (or documentation) is complete, or even correct.

To verify a simulator’s cycle-accuracy, you need to know how many cycles a given code fragment should require. Read the vendor's descriptions of pipeline behavior (including forwarding paths) and instruction latency and throughput, then construct code examples that exercise instruction sequences that are important to your application and have relatively complex pipeline behavior or latency.  For example, load-use latency is often a critical issue in performance-optimized signal processing algorithms, as is multiply-accumulate throughput and latency.  Construct small fragments of code that specifically test these relationships—particularly where longer latencies exist—and verify that the simulator accurately models cycle counts.  A similar methodology can be applied to verifying latency and throughput of the memory hierarchy.

Previous Page | Next Page
 
 
BF_Arrow_Webseminar
Advertise on InsideDSP
  
HomeAbout Inside DSPArticlesSearch ArticlesArchivesResourcesContact UsSubscribe to Inside DSPAdvertise with Inside DSP
Copyright 2006-2008 by BDTI  |  Terms Of Use  |  Privacy Statement
  |