You are here:  Articles


 
 
 
INSIDE DSP ARTICLES  

Current Articles | Categories | Search

Massively Parallel Processors for DSP: Development Tools
By BDTI, 9/24/2007

Leveraging Parallel Resources

As mentioned earlier, making efficient use of a massively parallel processor requires that the application is efficiently distributed across the processing elements.  It’s important to ensure, for example, that no single processing element bogs down the entire chip because it is overloaded, that memory and bus contention doesn’t create bottlenecks, and that interdependent tasks are able to efficiently exchange data. 

The tools and approaches used to ensure that parallel resources are effectively used depend, to some extent, on the processor architecture. In a SIMD (single instruction, multiple data) architecture, where the processing elements all perform the same operation (e.g., a multiply-accumulate) in a given instant, the programming paradigm tends to be quite similar to that of traditional single-core chips. The difference is that the programmer must select (or create) SIMD-appropriate algorithms, and organize and access data in a way that keeps the SIMD computational units from becoming starved for data. Depending on the application being implemented, this process may be anywhere from straightforward to extremely complicated.

One challenge for massively parallel architectures is that most embedded signal processing software is developed in C.  This is unfortunate because C—an inherently sequential language—is not well suited to describing parallel operations. In fact, C tends to obscure the parallelism present in a given algorithms. For this reason, vendors of massively parallel processors typically provide an alternative means for the programmer to specify parallel operations.

For example, consider the approach taken by one vendor of massively parallel SIMD chips, the Stream Processors, Inc. (SPI) The SPI Storm-1 architecture contains a MIPS-compatible processor mated to a large array of ALUs that operate in SIMD fashion. To develop an application for this architecture, the programmer begins with a standard C-language implementation. The programmer then identifies compute-intensive tasks that are suitable for SIMD acceleration, and specifies these tasks and related data sets using SPI’s extensions to ANSI C (i.e., with intrinsics). Using these extensions, the compiler allocates space for the SIMD data sets, schedules the tasks, and manages data synchronization.

If the processing element array consists of processors rather than SIMD-style ALUs, then the application will need to be partitioned into processor-sized chunks, and these chunks will then need to be mapped to specific processors. In the homogeneous case, every processor is equally well suited to perform every task, though the programmer may want to consider the processor location on the chip (e.g., some processors may be nearer to relevant on-chip peripherals) and inter-processor communication in considering which processor should execute a given task. If the processors are heterogeneous, the designer may also need to consider which processors are computationally best suited to handle which tasks.  In some cases this mapping will be obvious, and in others it won’t. Moreover, it may make sense to map a task to a processor that isn’t the best match in terms of computation capability, if doing so simplifies data communication or some other element of the design.

Some parallel chip companies, such as picoChip, offer tools that can do the mapping automatically. The picoChip PC102 chip contains three different types of processors that are used in a MIMD fashion (i.e., each processor runs its own program, and processes its own data). Some of the processors are optimized for control functionality, others for signal processing, and others for memory-intensive tasks, like buffering. All of the processors are capable of communicating with each other. In picoChip’s design paradigm, the programmer defines an application as a block diagram—each block has its own associated program, which can be written in C or in assembly. The programmer can also indicate that a block should be split across processors. The programmer also specifies the bandwidth (I/O) and connectivity requirements for each block using structured VHDL. In this manner, the programmer partitions the application into processor-sized chunks. The compiler then uses the bandwidth data and block diagram to automatically map the blocks to the processors in a way that attempts to optimize the communications paths between processors. The programmer isn’t required to specify which type of processor each task will run on, but has the option to do so when needed.

picoChip’s intent with this approach is to keep the programmer from having to manually partition the application across processors. If the compiler’s mapping is sufficient to meet performance targets then the programmer is off the hook. If not, however, the programmer may need to determine how to help the compiler redistribute and rebalance the load, possibly by specifying the processor to which a given task should be assigned, or by re-architecting the overall implementation approach and creating a new block diagram. To facilitate the optimization process, picoChip provides tools that show processor loading, utilization, and interconnect requirements.

Previous Page | Next Page
 
 
Fast Tracking
DSPDesignLine
  
HomeAbout Inside DSPArticlesSearch ArticlesArchivesResourcesContact UsSubscribe to Inside DSPAdvertise with Inside DSP
Copyright 2006-2008 by BDTI  |  Terms Of Use  |  Privacy Statement
  |