DSP Bricklayer: November 2011

Friday, November 25, 2011

DSP of the second kind

DSP of the Second Kind (1995-2005)
The goal of this section, is an historical perspective covering the time when DSPs became CPUs and vice versa.

Background

We are now in 1996 and for the last couple of years DSP is fashionable in Silicon valley. To complicate things further, DSP is just seen as a subset of the shining keyword of the 1990's: "MULTIMEDIA"."TA!DA!" Pretty much all processor vendors are preparing something. A good summary of that time is found in Jeff Bier's [ref 1 and 2].

Description

Roughly speaking there were 3 and a half types, here classified by interest order.

Add a DSP COP to a CPU (or more likely a MCU).
Add a DSP extension to an existing CPU ISA
Start from scratch and build a completely new architecture. It can be:

a DSP based on RISC principle (ZSP)
a DSP based on more RISCY than Thou Principle(C6x)
a RISC CPU equally good at DSP (TriCore)
a DSP with a register file

Multimedia processors (MM)

We will not mention MM further, except that for the sake of simplification we will put Pentium MMX in this category instead of ISA extension. Remember that we are speaking about DSPs.

Not considered: dual-core platforms such as consisting of a DSP core plus a CPU core.

When looking at the 3 remaining categories they can all be grouped under DSP extensions.Category 1 and 2 differ only by the level of integration.Category 1 integrates DSP as a COP, category 2 integrates DSP as a separate execution unit or as a fatter data path. In the end it all became implementation details. Funnily enough, category 3 (the new architectures) all started monolithic but under the influence of the deconstruction school (TenSilica) many came up with the concept of simpler core + extension.
We can also lamely argue that a C64x can be seen as a CPU (1 cluster + 3 simplest execution units (EU) ) or a powerful DSP ( 2 clusters with all 4 EUs).

Category 1- DSP COP

Largely obsoleted by category 2 and 3, we will mention a few products:

SH-DSP (1996) had a COP added to the the SH-3 core " good classic model"
ARM Piccolo (1997) , was architecturally identical but with the bonus of original solutions to some basic problems. (To be studied in details)
Siemens C166/ST10 MAC (1998) started as a COP and finished as fully integrated ST Super10.
And later, for instance...Massana Filu which was a piece of IP ( a COP). to be attached to a host.
We will not mention here the Tensilica Vectra or similar which are filed under Vector Processor.

Category 2- DSP extensions to CPU

Here are examples of DSP extensions to common CPU ISA. Note that not all of them are "Q format" types and are more commonly classified as MM extensions(*VP*). From our perspective, the difference between the twos are more understood in terms of DSP generation.

PowerPC

Altivec and variants (*VP*)
MPC8xx DSP addendum (MPC8xxRMAD Rev.0.1., 10/2003)

ARM (we are a bit lost here)

Move
Neon (*VP*)
MM extensions
SIMD
ARM9E
Xscale WMMX , WMMX2

MIPS

Lexra DSP extensions
MIPS DSP extensions

Coldfire DSPon
Hitachi MM extensions
TenSilica Vectra, VectraLX (*VP*)
ARC SIMD at MPR05
PIC MCU adds DSP (2005)
Intel SSE4
Sparc, HP --> see Ruby Lee, AMD, Alpha, MIPS Madmax

Categorie 3- New DSP Architectures

Further sub categorized as

New CPU

The unique illustration as such is TriCore (1996).

New DSP

Blackfin (1998) presented itself as an hybrid but really a DSP (hey 40-bit native register width, give you away)
Tiger Sharc (1999)
Starcore (1999)
TI C62x (1997) became C64x(2000) then C66 (2010)

And of course C67

New name: ZSP (1997)
tons of other hopefuls
We will not mention the Infineon Carmel and the DSP group family (see "revenge of the trees" and "the last honest DSP" in chapter DSP of the First Kind).

Category (outside): dual-core platforms

-> see platforms, Multi-Core.

For the record

68356
Dual core consisting of a DSP plus a CPU (typ: ARM7 + OAK)

References

BDTI “DSP on General Purpose Processors—An Overview”, presentation to MicroDesign Resources dinner meeting, January 1997.
--> giving rise to the BDT Guide - DSP on GP CPU (1997)
also comparing BDTI guide 2004 and 1995 reveals the amount of the evolution.
BDTI guide on TriCore (not available)
BDTI guide on StarCore
Multiple press vulgarus articles on "hybrid", C compilation and register files.
The many TI VLIW white papers

Thursday, November 24, 2011

VADD

BDT Benchmark - Vector Add

Goal
Introduction to benchmarking as an architecture tool. Implicit confusion between hardware and software to teach architecture.

Background
here will go a long explanation of manufacturer benchmarks (in the 80s), before BDTI and other stuff to explain that vector add is not a BDT copyright. But by choosing the Ntaps=40, BDT turned it into a standard in the same way that they turned the FFT256 as another bench standard.
Description
The VADD401 BB is remarkable because it is both very simple and complex. I believe that studying it will cover maybe 50% of the low hanging fruits of structural architecture.
We will now present results of several machines (DSP, CPU, etc..). For some of the results we used BDT as a reference. We also use "matlab" as descriptive language.
The cycles equations are defined as : total cycles = n*40 + ot (overhead) + p (pipeline). For most of the cases ot and p are confused as one.
The operation is
z =x+y where x,y,z are native data size (typ: 16-bit)

1) Architecture 1980
Machine 1: Theory= 3-bus DSP
fetch
decode
x=ld(mem1); y=ld(mem2);
for ii=1:39, z=add21(x,y); mem3=st(z); x=ld(mem1); y=ld(mem2);end;
z= add21(x,y); mem3=st(z);

Machine 2: Standard = 2-bus DSP
fetch
decode
x=ld(mem1); y=ld(mem2);
for ii=1:40, z=add21(x,y); x=ld(mem1); y =ld(mem2);
mem3=st(z);
end
Note: the load "one too far" is not considered a problem.

Machine 3: CPU = 1-bus
fetch
decode
counter=40;
while(counter>=1)
x= ld(mem1);
y= ld(mem2);
z= add21(x,y);
mem3=st(z);
dec(counter);
branch(top); %dummy to to mark the cost associated to branch
end

Giving following results

machine 1: 1*(N-1) + 3 = 42
machine 2: 2*N+4 = 84

machine 3: 6*N+3 = 243

The main architectural trade-offs are :
- number of buses
- ZOL (Zero Overhead loop).

2) Architecture 2000

Machine 11: 2-bus 4 lane DSP
fetch
decode
dispatch
ssssx=ld_by4(mem1); ssssy=ld_by4(mem2);
for ii=1:10, ssssz=add84(ssssx,ssssy); ssssx=ld_by4(mem1); ssssy=ld_by4(mem2)
mem3=st_by4(ssssz);
end

Machine 12: 1-bus 8 lane CPU
fetch
decode
dispatch
ssssssssx=ld(mem1);
ssssssssy=ld(mem2);
for ii=1:5
ssssssssz=add168(ssssssssx,ssssssssy);
ssssssssx=ld_by8(mem1);
ssssssssy=ld_by8(mem2);
mem3=st_by8(ssssssssz);
end

Machine 12: 1-bus 8 lane CPU with 'proper' epilog
fetch
decode
dispatch
ssssssssx=ld(mem,1);
ssssssssy=ld(mem,2);
for ii=1:4
ssssssssz=add(ssssssssx,ssssssssy);
ssssssssx=ld_by8(mem1);
ssssssssy=ld_by8(mem2);
mem=st_by8(ssssssssz);
end
ssssssssz=add168(ssssssssx,ssssssssy);
mem=st_by8(ssssssssz);

Machine 13: 1-bus 8 lane CPU with interleaved data
fetch
decode
dispatch
[ssssssssx ssssssssy ]=ld_by16(mem);
for ii=1:4
ssssssssz =add168(ssssssssx,ssssssssy);
[ssssssssx ssssssssy ]=ld_by16(mem);
mem=st(ssssssssz);
end
ssssssssz =add168(ssssssssx,ssssssssy); mem=st_by8(ssssssssz);

Giving following results

machine 11: 2*N/4 +5 = 25
machine 12: 4*N/8 +7 = 27

machine 13: 3*N/8 +6 = 21

The main architectural trade-offs are :
- replacing multiple buses by single bus
- larger width bus
- datapath with sub-word parallelism (multi lanes).
- reorganisation of data in memory
-

Saturday, November 19, 2011

FP DSP

Floating Point (FP) DSPs
[updated nov2015]
The goal of this section is to understand the relevance of FP DSP to the world of DSP today.

Background

[see edward Lee SPmag circa 1990 tutorial for more ]
The first FP DSPs were ATT and OKI (1983), ADI Sharc was the first successful, but not as much as the TI hype ("I've seen the future and it is floating point" ) introduced in 1987. Anyway by 1991 [ref. 1] each of the 4 DSP stars had a FP architecture:

TI C3x then C4x
ADI Sharc
Moto DSP96000
ATT DSP32C

That was about the time that DSP people started questioning the wisdom of "old" DSP architectures as opposed to "modern RISC". For instance, the Intel i860 was starting to cut them in pieces in many markets [ref 2,3].

TBD: We will not go here in details, but the argument was largely biased because DSPs were primarily SOC whereas the i860 was just .. well it was the i860. Try comparing a Bugatti Veyron with a Porsche Cayenne. And Intel with ADI..

Going back to the DSP mainstream story, the FP DSPs took 4% of the GP DSP market and 20 years later it was still the same. They look largely irrelevant, so why bother?

Remembering all these years, it is funny to consider the time we spent explaining to people that FP DSPs were not going to replace FXP DSPs.

Features and recent evolution

Indeed, why bother with FP DSPs?
Firstly they were the first DSP to be architectured in a "modern way".

For instance the C40 had a register file and emphasis was put on the C compiler.
Then its successor, the C67x was VLIW

which is an extreme CPU technique.. superscalar without safety net...

The same thing happened with ADI Sharc (the only competition to TI). The Sharc was turned into a modern CPU (Tiger Sharc). Mind you, they did not go overboard. It was a static superscalar not a VLIW.

Secondly, TI introduced DSP which both execute natively FP and integer.

The evolution of the low cost C28xx with the C283xx.
The evolution of the C67x workhorse into the C674x family
Most interesting is the most advanced DSP core, the C66 which is both natively FP and integer 32 (and 64 bit). It seems to be quite a major statement, that the only "relevant" DSP family has chosen to go the FP way. As described by Gene Frantz [ref. 6], the main reason is matrix computation. which is another way to spell Matlab.

The old arguments turned upside down
The argument against FP was that the complexity was not worth it. Nowadays, with the matrix problem, this is the other way round. If you use FXP, you must work at least in 32-bit (and you loose cost advantage of data size) , develop much longer algorithm (so you loose the code size advantage and worse the power consumption). All together the system price of a FP datapath is less than a FXP one.
Especially, in CPU architecture a FP unit is just added next to the IP unit so it is easy to figure out the cost.
(Well.. the exception model might suffer a bit too..)

Future of FP DSPs

The future of FP DSPs is not bright. All existing architectures (C67, Sharc) survive in their existing application space (military or audio) but have no serious roadmap. The only recent introductions such as C673x and C66x are the result of convergence more than FP evolution.
Also the C66 is built like a CPU, most striking features come from high end CPU (level 3 interconnect and caches) with a few dsp features left from the past. It stands a good chance in infrastructure (given the poor competition..PPC, Intel) but has no future as the next core for OMAP 7. One day will come for TI to decide if the C66x support is worth the bother or if A9, A15, A21 or A333 is the best bet.

Lessons for Coprocessing

On the other hand there is no doubt that FP has a good future in Cops and AS-DSPs.

Designers have full freedom and can draw from the rich past of FP DSPs. For instance, on the choice of data size, they can use 16+6 (as the first FP DSP from OKI) or 80-bit a la Intel.
Matlab Mapping becomes unidimensional.
But not all problems can be solved.

FP (single) is not enough. That at least, we've heard it from the audio claque (because linarity is 23-bit).
More seriously, we primarily live in digital world (not a numeric world). The numeric values (traditional DSP) are overwhelmed by the packeting and bit stuffing. Take any audio or video codecs.

References

Ray Weiss "32-bit FP DSP processors" , EDN Nov 7 1991
Steve Paavola " GP processors target FP DSP" www.edn.mag April 1, 1999
Plenty of similar articles in RTC magazine circa 2000 , e.g. from Spectrum Signal Processing
The EDN DSP Directories (before it became a joke), had good 1 page description of the 4 FP "old" DSPs (ex: june9,1994)
And so has the BDTI bible version 1995.
Gene Frantz White Paper " Where will Floating Point take us?" http://www.ti.com/lit/wp/spry145/spry145.pdf, oct 2010

this is the latest on Gene's white papers on FXP vs FP. See also::
Jim Larimer, Daniel Chen "Fixed or floating? ..." EDN 1995

Sunday, November 13, 2011

COP -VRAM

VECTOR RAM etc..
Goal
To explain the impact of the Vector RAM (VRAM) concept on COP design today.

Background
Mid-90s saw the good old RISC school (Berkeley) starting a new revolution: the VRAM . As given by patterson (2ts,1s), etc.. "the goal of Intelligent RAM (IRAM) is to design a cost-effective computer by designing a processor in a memory fabrication process(DRAM), and include memory-on-chip".
Already we can notice that memory-on-chip and processor in DRAM are two concepts which are a bit confused here.
Anyway by 1998 the concept became focused on VRAM (or VIRAM or V-IRAM or Vector IRAM).
It was a Vector Processor (VP) inside the memory Array (krste Asanovic was the VP guy).
As of today the concept is frozen to 2002 (last look this morning) and it is a pity.

RAM coprocessor in DSP
(hmm.. this section relying on personal memory might not be 100% accurate)
Some of the first DSPs were hardwired dsp function (such as AMI FFT 1980, Motorola CAFIR 1986, Inmos Adaptive FIR 1988) which from the host perspective were just a memory map.
In other words it was the simplest programming model (write Xarray, write Yarray, write paras, run, wait a little, read Zarray.).
Now on the 3 levels of architecture, silicon and software, implementing a block dsp function as a piece of memory is totally coherent (which is rare).
Note that the dsp function can be in series with the memory array (as to be totally transparent) but more likely is part of the memory map.
Going back to the VRAM concept, from a 1980 dsp archi perspective, there is no doubt that it is a serious step forward. Hence it should be considered as such when designing a COP buried inside a memory array.
And obviously we have Matlab implementation in mind here.

References found in my garbage can 1998

"Brass/IRAM retreat" multiple papers including Christo Kozyrakis micro-architecture, June 24,1998
David Patterson etc.. "A case for Intelligent RAM: IRAM" IEEE Micro, April 1997
Randi Thomas and Katherine Yellick "Efficient FFTs on IRAM" Berkeley, circa 1998?

Google

wikipedia..search IRAM
Patterson (2t, 1s)
much easier .. krste Asanovic
even better.. Kozyrakis ..it is not a name it is a trademark
and plenty of others

Saturday, November 12, 2011

COP - ASPRO, CAM

ASPRO, CAM and Find

The goal is to introduce the COP designer to one of the 4 or 5 major structure of computing (see note 1): Content Addressable Memory (CAM). We also briefly mention the very large field of ASociative PROcessors (ASPRO). Behind all that, there is an interesting parallel between CAM and the FIND instruction.

(note 1) with arithmetic, bit field logic and lookup table.

Introduction

A common problem with Matlab is that IF does not work on vector. This should not be surprising since it fits the DSP ISA principle where we define predicated instructions to replace branching.

DSP ISA

IF corresponds to a change of program flow and impacts the very early fetch.
Predication corresponds to two execution "in parallel" with a final mux to make the right choice. The hardware performance impact is negligable.

MATLAB

IF can be seen as a change of program flow and it is difficult to visualize it on vector.
Predication is implemented with logicals.

Example

y=(rbit==0)?x+7:x; 

in Matlab

y=x; 

if rbit==0 , y= x+7;

will work only on scalar. The brute force solution is to create a for loop (like in C).

But if we use logicals

cc= eq(rbit,0)

y(cc)=x(cc)+7;

y(~cc)=x(~cc);

This looks much more like predicated DSP code and it is works on vector.

Now, the funny thing is while it is perfect legal, the standard Matlab style is to use the Find instruction.

FIND

Without going into the details of the usage of Find (see Mathworks) , the implementation of Find as a COP or as C code is a heck of a challenge. But, as a first order, its structure is bases on a CAM.

CAM
CAM have been around for a long time and had(?) their hour of glory in Networking chips. In DSP a CAM is a natural for everything 'RECOG' (speech recognition, pattern recognition, etc..).

ASPRO

It is the generalization of the CAM concept to general purpose computing. ASPRO is as old as the world exists and a large body of references is available.

References found in my 2004 garbage can

Florin baboescu etc.."hardware implementation of a tree-based IP lookup Algorithm", ST Microelectronics, Year YYYY
Kohonen 1980 .. a book
Asanovic and chapman 1990
Asanovic"the Space chip" keyword: PADMAVATI
Cypress Ternary CAM
Djamshid tavagarian, "flag-oriented parallel Associative Architectures and Applications" IEEE Nov 1994
NeoMagic "the technology of APA" ..this one beats them all! Genius or monumental stupidity? still open after all these years.

there were at least 2 challenges :

software required double optimization and porting
iterative algorithm were impossible.We work on that one!

Romain Saha MOSAIC " CAM speeds up lossless compression" EDN 09.29.03

see also www.commsdesign.com circa 2003

Good try romain!

ALTERA A.N. "Implementing High Speed Search Applis with Altera CAM" July 2001, A.N 119
GEC Plessey "PNC1480 LAN CAM"

gosh!

FURTHER : INTELLIGENT MEMORIES seen by COMPUTER DESIGN MAY 1998 !!

Matrix transposing and Multiplying
Wavelets transforms
Lossy and lossless compression
Cryptography
Graphics Accelerator

FURTHER: replacing time by space

Using GBytes memory to store all possible outcomes.

Not your average chip.

Sunday, November 6, 2011

DSP - Architecture Past, Future

DSP :: Architecture Past, Future

The dinosaurs (70s)

Computers only (MIT Lincoln)
The first chips (TRW, AMD bit slices, other bit slices)

The first steps (79-80)

Intel 2920
Bell Lab internal

DSP of the First Kind (80-95) : A good compromise between application specific and General Purpose (GP)

The PSI
NEC 7811
AMI2811
TI 320C10

TI C10,C20,C25

Motorola 56000
And a lot many more
ADI enters the fray (21xx)
ATT becomes public (16xx)
TI starts a revolution every 4 years

FP (89)
MP (93)
VLIW (97)

Motorola weathercock: 24/24 16/24 24/16 16/16
ADI "one day I will be bigger"
ATT "only the professional"
Refer to BDT 1995 for summary
Smelling the RISC takeover
DSP group "the revenge of the trees"
Carmel "the last honest DSP!"

DSP of the Second Kind: Back into CPU mainstream (95-05)

TriCore
ARM9E
Hitachi SH
Extensions to CPU ISA (ARM, MIPS, PPC, Intel, HP, SUN)
TI: from C62 to C66
ZSP, Starcore
Blackfin, Tiger Sharc

DSP of the Lost Kind? What shall we try now? (05-10)

Customized DSP

Wireless DSP
Customisable DSP
Customizable Core: TenSilica, Arc,3-DSP

Multi-PE (Processing Elements)

Impact of MM architecture

MP, MC (Multi channels), MT
Re-configurable computing
Heterogeneous Platforms
Matlab and Custom DSP
Is GPU the latest smoking pot? Or is it Cell?

DSP of Any Kind

DSP boards
DSP custom Chips
FPGA platforms

DSP of the Third Kind (10-20?)

Back to the future: Coprocessor seen as a low complexity DSP

ASP DSP
COP DSP
A platform to build the platform: the framework

Added complexity: Matlab to implementation

Architecture evolution

For simplification sake, we will use the terms archi80 (DSP of the first kind), archi 90 (DSP of the second kind) and archi 2000 (DSP of the lost kind).

The typical archi80 is made of 3 blocks (DAU, AGU, PCU ) , buses and memories. These 3 blocks correspond respectively to the Data Arithmetic Unit, the Address Generation Unit and the Program Control Unit. The terminology matches the architecture previously developed in bit slice designs.To be more precise the architecture was centered on designing datapaths with a microcode memory in the control plane.
The typical archi90 turns the concept by 90 degrees so that now everything is seen from the instruction perspective. To that, a new central block is added: the Register File. Hence the 3 afore mentioned blocks became respectively (IP, LS, Fetch/Decode). There is no doubt that archi90 is much more sophisticated than archi80 and allows a solid foundation. But it is also so complex that it needs a scientific foundation (Computer architecture) and specialized engineers( Architect)...

When Risc proponents mention the simplicity of the Risc model (as for instance the DLX, MIPS1) they refer to CPU debate of RISC versus CISC (in which they misplaced DSPs). But implementing a DSP with Risc principles is not an average ARM7 design.

The archi2000 opened new directions in architecture but none of them was a major change.

Many new directions (such as multi-core) have very little to do with cores. They are just platform choices.
Configurable cores (TenSilica, Arc) boils down to a finer grain way of building a SOC.
Reconfigurable computing is a real breakthrough since it combines (re)build and run in hardware. Unfortunately it is still in its infancy.
And so are similar concepts like JIT interpreter (Transmeta) .
Because of its popularity, the C6x with its VLIW and its 2 cluster datapath is a real breakthrough. But TI is kind of going backward. Firstly their VLIW looks more and more like static SuperScalar and the 2 cluster never expanded into 4,8,16 clusters. Instead they rely on clock speed, multiple multi-level caches and MP for performance. Hardly original...
All similar attempts (Multi Media CPU) are now dead.
For a times there was a revival with the IBM cell and Nvidia GPU but difficult to see the future of DSP when looking at these beasts.
We had a series of gizmos such as stream processors which petered out like the rest.
Now none of these new trends were useless. All will have some impact on future DSP architectures.
For our purpose our favorite trend are configurable computing. For instance a DSP which includes a Matlab JIT or a DSP which reconfigures its operating units on the fly.

The archi2010 makes a simple constat

All effort are now on platform design.
Software (the lack of it, the price to develop it) is the main killer for a platform
Developing a new GP core is useless.
Only customized core can justify the investment. And not even a full core because of the price of developing new tool and software.
The space remaining for DSP is either coprocessing or a DSP core so simple that the tool development effort is minimal.