Friday, November 25, 2011

DSP of the second kind

DSP of the Second Kind (1995-2005)
The goal of this section, is an historical perspective covering the time when DSPs became CPUs and vice versa.


Background
We are now in 1996 and for the last couple of years DSP is fashionable in Silicon valley. To complicate things further, DSP is just seen as a subset of the shining keyword of the 1990's: "MULTIMEDIA"."TA!DA!" Pretty much all processor vendors are preparing something. A good summary of that time is found in Jeff Bier's [ref 1 and 2].

Description
Roughly speaking there were 3 and a half types, here classified by interest order.
  1. Add a DSP COP to a CPU (or more likely a MCU).
  2. Add a DSP extension to an existing CPU ISA
  3. Start from scratch and build a completely new architecture. It can be:
    1. a DSP based on RISC principle (ZSP)
    2. a DSP based on more RISCY than Thou Principle(C6x)
    3. a RISC CPU equally good at DSP (TriCore)
    4. a DSP with a register file 
  4. Multimedia processors (MM)
    1. We will not mention MM further, except that for the sake of simplification we will put Pentium MMX  in this category instead of ISA extension. Remember that we are speaking about DSPs.
  5. Not considered: dual-core platforms such as consisting of a DSP core  plus a CPU core.
When looking at the 3 remaining categories they can all be grouped under DSP extensions.Category 1 and 2 differ only by the level of integration.Category 1 integrates DSP as  a COP, category 2 integrates DSP as a separate execution unit or as a fatter data path.  In the end it all became implementation details. Funnily enough, category 3 (the new architectures) all started monolithic but under the influence of the deconstruction school (TenSilica) many came up with the concept of simpler core + extension.
We can also lamely argue that a C64x can be seen as a CPU (1 cluster + 3 simplest execution units (EU) ) or a powerful DSP ( 2 clusters with all 4 EUs).

Category 1- DSP COP

Largely obsoleted by category 2 and 3, we will mention a few products:
  1. SH-DSP (1996) had a COP added to the the SH-3 core " good classic model"
  2. ARM Piccolo (1997) , was architecturally identical  but with the bonus of original solutions to some basic problems. (To be studied in details)
  3. Siemens C166/ST10 MAC (1998) started as a COP and finished as fully integrated ST Super10.
  4. And later, for instance...Massana Filu which was a piece of IP ( a COP). to be attached to a host. 
  5. We will not mention here the Tensilica Vectra or similar which are filed under Vector Processor.  
Category 2- DSP extensions to CPU
Here are examples of DSP extensions to common CPU ISA.  Note that not all of them are "Q format" types and are more commonly classified as MM extensions(*VP*). From our perspective, the difference between the twos are more understood in terms of DSP generation.
  1. PowerPC
    1. Altivec and variants (*VP*)
    2. MPC8xx  DSP addendum (MPC8xxRMAD Rev.0.1., 10/2003)
  2. ARM (we are a bit lost here)
    1. Move
    2. Neon (*VP*)
    3. MM extensions
    4. SIMD
    5. ARM9E
    6. Xscale WMMX , WMMX2
  3. MIPS
    1. Lexra DSP extensions
    2. MIPS DSP extensions
  4. Coldfire DSPon
  5. Hitachi MM extensions
  6. TenSilica Vectra, VectraLX (*VP*)
  7. ARC SIMD at MPR05
  8. PIC MCU adds DSP (2005)
  9. Intel SSE4
  10. Sparc, HP --> see Ruby Lee, AMD, Alpha, MIPS Madmax

Categorie 3- New DSP Architectures
Further sub categorized as
  1. New CPU
    1. The unique illustration as such is TriCore (1996). 
  2. New DSP   
    1. Blackfin (1998) presented itself as an hybrid  but really a DSP (hey 40-bit native register width, give you away)
    2. Tiger Sharc (1999)
    3. Starcore (1999)
    4. TI C62x (1997) became C64x(2000) then C66 (2010)
      1. And of course C67
    5. New name: ZSP (1997)
    6. tons of other hopefuls
    7. We will not mention the Infineon Carmel and the DSP group family (see "revenge of the trees" and "the last honest DSP" in chapter DSP of the First Kind). 
Category (outside): dual-core platforms
-> see platforms, Multi-Core.
  • For the record
    • 68356
    • Dual core consisting of a DSP  plus a CPU (typ: ARM7 + OAK)

    References
    1. BDTI  “DSP on General Purpose Processors—An Overview”, presentation to MicroDesign Resources dinner meeting, January 1997. 
    2. --> giving rise to the BDT Guide - DSP on GP CPU (1997)
    3. also comparing BDTI guide 2004 and 1995 reveals the amount of the evolution.
    4. BDTI guide on TriCore (not available)
    5. BDTI guide on StarCore 
    6. Multiple press vulgarus articles on "hybrid", C compilation and register files.
    7. The many TI VLIW white papers 

      Thursday, November 24, 2011

      VADD

      BDT Benchmark - Vector Add

      Goal
      Introduction to benchmarking as an architecture tool. Implicit confusion between hardware and software to teach architecture.

      Background
      here will go a long explanation of manufacturer benchmarks (in the 80s), before BDTI and other stuff to explain that vector add is not a BDT copyright. But by choosing the Ntaps=40, BDT turned it into a standard in the same way that they turned the FFT256 as another bench standard.
      Description
      The VADD401 BB is remarkable because it is both very simple and complex. I believe that studying it will cover maybe 50% of the low hanging fruits of structural architecture.
      We will now present results of several machines (DSP, CPU, etc..). For some of the results we used BDT as a reference. We also use "matlab" as descriptive language.
      The cycles equations are defined as :  total cycles = n*40 + ot (overhead) + p (pipeline). For most of the cases ot and p are confused as one.
      The operation is 
        z =x+y   where x,y,z are native data size (typ: 16-bit)


      1) Architecture 1980
      Machine 1: Theory= 3-bus DSP
      fetch
      decode
      x=ld(mem1); y=ld(mem2);
      for ii=1:39, z=add21(x,y); mem3=st(z);   x=ld(mem1); y=ld(mem2);end;  
      z= add21(x,y); mem3=st(z);     

      Machine 2: Standard = 2-bus DSP
      fetch
      decode
      x=ld(mem1);  y=ld(mem2);
      for ii=1:40, z=add21(x,y); x=ld(mem1); y =ld(mem2);  
                    mem3=st(z);
      end     
      Note: the load "one too far" is not considered a problem.



      Machine 3:  CPU  = 1-bus 
      fetch
      decode
      counter=40;
      while(counter>=1)
        x= ld(mem1); 
        y= ld(mem2);
        z= add21(x,y);
        mem3=st(z);
        dec(counter);
        branch(top); %dummy to to mark the cost associated to branch
      end


      Giving following results

      machine 1:    1*(N-1) + 3 = 42
      machine 2:    2*N+4 = 84

      machine 3:    6*N+3 = 243



      The main architectural trade-offs are :
      - number of buses 
      - ZOL (Zero Overhead loop).


      2) Architecture 2000


      Machine 11: 2-bus 4 lane DSP 
      fetch
      decode
      dispatch
      ssssx=ld_by4(mem1); ssssy=ld_by4(mem2);
      for ii=1:10, ssssz=add84(ssssx,ssssy); ssssx=ld_by4(mem1); ssssy=ld_by4(mem2)  
                   mem3=st_by4(ssssz);
      end     

      Machine 12: 1-bus 8 lane CPU 
      fetch
      decode
      dispatch
      ssssssssx=ld(mem1);
      ssssssssy=ld(mem2);
      for ii=1:5
                  ssssssssz=add168(ssssssssx,ssssssssy); 
                  ssssssssx=ld_by8(mem1); 
                  ssssssssy=ld_by8(mem2);  
                  mem3=st_by8(ssssssssz);
      end     


      Machine 12: 1-bus 8 lane CPU with 'proper' epilog 
      fetch
      decode
      dispatch
      ssssssssx=ld(mem,1); 
      ssssssssy=ld(mem,2);
      for ii=1:4
                  ssssssssz=add(ssssssssx,ssssssssy); 
                  ssssssssx=ld_by8(mem1); 
                  ssssssssy=ld_by8(mem2);  
                  mem=st_by8(ssssssssz);
      end     
      ssssssssz=add168(ssssssssx,ssssssssy);  
      mem=st_by8(ssssssssz);


      Machine 13: 1-bus 8 lane CPU with interleaved data 
      fetch
      decode
      dispatch
      [ssssssssx  ssssssssy ]=ld_by16(mem); 
      for ii=1:4
                  ssssssssz =add168(ssssssssx,ssssssssy); 
                  [ssssssssx  ssssssssy ]=ld_by16(mem); 
                  mem=st(ssssssssz);
      end     
      ssssssssz =add168(ssssssssx,ssssssssy);                                  mem=st_by8(ssssssssz);


      Giving following results

      machine 11:    2*N/4 +5 = 25
      machine 12:    4*N/8 +7 = 27

      machine 13:    3*N/8 +6 = 21



      The main architectural trade-offs are :
      - replacing multiple buses by single bus 
      - larger width bus 
      - datapath with sub-word parallelism (multi lanes).
      - reorganisation of data in memory


      Saturday, November 19, 2011

      FP DSP

      Floating Point (FP) DSPs  
      [updated nov2015]
      The goal of this section is to understand the relevance of FP DSP to the world of DSP today.

      Background
      [see edward Lee SPmag circa 1990 tutorial for more ]
      The first FP DSPs were ATT and OKI (1983),  ADI Sharc was the first successful, but not as much as the TI hype ("I've seen the future and it is floating point" ) introduced in 1987. Anyway by 1991 [ref. 1] each of the 4 DSP stars had a FP architecture:
      • TI C3x then C4x
      • ADI Sharc
      • Moto DSP96000
      • ATT DSP32C
      That was about the time that DSP people started questioning the wisdom of "old" DSP architectures as opposed to "modern RISC". For instance, the Intel i860 was starting to cut them in pieces in many markets [ref 2,3].
      • TBD: We will not go here in details, but the argument was largely biased because DSPs were primarily SOC whereas  the i860 was just .. well it was the i860. Try comparing a Bugatti Veyron with a Porsche Cayenne. And Intel with ADI..
      Going back to the DSP mainstream story, the FP DSPs took 4% of the GP DSP market and 20 years later it was still the same. They look largely irrelevant, so why bother?
      • Remembering all these years, it is funny to consider the time we spent explaining to people that FP DSPs were not going to replace FXP DSPs. 

      Features and recent evolution
      Indeed, why bother with FP DSPs?
      Firstly they were the first DSP to be architectured in a "modern way".
      • For instance the C40 had a register file and emphasis was put on the C compiler. 
      • Then its successor, the C67x was VLIW
        • which is an extreme CPU technique.. superscalar without safety net... 
      • The same thing happened with ADI Sharc (the only competition to TI). The Sharc was turned into a modern CPU (Tiger Sharc). Mind you, they did not go overboard. It was a static superscalar not a VLIW.
      Secondly, TI introduced DSP which both execute natively FP and integer.
      • The evolution of the low cost C28xx with the C283xx.
      • The evolution of the C67x workhorse into the C674x family
      • Most interesting is the most advanced DSP core, the C66 which is both natively FP and integer 32 (and 64 bit). It seems to be quite a major statement, that the only "relevant" DSP family has chosen to go the FP way. As described by Gene Frantz [ref. 6], the main reason is matrix computation. which is another way to spell Matlab.

      The old arguments turned upside down
      The  argument against FP was that the complexity was not worth it. Nowadays, with the matrix problem, this is the other way round. If you use FXP, you must work at least in 32-bit (and you loose cost advantage of data size) , develop much longer algorithm (so you loose the code size advantage and worse the power consumption). All together the system price of a FP datapath is less than a FXP one.
      Especially, in CPU architecture a FP unit is just added next to the IP unit so it is easy to figure out the cost.
      (Well.. the exception model might suffer a bit too..)


      Future of FP DSPs
      The future of FP DSPs is not bright. All existing architectures (C67, Sharc) survive in their existing application space (military or audio) but have no serious roadmap. The only recent introductions such as C673x and C66x are the result of convergence more than FP evolution.
      Also the C66 is built like a CPU, most striking features come from high end CPU (level 3 interconnect and caches) with a few dsp features left from the past. It stands a good chance in infrastructure (given the poor competition..PPC, Intel) but has no future as the next core for OMAP 7.  One day will come for TI to decide if the C66x support is worth the bother or if A9, A15, A21 or A333 is the best bet.


      Lessons for Coprocessing

      On the other hand there is no doubt that FP has a good future in Cops and AS-DSPs.

      1. Designers have full freedom and can draw from the rich past of FP DSPs. For instance, on the choice of data size, they can use 16+6 (as the first FP DSP from OKI) or 80-bit a la Intel.
      2. Matlab Mapping becomes unidimensional.
      3. But not all problems can be solved. 
        1. FP (single) is not enough. That at least, we've heard it from the audio claque (because linarity is 23-bit).
        2. More seriously, we primarily live in digital world (not a numeric world). The numeric values (traditional DSP) are overwhelmed by the packeting and bit stuffing. Take any audio or video codecs.   




      References
      1. Ray Weiss "32-bit FP DSP processors" , EDN Nov 7 1991
      2. Steve Paavola " GP processors target FP DSP" www.edn.mag April 1, 1999
      3. Plenty of similar articles in RTC magazine circa 2000 , e.g. from Spectrum Signal Processing
      4. The EDN DSP Directories (before it became a joke), had good 1 page description of the 4 FP "old" DSPs (ex: june9,1994) 
      5. And so has the BDTI bible version 1995.
      6. Gene Frantz White Paper " Where will Floating Point take us?" http://www.ti.com/lit/wp/spry145/spry145.pdf, oct 2010
        1. this is the latest on Gene's white papers on FXP vs FP. See also::
        2.  Jim Larimer, Daniel Chen "Fixed or floating? ..." EDN 1995

      Sunday, November 13, 2011

      COP -VRAM

      VECTOR RAM etc..
      Goal
      To explain the impact of the Vector RAM (VRAM) concept on COP design today.

      Background
      Mid-90s saw the good old RISC school (Berkeley) starting a new revolution: the VRAM . As given by patterson (2ts,1s), etc.. "the goal of Intelligent RAM (IRAM) is to design a cost-effective computer by designing a processor in a memory fabrication process(DRAM), and include memory-on-chip".
      Already we can notice that memory-on-chip and processor in DRAM are two concepts which are a bit confused here.
      Anyway by 1998 the concept became focused on VRAM (or VIRAM or V-IRAM or Vector IRAM).
      It was a Vector Processor (VP) inside the memory Array  (krste Asanovic was the VP guy).
      As of today the concept is frozen to 2002 (last look this morning) and it is a pity.

      RAM coprocessor in DSP
      (hmm.. this section relying on personal memory might not be 100% accurate)
      Some of the first DSPs were hardwired dsp function (such as AMI FFT 1980, Motorola CAFIR 1986, Inmos Adaptive FIR 1988) which from the host perspective were just a memory map.
      In other words it was the simplest programming model (write Xarray, write Yarray, write paras, run, wait a little, read Zarray.).
      Now on the 3 levels of architecture, silicon and software,  implementing a block dsp function as a piece of memory is totally coherent (which is rare).
      Note that the dsp function can be in series with the memory array (as to be totally transparent) but more likely is part of the memory map. 
      Going back to the VRAM concept, from a 1980 dsp archi perspective, there is no doubt that it is a serious step forward. Hence it should be considered as such when designing a COP buried inside a memory array.
      And obviously we have Matlab implementation in mind here.

      References found in my garbage can 1998
      1. "Brass/IRAM retreat" multiple papers including Christo Kozyrakis micro-architecture, June 24,1998
      2. David Patterson etc.. "A case for Intelligent RAM: IRAM"  IEEE Micro, April 1997
      3. Randi Thomas and Katherine Yellick "Efficient FFTs on IRAM" Berkeley, circa 1998? 
      Google
      1. wikipedia..search IRAM
      2. Patterson (2t, 1s)
      3. much easier .. krste Asanovic 
      4. even better.. Kozyrakis ..it is not a name it is a trademark
      5. and plenty of others

      Saturday, November 12, 2011

      COP - ASPRO, CAM

      ASPRO, CAM and Find


      The goal is to introduce the COP designer to one of the 4 or 5 major structure of computing (see note 1): Content Addressable Memory (CAM). We also briefly mention the very large field of ASociative PROcessors (ASPRO). Behind all that, there is an interesting parallel between CAM and the FIND  instruction.

      (note 1) with arithmetic, bit field logic and lookup table.

      Introduction
      A common problem with Matlab is that IF does not work on vector. This should not be surprising since it fits the DSP ISA principle where we define predicated instructions to replace branching.
      • DSP ISA
        • IF corresponds to a change of program flow and impacts the very early fetch.
        • Predication corresponds to two execution "in parallel" with a final mux to make the right choice. The hardware performance impact is negligable.
      •  MATLAB
        • IF  can be seen as a change of program flow and it  is difficult to visualize it on vector.
        • Predication is implemented with logicals.
      Example 
      y=(rbit==0)?x+7:x;

      in Matlab
      y=x;
      if rbit==0 , y= x+7;
      will work only on scalar. The brute force solution is to create a for loop (like in C).
      But if we use logicals
      cc= eq(rbit,0)
      y(cc)=x(cc)+7;

      y(~cc)=x(~cc);
      This looks much more like predicated DSP code and it is works on vector.
      Now, the funny thing is while it is perfect legal, the standard Matlab style is to use the Find instruction.

      FIND
      Without going into the details of the usage of Find (see Mathworks) , the implementation of Find as a COP  or as C code is a heck of a challenge. But, as a first order, its structure is bases on a CAM.

      CAM
      CAM  have been around for a long time and had(?) their hour of glory in Networking chips. In DSP a CAM is a natural for everything 'RECOG' (speech recognition, pattern recognition, etc..).

      ASPRO
      It is the generalization of the CAM concept to general purpose computing. ASPRO is as old as the world exists and a large body of references is available.


      References found in my 2004 garbage can
      1. Florin baboescu etc.."hardware implementation of a tree-based IP lookup Algorithm",  ST Microelectronics, Year YYYY
      2. Kohonen 1980 .. a book
      3. Asanovic and chapman 1990
      4. Asanovic"the Space chip"    keyword: PADMAVATI
      5. Cypress Ternary CAM
      6. Djamshid tavagarian, "flag-oriented parallel Associative Architectures and Applications"  IEEE Nov 1994
      7. NeoMagic "the technology of APA" ..this one beats them all! Genius or monumental stupidity? still open after all these years. 
        1. there were at least 2 challenges : 
          1. software required double optimization and porting
          2. iterative algorithm were impossible.We work on that one!
      8. Romain Saha MOSAIC " CAM speeds up lossless compression"  EDN 09.29.03
        1. see also www.commsdesign.com circa 2003
          • Good try romain!
      9. ALTERA A.N. "Implementing High Speed Search Applis with Altera CAM" July 2001, A.N 119
      10. GEC Plessey "PNC1480 LAN CAM" 
        • gosh!
      FURTHER : INTELLIGENT MEMORIES seen by COMPUTER DESIGN MAY 1998 !!
      • Matrix transposing and Multiplying
      • Wavelets transforms
      • Lossy and lossless compression
      • Cryptography
      • Graphics Accelerator
      FURTHER: replacing time by space

      1. Using GBytes memory to store all possible outcomes. 
        • Not your average chip.

      Sunday, November 6, 2011

      DSP - Architecture Past, Future

      DSP  ::  Architecture Past, Future
        1. The dinosaurs (70s)
          1. Computers only (MIT Lincoln)
          2. The first chips (TRW, AMD bit slices, other bit slices)
        2. The first steps (79-80)
          1. Intel 2920
          2. Bell Lab internal
        3. DSP of the First Kind (80-95) : A good compromise between application specific and General Purpose (GP)
          1. The PSI
          2. NEC 7811
          3. AMI2811
          4. TI 320C10
            1. TI C10,C20,C25
          5. Motorola 56000
          6. And a lot many more
          7. ADI enters the fray (21xx)
          8. ATT becomes public (16xx)
          9. TI starts a revolution every 4 years 
            1. FP (89)
            2. MP (93)
            3. VLIW (97)
          10. Motorola weathercock: 24/24 16/24 24/16 16/16
          11. ADI "one day I will be bigger"
          12. ATT "only the professional" 
          13. Refer to BDT 1995 for summary
          14. Smelling the RISC takeover
          15. DSP group "the revenge of the trees"
          16. Carmel "the last honest DSP!" 
        4. DSP of the Second Kind: Back into CPU mainstream (95-05)
          1. TriCore
          2. ARM9E
          3. Hitachi SH
          4. Extensions to CPU ISA (ARM, MIPS, PPC, Intel, HP, SUN)
          5. TI: from C62 to C66
          6.  ZSP, Starcore
          7. Blackfin, Tiger Sharc
        5. DSP of the Lost Kind? What shall we try now? (05-10)
          1. Customized DSP
            1. Wireless DSP
            2. Customisable DSP
            3. Customizable Core: TenSilica, Arc,3-DSP
          2. Multi-PE (Processing Elements)
            1. Impact of MM architecture
          3. MP, MC (Multi channels), MT
          4. Re-configurable computing
          5. Heterogeneous Platforms
          6. Matlab and Custom DSP
          7. Is GPU the latest smoking pot? Or is it Cell? 
        6. DSP of Any Kind
          1. DSP boards
          2. DSP custom Chips
          3. FPGA platforms
        7. DSP of the Third Kind (10-20?)
          1. Back to the future: Coprocessor seen as a low complexity DSP
            1. ASP DSP
            2. COP DSP
            3. A platform to build the platform: the framework
          2. Added complexity: Matlab to implementation



      Architecture evolution
      For simplification sake, we will use the terms archi80 (DSP of the first kind),  archi 90 (DSP of the second kind) and archi 2000 (DSP of the lost kind).
      • The typical archi80 is made of 3 blocks (DAU, AGU, PCU ) , buses and memories. These 3 blocks correspond respectively to  the Data Arithmetic Unit, the Address Generation Unit and the Program Control Unit. The terminology matches the architecture previously developed in bit slice designs.To be more precise the architecture was centered on designing datapaths with a microcode memory in the control plane.
      • The typical archi90 turns the concept by 90 degrees so that now everything is seen from the  instruction perspective. To that, a new central block is added: the Register File. Hence the 3 afore mentioned blocks became respectively (IP, LS, Fetch/Decode). There is no doubt that archi90 is much more sophisticated than archi80 and allows a solid foundation. But it is also so complex that it needs  a scientific foundation (Computer architecture) and specialized engineers( Architect)...
        • When Risc proponents mention the simplicity of the Risc model (as for instance the DLX, MIPS1) they refer to CPU debate of RISC versus CISC (in which they misplaced DSPs). But implementing a DSP with Risc principles is not an average ARM7 design.
      • The archi2000 opened new directions in architecture but none of them was a major change. 
        • Many new directions (such as multi-core) have very little to do with cores. They are just platform choices. 
        • Configurable cores (TenSilica, Arc) boils down to a finer grain way of building a SOC.
        • Reconfigurable computing is a real breakthrough since it combines (re)build and run in hardware. Unfortunately it is still in its infancy.
        • And so are similar concepts like JIT interpreter (Transmeta) .
        • Because of its popularity, the C6x with its VLIW and its 2 cluster datapath is a real breakthrough. But TI is kind of going backward. Firstly their VLIW looks more and more like static SuperScalar and the 2 cluster never expanded into 4,8,16 clusters. Instead they rely on clock speed, multiple multi-level caches and MP for performance. Hardly original...
        • All similar attempts (Multi Media CPU) are now dead. 
        • For a times there was a revival with the IBM cell and Nvidia GPU but difficult to see the future of DSP when looking at these beasts.
        • We had a series of gizmos such as stream processors which petered out like the rest. 
        • Now none of these new trends were useless. All will have some impact on future DSP architectures.
        • For our purpose our favorite trend are configurable computing. For instance a DSP which includes a Matlab JIT or a DSP which reconfigures its operating units on the fly.
      • The archi2010 makes a simple constat
        • All effort are now on platform design.
        • Software (the lack of it, the price to develop it) is the main killer for a platform
        • Developing a new GP core is useless.
        • Only customized core can justify the investment. And not even a full core because of the price of developing new tool and software.
        • The space remaining for DSP is either coprocessing or a DSP core so simple that the tool development effort is minimal.
          • the second case is for simple apps (1K-2K code)



      REFERENCES -Found in the garage - Filed under DSP 
      1. Floating Point DSP
      REFERENCES -Found in the garage - Filed under CPU
      1. MultiMedia MM CPU
      2. DLX