Sunday, December 25, 2011

Wireless Cellular COP

Wireless Cellular COP
The goal of this section is to describe the historical impact of MIPs increase in Wireless platforms. 
  • Wireless coprocessing
  • Wireless coprocessor
  • Wireless extension
  • ASPI (Application SPecific Instructions)
  • Wireless DSP  {tbd Paulin & all}

Background                                                            
Until GSM (1987), DSP was not a very profitable business. Over the next 10 years, DSP and many CPU architectures were driven by the wireless craze. With the 3G introduction (1998),  the quiet MIPS evolution took a serious blow. Now based on CDMA, the partitioning of tasks between software and hardware became a very serious problem. Now, more than 10 years later, with the 4G introduction, things have slowed down to a normal pace and the hardware school has won. Most DSP intensive tasks use some kind of dedicated processing and there are no signs, it will change.
1987-1996
In 1987, the standard DSP had around 20 MIPS  (we will use MIPS instead of DSP MIPS, see note 1) and could implement a speech coder or a modem. As both applications increased linealy in complexity, so did the DSPs (and vice versa). The GSM standard was devised to fit nicely in the then most common TI DSP. With the increase in Channel Coding (CC) complexity we could see a slight shadow on the horizon but nothing we would not be able to tackle in software. In a nutshell, DSP was wireless, wireless was DSP, DSP was software and software efficiency was by creating new instructions (so called ASPI), especially the "Viterbi" instructions [ref 1,2]. This did not last long.

1) By introducing CDMA, Qualcomm changed the MIPs partitioning. Up to now (de)modulation schemes had very little impact on DSP  but CDMA required (say) another 100 MIPS which could only be implemented in hardware.
2)  A revolutionary CC scheme, Turbo Coding (TC) was now on the book and required another level of magnitude. Without mentioning TC, Viterbi Coding (VC) had reached another 100 MIPS.
1996-2001
The space Odyssey.  By 1997 it was clear that an ALL software implementation of the wireless communication standard would require a DSP with 300 (DSP) MIPS.  And we are speaking about basic cell-phone (speech or 16kbit/s modem).  But in 1996, we have two extreme schools. The first one not only wants to do everything in software but also integrate the MCU tasks in the DSP. Hence they go on designing the 300 MIPS  DSP. The second one had high end RISC CPUs already running at 300Mhz or more and believe wrongly that downgrading cost wise is possible and writing software is a question of men months. Both are dead, but the idea still floats around [6].
Over the years we quietly realized than except for the speech coder, the cellular standards were best implemented  in coprocessors.

2000-2003
That was half the story. Concurently, the wireless DSP and its little MCU became the wireless platform and very soon the little MCU had an even bigger MIPS problem due to data, audio and video. With 3G, wireless platforms turned into 3G computers and soon the Multi Media crowd " I smell Dollars Now" was proposing the SIMD engine as solution to the issue.

But, once more the solution was to introduce COPs to mop out the MIPS excess [ref 3,4,5]
Today and tomorrow
All DSP intensive tasks use either dedicated processing (which can be a COP or a programmable DSP). From the host perspective everything is a COP.  Because for each COP, there are no signs of that the algorithm will simplify over time (on the contrary), the architecture partitioning will not change. What will change is the methodology to transform the Matlab algorithm (or experimentation) to implementation.     

References
  1. For instance one can refer to the TI family ISA (C5x -->C54x --> C55x )
  2. Or the DSP GROUP (CEVA) pine, oak, palm evolution. A famous instruction took 1 cycle per value to find the address of the minimum in an array. It used a wire to block the AGU address... good luck to the pipeline. 
  3. Sollenberger & all "BCM 2132" HC, Aug 2003
    1. COP (to ARM9E): Edge Accelerator, CC accelerator, Incremental Redundancy, (external DSP) Teak  
  4. ' Intel Xscale , Wireless MMX' HC, Aug 2004
    1. Execution Units: Shift and Permute unit to solve the sub-word parallelism problem
  5. "Renesas Hitachi" HC, Aug 2004
    1. COP: 3D graphics, MPEG4, LCD controller, java accelerator
  6. Happy Camper and his troop "Terminal centric view of reconfigurable system architecture and enabling components and technologies" IEEE Com. Mag, May 2004
   

Saturday, December 17, 2011

FP hardware

Floating Point (FP) hardware
The Goal of this section is both sided
1) Emphasize the historical importance of the FP COP among all classes of COP.  For that, this covers all types of FP Hardware. This includes peripheral, coprocessor, execution unit (FPU),  Building Block, FP boards, FP DSP..
2) The natural relation to Matlab (and ease of implementation).

Background
On May 8 1980, Electronics [ref 1] had a wonderful concept drawing (*) of two hands working together to illustrate the upcoming of Intel Numeric Processor (sic!) [ref 2] which happened to be the birth of a new type of chip: the Coprocessor (COP).
At the time, this was real progress, as the only way to add FP performance to a microprocessor-based equipment was to use either the peripheral AMD 9511/2 [ref 7] or  a specialized board.

Of course, computational intensive co-processing had existed in mainframe including the somehow difficult dialog between Cray1 and DEC mini-computers. 
Today,  the FPU is so much part of a CPU that it is difficult to understand all the brouhahas of a separate chip. But  one has to be in the context of limited silicon real-estate. Incidentally this is also the case in today's embedded resources.
Very soon Motorola  answered with the 68881, which benefited from the more advanced interface of the 68000 architecture [ref 4]. In the same spirit, but not the same success NS [ref 5] came up with their own chips.
Meanwhile, as the years went by,  Intel was climbing up the numbering schemes with the 80287, then 80387 [ref 3] then 80487. Except it took us a while to realize that the 80487 was not a chip. While we will not bet on that, the 487 was purely an integration exercise and the interface might have been the same COP interface. But it was now 1990, the heyday of Computer Architecture so that by now the RISC architecture was dominant. The FP COP  became the FP unit (FPU), totally integrated to the pipeline [ref 8] and in the case of superscalar working in parallel with the Integer Processing (IP) unit. From that point, the story is largely available on the web [refer to Hot Chips] and goes much beyond the background scope. Much more relevant to our scope is the story of FP Building Blocks( see FP BB) .

FPU and FP extensions
We will give an incomplete list of FP units and FP extensions. FP extensions are characterized by a separate ISA and document which is added to the base architecture. This is not for the case for the Pentium which is natively FP , but this corresponds pretty much to all other architectures even the PPC. 
  1. PPC: Book E may 99
  2. ARM: 
  3. MIPS:
  4. Hitachi: SH7705 FPU
  5. TenSilica: 
  6. TriCore: TriCore FPU
  7. TI DSP C28xxx
    1. This is an interesting case as they offer two very different solutions. 
    2. The 283xx core is a 28xx core to which has been added a FPU. This is the standard situation.
    3. The 2803x family called Piccolo does not change thee 28xx core. The FPU is I/O  COP which acts as a CPU front end by processing signals coming from ADC modules.[ref Piccolo Control Law Accelerator]

FP BB (1985-1995) [ref 10]
1) At first we had the usual DSP/Bit Slice school (TRW, ADI, AMD) which naturally increased their portfolio with a DAU (see bit slice section) . They started 4-bit and up the integer curve 8,16,32. At 32-bit, FP had the same data size, so why not? It presents the disadvantage of larger size due to the FP adder but note that the multiplier is only 24x24. The biggest issue is the large step in complexity (such as IEEE standard)  which is a relatively small price to pay for the comfort of numeric accuracy. Very soon  as always in FP 32-bit was not enough and 64-bit chips appeared,. Even further the only market especially associated to the IEE standard.
2) Driven by new names (Weitek, Cyrix, IDT) the 64-bit FP BB turned into Intel coprocessor socket which at this writing baffles me completely. They must have missed the RISC revolution somehow. Even with the 486, Weitek was still pushing the usage of external COP. That being said, there are quite a few lessons to learn: their block diagrams are marvels,a heck of a datapath, and good host to cop interface (that was no Cray 2).  .
Quick list
    1. MULTIPLIER
      1. TRW 1042
      2. Weitek 1032,1064
      3. Weitek 2264  (IEEE)
      4. ADI 3210
    2. ALU, RALU,
      1. TRW 1022/3 (22-bit FP)
      2. Weitek 1033, 1065
      3. Weitek 2265 (IEEE)
      4. ADI 3220
    3. DAU -->  COP
      1. AMD 29325/C327
      2. Weitek 3132/3, 3364
      3. LSI 64132, TI ACT8847
    4. Intel COP
      1. Weitek 3167, Cyrix 83D87
FP DSP
follow this link 

FP IP (1995-1999) FP in FGA (now) [ref 11]
Last in the whole story is the Intellectual Property (IP) craze of the second part of the 90s where amateurs were writing a bit of C code and sold it as a product (virtual silicon?).
Today, the best ones are productized piece of IP  for FPGA. Xilinx, Altera, etc.. have solid specs describing these components. Multiple other sources can be found on the web. 
GPU (now)
The story cannot be complete without mentioning the GPU. While the architecture is somehow exemplary, it is much beyond the scope of this section..

Topics
  1. Coprocessor Interface
    1. 68881 outsmarts the 8087 with non blocking
  2. How much IEEE complexity?
    1. The "sacred" standard.
    2. ARM cheeky solution: problem with exception? no problemo, invent a new IEEE model
  3. TI gives up integer because Matrices and Matlab: the C66

Saturday, December 10, 2011

AccelProc in Communications

Accelerated Processing in Communications

The goal is to analyze accelerating processing in telecommunications. Any specific ways?. Lessons and trends are drawn.

Background
Because of the serial nature of the communication signal, and its speed ratio relative to CPU, the only realistic way to implement a communication function, was by designing a dedicated peripheral . This was the case in the 70s/80s (for instance UART, HDLC). With the dramatic increase in CPU raw speed (end 80s) it was possible to implement a  communication protocol in firmware [ref 1]. But by the 90s it was back to hardware and since that time, there had been an acceleration in communication bandwidth which renders software implementation impractical (so called "shannon gap"). Still we have seen over the years [ref 2] multiple attempts at so called "wire processor" but we feel is a bit of a white rabbit chase.   

Description
Traditionally communication (COM) is divided between infrastructure and terminal. In the middle stands the gateway. Obviously infrastructure (Base Station, DSLAM, VoIP Line cards, CMTS) require N times more processing power than a terminal.
Because, most COM processing is done in a dedicated hardware peripheral what is the scope for coprocessing? We answer this question by listing a few COM applications which use COP(s). Covered elsewhere are NPUs and Cell Phone platform coprocessing.
 
BTS
The cellular BTS, requires massive MIPS in  both Channel Coding (CC) , Chip Rate Modulation schemes (such as CDMA) and OFDM. These constitute the standard COPs:
- CC COPS: Viterbi, Turbo Codes [ref3]
- Chip Rate Processing: Rake receiver, De-spreader, Searcher [ref4]
- OFDM: FFTer  [TI website see C66x core apps].
And a special mention to the gutsy BAZIL which was made of a LSI/ZSP DSP and an "uncommitted array" of gates to implement the typical bit level algorithms.[ref 5].
We also worked on an architecture made of uncommitted COPS (linked through SRIO) and commanded by linked list in global memories. This is what we will call a high-end COP interface.

 
PHY
The point is that there are NOT that many ways to architect a PHY; the most flexible one is to use a cheap core (ARM7, 8051,cortex) and to connect with a light interface to hardware data paths. This can be considered de-facto coprocessing.
The problems starts when the data paths require more and more dsp. Most of all when the signal processing functions become adaptive and requires some light processing from the cheap core. Further as time progresses (and customer requests) the signal processing becomes less linear, more control oriented. Then the cheap cores and light interfaces very soon need more help (from a coprocessor to the coprocessor? why not?). 


Protocol Processor
Over the years many people attempted at the creation of a new type of processors: the PP. It was to be the equivalent of the DSP for Physical layer, the PP for Link Layer. Largely buried in the dustbin of history, still the PP should not be dismissed. The reason of its failure, has more to do with the advent of more complete solutions  such as PPP (Packet Protocol Processor), NPU (Networking Processing Unit) than the concept itself.
As a coprocessor the PP is a very useful building block, more likely to be loosely than tightly coupled. As far as reference goes all I have is a lot of NDAs [TBD].     

I/O Processor
This is the same story as above.  Many attempts for little results. As a matter of fact the classic reason for failure is speed. As the main CPU goes up the speed curve, the dedicated processor got left behind and becomes a point solution.However in a perspective of coprocessing the equation do not hold anymore. The host speed is pretty stable relative to the system workload. 
So an IOP can also be a pretty simple COP [ ref 6 Intel 8089] which can insulate the host CPU from real time communication events. The interface is another story.{ bandwidth and shared memory issues} 

References 
  1. TI 320C20 DSP doing HDLC
  2. Intel at Hot chips but when?
  3. Nat Seshan " New Tms320C6416 Solves the DSP challenge", EPF June, 2001
  4. A gatherer & all "A UMTS Baseband rx chip for infrastructure Apps" "TI TCI110 "  HC 2003
    1. Most interesting example of TI using multiple ARM7 + hardware blocks (instead of a programmable DSP) to implement a communication function.   
    2. The interface is standard mid-end (setup of contexts, parallel buses, dma)  
  5. Neil Stollon "Bazil" EPF, June 2001
    1. A heck of a paragdim shift. Still waiting after all these years
  6. Robin Jigour "Data concentration techniques unload host computers" EDN March 4, 1981
    1. Usage of 8089 IOP with a good description of interprocessor communication handled by linked message blocks
FURTHER : {TBD}

Saturday, December 3, 2011

BB - AGU

AGU (Address Generation Unit)

To goal of this section is to cover in details the design of an AGU.

Background
The first AGUs go back to the 70s when people where using bit slice to design boards[ref 1]. With the introduction of the first GP DSPs,  the AGU became slightly more sophisticated. In the mid-80s, AGUs  reached their top (not in general purpose DSP) in Building Block (BB) DSP.  First the byte slice from AMD (with the 29540 exclusively for FFT) [ref2] and secondly with the introduction of the word slice family from ADI ( ADSP 1410) [ref3].
The more modern DSPs of the 90s never went overboard for AGUs mainly due to the simplification impact of RISC and computer architecture. The AGUs  gave way to the more advanced concept of LSU (Load store Unit) .

Description
Before describing an AGU there is a pre-requisite and a post-requisite.
The pre-requisite is to have a good idea of the standard  addressing modes and their implications
The post requisite is the vast number of CPU issues which are not considered HERE since they are beyond the scope a simple DSP or COP design. For instance, cache effects, speculation, non blocking load, store multiple, user/supervisor mode, etc.. .


Features
  1. Number of data buses and data memories 
    1. This should give the number of AGUs . But note that it is as valid to implement  (say) a 3-address bus AGU as 1 unit or 3 single units. It is the classical trade-off  shared resources versus locality.
  2. Some questions in standard addressing modes:
    1. the simplest DSP architectures do not require stack pointer
    2. pre increment can be implemented in 2 cycles

  3. Circular addressing
  4. Bit reverse addressing (and all FFT related)
    1. while not strictly related to AGU, FFT can use the Block Floating Point mode which implies a load or store with scaling.
  5. Paging
and more advanced 
  1. Vector Addressing modes (concept of strides)
  2.  Interleaving (as used for decimation)
  3. Complex addressing (as in complex numbers)
Interesting
  1. PC relative addressing
  2. A bit processing unit which takes care of bit test/set and semaphores. (see StarCore).
  3. A mask unit
  4. Using an AGU as a second data unit
Number of buses and data memories 
Traditional DSP had a 2-bus/memory (such that Acc+= X*Y) or even 3-bus/memory architecture (such that Z=X*Y). When the very high performance Infineon Carmel was firstly developed, it had 6 buses. The reason was the dual multiplier (since Z=X*Y in parallel with  W =U*V gives 6 buses). Obviously it is difficult to be more flexible!
When we designed TriCore one key idea was to separate memory and data type. So, instead of having the standard DSP X,Y memories of 16-bit wide, we had a single 32-bit wide memory. This not only dramatically improved the core design but it fell perfectly in line since TriCore base was a 32-bit CPU. And when we improved by having a dual MAC, we doubled the bandwidth to a 64-bit bus.
The bottom line is that multiple memories can be avoided. Having 2xN or 1x2N is exactly the same in terms of bandwidth and performance. The main impact is more complex programming (to be explained see software pipelining).

On the other hand there are some algorithms which become so complex to implement on a single bus architecture that the simplification of the single get lost.

Standard addressing modes
The standard addressing modes are generally all similar (p++, p--, p+K and variations). The main issues are the number and width of registers. The only real issue is the stack pointer which requires a pre-increment (++p) which is totally at odd with the other addressing modes ( they all use post modification). This is problematic since addresses must generated as fast as possible and having an adder in the worst case datapath is not recommended.
The solutions are multiple:
- have a 2 cycle instructions for push (or pop)
- prepare all addresses in advance and have a final mux
- bite the bullet and implement it; this is even truer if the AGU contains a base+index addressing mode.  

Modulo addressing
Modulo (or Circular) addressing is easy to define, explain and implement.
     if  m >=0  pointer is incrementing
      address = (ptr + m) > (base +length)? ptr+m-length: ptr+m
     if  m <0 pointer is decrementing

      address = (ptr + m) < (base)? ptr+m+length: ptr+m
What are the issues?
1) Number of concurrent circular buffer?
1 is not enough, 2 is a good compromise, but 3 can easily be met. In regular ISA 4 or 8 is not usual.
2) There is the issue of simplification.
The standard equation above requires 4 registers (ptr, inc, base, length). This can be simplified to 3 and even down to 2 registers. It is very tempting to simplify by masking address bits. The circular buffer must then be a power of 2. This a very poor solution.

3) The last issue is more modern CPU than classic DSP.
Generally a DSP has a unique data size (the native size). In a modern CPU,  the memory access is independent of the data type. Hence accessing a long (2 packed short) on a circular buffer of shorts will run into alignment problem.
The 4 register model
current ptr (p) , modif (m), base(b),  length(l).
modif  (also called offset) is the post increment/decrement.

cc1= gt(m,0);

cc2= lt(m,0);

if cc1

   if (p+m) < (b+l), p= p+m;

   else p= p+m-l; end;

end

if cc2

    if (p+m) >= b, p= p+m;

    else p= p+m+l; end;
end

The 3 register model
current pointer (p) , base(b),  length(l).
Same equation as above but m is taken from the opcode immediate field. (say -16:+15).

The 2 register model (easy)
In this scheme the two registers are the current pointer (p) and a mask(m). The length of the buffer is limited to a power of 2, the base must start on a power of 2. The mask gives the position  where the pointer is cut in two. Such as for example, for a 64 long circular buffer
  p= concat(p_up, p_lo)   where p_up is a pointer with lower 6 bits masked and p_lo is a counter 0: 63;
The 2 register model (freescale 56800)

This is similar to above but much less severe limitations. The base can be anywhere, the length can be any values. The trick is that the buffer will always take the space of next power of 2.  So a modulo 366 will require the space of 512 words but will have 366 as upper bound (and zero as lower bound).

The 2 register model (TriCore )
On TriCore we had 16 address registers and there was no reason not to try the best which was at the time 8 simultaneous circular buffers. .. We are joking, in fact this number is given by the regularity of the instruction set.Since we had the concept of register pair, it was simpler to define a modulo as using any register pair but with a special meaning.

So we had to match 3 values, current pointer (p) , base(b),  length(l) to a register pair (say D0,D1).

And we mapped it as follows
D1= base    (32-bit)  
D0= length || pointer (both 16-bit) 
Finally, the post modification (m) was a 10-bit signed value given by the opcode immediate field.  


Bit reverse addressing
Bit reverse is this kind of feature which looks like a bottomless pit.There is no end to complexity.

1) First let us start with the golden model which is not so easy to generate.
  • Generating it in C is itself prone to errors so this not a very good start for verification.
  • Obviously using a mirror on a binary table of increasing numbers will give the result.
  • But in fact bit reverse can be easily generated. For instance in Matlab:
    with x the input
       y=dec2bin(x)
       bitrev= y(end:-1:1) 
  • Or logically. Starting from(0,1) a new pair of numbers is generated by multiplying by 2 (0,2) to which is concatenated the same pair but with +1 (1,3). And the next step (0,2,1,3) generates [0,4,2,6 ] [1,5,3,7]. Etc ad vitam eternam. You can use Excel, Matlab or C to do that.
2) The implementation in hardware seems trivial but it is not
  • We can for instance reverse the wires such as bitrev(13:0)= bit(0:13).
  • The problem is that it is only valid for a word width of 14 (a 16K FFT). You cannot use the least significant 8 bits to  to do a FFT256 .
  • But note that more modular methods exist (adder with reverse carry),
References

1.  R.J.Karwoski “ a general purpose address controller for real time and array processor applications” reprinted from (TRW) (1981??)
2.  AMD 29540 as part of the byte slice folio
3.  ADI ADSP 1410 Word slice Address Generator Data Sheet (~1985)
4.  Eric Martin “   “ 12 may 1986
5. Motorola DSP56800 family manual chapter 4 AGU
6.  ADI Blackfin
7.  TI C80x
8. Starcore SC1400
9.  Infineon Tricore Architecture manual  
10.  Infineon Carmel
11.  C55x
12.  Bier, Shoham & all “DSP processor fundamentals :: Section 6.  Addressing” BDTI 1994-96

Bit Slice BB

Bit Slice Building Blocks

The goal of this section is to cover a quick history of bit slice logic, the impact on fully integrated DSP of the first generation and to provide the methodology for today COPS.


Background
See references AMD Mick and Brick, blakeley etc..

Description
Bit slice logic was not only a series of components but most of all a technique to build a General Purpose (GP) computer and best of all a GP DSP.
Basically any DSP can be build from 3 units (or BB):
  • PCU: Program control Unit
  • DAU: Data Arithmetic Unit
  • AGU: Address Generation Unit
The way the 3 units are connected does not vary. What will vary is the features of each unit, the number (and width) of buses, the number of memories and the microcode.


Program Control Unit (PCU)
Its function is to fetch instructions and to decode them before dispatching to the respective units. In the bit slice model the decoding is not part of the PCU. Since the instructions are very wide and monolithic, there is an intermediate level (the microprogram store). A good example of PCU is the 29110 sequencer from the original AMD 29 bit-slice family but all DSPs have a section describing the PCU. This section goes by various names. 
 

Data Arithmetic Unit (DAU)
The DAU goes by many names and many types. The original AMD29x family had the 2901 called RALU (for register file + ALU) . And because we were processing signals, we added a multiplier which gave rise to the 2 standard  topologies (multiplier in series and parallel). Introduction of GP DSPs kind of standardized on the 16*16 + 40 structure. With the 90s and the vulgarisation of computer architecture, first came the name IP (Integer Processing), then the parallel topology as the units were all attached to the Register File, then new type of units (Bit Manipulation Unit, Bit Field unit, Shuffle), then duplication of units (super scalar) and sub word parallelism.
Today, the DAU is where most of the action is. Especially since our goal is to implement Matlab functions.   

Address Generation Unit (AGU)
The AGU  is a bit of the poors man BB of the bit slice family. For instance, the original AMD family did not have any (it was simpler to implement one with a RALU). This is still partially true today, an AGU is a degenerated ALU (1) but as anything dealing with memories it should not be underestimated .
See here for more details.


(1) In the simplistic model of computer architecture there is no AGU. Firstly there is no load/store parallelism and moreover there is a unified register file (see ARM7, ARM9, MIPS, PPC, etc..). {obviously I do not go into superscalar here}.

Friday, November 25, 2011

DSP of the second kind

DSP of the Second Kind (1995-2005)
The goal of this section, is an historical perspective covering the time when DSPs became CPUs and vice versa.


Background
We are now in 1996 and for the last couple of years DSP is fashionable in Silicon valley. To complicate things further, DSP is just seen as a subset of the shining keyword of the 1990's: "MULTIMEDIA"."TA!DA!" Pretty much all processor vendors are preparing something. A good summary of that time is found in Jeff Bier's [ref 1 and 2].

Description
Roughly speaking there were 3 and a half types, here classified by interest order.
  1. Add a DSP COP to a CPU (or more likely a MCU).
  2. Add a DSP extension to an existing CPU ISA
  3. Start from scratch and build a completely new architecture. It can be:
    1. a DSP based on RISC principle (ZSP)
    2. a DSP based on more RISCY than Thou Principle(C6x)
    3. a RISC CPU equally good at DSP (TriCore)
    4. a DSP with a register file 
  4. Multimedia processors (MM)
    1. We will not mention MM further, except that for the sake of simplification we will put Pentium MMX  in this category instead of ISA extension. Remember that we are speaking about DSPs.
  5. Not considered: dual-core platforms such as consisting of a DSP core  plus a CPU core.
When looking at the 3 remaining categories they can all be grouped under DSP extensions.Category 1 and 2 differ only by the level of integration.Category 1 integrates DSP as  a COP, category 2 integrates DSP as a separate execution unit or as a fatter data path.  In the end it all became implementation details. Funnily enough, category 3 (the new architectures) all started monolithic but under the influence of the deconstruction school (TenSilica) many came up with the concept of simpler core + extension.
We can also lamely argue that a C64x can be seen as a CPU (1 cluster + 3 simplest execution units (EU) ) or a powerful DSP ( 2 clusters with all 4 EUs).

Category 1- DSP COP

Largely obsoleted by category 2 and 3, we will mention a few products:
  1. SH-DSP (1996) had a COP added to the the SH-3 core " good classic model"
  2. ARM Piccolo (1997) , was architecturally identical  but with the bonus of original solutions to some basic problems. (To be studied in details)
  3. Siemens C166/ST10 MAC (1998) started as a COP and finished as fully integrated ST Super10.
  4. And later, for instance...Massana Filu which was a piece of IP ( a COP). to be attached to a host. 
  5. We will not mention here the Tensilica Vectra or similar which are filed under Vector Processor.  
Category 2- DSP extensions to CPU
Here are examples of DSP extensions to common CPU ISA.  Note that not all of them are "Q format" types and are more commonly classified as MM extensions(*VP*). From our perspective, the difference between the twos are more understood in terms of DSP generation.
  1. PowerPC
    1. Altivec and variants (*VP*)
    2. MPC8xx  DSP addendum (MPC8xxRMAD Rev.0.1., 10/2003)
  2. ARM (we are a bit lost here)
    1. Move
    2. Neon (*VP*)
    3. MM extensions
    4. SIMD
    5. ARM9E
    6. Xscale WMMX , WMMX2
  3. MIPS
    1. Lexra DSP extensions
    2. MIPS DSP extensions
  4. Coldfire DSPon
  5. Hitachi MM extensions
  6. TenSilica Vectra, VectraLX (*VP*)
  7. ARC SIMD at MPR05
  8. PIC MCU adds DSP (2005)
  9. Intel SSE4
  10. Sparc, HP --> see Ruby Lee, AMD, Alpha, MIPS Madmax

Categorie 3- New DSP Architectures
Further sub categorized as
  1. New CPU
    1. The unique illustration as such is TriCore (1996). 
  2. New DSP   
    1. Blackfin (1998) presented itself as an hybrid  but really a DSP (hey 40-bit native register width, give you away)
    2. Tiger Sharc (1999)
    3. Starcore (1999)
    4. TI C62x (1997) became C64x(2000) then C66 (2010)
      1. And of course C67
    5. New name: ZSP (1997)
    6. tons of other hopefuls
    7. We will not mention the Infineon Carmel and the DSP group family (see "revenge of the trees" and "the last honest DSP" in chapter DSP of the First Kind). 
Category (outside): dual-core platforms
-> see platforms, Multi-Core.
  • For the record
    • 68356
    • Dual core consisting of a DSP  plus a CPU (typ: ARM7 + OAK)

    References
    1. BDTI  “DSP on General Purpose Processors—An Overview”, presentation to MicroDesign Resources dinner meeting, January 1997. 
    2. --> giving rise to the BDT Guide - DSP on GP CPU (1997)
    3. also comparing BDTI guide 2004 and 1995 reveals the amount of the evolution.
    4. BDTI guide on TriCore (not available)
    5. BDTI guide on StarCore 
    6. Multiple press vulgarus articles on "hybrid", C compilation and register files.
    7. The many TI VLIW white papers 

      Thursday, November 24, 2011

      VADD

      BDT Benchmark - Vector Add

      Goal
      Introduction to benchmarking as an architecture tool. Implicit confusion between hardware and software to teach architecture.

      Background
      here will go a long explanation of manufacturer benchmarks (in the 80s), before BDTI and other stuff to explain that vector add is not a BDT copyright. But by choosing the Ntaps=40, BDT turned it into a standard in the same way that they turned the FFT256 as another bench standard.
      Description
      The VADD401 BB is remarkable because it is both very simple and complex. I believe that studying it will cover maybe 50% of the low hanging fruits of structural architecture.
      We will now present results of several machines (DSP, CPU, etc..). For some of the results we used BDT as a reference. We also use "matlab" as descriptive language.
      The cycles equations are defined as :  total cycles = n*40 + ot (overhead) + p (pipeline). For most of the cases ot and p are confused as one.
      The operation is 
        z =x+y   where x,y,z are native data size (typ: 16-bit)


      1) Architecture 1980
      Machine 1: Theory= 3-bus DSP
      fetch
      decode
      x=ld(mem1); y=ld(mem2);
      for ii=1:39, z=add21(x,y); mem3=st(z);   x=ld(mem1); y=ld(mem2);end;  
      z= add21(x,y); mem3=st(z);     

      Machine 2: Standard = 2-bus DSP
      fetch
      decode
      x=ld(mem1);  y=ld(mem2);
      for ii=1:40, z=add21(x,y); x=ld(mem1); y =ld(mem2);  
                    mem3=st(z);
      end     
      Note: the load "one too far" is not considered a problem.



      Machine 3:  CPU  = 1-bus 
      fetch
      decode
      counter=40;
      while(counter>=1)
        x= ld(mem1); 
        y= ld(mem2);
        z= add21(x,y);
        mem3=st(z);
        dec(counter);
        branch(top); %dummy to to mark the cost associated to branch
      end


      Giving following results

      machine 1:    1*(N-1) + 3 = 42
      machine 2:    2*N+4 = 84

      machine 3:    6*N+3 = 243



      The main architectural trade-offs are :
      - number of buses 
      - ZOL (Zero Overhead loop).


      2) Architecture 2000


      Machine 11: 2-bus 4 lane DSP 
      fetch
      decode
      dispatch
      ssssx=ld_by4(mem1); ssssy=ld_by4(mem2);
      for ii=1:10, ssssz=add84(ssssx,ssssy); ssssx=ld_by4(mem1); ssssy=ld_by4(mem2)  
                   mem3=st_by4(ssssz);
      end     

      Machine 12: 1-bus 8 lane CPU 
      fetch
      decode
      dispatch
      ssssssssx=ld(mem1);
      ssssssssy=ld(mem2);
      for ii=1:5
                  ssssssssz=add168(ssssssssx,ssssssssy); 
                  ssssssssx=ld_by8(mem1); 
                  ssssssssy=ld_by8(mem2);  
                  mem3=st_by8(ssssssssz);
      end     


      Machine 12: 1-bus 8 lane CPU with 'proper' epilog 
      fetch
      decode
      dispatch
      ssssssssx=ld(mem,1); 
      ssssssssy=ld(mem,2);
      for ii=1:4
                  ssssssssz=add(ssssssssx,ssssssssy); 
                  ssssssssx=ld_by8(mem1); 
                  ssssssssy=ld_by8(mem2);  
                  mem=st_by8(ssssssssz);
      end     
      ssssssssz=add168(ssssssssx,ssssssssy);  
      mem=st_by8(ssssssssz);


      Machine 13: 1-bus 8 lane CPU with interleaved data 
      fetch
      decode
      dispatch
      [ssssssssx  ssssssssy ]=ld_by16(mem); 
      for ii=1:4
                  ssssssssz =add168(ssssssssx,ssssssssy); 
                  [ssssssssx  ssssssssy ]=ld_by16(mem); 
                  mem=st(ssssssssz);
      end     
      ssssssssz =add168(ssssssssx,ssssssssy);                                  mem=st_by8(ssssssssz);


      Giving following results

      machine 11:    2*N/4 +5 = 25
      machine 12:    4*N/8 +7 = 27

      machine 13:    3*N/8 +6 = 21



      The main architectural trade-offs are :
      - replacing multiple buses by single bus 
      - larger width bus 
      - datapath with sub-word parallelism (multi lanes).
      - reorganisation of data in memory


      Saturday, November 19, 2011

      FP DSP

      Floating Point (FP) DSPs  
      [updated nov2015]
      The goal of this section is to understand the relevance of FP DSP to the world of DSP today.

      Background
      [see edward Lee SPmag circa 1990 tutorial for more ]
      The first FP DSPs were ATT and OKI (1983),  ADI Sharc was the first successful, but not as much as the TI hype ("I've seen the future and it is floating point" ) introduced in 1987. Anyway by 1991 [ref. 1] each of the 4 DSP stars had a FP architecture:
      • TI C3x then C4x
      • ADI Sharc
      • Moto DSP96000
      • ATT DSP32C
      That was about the time that DSP people started questioning the wisdom of "old" DSP architectures as opposed to "modern RISC". For instance, the Intel i860 was starting to cut them in pieces in many markets [ref 2,3].
      • TBD: We will not go here in details, but the argument was largely biased because DSPs were primarily SOC whereas  the i860 was just .. well it was the i860. Try comparing a Bugatti Veyron with a Porsche Cayenne. And Intel with ADI..
      Going back to the DSP mainstream story, the FP DSPs took 4% of the GP DSP market and 20 years later it was still the same. They look largely irrelevant, so why bother?
      • Remembering all these years, it is funny to consider the time we spent explaining to people that FP DSPs were not going to replace FXP DSPs. 

      Features and recent evolution
      Indeed, why bother with FP DSPs?
      Firstly they were the first DSP to be architectured in a "modern way".
      • For instance the C40 had a register file and emphasis was put on the C compiler. 
      • Then its successor, the C67x was VLIW
        • which is an extreme CPU technique.. superscalar without safety net... 
      • The same thing happened with ADI Sharc (the only competition to TI). The Sharc was turned into a modern CPU (Tiger Sharc). Mind you, they did not go overboard. It was a static superscalar not a VLIW.
      Secondly, TI introduced DSP which both execute natively FP and integer.
      • The evolution of the low cost C28xx with the C283xx.
      • The evolution of the C67x workhorse into the C674x family
      • Most interesting is the most advanced DSP core, the C66 which is both natively FP and integer 32 (and 64 bit). It seems to be quite a major statement, that the only "relevant" DSP family has chosen to go the FP way. As described by Gene Frantz [ref. 6], the main reason is matrix computation. which is another way to spell Matlab.

      The old arguments turned upside down
      The  argument against FP was that the complexity was not worth it. Nowadays, with the matrix problem, this is the other way round. If you use FXP, you must work at least in 32-bit (and you loose cost advantage of data size) , develop much longer algorithm (so you loose the code size advantage and worse the power consumption). All together the system price of a FP datapath is less than a FXP one.
      Especially, in CPU architecture a FP unit is just added next to the IP unit so it is easy to figure out the cost.
      (Well.. the exception model might suffer a bit too..)


      Future of FP DSPs
      The future of FP DSPs is not bright. All existing architectures (C67, Sharc) survive in their existing application space (military or audio) but have no serious roadmap. The only recent introductions such as C673x and C66x are the result of convergence more than FP evolution.
      Also the C66 is built like a CPU, most striking features come from high end CPU (level 3 interconnect and caches) with a few dsp features left from the past. It stands a good chance in infrastructure (given the poor competition..PPC, Intel) but has no future as the next core for OMAP 7.  One day will come for TI to decide if the C66x support is worth the bother or if A9, A15, A21 or A333 is the best bet.


      Lessons for Coprocessing

      On the other hand there is no doubt that FP has a good future in Cops and AS-DSPs.

      1. Designers have full freedom and can draw from the rich past of FP DSPs. For instance, on the choice of data size, they can use 16+6 (as the first FP DSP from OKI) or 80-bit a la Intel.
      2. Matlab Mapping becomes unidimensional.
      3. But not all problems can be solved. 
        1. FP (single) is not enough. That at least, we've heard it from the audio claque (because linarity is 23-bit).
        2. More seriously, we primarily live in digital world (not a numeric world). The numeric values (traditional DSP) are overwhelmed by the packeting and bit stuffing. Take any audio or video codecs.   




      References
      1. Ray Weiss "32-bit FP DSP processors" , EDN Nov 7 1991
      2. Steve Paavola " GP processors target FP DSP" www.edn.mag April 1, 1999
      3. Plenty of similar articles in RTC magazine circa 2000 , e.g. from Spectrum Signal Processing
      4. The EDN DSP Directories (before it became a joke), had good 1 page description of the 4 FP "old" DSPs (ex: june9,1994) 
      5. And so has the BDTI bible version 1995.
      6. Gene Frantz White Paper " Where will Floating Point take us?" http://www.ti.com/lit/wp/spry145/spry145.pdf, oct 2010
        1. this is the latest on Gene's white papers on FXP vs FP. See also::
        2.  Jim Larimer, Daniel Chen "Fixed or floating? ..." EDN 1995

      Sunday, November 13, 2011

      COP -VRAM

      VECTOR RAM etc..
      Goal
      To explain the impact of the Vector RAM (VRAM) concept on COP design today.

      Background
      Mid-90s saw the good old RISC school (Berkeley) starting a new revolution: the VRAM . As given by patterson (2ts,1s), etc.. "the goal of Intelligent RAM (IRAM) is to design a cost-effective computer by designing a processor in a memory fabrication process(DRAM), and include memory-on-chip".
      Already we can notice that memory-on-chip and processor in DRAM are two concepts which are a bit confused here.
      Anyway by 1998 the concept became focused on VRAM (or VIRAM or V-IRAM or Vector IRAM).
      It was a Vector Processor (VP) inside the memory Array  (krste Asanovic was the VP guy).
      As of today the concept is frozen to 2002 (last look this morning) and it is a pity.

      RAM coprocessor in DSP
      (hmm.. this section relying on personal memory might not be 100% accurate)
      Some of the first DSPs were hardwired dsp function (such as AMI FFT 1980, Motorola CAFIR 1986, Inmos Adaptive FIR 1988) which from the host perspective were just a memory map.
      In other words it was the simplest programming model (write Xarray, write Yarray, write paras, run, wait a little, read Zarray.).
      Now on the 3 levels of architecture, silicon and software,  implementing a block dsp function as a piece of memory is totally coherent (which is rare).
      Note that the dsp function can be in series with the memory array (as to be totally transparent) but more likely is part of the memory map. 
      Going back to the VRAM concept, from a 1980 dsp archi perspective, there is no doubt that it is a serious step forward. Hence it should be considered as such when designing a COP buried inside a memory array.
      And obviously we have Matlab implementation in mind here.

      References found in my garbage can 1998
      1. "Brass/IRAM retreat" multiple papers including Christo Kozyrakis micro-architecture, June 24,1998
      2. David Patterson etc.. "A case for Intelligent RAM: IRAM"  IEEE Micro, April 1997
      3. Randi Thomas and Katherine Yellick "Efficient FFTs on IRAM" Berkeley, circa 1998? 
      Google
      1. wikipedia..search IRAM
      2. Patterson (2t, 1s)
      3. much easier .. krste Asanovic 
      4. even better.. Kozyrakis ..it is not a name it is a trademark
      5. and plenty of others

      Saturday, November 12, 2011

      COP - ASPRO, CAM

      ASPRO, CAM and Find


      The goal is to introduce the COP designer to one of the 4 or 5 major structure of computing (see note 1): Content Addressable Memory (CAM). We also briefly mention the very large field of ASociative PROcessors (ASPRO). Behind all that, there is an interesting parallel between CAM and the FIND  instruction.

      (note 1) with arithmetic, bit field logic and lookup table.

      Introduction
      A common problem with Matlab is that IF does not work on vector. This should not be surprising since it fits the DSP ISA principle where we define predicated instructions to replace branching.
      • DSP ISA
        • IF corresponds to a change of program flow and impacts the very early fetch.
        • Predication corresponds to two execution "in parallel" with a final mux to make the right choice. The hardware performance impact is negligable.
      •  MATLAB
        • IF  can be seen as a change of program flow and it  is difficult to visualize it on vector.
        • Predication is implemented with logicals.
      Example 
      y=(rbit==0)?x+7:x;

      in Matlab
      y=x;
      if rbit==0 , y= x+7;
      will work only on scalar. The brute force solution is to create a for loop (like in C).
      But if we use logicals
      cc= eq(rbit,0)
      y(cc)=x(cc)+7;

      y(~cc)=x(~cc);
      This looks much more like predicated DSP code and it is works on vector.
      Now, the funny thing is while it is perfect legal, the standard Matlab style is to use the Find instruction.

      FIND
      Without going into the details of the usage of Find (see Mathworks) , the implementation of Find as a COP  or as C code is a heck of a challenge. But, as a first order, its structure is bases on a CAM.

      CAM
      CAM  have been around for a long time and had(?) their hour of glory in Networking chips. In DSP a CAM is a natural for everything 'RECOG' (speech recognition, pattern recognition, etc..).

      ASPRO
      It is the generalization of the CAM concept to general purpose computing. ASPRO is as old as the world exists and a large body of references is available.


      References found in my 2004 garbage can
      1. Florin baboescu etc.."hardware implementation of a tree-based IP lookup Algorithm",  ST Microelectronics, Year YYYY
      2. Kohonen 1980 .. a book
      3. Asanovic and chapman 1990
      4. Asanovic"the Space chip"    keyword: PADMAVATI
      5. Cypress Ternary CAM
      6. Djamshid tavagarian, "flag-oriented parallel Associative Architectures and Applications"  IEEE Nov 1994
      7. NeoMagic "the technology of APA" ..this one beats them all! Genius or monumental stupidity? still open after all these years. 
        1. there were at least 2 challenges : 
          1. software required double optimization and porting
          2. iterative algorithm were impossible.We work on that one!
      8. Romain Saha MOSAIC " CAM speeds up lossless compression"  EDN 09.29.03
        1. see also www.commsdesign.com circa 2003
          • Good try romain!
      9. ALTERA A.N. "Implementing High Speed Search Applis with Altera CAM" July 2001, A.N 119
      10. GEC Plessey "PNC1480 LAN CAM" 
        • gosh!
      FURTHER : INTELLIGENT MEMORIES seen by COMPUTER DESIGN MAY 1998 !!
      • Matrix transposing and Multiplying
      • Wavelets transforms
      • Lossy and lossless compression
      • Cryptography
      • Graphics Accelerator
      FURTHER: replacing time by space

      1. Using GBytes memory to store all possible outcomes. 
        • Not your average chip.

      Sunday, November 6, 2011

      DSP - Architecture Past, Future

      DSP  ::  Architecture Past, Future
        1. The dinosaurs (70s)
          1. Computers only (MIT Lincoln)
          2. The first chips (TRW, AMD bit slices, other bit slices)
        2. The first steps (79-80)
          1. Intel 2920
          2. Bell Lab internal
        3. DSP of the First Kind (80-95) : A good compromise between application specific and General Purpose (GP)
          1. The PSI
          2. NEC 7811
          3. AMI2811
          4. TI 320C10
            1. TI C10,C20,C25
          5. Motorola 56000
          6. And a lot many more
          7. ADI enters the fray (21xx)
          8. ATT becomes public (16xx)
          9. TI starts a revolution every 4 years 
            1. FP (89)
            2. MP (93)
            3. VLIW (97)
          10. Motorola weathercock: 24/24 16/24 24/16 16/16
          11. ADI "one day I will be bigger"
          12. ATT "only the professional" 
          13. Refer to BDT 1995 for summary
          14. Smelling the RISC takeover
          15. DSP group "the revenge of the trees"
          16. Carmel "the last honest DSP!" 
        4. DSP of the Second Kind: Back into CPU mainstream (95-05)
          1. TriCore
          2. ARM9E
          3. Hitachi SH
          4. Extensions to CPU ISA (ARM, MIPS, PPC, Intel, HP, SUN)
          5. TI: from C62 to C66
          6.  ZSP, Starcore
          7. Blackfin, Tiger Sharc
        5. DSP of the Lost Kind? What shall we try now? (05-10)
          1. Customized DSP
            1. Wireless DSP
            2. Customisable DSP
            3. Customizable Core: TenSilica, Arc,3-DSP
          2. Multi-PE (Processing Elements)
            1. Impact of MM architecture
          3. MP, MC (Multi channels), MT
          4. Re-configurable computing
          5. Heterogeneous Platforms
          6. Matlab and Custom DSP
          7. Is GPU the latest smoking pot? Or is it Cell? 
        6. DSP of Any Kind
          1. DSP boards
          2. DSP custom Chips
          3. FPGA platforms
        7. DSP of the Third Kind (10-20?)
          1. Back to the future: Coprocessor seen as a low complexity DSP
            1. ASP DSP
            2. COP DSP
            3. A platform to build the platform: the framework
          2. Added complexity: Matlab to implementation



      Architecture evolution
      For simplification sake, we will use the terms archi80 (DSP of the first kind),  archi 90 (DSP of the second kind) and archi 2000 (DSP of the lost kind).
      • The typical archi80 is made of 3 blocks (DAU, AGU, PCU ) , buses and memories. These 3 blocks correspond respectively to  the Data Arithmetic Unit, the Address Generation Unit and the Program Control Unit. The terminology matches the architecture previously developed in bit slice designs.To be more precise the architecture was centered on designing datapaths with a microcode memory in the control plane.
      • The typical archi90 turns the concept by 90 degrees so that now everything is seen from the  instruction perspective. To that, a new central block is added: the Register File. Hence the 3 afore mentioned blocks became respectively (IP, LS, Fetch/Decode). There is no doubt that archi90 is much more sophisticated than archi80 and allows a solid foundation. But it is also so complex that it needs  a scientific foundation (Computer architecture) and specialized engineers( Architect)...
        • When Risc proponents mention the simplicity of the Risc model (as for instance the DLX, MIPS1) they refer to CPU debate of RISC versus CISC (in which they misplaced DSPs). But implementing a DSP with Risc principles is not an average ARM7 design.
      • The archi2000 opened new directions in architecture but none of them was a major change. 
        • Many new directions (such as multi-core) have very little to do with cores. They are just platform choices. 
        • Configurable cores (TenSilica, Arc) boils down to a finer grain way of building a SOC.
        • Reconfigurable computing is a real breakthrough since it combines (re)build and run in hardware. Unfortunately it is still in its infancy.
        • And so are similar concepts like JIT interpreter (Transmeta) .
        • Because of its popularity, the C6x with its VLIW and its 2 cluster datapath is a real breakthrough. But TI is kind of going backward. Firstly their VLIW looks more and more like static SuperScalar and the 2 cluster never expanded into 4,8,16 clusters. Instead they rely on clock speed, multiple multi-level caches and MP for performance. Hardly original...
        • All similar attempts (Multi Media CPU) are now dead. 
        • For a times there was a revival with the IBM cell and Nvidia GPU but difficult to see the future of DSP when looking at these beasts.
        • We had a series of gizmos such as stream processors which petered out like the rest. 
        • Now none of these new trends were useless. All will have some impact on future DSP architectures.
        • For our purpose our favorite trend are configurable computing. For instance a DSP which includes a Matlab JIT or a DSP which reconfigures its operating units on the fly.
      • The archi2010 makes a simple constat
        • All effort are now on platform design.
        • Software (the lack of it, the price to develop it) is the main killer for a platform
        • Developing a new GP core is useless.
        • Only customized core can justify the investment. And not even a full core because of the price of developing new tool and software.
        • The space remaining for DSP is either coprocessing or a DSP core so simple that the tool development effort is minimal.
          • the second case is for simple apps (1K-2K code)



      REFERENCES -Found in the garage - Filed under DSP 
      1. Floating Point DSP
      REFERENCES -Found in the garage - Filed under CPU
      1. MultiMedia MM CPU
      2. DLX

        Sunday, October 30, 2011

        Mukesh Patel

        Mukesh Patel and Nazomi (sep 2000)

        So here we are in meeting room 21, me as architecture expert, to review Nazomi IP ( a Java accelerator); on the other side is Mukesh Patel the CTO  and while waiting for Saddam the H and the rest, we have a long discussion on the maturity of the Java market.
        Me, as always Mr Suspicious, need to be convinced.
        1. Firstly our DSP core (or any DSP for that matter) is not going to implement Java.
        2. Secondly it seems compulsory for our CPU core if we want to stay competitive with ARM as a cellphone platform. BUT (and it was not clear at the time) this does not fit in our platform strategy since our CPU core is mainly used  on the modem side, especially as TriCore is a CPU+DSP.
        3. Remains the AP side (Application Processor); no question about that except that the AP paragdim shift has not happened yet (this is 2000).
        4. Finally, I am not convinced of the cell phone as THE third screen.
        Coming from a different side, Mukesh explains that he was in Japan where he saw a few applications (like reading serials on the subway, but I might be wrong).
        Anyway, I am grateful to Mukesh as being one of the first guy to expand my cellphone architecture perspectives from western to eastern.
        ................ 
        The really good part of the meeting was Saddam the H who was the marketing in charge of core licensing.
        As someone mentioned later "maybe he did not know that he was supposed to license OUT our cores, not licensing IN other people's core".  After introduction, he started to ask if we could use the Java IP for our DSP core ...
        Well at this stage, my best use of time was to avoid opening my big mouth, go into power down, look at the Nazomi documentation and listen to Mukesh...
        Wonder where he is now.
         Corrections:  
        Looking at my notes, in 2000 Nazomi was called Jedi technologies (wow! glad they change names).

        Sunday, October 23, 2011

        COP - Networking Apps

        Coprocessor in Networking Applications
        The goal of this section is a quick overview of COPs

        Description
        Classified in two type of processing
          1. Centralized COP
          2. I/O COP
        Intuitively we understand than an I/O COP can be multiple (per line) and in-line (between PHY/MAC and CPU) whereas the centralized COP is spatially unique and sits next to the CPU.
          We are not networking specialist (instead see reference) so we will not try to file the following COP in any of the 2 processing spaces. Instead we will list them by how much their algorithmic contents looks like a dsp algorithm. 
          1. Encryption
            1. elliptic curves
            2. gallois fields
          2. Security
          3. Pattern matching, Reg-Ex
            1. TCAM

          4. TCP/IP acceleration
          5. Search, Classification

          References

          Linley Group

          Linley

          Linley Gwenapp

          Among the 4 MPR musketeers, Linley Gwenapp was the best at reconverting from CPU centric to platform. Remarkably he did it very early (1999-2000) by jumping on the NPU bandwagon.
          1. At first, he was presenting special sessions at MPF (and maybe HC).
          2. Then as MPF folded, he created his own group (www.linleygroup.com) which is THE reference in NPU or more exactly in platforms in the infrastructure ( more than networking) space.
          3. And frankly we wish they would expand in other spaces.
          Still from our perspective we will list the interesting features offered by NPU and infrastructure platforms (for deeper understanding the reader should refer to the above reference)
          • Complex SOC architecture where bandwidth and sometimes latency are critical
            • --> advances in buses and fabrics.
          • Naturally prone to multi-processing (MP) either as multi-cores or innovative multi-chips (good luck for latency).
            • --> advances in partitioning and topology
          • Even better, first serious implementations of MT (see Mario the MultiThread Champion) circa 2000.
          • Large diversity of Coprocessors.

          Mario MultoFredo

          So here we were circa 1996, the whole architecture team still working on the Dolphin definition, and I felt pretty much annoyed.

          " Why do i have to listen to this guy as we are late | confused | living in hippie land and .. what the heck.. I have to listen to another croonie and ..what the heck should I care for another CA (Computer Architecture)   technique which look like the others, miles ahead of any embedded implementation"
          At this stage I was still very DSP centric and little did I know that 5 years later I would push for Tricore next gen. to use Multithreading (MT). {this is another story}
          Anyway, we all listened very patiently and at the end of the meeting Bruce told me
          " this is Mario! nobody listened to him at NS, and since he did a thesis on MT, he is pushing it all over the valley"
          Well! we went to other better greater things and filed MT under nice to know. Not surprisingly, MT was not seen or heard in any embedded architecture (ARM, MIPS, PPC, SH,etc..) at the time,  let alone DSP, but instead it re-appeared in a Network Processor (NPU).
          Circa 2000 Clearspeed presented their NPU at Hot Chips (or maybe Microprocessor Forum) and here you have guessed Mario in the center of the arena. In a way it was satisfying to see that any gifted and focused architect always find the way to push his baby...{not sure about that, mind you}
           

          STOP ME !

          STOP ME (if you've heard that one before)!
          Herewithin are personal remembrances of meeting peoples, ideas, weirdos and opportunities .
          I do not guarantee the exact dates and I changed the names as I see fit .
          But they were always triggered by my brainy memory which I expect to be reliable as separating fantasy from true events.
          • Mario MultoFredo  "if he is not the father of multi-threading, he is at least his brightest son"
          • Linley Gwennap  "easy to remember; it's not gwenAPP"
          • Mukesh Patel "JAVA acceleration for eastern  architecture"

          Tuesday, October 18, 2011

          Friday, October 7, 2011

          Coprocessors (COP)

          Coprocessors (COP) 


           The goal of this section is to understand the scope and definitions of coprocessors (COP).
          • Difference between processors and COP. 
          • Difference between COP and peripheral.
          • ........already we can visualize that a COP is less than a processor but more than a Peripheral.
          • Difference between COP and accelerator
          • ........... at one time it was very clear
              • a COP sits next to a CPU on the level 1 bus  (like a 80387 or more recently a GPU)
              • an accelerator sits on the level 3 or system bus (like a Turbo decoder) 
          • Difference between COP and I/O processor (IOP)
          • Difference between COP and Intelligent Peripheral (IPERI)
          • ........... at one time it was very clear
              • a COP is a CPU  extension whereas  an IPERI is an CPU-agnostic block (see evolution from Am9511 to 8087).
          • Difference between COP and Application Specific Processor (ASP)
          • Difference between COP and DSP
            • ...........at one time it was very clear
              • a DSP is not a COP! 
              • But seen from the perspective of software running on the Host (ARM),  it is exactly becoming that, just another API.
            • .............and more
            • For simplification sake, we will use the definition
              • A COP is less than a CPU but more than a Peripheral.
              • A COP is dedicated to fill the system gap between standard CPU and standard Peripheral. It can be a custom function or an application specific processor.
            • Alternatively, the term "Accelerated processing " fits our definition.
            Background
            Beginning of the 80s Intel came up with a Floating Point Unit (8087) which is then the de-facto first coprocessor. The instruction set (ISA) and interface is well documented. Following the very successful 8087, Intel came up with a series of COPs:  the 80130 multitasking software, the 82586 for local network , the 82730 for Text (screen). While the 80130 was tightly coupled, the last two were bus coupled (not unlike the 8089 I/O processor). In fact, not surprisingly for Intel, everything was a COP.
            Co-processing was then one of the solution to the so called problem of extended processing. How to cover all possible applications without specializing the microprocessor ISA?  There were 3 types of solutions:
            1. the macrostore.  The most commonly used subroutines are romed (?) as micro-instructions instead of being executed from the external program store. Proponent: Texas
              1. not the brightest tree in the forest ; still, it could make sense in DSP COP. File under microprograming techniques.
            2. the coprocessor. The  most commonly used subroutines are executed by a specialized processor.
            3. the intelligent peripheral. The advantage being that concurrent processing is easy but the big disadvantage of not being transparent to the programmer.
            After 1983, CPU architecture becomes the standardized way so ---> refer to .. for the rest of the story.

              Topics
              1.  Most interesting was the Motorola answer (68881) to the 8087 which had a similar ISA but a much better non-blocking interface. In other words, Motorola introduced concurrent processing. Incidentally Motorola also introduced the first major COP architecture topic: what type of interface.


              A non exhaustive list of COPs and their applications
              DSP coprocessor:  A list of  ISA DSP extension  in kind of chronological
              While ISA extension and COP are not synonymous, we grouped them together for simplicity reasons.
              The advantage of a COP over an ISA extension are obvious.  The specs including the interface and the DSP extensions are physically separate from the rest of the Core, hence it is easier to model, to implement and to validate. 
              • NS FX161 (~ 1991) had a DSP COP
                • amazing trick!! the integer register and the DSP registers were skewed by 1 bit (because of q format) 
              • ARM PICCOLO (~1996) point solution for GSM speech coder.
                • Stands today for its very original  way of interfacing to the Core; 
                • both tightly coupled and asynchronous data through a FIFO.
                  • TBD : integrates notes and ICSPAT 98 Moerman class notes
                  • TBD:  matlab model   
                • !!! circular addressing on register file.
                  • TBD: matlab model
                • http://www.cs.umd.edu/class/fall2001/cmsc411/proj01/arm/dsp.html
              • HITACHI: SH-DSP  (1996)
              • PPC: ALTIVEC (circa 1998)
              • EXTENSA (circa 2003)
              • DSP PIC (2005?)
              • ........

                Design Issues
                • Tightly or Loosely coupled?
                • How do you return data to the core? interrupt?
                  • blocking , non blocking
                • Memory hierarchy position
                  • Level 0 memory : COP has access to the core Register File.
                    • the COP is just another execution unit inside the DAU
                  • Level 1 memory: COP has access to level 1 Memory
                    • even better: COP sits in the same place than a level 1 memory
                      • see FFTer from ?
                  • Level 2 or 3 memory: COP sits on one of the system Buses
                • Instruction Set or not?
                  • In theory an instruction set is good idea but it implies a lot of added complexities none of them major, but the whole can become unmanageable.
                    • tool issues
                    • C compiler or not
                    • opcode design
                    • added power consumption due to fetch,
                  • parameters are preferable especially a combination of build and run parameters.
                • Scheduling techniques
                  • "pure" datapath    
                  • vector processing  (access to data as block in memory)
                    • length,stride
                  • pipeline data path 
                  • sequential ; concept of clock  ; if cycle==1 ... if cycle==2...
                  • autonomous  z= FFT64(x)
                • Topology: how many ports? 
                  • a port must be a physical reality and not be a pointer to a structure.

                Further Topics 
                • Accelerated Processing  (AP)
                  • Traditionally AP is divided in several techniques 
                    • Central Core (CPU) based
                      • Specialized CPU
                      • CPU + COP(s)
                    • Periphery (Non  CPU) based
                      • Intelligent peripherals
                      • FPGA
                    • anything between  Core and Periphery.
                      • to simplify: Core is level 0 memory and Periphery is level 2 or 3.
                    • anything outside the chip is considered periphery.
                    • since our focus is customization we do not consider massive parallelism as a solution. 
                  • For our application space (dsp) it is simpler to treat AP and COP as a single topic.   

                Advanced Topics
                • FPGA Nodes:  Combining Massive Parallelism (MPP)  and customization
                  • MPP machines kind of disappeared of the DSP (and embedded) scene for obvious reasons of programming model and power efficiency.
                  • the next generation is based on a slightly different approach
                  • you have a switch fabric (say 16x16 or 256 nodes) and each node is dedicated to a function
                    • in fact some guys proposed a fabric based on the FFT treillis instead of row/column
                  • this approach is interesting because it is more sophisticated than our proposed signal graph
                    • since we map a Matlab/Simulink flow. 
                  • We are not familiar with the state of the art but it does not seem that this type of solution went deeper than FPGA implementation.
                  • And maybe it is the right technology.
                  References: my garage, google and questions 
                  1. "Making software acceleration simple" Critical Blue, 2002
                    1. http://www.criticalblue.com/
                    2. What is the Critical Blue philosophy? the methodology? the application space?
                    3. Is there a paragdim shift?
                    4. Any link to dsp? Matlab? 
                  2. "OptimoDE.;...etc" ARM, Hot Chips August 2004
                    1. http://www.hotchips.org/archives/hc16/3_Tue/12_HC16_Sess9_Pres3_bw.pdf
                      1. see also the PPT slides from CCCP, University of Michigan
                    2. http://www.iqmagazineonline.com/magazine/pdf/v3_n3_pdf/Pg74_ARM_Phonex.pdf
                    3. Originally developed by Adelante (an offspring of a Philips research company). They were partially bought by ARM.
                    4. OptiMode is a general purpose (GP) COP. What is wrong with this approach?
                    5. OptiMode is a GP methodology to design a COP. Advantages and limitations?
                    6. It is based on a VLIW core. What is the one big wrong with VLIW?
                      1. Compare a C55x MAC2 and a C62x MAC2
                      2. Compare evolution C62, C64, C64+
                      3. What is code footprint?
                      4. What is compound instruction?
                      5. What is a thick and thin operator (data-path)?
                  3. "Creating FPGA-based Co-processors for DSPs using Model Based Designs..." Avnet, Xilinx, April 2009
                  4. " Extreme Processing" Max Barron Instat/MDR, October 14, 2002
                    1. This reference, while excellent, illustrates what we do not want to do. Max Barron used the term "extreme" because he delved into some architecture which were massively parallel and general purpose
                    2. Here we consider solution (COP) as being customized for efficiency and specific to a task.
                      1. note: efficiency can also mean parallelism
                  5. "Accelerator Architecture" IEEE micro July/August 2008
                  6. Anand balaram, Andrew Volk "Text coprocessor brings quality to CRT displays" EDN feb 17, 1983
                    1. Including 80186-82730 interface
                    2. Software interface: command block, screen characteristics interface, string pointer list and display data strings
                  7. Stan Groves "standard interface keys processor design" Electronics Nov 17,1983
                  8. Michael Cruess "The 68000 coprocessor interface an overview" Motorola document dated, june 8, 1982
                    1. the author is Linked-in

                      Wednesday, September 28, 2011

                      Found in the webs

                      Oxford DSP


                      1.2 Background Checks

                      1.2 Background Checks

                      1. Henneson and Patterssy
                        1. Computer Architecture
                      2. Coprocessors (COP) 
                      3. The Core
                        • Which type of core? -->  let us start with a DSP of the first kind
                        • Is a DSP core best for DSP?
                        • Xilinx has a MAC. What is the next level?
                      4. Benchmark and benchmarking
                        1. BDT
                          1. VADD
                        2. The noble art of profiling
                          1. Profiling an ever changing reconfigurable machine
                        3. Optimization and tuning
                      5. Fixed Point Dialects
                      6. Algorithms, signals, structures, tips and tricks
                        1. Signal Processing
                        2. Matlab
                          1. Interesting feature: is M predication same as CA predication?
                      7. Once a (logic) designer always a  pain in the ass designer 
                        1. Logic
                        2. Arithmetic
                        3. Sequential
                        4. Xor constructs
                      8. SOC architecture
                        1. the ideal : NS Dgt Answer Phone -- 3 serial ports

                      Fixed Point Dialects

                      Fixed Point (FXP) dialects
                      The goal of this section is to summarize in a few words as possible the gigantic complexity of FXP.

                      Background
                      1. When 2q30 is 1q30? 
                      2. Float or integer
                      3. First it was q15
                      4. Then came q14
                      5. Then came 1q15, 1q31, 2q30 etc..
                      6. Then we got lost; 1q31=1q15 but 1q15  ~= 1q15 !!
                      7. Still 

                      List of FXP dialects
                      • Assembly language on first generation DSPs 
                      • ITU basic operators
                      • Matlab FI
                      • Simulink Fixed Point
                      • System C
                      • DSP extensions in C
                      • Proprietary FXP languages from CAD companies
                      Issues

                      FXP and integer
                      When translating FP to FXP the common mistake is to think that replacing the float data types with integer data types will suffice. This would be true if the operations were addition or based on addition. But as soon as multiplications are involved this does not work so well and it gets worse with division, transcendental functions, etc...
                      The reason is (in 99% cases), an FXP is not an integer. It is a fractional number. It can be a pure fractional (less than 1) or partly fractional (for instance in the range +31 to -32 will have 11 bit of fractions, in a 16-bit wide word).
                      Hence it is well known that an integer 16x16 gives 32bits and a fractional 16x16 gives 31bit ( and a saturated case). Also it is easy to visualize that an integer division and a fractional division will grow in opposite directions.