Sunday, December 25, 2011

Wireless Cellular COP

Wireless Cellular COP
The goal of this section is to describe the historical impact of MIPs increase in Wireless platforms. 
  • Wireless coprocessing
  • Wireless coprocessor
  • Wireless extension
  • ASPI (Application SPecific Instructions)
  • Wireless DSP  {tbd Paulin & all}

Background                                                            
Until GSM (1987), DSP was not a very profitable business. Over the next 10 years, DSP and many CPU architectures were driven by the wireless craze. With the 3G introduction (1998),  the quiet MIPS evolution took a serious blow. Now based on CDMA, the partitioning of tasks between software and hardware became a very serious problem. Now, more than 10 years later, with the 4G introduction, things have slowed down to a normal pace and the hardware school has won. Most DSP intensive tasks use some kind of dedicated processing and there are no signs, it will change.
1987-1996
In 1987, the standard DSP had around 20 MIPS  (we will use MIPS instead of DSP MIPS, see note 1) and could implement a speech coder or a modem. As both applications increased linealy in complexity, so did the DSPs (and vice versa). The GSM standard was devised to fit nicely in the then most common TI DSP. With the increase in Channel Coding (CC) complexity we could see a slight shadow on the horizon but nothing we would not be able to tackle in software. In a nutshell, DSP was wireless, wireless was DSP, DSP was software and software efficiency was by creating new instructions (so called ASPI), especially the "Viterbi" instructions [ref 1,2]. This did not last long.

1) By introducing CDMA, Qualcomm changed the MIPs partitioning. Up to now (de)modulation schemes had very little impact on DSP  but CDMA required (say) another 100 MIPS which could only be implemented in hardware.
2)  A revolutionary CC scheme, Turbo Coding (TC) was now on the book and required another level of magnitude. Without mentioning TC, Viterbi Coding (VC) had reached another 100 MIPS.
1996-2001
The space Odyssey.  By 1997 it was clear that an ALL software implementation of the wireless communication standard would require a DSP with 300 (DSP) MIPS.  And we are speaking about basic cell-phone (speech or 16kbit/s modem).  But in 1996, we have two extreme schools. The first one not only wants to do everything in software but also integrate the MCU tasks in the DSP. Hence they go on designing the 300 MIPS  DSP. The second one had high end RISC CPUs already running at 300Mhz or more and believe wrongly that downgrading cost wise is possible and writing software is a question of men months. Both are dead, but the idea still floats around [6].
Over the years we quietly realized than except for the speech coder, the cellular standards were best implemented  in coprocessors.

2000-2003
That was half the story. Concurently, the wireless DSP and its little MCU became the wireless platform and very soon the little MCU had an even bigger MIPS problem due to data, audio and video. With 3G, wireless platforms turned into 3G computers and soon the Multi Media crowd " I smell Dollars Now" was proposing the SIMD engine as solution to the issue.

But, once more the solution was to introduce COPs to mop out the MIPS excess [ref 3,4,5]
Today and tomorrow
All DSP intensive tasks use either dedicated processing (which can be a COP or a programmable DSP). From the host perspective everything is a COP.  Because for each COP, there are no signs of that the algorithm will simplify over time (on the contrary), the architecture partitioning will not change. What will change is the methodology to transform the Matlab algorithm (or experimentation) to implementation.     

References
  1. For instance one can refer to the TI family ISA (C5x -->C54x --> C55x )
  2. Or the DSP GROUP (CEVA) pine, oak, palm evolution. A famous instruction took 1 cycle per value to find the address of the minimum in an array. It used a wire to block the AGU address... good luck to the pipeline. 
  3. Sollenberger & all "BCM 2132" HC, Aug 2003
    1. COP (to ARM9E): Edge Accelerator, CC accelerator, Incremental Redundancy, (external DSP) Teak  
  4. ' Intel Xscale , Wireless MMX' HC, Aug 2004
    1. Execution Units: Shift and Permute unit to solve the sub-word parallelism problem
  5. "Renesas Hitachi" HC, Aug 2004
    1. COP: 3D graphics, MPEG4, LCD controller, java accelerator
  6. Happy Camper and his troop "Terminal centric view of reconfigurable system architecture and enabling components and technologies" IEEE Com. Mag, May 2004
   

Saturday, December 17, 2011

FP hardware

Floating Point (FP) hardware
The Goal of this section is both sided
1) Emphasize the historical importance of the FP COP among all classes of COP.  For that, this covers all types of FP Hardware. This includes peripheral, coprocessor, execution unit (FPU),  Building Block, FP boards, FP DSP..
2) The natural relation to Matlab (and ease of implementation).

Background
On May 8 1980, Electronics [ref 1] had a wonderful concept drawing (*) of two hands working together to illustrate the upcoming of Intel Numeric Processor (sic!) [ref 2] which happened to be the birth of a new type of chip: the Coprocessor (COP).
At the time, this was real progress, as the only way to add FP performance to a microprocessor-based equipment was to use either the peripheral AMD 9511/2 [ref 7] or  a specialized board.

Of course, computational intensive co-processing had existed in mainframe including the somehow difficult dialog between Cray1 and DEC mini-computers. 
Today,  the FPU is so much part of a CPU that it is difficult to understand all the brouhahas of a separate chip. But  one has to be in the context of limited silicon real-estate. Incidentally this is also the case in today's embedded resources.
Very soon Motorola  answered with the 68881, which benefited from the more advanced interface of the 68000 architecture [ref 4]. In the same spirit, but not the same success NS [ref 5] came up with their own chips.
Meanwhile, as the years went by,  Intel was climbing up the numbering schemes with the 80287, then 80387 [ref 3] then 80487. Except it took us a while to realize that the 80487 was not a chip. While we will not bet on that, the 487 was purely an integration exercise and the interface might have been the same COP interface. But it was now 1990, the heyday of Computer Architecture so that by now the RISC architecture was dominant. The FP COP  became the FP unit (FPU), totally integrated to the pipeline [ref 8] and in the case of superscalar working in parallel with the Integer Processing (IP) unit. From that point, the story is largely available on the web [refer to Hot Chips] and goes much beyond the background scope. Much more relevant to our scope is the story of FP Building Blocks( see FP BB) .

FPU and FP extensions
We will give an incomplete list of FP units and FP extensions. FP extensions are characterized by a separate ISA and document which is added to the base architecture. This is not for the case for the Pentium which is natively FP , but this corresponds pretty much to all other architectures even the PPC. 
  1. PPC: Book E may 99
  2. ARM: 
  3. MIPS:
  4. Hitachi: SH7705 FPU
  5. TenSilica: 
  6. TriCore: TriCore FPU
  7. TI DSP C28xxx
    1. This is an interesting case as they offer two very different solutions. 
    2. The 283xx core is a 28xx core to which has been added a FPU. This is the standard situation.
    3. The 2803x family called Piccolo does not change thee 28xx core. The FPU is I/O  COP which acts as a CPU front end by processing signals coming from ADC modules.[ref Piccolo Control Law Accelerator]

FP BB (1985-1995) [ref 10]
1) At first we had the usual DSP/Bit Slice school (TRW, ADI, AMD) which naturally increased their portfolio with a DAU (see bit slice section) . They started 4-bit and up the integer curve 8,16,32. At 32-bit, FP had the same data size, so why not? It presents the disadvantage of larger size due to the FP adder but note that the multiplier is only 24x24. The biggest issue is the large step in complexity (such as IEEE standard)  which is a relatively small price to pay for the comfort of numeric accuracy. Very soon  as always in FP 32-bit was not enough and 64-bit chips appeared,. Even further the only market especially associated to the IEE standard.
2) Driven by new names (Weitek, Cyrix, IDT) the 64-bit FP BB turned into Intel coprocessor socket which at this writing baffles me completely. They must have missed the RISC revolution somehow. Even with the 486, Weitek was still pushing the usage of external COP. That being said, there are quite a few lessons to learn: their block diagrams are marvels,a heck of a datapath, and good host to cop interface (that was no Cray 2).  .
Quick list
    1. MULTIPLIER
      1. TRW 1042
      2. Weitek 1032,1064
      3. Weitek 2264  (IEEE)
      4. ADI 3210
    2. ALU, RALU,
      1. TRW 1022/3 (22-bit FP)
      2. Weitek 1033, 1065
      3. Weitek 2265 (IEEE)
      4. ADI 3220
    3. DAU -->  COP
      1. AMD 29325/C327
      2. Weitek 3132/3, 3364
      3. LSI 64132, TI ACT8847
    4. Intel COP
      1. Weitek 3167, Cyrix 83D87
FP DSP
follow this link 

FP IP (1995-1999) FP in FGA (now) [ref 11]
Last in the whole story is the Intellectual Property (IP) craze of the second part of the 90s where amateurs were writing a bit of C code and sold it as a product (virtual silicon?).
Today, the best ones are productized piece of IP  for FPGA. Xilinx, Altera, etc.. have solid specs describing these components. Multiple other sources can be found on the web. 
GPU (now)
The story cannot be complete without mentioning the GPU. While the architecture is somehow exemplary, it is much beyond the scope of this section..

Topics
  1. Coprocessor Interface
    1. 68881 outsmarts the 8087 with non blocking
  2. How much IEEE complexity?
    1. The "sacred" standard.
    2. ARM cheeky solution: problem with exception? no problemo, invent a new IEEE model
  3. TI gives up integer because Matrices and Matlab: the C66

Saturday, December 10, 2011

AccelProc in Communications

Accelerated Processing in Communications

The goal is to analyze accelerating processing in telecommunications. Any specific ways?. Lessons and trends are drawn.

Background
Because of the serial nature of the communication signal, and its speed ratio relative to CPU, the only realistic way to implement a communication function, was by designing a dedicated peripheral . This was the case in the 70s/80s (for instance UART, HDLC). With the dramatic increase in CPU raw speed (end 80s) it was possible to implement a  communication protocol in firmware [ref 1]. But by the 90s it was back to hardware and since that time, there had been an acceleration in communication bandwidth which renders software implementation impractical (so called "shannon gap"). Still we have seen over the years [ref 2] multiple attempts at so called "wire processor" but we feel is a bit of a white rabbit chase.   

Description
Traditionally communication (COM) is divided between infrastructure and terminal. In the middle stands the gateway. Obviously infrastructure (Base Station, DSLAM, VoIP Line cards, CMTS) require N times more processing power than a terminal.
Because, most COM processing is done in a dedicated hardware peripheral what is the scope for coprocessing? We answer this question by listing a few COM applications which use COP(s). Covered elsewhere are NPUs and Cell Phone platform coprocessing.
 
BTS
The cellular BTS, requires massive MIPS in  both Channel Coding (CC) , Chip Rate Modulation schemes (such as CDMA) and OFDM. These constitute the standard COPs:
- CC COPS: Viterbi, Turbo Codes [ref3]
- Chip Rate Processing: Rake receiver, De-spreader, Searcher [ref4]
- OFDM: FFTer  [TI website see C66x core apps].
And a special mention to the gutsy BAZIL which was made of a LSI/ZSP DSP and an "uncommitted array" of gates to implement the typical bit level algorithms.[ref 5].
We also worked on an architecture made of uncommitted COPS (linked through SRIO) and commanded by linked list in global memories. This is what we will call a high-end COP interface.

 
PHY
The point is that there are NOT that many ways to architect a PHY; the most flexible one is to use a cheap core (ARM7, 8051,cortex) and to connect with a light interface to hardware data paths. This can be considered de-facto coprocessing.
The problems starts when the data paths require more and more dsp. Most of all when the signal processing functions become adaptive and requires some light processing from the cheap core. Further as time progresses (and customer requests) the signal processing becomes less linear, more control oriented. Then the cheap cores and light interfaces very soon need more help (from a coprocessor to the coprocessor? why not?). 


Protocol Processor
Over the years many people attempted at the creation of a new type of processors: the PP. It was to be the equivalent of the DSP for Physical layer, the PP for Link Layer. Largely buried in the dustbin of history, still the PP should not be dismissed. The reason of its failure, has more to do with the advent of more complete solutions  such as PPP (Packet Protocol Processor), NPU (Networking Processing Unit) than the concept itself.
As a coprocessor the PP is a very useful building block, more likely to be loosely than tightly coupled. As far as reference goes all I have is a lot of NDAs [TBD].     

I/O Processor
This is the same story as above.  Many attempts for little results. As a matter of fact the classic reason for failure is speed. As the main CPU goes up the speed curve, the dedicated processor got left behind and becomes a point solution.However in a perspective of coprocessing the equation do not hold anymore. The host speed is pretty stable relative to the system workload. 
So an IOP can also be a pretty simple COP [ ref 6 Intel 8089] which can insulate the host CPU from real time communication events. The interface is another story.{ bandwidth and shared memory issues} 

References 
  1. TI 320C20 DSP doing HDLC
  2. Intel at Hot chips but when?
  3. Nat Seshan " New Tms320C6416 Solves the DSP challenge", EPF June, 2001
  4. A gatherer & all "A UMTS Baseband rx chip for infrastructure Apps" "TI TCI110 "  HC 2003
    1. Most interesting example of TI using multiple ARM7 + hardware blocks (instead of a programmable DSP) to implement a communication function.   
    2. The interface is standard mid-end (setup of contexts, parallel buses, dma)  
  5. Neil Stollon "Bazil" EPF, June 2001
    1. A heck of a paragdim shift. Still waiting after all these years
  6. Robin Jigour "Data concentration techniques unload host computers" EDN March 4, 1981
    1. Usage of 8089 IOP with a good description of interprocessor communication handled by linked message blocks
FURTHER : {TBD}

Saturday, December 3, 2011

BB - AGU

AGU (Address Generation Unit)

To goal of this section is to cover in details the design of an AGU.

Background
The first AGUs go back to the 70s when people where using bit slice to design boards[ref 1]. With the introduction of the first GP DSPs,  the AGU became slightly more sophisticated. In the mid-80s, AGUs  reached their top (not in general purpose DSP) in Building Block (BB) DSP.  First the byte slice from AMD (with the 29540 exclusively for FFT) [ref2] and secondly with the introduction of the word slice family from ADI ( ADSP 1410) [ref3].
The more modern DSPs of the 90s never went overboard for AGUs mainly due to the simplification impact of RISC and computer architecture. The AGUs  gave way to the more advanced concept of LSU (Load store Unit) .

Description
Before describing an AGU there is a pre-requisite and a post-requisite.
The pre-requisite is to have a good idea of the standard  addressing modes and their implications
The post requisite is the vast number of CPU issues which are not considered HERE since they are beyond the scope a simple DSP or COP design. For instance, cache effects, speculation, non blocking load, store multiple, user/supervisor mode, etc.. .


Features
  1. Number of data buses and data memories 
    1. This should give the number of AGUs . But note that it is as valid to implement  (say) a 3-address bus AGU as 1 unit or 3 single units. It is the classical trade-off  shared resources versus locality.
  2. Some questions in standard addressing modes:
    1. the simplest DSP architectures do not require stack pointer
    2. pre increment can be implemented in 2 cycles

  3. Circular addressing
  4. Bit reverse addressing (and all FFT related)
    1. while not strictly related to AGU, FFT can use the Block Floating Point mode which implies a load or store with scaling.
  5. Paging
and more advanced 
  1. Vector Addressing modes (concept of strides)
  2.  Interleaving (as used for decimation)
  3. Complex addressing (as in complex numbers)
Interesting
  1. PC relative addressing
  2. A bit processing unit which takes care of bit test/set and semaphores. (see StarCore).
  3. A mask unit
  4. Using an AGU as a second data unit
Number of buses and data memories 
Traditional DSP had a 2-bus/memory (such that Acc+= X*Y) or even 3-bus/memory architecture (such that Z=X*Y). When the very high performance Infineon Carmel was firstly developed, it had 6 buses. The reason was the dual multiplier (since Z=X*Y in parallel with  W =U*V gives 6 buses). Obviously it is difficult to be more flexible!
When we designed TriCore one key idea was to separate memory and data type. So, instead of having the standard DSP X,Y memories of 16-bit wide, we had a single 32-bit wide memory. This not only dramatically improved the core design but it fell perfectly in line since TriCore base was a 32-bit CPU. And when we improved by having a dual MAC, we doubled the bandwidth to a 64-bit bus.
The bottom line is that multiple memories can be avoided. Having 2xN or 1x2N is exactly the same in terms of bandwidth and performance. The main impact is more complex programming (to be explained see software pipelining).

On the other hand there are some algorithms which become so complex to implement on a single bus architecture that the simplification of the single get lost.

Standard addressing modes
The standard addressing modes are generally all similar (p++, p--, p+K and variations). The main issues are the number and width of registers. The only real issue is the stack pointer which requires a pre-increment (++p) which is totally at odd with the other addressing modes ( they all use post modification). This is problematic since addresses must generated as fast as possible and having an adder in the worst case datapath is not recommended.
The solutions are multiple:
- have a 2 cycle instructions for push (or pop)
- prepare all addresses in advance and have a final mux
- bite the bullet and implement it; this is even truer if the AGU contains a base+index addressing mode.  

Modulo addressing
Modulo (or Circular) addressing is easy to define, explain and implement.
     if  m >=0  pointer is incrementing
      address = (ptr + m) > (base +length)? ptr+m-length: ptr+m
     if  m <0 pointer is decrementing

      address = (ptr + m) < (base)? ptr+m+length: ptr+m
What are the issues?
1) Number of concurrent circular buffer?
1 is not enough, 2 is a good compromise, but 3 can easily be met. In regular ISA 4 or 8 is not usual.
2) There is the issue of simplification.
The standard equation above requires 4 registers (ptr, inc, base, length). This can be simplified to 3 and even down to 2 registers. It is very tempting to simplify by masking address bits. The circular buffer must then be a power of 2. This a very poor solution.

3) The last issue is more modern CPU than classic DSP.
Generally a DSP has a unique data size (the native size). In a modern CPU,  the memory access is independent of the data type. Hence accessing a long (2 packed short) on a circular buffer of shorts will run into alignment problem.
The 4 register model
current ptr (p) , modif (m), base(b),  length(l).
modif  (also called offset) is the post increment/decrement.

cc1= gt(m,0);

cc2= lt(m,0);

if cc1

   if (p+m) < (b+l), p= p+m;

   else p= p+m-l; end;

end

if cc2

    if (p+m) >= b, p= p+m;

    else p= p+m+l; end;
end

The 3 register model
current pointer (p) , base(b),  length(l).
Same equation as above but m is taken from the opcode immediate field. (say -16:+15).

The 2 register model (easy)
In this scheme the two registers are the current pointer (p) and a mask(m). The length of the buffer is limited to a power of 2, the base must start on a power of 2. The mask gives the position  where the pointer is cut in two. Such as for example, for a 64 long circular buffer
  p= concat(p_up, p_lo)   where p_up is a pointer with lower 6 bits masked and p_lo is a counter 0: 63;
The 2 register model (freescale 56800)

This is similar to above but much less severe limitations. The base can be anywhere, the length can be any values. The trick is that the buffer will always take the space of next power of 2.  So a modulo 366 will require the space of 512 words but will have 366 as upper bound (and zero as lower bound).

The 2 register model (TriCore )
On TriCore we had 16 address registers and there was no reason not to try the best which was at the time 8 simultaneous circular buffers. .. We are joking, in fact this number is given by the regularity of the instruction set.Since we had the concept of register pair, it was simpler to define a modulo as using any register pair but with a special meaning.

So we had to match 3 values, current pointer (p) , base(b),  length(l) to a register pair (say D0,D1).

And we mapped it as follows
D1= base    (32-bit)  
D0= length || pointer (both 16-bit) 
Finally, the post modification (m) was a 10-bit signed value given by the opcode immediate field.  


Bit reverse addressing
Bit reverse is this kind of feature which looks like a bottomless pit.There is no end to complexity.

1) First let us start with the golden model which is not so easy to generate.
  • Generating it in C is itself prone to errors so this not a very good start for verification.
  • Obviously using a mirror on a binary table of increasing numbers will give the result.
  • But in fact bit reverse can be easily generated. For instance in Matlab:
    with x the input
       y=dec2bin(x)
       bitrev= y(end:-1:1) 
  • Or logically. Starting from(0,1) a new pair of numbers is generated by multiplying by 2 (0,2) to which is concatenated the same pair but with +1 (1,3). And the next step (0,2,1,3) generates [0,4,2,6 ] [1,5,3,7]. Etc ad vitam eternam. You can use Excel, Matlab or C to do that.
2) The implementation in hardware seems trivial but it is not
  • We can for instance reverse the wires such as bitrev(13:0)= bit(0:13).
  • The problem is that it is only valid for a word width of 14 (a 16K FFT). You cannot use the least significant 8 bits to  to do a FFT256 .
  • But note that more modular methods exist (adder with reverse carry),
References

1.  R.J.Karwoski “ a general purpose address controller for real time and array processor applications” reprinted from (TRW) (1981??)
2.  AMD 29540 as part of the byte slice folio
3.  ADI ADSP 1410 Word slice Address Generator Data Sheet (~1985)
4.  Eric Martin “   “ 12 may 1986
5. Motorola DSP56800 family manual chapter 4 AGU
6.  ADI Blackfin
7.  TI C80x
8. Starcore SC1400
9.  Infineon Tricore Architecture manual  
10.  Infineon Carmel
11.  C55x
12.  Bier, Shoham & all “DSP processor fundamentals :: Section 6.  Addressing” BDTI 1994-96

Bit Slice BB

Bit Slice Building Blocks

The goal of this section is to cover a quick history of bit slice logic, the impact on fully integrated DSP of the first generation and to provide the methodology for today COPS.


Background
See references AMD Mick and Brick, blakeley etc..

Description
Bit slice logic was not only a series of components but most of all a technique to build a General Purpose (GP) computer and best of all a GP DSP.
Basically any DSP can be build from 3 units (or BB):
  • PCU: Program control Unit
  • DAU: Data Arithmetic Unit
  • AGU: Address Generation Unit
The way the 3 units are connected does not vary. What will vary is the features of each unit, the number (and width) of buses, the number of memories and the microcode.


Program Control Unit (PCU)
Its function is to fetch instructions and to decode them before dispatching to the respective units. In the bit slice model the decoding is not part of the PCU. Since the instructions are very wide and monolithic, there is an intermediate level (the microprogram store). A good example of PCU is the 29110 sequencer from the original AMD 29 bit-slice family but all DSPs have a section describing the PCU. This section goes by various names. 
 

Data Arithmetic Unit (DAU)
The DAU goes by many names and many types. The original AMD29x family had the 2901 called RALU (for register file + ALU) . And because we were processing signals, we added a multiplier which gave rise to the 2 standard  topologies (multiplier in series and parallel). Introduction of GP DSPs kind of standardized on the 16*16 + 40 structure. With the 90s and the vulgarisation of computer architecture, first came the name IP (Integer Processing), then the parallel topology as the units were all attached to the Register File, then new type of units (Bit Manipulation Unit, Bit Field unit, Shuffle), then duplication of units (super scalar) and sub word parallelism.
Today, the DAU is where most of the action is. Especially since our goal is to implement Matlab functions.   

Address Generation Unit (AGU)
The AGU  is a bit of the poors man BB of the bit slice family. For instance, the original AMD family did not have any (it was simpler to implement one with a RALU). This is still partially true today, an AGU is a degenerated ALU (1) but as anything dealing with memories it should not be underestimated .
See here for more details.


(1) In the simplistic model of computer architecture there is no AGU. Firstly there is no load/store parallelism and moreover there is a unified register file (see ARM7, ARM9, MIPS, PPC, etc..). {obviously I do not go into superscalar here}.