DSP Bricklayer: BB

AGU (Address Generation Unit)

To goal of this section is to cover in details the design of an AGU.

Background

The first AGUs go back to the 70s when people where using bit slice to design boards[ref 1]. With the introduction of the first GP DSPs, the AGU became slightly more sophisticated. In the mid-80s, AGUs reached their top (not in general purpose DSP) in Building Block (BB) DSP. First the byte slice from AMD (with the 29540 exclusively for FFT) [ref2] and secondly with the introduction of the word slice family from ADI ( ADSP 1410) [ref3].
The more modern DSPs of the 90s never went overboard for AGUs mainly due to the simplification impact of RISC and computer architecture. The AGUs gave way to the more advanced concept of LSU (Load store Unit) .

Description

Before describing an AGU there is a pre-requisite and a post-requisite.
The pre-requisite is to have a good idea of the standard addressing modes and their implications.
The post requisite is the vast number of CPU issues which are not considered HERE since they are beyond the scope a simple DSP or COP design. For instance, cache effects, speculation, non blocking load, store multiple, user/supervisor mode, etc.. .

Features

Number of data buses and data memories

This should give the number of AGUs . But note that it is as valid to implement (say) a 3-address bus AGU as 1 unit or 3 single units. It is the classical trade-off shared resources versus locality.

Some questions in standard addressing modes:

the simplest DSP architectures do not require stack pointer
pre increment can be implemented in 2 cycles

Circular addressing
Bit reverse addressing (and all FFT related)

while not strictly related to AGU, FFT can use the Block Floating Point mode which implies a load or store with scaling.

Paging

and more advanced

Vector Addressing modes (concept of strides)
Interleaving (as used for decimation)
Complex addressing (as in complex numbers)

Interesting

PC relative addressing
A bit processing unit which takes care of bit test/set and semaphores. (see StarCore).
A mask unit
Using an AGU as a second data unit

Number of buses and data memories

Traditional DSP had a 2-bus/memory (such that Acc+= X*Y) or even 3-bus/memory architecture (such that Z=X*Y). When the very high performance Infineon Carmel was firstly developed, it had 6 buses. The reason was the dual multiplier (since Z=X*Y in parallel with W =U*V gives 6 buses). Obviously it is difficult to be more flexible!
When we designed TriCore one key idea was to separate memory and data type. So, instead of having the standard DSP X,Y memories of 16-bit wide, we had a single 32-bit wide memory. This not only dramatically improved the core design but it fell perfectly in line since TriCore base was a 32-bit CPU. And when we improved by having a dual MAC, we doubled the bandwidth to a 64-bit bus.
The bottom line is that multiple memories can be avoided. Having 2xN or 1x2N is exactly the same in terms of bandwidth and performance. The main impact is more complex programming (to be explained see software pipelining).

On the other hand there are some algorithms which become so complex to implement on a single bus architecture that the simplification of the single get lost.

Standard addressing modes

The standard addressing modes are generally all similar (p++, p--, p+K and variations). The main issues are the number and width of registers. The only real issue is the stack pointer which requires a pre-increment (++p) which is totally at odd with the other addressing modes ( they all use post modification). This is problematic since addresses must generated as fast as possible and having an adder in the worst case datapath is not recommended.
The solutions are multiple:
- have a 2 cycle instructions for push (or pop)
- prepare all addresses in advance and have a final mux
- bite the bullet and implement it; this is even truer if the AGU contains a base+index addressing mode.

Modulo addressing

Modulo (or Circular) addressing is easy to define, explain and implement.
if m >=0 pointer is incrementing

      address = (ptr + m) > (base +length)? ptr+m-length: ptr+m

if m <0 pointer is decrementing

      address = (ptr + m) < (base)? ptr+m+length: ptr+m

What are the issues?
1) Number of concurrent circular buffer?
1 is not enough, 2 is a good compromise, but 3 can easily be met. In regular ISA 4 or 8 is not usual.
2) There is the issue of simplification.
The standard equation above requires 4 registers (ptr, inc, base, length). This can be simplified to 3 and even down to 2 registers. It is very tempting to simplify by masking address bits. The circular buffer must then be a power of 2. This a very poor solution.

3) The last issue is more modern CPU than classic DSP.
Generally a DSP has a unique data size (the native size). In a modern CPU, the memory access is independent of the data type. Hence accessing a long (2 packed short) on a circular buffer of shorts will run into alignment problem.
The 4 register model
current ptr (p) , modif (m), base(b), length(l).
modif (also called offset) is the post increment/decrement.

cc1= gt(m,0);

cc2= lt(m,0);

if cc1 

               if (p+m) < (b+l),  p= p+m;

               else p= p+m-l; end;

end

if cc2 

                if (p+m) >= b,  p= p+m;

                else p= p+m+l; end;

end

The 3 register model
current pointer (p) , base(b), length(l).
Same equation as above but m is taken from the opcode immediate field. (say -16:+15).

The 2 register model (easy)
In this scheme the two registers are the current pointer (p) and a mask(m). The length of the buffer is limited to a power of 2, the base must start on a power of 2. The mask gives the position where the pointer is cut in two. Such as for example, for a 64 long circular buffer
p= concat(p_up, p_lo) where p_up is a pointer with lower 6 bits masked and p_lo is a counter 0: 63;
The 2 register model (freescale 56800)

This is similar to above but much less severe limitations. The base can be anywhere, the length can be any values. The trick is that the buffer will always take the space of next power of 2. So a modulo 366 will require the space of 512 words but will have 366 as upper bound (and zero as lower bound).

The 2 register model (TriCore )
On TriCore we had 16 address registers and there was no reason not to try the best which was at the time 8 simultaneous circular buffers. .. We are joking, in fact this number is given by the regularity of the instruction set.Since we had the concept of register pair, it was simpler to define a modulo as using any register pair but with a special meaning.

So we had to match 3 values, current pointer (p) , base(b), length(l) to a register pair (say D0,D1).

And we mapped it as follows
D1= base (32-bit)
D0= length || pointer (both 16-bit)
Finally, the post modification (m) was a 10-bit signed value given by the opcode immediate field.

Bit reverse addressing

Bit reverse is this kind of feature which looks like a bottomless pit.There is no end to complexity.

1) First let us start with the golden model which is not so easy to generate.

Generating it in C is itself prone to errors so this not a very good start for verification.
Obviously using a mirror on a binary table of increasing numbers will give the result.
But in fact bit reverse can be easily generated. For instance in Matlab:

    with x the input

       y=dec2bin(x)

       bitrev= y(end:-1:1)  

Or logically. Starting from(0,1) a new pair of numbers is generated by multiplying by 2 (0,2) to which is concatenated the same pair but with +1 (1,3). And the next step (0,2,1,3) generates [0,4,2,6 ] [1,5,3,7]. Etc ad vitam eternam. You can use Excel, Matlab or C to do that.

2) The implementation in hardware seems trivial but it is not

We can for instance reverse the wires such as bitrev(13:0)= bit(0:13).
The problem is that it is only valid for a word width of 14 (a 16K FFT). You cannot use the least significant 8 bits to to do a FFT256 .
But note that more modular methods exist (adder with reverse carry),

References

1. R.J.Karwoski “ a general purpose address controller for real time and array processor applications” reprinted from (TRW) (1981??)

2. AMD 29540 as part of the byte slice folio

3. ADI ADSP 1410 Word slice Address Generator Data Sheet (~1985)

4. Eric Martin “ “ 12 may 1986

5. Motorola DSP56800 family manual chapter 4 AGU

6. ADI Blackfin

7. TI C80x

8. Starcore SC1400

9. Infineon Tricore Architecture manual

10. Infineon Carmel

11. C55x

12. Bier, Shoham & all “DSP processor fundamentals :: Section 6. Addressing” BDTI 1994-96

Saturday, December 3, 2011

BB - AGU

1 comment:

Followers