DSP Bricklayer: VADD

BDT Benchmark - Vector Add

Goal
Introduction to benchmarking as an architecture tool. Implicit confusion between hardware and software to teach architecture.

Background
here will go a long explanation of manufacturer benchmarks (in the 80s), before BDTI and other stuff to explain that vector add is not a BDT copyright. But by choosing the Ntaps=40, BDT turned it into a standard in the same way that they turned the FFT256 as another bench standard.
Description
The VADD401 BB is remarkable because it is both very simple and complex. I believe that studying it will cover maybe 50% of the low hanging fruits of structural architecture.
We will now present results of several machines (DSP, CPU, etc..). For some of the results we used BDT as a reference. We also use "matlab" as descriptive language.
The cycles equations are defined as : total cycles = n*40 + ot (overhead) + p (pipeline). For most of the cases ot and p are confused as one.
The operation is
z =x+y where x,y,z are native data size (typ: 16-bit)

1) Architecture 1980
Machine 1: Theory= 3-bus DSP
fetch
decode
x=ld(mem1); y=ld(mem2);
for ii=1:39, z=add21(x,y); mem3=st(z); x=ld(mem1); y=ld(mem2);end;
z= add21(x,y); mem3=st(z);

Machine 2: Standard = 2-bus DSP
fetch
decode
x=ld(mem1); y=ld(mem2);
for ii=1:40, z=add21(x,y); x=ld(mem1); y =ld(mem2);
mem3=st(z);
end
Note: the load "one too far" is not considered a problem.

Machine 3: CPU = 1-bus
fetch
decode
counter=40;
while(counter>=1)
x= ld(mem1);
y= ld(mem2);
z= add21(x,y);
mem3=st(z);
dec(counter);
branch(top); %dummy to to mark the cost associated to branch
end

Giving following results

machine 1: 1*(N-1) + 3 = 42
machine 2: 2*N+4 = 84

machine 3: 6*N+3 = 243

The main architectural trade-offs are :
- number of buses
- ZOL (Zero Overhead loop).

2) Architecture 2000

Machine 11: 2-bus 4 lane DSP
fetch
decode
dispatch
ssssx=ld_by4(mem1); ssssy=ld_by4(mem2);
for ii=1:10, ssssz=add84(ssssx,ssssy); ssssx=ld_by4(mem1); ssssy=ld_by4(mem2)
mem3=st_by4(ssssz);
end

Machine 12: 1-bus 8 lane CPU
fetch
decode
dispatch
ssssssssx=ld(mem1);
ssssssssy=ld(mem2);
for ii=1:5
ssssssssz=add168(ssssssssx,ssssssssy);
ssssssssx=ld_by8(mem1);
ssssssssy=ld_by8(mem2);
mem3=st_by8(ssssssssz);
end

Machine 12: 1-bus 8 lane CPU with 'proper' epilog
fetch
decode
dispatch
ssssssssx=ld(mem,1);
ssssssssy=ld(mem,2);
for ii=1:4
ssssssssz=add(ssssssssx,ssssssssy);
ssssssssx=ld_by8(mem1);
ssssssssy=ld_by8(mem2);
mem=st_by8(ssssssssz);
end
ssssssssz=add168(ssssssssx,ssssssssy);
mem=st_by8(ssssssssz);

Machine 13: 1-bus 8 lane CPU with interleaved data
fetch
decode
dispatch
[ssssssssx ssssssssy ]=ld_by16(mem);
for ii=1:4
ssssssssz =add168(ssssssssx,ssssssssy);
[ssssssssx ssssssssy ]=ld_by16(mem);
mem=st(ssssssssz);
end
ssssssssz =add168(ssssssssx,ssssssssy); mem=st_by8(ssssssssz);

Giving following results

machine 11: 2*N/4 +5 = 25
machine 12: 4*N/8 +7 = 27

machine 13: 3*N/8 +6 = 21

The main architectural trade-offs are :
- replacing multiple buses by single bus
- larger width bus
- datapath with sub-word parallelism (multi lanes).
- reorganisation of data in memory
-

Thursday, November 24, 2011

VADD

No comments:

Post a Comment

Followers