Thursday, November 24, 2011

VADD

BDT Benchmark - Vector Add

Goal
Introduction to benchmarking as an architecture tool. Implicit confusion between hardware and software to teach architecture.

Background
here will go a long explanation of manufacturer benchmarks (in the 80s), before BDTI and other stuff to explain that vector add is not a BDT copyright. But by choosing the Ntaps=40, BDT turned it into a standard in the same way that they turned the FFT256 as another bench standard.
Description
The VADD401 BB is remarkable because it is both very simple and complex. I believe that studying it will cover maybe 50% of the low hanging fruits of structural architecture.
We will now present results of several machines (DSP, CPU, etc..). For some of the results we used BDT as a reference. We also use "matlab" as descriptive language.
The cycles equations are defined as :  total cycles = n*40 + ot (overhead) + p (pipeline). For most of the cases ot and p are confused as one.
The operation is 
  z =x+y   where x,y,z are native data size (typ: 16-bit)


1) Architecture 1980
Machine 1: Theory= 3-bus DSP
fetch
decode
x=ld(mem1); y=ld(mem2);
for ii=1:39, z=add21(x,y); mem3=st(z);   x=ld(mem1); y=ld(mem2);end;  
z= add21(x,y); mem3=st(z);     

Machine 2: Standard = 2-bus DSP
fetch
decode
x=ld(mem1);  y=ld(mem2);
for ii=1:40, z=add21(x,y); x=ld(mem1); y =ld(mem2);  
              mem3=st(z);
end     
Note: the load "one too far" is not considered a problem.



Machine 3:  CPU  = 1-bus 
fetch
decode
counter=40;
while(counter>=1)
  x= ld(mem1); 
  y= ld(mem2);
  z= add21(x,y);
  mem3=st(z);
  dec(counter);
  branch(top); %dummy to to mark the cost associated to branch
end


Giving following results

machine 1:    1*(N-1) + 3 = 42
machine 2:    2*N+4 = 84

machine 3:    6*N+3 = 243



The main architectural trade-offs are :
- number of buses 
- ZOL (Zero Overhead loop).


2) Architecture 2000


Machine 11: 2-bus 4 lane DSP 
fetch
decode
dispatch
ssssx=ld_by4(mem1); ssssy=ld_by4(mem2);
for ii=1:10, ssssz=add84(ssssx,ssssy); ssssx=ld_by4(mem1); ssssy=ld_by4(mem2)  
             mem3=st_by4(ssssz);
end     

Machine 12: 1-bus 8 lane CPU 
fetch
decode
dispatch
ssssssssx=ld(mem1);
ssssssssy=ld(mem2);
for ii=1:5
            ssssssssz=add168(ssssssssx,ssssssssy); 
            ssssssssx=ld_by8(mem1); 
            ssssssssy=ld_by8(mem2);  
            mem3=st_by8(ssssssssz);
end     


Machine 12: 1-bus 8 lane CPU with 'proper' epilog 
fetch
decode
dispatch
ssssssssx=ld(mem,1); 
ssssssssy=ld(mem,2);
for ii=1:4
            ssssssssz=add(ssssssssx,ssssssssy); 
            ssssssssx=ld_by8(mem1); 
            ssssssssy=ld_by8(mem2);  
            mem=st_by8(ssssssssz);
end     
ssssssssz=add168(ssssssssx,ssssssssy);  
mem=st_by8(ssssssssz);


Machine 13: 1-bus 8 lane CPU with interleaved data 
fetch
decode
dispatch
[ssssssssx  ssssssssy ]=ld_by16(mem); 
for ii=1:4
            ssssssssz =add168(ssssssssx,ssssssssy); 
            [ssssssssx  ssssssssy ]=ld_by16(mem); 
            mem=st(ssssssssz);
end     
ssssssssz =add168(ssssssssx,ssssssssy);                                  mem=st_by8(ssssssssz);


Giving following results

machine 11:    2*N/4 +5 = 25
machine 12:    4*N/8 +7 = 27

machine 13:    3*N/8 +6 = 21



The main architectural trade-offs are :
- replacing multiple buses by single bus 
- larger width bus 
- datapath with sub-word parallelism (multi lanes).
- reorganisation of data in memory


No comments:

Post a Comment