DSP Bricklayer: May 2012

Bit Wise BB
This is a heck of a topic!
Firstly it overlaps ISA studies and DSP Building Blocks in at least 2 specific places (the DAU and the shuffle units).
Secondly, the variety of function classes is very wide. For instance, pack/unpack, Count Leading Signs, Gallois Fields, bit manipulation.
Thirdly, and this will serve as an introduction to this topic, the function themselves can become amazingly out of hand.
Finally, Matlab bit wise capability is limited or non coherent. As of this writing our latest bit wise library is floating point based!

Background

In 1996, when we were designing the Tricore ISA, Bruce came up with a set of so called "permute" instructions and I do remember Rod's reaction as between bafflement and irritation.
On Bruce side, the truth is that all Media processors had this class of instructions [ ref: search Ruby Lee].
On Rod side, we all know that the most precious resource in ISA design is the opcode. And, the biggest issue with bit wise instructions is that they are opcode hogs (always requiring dozens of control bits).
For the sake of the story, I would add that, since 1996 I have been involved with this issue multiple times and seen a few people falling in this trap.

Example: TI C64+ : instruction PACKHL2
To illustrate this topic, we will now discuss in details the PACKHL2 instruction of the TI C64+ DSP. We will then expand to all C64+ PACK instructions.

naming convention and syntax issue
The instruction mnemonic PACK implies that the destination register is smaller than the source register(s), HL is a subtype and 2 in the TI terminology means 2 sub-word operations. Since the registers are 32-bit wide, ADD2 (sub2, etc.. ) will mean 2 x 16-bit additions, and ADD4 will 4 x 8-bit additions in parallel.

The syntax is
PACKHL2(src1,src2, dst) where src1,src2, dst are all 32 bit registers

Question: since he syntax is exactly the same as
ADD (src1,src2,dst) where src1,src2, dst are all 32 bit registers
why do you need the suffix 2?
Answer; In a 32-bit architecture, the 32-bit register is generic. All instructions use 32-bit registers. What matters are the operations inside the 32-bit registers. In this case 2 implies 16-bit data.
Question: PACK implies some kind of register demotion. The syntax dst = src1 <op> src2 is just like any standard 2-operator syntax. The 2 registers are operated upon and the result is written to destination. Where is demotion in this type of operation?
Effectively, we would be more comfortable with dst32= PACK(src64) where src64 is a 32-bit register pair (such as A3_A2).
                         PACKHL A3_A2,A0 ; a syntax using a 32-bit register pair (64bit )
                         PACKHL A2, A8, A0 ; TI syntax using two 32-bit registers
but the advantage of the TI syntax is obvious: register flexibility.

Definition
We are now entering the core of the matter. What is the definition of the PACKHL instruction? First we want something simple to describe this instruction. Writing a "C" definition is rather wordy and the TI standard description is more complicated than needed. The simplest description is to consider the 2 registers side by side, src2 on the left (made of the two half words D_C) and src1 on the right (respectively B_A), the result will be C_B.
                      D_C B_A

C_B

We have now the following definitions:
                        Z = concat(hi(X),lo(Y));    % hi() and lo() are self explaining functions
                        Z = [hi(X) lo(Y)];              % even more Matlab like
                      Z = [X(31:16) Y(15:0)];     not so Matlab

And they all look clear. But concatenation is not the same as packing. In fact, intuitively it is the contrary. One increase the variable size, the other reduce it.

Towards a general definition of PACK
Let us start with a most general definition. The register size is 64-bit and the granularity is 8-bit. Both values are very reasonable in 32-bit architectures.We will call this instruction PERM(ute).
PERM(src64, dst64, controlword);
We will see later that PACKHL2 is just a sub-case of PERM.
PERM is easily described. as shown the following example

  src64   HGFEDCBA

         8x8 switch

  dst64   AAGHFEDC

In this description each letter represents a byte and note the little endian choice.
But then what is the control word and how many bits do you need? The number of bits is the problem. In this case we have for destination a 64-bit register made of 8 sub-component (bytes). Each byte can receive any of the 8 source bytes (a 3-bit control) and since there are 8 destination bytes, the total number of bits is then 3x8= 24 bits. Not an easy decision to make in a 32-bit opcode! And using a register to hold the control word is a poor solution. It means a 3-cycle instruction (MVK, MVHK, PERM).
Finally the control word syntax is very straightforward. We just use the representation of the destination register.In the example above it is:
PERM(src64,dst64, AAGHFEDC)

Matching PERM and PACK

Using the similar definition to the one above, it can be seen that PACKHL2 is equivalent to:

PERM(src32,src32,dst32, FEDC)
In fact we can match all C64+ PACK Instructions in the same way (see table)

It must be noted that the C64+ DPACK instructions are effectively more like PERM than PACK since the sources (combined) and destination have the same 64-bit width.

Conclusions

Defining bit wise functions just by looking at the datapath is relatively easy and can give simple yet very powerful structures. Architects (attracted by elegance) will always love that.
The problem is the control path which can become rapidly out of hand.
To illustrate the problem we took an intruction from the C64+ instruction (PACKHL2) and extended it to a generic Permute instruction. The number of bits require to describe the instructions would be 24.
Thiis problem applies the same to a building block or a coprocessor unit.
With reference to the C64+ we will now compare the two approaches:

Advantages of using a general PERM instruction

Conceptually, it is very simple.
Software implementation is very direct.
Very flexible: any source byte can go to any destination byte. Any source byte can be duplicated (or replicated n times) in destination.
No need to do long studies and have drastic selection to choose the right PACK datapath to implement (this is very often the case with bytes).

C64+ offers only 2 choice: byte even and byte odd

Shortcomings for using a general PERM instruction

Need 24 bits of control. This is not realistic in 32-bit ISA.

C64+ defines only 8 instructions.The footprint is minimal

Having 24 bits gives power(2,24) possibilities; how to test that? (by construction?)
The flexibility advantage may be a delusion. Some features are missing. For instance, just looking at the C64+ ISA (sign extension, packing with saturation). saturation.

While TI made the right choices for the C64+ ISA, a different situation (64-bit ISA, dedicated COP, etc..) might give different results.The astute reader, we are sure, has already plenty of ideas and solutions.
BUT this is not the point of this section. The point is to make sure that you understand the main risk associated with Bit wise instructions (the control bits) . To be forewarned is to be ...

Sunday, May 20, 2012

T.O.C 20 May 2012

Bit Wise BB - an introduction

Followers