Sunday, October 30, 2011

Mukesh Patel

Mukesh Patel and Nazomi (sep 2000)

So here we are in meeting room 21, me as architecture expert, to review Nazomi IP ( a Java accelerator); on the other side is Mukesh Patel the CTO  and while waiting for Saddam the H and the rest, we have a long discussion on the maturity of the Java market.
Me, as always Mr Suspicious, need to be convinced.
  1. Firstly our DSP core (or any DSP for that matter) is not going to implement Java.
  2. Secondly it seems compulsory for our CPU core if we want to stay competitive with ARM as a cellphone platform. BUT (and it was not clear at the time) this does not fit in our platform strategy since our CPU core is mainly used  on the modem side, especially as TriCore is a CPU+DSP.
  3. Remains the AP side (Application Processor); no question about that except that the AP paragdim shift has not happened yet (this is 2000).
  4. Finally, I am not convinced of the cell phone as THE third screen.
Coming from a different side, Mukesh explains that he was in Japan where he saw a few applications (like reading serials on the subway, but I might be wrong).
Anyway, I am grateful to Mukesh as being one of the first guy to expand my cellphone architecture perspectives from western to eastern.
................ 
The really good part of the meeting was Saddam the H who was the marketing in charge of core licensing.
As someone mentioned later "maybe he did not know that he was supposed to license OUT our cores, not licensing IN other people's core".  After introduction, he started to ask if we could use the Java IP for our DSP core ...
Well at this stage, my best use of time was to avoid opening my big mouth, go into power down, look at the Nazomi documentation and listen to Mukesh...
Wonder where he is now.
 Corrections:  
Looking at my notes, in 2000 Nazomi was called Jedi technologies (wow! glad they change names).

Sunday, October 23, 2011

COP - Networking Apps

Coprocessor in Networking Applications
The goal of this section is a quick overview of COPs

Description
Classified in two type of processing
    1. Centralized COP
    2. I/O COP
Intuitively we understand than an I/O COP can be multiple (per line) and in-line (between PHY/MAC and CPU) whereas the centralized COP is spatially unique and sits next to the CPU.
    We are not networking specialist (instead see reference) so we will not try to file the following COP in any of the 2 processing spaces. Instead we will list them by how much their algorithmic contents looks like a dsp algorithm. 
    1. Encryption
      1. elliptic curves
      2. gallois fields
    2. Security
    3. Pattern matching, Reg-Ex
      1. TCAM

    4. TCP/IP acceleration
    5. Search, Classification

    References

    Linley Group

    Linley

    Linley Gwenapp

    Among the 4 MPR musketeers, Linley Gwenapp was the best at reconverting from CPU centric to platform. Remarkably he did it very early (1999-2000) by jumping on the NPU bandwagon.
    1. At first, he was presenting special sessions at MPF (and maybe HC).
    2. Then as MPF folded, he created his own group (www.linleygroup.com) which is THE reference in NPU or more exactly in platforms in the infrastructure ( more than networking) space.
    3. And frankly we wish they would expand in other spaces.
    Still from our perspective we will list the interesting features offered by NPU and infrastructure platforms (for deeper understanding the reader should refer to the above reference)
    • Complex SOC architecture where bandwidth and sometimes latency are critical
      • --> advances in buses and fabrics.
    • Naturally prone to multi-processing (MP) either as multi-cores or innovative multi-chips (good luck for latency).
      • --> advances in partitioning and topology
    • Even better, first serious implementations of MT (see Mario the MultiThread Champion) circa 2000.
    • Large diversity of Coprocessors.

    Mario MultoFredo

    So here we were circa 1996, the whole architecture team still working on the Dolphin definition, and I felt pretty much annoyed.

    " Why do i have to listen to this guy as we are late | confused | living in hippie land and .. what the heck.. I have to listen to another croonie and ..what the heck should I care for another CA (Computer Architecture)   technique which look like the others, miles ahead of any embedded implementation"
    At this stage I was still very DSP centric and little did I know that 5 years later I would push for Tricore next gen. to use Multithreading (MT). {this is another story}
    Anyway, we all listened very patiently and at the end of the meeting Bruce told me
    " this is Mario! nobody listened to him at NS, and since he did a thesis on MT, he is pushing it all over the valley"
    Well! we went to other better greater things and filed MT under nice to know. Not surprisingly, MT was not seen or heard in any embedded architecture (ARM, MIPS, PPC, SH,etc..) at the time,  let alone DSP, but instead it re-appeared in a Network Processor (NPU).
    Circa 2000 Clearspeed presented their NPU at Hot Chips (or maybe Microprocessor Forum) and here you have guessed Mario in the center of the arena. In a way it was satisfying to see that any gifted and focused architect always find the way to push his baby...{not sure about that, mind you}
     

    STOP ME !

    STOP ME (if you've heard that one before)!
    Herewithin are personal remembrances of meeting peoples, ideas, weirdos and opportunities .
    I do not guarantee the exact dates and I changed the names as I see fit .
    But they were always triggered by my brainy memory which I expect to be reliable as separating fantasy from true events.
    • Mario MultoFredo  "if he is not the father of multi-threading, he is at least his brightest son"
    • Linley Gwennap  "easy to remember; it's not gwenAPP"
    • Mukesh Patel "JAVA acceleration for eastern  architecture"

    Tuesday, October 18, 2011

    Friday, October 7, 2011

    Coprocessors (COP)

    Coprocessors (COP) 


     The goal of this section is to understand the scope and definitions of coprocessors (COP).
    • Difference between processors and COP. 
    • Difference between COP and peripheral.
    • ........already we can visualize that a COP is less than a processor but more than a Peripheral.
    • Difference between COP and accelerator
    • ........... at one time it was very clear
        • a COP sits next to a CPU on the level 1 bus  (like a 80387 or more recently a GPU)
        • an accelerator sits on the level 3 or system bus (like a Turbo decoder) 
    • Difference between COP and I/O processor (IOP)
    • Difference between COP and Intelligent Peripheral (IPERI)
    • ........... at one time it was very clear
        • a COP is a CPU  extension whereas  an IPERI is an CPU-agnostic block (see evolution from Am9511 to 8087).
    • Difference between COP and Application Specific Processor (ASP)
    • Difference between COP and DSP
      • ...........at one time it was very clear
        • a DSP is not a COP! 
        • But seen from the perspective of software running on the Host (ARM),  it is exactly becoming that, just another API.
      • .............and more
      • For simplification sake, we will use the definition
        • A COP is less than a CPU but more than a Peripheral.
        • A COP is dedicated to fill the system gap between standard CPU and standard Peripheral. It can be a custom function or an application specific processor.
      • Alternatively, the term "Accelerated processing " fits our definition.
      Background
      Beginning of the 80s Intel came up with a Floating Point Unit (8087) which is then the de-facto first coprocessor. The instruction set (ISA) and interface is well documented. Following the very successful 8087, Intel came up with a series of COPs:  the 80130 multitasking software, the 82586 for local network , the 82730 for Text (screen). While the 80130 was tightly coupled, the last two were bus coupled (not unlike the 8089 I/O processor). In fact, not surprisingly for Intel, everything was a COP.
      Co-processing was then one of the solution to the so called problem of extended processing. How to cover all possible applications without specializing the microprocessor ISA?  There were 3 types of solutions:
      1. the macrostore.  The most commonly used subroutines are romed (?) as micro-instructions instead of being executed from the external program store. Proponent: Texas
        1. not the brightest tree in the forest ; still, it could make sense in DSP COP. File under microprograming techniques.
      2. the coprocessor. The  most commonly used subroutines are executed by a specialized processor.
      3. the intelligent peripheral. The advantage being that concurrent processing is easy but the big disadvantage of not being transparent to the programmer.
      After 1983, CPU architecture becomes the standardized way so ---> refer to .. for the rest of the story.

        Topics
        1.  Most interesting was the Motorola answer (68881) to the 8087 which had a similar ISA but a much better non-blocking interface. In other words, Motorola introduced concurrent processing. Incidentally Motorola also introduced the first major COP architecture topic: what type of interface.


        A non exhaustive list of COPs and their applications
        DSP coprocessor:  A list of  ISA DSP extension  in kind of chronological
        While ISA extension and COP are not synonymous, we grouped them together for simplicity reasons.
        The advantage of a COP over an ISA extension are obvious.  The specs including the interface and the DSP extensions are physically separate from the rest of the Core, hence it is easier to model, to implement and to validate. 
        • NS FX161 (~ 1991) had a DSP COP
          • amazing trick!! the integer register and the DSP registers were skewed by 1 bit (because of q format) 
        • ARM PICCOLO (~1996) point solution for GSM speech coder.
          • Stands today for its very original  way of interfacing to the Core; 
          • both tightly coupled and asynchronous data through a FIFO.
            • TBD : integrates notes and ICSPAT 98 Moerman class notes
            • TBD:  matlab model   
          • !!! circular addressing on register file.
            • TBD: matlab model
          • http://www.cs.umd.edu/class/fall2001/cmsc411/proj01/arm/dsp.html
        • HITACHI: SH-DSP  (1996)
        • PPC: ALTIVEC (circa 1998)
        • EXTENSA (circa 2003)
        • DSP PIC (2005?)
        • ........

          Design Issues
          • Tightly or Loosely coupled?
          • How do you return data to the core? interrupt?
            • blocking , non blocking
          • Memory hierarchy position
            • Level 0 memory : COP has access to the core Register File.
              • the COP is just another execution unit inside the DAU
            • Level 1 memory: COP has access to level 1 Memory
              • even better: COP sits in the same place than a level 1 memory
                • see FFTer from ?
            • Level 2 or 3 memory: COP sits on one of the system Buses
          • Instruction Set or not?
            • In theory an instruction set is good idea but it implies a lot of added complexities none of them major, but the whole can become unmanageable.
              • tool issues
              • C compiler or not
              • opcode design
              • added power consumption due to fetch,
            • parameters are preferable especially a combination of build and run parameters.
          • Scheduling techniques
            • "pure" datapath    
            • vector processing  (access to data as block in memory)
              • length,stride
            • pipeline data path 
            • sequential ; concept of clock  ; if cycle==1 ... if cycle==2...
            • autonomous  z= FFT64(x)
          • Topology: how many ports? 
            • a port must be a physical reality and not be a pointer to a structure.

          Further Topics 
          • Accelerated Processing  (AP)
            • Traditionally AP is divided in several techniques 
              • Central Core (CPU) based
                • Specialized CPU
                • CPU + COP(s)
              • Periphery (Non  CPU) based
                • Intelligent peripherals
                • FPGA
              • anything between  Core and Periphery.
                • to simplify: Core is level 0 memory and Periphery is level 2 or 3.
              • anything outside the chip is considered periphery.
              • since our focus is customization we do not consider massive parallelism as a solution. 
            • For our application space (dsp) it is simpler to treat AP and COP as a single topic.   

          Advanced Topics
          • FPGA Nodes:  Combining Massive Parallelism (MPP)  and customization
            • MPP machines kind of disappeared of the DSP (and embedded) scene for obvious reasons of programming model and power efficiency.
            • the next generation is based on a slightly different approach
            • you have a switch fabric (say 16x16 or 256 nodes) and each node is dedicated to a function
              • in fact some guys proposed a fabric based on the FFT treillis instead of row/column
            • this approach is interesting because it is more sophisticated than our proposed signal graph
              • since we map a Matlab/Simulink flow. 
            • We are not familiar with the state of the art but it does not seem that this type of solution went deeper than FPGA implementation.
            • And maybe it is the right technology.
            References: my garage, google and questions 
            1. "Making software acceleration simple" Critical Blue, 2002
              1. http://www.criticalblue.com/
              2. What is the Critical Blue philosophy? the methodology? the application space?
              3. Is there a paragdim shift?
              4. Any link to dsp? Matlab? 
            2. "OptimoDE.;...etc" ARM, Hot Chips August 2004
              1. http://www.hotchips.org/archives/hc16/3_Tue/12_HC16_Sess9_Pres3_bw.pdf
                1. see also the PPT slides from CCCP, University of Michigan
              2. http://www.iqmagazineonline.com/magazine/pdf/v3_n3_pdf/Pg74_ARM_Phonex.pdf
              3. Originally developed by Adelante (an offspring of a Philips research company). They were partially bought by ARM.
              4. OptiMode is a general purpose (GP) COP. What is wrong with this approach?
              5. OptiMode is a GP methodology to design a COP. Advantages and limitations?
              6. It is based on a VLIW core. What is the one big wrong with VLIW?
                1. Compare a C55x MAC2 and a C62x MAC2
                2. Compare evolution C62, C64, C64+
                3. What is code footprint?
                4. What is compound instruction?
                5. What is a thick and thin operator (data-path)?
            3. "Creating FPGA-based Co-processors for DSPs using Model Based Designs..." Avnet, Xilinx, April 2009
            4. " Extreme Processing" Max Barron Instat/MDR, October 14, 2002
              1. This reference, while excellent, illustrates what we do not want to do. Max Barron used the term "extreme" because he delved into some architecture which were massively parallel and general purpose
              2. Here we consider solution (COP) as being customized for efficiency and specific to a task.
                1. note: efficiency can also mean parallelism
            5. "Accelerator Architecture" IEEE micro July/August 2008
            6. Anand balaram, Andrew Volk "Text coprocessor brings quality to CRT displays" EDN feb 17, 1983
              1. Including 80186-82730 interface
              2. Software interface: command block, screen characteristics interface, string pointer list and display data strings
            7. Stan Groves "standard interface keys processor design" Electronics Nov 17,1983
            8. Michael Cruess "The 68000 coprocessor interface an overview" Motorola document dated, june 8, 1982
              1. the author is Linked-in