In traditional CC-NUMAs, e.g. DASH / FLASH, Alewife and NUMA-Q, the DSM is managed with a cache line size, and data in other clusters is copied into a cache attached to each processor. The consistency protocol is a simple invalidate policy, the interconnection network is a simple mesh or ring, and the directory scheme is based on one-to-one data transfer.
Although such a mechanism works efficiently in those systems with a limited number of processors, it is not suitable for a system with thousands of processors. For example, a large amount of memory for cache and directory is required. The invalidate policy based on one-to-one data transfer often causes a network congestion when many processors share the same data.
In order to address these problems, the following methods are used in JUMP-1.
- Each processor (SuperSPARC+) shares a global virtual address space with two-stage TLB implementation, and the directory is attached not to every cache line but to every page, while the data transfer is performed by a cache line. Some parts of cluster memory are available as L3 (Level-3) cache which stores the copies of other cluster memory.
- Various types of cache coherence protocols can be utilized dynamically, including not only an invalidate policy but also an update policy. In traditional CC-NUMA systems, an update policy has never been implemented since it requires a lot of packet transfers. However, it can be useful for applications which require frequent data exchange by accessing shared data aggressively. For efficient implementation of such a update protocol, the MBP-light provides dedicated hardware mechanisms which multicast network packets and collect acknowledgment packets.
- Reduced Hierarchical Bitmap Directory schemes (RHBDs) are introduced to manage efficiently directory. In the RHBD, the bitmap directory is reduced and carried in the packet header for quick multicasting without directory access in each hierarchy. The hierarchical structure of the RDT is suitable for efficient implementation of the RHBD.
- Each processor provides a custom sophisticated snoop cache as the L2 cache. Various cache protocols including cache injection are supported by this chip. A relaxed consistency model is implemented using write buffers in the L2 cache. Moreover, a virtual queue message transfer scheme on the shared memory can be efficiently implemented.
Overview
JUMP-1 has SuperSPARC+ processors as its element processors. Each cluster board has 4 SuperSPARC+s, 16MB SDRAM and a custom processor which manages the distributed shared memory, MBP-light.
Clusters are connected with the interconnection network, RDT (Recursive Diagonal Torus).
Structure of Cluster Boards
This picture shows a cluster board of JUMP-1. It has 4 PEs, L2 cache controllers, cluster bus chips, MBIF (maintenance bus interface), STAFF-Link interface, cluster memory and so on.
Each SuperSPARC+ is connected to the cluster bus via L2 cache controller. The width of cluster bus is 64 bits, and requests sent on cluster bus is processed by MBP-light. MBP-light manages the distributed-shared memory, STAFF-Link (Parallel I/O), MBIF, RDT Router etc.
Interconnection Network
The interconnection network of JUMP-1 is RDT (Recursive Diagonal Torus). To manage distributed-shared memory, it is important to support efficient multicast mechanism. RDT is suitable for multicasting because it includes both torus and fat tree.
And in massively parallel system which has a number of node, it is also important to make the diameter of interconnection network. RDT has a simple structure, but it can keep the diameter small.
MBP-light (MBP=Memory Based Processor)
MBP-light is an ASIC with 4-stage pipelined 16bit RISC core. It's connected to cluster bus and the interconnection network, RDT. MBP-light is the heard of JUMP-1, manages the distributed shared memory, I/O (STAFF-Link), system monitoring(MBIF) and so on.
Overview of MBP-light
352pins TBGA / 0.4um embedded array
Random Logic 106,905 gate
Internal Memory 44,848 bitRDT (Recursive Diagonal Torus)
The RDT router was developed in our laboratory. It has both CMOS and ECL logic to drive RDT directly.
Overview of RDT Router Chip
299pinCMOS SOG, 0.5um
125k gates, ECL device
Bi-CMOS gate 0.11ns typ.
CMOS gate 0.06ns typ.