# **Optimized Core-links for Low-latency NoCs**

Ryuta Kawano<sup>†</sup>, Seiichi Tade<sup>†</sup>, Ikki Fujiwara<sup>††</sup>, Hiroki Matsutani<sup>†</sup>, Hideharu Amano<sup>†</sup>, Michihiro Koibuchi<sup>††</sup> <sup>†</sup>Keio University <sup>††</sup>National Institute of Informatics

blackbus@am.ics.keio.ac.jp

- <u>Conventional NoCs</u>
- Small-world Networks
  - Difficulty in applying on Chips
- How do we reduce path hops of NoCs?
  - Adding multiple links between a core and routers
  - Optimization method for picking core-links
- Evaluations
  - Zero-load latencies
  - Costs
  - Full-system simulation
- Conclusions

### **Increasing # of Cores on NoCs**



### Intel Teraflops Chip: 80-tile 2D Mesh



### Intel Teraflops Chip: 80-tile 2D Mesh



- Conventional NoCs
- Small-world Networks
  - Difficulty in applying on Chips
- How do we reduce path hops of NoCs?
  - Adding multiple links between a core and routers
  - Optimization method for picking core-links
- Evaluations
  - Zero-load latencies
  - Costs
  - Full-system simulation
- Conclusions

### **Small-world Topology for off-Chip Network**

Reduction of # of hops using *small-world effects* 

[Koibuchi et al. 2012]



Ring + Non-Random Links

Ring + Random Links

# **Small-world Topology on Chips**



- 2D Mesh
- + Inter-router additional links
- [ Ogras et al. 2006 ]
  - Need to use custom routing
    - to reduce path hops
    - to avoid deadlocks

- Conventional NoCs
- Small-world Networks
  - Difficulty in applying on Chips
- How do we reduce path hops of NoCs?
  - Adding multiple links between a core and routers
  - Optimization method for picking core-links
- Evaluations
  - Zero-load latencies
  - Costs
  - Full-system simulation
- Conclusions

### **Multiple Core-links to Reduce Path Hops**

#### Idea: Router topology

+ multiple links between a core and multiple routers



### Our Idea: Reduction of First and Last 1-hop Latencies with Shortcut Core-links



- Using shortest path between
  - source and destination cores
    - Achieving lower hops
      - by *small-world effects*
    - •Maintaining **regularities** of router topologies

### Our Idea: Reduction of First and Last 1-hop Latencies with Shortcut Core-links



- Using shortest path between
  - source and destination cores
    - Achieving lower hops
      - by *small-world effects*
    - •Maintaining **regularities** of router topologies

### **Optimization Method for Picking Core-links**

- Problem: Lower operating frequency by longer core-links
- •Solution: Optimization using GA (Genetic Algorithm)



Corresponding Topology

**i R** 

Providing best tradeoff between link length and # of hops

- Conventional NoCs
- Small-world Networks
  - Difficulty in applying on Chips
- How do we reduce path hops of NoCs?
  - Adding multiple links between a core and routers
  - Optimization method for picking core-links

### Evaluations

- Zero-load latencies
- Costs
- Full-system simulation
- Conclusions

### **Zero-load Latency**



8×8 Mesh router topology
+ optimized core-links
Max. core-link length: 4 tiles

Reduction of max. / ave. zero-load latencies by up to **49 % / 58 %** 

# Costs (8×8 Mesh)

- Wire Density Overhead on each tile
  - two links per core

| Max. | 5.40 links |
|------|------------|
| Ave. | 3.06 links |
| SD   | 1.58 links |

Router area

### (Fujitsu 65nm Process)

| 1 link per core  | 7.71 mm <sup>2</sup> |
|------------------|----------------------|
| 2 links per core | 9.86 mm <sup>2</sup> |

Increase by 27.8 %

- Energy Consumption
  - 1.2 V supply voltage
  - 65 nm CMOS Process
  - Wire capacitance load:
     0.20 [pJ / mm]



Increase by 3.0 % at minimum

### Parameters of Full System Simulation

- GEM5 [Binkert et al. 2011] is used as full-system simulator
- Using 7 applications from the OpenMP implementation of NAS Parallel Benchmarks

| Switching                  | Wormhole                   | Processor                | x86 (64-bit)       |  |
|----------------------------|----------------------------|--------------------------|--------------------|--|
| Packet length              | 1- or 5-flit               | L1 cache size            | 32 KB (line: 64 B) |  |
| Flit length                | 128-bit                    | L1 cache latency         | 1 cycle            |  |
| # of VCs                   | 3                          | L2 cache size            | 256 KB (assoc: 8)  |  |
| Size of VC                 | 4 flits                    | L2 cache latency         | 6 cycles           |  |
| Router latency             | 3 [cycles]                 | Memory size              | 2 GB               |  |
| Link latency               | 1 or 2 [cycles]            | Memory latency           | 160 cycles         |  |
| Router topology 8×8 Mesh   |                            | Chip Configuration       |                    |  |
| Max. link length 4 [tiles] |                            | <b>U</b>                 |                    |  |
| Routing                    |                            | # of CPUs / L1 caches    |                    |  |
|                            | VV routing                 | # of L2 caches           | 48                 |  |
| Inter-router               | XY routing                 | # of Directory controlle | ers 8              |  |
| Between router<br>and core | Selecting<br>shortest path |                          |                    |  |
|                            |                            |                          |                    |  |

#### **Parameters of Router Network**

#### **Parameters of Simulation**

| Processor        | x86 (64-bit)       |  |
|------------------|--------------------|--|
| L1 cache size    | 32 KB (line: 64 B) |  |
| L1 cache latency | 1 cycle            |  |
| L2 cache size    | 256 KB (assoc: 8)  |  |
| L2 cache latency | 6 cycles           |  |
| Memory size      | 2 GB               |  |
| Memory latency   | 160 cycles         |  |

## **Full-system Simulation Results**

•8×8 Mesh router topology •Max. core-

•Max. core-link length: 4 tiles



Reduction of application execution time by up to 10.1 %

### Conclusions

- Idea: Multiple links between a core and routers on NoCs
- Design : GA optimization for picking core-links
  - Reduction of max. / ave. zero-load latencies by up to 49 % / 58 %
  - Reduction of application execution time by up to 10.1 %



- Conventional NoCs
- Small-world Networks
  - Difficulty in applying on Chips
- How do we reduce path hops of NoCs?
  - Adding multiple links between a core and routers
  - Optimization method for picking core-links
- Evaluations
  - Zero-load latencies
  - Costs
  - Full-system simulation
- Conclusions

### **Backup Slides**

### Definition of Fitness Function f<sub>k</sub>

- Length of the *j*-th core-link for the *i*-th core  $(0 \le i < N, 0 \le j < x) : l_{i, j}$
- # of hops between the *p*-th and *q*-th core  $(0 \le p < N, 0 \le q < N)$ :  $h_{p,q}$
- # of links longer than the given maximum link length : I
- Supplemental parameters :  $\alpha = \beta = 1000$

$$f_{k} = \begin{cases} \alpha(\beta \cdot I + \sum_{i=0}^{N-1} \sum_{j=0}^{x-1} l_{i,j}) & (I > 0) \\ \max(h) + \operatorname{mean}(h) & (\text{otherwise}) & (2) \end{cases}$$

Reducing link length with function (1) Reducing *#* of hops with function (2)

#### **GA** Parameters

- # of inds: 100, # of gens: 20000
- P. of crossover, mutation: 1 %, 20 %
- Tournament size: 3