# SOC Design for HPC: Technology Analysis & Requirements

#### Peter M. Kogge McCourtney Prof. of CS & Engr. University of Notre Dame

Acknowledgement: This work was funded in part by the US Dept. of Energy, Sandia National Labs, as part of their Xcaliber and XGC projects.

## Thesis

- Today's COTS design typically "inward" focus
- For HPC, "outward" is far more crucial
  - Memory, esp. random access
  - Off-chip bandwidth
- This talk
  - Take-aways from TOP500
  - Take-aways from a Big Data problem
  - Energy discussion
- The biggest gains seem to come from rethinking system architecture
- SOC, if done right, seems to be right direction

# **Today's Architecture Classes**

- Heavyweight: traditional 100+W multi-core
- Lightweight: lower power single chip system
- Hybrid/Heterogeneous: Heavyweight/GPU combination
- **Big/Little**: Same ISA, different microarchitectures
- **Other**: XMT, Convey





## We All Know The Story: Unbroken Growth in TOP500 Rmax



# Floating Point Efficiency Remains High for Linpack



## **But Not All Benchmarks Double/Year**





### **Memory Growth Has Slowed**



# **And Memory per Flop/s Is Dropping!**



## A Real-World Big Data Problem



# Configurations

- Baseline: Lexis Nexis HPCC Configuration
  - 100 4-node Blades in 10 racks
- Memory Rich Configuration
  - Same as above but with maxed DRAM for RAM Disk
- 2015 Configuration
  - 4X cores/socket, DRAM, switched Infiniband
- 2015 Configuration with DRAM for RAM Disk
- Lightweight Configurations
  - 2 racks of Calxeda-like ARM-based SOCs
- Xcaliber: Memory Stack-Based
- Xcaliber with all computing at bottom

# **Possible "Lightweight" System**

- Assume Calxeda System on a Chip
  - 4 1-1.4GHz ARM A9 cores w'FPU
  - Single DDR3 2 rank controller
  - Networking: GigE, XAU
  - Supports up to 5 SATA
  - Fabric: 8x8 crossbar, 10Gbps links
    - 3 internal, 5 external
- Calxeda Reference card:



- 4 SATA sockets/SOC for disk connections
- 8 interfaces for off-card fabric
- 2U Blade (based on Boston Viridis Chassis)
  - 12 reference cards + up to 24 SATA
- Assumed Configuration of 40 blades, 2 racks





Images from www.calxeda.com 6/2/12



http://www.boston.co.uk/solutions/viridis/viridis-2u.aspx/

#### **X-Caliber-like Architecture**





(b) X-caliber Node Mockup

NOTRE DAME

M's built from 3D stacks of memory Each Stack

- 32 GB DRAM
- 256GB PCM
- Logic chip at bottom
- 64 0.5GB "Vaults"
- 8 full-duplex links - 32 GB/s each dir

Memory System (M) and Embedded Memory Processor (EMP)

EMP



Mern Network Interface

- Two computation Units
  - Right next to the DRAM vault memory controller (VAU)
  - To aggregate between DRAM vaults (EMP)
- "Memory Network" Centric
- Homenode for all addresses
  - Owns the address, data, and its state, "coherency"
- Three Control-Flow Options
  - In the Processor ("Memory is the Accelerator"), conventional
  - In the Memory System ("Processor is the Accelerator"), our approach
  - Both, probably un-programmable
- At 1-2 GHz, 4 EMPs per vault
  - 64 vaults

LINA TOT LINA

2-4K threads per node in the memory system!

# **Details: Heavyweight Alternatives**



## **Non-Heavyweights**



## Comparison





#### **Sample Path – Off Module Access**

- 1. Check local L1 (miss)
- 2. Go thru TLB to remote L3 (miss)
- 3. Across chip to correct port (thru routing table RAM)
- 4. Off-chip to router chip
- 5. 3 times thru router and out
- 6. Across microprocessor chip to correct DRAM I/F
- 7. Off-chip to get to correct DRAM chip
- 8. Cross DRAM chip to correct array block
- 9. Access DRAM Array
- 10. Return data to correct I/R
- 11. Off-chip to return data to microprocessor
- 12. Across chip to Routre Table
- 13. Across microprocessor to correct I/O port
- 14. Off-chip to correct router chip
- 15. 3 times thru router and out
- 16. Across microprocessor to correct core
- 17. Save in L2, L1 as required
- 18. Into Register File

## **Relook at Exascale Strawman**



| Register File Access    | 0.16   |
|-------------------------|--------|
| SRAM Access             | 0.23   |
| DRAM Access             | 1      |
| On-chip movement        | 0.0187 |
| Thru Silicon Vias (TSV) | 0.011  |
| Chip-to-Board           | 2      |
| Chip-to-optical         | 10     |
| Router on-chip          | 2      |

| <u>Step</u>      | Target | <u>p</u> J | #Occurrances | Tot | al pJ | % of Total           |
|------------------|--------|------------|--------------|-----|-------|----------------------|
| Read Alphas      | Remote | 13,819     | 4            | 55  | 276   | 16.5%                |
| Read pivot row   | Remote | 13,819     | 4            | 55  | 276   | 16.5%                |
| Read 1st Y[i]    | Local  | 1,380      | 88           | 121 | 5     | NV 1%                |
| Read Other Y[i]s | L1     | 39         | 264          | 10  | 2     | <b>V</b> <u>%</u>    |
| Write Y's        | L1     | 39         | 352          | 13  | 900   | 4.2%                 |
| Flush Y's        | Local  | 891        | 88           | 78  | 380   | 23. <mark>4</mark> % |
| Total            |        |            | 334,056      |     |       |                      |
| Ave per Flop     |        |            |              | 4   | 75    |                      |

#### **If this is true, 1 EF/s = 0.5 GW!**

#### **Access vs Reach**



# What Does This Tell Us?

- Cannot afford **ANY** memory references
- Many more energy sinks than you think
- Cost of Interconnect *Dominates*
- Must design for on-board or stacked DRAM
- Need to redesign the entire access path:
  - Alternative memory technologies reduce access cost
  - Alternative packaging costs reduce bit movement cost
  - Alternative transport protocols reduce # bits moved
  - Alternative execution models reduce # of movements

#### AND IT GETS <u>MUCH WORSE</u> FOR CACHE UNFRIENDLY PROBLEMS