# Computing's Energy Problem: (and what we can do about it)

Mark Horowitz

Stanford University horowitz@ee.stanford.edu

#### **Everything Has A Computer Inside**



#### The Reason is Simple: Moore's Law Made Gates Cheap



Fig. 2 Number of components per Integrated function for minimum cost per component extrapolated vs time.

# Dennard's Scaling Made Them Fast & Low Energy



#### The triple play:

Dennard, JSSC, pp. 256-268, Oct. 1974

# **Our Expectation**

#### Cray-1: world's fastest computer 1976-1982

- 64Mb memory (50ns cycle time)
- 40Kb register (6ns cycle time)
- ~1 million gates (4/5 input NAND)
- 80MHz clock
- 115kW

#### In 45nm (30 years later)

- < 3 mm<sup>2</sup>
- > 1 GHz
- ~ 1 W



CRAY-1

# **Supporting Evidence**



http://cpudb.stanford.edu/



http://cpudb.stanford.edu/

#### **The Power Limit**



http://cpudb.stanford.edu/



http://cpudb.stanford.edu/

#### This Power Problem Is Not Going Away: $P = \alpha C * Vdd^2 * f$



http://cpudb.stanford.edu/

**Think About It** 

# $\mathbf{P} = \begin{array}{c} \mathbf{ENERGY} & \mathbf{OPS} \\ \mathbf{OP} & \mathbf{S} \end{array}$

# **Technology to the Rescue?**





# **Problems w/ Replacing CMOS**

#### **Pretty fundamental physics**

• Avoiding this problem will be hard



• fJ/gate, 10ps delays, 10<sup>9</sup> working devices

Catch - 22



**The Truth About Innovation** 



# **Our CMOS Future**

#### Will see tremendous innovative uses of computation

- Capability of today's technology is incredible
- Can add computing and communication for nearly \$0
- Key questions are what problems need to be solved?

#### Most performance system will be energy limited

• These systems will be optimized for energy efficiency

#### Power = Energy/Op \* Ops/sec

#### **Processor Energy – Delay Trade-off**



17

#### **The Rise of Multi-Core Processors**



#### The Stagnation of Multi-Core Processors



# Aside: Throughput Optimized FP



#### Aside: Floating Point Optimization 180nm – ITRS 10nm



#### Have A Shiny Ball, Now What?



#### **Signal Processing ASICs**



Markovic, EE292 Class, Stanford, 2013

#### The Push For Specialized Hardware



#### ABSTRACT

Since 2005, processor designers have increase ploit Moore's Law scaling, rather than focusing L formance. The failure of Dennard scaling, to which ticore parts is partially a response, may soon limit mu just as single-core scaling has been curtailed. This p Just as surger-time stating that over outlaned. This is multicopy scaling limits by combining device scaling, s scaling, and multicore scaling to measure the speedup pole a set of parallel workloads for the next five technology generation For device scaling, we use both the ITRS projections and of more conservative device scaling parameters. To model sing on those conservative device scatting parameters, to menor single core scaling, we combine measurements from over 150 processor. to derive Pareto-optimal frontiers for area/performance and powerperformance. Finally, to model multicore scaling, we build a detailed performance model of upper-bound performance and lowerbound core power. The multicore designs we study include singlethreaded CPU-like and massively threaded GPU-like multicore chip organizations with symmetric, asymmetric, dynamic, and composed organizations with synthesis, asymmetric, or manife, and composed topologies. The study shows that regardless of chip organization and topology, multicore scaling is power limited to a degree not widely appreciated by the computing community. Even at 22 nm (just one year from now), 21% of a fixed-size chip must be powered (has one year non now), 21 won a measure cup muse or powered off, and at 8 nm, this number grows to more than 50%. Through 2024, only 7.9x average speedup is possible across commonly used parallel workloads, leaving a nearly 24-fold gap from a target of Categories and Subject Descriptors: Co [Computer Systems Or-

ganization] General — Modeling of computer architecture; C.0 Computer Systems Organization] General — System architectures General Terms: Design, Measurement, Performance Keywords: Dark Silicon, Modeling, Power, Technology Scaling

te leck efit s integrat ing doma. parallelism chip, it is criv will be in the lo performance of p. of core doubling? Such a study mus chip organizations, and and power timits at eac. ers all those factors togeth.



#### **Before Talking About Specialization**



#### **Don't Forget Memory System Energy**



#### **Processor Energy w/ Corrected Cache Sizes**



#### **Processor Energy Breakdown**



## **Data Center Energy Specs**



Malladi, ISCA, 2012

# SO HOW WILL ACCELERATORS HELP?

#### What Is Going On Here?



# **ASIC's Dirty Little Secret**

#### All the ASIC applications have absurd locality

• And work on short integer data



# **Rough Energy Numbers (45nm)**

| Integer |        | FP     |       | Memory |           |
|---------|--------|--------|-------|--------|-----------|
| Add     |        | FAdd   |       | Cache  | (64bit)   |
| 8 bit   | 0.03pJ | 16 bit | 0.4pJ | 8KB    | 10pJ      |
| 32 bit  | 0.1pJ  | 32 bit | 0.9pJ | 32KB   | 20pJ      |
| Mult    |        | FMult  |       | 1MB    | 100pJ     |
| 8 bit   | 0.2pJ  | 16 bit | 1pJ   | DRAM   | 1.3-2.6nJ |
| 32 bit  | 3 pJ   | 32 bit | 4pJ   |        |           |

#### Instruction Energy Breakdown



## The Truth: It's More About the Algorithm then the Hardware





# **Highly Local Computation Model**



# **Highly Local Computation Model**



# **Highly Local Computation Model**



# **Highly Local Computation Model**



# **Compose These Cores into a Pipeline**



#### Program in space, not time

• Makes building programmable hardware more difficult

# Working on System to Explore This Space

#### Takes high-level program

• Graph of stencil kernels

#### Maps to hardware level assembly

• Compute graph of operations for each kernel

#### **Currently we map the result to:**

• FPGA, custom ASIC

# **Enabling Innovation**

#### You don't just compile applications to efficiency

• Need to tweak the application to fit constraints

#### Need to enable application experts to play

• They know how to "cheat" and still get good results

# **Remember This Trade-off?**



http://genesis2.stanford.edu/

# **Chip Constructor Idea**



http://genesis2.stanford.edu/

## Not All Systems Are On The Bleeding Edge



## **App Store For Hardware**



# There's almost no limit to what iPhone can do.

The App Store has the best selection of mobile apps — from Apple and third-party developers. And they're all designed specifically for iPhone. The more apps you download, the more you'll realize your iPhone can do just about anything you can imagine.



Creat

## Challenge

TUTO DUTATION OF A CONTRACT OF

#### What Arduino can do

Arduino can sense the environment by receiving input from a variety of sensors and can affect its surroundings by controlling lights, motors, and other actuators. The microcontroller on the board is programmed using the Arduino programming language (based on Wiring) and the Arduino development environment

#### Community

The community of Arduino enthusiasts is vast, and includes region specific groups and special interest groups. The community is an excellent further source of assistance on all topics such as accessory selection, project assistance, and ideas of all sorts.

#### If technology is scaling more slowly

- We can incorporate current design knowledge into tools
- To create extensible system constructors

#### If killer products are going to be application driven

• Application experts need to design them

#### We can leverage the 1<sup>st</sup> bullet to enable the 2<sup>nd</sup>

• To usher in a new wave of innovative computing products