Workshop on System-on-a-Chip Design for HPC (Denver, 2014)

2014-08-26 ffard

Workshop Goals

The HPC-SoC workshop will focus on semi-custom, application-targeted designs, and server processing for HPC and data-centers, with the goal to develop a strategy for an open fabric that is targeted at SoC designs for high end computing applications.

This workshop will explore the following questions:

Technology Inventory and Requirements Analysis: We will survey the currently available IP building blocks and identify where gaps exist in current IP circuit technologies and design tools that will be crucial to HPC and datacenter-targeted SoC ASICS. The NoC fabrics that connect the IP components together does not guarantee a trouble-free SoC design process, and there are many crucial components for HPC that are not available for licensing. In addition, technology integration, such as low cost integration of memory cubes and advanced packaging pose challenges.

Shekhar Borkar: Senior Fellow, Intel Inc.
SoC for HPC?—Mindset is the biggest impediment: An HPC is an embedded machine that should be tuned to the respective HPC workloads. In the last two decades general purpose COTS processors were employed for two reasons: (1) the cost of a custom design was prohibitive, and (2) by the time the custom design was realized it became obsolete compared to the COTS. Things are different now, frequency has flattened, energy efficiency is the king, data movement dominates, and extreme parallelism provides performance. More importantly, advances in the design tools allow custom designs (SoC) realized in a matter of months! That is why, we must rethink and consider embracing SoC design for HPC to provide customized HW/SW co-designed efficient solution.

Franz Francetti, Carnegie Mellon University
Technology Inventory and Requirements Analysis: Efficiently solving large problems on SoC ASICs requires rethinking of the SoC fundamentals. Typically accelerators and special functional units are put together with CPUs on SoCs to allow for efficient computation (think Qualcomm Snapdragon or TI OMAP). However, the data set that can be held on chip is limited to a few MB, which severely limits the usefulness of on-chip accelerators for HPC and HPEC applications. In the context of the DARPA PERFECT program we are developing the concept of memory-side accelerators that are placed on logic layers within 3D stacks like the memory controller layer of Micron’s HMC, enabling near memory computing. We demonstrated in simulation 100x speed-up and 1000x power efficiency gains for the notoriously hard problem of sparse matrix matrix multiplication.

Mark Horowitz, Stanford
It is all about energy, which is all about communication/memory: There is a common view that since machines are energy limited, we need to create specialized computing cores on our SoC to enable the next generation of HPC. Yet if you look at the energy breakdown of a modern “processor” chip, you will find that over 50% of the chip energy is dissipated in the memory system that supports the cores. This result is for a complex OOO machine; the ratio for low power cores is much smaller. When you add the external DRAM energy, the ratio of memory to compute energy becomes even larger. While this result is not surprising to people in the implementation business, it should cause people to question how specialized accelerators can help improve energy efficiency? Isn’t that working on the wrong part of the equation? If one looks more carefully at accelerators you see that they all are tuned to a very narrow set of applications. The apps are not only data parallel, but they have absurd memory locality. If they didn’t have absurd locality, their energy would be dominated by their memory fetches. Thus the main question that HPC designers should be focused on is converting their core kernels to have absurd locality. If that is possible, there might be a generalized computing engine for this class of applications.
State of the Art: What can be done to leverage commodity embedded IP components, tools, and design methodologies to create HPC-targeted designs. We will review the current state-of-the-art SoC design workflow and the key technology components that are currently available on the open market.

Michael Holmes, Manager ASIC/SoC Products, Sandia National Laboratories
The evolution of the complex SoC has driven additional levels of design abstraction within the SoC design flow to allow increased complexity and chip scale architecture evaluation of highly abstracted and fast executing IP block models for large multicore software/hardware platform development. Utilizing these design tools and a comprehensive selection of pre-verified building block IP cells with extensive verification suites allows the designer to make architectural trades and focus on key performance parameters including throughput and power. This methodology can be used to rapidly optimize the platform to the problem. This new top layer of design and simulation abstraction creates a new layer of hierarchy of designers in the SoC development process which now includes true system level co-design of the hardware and software platform. Stated alternatively, this new methodology brings the system level designers much closer to the SoC design and architecture trades. This, in turn, has resulted in the development of new rapid prototyping tools and a large library of verified IP cells from these IP vendors and third parties to support design at this increased abstraction level utilizing available building blocks in a cohesive environment.

Steven Nowick, Columbia University
Asynchronous (i.e. clockless) and GALS (i.e. globally-asynchronous locally-synchronous) networks-on-chip offer the potential of significant improvement in performance, energy, reliability and scalability, since they eliminate the rigidity and overhead of the fixed-rate clock, and allow distributed assembly and communication of components. As such, they provide a flexible digital integrative medium for SoC’s in the HPC domain, allowing scalable and modular assembly and communication of Lego-like IP components. Several recent asynchronous/GALS NoC implementations, using largely standard-cell design methodology, have demonstrated significantly lower system latency and power (including in comparison to clock-gated synchronous designs), as well as smaller area footprint, while obtaining comparable throughput, in direct comparison to leading synchronous NoC’s in identical technology. Additional benefits include: (i) the ability to support highly-differentiated per-link performance, and (ii) near-instant wakeup from sleep modes. Semi-automated CAD tool flows have been used, including with synchronous CAD tools (Synopsys DC/IC Compiler), though remaining tool challenges will be discussed.

Martin Deneroff, Green Wave Inc.
Will walk participants through the steps and end-to-end costs of implementing an SOC. The cost data and observations about the steps come from experience gathered in creating the D.E. Shaw Anton 1 SOC (for which Marty was Chief Architect) and the Green Wave SOC design.

Mike Filippo, ARM Inc.
Today’s state-of-the-art SoC IP is rapidly approaching the complexity and capability of modern high-performance hand-tuned HPC systems. Low-power, high-performance CPU IP combined with highly scalable system solutions are enabling order-of-magnitude gains in overall platform capability. The variety of CPU, GPU, system IP, and accelerator/IO IP available in the SoC market today and that to come in the next 2-3 years will be well suited for systems of all kinds and can be expected to meet the requirements of the HPC market as well. One-size-fits-all HPC solutions will no longer be the norm, as SoC architectures will be adapted to fit the specific needs of the various HPC workloads, just as SoCs are currently being tailored to many different and varied market segments today. However, SoC IP providers must ensure their products are well tuned to avoid losing performance and/or efficiency in large HPC system designs.
Software Infrastructure: What will be required of our software environment to take full advantage of a rapidly evolving SoC designs. What would need to change in our software engineering practices keep up with a more flexible and rapidly evolving hardware design target?

Franz Francetti, Carnegie Mellon University
Software for HPC and HPEC systems depends crucially on a collection of high performance math libraries providing the usual primitives like FFT and BLAS, and often goes well beyond this. In an environment like SoC ASICs porting and performance portability becomes a crucial issue. An important but elusive capability is to quickly port and tune the low-level high performance math library infrastructure to new machines to avoid the opportunity cost of wasting precious computing cycles. With Spiral (www.spiral.net, www.spiralgen.com) we have demonstrated such a system that solves the performance portability problem for a set of kernel functions, mainly for frequency domain methods. Spiral generates efficient programs for parallel platforms including vector architectures, shared and distributed memory platforms, and GPUs. Large scalability was demonstrated with the automatic synthesis of the HPC Challenge’s Global FFT, a large 1D FFT across a whole supercomputer system and rapidly targeting of the new Intel Xeon Phi processor and to the K computer.

Sudhakar Yalamanchili, Georgia Tech
The evolution Disaggregated Compute: As data movement dominates, we will see the continued diversification of compute to accelerators, and into the memory and I/O hierarchies. There will be a need to think about programming models for such organizations and how such capabilities are exposed or utilized by languages, compilers/debuggers, and runtimes. This will have a major impact on the software infrastructures.

Wen-Mei Hwu, University of Illinois
Intentional Programming for Productive Development of Portable, High Efficiency Software in the SoC Heterogeneous Computing Era: In a modern SoC computing system, we have a large variety of computing devices such as CPU cores, GPUs, DSPs, and specialized micro-engines. The OpenCL 2.0 and HSA standards provide low-level programming interfaces that provide some basic level of portability. However, the algorithms are typically baked into the kernel code already. In this talk, I will advocate the use of Intentional Programming to better separate the concerns of what needs to be done and how the work should be done. This would allow better algorithm adjustment and more global optimization for applications. I will show some initial results from the Triolet-Python project.
Simulation/Modeling: SoC poses challenges to existing monolithic CPU-centric simulation environments that were originally designed for cell-phone scale systems. What new technologies will be required to bring the kind of design agility to the HPC-SoC design space that is currently relied upon for competitive consumer electronic designs.

Bruce Childers, University of Pittsburgh
Multi-fidelity modeling and simulation: Design decisions need to bemade early, but these decisions invariably depend on lower-level ones. The design process must consider how decisions affect the whole stack — from applications, to system software, to >the hardware SoC itself. This issue is compounded by HPC systems with thousands (millions?) of nodes, complex interconnects, and complex parallel applications. It is hard enough and time consuming to model conventional non-customized systems at scale, but now we also have to consider the possibility of customization at scale as well.
Customization inherently relies on exploring many design options; the exploration will need to be fast and accurate at scale. Thus, we need >high-level asbtract models, yet the models must consider interplay with the lower-level, later design decisions. The models have to consider customization of both software and hardware. An interactive, flexible exploration environment that provides automatic guidance (agent-based design exploration) would be ideal.

Model customization for software and hardware: The possibility to customize exists at all layers — e.g., if the compiler can’t exploit the SoC accelerator, what good does that accelerator do? We need to model how well the compiler can use the accelerator, what is needed to support it, how hard is it to include, etc. The same can be said for the application and OS/runtime system. Thus, we need to model how customization affects individual the whole stack of software/hardware components, as well the interactions/interplay among components in the context of the customized whole system.

Verification and validation of the models/simulation: This problem gets worse when the application, system software and hardware are customized. The design flow hinted above requires “plugging together” models at multiple levels of fidelity: How Verification and validation of the models/simulation: This problem gets worse when the application, system software and hardware are customized. The design flow hinted above requires “plugging together” models at multiple levels of fidelity: How we do verify that these are composed correctly and work together? Good engineering practice will surely help, i.e., defining interfaces, clear separation of concerns, etc. However, I think more is needed. Formally defined interface semantics may offer a chance to automatically reason about the way models are composed. Look at the certified compiler community for inspiration: they have defined formal methods to compose optimization passes. Validation is very difficult here (impossible?) — I’d settle for even a coarse confidence metric (computed automatically) that hints at how much to trust the results (I’m not asking for much!).

X. Sharon Hu, University of Notre Dame
Compared to embedded system design, developing SoC for HPC must deal with many of the same problems. However, the scale of HPC systems and the breadth of HPC applications present some unique difficulties for simulation and modeling. For example, how should one choose which accelerators to include given the variety of applications to be executed on the same SoC and what are the implications of using certain accelerators to inter-node performance? Hardware-software co-design approaches and tools (e.g., support of multiple abstraction levels, functional as well as non-function simulation) developed for embedded systems could be a good starting point for developing SoC design tools for HPC systems. However, the scale of HPC applications is beyond what can be handled by the state-of-the-art design tools for embedded applications. The multi-scale physics and multi-scale decision making approaches may provide insights on how to tackle the challenges of designing SoC for HPC.

Sudhakar Yalamanchili, Georgia Tech
Integrated Power, Performance and Reliability Modeling: The simulation and modeling environments of HPC-SoCs should include support for i) multi-scale models, and ii) integrated models of physical behaviors produced by future workloads. For example, with regards to multi-scale modeling, co-design will become increasingly important between the behavior of diverse IP components, power delivery, power management, thermal fields and device reliability. These phenomena operate over different time scales but encompass important performance limiting interactions. Moreover, these environments must enable composition of “both” functional behaviors as well as the physical models of IP blocks to support design space exploration over alternative HPC-SoC designs. The ability to construct, higher level abstract physical models (e.g., power and reliability) of IP blocks will be key to analysis early in the design process. The ability to pursue these investigations in the context of alternative packaging models (e.g., 2.5D and 3D) will also be an important component of future modeling and simulation environments. In the Manifold and SST projects, an effort has been underway to define a set of abstractions and interfaces for this purpose.

Arun Rodrigues, Sandia National Laboratories
The System-on-Chip revolution opens up vast new design spaces. While a number of detailed tools exist to perform final validation, the early design space exploration requires the ability to quickly explore many designs at a high level. The unique requirements of HPC mean that this design exploration requires simulation of large parallel systems, not just portions of a node. To address these issues, the Structural Simulation Toolkit (SST) has been developed to allow large parallel simulation of advanced architectures at multiple scales. This talk explores the capabilities of the SST and provides examples for how it can be used to examine novel memory and network designs.
Open Technologies that support SoC Ecosystem for HPC: What open technologies, tools, and open-source gate-ware are available to engage the academic and research community involved in exploring the design space for high performance SoCs.

Krste Asanovic, University of California, Berkeley
SuperOpen: Building a truly open HPC-SoC infrastructure around RISC-V and Chisel
Traditional embedded IP, although adequate to the task of building HPCSoCs, is not open. Although licence fees and licensing businessmodels can be a barrier to HPC-SoC, licence fees are not the primary benefit of open-source hardware. More important benefits are:
1. Allow innovation from a larger community – existing IP vendors don’t care about HPC, so modifications to support HPC won’t happen to their closed-source code. Open-source designs can improve far faster than closed-source designs.
2. Preservation of software investment when the company supporting an ISA or system bus architecture dies (i.e., basically almost every ISA has died but someone still owns their IP, and ARM and Intel won’t live forever).
3. Auditable hardware designs – no foreign government wants to trust SoC IP from a foreign vendor. I’ll describe our work on the open RISC-V ISA and the Chisel hardware construction language, including recent vector microprocessor prototypes measured at 16GFLOPS/W for IEEE-754/2008 DGEMM.
Bruce Childers, University of Pittsburgh
I am sure you’ll get many answers about specific design tools/IP blocks, e.g., OpenSPARC, OpenRISC, OCP, SystemC, Gem5, SST, etc. In addition to the tools, models, and IP blocks, we need to find mechanisms to work closely as a community — leveraging effort — to achieve open, custom SoCs for HPC. While opencores.org is a repository for open-source IP blocks, it covers a diverse set of IP and mostly acts as a distribution repository. Instead, the exchange should be more active: build/engage community, clearinghouse for the artifacts (tools, models, IP blocks, etc.), online training, online access to tools/models, online access to experimental data (outcomes that can be reproduced/validated by others), etc. We are trying something similar for architecture through Open Curation for Computer Architecture Modeling (OCCAM). This might be a starting point – at least, some of the underlying software infrastructure could serve the HPC SoC exchange. There are other infrastructures that could be used, such as HubZero. A great example of an active, successful exchange is nanohub.org.

David Donofrio, Lawrence Berkeley National Lab
Recent advancements in technology scaling have shown a trend towards greater integration with large-scale chips containing thousands of processors connected to memories and other I/O devices using non-trivial network topologies. Software simulation proves insufficient to study the tradeoffs in such complex systems due to slow execution time, whereas hardware RTL development is too time-consuming. We present OpenSoC Fabric, an on-chip network generation infrastructure which aims to provide a parameterizable and powerful on-chip network generator for evaluating future high performance computing architectures based on SoC technology. OpenSoC Fabric leverages a new hardware DSL, Chisel, which contains powerful abstractions provided by its base language, Scala, and generates both software and hardware models, in the form of C++ and Verilog, from a single code base. The OpenSoC Fabric infrastructure is modeled after existing state-of-the-art simulators, offers a large and powerful collection of configuration options, and follows object-oriented design and functional programming to make functionality extension as easy as possible.

Location

Leidos Engineering
1801 California St., Ste. 2800
Denver, CO 80202
(303) 299-5200

Online Registration

Optional meal fee is $25 per day, which includes lunch, breakfast snacks, coffee, and refreshments.

Lodging Options

The Leidos offices are located in downtown Denver four blocks from the Denver Convention Center, with plentiful nearby hotels. For this reason we did not reserve a block of rooms, rather participants can choose from their preferred hotel chain in order to obtain government rates. The estimated cost ranges from a nearby Residence Inn at $156/night to the Denver Marriott Courtyard downtown at $195/night.

Important Dates

Registration: May 20 – June 15
Workshops: August 26, 8:30am – 5:30pm August 27, 8:30am-5:30pm, (Tuesday, Wednesday)
Steering Committee: August 28, 8:30am-12:00pm (Thursday)

Steering Committee

Ang, James (SNL)
Bergman, Larry (NASA/JPL)
Harrod, William (DOE/SC)
Hiller, Jon (STA representing DARPA/MTO)
Jiang, Hong (NSF)
Shalf, John (LBNL)
Wheeler, Noel (LPS/ACS)

At-Large Attendees

Asanovic, Krste (UC Berkeley)
Bhandarkar, Dileep (Qualcomm)
Boas, Bill (System Fabric Works)
Borkar, Shekhar (Intel Corporation)
Burke, Daniel (LBNL)
Childers, Bruce (Univ of Pittsburgh)
Deneroff, Martin (Green Wave Inc.)
Donofrio, David (LBNL)
Filippo, Mike (ARM)
Franchetti, Franz (CMU)
Hass, David (Broadcom)
Hemmert, Scott (SNL)
Hoang, Thuc (NNSA/ASC)
Holmes, Michael (SNL)
Horowitz, Mark (Stanford)
Hu, Sharon (Notre Dame University)
Hwu, Wen-Mei (University of Illinois)
Kogge, Peter (Notre Dame University)
Macalusso, Tina (Leidos)
Mudge, Trevor (University of Michigan)
Nowick, Steve (Columbia University)
Nussbaum, Sebastian (AMD)
Phillips, Steve (Superion Technology)
Powell, Wes (NASA GCFC)
Some, Rafi (NASA JPL)
Yalamanchili, Sudhakar (Georgia Tech University)

Schedule

Tuesday August 26
8:00-8:30 Coffee and breakfast pastries
8:30-9:00 Welcome and Overview of Goals for the Workshop
    8:30-9:00 John Shalf (LBNL), Jim Ang (SNL) (Slides)
    9:00-9:45 Agency Perspectives: Noel Wheeler (LPS/ACS), Larry Bergmann (NASA), Jon Hiller (STA for DARPA)
9:45-12:15 Session 1: Technology Inventory and Requirements Analysis
    9:45-10:00 Shekhar Borkar, Intel Labs (Slides)
10:00-10:30 Break
10:30-12:15 Session 1: Technology Inventory and Requirements Analysis (continued)
    10:30-10:45 Franz Francetti, CMU (Slides)
    10:45-11:00 Peter Kogge, Notre Dame (Slides)
    11:00-11:15 Mark Horowitz, Stanford (Slides)
    11:15-12:15 Group Discussion / Panel on Technology & Requirements
12:15-1:00 Lunch
1:00-3:00 Session 2: State of the Art
    1:00-1:15 Mike Filippo, ARM (Slides)
    1:15-1:30 Mike Holmes, SNL (Slides)
    1:30-1:45 Steven Nowick, Columbia (Slides)
    1:45-2:00 Martin Deneroff, EMU Solutions (Slides)
    2:00-3:00 Group Discussion/Panel on State of the Art
3:00-3:30 Break
3:30-5:30 Session 3: Software Infrastructure
    3:30-3:45 Franz Francetti, CMU (Slides)
    3:45-4:00 Bill Boas, SFW (Slides)
    4:00-4:15 Sudhakar Yalimanchili, Georgia Tech (Slides)
    4:15-4:30 Wen-Mei Hwu, University of Illinois (Slides)
    4:30-5:30 Group Discussion/Panel on Software Infrastructure
5:30 Arrange for group dining, informal social hour

Wednesday August 27
8:00-8:30 Coffee and breakfast pastries
8:30-9:00 Welcome and Highlights from Sessions 1-3
    8:30-9:00 Jim Ang (SNL), John Shalf (LBNL)
9:00-10:45 Session 4: Simulation and Modeling
    9:00-9:15 Sharon Hu, Notre Dame (Slides)
    9:15-9:30 Bruce Childers, Univ. of Pittsburgh (Slides)
    9:30-9:45 Sudhakar Yalimanchili, Georgia Tech (Slides)
    9:45-10:00 Arun Rodrigues, SNL (Slides)
10:00-10:30 Break
    10:30-10:45 Hong Jiang (NSF) - Agency Perspective (Slides)
    10:45-11:30 Group Discussion / Panel on Simulation and Modeling
11:30-12:15 Session 5: Open Technologies for the SoC Eco-system
    11:30-11:45 Bruce Childers, Univ. of Pittsburgh (Slides)
    11:45-12:00 David Donofrio, LBNL (Slides)
    12:00-12:15 Krste Asanovic, UCB (Slides)
12:15-1:00 Lunch
1:00-3:00 Resume Session 5: Open Technologies that support HPC leverage of the SoC Eco-system
    1:00-2:00 Group Discussion / Panel on Open Technologies for SoC Ecosystem
    2:00-3:00 SoC Industry Roundtable
        David Hass, Broadcom
        Dileep Bhandarkar, Qualcomm
        Mike Filippo, ARM
        Steve Philips, Superion
        Sebastian Nussbaum, AMD
        Shekhar Borkar, Intel
        Bill Boas, System Fabric Works
3:00-3:30 Break
3:30-5:30 Session 6: Next Steps
    3:30-5:30 Group Discussion:

What are the key points that you take from this workshop?
Any surprises from our discussions?
What issues remain in SoC for HPC?
Recommendations for Follow-on Workshop Topic(s) and Participants

5:30 Arrange for group dining, informal social hour

Thursday August 28 Steering Committee Only
8:00-8:30 Coffee and breakfast pastries
8:30-9:00 Goals and Outcomes for the Workshop
9:00-10:00 Discussion
10:00-10:15 Break
10:15-11:15 Writing assignments, timelines and close-out
11:15-12:00 Discussion/Q&A
12:00-1:00 Lunch on site

Trackbacks & Pings

Design Automation for HPC, Clouds, and Server-Class SoCs (DAC2015) | SoC for HPC :
[…] workshop builds upon the highly successful “System-on-Chip Design for HPC” workshop that took place in August of 2014 in Denver Colorado. It is fortuitous to colocate with […]

10 years ago

SoC for HPC