Software Challenges in the Multicore Era (9:15 - 10:45 AM)

Software for High Performance Embedded Multiprocessors

Kenneth Mackenzie (Reservoir)

Chip multiprocessors (CMPs) are displacing application-specific integrated circuits (ASICs) in domains where designers must balance very high performance/power requirements against flexibility and programmer productivity requirements. The software development for such applications is challenging because the performance/power is being compared to purpose-built ASICs. A key problem for the software is that such CMPs are built with relatively low memory/core in order to devote as much chip area as possible to computation. In this talk, I will describe experiences Reservoir Labs has had developing software and tools in such domains (e.g. radar, networking, physics modeling). In particular, Reservoir is developing R-Stream, a parallelizing C compiler that uses a "polyhedral" representation of programs and architectures that allows it to explicitly model space and other constraints as part of the mapping problem.

Ken Mackenzie is a Consulting Engineer at Reservoir Labs, Inc. Ken joined Reservoir Labs in 2003 after five years as an Assistant Professor in the College of Computing at Georgia Tech. He received a B.S. in Electrical Engineering, M.S. and Ph.D. in Electrical Engineering and Computer Science from MIT in 1990, 1990 and 1998. As a graduate student, Ken worked as a member of the Alewife project, a novel, large-scale, distributed shared memory multiprocessor. His Ph.D. thesis work was to develop a fast, protected messaging system for Fugu, a machine that extended Alewife with multitasking, virtual memory and an Exokernel operating system. At Georgia Tech, Ken was granted an NSF CAREER award in 1999 for a project in which he and his students developed a form of software caching using dynamic binary rewriting targeted at embedded systems. At Reservoir Labs, Ken has worked as a consultant on customer projects involving network processors and a custom supercomputer for a molecular dynamics application as well as serving as PI for a DOE-funded, high-speed network intrusion detection project.

Doing real science on a Petaflop MPP

John Levesque (Cray)

This talk to will discuss how several organizations have been able to perform "break-through" science on the Cray XT system. With the advent of the XT architecture, many researchers have scaled their applications to 100s of sustained Teraflops of computation and have been able to perform much more accurate simulations that have shown an improved understanding of the science being investigated.
Scaling to 10s of thousands of processors can be difficult; however, there have been surprisingly large numbers of applications that were scaled without much work. This talk will discuss difficulties that needed to be addressed to enable these scientific break-throughs. Much of the experiences have been achieved with a coordinated effort between the scientific researcher and Cray experts from Cray's Supercomputing Center of Excellence, directed by the speaker, John Levesque.

John Levesque is the director of the Cray Supercomputing Center of Excellence at the Department of Energy's Oak Ridge National Laboratory (ORNL), the world's most powerful supercomputer for open (non-classified) scientific research. Levesque leads a team of engineers providing application and high performance computing expertise to researchers using ORNL systems. The Center of Excellence offers scientists and engineers from around the globe access to deep HPC application performance knowledge, allowing them to focus on their science or engineering challenge. Levesque also works closely with ORNL staff members to optimize application performance on ORNL's Cray X1E and XT3 supercomputers. Through this work, ORNL is running Parallel Ocean Program (POP) at speeds up to 1.5 times faster than it runs on the Earth Simulator in Japan.
From 2001 to 2003, Levesque was a senior principal engineer responsible for the benchmarking initiatives for the Cray XI system, playing a key role in helping CINECA, Italy's national supercomputer center and the Department of Defense Modernization Program optimize their applications. Prior to Cray, Levesque was the director of the Advanced Computer Technology Center at IBM's Watson Research Center where his team received the Scientific Achievement Award for contributing more than $200 million in new business sales in 2000.
With more than 35 years experience in high performance computing, Levesque is a recognized expert in the optimization of FORTRAN programs and parallel vector system architectures. He has been involved in the Advanced Strategic Computing Initiative at the US Department of Energy, the National Academy of Sciences, and is a regular contributor to the Cray User Conference. Levesque has also contributed to numerous publications on the optimization of programs for advanced scientific computers including a collection of papers entitled "Concurrency and Computations" in 2003 and a book entitled "Guidebook to FORTRAN on Supercomputers" published in 1989. Levesque has a Bachelor of Arts and Master of Arts in Mathematics from the University of New Mexico in Albuquerque New Mexico.

Blue Gene System Software Design and Implementation

Robert W. Wisniewski (IBM)


Four years ago, Blue Gene made a significant impact by introducing an ultra-scalable computer with a focus on low power. Since them, BG/L has maintained the number 1 spot on the top500 list for 7 lists in a row. In the recent November 2007 list, the number 2 spot was obtained by BG/P, the next line of Blue Gene computers. Blue Gene/P is design to scale to several PetaFlops at 256 racks of 256Ki nodes and 1Mi processor cores. There are unique challenges to designing software to run at this scale. In this talk I will describe the software system stack that runs on the Blue Gene/P machines and the motivation behind the design. I will focus on the strategies the team has used to get software to run in this ultrascale environment. I will also spend a little time describing some of the multi-core challenges we will face in next generation.

Robert W. Wisniewski received his PhD from the University of Rochester. Prior to coming to IBM Research, he worked at SGI on Operating System design and bring-up for their high-end Origin servers as well as real-time performance on parallel machines. He started at IBM Research working on the K42 project, a research effort aimed at designing from the ground up, a scalable customizable operating system for small parallel machines up to large-scale machines used in scientific computing. As part of the K42 effort, he made contributions to Linux including LTT (the Linux Trace Toolkit) and relayfs. He was involved in Phase I and II of the IBM PERCS DARPA HPCS project. For Phase II of PERCS he worked on CPO (Continuous Program Optimization), which is aimed at using vertically integrated performance data to automatically improve the performance of both applications and the underlying system. Following his work on PERCS, he contributed to the CSO (Commercial Scale Out) project in the area of performance understanding. He is currently a research scientist and manager of the Blue Gene Software Team at IBM Research. His research interests include scalable parallel systems, first-class system customization, performance monitoring, and using performance monitoring information for continuous programming and system optimization.

Hardware Challenges in the Multicore Era (11:00 - 12:00 AM)

Extending Moore's Law: Challenges and Opportunities

Shih-Lien Lu (Intel)


Moore's Law will continue to hold in the foreseeable future, but voltage scaling is slowing down. This talk covers two topics. First, we examine the challenge and a potential solution on how to evaluate the design of a system with hundreds or even thousand of processors. As microprocessor architectures move from emphasizing single core performance to multi-core designs, simulation techniques employed successfully for small designs, with single digit or low two-digit number of cores, is becoming insufficient. Research Accelerator for Multiple Processors (RAMP) proposes to utilize FPGAs as building blocks of an infrastructure which can be used to evaluate new architectures with many processors. Second, we examine some issues facing microprocessor when we continue the scaling of supply voltage. These issues cause the processor to fail. In particular memory bits in a processor becomes unstable. We propose architectural techniques that enable microprocessor caches (L1 and L2), to operate at low voltages despite very high memory cell failure rates.


Shih-Lien Lu is currently a Principal Engineer with the Microprocessor technology Labs of Corporate Technology Group, Intel Corporation at Hillsboro Oregon. He leads a team of researchers working on Microarchitecture techniques to extend Moore's Law. He received his BS in EECS from UC Berkeley, and MS and PhD both in CSE from UCLA. Prior to joining Intel in 1999, he worked on the MOSIS project at USC/ISI which provides research and education community VLSI fabrication services from 1984 to 1991. After MOSIS he was on the faculty of the ECE Dept. at Oregon State University. While at OSU, he received the College of Engineering Carter Award for outstanding and inspirational teaching in 1995 and the Engelbrecht Young Faculty Award in 1996.

Back to the Future: The Transition to Multi-Core Processors in Mobile Applications

Ty Garibay (TI)

Now that multi-core processors from Intel and AMD are dominating shipments in the PC market, the obvious question is when will this same technology wave hit the mobile applications market. This presentation will look at the transition to multi-core processors in other markets and seek to extrapolate how a similar transition might play out in mobile phones, personal media player, etc. We will finish up with a short introduction to the first licensable processor core designed for multi-processing mobile applications, ARM's Cortex-A9.

Ty Garibay is the Program Manager for ARM Processor development in Texas Instruments' Wireless Terminals Business Unit and site manager for TI's Austin office. Currently, he and his team are focused on the completion of the industry's first 45nm implementation of ARM's Cortex-A8 super-scalar processor core. Over the previous 20+ years, Ty has designed microprocessors at ARM, Alchemy Semiconductor, SGI/MIPS, Cyrix and Motorola, participating in all phases of design from circuits and layout through architecture and product definition. He holds over 30 patents in the areas of integrated circuit design, computer architecture and design methodology.

GPUs and Massively Parallel Architectures (1:00 - 3:00 PM)

Blue Gene: The Next Generation

Valentina Salapura (IBM)


The new Blue Gene/P chip multiprocessor (CMP) scales node performance using a multi-core system-on-a-chip design. The new Blue Gene/P chip multiprocessor exploits a novel way of reducing coherence cost by filtering useless coherence actions, resulting in a design with improved power and performance characteristics. In addition to multi-threaded execution, parallelism is exploited at the process-level, data-level, and instruction-level. The dual floating point unit and the dual-issue out-of-order PowerPC450 processor core exploit data and instruction level parallelism, respectively. To exploit process-level parallelism, special emphasis was put on efficient communication primitives by including hardware support for the MPI protocol, such as low latency barriers, and five highly optimized communication networks.
As the result of this deliberate design for scalability approach, Blue Gene supercomputers offer unprecedented scalability, in some cases by orders of magnitude, to a wide range of scientific applications. A broad range of scientific applications on Blue Gene supercomputers have advanced scientific discovery, which is the real merit and ultimate measure of success of the Blue Gene system family.


Valentina Salapura is an IBM Master Inventor and System Architect at the IBM T.J. Watson Research Center. Dr. Salapura has been a technical leader for the Blue Gene program since its inception. She has contributed to the architecture and implementation of several generations of Blue Gene Systems focusing on multiprocessor interconnect and synchronization, and multithreaded, multicore architecture design and evaluation. Most recently, she has been unit lead for several units of Blue Gene/P, as well as a leader of the chip and system bringup effort. Valentina Salapura is a recipient of the 2006 ACM Gordon Bell Prize for Special Achievements for the Blue Gene/L Supercomputer and Quantum Chromodynamics. Dr. Salapura has received several corporate awards for her technical contributions. Dr. Salapura is the author of over 60 papers on processor architecture and high-performance computing, and holds many patents in this area. Dr. Salapura is a Senior Member of the IEEE.

The Democratization Of Parallel Computing

David Luebke (NVIDIA)

Modern GPUs provide a level of massively parallel computation that was once the preserve of supercomputers like the MasPar and Connection Machine. NVIDIA's Tesla architecture for GPU Computing provides a fully programmable, massively multithreaded chip with up to 128 scalar processor cores and over twelve thousand threads, capable of delivering hundreds of billions of operations per second. Researchers across many scientific and engineering disciplines are using this platform to accelerate important computations by up to 2 orders of magnitude.
I will provide an overview of the Tesla architecture and explore the transition it represents in massively parallel computing: from the domain of supercomputers to that of commodity "manycore" hardware available to all. I will also introduce CUDA, a scalable parallel programming model and software environment for parallel programming. By providing a small set of readily understood extensions to the C/C++ languages, CUDA allows programmers to focus on writing efficient parallel algorithms without the burden of learning a multitude of new programming constructs. Finally, as the GPU is the only widely available commodity "manycore" chip available today, I will explore its importance as a research platform for exploring important issues in parallel programming and architecture.


David Luebke is a Research Scientist at NVIDIA Corporation, which he joined in 2006 after eight years on the faculty of the University of Virginia. He has a Ph.D. in Computer Science from the University of North Carolina and a Bachelors degree in Chemistry from the Colorado College. Luebke's principal research interests are general-purpose GPU computing and realistic real-time computer graphics. Specific recent projects include fast multi-layer subsurface scattering for realistic skin rendering, temperature-aware graphics architecture, scientific computation on graphics hardware, advanced reflectance and illumination models for real-time rendering, and image-based acquisition of real-world environments. Past projects include the book "Level of Detail for 3D Graphics", for which Luebke was the lead author, and the Virtual Monticello museum exhibit, which ran for over 3 months and helped attract over 110,000 visitors as a centerpiece of the major exhibition "Jefferson's America and Napoleon's France" at the New Orleans Museum of Art.

Challenges for GPU Architecture

Michael Doggett (AMD)


GPU architectures have evolved into massively parallel multi-core machines. This talk will review GPU architecture by looking at AMD's ATI Radeon 2900XT. This GPU is capable of massively parallel computation for high performance 3D graphics and general purpose algorithms. The shader uses multi-threading to hide latency of memory access so that compute units are kept busy. This high level of parallelism is achieved through a hierarchy of compute elements and programmed via an abstracted sequential API. New generations of GPUs are challenged to offer more programmable flexibility for compute and graphics while increasing performance for existing APIs and applications.


Michael Doggett is a Principal Member of Technical Staff within AMD's Graphics Product Group. He has worked on the Radeon2900 and previously the XBOX360 GPU and continues to work on upcomping high end GPUs. He worked as a Post-Doc at the University of Tuebingen, Germany on displacement mapping and volume rendering hardware. He has a B.E, B.Sc and PhD from the University of New South Wales, Sydney, Australia.

Future CPU Architectures: The Shift from Traditional Models

Eric Sprangle (Intel)


While Moore's law is alive and well in silicon scaling technology, it is clear that microprocessors have encountered significant technical issues that will influence the overall direction of the future architectures. This talk discusses the recent history of Intel microprocessors, some of the rational that guided the development of those processors. Further, the talk highlights why the future microprocessor architectures will likely look different from the past.
The traditional microprocessor architecture uses hardware techniques such as out-of-order processing to extract higher performance out of applications that have little or no explicit parallelism. The hardware techniques employed in the past have continued to improve performance, but at the cost of significantly increasing the power consumption of the traditional microprocessors. The power increases have led to not only higher electrical power delivery costs, but higher costs dissipating the power, resulting in high ambient noise, larger enclosure and hotter laps. To avoid a future that requires asbestos based jeans to properly handle laptops, the microprocessor architecture must change to facilitate higher performance without significantly higher power.
It is likely that microprocessor architecture will evolve from the ubiquitous single core, single threaded machine that we know and love, to an architecture that employs more cores and more threads. This shift is apparent in today's market where general purpose processors have included techniques such as Hyper-Threading Technology and Multi-Core processors. This talk will speculate on some potential next steps for that technology and some of the potential implications on software development.


Eric Sprangle is a Principal Engineer with Intel's Visual Computing Group in Austin. Eric has been with Intel for eight years, working on the Intel Pentium 4 processor family, and he is currently one of the lead architects on the Larrabee project. Prior to joining Intel, Eric worked at ROSS Technology. Eric enjoys training for and racing in triathlons.

Computer Security Research in Atlanta (3:15 - 4:30 PM)

Sven Krasser (Secure Computing)

Dr. Sven Krasser is Director of Data Mining Research at Secure Computing Corporation. In this role, he leads the data analysis and classification efforts for TrustedSource (, Secure Computing\u2019s industry-leading global reputation service. He is a recognized authority on anti-spam and web security threats, leading inventor of numerous key patent-pending technologies, and author of various publications on networking and security.
Dr. Krasser received a Vordiplom in Electrical Engineering from the Technische Universitaet Carolo-Wilhelmina zu Braunschweig in Braunschweig, Germany, in 2000 and a M.S. and Ph.D. in Electrical and Computer Engineering from the Georgia Institute of Technology in Atlanta, Georgia, in 2002 and 2004.

Paul Royal (Damballa)

Paul Royal is Principal Researcher at Damballa, Inc., an Atlanta-based company whose primary focus is on detection and remediation of targeted threats such as bots. In his role at Damballa, Paul collaborates with researchers and engineers to design new techniques for and apply ongoing research efforts in the implementation of network sensors and analyzers used for the discovery and identification of bot behavior.
Paul received his Bachelor and Master of Science in Computer Science from the Georgia Institute of Technology in 2004 and 2006, respectively. As a graduate student Paul studied binary analysis under Dr. Wenke Lee, focusing on the topics of automated malware processing and transformation. Paul's graduate academic work yielded the creation of PolyUnpack, the implementation of a technique for the automated extraction of the hidden code in packed malware. PolyUnpack is part of the ISC OARC malware repository and will be integrated into a future version of the Anubis malware analysis service.

Tom Cross (IBM/ISS)

Tom Cross is a vulnerability researcher with IBM Internet Security Systems X-Force. Tom's primary responsibility is to reproduce known vulnerabilities for the purpose of developing detection and prevention strategies. He also studies emerging technologies, particularly enterprise and carrier Voice over IP, as well as mobile IP networks, for the purpose of developing a better understanding of the security threats these infrastructures will face and how to best mitigate those threats. To this end Tom works closely with research teams at the Georgia Tech Information Security Center who are working on IBM funded projects in these areas. Tom has worked for IBM Internet Security Systems since 2003, prior to which he worked as a network security consultant and as a Director of Security Engineering for a large ISP. Tom holds a Bachelor's degree in Computer Engineering from Georgia Tech.