Computer Organisation and Design (2014)

In Praise of Computer Organization and Design: The Hardware/ 

Software Interface, Fifth Edition 

“Textbook selection is often a frustrating act of compromise—pedagogy, content 

coverage, quality of exposition, level of rigor, cost. Computer Organization and 

Design is the rare book that hits all the right notes across the board, without 

compromise. It is not only the premier computer organization textbook, it is a 

shining example of what all computer science textbooks could and should be.” 

—Michael Goldweber, Xavier University 

“I have been using Computer Organization and Design for years, from the very 

first edition. The new Fifth Edition is yet another outstanding improvement on an 

already classic text. The evolution from desktop computing to mobile computing 

to Big Data brings new coverage of embedded processors such as the ARM, new 

material on how software and hardware interact to increase performance, and 

cloud computing. All this without sacrificing the fundamentals.” 

—Ed Harcourt, St. Lawrence University 

“To Millennials: Computer Organization and Design is the computer architecture 

book you should keep on your (virtual) bookshelf. The book is both old and new, 

because it develops venerable principles—Moore's Law, abstraction, common case 

fast, redundancy, memory hierarchies, parallelism, and pipelining—but illustrates 

them with contemporary designs, e.g., ARM Cortex A8 and Intel Core i7.” 

—Mark D. Hill, University of Wisconsin-Madison 

“The new edition of Computer Organization and Design keeps pace with advances 

in emerging embedded and many-core (GPU) systems, where tablets and 

smartphones will are quickly becoming our new desktops. This text acknowledges 

these changes, but continues to provide a rich foundation of the fundamentals 

in computer organization and design which will be needed for the designers of 

hardware and software that power this new class of devices and systems.” 

—Dave Kaeli, Northeastern University 

“The Fifth Edition of Computer Organization and Design provides more than an 

introduction to computer architecture. It prepares the reader for the changes necessary 

to meet the ever-increasing performance needs of mobile systems and big data 

processing at a time that difficulties in semiconductor scaling are making all systems 

power constrained. In this new era for computing, hardware and software must be codesigned 

and system-level architecture is as critical as component-level optimizations.” 

—Christos Kozyrakis, Stanford University 

“Patterson and Hennessy brilliantly address the issues in ever-changing computer 

hardware architectures, emphasizing on interactions among hardware and software 

components at various abstraction levels. By interspersing I/O and parallelism concepts 

with a variety of mechanisms in hardware and software throughout the book, the new 

edition achieves an excellent holistic presentation of computer architecture for the 

PostPC era. This book is an essential guide to hardware and software professionals 

facing energy efficiency and parallelization challenges in Tablet PC to cloud computing.” 

—Jae C. Oh, Syracuse University

This page intentionally left blank

F I F T H E D I T I O N 

Computer Organization and Design 

THE HARDWARE/SOFTWARE INTERFACE

David A. Patterson has been teaching computer architecture at the University of 

California, Berkeley, since joining the faculty in 1977, where he holds the Pardee Chair 

of Computer Science. His teaching has been honored by the Distinguished Teaching 

Award from the University of California, the Karlstrom Award from ACM, and the 

Mulligan Education Medal and Undergraduate Teaching Award from IEEE. Patterson 

received the IEEE Technical Achievement Award and the ACM Eckert-Mauchly Award 

for contributions to RISC, and he shared the IEEE Johnson Information Storage Award 

for contributions to RAID. He also shared the IEEE John von Neumann Medal and 

the C & C Prize with John Hennessy. Like his co-author, Patterson is a Fellow of the 

American Academy of Arts and Sciences, the Computer History Museum, ACM, 

and IEEE, and he was elected to the National Academy of Engineering, the National 

Academy of Sciences, and the Silicon Valley Engineering Hall of Fame. He served on 

the Information Technology Advisory Committee to the U.S. President, as chair of the 

CS division in the Berkeley EECS department, as chair of the Computing Research 

Association, and as President of ACM. This record led to Distinguished Service Awards 

from ACM and CRA. 

At Berkeley, Patterson led the design and implementation of RISC I, likely the first 

VLSI reduced instruction set computer, and the foundation of the commercial 

SPARC architecture. He was a leader of the Redundant Arrays of Inexpensive Disks 

(RAID) project, which led to dependable storage systems from many companies. 

He was also involved in the Network of Workstations (NOW) project, which led to 

cluster technology used by Internet companies and later to cloud computing. These 

projects earned three dissertation awards from ACM. His current research projects 

are Algorithm-Machine-People and Algorithms and Specializers for Provably Optimal 

Implementations with Resilience and Efficiency. The AMP Lab is developing scalable 

machine learning algorithms, warehouse-scale-computer-friendly programming 

models, and crowd-sourcing tools to gain valuable insights quickly from big data in 

the cloud. The ASPIRE Lab uses deep hardware and software co-tuning to achieve the 

highest possible performance and energy efficiency for mobile and rack computing 

systems. 

John L. Hennessy is the tenth president of Stanford University, where he has been 

a member of the faculty since 1977 in the departments of electrical engineering and 

computer science. Hennessy is a Fellow of the IEEE and ACM; a member of the 

National Academy of Engineering, the National Academy of Science, and the American 

Philosophical Society; and a Fellow of the American Academy of Arts and Sciences. 

Among his many awards are the 2001 Eckert-Mauchly Award for his contributions to 

RISC technology, the 2001 Seymour Cray Computer Engineering Award, and the 2000 

John von Neumann Award, which he shared with David Patterson. He has also received 

seven honorary doctorates. 

In 1981, he started the MIPS project at Stanford with a handful of graduate students. 

After completing the project in 1984, he took a leave from the university to cofound 

MIPS Computer Systems (now MIPS Technologies), which developed one of the first 

commercial RISC microprocessors. As of 2006, over 2 billion MIPS microprocessors have 

been shipped in devices ranging from video games and palmtop computers to laser printers 

and network switches. Hennessy subsequently led the DASH (Director Architecture 

for Shared Memory) project, which prototyped the first scalable cache coherent 

multiprocessor; many of the key ideas have been adopted in modern multiprocessors. 

In addition to his technical activities and university responsibilities, he has continued to 

work with numerous start-ups both as an early-stage advisor and an investor.

To Linda, 

who has been, is, and always will be the love of my life

A CKNOWLEDGMENTS 

Figures 1.7, 1.8 Courtesy of iFixit ( www.ifixit.com ). 

Figure 1.9 Courtesy of Chipworks ( www.chipworks.com ). 

Figure 1.13 Courtesy of Intel. 

Figures 1.10.1, 1.10.2, 4.15.2 Courtesy of the Charles Babbage 

Institute, University of Minnesota Libraries, Minneapolis. 

Figures 1.10.3, 4.15.1, 4.15.3, 5.12.3, 6.14.2 Courtesy of IBM. 

Figure 1.10.4 Courtesy of Cray Inc. 

Figure 1.10.5 Courtesy of Apple Computer, Inc. 

Figure 1.10.6 Courtesy of the Computer History Museum. 

Figures 5.17.1, 5.17.2 Courtesy of Museum of Science, Boston. 

Figure 5.17.4 Courtesy of MIPS Technologies, Inc. 

Figure 6.15.1 Courtesy of NASA Ames Research Center.

Preface 

The most beautiful thing we can experience is the mysterious. It is the 

source of all true art and science. 

Albert Einstein, What I Believe, 1930 

About This Book 

We believe that learning in computer science and engineering should reflect 

the current state of the field, as well as introduce the principles that are shaping 

computing. We also feel that readers in every specialty of computing need 

to appreciate the organizational paradigms that determine the capabilities, 

performance, energy, and, ultimately, the success of computer systems. 

Modern computer technology requires professionals of every computing 

specialty to understand both hardware and software. The interaction between 

hardware and software at a variety of levels also offers a framework for understanding 

the fundamentals of computing. Whether your primary interest is hardware or 

software, computer science or electrical engineering, the central ideas in computer 

organization and design are the same. Thus, our emphasis in this book is to show 

the relationship between hardware and software and to focus on the concepts that 

are the basis for current computers. 

The recent switch from uniprocessor to multicore microprocessors confirmed 

the soundness of this perspective, given since the first edition. While programmers 

could ignore the advice and rely on computer architects, compiler writers, and silicon 

engineers to make their programs run faster or be more energy-efficient without 

change, that era is over. For programs to run faster, they must become parallel. 

While the goal of many researchers is to make it possible for programmers to be 

unaware of the underlying parallel nature of the hardware they are programming, 

it will take many years to realize this vision. Our view is that for at least the next 

decade, most programmers are going to have to understand the hardware/software 

interface if they want programs to run efficiently on parallel computers. 

The audience for this book includes those with little experience in assembly 

language or logic design who need to understand basic computer organization as 

well as readers with backgrounds in assembly language and/or logic design who 

want to learn how to design a computer or understand how a system works and 

why it performs as it does.

xvi 

Preface 

About the Other Book 

Some readers may be familiar with Computer Architecture: A Quantitative 

Approach , popularly known as Hennessy and Patterson. (This book in turn is 

often called Patterson and Hennessy.) Our motivation in writing the earlier book 

was to describe the principles of computer architecture using solid engineering 

fundamentals and quantitative cost/performance tradeoffs. We used an approach 

that combined examples and measurements, based on commercial systems, to 

create realistic design experiences. Our goal was to demonstrate that computer 

architecture could be learned using quantitative methodologies instead of a 

descriptive approach. It was intended for the serious computing professional who 

wanted a detailed understanding of computers. 

A majority of the readers for this book do not plan to become computer 

architects. The performance and energy efficiency of future software systems will 

be dramatically affected, however, by how well software designers understand the 

basic hardware techniques at work in a system. Thus, compiler writers, operating 

system designers, database programmers, and most other software engineers need 

a firm grounding in the principles presented in this book. Similarly, hardware 

designers must understand clearly the effects of their work on software applications. 

Thus, we knew that this book had to be much more than a subset of the material 

in Computer Architecture , and the material was extensively revised to match the 

different audience. We were so happy with the result that the subsequent editions of 

Computer Architecture were revised to remove most of the introductory material; 

hence, there is much less overlap today than with the first editions of both books. 

Changes for the Fifth Edition 

We had six major goals for the fifth edition of Computer Organization and Design: 

demonstrate the importance of understanding hardware with a running example; 

highlight major themes across the topics using margin icons that are introduced 

early; update examples to reflect changeover from PC era to PostPC era; spread the 

material on I/O throughout the book rather than isolating it into a single chapter; 

update the technical content to reflect changes in the industry since the publication 

of the fourth edition in 2009; and put appendices and optional sections online 

instead of including a CD to lower costs and to make this edition viable as an 

electronic book. 

Before discussing the goals in detail, let’s look at the table on the next page. It 

shows the hardware and software paths through the material. Chapters 1, 4, 5, and 

6 are found on both paths, no matter what the experience or the focus. Chapter 1 

discusses the importance of energy and how it motivates the switch from single 

core to multicore microprocessors and introduces the eight great ideas in computer 

architecture. Chapter 2 is likely to be review material for the hardware-oriented, 

but it is essential reading for the software-oriented, especially for those readers 

interested in learning more about compilers and object-oriented programming 

languages. Chapter 3 is for readers interested in constructing a datapath or in

xviii 

Preface 

learning more about floating-point arithmetic. Some will skip parts of Chapter 3, 

either because they don’t need them or because they offer a review. However, we 

introduce the running example of matrix multiply in this chapter, showing how 

subword parallels offers a fourfold improvement, so don’t skip sections 3.6 to 3.8. 

Chapter 4 explains pipelined processors. Sections 4.1, 4.5, and 4.10 give overviews 

and Section 4.12 gives the next performance boost for matrix multiply for those with 

a software focus. Those with a hardware focus, however, will find that this chapter 

presents core material; they may also, depending on their background, want to read 

Appendix C on logic design first. The last chapter on multicores, multiprocessors, 

and clusters, is mostly new content and should be read by everyone. It was 

significantly reorganized in this edition to make the flow of ideas more natural 

and to include much more depth on GPUs, warehouse scale computers, and the 

hardware-software interface of network interface cards that are key to clusters. 

The first of the six goals for this firth edition was to demonstrate the importance 

of understanding modern hardware to get good performance and energy efficiency 

with a concrete example. As mentioned above, we start with subword parallelism 

in Chapter 3 to improve matrix multiply by a factor of 4. We double performance 

in Chapter 4 by unrolling the loop to demonstrate the value of instruction level 

parallelism. Chapter 5 doubles performance again by optimizing for caches using 

blocking. Finally, Chapter 6 demonstrates a speedup of 14 from 16 processors by 

using thread-level parallelism. All four optimizations in total add just 24 lines of C 

code to our initial matrix multiply example. 

The second goal was to help readers separate the forest from the trees by 

identifying eight great ideas of computer architecture early and then pointing out 

all the places they occur throughout the rest of the book. We use (hopefully) easy 

to remember margin icons and highlight the corresponding word in the text to 

remind readers of these eight themes. There are nearly 100 citations in the book. 

No chapter has less than seven examples of great ideas, and no idea is cited less than 

five times. Performance via parallelism, pipelining, and prediction are the three 

most popular great ideas, followed closely by Moore’s Law. The processor chapter 

(4) is the one with the most examples, which is not a surprise since it probably 

received the most attention from computer architects. The one great idea found in 

every chapter is performance via parallelism, which is a pleasant observation given 

the recent emphasis in parallelism in the field and in editions of this book. 

The third goal was to recognize the generation change in computing from the 

PC era to the PostPC era by this edition with our examples and material. Thus, 

Chapter 1 dives into the guts of a tablet computer rather than a PC, and Chapter 6 

describes the computing infrastructure of the cloud. We also feature the ARM, 

which is the instruction set of choice in the personal mobile devices of the PostPC 

era, as well as the x86 instruction set that dominated the PC Era and (so far) 

dominates cloud computing. 

The fourth goal was to spread the I/O material throughout the book rather 

than have it in its own chapter, much as we spread parallelism throughout all the 

chapters in the fourth edition. Hence, I/O material in this edition can be found in

Preface xix 

Sections 1.4, 4.9, 5.2, 5.5, 5.11, and 6.9. The thought is that readers (and instructors) 

are more likely to cover I/O if it’s not segregated to its own chapter. 

This is a fast-moving field, and, as is always the case for our new editions, an 

important goal is to update the technical content. The running example is the ARM 

Cortex A8 and the Intel Core i7, reflecting our PostPC Era. Other highlights include 

an overview the new 64-bit instruction set of ARMv8, a tutorial on GPUs that 

explains their unique terminology, more depth on the warehouse scale computers 

that make up the cloud, and a deep dive into 10 Gigabyte Ethernet cards. 

To keep the main book short and compatible with electronic books, we placed 

the optional material as online appendices instead of on a companion CD as in 

prior editions. 

Finally, we updated all the exercises in the book. 

While some elements changed, we have preserved useful book elements from 

prior editions. To make the book work better as a reference, we still place definitions 

of new terms in the margins at their first occurrence. The book element called 

“Understanding Program Performance” sections helps readers understand the 

performance of their programs and how to improve it, just as the “Hardware/Software 

Interface” book element helped readers understand the tradeoffs at this interface. 

“The Big Picture” section remains so that the reader sees the forest despite all the 

trees. “Check Yourself ” sections help readers to confirm their comprehension of the 

material on the first time through with answers provided at the end of each chapter. 

This edition still includes the green MIPS reference card, which was inspired by the 

“Green Card” of the IBM System/360. This card has been updated and should be a 

handy reference when writing MIPS assembly language programs. 

Changes for the Fifth Edition 

We have collected a great deal of material to help instructors teach courses using 

this book. Solutions to exercises, figures from the book, lecture slides, and other 

materials are available to adopters from the publisher. Check the publisher’s Web 

site for more information: 

textbooks.elsevier.com/9780124077263 

Concluding Remarks 

If you read the following acknowledgments section, you will see that we went to 

great lengths to correct mistakes. Since a book goes through many printings, we 

have the opportunity to make even more corrections. If you uncover any remaining, 

resilient bugs, please contact the publisher by electronic mail at cod5bugs@mkp. 

com or by low-tech mail using the address found on the copyright page. 

This edition is the second break in the long-standing collaboration between 

Hennessy and Patterson, which started in 1989. The demands of running one of 

the world’s great universities meant that President Hennessy could no longer make 

the substantial commitment to create a new edition. The remaining author felt

xx 

Preface 

once again like a tightrope walker without a safety net. Hence, the people in the 

acknowledgments and Berkeley colleagues played an even larger role in shaping 

the contents of this book. Nevertheless, this time around there is only one author 

to blame for the new material in what you are about to read. 

Acknowledgments for the Fifth Edition 

With every edition of this book, we are very fortunate to receive help from many 

readers, reviewers, and contributors. Each of these people has helped to make this 

book better. 

Chapter 6 was so extensively revised that we did a separate review for ideas and 

contents, and I made changes based on the feedback from every reviewer. I’d like to 

thank Christos Kozyrakis of Stanford University for suggesting using the network 

interface for clusters to demonstrate the hardware-software interface of I/O and 

for suggestions on organizing the rest of the chapter; Mario Flagsilk of Stanford 

University for providing details, diagrams, and performance measurements of the 

NetFPGA NIC; and the following for suggestions on how to improve the chapter: 

David Kaeli of Northeastern University, Partha Ranganathan of HP Labs, 

David Wood of the University of Wisconsin, and my Berkeley colleagues Siamak 

Faridani , Shoaib Kamil , Yunsup Lee , Zhangxi Tan , and Andrew Waterman . 

Special thanks goes to Rimas Avizenis of UC Berkeley, who developed the 

various versions of matrix multiply and supplied the performance numbers as well. 

As I worked with his father while I was a graduate student at UCLA, it was a nice 

symmetry to work with Rimas at UCB. 

I also wish to thank my longtime collaborator Randy Katz of UC Berkeley, who 

helped develop the concept of great ideas in computer architecture as part of the 

extensive revision of an undergraduate class that we did together. 

I’d like to thank David Kirk , John Nickolls , and their colleagues at NVIDIA 

(Michael Garland, John Montrym, Doug Voorhies, Lars Nyland, Erik Lindholm, 

Paulius Micikevicius, Massimiliano Fatica, Stuart Oberman, and Vasily Volkov) 

for writing the first in-depth appendix on GPUs. I’d like to express again my 

appreciation to Jim Larus , recently named Dean of the School of Computer and 

Communications Science at EPFL, for his willingness in contributing his expertise 

on assembly language programming, as well as for welcoming readers of this book 

with regard to using the simulator he developed and maintains. 

I am also very grateful to Jason Bakos of the University of South Carolina, 

who updated and created new exercises for this edition, working from originals 

prepared for the fourth edition by Perry Alexander (The University of Kansas); 

Javier Bruguera (Universidade de Santiago de Compostela); Matthew Farrens 

(University of California, Davis); David Kaeli (Northeastern University); Nicole 

Kaiyan (University of Adelaide); John Oliver (Cal Poly, San Luis Obispo); Milos 

Prvulovic (Georgia Tech); and Jichuan Chang , Jacob Leverich , Kevin Lim , and 

Partha Ranganathan (all from Hewlett-Packard). 

Additional thanks goes to Jason Bakos for developing the new lecture slides.

I am grateful to the many instructors who have answered the publisher’s surveys, 

reviewed our proposals, and attended focus groups to analyze and respond to our 

plans for this edition. They include the following individuals: Focus Groups in 

2012: Bruce Barton (Suffolk County Community College), Jeff Braun (Montana 

Tech), Ed Gehringer (North Carolina State), Michael Goldweber (Xavier University), 

Ed Harcourt (St. Lawrence University), Mark Hill (University of Wisconsin, 

Madison), Patrick Homer (University of Arizona), Norm Jouppi (HP Labs), Dave 

Kaeli (Northeastern University), Christos Kozyrakis (Stanford University), 

Zachary Kurmas (Grand Valley State University), Jae C. Oh (Syracuse University), 

Lu Peng (LSU), Milos Prvulovic (Georgia Tech), Partha Ranganathan (HP 

Labs), David Wood (University of Wisconsin), Craig Zilles (University of Illinois 

at Urbana-Champaign). Surveys and Reviews: Mahmoud Abou-Nasr (Wayne State 

University), Perry Alexander (The University of Kansas), Hakan Aydin (George 

Mason University), Hussein Badr (State University of New York at Stony Brook), 

Mac Baker (Virginia Military Institute), Ron Barnes (George Mason University), 

Douglas Blough (Georgia Institute of Technology), Kevin Bolding (Seattle Pacific 

University), Miodrag Bolic (University of Ottawa), John Bonomo (Westminster 

College), Jeff Braun (Montana Tech), Tom Briggs (Shippensburg University), Scott 

Burgess (Humboldt State University), Fazli Can (Bilkent University), Warren R. 

Carithers (Rochester Institute of Technology), Bruce Carlton (Mesa Community 

College), Nicholas Carter (University of Illinois at Urbana-Champaign), Anthony 

Cocchi (The City University of New York), Don Cooley (Utah State University), 

Robert D. Cupper (Allegheny College), Edward W. Davis (North Carolina State 

University), Nathaniel J. Davis (Air Force Institute of Technology), Molisa Derk 

(Oklahoma City University), Derek Eager (University of Saskatchewan), Ernest 

Ferguson (Northwest Missouri State University), Rhonda Kay Gaede (The University 

of Alabama), Etienne M. Gagnon (UQAM), Costa Gerousis (Christopher Newport 

University), Paul Gillard (Memorial University of Newfoundland), Michael 

Goldweber (Xavier University), Georgia Grant (College of San Mateo), Merrill Hall 

(The Master’s College), Tyson Hall (Southern Adventist University), Ed Harcourt 

(St. Lawrence University), Justin E. Harlow (University of South Florida), Paul F. 

Hemler (Hampden-Sydney College), Martin Herbordt (Boston University), Steve 

J. Hodges (Cabrillo College), Kenneth Hopkinson (Cornell University), Dalton 

Hunkins (St. Bonaventure University), Baback Izadi (State University of New 

York—New Paltz), Reza Jafari, Robert W. Johnson (Colorado Technical University), 

Bharat Joshi (University of North Carolina, Charlotte), Nagarajan Kandasamy 

(Drexel University), Rajiv Kapadia, Ryan Kastner (University of California, 

Santa Barbara), E.J. Kim (Texas A&M University), Jihong Kim (Seoul National 

University), Jim Kirk (Union University), Geoffrey S. Knauth (Lycoming College), 

Manish M. Kochhal (Wayne State), Suzan Koknar-Tezel (Saint Joseph’s University), 

Angkul Kongmunvattana (Columbus State University), April Kontostathis (Ursinus 

College), Christos Kozyrakis (Stanford University), Danny Krizanc (Wesleyan 

University), Ashok Kumar, S. Kumar (The University of Texas), Zachary Kurmas 

(Grand Valley State University), Robert N. Lea (University of Houston), Baoxin 

Preface xxi

xxii 

Preface 

Li (Arizona State University), Li Liao (University of Delaware), Gary Livingston 

(University of Massachusetts), Michael Lyle, Douglas W. Lynn (Oregon Institute 

of Technology), Yashwant K Malaiya (Colorado State University), Bill Mark 

(University of Texas at Austin), Ananda Mondal (Claflin University), Alvin Moser 

(Seattle University), Walid Najjar (University of California, Riverside), Danial J. 

Neebel (Loras College), John Nestor (Lafayette College), Jae C. Oh (Syracuse 

University), Joe Oldham (Centre College), Timour Paltashev, James Parkerson 

(University of Arkansas), Shaunak Pawagi (SUNY at Stony Brook), Steve Pearce, Ted 

Pedersen (University of Minnesota), Lu Peng (Louisiana State University), Gregory 

D Peterson (The University of Tennessee), Milos Prvulovic (Georgia Tech), Partha 

Ranganathan (HP Labs), Dejan Raskovic (University of Alaska, Fairbanks) Brad 

Richards (University of Puget Sound), Roman Rozanov, Louis Rubinfield (Villanova 

University), Md Abdus Salam (Southern University), Augustine Samba (Kent State 

University), Robert Schaefer (Daniel Webster College), Carolyn J. C. Schauble 

(Colorado State University), Keith Schubert (CSU San Bernardino), William 

L. Schultz, Kelly Shaw (University of Richmond), Shahram Shirani (McMaster 

University), Scott Sigman (Drury University), Bruce Smith, David Smith, Jeff W. 

Smith (University of Georgia, Athens), Mark Smotherman (Clemson University), 

Philip Snyder (Johns Hopkins University), Alex Sprintson (Texas A&M), Timothy 

D. Stanley (Brigham Young University), Dean Stevens (Morningside College), 

Nozar Tabrizi (Kettering University), Yuval Tamir (UCLA), Alexander Taubin 

(Boston University), Will Thacker (Winthrop University), Mithuna Thottethodi 

(Purdue University), Manghui Tu (Southern Utah University), Dean Tullsen 

(UC San Diego), Rama Viswanathan (Beloit College), Ken Vollmar (Missouri 

State University), Guoping Wang (Indiana-Purdue University), Patricia Wenner 

(Bucknell University), Kent Wilken (University of California, Davis), David Wolfe 

(Gustavus Adolphus College), David Wood (University of Wisconsin, Madison), 

Ki Hwan Yum (University of Texas, San Antonio), Mohamed Zahran (City College 

of New York), Gerald D. Zarnett (Ryerson University), Nian Zhang (South Dakota 

School of Mines & Technology), Jiling Zhong (Troy University), Huiyang Zhou 

(The University of Central Florida), Weiyu Zhu (Illinois Wesleyan University). 

A special thanks also goes to Mark Smotherman for making multiple passes to 

find technical and writing glitches that significantly improved the quality of this 

edition. 

We wish to thank the extended Morgan Kaufmann family for agreeing to publish 

this book again under the able leadership of Todd Green and Nate McFadden : I 

certainly couldn’t have completed the book without them. We also want to extend 

thanks to Lisa Jones , who managed the book production process, and Russell 

Purdy , who did the cover design. The new cover cleverly connects the PostPC Era 

content of this edition to the cover of the first edition. 

The contributions of the nearly 150 people we mentioned here have helped 

make this fifth edition what I hope will be our best book yet. Enjoy! 

David A. Patterson


1 

Civilization advances 

by extending the 

number of important 

operations which we 

can perform without 

thinking about them. 

Alfred North Whitehead, 

An Introduction to Mathematics, 1911 

Computer 

Abstractions and 

Technology 

1.1 Introduction 3 

1.2 Eight Great Ideas in Computer 

Architecture 11 

1.3 Below Your Program 13 

1.4 Under the Covers 16 

1.5 Technologies for Building Processors and 

Memory 24 

Computer Organization and Design. DOI: http://dx.doi.org/10.1016/B978-0-12-407726-3.00001-1 

© 2013 Elsevier Inc. All rights reserved.

4 Chapter 1 Computer Abstractions and Technology 

Computers have led to a third revolution for civilization, with the information 

revolution taking its place alongside the agricultural and the industrial revolutions. 

The resulting multiplication of humankind’s intellectual strength and reach 

naturally has affected our everyday lives profoundly and changed the ways in which 

the search for new knowledge is carried out. There is now a new vein of scientific 

investigation, with computational scientists joining theoretical and experimental 

scientists in the exploration of new frontiers in astronomy, biology, chemistry, and 

physics, among others. 

The computer revolution continues. Each time the cost of computing improves 

by another factor of 10, the opportunities for computers multiply. Applications that 

were economically infeasible suddenly become practical. In the recent past, the 

following applications were “computer science fiction.” 

■ Computers in automobiles: Until microprocessors improved dramatically 

in price and performance in the early 1980s, computer control of cars was 

ludicrous. Today, computers reduce pollution, improve fuel efficiency via 

engine controls, and increase safety through blind spot warnings, lane 

departure warnings, moving object detection, and air bag inflation to protect 

occupants in a crash. 

■ Cell phones: Who would have dreamed that advances in computer 

systems would lead to more than half of the planet having mobile phones, 

allowing person-to-person communication to almost anyone anywhere in 

the world? 

■ Human genome project: The cost of computer equipment to map and analyze 

human DNA sequences was hundreds of millions of dollars. It’s unlikely that 

anyone would have considered this project had the computer costs been 10 

to 100 times higher, as they would have been 15 to 25 years earlier. Moreover, 

costs continue to drop; you will soon be able to acquire your own genome, 

allowing medical care to be tailored to you. 

■ World Wide Web: Not in existence at the time of the first edition of this book, 

the web has transformed our society. For many, the web has replaced libraries 

and newspapers. 

■ Search engines: As the content of the web grew in size and in value, finding 

relevant information became increasingly important. Today, many people 

rely on search engines for such a large part of their lives that it would be a 

hardship to go without them. 

Clearly, advances in this technology now affect almost every aspect of our 

society. Hardware advances have allowed programmers to create wonderfully 

useful software, which explains why computers are omnipresent. Today’s science 

fiction suggests tomorrow’s killer applications: already on their way are glasses that 

augment reality, the cashless society, and cars that can drive themselves.


Classes of Computing Applications and Their 

Characteristics 

Although a common set of hardware technologies (see Sections 1.4 and 1.5) is used 

in computers ranging from smart home appliances to cell phones to the largest 

supercomputers, these different applications have different design requirements 

and employ the core hardware technologies in different ways. Broadly speaking, 

computers are used in three different classes of applications. 

Personal computers (PCs) are possibly the best known form of computing, 

which readers of this book have likely used extensively. Personal computers 

emphasize delivery of good performance to single users at low cost and usually 

execute third-party software. This class of computing drove the evolution of many 

computing technologies, which is only about 35 years old! 

Servers are the modern form of what were once much larger computers, and 

are usually accessed only via a network. Servers are oriented to carrying large 

workloads, which may consist of either single complex applications—usually a 

scientific or engineering application—or handling many small jobs, such as would 

occur in building a large web server. These applications are usually based on 

software from another source (such as a database or simulation system), but are 

often modified or customized for a particular function. Servers are built from the 

same basic technology as desktop computers, but provide for greater computing, 

storage, and input/output capacity. In general, servers also place a greater emphasis 

on dependability, since a crash is usually more costly than it would be on a singleuser 

PC. 

Servers span the widest range in cost and capability. At the low end, a server 

may be little more than a desktop computer without a screen or keyboard and 

cost a thousand dollars. These low-end servers are typically used for file storage, 

small business applications, or simple web serving (see Section 6.10). At the other 

extreme are supercomputers, which at the present consist of tens of thousands of 

processors and many terabytes of memory, and cost tens to hundreds of millions 

of dollars. Supercomputers are usually used for high-end scientific and engineering 

calculations, such as weather forecasting, oil exploration, protein structure 

determination, and other large-scale problems. Although such supercomputers 

represent the peak of computing capability, they represent a relatively small fraction 

of the servers and a relatively small fraction of the overall computer market in 

terms of total revenue. 

Embedded computers are the largest class of computers and span the widest 

range of applications and performance. Embedded computers include the 

microprocessors found in your car, the computers in a television set, and the 

networks of processors that control a modern airplane or cargo ship. Embedded 

computing systems are designed to run one application or one set of related 

applications that are normally integrated with the hardware and delivered as a 

single system; thus, despite the large number of embedded computers, most users 

never really see that they are using a computer! 

personal computer 

(PC) A computer 

designed for use by 

an individual, usually 

incorporating a graphics 

display, a keyboard, and a 

mouse. 

server A computer 

used for running 

larger programs for 

multiple users, often 

simultaneously, and 

typically accessed only via 

a network. 

supercomputer A class 

of computers with the 

highest performance and 

cost; they are configured 

as servers and typically 

cost tens to hundreds of 

millions of dollars. 

terabyte (TB) Originally 

1,099,511,627,776 

(2 40 ) bytes, although 

communications and 

secondary storage 

systems developers 

started using the term to 

mean 1,000,000,000,000 

(10 12 ) bytes. To reduce 

confusion, we now use the 

term tebibyte (TiB) for 

2 40 bytes, defining terabyte 

(TB) to mean 10 12 bytes. 

Figure 1.1 shows the full 

range of decimal and 

binary values and names. 

embedded computer 

A computer inside another 

device used for running 

one predetermined 

application or collection of 

software.


multicore 

microprocessor 

A microprocessor 

containing multiple 

processors (“cores”) in a 

single integrated circuit. 

last decade, advances in computer design and memory technology have greatly 

reduced the importance of small memory size in most applications other than 

those in embedded computing systems. 

Programmers interested in performance now need to understand the issues 

that have replaced the simple memory model of the 1960s: the parallel nature 

of processors and the hierarchical nature of memories. Moreover, as we explain 

in Section 1.7, today’s programmers need to worry about energy efficiency of 

their programs running either on the PMD or in the Cloud, which also requires 

understanding what is below your code. Programmers who seek to build 

competitive versions of software will therefore need to increase their knowledge of 

computer organization. 

We are honored to have the opportunity to explain what’s inside this revolutionary 

machine, unraveling the software below your program and the hardware under the 

covers of your computer. By the time you complete this book, we believe you will 

be able to answer the following questions: 

■ How are programs written in a high-level language, such as C or Java, 

translated into the language of the hardware, and how does the hardware 

execute the resulting program? Comprehending these concepts forms the 

basis of understanding the aspects of both the hardware and software that 

affect program performance. 

■ What is the interface between the software and the hardware, and how does 

software instruct the hardware to perform needed functions? These concepts 

are vital to understanding how to write many kinds of software. 

■ What determines the performance of a program, and how can a programmer 

improve the performance? As we will see, this depends on the original 

program, the software translation of that program into the computer’s 

language, and the effectiveness of the hardware in executing the program. 

■ What techniques can be used by hardware designers to improve performance? 

This book will introduce the basic concepts of modern computer design. The 

interested reader will find much more material on this topic in our advanced 

book, Computer Architecture: A Quantitative Approach. 

■ What techniques can be used by hardware designers to improve energy 

efficiency? What can the programmer do to help or hinder energy efficiency? 

■ What are the reasons for and the consequences of the recent switch from 

sequential processing to parallel processing? This book gives the motivation, 

describes the current hardware mechanisms to support parallelism, and 

surveys the new generation of “multicore” microprocessors (see Chapter 6). 

■ Since the first commercial computer in 1951, what great ideas did computer 

architects come up with that lay the foundation of modern computing?


To demonstrate the impact of the ideas in this book, we improve the performance 

of a C program that multiplies a matrix times a vector in a sequence of 

chapters. Each step leverages understanding how the underlying hardware 

really works in a modern microprocessor to improve performance by a factor 

of 200! 

■ In the category of data level parallelism, in Chapter 3 we use subword 

parallelism via C intrinsics to increase performance by a factor of 3.8. 

■ In the category of instruction level parallelism, in Chapter 4 we use loop 

unrolling to exploit multiple instruction issue and out-of-order execution 

hardware to increase performance by another factor of 2.3. 

■ In the category of memory hierarchy optimization, in Chapter 5 we use 

cache blocking to increase performance on large matrices by another factor 

of 2.5. 

■ In the category of thread level parallelism, in Chapter 6 we use parallel for 

loops in OpenMP to exploit multicore hardware to increase performance by 

another factor of 14. 

Check 

Yourself 

Check Yourself sections are designed to help readers assess whether they 

comprehend the major concepts introduced in a chapter and understand the 

implications of those concepts. Some Check Yourself questions have simple answers; 

others are for discussion among a group. Answers to the specific questions can 

be found at the end of the chapter. Check Yourself questions appear only at the 

end of a section, making it easy to skip them if you are sure you understand the 

material. 

1. The number of embedded processors sold every year greatly outnumbers 

the number of PC and even PostPC processors. Can you confirm or deny 

this insight based on your own experience? Try to count the number of 

embedded processors in your home. How does it compare with the number 

of conventional computers in your home? 

2. As mentioned earlier, both the software and hardware affect the performance 

of a program. Can you think of examples where each of the following is the 

right place to look for a performance bottleneck? 

■ The algorithm chosen 

■ The programming language or compiler 

■ The operating system 

■ The processor 

■ The I/O system and devices


compiler A program 

that translates high-level 

language statements 

into assembly language 

statements. 

binary digit Also called 

a bit. One of the two 

numbers in base 2 (0 or 1) 

that are the components 

of information. 

instruction A command 

that computer hardware 

understands and obeys. 

assembler A program 

that translates a symbolic 

version of instructions 

into the binary version. 

assembly language 

A symbolic representation 

of machine instructions. 

machine language 

A binary representation of 

machine instructions. 

Compilers perform another vital function: the translation of a program written 

in a high-level language, such as C, C, Java, or Visual Basic into instructions 

that the hardware can execute. Given the sophistication of modern programming 

languages and the simplicity of the instructions executed by the hardware, the 

translation from a high-level language program to hardware instructions is 

complex. We give a brief overview of the process here and then go into more depth 

in Chapter 2 and in Appendix A. 

From a High-Level Language to the Language of Hardware 

To actually speak to electronic hardware, you need to send electrical signals. The 

easiest signals for computers to understand are on and off, and so the computer 

alphabet is just two letters. Just as the 26 letters of the English alphabet do not limit 

how much can be written, the two letters of the computer alphabet do not limit 

what computers can do. The two symbols for these two letters are the numbers 0 

and 1, and we commonly think of the computer language as numbers in base 2, or 

binary numbers. We refer to each “letter” as a binary digit or bit. Computers are 

slaves to our commands, which are called instructions. Instructions, which are just 

collections of bits that the computer understands and obeys, can be thought of as 

numbers. For example, the bits 

1000110010100000 

tell one computer to add two numbers. Chapter 2 explains why we use numbers 

for instructions and data; we don’t want to steal that chapter’s thunder, but using 

numbers for both instructions and data is a foundation of computing. 

The first programmers communicated to computers in binary numbers, but this 

was so tedious that they quickly invented new notations that were closer to the way 

humans think. At first, these notations were translated to binary by hand, but this 

process was still tiresome. Using the computer to help program the computer, the 

pioneers invented programs to translate from symbolic notation to binary. The first of 

these programs was named an assembler. This program translates a symbolic version 

of an instruction into the binary version. For example, the programmer would write 

add A,B 

and the assembler would translate this notation into 

1000110010100000 

This instruction tells the computer to add the two numbers A and B. The name coined 

for this symbolic language, still used today, is assembly language. In contrast, the 

binary language that the machine understands is the machine language. 

Although a tremendous improvement, assembly language is still far from the 

notations a scientist might like to use to simulate fluid flow or that an accountant 

might use to balance the books. Assembly language requires the programmer 

to write one line for every instruction that the computer will follow, forcing the 

programmer to think like the computer.


A compiler enables a programmer to write this high-level language expression: 

A + B 

The compiler would compile it into this assembly language statement: 

add A,B 

As shown above, the assembler would translate this statement into the binary 

instructions that tell the computer to add the two numbers A and B. 

High-level programming languages offer several important benefits. First, they 

allow the programmer to think in a more natural language, using English words 

and algebraic notation, resulting in programs that look much more like text than 

like tables of cryptic symbols (see Figure 1.4). Moreover, they allow languages to be 

designed according to their intended use. Hence, Fortran was designed for scientific 

computation, Cobol for business data processing, Lisp for symbol manipulation, 

and so on. There are also domain-specific languages for even narrower groups of 

users, such as those interested in simulation of fluids, for example. 

The second advantage of programming languages is improved programmer 

productivity. One of the few areas of widespread agreement in software development 

is that it takes less time to develop programs when they are written in languages 

that require fewer lines to express an idea. Conciseness is a clear advantage of highlevel 

languages over assembly language. 

The final advantage is that programming languages allow programs to be 

independent of the computer on which they were developed, since compilers and 

assemblers can translate high-level language programs to the binary instructions of 

any computer. These three advantages are so strong that today little programming 

is done in assembly language. 

1.4 Under the Covers 

input device 

A mechanism through 

which the computer is 

fed information, such as a 

keyboard. 

output device 

A mechanism that 

conveys the result of a 

computation to a user, 

such as a display, or to 

another computer. 

Now that we have looked below your program to uncover the underlying software, 

let’s open the covers of your computer to learn about the underlying hardware. The 

underlying hardware in any computer performs the same basic functions: inputting 

data, outputting data, processing data, and storing data. How these functions are 

performed is the primary topic of this book, and subsequent chapters deal with 

different parts of these four tasks. 

When we come to an important point in this book, a point so important that 

we hope you will remember it forever, we emphasize it by identifying it as a Big 

Picture item. We have about a dozen Big Pictures in this book, the first being the 

five components of a computer that perform the tasks of inputting, outputting, 

processing, and storing data. 

Two key components of computers are input devices, such as the microphone, 

and output devices, such as the speaker. As the names suggest, input feeds the


computer, and output is the result of computation sent to the user. Some devices, 

such as wireless networks, provide both input and output to the computer. 

Chapters 5 and 6 describe input/output (I/O) devices in more detail, but let’s 

take an introductory tour through the computer hardware, starting with the 

external I/O devices. 

The five classic components of a computer are input, output, memory, 

datapath, and control, with the last two sometimes combined and called 

the processor. Figure 1.5 shows the standard organization of a computer. 

This organization is independent of hardware technology: you can place 

every piece of every computer, past and present, into one of these five 

categories. To help you keep all this in perspective, the five components of 

a computer are shown on the front page of each of the following chapters, 

with the portion of interest to that chapter highlighted. 

The BIG 

Picture 

FIGURE 1.5 The organization of a computer, showing the five classic components. The 

processor gets instructions and data from memory. Input writes data to memory, and output reads data from 

memory. Control sends the signals that determine the operations of the datapath, memory, input, and output.


liquid crystal display 

A display technology 

using a thin layer of liquid 

polymers that can be used 

to transmit or block light 

according to whether a 

charge is applied. 

active matrix display 

A liquid crystal display 

using a transistor to 

control the transmission 

of light at each individual 

pixel. 

pixel The smallest 

individual picture 

element. Screens are 

composed of hundreds 

of thousands to millions 

of pixels, organized in a 

matrix. 

Through computer 

displays I have landed 

an airplane on the 

deck of a moving 

carrier, observed a 

nuclear particle hit a 

potential well, flown 

in a rocket at nearly 

the speed of light and 

watched a computer 

reveal its innermost 

workings. 

Ivan Sutherland, the 

“father” of computer 

graphics, Scientific 

American, 1984 

Through the Looking Glass 

The most fascinating I/O device is probably the graphics display. Most personal 

mobile devices use liquid crystal displays (LCDs) to get a thin, low-power display. 

The LCD is not the source of light; instead, it controls the transmission of light. 

A typical LCD includes rod-shaped molecules in a liquid that form a twisting 

helix that bends light entering the display, from either a light source behind the 

display or less often from reflected light. The rods straighten out when a current is 

applied and no longer bend the light. Since the liquid crystal material is between 

two screens polarized at 90 degrees, the light cannot pass through unless it is bent. 

Today, most LCD displays use an active matrix that has a tiny transistor switch at 

each pixel to precisely control current and make sharper images. A red-green-blue 

mask associated with each dot on the display determines the intensity of the threecolor 

components in the final image; in a color active matrix LCD, there are three 

transistor switches at each point. 

The image is composed of a matrix of picture elements, or pixels, which can 

be represented as a matrix of bits, called a bit map. Depending on the size of the 

screen and the resolution, the display matrix in a typical tablet ranges in size from 

1024 768 to 2048 1536. A color display might use 8 bits for each of the three 

colors (red, blue, and green), for 24 bits per pixel, permitting millions of different 

colors to be displayed. 

The computer hardware support for graphics consists mainly of a raster refresh 

buffer, or frame buffer, to store the bit map. The image to be represented onscreen 

is stored in the frame buffer, and the bit pattern per pixel is read out to the graphics 

display at the refresh rate. Figure 1.6 shows a frame buffer with a simplified design 

of just 4 bits per pixel. 

The goal of the bit map is to faithfully represent what is on the screen. The 

challenges in graphics systems arise because the human eye is very good at detecting 

even subtle changes on the screen. 

Y 0 

Frame buffer 

0 011 1 101 

Y 0 

Y 1 

Raster scan CRT display 

Y 1 

X 0 X 1 

X 0 X 1 

FIGURE 1.6 Each coordinate in the frame buffer on the left determines the shade of the 

corresponding coordinate for the raster scan CRT display on the right. Pixel (X 0 

, Y 0 

) contains 

the bit pattern 0011, which is a lighter shade on the screen than the bit pattern 1101 in pixel (X 1 

, Y 1 

).


Touchscreen 

While PCs also use LCD displays, the tablets and smartphones of the PostPC era 

have replaced the keyboard and mouse with touch sensitive displays, which has 

the wonderful user interface advantage of users pointing directly what they are 

interested in rather than indirectly with a mouse. 

While there are a variety of ways to implement a touch screen, many tablets 

today use capacitive sensing. Since people are electrical conductors, if an insulator 

like glass is covered with a transparent conductor, touching distorts the electrostatic 

field of the screen, which results in a change in capacitance. This technology can 

allow multiple touches simultaneously, which allows gestures that can lead to 

attractive user interfaces. 

Opening the Box 

Figure 1.7 shows the contents of the Apple iPad 2 tablet computer. Unsurprisingly, 

of the five classic components of the computer, I/O dominates this reading device. 

The list of I/O devices includes a capacitive multitouch LCD display, front facing 

camera, rear facing camera, microphone, headphone jack, speakers, accelerometer, 

gyroscope, Wi-Fi network, and Bluetooth network. The datapath, control, and 

memory are a tiny portion of the components. 

The small rectangles in Figure 1.8 contain the devices that drive our advancing 

technology, called integrated circuits and nicknamed chips. The A5 package seen 

in the middle of in Figure 1.8 contains two ARM processors that operate with a 

clock rate of 1 GHz. The processor is the active part of the computer, following the 

instructions of a program to the letter. It adds numbers, tests numbers, signals I/O 

devices to activate, and so on. Occasionally, people call the processor the CPU, for 

the more bureaucratic-sounding central processor unit. 

Descending even lower into the hardware, Figure 1.9 reveals details of a 

microprocessor. The processor logically comprises two main components: datapath 

and control, the respective brawn and brain of the processor. The datapath performs 

the arithmetic operations, and control tells the datapath, memory, and I/O devices 

what to do according to the wishes of the instructions of the program. Chapter 4 

explains the datapath and control for a higher-performance design. 

The A5 package in Figure 1.8 also includes two memory chips, each with 

2 gibibits of capacity, thereby supplying 512 MiB. The memory is where the 

programs are kept when they are running; it also contains the data needed by the 

running programs. The memory is built from DRAM chips. DRAM stands for 

dynamic random access memory. Multiple DRAMs are used together to contain 

the instructions and data of a program. In contrast to sequential access memories, 

such as magnetic tapes, the RAM portion of the term DRAM means that memory 

accesses take basically the same amount of time no matter what portion of the 

memory is read. 

Descending into the depths of any component of the hardware reveals insights 

into the computer. Inside the processor is another type of memory—cache memory. 

integrated circuit Also 

called a chip. A device 

combining dozens to 

millions of transistors. 

central processor unit 

(CPU) Also called 

processor. The active part 

of the computer, which 

contains the datapath and 

control and which adds 

numbers, tests numbers, 

signals I/O devices to 

activate, and so on. 

datapath The 

component of the 

processor that performs 

arithmetic operations 

control The component 

of the processor that 

commands the datapath, 

memory, and I/O 

devices according to 

the instructions of the 

program. 

memory The storage 

area in which programs 

are kept when they are 

running and that contains 

the data needed by the 

running programs. 

dynamic random access 

memory (DRAM) 

Memory built as an 

integrated circuit; it 

provides random access to 

any location. Access times 

are 50 nanoseconds and 

cost per gigabyte in 2012 

was $5 to $10.


FIGURE 1.7 Components of the Apple iPad 2 A1395. The metal back of the iPad (with the reversed 

Apple logo in the middle) is in the center. At the top is the capacitive multitouch screen and LCD display. To 

the far right is the 3.8 V, 25 watt-hour, polymer battery, which consists of three Li-ion cell cases and offers 

10 hours of battery life. To the far left is the metal frame that attaches the LCD to the back of the iPad. The 

small components surrounding the metal back in the center are what we think of as the computer; they 

are often L-shaped to fit compactly inside the case next to the battery. Figure 1.8 shows a close-up of the 

L-shaped board to the lower left of the metal case, which is the logic printed circuit board that contains the 

processor and the memory. The tiny rectangle below the logic board contains a chip that provides wireless 

communication: Wi-Fi, Bluetooth, and FM tuner. It fits into a small slot in the lower left corner of the logic 

board. Near the upper left corner of the case is another L-shaped component, which is a front-facing camera 

assembly that includes the camera, headphone jack, and microphone. Near the right upper corner of the case 

is the board containing the volume control and silent/screen rotation lock button along with a gyroscope and 

accelerometer. These last two chips combine to allow the iPad to recognize 6-axis motion. The tiny rectangle 

next to it is the rear-facing camera. Near the bottom right of the case is the L-shaped speaker assembly. The 

cable at the bottom is the connector between the logic board and the camera/volume control board. The 

board between the cable and the speaker assembly is the controller for the capacitive touchscreen. (Courtesy 

iFixit, www.ifixit.com) 

FIGURE 1.8 The logic board of Apple iPad 2 in Figure 1.7. The photo highlights five integrated circuits. 

The large integrated circuit in the middle is the Apple A5 chip, which contains a dual ARM processor cores 

that run at 1 GHz as well as 512 MB of main memory inside the package. Figure 1.9 shows a photograph of 

the processor chip inside the A5 package. The similar sized chip to the left is the 32 GB flash memory chip 

for non-volatile storage. There is an empty space between the two chips where a second flash chip can be 

installed to double storage capacity of the iPad. The chips to the right of the A5 include power controller and 

I/O controller chips. (Courtesy iFixit, www.ifixit.com)


cache memory A small, 

fast memory that acts as a 

buffer for a slower, larger 

memory. 

FIGURE 1.9 The processor integrated circuit inside the A5 package. The size of chip is 12.1 by 10.1 mm, and 

it was manufactured originally in a 45-nm process (see Section 1.5). It has two identical ARM processors or 

cores in the middle left of the chip and a PowerVR graphical processor unit (GPU) with four datapaths in the 

upper left quadrant. To the left and bottom side of the ARM cores are interfaces to main memory (DRAM). 

(Courtesy Chipworks, www.chipworks.com) 

static random access 

memory (SRAM) Also 

memory built as an 

integrated circuit, but 

faster and less dense than 

DRAM. 

Cache memory consists of a small, fast memory that acts as a buffer for the DRAM 

memory. (The nontechnical definition of cache is a safe place for hiding things.) 

Cache is built using a different memory technology, static random access memory 

(SRAM). SRAM is faster but less dense, and hence more expensive, than DRAM 

(see Chapter 5). SRAM and DRAM are two layers of the memory hierarchy.


To distinguish between the volatile memory used to hold data and programs 

while they are running and this nonvolatile memory used to store data and 

programs between runs, the term main memory or primary memory is used for 

the former, and secondary memory for the latter. Secondary memory forms the 

next lower layer of the memory hierarchy. DRAMs have dominated main memory 

since 1975, but magnetic disks dominated secondary memory starting even earlier. 

Because of their size and form factor, personal Mobile Devices use flash memory, 

a nonvolatile semiconductor memory, instead of disks. Figure 1.8 shows the chip 

containing the flash memory of the iPad 2. While slower than DRAM, it is much 

cheaper than DRAM in addition to being nonvolatile. Although costing more per 

bit than disks, it is smaller, it comes in much smaller capacities, it is more rugged, 

and it is more power efficient than disks. Hence, flash memory is the standard 

secondary memory for PMDs. Alas, unlike disks and DRAM, flash memory bits 

wear out after 100,000 to 1,000,000 writes. Thus, file systems must keep track of 

the number of writes and have a strategy to avoid wearing out storage, such as by 

moving popular data. Chapter 5 describes disks and flash memory in more detail. 

Communicating with Other Computers 

We’ve explained how we can input, compute, display, and save data, but there is 

still one missing item found in today’s computers: computer networks. Just as the 

processor shown in Figure 1.5 is connected to memory and I/O devices, networks 

interconnect whole computers, allowing computer users to extend the power of 

computing by including communication. Networks have become so popular that 

they are the backbone of current computer systems; a new personal mobile device 

or server without a network interface would be ridiculed. Networked computers 

have several major advantages: 

■ Communication: Information is exchanged between computers at high 

speeds. 

■ Resource sharing: Rather than each computer having its own I/O devices, 

computers on the network can share I/O devices. 

■ Nonlocal access: By connecting computers over long distances, users need not 

be near the computer they are using. 

Networks vary in length and performance, with the cost of communication 

increasing according to both the speed of communication and the distance that 

information travels. Perhaps the most popular type of network is Ethernet. It can 

be up to a kilometer long and transfer at up to 40 gigabits per second. Its length and 

speed make Ethernet useful to connect computers on the same floor of a building; 

main memory Also 

called primary memory. 

Memory used to hold 

programs while they are 

running; typically consists 

of DRAM in today’s 

computers. 

secondary memory 

Nonvolatile memory 

used to store programs 

and data between runs; 

typically consists of flash 

memory in PMDs and 

magnetic disks in servers. 

magnetic disk Also 

called hard disk. A form 

of nonvolatile secondary 

memory composed of 

rotating platters coated 

with a magnetic recording 

material. Because they 

are rotating mechanical 

devices, access times are 

about 5 to 20 milliseconds 

and cost per gigabyte in 

2012 was $0.05 to $0.10. 

flash memory 

A nonvolatile semiconductor 

memory. It 

is cheaper and slower 

than DRAM but more 

expensive per bit and 

faster than magnetic disks. 

Access times are about 5 

to 50 microseconds and 

cost per gigabyte in 2012 

was $0.75 to $1.00.


local area network 

(LAN) A network 

designed to carry data 

within a geographically 

confined area, typically 

within a single building. 

wide area network 

(WAN) A network 

extended over hundreds 

of kilometers that can 

span a continent. 

Check 

Yourself 

hence, it is an example of what is generically called a local area network. Local area 

networks are interconnected with switches that can also provide routing services 

and security. Wide area networks cross continents and are the backbone of the 

Internet, which supports the web. They are typically based on optical fibers and are 

leased from telecommunication companies. 

Networks have changed the face of computing in the last 30 years, both by 

becoming much more ubiquitous and by making dramatic increases in performance. 

In the 1970s, very few individuals had access to electronic mail, the Internet and 

web did not exist, and physically mailing magnetic tapes was the primary way to 

transfer large amounts of data between two locations. Local area networks were 

almost nonexistent, and the few existing wide area networks had limited capacity 

and restricted access. 

As networking technology improved, it became much cheaper and had a much 

higher capacity. For example, the first standardized local area network technology, 

developed about 30 years ago, was a version of Ethernet that had a maximum capacity 

(also called bandwidth) of 10 million bits per second, typically shared by tens of, if 

not a hundred, computers. Today, local area network technology offers a capacity 

of from 1 to 40 gigabits per second, usually shared by at most a few computers. 

Optical communications technology has allowed similar growth in the capacity of 

wide area networks, from hundreds of kilobits to gigabits and from hundreds of 

computers connected to a worldwide network to millions of computers connected. 

This combination of dramatic rise in deployment of networking combined with 

increases in capacity have made network technology central to the information 

revolution of the last 30 years. 

For the last decade another innovation in networking is reshaping the way 

computers communicate. Wireless technology is widespread, which enabled 

the PostPC Era. The ability to make a radio in the same low-cost semiconductor 

technology (CMOS) used for memory and microprocessors enabled a significant 

improvement in price, leading to an explosion in deployment. Currently available 

wireless technologies, called by the IEEE standard name 802.11, allow for transmission 

rates from 1 to nearly 100 million bits per second. Wireless technology is quite a bit 

different from wire-based networks, since all users in an immediate area share the 

airwaves. 

■ Semiconductor DRAM memory, flash memory, and disk storage differ 

significantly. For each technology, list its volatility, approximate relative 

access time, and approximate relative cost compared to DRAM. 

1.5 

Technologies for Building Processors 

and Memory 

Processors and memory have improved at an incredible rate, because computer 

designers have long embraced the latest in electronic technology to try to win the 

race to design a better computer. Figure 1.10 shows the technologies that have

1.6 Performance 27 

FIGURE 1.13 A 12-inch (300 mm) wafer of Intel Core i7 (Courtesy Intel). The number of 

dies on this 300 mm (12 inch) wafer at 100% yield is 280, each 20.7 by 10.5 mm. The several dozen partially 

rounded chips at the boundaries of the wafer are useless; they are included because it’s easier to create the 

masks used to pattern the silicon. This die uses a 32-nanometer technology, which means that the smallest 

features are approximately 32 nm in size, although they are typically somewhat smaller than the actual feature 

size, which refers to the size of the transistors as “drawn” versus the final manufactured size. 

called dies and more informally known as chips. Figure 1.13 shows a photograph 

of a wafer containing microprocessors before they have been diced; earlier, Figure 

1.9 shows an individual microprocessor die. 

Dicing enables you to discard only those dies that were unlucky enough to 

contain the flaws, rather than the whole wafer. This concept is quantified by the 

yield of a process, which is defined as the percentage of good dies from the total 

number of dies on the wafer. 

The cost of an integrated circuit rises quickly as the die size increases, due both 

to the lower yield and the smaller number of dies that fit on a wafer. To reduce the 

cost, using the next generation process shrinks a large die as it uses smaller sizes for 

both transistors and wires. This improves the yield and the die count per wafer. A 

32-nanometer (nm) process was typical in 2012, which means essentially that the 

smallest feature size on the die is 32 nm. 

die The individual 

rectangular sections that 

are cut from a wafer, more 

informally known as 

chips. 

yield The percentage of 

good dies from the total 

number of dies on the 

wafer.


Once you’ve found good dies, they are connected to the input/output pins of a 

package, using a process called bonding. These packaged parts are tested a final time, 

since mistakes can occur in packaging, and then they are shipped to customers. 

Elaboration: The cost of an integrated circuit can be expressed in three simple 

equations: 

Cost per wafer 

Cost per die 

Dies per wafer yield 

Dies per wafer 

Yield 

Wafer area 

Die area 

1 

( 1 ( Defects per area Die area/2)) The fi rst equation is straightforward to derive. The second is an approximation, 

since it does not subtract the area near the border of the round wafer that cannot 

accommodate the rectangular dies (see Figure 1.13). The fi nal equation is based on 

empirical observations of yields at integrated circuit factories, with the exponent related 

to the number of critical processing steps. 

Hence, depending on the defect rate and the size of the die and wafer, costs are 

generally not linear in the die area. 

Check 

Yourself 

A key factor in determining the cost of an integrated circuit is volume. Which of 

the following are reasons why a chip made in high volume should cost less? 

1. With high volumes, the manufacturing process can be tuned to a particular 

design, increasing the yield. 

2. It is less work to design a high-volume part than a low-volume part. 

3. The masks used to make the chip are expensive, so the cost per chip is lower 

for higher volumes. 

4. Engineering development costs are high and largely independent of volume; 

thus, the development cost per die is lower with high-volume parts. 

5. High-volume parts usually have smaller die sizes than low-volume parts and 

therefore have higher yield per wafer. 

1.6 Performance 

Assessing the performance of computers can be quite challenging. The scale and 

intricacy of modern software systems, together with the wide range of performance 

improvement techniques employed by hardware designers, have made performance 

assessment much more difficult. 

When trying to choose among different computers, performance is an important 

attribute. Accurately measuring and comparing different computers is critical to


purchasers and therefore to designers. The people selling computers know this as 

well. Often, salespeople would like you to see their computer in the best possible 

light, whether or not this light accurately reflects the needs of the purchaser’s 

application. Hence, understanding how best to measure performance and the 

limitations of performance measurements is important in selecting a computer. 

The rest of this section describes different ways in which performance can be 

determined; then, we describe the metrics for measuring performance from the 

viewpoint of both a computer user and a designer. We also look at how these metrics 

are related and present the classical processor performance equation, which we will 

use throughout the text. 

Defining Performance 

When we say one computer has better performance than another, what do we 

mean? Although this question might seem simple, an analogy with passenger 

airplanes shows how subtle the question of performance can be. Figure 1.14 

lists some typical passenger airplanes, together with their cruising speed, range, 

and capacity. If we wanted to know which of the planes in this table had the best 

performance, we would first need to define performance. For example, considering 

different measures of performance, we see that the plane with the highest cruising 

speed was the Concorde (retired from service in 2003), the plane with the longest 

range is the DC-8, and the plane with the largest capacity is the 747. 

Airplane 

Passenger 

capacity 

Cruising range 

(miles) 

Cruising speed 

(m.p.h.) 

Passenger throughput 

(passengers × m.p.h.) 

Boeing 777 375 4630 610 228,750 

Boeing 747 470 

4150 610 286,700 

BAC/Sud Concorde 132 

4000 1350 178,200 

Douglas DC-8-50 146 

8720 544 79,424 

FIGURE 1.14 The capacity, range, and speed for a number of commercial airplanes. The last 

column shows the rate at which the airplane transports passengers, which is the capacity times the cruising 

speed (ignoring range and takeoff and landing times). 

Let’s suppose we define performance in terms of speed. This still leaves two 

possible definitions. You could define the fastest plane as the one with the highest 

cruising speed, taking a single passenger from one point to another in the least time. 

If you were interested in transporting 450 passengers from one point to another, 

however, the 747 would clearly be the fastest, as the last column of the figure shows. 

Similarly, we can define computer performance in several different ways. 

If you were running a program on two different desktop computers, you’d say 

that the faster one is the desktop computer that gets the job done first. If you were 

running a datacenter that had several servers running jobs submitted by many 

users, you’d say that the faster computer was the one that completed the most 

jobs during a day. As an individual computer user, you are interested in reducing 

response time—the time between the start and completion of a task—also referred 

response time Also 

called execution time. 

The total time required 

for the computer to 

complete a task, including 

disk accesses, memory 

accesses, I/O activities, 

operating system 

overhead, CPU execution 

time, and so on.


throughput Also called 

bandwidth. Another 

measure of performance, 

it is the number of tasks 

completed per unit time. 

to as execution time. Datacenter managers are often interested in increasing 

throughput or bandwidth—the total amount of work done in a given time. Hence, 

in most cases, we will need different performance metrics as well as different sets 

of applications to benchmark personal mobile devices, which are more focused on 

response time, versus servers, which are more focused on throughput. 

Throughput and Response Time 

EXAMPLE 

ANSWER 

Do the following changes to a computer system increase throughput, decrease 

response time, or both? 

1. Replacing the processor in a computer with a faster version 

2. Adding additional processors to a system that uses multiple processors 

for separate tasks—for example, searching the web 

Decreasing response time almost always improves throughput. Hence, in case 

1, both response time and throughput are improved. In case 2, no one task gets 

work done faster, so only throughput increases. 

If, however, the demand for processing in the second case was almost 

as large as the throughput, the system might force requests to queue up. In 

this case, increasing the throughput could also improve response time, since 

it would reduce the waiting time in the queue. Thus, in many real computer 

systems, changing either execution time or throughput often affects the other. 

In discussing the performance of computers, we will be primarily concerned with 

response time for the first few chapters. To maximize performance, we want to 

minimize response time or execution time for some task. Thus, we can relate 

performance and execution time for a computer X: 

1 

PerformanceX 

 

Execution time 

This means that for two computers X and Y, if the performance of X is greater than 

the performance of Y, we have 

PerformanceX 

PerformanceY 

1 1 

 

Execution time Execution time 


X 

Y 

X 


That is, the execution time on Y is longer than that on X, if X is faster than Y. 

Y 

X


In discussing a computer design, we often want to relate the performance of two 

different computers quantitatively. We will use the phrase “X is n times faster than 

Y”—or equivalently “X is n times as fast as Y”—to mean 

Performance 

Performance 

If X is n times as fast as Y, then the execution time on Y is n times as long as it is 

on X: 

Performance 

Performance 

X 

Y 

X 

Y 

n 


 


Y 

X 

n 

Relative Performance 

If computer A runs a program in 10 seconds and computer B runs the same 

program in 15 seconds, how much faster is A than B? 

EXAMPLE 

We know that A is n times as fast as B if 

Performance 

Performance 

A 

B 


 


B 

A 

n 

ANSWER 

Thus the performance ratio is 

15 

15 . 

10 

and A is therefore 1.5 times as fast as B. 

In the above example, we could also say that computer B is 1.5 times slower than 

computer A, since 

Performance 

Performance 

A 

B 

15 . 

means that 

Performance 

15 . 

A 

Performance 

B


For simplicity, we will normally use the terminology as fast as when we try to 

compare computers quantitatively. Because performance and execution time are 

reciprocals, increasing performance requires decreasing execution time. To avoid 

the potential confusion between the terms increasing and decreasing, we usually 

say “improve performance” or “improve execution time” when we mean “increase 

performance” and “decrease execution time.” 

CPU execution 

time Also called CPU 

time. The actual time the 

CPU spends computing 

for a specific task. 

user CPU time The 

CPU time spent in a 

program itself. 

system CPU time The 

CPU time spent in 

the operating system 

performing tasks on 

behalf of the program. 

Measuring Performance 

Time is the measure of computer performance: the computer that performs the 

same amount of work in the least time is the fastest. Program execution time is 

measured in seconds per program. However, time can be defined in different ways, 

depending on what we count. The most straightforward definition of time is called 

wall clock time, response time, or elapsed time. These terms mean the total time 

to complete a task, including disk accesses, memory accesses, input/output (I/O) 

activities, operating system overhead—everything. 

Computers are often shared, however, and a processor may work on several 

programs simultaneously. In such cases, the system may try to optimize throughput 

rather than attempt to minimize the elapsed time for one program. Hence, we 

often want to distinguish between the elapsed time and the time over which the 

processor is working on our behalf. CPU execution time or simply CPU time, 

which recognizes this distinction, is the time the CPU spends computing for this 

task and does not include time spent waiting for I/O or running other programs. 

(Remember, though, that the response time experienced by the user will be the 

elapsed time of the program, not the CPU time.) CPU time can be further divided 

into the CPU time spent in the program, called user CPU time, and the CPU time 

spent in the operating system performing tasks on behalf of the program, called 

system CPU time. Differentiating between system and user CPU time is difficult to 

do accurately, because it is often hard to assign responsibility for operating system 

activities to one user program rather than another and because of the functionality 

differences among operating systems. 

For consistency, we maintain a distinction between performance based on 

elapsed time and that based on CPU execution time. We will use the term system 

performance to refer to elapsed time on an unloaded system and CPU performance 

to refer to user CPU time. We will focus on CPU performance in this chapter, 

although our discussions of how to summarize performance can be applied to 

either elapsed time or CPU time measurements. 

Understanding 

Program 

Performance 

Different applications are sensitive to different aspects of the performance of a 

computer system. Many applications, especially those running on servers, depend 

as much on I/O performance, which, in turn, relies on both hardware and software. 

Total elapsed time measured by a wall clock is the measurement of interest. In


some application environments, the user may care about throughput, response 

time, or a complex combination of the two (e.g., maximum throughput with a 

worst-case response time). To improve the performance of a program, one must 

have a clear definition of what performance metric matters and then proceed to 

look for performance bottlenecks by measuring program execution and looking 

for the likely bottlenecks. In the following chapters, we will describe how to search 

for bottlenecks and improve performance in various parts of the system. 

Although as computer users we care about time, when we examine the details 

of a computer it’s convenient to think about performance in other metrics. In 

particular, computer designers may want to think about a computer by using a 

measure that relates to how fast the hardware can perform basic functions. Almost 

all computers are constructed using a clock that determines when events take 

place in the hardware. These discrete time intervals are called clock cycles (or 

ticks, clock ticks, clock periods, clocks, cycles). Designers refer to the length of a 

clock period both as the time for a complete clock cycle (e.g., 250 picoseconds, or 

250 ps) and as the clock rate (e.g., 4 gigahertz, or 4 GHz), which is the inverse of the 

clock period. In the next subsection, we will formalize the relationship between the 

clock cycles of the hardware designer and the seconds of the computer user. 

1. Suppose we know that an application that uses both personal mobile 

devices and the Cloud is limited by network performance. For the following 

changes, state whether only the throughput improves, both response time 

and throughput improve, or neither improves. 

a. An extra network channel is added between the PMD and the Cloud, 

increasing the total network throughput and reducing the delay to obtain 

network access (since there are now two channels). 

b. The networking software is improved, thereby reducing the network 

communication delay, but not increasing throughput. 

c. More memory is added to the computer. 

2. Computer C’s performance is 4 times as fast as the performance of computer 

B, which runs a given application in 28 seconds. How long will computer C 

take to run that application? 

clock cycle Also called 

tick, clock tick, clock 

period, clock, or cycle. 

The time for one clock 

period, usually of the 

processor clock, which 

runs at a constant rate. 

clock period The length 

of each clock cycle. 

Check 

Yourself 

CPU Performance and Its Factors 

Users and designers often examine performance using different metrics. If we could 

relate these different metrics, we could determine the effect of a design change 

on the performance as experienced by the user. Since we are confining ourselves 

to CPU performance at this point, the bottom-line performance measure is CPU


execution time. A simple formula relates the most basic metrics (clock cycles and 

clock cycle time) to CPU time: 

CPU execution time 

for a program 

CPU clock cycles 

for a program 

Clock cycle time 

Alternatively, because clock rate and clock cycle time are inverses, 

CPU execution time 

for a program 

CPU clock cycles for a program 

 

Clock rate 

This formula makes it clear that the hardware designer can improve performance 

by reducing the number of clock cycles required for a program or the length of 

the clock cycle. As we will see in later chapters, the designer often faces a trade-off 

between the number of clock cycles needed for a program and the length of each 

cycle. Many techniques that decrease the number of clock cycles may also increase 

the clock cycle time. 

Improving Performance 

EXAMPLE 

Our favorite program runs in 10 seconds on computer A, which has a 2 GHz 

clock. We are trying to help a computer designer build a computer, B, which will 

run this program in 6 seconds. The designer has determined that a substantial 

increase in the clock rate is possible, but this increase will affect the rest of the 

CPU design, causing computer B to require 1.2 times as many clock cycles as 

computer A for this program. What clock rate should we tell the designer to 

target? 

ANSWER 

Let’s first find the number of clock cycles required for the program on A: 

CPU time 

A 

10 seconds 


Clock rate 

A 


9 cycles 

2 10 

second 

cycles 

CPU clock cycles A 10 seconds 2 10 20 10 

second 

A 

A 

9 9 

cycles


CPU time for B can be found using this equation: 

CPU time 

B 

12 . 


Clock rate 

B 

A 

6 seconds 

12 . 20 10 cycles 

Clock rate 

9 

B 

Clock rate 

B 

1. 

2 20 10 cycles 

6 seconds 

9 

9 9 

0. 

2 20 10 cycles 4 10 cycles 

second 

second 

4 GHz 

To run the program in 6 seconds, B must have twice the clock rate of A. 

Instruction Performance 

The performance equations above did not include any reference to the number of 

instructions needed for the program. However, since the compiler clearly generated 

instructions to execute, and the computer had to execute the instructions to run 

the program, the execution time must depend on the number of instructions in a 

program. One way to think about execution time is that it equals the number of 

instructions executed multiplied by the average time per instruction. Therefore, the 

number of clock cycles required for a program can be written as 

CPU clock cycles Instructions for a program 

Average clock cycles 

per instruction 

The term clock cycles per instruction, which is the average number of clock 

cycles each instruction takes to execute, is often abbreviated as CPI. Since different 

instructions may take different amounts of time depending on what they do, CPI is 

an average of all the instructions executed in the program. CPI provides one way of 

comparing two different implementations of the same instruction set architecture, 

since the number of instructions executed for a program will, of course, be the 

same. 

clock cycles 

per instruction 

(CPI) Average number 

of clock cycles per 

instruction for a program 

or program fragment. 

Using the Performance Equation 

Suppose we have two implementations of the same instruction set architecture. 

Computer A has a clock cycle time of 250 ps and a CPI of 2.0 for some program, 

and computer B has a clock cycle time of 500 ps and a CPI of 1.2 for the same 

program. Which computer is faster for this program and by how much? 

EXAMPLE


ANSWER 

We know that each computer executes the same number of instructions for 

the program; let’s call this number I. First, find the number of processor clock 

cycles for each computer: 

CPU clock cyclesA 

I × 20 . 

CPU clock cycles I × 12 . 

B 

Now we can compute the CPU time for each computer: 

CPU timeA 

CPU clock cyclesA 


I 2. 0 250 ps 500 I ps 

Likewise, for B: 

CPU time I 12 . 500ps 600 I ps 

B 

Clearly, computer A is faster. The amount faster is given by the ratio of the 

execution times: 

CPU performance Execution time 600 I ps 

A 

B 

12 . 

CPU performance Execution time 500 I ps 

B 

We can conclude that computer A is 1.2 times as fast as computer B for this 

program. 

A 

instruction count The 

number of instructions 

executed by the program. 

The Classic CPU Performance Equation 

We can now write this basic performance equation in terms of instruction count 

(the number of instructions executed by the program), CPI, and clock cycle time: 

CPU time Instruction count CPI Clock cycle time 

or, since the clock rate is the inverse of clock cycle time: 

Instruction count CPI 

CPU time 

Clock rate 

These formulas are particularly useful because they separate the three key factors 

that affect performance. We can use these formulas to compare two different 

implementations or to evaluate a design alternative if we know its impact on these 

three parameters.

1.7 The Power Wall 41 

Although power provides a limit to what we can cool, in the PostPC Era the 

really critical resource is energy. Battery life can trump performance in the personal 

mobile device, and the architects of warehouse scale computers try to reduce the 

costs of powering and cooling 100,000 servers as the costs are high at this scale. Just 

as measuring time in seconds is a safer measure of program performance than a 

rate like MIPS (see Section 1.10), the energy metric joules is a better measure than 

a power rate like watts, which is just joules/second. 

The dominant technology for integrated circuits is called CMOS (complementary 

metal oxide semiconductor). For CMOS, the primary source of energy consumption 

is so-called dynamic energy—that is, energy that is consumed when transistors 

switch states from 0 to 1 and vice versa. The dynamic energy depends on the 

capacitive loading of each transistor and the voltage applied: 

2 

Energy ∝ Capacitive load Voltage 

This equation is the energy of a pulse during the logic transition of 0 → 1 → 0 or 

1 → 0 → 1. The energy of a single transition is then 

Energy ∝ 12 / Capacitive load Voltage 

The power required per transistor is just the product of energy of a transition and 

the frequency of transitions: 

Power ∝ 12 / Capacitive load Voltage Frequency switched 

Frequency switched is a function of the clock rate. The capacitive load per transistor 

is a function of both the number of transistors connected to an output (called the 

fanout) and the technology, which determines the capacitance of both wires and 

transistors. 

With regard to Figure 1.16, how could clock rates grow by a factor of 1000 

while power grew by only a factor of 30? Energy and thus power can be reduced by 

lowering the voltage, which occurred with each new generation of technology, and 

power is a function of the voltage squared. Typically, the voltage was reduced about 

15% per generation. In 20 years, voltages have gone from 5 V to 1 V, which is why 

the increase in power is only 30 times. 

2 

2 

Relative Power 

Suppose we developed a new, simpler processor that has 85% of the capacitive 

load of the more complex older processor. Further, assume that it has adjustable 

voltage so that it can reduce voltage 15% compared to processor B, which 

results in a 15% shrink in frequency. What is the impact on dynamic power? 

EXAMPLE


ANSWER 

Power 

Power 

new 

old 

〈 Capacitive load 085 . 〉〈 Voltage 085 . 〉 2 〈 Frequency switched 

2 

Capacitive load Voltage Frequency switched 

085 . 〉 

Thus the power ratio is 

4 

085 . 052 . 

Hence, the new processor uses about half the power of the old processor. 

The problem today is that further lowering of the voltage appears to make the 

transistors too leaky, like water faucets that cannot be completely shut off. Even 

today about 40% of the power consumption in server chips is due to leakage. If 

transistors started leaking more, the whole process could become unwieldy. 

To try to address the power problem, designers have already attached large 

devices to increase cooling, and they turn off parts of the chip that are not used in 

a given clock cycle. Although there are many more expensive ways to cool chips 

and thereby raise their power to, say, 300 watts, these techniques are generally 

too expensive for personal computers and even servers, not to mention personal 

mobile devices. 

Since computer designers slammed into a power wall, they needed a new way 

forward. They chose a different path from the way they designed microprocessors 

for their first 30 years. 

Elaboration: Although dynamic energy is the primary source of energy consumption 

in CMOS, static energy consumption occurs because of leakage current that flows even 

when a transistor is off. In servers, leakage is typically responsible for 40% of the energy 

consumption. Thus, increasing the number of transistors increases power dissipation, 

even if the transistors are always off. A variety of design techniques and technology 

innovations are being deployed to control leakage, but it’s hard to lower voltage further. 

Elaboration: Power is a challenge for integrated circuits for two reasons. First, power 

must be brought in and distributed around the chip; modern microprocessors use 

hundreds of pins just for power and ground! Similarly, multiple levels of chip interconnect 

are used solely for power and ground distribution to portions of the chip. Second, power 

is dissipated as heat and must be removed. Server chips can burn more than 100 watts, 

and cooling the chip and the surrounding system is a major expense in Warehouse Scale 

Computers (see Chapter 6).

1.9 Real Stuff: Benchmarking the Intel Core i7 47 

Description 

Name 

Instruction 

Count x 10 9 

CPI 


(seconds x 10 –9 ) 

Execution 

Time 

(seconds) 

Reference 

Time 

(seconds) 

SPECratio 

Interpreted string processing perl 2252 0.60 0.376 508 9770 19.2 

Block-sorting bzip2 2390 0.70 0.376 629 9650 15.4 

compression 

GNU C compiler gcc 794 1.20 0.376 358 8050 22.5 

Combinatorial optimization mcf 221 2.66 0.376 221 9120 41.2 

Go game (AI) go 1274 1.10 0.376 527 10490 19.9 

Search gene sequence hmmer 2616 0.60 0.376 590 9330 15.8 

Chess game (AI) sjeng 1948 0.80 0.376 586 12100 20.7 

Quantum computer libquantum 659 0.44 0.376 109 20720 190.0 

simulation 

Video compression h264avc 3793 0.50 0.376 713 22130 31.0 

Discrete event omnetpp 367 2.10 0.376 290 6250 21.5 

simulation library 

Games/path finding astar 1250 1.00 0.376 470 7020 14.9 

XML parsing xalancbmk 1045 0.70 0.376 275 6900 25.1 

Geometric mean – – – – – – 

25.7 

FIGURE 1.18 SPECINTC2006 benchmarks running on a 2.66 GHz Intel Core i7 920. As the equation on page 35 explains, 

execution time is the product of the three factors in this table: instruction count in billions, clocks per instruction (CPI), and clock cycle time in 

nanoseconds. SPECratio is simply the reference time, which is supplied by SPEC, divided by the measured execution time. The single number 

quoted as SPECINTC2006 is the geometric mean of the SPECratios. 

set focusing on processor performance (now called SPEC89), which has evolved 

through five generations. The latest is SPEC CPU2006, which consists of a set of 12 

integer benchmarks (CINT2006) and 17 floating-point benchmarks (CFP2006). 

The integer benchmarks vary from part of a C compiler to a chess program to a 

quantum computer simulation. The floating-point benchmarks include structured 

grid codes for finite element modeling, particle method codes for molecular 

dynamics, and sparse linear algebra codes for fluid dynamics. 

Figure 1.18 describes the SPEC integer benchmarks and their execution time 

on the Intel Core i7 and shows the factors that explain execution time: instruction 

count, CPI, and clock cycle time. Note that CPI varies by more than a factor of 5. 

To simplify the marketing of computers, SPEC decided to report a single number 

to summarize all 12 integer benchmarks. Dividing the execution time of a reference 

processor by the execution time of the measured computer normalizes the execution 

time measurements; this normalization yields a measure, called the SPECratio, which 

has the advantage that bigger numeric results indicate faster performance. That is, 

the SPECratio is the inverse of execution time. A CINT2006 or CFP2006 summary 

measurement is obtained by taking the geometric mean of the SPECratios. 

Elaboration: When comparing two computers using SPECratios, use the geometric 

mean so that it gives the same relative answer no matter what computer is used to 

normalize the results. If we averaged the normalized execution time values with an 

arithmetic mean, the results would vary depending on the computer we choose as the 

reference.


The formula for the geometric mean is 

n 

n 

∏ 

i1 

Execution time ratio i 

where Execution time ratio i 

is the execution time, normalized to the reference computer, 

for the ith program of a total of n in the workload, and 

i 

n 

∏ 

SPEC Power Benchmark 

1 

ai 

means the product a 1 a 2 … a 

Given the increasing importance of energy and power, SPEC added a benchmark 

to measure power. It reports power consumption of servers at different workload 

levels, divided into 10% increments, over a period of time. Figure 1.19 shows the 

results for a server using Intel Nehalem processors similar to the above. 

n 

Target Load % 

Performance 

(ssj_ops) 

Average Power 

(watts) 

100% 865,618 258 

90% 786,688 242 

80% 698,051 224 

70% 607,826 204 

60% 521,391 185 

50% 436,757 170 

40% 345,919 157 

30% 262,071 146 

20% 176,061 135 

10% 86,784 121 

0% 0 80 

Overall Sum 4,787,166 1922 

∑ssj_ops / ∑power = 2490 

FIGURE 1.19 SPECpower_ssj2008 running on a dual socket 2.66 GHz Intel Xeon X5650 

with 16 GB of DRAM and one 100 GB SSD disk. 

SPECpower started with another SPEC benchmark for Java business applications 

(SPECJBB2005), which exercises the processors, caches, and main memory as well 

as the Java virtual machine, compiler, garbage collector, and pieces of the operating 

system. Performance is measured in throughput, and the units are business 

operations per second. Once again, to simplify the marketing of computers, SPEC

1.10 Fallacies and Pitfalls 49 

boils these numbers down to a single number, called “overall ssj_ops per watt.” The 

formula for this single summarizing metric is 

⎛ 10 ⎞ ⎛ 10 ⎞ 

overall ssj_ops per watt 

∑ssj_ops i 

poweri 

⎝⎜ 

⎠⎟ 

∑ 

⎝⎜ 

⎠⎟ 

where ssj_ops i 

is performance at each 10% increment and power i 

is power 

consumed at each performance level. 

i0 

i0 

1.10 

Fallacies and Pitfalls 

The purpose of a section on fallacies and pitfalls, which will be found in every 

chapter, is to explain some commonly held misconceptions that you might 

encounter. We call them fallacies. When discussing a fallacy, we try to give a 

counterexample. We also discuss pitfalls, or easily made mistakes. Often pitfalls are 

generalizations of principles that are only true in a limited context. The purpose 

of these sections is to help you avoid making these mistakes in the computers you 

may design or use. Cost/performance fallacies and pitfalls have ensnared many a 

computer architect, including us. Accordingly, this section suffers no shortage of 

relevant examples. We start with a pitfall that traps many designers and reveals an 

important relationship in computer design. 

Pitfall: Expecting the improvement of one aspect of a computer to increase overall 

performance by an amount proportional to the size of the improvement. 

The great idea of making the common case fast has a demoralizing corollary 

that has plagued designers of both hardware and software. It reminds us that the 

opportunity for improvement is affected by how much time the event consumes. 

A simple design problem illustrates it well. Suppose a program runs in 100 

seconds on a computer, with multiply operations responsible for 80 seconds of this 

time. How much do I have to improve the speed of multiplication if I want my 

program to run five times faster? 

The execution time of the program after making the improvement is given by 

the following simple equation known as Amdahl’s Law: 

Execution time after improvement 

Execution time affected by improvement 

Execution time unaffected 

Amount of improvement 

For this problem: 

Execution time after improvement 

80 seconds 

n 

( 100 80 seconds) 

Science must begin 

with myths, and the 

criticism of myths. 

Sir Karl Popper, The 

Philosophy of Science, 

1957 

Amdahl’s Law 

A rule stating that 

the performance 

enhancement possible 

with a given improvement 

is limited by the amount 

that the improved feature 

is used. It is a quantitative 

version of the law of 

diminishing returns.


Since we want the performance to be five times faster, the new execution time 

should be 20 seconds, giving 

20 seconds 

0 

80 seconds 

n 

80 seconds 

n 

20 seconds 

That is, there is no amount by which we can enhance-multiply to achieve a fivefold 

increase in performance, if multiply accounts for only 80% of the workload. The 

performance enhancement possible with a given improvement is limited by the amount 

that the improved feature is used. In everyday life this concept also yields what we call 

the law of diminishing returns. 

We can use Amdahl’s Law to estimate performance improvements when we 

know the time consumed for some function and its potential speedup. Amdahl’s 

Law, together with the CPU performance equation, is a handy tool for evaluating 

potential enhancements. Amdahl’s Law is explored in more detail in the exercises. 

Amdahl’s Law is also used to argue for practical limits to the number of parallel 

processors. We examine this argument in the Fallacies and Pitfalls section of 

Chapter 6. 

Fallacy: Computers at low utilization use little power. 

Power efficiency matters at low utilizations because server workloads vary. 

Utilization of servers in Google’s warehouse scale computer, for example, is 

between 10% and 50% most of the time and at 100% less than 1% of the time. Even 

given five years to learn how to run the SPECpower benchmark well, the specially 

configured computer with the best results in 2012 still uses 33% of the peak power 

at 10% of the load. Systems in the field that are not configured for the SPECpower 

benchmark are surely worse. 

Since servers’ workloads vary but use a large fraction of peak power, Luiz 

Barroso and Urs Hölzle [2007] argue that we should redesign hardware to achieve 

“energy-proportional computing.” If future servers used, say, 10% of peak power at 

10% workload, we could reduce the electricity bill of datacenters and become good 

corporate citizens in an era of increasing concern about CO 2 

emissions. 

Fallacy: Designing for performance and designing for energy efficiency are 

unrelated goals. 

Since energy is power over time, it is often the case that hardware or software 

optimizations that take less time save energy overall even if the optimization takes 

a bit more energy when it is used. One reason is that all of the rest of the computer is 

consuming energy while the program is running, so even if the optimized portion 

uses a little more energy, the reduced time can save the energy of the whole system. 

Pitfall: Using a subset of the performance equation as a performance metric. 

We have already warned about the danger of predicting performance based on 

simply one of clock rate, instruction count, or CPI. Another common mistake


is to use only two of the three factors to compare performance. Although using 

two of the three factors may be valid in a limited context, the concept is also 

easily misused. Indeed, nearly all proposed alternatives to the use of time as the 

performance metric have led eventually to misleading claims, distorted results, or 

incorrect interpretations. 

One alternative to time is MIPS (million instructions per second). For a given 

program, MIPS is simply 

Instruction count 

MIPS 

Execution time 10 6 

Since MIPS is an instruction execution rate, MIPS specifies performance inversely 

to execution time; faster computers have a higher MIPS rating. The good news 

about MIPS is that it is easy to understand, and faster computers mean bigger 

MIPS, which matches intuition. 

There are three problems with using MIPS as a measure for comparing computers. 

First, MIPS specifies the instruction execution rate but does not take into account 

the capabilities of the instructions. We cannot compare computers with different 

instruction sets using MIPS, since the instruction counts will certainly differ. 

Second, MIPS varies between programs on the same computer; thus, a computer 

cannot have a single MIPS rating. For example, by substituting for execution time, 

we see the relationship between MIPS, clock rate, and CPI: 

Instruction count Clock rate 

MIPS 

Instruction count CPI 

10 6 CPI 10 6 

Clock rate 

million instructions 

per second (MIPS) 

A measurement of 

program execution speed 

based on the number of 

millions of instructions. 

MIPS is computed as the 

instruction count divided 

by the product of the 

execution time and 10 6 . 

The CPI varied by a factor of 5 for SPEC CPU2006 on an Intel Core i7 computer 

in Figure 1.18, so MIPS does as well. Finally, and most importantly, if a new 

program executes more instructions but each instruction is faster, MIPS can vary 

independently from performance! 

Consider the following performance measurements for a program: 

Measurement Computer A Computer B 

Check 

Yourself 

Instruction count 10 billion 8 billion 

Clock rate 4 GHz 4 GHz 

CPI 1.0 1.1 

a. Which computer has the higher MIPS rating? 

b. Which computer is faster?

1.13 Exercises 55 

e. Library reserve desk 

f. Increasing the gate area on a CMOS transistor to decrease its switching time 

g. Adding electromagnetic aircraft catapults (which are electrically-powered 

as opposed to current steam-powered models), allowed by the increased power 

generation offered by the new reactor technology 

h. Building self-driving cars whose control systems partially rely on existing sensor 

systems already installed into the base vehicle, such as lane departure systems and 

smart cruise control systems 

1.3 [2] Describe the steps that transform a program written in a high-level 

language such as C into a representation that is directly executed by a computer 

processor. 

1.4 [2] Assume a color display using 8 bits for each of the primary colors 

(red, green, blue) per pixel and a frame size of 1280 × 1024. 

a. What is the minimum size in bytes of the frame buffer to store a frame? 

b. How long would it take, at a minimum, for the frame to be sent over a 100 

Mbit/s network? 

1.5 [4] Consider three different processors P1, P2, and P3 executing 

the same instruction set. P1 has a 3 GHz clock rate and a CPI of 1.5. P2 has a 

2.5 GHz clock rate and a CPI of 1.0. P3 has a 4.0 GHz clock rate and has a CPI 

of 2.2. 

a. Which processor has the highest performance expressed in instructions per second? 

b. If the processors each execute a program in 10 seconds, find the number of 

cycles and the number of instructions. 

c. We are trying to reduce the execution time by 30% but this leads to an increase 

of 20% in the CPI. What clock rate should we have to get this time reduction? 

1.6 [20] Consider two different implementations of the same instruction 

set architecture. The instructions can be divided into four classes according to 

their CPI (class A, B, C, and D). P1 with a clock rate of 2.5 GHz and CPIs of 1, 2, 3, 

and 3, and P2 with a clock rate of 3 GHz and CPIs of 2, 2, 2, and 2. 

Given a program with a dynamic instruction count of 1.0E6 instructions divided 

into classes as follows: 10% class A, 20% class B, 50% class C, and 20% class D, 

which implementation is faster? 

a. What is the global CPI for each implementation? 

b. Find the clock cycles required in both cases.


1.7 [15] Compilers can have a profound impact on the performance 

of an application. Assume that for a program, compiler A results in a dynamic 

instruction count of 1.0E9 and has an execution time of 1.1 s, while compiler B 

results in a dynamic instruction count of 1.2E9 and an execution time of 1.5 s. 

a. Find the average CPI for each program given that the processor has a clock cycle 

time of 1 ns. 

b. Assume the compiled programs run on two different processors. If the execution 

times on the two processors are the same, how much faster is the clock of the 

processor running compiler A’s code versus the clock of the processor running 

compiler B’s code? 

c. A new compiler is developed that uses only 6.0E8 instructions and has an 

average CPI of 1.1. What is the speedup of using this new compiler versus using 

compiler A or B on the original processor? 

1.8 The Pentium 4 Prescott processor, released in 2004, had a clock rate of 3.6 

GHz and voltage of 1.25 V. Assume that, on average, it consumed 10 W of static 

power and 90 W of dynamic power. 

The Core i5 Ivy Bridge, released in 2012, had a clock rate of 3.4 GHz and voltage 

of 0.9 V. Assume that, on average, it consumed 30 W of static power and 40 W of 

dynamic power. 

1.8.1 [5] For each processor find the average capacitive loads. 

1.8.2 [5] Find the percentage of the total dissipated power comprised by 

static power and the ratio of static power to dynamic power for each technology. 

1.8.3 [15] If the total dissipated power is to be reduced by 10%, how much 

should the voltage be reduced to maintain the same leakage current? Note: power 

is defined as the product of voltage and current. 

1.9 Assume for arithmetic, load/store, and branch instructions, a processor has 

CPIs of 1, 12, and 5, respectively. Also assume that on a single processor a program 

requires the execution of 2.56E9 arithmetic instructions, 1.28E9 load/store 

instructions, and 256 million branch instructions. Assume that each processor has 

a 2 GHz clock frequency. 

Assume that, as the program is parallelized to run over multiple cores, the number 

of arithmetic and load/store instructions per processor is divided by 0.7 x p (where 

p is the number of processors) but the number of branch instructions per processor 

remains the same. 

1.9.1 [5] Find the total execution time for this program on 1, 2, 4, and 8 

processors, and show the relative speedup of the 2, 4, and 8 processor result relative 

to the single processor result.


1.9.2 [10] If the CPI of the arithmetic instructions was doubled, 

what would the impact be on the execution time of the program on 1, 2, 4, or 8 

processors? 

1.9.3 [10] To what should the CPI of load/store instructions be 

reduced in order for a single processor to match the performance of four processors 

using the original CPI values? 

1.10 Assume a 15 cm diameter wafer has a cost of 12, contains 84 dies, and has 

0.020 defects/cm 2 . Assume a 20 cm diameter wafer has a cost of 15, contains 100 

dies, and has 0.031 defects/cm 2 . 

1.10.1 [10] Find the yield for both wafers. 

1.10.2 [5] Find the cost per die for both wafers. 

1.10.3 [5] If the number of dies per wafer is increased by 10% and the 

defects per area unit increases by 15%, find the die area and yield. 

1.10.4 [5] Assume a fabrication process improves the yield from 0.92 to 

0.95. Find the defects per area unit for each version of the technology given a die 

area of 200 mm 2 . 

1.11 The results of the SPEC CPU2006 bzip2 benchmark running on an AMD 

Barcelona has an instruction count of 2.389E12, an execution time of 750 s, and a 

reference time of 9650 s. 

1.11.1 [5] Find the CPI if the clock cycle time is 0.333 ns. 

1.11.2 [5] Find the SPECratio. 

1.11.3 [5] Find the increase in CPU time if the number of instructions 

of the benchmark is increased by 10% without affecting the CPI. 

1.11.4 [5] Find the increase in CPU time if the number of instructions 

of the benchmark is increased by 10% and the CPI is increased by 5%. 

1.11.5 [5] Find the change in the SPECratio for this change. 

1.11.6 [10] Suppose that we are developing a new version of the AMD 

Barcelona processor with a 4 GHz clock rate. We have added some additional 

instructions to the instruction set in such a way that the number of instructions 

has been reduced by 15%. The execution time is reduced to 700 s and the new 

SPECratio is 13.7. Find the new CPI. 

1.11.7 [10] This CPI value is larger than obtained in 1.11.1 as the clock 

rate was increased from 3 GHz to 4 GHz. Determine whether the increase in the 

CPI is similar to that of the clock rate. If they are dissimilar, why? 

1.11.8 [5] By how much has the CPU time been reduced?


1.11.9 [10] For a second benchmark, libquantum, assume an execution 

time of 960 ns, CPI of 1.61, and clock rate of 3 GHz. If the execution time is 

reduced by an additional 10% without affecting to the CPI and with a clock rate of 

4 GHz, determine the number of instructions. 

1.11.10 [10] Determine the clock rate required to give a further 10% 

reduction in CPU time while maintaining the number of instructions and with the 

CPI unchanged. 

1.11.11 [10] Determine the clock rate if the CPI is reduced by 15% and 

the CPU time by 20% while the number of instructions is unchanged. 

1.12 Section 1.10 cites as a pitfall the utilization of a subset of the performance 

equation as a performance metric. To illustrate this, consider the following two 

processors. P1 has a clock rate of 4 GHz, average CPI of 0.9, and requires the 

execution of 5.0E9 instructions. P2 has a clock rate of 3 GHz, an average CPI of 

0.75, and requires the execution of 1.0E9 instructions. 

1.12.1 [5] One usual fallacy is to consider the computer with the 

largest clock rate as having the largest performance. Check if this is true for P1 and 

P2. 

1.12.2 [10] Another fallacy is to consider that the processor executing 

the largest number of instructions will need a larger CPU time. Considering that 

processor P1 is executing a sequence of 1.0E9 instructions and that the CPI of 

processors P1 and P2 do not change, determine the number of instructions that P2 

can execute in the same time that P1 needs to execute 1.0E9 instructions. 

1.12.3 [10] A common fallacy is to use MIPS (millions of 

instructions per second) to compare the performance of two different processors, 

and consider that the processor with the largest MIPS has the largest performance. 

Check if this is true for P1 and P2. 

1.12.4 [10] Another common performance figure is MFLOPS (millions 

of floating-point operations per second), defined as 

MFLOPS = No. FP operations / (execution time × 1E6) 

but this figure has the same problems as MIPS. Assume that 40% of the instructions 

executed on both P1 and P2 are floating-point instructions. Find the MFLOPS 

figures for the programs. 

1.13 Another pitfall cited in Section 1.10 is expecting to improve the overall 

performance of a computer by improving only one aspect of the computer. Consider 

a computer running a program that requires 250 s, with 70 s spent executing FP 

instructions, 85 s executed L/S instructions, and 40 s spent executing branch 

instructions. 

1.13.1 [5] By how much is the total time reduced if the time for FP 

operations is reduced by 20%?


1.13.2 [5] By how much is the time for INT operations reduced if the 

total time is reduced by 20%? 

1.13.3 [5] Can the total time can be reduced by 20% by reducing only 

the time for branch instructions? 

1.14 Assume a program requires the execution of 50 × 106 FP instructions, 

110 × 106 INT instructions, 80 × 106 L/S instructions, and 16 × 106 branch 

instructions. The CPI for each type of instruction is 1, 1, 4, and 2, respectively. 

Assume that the processor has a 2 GHz clock rate. 

1.14.1 [10] By how much must we improve the CPI of FP instructions if 

we want the program to run two times faster? 

1.14.2 [10] By how much must we improve the CPI of L/S instructions 

if we want the program to run two times faster? 

1.14.3 [5] By how much is the execution time of the program improved 

if the CPI of INT and FP instructions is reduced by 40% and the CPI of L/S and 

Branch is reduced by 30%? 

1.15 [5] When a program is adapted to run on multiple processors in 

a multiprocessor system, the execution time on each processor is comprised of 

computing time and the overhead time required for locked critical sections and/or 

to send data from one processor to another. 

Assume a program requires t = 100 s of execution time on one processor. When run 

p processors, each processor requires t/p s, as well as an additional 4 s of overhead, 

irrespective of the number of processors. Compute the per-processor execution 

time for 2, 4, 8, 16, 32, 64, and 128 processors. For each case, list the corresponding 

speedup relative to a single processor and the ratio between actual speedup versus 

ideal speedup (speedup if there was no overhead). 

§1.1, page 10: Discussion questions: many answers are acceptable. 

§1.4, page 24: DRAM memory: volatile, short access time of 50 to 70 nanoseconds, 

and cost per GB is $5 to $10. Disk memory: nonvolatile, access times are 100,000 

to 400,000 times slower than DRAM, and cost per GB is 100 times cheaper than 

DRAM. Flash memory: nonvolatile, access times are 100 to 1000 times slower than 

DRAM, and cost per GB is 7 to 10 times cheaper than DRAM. 

§1.5, page 28: 1, 3, and 4 are valid reasons. Answer 5 can be generally true because 

high volume can make the extra investment to reduce die size by, say, 10% a good 

economic decision, but it doesn’t have to be true. 

§1.6, page 33: 1. a: both, b: latency, c: neither. 7 seconds. 

§1.6, page 40: b. 

§1.10, page 51: a. Computer A has the higher MIPS rating. b. Computer B is faster. 

Answers to 

Check Yourself

2 

I speak Spanish 

to God, Italian to 

women, French to 

men, and German to 

my horse. 

Charles V, Holy Roman Emperor 

(1500–1558) 

Instructions: 

Language of the 

Computer 


2.2 Operations of the Computer Hardware 63 

2.3 Operands of the Computer Hardware 66 

2.4 Signed and Unsigned Numbers 73 

2.5 Representing Instructions in the 

Computer 80 

2.6 Logical Operations 87 

2.7 Instructions for Making Decisions 90 



2.2 Operations of the Computer Hardware 65 

instruction. Another difference from C is that comments always terminate at the 

end of a line. 

The natural number of operands for an operation like addition is three: the 

two numbers being added together and a place to put the sum. Requiring every 

instruction to have exactly three operands, no more and no less, conforms to the 

philosophy of keeping the hardware simple: hardware for a variable number of 

operands is more complicated than hardware for a fixed number. This situation 

illustrates the first of three underlying principles of hardware design: 

Design Principle 1: Simplicity favors regularity. 

We can now show, in the two examples that follow, the relationship of programs 

written in higher-level programming languages to programs in this more primitive 

notation. 

Compiling Two C Assignment Statements into MIPS 

This segment of a C program contains the five variables a, b, c, d, and e. Since 

Java evolved from C, this example and the next few work for either high-level 

programming language: 

EXAMPLE 

a = b + c; 

d = a – e; 

The translation from C to MIPS assembly language instructions is performed 

by the compiler. Show the MIPS code produced by a compiler. 

A MIPS instruction operates on two source operands and places the result 

in one destination operand. Hence, the two simple statements above compile 

directly into these two MIPS assembly language instructions: 

ANSWER 

add a, b, c 

sub d, a, e 

Compiling a Complex C Assignment into MIPS 

A somewhat complex statement contains the five variables f, g, h, i, and j: 

EXAMPLE 

f = (g + h) – (i + j); 

What might a C compiler produce?

68 Chapter 2 Instructions: Language of the Computer 

ANSWER 

The compiled program is very similar to the prior example, except we replace 

the variables with the register names mentioned above plus two temporary 

registers, $t0 and $t1, which correspond to the temporary variables above: 

add $t0,$s1,$s2 # register $t0 contains g + h 

add $t1,$s3,$s4 # register $t1 contains i + j 

sub $s0,$t0,$t1 # f gets $t0 – $t1, which is (g + h)–(i + j) 

data transfer 

instruction A command 

that moves data between 

memory and registers. 

address A value used to 

delineate the location of 

a specific data element 

within a memory array. 

Memory Operands 

Programming languages have simple variables that contain single data elements, 

as in these examples, but they also have more complex data structures—arrays and 

structures. These complex data structures can contain many more data elements 

than there are registers in a computer. How can a computer represent and access 

such large structures? 

Recall the five components of a computer introduced in Chapter 1 and repeated 

on page 61. The processor can keep only a small amount of data in registers, but 

computer memory contains billions of data elements. Hence, data structures 

(arrays and structures) are kept in memory. 

As explained above, arithmetic operations occur only on registers in MIPS 

instructions; thus, MIPS must include instructions that transfer data between 

memory and registers. Such instructions are called data transfer instructions. 

To access a word in memory, the instruction must supply the memory address. 

Memory is just a large, single-dimensional array, with the address acting as the 

index to that array, starting at 0. For example, in Figure 2.2, the address of the third 

data element is 2, and the value of Memory [2] is 10. 

3 

2 

1 

0 

Address 

100 

10 

101 

1 

Data 

Processor 

Memory 

FIGURE 2.2 Memory addresses and contents of memory at those locations. If these elements 

were words, these addresses would be incorrect, since MIPS actually uses byte addressing, with each word 

representing four bytes. Figure 2.3 shows the memory addressing for sequential word addresses. 

The data transfer instruction that copies data from memory to a register is 

traditionally called load. The format of the load instruction is the name of the 

operation followed by the register to be loaded, then a constant and register used to 

access memory. The sum of the constant portion of the instruction and the contents 

of the second register forms the memory address. The actual MIPS name for this 

instruction is lw, standing for load word.


Compiling an Assignment When an Operand Is in Memory 

Let’s assume that A is an array of 100 words and that the compiler has 

associated the variables g and h with the registers $s1 and $s2 as before. 

Let’s also assume that the starting address, or base address, of the array is in 

$s3. Compile this C assignment statement: 

EXAMPLE 

g = h + A[8]; 

Although there is a single operation in this assignment statement, one of 

the operands is in memory, so we must first transfer A[8] to a register. The 

address of this array element is the sum of the base of the array A, found in 

register $s3, plus the number to select element 8. The data should be placed 

in a temporary register for use in the next instruction. Based on Figure 2.2, the 

first compiled instruction is 

ANSWER 

lw 

$t0,8($s3) # Temporary reg $t0 gets A[8] 

(We’ll be making a slight adjustment to this instruction, but we’ll use this 

simplified version for now.) The following instruction can operate on the value 

in $t0 (which equals A[8]) since it is in a register. The instruction must add 

h (contained in $s2) to A[8] (contained in $t0) and put the sum in the 

register corresponding to g (associated with $s1): 

add 

$s1,$s2,$t0 # g = h + A[8] 

The constant in a data transfer instruction (8) is called the offset, and the 

register added to form the address ($s3) is called the base register. 

In addition to associating variables with registers, the compiler allocates data 

structures like arrays and structures to locations in memory. The compiler can then 

place the proper starting address into the data transfer instructions. 

Since 8-bit bytes are useful in many programs, virtually all architectures today 

address individual bytes. Therefore, the address of a word matches the address of 

one of the 4 bytes within the word, and addresses of sequential words differ by 4. 

For example, Figure 2.3 shows the actual MIPS addresses for the words in Figure 

2.2; the byte address of the third word is 8. 

In MIPS, words must start at addresses that are multiples of 4. This requirement 

is called an alignment restriction, and many architectures have it. (Chapter 4 

suggests why alignment leads to faster data transfers.) 

Hardware/ 

Software 

Interface 

alignment restriction 

A requirement that data 

be aligned in memory on 

natural boundaries.


12 

8 

4 

0 

Byte Address 

100 

10 

101 

1 

Data 

Processor 

Memory 

FIGURE 2.3 Actual MIPS memory addresses and contents of memory for those words. 

The changed addresses are highlighted to contrast with Figure 2.2. Since MIPS addresses each byte, word 

addresses are multiples of 4: there are 4 bytes in a word. 

Computers divide into those that use the address of the leftmost or “big end” byte 

as the word address versus those that use the rightmost or “little end” byte. MIPS is 

in the big-endian camp. Since the order matters only if you access the identical data 

both as a word and as four bytes, few need to be aware of the endianess. (Appendix 

A shows the two options to number bytes in a word.) 

Byte addressing also affects the array index. To get the proper byte address in the 

code above, the offset to be added to the base register $s3 must be 4 8, or 32, so 

that the load address will select A[8] and not A[8/4]. (See the related pitfall on 

page 160 of Section 2.19.) 

The instruction complementary to load is traditionally called store; it copies data 

from a register to memory. The format of a store is similar to that of a load: the 

name of the operation, followed by the register to be stored, then offset to select 

the array element, and finally the base register. Once again, the MIPS address is 

specified in part by a constant and in part by the contents of a register. The actual 

MIPS name is sw, standing for store word. 

Hardware/ 

Software 

Interface 

As the addresses in loads and stores are binary numbers, we can see why the 

DRAM for main memory comes in binary sizes rather than in decimal sizes. That 

is, in gebibytes (2 30 ) or tebibytes (2 40 ), not in gigabytes (10 9 ) or terabytes (10 12 ); see 

Figure 1.1.


Compiling Using Load and Store 

Assume variable h is associated with register $s2 and the base address of 

the array A is in $s3. What is the MIPS assembly code for the C assignment 

statement below? 

EXAMPLE 

A[12] = h + A[8]; 

Although there is a single operation in the C statement, now two of the 

operands are in memory, so we need even more MIPS instructions. The first 

two instructions are the same as in the prior example, except this time we use 

the proper offset for byte addressing in the load word instruction to select 

A[8], and the add instruction places the sum in $t0: 

ANSWER 

lw $t0,32($s3) # Temporary reg $t0 gets A[8] 

add $t0,$s2,$t0 # Temporary reg $t0 gets h + A[8] 

The final instruction stores the sum into A[12], using 48 (4 12) as the offset 

and register $s3 as the base register. 

sw $t0,48($s3) # Stores h + A[8] back into A[12] 

Load word and store word are the instructions that copy words between 

memory and registers in the MIPS architecture. Other brands of computers use 

other instructions along with load and store to transfer data. An architecture with 

such alternatives is the Intel x86, described in Section 2.17. 

Many programs have more variables than computers have registers. Consequently, 

the compiler tries to keep the most frequently used variables in registers and places 

the rest in memory, using loads and stores to move variables between registers and 

memory. The process of putting less commonly used variables (or those needed 

later) into memory is called spilling registers. 

The hardware principle relating size and speed suggests that memory must be 

slower than registers, since there are fewer registers. This is indeed the case; data 

accesses are faster if data is in registers instead of memory. 

Moreover, data is more useful when in a register. A MIPS arithmetic instruction 

can read two registers, operate on them, and write the result. A MIPS data transfer 

instruction only reads one operand or writes one operand, without operating on it. 

Thus, registers take less time to access and have higher throughput than memory, 

making data in registers both faster to access and simpler to use. Accessing registers 

also uses less energy than accessing memory. To achieve highest performance and 

conserve energy, an instruction set architecture must have a sufficient number of 

registers, and compilers must use registers efficiently. 

Hardware/ 

Software 

Interface


We number the bits 0, 1, 2, 3, . . . from right to left in a word. The drawing below 

shows the numbering of bits within a MIPS word and the placement of the number 

1011 two 

: 

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 

(32 bits wide) 

least significant bit The 

rightmost bit in a MIPS 

word. 

most significant bit The 

leftmost bit in a MIPS 

word. 

Since words are drawn vertically as well as horizontally, leftmost and rightmost 

may be unclear. Hence, the phrase least significant bit is used to refer to the rightmost 

bit (bit 0 above) and most significant bit to the leftmost bit (bit 31). 

The MIPS word is 32 bits long, so we can represent 2 32 different 32-bit patterns. 

It is natural to let these combinations represent the numbers from 0 to 2 32 1 

(4,294,967,295 ten 

): 

0000 0000 0000 0000 0000 0000 0000 0000 two 

= 0 ten 

0000 0000 0000 0000 0000 0000 0000 0001 two 

= 1 ten 

0000 0000 0000 0000 0000 0000 0000 0010 two 

= 2 ten 

. . . . . . 

1111 1111 1111 1111 1111 1111 1111 1101 two 

= 4,294,967,293 ten 

1111 1111 1111 1111 1111 1111 1111 1110 two 

= 4,294,967,294 ten 

1111 1111 1111 1111 1111 1111 1111 1111 two 

= 4,294,967,295 ten 

That is, 32-bit binary numbers can be represented in terms of the bit value times a 

power of 2 (here xi means the ith bit of x): 

31 30 29 1 0 

( x31 2 ) ( x30 2 ) ( x29 2 ) … ( x1 2 ) ( x0 2 ) 

For reasons we will shortly see, these positive numbers are called unsigned numbers. 

Hardware/ 

Software 

Interface 

Base 2 is not natural to human beings; we have 10 fingers and so find base 10 

natural. Why didn’t computers use decimal? In fact, the first commercial computer 

did offer decimal arithmetic. The problem was that the computer still used on 

and off signals, so a decimal digit was simply represented by several binary digits. 

Decimal proved so inefficient that subsequent computers reverted to all binary, 

converting to base 10 only for the relatively infrequent input/output events. 

Keep in mind that the binary bit patterns above are simply representatives of 

numbers. Numbers really have an infinite number of digits, with almost all being 

0 except for a few of the rightmost digits. We just don’t normally show leading 0s. 

Hardware can be designed to add, subtract, multiply, and divide these binary 

bit patterns. If the number that is the proper result of such operations cannot be 

represented by these rightmost hardware bits, overflow is said to have occurred.


It’s up to the programming language, the operating system, and the program to 

determine what to do if overflow occurs. 

Computer programs calculate both positive and negative numbers, so we need a 

representation that distinguishes the positive from the negative. The most obvious 

solution is to add a separate sign, which conveniently can be represented in a single 

bit; the name for this representation is sign and magnitude. 

Alas, sign and magnitude representation has several shortcomings. First, it’s 

not obvious where to put the sign bit. To the right? To the left? Early computers 

tried both. Second, adders for sign and magnitude may need an extra step to set 

the sign because we can’t know in advance what the proper sign will be. Finally, a 

separate sign bit means that sign and magnitude has both a positive and a negative 

zero, which can lead to problems for inattentive programmers. As a result of these 

shortcomings, sign and magnitude representation was soon abandoned. 

In the search for a more attractive alternative, the question arose as to what 

would be the result for unsigned numbers if we tried to subtract a large number 

from a small one. The answer is that it would try to borrow from a string of leading 

0s, so the result would have a string of leading 1s. 

Given that there was no obvious better alternative, the final solution was to pick 

the representation that made the hardware simple: leading 0s mean positive, and 

leading 1s mean negative. This convention for representing signed binary numbers 

is called two’s complement representation: 

0000 0000 0000 0000 0000 0000 0000 0000 two 

= 0 ten 

0000 0000 0000 0000 0000 0000 0000 0001 two 

= 1 ten 

0000 0000 0000 0000 0000 0000 0000 0010 two 

= 2 ten 

. . . . . . 

0111 1111 1111 1111 1111 1111 1111 1101 two 

= 2,147,483,645 ten 

0111 1111 1111 1111 1111 1111 1111 1110 two 

= 2,147,483,646 ten 

0111 1111 1111 1111 1111 1111 1111 1111 two 

= 2,147,483,647 ten 

1000 0000 0000 0000 0000 0000 0000 0000 two 

= –2,147,483,648 ten 

1000 0000 0000 0000 0000 0000 0000 0001 two 

= –2,147,483,647 ten 

1000 0000 0000 0000 0000 0000 0000 0010 two 

= –2,147,483,646 ten 

. . . . . . 

1111 1111 1111 1111 1111 1111 1111 1101 two 

= –3 ten 

1111 1111 1111 1111 1111 1111 1111 1110 two 

= –2 ten 

1111 1111 1111 1111 1111 1111 1111 1111 two 

= –1 ten 

The positive half of the numbers, from 0 to 2,147,483,647 ten 

(2 31 1), use the same 

representation as before. The following bit pattern (1000 . . . 0000 two 

) represents the most 

negative number 2,147,483,648 ten 

(2 31 ). It is followed by a declining set of negative 

numbers: 2,147,483,647 ten 

(1000 . . . 0001 two 

) down to 1 ten 

(1111 . . . 1111 two 

). 

Two’s complement does have one negative number, 2,147,483,648 ten 

, that 

has no corresponding positive number. Such imbalance was also a worry to the 

inattentive programmer, but sign and magnitude had problems for both the 

programmer and the hardware designer. Consequently, every computer today uses 

two’s complement binary representations for signed numbers.


Two’s complement representation has the advantage that all negative numbers 

have a 1 in the most significant bit. Consequently, hardware needs to test only 

this bit to see if a number is positive or negative (with the number 0 considered 

positive). This bit is often called the sign bit. By recognizing the role of the sign bit, 

we can represent positive and negative 32-bit numbers in terms of the bit value 

times a power of 2: 

31 30 29 1 0 

( x31 2 ) ( x30 2 ) + ( x29 2 ) … ( x1 2 ) ( x0 2 ) 

The sign bit is multiplied by 2 31 , and the rest of the bits are then multiplied by 

positive versions of their respective base values. 

EXAMPLE 

Binary to Decimal Conversion 

What is the decimal value of this 32-bit two’s complement number? 

1111 1111 1111 1111 1111 1111 1111 1100 two 

ANSWER 

Substituting the number’s bit values into the formula above: 

31 30 29 1 1 0 

( 1 2 ) ( 1 2 ) ( 1 2 ) … ( 1 2 ) ( 0 2 ) ( 0 2 ) 

31 30 29 2 

2 2 2 … 2 0 0 

2, 147, 483, 648te 

n 

2, 147, 483, 

644ten 

4 

ten 

We’ll see a shortcut to simplify conversion from negative to positive soon. 

Just as an operation on unsigned numbers can overflow the capacity of hardware 

to represent the result, so can an operation on two’s complement numbers. Overflow 

occurs when the leftmost retained bit of the binary bit pattern is not the same as the 

infinite number of digits to the left (the sign bit is incorrect): a 0 on the left of the bit 

pattern when the number is negative or a 1 when the number is positive. 

Hardware/ 

Software 

Interface 

Signed versus unsigned applies to loads as well as to arithmetic. The function of a 

signed load is to copy the sign repeatedly to fill the rest of the register—called sign 

extension—but its purpose is to place a correct representation of the number within 

that register. Unsigned loads simply fill with 0s to the left of the data, since the 

number represented by the bit pattern is unsigned. 

When loading a 32-bit word into a 32-bit register, the point is moot; signed and 

unsigned loads are identical. MIPS does offer two flavors of byte loads: load byte (lb) 

treats the byte as a signed number and thus sign-extends to fill the 24 left-most bits 

of the register, while load byte unsigned (lbu) works with unsigned integers. Since C 

programs almost always use bytes to represent characters rather than consider bytes 

as very short signed integers, lbu is used practically exclusively for byte loads.


Unlike the numbers discussed above, memory addresses naturally start at 0 

and continue to the largest address. Put another way, negative addresses make 

no sense. Thus, programs want to deal sometimes with numbers that can be 

positive or negative and sometimes with numbers that can be only positive. 

Some programming languages reflect this distinction. C, for example, names the 

former integers (declared as int in the program) and the latter unsigned integers 

(unsigned int). Some C style guides even recommend declaring the former as 

signed int to keep the distinction clear. 

Let’s examine two useful shortcuts when working with two’s complement 

numbers. The first shortcut is a quick way to negate a two’s complement binary 

number. Simply invert every 0 to 1 and every 1 to 0, then add one to the result. 

This shortcut is based on the observation that the sum of a number and its inverted 

representation must be 111 . . . 111 two 

, which represents 1. Since x x 1, 

therefore x x 1 0 or x 1 − x. (We use the notation x to mean invert 

every bit in x from 0 to 1 and vice versa.) 

Negation Shortcut 

Negate 2 ten 

, and then check the result by negating 2 ten 

. 

2 ten 

0000 0000 0000 0000 0000 0000 0000 0010 two 

Negating this number by inverting the bits and adding one, 

1111 1111 1111 1111 1111 1111 1111 1101 two 

+ 1 two 

= 1111 1111 1111 1111 1111 1111 1111 1110 two 

= –2 ten 

Going the other direction, 

Hardware/ 

Software 

Interface 

EXAMPLE 

ANSWER 

= 0000 0000 0000 0000 0000 0000 0000 0010 two 

= 2 ten 

1111 1111 1111 1111 1111 1111 1111 1110 two 

is first inverted and then incremented: 

0000 0000 0000 0000 0000 0000 0000 0001 two 

+ 1 two


Our next shortcut tells us how to convert a binary number represented in n bits 

to a number represented with more than n bits. For example, the immediate field 

in the load, store, branch, add, and set on less than instructions contains a two’s 

complement 16-bit number, representing 32,768 ten 

(2 15 ) to 32,767 ten 

(2 15 1). 

To add the immediate field to a 32-bit register, the computer must convert that 16- 

bit number to its 32-bit equivalent. The shortcut is to take the most significant bit 

from the smaller quantity—the sign bit—and replicate it to fill the new bits of the 

larger quantity. The old nonsign bits are simply copied into the right portion of the 

new word. This shortcut is commonly called sign extension. 

EXAMPLE 

Sign Extension Shortcut 

Convert 16-bit binary versions of 2 ten 

and 2 ten 

to 32-bit binary numbers. 

ANSWER 

The 16-bit binary version of the number 2 is 

0000 0000 0000 0010 two 

= 2 ten 

It is converted to a 32-bit number by making 16 copies of the value in the most 

significant bit (0) and placing that in the left-hand half of the word. The right 

half gets the old value: 

0000 0000 0000 0000 0000 0000 0000 0010 two 

= 2 ten 

Let’s negate the 16-bit version of 2 using the earlier shortcut. Thus, 

0000 0000 0000 0010 two 

becomes 

1111 1111 1111 1101 two 

+ 1 two 

= 1111 1111 1111 1110 two 

Creating a 32-bit version of the negative number means copying the sign bit 

16 times and placing it on the left: 

1111 1111 1111 1111 1111 1111 1111 1110 two 

= –2 ten 

This trick works because positive two’s complement numbers really have an infinite 

number of 0s on the left and negative two’s complement numbers have an infinite 

number of 1s. The binary bit pattern representing a number hides leading bits to fit 

the width of the hardware; sign extension simply restores some of them.


Summary 

The main point of this section is that we need to represent both positive and 

negative integers within a computer word, and although there are pros and cons to 

any option, the unanimous choice since 1965 has been two’s complement. 

Elaboration: For signed decimal numbers, we used “” to represent negative 

because there are no limits to the size of a decimal number. Given a fi xed word size, 

binary and hexadecimal (see Figure 2.4) bit strings can encode the sign; hence we do 

not normally use “” or “” with binary or hexadecimal notation. 

What is the decimal value of this 64-bit two’s complement number? 

1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1000 two 

Check 

Yourself 

1) –4 ten 

2) –8 ten 

3) –16 ten 

4) 18,446,744,073,709,551,609 ten 

Elaboration: Two’s complement gets its name from the rule that the unsigned sum 

of an n-bit number and its n-bit negative is 2 n ; hence, the negation or complement of a 

number x is 2 n x, or its “two’s complement.” 

A third alternative representation to two’s complement and sign and magnitude is 

called one’s complement. The negative of a one’s complement is found by inverting 

each bit, from 0 to 1 and from 1 to 0, or x. This relation helps explain its name since 

the complement of x is 2 n x 1. It was also an attempt to be a better solution 

than sign and magnitude, and several early scientifi c computers did use the notation. 

This representation is similar to two’s complement except that it also has two 0s: 

00 . . . 00 two 

is positive 0 and 11 . . . 11 two 

is negative 0. The most negative number, 

10 . . . 000 two 

, represents 2,147,483,647 ten 

, and so the positives and negatives are 

balanced. One’s complement adders did need an extra step to subtract a number, and 

hence two’s complement dominates today. 

A fi nal notation, which we will look at when we discuss fl oating point in Chapter 3, 

is to represent the most negative value by 00 . . . 000 two 

and the most positive value 

by 11 . . . 11 two 

, with 0 typically having the value 10 . . . 00 two 

. This is called a biased 

notation, since it biases the number such that the number plus the bias has a nonnegative 

representation. 

one’s complement 

A notation that represents 

the most negative value 

by 10 . . . 000 two 

and the 

most positive value by 

01 . . . 11 two 

, leaving an 

equal number of negatives 

and positives but ending 

up with two zeros, one 

positive (00 . . . 00 two 

) and 

one negative (11 . . . 11 two 

). 

The term is also used to 

mean the inversion of 

every bit in a pattern: 0 to 

1 and 1 to 0. 

biased notation 

A notation that represents 

the most negative value 

by 00 . . . 000 two 

and the 

most positive value by 11 

. . . 11 two 

, with 0 typically 

having the value 10 . . . 

00 two 

, thereby biasing 

the number such that 

the number plus the 

bias has a non-negative 

representation.

2.5 Representing Instructions in the Computer 81 

This layout of the instruction is called the instruction format. As you can see 

from counting the number of bits, this MIPS instruction takes exactly 32 bits—the 

same size as a data word. In keeping with our design principle that simplicity favors 

regularity, all MIPS instructions are 32 bits long. 

To distinguish it from assembly language, we call the numeric version of 

instructions machine language and a sequence of such instructions machine code. 

It would appear that you would now be reading and writing long, tedious strings 

of binary numbers. We avoid that tedium by using a higher base than binary that 

converts easily into binary. Since almost all computer data sizes are multiples of 

4, hexadecimal (base 16) numbers are popular. Since base 16 is a power of 2, 

we can trivially convert by replacing each group of four binary digits by a single 

hexadecimal digit, and vice versa. Figure 2.4 converts between hexadecimal and 

binary. 

instruction format 

A form of representation 

of an instruction 

composed of fields of 

binary numbers. 

machine 

language Binary 

representation used for 

communication within a 

computer system. 

hexadecimal Numbers 

in base 16. 

Hexadecimal Binary Hexadecimal Binary Hexadecimal Binary Hexadecimal Binary 

0 hex 0000 two 4 hex 0100 two 8 hex 1000 two c hex 1100 two 

1 hex 0001 two 5 hex 0101 two 9 hex 1001 two d hex 1101 two 

2 hex 0010 two 6 hex 0110 two a hex 1010 two e hex 1110 two 

3 hex 0011 two 7 hex 0111 two b hex 1011 two f hex 1111 two 

FIGURE 2.4 The hexadecimal-binary conversion table. Just replace one hexadecimal digit by the corresponding four binary digits, 

and vice versa. If the length of the binary number is not a multiple of 4, go from right to left. 

Because we frequently deal with different number bases, to avoid confusion 

we will subscript decimal numbers with ten, binary numbers with two, and 

hexadecimal numbers with hex. (If there is no subscript, the default is base 10.) By 

the way, C and Java use the notation 0xnnnn for hexadecimal numbers. 

Binary to Hexadecimal and Back 

Convert the following hexadecimal and binary numbers into the other base: 

EXAMPLE 

eca8 6420 hex 

0001 0011 0101 0111 1001 1011 1101 1111 two


ANSWER 

Using Figure 2.4, the answer is just a table lookup one way: 

eca8 6420 hex 

1110 1100 1010 1000 0110 0100 0010 0000 two 

And then the other direction: 

0001 0011 0101 0111 1001 1011 1101 1111 two 

1357 9bdf hex 

MIPS Fields 

MIPS fields are given names to make them easier to discuss: 

op rs rt rd shamt funct 

6 bits 5 bits 5 bits 5 bits 5 bits 6 bits 

Here is the meaning of each name of the fields in MIPS instructions: 

opcode The field that 

denotes the operation and 

format of an instruction. 

■ op: Basic operation of the instruction, traditionally called the opcode. 

■ rs: The first register source operand. 

■ rt: The second register source operand. 

■ rd: The register destination operand. It gets the result of the operation. 

■ shamt: Shift amount. (Section 2.6 explains shift instructions and this term; it 

will not be used until then, and hence the field contains zero in this section.) 

■ funct: Function. This field, often called the function code, selects the specific 

variant of the operation in the op field. 

A problem occurs when an instruction needs longer fields than those shown 

above. For example, the load word instruction must specify two registers and a 

constant. If the address were to use one of the 5-bit fields in the format above, the 

constant within the load word instruction would be limited to only 2 5 or 32. This 

constant is used to select elements from arrays or data structures, and it often needs 

to be much larger than 32. This 5-bit field is too small to be useful. 

Hence, we have a conflict between the desire to keep all instructions the same 

length and the desire to have a single instruction format. This leads us to the final 

hardware design principle:

2.5 Representing Instructions in the Computer 83 

Design Principle 3: Good design demands good compromises. 

The compromise chosen by the MIPS designers is to keep all instructions the 

same length, thereby requiring different kinds of instruction formats for different 

kinds of instructions. For example, the format above is called R-type (for register) 

or R-format. A second type of instruction format is called I-type (for immediate) 

or I-format and is used by the immediate and data transfer instructions. The fields 

of I-format are 

op rs rt constant or address 

6 bits 5 bits 5 bits 16 bits 

The 16-bit address means a load word instruction can load any word within 

a region of 2 15 or 32,768 bytes (2 13 or 8192 words) of the address in the base 

register rs. Similarly, add immediate is limited to constants no larger than 2 15 . 

We see that more than 32 registers would be difficult in this format, as the rs and rt 

fields would each need another bit, making it harder to fit everything in one word. 

Let’s look at the load word instruction from page 71: 

lw $t0,32($s3) # Temporary reg $t0 gets A[8] 

Here, 19 (for $s3) is placed in the rs field, 8 (for $t0) is placed in the rt field, and 

32 is placed in the address field. Note that the meaning of the rt field has changed 

for this instruction: in a load word instruction, the rt field specifies the destination 

register, which receives the result of the load. 

Although multiple formats complicate the hardware, we can reduce the complexity 

by keeping the formats similar. For example, the first three fields of the R-type and 

I-type formats are the same size and have the same names; the length of the fourth 

field in I-type is equal to the sum of the lengths of the last three fields of R-type. 

In case you were wondering, the formats are distinguished by the values in the 

first field: each format is assigned a distinct set of values in the first field (op) so that 

the hardware knows whether to treat the last half of the instruction as three fields 

(R-type) or as a single field (I-type). Figure 2.5 shows the numbers used in each 

field for the MIPS instructions covered so far. 

Instruction Format op rs rt rd shamt funct address 

add R 0 reg reg reg 0 32 ten n.a. 

sub (subtract) R 0 reg reg reg 0 34 ten n.a. 

add immediate I 8 ten reg reg n.a. n.a. n.a. constant 

lw (load word) I 35 ten reg reg n.a. n.a. n.a. address 

sw (store word) I 43 ten reg reg n.a. n.a. n.a. address 

FIGURE 2.5 MIPS instruction encoding. In the table above, “reg” means a register number between 0 

and 31, “address” means a 16-bit address, and “n.a.” (not applicable) means this field does not appear in this 

format. Note that add and sub instructions have the same value in the op field; the hardware uses the funct 

field to decide the variant of the operation: add (32) or subtract (34).


EXAMPLE 

Translating MIPS Assembly Language into Machine Language 

We can now take an example all the way from what the programmer writes 

to what the computer executes. If $t1 has the base of the array A and $s2 

corresponds to h, the assignment statement 

A[300] = h + A[300]; 

is compiled into 

lw $t0,1200($t1) # Temporary reg $t0 gets A[300] 

add $t0,$s2,$t0 # Temporary reg $t0 gets h + A[300] 

sw $t0,1200($t1) # Stores h + A[300] back into A[300] 

What is the MIPS machine language code for these three instructions? 

ANSWER 

For convenience, let’s first represent the machine language instructions using 

decimal numbers. From Figure 2.5, we can determine the three machine 

language instructions: 

Op rs rt rd 

address/ 

shamt 

funct 

35 9 8 1200 

0 18 8 8 0 32 

43 9 8 1200 

The lw instruction is identified by 35 (see Figure 2.5) in the first field 

(op). The base register 9 ($t1) is specified in the second field (rs), and the 

destination register 8 ($t0) is specified in the third field (rt). The offset to 

select A[300] (1200 300 4) is found in the final field (address). 

The add instruction that follows is specified with 0 in the first field (op) and 

32 in the last field (funct). The three register operands (18, 8, and 8) are found 

in the second, third, and fourth fields and correspond to $s2, $t0, and $t0. 

The sw instruction is identified with 43 in the first field. The rest of this final 

instruction is identical to the lw instruction. 

Since 1200 ten 

0000 0100 1011 0000 two 

, the binary equivalent to the decimal 

form is: 

100011 01001 01000 0000 0100 1011 0000 

000000 10010 01000 01000 00000 100000 

101011 01001 01000 0000 0100 1011 0000


The dual of a shift left is a shift right. The actual name of the two MIPS shift 

instructions are called shift left logical (sll) and shift right logical (srl). The 

following instruction performs the operation above, assuming that the original 

value was in register $s0 and the result should go in register $t2: 

sll $t2,$s0,4 # reg $t2 = reg $s0

2.6 Logical Operations 89 

To place a value into one of these seas of 0s, there is the dual to AND, called 

OR. It is a bit-by-bit operation that places a 1 in the result if either operand bit is 

a 1. To elaborate, if the registers $t1 and $t2 are unchanged from the preceding 

example, the result of the MIPS instruction 

or $t0,$t1,$t2 # reg $t0 = reg $t1 | reg $t2 

is this value in register $t0: 

OR A logical bit-bybit 

operation with two 

operands that calculates 

a 1 if there is a 1 in either 

operand. 

0000 0000 0000 0000 0011 1101 1100 0000 two 

The final logical operation is a contrarian. NOT takes one operand and places a 1 

in the result if one operand bit is a 0, and vice versa. Using our prior notation, it 

calculates x. 

In keeping with the three-operand format, the designers of MIPS decided to 

include the instruction NOR (NOT OR) instead of NOT. If one operand is zero, 

then it is equivalent to NOT: A NOR 0 NOT (A OR 0) NOT (A). 

If the register $t1 is unchanged from the preceding example and register $t3 

has the value 0, the result of the MIPS instruction 

nor $t0,$t1,$t3 # reg $t0 = ~ (reg $t1 | reg $t3) 

is this value in register $t0: 

NOT A logical bit-bybit 

operation with one 

operand that inverts the 

bits; that is, it replaces 

every 1 with a 0, and 

every 0 with a 1. 

NOR A logical bit-bybit 

operation with two 

operands that calculates 

the NOT of the OR of the 

two operands. That is, it 

calculates a 1 only if there 

is a 0 in both operands. 

1111 1111 1111 1111 1100 0011 1111 1111 two 

Figure 2.8 above shows the relationship between the C and Java operators and the 

MIPS instructions. Constants are useful in AND and OR logical operations as well 

as in arithmetic operations, so MIPS also provides the instructions and immediate 

(andi) and or immediate (ori). Constants are rare for NOR, since its main use is 

to invert the bits of a single operand; thus, the MIPS instruction set architecture has 

no immediate version of NOR. 

Elaboration: The full MIPS instruction set also includes exclusive or (XOR), which 

sets the bit to 1 when two corresponding bits differ, and to 0 when they are the same. C 

allows bit fi elds or fi elds to be defi ned within words, both allowing objects to be packed 

within a word and to match an externally enforced interface such as an I/O device. All 

fi elds must fi t within a single word. Fields are unsigned integers that can be as short as 

1 bit. C compilers insert and extract fi elds using logical instructions in MIPS: and, or, 

sll, and srl. 

Elaboration: Logical AND immediate and logical OR immediate put 0s into the upper 

16 bits to form a 32-bit constant, unlike add immediate, which does sign extension. 

Which operations can isolate a field in a word? 

1. AND 

2. A shift left followed by a shift right 

Check 

Yourself


The next assignment statement performs a single operation, and if all the 

operands are allocated to registers, it is just one instruction: 

add $s0,$s1,$s2 # f = g + h (skipped if i ≠ j) 

We now need to go to the end of the if statement. This example introduces 

another kind of branch, often called an unconditional branch. This instruction 

says that the processor always follows the branch. To distinguish between 

conditional and unconditional branches, the MIPS name for this type of 

instruction is jump, abbreviated as j (the label Exit is defined below). 

conditional branch An 

instruction that requires 

the comparison of two 

values and that allows for 

a subsequent transfer of 

control to a new address 

in the program based 

on the outcome of the 

comparison. 

j Exit 

# go to Exit 

The assignment statement in the else portion of the if statement can again be 

compiled into a single instruction. We just need to append the label Else to 

this instruction. We also show the label Exit that is after this instruction, 

showing the end of the if-then-else compiled code: 

Else:sub $s0,$s1,$s2 # f = g – h (skipped if i = j) 

Exit: 

Notice that the assembler relieves the compiler and the assembly language 

programmer from the tedium of calculating addresses for branches, just as it does 

for calculating data addresses for loads and stores (see Section 2.12). 

i=j 

i= =j? 

i≠ j 

Else: 

f=g+h 

f=g–h 

Exit: 

FIGURE 2.9 Illustration of the options in the if statement above. The left box corresponds to 

the then part of the if statement, and the right box corresponds to the else part.


Hardware/ 

Software 

Interface 

Compilers frequently create branches and labels where they do not appear in 

the programming language. Avoiding the burden of writing explicit labels and 

branches is one benefit of writing in high-level programming languages and is a 

reason coding is faster at that level. 

Loops 

Decisions are important both for choosing between two alternatives—found in if 

statements—and for iterating a computation—found in loops. The same assembly 

instructions are the building blocks for both cases. 

EXAMPLE 

Compiling a while Loop in C 

Here is a traditional loop in C: 

while (save[i] == k) 

i += 1; 

Assume that i and k correspond to registers $s3 and $s5 and the base of the 

array save is in $s6. What is the MIPS assembly code corresponding to this 

C segment? 

ANSWER 

The first step is to load save[i] into a temporary register. Before we can load 

save[i] into a temporary register, we need to have its address. Before we 

can add i to the base of array save to form the address, we must multiply the 

index i by 4 due to the byte addressing problem. Fortunately, we can use shift 

left logical, since shifting left by 2 bits multiplies by 2 2 or 4 (see page 88 in the 

prior section). We need to add the label Loop to it so that we can branch back 

to that instruction at the end of the loop: 

Loop: sll $t1,$s3,2 # Temp reg $t1 = i * 4 

To get the address of save[i], we need to add $t1 and the base of save in $s6: 

add $t1,$t1,$s6 # $t1 = address of save[i] 

Now we can use that address to load save[i] into a temporary register: 

lw $t0,0($t1) # Temp reg $t0 = save[i] 

The next instruction performs the loop test, exiting if save[i] ≠ k: 

bne $t0,$s5, Exit 

# go to Exit if save[i] ≠ k


The next instruction adds 1 to i: 

addi $s3,$s3,1 # i = i + 1 

The end of the loop branches back to the while test at the top of the loop. We 

just add the Exit label after it, and we’re done: 

j Loop # go to Loop 

Exit: 

(See the exercises for an optimization of this sequence.) 

Such sequences of instructions that end in a branch are so fundamental to compiling 

that they are given their own buzzword: a basic block is a sequence of instructions 

without branches, except possibly at the end, and without branch targets or branch 

labels, except possibly at the beginning. One of the first early phases of compilation 

is breaking the program into basic blocks. 

The test for equality or inequality is probably the most popular test, but sometimes 

it is useful to see if a variable is less than another variable. For example, a for loop 

may want to test to see if the index variable is less than 0. Such comparisons are 

accomplished in MIPS assembly language with an instruction that compares two 

registers and sets a third register to 1 if the first is less than the second; otherwise, 

it is set to 0. The MIPS instruction is called set on less than, or slt. For example, 

Hardware/ 

Software 

Interface 

basic block A sequence 

of instructions without 

branches (except possibly 

at the end) and without 

branch targets or branch 

labels (except possibly at 

the beginning). 

slt $t0, $s3, $s4 # $t0 = 1 if $s3 < $s4 

means that register $t0 is set to 1 if the value in register $s3 is less than the value 

in register $s4; otherwise, register $t0 is set to 0. 

Constant operands are popular in comparisons, so there is an immediate version 

of the set on less than instruction. To test if register $s2 is less than the constant 

10, we can just write 

slti $t0,$s2,10 # $t0 = 1 if $s2 < 10 

MIPS compilers use the slt, slti, beq, bne, and the fixed value of 0 (always 

available by reading register $zero) to create all relative conditions: equal, not 

equal, less than, less than or equal, greater than, greater than or equal. 

Hardware/ 

Software 

Interface


Heeding von Neumann’s warning about the simplicity of the “equipment,” the 

MIPS architecture doesn’t include branch on less than because it is too complicated; 

either it would stretch the clock cycle time or it would take extra clock cycles per 

instruction. Two faster instructions are more useful. 

Hardware/ 

Software 

Interface 

Comparison instructions must deal with the dichotomy between signed and 

unsigned numbers. Sometimes a bit pattern with a 1 in the most significant bit 

represents a negative number and, of course, is less than any positive number, 

which must have a 0 in the most significant bit. With unsigned integers, on the 

other hand, a 1 in the most significant bit represents a number that is larger than 

any that begins with a 0. (We’ll soon take advantage of this dual meaning of the 

most significant bit to reduce the cost of the array bounds checking.) 

MIPS offers two versions of the set on less than comparison to handle these 

alternatives. Set on less than (slt) and set on less than immediate (slti) work with 

signed integers. Unsigned integers are compared using set on less than unsigned 

(sltu) and set on less than immediate unsigned (sltiu). 

EXAMPLE 

Signed versus Unsigned Comparison 

Suppose register $s0 has the binary number 

1111 1111 1111 1111 1111 1111 1111 1111 two 

and that register $s1 has the binary number 

0000 0000 0000 0000 0000 0000 0000 0001 two 

What are the values of registers $t0 and $t1 after these two instructions? 

slt 

sltu 

$t0, $s0, $s1 # signed comparison 

$t1, $s0, $s1 # unsigned comparison 

ANSWER 

The value in register $s0 represents 1 ten 

if it is an integer and 4,294,967,295 ten 

if it is an unsigned integer. The value in register $s1 represents 1 ten 

in either 

case. Then register $t0 has the value 1, since 1 ten 

1 ten 

, and register $t1 has 

the value 0, since 4,294,967,295 ten 

1 ten 

.


Treating signed numbers as if they were unsigned gives us a low cost way of 

checking if 0 x y, which matches the index out-of-bounds check for arrays. The 

key is that negative integers in two’s complement notation look like large numbers 

in unsigned notation; that is, the most significant bit is a sign bit in the former 

notation but a large part of the number in the latter. Thus, an unsigned comparison 

of x y also checks if x is negative as well as if x is less than y. 

Bounds Check Shortcut 

Use this shortcut to reduce an index-out-of-bounds check: jump to 

IndexOutOfBounds if $s1 ≥ $t2 or if $s1 is negative. 

EXAMPLE 

The checking code just uses u to do both checks: 

sltu $t0,$s1,$t2 # $t0=0 if $s1>=length or $s1

2.8 Supporting Procedures in Computer Hardware 97 

You can think of a procedure like a spy who leaves with a secret plan, acquires 

resources, performs the task, covers his or her tracks, and then returns to the point 

of origin with the desired result. Nothing else should be perturbed once the mission 

is complete. Moreover, a spy operates on only a “need to know” basis, so the spy 

can’t make assumptions about his employer. 

Similarly, in the execution of a procedure, the program must follow these six 

steps: 

1. Put parameters in a place where the procedure can access them. 

2. Transfer control to the procedure. 

3. Acquire the storage resources needed for the procedure. 

4. Perform the desired task. 

5. Put the result value in a place where the calling program can access it. 

6. Return control to the point of origin, since a procedure can be called from 

several points in a program. 

As mentioned above, registers are the fastest place to hold data in a computer, 

so we want to use them as much as possible. MIPS software follows the following 

convention for procedure calling in allocating its 32 registers: 

■ $a0–$a3: four argument registers in which to pass parameters 

■ $v0–$v1: two value registers in which to return values 

■ $ra: one return address register to return to the point of origin 

In addition to allocating these registers, MIPS assembly language includes an 

instruction just for the procedures: it jumps to an address and simultaneously 

saves the address of the following instruction in register $ra. The jump-and-link 

instruction (jal) is simply written 

jal ProcedureAddress 

The link portion of the name means that an address or link is formed that points 

to the calling site to allow the procedure to return to the proper address. This “link,” 

stored in register$ra (register 31), is called the return address. The return address 

is needed because the same procedure could be called from several parts of the 

program. 

To support such situations, computers like MIPS use jump register instruction 

(jr), introduced above to help with case statements, meaning an unconditional 

jump to the address specified in a register: 

jr 

$ra 

jump-and-link 

instruction An 

instruction that jumps 

to an address and 

simultaneously saves the 

address of the following 

instruction in a register 

($ra in MIPS). 

return address A link to 

the calling site that allows 

a procedure to return 

to the proper address; 

in MIPS it is stored in 

register $ra.


caller The program that 

instigates a procedure and 

provides the necessary 

parameter values. 

callee A procedure that 

executes a series of stored 

instructions based on 

parameters provided by 

the caller and then returns 

control to the caller. 

program counter 

(PC) The register 

containing the address 

of the instruction in the 

program being executed. 

stack A data structure 

for spilling registers 

organized as a last-infirst-out 

queue. 

stack pointer A value 

denoting the most 

recently allocated address 

in a stack that shows 

where registers should 

be spilled or where old 

register values can be 

found. In MIPS, it is 

register $sp. 

push Add element to 

stack. 

pop Remove element 

from stack. 

The jump register instruction jumps to the address stored in register $ra— 

which is just what we want. Thus, the calling program, or caller, puts the parameter 

values in $a0–$a3 and uses jal X to jump to procedure X (sometimes named 

the callee). The callee then performs the calculations, places the results in $v0 and 

$v1, and returns control to the caller using jr $ra. 

Implicit in the stored-program idea is the need to have a register to hold the 

address of the current instruction being executed. For historical reasons, this 

register is almost always called the program counter, abbreviated PC in the MIPS 

architecture, although a more sensible name would have been instruction address 

register. The jal instruction actually saves PC 4 in register $ra to link to the 

following instruction to set up the procedure return. 

Using More Registers 

Suppose a compiler needs more registers for a procedure than the four argument 

and two return value registers. Since we must cover our tracks after our mission 

is complete, any registers needed by the caller must be restored to the values that 

they contained before the procedure was invoked. This situation is an example in 

which we need to spill registers to memory, as mentioned in the Hardware/Software 

Interface section above. 

The ideal data structure for spilling registers is a stack—a last-in-first-out 

queue. A stack needs a pointer to the most recently allocated address in the stack 

to show where the next procedure should place the registers to be spilled or where 

old register values are found. The stack pointer is adjusted by one word for each 

register that is saved or restored. MIPS software reserves register 29 for the stack 

pointer, giving it the obvious name $sp. Stacks are so popular that they have their 

own buzzwords for transferring data to and from the stack: placing data onto the 

stack is called a push, and removing data from the stack is called a pop. 

By historical precedent, stacks “grow” from higher addresses to lower addresses. 

This convention means that you push values onto the stack by subtracting from the 

stack pointer. Adding to the stack pointer shrinks the stack, thereby popping values 

off the stack. 

EXAMPLE 

Compiling a C Procedure That Doesn’t Call Another Procedure 

Let’s turn the example on page 65 from Section 2.2 into a C procedure: 

int leaf_example (int g, int h, int i, int j) 

{ 

int f; 

} 

f = (g + h) – (i + j); 

return f; 

What is the compiled MIPS assembly code?


The parameter variables g, h, i, and j correspond to the argument registers 

$a0, $a1, $a2, and $a3, and f corresponds to $s0. The compiled program 

starts with the label of the procedure: 

ANSWER 

leaf_example: 

The next step is to save the registers used by the procedure. The C assignment 

statement in the procedure body is identical to the example on page 68, which 

uses two temporary registers. Thus, we need to save three registers: $s0, $t0, 

and $t1. We “push” the old values onto the stack by creating space for three 

words (12 bytes) on the stack and then store them: 

addi $sp, $sp, –12 # adjust stack to make room for 3 items 

sw $t1, 8($sp) # save register $t1 for use afterwards 

sw $t0, 4($sp) # save register $t0 for use afterwards 

sw $s0, 0($sp) # save register $s0 for use afterwards 

Figure 2.10 shows the stack before, during, and after the procedure call. 

The next three statements correspond to the body of the procedure, which 

follows the example on page 68: 

add $t0,$a0,$a1 # register $t0 contains g + h 

add $t1,$a2,$a3 # register $t1 contains i + j 

sub $s0,$t0,$t1 # f = $t0 – $t1, which is (g + h)–(i + j) 

To return the value of f, we copy it into a return value register: 

add $v0,$s0,$zero # returns f ($v0 = $s0 + 0) 

Before returning, we restore the three old values of the registers we saved by 

“popping” them from the stack: 

lw $s0, 0($sp) # restore register $s0 for caller 

lw $t0, 4($sp) # restore register $t0 for caller 

lw $t1, 8($sp) # restore register $t1 for caller 

addi $sp,$sp,12 # adjust stack to delete 3 items 

The procedure ends with a jump register using the return address: 

jr $ra # jump back to calling routine 

In the previous example, we used temporary registers and assumed their old 

values must be saved and restored. To avoid saving and restoring a register whose 

value is never used, which might happen with a temporary register, MIPS software 

separates 18 of the registers into two groups: 

■ $t0–$t9: temporary registers that are not preserved by the callee (called 

procedure) on a procedure call 

■ $s0–$s7: saved registers that must be preserved on a procedure call (if 

used, the callee saves and restores them)


High address 

$sp 

$sp 

Contents of register $t1 

Contents of register $t0 

Contents of register $s0 

$sp 

Low address 

(a) (b) (c) 

FIGURE 2.10 The values of the stack pointer and the stack (a) before, (b) during, and (c) 

after the procedure call. The stack pointer always points to the “top” of the stack, or the last word in the 

stack in this drawing. 

This simple convention reduces register spilling. In the example above, since the 

caller does not expect registers $t0 and $t1 to be preserved across a procedure 

call, we can drop two stores and two loads from the code. We still must save and 

restore $s0, since the callee must assume that the caller needs its value. 

Nested Procedures 

Procedures that do not call others are called leaf procedures. Life would be simple if 

all procedures were leaf procedures, but they aren’t. Just as a spy might employ other 

spies as part of a mission, who in turn might use even more spies, so do procedures 

invoke other procedures. Moreover, recursive procedures even invoke “clones” of 

themselves. Just as we need to be careful when using registers in procedures, more 

care must also be taken when invoking nonleaf procedures. 

For example, suppose that the main program calls procedure A with an argument 

of 3, by placing the value 3 into register $a0 and then using jal A. Then suppose 

that procedure A calls procedure B via jal B with an argument of 7, also placed 

in $a0. Since A hasn’t finished its task yet, there is a conflict over the use of register 

$a0. Similarly, there is a conflict over the return address in register $ra, since it 

now has the return address for B. Unless we take steps to prevent the problem, this 

conflict will eliminate procedure A’s ability to return to its caller. 

One solution is to push all the other registers that must be preserved onto 

the stack, just as we did with the saved registers. The caller pushes any argument 

registers ($a0–$a3) or temporary registers ($t0–$t9) that are needed after 

the call. The callee pushes the return address register $ra and any saved registers 

($s0–$s7) used by the callee. The stack pointer $sp is adjusted to account for the 

number of registers placed on the stack. Upon the return, the registers are restored 

from memory and the stack pointer is readjusted.


Compiling a Recursive C Procedure, Showing Nested Procedure 

Linking 

EXAMPLE 

Let’s tackle a recursive procedure that calculates factorial: 

int fact (int n) 

{ 

if (n < 1) return (1); 

else return (n * fact(n – 1)); 

} 

What is the MIPS assembly code? 

The parameter variable n corresponds to the argument register $a0. The 

compiled program starts with the label of the procedure and then saves two 

registers on the stack, the return address and $a0: 

ANSWER 

fact: 

addi $sp, $sp, –8 # adjust stack for 2 items 

sw $ra, 4($sp) # save the return address 

sw $a0, 0($sp) # save the argument n 

The first time fact is called, sw saves an address in the program that called 

fact. The next two instructions test whether n is less than 1, going to L1 if 

n ≥ 1. 

slti $t0,$a0,1 # test for n < 1 

beq $t0,$zero,L1 # if n >= 1, go to L1 

If n is less than 1, fact returns 1 by putting 1 into a value register: it adds 1 to 

0 and places that sum in $v0. It then pops the two saved values off the stack 

and jumps to the return address: 

addi $v0,$zero,1 # return 1 

addi $sp,$sp,8 # pop 2 items off stack 

jr $ra # return to caller 

Before popping two items off the stack, we could have loaded $a0 and 

$ra. Since $a0 and $ra don’t change when n is less than 1, we skip those 

instructions. 

If n is not less than 1, the argument n is decremented and then fact is 

called again with the decremented value: 

L1: addi $a0,$a0,–1 # n >= 1: argument gets (n – 1) 

jal fact # call fact with (n –1)


The next instruction is where fact returns. Now the old return address and 

old argument are restored, along with the stack pointer: 

lw $a0, 0($sp) # return from jal: restore argument n 

lw $ra, 4($sp) # restore the return address 

addi $sp, $sp, 8 # adjust stack pointer to pop 2 items 

Next, the value register $v0 gets the product of old argument $a0 and 

the current value of the value register. We assume a multiply instruction is 

available, even though it is not covered until Chapter 3: 

mul $v0,$a0,$v0 # return n * fact (n – 1) 

Finally, fact jumps again to the return address: 

jr $ra # return to the caller 

Hardware/ 

Software 

Interface 

global pointer The 

register that is reserved to 

point to the static area. 

A C variable is generally a location in storage, and its interpretation depends both 

on its type and storage class. Examples include integers and characters (see Section 

2.9). C has two storage classes: automatic and static. Automatic variables are local to 

a procedure and are discarded when the procedure exits. Static variables exist across 

exits from and entries to procedures. C variables declared outside all procedures 

are considered static, as are any variables declared using the keyword static. The 

rest are automatic. To simplify access to static data, MIPS software reserves another 

register, called the global pointer, or $gp. 

Figure 2.11 summarizes what is preserved across a procedure call. Note that 

several schemes preserve the stack, guaranteeing that the caller will get the same 

data back on a load from the stack as it stored onto the stack. The stack above $sp 

is preserved simply by making sure the callee does not write above $sp; $sp is 

Preserved 

Saved registers: $s0–$s7 

Stack pointer register: $sp 

Return address register: $ra 

Stack above the stack pointer 

Not preserved 

Temporary registers: $t0–$t9 

Argument registers: $a0–$a3 

Return value registers: $v0–$v1 

Stack below the stack pointer 

FIGURE 2.11 What is and what is not preserved across a procedure call. If the software relies 

on the frame pointer register or on the global pointer register, discussed in the following subsections, they 

are also preserved.


itself preserved by the callee adding exactly the same amount that was subtracted 

from it; and the other registers are preserved by saving them on the stack (if they 

are used) and restoring them from there. 

Allocating Space for New Data on the Stack 

The final complexity is that the stack is also used to store variables that are local 

to the procedure but do not fit in registers, such as local arrays or structures. The 

segment of the stack containing a procedure’s saved registers and local variables is 

called a procedure frame or activation record. Figure 2.12 shows the state of the 

stack before, during, and after the procedure call. 

Some MIPS software uses a frame pointer ($fp) to point to the first word of 

the frame of a procedure. A stack pointer might change during the procedure, and 

so references to a local variable in memory might have different offsets depending 

on where they are in the procedure, making the procedure harder to understand. 

Alternatively, a frame pointer offers a stable base register within a procedure for 

local memory-references. Note that an activation record appears on the stack 

whether or not an explicit frame pointer is used. We’ve been avoiding using $fp by 

avoiding changes to $sp within a procedure: in our examples, the stack is adjusted 

only on entry and exit of the procedure. 

procedure frame Also 

called activation record. 

The segment of the stack 

containing a procedure’s 

saved registers and local 

variables. 

frame pointer A value 

denoting the location of 

the saved registers and 

local variables for a given 

procedure. 

High address 

$fp 

$fp 

$sp 

$fp 

Saved argument 

registers (if any) 

$sp 

Saved return address 

Saved saved 

registers (if any) 

$sp 

Local arrays and 

structures (if any) 

Low address 

(a) (b) (c) 

FIGURE 2.12 Illustration of the stack allocation (a) before, (b) during, and (c) after the 

procedure call. The frame pointer ($fp) points to the first word of the frame, often a saved argument 

register, and the stack pointer ($sp) points to the top of the stack. The stack is adjusted to make room for 

all the saved registers and any memory-resident local variables. Since the stack pointer may change during 

program execution, it’s easier for programmers to reference variables via the stable frame pointer, although it 

could be done just with the stack pointer and a little address arithmetic. If there are no local variables on the 

stack within a procedure, the compiler will save time by not setting and restoring the frame pointer. When a 

frame pointer is used, it is initialized using the address in $sp on a call, and $sp is restored using $fp. This 

information is also found in Column 4 of the MIPS Reference Data Card at the front of this book.


Figure 2.14 summarizes the register conventions for the MIPS assembly 

language. This convention is another example of making the common case fast: 

most procedures can be satisfied with up to 4 arguments, 2 registers for a return 

value, 8 saved registers, and 10 temporary registers without ever going to memory. 

Name Register number Usage 

Preserved on 

call? 

$zero 0 The constant value 0 n.a. 

$v0–$v1 2–3 Values for results and expression evaluation no 

$a0–$a3 4–7 Arguments no 

$t0–$t7 

$s0–$s7 

$t8–$t9 

$gp 

$sp 

$fp 

$ra 

8–15 

16–23 

24–25 

28 

29 

30 

31 

Temporaries 

Saved 

More temporaries 

Global pointer 

Stack pointer 

Frame pointer 

Return address 

FIGURE 2.14 MIPS register conventions. Register 1, called $at, is reserved for the assembler (see 

Section 2.12), and registers 26–27, called $k0–$k1, are reserved for the operating system. This information 

is also found in Column 2 of the MIPS Reference Data Card at the front of this book. 

no 

yes 

no 

yes 

yes 

yes 

yes 

Elaboration: What if there are more than four parameters? The MIPS convention is 

to place the extra parameters on the stack just above the frame pointer. The procedure 

then expects the fi rst four parameters to be in registers $a0 through $a3 and the rest 

in memory, addressable via the frame pointer. 

As mentioned in the caption of Figure 2.12, the frame pointer is convenient because 

all references to variables in the stack within a procedure will have the same offset. 

The frame pointer is not necessary, however. The GNU MIPS C compiler uses a frame 

pointer, but the C compiler from MIPS does not; it treats register 30 as another save 

register ($s8). 

Elaboration: Some recursive procedures can be implemented iteratively without using 

recursion. Iteration can signifi cantly improve performance by removing the overhead 

associated with recursive procedure calls. For example, consider a procedure used to 

accumulate a sum: 

int sum (int n, int acc) { 

if (n >0) 

return sum(n – 1, acc + n); 

else 

return acc; 

} 

Consider the procedure call sum(3,0). This will result in recursive calls to 

sum(2,3), sum(1,5), and sum(0,6), and then the result 6 will be returned four

2.9 Communicating with People 107 

ASCII versus Binary Numbers 

We could represent numbers as strings of ASCII digits instead of as integers. 

How much does storage increase if the number 1 billion is represented in 

ASCII versus a 32-bit integer? 

EXAMPLE 

One billion is 1,000,000,000, so it would take 10 ASCII digits, each 8 bits long. 

Thus the storage expansion would be (10 8)/32 or 2.5. Beyond the expansion 

in storage, the hardware to add, subtract, multiply, and divide such decimal 

numbers is difficult and would consume more energy. Such difficulties explain 

why computing professionals are raised to believe that binary is natural and 

that the occasional decimal computer is bizarre. 

ANSWER 

A series of instructions can extract a byte from a word, so load word and store 

word are sufficient for transferring bytes as well as words. Because of the popularity 

of text in some programs, however, MIPS provides instructions to move bytes. Load 

byte (lb) loads a byte from memory, placing it in the rightmost 8 bits of a register. 

Store byte (sb) takes a byte from the rightmost 8 bits of a register and writes it to 

memory. Thus, we copy a byte with the sequence 

lb $t0,0($sp) 

sb $t0,0($gp) 

# Read byte from source 

# Write byte to destination 

Characters are normally combined into strings, which have a variable number 

of characters. There are three choices for representing a string: (1) the first position 

of the string is reserved to give the length of a string, (2) an accompanying variable 

has the length of the string (as in a structure), or (3) the last position of a string is 

indicated by a character used to mark the end of a string. C uses the third choice, 

terminating a string with a byte whose value is 0 (named null in ASCII). Thus, 

the string “Cal” is represented in C by the following 4 bytes, shown as decimal 

numbers: 67, 97, 108, 0. (As we shall see, Java uses the first option.)


EXAMPLE 

Compiling a String Copy Procedure, Showing How to Use C Strings 

The procedure strcpy copies string y to string x using the null byte 

termination convention of C: 

void strcpy (char x[], char y[]) 

{ 

int i; 

} 

i = 0; 

while ((x[i] = y[i]) != ‘\0’) /* copy & test byte */ 

i += 1; 

What is the MIPS assembly code? 

ANSWER 

Below is the basic MIPS assembly code segment. Assume that base addresses 

for arrays x and y are found in $a0 and $a1, while i is in $s0. strcpy 

adjusts the stack pointer and then saves the saved register $s0 on the stack: 

strcpy: 

addi $sp,$sp,–4 # adjust stack for 1 more item 

sw $s0, 0($sp) # save $s0 

To initialize i to 0, the next instruction sets $s0 to 0 by adding 0 to 0 and 

placing that sum in $s0: 

add $s0,$zero,$zero # i = 0 + 0 

This is the beginning of the loop. The address of y[i] is first formed by adding 

i to y[]: 

L1: add $t1,$s0,$a1 # address of y[i] in $t1 

Note that we don’t have to multiply i by 4 since y is an array of bytes and not 

of words, as in prior examples. 

To load the character in y[i], we use load byte unsigned, which puts the 

character into $t2: 

lbu 

$t2, 0($t1) # $t2 = y[i] 

A similar address calculation puts the address of x[i] in $t3, and then the 

character in $t2 is stored at that address.

2.9 Communicating with People 109 

add $t3,$s0,$a0 # address of x[i] in $t3 

sb $t2, 0($t3) # x[i] = y[i] 

Next, we exit the loop if the character was 0. That is, we exit if it is the last 

character of the string: 

beq 

$t2,$zero,L2 # if y[i] == 0, go to L2 

If not, we increment i and loop back: 

addi $s0, $s0,1 # i = i + 1 

j L1 # go to L1 

If we don’t loop back, it was the last character of the string; we restore $s0 and 

the stack pointer, and then return. 

L2: lw $s0, 0($sp) # y[i] == 0: end of string. 

# Restore old $s0 

addi $sp,$sp,4 # pop 1 word off stack 

jr $ra # return 

String copies usually use pointers instead of arrays in C to avoid the operations 

on i in the code above. See Section 2.14 for an explanation of arrays versus 

pointers. 

Since the procedure strcpy above is a leaf procedure, the compiler could 

allocate i to a temporary register and avoid saving and restoring $s0. Hence, 

instead of thinking of the $t registers as being just for temporaries, we can think of 

them as registers that the callee should use whenever convenient. When a compiler 

finds a leaf procedure, it exhausts all temporary registers before using registers it 

must save. 

Characters and Strings in Java 

Unicode is a universal encoding of the alphabets of most human languages. Figure 

2.16 gives a list of Unicode alphabets; there are almost as many alphabets in Unicode 

as there are useful symbols in ASCII. To be more inclusive, Java uses Unicode for 

characters. By default, it uses 16 bits to represent a character.


Latin Malayalam Tagbanwa General Punctuation 

Greek Sinhala Khmer Spacing Modifier Letters 

Cyrillic Thai Mongolian Currency Symbols 

Armenian Lao Limbu Combining Diacritical Marks 

Hebrew Tibetan Tai Le Combining Marks for Symbols 

Arabic Myanmar Kangxi Radicals Superscripts and Subscripts 

Syriac Georgian Hiragana Number Forms 

Thaana Hangul Jamo Katakana Mathematical Operators 

Devanagari Ethiopic Bopomofo Mathematical Alphanumeric Symbols 

Bengali Cherokee Kanbun Braille Patterns 

Gurmukhi Unified Canadian Shavian 

Optical Character Recognition 

Aboriginal Syllabic 

Gujarati Ogham Osmanya Byzantine Musical Symbols 

Oriya Runic Cypriot Syllabary Musical Symbols 

Tamil Tagalog Tai Xuan Jing Symbols Arrows 

Telugu Hanunoo Yijing Hexagram Symbols Box Drawing 

Kannada Buhid Aegean Numbers Geometric Shapes 

FIGURE 2.16 Example alphabets in Unicode. Unicode version 4.0 has more than 160 “blocks,” 

which is their name for a collection of symbols. Each block is a multiple of 16. For example, Greek starts at 

0370 hex 

, and Cyrillic at 0400 hex 

. The first three columns show 48 blocks that correspond to human languages 

in roughly Unicode numerical order. The last column has 16 blocks that are multilingual and are not in order. 

A 16-bit encoding, called UTF-16, is the default. A variable-length encoding, called UTF-8, keeps the ASCII 

subset as eight bits and uses 16 or 32 bits for the other characters. UTF-32 uses 32 bits per character. To learn 

more, see www.unicode.org. 

The MIPS instruction set has explicit instructions to load and store such 16- 

bit quantities, called halfwords. Load half (lh) loads a halfword from memory, 

placing it in the rightmost 16 bits of a register. Like load byte, load half (lh) treats 

the halfword as a signed number and thus sign-extends to fill the 16 leftmost bits 

of the register, while load halfword unsigned (lhu) works with unsigned integers. 

Thus, lhu is the more popular of the two. Store half (sh) takes a halfword from the 

rightmost 16 bits of a register and writes it to memory. We copy a halfword with 

the sequence 

lhu $t0,0($sp) # Read halfword (16 bits) from source 

sh $t0,0($gp) # Write halfword (16 bits) to destination 

Strings are a standard Java class with special built-in support and predefined 

methods for concatenation, comparison, and conversion. Unlike C, Java includes a 

word that gives the length of the string, similar to Java arrays.


32-Bit Immediate Operands 

Although constants are frequently short and fit into the 16-bit field, sometimes they 

are bigger. The MIPS instruction set includes the instruction load upper immediate 

(lui) specifically to set the upper 16 bits of a constant in a register, allowing a 

subsequent instruction to specify the lower 16 bits of the constant. Figure 2.17 

shows the operation of lui. 

EXAMPLE 

Loading a 32-Bit Constant 

What is the MIPS assembly code to load this 32-bit constant into register $s0? 

0000 0000 0011 1101 0000 1001 0000 0000 

ANSWER 

First, we would load the upper 16 bits, which is 61 in decimal, using lui: 

lui $s0, 61 # 61 decimal = 0000 0000 0011 1101 binary 

The value of register $s0 afterward is 

0000 0000 0011 1101 0000 0000 0000 0000 

The next step is to insert the lower 16 bits, whose decimal value is 2304: 

ori $s0, $s0, 2304 # 2304 decimal = 0000 1001 0000 0000 

The final value in register $s0 is the desired value: 

0000 0000 0011 1101 0000 1001 0000 0000 

The machine language version of lui $t0, 255 # $t0 is register 8: 

001111 00000 01000 0000 0000 1111 1111 

Contents of register $t0 after executing lui $t0, 255: 

0000 0000 1111 1111 0000 0000 0000 0000 

FIGURE 2.17 The effect of the lui instruction. The instruction lui transfers the 16-bit immediate constant field value into the 

leftmost 16 bits of the register, filling the lower 16 bits with 0s.

2.10 MIPS Addressing for 32-bit Immediates and Addresses 113 

Either the compiler or the assembler must break large constants into pieces and 

then reassemble them into a register. As you might expect, the immediate field’s 

size restriction may be a problem for memory addresses in loads and stores as 

well as for constants in immediate instructions. If this job falls to the assembler, 

as it does for MIPS software, then the assembler must have a temporary register 

available in which to create the long values. This need is a reason for the register 

$at (assembler temporary), which is reserved for the assembler. 

Hence, the symbolic representation of the MIPS machine language is no longer 

limited by the hardware, but by whatever the creator of an assembler chooses to 

include (see Section 2.12). We stick close to the hardware to explain the architecture 

of the computer, noting when we use the enhanced language of the assembler that 

is not found in the processor. 

Hardware/ 

Software 

Interface 

Elaboration: Creating 32-bit constants needs care. The instruction addi copies the 

left-most bit of the 16-bit immediate fi eld of the instruction into the upper 16 bits of a 

word. Logical or immediate from Section 2.6 loads 0s into the upper 16 bits and hence 

is used by the assembler in conjunction with lui to create 32-bit constants. 

Addressing in Branches and Jumps 

The MIPS jump instructions have the simplest addressing. They use the final MIPS 

instruction format, called the J-type, which consists of 6 bits for the operation field 

and the rest of the bits for the address field. Thus, 

j 10000 # go to location 10000 

could be assembled into this format (it’s actually a bit more complicated, as we will 

see): 

2 10000 

6 bits 26 bits 

where the value of the jump opcode is 2 and the jump address is 10000. 

Unlike the jump instruction, the conditional branch instruction must specify 

two operands in addition to the branch address. Thus, 

bne $s0,$s1,Exit # go to Exit if $s0 ≠ $s1 

is assembled into this instruction, leaving only 16 bits for the branch address: 

5 16 17 Exit 

6 bits 5 bits 5 bits 16 bits


If addresses of the program had to fit in this 16-bit field, it would mean that no 

program could be bigger than 2 16 , which is far too small to be a realistic option 

today. An alternative would be to specify a register that would always be added 

to the branch address, so that a branch instruction would calculate the following: 

Program counter Register Branch address 

PC-relative 

addressing An 

addressing regime 

in which the address 

is the sum of the 

program counter (PC) 

and a constant in the 

instruction. 

This sum allows the program to be as large as 2 32 and still be able to use 

conditional branches, solving the branch address size problem. Then the question 

is, which register? 

The answer comes from seeing how conditional branches are used. Conditional 

branches are found in loops and in if statements, so they tend to branch to a 

nearby instruction. For example, about half of all conditional branches in SPEC 

benchmarks go to locations less than 16 instructions away. Since the program 

counter (PC) contains the address of the current instruction, we can branch within 

2 15 words of the current instruction if we use the PC as the register to be added 

to the address. Almost all loops and if statements are much smaller than 2 16 words, 

so the PC is the ideal choice. 

This form of branch addressing is called PC-relative addressing. As we shall see 

in Chapter 4, it is convenient for the hardware to increment the PC early to point 

to the next instruction. Hence, the MIPS address is actually relative to the address 

of the following instruction (PC 4) as opposed to the current instruction (PC). 

It is yet another example of making the common case fast, which in this case is 

addressing nearby instructions. 

Like most recent computers, MIPS uses PC-relative addressing for all conditional 

branches, because the destination of these instructions is likely to be close to the 

branch. On the other hand, jump-and-link instructions invoke procedures that 

have no reason to be near the call, so they normally use other forms of addressing. 

Hence, the MIPS architecture offers long addresses for procedure calls by using the 

J-type format for both jump and jump-and-link instructions. 

Since all MIPS instructions are 4 bytes long, MIPS stretches the distance of the 

branch by having PC-relative addressing refer to the number of words to the next 

instruction instead of the number of bytes. Thus, the 16-bit field can branch four 

times as far by interpreting the field as a relative word address rather than as a 

relative byte address. Similarly, the 26-bit field in jump instructions is also a word 

address, meaning that it represents a 28-bit byte address. 

Elaboration: Since the PC is 32 bits, 4 bits must come from somewhere else for 

jumps. The MIPS jump instruction replaces only the lower 28 bits of the PC, leaving 

the upper 4 bits of the PC unchanged. The loader and linker (Section 2.12) must be 

careful to avoid placing a program across an address boundary of 256 MB (64 million 

instructions); otherwise, a jump must be replaced by a jump register instruction preceded 

by other instructions to load the full 32-bit address into a register.


Hardware/ 

Software 

Interface 

Most conditional branches are to a nearby location, but occasionally they branch 

far away, farther than can be represented in the 16 bits of the conditional branch 

instruction. The assembler comes to the rescue just as it did with large addresses 

or constants: it inserts an unconditional jump to the branch target, and inverts the 

condition so that the branch decides whether to skip the jump. 

EXAMPLE 

Branching Far Away 

Given a branch on register $s0 being equal to register $s1, 

beq 

$s0, $s1, L1 

replace it by a pair of instructions that offers a much greater branching distance. 

ANSWER 

These instructions replace the short-address conditional branch: 

bne $s0, $s1, L2 

j L1 

L2: 

addressing mode One 

of several addressing 

regimes delimited by their 

varied use of operands 

and/or addresses. 

MIPS Addressing Mode Summary 

Multiple forms of addressing are generically called addressing modes. Figure 2.18 

shows how operands are identified for each addressing mode. The MIPS addressing 

modes are the following: 

1. Immediate addressing, where the operand is a constant within the instruction 

itself 

2. Register addressing, where the operand is a register 

3. Base or displacement addressing, where the operand is at the memory location 

whose address is the sum of a register and a constant in the instruction 

4. PC-relative addressing, where the branch address is the sum of the PC and a 

constant in the instruction 

5. Pseudodirect addressing, where the jump address is the 26 bits of the 

instruction concatenated with the upper bits of the PC


Decoding Machine Language 

Sometimes you are forced to reverse-engineer machine language to create the 

original assembly language. One example is when looking at “core dump.” Figure 

2.19 shows the MIPS encoding of the fields for the MIPS machine language. This 

figure helps when translating by hand between assembly language and machine 

language. 

EXAMPLE 

Decoding Machine Code 

What is the assembly language statement corresponding to this machine 

instruction? 

00af8020hex 

ANSWER 

The first step in converting hexadecimal to binary is to find the op fields: 

(Bits: 31 28 26 5 2 0) 

0000 0000 1010 1111 1000 0000 0010 0000 

We look at the op field to determine the operation. Referring to Figure 2.19, 

when bits 31–29 are 000 and bits 28–26 are 000, it is an R-format instruction. 

Let’s reformat the binary instruction into R-format fields, listed in Figure 2.20: 

op rs rt rd shamt funct 

000000 00101 01111 10000 00000 100000 

The bottom portion of Figure 2.19 determines the operation of an R-format 

instruction. In this case, bits 5–3 are 100 and bits 2–0 are 000, which means 

this binary pattern represents an add instruction. 

We decode the rest of the instruction by looking at the field values. The 

decimal values are 5 for the rs field, 15 for rt, and 16 for rd (shamt is unused). 

Figure 2.14 shows that these numbers represent registers $a1, $t7, and $s0. 

Now we can reveal the assembly instruction: 

add $s0,$a1,$t7


Name Fields Comments 

Field size 6 bits 5 bits 5 bits 5 bits 5 bits 6 bits All MIPS instructions are 32 bits long 

R-format op rs rt rd shamt funct Arithmetic instruction format 

I-format op rs rt address/immediate Transfer, branch,imm. format 

J-format op 

target address 

Jump instruction format 

FIGURE 2.20 

MIPS instruction formats. 

Figure 2.20 shows all the MIPS instruction formats. Figure 2.1 on page 64 shows 

the MIPS assembly language revealed in this chapter. The remaining hidden portion 

of MIPS instructions deals mainly with arithmetic and real numbers, which are 

covered in the next chapter. 

Check 

Yourself 

I. What is the range of addresses for conditional branches in MIPS (K 1024)? 

1. Addresses between 0 and 64K 1 

2. Addresses between 0 and 256K 1 

3. Addresses up to about 32K before the branch to about 32K after 

4. Addresses up to about 128K before the branch to about 128K after 

II. What is the range of addresses for jump and jump and link in MIPS 

(M 1024K)? 

1. Addresses between 0 and 64M 1 

2. Addresses between 0 and 256M 1 

3. Addresses up to about 32M before the branch to about 32M after 

4. Addresses up to about 128M before the branch to about 128M after 

5. Anywhere within a block of 64M addresses where the PC supplies the 

upper 6 bits 

6. Anywhere within a block of 256M addresses where the PC supplies the 

upper 4 bits 

III. What is the MIPS assembly language instruction corresponding to the 

machine instruction with the value 0000 0000 hex 

? 

1. j 

2. R-format 

3. addi 

4. sll 

5. mfc0 

6. Undefined opcode: there is no legal instruction that corresponds to 0

2.11 Parallelism and Instructions: Synchronization 121 

2.11 

Parallelism and Instructions: 

Synchronization 

Parallel execution is easier when tasks are independent, but often they need to 

cooperate. Cooperation usually means some tasks are writing new values that 

others must read. To know when a task is finished writing so that it is safe for 

another to read, the tasks need to synchronize. If they don’t synchronize, there is a 

danger of a data race, where the results of the program can change depending on 

how events happen to occur. 

For example, recall the analogy of the eight reporters writing a story on page 44 of 

Chapter 1. Suppose one reporter needs to read all the prior sections before writing 

a conclusion. Hence, he or she must know when the other reporters have finished 

their sections, so that there is no danger of sections being changed afterwards. That 

is, they had better synchronize the writing and reading of each section so that the 

conclusion will be consistent with what is printed in the prior sections. 

In computing, synchronization mechanisms are typically built with user-level 

software routines that rely on hardware-supplied synchronization instructions. In 

this section, we focus on the implementation of lock and unlock synchronization 

operations. Lock and unlock can be used straightforwardly to create regions 

where only a single processor can operate, called a mutual exclusion, as well as to 

implement more complex synchronization mechanisms. 

The critical ability we require to implement synchronization in a multiprocessor 

is a set of hardware primitives with the ability to atomically read and modify a 

memory location. That is, nothing else can interpose itself between the read and 

the write of the memory location. Without such a capability, the cost of building 

basic synchronization primitives will be high and will increase unreasonably as the 

processor count increases. 

There are a number of alternative formulations of the basic hardware primitives, 

all of which provide the ability to atomically read and modify a location, together 

with some way to tell if the read and write were performed atomically. In general, 

architects do not expect users to employ the basic hardware primitives, but 

instead expect that the primitives will be used by system programmers to build a 

synchronization library, a process that is often complex and tricky. 

Let’s start with one such hardware primitive and show how it can be used to 

build a basic synchronization primitive. One typical operation for building 

synchronization operations is the atomic exchange or atomic swap, which interchanges 

a value in a register for a value in memory. 

To see how to use this to build a basic synchronization primitive, assume that 

we want to build a simple lock where the value 0 is used to indicate that the lock 

is free and 1 is used to indicate that the lock is unavailable. A processor tries to set 

the lock by doing an exchange of 1, which is in a register, with the memory address 

corresponding to the lock. The value returned from the exchange instruction is 1 

if some other processor had already claimed access, and 0 otherwise. In the latter 

data race Two memory 

accesses form a data race 

if they are from different 

threads to same location, 

at least one is a write, 

and they occur one after 

another.


case, the value is also changed to 1, preventing any competing exchange in another 

processor from also retrieving a 0. 

For example, consider two processors that each try to do the exchange 

simultaneously: this race is broken, since exactly one of the processors will perform 

the exchange first, returning 0, and the second processor will return 1 when it does 

the exchange. The key to using the exchange primitive to implement synchronization 

is that the operation is atomic: the exchange is indivisible, and two simultaneous 

exchanges will be ordered by the hardware. It is impossible for two processors 

trying to set the synchronization variable in this manner to both think they have 

simultaneously set the variable. 

Implementing a single atomic memory operation introduces some challenges in 

the design of the processor, since it requires both a memory read and a write in a 

single, uninterruptible instruction. 

An alternative is to have a pair of instructions in which the second instruction 

returns a value showing whether the pair of instructions was executed as if the pair 

were atomic. The pair of instructions is effectively atomic if it appears as if all other 

operations executed by any processor occurred before or after the pair. Thus, when 

an instruction pair is effectively atomic, no other processor can change the value 

between the instruction pair. 

In MIPS this pair of instructions includes a special load called a load linked and 

a special store called a store conditional. These instructions are used in sequence: 

if the contents of the memory location specified by the load linked are changed 

before the store conditional to the same address occurs, then the store conditional 

fails. The store conditional is defined to both store the value of a (presumably 

different) register in memory and to change the value of that register to a 1 if it 

succeeds and to a 0 if it fails. Since the load linked returns the initial value, and the 

store conditional returns 1 only if it succeeds, the following sequence implements 

an atomic exchange on the memory location specified by the contents of $s1: 

again: addi $t0,$zero,1 ;copy locked value 

ll $t1,0($s1) ;load linked 

sc $t0,0($s1) ;store conditional 

beq $t0,$zero,again ;branch if store fails 

add $s4,$zero,$t1 ;put load value in $s4 

Any time a processor intervenes and modifies the value in memory between the 

ll and sc instructions, the sc returns 0 in $t0, causing the code sequence to try 

again. At the end of this sequence the contents of $s4 and the memory location 

specified by $s1 have been atomically exchanged. 

Elaboration: Although it was presented for multiprocessor synchronization, atomic 

exchange is also useful for the operating system in dealing with multiple processes 

in a single processor. To make sure nothing interferes in a single processor, the store 

conditional also fails if the processor does a context switch between the two instructions 

(see Chapter 5).


C program 

Compiler 

Assembly language program 

Assembler 

Object: Machine language module 

Object: Library routine (machine language) 

Linker 

Executable: Machine language program 

Loader 

Memory 

FIGURE 2.21 A translation hierarchy for C. A high-level language program is first compiled into 

an assembly language program and then assembled into an object module in machine language. The linker 

combines multiple modules with library routines to resolve all references. The loader then places the machine 

code into the proper memory locations for execution by the processor. To speed up the translation process, 

some steps are skipped or combined. Some compilers produce object modules directly, and some systems use 

linking loaders that perform the last two steps. To identify the type of file, UNIX follows a suffix convention 

for files: C source files are named x.c, assembly files are x.s, object files are named x.o, statically linked 

library routines are x.a, dynamically linked library routes are x.so, and executable files by default are 

called a.out. MS-DOS uses the suffixes .C, .ASM, .OBJ, .LIB, .DLL, and .EXE to the same effect. 

pseudoinstruction 

A common variation 

of assembly language 

instructions often treated 

as if it were an instruction 

in its own right. 

Assembler 

Since assembly language is an interface to higher-level software, the assembler 

can also treat common variations of machine language instructions as if they 

were instructions in their own right. The hardware need not implement these 

instructions; however, their appearance in assembly language simplifies translation 

and programming. Such instructions are called pseudoinstructions. 

As mentioned above, the MIPS hardware makes sure that register $zero always 

has the value 0. That is, whenever register $zero is used, it supplies a 0, and the 

programmer cannot change the value of register $zero. Register $zero is used 

to create the assembly language instruction that copies the contents of one register 

to another. Thus the MIPS assembler accepts this instruction even though it is not 

found in the MIPS architecture: 

move $t0,$t1 # register $t0 gets register $t1

2.12 Translating and Starting a Program 125 

The assembler converts this assembly language instruction into the machine 

language equivalent of the following instruction: 

add $t0,$zero,$t1 # register $t0 gets 0 + register $t1 

The MIPS assembler also converts blt (branch on less than) into the two 

instructions slt and bne mentioned in the example on page 95. Other examples 

include bgt, bge, and ble. It also converts branches to faraway locations into a 

branch and jump. As mentioned above, the MIPS assembler allows 32-bit constants 

to be loaded into a register despite the 16-bit limit of the immediate instructions. 

In summary, pseudoinstructions give MIPS a richer set of assembly language 

instructions than those implemented by the hardware. The only cost is reserving 

one register, $at, for use by the assembler. If you are going to write assembly 

programs, use pseudoinstructions to simplify your task. To understand the MIPS 

architecture and be sure to get best performance, however, study the real MIPS 

instructions found in Figures 2.1 and 2.19. 

Assemblers will also accept numbers in a variety of bases. In addition to binary 

and decimal, they usually accept a base that is more succinct than binary yet 

converts easily to a bit pattern. MIPS assemblers use hexadecimal. 

Such features are convenient, but the primary task of an assembler is assembly 

into machine code. The assembler turns the assembly language program into an 

object file, which is a combination of machine language instructions, data, and 

information needed to place instructions properly in memory. 

To produce the binary version of each instruction in the assembly language 

program, the assembler must determine the addresses corresponding to all labels. 

Assemblers keep track of labels used in branches and data transfer instructions 

in a symbol table. As you might expect, the table contains pairs of symbols and 

addresses. 

The object file for UNIX systems typically contains six distinct pieces: 

■ The object file header describes the size and position of the other pieces of the 

object file. 

■ The text segment contains the machine language code. 

■ The static data segment contains data allocated for the life of the program. 

(UNIX allows programs to use both static data, which is allocated throughout 

the program, and dynamic data, which can grow or shrink as needed by the 

program. See Figure 2.13.) 

■ The relocation information identifies instructions and data words that depend 

on absolute addresses when the program is loaded into memory. 

■ The symbol table contains the remaining labels that are not defined, such as 

external references. 

symbol table A table 

that matches names of 

labels to the addresses of 

the memory words that 

instructions occupy.


■ The debugging information contains a concise description of how the modules 

were compiled so that a debugger can associate machine instructions with C 

source files and make data structures readable. 

The next subsection shows how to attach such routines that have already been 

assembled, such as library routines. 

linker Also called 

link editor. A systems 

program that combines 

independently assembled 


programs and resolves all 

undefined labels into an 

executable file. 

executable file 

A functional program in 

the format of an object 

file that contains no 

unresolved references. 

It can contain symbol 

tables and debugging 

information. A “stripped 

executable” does not 

contain that information. 

Relocation information 

may be included for the 

loader. 

Linker 

What we have presented so far suggests that a single change to one line of one 

procedure requires compiling and assembling the whole program. Complete 

retranslation is a terrible waste of computing resources. This repetition is 

particularly wasteful for standard library routines, because programmers would 

be compiling and assembling routines that by definition almost never change. An 

alternative is to compile and assemble each procedure independently, so that a 

change to one line would require compiling and assembling only one procedure. 

This alternative requires a new systems program, called a link editor or linker, 

which takes all the independently assembled machine language programs and 

“stitches” them together. 

There are three steps for the linker: 

1. Place code and data modules symbolically in memory. 

2. Determine the addresses of data and instruction labels. 

3. Patch both the internal and external references. 

The linker uses the relocation information and symbol table in each object 

module to resolve all undefined labels. Such references occur in branch instructions, 

jump instructions, and data addresses, so the job of this program is much like that 

of an editor: it finds the old addresses and replaces them with the new addresses. 

Editing is the origin of the name “link editor,” or linker for short. The reason a 

linker is useful is that it is much faster to patch code than it is to recompile and 

reassemble. 

If all external references are resolved, the linker next determines the memory 

locations each module will occupy. Recall that Figure 2.13 on page 104 shows 

the MIPS convention for allocation of program and data to memory. Since the 

files were assembled in isolation, the assembler could not know where a module’s 

instructions and data would be placed relative to other modules. When the linker 

places a module in memory, all absolute references, that is, memory addresses that 

are not relative to a register, must be relocated to reflect its true location. 

The linker produces an executable file that can be run on a computer. Typically, 

this file has the same format as an object file, except that it contains no unresolved 

references. It is possible to have partially linked files, such as library routines, that 

still have unresolved addresses and hence result in object files.


Linking Object Files 

Link the two object files below. Show updated addresses of the first few 

instructions of the completed executable file. We show the instructions in 

assembly language just to make the example understandable; in reality, the 

instructions would be numbers. 

Note that in the object files we have highlighted the addresses and symbols 

that must be updated in the link process: the instructions that refer to the 

addresses of procedures A and B and the instructions that refer to the addresses 

of data words X and Y. 

EXAMPLE 

Object file header 

Name 

Procedure A 

Text size 

Data size 

100 hex 

20 hex 

Text segment Address Instruction 

0 lw $a0, 0($gp) 

4 jal 0 

… 

… 

Data segment 0 (X) 

… 

… 

Relocation information Address Instruction type Dependency 

0 lw X 

4 jal B 

Symbol table Label Address 

X – 

B – 

Object file header 

Name 

Procedure B 

Text size 

Data size 

200 hex 

30 hex 


0 sw $a1, 0($gp) 

4 jal 0 

… 

… 

Data segment 0 (Y) 

… 

… 

Relocation information Address Instruction type Dependency 

0 sw Y 

4 jal A 

Symbol table Label Address 

Y – 

A –


ANSWER 

Procedure A needs to find the address for the variable labeled X to put in the 

load instruction and to find the address of procedure B to place in the jal 

instruction. Procedure B needs the address of the variable labeled Y for the 

store instruction and the address of procedure A for its jal instruction. 

From Figure 2.13 on page 104, we know that the text segment starts 

at address 40 0000 hex 

and the data segment at 1000 0000 hex 

. The text of 

procedure A is placed at the first address and its data at the second. The object 

file header for procedure A says that its text is 100 hex 

bytes and its data is 20 hex 

bytes, so the starting address for procedure B text is 40 0100 hex 

, and its data 

starts at 1000 0020 hex 

. 

Executable file header 

Text size 

Data size 

300 hex 

50 hex 


0040 0000 hex lw $a0, 8000 hex 

($gp) 

0040 0004 hex jal 40 0100 hex 

… 

… 

0040 0100 hex sw $a1, 8020 hex 

($gp) 

Data segment 

0040 0104 hex jal 40 0000 hex 

… 

… 

Address 

1000 0000 hex (X) 

… 

… 

1000 0020 hex (Y) 

… 

… 

Figure 2.13 also shows that the text segment starts at address 40 0000 hex 

and the data segment at 1000 0000 hex 

. The text of procedure A is placed at the 

first address and its data at the second. The object file header for procedure A 

says that its text is 100 hex 

bytes and its data is 20 hex 

bytes, so the starting address 

for procedure B text is 40 0100 hex 

, and its data starts at 1000 0020 hex 

. 

Now the linker updates the address fields of the instructions. It uses the 

instruction type field to know the format of the address to be edited. We have 

two types here: 

1. The jals are easy because they use pseudodirect addressing. The jal at 

address 40 0004 hex 

gets 40 0100 hex 

(the address of procedure B) in its 

address field, and the jal at 40 0104 hex 

gets 40 0000 hex 

(the address of 

procedure A) in its address field. 

2. The load and store addresses are harder because they are relative to a base 

register. This example uses the global pointer as the base register. Figure 2.13 

shows that $gp is initialized to 1000 8000 hex 

. To get the address 1000 0000 hex 

(the address of word X), we place 8000 hex 

in the address field of lw at address 

40 0000 hex 

. Similarly, we place 8020 hex 

in the address field of sw at address 

40 0100 hex 

to get the address 1000 0020 hex 

(the address of word Y).


Elaboration: Recall that MIPS instructions are word aligned, so jal drops the right 

two bits to increase the instruction’s address range. Thus, it uses 26 bits to create a 

28-bit byte address. Hence, the actual address in the lower 26 bits of the jal instruction 

in this example is 10 0040 hex, 

rather than 40 0100 hex 

. 

Loader 

Now that the executable file is on disk, the operating system reads it to memory and 

starts it. The loader follows these steps in UNIX systems: 

1. Reads the executable file header to determine size of the text and data 

segments. 

2. Creates an address space large enough for the text and data. 

3. Copies the instructions and data from the executable file into memory. 

4. Copies the parameters (if any) to the main program onto the stack. 

5. Initializes the machine registers and sets the stack pointer to the first free 

location. 

6. Jumps to a start-up routine that copies the parameters into the argument 

registers and calls the main routine of the program. When the main routine 

returns, the start-up routine terminates the program with an exit system 

call. 

Sections A.3 and A.4 in Appendix A describe linkers and loaders in more detail. 

Dynamically Linked Libraries 

The first part of this section describes the traditional approach to linking libraries 

before the program is run. Although this static approach is the fastest way to call 

library routines, it has a few disadvantages: 

■ The library routines become part of the executable code. If a new version of 

the library is released that fixes bugs or supports new hardware devices, the 

statically linked program keeps using the old version. 

■ It loads all routines in the library that are called anywhere in the executable, 

even if those calls are not executed. The library can be large relative to the 

program; for example, the standard C library is 2.5 MB. 

These disadvantages lead to dynamically linked libraries (DLLs), where the 

library routines are not linked and loaded until the program is run. Both the 

program and library routines keep extra information on the location of nonlocal 

procedures and their names. In the initial version of DLLs, the loader ran a dynamic 

linker, using the extra information in the file to find the appropriate libraries and to 

update all external references. 

loader A systems 

program that places an 

object program in main 

memory so that it is ready 

to execute. 

Virtually every 

problem in computer 

science can be solved 

by another level of 

indirection. 

David Wheeler 

dynamically linked 

libraries (DLLs) Library 

routines that are linked 

to a program during 

execution.


The downside of the initial version of DLLs was that it still linked all routines 

of the library that might be called, versus only those that are called during the 

running of the program. This observation led to the lazy procedure linkage version 

of DLLs, where each routine is linked only after it is called. 

Like many innovations in our field, this trick relies on a level of indirection. 

Figure 2.22 shows the technique. It starts with the nonlocal routines calling a set of 

dummy routines at the end of the program, with one entry per nonlocal routine. 

These dummy entries each contain an indirect jump. 

The first time the library routine is called, the program calls the dummy entry 

and follows the indirect jump. It points to code that puts a number in a register to 

Text 

jal 

... 

lw 

jr 

... 

Text 

jal 

... 

lw 

jr 

... 

Data 

Data 

Text 

... 

li 

j 

... 

ID 

Text 

Dynamic linker/loader 

Remap DLL routine 

j ... 

Data/Text 

DLL routine 

... 

jr 

Text 

DLL routine 

... 

jr 

(a) First call to DLL routine 

(b) Subsequent calls to DLL routine 

FIGURE 2.22 Dynamically linked library via lazy procedure linkage. (a) Steps for the first time 

a call is made to the DLL routine. (b) The steps to find the routine, remap it, and link it are skipped on 

subsequent calls. As we will see in Chapter 5, the operating system may avoid copying the desired routine by 

remapping it using virtual memory management.

2.13 A C Sort Example to Put It All Together 133 

void swap(int v[], int k) 

{ 

int temp; 

temp = v[k]; 

v[k] = v[k+1]; 

v[k+1] = temp; 

} 

FIGURE 2.24 A C procedure that swaps two locations in memory. This subsection uses this 

procedure in a sorting example. 

The Procedure swap 

Let’s start with the code for the procedure swap in Figure 2.24. This procedure 

simply swaps two locations in memory. When translating from C to assembly 

language by hand, we follow these general steps: 

1. Allocate registers to program variables. 

2. Produce code for the body of the procedure. 

3. Preserve registers across the procedure invocation. 

This section describes the swap procedure in these three pieces, concluding by 

putting all the pieces together. 

Register Allocation for swap 

As mentioned on pages 98–99, the MIPS convention on parameter passing is to 

use registers $a0, $a1, $a2, and $a3. Since swap has just two parameters, v and 

k, they will be found in registers $a0 and $a1. The only other variable is temp, 

which we associate with register $t0 since swap is a leaf procedure (see page 100). 

This register allocation corresponds to the variable declarations in the first part of 

the swap procedure in Figure 2.24. 

Code for the Body of the Procedure swap 

The remaining lines of C code in swap are 

temp = v[k]; 

v[k] = v[k+1]; 

v[k+1] = temp; 

Recall that the memory address for MIPS refers to the byte address, and so 

words are really 4 bytes apart. Hence we need to multiply the index k by 4 before 

adding it to the address. Forgetting that sequential word addresses differ by 4 instead


The Procedure sort 

To ensure that you appreciate the rigor of programming in assembly language, we’ll 

try a second, longer example. In this case, we’ll build a routine that calls the swap 

procedure. This program sorts an array of integers, using bubble or exchange sort, 

which is one of the simplest if not the fastest sorts. Figure 2.26 shows the C version 

of the program. Once again, we present this procedure in several steps, concluding 

with the full procedure. 

void sort (int v[], int n) 

{ 

int i, j; 

for (i = 0; i < n; i += 1) { 

for (j = i – 1; j >= 0 && v[j] > v[j + 1]; j =1) { 

swap(v,j); 

} 

} 

} 

FIGURE 2.26 A C procedure that performs a sort on the array v. 

Register Allocation for sort 

The two parameters of the procedure sort, v and n, are in the parameter registers 

$a0 and $a1, and we assign register $s0 to i and register $s1 to j. 

Code for the Body of the Procedure sort 

The procedure body consists of two nested for loops and a call to swap that includes 

parameters. Let’s unwrap the code from the outside to the middle. 

The first translation step is the first for loop: 

for (i = 0; i


The loop should be exited if i < n is not true or, said another way, should be 

exited if i ≥ n. The set on less than instruction sets register $t0 to 1 if $s0 < 

$a1 and to 0 otherwise. Since we want to test if $s0 ≥ $a1, we branch if register 

$t0 is 0. This test takes two instructions: 

for1tst:slt $t0, $s0, $a1 # reg $t0 = 0 if $s0 ≥ $a1 (i≥n) 

beq $t0, $zero,exit1 # go to exit1 if $s0 ≥ $a1 (i≥n) 

The bottom of the loop just jumps back to the loop test: 

exit1: 

j for1tst 

# jump to test of outer loop 

The skeleton code of the first for loop is then 

move $s0, $zero # i = 0 

for1tst:slt $t0, $s0, $a1 # reg $t0 = 0 if $s0 ≥ $a1 (i≥n) 

beq $t0, $zero,exit1 # go to exit1 if $s0 ≥ $a1 (i≥n) 

. . . 

(body of first for loop) 

. . . 

addi $s0, $s0, 1 # i += 1 

j for1tst # jump to test of outer loop 

exit1: 

Voila! (The exercises explore writing faster code for similar loops.) 

The second for loop looks like this in C: 

for (j = i – 1; j >= 0 && v[j] > v[j + 1]; j –= 1) { 

The initialization portion of this loop is again one instruction: 

addi $s1, $s0, –1 # j = i – 1 

The decrement of j at the end of the loop is also one instruction: 

addi $s1, $s1, –1 # j –= 1 

The loop test has two parts. We exit the loop if either condition fails, so the first 

test must exit the loop if it fails (j 0): 

for2tst: slti $t0, $s1, 0 # reg $t0 = 1 if $s1 < 0 (j < 0) 

bne $t0, $zero, exit2 # go to exit2 if $s1 < 0 (j < 0) 

This branch will skip over the second condition test. If it doesn’t skip, j ≥ 0.


The second test exits if v[j] > v[j + 1] is not true, or exits if v[j] ≤ 

v[j + 1]. First we create the address by multiplying j by 4 (since we need a byte 

address) and add it to the base address of v: 

sll $t1, $s1, 2 # reg $t1 = j * 4 

add $t2, $a0, $t1 # reg $t2 = v + (j * 4) 

Now we load v[j]: 

lw $t3, 0($t2) # reg $t3 = v[j] 

Since we know that the second element is just the following word, we add 4 to 

the address in register $t2 to get v[j + 1]: 

lw $t4, 4($t2) # reg $t4 = v[j + 1] 

The test of v[j] ≤ v[j + 1] is the same as v[j + 1] ≥ v[j], so the 

two instructions of the exit test are 

slt $t0, $t4, $t3 # reg $t0 = 0 if $t4 ≥ $t3 

beq $t0, $zero, exit2 # go to exit2 if $t4 ≥ $t3 

The bottom of the loop jumps back to the inner loop test: 

j for2tst # jump to test of inner loop 

Combining the pieces, the skeleton of the second for loop looks like this: 

addi $s1, $s0, –1 # j = i – 1 

for2tst:slti $t0, $s1, 0 # reg $t0 = 1 if $s1 < 0 (j < 0) 

bne $t0, $zero, exit2 # go to exit2 if $s1 < 0 (j < 0) 

sll $t1, $s1, 2 # reg $t1 = j * 4 

add $t2, $a0, $t1 # reg $t2 = v + (j * 4) 

lw $t3, 0($t2) # reg $t3 = v[j] 

lw $t4, 4($t2) # reg $t4 = v[j + 1] 

slt $t0, $t4, $t3 # reg $t0 = 0 if $t4 ≥ $t3 

beq $t0, $zero, exit2 # go to exit2 if $t4 ≥ $t3 

. . . 

(body of second for loop) 

. . . 

addi $s1, $s1, –1 # j –= 1 

j for2tst 

# jump to test of inner loop 

exit2: 

The Procedure Call in sort 

The next step is the body of the second for loop: 

swap(v,j); 

Calling swap is easy enough: 

jal 

swap


Passing Parameters in sort 

The problem comes when we want to pass parameters because the sort procedure 

needs the values in registers $a0 and $a1, yet the swap procedure needs to have its 

parameters placed in those same registers. One solution is to copy the parameters 

for sort into other registers earlier in the procedure, making registers $a0 and 

$a1 available for the call of swap. (This copy is faster than saving and restoring on 

the stack.) We first copy $a0 and $a1 into $s2 and $s3 during the procedure: 

move $s2, $a0 # copy parameter $a0 into $s2 

move $s3, $a1 # copy parameter $a1 into $s3 

Then we pass the parameters to swap with these two instructions: 

move $a0, $s2 

move $a1, $s1 

# first swap parameter is v 

# second swap parameter is j 

Preserving Registers in sort 

The only remaining code is the saving and restoring of registers. Clearly, we must 

save the return address in register $ra, since sort is a procedure and is called 

itself. The sort procedure also uses the saved registers $s0, $s1, $s2, and $s3, 

so they must be saved. The prologue of the sort procedure is then 

addi $sp,$sp,–20 # make room on stack for 5 registers 

sw $ra,16($sp) # save $ra on stack 

sw $s3,12($sp) # save $s3 on stack 

sw $s2, 8($sp) # save $s2 on stack 



The tail of the procedure simply reverses all these instructions, then adds a jr to 

return. 

The Full Procedure sort 

Now we put all the pieces together in Figure 2.27, being careful to replace references 

to registers $a0 and $a1 in the for loops with references to registers $s2 and $s3. 

Once again, to make the code easier to follow, we identify each block of code with 

its purpose in the procedure. In this example, nine lines of the sort procedure in 

C became 35 lines in the MIPS assembly language. 

Elaboration: One optimization that works with this example is procedure inlining. 

Instead of passing arguments in parameters and invoking the code with a jal instruction, 

the compiler would copy the code from the body of the swap procedure where the call 

to swap appears in the code. Inlining would avoid four instructions in this example. The 

downside of the inlining optimization is that the compiled code would be bigger if the 

inlined procedure is called from several locations. Such a code expansion might turn 

into lower performance if it increased the cache miss rate; see Chapter 5.


clear1(int array[], int size) 

{ 

int i; 

for (i = 0; i < size; i += 1) 

array[i] = 0; 

} 

clear2(int *array, int size) 

{ 

int *p; 

for (p = &array[0]; p < &array[size]; p = p + 1) 

*p = 0; 

} 

FIGURE 2.30 Two C procedures for setting an array to all zeros. Clear1 uses indices, 

while clear2 uses pointers. The second procedure needs some explanation for those unfamiliar with C. 

The address of a variable is indicated by &, and the object pointed to by a pointer is indicated by *. The 

declarations declare that array and p are pointers to integers. The first part of the for loop in clear2 

assigns the address of the first element of array to the pointer p. The second part of the for loop tests to see 

if the pointer is pointing beyond the last element of array. Incrementing a pointer by one, in the last part of 

the for loop, means moving the pointer to the next sequential object of its declared size. Since p is a pointer to 

integers, the compiler will generate MIPS instructions to increment p by four, the number of bytes in a MIPS 

integer. The assignment in the loop places 0 in the object pointed to by p. 

Finally, we can store 0 in that address: 

sw $zero, 0($t2) # array[i] = 0 

This instruction is the end of the body of the loop, so the next step is to increment i: 

addi $t0,$t0,1 # i = i + 1 

The loop test checks if i is less than size: 

slt $t3,$t0,$a1 # $t3 = (i < size) 

bne $t3,$zero,loop1 # if (i < size) go to loop1 

We have now seen all the pieces of the procedure. Here is the MIPS code for 

clearing an array using indices: 

move $t0,$zero # i = 0 

loop1: sll $t1,$t0,2 # $t1 = i * 4 

add $t2,$a0,$t1 # $t2 = address of array[i] 

sw $zero, 0($t2) # array[i] = 0 

addi $t0,$t0,1 # i = i + 1 

slt $t3,$t0,$a1 # $t3 = (i < size) 

bne $t3,$zero,loop1 # if (i < size) go to loop1 

(This code works as long as size is greater than 0; ANSI C requires a test of size 

before the loop, but we’ll skip that legality here.)

2.14 Arrays versus Pointers 143 

Pointer Version of Clear 

The second procedure that uses pointers allocates the two parameters array and 

size to the registers $a0 and $a1 and allocates p to register $t0. The code for 

the second procedure starts with assigning the pointer p to the address of the first 

element of the array: 

move $t0,$a0 

# p = address of array[0] 

The next code is the body of the for loop, which simply stores 0 into p: 

loop2: sw $zero,0($t0) # Memory[p] = 0 

This instruction implements the body of the loop, so the next code is the iteration 

increment, which changes p to point to the next word: 

addi $t0,$t0,4 # p = p + 4 

Incrementing a pointer by 1 means moving the pointer to the next sequential 

object in C. Since p is a pointer to integers, each of which uses 4 bytes, the compiler 

increments p by 4. 

The loop test is next. The first step is calculating the address of the last element 

of array. Start with multiplying size by 4 to get its byte address: 

sll $t1,$a1,2 # $t1 = size * 4 

and then we add the product to the starting address of the array to get the address 

of the first word after the array: 

add $t2,$a0,$t1 

# $t2 = address of array[size] 

The loop test is simply to see if p is less than the last element of array: 

slt $t3,$t0,$t2 # $t3 = (p

2.16 Real Stuff: ARMv7 (32-bit) Instructions 147 

by any amount, add it to the other registers to form the address, and then update 

one register with this new address. 

Addressing mode 

ARM 

MIPS 

Register operand 

Immediate operand 

Register + offset (displacement or based) 

Register + register (indexed) X 

— 

Register + scaled register (scaled) X 

— 

Register + offset and update register X 

— 

Register + register and update register X 

— 

Autoincrement, autodecrement X 

— 

PC-relative data X 

— 

X 

X 

X 

X 

X 

X 

FIGURE 2.33 Summary of data addressing modes. ARM has separate register indirect and register 

offset addressing modes, rather than just putting 0 in the offset of the latter mode. To get greater addressing 

range, ARM shifts the offset left 1 or 2 bits if the data size is halfword or word. 

Compare and Conditional Branch 

MIPS uses the contents of registers to evaluate conditional branches. ARM uses the 

traditional four condition code bits stored in the program status word: negative, 

zero, carry, and overflow. They can be set on any arithmetic or logical instruction; 

unlike earlier architectures, this setting is optional on each instruction. An 

explicit option leads to fewer problems in a pipelined implementation. ARM uses 

conditional branches to test condition codes to determine all possible unsigned 

and signed relations. 

CMP subtracts one operand from the other and the difference sets the condition 

codes. Compare negative (CMN) adds one operand to the other, and the sum sets 

the condition codes. TST performs logical AND on the two operands to set all 

condition codes but overflow, while TEQ uses exclusive OR to set the first three 

condition codes. 

One unusual feature of ARM is that every instruction has the option of executing 

conditionally, depending on the condition codes. Every instruction starts with a 

4-bit field that determines whether it will act as a no operation instruction (nop) 

or as a real instruction, depending on the condition codes. Hence, conditional 

branches are properly considered as conditionally executing the unconditional 

branch instruction. Conditional execution allows avoiding a branch to jump over a 

single instruction. It takes less code space and time to simply conditionally execute 

one instruction. 

Figure 2.34 shows the instruction formats for ARM and MIPS. The principal 

differences are the 4-bit conditional execution field in every instruction and the 

smaller register field, because ARM has half the number of registers.


ARM 

31 28 27 

Opx 4 

20 19 16 15 

12 11 

4 3 0 

Op 8 Rs1 4 Rd 4 Opx 8 

Rs2 4 

Register-register 

31 26 25 

21 20 

16 15 

11 10 6 5 0 

MIPS 

Op 6 

Rs1 5 Rs2 5 Rd 5 Const 5 Opx 6 

31 28 27 

20 19 16 15 12 11 

0 

ARM Opx 4 

Op 8 Rs1 4 Rd 4 Const 12 

Data transfer 

31 26 25 21 20 16 15 

0 

MIPS 

Op 6 

Rs1 5 Rd 5 

Const 16 

31 28 27 24 23 

0 

ARM Opx 4 Op 4 Const 24 

Branch 

31 26 25 

21 20 

16 15 

0 

MIPS 

Op 6 

Rs1 5 Opx 5 /Rs2 5 Const 16 

ARM 

31 28 27 24 23 

0 

Opx 4 Op 4 Const 24 

Jump/Call 

31 26 25 

0 

MIPS Op 6 

Const 26 

Opcode 

Register 

Constant 

FIGURE 2.34 Instruction formats, ARM and MIPS. The differences result from whether the 

architecture has 16 or 32 registers. 

Unique Features of ARM 

Figure 2.35 shows a few arithmetic-logical instructions not found in MIPS. Since 

ARM does not have a dedicated register for 0, it has separate opcodes to perform 

some operations that MIPS can do with $zero. In addition, ARM has support for 

multiword arithmetic. 

ARM’s 12-bit immediate field has a novel interpretation. The eight leastsignificant 

bits are zero-extended to a 32-bit value, then rotated right the number 

of bits specified in the first four bits of the field multiplied by two. One advantage is 

that this scheme can represent all powers of two in a 32-bit word. Whether this split 

actually catches more immediates than a simple 12-bit field would be an interesting 

study. 

Operand shifting is not limited to immediates. The second register of all 

arithmetic and logical processing operations has the option of being shifted before 

being operated on. The shift options are shift left logical, shift right logical, shift 

right arithmetic, and rotate right.

2.17 Real Stuff: x86 Instructions 151 

in parallel. Not only does this change enable more multimedia operations; 

it gives the compiler a different target for floating-point operations than 

the unique stack architecture. Compilers can choose to use the eight SSE 

registers as floating-point registers like those found in other computers. This 

change boosted the floating-point performance of the Pentium 4, the first 

microprocessor to include SSE2 instructions. 

■ 2003: A company other than Intel enhanced the x86 architecture this time. 

AMD announced a set of architectural extensions to increase the address 

space from 32 to 64 bits. Similar to the transition from a 16- to 32-bit address 

space in 1985 with the 80386, AMD64 widens all registers to 64 bits. It also 

increases the number of registers to 16 and increases the number of 128- 

bit SSE registers to 16. The primary ISA change comes from adding a new 

mode called long mode that redefines the execution of all x86 instructions 

with 64-bit addresses and data. To address the larger number of registers, it 

adds a new prefix to instructions. Depending how you count, long mode also 

adds four to ten new instructions and drops 27 old ones. PC-relative data 

addressing is another extension. AMD64 still has a mode that is identical 

to x86 (legacy mode) plus a mode that restricts user programs to x86 but 

allows operating systems to use AMD64 (compatibility mode). These modes 

allow a more graceful transition to 64-bit addressing than the HP/Intel IA-64 

architecture. 

■ 2004: Intel capitulates and embraces AMD64, relabeling it Extended Memory 

64 Technology (EM64T). The major difference is that Intel added a 128-bit 

atomic compare and swap instruction, which probably should have been 

included in AMD64. At the same time, Intel announced another generation of 

media extensions. SSE3 adds 13 instructions to support complex arithmetic, 

graphics operations on arrays of structures, video encoding, floating-point 

conversion, and thread synchronization (see Section 2.11). AMD added SSE3 

in subsequent chips and the missing atomic swap instruction to AMD64 to 

maintain binary compatibility with Intel. 

■ 2006: Intel announces 54 new instructions as part of the SSE4 instruction set 

extensions. These extensions perform tweaks like sum of absolute differences, 

dot products for arrays of structures, sign or zero extension of narrow data to 

wider sizes, population count, and so on. They also added support for virtual 

machines (see Chapter 5). 

■ 2007: AMD announces 170 instructions as part of SSE5, including 46 

instructions of the base instruction set that adds three operand instructions 

like MIPS. 

■ 2011: Intel ships the Advanced Vector Extension that expands the SSE 

register width from 128 to 256 bits, thereby redefining about 250 instructions 

and adding 128 new instructions.


This history illustrates the impact of the “golden handcuffs” of compatibility on 

the x86, as the existing software base at each step was too important to jeopardize 

with significant architectural changes. 

Whatever the artistic failures of the x86, keep in mind that this instruction set 

largely drove the PC generation of computers and still dominates the cloud portion 

of the PostPC Era. Manufacturing 350M x86 chips per year may seem small 

compared to 9 billion ARMv7 chips, but many companies would love to control 

such a market. Nevertheless, this checkered ancestry has led to an architecture that 

is difficult to explain and impossible to love. 

Brace yourself for what you are about to see! Do not try to read this section 

with the care you would need to write x86 programs; the goal instead is to give you 

familiarity with the strengths and weaknesses of the world’s most popular desktop 

architecture. 

Rather than show the entire 16-bit, 32-bit, and 64-bit instruction set, in this 

section we concentrate on the 32-bit subset that originated with the 80386. We start 

our explanation with the registers and addressing modes, move on to the integer 

operations, and conclude with an examination of instruction encoding. 

x86 Registers and Data Addressing Modes 

The registers of the 80386 show the evolution of the instruction set (Figure 2.36). 

The 80386 extended all 16-bit registers (except the segment registers) to 32 bits, 

prefixing an E to their name to indicate the 32-bit version. We’ll refer to them 

generically as GPRs (general-purpose registers). The 80386 contains only eight 

GPRs. This means MIPS programs can use four times as many and ARMv7 twice 

as many. 

Figure 2.37 shows the arithmetic, logical, and data transfer instructions are 

two-operand instructions. There are two important differences here. The x86 

arithmetic and logical instructions must have one operand act as both a source 

and a destination; ARMv7 and MIPS allow separate registers for source and 

destination. This restriction puts more pressure on the limited registers, since one 

source register must be modified. The second important difference is that one of 

the operands can be in memory. Thus, virtually any instruction may have one 

operand in memory, unlike ARMv7 and MIPS. 

Data memory-addressing modes, described in detail below, offer two sizes of 

addresses within the instruction. These so-called displacements can be 8 bits or 32 

bits. 

Although a memory operand can use any addressing mode, there are restrictions 

on which registers can be used in a mode. Figure 2.38 shows the x86 addressing 

modes and which GPRs cannot be used with each mode, as well as how to get the 

same effect using MIPS instructions. 

x86 Integer Operations 

The 8086 provides support for both 8-bit (byte) and 16-bit (word) data types. The 

80386 adds 32-bit addresses and data (double words) in the x86. (AMD64 adds 64-


Name 

31 

EAX 

ECX 

EDX 

EBX 

ESP 

EBP 

ESI 

EDI 

Use 

0 

GPR 0 

GPR 1 

GPR 2 

GPR 3 

GPR 4 

GPR 5 

GPR 6 

GPR 7 

CS 

SS 

DS 

ES 

FS 

GS 

Code segment pointer 

Stack segment pointer (top of stack) 

Data segment pointer 0 




EIP 

EFLAGS 

Instruction pointer (PC) 

Condition codes 

FIGURE 2.36 The 80386 register set. Starting with the 80386, the top eight registers were extended 

to 32 bits and could also be used as general-purpose registers. 

Source/destination operand type 

Register 

Register 

Register 

Memory 

Memory 

Second source operand 

Register 

Immediate 

Memory 

Register 

Immediate 

FIGURE 2.37 Instruction types for the arithmetic, logical, and data transfer instructions. 

The x86 allows the combinations shown. The only restriction is the absence of a memory-memory mode. 

Immediates may be 8, 16, or 32 bits in length; a register is any one of the 14 major registers in Figure 2.36 

(not EIP or EFLAGS).


Register 

Mode 

Description 

restrictions 

MIPS equivalent 

Register indirect Address is in a register. Not ESP or EBP lw $s0,0($s1) 

Based mode with 8- or 32-bit 

displacement 

Base plus scaled index 

Base plus scaled index with 

8- or 32-bit displacement 

Address is contents of base register plus 

displacement. 

The address is 

Base + (2 Scale x Index) 

where Scale has the value 0, 1, 2, or 3. 

The address is 

Base + (2 Scale x Index) + displacement 

where Scale has the value 0, 1, 2, or 3. 

Not ESP 

Base: any GPR 

Index: not ESP 

Base: any GPR 

Index: not ESP 

lw $s0,100($s1) #


The first two categories are unremarkable, except that the arithmetic and logic 

instruction operations allow the destination to be either a register or a memory 

location. Figure 2.39 shows some typical x86 instructions and their functions. 

Conditional branches on the x86 are based on condition codes or flags, like 

ARMv7. Condition codes are set as a side effect of an operation; most are used 

to compare the value of a result to 0. Branches then test the condition codes. PC- 

Instruction 

Function 

je name 

if equal(condition code) {EIP=name}; 

EIP–128


Instruction 

Control 

jnz, jz 

Meaning 

Conditional and unconditional branches 

Jump if condition to EIP + 8-bit offset; JNE (forJNZ), JE (for JZ) are 

alternative names 

jmp 

Unconditional jump—8-bit or 16-bit offset 

call 

Subroutine call—16-bit offset; return address pushed onto stack 

ret 

Pops return address from stack and jumps to it 

loop Loop branch—decrement ECX; jump to EIP + 8-bit displacement if ECX ≠ 0 

Data transfer Move data between registers or between register and memory 

move 

Move between two registers or between register and memory 

push, pop Push source operand on stack; pop operand from stack top to a register 

les 

Load ES and one of the GPRs from memory 

Arithmetic, logical Arithmetic and logical operations using the data registers and memory 

add, sub 

Add source to destination; subtract source from destination; register-memory 

format 

cmp 

Compare source and destination; register-memory format 

shl, shr, rcr Shift left; shift logical right; rotate right with carry condition code as fi ll 

cbw 

Convert byte in eight rightmost bits of EAX to 16-bit word in right of EAX 

test 

Logical AND of source and destination sets condition codes 

inc, dec 

Increment destination, decrement destination 

or, xor 

Logical OR; exclusive OR; register-memory format 

String 

Move between string operands; length given by a repeat prefix 

movs 

Copies from string source to destination by incrementing ESI and EDI; may be 

repeated 

lods 

Loads a byte, word, or doubleword of a string into the EAX register 

FIGURE 2.40 Some typical operations on the x86. Many operations use register-memory format, 

where either the source or the destination may be memory and the other may be a register or immediate 

operand. 

of the instructions that address memory. The base plus scaled index mode uses a second 

postbyte, labeled “sc, index, base.” 

Figure 2.42 shows the encoding of the two postbyte address specifiers for 

both 16-bit and 32-bit mode. Unfortunately, to understand fully which registers 

and which addressing modes are available, you need to see the encoding of all 

addressing modes and sometimes even the encoding of the instructions. 

x86 Conclusion 

Intel had a 16-bit microprocessor two years before its competitors’ more elegant 

architectures, such as the Motorola 68000, and this head start led to the selection 

of the 8086 as the CPU for the IBM PC. Intel engineers generally acknowledge that 

the x86 is more difficult to build than computers like ARMv7 and MIPS, but the 

large market meant in the PC Era that AMD and Intel could afford more resources


a. JE EIP + displacement 

4 4 8 

JE 

Condition 

Displacement 

b. CALL 

8 32 

CALL 

Offset 

c. MOV EBX, [EDI + 45] 

6 1 1 8 

8 

r/m 

MOV d w 

Displacement 

Postbyte 

d. PUSH ESI 

5 3 

PUSH Reg 

e. ADD EAX, #6765 

4 3 1 

32 

ADD 

Reg w 

Immediate 

f. TEST EDX, #42 

7 1 8 

32 

TEST 

w 

Postbyte 

Immediate 

FIGURE 2.41 Typical x86 instruction formats. Figure 2.42 shows the encoding of the postbyte. 

Many instructions contain the 1-bit field w, which says whether the operation is a byte or a double word. The 

d field in MOV is used in instructions that may move to or from memory and shows the direction of the move. 

The ADD instruction requires 32 bits for the immediate field, because in 32-bit mode, the immediates are 

either 8 bits or 32 bits. The immediate field in the TEST is 32 bits long because there is no 8-bit immediate for 

test in 32-bit mode. Overall, instructions may vary from 1 to 15 bytes in length. The long length comes from 

extra 1-byte prefixes, having both a 4-byte immediate and a 4-byte displacement address, using an opcode of 

2 bytes, and using the scaled index mode specifier, which adds another byte. 

to help overcome the added complexity. What the x86 lacks in style, it made up for 

in market size, making it beautiful from the right perspective. 

Its saving grace is that the most frequently used x86 architectural components 

are not too difficult to implement, as AMD and Intel have demonstrated by rapidly 

improving performance of integer programs since 1978. To get that performance,


reg w = 0 w = 1 r/m mod = 0 mod = 1 mod = 2 mod = 3 

16b 32b 16b 32b 16b 32b 16b 32b 

0 AL AX EAX 0 addr=BX+SI =EAX same same same same same 

1 CL CX ECX 1 addr=BX+DI =ECX addr as addr as addr as addr as as 

2 DL DX EDX 2 addr=BP+SI =EDX mod=0 mod=0 mod=0 mod=0 reg 

3 BL BX EBX 3 addr=BP+SI =EBX + disp8 + disp8 + disp16 + disp32 fi eld 

4 AH SP ESP 4 addr=SI =(sib) SI+disp8 (sib)+disp8 SI+disp8 (sib)+disp32 “ 

5 CH BP EBP 5 addr=DI =disp32 DI+disp8 EBP+disp8 DI+disp16 EBP+disp32 “ 

6 DH SI ESI 6 addr=disp16 =ESI BP+disp8 ESI+disp8 BP+disp16 ESI+disp32 “ 

7 BH DI EDI 7 addr=BX =EDI BX+disp8 EDI+disp8 BX+disp16 EDI+disp32 “ 

FIGURE 2.42 The encoding of the first address specifier of the x86: mod, reg, r/m. The first four columns show the encoding 

of the 3-bit reg field, which depends on the w bit from the opcode and whether the machine is in 16-bit mode (8086) or 32-bit mode (80386). 

The remaining columns explain the mod and r/m fields. The meaning of the 3-bit r/m field depends on the value in the 2-bit mod field and the 

address size. Basically, the registers used in the address calculation are listed in the sixth and seventh columns, under mod 0, with mod 1 

adding an 8-bit displacement and mod 2 adding a 16-bit or 32-bit displacement, depending on the address mode. The exceptions are 1) r/m 

6 when mod 1 or mod 2 in 16-bit mode selects BP plus the displacement; 2) r/m 5 when mod 1 or mod 2 in 32-bit mode selects 

EBP plus displacement; and 3) r/m 4 in 32-bit mode when mod does not equal 3, where (sib) means use the scaled index mode shown in 

Figure 2.38. When mod 3, the r/m field indicates a register, using the same encoding as the reg field combined with the w bit. 

compilers must avoid the portions of the architecture that are hard to implement 

fast. 

In the PostPC Era, however, despite considerable architectural and manufacturing 

expertise, x86 has not yet been competitive in the personal mobile device. 

2.18 Real Stuff: ARMv8 (64-bit) Instructions 

Of the many potential problems in an instruction set, the one that is almost impossible 

to overcome is having too small a memory address. While the x86 was successfully 

extended first to 32-bit addresses and then later to 64-bit addresses, many of its 

brethren were left behind. For example, the 16-bit address MOStek 6502 powered the 

Apple II, but even given this headstart with the first commercially successful personal 

computer, its lack of address bits condemned it to the dustbin of history. 

ARM architects could see the writing on the wall of their 32-bit address 

computer, and began design of the 64-bit address version of ARM in 2007. It was 

finally revealed in 2013. Rather than some minor cosmetic changes to make all 

the registers 64 bits wide, which is basically what happened to the x86, ARM did a 

complete overhaul. The good news is that if you know MIPS it will be very easy to 

pick up ARMv8, as the 64-bit version is called. 

First, as compared to MIPS, ARM dropped virtually all of the unusual features 

of v7: 

■ There is no conditional execution field, as there was in nearly every instruction 

in v7.


This battle between compilers and assembly language coders is another situation 

in which humans are losing ground. For example, C offers the programmer a 

chance to give a hint to the compiler about which variables to keep in registers 

versus spilled to memory. When compilers were poor at register allocation, such 

hints were vital to performance. In fact, some old C textbooks spent a fair amount 

of time giving examples that effectively use register hints. Today’s C compilers 

generally ignore such hints, because the compiler does a better job at allocation 

than the programmer does. 

Even if writing by hand resulted in faster code, the dangers of writing in assembly 

language are the longer time spent coding and debugging, the loss in portability, 

and the difficulty of maintaining such code. One of the few widely accepted axioms 

of software engineering is that coding takes longer if you write more lines, and it 

clearly takes many more lines to write a program in assembly language than in C 

or Java. Moreover, once it is coded, the next danger is that it will become a popular 

program. Such programs always live longer than expected, meaning that someone 

will have to update the code over several years and make it work with new releases 

of operating systems and new models of machines. Writing in higher-level language 

instead of assembly language not only allows future compilers to tailor the code 

to future machines; it also makes the software easier to maintain and allows the 

program to run on more brands of computers. 

Fallacy: The importance of commercial binary compatibility means successful 

instruction sets don’t change. 

While backwards binary compatibility is sacrosanct, Figure 2.43 shows that the x86 

architecture has grown dramatically. The average is more than one instruction per 

month over its 35-year lifetime! 

Pitfall: Forgetting that sequential word addresses in machines with byte addressing 

do not differ by one. 

Many an assembly language programmer has toiled over errors made by assuming 

that the address of the next word can be found by incrementing the address in a 

register by one instead of by the word size in bytes. Forewarned is forearmed! 

Pitfall: Using a pointer to an automatic variable outside its defining procedure. 

A common mistake in dealing with pointers is to pass a result from a procedure 

that includes a pointer to an array that is local to that procedure. Following the 

stack discipline in Figure 2.12, the memory that contains the local array will be 

reused as soon as the procedure returns. Pointers to automatic variables can lead 

to chaos.


We also saw the great idea of making the common cast fast applied to instruction 

sets as well as computer architecture. Examples of making the common MIPS 

case fast include PC-relative addressing for conditional branches and immediate 

addressing for larger constant operands. 

Above this machine level is assembly language, a language that humans can read. 

The assembler translates it into the binary numbers that machines can understand, 

and it even “extends” the instruction set by creating symbolic instructions that 

aren’t in the hardware. For instance, constants or addresses that are too big are 

broken into properly sized pieces, common variations of instructions are given 

their own name, and so on. Figure 2.44 lists the MIPS instructions we have covered 

MIPS instructions Name Format Pseudo MIPS Name Format 

add add R move move R 

subtract sub R multiply mult R 

add immediate addi I multiply immediate multi I 

load word lw I load immediate li I 

store word sw I branch less than blt I 

load half lh I branch less than 

load half unsigned lhu I or equal ble I 

store half sh I branch greater than bgt I 

load byte lb I branch greater than 

load byte unsigned lbu I or equal bge I 

store byte sb I 

load linked ll I 

store conditional sc I 

load upper immediate lui I 

and and R 

or or R 

nor nor R 

and immediate andi I 

or immediate ori I 

shift left logical sll R 

shift right logical srl R 

branch on equal beq I 

branch on not equal bne I 

set less than slt R 

set less than immediate slti I 

set less than immediate 

sltiu I 

unsigned 

jump j J 

jump register jr R 

jump and link jal J 

FIGURE 2.44 The MIPS instruction set covered so far, with the real MIPS instructions 

on the left and the pseudoinstructions on the right. Appendix A (Section A.10) describes the 

full MIPS architecture. Figure 2.1 shows more details of the MIPS architecture revealed in this chapter. The 

information given here is also found in Columns 1 and 2 of the MIPS Reference Data Card at the front of 

the book.


2.3 [5] For the following C statement, what is the corresponding 

MIPS assembly code? Assume that the variables f, g, h, i, and j are assigned to 

registers $s0, $s1, $s2, $s3, and $s4, respectively. Assume that the base address 

of the arrays A and B are in registers $s6 and $s7, respectively. 

B[8] = A[i−j]; 

2.4 [5] For the MIPS assembly instructions below, what is the 

corresponding C statement? Assume that the variables f, g, h, i, and j are assigned 

to registers $s0, $s1, $s2, $s3, and $s4, respectively. Assume that the base address 

of the arrays A and B are in registers $s6 and $s7, respectively. 

sll $t0, $s0, 2 # $t0 = f * 4 

add $t0, $s6, $t0 # $t0 = &A[f] 

sll $t1, $s1, 2 # $t1 = g * 4 

add $t1, $s7, $t1 # $t1 = &B[g] 

lw $s0, 0($t0) # f = A[f] 

addi $t2, $t0, 4 

lw $t0, 0($t2) 

add $t0, $t0, $s0 

sw $t0, 0($t1) 

2.5 [5] For the MIPS assembly instructions in Exercise 2.4, rewrite 

the assembly code to minimize the number if MIPS instructions (if possible) 

needed to carry out the same function. 

2.6 The table below shows 32-bit values of an array stored in memory. 

Address Data 

24 2 

38 4 

32 3 

36 6 

40 1


2.6.1 [5] For the memory locations in the table above, write C 

code to sort the data from lowest to highest, placing the lowest value in the 

smallest memory location shown in the figure. Assume that the data shown 

represents the C variable called Array, which is an array of type int, and that 

the first number in the array shown is the first element in the array. Assume 

that this particular machine is a byte-addressable machine and a word consists 

of four bytes. 

2.6.2 [5] For the memory locations in the table above, write MIPS 

code to sort the data from lowest to highest, placing the lowest value in the smallest 

memory location. Use a minimum number of MIPS instructions. Assume the base 

address of Array is stored in register $s6. 

2.7 [5] Show how the value 0xabcdef12 would be arranged in memory 

of a little-endian and a big-endian machine. Assume the data is stored starting at 

address 0. 

2.8 [5] Translate 0xabcdef12 into decimal. 

2.9 [5] Translate the following C code to MIPS. Assume that the 

variables f, g, h, i, and j are assigned to registers $s0, $s1, $s2, $s3, and $s4, 

respectively. Assume that the base address of the arrays A and B are in registers $s6 

and $s7, respectively. Assume that the elements of the arrays A and B are 4-byte 

words: 

B[8] = A[i] + A[j]; 

2.10 [5] Translate the following MIPS code to C. Assume that the 

variables f, g, h, i, and j are assigned to registers $s0, $s1, $s2, $s3, and $s4, 

respectively. Assume that the base address of the arrays A and B are in registers $s6 

and $s7, respectively. 

addi $t0, $s6, 4 

add $t1, $s6, $0 

sw $t1, 0($t0) 

lw $t0, 0($t0) 

add $s0, $t1, $t0 

2.11 [5] For each MIPS instruction, show the value of the opcode 

(OP), source register (RS), and target register (RT) fields. For the I-type instructions, 

show the value of the immediate field, and for the R-type instructions, show the 

value of the destination register (RD) field.


2.12 Assume that registers $s0 and $s1 hold the values 0x80000000 and 

0xD0000000, respectively. 

2.12.1 [5] What is the value of $t0 for the following assembly code? 

add $t0, $s0, $s1 

2.12.2 [5] Is the result in $t0 the desired result, or has there been overflow? 

2.12.3 [5] For the contents of registers $s0 and $s1 as specified above, 

what is the value of $t0 for the following assembly code? 

sub $t0, $s0, $s1 

2.12.4 [5] Is the result in $t0 the desired result, or has there been overflow? 

2.12.5 [5] For the contents of registers $s0 and $s1 as specified above, 

what is the value of $t0 for the following assembly code? 

add $t0, $s0, $s1 

add $t0, $t0, $s0 

2.12.6 [5] Is the result in $t0 the desired result, or has there been 

overflow? 

2.13 Assume that $s0 holds the value 128 ten 

. 

2.13.1 [5] For the instruction add $t0, $s0, $s1, what is the range(s) of 

values for $s1 that would result in overflow? 

2.13.2 [5] For the instruction sub $t0, $s0, $s1, what is the range(s) of 


2.13.3 [5] For the instruction sub $t0, $s1, $s0, what is the range(s) of 


2.14 [5] Provide the type and assembly language instruction for the 

following binary value: 0000 0010 0001 0000 1000 0000 0010 0000 two 

2.15 [5] Provide the type and hexadecimal representation of 

following instruction: sw $t1, 32($t2)


2.16 [5] Provide the type, assembly language instruction, and binary 

representation of instruction described by the following MIPS fields: 

op=0, rs=3, rt=2, rd=3, shamt=0, funct=34 

2.17 [5] Provide the type, assembly language instruction, and binary 

representation of instruction described by the following MIPS fields: 

op=0x23, rs=1, rt=2, const=0x4 

2.18 Assume that we would like to expand the MIPS register file to 128 registers 

and expand the instruction set to contain four times as many instructions. 

2.18.1 [5] How this would this affect the size of each of the bit fields in 

the R-type instructions? 

2.18.2 [5] How this would this affect the size of each of the bit fields in 

the I-type instructions? 

2.18.3 [5] How could each of the two proposed changes decrease 

the size of an MIPS assembly program? On the other hand, how could the proposed 

change increase the size of an MIPS assembly program? 

2.19 Assume the following register contents: 

$t0 = 0xAAAAAAAA, $t1 = 0x12345678 

2.19.1 [5] For the register values shown above, what is the value of $t2 

for the following sequence of instructions? 

sll $t2, $t0, 44 

or $t2, $t2, $t1 



sll $t2, $t0, 4 

andi $t2, $t2, −1 



srl $t2, $t0, 3 

andi $t2, $t2, 0xFFEF


2.20 [5] Find the shortest sequence of MIPS instructions that extracts bits 

16 down to 11 from register $t0 and uses the value of this field to replace bits 31 

down to 26 in register $t1 without changing the other 26 bits of register $t1. 

2.21 [5] Provide a minimal set of MIPS instructions that may be used to 

implement the following pseudoinstruction: 

not $t1, $t2 

// bit-wise invert 

2.22 [5] For the following C statement, write a minimal sequence of MIPS 

assembly instructions that does the identical operation. Assume $t1 = A, $t2 = B, 

and $s1 is the base address of C. 

A = C[0] 0) R[rs]=R[rs]−1, PC=PC+4+BranchAddr 

2.25.1 [5] If this instruction were to be implemented in the MIPS 

instruction set, what is the most appropriate instruction format? 

2.25.2 [5] What is the shortest sequence of MIPS instructions that 

performs the same operation?


2.26 Consider the following MIPS loop: 

LOOP: slt $t2, $0, $t1 

beq $t2, $0, DONE 

subi $t1, $t1, 1 

addi $s2, $s2, 2 

j LOOP 

DONE: 

2.26.1 [5] Assume that the register $t1 is initialized to the value 10. What 

is the value in register $s2 assuming $s2 is initially zero? 

2.26.2 [5] For each of the loops above, write the equivalent C code 

routine. Assume that the registers $s1, $s2, $t1, and $t2 are integers A, B, i, and 

temp, respectively. 

2.26.3 [5] For the loops written in MIPS assembly above, assume that 

the register $t1 is initialized to the value N. How many MIPS instructions are 

executed? 

2.27 [5] Translate the following C code to MIPS assembly code. Use a 

minimum number of instructions. Assume that the values of a, b, i, and j are in 

registers $s0, $s1, $t0, and $t1, respectively. Also, assume that register $s2 holds 

the base address of the array D. 

for(i=0; i


addi $t1, $t1, 1 

slti $t2, $t1, 100 

bne $t2, $s0, LOOP 

2.30 [5] Rewrite the loop from Exercise 2.29 to reduce the number of 

MIPS instructions executed. 

2.31 [5] Implement the following C code in MIPS assembly. What is the 

total number of MIPS instructions needed to execute the function? 

int fib(int n){ 

if (n==0) 

return 0; 

else if (n == 1) 

return 1; 

else 

return fib(n−1) + fib(n−2); 

2.32 [5] Functions can often be implemented by compilers “in-line.” An 

in-line function is when the body of the function is copied into the program space, 

allowing the overhead of the function call to be eliminated. Implement an “in-line” 

version of the C code above in MIPS assembly. What is the reduction in the total 

number of MIPS assembly instructions needed to complete the function? Assume 

that the C variable n is initialized to 5. 

2.33 [5] For each function call, show the contents of the stack after the 

function call is made. Assume the stack pointer is originally at address 0x7ffffffc, 

and follow the register conventions as specified in Figure 2.11. 

2.34 Translate function f into MIPS assembly language. If you need to use 

registers $t0 through $t7, use the lower-numbered registers first. Assume the 

function declaration for func is “int f(int a, int b);”. The code for function 

f is as follows: 

int f(int a, int b, int c, int d){ 

return func(func(a,b),c+d); 

}


2.35 [5] Can we use the tail-call optimization in this function? If no, 

explain why not. If yes, what is the difference in the number of executed instructions 

in f with and without the optimization? 

2.36 [5] Right before your function f from Exercise 2.34 returns, what do 

we know about contents of registers $t5, $s3, $ra, and $sp? Keep in mind that 

we know what the entire function f looks like, but for function func we only know 

its declaration. 

2.37 [5] Write a program in MIPS assembly language to convert an ASCII 

number string containing positive and negative integer decimal strings, to an 

integer. Your program should expect register $a0 to hold the address of a nullterminated 

string containing some combination of the digits 0 through 9. Your 

program should compute the integer value equivalent to this string of digits, then 

place the number in register $v0. If a non-digit character appears anywhere in the 

string, your program should stop with the value −1 in register $v0. For example, 

if register $a0 points to a sequence of three bytes 50ten, 52ten, 0ten (the nullterminated 

string “24”), then when the program stops, register $v0 should contain 

the value 24 ten 

. 

2.38 [5] Consider the following code: 

lbu $t0, 0($t1) 

sw $t0, 0($t2) 

Assume that the register $t1 contains the address 0x1000 0000 and the register 

$t2 contains the address 0x1000 0010. Note the MIPS architecture utilizes 

big-endian addressing. Assume that the data (in hexadecimal) at address 0x1000 

0000 is: 0x11223344. What value is stored at the address pointed to by register 

$t2? 

2.39 [5] Write the MIPS assembly code that creates the 32-bit constant 

0010 0000 0000 0001 0100 1001 0010 0100 two 

and stores that value to 

register $t1. 

2.40 [5] If the current value of the PC is 0x00000000, can you use 

a single jump instruction to get to the PC address as shown in Exercise 2.39? 

2.41 [5] If the current value of the PC is 0x00000600, can you use 

a single branch instruction to get to the PC address as shown in Exercise 2.39?


2.42 [5] If the current value of the PC is 0x1FFFf000, can you use 

a single branch instruction to get to the PC address as shown in Exercise 2.39? 

2.43 [5] Write the MIPS assembly code to implement the following C 

code: 

lock(lk); 

shvar=max(shvar,x); 

unlock(lk); 

Assume that the address of the lk variable is in $a0, the address of the shvar 

variable is in $a1, and the value of variable x is in $a2. Your critical section should 

not contain any function calls. Use ll/sc instructions to implement the lock() 

operation, and the unlock() operation is simply an ordinary store instruction. 

2.44 [5] Repeat Exercise 2.43, but this time use ll/sc to perform 

an atomic update of the shvar variable directly, without using lock() and 

unlock(). Note that in this problem there is no variable lk. 

2.45 [5] Using your code from Exercise 2.43 as an example, explain what 

happens when two processors begin to execute this critical section at the same 

time, assuming that each processor executes exactly one instruction per cycle. 

2.46 Assume for a given processor the CPI of arithmetic instructions is 1, 

the CPI of load/store instructions is 10, and the CPI of branch instructions is 

3. Assume a program has the following instruction breakdowns: 500 million 

arithmetic instructions, 300 million load/store instructions, 100 million branch 

instructions. 

2.46.1 [5] Suppose that new, more powerful arithmetic instructions are 

added to the instruction set. On average, through the use of these more powerful 

arithmetic instructions, we can reduce the number of arithmetic instructions 

needed to execute a program by 25%, and the cost of increasing the clock cycle 

time by only 10%. Is this a good design choice? Why? 

2.46.2 [5] Suppose that we find a way to double the performance of 

arithmetic instructions. What is the overall speedup of our machine? What if we 

find a way to improve the performance of arithmetic instructions by 10 times? 

2.47 Assume that for a given program 70% of the executed instructions are 

arithmetic, 10% are load/store, and 20% are branch.


2.47.1 [5] Given this instruction mix and the assumption that an 

arithmetic instruction requires 2 cycles, a load/store instruction takes 6 cycles, and 

a branch instruction takes 3 cycles, find the average CPI. 

2.47.2 [5] For a 25% improvement in performance, how many cycles, on 

average, may an arithmetic instruction take if load/store and branch instructions 

are not improved at all? 

2.47.3 [5] For a 50% improvement in performance, how many cycles, on 

average, may an arithmetic instruction take if load/store and branch instructions 

are not improved at all? 

Answers to 

Check Yourself 

§2.2, page 66: MIPS, C, Java 

§2.3, page 72: 2) Very slow 

§2.4, page 79: 2) 8 ten 

§2.5, page 87: 4) sub $t2, $t0, $t1 

§2.6, page 89: Both. AND with a mask pattern of 1s will leaves 0s everywhere but 

the desired field. Shifting left by the correct amount removes the bits from the left 

of the field. Shifting right by the appropriate amount puts the field into the rightmost 

bits of the word, with 0s in the rest of the word. Note that AND leaves the 

field where it was originally, and the shift pair moves the field into the rightmost 

part of the word. 

§2.7, page 96: I. All are true. II. 1). 

§2.8, page 106: Both are true. 

§2.9, page 111: I. 1) and 2) II. 3) 

§2.10, page 120: I. 4) 128K. II. 6) a block of 256M. III. 4) sll 

§2.11, page 123: Both are true. 

§2.12, page 132: 4) Machine independence.


3 

Arithmetic for 

Computers 

Numerical precision 

is the very soul of 

science. 

Sir D’arcy Wentworth Thompson 

On Growth and Form, 1917 


3.2 Addition and Subtraction 178 

3.3 Multiplication 183 

3.4 Division 189 

3.5 Floating Point 196 

3.6 Parallelism and Computer Arithmetic: 

Subword Parallelism 222 

3.7 Real Stuff: Streaming SIMD Extensions and 

Advanced Vector Extensions in x86 224 




(0) 

(0) 

0 

0 

0 (0) 

(1) 

0 

0 

1 (1) 

(1) 

1 

1 

1 (1) 

(0) 

1 

1 

0 (0) 

(Carries) 

1 

0 

1 

. . . 

. . . 

. . . 

0 

0 

0 (0) 

(0) 

FIGURE 3.1 Binary addition, showing carries from right to left. The rightmost bit adds 1 

to 0, resulting in the sum of this bit being 1 and the carry out from this bit being 0. Hence, the operation 

for the second digit to the right is 0 1 1. This generates a 0 for this sum bit and a carry out of 1. The 

third digit is the sum of 1 1 1, resulting in a carry out of 1 and a sum bit of 1. The fourth bit is 1 

0 0, yielding a 1 sum and no carry. 

0000 0000 0000 0000 0000 0000 0000 0111 two = 7 ten 

– 0000 0000 0000 0000 0000 0000 0000 0110 two = 6 ten 

= 0000 0000 0000 0000 0000 0000 0000 0001 two = 1 ten 

or via addition using the two’s complement representation of 6: 

0000 0000 0000 0000 0000 0000 0000 0111 two = 7 ten 

+ 1111 1111 1111 1111 1111 1111 1111 1010 two = –6 ten 

= 0000 0000 0000 0000 0000 0000 0000 0001 two = 1 ten 

Recall that overflow occurs when the result from an operation cannot be 

represented with the available hardware, in this case a 32-bit word. When can 

overflow occur in addition? When adding operands with different signs, overflow 

cannot occur. The reason is the sum must be no larger than one of the operands. 

For example, 10 4 6. Since the operands fit in 32 bits and the sum is no 

larger than an operand, the sum must fit in 32 bits as well. Therefore, no overflow 

can occur when adding positive and negative operands. 

There are similar restrictions to the occurrence of overflow during subtract, but 

it’s just the opposite principle: when the signs of the operands are the same, overflow 

cannot occur. To see this, remember that c a c (a) because we subtract by 

negating the second operand and then add. Therefore, when we subtract operands 

of the same sign we end up by adding operands of different signs. From the prior 

paragraph, we know that overflow cannot occur in this case either. 

Knowing when overflow cannot occur in addition and subtraction is all well and 

good, but how do we detect it when it does occur? Clearly, adding or subtracting 

two 32-bit numbers can yield a result that needs 33 bits to be fully expressed. 

The lack of a 33rd bit means that when overflow occurs, the sign bit is set with 

the value of the result instead of the proper sign of the result. Since we need just one 

extra bit, only the sign bit can be wrong. Hence, overflow occurs when adding two 

positive numbers and the sum is negative, or vice versa. This spurious sum means 

a carry out occurred into the sign bit. 

Overflow occurs in subtraction when we subtract a negative number from a 

positive number and get a negative result, or when we subtract a positive number 

from a negative number and get a positive result. Such a ridiculous result means a 

borrow occurred from the sign bit. Figure 3.2 shows the combination of operations, 

operands, and results that indicate an overflow.


more detail; Chapter 5 describes other situations where exceptions and interrupts 

occur.) 

MIPS includes a register called the exception program counter (EPC) to contain 

the address of the instruction that caused the exception. The instruction move from 

system control (mfc0) is used to copy EPC into a general-purpose register so that 

MIPS software has the option of returning to the offending instruction via a jump 

register instruction. 

interrupt An exception 

that comes from outside 

of the processor. (Some 

architectures use the 

term interrupt for all 

exceptions.) 

Summary 

A major point of this section is that, independent of the representation, the finite 

word size of computers means that arithmetic operations can create results that 

are too large to fit in this fixed word size. It’s easy to detect overflow in unsigned 

numbers, although these are almost always ignored because programs don’t want to 

detect overflow for address arithmetic, the most common use of natural numbers. 

Two’s complement presents a greater challenge, yet some software systems require 

detection of overflow, so today all computers have a way to detect it. 

Some programming languages allow two’s complement integer arithmetic 

on variables declared byte and half, whereas MIPS only has integer arithmetic 

operations on full words. As we recall from Chapter 2, MIPS does have data transfer 

operations for bytes and halfwords. What MIPS instructions should be generated 

for byte and halfword arithmetic operations? 

1. Load with lbu, lhu; arithmetic with add, sub, mult, div; then store using 

sb, sh. 

2. Load with lb, lh; arithmetic with add, sub, mult, div; then store using 

sb, sh. 

3. Load with lb, lh; arithmetic with add, sub, mult, div, using AND to mask 

result to 8 or 16 bits after each operation; then store using sb, sh. 

Check 

Yourself 

Elaboration: One feature not generally found in general-purpose microprocessors is 

saturating operations. Saturation means that when a calculation overflows, the result 

is set to the largest positive number or most negative number, rather than a modulo 

calculation as in two’s complement arithmetic. Saturation is likely what you want for media 

operations. For example, the volume knob on a radio set would be frustrating if, as you 

turned it, the volume would get continuously louder for a while and then immediately very 

soft. A knob with saturation would stop at the highest volume no matter how far you turned 

it. Multimedia extensions to standard instruction sets often offer saturating arithmetic. 

Elaboration: MIPS can trap on overfl ow, but unlike many other computers, there is 

no conditional branch to test overfl ow. A sequence of MIPS instructions can discover


3.3 Multiplication 

Now that we have completed the explanation of addition and subtraction, we are 

ready to build the more vexing operation of multiplication. 

First, let’s review the multiplication of decimal numbers in longhand to remind 

ourselves of the steps of multiplication and the names of the operands. For reasons 

that will become clear shortly, we limit this decimal example to using only the 

digits 0 and 1. Multiplying 1000 ten 

by 1001 ten 

: 

Multiplicand 1000 ten 

Multiplier x 1001 ten 

1000 

0000 

0000 

1000 

Product 1001000 ten 

Multiplication is 

vexation, Division is 

as bad; The rule of 

three doth puzzle me, 

And practice drives me 

mad. 

Anonymous, 

Elizabethan manuscript, 

1570 

The first operand is called the multiplicand and the second the multiplier. 

The final result is called the product. As you may recall, the algorithm learned in 

grammar school is to take the digits of the multiplier one at a time from right to 

left, multiplying the multiplicand by the single digit of the multiplier, and shifting 

the intermediate product one digit to the left of the earlier intermediate products. 

The first observation is that the number of digits in the product is considerably 

larger than the number in either the multiplicand or the multiplier. In fact, if we 

ignore the sign bits, the length of the multiplication of an n-bit multiplicand and an 

m-bit multiplier is a product that is n m bits long. That is, n m bits are required 

to represent all possible products. Hence, like add, multiply must cope with 

overflow because we frequently want a 32-bit product as the result of multiplying 

two 32-bit numbers. 

In this example, we restricted the decimal digits to 0 and 1. With only two 

choices, each step of the multiplication is simple: 

1. Just place a copy of the multiplicand (1 multiplicand) in the proper place 

if the multiplier digit is a 1, or 

2. Place 0 (0 multiplicand) in the proper place if the digit is 0. 

Although the decimal example above happens to use only 0 and 1, multiplication 

of binary numbers must always use 0 and 1, and thus always offers only these two 

choices. 

Now that we have reviewed the basics of multiplication, the traditional next 

step is to provide the highly optimized multiply hardware. We break with tradition 

in the belief that you will gain a better understanding by seeing the evolution of 

the multiply hardware and algorithm through multiple generations. For now, let’s 

assume that we are multiplying only positive numbers.


Start 

Multiplier0 = 1 

1. Test 

Multiplier0 

Multiplier0 = 0 

1a. Add multiplicand to product and 

place the result in Product register 

2. Shift the Multiplicand register left 1 bit 

3. Shift the Multiplier register right 1 bit 

32nd repetition? 

No: < 32 repetitions 

Yes: 32 repetitions 

Done 

FIGURE 3.4 The first multiplication algorithm, using the hardware shown in Figure 3.3. If 

the least significant bit of the multiplier is 1, add the multiplicand to the product. If not, go to the next step. 

Shift the multiplicand left and the multiplier right in the next two steps. These three steps are repeated 32 

times. 

This algorithm and hardware are easily refined to take 1 clock cycle per step. 

The speed-up comes from performing the operations in parallel: the multiplier 

and multiplicand are shifted while the multiplicand is added to the product if the 

multiplier bit is a 1. The hardware just has to ensure that it tests the right bit of 

the multiplier and gets the preshifted version of the multiplicand. The hardware is 

usually further optimized to halve the width of the adder and registers by noticing 

where there are unused portions of registers and adders. Figure 3.5 shows the 

revised hardware.


Iteration Step Multiplier Multiplicand Product 

0 Initial values 0011 0000 0010 0000 0000 

1 1a: 1 ⇒ Prod = Prod + Mcand 0011 0000 0010 0000 0010 

2: Shift left Multiplicand 0011 0000 0100 0000 0010 

3: Shift right Multiplier 0001 0000 0100 0000 0010 

2 1a: 1 ⇒ Prod = Prod + Mcand 0001 0000 0100 0000 0110 



3 1: 0 ⇒ No operation 0000 0000 1000 0000 0110 



4 1: 0 ⇒ No operation 0000 0001 0000 0000 0110 



FIGURE 3.6 Multiply example using algorithm in Figure 3.4. The bit examined to determine the 

next step is circled in color. 

Signed Multiplication 

So far, we have dealt with positive numbers. The easiest way to understand how 

to deal with signed numbers is to first convert the multiplier and multiplicand to 

positive numbers and then remember the original signs. The algorithms should 

then be run for 31 iterations, leaving the signs out of the calculation. As we learned 

in grammar school, we need negate the product only if the original signs disagree. 

It turns out that the last algorithm will work for signed numbers, provided that 

we remember that we are dealing with numbers that have infinite digits, and we are 

only representing them with 32 bits. Hence, the shifting steps would need to extend 

the sign of the product for signed numbers. When the algorithm completes, the 

lower word would have the 32-bit product. 

Faster Multiplication 

Moore’s Law has provided so much more in resources that hardware designers can 

now build much faster multiplication hardware. Whether the multiplicand is to be 

added or not is known at the beginning of the multiplication by looking at each of 

the 32 multiplier bits. Faster multiplications are possible by essentially providing 

one 32-bit adder for each bit of the multiplier: one input is the multiplicand ANDed 

with a multiplier bit, and the other is the output of a prior adder. 

A straightforward approach would be to connect the outputs of adders on the 

right to the inputs of adders on the left, making a stack of adders 32 high. An 

alternative way to organize these 32 additions is in a parallel tree, as Figure 3.7 

shows. Instead of waiting for 32 add times, we wait just the log 2 

(32) or five 32-bit 

add times.


3.4 Division 

The reciprocal operation of multiply is divide, an operation that is even less frequent 

and even more quirky. It even offers the opportunity to perform a mathematically 

invalid operation: dividing by 0. 

Let’s start with an example of long division using decimal numbers to recall the 

names of the operands and the grammar school division algorithm. For reasons 

similar to those in the previous section, we limit the decimal digits to just 0 or 1. 

The example is dividing 1,001,010 ten 

by 1000 ten 

: 

1001 ten Quotient 

Divisor 1000 ten 1001010 ten Dividend 

−1000 

10 

101 

1010 

−1000 

10 ten Remainder 

Divide’s two operands, called the dividend and divisor, and the result, called 

the quotient, are accompanied by a second result, called the remainder. Here is 

another way to express the relationship between the components: 

Dividend Quotient Divisor Remainder 

where the remainder is smaller than the divisor. Infrequently, programs use the 

divide instruction just to get the remainder, ignoring the quotient. 

The basic grammar school division algorithm tries to see how big a number 

can be subtracted, creating a digit of the quotient on each attempt. Our carefully 

selected decimal example uses only the numbers 0 and 1, so it’s easy to figure out 

how many times the divisor goes into the portion of the dividend: it’s either 0 times 

or 1 time. Binary numbers contain only 0 or 1, so binary division is restricted to 

these two choices, thereby simplifying binary division. 

Let’s assume that both the dividend and the divisor are positive and hence the 

quotient and the remainder are nonnegative. The division operands and both 

results are 32-bit values, and we will ignore the sign for now. 

A Division Algorithm and Hardware 

Figure 3.8 shows hardware to mimic our grammar school algorithm. We start with 

the 32-bit Quotient register set to 0. Each iteration of the algorithm needs to move 

the divisor to the right one digit, so we start with the divisor placed in the left half 

of the 64-bit Divisor register and shift it right 1 bit each step to align it with the 

dividend. The Remainder register is initialized with the dividend. 

Divide et impera. 

Latin for “Divide and 

rule,” ancient political 

maxim cited by 

Machiavelli, 1532 

dividend A number 

being divided. 

divisor A number that 

the dividend is divided by. 

quotient The primary 

result of a division; 

a number that when 

multiplied by the 

divisor and added to the 

remainder produces the 

dividend. 

remainder The 

secondary result of 

a division; a number 

that when added to the 

product of the quotient 

and the divisor produces 

the dividend.


Start 

1. Subtract the Divisor register from the 

Remainder register and place the 

result in the Remainder register 

Remainder ≥ 0 

Test Remainder 

Remainder < 0 

2a. Shift the Quotient register to the left, 

setting the new rightmost bit to 1 

2b. Restore the original value by adding 

the Divisor register to the Remainder 

register and placing the sum in the 

Remainder register. Also shift the 

Quotient register to the left, setting the 

new least significant bit to 0 

3. Shift the Divisor register right 1 bit 

33rd repetition? 

No: < 33 repetitions 

Yes: 33 repetitions 

Done 

FIGURE 3.9 A division algorithm, using the hardware in Figure 3.8. If the remainder is positive, 

the divisor did go into the dividend, so step 2a generates a 1 in the quotient. A negative remainder after 

step 1 means that the divisor did not go into the dividend, so step 2b generates a 0 in the quotient and adds 

the divisor to the remainder, thereby reversing the subtraction of step 1. The final shift, in step 3, aligns the 

divisor properly, relative to the dividend for the next iteration. These steps are repeated 33 times. 

This algorithm and hardware can be refined to be faster and cheaper. The speedup 

comes from shifting the operands and the quotient simultaneously with the 

subtraction. This refinement halves the width of the adder and registers by noticing 

where there are unused portions of registers and adders. Figure 3.11 shows the 

revised hardware.


Elaboration: The one complication of signed division is that we must also set the sign 

of the remainder. Remember that the following equation must always hold: 

Dividend Quotient Divisor Remainder 

To understand how to set the sign of the remainder, let’s look at the example of dividing 

all the combinations of 7 ten 

by 2 ten 

. The fi rst case is easy: 

Checking the results: 

7 2: Quotient 3, Remainder 1 

7 3 2 (1) 6 1 

If we change the sign of the dividend, the quotient must change as well: 

7 2: Quotient 3 

Rewriting our basic formula to calculate the remainder: 

So, 

Remainder (Dividend Quotient Divisor) 7 (3x 2) 

7 (6) 1 

Checking the results again: 


7 3 2 (1) 6 1 

The reason the answer isn’t a quotient of 4 and a remainder of 1, which would also 

fi t this formula, is that the absolute value of the quotient would then change depending 

on the sign of the dividend and the divisor! Clearly, if 

(x y) (x) y 

programming would be an even greater challenge. This anomalous behavior is avoided 

by following the rule that the dividend and remainder must have the same signs, no 

matter what the signs of the divisor and quotient. 

We calculate the other combinations by following the same rule: 


7 2: Quotient 3, Remainder 1

196 Chapter 3 Arithmetic for Computers 

Hardware/ 

Software 

Interface 

MIPS divide instructions ignore overflow, so software must determine whether the 

quotient is too large. In addition to overflow, division can also result in an improper 

calculation: division by 0. Some computers distinguish these two anomalous events. 

MIPS software must check the divisor to discover division by 0 as well as overflow. 

Elaboration: An even faster algorithm does not immediately add the divisor back 

if the remainder is negative. It simply adds the dividend to the shifted remainder in 

the following step, since (r d) 2 d r 2 d 2 d r 2 d. This 

nonrestoring division algorithm, which takes 1 clock cycle per step, is explored further 

in the exercises; the algorithm above is called restoring division. A third algorithm that 

doesn’t save the result of the subtract if it’s negative is called a nonperforming division 

algorithm. It averages one-third fewer arithmetic operations. 

3.5 Floating Point 

Speed gets you 

nowhere if you’re 

headed the wrong way. 

American proverb 

scientific notation 

A notation that renders 

numbers with a single 

digit to the left of the 

decimal point. 

normalized A number 

in floating-point notation 

that has no leading 0s. 

Going beyond signed and unsigned integers, programming languages support 

numbers with fractions, which are called reals in mathematics. Here are some 

examples of reals: 

3.14159265… ten 

(pi) 

2.71828… ten 

(e) 

0.000000001 ten 

or 1.0 ten 

× 10 −9 (seconds in a nanosecond) 

3,155,760,000 ten 

or 3.15576 ten 

× 10 9 (seconds in a typical century) 

Notice that in the last case, the number didn’t represent a small fraction, but it 

was bigger than we could represent with a 32-bit signed integer. The alternative 

notation for the last two numbers is called scientific notation, which has a single 

digit to the left of the decimal point. A number in scientific notation that has no 

leading 0s is called a normalized number, which is the usual way to write it. For 

example, 1.0 ten 

10 9 is in normalized scientific notation, but 0.1 ten 

10 8 and 

10.0 ten 

10 10 are not. 

Just as we can show decimal numbers in scientific notation, we can also show 

binary numbers in scientific notation: 

1.0 two 

2 1 

To keep a binary number in normalized form, we need a base that we can increase 

or decrease by exactly the number of bits the number must be shifted to have one 

nonzero digit to the left of the decimal point. Only a base of 2 fulfills our need. Since 

the base is not 10, we also need a new name for decimal point; binary point will do fine.


Computer arithmetic that supports such numbers is called floating point 

because it represents numbers in which the binary point is not fixed, as it is for 

integers. The programming language C uses the name float for such numbers. Just 

as in scientific notation, numbers are represented as a single nonzero digit to the 

left of the binary point. In binary, the form is 

floating point 

Computer arithmetic that 

represents numbers in 

which the binary point is 

not fixed. 

1.xxxxxxxxx two 

2 yyyy 

(Although the computer represents the exponent in base 2 as well as the rest of the 

number, to simplify the notation we show the exponent in decimal.) 

A standard scientific notation for reals in normalized form offers three 

advantages. It simplifies exchange of data that includes floating-point numbers; 

it simplifies the floating-point arithmetic algorithms to know that numbers will 

always be in this form; and it increases the accuracy of the numbers that can be 

stored in a word, since the unnecessary leading 0s are replaced by real digits to the 

right of the binary point. 

Floating-Point Representation 

A designer of a floating-point representation must find a compromise between the 

size of the fraction and the size of the exponent, because a fixed word size means 

you must take a bit from one to add a bit to the other. This tradeoff is between 

precision and range: increasing the size of the fraction enhances the precision 

of the fraction, while increasing the size of the exponent increases the range of 

numbers that can be represented. As our design guideline from Chapter 2 reminds 

us, good design demands good compromise. 

Floating-point numbers are usually a multiple of the size of a word. The 

representation of a MIPS floating-point number is shown below, where s is the sign 

of the floating-point number (1 meaning negative), exponent is the value of the 

8-bit exponent field (including the sign of the exponent), and fraction is the 23-bit 

number. As we recall from Chapter 2, this representation is sign and magnitude, 

since the sign is a separate bit from the rest of the number. 

fraction The value, 

generally between 0 and 

1, placed in the fraction 

field. The fraction is also 

called the mantissa. 

exponent In the 

numerical representation 

system of floating-point 

arithmetic, the value that 

is placed in the exponent 

field. 

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 

s exponent fraction 

1 bit 8 bits 23 bits 

In general, floating-point numbers are of the form 

(1) S F 2 E 

F involves the value in the fraction field and E involves the value in the exponent 

field; the exact relationship to these fields will be spelled out soon. (We will shortly 

see that MIPS does something slightly more sophisticated.)


overflow (floatingpoint) 

A situation in 

which a positive exponent 

becomes too large to fit in 

the exponent field. 

underflow (floatingpoint) 

A situation 

in which a negative 

exponent becomes too 

large to fit in the exponent 

field. 

double precision 

A floating-point value 

represented in two 32-bit 

words. 

single precision 

A floating-point value 

represented in a single 32- 

bit word. 

These chosen sizes of exponent and fraction give MIPS computer arithmetic 

an extraordinary range. Fractions almost as small as 2.0 ten 

10 38 and numbers 

almost as large as 2.0 ten 

10 38 can be represented in a computer. Alas, extraordinary 

differs from infinite, so it is still possible for numbers to be too large. Thus, overflow 

interrupts can occur in floating-point arithmetic as well as in integer arithmetic. 

Notice that overflow here means that the exponent is too large to be represented 

in the exponent field. 

Floating point offers a new kind of exceptional event as well. Just as programmers 

will want to know when they have calculated a number that is too large to be 

represented, they will want to know if the nonzero fraction they are calculating 

has become so small that it cannot be represented; either event could result in a 

program giving incorrect answers. To distinguish it from overflow, we call this 

event underflow. This situation occurs when the negative exponent is too large to 

fit in the exponent field. 

One way to reduce chances of underflow or overflow is to offer another format 

that has a larger exponent. In C this number is called double, and operations on 

doubles are called double precision floating-point arithmetic; single precision 

floating point is the name of the earlier format. 

The representation of a double precision floating-point number takes two MIPS 

words, as shown below, where s is still the sign of the number, exponent is the value 

of the 11-bit exponent field, and fraction is the 52-bit number in the fraction field. 

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 

s 

exponent 

fraction 


fraction (continued) 

32 bits 

MIPS double precision allows numbers almost as small as 2.0 ten 

10 308 and almost 

as large as 2.0 ten 

10 308 . Although double precision does increase the exponent 

range, its primary advantage is its greater precision because of the much larger 

fraction. 

These formats go beyond MIPS. They are part of the IEEE 754 floating-point 

standard, found in virtually every computer invented since 1980. This standard has 

greatly improved both the ease of porting floating-point programs and the quality 

of computer arithmetic. 

To pack even more bits into the significand, IEEE 754 makes the leading 1-bit 

of normalized binary numbers implicit. Hence, the number is actually 24 bits long 

in single precision (implied 1 and a 23-bit fraction), and 53 bits long in double 

precision (1 52). To be precise, we use the term significand to represent the 24- 

or 53-bit number that is 1 plus the fraction, and fraction when we mean the 23- or 

52-bit number. Since 0 has no leading 1, it is given the reserved exponent value 0 so 

that the hardware won’t attach a leading 1 to it.


Single precision Double precision Object represented 

Exponent Fraction Exponent Fraction 

0 0 0 0 0 

0 Nonzero 0 Nonzero ± denormalized number 

1–254 Anything 1–2046 Anything ± floating-point number 

255 0 2047 0 ± infinity 

255 Nonzero 2047 Nonzero NaN (Not a Number) 

FIGURE 3.13 EEE 754 encoding of floating-point numbers. A separate sign bit determines the 

sign. Denormalized numbers are described in the Elaboration on page 222. This information is also found in 

Column 4 of the MIPS Reference Data Card at the front of this book. 

Thus 00 … 00 two 

represents 0; the representation of the rest of the numbers uses 

the form from before with the hidden 1 added: 

(1) S (1 Fraction) 2 E 

where the bits of the fraction represent a number between 0 and 1 and E specifies 

the value in the exponent field, to be given in detail shortly. If we number the bits 

of the fraction from left to right s1, s2, s3, …, then the value is 

(1) S (1 (s1 2 1 ) (s2 2 2 ) (s3 2 3 ) (s4 2 4 ) ...) 2 E 

Figure 3.13 shows the encodings of IEEE 754 floating-point numbers. Other 

features of IEEE 754 are special symbols to represent unusual events. For example, 

instead of interrupting on a divide by 0, software can set the result to a bit pattern 

representing ∞ or ∞; the largest exponent is reserved for these special symbols. 

When the programmer prints the results, the program will print an infinity symbol. 

(For the mathematically trained, the purpose of infinity is to form topological 

closure of the reals.) 

IEEE 754 even has a symbol for the result of invalid operations, such as 0/0 

or subtracting infinity from infinity. This symbol is NaN, for Not a Number. The 

purpose of NaNs is to allow programmers to postpone some tests and decisions to 

a later time in the program when they are convenient. 

The designers of IEEE 754 also wanted a floating-point representation that could 

be easily processed by integer comparisons, especially for sorting. This desire is 

why the sign is in the most significant bit, allowing a quick test of less than, greater 

than, or equal to 0. (It’s a little more complicated than a simple integer sort, since 

this notation is essentially sign and magnitude rather than two’s complement.) 

Placing the exponent before the significand also simplifies the sorting of 

floating-point numbers using integer comparison instructions, since numbers with 

bigger exponents look larger than numbers with smaller exponents, as long as both 

exponents have the same sign.


Negative exponents pose a challenge to simplified sorting. If we use two’s 

complement or any other notation in which negative exponents have a 1 in the 

most significant bit of the exponent field, a negative exponent will look like a big 

number. For example, 1.0 two 

2 1 would be represented as 

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 

0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 . . . 

(Remember that the leading 1 is implicit in the significand.) The value 1.0 two 

2 1 

would look like the smaller binary number 

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 

0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 . . . 

The desirable notation must therefore represent the most negative exponent as 

00 … 00 two 

and the most positive as 11 … 11 two 

. This convention is called biased 

notation, with the bias being the number subtracted from the normal, unsigned 

representation to determine the real value. 

IEEE 754 uses a bias of 127 for single precision, so an exponent of 1 is 

represented by the bit pattern of the value 1 127 ten 

, or 126 ten 

0111 1110 two 

, 

and 1 is represented by 1 127, or 128 ten 

1000 0000 two 

. The exponent bias for 

double precision is 1023. Biased exponent means that the value represented by a 

floating-point number is really 

(1) S (Exponent Bias) 

(1 Fraction) 2 

The range of single precision numbers is then from as small as 

to as large as 

Let’s demonstrate. 

1.00000000000000000000000 two 

2 126 

1.11111111111111111111111 two 

2 127 .


Floating-Point Representation 

Show the IEEE 754 binary representation of the number 0.75 ten 

in single and 

double precision. 

EXAMPLE 

The number 0.75 ten 

is also 

3/4 ten 

or 3/2 2 ten 

ANSWER 

It is also represented by the binary fraction 

11 two 

/2 2 or 0.11 ten two 

In scientific notation, the value is 

0.11 two 

2 0 

and in normalized scientific notation, it is 

1.1 two 

2 1 

The general representation for a single precision number is 

(1) S (1 Fraction) 2 (Exponent127) 

Subtracting the bias 127 from the exponent of 1.1 two 

2 1 yields 

(1) 1 (1 .1000 0000 0000 0000 0000 000 two 

) 2 (126127) 

The single precision binary representation of 0.75 ten 

is then 

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 

1 0 1 1 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 


The double precision representation is 

(1) 1 (1 .1000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 two 

) 2 (10221023) 

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 

1 0 1 1 1 1 1 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 


0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

32 bits


Now let’s try going the other direction. 

EXAMPLE 

Converting Binary to Decimal Floating Point 

What decimal number is represented by this single precision float? 

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 

1 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 . . . 

ANSWER 

The sign bit is 1, the exponent field contains 129, and the fraction field contains 

1 2 2 1/4, or 0.25. Using the basic equation, 

(1) S (1 Fraction) 2 (ExponentBias) (1) 1 (1 0.25) 2 (129127) 

1 1.25 2 2 

1.25 4 

5.0 

In the next few subsections, we will give the algorithms for floating-point 

addition and multiplication. At their core, they use the corresponding integer 

operations on the significands, but extra bookkeeping is necessary to handle the 

exponents and normalize the result. We first give an intuitive derivation of the 

algorithms in decimal and then give a more detailed, binary version in the figures. 

Elaboration: Following IEEE guidelines, the IEEE 754 committee was reformed 20 

years after the standard to see what changes, if any, should be made. The revised 

standard IEEE 754-2008 includes nearly all the IEEE 754-1985 and adds a 16-bit format 

(“half precision”) and a 128-bit format (“quadruple precision”). No hardware has yet been 

built that supports quadruple precision, but it will surely come. The revised standard 

also add decimal fl oating point arithmetic, which IBM mainframes have implemented. 

Elaboration: In an attempt to increase range without removing bits from the significand, 

some computers before the IEEE 754 standard used a base other than 2. For example, 

the IBM 360 and 370 mainframe computers use base 16. Since changing the IBM 

exponent by one means shifting the significand by 4 bits, “normalized” base 16 numbers 

can have up to 3 leading bits of 0s! Hence, hexadecimal digits mean that up to 3 bits must 

be dropped from the significand, which leads to surprising problems in the accuracy of 

floating-point arithmetic. IBM mainframes now support IEEE 754 as well as the hex format.


Floating-Point Addition 

Let’s add numbers in scientific notation by hand to illustrate the problems in 

floating-point addition: 9.999 ten 

10 1 1.610 ten 

10 1 . Assume that we can store 

only four decimal digits of the significand and two decimal digits of the exponent. 

Step 1. To be able to add these numbers properly, we must align the decimal 

point of the number that has the smaller exponent. Hence, we need 

a form of the smaller number, 1.610 ten 

10 1 , that matches the 

larger exponent. We obtain this by observing that there are multiple 

representations of an unnormalized floating-point number in 

scientific notation: 

1.610 ten 

10 1 0.1610 ten 

10 0 0.01610 ten 

10 1 

The number on the right is the version we desire, since its exponent 

matches the exponent of the larger number, 9.999 ten 

10 1 . Thus, the 

first step shifts the significand of the smaller number to the right until 

its corrected exponent matches that of the larger number. But we can 

represent only four decimal digits so, after shifting, the number is 

really 

0.016 10 1 

Step 2. Next comes the addition of the significands: 

The sum is 10.015 ten 

10 1 . 

9.999 ten 

+ 0.016 ten 

10.015 ten 

Step 3. This sum is not in normalized scientific notation, so we need to 

adjust it: 

10.015 ten 

10 1 1.0015 ten 

10 2 

Thus, after the addition we may have to shift the sum to put it into 

normalized form, adjusting the exponent appropriately. This example 

shows shifting to the right, but if one number were positive and the 

other were negative, it would be possible for the sum to have many 

leading 0s, requiring left shifts. Whenever the exponent is increased 

or decreased, we must check for overflow or underflow—that is, we 

must make sure that the exponent still fits in its field. 

Step 4. Since we assumed that the significand can be only four digits long 

(excluding the sign), we must round the number. In our grammar 

school algorithm, the rules truncate the number if the digit to the 

right of the desired point is between 0 and 4 and add 1 to the digit if 

the number to the right is between 5 and 9. The number 

1.0015 ten 

10 2


is rounded to four digits in the significand to 

1.002 ten 

10 2 

since the fourth digit to the right of the decimal point was between 5 

and 9. Notice that if we have bad luck on rounding, such as adding 1 

to a string of 9s, the sum may no longer be normalized and we would 

need to perform step 3 again. 

Figure 3.14 shows the algorithm for binary floating-point addition that follows 

this decimal example. Steps 1 and 2 are similar to the example just discussed: 

adjust the significand of the number with the smaller exponent and then add the 

two significands. Step 3 normalizes the results, forcing a check for overflow or 

underflow. The test for overflow and underflow in step 3 depends on the precision 

of the operands. Recall that the pattern of all 0 bits in the exponent is reserved and 

used for the floating-point representation of zero. Moreover, the pattern of all 1 bits 

in the exponent is reserved for indicating values and situations outside the scope of 

normal floating-point numbers (see the Elaboration on page 222). For the example 

below, remember that for single precision, the maximum exponent is 127, and the 

minimum exponent is 126. 

EXAMPLE 

Binary Floating-Point Addition 

Try adding the numbers 0.5 ten 

and 0.4375 ten 

in binary using the algorithm in 

Figure 3.14. 

ANSWER 

Let’s first look at the binary version of the two numbers in normalized scientific 

notation, assuming that we keep 4 bits of precision: 

0.5 ten 

1/2 ten 

1/2 1 ten 

0.1 two 

0.1 two 

2 0 1.000 two 

2 1 

0.4375 ten 

7/16 ten 

7/2 4 ten 

0.0111 two 

0.0111 two 

2 0 1.110 two 

2 2 

Now we follow the algorithm: 

Step 1. The significand of the number with the lesser exponent (1.11 two 

2 2 ) is shifted right until its exponent matches the larger number: 

Step 2. Add the significands: 

1.110 two 

2 2 0.111 two 

2 1 

1.000 two 

2 1 (0.111 two 

2 1 ) 0.001 two 

2 1


Step 3. Normalize the sum, checking for overflow or underflow: 

0.001 two 

2 1 0.010 two 

2 2 0.100 two 

2 3 

1.000 two 

2 4 

Since 127 4 126, there is no overflow or underflow. (The 

biased exponent would be 4 127, or 123, which is between 1 and 

254, the smallest and largest unreserved biased exponents.) 

Step 4. Round the sum: 

1.000 two 

2 4 

The sum already fits exactly in 4 bits, so there is no change to the bits 

due to rounding. 

This sum is then 

1.000 two 

2 4 0.0001000 two 

0.0001 two 

1/2 4 ten 1/16 ten 0.0625 ten 

This sum is what we would expect from adding 0.5 ten 

to 0.4375 ten 

. 

Many computers dedicate hardware to run floating-point operations as fast as possible. 

Figure 3.15 sketches the basic organization of hardware for floating-point addition. 

Floating-Point Multiplication 

Now that we have explained floating-point addition, let’s try floating-point 

multiplication. We start by multiplying decimal numbers in scientific notation by 

hand: 1.110 ten 

10 10 9.200 ten 

10 5 . Assume that we can store only four digits 

of the significand and two digits of the exponent. 

Step 1. Unlike addition, we calculate the exponent of the product by simply 

adding the exponents of the operands together: 

New exponent 10 (5) 5 

Let’s do this with the biased exponents as well to make sure we obtain 

the same result: 10 + 127 = 137, and 5 + 127 = 122, so 

New exponent 137 122 259 

This result is too large for the 8-bit exponent field, so something is 

amiss! The problem is with the bias because we are adding the biases 

as well as the exponents: 

New exponent (10 127) (5 127) (5 2 127) 259 

Accordingly, to get the correct biased sum when we add biased numbers, 

we must subtract the bias from the sum:


Sign 

Exponent 

Fraction 

Sign 

Exponent 

Fraction 

Small ALU 

Compare 

exponents 

Exponent 

difference 

0 1 0 1 0 1 

Control 

Shift right 

Shift smaller 

number right 

Big ALU 

Add 

0 1 0 1 

Increment or 

decrement 

Shift left or right 

Normalize 

Rounding hardware 

Round 

Sign 

Exponent 

Fraction 

FIGURE 3.15 Block diagram of an arithmetic unit dedicated to floating-point addition. The steps of Figure 3.14 correspond 

to each block, from top to bottom. First, the exponent of one operand is subtracted from the other using the small ALU to determine which is 

larger and by how much. This difference controls the three multiplexors; from left to right, they select the larger exponent, the significand of the 

smaller number, and the significand of the larger number. The smaller significand is shifted right, and then the significands are added together 

using the big ALU. The normalization step then shifts the sum left or right and increments or decrements the exponent. Rounding then creates 

the final result, which may require normalizing again to produce the actual final result.


New exponent 137 122 127 259 127 132 (5 127) 

and 5 is indeed the exponent we calculated initially. 

Step 2. Next comes the multiplication of the significands: 

1.110 ten 

× 9.200 ten 

0000 

0000 

2220 

9990 

10212000 ten 

There are three digits to the right of the decimal point for each 

operand, so the decimal point is placed six digits from the right in the 

product significand: 

10.212000 ten 

Assuming that we can keep only three digits to the right of the decimal 

point, the product is 10.212 10 5 . 

Step 3. This product is unnormalized, so we need to normalize it: 

10.212 ten 

10 5 1.0212 ten 

10 6 

Thus, after the multiplication, the product can be shifted right one digit 

to put it in normalized form, adding 1 to the exponent. At this point, 

we can check for overflow and underflow. Underflow may occur if both 

operands are small—that is, if both have large negative exponents. 

Step 4. We assumed that the significand is only four digits long (excluding the 

sign), so we must round the number. The number 

1.0212 ten 

10 6 

is rounded to four digits in the significand to 

1.021 ten 

10 6 

Step 5. The sign of the product depends on the signs of the original operands. 

If they are both the same, the sign is positive; otherwise, it’s negative. 

Hence, the product is 

1.021 ten 

10 6 

The sign of the sum in the addition algorithm was determined by 

addition of the significands, but in multiplication, the sign of the 

product is determined by the signs of the operands.


Once again, as Figure 3.16 shows, multiplication of binary floating-point numbers 

is quite similar to the steps we have just completed. We start with calculating 

the new exponent of the product by adding the biased exponents, being sure to 

subtract one bias to get the proper result. Next is multiplication of significands, 

followed by an optional normalization step. The size of the exponent is checked 

for overflow or underflow, and then the product is rounded. If rounding leads to 

further normalization, we once again check for exponent size. Finally, set the sign 

bit to 1 if the signs of the operands were different (negative product) or to 0 if they 

were the same (positive product). 

EXAMPLE 

ANSWER 

Binary Floating-Point Multiplication 

Let’s try multiplying the numbers 0.5 ten 


, using the steps in 

Figure 3.16. 

In binary, the task is multiplying 1.000 two 

2 1 by 1.110 two 

2 2 . 

Step 1. Adding the exponents without bias: 

1 (2) 3 

or, using the biased representation: 

(1 127) (2 127) 127 (1 2) (127 127 127) 

3 127 124 

Step 2. Multiplying the significands: 

1.000 two 

1.110 two 

0000 

1000 

1000 

1000 

1110000 two 

The product is 1.110000 two 

2 3 , but we need to keep it to 4 bits, so it 

is 1.110 two 

2 3 . 

Step 3. Now we check the product to make sure it is normalized, and then 

check the exponent for overflow or underflow. The product is already 

normalized and, since 127 3 126, there is no overflow or 

underflow. (Using the biased representation, 254 124 1, so the 

exponent fits.) 

Step 4. Rounding the product makes no change: 

1.110 two 

2 3


Step 5. Since the signs of the original operands differ, make the sign of the 

product negative. Hence, the product is 

1.110 two 

2 3 

Converting to decimal to check our results: 

1.110 two 

2 3 0.001110 two 

0.00111 two 

7/2 5 ten 7/32 ten 0.21875 ten 

The product of 0.5 ten 


is indeed 0.21875 ten 

. 

Floating-Point Instructions in MIPS 

MIPS supports the IEEE 754 single precision and double precision formats with 

these instructions: 

■ Floating-point addition, single (add.s) and addition, double (add.d) 

■ Floating-point subtraction, single (sub.s) and subtraction, double (sub.d) 

■ Floating-point multiplication, single (mul.s) and multiplication, double (mul.d) 

■ Floating-point division, single (div.s) and division, double (div.d) 

■ Floating-point comparison, single (c.x.s) and comparison, double (c.x.d), 

where x may be equal (eq), not equal (neq), less than (lt), less than or equal 

(le), greater than (gt), or greater than or equal (ge) 

■ Floating-point branch, true (bc1t) and branch, false (bc1f) 

Floating-point comparison sets a bit to true or false, depending on the comparison 

condition, and a floating-point branch then decides whether or not to branch, 

depending on the condition. 

The MIPS designers decided to add separate floating-point registers—called 

$f0, $f1, $f2, …—used either for single precision or double precision. Hence, 

they included separate loads and stores for floating-point registers: lwc1 and 

swc1. The base registers for floating-point data transfers which are used for 

addresses remain integer registers. The MIPS code to load two single precision 

numbers from memory, add them, and then store the sum might look like this: 

lwc1 

lwc1 

add.s 

swc1 

$f4,c($sp) # Load 32-bit F.P. number into F4 

$f6,a($sp) # Load 32-bit F.P. number into F6 

$f2,$f4,$f6 # F2 = F4 + F6 single precision 

$f2,b($sp) # Store 32-bit F.P. number from F2 

A double precision register is really an even-odd pair of single precision registers, 

using the even register number as its name. Thus, the pair of single precision 

registers $f2 and $f3 also form the double precision register named $f2. 

Figure 3.17 summarizes the floating-point portion of the MIPS architecture revealed 

in this chapter, with the additions to support floating point shown in color. Similar to 

Figure 2.19 in Chapter 2, Figure 3.18 shows the encoding of these instructions.


op(31:26): 

28–26 

0(000) 1(001) 2(010) 3(011) 4(100) 5(101) 6(110) 7(111) 

31–29 

0(000) Rfmt Bltz/gez j jal beq bne blez bgtz 

1(001) addi addiu slti sltiu ANDi ORi xORi lui 

2(010) TLB FlPt 

3(011) 

4(100) lb lh lwl lw lbu lhu lwr 

5(101) sb sh swl sw swr 

6(110) lwc0 lwc1 

7(111) swc0 swc1 

op(31:26) = 010001 (FlPt), (rt(16:16) = 0 => c = f, rt(16:16) = 1 => c = t), rs(25:21): 

23–21 

0(000) 1(001) 2(010) 3(011) 4(100) 5(101) 6(110) 7(111) 

25–24 

0(00) mfc1 cfc1 mtc1 ctc1 

1(01) bc1.c 

2(10) f = single f = double 

3(11) 

op(31:26) = 010001 (FlPt), (f above: 10000 => f = s, 10001 => f = d), funct(5:0): 

2–0 

0(000) 1(001) 2(010) 3(011) 4(100) 5(101) 6(110) 7(111) 

5–3 

0(000) add.f sub.f mul.f div.f abs.f mov.f neg.f 

1(001) 

2(010) 

3(011) 

4(100) cvt.s.f cvt.d.f cvt.w.f 

5(101) 

6(110) c.f.f c.un.f c.eq.f c.ueq.f c.olt.f c.ult.f c.ole.f c.ule.f 

7(111) c.sf.f c.ngle.f c.seq.f c.ngl.f c.lt.f c.nge.f c.le.f c.ngt.f 

FIGURE 3.18 MIPS floating-point instruction encoding. This notation gives the value of a field by row and by column. For example, 

in the top portion of the figure, lw is found in row number 4 (100 two 

for bits 31–29 of the instruction) and column number 3 (011 two 

for bits 

28–26 of the instruction), so the corresponding value of the op field (bits 31–26) is 100011 two 

. Underscore means the field is used elsewhere. 

For example, FlPt in row 2 and column 1 (op 010001 two 

) is defined in the bottom part of the figure. Hence sub.f in row 0 and column 1 of 

the bottom section means that the funct field (bits 5–0) of the instruction) is 000001 two 

and the op field (bits 31–26) is 010001 two 

. Note that the 

5-bit rs field, specified in the middle portion of the figure, determines whether the operation is single precision (f s, so rs 10000) or double 

precision (f d, so rs 10001). Similarly, bit 16 of the instruction determines if the bc1.c instruction tests for true (bit 16 1 bc1.t) 

or false (bit 16 0 bc1.f). Instructions in color are described in Chapter 2 or this chapter, with Appendix A covering all instructions. 

This information is also found in column 2 of the MIPS Reference Data Card at the front of this book.


Hardware/ 

Software 

Interface 

One issue that architects face in supporting floating-point arithmetic is whether 

to use the same registers used by the integer instructions or to add a special set 

for floating point. Because programs normally perform integer operations and 

floating-point operations on different data, separating the registers will only 

slightly increase the number of instructions needed to execute a program. The 

major impact is to create a separate set of data transfer instructions to move data 

between floating-point registers and memory. 

The benefits of separate floating-point registers are having twice as many 

registers without using up more bits in the instruction format, having twice the 

register bandwidth by having separate integer and floating-point register sets, and 

being able to customize registers to floating point; for example, some computers 

convert all sized operands in registers into a single internal format. 

EXAMPLE 

Compiling a Floating-Point C Program into MIPS Assembly Code 

Let’s convert a temperature in Fahrenheit to Celsius: 

float f2c (float fahr) 

{ 

return ((5.0/9.0) *(fahr – 32.0)); 

} 

Assume that the floating-point argument fahr is passed in $f12 and the 

result should go in $f0. (Unlike integer registers, floating-point register 0 can 

contain a number.) What is the MIPS assembly code? 

ANSWER 

We assume that the compiler places the three floating-point constants in 

memory within easy reach of the global pointer $gp. The first two instructions 

load the constants 5.0 and 9.0 into floating-point registers: 

f2c: 

lwc1 $f16,const5($gp) # $f16 = 5.0 (5.0 in memory) 

lwc1 $f18,const9($gp) # $f18 = 9.0 (9.0 in memory) 

They are then divided to get the fraction 5.0/9.0: 

div.s $f16, $f16, $f18 # $f16 = 5.0 / 9.0


(Many compilers would divide 5.0 by 9.0 at compile time and save the single 

constant 5.0/9.0 in memory, thereby avoiding the divide at runtime.) Next, we 

load the constant 32.0 and then subtract it from fahr ($f12): 

lwc1 $f18, const32($gp)# $f18 = 32.0 

sub.s $f18, $f12, $f18 # $f18 = fahr – 32.0 

Finally, we multiply the two intermediate results, placing the product in $f0 as 

the return result, and then return 

mul.s $f0, $f16, $f18 # $f0 = (5/9)*(fahr – 32.0) 

jr $ra 

# return 

Now let’s perform floating-point operations on matrices, code commonly 

found in scientific programs. 

Compiling Floating-Point C Procedure with Two-Dimensional 

Matrices into MIPS 

EXAMPLE 

Most floating-point calculations are performed in double precision. Let’s perform 

matrix multiply of C C A * B. It is commonly called DGEMM, 

for Double precision, General Matrix Multiply. We’ll see versions of DGEMM 

again in Section 3.8 and subsequently in Chapters 4, 5, and 6. Let’s assume C, 

A, and B are all square matrices with 32 elements in each dimension. 

void mm (double c[][], double a[][], double b[][]) 

{ 

int i, j, k; 

for (i = 0; i != 32; i = i + 1) 

for (j = 0; j != 32; j = j + 1) 

for (k = 0; k != 32; k = k + 1) 

c[i][j] = c[i][j] + a[i][k] *b[k][j]; 

} 

The array starting addresses are parameters, so they are in $a0, $a1, and $a2. 

Assume that the integer variables are in $s0, $s1, and $s2, respectively. 

What is the MIPS assembly code for the body of the procedure? 

Note that c[i][j] is used in the innermost loop above. Since the loop index 

is k, the index does not affect c[i][j], so we can avoid loading and storing 

c[i][j] each iteration. Instead, the compiler loads c[i][j] into a register 

outside the loop, accumulates the sum of the products of a[i][k] and 

ANSWER


b[k][j] in that same register, and then stores the sum into c[i][j] upon 

termination of the innermost loop. 

We keep the code simpler by using the assembly language pseudoinstructions 

li (which loads a constant into a register), and l.d and s.d (which the 

assembler turns into a pair of data transfer instructions, lwc1 or swc1, to a 

pair of floating-point registers). 

The body of the procedure starts with saving the loop termination value of 

32 in a temporary register and then initializing the three for loop variables: 

mm:... 

li $t1, 32 # $t1 = 32 (row size/loop end) 

li $s0, 0 # i = 0; initialize 1st for loop 

L1: li $s1, 0 # j = 0; restart 2nd for loop 

L2: li $s2, 0 # k = 0; restart 3rd for loop 

To calculate the address of c[i][j], we need to know how a 32 32, twodimensional 

array is stored in memory. As you might expect, its layout is the 

same as if there were 32 single-dimension arrays, each with 32 elements. So the 

first step is to skip over the i “single-dimensional arrays,” or rows, to get the 

one we want. Thus, we multiply the index in the first dimension by the size of 

the row, 32. Since 32 is a power of 2, we can use a shift instead: 

sll $t2, $s0, 5 # $t2 = i * 2 5 (size of row of c) 

Now we add the second index to select the jth element of the desired row: 

addu $t2, $t2, $s1 

# $t2 = i * size(row) + j 

To turn this sum into a byte index, we multiply it by the size of a matrix element 

in bytes. Since each element is 8 bytes for double precision, we can instead shift 

left by 3: 

sll $t2, $t2, 3 

# $t2 = byte offset of [i][j] 

Next we add this sum to the base address of c, giving the address of c[i][j], 

and then load the double precision number c[i][j] into $f4: 

addu $t2, $a0, $t2 # $t2 = byte address of c[i][j] 

l.d $f4, 0($t2) # $f4 = 8 bytes of c[i][j] 

The following five instructions are virtually identical to the last five: calculate 

the address and then load the double precision number b[k][j]. 

L3: sll $t0, $s2, 5 # $t0 = k * 2 5 (size of row of b) 

addu $t0, $t0, $s1 # $t0 = k * size(row) + j 

sll $t0, $t0, 3 # $t0 = byte offset of [k][j] 

addu $t0, $a2, $t0 # $t0 = byte address of b[k][j] 

l.d $f16, 0($t0) # $f16 = 8 bytes of b[k][j] 

Similarly, the next five instructions are like the last five: calculate the address 

and then load the double precision number a[i][k].


sll $t0, $s0, 5 # $t0 = i * 2 5 (size of row of a) 

addu $t0, $t0, $s2 # $t0 = i * size(row) + k 

sll $t0, $t0, 3 # $t0 = byte offset of [i][k] 

addu $t0, $a1, $t0 # $t0 = byte address of a[i][k] 

l.d $f18, 0($t0) # $f18 = 8 bytes of a[i][k] 

Now that we have loaded all the data, we are finally ready to do some floatingpoint 

operations! We multiply elements of a and b located in registers $f18 

and $f16, and then accumulate the sum in $f4. 

mul.d $f16, $f18, $f16 # $f16 = a[i][k] * b[k][j] 

add.d $f4, $f4, $f16 # f4 = c[i][j] + a[i][k] * b[k][j] 

The final block increments the index k and loops back if the index is not 32. 

If it is 32, and thus the end of the innermost loop, we need to store the sum 

accumulated in $f4 into c[i][j]. 

addiu $s2, $s2, 1 # $k = k + 1 

bne $s2, $t1, L3 # if (k != 32) go to L3 

s.d $f4, 0($t2) # c[i][j] = $f4 

Similarly, these final four instructions increment the index variable of the 

middle and outermost loops, looping back if the index is not 32 and exiting if 

the index is 32. 

addiu $s1, $s1, 1 # $j = j + 1 

bne $s1, $t1, L2 # if (j != 32) go to L2 

addiu $s0, $s0, 1 # $i = i + 1 

bne $s0, $t1, L1 # if (i != 32) go to L1 

… 

Figure 3.22 below shows the x86 assembly language code for a slightly different 

version of DGEMM in Figure 3.21. 

Elaboration: The array layout discussed in the example, called row-major order, is 

used by C and many other programming languages. Fortran instead uses column-major 

order, whereby the array is stored column by column. 

Elaboration: Only 16 of the 32 MIPS floating-point registers could originally be used 

for double precision operations: $f0, $f2, $f4, …, $f30. Double precision is computed 

using pairs of these single precision registers. The odd-numbered floating-point registers 

were used only to load and store the right half of 64-bit floating-point numbers. MIPS-32 

added l.d and s.d to the instruction set. MIPS-32 also added “paired single” versions of 

all floating-point instructions, where a single instruction results in two parallel floating-point 

operations on two 32-bit operands inside 64-bit registers (see Section 3.6). For example, 

add.ps $f0, $f2, $f4 is equivalent to add.s $f0, $f2, $f4 followed by add.s 

$f1, $f3, $f5.


Elaboration: Another reason for separate integers and floating-point registers is that 

microprocessors in the 1980s didn’t have enough transistors to put the floating-point unit 

on the same chip as the integer unit. Hence, the floating-point unit, including the floatingpoint 

registers, was optionally available as a second chip. Such optional accelerator 

chips are called coprocessors, and explain the acronym for floating-point loads in MIPS: 

lwc1 means load word to coprocessor 1, the floating-point unit. (Coprocessor 0 deals 

with virtual memory, described in Chapter 5.) Since the early 1990s, microprocessors 

have integrated floating point (and just about everything else) on chip, and hence the term 

coprocessor joins accumulator and core memory as quaint terms that date the speaker. 

Elaboration: As mentioned in Section 3.4, accelerating division is more challenging 

than multiplication. In addition to SRT, another technique to leverage a fast multiplier 

is Newton’s iteration, where division is recast as fi nding the zero of a function to fi nd 

the reciprocal 1/c, which is then multiplied by the other operand. Iteration techniques 

cannot be rounded properly without calculating many extra bits. A TI chip solved this 

problem by calculating an extra-precise reciprocal. 

Elaboration: Java embraces IEEE 754 by name in its defi nition of Java fl oating-point 

data types and operations. Thus, the code in the fi rst example could have well been 

generated for a class method that converted Fahrenheit to Celsius. 

The second example above uses multiple dimensional arrays, which are not explicitly 

supported in Java. Java allows arrays of arrays, but each array may have its own length, 

unlike multiple dimensional arrays in C. Like the examples in Chapter 2, a Java version 

of this second example would require a good deal of checking code for array bounds, 

including a new length calculation at the end of row access. It would also need to check 

that the object reference is not null. 

guard The first of two 

extra bits kept on the 

right during intermediate 

calculations of floatingpoint 

numbers; used 

to improve rounding 

accuracy. 

round Method to 

make the intermediate 

floating-point result fit 

the floating-point format; 

the goal is typically to find 

the nearest number that 

can be represented in the 

format. 

Accurate Arithmetic 

Unlike integers, which can represent exactly every number between the smallest and 

largest number, floating-point numbers are normally approximations for a number 

they can’t really represent. The reason is that an infinite variety of real numbers 

exists between, say, 0 and 1, but no more than 2 53 can be represented exactly in 

double precision floating point. The best we can do is getting the floating-point 

representation close to the actual number. Thus, IEEE 754 offers several modes of 

rounding to let the programmer pick the desired approximation. 

Rounding sounds simple enough, but to round accurately requires the hardware 

to include extra bits in the calculation. In the preceding examples, we were vague 

on the number of bits that an intermediate representation can occupy, but clearly, 

if every intermediate result had to be truncated to the exact number of digits, there 

would be no opportunity to round. IEEE 754, therefore, always keeps two extra bits 

on the right during intermediate additions, called guard and round, respectively. 

Let’s do a decimal example to illustrate their value.


Rounding with Guard Digits 

Add 2.56 ten 

10 0 to 2.34 ten 

10 2 , assuming that we have three significant 

decimal digits. Round to the nearest decimal number with three significant 

decimal digits, first with guard and round digits, and then without them. 

EXAMPLE 

First we must shift the smaller number to the right to align the exponents, so 

2.56 ten 

10 0 becomes 0.0256 ten 

10 2 . Since we have guard and round digits, 

we are able to represent the two least significant digits when we align exponents. 

The guard digit holds 5 and the round digit holds 6. The sum is 

ANSWER 

2.3400 ten 

+ 0.0256 ten 

2.3656 ten 

Thus the sum is 2.3656 ten 

10 2 . Since we have two digits to round, we want 

values 0 to 49 to round down and 51 to 99 to round up, with 50 being the 

tiebreaker. Rounding the sum up with three significant digits yields 2.37 ten 

10 2 . 

Doing this without guard and round digits drops two digits from the 

calculation. The new sum is then 

2.34 ten 

+ 0.02 ten 

2.36 ten 

The answer is 2.36 ten 

10 2 , off by 1 in the last digit from the sum above. 

Since the worst case for rounding would be when the actual number is halfway 

between two floating-point representations, accuracy in floating point is normally 

measured in terms of the number of bits in error in the least significant bits of the 

significand; the measure is called the number of units in the last place, or ulp. If 

a number were off by 2 in the least significant bits, it would be called off by 2 ulps. 

Provided there is no overflow, underflow, or invalid operation exceptions, IEEE 

754 guarantees that the computer uses the number that is within one-half ulp. 

units in the last place 

(ulp) The number of 

bits in error in the least 

significant bits of the 

significand between 

the actual number and 

the number that can be 

represented. 

Elaboration: Although the example above really needed just one extra digit, multiply 

can need two. A binary product may have one leading 0 bit; hence, the normalizing step 

must shift the product one bit left. This shifts the guard digit into the least significant bit 

of the product, leaving the round bit to help accurately round the product. 

IEEE 754 has four rounding modes: always round up (toward +∞), always round down 

(toward ∞), truncate, and round to nearest even. The fi nal mode determines what to 

do if the number is exactly halfway in between. The U.S. Internal Revenue Service (IRS) 

always rounds 0.50 dollars up, possibly to the benefi t of the IRS. A more equitable way 

would be to round up this case half the time and round down the other half. IEEE 754 

says that if the least signifi cant bit retained in a halfway case would be odd, add one;


In an attempt to squeeze every last bit of precision from a fl oating-point operation, 

the standard allows some numbers to be represented in unnormalized form. Rather than 

having a gap between 0 and the smallest normalized number, IEEE allows denormalized 

numbers (also known as denorms or subnormals). They have the same exponent as 

zero but a nonzero fraction. They allow a number to degrade in signifi cance until it 

becomes 0, called gradual underfl ow. For example, the smallest positive single precision 

normalized number is 

1.0000 0000 0000 0000 0000 000 two 

2 126 

but the smallest single precision denormalized number is 

0.0000 0000 0000 0000 0000 001 two 

2 126 , or 1.0 two 

2 149 

For double precision, the denorm gap goes from 1.0 2 1022 to 1.0 2 1074 . 

The possibility of an occasional unnormalized operand has given headaches to 

fl oating-point designers who are trying to build fast fl oating-point units. Hence, many 

computers cause an exception if an operand is denormalized, letting software complete 

the operation. Although software implementations are perfectly valid, their lower 

performance has lessened the popularity of denorms in portable fl oating-point software. 

Moreover, if programmers do not expect denorms, their programs may surprise them. 

3.6 

Parallelism and Computer Arithmetic: 

Subword Parallelism 

Since every desktop microprocessor by definition has its own graphical displays, 

as transistor budgets increased it was inevitable that support would be added for 

graphics operations. 

Many graphics systems originally used 8 bits to represent each of the three 

primary colors plus 8 bits for a location of a pixel. The addition of speakers and 

microphones for teleconferencing and video games suggested support of sound as 

well. Audio samples need more than 8 bits of precision, but 16 bits are sufficient. 

Every microprocessor has special support so that bytes and halfwords take up 

less space when stored in memory (see Section 2.9), but due to the infrequency of 

arithmetic operations on these data sizes in typical integer programs, there was 

little support beyond data transfers. Architects recognized that many graphics 

and audio applications would perform the same operation on vectors of this data. 

By partitioning the carry chains within a 128-bit adder, a processor could use 

parallelism to perform simultaneous operations on short vectors of sixteen 8-bit 

operands, eight 16-bit operands, four 32-bit operands, or two 64-bit operands. The 

cost of such partitioned adders was small. 

Given that the parallelism occurs within a wide word, the extensions are 

classified as subword parallelism. It is also classified under the more general name 

of data level parallelism. They have been also called vector or SIMD, for single 

instruction, multiple data (see Section 6.6). The rising popularity of multimedia


1. void dgemm (int n, double* A, double* B, double* C) 

2. { 

3. for (int i = 0; i < n; ++i) 

4. for (int j = 0; j < n; ++j) 

5. { 

6. double cij = C[i+j*n]; /* cij = C[i][j] */ 

7. for( int k = 0; k < n; k++ ) 

8. cij += A[i+k*n] * B[k+j*n]; /* cij += A[i][k]*B[k][j] */ 

9. C[i+j*n] = cij; /* C[i][j] = cij */ 

10. } 

11. } 

FIGURE 3.21 Unoptimized C version of a double precision matrix multiply, widely known as DGEMM for 

Double-precision GEneral Matrix Multiply (GEMM). Because we are passing the matrix dimension as the parameter 

n, this version of DGEMM uses single dimensional versions of matrices C, A, and B and address arithmetic to get better 

performance instead of using the more intuitive two-dimensional arrays that we saw in Section 3.5. The comments remind 

us of this more intuitive notation. 

1. vmovsd (%r10),%xmm0 # Load 1 element of C into %xmm0 

2. mov %rsi,%rcx # register %rcx = %rsi 

3. xor %eax,%eax # register %eax = 0 

4. vmovsd (%rcx),%xmm1 # Load 1 element of B into %xmm1 

5. add %r9,%rcx # register %rcx = %rcx + %r9 

6. vmulsd (%r8,%rax,8),%xmm1,%xmm1 # Multiply %xmm1, element of A 

7. add $0x1,%rax # register %rax = %rax + 1 

8. cmp %eax,%edi # compare %eax to %edi 

9. vaddsd %xmm1,%xmm0,%xmm0 # Add %xmm1, %xmm0 

10. jg 30 # jump if %eax > %edi 

11. add $0x1,%r11d # register %r11 = %r11 + 1 

12. vmovsd %xmm0,(%r10) # Store %xmm0 into C element 

FIGURE 3.22 The x86 assembly language for the body of the nested loops generated by compiling the 

optimized C code in Figure 3.21. Although it is dealing with just 64-bits of data, the compiler uses the AVX version of 

the instructions instead of SSE2 presumably so that it can use three address per instruction instead of two (see the Elaboration 

in Section 3.7).

3.8 Going Faster: Subword Parallelism and Matrix Multiply 227 

1. #include 

2. void dgemm (int n, double* A, double* B, double* C) 

3. { 

4. for ( int i = 0; i < n; i+=4 ) 

5. for ( int j = 0; j < n; j++ ) { 

6. __m256d c0 = _mm256_load_pd(C+i+j*n); /* c0 = C[i][j] */ 

7. for( int k = 0; k < n; k++ ) 

8. c0 = _mm256_add_pd(c0, /* c0 += A[i][k]*B[k][j] */ 

9. _mm256_mul_pd(_mm256_load_pd(A+i+k*n), 

10. _mm256_broadcast_sd(B+k+j*n))); 

11. _mm256_store_pd(C+i+j*n, c0); /* C[i][j] = c0 */ 

12. } 

13. } 

FIGURE 3.23 Optimized C version of DGEMM using C intrinsics to generate the AVX subword-parallel 

instructions for the x86. Figure 3.24 shows the assembly language produced by the compiler for the inner loop. 

While compiler writers may eventually be able to routinely produce highquality 

code that uses the AVX instructions of the x86, for now we must “cheat” by 

using C intrinsics that more or less tell the compiler exactly how to produce good 

code. Figure 3.23 shows the enhanced version of Figure 3.21 for which the Gnu C 

compiler produces AVX code. Figure 3.24 shows annotated x86 code that is the 

output of compiling using gcc with the –O3 level of optimization. 

The declaration on line 6 of Figure 3.23 uses the __m256d data type, which tells 

the compiler the variable will hold 4 double-precision floating-point values. The 

intrinsic _mm256_load_pd() also on line 6 uses AVX instructions to load 4 

double-precision floating-point numbers in parallel (_pd) from the matrix C into 

c0. The address calculation C+i+j*n on line 6 represents element C[i+j*n]. 

Symmetrically, the final step on line 11 uses the intrinsic _mm256_store_pd() 

to store 4 double-precision floating-point numbers from c0 into the matrix C. 

As we’re going through 4 elements each iteration, the outer for loop on line 4 

increments i by 4 instead of by 1 as on line 3 of Figure 3.21. 

Inside the loops, on line 9 we first load 4 elements of A again using _mm256_ 

load_pd(). To multiply these elements by one element of B, on line 10 we first 

use the intrinsic _mm256_broadcast_sd(), which makes 4 identical copies 

of the scalar double precision number—in this case an element of B—in one of the 

YMM registers. We then use _mm256_mul_pd() on line 9 to multiply the four 

double-precision results in parallel. Finally, _mm256_add_pd() on line 8 adds 

the 4 products to the 4 sums in c0. 

Figure 3.24 shows resulting x86 code for the body of the inner loops produced 

by the compiler. You can see the five AVX instructions—they all start with v and


1. vmovapd (%r11),%ymm0 # Load 4 elements of C into %ymm0 

2. mov %rbx,%rcx # register %rcx = %rbx 

3. xor %eax,%eax # register %eax = 0 

4. vbroadcastsd (%rax,%r8,1),%ymm1 # Make 4 copies of B element 

5. add $0x8,%rax # register %rax = %rax + 8 

6. vmulpd (%rcx),%ymm1,%ymm1 # Parallel mul %ymm1,4 A elements 

7. add %r9,%rcx # register %rcx = %rcx + %r9 

8. cmp %r10,%rax # compare %r10 to %rax 

9. vaddpd %ymm1,%ymm0,%ymm0 # Parallel add %ymm1, %ymm0 

10. jne 50 # jump if not %r10 != %rax 

11. add $0x1,%esi # register % esi = % esi + 1 

12. vmovapd %ymm0,(%r11) # Store %ymm0 into 4 C elements 

FIGURE 3.24 The x86 assembly language for the body of the nested loops generated by compiling 

the optimized C code in Figure 3.23. Note the similarities to Figure 3.22, with the primary difference being that the 

five floating-point operations are now using YMM registers and using the pd versions of the instructions for parallel double 

precision instead of the sd version for scalar double precision. 

four of the five use pd for parallel double precision—that correspond to the C 

intrinsics mentioned above. The code is very similar to that in Figure 3.22 above: 

both use 12 instructions, the integer instructions are nearly identical (but different 

registers), and the floating-point instruction differences are generally just going 

from scalar double (sd) using XMM registers to parallel double (pd) with YMM 

registers. The one exception is line 4 of Figure 3.24. Every element of A must be 

multiplied by one element of B. One solution is to place four identical copies of the 

64-bit B element side-by-side into the 256-bit YMM register, which is just what the 

instruction vbroadcastsd does. 

For matrices of dimensions of 32 by 32, the unoptimized DGEMM in Figure 3.21 

runs at 1.7 GigaFLOPS (FLoating point Operations Per Second) on one core of a 

2.6 GHz Intel Core i7 (Sandy Bridge). The optimized code in Figure 3.23 performs 

at 6.4 GigaFLOPS. The AVX version is 3.85 times as fast, which is very close to the 

factor of 4.0 increase that you might hope for from performing 4 times as many 

operations at a time by using subword parallelism. 

Elaboration: As mentioned in the Elaboration in Section 1.6, Intel offers Turbo mode 

that temporarily runs at a higher clock rate until the chip gets too hot. This Intel Core i7 

(Sandy Bridge) can increase from 2.6 GHz to 3.3 GHz in Turbo mode. The results above 

are with Turbo mode turned off. If we turn it on, we improve all the results by the increase 

in the clock rate of 3.3/2.6 = 1.27 to 2.1 GFLOPS for unoptimized DGEMM and 8.1 

GFLOPS with AVX. Turbo mode works particularly well when using only a single core of 

an eight-core chip, as in this case, as it lets that single core use much more than its fair 

share of power since the other cores are idle.


3.9 Fallacies and Pitfalls 

Arithmetic fallacies and pitfalls generally stem from the difference between the 

limited precision of computer arithmetic and the unlimited precision of natural 

arithmetic. 

Fallacy: Just as a left shift instruction can replace an integer multiply by a 

power of 2, a right shift is the same as an integer division by a power of 2. 

Recall that a binary number c, where xi means the ith bit, represents the number 

… (x 3 2 3 ) (x 2 2 2 ) 1 (x1 2 1 ) (x0 2 0 ) 

Shifting the bits of c right by n bits would seem to be the same as dividing by 

2n. And this is true for unsigned integers. The problem is with signed integers. For 

example, suppose we want to divide 5 ten 

by 4 ten 

; the quotient should be 1 ten 

. The 

two’s complement representation of 5 ten 

is 

Thus mathematics 

may be defined as the 

subject in which we 

never know what we 

are talking about, nor 

whether what we are 

saying is true. 

Bertrand Russell, Recent 

Words on the Principles 

of Mathematics, 1901 

1111 1111 1111 1111 1111 1111 1111 1011 two 

According to this fallacy, shifting right by two should divide by 4 ten 

(2 2 ): 

0011 1111 1111 1111 1111 1111 1111 1110 two 

With a 0 in the sign bit, this result is clearly wrong. The value created by the shift 

right is actually 1,073,741,822 ten 

instead of 1 ten 

. 

A solution would be to have an arithmetic right shift that extends the sign bit 

instead of shifting in 0s. A 2-bit arithmetic shift right of 5 ten 

produces 

1111 1111 1111 1111 1111 1111 1111 1110 two 

The result is 2 ten 

instead of 1 ten 

; close, but no cigar. 

Pitfall: Floating-point addition is not associative. 

Associativity holds for a sequence of two’s complement integer additions, even if the 

computation overflows. Alas, because floating-point numbers are approximations 

of real numbers and because computer arithmetic has limited precision, it does 

not hold for floating-point numbers. Given the great range of numbers that can be 

represented in floating point, problems occur when adding two large numbers of 

opposite signs plus a small number. For example, let’s see if c (a b) (c a) 

b. Assume c 1.5 ten 

10 38 , a 1.5 ten 

10 38 , and b 1.0, and that these are 

all single precision numbers.


c ( a b) 1.5ten 

10 (1.5ten 

10 1.0) 

38 

38 

1.5ten 

10 (1.5ten 

10 ) 

0.0 

38 

38 

c ( a b) ( 1.5ten 

10 1.5ten 

10 ) 1.0 

(0.0ten 

) 1.0 

1.0 

Since floating-point numbers have limited precision and result in approximations 

of real results, 1.5 ten 

10 38 is so much larger than 1.0 ten 

that 1.5 ten 

10 38 1.0 is still 

1.5 ten 

10 38 . That is why the sum of c, a, and b is 0.0 or 1.0, depending on the order 

of the floating-point additions, so c (a b) (c a) b. Therefore, floatingpoint 

addition is not associative. 

38 

Fallacy: Parallel execution strategies that work for integer data types also work 

for floating-point data types. 

Programs have typically been written first to run sequentially before being rewritten 

to run concurrently, so a natural question is, “Do the two versions get the same 

answer?” If the answer is no, you presume there is a bug in the parallel version that 

you need to track down. 

This approach assumes that computer arithmetic does not affect the results when 

going from sequential to parallel. That is, if you were to add a million numbers 

together, you would get the same results whether you used 1 processor or 1000 

processors. This assumption holds for two’s complement integers, since integer 

addition is associative. Alas, since floating-point addition is not associative, the 

assumption does not hold. 

A more vexing version of this fallacy occurs on a parallel computer where the 

operating system scheduler may use a different number of processors depending 

on what other programs are running on a parallel computer. As the varying 

number of processors from each run would cause the floating-point sums to be 

calculated in different orders, getting slightly different answers each time despite 

running identical code with identical input may flummox unaware parallel 

programmers. 

Given this quandary, programmers who write parallel code with floating-point 

numbers need to verify whether the results are credible even if they don’t give the 

same exact answer as the sequential code. The field that deals with such issues is 

called numerical analysis, which is the subject of textbooks in its own right. Such 

concerns are one reason for the popularity of numerical libraries such as LAPACK 

and SCALAPAK, which have been validated in both their sequential and parallel 

forms. 

Pitfall: The MIPS instruction add immediate unsigned (addiu) sign-extends 

its 16-bit immediate field. 

38


Despite its name, add immediate unsigned (addiu) is used to add constants to 

signed integers when we don’t care about overflow. MIPS has no subtract immediate 

instruction, and negative numbers need sign extension, so the MIPS architects 

decided to sign-extend the immediate field. 

Fallacy: Only theoretical mathematicians care about floating-point accuracy. 

Newspaper headlines of November 1994 prove this statement is a fallacy (see 

Figure 3.25). The following is the inside story behind the headlines. 

The Pentium used a standard floating-point divide algorithm that generates 

multiple quotient bits per step, using the most significant bits of divisor and 

dividend to guess the next 2 bits of the quotient. The guess is taken from a lookup 

table containing 2, 1, 0, 1, or 2. The guess is multiplied by the divisor and 

subtracted from the remainder to generate a new remainder. Like nonrestoring 

division, if a previous guess gets too large a remainder, the partial remainder is 

adjusted in a subsequent pass. 

Evidently, there were five elements of the table from the 80486 that Intel 

engineers thought could never be accessed, and they optimized the logic to return 

0 instead of 2 in these situations on the Pentium. Intel was wrong: while the first 11 

FIGURE 3.25 A sampling of newspaper and magazine articles from November 1994, 

including the New York Times, San Jose Mercury News, San Francisco Chronicle, and 

Infoworld. The Pentium floating-point divide bug even made the “Top 10 List” of the David Letterman 

Late Show on television. Intel eventually took a $300 million write-off to replace the buggy chips.


bits were always correct, errors would show up occasionally in bits 12 to 52, or the 

4th to 15th decimal digits. 

A math professor at Lynchburg College in Virginia, Thomas Nicely, discovered the 

bug in September 1994. After calling Intel technical support and getting no official 

reaction, he posted his discovery on the Internet. This post led to a story in a trade 

magazine, which in turn caused Intel to issue a press release. It called the bug a glitch 

that would affect only theoretical mathematicians, with the average spreadsheet 

user seeing an error every 27,000 years. IBM Research soon counterclaimed that the 

average spreadsheet user would see an error every 24 days. Intel soon threw in the 

towel by making the following announcement on December 21: 

“We at Intel wish to sincerely apologize for our handling of the recently publicized 

Pentium processor flaw. The Intel Inside symbol means that your computer has 

a microprocessor second to none in quality and performance. Thousands of Intel 

employees work very hard to ensure that this is true. But no microprocessor is 

ever perfect. What Intel continues to believe is technically an extremely minor 

problem has taken on a life of its own. Although Intel firmly stands behind the 

quality of the current version of the Pentium processor, we recognize that many 

users have concerns. We want to resolve these concerns. Intel will exchange the 

current version of the Pentium processor for an updated version, in which this 

floating-point divide flaw is corrected, for any owner who requests it, free of 

charge anytime during the life of their computer.” 

Analysts estimate that this recall cost Intel $500 million, and Intel engineers did not 

get a Christmas bonus that year. 

This story brings up a few points for everyone to ponder. How much cheaper 

would it have been to fix the bug in July 1994? What was the cost to repair the 

damage to Intel’s reputation? And what is the corporate responsibility in disclosing 

bugs in a product so widely used and relied upon as a microprocessor? 

3.10 Concluding Remarks 

Over the decades, computer arithmetic has become largely standardized, greatly 

enhancing the portability of programs. Two’s complement binary integer arithmetic is 

found in every computer sold today, and if it includes floating point support, it offers 

the IEEE 754 binary floating-point arithmetic. 

Computer arithmetic is distinguished from paper-and-pencil arithmetic by the 

constraints of limited precision. This limit may result in invalid operations through 

calculating numbers larger or smaller than the predefined limits. Such anomalies, called 

“overflow” or “underflow,” may result in exceptions or interrupts, emergency events 

similar to unplanned subroutine calls. Chapters 4 and 5 discuss exceptions in more detail. 

Floating-point arithmetic has the added challenge of being an approximation 

of real numbers, and care needs to be taken to ensure that the computer number

3.10 Concluding Remarks 233 

selected is the representation closest to the actual number. The challenges of 

imprecision and limited representation of floating point are part of the inspiration 

for the field of numerical analysis. The recent switch to parallelism shines the 

searchlight on numerical analysis again, as solutions that were long considered 

safe on sequential computers must be reconsidered when trying to find the fastest 

algorithm for parallel computers that still achieves a correct result. 

Data-level parallelism, specifically subword parallelism, offers a simple path to 

higher performance for programs that are intensive in arithmetic operations for 

either integer or floating-point data. We showed that we could speed up matrix 

multiply nearly fourfold by using instructions that could execute four floatingpoint 

operations at a time. 

With the explanation of computer arithmetic in this chapter comes a description 

of much more of the MIPS instruction set. One point of confusion is the instructions 

covered in these chapters versus instructions executed by MIPS chips versus the 

instructions accepted by MIPS assemblers. Two figures try to make this clear. 

Figure 3.26 lists the MIPS instructions covered in this chapter and Chapter 2. 

We call the set of instructions on the left-hand side of the figure the MIPS core. The 

instructions on the right we call the MIPS arithmetic core. On the left of Figure 3.27 

are the instructions the MIPS processor executes that are not found in Figure 3.26. 

We call the full set of hardware instructions MIPS-32. On the right of Figure 3.27 

are the instructions accepted by the assembler that are not part of MIPS-32. We call 

this set of instructions Pseudo MIPS. 

Figure 3.28 gives the popularity of the MIPS instructions for SPEC CPU2006 

integer and floating-point benchmarks. All instructions are listed that were 

responsible for at least 0.2% of the instructions executed. 

Note that although programmers and compiler writers may use MIPS-32 to 

have a richer menu of options, MIPS core instructions dominate integer SPEC 

CPU2006 execution, and the integer core plus arithmetic core dominate SPEC 

CPU2006 floating point, as the table below shows. 

Instruction subset Integer Fl. pt. 

MIPS core 98% 31% 

MIPS arithmetic core 2% 66% 

Remaining MIPS-32 0% 3% 

For the rest of the book, we concentrate on the MIPS core instructions—the integer 

instruction set excluding multiply and divide—to make the explanation of computer 

design easier. As you can see, the MIPS core includes the most popular MIPS 

instructions; be assured that understanding a computer that runs the MIPS core 

will give you sufficient background to understand even more ambitious computers. 

No matter what the instruction set or its size—MIPS, ARM, x86—never forget that 

bit patterns have no inherent meaning. The same bit pattern may represent a signed 

integer, unsigned integer, floating-point number, string, instruction, and so on. In 

stored program computers, it is the operation on the bit pattern that determines its 

meaning.

3.10 Concluding Remarks 235 

Remaining MIPS-32 Name Format Pseudo MIPS Name Format 

exclusive or (rs ⊕ rt) xor R absolute value abs rd,rs 

exclusive or immediate xori I negate (signed or unsigned) negs rd,rs 

shift right arithmetic sra R rotate left rol rd,rs,rt 

shift left logical variable sllv R rotate right ror rd,rs,rt 

shift right logical variable srlv R multiply and don’t check oflw (signed or uns.) muls rd,rs,rt 

shift right arithmetic variable srav R multiply and check oflw (signed or uns.) mulos rd,rs,rt 

move to Hi mthi R divide and check overflow div rd,rs,rt 

move to Lo mtlo R divide and don t check overflow divu rd,rs,rt 

load halfword lh I remainder (signed or unsigned) rems rd,rs,rt 

load byte lb I load immediate li rd,imm 

load word left (unaligned) lwl I load address la rd,addr 

load word right (unaligned) lwr I load double ld rd,addr 

store word left (unaligned) swl I store double sd rd,addr 

store word right (unaligned) swr I unaligned load word ulw rd,addr 

load linked (atomic update) ll I unaligned store word usw rd,addr 

store cond. (atomic update) sc I unaligned load halfword (signed or uns.) ulhs rd,addr 

move if zero movz R unaligned store halfword ush rd,addr 

move if not zero movn R branch b Label 

multiply and add (S or uns.) madds R branch on equal zero beqz rs,L 

multiply and subtract (S or uns.) msubs I branch on compare (signed or unsigned) bxs rs,rt,L 

branch on ≥ zero and link bgezal I (x = lt, le, gt, ge) 

branch on < zero and link bltzal I set equal seq rd,rs,rt 

jump and link register jalr R set not equal sne rd,rs,rt 

branch compare to zero bxz I set on compare (signed or unsigned) sxs rd,rs,rt 

branch compare to zero likely bxzl I (x = lt, le, gt, ge) 

(x = lt, le, gt, ge) load to floating point (s or d) l.f rd,addr 

branch compare reg likely bxl I store from floating point (s or d) s.f rd,addr 

trap if compare reg tx R 

trap if compare immediate txi I 

(x = eq, neq, lt, le, gt, ge) 

return from exception rfe R 

system call syscall I 

break (cause exception) break I 

move from FP to integer mfc1 R 

move to FP from integer mtc1 R 

FP move (s or d) mov.f R 

FP move if zero (s or d) movz.f R 

FP move if not zero (s or d) movn.f R 

FP square root (s or d) sqrt.f R 

FP absolute value (s or d) abs.f R 

FP negate (s or d) neg.f R 

FP convert (w, s, or d) cvt.f.f R 

FP compare un (s or d) c.xn.f R 

FIGURE 3.27 Remaining MIPS-32 and Pseudo MIPS instruction sets. f means single (s) or double (d) precision floating-point 

instructions, and s means signed and unsigned (u) versions. MIPS-32 also has FP instructions for multiply and add/sub (madd.f/ msub.f), 

ceiling (ceil.f), truncate (trunc.f), round (round.f), and reciprocal (recip.f). The underscore represents the letter to include to represent 

that datatype.


3.12 Exercises 

3.1 [5] What is 5ED4 07A4 when these values represent unsigned 16- 

bit hexadecimal numbers? The result should be written in hexadecimal. Show your 

work. 

3.2 [5] What is 5ED4 07A4 when these values represent signed 16- 

bit hexadecimal numbers stored in sign-magnitude format? The result should be 

written in hexadecimal. Show your work. 

3.3 [10] Convert 5ED4 into a binary number. What makes base 16 

(hexadecimal) an attractive numbering system for representing values in 

computers? 

3.4 [5] What is 4365 3412 when these values represent unsigned 12-bit 

octal numbers? The result should be written in octal. Show your work. 

3.5 [5] What is 4365 3412 when these values represent signed 12-bit 

octal numbers stored in sign-magnitude format? The result should be written in 

octal. Show your work. 

3.6 [5] Assume 185 and 122 are unsigned 8-bit decimal integers. Calculate 

185 – 122. Is there overflow, underflow, or neither? 

3.7 [5] Assume 185 and 122 are signed 8-bit decimal integers stored in 

sign-magnitude format. Calculate 185 122. Is there overflow, underflow, or 

neither? 


sign-magnitude format. Calculate 185 122. Is there overflow, underflow, or 

neither? 


two’s complement format. Calculate 151 214 using saturating arithmetic. The 

result should be written in decimal. Show your work. 


two’s complement format. Calculate 151 214 using saturating arithmetic. The 

result should be written in decimal. Show your work. 

3.11 [10] Assume 151 and 214 are unsigned 8-bit integers. Calculate 151 

214 using saturating arithmetic. The result should be written in decimal. Show 

your work. 

3.12 [20] Using a table similar to that shown in Figure 3.6, calculate the 

product of the octal unsigned 6-bit integers 62 and 12 using the hardware described 

in Figure 3.3. You should show the contents of each register on each step. 

Never give in, never 

give in, never, never, 

never—in nothing, 

great or small, large or 

petty—never give in. 

Winston Churchill, 

address at Harrow 

School, 1941


3.13 [20] Using a table similar to that shown in Figure 3.6, calculate the 

product of the hexadecimal unsigned 8-bit integers 62 and 12 using the hardware 

described in Figure 3.5. You should show the contents of each register on each step. 

3.14 [10] Calculate the time necessary to perform a multiply using the 

approach given in Figures 3.3 and 3.4 if an integer is 8 bits wide and each step 

of the operation takes 4 time units. Assume that in step 1a an addition is always 

performed—either the multiplicand will be added, or a zero will be. Also assume 

that the registers have already been initialized (you are just counting how long it 

takes to do the multiplication loop itself). If this is being done in hardware, the 

shifts of the multiplicand and multiplier can be done simultaneously. If this is being 

done in software, they will have to be done one after the other. Solve for each case. 


approach described in the text (31 adders stacked vertically) if an integer is 8 bits 

wide and an adder takes 4 time units. 


approach given in Figure 3.7 if an integer is 8 bits wide and an adder takes 4 time 

units. 

3.17 [20] As discussed in the text, one possible performance enhancement 

is to do a shift and add instead of an actual multiplication. Since 9 6, for example, 

can be written (2 2 2 1) 6, we can calculate 9 6 by shifting 6 to the left 3 

times and then adding 6 to that result. Show the best way to calculate 033 055 

using shifts and adds/subtracts. Assume both inputs are 8-bit unsigned integers. 

3.18 [20] Using a table similar to that shown in Figure 3.10, calculate 

74 divided by 21 using the hardware described in Figure 3.8. You should show 

the contents of each register on each step. Assume both inputs are unsigned 6-bit 

integers. 

3.19 [30] Using a table similar to that shown in Figure 3.10, calculate 

74 divided by 21 using the hardware described in Figure 3.11. You should show 

the contents of each register on each step. Assume A and B are unsigned 6-bit 

integers. This algorithm requires a slightly different approach than that shown in 

Figure 3.9. You will want to think hard about this, do an experiment or two, or else 

go to the web to figure out how to make this work correctly. (Hint: one possible 

solution involves using the fact that Figure 3.11 implies the remainder register can 

be shifted either direction.) 

3.20 [5] What decimal number does the bit pattern 0×0C000000 

represent if it is a two’s complement integer? An unsigned integer? 

3.21 [10] If the bit pattern 0×0C000000 is placed into the Instruction 

Register, what MIPS instruction will be executed? 

3.22 [10] What decimal number does the bit pattern 0×0C000000 

represent if it is a floating point number? Use the IEEE 754 standard.


3.23 [10] Write down the binary representation of the decimal number 

63.25 assuming the IEEE 754 single precision format. 


63.25 assuming the IEEE 754 double precision format. 


63.25 assuming it was stored using the single precision IBM format (base 16, 

instead of base 2, with 7 bits of exponent). 

3.26 [20] Write down the binary bit pattern to represent 1.5625 10 1 

assuming a format similar to that employed by the DEC PDP-8 (the leftmost 12 

bits are the exponent stored as a two’s complement number, and the rightmost 24 

bits are the fraction stored as a two’s complement number). No hidden 1 is used. 

Comment on how the range and accuracy of this 36-bit pattern compares to the 

single and double precision IEEE 754 standards. 

3.27 [20] IEEE 754-2008 contains a half precision that is only 16 bits 

wide. The leftmost bit is still the sign bit, the exponent is 5 bits wide and has a bias 

of 15, and the mantissa is 10 bits long. A hidden 1 is assumed. Write down the 

bit pattern to represent 1.5625 10 1 assuming a version of this format, which 

uses an excess-16 format to store the exponent. Comment on how the range and 

accuracy of this 16-bit floating point format compares to the single precision IEEE 

754 standard. 

3.28 [20] The Hewlett-Packard 2114, 2115, and 2116 used a format 

with the leftmost 16 bits being the fraction stored in two’s complement format, 

followed by another 16-bit field which had the leftmost 8 bits as an extension of the 

fraction (making the fraction 24 bits long), and the rightmost 8 bits representing 

the exponent. However, in an interesting twist, the exponent was stored in signmagnitude 

format with the sign bit on the far right! Write down the bit pattern to 

represent 1.5625 10 1 assuming this format. No hidden 1 is used. Comment on 

how the range and accuracy of this 32-bit pattern compares to the single precision 

IEEE 754 standard. 

3.29 [20] Calculate the sum of 2.6125 10 1 and 4.150390625 10 1 

by hand, assuming A and B are stored in the 16-bit half precision described in 

Exercise 3.27. Assume 1 guard, 1 round bit, and 1 sticky bit, and round to the 

nearest even. Show all the steps. 

3.30 [30] Calculate the product of –8.0546875 10 0 and 1.79931640625 

10 –1 by hand, assuming A and B are stored in the 16-bit half precision format 

described in Exercise 3.27. Assume 1 guard, 1 round bit, and 1 sticky bit, and round 

to the nearest even. Show all the steps; however, as is done in the example in the 

text, you can do the multiplication in human-readable format instead of using the 

techniques described in Exercises 3.12 through 3.14. Indicate if there is overflow 

or underflow. Write your answer in both the 16-bit floating point format described 

in Exercise 3.27 and also as a decimal number. How accurate is your result? How 

does it compare to the number you get if you do the multiplication on a calculator?


3.31 [30] Calculate by hand 8.625 10 1 divided by 4.875 10 0 . Show 

all the steps necessary to achieve your answer. Assume there is a guard, a round bit, 

and a sticky bit, and use them if necessary. Write the final answer in both the 16-bit 

floating point format described in Exercise 3.27 and in decimal and compare the 

decimal result to that which you get if you use a calculator. 

3.32 [20] Calculate (3.984375 10 1 3.4375 10 1 ) 1.771 10 3 

by hand, assuming each of the values are stored in the 16-bit half precision format 

described in Exercise 3.27 (and also described in the text). Assume 1 guard, 1 

round bit, and 1 sticky bit, and round to the nearest even. Show all the steps, and 

write your answer in both the 16-bit floating point format and in decimal. 

3.33 [20] Calculate 3.984375 10 1 (3.4375 10 1 1.771 10 3 ) 

by hand, assuming each of the values are stored in the 16-bit half precision format 

described in Exercise 3.27 (and also described in the text). Assume 1 guard, 1 

round bit, and 1 sticky bit, and round to the nearest even. Show all the steps, and 


3.34 [10] Based on your answers to 3.32 and 3.33, does (3.984375 10 1 

3.4375 10 1 ) 1.771 10 3 = 3.984375 10 1 (3.4375 10 1 1.771 

10 3 )? 

3.35 [30] Calculate (3.41796875 10 3 6.34765625 10 3 ) 1.05625 

10 2 by hand, assuming each of the values are stored in the 16-bit half precision 

format described in Exercise 3.27 (and also described in the text). Assume 1 guard, 

1 round bit, and 1 sticky bit, and round to the nearest even. Show all the steps, and 


3.36 [30] Calculate 3.41796875 10 3 (6.34765625 10 3 1.05625 

10 2 ) by hand, assuming each of the values are stored in the 16-bit half precision 




3.37 [10] Based on your answers to 3.35 and 3.36, does (3.41796875 10 3 

6.34765625 10 3 ) 1.05625 10 2 = 3.41796875 10 3 (6.34765625 

10 3 1.05625 10 2 )? 

3.38 [30] Calculate 1.666015625 10 0 (1.9760 10 4 1.9744 

10 4 ) by hand, assuming each of the values are stored in the 16-bit half precision 




3.39 [30] Calculate (1.666015625 10 0 1.9760 10 4 ) (1.666015625 

10 0 1.9744 10 4 ) by hand, assuming each of the values are stored in the 

16-bit half precision format described in Exercise 3.27 (and also described in the 

text). Assume 1 guard, 1 round bit, and 1 sticky bit, and round to the nearest even. 

Show all the steps, and write your answer in both the 16-bit floating point format 

and in decimal.


3.40 [10] Based on your answers to 3.38 and 3.39, does (1.666015625 

10 0 1.9760 10 4 ) (1.666015625 10 0 1.9744 10 4 ) = 1.666015625 

10 0 (1.9760 10 4 1.9744 10 4 )? 

3.41 [10] Using the IEEE 754 floating point format, write down the bit 

pattern that would represent 1/4. Can you represent 1/4 exactly? 

3.42 [10] What do you get if you add 1/4 to itself 4 times? What is 1/4 

4? Are they the same? What should they be? 

3.43 [10] Write down the bit pattern in the fraction of value 1/3 assuming 

a floating point format that uses binary numbers in the fraction. Assume there are 

24 bits, and you do not need to normalize. Is this representation exact? 

3.44 [10] Write down the bit pattern in the fraction assuming a floating 

point format that uses Binary Coded Decimal (base 10) numbers in the fraction 

instead of base 2. Assume there are 24 bits, and you do not need to normalize. Is 

this representation exact? 

3.45 [10] Write down the bit pattern assuming that we are using base 15 

numbers in the fraction instead of base 2. (Base 16 numbers use the symbols 0–9 

and A–F. Base 15 numbers would use 0–9 and A–E.) Assume there are 24 bits, and 

you do not need to normalize. Is this representation exact? 

3.46 [20] Write down the bit pattern assuming that we are using base 30 

numbers in the fraction instead of base 2. (Base 16 numbers use the symbols 0–9 

and A–F. Base 30 numbers would use 0–9 and A–T.) Assume there are 20 bits, and 

you do not need to normalize. Is this representation exact? 

3.47 [45] The following C code implements a four-tap FIR filter on 

input array sig_in. Assume that all arrays are 16-bit fixed-point values. 

for (i 3;i< 128;i ) 

sig_out[i] sig_in[i-3] * f[0] sig_in[i-2] * f[1] 

sig_in[i-1] * f[2] sig_in[i] * f[3]; 

Assume you are to write an optimized implementation this code in assembly 

language on a processor that has SIMD instructions and 128-bit registers. Without 

knowing the details of the instruction set, briefly describe how you would 

implement this code, maximizing the use of sub-word operations and minimizing 

the amount of data that is transferred between registers and memory. State all your 

assumptions about the instructions you use. 

§3.2, page 182: 2. 

§3.5, page 221: 3. 

Answers to 

Check Yourself

4 

The Processor 

In a major matter, no 

details are small. 

French Proverb 


4.2 Logic Design Conventions 248 

4.3 Building a Datapath 251 

4.4 A Simple Implementation Scheme 259 

4.5 An Overview of Pipelining 272 

4.6 Pipelined Datapath and Control 286 

4.7 Data Hazards: Forwarding versus 

Stalling 303 

4.8 Control Hazards 316 

4.9 Exceptions 325 

4.10 Parallelism via Instructions 332 




However, it illustrates the key principles used in creating a datapath and designing 

the control. The implementation of the remaining instructions is similar. 

In examining the implementation, we will have the opportunity to see how the 

instruction set architecture determines many aspects of the implementation, and 

how the choice of various implementation strategies affects the clock rate and CPI 

for the computer. Many of the key design principles introduced in Chapter 1 can 

be illustrated by looking at the implementation, such as Simplicity favors regularity. 

In addition, most concepts used to implement the MIPS subset in this chapter are 

the same basic ideas that are used to construct a broad spectrum of computers, 

from high-performance servers to general-purpose microprocessors to embedded 

processors. 

An Overview of the Implementation 

In Chapter 2, we looked at the core MIPS instructions, including the integer 

arithmetic-logical instructions, the memory-reference instructions, and the branch 

instructions. Much of what needs to be done to implement these instructions is the 

same, independent of the exact class of instruction. For every instruction, the first 

two steps are identical: 

1. Send the program counter (PC) to the memory that contains the code and 

fetch the instruction from that memory. 

2. Read one or two registers, using fields of the instruction to select the registers 

to read. For the load word instruction, we need to read only one register, but 

most other instructions require reading two registers. 

After these two steps, the actions required to complete the instruction depend 

on the instruction class. Fortunately, for each of the three instruction classes 

(memory-reference, arithmetic-logical, and branches), the actions are largely the 

same, independent of the exact instruction. The simplicity and regularity of the 

MIPS instruction set simplifies the implementation by making the execution of 

many of the instruction classes similar. 

For example, all instruction classes, except jump, use the arithmetic-logical unit 

(ALU) after reading the registers. The memory-reference instructions use the ALU 

for an address calculation, the arithmetic-logical instructions for the operation 

execution, and branches for comparison. After using the ALU, the actions required 

to complete various instruction classes differ. A memory-reference instruction 

will need to access the memory either to read data for a load or write data for a 

store. An arithmetic-logical or load instruction must write the data from the ALU 

or memory back into a register. Lastly, for a branch instruction, we may need to 

change the next instruction address based on the comparison; otherwise, the PC 

should be incremented by 4 to get the address of the next instruction. 

Figure 4.1 shows the high-level view of a MIPS implementation, focusing on 

the various functional units and their interconnection. Although this figure shows 

most of the flow of data through the processor, it omits two important aspects of 

instruction execution.

246 Chapter 4 The Processor 

First, in several places, Figure 4.1 shows data going to a particular unit as coming 

from two different sources. For example, the value written into the PC can come 

from one of two adders, the data written into the register file can come from either 

the ALU or the data memory, and the second input to the ALU can come from 

a register or the immediate field of the instruction. In practice, these data lines 

cannot simply be wired together; we must add a logic element that chooses from 

among the multiple sources and steers one of those sources to its destination. This 

selection is commonly done with a device called a multiplexor, although this device 

might better be called a data selector. Appendix B describes the multiplexor, which 

selects from among several inputs based on the setting of its control lines. The 

control lines are set based primarily on information taken from the instruction 

being executed. 

The second omission in Figure 4.1 is that several of the units must be controlled 

depending on the type of instruction. For example, the data memory must read 

4 

Add 

Add 

Data 

PC Address Instruction 

Instruction 

memory 

Register # 

Registers ALU Address 

Register # 

Data 

Register # 

memory 

Data 

FIGURE 4.1 An abstract view of the implementation of the MIPS subset showing the 

major functional units and the major connections between them. All instructions start by using 

the program counter to supply the instruction address to the instruction memory. After the instruction is 

fetched, the register operands used by an instruction are specified by fields of that instruction. Once the 

register operands have been fetched, they can be operated on to compute a memory address (for a load or 

store), to compute an arithmetic result (for an integer arithmetic-logical instruction), or a compare (for a 

branch). If the instruction is an arithmetic-logical instruction, the result from the ALU must be written to 

a register. If the operation is a load or store, the ALU result is used as an address to either store a value from 

the registers or load a value from memory into the registers. The result from the ALU or memory is written 

back into the register file. Branches require the use of the ALU output to determine the next instruction 

address, which comes either from the ALU (where the PC and branch offset are summed) or from an adder 

that increments the current PC by 4. The thick lines interconnecting the functional units represent buses, 

which consist of multiple signals. The arrows are used to guide the reader in knowing how information flows. 

Since signal lines may cross, we explicitly show when crossing lines are connected by the presence of a dot 

where the lines cross.


on a load and written on a store. The register file must be written only on a load 

or an arithmetic-logical instruction. And, of course, the ALU must perform one 

of several operations. (Appendix B describes the detailed design of the ALU.) 

Like the multiplexors, control lines that are set on the basis of various fields in the 

instruction direct these operations. 

Figure 4.2 shows the datapath of Figure 4.1 with the three required multiplexors 

added, as well as control lines for the major functional units. A control unit, 

which has the instruction as an input, is used to determine how to set the control 

lines for the functional units and two of the multiplexors. The third multiplexor, 

Branch 

M 

u 

x 

4 

Add 

Data 

Add 

M 

u 

x 

ALU operation 

PC Address Instruction 

Instruction 

memory 

Register # 

MemWrite 

Registers ALU Address 

Register # 

M 

Zero 

u 

Data 

x 

memory 

Register # RegWrite 

Data 

MemRead 

Control 

FIGURE 4.2 The basic implementation of the MIPS subset, including the necessary multiplexors and control lines. 

The top multiplexor (“Mux”) controls what value replaces the PC (PC + 4 or the branch destination address); the multiplexor is controlled 

by the gate that “ANDs” together the Zero output of the ALU and a control signal that indicates that the instruction is a branch. The middle 

multiplexor, whose output returns to the register file, is used to steer the output of the ALU (in the case of an arithmetic-logical instruction) or 

the output of the data memory (in the case of a load) for writing into the register file. Finally, the bottommost multiplexor is used to determine 

whether the second ALU input is from the registers (for an arithmetic-logical instruction or a branch) or from the offset field of the instruction 

(for a load or store). The added control lines are straightforward and determine the operation performed at the ALU, whether the data memory 

should read or write, and whether the registers should perform a write operation. The control lines are shown in color to make them easier to 

see.


sign-extend To increase 

the size of a data item by 

replicating the high-order 

sign bit of the original 

data item in the highorder 

bits of the larger, 

destination data item. 

branch target 

address The address 

specified in a branch, 

which becomes the new 

program counter (PC) 

if the branch is taken. In 

the MIPS architecture the 

branch target is given by 

the sum of the offset field 

of the instruction and the 

address of the instruction 

following the branch. 

branch taken 

A branch where the 

branch condition is 

satisfied and the program 

counter (PC) becomes 

the branch target. All 

unconditional jumps are 

taken branches. 

branch not taken or 

(untaken branch) 

A branch where the 

branch condition is false 

and the program counter 

(PC) becomes the address 

of the instruction that 

sequentially follows the 

branch. 

Next, consider the MIPS load word and store word instructions, which have the 

general form lw $t1,offset_value($t2) or sw $t1,offset_value 

($t2). These instructions compute a memory address by adding the base register, 

which is $t2, to the 16-bit signed offset field contained in the instruction. If the 

instruction is a store, the value to be stored must also be read from the register file 

where it resides in $t1. If the instruction is a load, the value read from memory 

must be written into the register file in the specified register, which is $t1. Thus, 

we will need both the register file and the ALU from Figure 4.7. 

In addition, we will need a unit to sign-extend the 16-bit offset field in the 

instruction to a 32-bit signed value, and a data memory unit to read from or write 

to. The data memory must be written on store instructions; hence, data memory 

has read and write control signals, an address input, and an input for the data to be 

written into memory. Figure 4.8 shows these two elements. 

The beq instruction has three operands, two registers that are compared for 

equality, and a 16-bit offset used to compute the branch target address relative 

to the branch instruction address. Its form is beq $t1,$t2,offset. To 

implement this instruction, we must compute the branch target address by adding 

the sign-extended offset field of the instruction to the PC. There are two details in 

the definition of branch instructions (see Chapter 2) to which we must pay attention: 

■ The instruction set architecture specifies that the base for the branch address 

calculation is the address of the instruction following the branch. Since we 

compute PC + 4 (the address of the next instruction) in the instruction fetch 

datapath, it is easy to use this value as the base for computing the branch 

target address. 

■ The architecture also states that the offset field is shifted left 2 bits so that it 

is a word offset; this shift increases the effective range of the offset field by a 

factor of 4. 

To deal with the latter complication, we will need to shift the offset field by 2. 

As well as computing the branch target address, we must also determine whether 

the next instruction is the instruction that follows sequentially or the instruction 

at the branch target address. When the condition is true (i.e., the operands are 

equal), the branch target address becomes the new PC, and we say that the branch 

is taken. If the operands are not equal, the incremented PC should replace the 

current PC (just as for any other normal instruction); in this case, we say that the 

branch is not taken. 

Thus, the branch datapath must do two operations: compute the branch target 

address and compare the register contents. (Branches also affect the instruction 

fetch portion of the datapath, as we will deal with shortly.) Figure 4.9 shows the 

structure of the datapath segment that handles branches. To compute the branch 

target address, the branch datapath includes a sign extension unit, from Figure 4.8 

and an adder. To perform the compare, we need to use the register file shown in 

Figure 4.7a to supply the two register operands (although we will not need to write 

into the register file). In addition, the comparison can be done using the ALU we


PC + 4 from instruction datapath 

Shift 

left 2 

Add Sum 

Branch 

target 

Instruction 

Read 

register 1 

Read 

register 2 

Write 

register 

Write 

data 

Registers 

Read 

data 1 

Read 

data 2 

4 

ALU operation 

ALU Zero 

To branch 

control logic 

RegWrite 

16 

Signextend 

32 

FIGURE 4.9 The datapath for a branch uses the ALU to evaluate the branch condition and 

a separate adder to compute the branch target as the sum of the incremented PC and the 

sign-extended, lower 16 bits of the instruction (the branch displacement), shifted left 2 

bits. The unit labeled Shift left 2 is simply a routing of the signals between input and output that adds 00 two 

to the low-order end of the sign-extended offset field; no actual shift hardware is needed, since the amount of 

the “shift” is constant. Since we know that the offset was sign-extended from 16 bits, the shift will throw away 

only “sign bits.” Control logic is used to decide whether the incremented PC or branch target should replace 

the PC, based on the Zero output of the ALU. 

Creating a Single Datapath 

Now that we have examined the datapath components needed for the individual 

instruction classes, we can combine them into a single datapath and add the control 

to complete the implementation. This simplest datapath will attempt to execute all 

instructions in one clock cycle. This means that no datapath resource can be used 

more than once per instruction, so any element needed more than once must be 

duplicated. We therefore need a memory for instructions separate from one for 

data. Although some of the functional units will need to be duplicated, many of the 

elements can be shared by different instruction flows. 

To share a datapath element between two different instruction classes, we may 

need to allow multiple connections to the input of an element, using a multiplexor 

and control signal to select among the multiple inputs.

4.3 Building a Datapath 257 

Building a Datapath 

The operations of arithmetic-logical (or R-type) instructions and the memory 

instructions datapath are quite similar. The key differences are the following: 

■ The arithmetic-logical instructions use the ALU, with the inputs coming 

from the two registers. The memory instructions can also use the ALU 

to do the address calculation, although the second input is the signextended 

16-bit offset field from the instruction. 

■ The value stored into a destination register comes from the ALU (for an 

R-type instruction) or the memory (for a load). 

Show how to build a datapath for the operational portion of the memoryreference 

and arithmetic-logical instructions that uses a single register file 

and a single ALU to handle both types of instructions, adding any necessary 

multiplexors. 

EXAMPLE 

To create a datapath with only a single register file and a single ALU, we must 

support two different sources for the second ALU input, as well as two different 

sources for the data stored into the register file. Thus, one multiplexor is placed 

at the ALU input and another at the data input to the register file. Figure 4.10 

shows the operational portion of the combined datapath. 

ANSWER 

Now we can combine all the pieces to make a simple datapath for the core 

MIPS architecture by adding the datapath for instruction fetch (Figure 4.6), the 

datapath from R-type and memory instructions (Figure 4.10), and the datapath 

for branches (Figure 4.9). Figure 4.11 shows the datapath we obtain by composing 

the separate pieces. The branch instruction uses the main ALU for comparison of 

the register operands, so we must keep the adder from Figure 4.9 for computing 

the branch target address. An additional multiplexor is required to select either the 

sequentially following instruction address (PC + 4) or the branch target address to 

be written into the PC. 

Now that we have completed this simple datapath, we can add the control unit. 

The control unit must be able to take inputs and generate a write signal for each 

state element, the selector control for each multiplexor, and the ALU control. The 

ALU control is different in a number of ways, and it will be useful to design it first 

before we design the rest of the control unit. 

I. Which of the following is correct for a load instruction? Refer to Figure 4.10. 

a. MemtoReg should be set to cause the data from memory to be sent to the 

register file. 

Check 

Yourself


Instruction 

Read 

register 1 

Read 

register 2 

Registers 

Write 

register 

Read 

data 1 

Read 

data 2 

ALUSrc 

0 

M 

ux 

4 

ALU 

ALU operation 

Zero 

ALU 

result 

Address 

MemWrite 

MemtoReg 

Read 

data 

1 

M 

ux 

Write 

data 

RegWrite 

1 

Write 

data 

Data 

memory 

0 

16 

Signextend 

32 

MemRead 

FIGURE 4.10 The datapath for the memory instructions and the R-type instructions. This example shows how a single 

datapath can be assembled from the pieces in Figures 4.7 and 4.8 by adding multiplexors. Two multiplexors are needed, as described in the 

example. 

PCSrc 

Add 

M 

ux 

4 

Shift 

left 2 

ALU 

Add result 

PC 

Read 

address 

Instruction 

Instruction 

memory 

Read 

register 1 

Read 

register 2 

Registers 

Write 

register 

Write 

data 

RegWrite 

Read 

data 1 

Read 

data 2 

ALUSrc 

M 

ux 

4 

ALU operation 

Zero 

ALU ALU 

result 

Address 

Write 

data 

MemWrite 

MemtoReg 

Read 

data 

Data 

memory 

M 

ux 

16 

Signextend 

32 

MemRead 

FIGURE 4.11 The simple datapath for the core MIPS architecture combines the elements required by different 

instruction classes. The components come from Figures 4.6, 4.9, and 4.10. This datapath can execute the basic instructions (load-store 

word, ALU operations, and branches) in a single clock cycle. Just one additional multiplexor is needed to integrate branches. The support for 

jumps will be added later.


Field 0 rs rt rd shamt funct 

Bit positions 31:26 25:21 20:16 15:11 10:6 5:0 

a. R-type instruction 

Field 35 or 43 rs rt address 

Bit positions 31:26 25:21 20:16 15:0 

b. Load or store instruction 

Field 4 rs rt address 

Bit positions 31:26 25:21 20:16 15:0 

c. Branch instruction 

FIGURE 4.14 The three instruction classes (R-type, load and store, and branch) use two 

different instruction formats. The jump instructions use another format, which we will discuss shortly. 

(a) Instruction format for R-format instructions, which all have an opcode of 0. These instructions have three 

register operands: rs, rt, and rd. Fields rs and rt are sources, and rd is the destination. The ALU function is 

in the funct field and is decoded by the ALU control design in the previous section. The R-type instructions 

that we implement are add, sub, AND, OR, and slt. The shamt field is used only for shifts; we will ignore it 

in this chapter. (b) Instruction format for load (opcode = 35 ten 

) and store (opcode = 43 ten 

) instructions. The 

register rs is the base register that is added to the 16-bit address field to form the memory address. For loads, 

rt is the destination register for the loaded value. For stores, rt is the source register whose value should be 

stored into memory. (c) Instruction format for branch equal (opcode =4). The registers rs and rt are the 

source registers that are compared for equality. The 16-bit address field is sign-extended, shifted, and added 

to the PC + 4 to compute the branch target address. 

opcode The field that 

denotes the operation and 

format of an instruction. 

the formats of the three instruction classes: the R-type, branch, and load-store 

instructions. Figure 4.14 shows these formats. 

There are several major observations about this instruction format that we will 

rely on: 

■ The op field, which as we saw in Chapter 2 is called the opcode, is always 

contained in bits 31:26. We will refer to this field as Op[5:0]. 

■ The two registers to be read are always specified by the rs and rt fields, at 

positions 25:21 and 20:16. This is true for the R-type instructions, branch 

equal, and store. 

■ The base register for load and store instructions is always in bit positions 

25:21 (rs). 

■ The 16-bit offset for branch equal, load, and store is always in positions 15:0. 

■ The destination register is in one of two places. For a load it is in bit positions 

20:16 (rt), while for an R-type instruction it is in bit positions 15:11 (rd). 

Thus, we will need to add a multiplexor to select which field of the instruction 

is used to indicate the register number to be written. 

The first design principle from Chapter 2—simplicity favors regularity—pays off 

here in specifying control.


PCSrc 

0 

Add 

M ux 

4 

RegWrite 

Shift 

left 2 

ALU 

Addresult 

1 

PC 

Instruction [25:21] Read 

Read 

register 1 

address 

Read 

Instruction [20:16] Read data 1 

register 2 

ALUSrc Zero 

Instruction 

0 

[31:0] ALU 

M Write Read 

ALU 

0 

data 2 

result 

Instruction 

ux 

Instruction [15:11] register 

M 

memory 

1 

ux 

Write 

1 

data Registers 

RegDst 

Instruction [15:0] 

16 

Signextend 

32 

ALU 

control 

MemWrite 

Address 

Write 

data 

Read 

data 

Data 

memory 

MemRead 

MemtoReg 

1 

M ux 

0 

Instruction [5:0] 

ALUOp 

FIGURE 4.15 The datapath of Figure 4.11 with all necessary multiplexors and all control lines identified. The control 

lines are shown in color. The ALU control block has also been added. The PC does not require a write control, since it is written once at the end 

of every clock cycle; the branch control logic determines whether it is written with the incremented PC or the branch target address. 

Using this information, we can add the instruction labels and extra multiplexor 

(for the Write register number input of the register file) to the simple datapath. 

Figure 4.15 shows these additions plus the ALU control block, the write signals for 

state elements, the read signal for the data memory, and the control signals for the 

multiplexors. Since all the multiplexors have two inputs, they each require a single 

control line. 

Figure 4.15 shows seven single-bit control lines plus the 2-bit ALUOp control 

signal. We have already defined how the ALUOp control signal works, and it is 

useful to define what the seven other control signals do informally before we 

determine how to set these control signals during instruction execution. Figure 

4.16 describes the function of these seven control lines. 

Now that we have looked at the function of each of the control signals, we can 

look at how to set them. The control unit can set all but one of the control signals 

based solely on the opcode field of the instruction. The PCSrc control line is the 

exception. That control line should be asserted if the instruction is branch on equal 

(a decision that the control unit can make) and the Zero output of the ALU, which 

is used for equality comparison, is asserted. To generate the PCSrc signal, we will 

need to AND together a signal from the control unit, which we call Branch, with 

the Zero signal out of the ALU.


0 

Add 

M ux 

4 

Instruction [31–26] 

Control 

RegDst 

Branch 

MemRead 

MemtoReg 

ALUOp 

MemWrite 

ALUSrc 

RegWrite 

Shift 

left 2 

ALU 

Add 

result 

1 

PC 

Instruction [25–21] Read 

Read 

register 1 

address 

Read 


data 1 

Zero 

Instruction 

register 2 

0 

[31–0] ALU 

M Read 

ALU 

Write 

0 

Instruction 

ux 

data 2 

result 

Instruction [15–11] register 

M 

memory 

ux 

1 

Write 

data 

1 

Registers 

Address 

Read 

data 

Write 

data 

Data 

memory 

1 

M ux 

0 


16 

Signextend 

32 

ALU 

control 


FIGURE 4.17 The simple datapath with the control unit. The input to the control unit is the 6-bit opcode field from the instruction. 

The outputs of the control unit consist of three 1-bit signals that are used to control multiplexors (RegDst, ALUSrc, and MemtoReg), three 

signals for controlling reads and writes in the register file and data memory (RegWrite, MemRead, and MemWrite), a 1-bit signal used in 

determining whether to possibly branch (Branch), and a 2-bit control signal for the ALU (ALUOp). An AND gate is used to combine the 

branch control signal and the Zero output from the ALU; the AND gate output controls the selection of the next PC. Notice that PCSrc is now 

a derived signal, rather than one coming directly from the control unit. Thus, we drop the signal name in subsequent figures. 

think of four steps to execute the instruction; these steps are ordered by the flow 

of information: 

1. The instruction is fetched, and the PC is incremented. 

2. Two registers, $t2 and $t3, are read from the register file; also, the main 

control unit computes the setting of the control lines during this step. 

3. The ALU operates on the data read from the register file, using the function 

code (bits 5:0, which is the funct field, of the instruction) to generate the 

ALU function.


3. The ALU computes the sum of the value read from the register file and the 

sign-extended, lower 16 bits of the instruction (offset). 

4. The sum from the ALU is used as the address for the data memory. 

5. The data from the memory unit is written into the register file; the register 

destination is given by bits 20:16 of the instruction ($t1). 

Finally, we can show the operation of the branch-on-equal instruction, such as 

beq $t1, $t2, offset, in the same fashion. It operates much like an R-format 

instruction, but the ALU output is used to determine whether the PC is written with 

PC + 4 or the branch target address. Figure 4.21 shows the four steps in execution: 

1. An instruction is fetched from the instruction memory, and the PC is 

incremented. 

0 

Add 

M ux 

4 


Control 

RegDst 

Branch 

MemRead 

MemtoReg 

ALUOp 

MemWrite 

ALUSrc 

RegWrite 

Shift 

left 2 

ALU 

Add 

result 

1 

PC 


Read 

register 1 

address 

Read 


data 1 

Zero 

Instruction 

register 2 

0 

[31–0] ALU 

M Read 

ALU 

Write 

0 

Instruction 

ux 

data 2 

result 


M 

memory 

ux 

1 

Write 

data 

1 

Registers 

Address 

Read 

data 

Write 

data 

Data 

memory 

1 

M ux 

0 


16 

Signextend 

32 

ALU 

control 


FIGURE 4.21 The datapath in operation for a branch-on-equal instruction. The control lines, datapath units, and connections 

that are active are highlighted. After using the register file and ALU to perform the compare, the Zero output is used to select the next program 

counter from between the two candidates.


single-cycle 

implementation Also 

called single clock cycle 

implementation. An 

implementation in which 

an instruction is executed 

in one clock cycle. While 

easy to understand, it is 

too slow to be practical. 

Now that we have a single-cycle implementation of most of the MIPS core 

instruction set, let’s add the jump instruction to show how the basic datapath and 

control can be extended to handle other instructions in the instruction set. 

EXAMPLE 

Implementing Jumps 

Figure 4.17 shows the implementation of many of the instructions we looked at 

in Chapter 2. One class of instructions missing is that of the jump instruction. 

Extend the datapath and control of Figure 4.17 to include the jump instruction. 

Describe how to set any new control lines. 

ANSWER 

The jump instruction, shown in Figure 4.23, looks somewhat like a branch 

instruction but computes the target PC differently and is not conditional. Like 

a branch, the low-order 2 bits of a jump address are always 00 two 

. The next 

lower 26 bits of this 32-bit address come from the 26-bit immediate field in the 

instruction. The upper 4 bits of the address that should replace the PC come 

from the PC of the jump instruction plus 4. Thus, we can implement a jump by 

storing into the PC the concatenation of 

■ the upper 4 bits of the current PC + 4 (these are bits 31:28 of the 

sequentially following instruction address) 

■ the 26-bit immediate field of the jump instruction 

■ the bits 00 two 

Figure 4.24 shows the addition of the control for jump added to Figure 4.17. An 

additional multiplexor is used to select the source for the new PC value, which 

is either the incremented PC (PC + 4), the branch target PC, or the jump target 

PC. One additional control signal is needed for the additional multiplexor. This 

control signal, called Jump, is asserted only when the instruction is a jump— 

that is, when the opcode is 2. 

Field 000010 address 

Bit positions 31:26 25:0 

FIGURE 4.23 Instruction format for the jump instruction (opcode = 2). The destination 

address for a jump instruction is formed by concatenating the upper 4 bits of the current PC + 4 to the 26-bit 

address field in the jump instruction and adding 00 as the 2 low-order bits.


Add 

Instruction [25–0] Jump address [31–0] 

Shift 

left 2 

26 28 

PC + 4 [31–28] 

0 

M ux 

1 

M ux 

4 


Control 

RegDst 

Jump 

Branch 

MemRead 

MemtoReg 

ALUOp 

MemWrite 

ALUSrc 

RegWrite 

Shift 

left 2 

ALU 

Add 

result 

1 

0 

PC 


Read 

register 1 

address 

Read 


data 1 

Zero 

Instruction 

register 2 

0 

[31–0] ALU 

M Read 

ALU 

Write 

0 

Instruction 

ux 

data 2 

result 


M 

memory 

ux 

1 

Write 

data 

1 

Registers 

Address 

Write 

data 

Read 

data 

Data 

memory 

1 

M ux 

0 


16 

Signextend 

32 

ALU 

control 


FIGURE 4.24 The simple control and datapath are extended to handle the jump instruction. An additional multiplexor (at 

the upper right) is used to choose between the jump target and either the branch target or the sequential instruction following this one. This 

multiplexor is controlled by the jump control signal. The jump target address is obtained by shifting the lower 26 bits of the jump instruction 

left 2 bits, effectively adding 00 as the low-order bits, and then concatenating the upper 4 bits of PC + 4 as the high-order bits, thus yielding a 

32-bit address. 

Why a Single-Cycle Implementation Is Not Used Today 

Although the single-cycle design will work correctly, it would not be used in 

modern designs because it is inefficient. To see why this is so, notice that the clock 

cycle must have the same length for every instruction in this single-cycle design. 

Of course, the longest possible path in the processor determines the clock cycle. 

This path is almost certainly a load instruction, which uses five functional units 

in series: the instruction memory, the register file, the ALU, the data memory, and 

the register file. Although the CPI is 1 (see Chapter 1), the overall performance of 

a single-cycle implementation is likely to be poor, since the clock cycle is too long. 

The penalty for using the single-cycle design with a fixed clock cycle is significant, 

but might be considered acceptable for this small instruction set. Historically, early


computers with very simple instruction sets did use this implementation technique. 

However, if we tried to implement the floating-point unit or an instruction set with 

more complex instructions, this single-cycle design wouldn’t work well at all. 

Because we must assume that the clock cycle is equal to the worst-case delay 

for all instructions, it’s useless to try implementation techniques that reduce the 

delay of the common case but do not improve the worst-case cycle time. A singlecycle 

implementation thus violates the great idea from Chapter 1 of making the 

common case fast. 

In next section, we’ll look at another implementation technique, called 

pipelining, that uses a datapath very similar to the single-cycle datapath but is 

much more efficient by having a much higher throughput. Pipelining improves 

efficiency by executing multiple instructions simultaneously. 

Check 

Yourself 

Look at the control signals in Figure 4.22. Can you combine any together? Can any 

control signal output in the figure be replaced by the inverse of another? (Hint: take 

into account the don’t cares.) If so, can you use one signal for the other without 

adding an inverter? 

4.5 An Overview of Pipelining 

Never waste time. 

American proverb 

pipelining An 

implementation 

technique in which 

multiple instructions are 

overlapped in execution, 

much like an assembly 

line. 

Pipelining is an implementation technique in which multiple instructions are 

overlapped in execution. Today, pipelining is nearly universal. 

This section relies heavily on one analogy to give an overview of the pipelining 

terms and issues. If you are interested in just the big picture, you should concentrate 

on this section and then skip to Sections 4.10 and 4.11 to see an introduction to the 

advanced pipelining techniques used in recent processors such as the Intel Core i7 

and ARM Cortex-A8. If you are interested in exploring the anatomy of a pipelined 

computer, this section is a good introduction to Sections 4.6 through 4.9. 

Anyone who has done a lot of laundry has intuitively used pipelining. The nonpipelined 

approach to laundry would be as follows: 

1. Place one dirty load of clothes in the washer. 

2. When the washer is finished, place the wet load in the dryer. 

3. When the dryer is finished, place the dry load on a table and fold. 

4. When folding is finished, ask your roommate to put the clothes away. 

When your roommate is done, start over with the next dirty load. 

The pipelined approach takes much less time, as Figure 4.25 shows. As soon 

as the washer is finished with the first load and placed in the dryer, you load the 

washer with the second dirty load. When the first load is dry, you place it on the 

table to start folding, move the wet load to the dryer, and put the next dirty load


pipeline, in this case four: washing, drying, folding, and putting away. Therefore, 

pipelined laundry is potentially four times faster than nonpipelined: 20 loads would 

take about 5 times as long as 1 load, while 20 loads of sequential laundry takes 20 

times as long as 1 load. It’s only 2.3 times faster in Figure 4.25, because we only 

show 4 loads. Notice that at the beginning and end of the workload in the pipelined 

version in Figure 4.25, the pipeline is not completely full; this start-up and winddown 

affects performance when the number of tasks is not large compared to the 

number of stages in the pipeline. If the number of loads is much larger than 4, then 

the stages will be full most of the time and the increase in throughput will be very 

close to 4. 

The same principles apply to processors where we pipeline instruction-execution. 

MIPS instructions classically take five steps: 

1. Fetch instruction from memory. 

2. Read registers while decoding the instruction. The regular format of MIPS 

instructions allows reading and decoding to occur simultaneously. 

3. Execute the operation or calculate an address. 

4. Access an operand in data memory. 

5. Write the result into a register. 

Hence, the MIPS pipeline we explore in this chapter has five stages. The following 

example shows that pipelining speeds up instruction execution just as it speeds up 

the laundry. 

EXAMPLE 

ANSWER 

Single-Cycle versus Pipelined Performance 

To make this discussion concrete, let’s create a pipeline. In this example, and in 

the rest of this chapter, we limit our attention to eight instructions: load word 

(lw), store word (sw), add (add), subtract (sub), AND (and), OR (or), set 

less than (slt), and branch on equal (beq). 

Compare the average time between instructions of a single-cycle 

implementation, in which all instructions take one clock cycle, to a pipelined 

implementation. The operation times for the major functional units in this 

example are 200 ps for memory access, 200 ps for ALU operation, and 100 ps 

for register file read or write. In the single-cycle model, every instruction takes 

exactly one clock cycle, so the clock cycle must be stretched to accommodate 

the slowest instruction. 

Figure 4.26 shows the time required for each of the eight instructions. 

The single-cycle design must allow for the slowest instruction—in Figure 

4.26 it is lw—so the time required for every instruction is 800 ps. Similarly


to Figure 4.25, Figure 4.27 compares nonpipelined and pipelined execution 

of three load word instructions. Thus, the time between the first and fourth 

instructions in the nonpipelined design is 3 × 800 ns or 2400 ps. 

All the pipeline stages take a single clock cycle, so the clock cycle must be long 

enough to accommodate the slowest operation. Just as the single-cycle design 

must take the worst-case clock cycle of 800 ps, even though some instructions 

can be as fast as 500 ps, the pipelined execution clock cycle must have the 

worst-case clock cycle of 200 ps, even though some stages take only 100 ps. 

Pipelining still offers a fourfold performance improvement: the time between 

the first and fourth instructions is 3 × 200 ps or 600 ps. 

We can turn the pipelining speed-up discussion above into a formula. If the 

stages are perfectly balanced, then the time between instructions on the pipelined 

processor—assuming ideal conditions—is equal to 

Time between instructions 

pipelined 

Time between instructio 

 

Number of pipe stages 

n nonpipelined 

Under ideal conditions and with a large number of instructions, the speed-up 

from pipelining is approximately equal to the number of pipe stages; a five-stage 

pipeline is nearly five times faster. 

The formula suggests that a five-stage pipeline should offer nearly a fivefold 

improvement over the 800 ps nonpipelined time, or a 160 ps clock cycle. The 

example shows, however, that the stages may be imperfectly balanced. Moreover, 

pipelining involves some overhead, the source of which will be clearer shortly. 

Thus, the time per instruction in the pipelined processor will exceed the minimum 

possible, and speed-up will be less than the number of pipeline stages. 

Instruction class 

Instruction 

fetch 

Register 

read 

ALU 

operation 

Data 

access 

Register 

write 

Total 

time 

Load word (lw) 200 ps 100 ps 200 ps 200 ps 100 ps 800 ps 

Store word (sw) 200 ps 100 ps 200 ps 200 ps 700 ps 

R-format (add, sub, AND, 200 ps 100 ps 200 ps 100 ps 600 ps 

OR, slt) 

Branch (beq) 200 ps 100 ps 200 ps 500 ps 

FIGURE 4.26 Total time for each instruction calculated from the time for each component. 

This calculation assumes that the multiplexors, control unit, PC accesses, and sign extension unit have no 

delay.


Pipelining improves performance by increasing instruction throughput, as 

opposed to decreasing the execution time of an individual instruction, but instruction 

throughput is the important metric because real programs execute billions of 

instructions. 

Designing Instruction Sets for Pipelining 

Even with this simple explanation of pipelining, we can get insight into the design 

of the MIPS instruction set, which was designed for pipelined execution. 

First, all MIPS instructions are the same length. This restriction makes it much 

easier to fetch instructions in the first pipeline stage and to decode them in the 

second stage. In an instruction set like the x86, where instructions vary from 1 byte 

to 15 bytes, pipelining is considerably more challenging. Recent implementations 

of the x86 architecture actually translate x86 instructions into simple operations 

that look like MIPS instructions and then pipeline the simple operations rather 

than the native x86 instructions! (See Section 4.10.) 

Second, MIPS has only a few instruction formats, with the source register fields 

being located in the same place in each instruction. This symmetry means that the 

second stage can begin reading the register file at the same time that the hardware 

is determining what type of instruction was fetched. If MIPS instruction formats 

were not symmetric, we would need to split stage 2, resulting in six pipeline stages. 

We will shortly see the downside of longer pipelines. 

Third, memory operands only appear in loads or stores in MIPS. This restriction 

means we can use the execute stage to calculate the memory address and then 

access memory in the following stage. If we could operate on the operands in 

memory, as in the x86, stages 3 and 4 would expand to an address stage, memory 

stage, and then execute stage. 

Fourth, as discussed in Chapter 2, operands must be aligned in memory. Hence, 

we need not worry about a single data transfer instruction requiring two data 

memory accesses; the requested data can be transferred between processor and 

memory in a single pipeline stage. 

Pipeline Hazards 

There are situations in pipelining when the next instruction cannot execute in the 

following clock cycle. These events are called hazards, and there are three different 

types. 

Hazards 

The first hazard is called a structural hazard. It means that the hardware cannot 

support the combination of instructions that we want to execute in the same clock 

cycle. A structural hazard in the laundry room would occur if we used a washerdryer 

combination instead of a separate washer and dryer, or if our roommate was 

busy doing something else and wouldn’t put clothes away. Our carefully scheduled 

pipeline plans would then be foiled. 

structural hazard When 

a planned instruction 

cannot execute in the 

proper clock cycle because 

the hardware does not 

support the combination 

of instructions that are set 

to execute.


As we said above, the MIPS instruction set was designed to be pipelined, 

making it fairly easy for designers to avoid structural hazards when designing a 

pipeline. Suppose, however, that we had a single memory instead of two memories. 

If the pipeline in Figure 4.27 had a fourth instruction, we would see that in the 

same clock cycle the first instruction is accessing data from memory while the 

fourth instruction is fetching an instruction from that same memory. Without two 

memories, our pipeline could have a structural hazard. 

data hazard Also 

called a pipeline data 

hazard. When a planned 

instruction cannot 

execute in the proper 

clock cycle because data 

that is needed to execute 

the instruction is not yet 

available. 

forwarding Also called 

bypassing. A method of 

resolving a data hazard 

by retrieving the missing 

data element from 

internal buffers rather 

than waiting for it to 

arrive from programmervisible 

registers or 

memory. 

Data Hazards 

Data hazards occur when the pipeline must be stalled because one step must wait 

for another to complete. Suppose you found a sock at the folding station for which 

no match existed. One possible strategy is to run down to your room and search 

through your clothes bureau to see if you can find the match. Obviously, while you 

are doing the search, loads must wait that have completed drying and are ready to 

fold as well as those that have finished washing and are ready to dry. 

In a computer pipeline, data hazards arise from the dependence of one 

instruction on an earlier one that is still in the pipeline (a relationship that does not 

really exist when doing laundry). For example, suppose we have an add instruction 

followed immediately by a subtract instruction that uses the sum ($s0): 

add $s0, $t0, $t1 

sub $t2, $s0, $t3 

Without intervention, a data hazard could severely stall the pipeline. The add 

instruction doesn’t write its result until the fifth stage, meaning that we would have 

to waste three clock cycles in the pipeline. 

Although we could try to rely on compilers to remove all such hazards, the 

results would not be satisfactory. These dependences happen just too often and the 

delay is just too long to expect the compiler to rescue us from this dilemma. 

The primary solution is based on the observation that we don’t need to wait for 

the instruction to complete before trying to resolve the data hazard. For the code 

sequence above, as soon as the ALU creates the sum for the add, we can supply it as 

an input for the subtract. Adding extra hardware to retrieve the missing item early 

from the internal resources is called forwarding or bypassing. 

EXAMPLE 

Forwarding with Two Instructions 

For the two instructions above, show what pipeline stages would be connected 

by forwarding. Use the drawing in Figure 4.28 to represent the datapath during 

the five stages of the pipeline. Align a copy of the datapath for each instruction, 

similar to the laundry pipeline in Figure 4.25.


Time 

200 400 600 800 1000 

add $s0, $t0, $t1 IF ID EX MEM 

WB 

FIGURE 4.28 Graphical representation of the instruction pipeline, similar in spirit to 

the laundry pipeline in Figure 4.25. Here we use symbols representing the physical resources with 

the abbreviations for pipeline stages used throughout the chapter. The symbols for the five stages: IF for 

the instruction fetch stage, with the box representing instruction memory; ID for the instruction decode/ 

register file read stage, with the drawing showing the register file being read; EX for the execution stage, 

with the drawing representing the ALU; MEM for the memory access stage, with the box representing data 

memory; and WB for the write-back stage, with the drawing showing the register file being written. The 

shading indicates the element is used by the instruction. Hence, MEM has a white background because add 

does not access the data memory. Shading on the right half of the register file or memory means the element 

is read in that stage, and shading of the left half means it is written in that stage. Hence the right half of ID is 

shaded in the second stage because the register file is read, and the left half of WB is shaded in the fifth stage 

because the register file is written. 

Figure 4.29 shows the connection to forward the value in $s0 after the 

execution stage of the add instruction as input to the execution stage of the 

sub instruction. 

ANSWER 

In this graphical representation of events, forwarding paths are valid only if the 

destination stage is later in time than the source stage. For example, there cannot 

be a valid forwarding path from the output of the memory access stage in the first 

instruction to the input of the execution stage of the following, since that would 

mean going backward in time. 

Forwarding works very well and is described in detail in Section 4.7. It cannot 

prevent all pipeline stalls, however. For example, suppose the first instruction was a 

load of $s0 instead of an add. As we can imagine from looking at Figure 4.29, the 

Program 

execution 

order Time 

(in instructions) 

add $s0, $t0, $t1 

IF 

200 400 600 800 1000 

ID EX MEM WB 

sub $t2, $s0, $t3 

IF 

ID 

EX 

MEM 

WB 

FIGURE 4.29 Graphical representation of forwarding. The connection shows the forwarding path 

from the output of the EX stage of add to the input of the EX stage for sub, replacing the value from register 

$s0 read in the second stage of sub.


Find the hazards in the preceding code segment and reorder the instructions 

to avoid any pipeline stalls. 

Both add instructions have a hazard because of their respective dependence 

on the immediately preceding lw instruction. Notice that bypassing eliminates 

several other potential hazards, including the dependence of the first add on 

the first lw and any hazards for store instructions. Moving up the third lw 

instruction to become the third instruction eliminates both hazards: 

ANSWER 

lw $t1, 0($t0) 

lw $t2, 4($t0) 

lw $t4, 8($t0) 

add $t3, $t1,$t2 

sw $t3, 12($t0) 

add $t5, $t1,$t4 

sw $t5, 16($t0) 

On a pipelined processor with forwarding, the reordered sequence will 

complete in two fewer cycles than the original version. 

Forwarding yields another insight into the MIPS architecture, in addition to the 

four mentioned on page 277. Each MIPS instruction writes at most one result and 

does this in the last stage of the pipeline. Forwarding is harder if there are multiple 

results to forward per instruction or if there is a need to write a result early on in 

instruction execution. 

Elaboration: The name “forwarding” comes from the idea that the result is passed 

forward from an earlier instruction to a later instruction. “Bypassing” comes from 

passing the result around the register fi le to the desired unit. 

Control Hazards 

The third type of hazard is called a control hazard, arising from the need to make a 

decision based on the results of one instruction while others are executing. 

Suppose our laundry crew was given the happy task of cleaning the uniforms 

of a football team. Given how filthy the laundry is, we need to determine whether 

the detergent and water temperature setting we select is strong enough to get the 

uniforms clean but not so strong that the uniforms wear out sooner. In our laundry 

pipeline, we have to wait until after the second stage to examine the dry uniform to 

see if we need to change the washer setup or not. What to do? 

Here is the first of two solutions to control hazards in the laundry room and its 

computer equivalent. 

Stall: Just operate sequentially until the first batch is dry and then repeat until 

you have the right formula. 

This conservative option certainly works, but it is slow. 

control hazard Also 

called branch hazard. 

When the proper 

instruction cannot 

execute in the proper 

pipeline clock cycle 

because the instruction 

that was fetched is not the 

one that is needed; that 

is, the flow of instruction 

addresses is not what the 

pipeline expected.


IF: Instruction fetch 

ID: Instruction decode/ 

register file read 

EX: Execute/ 

address calculation 

MEM: Memory access 

WB: Write back 

Add 

4 

Shift 

left 2 

ADD 

Add 

result 

0 

Read Read 

M 

register 1 data 1 

u PC Address 

Zero 

x 

Read 

ALU 

1 register 2 

ALU 

Address 

Instruction 

Instruction 

memory 

Write 

register 

Write 

data 

Registers 

Read 

data 2 

0 

M 

u 

x 

1 

result 

Write 

data 

Data 

memory 

Read 

data 

1 

M 

u 

x 

0 

16 

Signextend 

32 

FIGURE 4.33 The single-cycle datapath from Section 4.4 (similar to Figure 4.17). Each step of the instruction can be mapped 

onto the datapath from left to right. The only exceptions are the update of the PC and the write-back step, shown in color, which sends either 

the ALU result or the data from memory to the left to be written into the register file. (Normally we use color lines for control, but these are 

data lines.) 

five stages as they complete execution. Returning to our laundry analogy, clothes 

get cleaner, drier, and more organized as they move through the line, and they 

never move backward. 

There are, however, two exceptions to this left-to-right flow of instructions: 

■ The write-back stage, which places the result back into the register file in the 

middle of the datapath 

■ The selection of the next value of the PC, choosing between the incremented 

PC and the branch address from the MEM stage 

Data flowing from right to left does not affect the current instruction; these 

reverse data movements influence only later instructions in the pipeline. Note that


the first right-to-left flow of data can lead to data hazards and the second leads to 

control hazards. 

One way to show what happens in pipelined execution is to pretend that each 

instruction has its own datapath, and then to place these datapaths on a timeline to 

show their relationship. Figure 4.34 shows the execution of the instructions in Figure 

4.27 by displaying their private datapaths on a common timeline. We use a stylized 

version of the datapath in Figure 4.33 to show the relationships in Figure 4.34. 

Figure 4.34 seems to suggest that three instructions need three datapaths. 

Instead, we add registers to hold data so that portions of a single datapath can be 

shared during instruction execution. 

For example, as Figure 4.34 shows, the instruction memory is used during 

only one of the five stages of an instruction, allowing it to be shared by following 

instructions during the other four stages. To retain the value of an individual 

instruction for its other four stages, the value read from instruction memory must 

be saved in a register. Similar arguments apply to every pipeline stage, so we must 

place registers wherever there are dividing lines between stages in Figure 4.33. 

Returning to our laundry analogy, we might have a basket between each pair of 

stages to hold the clothes for the next step. 

Program 

execution 

order 


Time (in clock cycles) 

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 

lw $1, 100($0) 

IM 

Reg 

ALU 

DM 

Reg 

lw $2, 200($0) 

IM 

Reg 

ALU 

DM 

Reg 

lw $3, 300($0) 

IM 

Reg 

ALU 

DM 

Reg 

FIGURE 4.34 Instructions being executed using the single-cycle datapath in Figure 4.33, 

assuming pipelined execution. Similar to Figures 4.28 through 4.30, this figure pretends that each 

instruction has its own datapath, and shades each portion according to use. Unlike those figures, each stage 

is labeled by the physical resource used in that stage, corresponding to the portions of the datapath in Figure 

4.33. IM represents the instruction memory and the PC in the instruction fetch stage, Reg stands for the 

register file and sign extender in the instruction decode/register file read stage (ID), and so on. To maintain 

proper time order, this stylized datapath breaks the register file into two logical parts: registers read during 

register fetch (ID) and registers written during write back (WB). This dual use is represented by drawing 

the unshaded left half of the register file using dashed lines in the ID stage, when it is not being written, and 

the unshaded right half in dashed lines in the WB stage, when it is not being read. As before, we assume the 

register file is written in the first half of the clock cycle and the register file is read during the second half.


Figure 4.35 shows the pipelined datapath with the pipeline registers highlighted. 

All instructions advance during each clock cycle from one pipeline register 

to the next. The registers are named for the two stages separated by that register. 

For example, the pipeline register between the IF and ID stages is called IF/ID. 

Notice that there is no pipeline register at the end of the write-back stage. All 

instructions must update some state in the processor—the register file, memory, or 

the PC—so a separate pipeline register is redundant to the state that is updated. For 

example, a load instruction will place its result in 1 of the 32 registers, and any later 

instruction that needs that data will simply read the appropriate register. 

Of course, every instruction updates the PC, whether by incrementing it or by 

setting it to a branch destination address. The PC can be thought of as a pipeline 

register: one that feeds the IF stage of the pipeline. Unlike the shaded pipeline 

registers in Figure 4.35, however, the PC is part of the visible architectural state; 

its contents must be saved when an exception occurs, while the contents of the 

pipeline registers can be discarded. In the laundry analogy, you could think of the 

PC as corresponding to the basket that holds the load of dirty clothes before the 

wash step. 

To show how the pipelining works, throughout this chapter we show sequences 

of figures to demonstrate operation over time. These extra pages would seem to 

require much more time for you to understand. Fear not; the sequences take much 

IF/ID 

ID/EX 

EX/MEM 

MEM/WB 

Add 

4 

Shift 

left 2 

Add Add 

result 

0 

M 

u 

x 

1 

PC 

Address 

Instruction 

memory 

Instruction 

Read 

register 1 

Read 

register 2 

Write 

register 

Write 

data 

Registers 

Read 

data 1 

Read 

data 2 

0 

M 

u 

x 

1 

Zero 

ALU ALU 

result 

Read 

Address 

data 

Data 

memory 

1 

M 

u 

x 

0 

Write 

data 

16 Signextend 

32 

FIGURE 4.35 The pipelined version of the datapath in Figure 4.33. The pipeline registers, in color, separate each pipeline stage. 

They are labeled by the stages that they separate; for example, the first is labeled IF/ID because it separates the instruction fetch and instruction 

decode stages. The registers must be wide enough to store all the data corresponding to the lines that go through them. For example, the 

IF/ID register must be 64 bits wide, because it must hold both the 32-bit instruction fetched from memory and the incremented 32-bit PC 

address. We will expand these registers over the course of this chapter, but for now the other three pipeline registers contain 128, 97, and 64 

bits, respectively.


less time than it might appear, because you can compare them to see what changes 

occur in each clock cycle. Section 4.7 describes what happens when there are data 

hazards between pipelined instructions; ignore them for now. 

Figures 4.36 through 4.38, our first sequence, show the active portions of the 

datapath highlighted as a load instruction goes through the five stages of pipelined 

execution. We show a load first because it is active in all five stages. As in Figures 

4.28 through 4.30, we highlight the right half of registers or memory when they are 

being read and highlight the left half when they are being written. 

We show the instruction abbreviation lw with the name of the pipe stage that is 

active in each figure. The five stages are the following: 

1. Instruction fetch: The top portion of Figure 4.36 shows the instruction being 

read from memory using the address in the PC and then being placed in the 

IF/ID pipeline register. The PC address is incremented by 4 and then written 

back into the PC to be ready for the next clock cycle. This incremented 

address is also saved in the IF/ID pipeline register in case it is needed later 

for an instruction, such as beq. The computer cannot know which type of 

instruction is being fetched, so it must prepare for any instruction, passing 

potentially needed information down the pipeline. 

2. Instruction decode and register file read: The bottom portion of Figure 4.36 

shows the instruction portion of the IF/ID pipeline register supplying the 

16-bit immediate field, which is sign-extended to 32 bits, and the register 

numbers to read the two registers. All three values are stored in the ID/EX 

pipeline register, along with the incremented PC address. We again transfer 

everything that might be needed by any instruction during a later clock 

cycle. 

3. Execute or address calculation: Figure 4.37 shows that the load instruction 

reads the contents of register 1 and the sign-extended immediate from the 

ID/EX pipeline register and adds them using the ALU. That sum is placed in 

the EX/MEM pipeline register. 

4. Memory access: The top portion of Figure 4.38 shows the load instruction 

reading the data memory using the address from the EX/MEM pipeline 

register and loading the data into the MEM/WB pipeline register. 

5. Write-back: The bottom portion of Figure 4.38 shows the final step: reading 

the data from the MEM/WB pipeline register and writing it into the register 

file in the middle of the figure. 

This walk-through of the load instruction shows that any information needed 

in a later pipe stage must be passed to that stage via a pipeline register. Walking 

through a store instruction shows the similarity of instruction execution, as well 

as passing the information for later stages. Here are the five pipe stages of the store 

instruction:


lw 

Instruction fetch 

IF/ID 

ID/EX 

EX/MEM 

MEM/WB 

Add 

4 

Shift 

left 2 

Add 

Add 

resu t 

0 

M 

u 

x 

1 

PC 

Address 

Instruction 

memory 

Instruction 

Read 

register 1 Read 

data 1 

Read 

register 2 

Registers Read 

Write 

data 2 

register 

Write 

data 

0 

M 

u 

x 

1 

Zero 

ALU ALU 

result 

Address 

Data 

memory 

Read 

data 

0 

M 

u 

x 

1 

Write 

data 

16 Sign 32 

extend 

lw 

Instruction decode 

IF/ID 

ID/EX 

EX/MEM 

MEM/WB 

Add 

4 

Shift 

left 2 

Add 

Add 

resu t 

0 

M 

u 

x 

1 

PC 

Address 

Instruction 

memory 

Instruction 

Read 


data 1 

Read 

register 2 


Write 

data 2 

register 

Write 

data 

0 

M 

u 

x 

1 

Zero 

ALU ALU 

result 

Address 

Data 

memory 

Read 

data 

1 

M 

u 

x 

0 

Write 

data 

16 Sign 32 

extend 

FIGURE 4.36 IF and ID: First and second pipe stages of an instruction, with the active portions of the datapath in 

Figure 4.35 highlighted. The highlighting convention is the same as that used in Figure 4.28. As in Section 4.2, there is no confusion when 

reading and writing registers, because the contents change only on the clock edge. Although the load needs only the top register in stage 2, 

the processor doesn’t know what instruction is being decoded, so it sign-extends the 16-bit constant and reads both registers into the ID/EX 

pipeline register. We don’t need all three operands, but it simplifies control to keep all three.


Iw 

Execution 

IF/ID 

ID/EX 

EX/MEM 

MEM/WB 

Add 

4 

Shift 

left 2 

Add Add 

result 

0 

M 

u 

x 

1 

PC 

Address 

Instruction 

memory 

Instruction 

Read 

register 1 

Read 

register 2 

Write 

register 

Write 

data 

Registers 

Read 

data 1 

Read 

data 2 

0 

M 

u 

x 

1 

Zero 

ALU ALU 

result 

Read 

Address 

data 

Data 

memory 

Write 

data 

1 

M 

u 

x 

0 

16 Signextend 

32 

FIGURE 4.37 EX: The third pipe stage of a load instruction, highlighting the portions of the datapath in Figure 4.35 

used in this pipe stage. The register is added to the sign-extended immediate, and the sum is placed in the EX/MEM pipeline register. 

1. Instruction fetch: The instruction is read from memory using the address 

in the PC and then is placed in the IF/ID pipeline register. This stage occurs 

before the instruction is identified, so the top portion of Figure 4.36 works 

for store as well as load. 

2. Instruction decode and register file read: The instruction in the IF/ID pipeline 

register supplies the register numbers for reading two registers and extends 

the sign of the 16-bit immediate. These three 32-bit values are all stored 

in the ID/EX pipeline register. The bottom portion of Figure 4.36 for load 

instructions also shows the operations of the second stage for stores. These 

first two stages are executed by all instructions, since it is too early to know 

the type of the instruction. 

3. Execute and address calculation: Figure 4.39 shows the third step; the 

effective address is placed in the EX/MEM pipeline register. 

4. Memory access: The top portion of Figure 4.40 shows the data being written 

to memory. Note that the register containing the data to be stored was read in 

an earlier stage and stored in ID/EX. The only way to make the data available 

during the MEM stage is to place the data into the EX/MEM pipeline register 

in the EX stage, just as we stored the effective address into EX/MEM.


Iw 

Memory 

IF/ID 

ID/EX 

EX/MEM 

MEM/WB 

Add 

4 

Shift 

left 2 

Add 

Add 

result 

0 

M 

u 

x 

1 

PC 

Address 

Instruction 

memory 

Instruction 

Read 


data 1 

Read 

register 2 


Write 

data 2 

register 

Write 

data 

0 

M 

u 

x 

1 

Zero 

ALU ALU 

result 

Address 

Data 

memory 

Read 

data 

0 

M 

u 

x 

1 

Write 

data 

16 Sign 32 

extend 

Iw 

Write-back 

IF/ID 

ID/EX 

EX/MEM 

MEM/WB 

Add 

4 

Shift 

left 2 

Add 

Add 

result 

0 

M 

u 

x 

1 

PC 

Address 

Instruction 

memory 

Instruction 

Read 


data 1 

Read 

register 2 


Write 

data 2 

register 

Write 

data 

0 

M 

u 

x 

1 

Zero 

ALU ALU 

result 

Address 

Data 

memory 

Read 

data 

1 

M 

u 

x 

0 

Write 

data 

16 Sign 32 

extend 

FIGURE 4.38 MEM and WB: The fourth and fifth pipe stages of a load instruction, highlighting the portions of the 

datapath in Figure 4.35 used in this pipe stage. Data memory is read using the address in the EX/MEM pipeline registers, and the 

data is placed in the MEM/WB pipeline register. Next, data is read from the MEM/WB pipeline register and written into the register file in the 

middle of the datapath. Note: there is a bug in this design that is repaired in Figure 4.41.


sw 

Execution 

IF/ID 

ID/EX 

EX/MEM 

MEM/WB 

Add 

4 

Shift 

left 2 

Add Add 

result 

0 

M 

u 

x 

1 

PC 

Address 

Instruction 

memory 

Instruction 

Read 

register 1 

Read 

register 2 

Write 

register 

Write 

data 

Registers 

Read 

data 1 

Read 

data 2 

0 

M 

u 

x 

1 

Zero 

ALU ALU 

result 

Read 

Address 

data 

Data 

memory 

Write 

data 

1 

M 

u 

x 

0 

16 Signextend 

32 

FIGURE 4.39 EX: The third pipe stage of a store instruction. Unlike the third stage of the load instruction in Figure 4.37, the 

second register value is loaded into the EX/MEM pipeline register to be used in the next stage. Although it wouldn’t hurt to always write this 

second register into the EX/MEM pipeline register, we write the second register only on a store instruction to make the pipeline easier to 

understand. 

5. Write-back: The bottom portion of Figure 4.40 shows the final step of the 

store. For this instruction, nothing happens in the write-back stage. Since 

every instruction behind the store is already in progress, we have no way 

to accelerate those instructions. Hence, an instruction passes through a 

stage even if there is nothing to do, because later instructions are already 

progressing at the maximum rate. 

The store instruction again illustrates that to pass something from an early pipe 

stage to a later pipe stage, the information must be placed in a pipeline register; 

otherwise, the information is lost when the next instruction enters that pipeline 

stage. For the store instruction we needed to pass one of the registers read in the 

ID stage to the MEM stage, where it is stored in memory. The data was first placed 

in the ID/EX pipeline register and then passed to the EX/MEM pipeline register. 

Load and store illustrate a second key point: each logical component of the 

datapath—such as instruction memory, register read ports, ALU, data memory, 

and register write port—can be used only within a single pipeline stage. Otherwise, 

we would have a structural hazard (see page 277). Hence these components, and 

their control, can be associated with a single pipeline stage. 

Now we can uncover a bug in the design of the load instruction. Did you see it? 

Which register is changed in the final stage of the load? More specifically, which


sw 

Memory 

IF/ID 

ID/EX 

EX/MEM 

MEM/WB 

Add 

4 

Shift 

left 2 

Add 

Add 

result 

0 

M 

u 

x 

1 

PC 

Address 

Instruction 

memory 

Instruction 

Read 


data 1 

Read 

register 2 


Write 

data 2 

register 

Write 

data 

0 

M 

u 

x 

1 

Zero 

ALU ALU 

result 

Address 

Data 

memory 

Read 

data 

0 

M 

u 

x 

1 

Write 

data 

16 Sign 32 

extend 

sw 

Write-back 

IF/ID 

ID/EX 

EX/MEM 

MEM/WB 

Add 

4 

Shift 

left 2 

Add 

Add 

result 

0 

M 

u 

x 

1 

PC 

Address 

Instruction 

memory 

Instruction 

Read 


data 1 

Read 

register 2 


Write 

data 2 

register 

Write 

data 

0 

M 

u 

x 

1 

Zero 

ALU ALU 

result 

Address 

Data 

memory 

Read 

data 

1 

M 

u 

x 

0 

Write 

data 

16 Sign 32 

extend 

FIGURE 4.40 MEM and WB: The fourth and fifth pipe stages of a store instruction. In the fourth stage, the data is written into 

data memory for the store. Note that the data comes from the EX/MEM pipeline register and that nothing is changed in the MEM/WB pipeline 

register. Once the data is written in memory, there is nothing left for the store instruction to do, so nothing happens in stage 5.


instruction supplies the write register number? The instruction in the IF/ID pipeline 

register supplies the write register number, yet this instruction occurs considerably 

after the load instruction! 

Hence, we need to preserve the destination register number in the load 

instruction. Just as store passed the register contents from the ID/EX to the EX/ 

MEM pipeline registers for use in the MEM stage, load must pass the register 

number from the ID/EX through EX/MEM to the MEM/WB pipeline register for 

use in the WB stage. Another way to think about the passing of the register number 

is that to share the pipelined datapath, we need to preserve the instruction read 

during the IF stage, so each pipeline register contains a portion of the instruction 

needed for that stage and later stages. 

Figure 4.41 shows the correct version of the datapath, passing the write register 

number first to the ID/EX register, then to the EX/MEM register, and finally to the 

MEM/WB register. The register number is used during the WB stage to specify 

the register to be written. Figure 4.42 is a single drawing of the corrected datapath, 

highlighting the hardware used in all five stages of the load word instruction in 

Figures 4.36 through 4.38. See Section 4.8 for an explanation of how to make the 

branch instruction work as expected. 

Graphically Representing Pipelines 

Pipelining can be difficult to understand, since many instructions are simultaneously 

executing in a single datapath in every clock cycle. To aid understanding, there are 

IF/ID 

ID/EX 

EX/MEM 

MEM/WB 

Add 

4 

Shift 

left 2 

Add 

Add 

resu t 

0 

M 

u 

x 

1 

PC 

Address 

Instruction 

memory 

Instruction 

Read 


data 1 

Read 

register 2 


Write 

data 2 

register 

Write 

data 

0 

M 

u 

x 

1 

Zero 

ALU ALU 

result 

Address 

Data 

memory 

Read 

data 

1 

M 

u 

x 

0 

Write 

data 

16 Sign 32 

extend 

FIGURE 4.41 The corrected pipelined datapath to handle the load instruction properly. The write register number now 

comes from the MEM/WB pipeline register along with the data. The register number is passed from the ID pipe stage until it reaches the MEM/ 

WB pipeline register, adding five more bits to the last three pipeline registers. This new path is shown in color.


two basic styles of pipeline figures: multiple-clock-cycle pipeline diagrams, such as 

Figure 4.34 on page 288, and single-clock-cycle pipeline diagrams, such as Figures 

4.36 through 4.40. The multiple-clock-cycle diagrams are simpler but do not contain 

all the details. For example, consider the following five-instruction sequence: 

lw $10, 20($1) 

sub $11, $2, $3 

add $12, $3, $4 

lw $13, 24($1) 

add $14, $5, $6 

Figure 4.43 shows the multiple-clock-cycle pipeline diagram for these 

instructions. Time advances from left to right across the page in these diagrams, 

and instructions advance from the top to the bottom of the page, similar to the 

laundry pipeline in Figure 4.25. A representation of the pipeline stages is placed 

in each portion along the instruction axis, occupying the proper clock cycles. 

These stylized datapaths represent the five stages of our pipeline graphically, but 

a rectangle naming each pipe stage works just as well. Figure 4.44 shows the more 

traditional version of the multiple-clock-cycle pipeline diagram. Note that Figure 

4.43 shows the physical resources used at each stage, while Figure 4.44 uses the 

name of each stage. 

Single-clock-cycle pipeline diagrams show the state of the entire datapath during 

a single clock cycle, and usually all five instructions in the pipeline are identified by 

labels above their respective pipeline stages. We use this type of figure to show the 

details of what is happening within the pipeline during each clock cycle; typically, 

IF/ID 

ID/EX 

EX/MEM 

MEM/WB 

Add 

4 

Shift 

left 2 

Add 

Add 

resu t 

0 

M 

u 

x 

1 

PC 

Address 

Instruction 

memory 

Instruction 

Read 


data 1 

Read 

register 2 


Write 

data 2 

register 

Write 

data 

0 

M 

u 

x 

1 

Zero 

ALU ALU 

result 

Address 

Data 

memory 

Read 

data 

1 

M 

u 

x 

0 

Write 

data 

16 Sign 32 

extend 

FIGURE 4.42 

The portion of the datapath in Figure 4.41 that is used in all five stages of a load instruction.


Program 

execution 

order 



CC 1 CC 2 CC 3 

CC 4 

CC 5 

CC 6 

CC 7 

CC 8 

CC 9 

lw $10, 20($1) 

sub $11, $2, $3 

add $12, $3, $4 

lw $13, 24($1) 

add $14, $5, $6 

Instruction 

fetch 

Instruction 

decode 

Instruction 

fetch 

Execution 

Instruction 

decode 

Instruction 

fetch 

Data 

access 

Execution 

Instruction 

decode 

Instruction 

fetch 

Write-back 

Data 

access 

Execution 

Instruction 

decode 

Instruction 

fetch 

Write-back 

Data 

access 

Execution 

Instruction 

decode 

Write-back 

Data 

access 

Execution 

Write-back 

Data 

access 

Write-back 

FIGURE 4.44 Traditional multiple-clock-cycle pipeline diagram of five instructions in Figure 4.43. 

add $14, $5, $6 

lw $13, 24 ($1) 

add $12, $3, $4 

sub $11, $2, $3 

lw $10, 20($1) 


Instruction decode 

Execution 

Memory 

Write-back 

IF/ID 

ID/EX 

EX/MEM 

MEM/WB 

Add 

4 

Shift 

left 2 

Add Add 

result 

0 

M 

u 

x 

1 

PC 

Address 

Instruction 

memory 

Instruction 

Read 

register 1 

Read 

data 1 

Read 

register 2 

Registers 

Read 

Write 

data 2 

register 

Write 

data 

0 

M 

u 

x 

1 

Zero 

ALU ALU 

result 

Read 

Address 

data 

Data 

memory 

1 

M 

u 

x 

0 

Write 

data 

16 Sign 32 

extend 

FIGURE 4.45 The single-clock-cycle diagram corresponding to clock cycle 5 of the pipeline in Figures 4.43 and 4.44. 

As you can see, a single-clock-cycle figure is a vertical slice through a multiple-clock-cycle diagram. 

1. Allowing jumps, branches, and ALU instructions to take fewer stages than 

the five required by the load instruction will increase pipeline performance 

under all circumstances.


2. Trying to allow some instructions to take fewer cycles does not help, since 

the throughput is determined by the clock cycle; the number of pipe stages 

per instruction affects latency, not throughput. 

3. You cannot make ALU instructions take fewer cycles because of the writeback 

of the result, but branches and jumps can take fewer cycles, so there is 

some opportunity for improvement. 

4. Instead of trying to make instructions take fewer cycles, we should explore 

making the pipeline longer, so that instructions take more cycles, but the 

cycles are shorter. This could improve performance. 

In the 6600 Computer, 

perhaps even more 

than in any previous 

computer, the control 

system is the difference. 

James Thornton, Design 

of a Computer: The 

Control Data 6600, 1970 

Pipelined Control 

Just as we added control to the single-cycle datapath in Section 4.3, we now add 

control to the pipelined datapath. We start with a simple design that views the 

problem through rose-colored glasses. 

The first step is to label the control lines on the existing datapath. Figure 4.46 

shows those lines. We borrow as much as we can from the control for the simple 

datapath in Figure 4.17. In particular, we use the same ALU control logic, branch 

logic, destination-register-number multiplexor, and control lines. These functions 

are defined in Figures 4.12, 4.16, and 4.18. We reproduce the key information in 

Figures 4.47 through 4.49 on a single page to make the following discussion easier 

to follow. 

As was the case for the single-cycle implementation, we assume that the PC is 

written on each clock cycle, so there is no separate write signal for the PC. By the 

same argument, there are no separate write signals for the pipeline registers (IF/ 

ID, ID/EX, EX/MEM, and MEM/WB), since the pipeline registers are also written 

during each clock cycle. 

To specify control for the pipeline, we need only set the control values during 

each pipeline stage. Because each control line is associated with a component active 

in only a single pipeline stage, we can divide the control lines into five groups 

according to the pipeline stage. 

1. Instruction fetch: The control signals to read instruction memory and to 

write the PC are always asserted, so there is nothing special to control in this 

pipeline stage. 

2. Instruction decode/register file read: As in the previous stage, the same thing 

happens at every clock cycle, so there are no optional control lines to set. 

3. Execution/address calculation: The signals to be set are RegDst, ALUOp, 

and ALUSrc (see Figures 4.47 and 4.48). The signals select the Result register, 

the ALU operation, and either Read data 2 or a sign-extended immediate 

for the ALU.


PCSrc 

IF/ID 

ID/EX 

EX/MEM 

MEM/WB 

Add 

4 

Shift 

left 2 

Add Add 

result 

Branch 

0 

Mux 

RegWrite 

1 

PC 

Address 

Instruction 

memory 

Instruction 

Read 

register 1 

Read 

register 2 

Registers 

Write Read 

register data 2 

Write 

data 

Instruction 

(15–0) 

Read 

data 1 

16 Signextend 

32 6 

ALUSrc 

0 

Mux 

1 

ALU 

control 

Zero 

Add ALU 

result 

Write 

data 

MemWrite 

Read 

Address 

data 

Data 

memory 

MemRead 

MemtoReg 

1 

M 

ux 

0 

Instruction 

(20–16) 

0 

Mux 

ALUOp 

Instruction 

(15–11) 

1 

RegDst 

FIGURE 4.46 The pipelined datapath of Figure 4.41 with the control signals identified. This datapath borrows the control 

logic for PC source, register destination number, and ALU control from Section 4.4. Note that we now need the 6-bit funct field (function 

code) of the instruction in the EX stage as input to ALU control, so these bits must also be included in the ID/EX pipeline register. Recall that 

these 6 bits are also the 6 least significant bits of the immediate field in the instruction, so the ID/EX pipeline register can supply them from the 

immediate field since sign extension leaves these bits unchanged. 

Instruction 

opcode 

ALUOp 

Instruction 

operation 

Function code 

Desired 

ALU action 

ALU control 

input 

LW 00 load word XXXXXX add 0010 

SW 00 store word XXXXXX add 0010 

Branch equal 01 branch equal XXXXXX subtract 0110 

R-type 10 add 100000 add 0010 

R-type 10 subtract 100010 subtract 0110 

R-type 10 AND 100100 AND 0000 

R-type 10 OR 100101 OR 0001 

R-type 10 set on less than 101010 set on less than 0111 

FIGURE 4.47 A copy of Figure 4.12. This figure shows how the ALU control bits are set depending on the ALUOp control bits and the 

different function codes for the R-type instruction.


Signal name Effect when deasserted (0) Effect when asserted (1) 

RegDst 

The register destination number for the Write 

register comes from the rt field (bits 20:16). 

The register destination number for the Write register comes 

from the rd field (bits 15:11). 

RegWrite None. The register on the Write register input is written with the value 

on the Write data input. 

ALUSrc 

PCSrc 

The second ALU operand comes from the second 

register file output (Read data 2). 

The PC is replaced by the output of the adder that 

computes the value of PC + 4. 

The second ALU operand is the sign-extended, lower 16 bits of 

the instruction. 

The PC is replaced by the output of the adder that computes 

the branch target. 

MemRead None. Data memory contents designated by the address input are 

put on the Read data output. 

MemWrite None. Data memory contents designated by the address input are 

replaced by the value on the Write data input. 

MemtoReg 

The value fed to the register Write data input 

comes from the ALU. 

The value fed to the register Write data input comes from the 

data memory. 

FIGURE 4.48 A copy of Figure 4.16. The function of each of seven control signals is defined. The ALU control lines (ALUOp) are defined 

in the second column of Figure 4.47. When a 1-bit control to a 2-way multiplexor is asserted, the multiplexor selects the input corresponding 

to 1. Otherwise, if the control is deasserted, the multiplexor selects the 0 input. Note that PCSrc is controlled by an AND gate in Figure 4.46. 

If the Branch signal and the ALU Zero signal are both set, then PCSrc is 1; otherwise, it is 0. Control sets the Branch signal only during a beq 

instruction; otherwise, PCSrc is set to 0. 

Instruction 

Execution/address calculation stage 

control lines 

RegDst ALUOp1 ALUOp0 ALUSrc Branch 

Memory access stage 

control lines 

Mem- 

Read 

Mem- 

Write 

Write-back stage 

control lines 

Reg- 

Write 

R-format 1 1 0 0 0 0 0 1 0 

lw 0 0 0 1 0 1 0 1 1 

sw X 0 0 1 0 0 1 0 X 

beq X 0 1 0 1 0 0 0 X 

Memto- 

Reg 

FIGURE 4.49 The values of the control lines are the same as in Figure 4.18, but they have been shuffled into three 

groups corresponding to the last three pipeline stages. 

4. Memory access: The control lines set in this stage are Branch, MemRead, and 

MemWrite. The branch equal, load, and store instructions set these signals, 

respectively. Recall that PCSrc in Figure 4.48 selects the next sequential 

address unless control asserts Branch and the ALU result was 0. 

5. Write-back: The two control lines are MemtoReg, which decides between 

sending the ALU result or the memory value to the register file, and Reg- 

Write, which writes the chosen value. 

Since pipelining the datapath leaves the meaning of the control lines unchanged, 

we can use the same control values. Figure 4.49 has the same values as in Section 

4.4, but now the nine control lines are grouped by pipeline stage.


PCSrc 

ID/EX 

WB 

EX/MEM 

Control 

M 

WB 

MEM/WB 

IF/ID 

EX 

M 

WB 

Add 

0 

M 

ux 

1 

PC 

4 

Address 

Instruction 

memory 

Instruction 

Read 

register 1 

Read 

register 2 

Write 

register 

Write 

data 

RegWrite 

Registers 

Read 

data 1 

Read 

data 2 

Shift 

left 2 

0 Mux 

1 

Add Add 

result 

ALUSrc 

Zero 

ALU ALU 

result 

Branch 

MemWrite 

Read 

Address 

data 

Data 

memory 

MemtoReg 

1 

M 

ux 

0 

Instruction 

[15–0] 

16 Signextend 

32 

6 

ALU 

control 

Write 

data 

MemRead 

Instruction 

[20–16] 

0 

ALUOp 

Instruction 

[15–11] 

1 

M 

ux 

RegDst 

FIGURE 4.51 The pipelined datapath of Figure 4.46, with the control signals connected to the control portions of 

the pipeline registers. The control values for the last three stages are created during the instruction decode stage and then placed in the 

ID/EX pipeline register. The control lines for each pipe stage are used, and remaining control lines are then passed to the next pipeline stage. 

Let’s look at a sequence with many dependences, shown in color: 

sub $2, $1,$3 # Register $2 written by sub 

and $12,$2,$5 # 1st operand($2) depends on sub 

or $13,$6,$2 # 2nd operand($2) depends on sub 

add $14,$2,$2 # 1st($2) & 2nd($2) depend on sub 

sw $15,100($2) # Base ($2) depends on sub 

The last four instructions are all dependent on the result in register $2 of the 

first instruction. If register $2 had the value 10 before the subtract instruction and 

−20 afterwards, the programmer intends that −20 will be used in the following 

instructions that refer to register $2.

4.7 Data Hazards: Forwarding versus Stalling 305 

How would this sequence perform with our pipeline? Figure 4.52 illustrates the 

execution of these instructions using a multiple-clock-cycle pipeline representation. 

To demonstrate the execution of this instruction sequence in our current pipeline, 

the top of Figure 4.52 shows the value of register $2, which changes during the 

middle of clock cycle 5, when the sub instruction writes its result. 

The last potential hazard can be resolved by the design of the register file 

hardware: What happens when a register is read and written in the same clock 

cycle? We assume that the write is in the first half of the clock cycle and the read 

is in the second half, so the read delivers what is written. As is the case for many 

implementations of register files, we have no data hazard in this case. 

Figure 4.52 shows that the values read for register $2 would not be the result of 

the sub instruction unless the read occurred during clock cycle 5 or later. Thus, the 

instructions that would get the correct value of −20 are add and sw; the AND and 


Value of CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 

register $2: 10 10 10 10 10/–20 –20 –20 –20 –20 

Program 

execution 

order 


sub $2, $1, $3 

IM 

Reg 

DM 

Reg 

and $12, $2, $5 

IM 

Reg 

DM 

Reg 

or $13, $6, $2 

IM 

Reg 

DM 

Reg 

add $14, $2,$2 

IM 

Reg 

DM 

Reg 

sw $15, 100($2) 

IM 

Reg 

DM 

Reg 

FIGURE 4.52 Pipelined dependences in a five-instruction sequence using simplified datapaths to show the 

dependences. All the dependent actions are shown in color, and “CC 1” at the top of the figure means clock cycle 1. The first instruction 

writes into $2, and all the following instructions read $2. This register is written in clock cycle 5, so the proper value is unavailable before clock 

cycle 5. (A read of a register during a clock cycle returns the value written at the end of the first half of the cycle, when such a write occurs.) The 

colored lines from the top datapath to the lower ones show the dependences. Those that must go backward in time are pipeline data hazards.


OR instructions would get the incorrect value 10! Using this style of drawing, such 

problems become apparent when a dependence line goes backward in time. 

As mentioned in Section 4.5, the desired result is available at the end of the 

EX stage or clock cycle 3. When is the data actually needed by the AND and OR 

instructions? At the beginning of the EX stage, or clock cycles 4 and 5, respectively. 

Thus, we can execute this segment without stalls if we simply forward the data as 

soon as it is available to any units that need it before it is available to read from the 

register file. 

How does forwarding work? For simplicity in the rest of this section, we consider 

only the challenge of forwarding to an operation in the EX stage, which may be 

either an ALU operation or an effective address calculation. This means that when 

an instruction tries to use a register in its EX stage that an earlier instruction 

intends to write in its WB stage, we actually need the values as inputs to the ALU. 

A notation that names the fields of the pipeline registers allows for a more 

precise notation of dependences. For example, “ID/EX.RegisterRs” refers to the 

number of one register whose value is found in the pipeline register ID/EX; that is, 

the one from the first read port of the register file. The first part of the name, to the 

left of the period, is the name of the pipeline register; the second part is the name of 

the field in that register. Using this notation, the two pairs of hazard conditions are 

1a. EX/MEM.RegisterRd = ID/EX.RegisterRs 

1b. EX/MEM.RegisterRd = ID/EX.RegisterRt 

2a. MEM/WB.RegisterRd = ID/EX.RegisterRs 

2b. MEM/WB.RegisterRd = ID/EX.RegisterRt 

The first hazard in the sequence on page 304 is on register $2, between the 

result of sub $2,$1,$3 and the first read operand of and $12,$2,$5. This 

hazard can be detected when the and instruction is in the EX stage and the prior 

instruction is in the MEM stage, so this is hazard 1a: 

EX/MEM.RegisterRd = ID/EX.RegisterRs = $2 

EXAMPLE 

Dependence Detection 

Classify the dependences in this sequence from page 304: 

sub $2, $1, $3 # Register $2 set by sub 

and $12, $2, $5 # 1st operand($2) set by sub 

or $13, $6, $2 # 2nd operand($2) set by sub 

add $14, $2, $2 # 1st($2) & 2nd($2) set by sub 

sw $15, 100($2) # Index($2) set by sub


As mentioned above, the sub-and is a type 1a hazard. The remaining hazards 

are as follows: 

■ The sub-or is a type 2b hazard: 

ANSWER 

MEM/WB.RegisterRd = ID/EX.RegisterRt = $2 

■ The two dependences on sub-add are not hazards because the register 

file supplies the proper data during the ID stage of add. 

■ There is no data hazard between sub and sw because sw reads $2 the 

clock cycle after sub writes $2. 

Because some instructions do not write registers, this policy is inaccurate; 

sometimes it would forward when it shouldn’t. One solution is simply to check 

to see if the RegWrite signal will be active: examining the WB control field of the 

pipeline register during the EX and MEM stages determines whether RegWrite 

is asserted. Recall that MIPS requires that every use of $0 as an operand must 

yield an operand value of 0. In the event that an instruction in the pipeline has 

$0 as its destination (for example, sll $0, $1, 2), we want to avoid forwarding 

its possibly nonzero result value. Not forwarding results destined for $0 frees the 

assembly programmer and the compiler of any requirement to avoid using $0 as 

a destination. The conditions above thus work properly as long we add EX/MEM. 

RegisterRd ≠ 0 to the first hazard condition and MEM/WB.RegisterRd ≠ 0 to the 

second. 

Now that we can detect hazards, half of the problem is resolved—but we must 

still forward the proper data. 

Figure 4.53 shows the dependences between the pipeline registers and the inputs 

to the ALU for the same code sequence as in Figure 4.52. The change is that the 

dependence begins from a pipeline register, rather than waiting for the WB stage to 

write the register file. Thus, the required data exists in time for later instructions, 

with the pipeline registers holding the data to be forwarded. 

If we can take the inputs to the ALU from any pipeline register rather than just 

ID/EX, then we can forward the proper data. By adding multiplexors to the input 

of the ALU, and with the proper controls, we can run the pipeline at full speed in 

the presence of these data dependences. 

For now, we will assume the only instructions we need to forward are the four 

R-format instructions: add, sub, AND, and OR. Figure 4.54 shows a close-up of 

the ALU and pipeline register before and after adding forwarding. Figure 4.55 

shows the values of the control lines for the ALU multiplexors that select either the 

register file values or one of the forwarded values. 

This forwarding control will be in the EX stage, because the ALU forwarding 

multiplexors are found in that stage. Thus, we must pass the operand register 

numbers from the ID stage via the ID/EX pipeline register to determine whether 

to forward values. We already have the rt field (bits 20–16). Before forwarding, the 

ID/EX register had no need to include space to hold the rs field. Hence, rs (bits 

25–21) is added to ID/EX.


ID/EX 

EX/MEM 

MEM/WB 

Registers 

ALU 

Data 

memory 

M 

ux 

a. No forwarding 

ID/EX EX/MEM MEM/WB 

M 

ux 

M 

ux 

M 

ux 

Registers 

ForwardA 

ALU 

Data 

memory 

M 

ux 

ForwardB 

Rs 

Rt 

Rt 

Rd 

EX/MEM.RegisterRd 

Forwarding 

unit 

MEM/WB.RegisterRd 

b. With forwarding 

FIGURE 4.54 On the top are the ALU and pipeline registers before adding forwarding. On 

the bottom, the multiplexors have been expanded to add the forwarding paths, and we show the forwarding 

unit. The new hardware is shown in color. This figure is a stylized drawing, however, leaving out details 

from the full datapath such as the sign extension hardware. Note that the ID/EX.RegisterRt field is shown 

twice, once to connect to the Mux and once to the forwarding unit, but it is a single signal. As in the earlier 

discussion, this ignores forwarding of a store value to a store instruction. Also note that this mechanism 

works for slt instructions as well.


Mux control Source Explanation 

ForwardA = 00 ID/EX The first ALU operand comes from the register file. 

ForwardA = 10 EX/MEM The first ALU operand is forwarded from the prior ALU result. 

ForwardA = 01 MEM/WB The first ALU operand is forwarded from data memory or an earlier 

ALU result. 

ForwardB = 00 ID/EX The second ALU operand comes from the register file. 

ForwardB = 10 EX/MEM The second ALU operand is forwarded from the prior ALU result. 

ForwardB = 01 MEM/WB The second ALU operand is forwarded from data memory or an 

earlier ALU result. 

FIGURE 4.55 The control values for the forwarding multiplexors in Figure 4.54. The signed 

immediate that is another input to the ALU is described in the Elaboration at the end of this section. 

Note that the EX/MEM.RegisterRd field is the register destination for either 

an ALU instruction (which comes from the Rd field of the instruction) or a load 

(which comes from the Rt field). 

This case forwards the result from the previous instruction to either input of the 

ALU. If the previous instruction is going to write to the register file, and the write 

register number matches the read register number of ALU inputs A or B, provided 

it is not register 0, then steer the multiplexor to pick the value instead from the 

pipeline register EX/MEM. 

2. MEM hazard: 

if (MEM/WB.RegWrite 

and (MEM/WB.RegisterRd ≠ 0) 

and ( MEM/WB.RegisterRd = ID/EX.RegisterRs)) ForwardA = 01 



and (MEM/WB.RegisterRd = ID/EX.RegisterRt)) ForwardB = 01 

As mentioned above, there is no hazard in the WB stage, because we assume that 

the register file supplies the correct result if the instruction in the ID stage reads 

the same register written by the instruction in the WB stage. Such a register file 

performs another form of forwarding, but it occurs within the register file. 

One complication is potential data hazards between the result of the instruction 

in the WB stage, the result of the instruction in the MEM stage, and the source 

operand of the instruction in the ALU stage. For example, when summing a vector 

of numbers in a single register, a sequence of instructions will all read and write to 

the same register: 

add $1,$1,$2 

add $1,$1,$3 

add $1,$1,$4 

. . .


In this case, the result is forwarded from the MEM stage because the result in the 

MEM stage is the more recent result. Thus, the control for the MEM hazard would 

be (with the additions highlighted): 



and not(EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0) 

and (EX/MEM.RegisterRd ≠ ID/EX.RegisterRs)) 

and (MEM/WB.RegisterRd = ID/EX.RegisterRs)) ForwardA = 01 



and not(EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0) 

and (EX/MEM.RegisterRd ≠ ID/EX.RegisterRt)) 

and (MEM/WB.RegisterRd = ID/EX.RegisterRt)) ForwardB = 01 

Figure 4.56 shows the hardware necessary to support forwarding for operations 

that use results during the EX stage. Note that the EX/MEM.RegisterRd field is the 

register destination for either an ALU instruction (which comes from the Rd field 

of the instruction) or a load (which comes from the Rt field). 

ID/EX 

WB 

EX/MEM 

Control 

M 

WB 

MEM/WB 

IF/ID 

EX 

M 

WB 

M 

u 

x 

PC 

Instruction 

memory 

Instruction 

Registers 

M 

u 

x 

ALU 

Data 

memory 

M 

u 

x 

IF/ID.RegisterRs 

IF/ID.RegisterRt 


IF/ID.RegisterRd 

Rs 

Rt 

Rt 

Rd 

M 

u 

x 

Forwarding 

unit 

EX/MEM.RegisterRd 

MEM/WB.RegisterRd 

FIGURE 4.56 The datapath modified to resolve hazards via forwarding. Compared with the datapath in Figure 4.51, the additions 

are the multiplexors to the inputs to the ALU. This figure is a more stylized drawing, however, leaving out details from the full datapath, such 

as the branch hardware and the sign extension hardware.


use. Checking for load instructions, the control for the hazard detection unit is this 

single condition: 

nop An instruction that 

does no operation to 

change state. 

if (ID/EX.MemRead and 

((ID/EX.RegisterRt = IF/ID.RegisterRs) or 

(ID/EX.RegisterRt = IF/ID.RegisterRt))) 

stall the pipeline 

The first line tests to see if the instruction is a load: the only instruction that reads 

data memory is a load. The next two lines check to see if the destination register 

field of the load in the EX stage matches either source register of the instruction 

in the ID stage. If the condition holds, the instruction stalls one clock cycle. After 

this 1-cycle stall, the forwarding logic can handle the dependence and execution 

proceeds. (If there were no forwarding, then the instructions in Figure 4.58 would 

need another stall cycle.) 

If the instruction in the ID stage is stalled, then the instruction in the IF stage 

must also be stalled; otherwise, we would lose the fetched instruction. Preventing 

these two instructions from making progress is accomplished simply by preventing 

the PC register and the IF/ID pipeline register from changing. Provided these 

registers are preserved, the instruction in the IF stage will continue to be read 

using the same PC, and the registers in the ID stage will continue to be read using 

the same instruction fields in the IF/ID pipeline register. Returning to our favorite 

analogy, it’s as if you restart the washer with the same clothes and let the dryer 

continue tumbling empty. Of course, like the dryer, the back half of the pipeline 

starting with the EX stage must be doing something; what it is doing is executing 

instructions that have no effect: nops. 

How can we insert these nops, which act like bubbles, into the pipeline? In Figure 

4.49, we see that deasserting all nine control signals (setting them to 0) in the EX, 

MEM, and WB stages will create a “do nothing” or nop instruction. By identifying 

the hazard in the ID stage, we can insert a bubble into the pipeline by changing the 

EX, MEM, and WB control fields of the ID/EX pipeline register to 0. These benign 

control values are percolated forward at each clock cycle with the proper effect: no 

registers or memories are written if the control values are all 0. 

Figure 4.59 shows what really happens in the hardware: the pipeline execution 

slot associated with the AND instruction is turned into a nop and all instructions 

beginning with the AND instruction are delayed one cycle. Like an air bubble in 

a water pipe, a stall bubble delays everything behind it and proceeds down the 

instruction pipe one stage each cycle until it exits at the end. In this example, the 

hazard forces the AND and OR instructions to repeat in clock cycle 4 what they 

did in clock cycle 3: AND reads registers and decodes, and OR is refetched from 

instruction memory. Such repeated work is what a stall looks like, but its effect is 

to stretch the time of the AND and OR instructions and delay the fetch of the add 

instruction. 

Figure 4.60 highlights the pipeline connections for both the hazard detection 

unit and the forwarding unit. As before, the forwarding unit controls the ALU


Hazard 

detection 

unit 

ID/EX.MemRead 

IF/DWrite 

ID/EX 

WB 

EX/MEM 

PCWrite 

IF/ID 

Control 

M 

WB 

0 EX 

M 

MEM/WB 

WB 

PC 

Instruction 

memory 

Instruction 

Registers 

ALU 

M 

ux 

M 

ux 

M 

ux 

M 

ux 

M 

ux 

Data 

memory 

IF/ID.RegisterRs 



IF/ID.RegisterRd 

ID/EX.RegisterRt 

Rt 

Rd 

Rs 

Rt 

Forwarding 

unit 

FIGURE 4.60 Pipelined control overview, showing the two multiplexors for forwarding, the hazard detection unit, and 

the forwarding unit. Although the ID and EX stages have been simplified—the sign-extended immediate and branch logic are missing— 

this drawing gives the essence of the forwarding hardware requirements. 

Elaboration: Regarding the remark earlier about setting control lines to 0 to avoid 

writing registers or memory: only the signals RegWrite and MemWrite need be 0, while 

the other control signals can be don’t cares. 

There are a thousand 

hacking at the 

branches of evil to one 

who is striking at the 

root. 

Henry David Thoreau, 

Walden, 1854 

4.8 Control Hazards 

Thus far, we have limited our concern to hazards involving arithmetic operations 

and data transfers. However, as we saw in Section 4.5, there are also pipeline hazards 

involving branches. Figure 4.61 shows a sequence of instructions and indicates when 

the branch would occur in this pipeline. An instruction must be fetched at every 

clock cycle to sustain the pipeline, yet in our design the decision about whether to 

branch doesn’t occur until the MEM pipeline stage. As mentioned in Section 4.5,


Forwarding for the operands of branches was formerly handled by the ALU 

forwarding logic, but the introduction of the equality test unit in ID will 

require new forwarding logic. Note that the bypassed source operands of a 

branch can come from either the ALU/MEM or MEM/WB pipeline latches. 

2. Because the values in a branch comparison are needed during ID but may be 

produced later in time, it is possible that a data hazard can occur and a stall 

will be needed. For example, if an ALU instruction immediately preceding 

a branch produces one of the operands for the comparison in the branch, 

a stall will be required, since the EX stage for the ALU instruction will 

occur after the ID cycle of the branch. By extension, if a load is immediately 

followed by a conditional branch that is on the load result, two stall cycles 

will be needed, as the result from the load appears at the end of the MEM 

cycle but is needed at the beginning of ID for the branch. 

Despite these difficulties, moving the branch execution to the ID stage is an 

improvement, because it reduces the penalty of a branch to only one instruction if 

the branch is taken, namely, the one currently being fetched. The exercises explore 

the details of implementing the forwarding path and detecting the hazard. 

To flush instructions in the IF stage, we add a control line, called IF.Flush, 

that zeros the instruction field of the IF/ID pipeline register. Clearing the register 

transforms the fetched instruction into a nop, an instruction that has no action 

and changes no state. 

Pipelined Branch 

Show what happens when the branch is taken in this instruction sequence, 

assuming the pipeline is optimized for branches that are not taken and that we 

moved the branch execution to the ID stage: 

EXAMPLE 

36 sub $10, $4, $8 

40 beq $1, $3, 7 # PC-relative branch to 40 + 4 + 7 * 4 = 72 

44 and $12, $2, $5 

48 or $13, $2, $6 

52 add $14, $4, $2 

56 slt $15, $6, $7 

. . . 

72 lw $4, 50($7) 

Figure 4.62 shows what happens when a branch is taken. Unlike Figure 4.61, 

there is only one pipeline bubble on a taken branch. 

ANSWER


The limitations on delayed branch scheduling arise from (1) the restrictions on the 

instructions that are scheduled into the delay slots and (2) our ability to predict at 

compile time whether a branch is likely to be taken or not. 

Delayed branching was a simple and effective solution for a fi ve-stage pipeline 

issuing one instruction each clock cycle. As processors go to both longer pipelines 

and issuing multiple instructions per clock cycle (see Section 4.10), the branch delay 

becomes longer, and a single delay slot is insuffi cient. Hence, delayed branching has 

lost popularity compared to more expensive but more fl exible dynamic approaches. 

Simultaneously, the growth in available transistors per chip has due to Moore’s Law 

made dynamic prediction relatively cheaper. 

a. From before 

add $s1, $s2, $s3 

if $s2 = 0 then 

Delay slot 

b. From target 

sub $t4, $t5, $t6 

. . . 

add $s1, $s2, $s3 


Delay slot 

c. From fall-through 

add $s1, $s2, $s3 


Delay slot 

sub $t4, $t5, $t6 

Becomes 

Becomes 

Becomes 


add $s1, $s2, $s3 

add $s1, $s2, $s3 


sub $t4, $t5, $t6 

add $s1, $s2, $s3 


sub $t4, $t5, $t6 

FIGURE 4.64 Scheduling the branch delay slot. The top box in each pair shows the code before 

scheduling; the bottom box shows the scheduled code. In (a), the delay slot is scheduled with an independent 

instruction from before the branch. This is the best choice. Strategies (b) and (c) are used when (a) is not 

possible. In the code sequences for (b) and (c), the use of $s1 in the branch condition prevents the add 

instruction (whose destination is $s1) from being moved into the branch delay slot. In (b) the branch delay 

slot is scheduled from the target of the branch; usually the target instruction will need to be copied because 

it can be reached by another path. Strategy (b) is preferred when the branch is taken with high probability, 

such as a loop branch. Finally, the branch may be scheduled from the not-taken fall-through as in (c). To 

make this optimization legal for (b) or (c), it must be OK to execute the sub instruction when the branch 

goes in the unexpected direction. By “OK” we mean that the work is wasted, but the program will still execute 

correctly. This is the case, for example, if $t4 were an unused temporary register when the branch goes in 

the unexpected direction.


branch target buffer 

A structure that caches 

the destination PC or 

destination instruction 

for a branch. It is usually 

organized as a cache with 

tags, making it more 

costly than a simple 

prediction buffer. 

correlating predictor 

A branch predictor that 

combines local behavior 

of a particular branch 

and global information 

about the behavior of 

some recent number of 

executed branches. 

tournament branch 

predictor A branch 

predictor with multiple 

predictions for each 

branch and a selection 

mechanism that chooses 

which predictor to enable 

for a given branch. 

Elaboration: A branch predictor tells us whether or not a branch is taken, but still 

requires the calculation of the branch target. In the fi ve-stage pipeline, this calculation 

takes one cycle, meaning that taken branches will have a 1-cycle penalty. Delayed 

branches are one approach to eliminate that penalty. Another approach is to use a 

cache to hold the destination program counter or destination instruction using a branch 

target buffer. 

The 2-bit dynamic prediction scheme uses only information about a particular branch. 

Researchers noticed that using information about both a local branch, and the global 

behavior of recently executed branches together yields greater prediction accuracy for 

the same number of prediction bits. Such predictors are called correlating predictors. 

A typical correlating predictor might have two 2-bit predictors for each branch, with the 

choice between predictors made based on whether the last executed branch was taken 

or not taken. Thus, the global branch behavior can be thought of as adding additional 

index bits for the prediction lookup. 

A more recent innovation in branch prediction is the use of tournament predictors. A 

tournament predictor uses multiple predictors, tracking, for each branch, which predictor 

yields the best results. A typical tournament predictor might contain two predictions for 

each branch index: one based on local information and one based on global branch 

behavior. A selector would choose which predictor to use for any given prediction. The 

selector can operate similarly to a 1- or 2-bit predictor, favoring whichever of the two 

predictors has been more accurate. Some recent microprocessors use such elaborate 

predictors. 

Elaboration: One way to reduce the number of conditional branches is to add 

conditional move instructions. Instead of changing the PC with a conditional branch, the 

instruction conditionally changes the destination register of the move. If the condition 

fails, the move acts as a nop. For example, one version of the MIPS instruction set 

architecture has two new instructions called movn (move if not zero) and movz (move 

if zero). Thus, movn $8, $11, $4 copies the contents of register 11 into register 8, 

provided that the value in register 4 is nonzero; otherwise, it does nothing. 

The ARMv7 instruction set has a condition fi eld in most instructions. Hence, ARM 

programs could have fewer conditional branches than in MIPS programs. 

Pipeline Summary 

We started in the laundry room, showing principles of pipelining in an everyday 

setting. Using that analogy as a guide, we explained instruction pipelining 

step-by-step, starting with the single-cycle datapath and then adding pipeline 

registers, forwarding paths, data hazard detection, branch prediction, and flushing 

instructions on exceptions. Figure 4.65 shows the final evolved datapath and control. 

We now are ready for yet another control hazard: the sticky issue of exceptions. 

Check 

Yourself 

Consider three branch prediction schemes: predict not taken, predict taken, and 

dynamic prediction. Assume that they all have zero penalty when they predict 

correctly and two cycles when they are wrong. Assume that the average predict


IF.Flush 

Hazard 

detection 

unit 

ID/EX 

WB 

MEM/WB 

Control 

0 

M 

WB 

EX/MEM 

+ 

IF/ID 

+ 

EX 

M 

WB 

4 

Shift 

left 2 

M 

ux 

M 

ux 

M 

ux 

M 

ux 

Registers = 

M Instruction 

Data 

ux PC 

ALU 

memory 

memory 

M 

ux 

Signextend 

Fowarding 

unit 

FIGURE 4.65 The final datapath and control for this chapter. Note that this is a stylized figure rather than a detailed datapath, so 

it’s missing the ALUsrc Mux from Figure 4.57 and the multiplexor controls from Figure 4.51. 

accuracy of the dynamic predictor is 90%. Which predictor is the best choice for 

the following branches? 

1. A branch that is taken with 5% frequency 



4.9 Exceptions 

Control is the most challenging aspect of processor design: it is both the hardest 

part to get right and the hardest part to make fast. One of the hardest parts of 

To make a computer 

with automatic 

program-interruption 

facilities behave 

[sequentially] was 

not an easy matter, 

because the number of 

instructions in various 

stages of processing 

when an interrupt 

signal occurs may be 

large. 

Fred Brooks, Jr., 

Planning a Computer 

System: Project Stretch, 

1962


we did for the taken branch in the previous section, we must flush the instructions 

that follow the add instruction from the pipeline and begin fetching instructions 

from the new address. We will use the same mechanism we used for taken branches, 

but this time the exception causes the deasserting of control lines. 

When we dealt with branch mispredict, we saw how to flush the instruction 

in the IF stage by turning it into a nop. To flush instructions in the ID stage, we 

use the multiplexor already in the ID stage that zeros control signals for stalls. A 

new control signal, called ID.Flush, is ORed with the stall signal from the hazard 

detection unit to flush during ID. To flush the instruction in the EX phase, we use 

a new signal called EX.Flush to cause new multiplexors to zero the control lines. To 

start fetching instructions from location 8000 0180 hex 

, which is the MIPS exception 

address, we simply add an additional input to the PC multiplexor that sends 8000 

0180 hex 

to the PC. Figure 4.66 shows these changes. 

This example points out a problem with exceptions: if we do not stop execution 

in the middle of the instruction, the programmer will not be able to see the original 

value of register $1 that helped cause the overflow because it will be clobbered as 

the Destination register of the add instruction. Because of careful planning, the 

overflow exception is detected during the EX stage; hence, we can use the EX.Flush 

signal to prevent the instruction in the EX stage from writing its result in the WB 

stage. Many exceptions require that we eventually complete the instruction that 

caused the exception as if it executed normally. The easiest way to do this is to flush 

the instruction and restart it from the beginning after the exception is handled. 

The final step is to save the address of the offending instruction in the exception 

program counter (EPC). In reality, we save the address +4, so the exception handling 

the software routine must first subtract 4 from the saved value. Figure 4.66 shows 

a stylized version of the datapath, including the branch hardware and necessary 

accommodations to handle exceptions. 

EXAMPLE 

Exception in a Pipelined Computer 

Given this instruction sequence, 

40 hex 

sub $11, $2, $4 

44 hex 

and $12, $2, $5 

48 hex 

or $13, $2, $6 

4C hex 

add $1, $2, $1 

50 hex 

slt $15, $6, $7 

54 hex 

lw $16, 50($7) 

. . .


IF.Flush 

EX.Flush 

ID.Flush 

Hazard 

detection 

unit 

ID/EX 

M 

ux 

IF/ID 

Control 

 

0 

M 

u 

x 

WB 

M 

EX 

0 

Cause 

EPC 

EX/MEM 

M 

WB 

ux 

0 M 

MEM/WB 

WB 

80000180 

M 

u 

x 

PC 

4 

 

Instruction 

memory 

Shift 

left 2 

Registers 

 

M 

u 

x 

M 

u 

x 

ALU 

Data 

memory 

M 

u 

x 

Signextend 

M 

u 

x 

Forwarding 

unit 

FIGURE 4.66 The datapath with controls to handle exceptions. The key additions include a new input with the value 8000 0180 hex 

in the multiplexor that supplies the new PC value; a Cause register to record the cause of the exception; and an Exception PC register to save 

the address of the instruction that caused the exception. The 8000 0180 hex 

input to the multiplexor is the initial address to begin fetching 

instructions in the event of an exception. Although not shown, the ALU overflow signal is an input to the control unit. 

assume the instructions to be invoked on an exception begin like this: 

80000180 hex 

sw $26, 1000($0) 

80000184 hex 

sw $27, 1004($0) 

... 

Show what happens in the pipeline if an overflow exception occurs in the add 

instruction. 

Figure 4.67 shows the events, starting with the add instruction in the EX stage. 

The overflow is detected during that phase, and 8000 0180 hex 

is forced into the 

PC. Clock cycle 7 shows that the add and following instructions are flushed, 

and the first instruction of the exception code is fetched. Note that the address 

of the instruction following the add is saved: 4C hex 

+ 4 = 50 hex 

. 

ANSWER


80000180 

M ux 

Clock 6 

80000180 

M 

ux 

lw $16, 50($7) slt $15, $6, $7 add $1, $2, $1 or $13, . . . and $12, . . . 

EX.Flush 

IF.Flush 

ID.Flush 

Hazard 

detection 

unit 

M 

ID/EX ux 

0 10 

WB 

0 

M 

EX/MEM 

0 000 

M 10 

Control 

ux M 

ux 

WB 

MEM/WB 

Cause 

0 

50 

1 

IF/ID 

+ 

0 

EPC 

WB 

PC 

80000180 54 

4 

+ 

Instruction 

memory 

sw $26, 1000($0) 

IF.Flush 

PC 

80000184 

80000180 

4 

+ 

Instruction 

memory 

58 

54 

IF/ID 

58 

Sh ft 

left 2 

Registers 

12 

S gn 

extend 

= 

$6 

$7 

15 

EX 

$1 

M 

ux 

M 

u 

x 

0 M 

Forwarding 

unit 

bubble (nop) bubble bubble or $13, . . . 

EX.Flush 

Hazard 

detection 

unit 

Control 

Sh ft 

left 2 

+ 

ID.Flush 

Registers 

13 

ID/EX 

00 

0 0 

WB 

0 

EX/MEM 

M 

0 

000 M 

WB 

00 

ux M 

Cause ux 

0 0 EX 

EPC 0 M 

= 

M 

ux 

M 

ux 

M 

ux 

M 

ux 

$2 

$1 

ALU 

Data 

memory 

13 12 

Data 

memory 

MEM/WB 

WB 

M 

ux 

M 

ux 

S gn 

extend 

Clock 7 

M 

u 

x 

Forwarding 

unit 

13 

FIGURE 4.67 The result of an exception due to arithmetic overflow in the add instruction. The overflow is detected during 

the EX stage of clock 6, saving the address following the add in the EPC register (4C + 4 = 50 hex 

). Overflow causes all the Flush signals to be set 

near the end of this clock cycle, deasserting control values (setting them to 0) for the add. Clock cycle 7 shows the instructions converted to 

bubbles in the pipeline plus the fetching of the first instruction of the exception routine—sw $25,1000($0)—from instruction location 

8000 0180 hex 

. Note that the AND and OR instructions, which are prior to the add, still complete. Although not shown, the ALU overflow signal 

is an input to the control unit.


We mentioned five examples of exceptions on page 326, and we will see others 

in Chapter 5. With five instructions active in any clock cycle, the challenge is 

to associate an exception with the appropriate instruction. Moreover, multiple 

exceptions can occur simultaneously in a single clock cycle. The solution is to 

prioritize the exceptions so that it is easy to determine which is serviced first. In 

most MIPS implementations, the hardware sorts exceptions so that the earliest 

instruction is interrupted. 

I/O device requests and hardware malfunctions are not associated with a specific 

instruction, so the implementation has some flexibility as to when to interrupt the 

pipeline. Hence, the mechanism used for other exceptions works just fine. 

The EPC captures the address of the interrupted instructions, and the MIPS 

Cause register records all possible exceptions in a clock cycle, so the exception 

software must match the exception to the instruction. An important clue is knowing 

in which pipeline stage a type of exception can occur. For example, an undefined 

instruction is discovered in the ID stage, and invoking the operating system 

occurs in the EX stage. Exceptions are collected in the Cause register in a pending 

exception field so that the hardware can interrupt based on later exceptions, once 

the earliest one has been serviced. 

The hardware and the operating system must work in conjunction so that 

exceptions behave as you would expect. The hardware contract is normally to 

stop the offending instruction in midstream, let all prior instructions complete, 

flush all following instructions, set a register to show the cause of the exception, 

save the address of the offending instruction, and then jump to a prearranged 

address. The operating system contract is to look at the cause of the exception and 

act appropriately. For an undefined instruction, hardware failure, or arithmetic 

overflow exception, the operating system normally kills the program and returns 

an indicator of the reason. For an I/O device request or an operating system service 

call, the operating system saves the state of the program, performs the desired task, 

and, at some point in the future, restores the program to continue execution. In 

the case of I/O device requests, we may often choose to run another task before 

resuming the task that requested the I/O, since that task may often not be able to 

proceed until the I/O is complete. Exceptions are why the ability to save and restore 

the state of any task is critical. One of the most important and frequent uses of 

exceptions is handling page faults and TLB exceptions; Chapter 5 describes these 

exceptions and their handling in more detail. 

Elaboration: The diffi culty of always associating the correct exception with the correct 

instruction in pipelined computers has led some computer designers to relax this 

requirement in noncritical cases. Such processors are said to have imprecise interrupts 

or imprecise exceptions. In the example above, PC would normally have 58 hex 

at the start 

of the clock cycle after the exception is detected, even though the offending instruction 

Hardware/ 

Software 

Interface 

imprecise 

interrupt Also called 

imprecise exception. 

Interrupts or exceptions 

in pipelined computers 

that are not associated 

with the exact instruction 

that was the cause of the 

interrupt or exception.


Another example is that we might speculate that a store that precedes a load does 

not refer to the same address, which would allow the load to be executed before the 

store. The difficulty with speculation is that it may be wrong. So, any speculation 

mechanism must include both a method to check if the guess was right and a 

method to unroll or back out the effects of the instructions that were executed 

speculatively. The implementation of this back-out capability adds complexity. 

Speculation may be done in the compiler or by the hardware. For example, the 

compiler can use speculation to reorder instructions, moving an instruction across 

a branch or a load across a store. The processor hardware can perform the same 

transformation at runtime using techniques we discuss later in this section. 

The recovery mechanisms used for incorrect speculation are rather different. 

In the case of speculation in software, the compiler usually inserts additional 

instructions that check the accuracy of the speculation and provide a fix-up routine 

to use when the speculation is incorrect. In hardware speculation, the processor 

usually buffers the speculative results until it knows they are no longer speculative. 

If the speculation is correct, the instructions are completed by allowing the 

contents of the buffers to be written to the registers or memory. If the speculation is 

incorrect, the hardware flushes the buffers and re-executes the correct instruction 

sequence. 

Speculation introduces one other possible problem: speculating on certain 

instructions may introduce exceptions that were formerly not present. For 

example, suppose a load instruction is moved in a speculative manner, but the 

address it uses is not legal when the speculation is incorrect. The result would be 

an exception that should not have occurred. The problem is complicated by the 

fact that if the load instruction were not speculative, then the exception must 

occur! In compiler-based speculation, such problems are avoided by adding 

special speculation support that allows such exceptions to be ignored until it is 

clear that they really should occur. In hardware-based speculation, exceptions 

are simply buffered until it is clear that the instruction causing them is no longer 

speculative and is ready to complete; at that point the exception is raised, and 

nor-mal exception handling proceeds. 

Since speculation can improve performance when done properly and decrease 

performance when done carelessly, significant effort goes into deciding when it 

is appropriate to speculate. Later in this section, we will examine both static and 

dynamic techniques for speculation. 

issue packet The set 

of instructions that 

issues together in one 

clock cycle; the packet 

may be determined 

statically by the compiler 

or dynamically by the 

processor. 

Static Multiple Issue 

Static multiple-issue processors all use the compiler to assist with packaging 

instructions and handling hazards. In a static issue processor, you can think of the 

set of instructions issued in a given clock cycle, which is called an issue packet, as 

one large instruction with multiple operations. This view is more than an analogy. 

Since a static multiple-issue processor usually restricts what mix of instructions can 

be initiated in a given clock cycle, it is useful to think of the issue packet as a single


instruction allowing several operations in certain predefined fields. This view led to 

the original name for this approach: Very Long Instruction Word (VLIW). 

Most static issue processors also rely on the compiler to take on some 

responsibility for handling data and control hazards. The compiler’s responsibilities 

may include static branch prediction and code scheduling to reduce or prevent all 

hazards. Let’s look at a simple static issue version of a MIPS processor, before we 

describe the use of these techniques in more aggressive processors. 

An Example: Static Multiple Issue with the MIPS ISA 

To give a flavor of static multiple issue, we consider a simple two-issue MIPS 

processor, where one of the instructions can be an integer ALU operation or 

branch and the other can be a load or store. Such a design is like that used in some 

embedded MIPS processors. Issuing two instructions per cycle will require fetching 

and decoding 64 bits of instructions. In many static multiple-issue processors, and 

essentially all VLIW processors, the layout of simultaneously issuing instructions 

is restricted to simplify the decoding and instruction issue. Hence, we will require 

that the instructions be paired and aligned on a 64-bit boundary, with the ALU 

or branch portion appearing first. Furthermore, if one instruction of the pair 

cannot be used, we require that it be replaced with a nop. Thus, the instructions 

always issue in pairs, possibly with a nop in one slot. Figure 4.68 shows how the 

instructions look as they go into the pipeline in pairs. 

Static multiple-issue processors vary in how they deal with potential data and 

control hazards. In some designs, the compiler takes full responsibility for removing 

all hazards, scheduling the code and inserting no-ops so that the code executes 

without any need for hazard detection or hardware-generated stalls. In others, 

the hardware detects data hazards and generates stalls between two issue packets, 

while requiring that the compiler avoid all dependences within an instruction pair. 

Even so, a hazard generally forces the entire issue packet containing the dependent 

Very Long Instruction 

Word (VLIW) 

A style of instruction set 

architecture that launches 

many operations that are 

defined to be independent 

in a single wide 

instruction, typically with 

many separate opcode 

fields. 

Instruction type 

Pipe stages 

ALU or branch instruction IF ID EX MEM WB 

Load or store instruction IF ID EX MEM WB 







FIGURE 4.68 Static two-issue pipeline in operation. The ALU and data transfer instructions 

are issued at the same time. Here we have assumed the same five-stage structure as used for the single-issue 

pipeline. Although this is not strictly necessary, it does have some advantages. In particular, keeping the 

register writes at the end of the pipeline simplifies the handling of exceptions and the maintenance of a 

precise exception model, which become more difficult in multiple-issue processors.


instruction to stall. Whether the software must handle all hazards or only try to 

reduce the fraction of hazards between separate issue packets, the appearance of 

having a large single instruction with multiple operations is reinforced. We will 

assume the second approach for this example. 

To issue an ALU and a data transfer operation in parallel, the first need for 

additional hardware—beyond the usual hazard detection and stall logic—is extra 

ports in the register file (see Figure 4.69). In one clock cycle we may need to read 

two registers for the ALU operation and two more for a store, and also one write 

port for an ALU operation and one write port for a load. Since the ALU is tied 

up for the ALU operation, we also need a separate adder to calculate the effective 

address for data transfers. Without these extra resources, our two-issue pipeline 

would be hindered by structural hazards. 

Clearly, this two-issue processor can improve performance by up to a factor of 

two. Doing so, however, requires that twice as many instructions be overlapped 

in execution, and this additional overlap increases the relative performance loss 

from data and control hazards. For example, in our simple five-stage pipeline, 

 

 

M 

ux 

M 

ux 

4 

ALU 

80000180 

M 

ux 

PC 

Instruction 

memory 

Registers 

Signextend 

Signextend 

ALU 

Write 

data 

Data 

memory 

Address 

FIGURE 4.69 A static two-issue datapath. The additions needed for double issue are highlighted: another 32 bits from instruction 

memory, two more read ports and one more write port on the register file, and another ALU. Assume the bottom ALU handles address 

calculations for data transfers and the top ALU handles everything else.


loads have a use latency of one clock cycle, which prevents one instruction from 

using the result without stalling. In the two-issue, five-stage pipeline the result of 

a load instruction cannot be used on the next clock cycle. This means that the next 

two instructions cannot use the load result without stalling. Furthermore, ALU 

instructions that had no use latency in the simple five-stage pipeline now have a 

one-instruction use latency, since the results cannot be used in the paired load or 

store. To effectively exploit the parallelism available in a multiple-issue processor, 

more ambitious compiler or hardware scheduling techniques are needed, and static 

multiple issue requires that the compiler take on this role. 

use latency Number 

of clock cycles between 

a load instruction and 

an instruction that can 

use the result of the 

load without stalling the 

pipeline. 

Simple Multiple-Issue Code Scheduling 

How would this loop be scheduled on a static two-issue pipeline for MIPS? 

EXAMPLE 

Loop: lw $t0, 0($s1) # $t0=array element 

addu $t0,$t0,$s2# add scalar in $s2 

sw $t0, 0($s1)# store result 

addi $s1,$s1,–4# decrement pointer 

bne $s1,$zero,Loop# branch $s1!=0 

Reorder the instructions to avoid as many pipeline stalls as possible. Assume 

branches are predicted, so that control hazards are handled by the hardware. 

The first three instructions have data dependences, and so do the last two. 

Figure 4.70 shows the best schedule for these instructions. Notice that just 

one pair of instructions has both issue slots used. It takes four clocks per loop 

iteration; at four clocks to execute five instructions, we get the disappointing 

CPI of 0.8 versus the best case of 0.5., or an IPC of 1.25 versus 2.0. Notice 

that in computing CPI or IPC, we do not count any nops executed as useful 

instructions. Doing so would improve CPI, but not performance! 

ANSWER 

ALU or branch instruction Data transfer instruction Clock cycle 

Loop: lw $t0, 0($s1) 1 

addi $s1,$s1,–4 2 

addu $t0,$t0,$s2 3 

bne $s1,$zero,Loop sw $t0, 4($s1) 4 

FIGURE 4.70 The scheduled code as it would look on a two-issue MIPS pipeline. The empty 

slots are no-ops.


loop unrolling 

A technique to get more 

performance from loops 

that access arrays, in 

which multiple copies of 

the loop body are made 

and instructions from 

different iterations are 

EXAMPLE 

An important compiler technique to get more performance from loops 

is loop unrolling, where multiple copies of the loop body are made. After 

unrolling, there is more ILP available by overlapping instructions from different 

iterations. 

Loop Unrolling for Multiple-Issue Pipelines 

See how well loop unrolling and scheduling work in the example above. For 

simplicity assume that the loop index is a multiple of four. 

ANSWER 

register renaming The 

renaming of registers 

by the compiler or 

hardware to remove 

antidependences. 

antidependence Also 

called name 

dependence. An 

ordering forced by the 

reuse of a name, typically 

a register, rather than by 

a true dependence that 

carries a value between 

two instructions. 

To schedule the loop without any delays, it turns out that we need to make 

four copies of the loop body. After unrolling and eliminating the unnecessary 

loop overhead instructions, the loop will contain four copies each of lw, add, 

and sw, plus one addi and one bne. Figure 4.71 shows the unrolled and 

scheduled code. 

During the unrolling process, the compiler introduced additional registers 

($t1, $t2, $t3). The goal of this process, called register renaming, is to 

eliminate dependences that are not true data dependences, but could either 

lead to potential hazards or prevent the compiler from flexibly scheduling 

the code. Consider how the unrolled code would look using only $t0. There 

would be repeated instances of lw $t0,0($$s1), addu $t0, $t0, $s2 

followed by sw t0,4($s1), but these sequences, despite using $t0, are 

actually completely independent—no data values flow between one set of these 

instructions and the next set. This case is what is called an antidependence or 

name dependence, which is an ordering forced purely by the reuse of a name, 

rather than a real data dependence that is also called a true dependence. 

Renaming the registers during the unrolling process allows the compiler 

to move these independent instructions subsequently so as to better schedule 

ALU or branch instruction Data transfer instruction Clock cycle 

Loop: addi $s1,$s1,–16 lw $t0, 0($s1) 1 

lw $t1,12($s1) 2 

addu $t0,$t0,$s2 lw $t2, 8($s1) 3 

addu $t1,$t1,$s2 lw $t3, 4($s1) 4 

addu $t2,$t2,$s2 sw $t0, 16($s1) 5 

addu $t3,$t3,$s2 sw $t1,12($s1) 6 

sw $t2, 8($s1) 7 

bne $s1,$zero,Loop sw $t3, 4($s1) 8 

FIGURE 4.71 The unrolled and scheduled code of Figure 4.70 as it would look on a static 

two-issue MIPS pipeline. The empty slots are no-ops. Since the first instruction in the loop decrements 

$s1 by 16, the addresses loaded are the original value of $s1, then that address minus 4, minus 8, and minus 12.


the code. The renaming process eliminates the name dependences, while 

preserving the true dependences. 

Notice now that 12 of the 14 instructions in the loop execute as pairs. It takes 

8 clocks for 4 loop iterations, or 2 clocks per iteration, which yields a CPI of 8/14 

= 0.57. Loop unrolling and scheduling with dual issue gave us an improvement 

factor of almost 2, partly from reducing the loop control instructions and partly 

from dual issue execution. The cost of this performance improvement is using four 

temporary registers rather than one, as well as a significant increase in code size. 

Dynamic Multiple-Issue Processors 

Dynamic multiple-issue processors are also known as superscalar processors, or 

simply superscalars. In the simplest superscalar processors, instructions issue in 

order, and the processor decides whether zero, one, or more instructions can issue 

in a given clock cycle. Obviously, achieving good performance on such a processor 

still requires the compiler to try to schedule instructions to move dependences 

apart and thereby improve the instruction issue rate. Even with such compiler 

scheduling, there is an important difference between this simple superscalar 

and a VLIW processor: the code, whether scheduled or not, is guaranteed by 

the hardware to execute correctly. Furthermore, compiled code will always run 

correctly independent of the issue rate or pipeline structure of the processor. In 

some VLIW designs, this has not been the case, and recompilation was required 

when moving across different processor models; in other static issue processors, 

code would run correctly across different implementations, but often so poorly as 

to make compilation effectively required. 

Many superscalars extend the basic framework of dynamic issue decisions to 

include dynamic pipeline scheduling. Dynamic pipeline scheduling chooses 

which instructions to execute in a given clock cycle while trying to avoid hazards 

and stalls. Let’s start with a simple example of avoiding a data hazard. Consider the 

following code sequence: 

lw $t0, 20($s2) 

addu $t1, $t0, $t2 

sub $s4, $s4, $t3 

slti $t5, $s4, 20 

Even though the sub instruction is ready to execute, it must wait for the lw 

and addu to complete first, which might take many clock cycles if memory is slow. 

(Chapter 5 explains cache misses, the reason that memory accesses are sometimes 

very slow.) Dynamic pipeline scheduling allows such hazards to be avoided either 

fully or partially. 

superscalar An 

advanced pipelining 

technique that enables the 

processor to execute more 

than one instruction per 

clock cycle by selecting 

them during execution. 

dynamic pipeline 

scheduling Hardware 

support for reordering 

the order of instruction 

execution so as to avoid 

stalls. 

Dynamic Pipeline Scheduling 

Dynamic pipeline scheduling chooses which instructions to execute next, possibly 

reordering them to avoid stalls. In such processors, the pipeline is divided into 

three major units: an instruction fetch and issue unit, multiple functional units



and decode unit 

In-order issue 

Reservation 

station 

Reservation 

station 

. . . 

Reservation 

station 

Reservation 

station 

Functional 

units 

Integer 

Integer 

. . . 

Floating 

point 

Loadstore 

Out-of-order execute 

Commit 

unit 

In-order commit 

FIGURE 4.72 The three primary units of a dynamically scheduled pipeline. The final step of 

updating the state is also called retirement or graduation. 

commit unit The unit in 

a dynamic or out-of-order 

execution pipeline that 

decides when it is safe to 

release the result of an 

operation to programmervisible 

registers and 

memory. 

reservation station 

A buffer within a 

functional unit that holds 

the operands and the 

operation. 

reorder buffer The 

buffer that holds results in 

a dynamically scheduled 

processor until it is safe 

to store the results to 

memory or a register. 

(a dozen or more in high-end designs in 2013), and a commit unit. Figure 4.72 

shows the model. The first unit fetches instructions, decodes them, and sends 

each instruction to a corresponding functional unit for execution. Each functional 

unit has buffers, called reservation stations, which hold the operands and the 

operation. (The Elaboration discusses an alternative to reservation stations used 

by many recent processors.) As soon as the buffer contains all its operands and 

the functional unit is ready to execute, the result is calculated. When the result is 

completed, it is sent to any reservation stations waiting for this particular result 

as well as to the commit unit, which buffers the result until it is safe to put the 

result into the register file or, for a store, into memory. The buffer in the commit 

unit, often called the reorder buffer, is also used to supply operands, in much the 

same way as forwarding logic does in a statically scheduled pipeline. Once a result 

is committed to the register file, it can be fetched directly from there, just as in a 

normal pipeline. 

The combination of buffering operands in the reservation stations and results 

in the reorder buffer provides a form of register renaming, just like that used by 

the compiler in our earlier loop-unrolling example on page 338. To see how this 

conceptually works, consider the following steps:


1. When an instruction issues, it is copied to a reservation station for the 

appropriate functional unit. Any operands that are available in the register 

file or reorder buffer are also immediately copied into the reservation station. 

The instruction is buffered in the reservation station until all the operands 

and the functional unit are available. For the issuing instruction, the register 

copy of the operand is no longer required, and if a write to that register 

occurred, the value could be overwritten. 

2. If an operand is not in the register file or reorder buffer, it must be waiting to 

be produced by a functional unit. The name of the functional unit that will 

produce the result is tracked. When that unit eventually produces the result, 

it is copied directly into the waiting reservation station from the functional 

unit bypassing the registers. 

These steps effectively use the reorder buffer and the reservation stations to 

implement register renaming. 

Conceptually, you can think of a dynamically scheduled pipeline as analyzing 

the data flow structure of a program. The processor then executes the instructions 

in some order that preserves the data flow order of the program. This style of 

execution is called an out-of-order execution, since the instructions can be 

executed in a different order than they were fetched. 

To make programs behave as if they were running on a simple in-order pipeline, 

the instruction fetch and decode unit is required to issue instructions in order, 

which allows dependences to be tracked, and the commit unit is required to write 

results to registers and memory in program fetch order. This conservative mode is 

called in-order commit. Hence, if an exception occurs, the computer can point to 

the last instruction executed, and the only registers updated will be those written 

by instructions before the instruction causing the exception. Although the front 

end (fetch and issue) and the back end (commit) of the pipeline run in order, 

the functional units are free to initiate execution whenever the data they need is 

available. Today, all dynamically scheduled pipelines use in-order commit. 

Dynamic scheduling is often extended by including hardware-based speculation, 

especially for branch outcomes. By predicting the direction of a branch, a 

dynamically scheduled processor can continue to fetch and execute instructions 

along the predicted path. Because the instructions are committed in order, we know 

whether or not the branch was correctly predicted before any instructions from the 

predicted path are committed. A speculative, dynamically scheduled pipeline can 

also support speculation on load addresses, allowing load-store reordering, and 

using the commit unit to avoid incorrect speculation. In the next section, we will 

look at the use of dynamic scheduling with speculation in the Intel Core i7 design. 

out-of-order 

execution A situation in 

pipelined execution when 

an instruction blocked 

from executing does 

not cause the following 

instructions to wait. 

in-order commit 

A commit in which 

the results of pipelined 

execution are written to 

the programmer visible 

state in the same order 

that instructions are 

fetched.


Modern, high-performance microprocessors are capable of issuing several instructions 

per clock; unfortunately, sustaining that issue rate is very difficult. For example, despite 

the existence of processors with four to six issues per clock, very few applications can 

sustain more than two instructions per clock. There are two primary reasons for this. 

First, within the pipeline, the major performance bottlenecks arise from 

dependences that cannot be alleviated, thus reducing the parallelism among 

instructions and the sustained issue rate. Although little can be done about true data 

dependences, often the compiler or hardware does not know precisely whether a 

dependence exists or not, and so must conservatively assume the dependence exists. 

For example, code that makes use of pointers, particularly in ways that may lead to 

aliasing, will lead to more implied potential dependences. In contrast, the greater 

regularity of array accesses often allows a compiler to deduce that no dependences 

exist. Similarly, branches that cannot be accurately predicted whether at runtime or 

compile time will limit the ability to exploit ILP. Often, additional ILP is available, but 

the ability of the compiler or the hardware to find ILP that may be widely separated 

(sometimes by the execution of thousands of instructions) is limited. 

Second, losses in the memory hierarchy (the topic of Chapter 5) also limit the 

ability to keep the pipeline full. Some memory system stalls can be hidden, but 

limited amounts of ILP also limit the extent to which such stalls can be hidden. 

Hardware/ 

Software 

Interface 

Energy Efficiency and Advanced Pipelining 

The downside to the increasing exploitation of instruction-level parallelism via 

dynamic multiple issue and speculation is potential energy inefficiency. Each 

innovation was able to turn more transistors into performance, but they often did 

so very inefficiently. Now that we have hit the power wall, we are seeing designs 

with multiple processors per chip where the processors are not as deeply pipelined 

or as aggressively speculative as its predecessors. 

The belief is that while the simpler processors are not as fast as their sophisticated 

brethren, they deliver better performance per joule, so that they can deliver more 

performance per chip when designs are constrained more by energy than they are 

by number of transistors. 

Figure 4.73 shows the number of pipeline stages, the issue width, speculation level, 

clock rate, cores per chip, and power of several past and recent microprocessors. Note 

the drop in pipeline stages and power as companies switch to multicore designs. 

Elaboration: A commit unit controls updates to the register file and memory. Some 

dynamically scheduled processors update the register file immediately during execution, 

using extra registers to implement the renaming function and preserving the older copy of a 

register until the instruction updating the register is no longer speculative. Other processors 

buffer the result, typically in a structure called a reorder buffer, and the actual update to the 

register file occurs later as part of the commit. Stores to memory must be buffered until 

commit time either in a store buffer (see Chapter 5) or in the reorder buffer. The commit unit 

allows the store to write to memory from the buffer when the buffer has a valid address and 

valid data, and when the store is no longer dependent on predicted branches.


Microprocessor Year Clock Rate 

Pipeline 

Stages 

Issue 

Width 

Out-of-Order/ 

Speculation 

Cores/ 

Chip 

Power 

Intel 486 1989 25 MHz 5 1 No 1 5 W 

Intel Pentium 1993 66 MHz 5 2 No 1 10 W 

Intel Pentium Pro 1997 200 MHz 10 3 Yes 1 29 W 

Intel Pentium 4 Willamette 2001 2000 MHz 22 3 Yes 1 75 W 

Intel Pentium 4 Prescott 2004 3600 MHz 31 3 Yes 1 103 W 

Intel Core 2006 2930 MHz 14 4 Yes 

2 75 W 

Intel Core i5 Nehalem 2010 3300 MHz 14 4 Yes 

1 87 W 

Intel Core i5 Ivy Bridge 2012 3400 MHz 14 4 Yes 

8 77 W 

FIGURE 4.73 Record of Intel Microprocessors in terms of pipeline complexity, number of cores, and power. The Pentium 

4 pipeline stages do not include the commit stages. If we included them, the Pentium 4 pipelines would be even deeper. 

Elaboration: Memory accesses benefi t from nonblocking caches, which continue 

servicing cache accesses during a cache miss (see Chapter 5). Out-of-order execution 

processors need the cache design to allow instructions to execute during a miss. 

Check 

Yourself 

State whether the following techniques or components are associated primarily 

with a software- or hardware-based approach to exploiting ILP. In some cases, the 

answer may be both. 

1. Branch prediction 

2. Multiple issue 

3. VLIW 

4. Superscalar 

5. Dynamic scheduling 

6. Out-of-order execution 

7. Speculation 

8. Reorder buffer 

9. Register renaming 

4.11 Real Stuff: The ARM Cortex-A8 and Intel 

Core i7 Pipelines 

Figure 4.74 describes the two microprocessors we examine in this section, whose 

targets are the two bookends of the PostPC Era.

4.11 Real Stuff: The ARM Cortex-A8 and Intel Core i7 Pipelines 345 

Processor ARM A8 

Intel Core i7 920 

Market 

Thermal design power 

Clock rate 

Cores/Chip 

Floating point? 

Multiple Issue? 

Peak instructions/clock cycle 

Pipeline Stages 

Pipeline schedule 

Branch prediction 

1st level caches / core 

2nd level cache / core 

3rd level cache (shared) 

Personal Mobile Device 

2 Watts 

1 GHz 

1 

No 

Dynamic 

2 

14 

Static In-order 

2-level 

32 KiB I, 32 KiB D 

128 - 1024 KiB 

-- 

Server, Cloud 

130 Watts 

2.66 GHz 

4 

Yes 

Dynamic 

4 

14 

Dynamic Out-of-order with Speculation 

2-level 

32 KiB I, 32 KiB D 

256 KiB 

2 - 8 MiB 

FIGURE 4.74 Specification of the ARM Cortex-A8 and the Intel Core i7 920. 

The ARM Cortex-A8 

The ARM Corxtex-A8 runs at 1 GHz with a 14-stage pipeline. It uses dynamic 

multiple issue, with two instructions per clock cycle. It is a static in-order pipeline, 

in that instructions issue, execute, and commit in order. The pipeline consists of 

three sections for instruction fetch, instruction decode, and execute. Figure 4.75 

shows the overall pipeline. 

The first three stages fetch two instructions at a time and try to keep a 

12-instruction entry prefetch buffer full. It uses a two-level branch predictor using 

both a 512-entry branch target buffer, a 4096-entry global history buffer, and an 

8-entry return stack to predict future returns. When the branch prediction is 

wrong, it empties the pipeline, resulting in a 13-clock cycle misprediction penalty. 

The five stages of the decode pipeline determine if there are dependences 

between a pair of instructions, which would force sequential execution, and in 

which pipeline of the execution stages to send the instructions. 

The six stages of the instruction execution section offer one pipeline for load 

and store instructions and two pipelines for arithmetic operations, although only 

the first of the pair can handle multiplies. Either instruction from the pair can be 

issued to the load-store pipeline. The execution stages have full bypassing between 

the three pipelines. 

Figure 4.76 shows the CPI of the A8 using small versions of programs derived 

from the SPEC2000 benchmarks. While the ideal CPI is 0.5, the best case here is 

1.4, the median case is 2.0, and the worst case is 5.2. For the median case, 80% of 

the stalls are due to the pipelining hazards and 20% are stalls due to the memory

4.11 Real Stuff: The ARM Cortex-A8 and Intel Core i7 Pipelines 349 

issue the micro-ops from the buffer, eliminating the need for the instruction 

fetch and instruction decode stages to be activated. 

5. Perform the basic instruction issue—Looking up the register location in the 

register tables, renaming the registers, allocating a reorder buffer entry, and 

fetching any results from the registers or reorder buffer before sending the 

micro-ops to the reservation stations. 

6. The i7 uses a 36-entry centralized reservation station shared by six functional 

units. Up to six micro-ops may be dispatched to the functional units every 

clock cycle. 

7. The individual function units execute micro-ops and then results are sent 

back to any waiting reservation station as well as to the register retirement 

unit, where they will update the register state, once it is known that the 

instruction is no longer speculative. The entry corresponding to the 

instruction in the reorder buffer is marked as complete. 

8. When one or more instructions at the head of the reorder buffer have been 

marked as complete, the pending writes in the register retirement unit are 

executed, and the instructions are removed from the reorder buffer. 

Elaboration: Hardware in the second and fourth steps can combine or fuse operations 

together to reduce the number of operations that must be performed. Macro-op fusion 

in the second step takes x86 instruction combinations, such as compare followed by a 

branch, and fuses them into a single operation. Microfusion in the fourth step combines 

micro-operation pairs such as load/ALU operation and ALU operation/store and issues 

them to a single reservation station (where they can still issue independently), thus 

increasing the usage of the buffer. In a study of the Intel Core architecture, which also 

incorporated microfusion and macrofusion, Bird et al. [2007] discovered that microfusion 

had little impact on performance, while macrofusion appears to have a modest positive 

impact on integer performance and little impact on floating-point performance. 

Performance of the Intel Core i7 920 

Figure 4.78 shows the CPI of the Intel Core i7 for each of the SPEC2006 benchmarks. 

While the ideal CPI is 0.25, the best case here is 0.44, the median case is 0.79, and 

the worst case is 2.67. 

While it is difficult to differentiate between pipeline stalls and memory stalls 

in a dynamic out-of-order execution pipeline, we can show the effectiveness of 

branch prediction and speculation. Figure 4.79 shows the percentage of branches 

mispredicted and the percentage of the work (measured by the numbers of microops 

dispatched into the pipeline) that does not retire (that is, their results are 

annulled) relative to all micro-op dispatches. The min, median, and max of branch 

mispredictions are 0%, 2%, and 10%. For wasted work, they are 1%, 18%, and 39%. 

The wasted work in some cases closely matches the branch misprediction rates, 

such as for gobmk and astar. In several instances, such as mcf, the wasted work 

seems relatively larger than the misprediction rate. This divergence is likely due


3 

2.5 

Stalls, misspeculation 

Ideal CPI 

2.67 

2 

2.12 

CPI 

1.5 

1.02 

1 

0.5 0.44 0.59 0.61 0.65 0.74 0.77 0.82 

0 

libquantum 

h264ref 

1.06 

1.23 

hmmer 

perlbench 

bzip2 

xalancbmk 

sjeng 

gobmk 

astar 

gcc 

omnetpp 

mcf 

FIGURE 4.78 CPI of Intel Core i7 920 running SPEC2006 integer benchmarks. 

Branch misprediction % Wasted work % 

40% 

38% 

39% 

35% 

30% 

32% 

25% 

24% 

25% 

22% 

20% 

15% 

15% 

10% 

5% 

0% 

1% 

0% 

libquantum 

11% 

6% 

5% 

2% 2% 2% 

h264ref 

hmmer 

perlbench 

7% 

5% 

1% 

bzip2 

xalancbmk 

5% 

sjeng 

10% 

gobmk 

9% 

astar 

2% 2% 

gcc 

omnetpp 

6% 

mcf 

FIGURE 4.79 Percentage of branch mispredictions and wasted work due to unfruitful 

speculation of Intel Core i7 920 running SPEC2006 integer benchmarks.


1 #include 

2 #define UNROLL (4) 

3 

4 void dgemm (int n, double* A, double* B, double* C) 

5 { 

6 for ( int i = 0; i < n; i+=UNROLL*4 ) 

7 for ( int j = 0; j < n; j++ ) { 

8 __m256d c[4]; 

9 for ( int x = 0; x < UNROLL; x++ ) 

10 c[x] = _mm256_load_pd(C+i+x*4+j*n); 

11 

12 for( int k = 0; k < n; k++ ) 

13 { 

14 __m256d b = _mm256_broadcast_sd(B+k+j*n); 

15 for (int x = 0; x < UNROLL; x++) 

16 c[x] = _mm256_add_pd(c[x], 

17 _mm256_mul_pd(_mm256_load_pd(A+n*k+x*4+i), b)); 

18 } 

19 

20 for ( int x = 0; x < UNROLL; x++ ) 

21 _mm256_store_pd(C+i+x*4+j*n, c[x]); 

22 } 

23 } 

FIGURE 4.80 Optimized C version of DGEMM using C intrinsics to generate the AVX subwordparallel 

instructions for the x86 (Figure 3.23) and loop unrolling to create more opportunities for 

instruction-level parallelism. Figure 4.81 shows the assembly language produced by the compiler for the inner 

loop, which unrolls the three for-loop bodies to expose instruction level parallelism. 

instruction, since we can use the four copies of the B element in register %ymm0 

repeatedly throughout the loop. Thus, the 5 AVX instructions in Figure 3.24 

become 17 in Figure 4.81, and the 7 integer instructions appear in both, although 

the constants and addressing changes to account for the unrolling. Hence, despite 

unrolling 4 times, the number of instructions in the body of the loop only doubles: 

from 12 to 24. 

Figure 4.82 shows the performance increase DGEMM for 32x32 matrices in 

going from unoptimized to AVX and then to AVX with unrolling. Unrolling more 

than doubles performance, going from 6.4 GFLOPS to 14.6 GFLOPS. Optimizations 

for subword parallelism and instruction level parallelism result in an overall 

speedup of 8.8 versus the unoptimized DGEMM in Figure 3.21. 

Elaboration: As mentioned in the Elaboration in Section 3.8, these results are with 

Turbo mode turned off. If we turn it on, like in Chapter 3 we improve all the results by the 

temporary increase in the clock rate of 3.3/2.6 = 1.27 to 2.1 GFLOPS for unoptimized 

DGEMM, 8.1 GFLOPS with AVX, and 18.6 GFLOPS with unrolling and AVX. As mentioned 

in Section 3.8, Turbo mode works particularly well in this case because it is using only 

a single core of an eight-core chip.

4.12 Going Faster: Instruction-Level Parallelism and Matrix Multiply 353 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 

21 

22 

23 

24 

vmovapd (%r11),%ymm4 

# Load 4 elements of C into %ymm4 

mov %rbx,%rax # register %rax = %rbx 

xor %ecx,%ecx # register %ecx = 0 

vmovapd 0x20(%r11),%ymm3 # Load 4 elements of C into %ymm3 



vbroadcastsd (%rcx,%r9,1),%ymm0 # Make 4 copies of B element 

add $0x8,%rcx # register %rcx = %rcx + 8 

vmulpd (%rax),%ymm0,%ymm5 # Parallel mul %ymm1,4 A elements 

vaddpd %ymm5,%ymm4,%ymm4 # Parallel add %ymm5, %ymm4 

vmulpd 0x20(%rax),%ymm0,%ymm5 # Parallel mul %ymm1,4 A elements 




add %r8,%rax # register %rax = %rax + %r8 

cmp %r10,%rcx # compare %r8 to %rax 



jne 68 # jump if not %r8 != %rax 

add $0x1,%esi # register % esi = % esi + 1 

vmovapd %ymm4,(%r11) 

# Store %ymm4 into 4 C elements 

vmovapd %ymm3,0x20(%r11) # Store %ymm3 into 4 C elements 



FIGURE 4.81 The x86 assembly language for the body of the nested loops generated by compiling 

the unrolled C code in Figure 4.80. 

Elaboration: There are no pipeline stalls despite the reuse of register %ymm5 in lines 

9 to 17 Figure 4.81 because the Intel Core i7 pipeline renames the registers. 

Are the following statements true or false? 

1. The Intel Core i7 uses a multiple-issue pipeline to directly execute x86 

instructions. 

2. Both the A8 and the Core i7 use dynamic multiple issue. 

3. The Core i7 microarchitecture has many more registers than x86 requires. 

4. The Intel Core i7 uses less than half the pipeline stages of the earlier Intel 

Pentium 4 Prescott (see Figure 4.73). 

Check 

Yourself


4.14 Fallacies and Pitfalls 

Fallacy: Pipelining is easy. 

Our books testify to the subtlety of correct pipeline execution. Our advanced book 

had a pipeline bug in its first edition, despite its being reviewed by more than 100 

people and being class-tested at 18 universities. The bug was uncovered only when 

someone tried to build the computer in that book. The fact that the Verilog to 

describe a pipeline like that in the Intel Core i7 will be many thousands of lines is 

an indication of the complexity. Beware! 

Fallacy: Pipelining ideas can be implemented independent of technology. 

When the number of transistors on-chip and the speed of transistors made a 

five-stage pipeline the best solution, then the delayed branch (see the Elaboration 

on page 255) was a simple solution to control hazards. With longer pipelines, 

superscalar execution, and dynamic branch prediction, it is now redundant. In 

the early 1990s, dynamic pipeline scheduling took too many resources and was 

not required for high performance, but as transistor budgets continued to double 

due to Moore’s Law and logic became much faster than memory, then multiple 

functional units and dynamic pipelining made more sense. Today, concerns about 

power are leading to less aggressive designs. 

Pitfall: Failure to consider instruction set design can adversely impact pipelining. 

Many of the difficulties of pipelining arise because of instruction set complications. 

Here are some examples: 

■ Widely variable instruction lengths and running times can lead to imbalance 

among pipeline stages and severely complicate hazard detection in a design 

pipelined at the instruction set level. This problem was overcome, initially 

in the DEC VAX 8500 in the late 1980s, using the micro-operations and 

micropipelined scheme that the Intel Core i7 employs today. Of course, the 

overhead of translation and maintaining correspondence between the microoperations 

and the actual instructions remains. 

■ Sophisticated addressing modes can lead to different sorts of problems. 

Addressing modes that update registers complicate hazard detection. Other 

addressing modes that require multiple memory accesses substantially 

complicate pipeline control and make it difficult to keep the pipeline flowing 

smoothly. 

■ Perhaps the best example is the DEC Alpha and the DEC NVAX. In 

comparable technology, the newer instruction set architecture of the Alpha 

allowed an implementation whose performance is more than twice as fast 

as NVAX. In another example, Bhandarkar and Clark [1991] compared the 

MIPS M/2000 and the DEC VAX 8700 by counting clock cycles of the SPEC 

benchmarks; they concluded that although the MIPS M/2000 executes more


4.3 When processor designers consider a possible improvement to the processor 

datapath, the decision usually depends on the cost/performance trade-off. In 

the following three problems, assume that we are starting with a datapath from 

Figure 4.2, where I-Mem, Add, Mux, ALU, Regs, D-Mem, and Control blocks have 

latencies of 400 ps, 100 ps, 30 ps, 120 ps, 200 ps, 350 ps, and 100 ps, respectively, 

and costs of 1000, 30, 10, 100, 200, 2000, and 500, respectively. 

Consider the addition of a multiplier to the ALU. This addition will add 300 ps to the 

latency of the ALU and will add a cost of 600 to the ALU. The result will be 5% fewer 

instructions executed since we will no longer need to emulate the MUL instruction. 

4.3.1 [10] What is the clock cycle time with and without this improvement? 

4.3.2 [10] What is the speedup achieved by adding this improvement? 

4.3.3 [10] Compare the cost/performance ratio with and without this 

improvement. 

4.4 Problems in this exercise assume that logic blocks needed to implement a 

processor’s datapath have the following latencies: 

I-Mem Add Mux ALU Regs D-Mem Sign-Extend Shift-Left-2 

200ps 70ps 20ps 90ps 90ps 250ps 15ps 10ps 

4.4.1 [10] If the only thing we need to do in a processor is fetch consecutive 

instructions (Figure 4.6), what would the cycle time be? 

4.4.2 [10] Consider a datapath similar to the one in Figure 4.11, but for a 

processor that only has one type of instruction: unconditional PC-relative branch. 

What would the cycle time be for this datapath? 

4.4.3 [10] Repeat 4.4.2, but this time we need to support only conditional 

PC-relative branches. 

The remaining three problems in this exercise refer to the datapath element Shiftleft-2: 

4.4.4 [10] Which kinds of instructions require this resource? 

4.4.5 [20] For which kinds of instructions (if any) is this resource on the 

critical path? 

4.4.6 [10] Assuming that we only support beq and add instructions, 

discuss how changes in the given latency of this resource affect the cycle time of the 

processor. Assume that the latencies of other resources do not change.


4.5 For the problems in this exercise, assume that there are no pipeline stalls and 

that the breakdown of executed instructions is as follows: 

add addi not beq lw sw 

20% 20% 0% 25% 25% 10% 

4.5.1 [10] In what fraction of all cycles is the data memory used? 

4.5.2 [10] In what fraction of all cycles is the input of the sign-extend 

circuit needed? What is this circuit doing in cycles in which its input is not needed? 

4.6 When silicon chips are fabricated, defects in materials (e.g., silicon) and 

manufacturing errors can result in defective circuits. A very common defect is for 

one wire to affect the signal in another. This is called a cross-talk fault. A special 

class of cross-talk faults is when a signal is connected to a wire that has a constant 

logical value (e.g., a power supply wire). In this case we have a stuck-at-0 or a stuckat-1 

fault, and the affected signal always has a logical value of 0 or 1, respectively. 

The following problems refer to bit 0 of the Write Register input on the register file 

in Figure 4.24. 

4.6.1 [10] Let us assume that processor testing is done by filling the 

PC, registers, and data and instruction memories with some values (you can choose 

which values), letting a single instruction execute, then reading the PC, memories, 

and registers. These values are then examined to determine if a particular fault is 

present. Can you design a test (values for PC, memories, and registers) that would 

determine if there is a stuck-at-0 fault on this signal? 

4.6.2 [10] Repeat 4.6.1 for a stuck-at-1 fault. Can you use a single 

test for both stuck-at-0 and stuck-at-1? If yes, explain how; if no, explain why not. 

4.6.3 [60] If we know that the processor has a stuck-at-1 fault on 

this signal, is the processor still usable? To be usable, we must be able to convert 

any program that executes on a normal MIPS processor into a program that works 

on this processor. You can assume that there is enough free instruction memory 

and data memory to let you make the program longer and store additional 

data. Hint: the processor is usable if every instruction “broken” by this fault can 

be replaced with a sequence of “working” instructions that achieve the same 

effect. 

4.6.4 [10] Repeat 4.6.1, but now the fault to test for is whether 

the “MemRead” control signal becomes 0 if RegDst control signal is 0, no fault 

otherwise. 

4.6.5 [10] Repeat 4.6.4, but now the fault to test for is whether the 

“Jump” control signal becomes 0 if RegDst control signal is 0, no fault otherwise.


4.7 In this exercise we examine in detail how an instruction is executed in a 

single-cycle datapath. Problems in this exercise refer to a clock cycle in which the 

processor fetches the following instruction word: 

10101100011000100000000000010100. 

Assume that data memory is all zeros and that the processor’s registers have the 

following values at the beginning of the cycle in which the above instruction word 

is fetched: 

r0 r1 r2 r3 r4 r5 r6 r8 r12 r31 

0 –1 2 –3 –4 10 6 8 2 –16 

4.7.1 [5] What are the outputs of the sign-extend and the jump “Shift left 

2” unit (near the top of Figure 4.24) for this instruction word? 

4.7.2 [10] What are the values of the ALU control unit’s inputs for this 

instruction? 

4.7.3 [10] What is the new PC address after this instruction is executed? 

Highlight the path through which this value is determined. 

4.7.4 [10] For each Mux, show the values of its data output during the 

execution of this instruction and these register values. 

4.7.5 [10] For the ALU and the two add units, what are their data input 

values? 

4.7.6 [10] What are the values of all inputs for the “Registers” unit? 

4.8 In this exercise, we examine how pipelining affects the clock cycle time of the 

processor. Problems in this exercise assume that individual stages of the datapath 

have the following latencies: 

IF ID EX MEM WB 

250ps 350ps 150ps 300ps 200ps 

Also, assume that instructions executed by the processor are broken down as 

follows: 

alu beq lw sw 

45% 20% 20% 15% 

4.8.1 [5] What is the clock cycle time in a pipelined and non-pipelined 

processor? 

4.8.2 [10] What is the total latency of an LW instruction in a pipelined 

and non-pipelined processor?


4.10.6 [10] Assuming stall-on-branch and no delay slots, what is the new 

clock cycle time and execution time of this instruction sequence if beq address 

computation is moved to the MEM stage? What is the speedup from this change? 

Assume that the latency of the EX stage is reduced by 20 ps and the latency of the 

MEM stage is unchanged when branch outcome resolution is moved from EX to 

MEM. 

4.11 Consider the following loop. 

loop:lw r1,0(r1) 

and r1,r1,r2 

lw r1,0(r1) 

lw r1,0(r1) 

beq r1,r0,loop 

Assume that perfect branch prediction is used (no stalls due to control hazards), 

that there are no delay slots, and that the pipeline has full forwarding support. Also 

assume that many iterations of this loop are executed before the loop exits. 

4.11.1 [10] Show a pipeline execution diagram for the third iteration of 

this loop, from the cycle in which we fetch the first instruction of that iteration up 

to (but not including) the cycle in which we can fetch the first instruction of the 

next iteration. Show all instructions that are in the pipeline during these cycles (not 

just those from the third iteration). 

4.11.2 [10] How often (as a percentage of all cycles) do we have a cycle in 

which all five pipeline stages are doing useful work? 

4.12 This exercise is intended to help you understand the cost/complexity/ 

performance trade-offs of forwarding in a pipelined processor. Problems in this 

exercise refer to pipelined datapaths from Figure 4.45. These problems assume 

that, of all the instructions executed in a processor, the following fraction of these 

instructions have a particular type of RAW data dependence. The type of RAW 

data dependence is identified by the stage that produces the result (EX or MEM) 

and the instruction that consumes the result (1st instruction that follows the one 

that produces the result, 2nd instruction that follows, or both). We assume that the 

register write is done in the first half of the clock cycle and that register reads are 

done in the second half of the cycle, so “EX to 3rd” and “MEM to 3rd” dependences 

are not counted because they cannot result in data hazards. Also, assume that the 

CPI of the processor is 1 if there are no data hazards. 

EX to 1 st 

Only 

MEM to 1 st 

Only 

EX to 2 nd 

Only 

MEM to 2 nd 

Only 

EX to 1 st 

and MEM 

to 2nd 

Other RAW 

Dependences 

5% 20% 5% 10% 10% 10%


4.13.2 [10] Repeat 4.13.1 but now use nops only when a hazard cannot be 

avoided by changing or rearranging these instructions. You can assume register R7 

can be used to hold temporary values in your modified code. 

4.13.3 [10] If the processor has forwarding, but we forgot to implement 

the hazard detection unit, what happens when this code executes? 

4.13.4 [20] If there is forwarding, for the first five cycles during the 

execution of this code, specify which signals are asserted in each cycle by hazard 

detection and forwarding units in Figure 4.60. 

4.13.5 [10] If there is no forwarding, what new inputs and output signals 

do we need for the hazard detection unit in Figure 4.60? Using this instruction 

sequence as an example, explain why each signal is needed. 

4.13.6 [20] For the new hazard detection unit from 4.13.5, specify which 

output signals it asserts in each of the first five cycles during the execution of this 

code. 

4.14 This exercise is intended to help you understand the relationship between 

delay slots, control hazards, and branch execution in a pipelined processor. In 

this exercise, we assume that the following MIPS code is executed on a pipelined 

processor with a 5-stage pipeline, full forwarding, and a predict-taken branch 

predictor: 

lw r2,0(r1) 

label1: beq r2,r0,label2 # not taken once, then taken 

lw r3,0(r2) 

beq r3,r0,label1 # taken 

add r1,r3,r1 

label2: sw r1,0(r2) 

4.14.1 [10] Draw the pipeline execution diagram for this code, assuming 

there are no delay slots and that branches execute in the EX stage. 

4.14.2 [10] Repeat 4.14.1, but assume that delay slots are used. In the 

given code, the instruction that follows the branch is now the delay slot instruction 

for that branch. 

4.14.3 [20] One way to move the branch resolution one stage earlier is to 

not need an ALU operation in conditional branches. The branch instructions would 

be “bez rd,label” and “bnez rd,label”, and it would branch if the register has 

and does not have a zero value, respectively. Change this code to use these branch 

instructions instead of beq. You can assume that register R8 is available for you 

to use as a temporary register, and that an seq (set if equal) R-type instruction can 

be used.


Section 4.8 describes how the severity of control hazards can be reduced by moving 

branch execution into the ID stage. This approach involves a dedicated comparator 

in the ID stage, as shown in Figure 4.62. However, this approach potentially adds 

to the latency of the ID stage, and requires additional forwarding logic and hazard 

detection. 

4.14.4 [10] Using the first branch instruction in the given code as an 

example, describe the hazard detection logic needed to support branch execution 

in the ID stage as in Figure 4.62. Which type of hazard is this new logic supposed 

to detect? 

4.14.5 [10] For the given code, what is the speedup achieved by moving 

branch execution into the ID stage? Explain your answer. In your speedup 

calculation, assume that the additional comparison in the ID stage does not affect 

clock cycle time. 

4.14.6 [10] Using the first branch instruction in the given code as an 

example, describe the forwarding support that must be added to support branch 

execution in the ID stage. Compare the complexity of this new forwarding unit to 

the complexity of the existing forwarding unit in Figure 4.62. 

4.15 The importance of having a good branch predictor depends on how often 

conditional branches are executed. Together with branch predictor accuracy, this 

will determine how much time is spent stalling due to mispredicted branches. In 

this exercise, assume that the breakdown of dynamic instructions into various 

instruction categories is as follows: 

R-Type BEQ JMP LW SW 

40% 25% 5% 25% 5% 

Also, assume the following branch predictor accuracies: 

Always-Taken Always-Not-Taken 2-Bit 

45% 55% 85% 

4.15.1 [10] Stall cycles due to mispredicted branches increase the 

CPI. What is the extra CPI due to mispredicted branches with the always-taken 

predictor? Assume that branch outcomes are determined in the EX stage, that there 

are no data hazards, and that no delay slots are used. 

4.15.2 [10] Repeat 4.15.1 for the “always-not-taken” predictor. 

4.15.3 [10] Repeat 4.15.1 for for the 2-bit predictor. 

4.15.4 [10] With the 2-bit predictor, what speedup would be achieved if 

we could convert half of the branch instructions in a way that replaces a branch 

instruction with an ALU instruction? Assume that correctly and incorrectly 

predicted instructions have the same chance of being replaced.


4.15.5 [10] With the 2-bit predictor, what speedup would be achieved if 

we could convert half of the branch instructions in a way that replaced each branch 

instruction with two ALU instructions? Assume that correctly and incorrectly 

predicted instructions have the same chance of being replaced. 

4.15.6 [10] Some branch instructions are much more predictable than 

others. If we know that 80% of all executed branch instructions are easy-to-predict 

loop-back branches that are always predicted correctly, what is the accuracy of the 

2-bit predictor on the remaining 20% of the branch instructions? 

4.16 This exercise examines the accuracy of various branch predictors for the 

following repeating pattern (e.g., in a loop) of branch outcomes: T, NT, T, T, NT 

4.16.1 [5] What is the accuracy of always-taken and always-not-taken 

predictors for this sequence of branch outcomes? 

4.16.2 [5] What is the accuracy of the two-bit predictor for the first 4 

branches in this pattern, assuming that the predictor starts off in the bottom left 

state from Figure 4.63 (predict not taken)? 

4.16.3 [10] What is the accuracy of the two-bit predictor if this pattern is 

repeated forever? 

4.16.4 [30] Design a predictor that would achieve a perfect accuracy if 

this pattern is repeated forever. You predictor should be a sequential circuit with 

one output that provides a prediction (1 for taken, 0 for not taken) and no inputs 

other than the clock and the control signal that indicates that the instruction is a 

conditional branch. 

4.16.5 [10] What is the accuracy of your predictor from 4.16.4 if it is 

given a repeating pattern that is the exact opposite of this one? 

4.16.6 [20] Repeat 4.16.4, but now your predictor should be able to 

eventually (after a warm-up period during which it can make wrong predictions) 

start perfectly predicting both this pattern and its opposite. Your predictor should 

have an input that tells it what the real outcome was. Hint: this input lets your 

predictor determine which of the two repeating patterns it is given. 

4.17 This exercise explores how exception handling affects pipeline design. The 

first three problems in this exercise refer to the following two instructions: 

Instruction 1 Instruction 2 

BNE R1, R2, Label 

LW R1, 0(R1) 

4.17.1 [5] Which exceptions can each of these instructions trigger? For 

each of these exceptions, specify the pipeline stage in which it is detected.


4.17.2 [10] If there is a separate handler address for each exception, show 

how the pipeline organization must be changed to be able to handle this exception. 

You can assume that the addresses of these handlers are known when the processor 

is designed. 

4.17.3 [10] If the second instruction is fetched right after the first 

instruction, describe what happens in the pipeline when the first instruction causes 

the first exception you listed in 4.17.1. Show the pipeline execution diagram from 

the time the first instruction is fetched until the time the first instruction of the 

exception handler is completed. 

4.17.4 [20] In vectored exception handling, the table of exception handler 

addresses is in data memory at a known (fixed) address. Change the pipeline to 

implement this exception handling mechanism. Repeat 4.17.3 using this modified 

pipeline and vectored exception handling. 

4.17.5 [15] We want to emulate vectored exception handling (described 

in 4.17.4) on a machine that has only one fixed handler address. Write the code 

that should be at that fixed address. Hint: this code should identify the exception, 

get the right address from the exception vector table, and transfer execution to that 

handler. 

4.18 In this exercise we compare the performance of 1-issue and 2-issue 

processors, taking into account program transformations that can be made to 

optimize for 2-issue execution. Problems in this exercise refer to the following loop 

(written in C): 

for(i=0;i!=j;i+=2) 

b[i]=a[i]–a[i+1]; 

When writing MIPS code, assume that variables are kept in registers as follows, and 

that all registers except those indicated as Free are used to keep various variables, 

so they cannot be used for anything else. 

i j a b c Free 

R5 R6 R1 R2 R3 R10, R11, R12 

4.18.1 [10] Translate this C code into MIPS instructions. Your translation 

should be direct, without rearranging instructions to achieve better performance. 

4.18.2 [10] If the loop exits after executing only two iterations, draw a 

pipeline diagram for your MIPS code from 4.18.1 executed on a 2-issue processor 

shown in Figure 4.69. Assume the processor has perfect branch prediction and can 

fetch any two instructions (not just consecutive instructions) in the same cycle. 

4.18.3 [10] Rearrange your code from 4.18.1 to achieve better 

performance on a 2-issue statically scheduled processor from Figure 4.69.


4.18.4 [10] Repeat 4.18.2, but this time use your MIPS code from 4.18.3. 

4.18.5 [10] What is the speedup of going from a 1-issue processor to 

a 2-issue processor from Figure 4.69? Use your code from 4.18.1 for both 1-issue 

and 2-issue, and assume that 1,000,000 iterations of the loop are executed. As in 

4.18.2, assume that the processor has perfect branch predictions, and that a 2-issue 

processor can fetch any two instructions in the same cycle. 

4.18.6 [10] Repeat 4.18.5, but this time assume that in the 2-issue 

processor one of the instructions to be executed in a cycle can be of any kind, and 

the other must be a non-memory instruction. 

4.19 This exercise explores energy efficiency and its relationship with performance. 

Problems in this exercise assume the following energy consumption for activity in 

Instruction memory, Registers, and Data memory. You can assume that the other 

components of the datapath spend a negligible amount of energy. 

I-Mem 1 Register Read Register Write D-Mem Read D-Mem Write 

140pJ 70pJ 60pJ 140pJ 120pJ 

Assume that components in the datapath have the following latencies. You can 

assume that the other components of the datapath have negligible latencies. 

I-Mem Control Register Read or Write ALU D-Mem Read or Write 

200ps 150ps 90ps 90ps 250ps 

4.19.1 [10] How much energy is spent to execute an ADD 

instruction in a single-cycle design and in the 5-stage pipelined design? 

4.19.2 [10] What is the worst-case MIPS instruction in terms of 

energy consumption, and what is the energy spent to execute it? 

4.19.3 [10] If energy reduction is paramount, how would you 

change the pipelined design? What is the percentage reduction in the energy spent 

by an LW instruction after this change? 

4.19.4 [10] What is the performance impact of your changes from 

4.19.3? 

4.19.5 [10] We can eliminate the MemRead control signal and have 

the data memory be read in every cycle, i.e., we can permanently have MemRead=1. 

Explain why the processor still functions correctly after this change. What is the 

effect of this change on clock frequency and energy consumption? 

4.19.6 [10] If an idle unit spends 10% of the power it would spend 

if it were active, what is the energy spent by the instruction memory in each cycle? 

What percentage of the overall energy spent by the instruction memory does this 

idle energy represent?


Answers to 


§4.1, page 248: 3 of 5: Control, Datapath, Memory. Input and Output are missing. 

§4.2, page 251: false. Edge-triggered state elements make simultaneous reading and 

writing both possible and unambiguous. 

§4.3, page 257: I. a. II. c. 

§4.4, page 272: Yes, Branch and ALUOp0 are identical. In addition, MemtoReg and 

RegDst are inverses of one another. You don’t need an inverter; simply use the other 

signal and flip the order of the inputs to the multiplexor! 

§4.5, page 285: I. Stall on the lw result. 2. Bypass the first add result written into 

$t1. 3. No stall or bypass required. 

§4.6, page 298: Statements 2 and 4 are correct; the rest are incorrect. 

§4.8, page 324: 1. Predict not taken. 2. Predict taken. 3. Dynamic prediction. 

§4.9, page 332: The first instruction, since it is logically executed before the others. 

§4.10, page 344: 1. Both. 2. Both. 3. Software. 4. Hardware. 5. Hardware. 6. 

Hardware. 7. Both. 8. Hardware. 9. Both. 

§4.11, page 353: First two are false and the last two are true.


5 

Ideally one would desire an 

indefinitely large memory 

capacity such that any 

particular … word would be 

immediately available. … We 

are … forced to recognize the 

possibility of constructing a 

hierarchy of memories, each 

of which has greater capacity 

than the preceding but which 

is less quickly accessible. 

A. W. Burks, H. H. Goldstine, and 

J. von Neumann 

Preliminary Discussion of the Logical Design of an 

Electronic Computing Instrument, 1946 

Large and Fast: 

Exploiting Memory 

Hierarchy 


5.2 Memory Technologies 378 

5.3 The Basics of Caches 383 

5.4 Measuring and Improving Cache 

Performance 398 

5.5 Dependable Memory Hierarchy 418 

5.6 Virtual Machines 424 

5.7 Virtual Memory 427 




Speed 

Processor 

Size 

Cost ($/bit) 

Current 

technology 

Fastest 

Memory 

Smallest 

Highest 

SRAM 

Memory 

DRAM 

Slowest 

Memory 

Biggest 

Lowest 

Magnetic disk 

FIGURE 5.1 The basic structure of a memory hierarchy. By implementing the memory system as 

a hierarchy, the user has the illusion of a memory that is as large as the largest level of the hierarchy, but can 

be accessed as if it were all built from the fastest memory. Flash memory has replaced disks in many personal 

mobile devices, and may lead to a new level in the storage hierarchy for desktop and server computers; see 

Section 5.2. 

Just as accesses to books on the desk naturally exhibit locality, locality in 

programs arises from simple and natural program structures. For example, 

most programs contain loops, so instructions and data are likely to be accessed 

repeatedly, showing high amounts of temporal locality. Since instructions are 

normally accessed sequentially, programs also show high spatial locality. Accesses 

to data also exhibit a natural spatial locality. For example, sequential accesses to 

elements of an array or a record will naturally have high degrees of spatial locality. 

We take advantage of the principle of locality by implementing the memory 

of a computer as a memory hierarchy. A memory hierarchy consists of multiple 

levels of memory with different speeds and sizes. The faster memories are more 

expensive per bit than the slower memories and thus are smaller. 

Figure 5.1 shows the faster memory is close to the processor and the slower, 

less expensive memory is below it. The goal is to present the user with as much 

memory as is available in the cheapest technology, while providing access at the 

speed offered by the fastest memory. 

The data is similarly hierarchical: a level closer to the processor is generally a 

subset of any level further away, and all the data is stored at the lowest level. By 

analogy, the books on your desk form a subset of the library you are working in, 

which is in turn a subset of all the libraries on campus. Furthermore, as we move 

away from the processor, the levels take progressively longer to access, just as we 

might encounter in a hierarchy of campus libraries. 

A memory hierarchy can consist of multiple levels, but data is copied between 

only two adjacent levels at a time, so we can focus our attention on just two levels. 

memory hierarchy 

A structure that uses 

multiple levels of 

memories; as the distance 

from the processor 

increases, the size of the 

memories and the access 

time both increase.

376 Chapter 5 Large and Fast: Exploiting Memory Hierarchy 

Processor 

Data is transferred 

block (or line) The 

minimum unit of 

information that can 

be either present or not 

present in a cache. 

hit rate The fraction of 

memory accesses found 

in a level of the memory 

hierarchy. 

miss rate The fraction 

of memory accesses not 

found in a level of the 

memory hierarchy. 

hit time The time 

required to access a level 

of the memory hierarchy, 

including the time needed 

to determine whether the 

access is a hit or a miss. 

miss penalty The time 

required to fetch a block 

into a level of the memory 

hierarchy from the lower 

level, including the time 

to access the block, 

transmit it from one level 

to the other, insert it in 

the level that experienced 

the miss, and then pass 

the block to the requestor. 

FIGURE 5.2 Every pair of levels in the memory hierarchy can be thought of as having an 

upper and lower level. Within each level, the unit of information that is present or not is called a block or 

a line. Usually we transfer an entire block when we copy something between levels. 

The upper level—the one closer to the processor—is smaller and faster than the lower 

level, since the upper level uses technology that is more expensive. Figure 5.2 shows 

that the minimum unit of information that can be either present or not present in 

the two-level hierarchy is called a block or a line; in our library analogy, a block of 

information is one book. 

If the data requested by the processor appears in some block in the upper level, 

this is called a hit (analogous to your finding the information in one of the books 

on your desk). If the data is not found in the upper level, the request is called a miss. 

The lower level in the hierarchy is then accessed to retrieve the block containing the 

requested data. (Continuing our analogy, you go from your desk to the shelves to 

find the desired book.) The hit rate, or hit ratio, is the fraction of memory accesses 

found in the upper level; it is often used as a measure of the performance of the 

memory hierarchy. The miss rate (1−hit rate) is the fraction of memory accesses 

not found in the upper level. 

Since performance is the major reason for having a memory hierarchy, the time 

to service hits and misses is important. Hit time is the time to access the upper level 

of the memory hierarchy, which includes the time needed to determine whether 

the access is a hit or a miss (that is, the time needed to look through the books on 

the desk). The miss penalty is the time to replace a block in the upper level with 

the corresponding block from the lower level, plus the time to deliver this block to 

the processor (or the time to get another book from the shelves and place it on the 

desk). Because the upper level is smaller and built using faster memory parts, the 

hit time will be much smaller than the time to access the next level in the hierarchy, 

which is the major component of the miss penalty. (The time to examine the books 

on the desk is much smaller than the time to get up and get a new book from the 

shelves.)


As we will see in this chapter, the concepts used to build memory systems affect 

many other aspects of a computer, including how the operating system manages 

memory and I/O, how compilers generate code, and even how applications use 

the computer. Of course, because all programs spend much of their time accessing 

memory, the memory system is necessarily a major factor in determining 

performance. The reliance on memory hierarchies to achieve performance 

has meant that programmers, who used to be able to think of memory as a flat, 

random access storage device, now need to understand that memory is a hierarchy 

to get good performance. We show how important this understanding is in later 

examples, such as Figure 5.18 on page 408, and Section 5.14, which shows how to 

double matrix multiply performance. 

Since memory systems are critical to performance, computer designers devote a 

great deal of attention to these systems and develop sophisticated mechanisms for 

improving the performance of the memory system. In this chapter, we discuss the 

major conceptual ideas, although we use many simplifications and abstractions to 

keep the material manageable in length and complexity. 

Programs exhibit both temporal locality, the tendency to reuse recently 

accessed data items, and spatial locality, the tendency to reference data 

items that are close to other recently accessed items. Memory hierarchies 

take advantage of temporal locality by keeping more recently accessed 

data items closer to the processor. Memory hierarchies take advantage of 

spatial locality by moving blocks consisting of multiple contiguous words 

in memory to upper levels of the hierarchy. 

Figure 5.3 shows that a memory hierarchy uses smaller and faster 

memory technologies close to the processor. Thus, accesses that hit in the 

highest level of the hierarchy can be processed quickly. Accesses that miss 

go to lower levels of the hierarchy, which are larger but slower. If the hit 

rate is high enough, the memory hierarchy has an effective access time 

close to that of the highest (and fastest) level and a size equal to that of the 

lowest (and largest) level. 

In most systems, the memory is a true hierarchy, meaning that data 

cannot be present in level i unless it is also present in level i 1. 

The BIG 

Picture 

Which of the following statements are generally true? 

1. Memory hierarchies take advantage of temporal locality. 

2. On a read, the value returned depends on which blocks are in the cache. 

3. Most of the cost of the memory hierarchy is at the highest level. 

4. Most of the capacity of the memory hierarchy is at the lowest level. 

Check 

Yourself


SRAM Technology 

SRAMs are simply integrated circuits that are memory arrays with (usually) a 

single access port that can provide either a read or a write. SRAMs have a fixed 

access time to any datum, though the read and write access times may differ. 

SRAMs don’t need to refresh and so the access time is very close to the cycle 

time. SRAMs typically use six to eight transistors per bit to prevent the information 

from being disturbed when read. SRAM needs only minimal power to retain the 

charge in standby mode. 

In the past, most PCs and server systems used separate SRAM chips for either 

their primary, secondary, or even tertiary caches. Today, thanks to Moore’s Law, all 

levels of caches are integrated onto the processor chip, so the market for separate 

SRAM chips has nearly evaporated. 

DRAM Technology 

In a SRAM, as long as power is applied, the value can be kept indefinitely. In a 

dynamic RAM (DRAM), the value kept in a cell is stored as a charge in a capacitor. 

A single transistor is then used to access this stored charge, either to read the 

value or to overwrite the charge stored there. Because DRAMs use only a single 

transistor per bit of storage, they are much denser and cheaper per bit than SRAM. 

As DRAMs store the charge on a capacitor, it cannot be kept indefinitely and must 

periodically be refreshed. That is why this memory structure is called dynamic, as 

opposed to the static storage in an SRAM cell. 

To refresh the cell, we merely read its contents and write it back. The charge 

can be kept for several milliseconds. If every bit had to be read out of the DRAM 

and then written back individually, we would constantly be refreshing the DRAM, 

leaving no time for accessing it. Fortunately, DRAMs use a two-level decoding 

structure, and this allows us to refresh an entire row (which shares a word line) 

with a read cycle followed immediately by a write cycle. 

Figure 5.4 shows the internal organization of a DRAM, and Figure 5.5 shows 

how the density, cost, and access time of DRAMs have changed over the years. 

The row organization that helps with refresh also helps with performance. To 

improve performance, DRAMs buffer rows for repeated access. The buffer acts 

like an SRAM; by changing the address, random bits can be accessed in the buffer 

until the next row access. This capability improves the access time significantly, 

since the access time to bits in the row is much lower. Making the chip wider also 

improves the memory bandwidth of the chip. When the row is in the buffer, it 

can be transferred by successive addresses at whatever the width of the DRAM is 

(typically 4, 8, or 16 bits), or by specifying a block transfer and the starting address 

within the buffer. 

To further improve the interface to processors, DRAMs added clocks and are 

properly called Synchronous DRAMs or SDRAMs. The advantage of SDRAMs 

is that the use of a clock eliminates the time for the memory and processor to 

synchronize. The speed advantage of synchronous DRAMs comes from the ability 

to transfer the bits in the burst without having to specify additional address bits.


write from multiple banks, with each having its own row buffer. Sending an address 

to several banks permits them all to read or write simultaneously. For example, 

with four banks, there is just one access time and then accesses rotate between 

the four banks to supply four times the bandwidth. This rotating access scheme is 

called address interleaving. 

Although Personal Mobile Devices like the iPad (see Chapter 1) use individual 

DRAMs, memory for servers are commonly sold on small boards called dual inline 

memory modules (DIMMs). DIMMs typically contain 4–16 DRAMs, and they are 

normally organized to be 8 bytes wide for server systems. A DIMM using DDR4- 

3200 SDRAMs could transfer at 8 3200 25,600 megabytes per second. Such 

DIMMs are named after their bandwidth: PC25600. Since a DIMM can have so 

many DRAM chips that only a portion of them are used for a particular transfer, we 

need a term to refer to the subset of chips in a DIMM that share common address 

lines. To avoid confusion with the internal DRAM names of row and banks, we use 

the term memory rank for such a subset of chips in a DIMM. 

Elaboration: One way to measure the performance of the memory system behind the 

caches is the Stream benchmark [McCalpin, 1995]. It measures the performance of 

long vector operations. They have no temporal locality and they access arrays that are 

larger than the cache of the computer being tested. 

Flash Memory 

Flash memory is a type of electrically erasable programmable read-only memory 

(EEPROM). 

Unlike disks and DRAM, but like other EEPROM technologies, writes can wear out 

flash memory bits. To cope with such limits, most flash products include a controller 

to spread the writes by remapping blocks that have been written many times to less 

trodden blocks. This technique is called wear leveling. With wear leveling, personal 

mobile devices are very unlikely to exceed the write limits in the flash. Such wear 

leveling lowers the potential performance of flash, but it is needed unless higherlevel 

software monitors block wear. Flash controllers that perform wear leveling can 

also improve yield by mapping out memory cells that were manufactured incorrectly. 

Disk Memory 

As Figure 5.6 shows, a magnetic hard disk consists of a collection of platters, which 

rotate on a spindle at 5400 to 15,000 revolutions per minute. The metal platters are 

covered with magnetic recording material on both sides, similar to the material found 

on a cassette or videotape. To read and write information on a hard disk, a movable arm 

containing a small electromagnetic coil called a read-write head is located just above 

each surface. The entire drive is permanently sealed to control the environment inside 

the drive, which, in turn, allows the disk heads to be much closer to the drive surface. 

Each disk surface is divided into concentric circles, called tracks. There are 

typically tens of thousands of tracks per surface. Each track is in turn divided into 

track One of thousands 

of concentric circles that 

makes up the surface of a 

magnetic disk.


sector One of the 

segments that make up a 

track on a magnetic disk; 

a sector is the smallest 

amount of information 

that is read or written on 

a disk. 

sectors that contain the information; each track may have thousands of sectors. 

Sectors are typically 512 to 4096 bytes in size. The sequence recorded on the 

magnetic media is a sector number, a gap, the information for that sector including 

error correction code (see Section 5.5), a gap, the sector number of the next sector, 

and so on. 

The disk heads for each surface are connected together and move in conjunction, 

so that every head is over the same track of every surface. The term cylinder is used 

to refer to all the tracks under the heads at a given point on all surfaces. 

FIGURE 5.6 A disk showing 10 disk platters and the read/write heads. The diameter of 

today’s disks is 2.5 or 3.5 inches, and there are typically one or two platters per drive today. 

seek The process of 

positioning a read/write 

head over the proper 

track on a disk. 

To access data, the operating system must direct the disk through a three-stage 

process. The first step is to position the head over the proper track. This operation is 

called a seek, and the time to move the head to the desired track is called the seek time. 

Disk manufacturers report minimum seek time, maximum seek time, and average 

seek time in their manuals. The first two are easy to measure, but the average is open to 

wide interpretation because it depends on the seek distance. The industry calculates 

average seek time as the sum of the time for all possible seeks divided by the number 

of possible seeks. Average seek times are usually advertised as 3 ms to 13 ms, but, 

depending on the application and scheduling of disk requests, the actual average seek 

time may be only 25% to 33% of the advertised number because of locality of disk


references. This locality arises both because of successive accesses to the same file and 

because the operating system tries to schedule such accesses together. 

Once the head has reached the correct track, we must wait for the desired sector 

to rotate under the read/write head. This time is called the rotational latency or 

rotational delay. The average latency to the desired information is halfway around 

the disk. Disks rotate at 5400 RPM to 15,000 RPM. The average rotational latency 

at 5400 RPM is 

0.5 rotation 0.5 rotation 

Average rotational latency 

 

5400 RPM 

⎛ seconds⎞ 

5400 RPM/ 

60 

⎝⎜ 

minute ⎠⎟ 

0.0056 seconds 5.6 ms 

rotational latency Also 

called rotational delay. 

The time required for 

the desired sector of a 

disk to rotate under the 

read/write head; usually 

assumed to be half the 

rotation time. 

The last component of a disk access, transfer time, is the time to transfer a block 

of bits. The transfer time is a function of the sector size, the rotation speed, and the 

recording density of a track. Transfer rates in 2012 were between 100 and 200 MB/sec. 

One complication is that most disk controllers have a built-in cache that stores 

sectors as they are passed over; transfer rates from the cache are typically higher, 

and were up to 750 MB/sec (6 Gbit/sec) in 2012. 

Alas, where block numbers are located is no longer intuitive. The assumptions of 

the sector-track-cylinder model above are that nearby blocks are on the same track, 

blocks in the same cylinder take less time to access since there is no seek time, 

and some tracks are closer than others. The reason for the change was the raising 

of the level of the disk interfaces. To speed-up sequential transfers, these higherlevel 

interfaces organize disks more like tapes than like random access devices. 

The logical blocks are ordered in serpentine fashion across a single surface, trying 

to capture all the sectors that are recorded at the same bit density to try to get best 

performance. Hence, sequential blocks may be on different tracks. 

In summary, the two primary differences between magnetic disks and 

semiconductor memory technologies are that disks have a slower access time because 

they are mechanical devices—flash is 1000 times as fast and DRAM is 100,000 times 

as fast—yet they are cheaper per bit because they have very high storage capacity at a 

modest cost—disk is 10 to 100 time cheaper. Magnetic disks are nonvolatile like flash, 

but unlike flash there is no write wear-out problem. However, flash is much more 

rugged and hence a better match to the jostling inherent in personal mobile devices. 

5.3 The Basics of Caches 

In our library example, the desk acted as a cache—a safe place to store things 

(books) that we needed to examine. Cache was the name chosen to represent the 

level of the memory hierarchy between the processor and main memory in the first 

commercial computer to have this extra level. The memories in the datapath in 

Chapter 4 are simply replaced by caches. Today, although this remains the dominant 

Cache: a safe place 

for hiding or storing 

things. 

Webster’s New World 

Dictionary of the 

American Language, 

Third College Edition, 

1988


direct-mapped cache 

A cache structure in 

which each memory 

location is mapped to 

exactly one location in the 

cache. 

use of the word cache, the term is also used to refer to any storage managed to take 

advantage of locality of access. Caches first appeared in research computers in the 

early 1960s and in production computers later in that same decade; every generalpurpose 

computer built today, from servers to low-power embedded processors, 

includes caches. 

In this section, we begin by looking at a very simple cache in which the processor 

requests are each one word and the blocks also consist of a single word. (Readers 

already familiar with cache basics may want to skip to Section 5.4.) Figure 5.7 shows 

such a simple cache, before and after requesting a data item that is not initially in 

the cache. Before the request, the cache contains a collection of recent references 

X 1 

, X 2 

, …, X n1 

, and the processor requests a word X n 

that is not in the cache. This 

request results in a miss, and the word X n 

is brought from memory into the cache. 

In looking at the scenario in Figure 5.7, there are two questions to answer: How 

do we know if a data item is in the cache? Moreover, if it is, how do we find it? The 

answers are related. If each word can go in exactly one place in the cache, then it 

is straightforward to find the word if it is in the cache. The simplest way to assign 

a location in the cache for each word in memory is to assign the cache location 

based on the address of the word in memory. This cache structure is called direct 

mapped, since each memory location is mapped directly to exactly one location in 

the cache. The typical mapping between addresses and cache locations for a directmapped 

cache is usually simple. For example, almost all direct-mapped caches use 

this mapping to find a block: 

(Block address) modulo (Number of blocks in the cache) 

tag A field in a table used 

for a memory hierarchy 

that contains the address 

information required 

to identify whether the 

associated block in the 

hierarchy corresponds to 

a requested word. 

If the number of entries in the cache is a power of 2, then modulo can be 

computed simply by using the low-order log 2 

(cache size in blocks) bits of the 

address. Thus, an 8-block cache uses the three lowest bits (8 2 3 ) of the block 

address. For example, Figure 5.8 shows how the memory addresses between 1 ten 

(00001 two 

) and 29 ten 

(11101 two 

) map to locations 1 ten 

(001 two 

) and 5 ten 

(101 two 

) in a 

direct-mapped cache of eight words. 

Because each cache location can contain the contents of a number of different 

memory locations, how do we know whether the data in the cache corresponds 

to a requested word? That is, how do we know whether a requested word is in the 

cache or not? We answer this question by adding a set of tags to the cache. The 

tags contain the address information required to identify whether a word in the 

cache corresponds to the requested word. The tag needs only to contain the upper 

portion of the address, corresponding to the bits that are not used as an index into 

the cache. For example, in Figure 5.8 we need only have the upper 2 of the 5 address 

bits in the tag, since the lower 3-bit index field of the address selects the block. 

Architects omit the index bits because they are redundant, since by definition the 

index field of any address of a cache block must be that block number. 

We also need a way to recognize that a cache block does not have valid 

information. For instance, when a processor starts up, the cache does not have good 

data, and the tag fields will be meaningless. Even after executing many instructions,


we have conflicting demands for a block. The word at address 18 (10010 two 

) should 

be brought into cache block 2 (010 two 

). Hence, it must replace the word at address 

26 (11010 two 

), which is already in cache block 2 (010 two 

). This behavior allows a 

cache to take advantage of temporal locality: recently referenced words replace less 

recently referenced words. 

This situation is directly analogous to needing a book from the shelves and 

having no more space on your desk—some book already on your desk must be 

returned to the shelves. In a direct-mapped cache, there is only one place to put the 

newly requested item and hence only one choice of what to replace. 

We know where to look in the cache for each possible address: the low-order bits 

of an address can be used to find the unique cache entry to which the address could 

map. Figure 5.10 shows how a referenced address is divided into 

■ A tag field, which is used to compare with the value of the tag field of the 

cache 

■ A cache index, which is used to select the block 

The index of a cache block, together with the tag contents of that block, uniquely 

specifies the memory address of the word contained in the cache block. Because 

the index field is used as an address to reference the cache, and because an n-bit 

field has 2 n values, the total number of entries in a direct-mapped cache must be a 

power of 2. In the MIPS architecture, since words are aligned to multiples of four 

bytes, the least significant two bits of every address specify a byte within a word. 

Hence, the least significant two bits are ignored when selecting a word in the block. 

The total number of bits needed for a cache is a function of the cache size and 

the address size, because the cache includes both the storage for the data and the 

tags. The size of the block above was one word, but normally it is several. For the 

following situation: 

■ 32-bit addresses 

■ A direct-mapped cache 

■ The cache size is 2 n blocks, so n bits are used for the index 

■ The block size is 2 m words (2 m+2 bytes), so m bits are used for the word within 

the block, and two bits are used for the byte part of the address 

the size of the tag field is 

32 (n m 2). 

The total number of bits in a direct-mapped cache is 

2 n (block size tag size valid field size).


Bits in a Cache 

EXAMPLE 

How many total bits are required for a direct-mapped cache with 16 KiB of 

data and 4-word blocks, assuming a 32-bit address? 

ANSWER 

We know that 16 KiB is 4096 (2 12 ) words. With a block size of 4 words (2 2 ), 

there are 1024 (2 10 ) blocks. Each block has 4 32 or 128 bits of data plus a 

tag, which is 32 10 2 2 bits, plus a valid bit. Thus, the total cache size is 

2 10 (4 32 (32 10 2 2) 1) 2 10 147 147 Kibibits 

or 18.4 KiB for a 16 KiB cache. For this cache, the total number of bits in the 

cache is about 1.15 times as many as needed just for the storage of the data. 

Mapping an Address to a Multiword Cache Block 

EXAMPLE 

Consider a cache with 64 blocks and a block size of 16 bytes. To what block 

number does byte address 1200 map? 

ANSWER 

We saw the formula on page 384. The block is given by 

(Block address) modulo (Number of blocks in the cache) 

where the address of the block is 

Byte address 

Bytes per block 

Notice that this block address is the block containing all addresses between 

⎡ Byte address ⎤ 

Bytes per block 

⎢ 

⎣Bytes per block ⎥ 

⎦


the block from the next lower level of the hierarchy and load it into the cache. The 

time to fetch the block has two parts: the latency to the first word and the transfer 

time for the rest of the block. Clearly, unless we change the memory system, the 

transfer time—and hence the miss penalty—will likely increase as the block size 

increases. Furthermore, the improvement in the miss rate starts to decrease as the 

blocks become larger. The result is that the increase in the miss penalty overwhelms 

the decrease in the miss rate for blocks that are too large, and cache performance 

thus decreases. Of course, if we design the memory to transfer larger blocks more 

efficiently, we can increase the block size and obtain further improvements in cache 

performance. We discuss this topic in the next section. 

Elaboration: Although it is hard to do anything about the longer latency component of 

the miss penalty for large blocks, we may be able to hide some of the transfer time so 

that the miss penalty is effectively smaller. The simplest method for doing this, called 

early restart, is simply to resume execution as soon as the requested word of the block 

is returned, rather than wait for the entire block. Many processors use this technique 

for instruction access, where it works best. Instruction accesses are largely sequential, 

so if the memory system can deliver a word every clock cycle, the processor may be 

able to restart operation when the requested word is returned, with the memory system 

delivering new instruction words just in time. This technique is usually less effective for 

data caches because it is likely that the words will be requested from the block in a 

less predictable way, and the probability that the processor will need another word from 

a different cache block before the transfer completes is high. If the processor cannot 

access the data cache because a transfer is ongoing, then it must stall. 

An even more sophisticated scheme is to organize the memory so that the requested 

word is transferred from the memory to the cache fi rst. The remainder of the block 

is then transferred, starting with the address after the requested word and wrapping 

around to the beginning of the block. This technique, called requested word fi rst or 

critical word fi rst, can be slightly faster than early restart, but it is limited by the same 

properties that limit early restart. 

cache miss A request for 

data from the cache that 

cannot be filled because 

the data is not present in 

the cache. 

Handling Cache Misses 

Before we look at the cache of a real system, let’s see how the control unit deals with 

cache misses. (We describe a cache controller in detail in Section 5.9). The control 

unit must detect a miss and process the miss by fetching the requested data from 

memory (or, as we shall see, a lower-level cache). If the cache reports a hit, the 

computer continues using the data as if nothing happened. 

Modifying the control of a processor to handle a hit is trivial; misses, however, 

require some extra work. The cache miss handling is done in collaboration with 

the processor control unit and with a separate controller that initiates the memory 

access and refills the cache. The processing of a cache miss creates a pipeline stall 

(Chapter 4) as opposed to an interrupt, which would require saving the state of all 

registers. For a cache miss, we can stall the entire processor, essentially freezing 

the contents of the temporary and programmer-visible registers, while we wait


for memory. More sophisticated out-of-order processors can allow execution of 

instructions while waiting for a cache miss, but we’ll assume in-order processors 

that stall on cache misses in this section. 

Let’s look a little more closely at how instruction misses are handled; the same 

approach can be easily extended to handle data misses. If an instruction access 

results in a miss, then the content of the Instruction register is invalid. To get the 

proper instruction into the cache, we must be able to instruct the lower level in the 

memory hierarchy to perform a read. Since the program counter is incremented in 

the first clock cycle of execution, the address of the instruction that generates an 

instruction cache miss is equal to the value of the program counter minus 4. Once 

we have the address, we need to instruct the main memory to perform a read. We 

wait for the memory to respond (since the access will take multiple clock cycles), 

and then write the words containing the desired instruction into the cache. 

We can now define the steps to be taken on an instruction cache miss: 

1. Send the original PC value (current PC – 4) to the memory. 

2. Instruct main memory to perform a read and wait for the memory to 

complete its access. 

3. Write the cache entry, putting the data from memory in the data portion of 

the entry, writing the upper bits of the address (from the ALU) into the tag 

field, and turning the valid bit on. 

4. Restart the instruction execution at the first step, which will refetch the 

instruction, this time finding it in the cache. 

The control of the cache on a data access is essentially identical: on a miss, we 

simply stall the processor until the memory responds with the data. 

Handling Writes 

Writes work somewhat differently. Suppose on a store instruction, we wrote the 

data into only the data cache (without changing main memory); then, after the 

write into the cache, memory would have a different value from that in the cache. 

In such a case, the cache and memory are said to be inconsistent. The simplest way 

to keep the main memory and the cache consistent is always to write the data into 

both the memory and the cache. This scheme is called write-through. 

The other key aspect of writes is what occurs on a write miss. We first fetch the 

words of the block from memory. After the block is fetched and placed into the 

cache, we can overwrite the word that caused the miss into the cache block. We also 

write the word to main memory using the full address. 

Although this design handles writes very simply, it would not provide very 

good performance. With a write-through scheme, every write causes the data 

to be written to main memory. These writes will take a long time, likely at least 

100 processor clock cycles, and could slow down the processor considerably. For 

example, suppose 10% of the instructions are stores. If the CPI without cache 

write-through 

A scheme in which writes 

always update both the 

cache and the next lower 

level of the memory 

hierarchy, ensuring that 

data is always consistent 

between the two.


write buffer A queue 

that holds data while 

the data is waiting to be 

written to memory. 

write-back A scheme 

that handles writes by 

updating values only to 

the block in the cache, 

then writing the modified 

block to the lower level 

of the hierarchy when the 

block is replaced. 

misses was 1.0, spending 100 extra cycles on every write would lead to a CPI of 

1.0 100 10% 11, reducing performance by more than a factor of 10. 

One solution to this problem is to use a write buffer. A write buffer stores the 

data while it is waiting to be written to memory. After writing the data into the 

cache and into the write buffer, the processor can continue execution. When a write 

to main memory completes, the entry in the write buffer is freed. If the write buffer 

is full when the processor reaches a write, the processor must stall until there is an 

empty position in the write buffer. Of course, if the rate at which the memory can 

complete writes is less than the rate at which the processor is generating writes, no 

amount of buffering can help, because writes are being generated faster than the 

memory system can accept them. 

The rate at which writes are generated may also be less than the rate at which the 

memory can accept them, and yet stalls may still occur. This can happen when the 

writes occur in bursts. To reduce the occurrence of such stalls, processors usually 

increase the depth of the write buffer beyond a single entry. 

The alternative to a write-through scheme is a scheme called write-back. In a 

write-back scheme, when a write occurs, the new value is written only to the block 

in the cache. The modified block is written to the lower level of the hierarchy when 

it is replaced. Write-back schemes can improve performance, especially when 

processors can generate writes as fast or faster than the writes can be handled by 

main memory; a write-back scheme is, however, more complex to implement than 

write-through. 

In the rest of this section, we describe caches from real processors, and we 

examine how they handle both reads and writes. In Section 5.8, we will describe 

the handling of writes in more detail. 

Elaboration: Writes introduce several complications into caches that are not present 

for reads. Here we discuss two of them: the policy on write misses and effi cient 

implementation of writes in write-back caches. 

Consider a miss in a write-through cache. The most common strategy is to allocate a 

block in the cache, called write allocate. The block is fetched from memory and then the 

appropriate portion of the block is overwritten. An alternative strategy is to update the portion 

of the block in memory but not put it in the cache, called no write allocate. The motivation is 

that sometimes programs write entire blocks of data, such as when the operating system 

zeros a page of memory. In such cases, the fetch associated with the initial write miss may 

be unnecessary. Some computers allow the write allocation policy to be changed on a per 

page basis. 

Actually implementing stores effi ciently in a cache that uses a write-back strategy is 

more complex than in a write-through cache. A write-through cache can write the data 

into the cache and read the tag; if the tag mismatches, then a miss occurs. Because the 

cache is write-through, the overwriting of the block in the cache is not catastrophic, since 

memory has the correct value. In a write-back cache, we must fi rst write the block back 

to memory if the data in the cache is modifi ed and we have a cache miss. If we simply 

overwrote the block on a store instruction before we knew whether the store had hit in 

the cache (as we could for a write-through cache), we would destroy the contents of the 

block, which is not backed up in the next lower level of the memory hierarchy.


In a write-back cache, because we cannot overwrite the block, stores either require 

two cycles (a cycle to check for a hit followed by a cycle to actually perform the write) or 

require a write buffer to hold that data—effectively allowing the store to take only one 

cycle by pipelining it. When a store buffer is used, the processor does the cache lookup 

and places the data in the store buffer during the normal cache access cycle. Assuming 

a cache hit, the new data is written from the store buffer into the cache on the next 

unused cache access cycle. 

By comparison, in a write-through cache, writes can always be done in one cycle. 

We read the tag and write the data portion of the selected block. If the tag matches 

the address of the block being written, the processor can continue normally, since the 

correct block has been updated. If the tag does not match, the processor generates a 

write miss to fetch the rest of the block corresponding to that address. 

Many write-back caches also include write buffers that are used to reduce the miss 

penalty when a miss replaces a modifi ed block. In such a case, the modifi ed block is 

moved to a write-back buffer associated with the cache while the requested block is read 

from memory. The write-back buffer is later written back to memory. Assuming another 

miss does not occur immediately, this technique halves the miss penalty when a dirty 

block must be replaced. 

An Example Cache: The Intrinsity FastMATH Processor 

The Intrinsity FastMATH is an embedded microprocessor that uses the MIPS 

architecture and a simple cache implementation. Near the end of the chapter, we 

will examine the more complex cache designs of ARM and Intel microprocessors, 

but we start with this simple, yet real, example for pedagogical reasons. Figure 5.12 

shows the organization of the Intrinsity FastMATH data cache. 

This processor has a 12-stage pipeline. When operating at peak speed, the 

processor can request both an instruction word and a data word on every clock. 

To satisfy the demands of the pipeline without stalling, separate instruction 

and data caches are used. Each cache is 16 KiB, or 4096 words, with 16-word 

blocks. 

Read requests for the cache are straightforward. Because there are separate 

data and instruction caches, we need separate control signals to read and write 

each cache. (Remember that we need to update the instruction cache when a miss 

occurs.) Thus, the steps for a read request to either cache are as follows: 

1. Send the address to the appropriate cache. The address comes either from 

the PC (for an instruction) or from the ALU (for data). 

2. If the cache signals hit, the requested word is available on the data lines. 

Since there are 16 words in the desired block, we need to select the right one. 

A block index field is used to control the multiplexor (shown at the bottom 

of the figure), which selects the requested word from the 16 words in the 

indexed block.


To take advantage of spatial locality, a cache must have a block size larger than 

one word. The use of a larger block decreases the miss rate and improves the 

efficiency of the cache by reducing the amount of tag storage relative to the amount 

of data storage in the cache. Although a larger block size decreases the miss rate, it 

can also increase the miss penalty. If the miss penalty increased linearly with the 

block size, larger blocks could easily lead to lower performance. 

To avoid performance loss, the bandwidth of main memory is increased to 

transfer cache blocks more efficiently. Common methods for increasing bandwidth 

external to the DRAM are making the memory wider and interleaving. DRAM 

designers have steadily improved the interface between the processor and memory 

to increase the bandwidth of burst mode transfers to reduce the cost of larger cache 

block sizes. 

Check 

Yourself 

The speed of the memory system affects the designer’s decision on the size of 

the cache block. Which of the following cache designer guidelines are generally 

valid? 

1. The shorter the memory latency, the smaller the cache block 

2. The shorter the memory latency, the larger the cache block 

3. The higher the memory bandwidth, the smaller the cache block 

4. The higher the memory bandwidth, the larger the cache block 

5.4 

Measuring and Improving Cache 

Performance 

In this section, we begin by examining ways to measure and analyze cache 

performance. We then explore two different techniques for improving cache 

performance. One focuses on reducing the miss rate by reducing the probability 

that two different memory blocks will contend for the same cache location. The 

second technique reduces the miss penalty by adding an additional level to the 

hierarchy. This technique, called multilevel caching, first appeared in high-end 

computers selling for more than $100,000 in 1990; since then it has become 

common on personal mobile devices selling for a few hundred dollars!

5.4 Measuring and Improving Cache Performance 399 

CPU time can be divided into the clock cycles that the CPU spends executing 

the program and the clock cycles that the CPU spends waiting for the memory 

system. Normally, we assume that the costs of cache accesses that are hits are part 

of the normal CPU execution cycles. Thus, 

CPU time (CPU execution clock cycles Memory-stall clock cycles) 


The memory-stall clock cycles come primarily from cache misses, and we make 

that assumption here. We also restrict the discussion to a simplified model of the 

memory system. In real processors, the stalls generated by reads and writes can be 

quite complex, and accurate performance prediction usually requires very detailed 

simulations of the processor and memory system. 

Memory-stall clock cycles can be defined as the sum of the stall cycles coming 

from reads plus those coming from writes: 

Memory-stall clock cycles (Read-stall cycles Write-stall cycles) 

The read-stall cycles can be defined in terms of the number of read accesses per 

program, the miss penalty in clock cycles for a read, and the read miss rate: 

Read-stall cycles 

Reads 

Program 

Read miss rate Read miss penalty 

Writes are more complicated. For a write-through scheme, we have two sources of 

stalls: write misses, which usually require that we fetch the block before continuing 

the write (see the Elaboration on page 394 for more details on dealing with writes), 

and write buffer stalls, which occur when the write buffer is full when a write 

occurs. Thus, the cycles stalled for writes equals the sum of these two: 

Write-stall cycles 

⎛ Writes 

⎞ 

Write miss rate Write miss penalty 

⎝⎜ 

Program ⎠⎟ 

Write buffer stalls 

Because the write buffer stalls depend on the proximity of writes, and not just 

the frequency, it is not possible to give a simple equation to compute such stalls. 

Fortunately, in systems with a reasonable write buffer depth (e.g., four or more 

words) and a memory capable of accepting writes at a rate that significantly exceeds 

the average write frequency in programs (e.g., by a factor of 2), the write buffer 

stalls will be small, and we can safely ignore them. If a system did not meet these 

criteria, it would not be well designed; instead, the designer should have used either 

a deeper write buffer or a write-back organization.


Write-back schemes also have potential additional stalls arising from the need 

to write a cache block back to memory when the block is replaced. We will discuss 

this more in Section 5.8. 

In most write-through cache organizations, the read and write miss penalties are 

the same (the time to fetch the block from memory). If we assume that the write 

buffer stalls are negligible, we can combine the reads and writes by using a single 

miss rate and the miss penalty: 

Memory-stall clock cycles 

Memory accesses 

Program 

Miss rate 

Miss penalty 

We can also factor this as 

Memory-stall clock cycles 

Instructions 

Program 

Misses 

Instruction 

Miss penalty 

Let’s consider a simple example to help us understand the impact of cache 

performance on processor performance. 

Calculating Cache Performance 

EXAMPLE 

Assume the miss rate of an instruction cache is 2% and the miss rate of the data 

cache is 4%. If a processor has a CPI of 2 without any memory stalls and the 

miss penalty is 100 cycles for all misses, determine how much faster a processor 

would run with a perfect cache that never missed. Assume the frequency of all 

loads and stores is 36%. 

ANSWER 

The number of memory miss cycles for instructions in terms of the Instruction 

count (I) is 

Instruction miss cycles I 2% 100 2.00 I 

As the frequency of all loads and stores is 36%, we can find the number of 

memory miss cycles for data references: 

Data miss cycles I 36% 4% 100 1.44 I


The total number of memory-stall cycles is 2.00 I 1.44 I 3.44 I. This is 

more than three cycles of memory stall per instruction. Accordingly, the total 

CPI including memory stalls is 2 3.44 5.44. Since there is no change in 

instruction count or clock rate, the ratio of the CPU execution times is 

CPU time with stalls 

CPU time with perfect cache 

I CPI stall Clock cycle 

I CPIperfect 

Clock cycle 

CPIstall 

5. 4 4 

CPI 2 

perfect 

The performance with the perfect cache is better by 544 . 

2 

2.72. 

What happens if the processor is made faster, but the memory system is not? The 

amount of time spent on memory stalls will take up an increasing fraction of the 

execution time; Amdahl’s Law, which we examined in Chapter 1, reminds us of 

this fact. A few simple examples show how serious this problem can be. Suppose 

we speed-up the computer in the previous example by reducing its CPI from 2 to 1 

without changing the clock rate, which might be done with an improved pipeline. 

The system with cache misses would then have a CPI of 1 3.44 4.44, and the 

system with the perfect cache would be 

444 . 

4.44 times as fast. 

1 

The amount of execution time spent on memory stalls would have risen from 

344 . 

63% 

544 . 

to 344 . 

77% 

444 . 

Similarly, increasing the clock rate without changing the memory system also 

increases the performance lost due to cache misses. 

The previous examples and equations assume that the hit time is not a factor in 

determining cache performance. Clearly, if the hit time increases, the total time to 

access a word from the memory system will increase, possibly causing an increase in 

the processor cycle time. Although we will see additional examples of what can increase


hit time shortly, one example is increasing the cache size. A larger cache could clearly 

have a longer access time, just as, if your desk in the library was very large (say, 3 square 

meters), it would take longer to locate a book on the desk. An increase in hit time 

likely adds another stage to the pipeline, since it may take multiple cycles for a cache 

hit. Although it is more complex to calculate the performance impact of a deeper 

pipeline, at some point the increase in hit time for a larger cache could dominate the 

improvement in hit rate, leading to a decrease in processor performance. 

To capture the fact that the time to access data for both hits and misses affects 

performance, designers sometime use average memory access time (AMAT) as 

a way to examine alternative cache designs. Average memory access time is the 

average time to access memory considering both hits and misses and the frequency 

of different accesses; it is equal to the following: 

AMAT Time for a hit Miss rate Miss penalty 

Calculating Average Memory Access Time 

EXAMPLE 

Find the AMAT for a processor with a 1 ns clock cycle time, a miss penalty of 

20 clock cycles, a miss rate of 0.05 misses per instruction, and a cache access 

time (including hit detection) of 1 clock cycle. Assume that the read and write 

miss penalties are the same and ignore other write stalls. 

ANSWER 

The average memory access time per instruction is 

AMAT Time for a hit Miss rate Miss penalty 

1 0.05 20 

2 clock cycles 

or 2 ns. 

The next subsection discusses alternative cache organizations that decrease 

miss rate but may sometimes increase hit time; additional examples appear in 

Section 5.15, Fallacies and Pitfalls. 

Reducing Cache Misses by More Flexible Placement 

of Blocks 

So far, when we place a block in the cache, we have used a simple placement scheme: 

A block can go in exactly one place in the cache. As mentioned earlier, it is called 

direct mapped because there is a direct mapping from any block address in memory 

to a single location in the upper level of the hierarchy. However, there is actually a 

whole range of schemes for placing blocks. Direct mapped, where a block can be 

placed in exactly one location, is at one extreme.


At the other extreme is a scheme where a block can be placed in any location 

in the cache. Such a scheme is called fully associative, because a block in memory 

may be associated with any entry in the cache. To find a given block in a fully 

associative cache, all the entries in the cache must be searched because a block 

can be placed in any one. To make the search practical, it is done in parallel with 

a comparator associated with each cache entry. These comparators significantly 

increase the hardware cost, effectively making fully associative placement practical 

only for caches with small numbers of blocks. 

The middle range of designs between direct mapped and fully associative 

is called set associative. In a set-associative cache, there are a fixed number of 

locations where each block can be placed. A set-associative cache with n locations 

for a block is called an n-way set-associative cache. An n-way set-associative cache 

consists of a number of sets, each of which consists of n blocks. Each block in the 

memory maps to a unique set in the cache given by the index field, and a block can 

be placed in any element of that set. Thus, a set-associative placement combines 

direct-mapped placement and fully associative placement: a block is directly 

mapped into a set, and then all the blocks in the set are searched for a match. For 

example, Figure 5.14 shows where block 12 may be placed in a cache with eight 

blocks total, according to the three block placement policies. 

Remember that in a direct-mapped cache, the position of a memory block is 

given by 

fully associative 

cache A cache structure 

in which a block can be 

placed in any location in 

the cache. 

set-associative cache 

A cache that has a fixed 

number of locations (at 

least two) where each 

block can be placed. 

(Block number) modulo (Number of blocks in the cache) 

Direct mapped 

Set associative 

Fully associative 

Block # 

012 34 5 6 7 

Set # 

0 1 2 3 

Data 

Data 

Data 

Tag 

1 

2 

Tag 

1 

2 

Tag 

1 

2 

Search 

Search 

Search 

FIGURE 5.14 The location of a memory block whose address is 12 in a cache with eight 

blocks varies for direct-mapped, set-associative, and fully associative placement. In directmapped 

placement, there is only one cache block where memory block 12 can be found, and that block is 

given by (12 modulo 8) 4. In a two-way set-associative cache, there would be four sets, and memory block 

12 must be in set (12 mod 4) 0; the memory block could be in either element of the set. In a fully associative 

placement, the memory block for block address 12 can appear in any of the eight cache blocks.


In a set-associative cache, the set containing a memory block is given by 

(Block number) modulo (Number of sets in the cache) 

Since the block may be placed in any element of the set, all the tags of all the elements 

of the set must be searched. In a fully associative cache, the block can go anywhere, 

and all tags of all the blocks in the cache must be searched. 

We can also think of all block placement strategies as a variation on set 

associativity. Figure 5.15 shows the possible associativity structures for an eightblock 

cache. A direct-mapped cache is simply a one-way set-associative cache: 

each cache entry holds one block and each set has one element. A fully associative 

cache with m entries is simply an m-way set-associative cache; it has one set with m 

blocks, and an entry can reside in any block within that set. 

The advantage of increasing the degree of associativity is that it usually decreases 

the miss rate, as the next example shows. The main disadvantage, which we discuss 

in more detail shortly, is a potential increase in the hit time. 

One-way set associative 

(direct mapped) 

Block Tag Data 

0 

1 

2 

3 

4 

5 

6 

7 

Set 

0 

1 

2 

3 

Two-way set associative 

Tag Data Tag Data 

Four-way set associative 

Set 

0 

1 

Tag Data Tag Data Tag Data Tag Data 

Eight-way set associative (fully associative) 

Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data 

FIGURE 5.15 An eight-block cache configured as direct mapped, two-way set associative, 

four-way set associative, and fully associative. The total size of the cache in blocks is equal to the 

number of sets times the associativity. Thus, for a fixed cache size, increasing the associativity decreases 

the number of sets while increasing the number of elements per set. With eight blocks, an eight-way setassociative 

cache is the same as a fully associative cache.


is replaced. (We will discuss other replacement rules in more detail shortly.) 

Using this replacement rule, the contents of the set-associative cache after each 

reference looks like this: 

Address of memory 

block accessed 

Hit 

or miss 

0 miss Memory[0] 

8 miss Memory[0] Memory[8] 

0 hit Memory[0] Memory[8] 



Contents of cache blocks after reference 

Set 0 Set 0 Set 1 Set 1 

Notice that when block 6 is referenced, it replaces block 8, since block 8 has 

been less recently referenced than block 0. The two-way set-associative cache 

has four misses, one less than the direct-mapped cache. 

The fully associative cache has four cache blocks (in a single set); any 

memory block can be stored in any cache block. The fully associative cache has 

the best performance, with only three misses: 

Address of memory 

block accessed 

Hit 

or miss 

Contents of cache blocks after reference 

Block 0 Block 1 Block 2 Block 3 

0 miss Memory[0] 


0 hit Memory[0] Memory[8] 

6 miss Memory[0] Memory[8] Memory[6] 

8 hit Memory[0] Memory[8] Memory[6] 

For this series of references, three misses is the best we can do, because three 

unique block addresses are accessed. Notice that if we had eight blocks in the 

cache, there would be no replacements in the two-way set-associative cache 

(check this for yourself), and it would have the same number of misses as the 

fully associative cache. Similarly, if we had 16 blocks, all 3 caches would have 

the same number of misses. Even this trivial example shows that cache size and 

associativity are not independent in determining cache performance. 

How much of a reduction in the miss rate is achieved by associativity? 

Figure 5.16 shows the improvement for a 64 KiB data cache with a 16-word block, 

and associativity ranging from direct mapped to eight-way. Going from one-way 

to two-way associativity decreases the miss rate by about 15%, but there is little 

further improvement in going to higher associativity.


Associativity 

Data miss rate 

1 10.3% 

2 8.6% 

4 8.3% 

8 8.1% 

FIGURE 5.16 The data cache miss rates for an organization like the Intrinsity FastMATH 

processor for SPEC CPU2000 benchmarks with associativity varying from one-way to 

eight-way. These results for 10 SPEC CPU2000 programs are from Hennessy and Patterson (2003). 

Tag 

Index 

Block offset 

FIGURE 5.17 The three portions of an address in a set-associative or direct-mapped 

cache. The index is used to select the set, then the tag is used to choose the block by comparison with the 

blocks in the selected set. The block offset is the address of the desired data within the block. 

Locating a Block in the Cache 

Now, let’s consider the task of finding a block in a cache that is set associative. 

Just as in a direct-mapped cache, each block in a set-associative cache includes 

an address tag that gives the block address. The tag of every cache block within 

the appropriate set is checked to see if it matches the block address from the 

processor. Figure 5.17 decomposes the address. The index value is used to select 

the set containing the address of interest, and the tags of all the blocks in the set 

must be searched. Because speed is of the essence, all the tags in the selected set are 

searched in parallel. As in a fully associative cache, a sequential search would make 

the hit time of a set-associative cache too slow. 

If the total cache size is kept the same, increasing the associativity increases the 

number of blocks per set, which is the number of simultaneous compares needed 

to perform the search in parallel: each increase by a factor of 2 in associativity 

doubles the number of blocks per set and halves the number of sets. Accordingly, 

each factor-of-2 increase in associativity decreases the size of the index by 1 bit and 

increases the size of the tag by 1 bit. In a fully associative cache, there is effectively 

only one set, and all the blocks must be checked in parallel. Thus, there is no index, 

and the entire address, excluding the block offset, is compared against the tag of 

every block. In other words, we search the entire cache without any indexing. 

In a direct-mapped cache, only a single comparator is needed, because the entry can 

be in only one block, and we access the cache simply by indexing. Figure 5.18 shows 

that in a four-way set-associative cache, four comparators are needed, together with 

a 4-to-1 multiplexor to choose among the four potential members of the selected set. 

The cache access consists of indexing the appropriate set and then searching the tags 

of the set. The costs of an associative cache are the extra comparators and any delay 

imposed by having to do the compare and select from among the elements of the set.


Choosing Which Block to Replace 

When a miss occurs in a direct-mapped cache, the requested block can go in 

exactly one position, and the block occupying that position must be replaced. In 

an associative cache, we have a choice of where to place the requested block, and 

hence a choice of which block to replace. In a fully associative cache, all blocks are 

candidates for replacement. In a set-associative cache, we must choose among the 

blocks in the selected set. 

The most commonly used scheme is least recently used (LRU), which we used 

in the previous example. In an LRU scheme, the block replaced is the one that has 

been unused for the longest time. The set associative example on page 405 uses 

LRU, which is why we replaced Memory(0) instead of Memory(6). 

LRU replacement is implemented by keeping track of when each element in a 

set was used relative to the other elements in the set. For a two-way set-associative 

cache, tracking when the two elements were used can be implemented by keeping 

a single bit in each set and setting the bit to indicate an element whenever that 

element is referenced. As associativity increases, implementing LRU gets harder; in 

Section 5.8, we will see an alternative scheme for replacement. 

least recently used 

(LRU) A replacement 

scheme in which the 

block replaced is the one 

that has been unused for 

the longest time. 

Size of Tags versus Set Associativity 

Increasing associativity requires more comparators and more tag bits per 

cache block. Assuming a cache of 4096 blocks, a 4-word block size, and a 

32-bit address, find the total number of sets and the total number of tag bits 

for caches that are direct mapped, two-way and four-way set associative, and 

fully associative. 

EXAMPLE 

Since there are 16 ( 2 4 ) bytes per block, a 32-bit address yields 324 28 bits 

to be used for index and tag. The direct-mapped cache has the same number 

of sets as blocks, and hence 12 bits of index, since log 2 

(4096) 12; hence, the 

total number is (2812) 4096 16 4096 66 K tag bits. 

Each degree of associativity decreases the number of sets by a factor of 2 and 

thus decreases the number of bits used to index the cache by 1 and increases 

the number of bits in the tag by 1. Thus, for a two-way set-associative cache, 

there are 2048 sets, and the total number of tag bits is (2811) 2 2048 

34 2048 70 Kbits. For a four-way set-associative cache, the total number 

of sets is 1024, and the total number is (2810) 4 1024 72 1024 

74 K tag bits. 

For a fully associative cache, there is only one set with 4096 blocks, and the 

tag is 28 bits, leading to 28 4096 1 115 K tag bits. 

ANSWER


Reducing the Miss Penalty Using Multilevel Caches 

All modern computers make use of caches. To close the gap further between the 

fast clock rates of modern processors and the increasingly long time required to 

access DRAMs, most microprocessors support an additional level of caching. This 

second-level cache is normally on the same chip and is accessed whenever a miss 

occurs in the primary cache. If the second-level cache contains the desired data, 

the miss penalty for the first-level cache will be essentially the access time of the 

second-level cache, which will be much less than the access time of main memory. 

If neither the primary nor the secondary cache contains the data, a main memory 

access is required, and a larger miss penalty is incurred. 

How significant is the performance improvement from the use of a secondary 

cache? The next example shows us. 

Performance of Multilevel Caches 

EXAMPLE 

ANSWER 

Suppose we have a processor with a base CPI of 1.0, assuming all references 

hit in the primary cache, and a clock rate of 4 GHz. Assume a main memory 

access time of 100 ns, including all the miss handling. Suppose the miss rate 

per instruction at the primary cache is 2%. How much faster will the processor 

be if we add a secondary cache that has a 5 ns access time for either a hit or 

a miss and is large enough to reduce the miss rate to main memory to 0.5%? 

The miss penalty to main memory is 

100 ns 

400 clock cycles 

ns 

025 . 

clock cycle 

The effective CPI with one level of caching is given by 

Total CPI Base CPI Memory-stall cycles per instruction 

For the processor with one level of caching, 

Total CPI 1.0 Memory-stall cycles per instruction 1.0 2% 400 9 

With two levels of caching, a miss in the primary (or first-level) cache can be 

satisfied either by the secondary cache or by main memory. The miss penalty 

for an access to the second-level cache is 

025 . 

5 ns 

ns 

clock cycle 

20 clock cycles


If the miss is satisfied in the secondary cache, then this is the entire miss 

penalty. If the miss needs to go to main memory, then the total miss penalty is 

the sum of the secondary cache access time and the main memory access time. 

Thus, for a two-level cache, total CPI is the sum of the stall cycles from both 

levels of cache and the base CPI: 

Total CPI 1 Primary stalls per instruction Secondary stalls per instruction 

1 2% 20 0.5% 400 1 0.4 2.0 3.4 

Thus, the processor with the secondary cache is faster by 

90 . 

2.6 

34 . 

Alternatively, we could have computed the stall cycles by summing the stall 

cycles of those references that hit in the secondary cache ((2%0.5%) 

20 0.3). Those references that go to main memory, which must include the 

cost to access the secondary cache as well as the main memory access time, are 

(0.5% (20 400) 2.1). The sum, 1.0 0.3 2.1, is again 3.4. 

The design considerations for a primary and secondary cache are significantly 

different, because the presence of the other cache changes the best choice versus 

a single-level cache. In particular, a two-level cache structure allows the primary 

cache to focus on minimizing hit time to yield a shorter clock cycle or fewer 

pipeline stages, while allowing the secondary cache to focus on miss rate to reduce 

the penalty of long memory access times. 

The effect of these changes on the two caches can be seen by comparing each 

cache to the optimal design for a single level of cache. In comparison to a singlelevel 

cache, the primary cache of a multilevel cache is often smaller. Furthermore, 

the primary cache may use a smaller block size, to go with the smaller cache size and 

also to reduce the miss penalty. In comparison, the secondary cache will be much 

larger than in a single-level cache, since the access time of the secondary cache is 

less critical. With a larger total size, the secondary cache may use a larger block size 

than appropriate with a single-level cache. It often uses higher associativity than 

the primary cache given the focus of reducing miss rates. 

multilevel cache 

A memory hierarchy with 

multiple levels of caches, 

rather than just a cache 

and main memory. 

Sorting has been exhaustively analyzed to find better algorithms: Bubble Sort, 

Quicksort, Radix Sort, and so on. Figure 5.19(a) shows instructions executed by 

item searched for Radix Sort versus Quicksort. As expected, for large arrays, Radix 

Sort has an algorithmic advantage over Quicksort in terms of number of operations. 

Figure 5.19(b) shows time per key instead of instructions executed. We see that the 

lines start on the same trajectory as in Figure 5.19(a), but then the Radix Sort line 


Program 

Performance


1200 

1000 

Radix Sort 

Instructions/item 

800 

600 

400 

200 

Quicksort 

a. 

0 

4 8 16 32 

64 128 256 512 1024 2048 4096 

Size (K items to sort) 

2000 

Clock cycles/item 

1600 

1200 

800 

400 

Radix Sort 

Quicksort 

b. 

0 

4 8 16 32 

64 128 256 512 1024 2048 4096 


5 

Cache misses/item 

4 

3 

2 

1 

Radix Sort 

Quicksort 

c. 

0 

4 8 16 32 

64 128 256 512 1024 2048 4096 


FIGURE 5.19 Comparing Quicksort and Radix Sort by (a) instructions executed per item 

sorted, (b) time per item sorted, and (c) cache misses per item sorted. This data is from a 

paper by LaMarca and Ladner [1996]. Due to such results, new versions of Radix Sort have been invented 

that take memory hierarchy into account, to regain its algorithmic advantages (see Section 5.15). The basic 

idea of cache optimizations is to use all the data in a block repeatedly before it is replaced on a miss.


diverges as the data to sort increases. What is going on? Figure 5.19(c) answers by 

looking at the cache misses per item sorted: Quicksort consistently has many fewer 

misses per item to be sorted. 

Alas, standard algorithmic analysis often ignores the impact of the memory 

hierarchy. As faster clock rates and Moore’s Law allow architects to squeeze all of 

the performance out of a stream of instructions, using the memory hierarchy well 

is critical to high performance. As we said in the introduction, understanding the 

behavior of the memory hierarchy is critical to understanding the performance of 

programs on today’s computers. 

Software Optimization via Blocking 

Given the importance of the memory hierarchy to program performance, not 

surprisingly many software optimizations were invented that can dramatically 

improve performance by reusing data within the cache and hence lower miss rates 

due to improved temporal locality. 

When dealing with arrays, we can get good performance from the memory 

system if we store the array in memory so that accesses to the array are sequential 

in memory. Suppose that we are dealing with multiple arrays, however, with some 

arrays accessed by rows and some by columns. Storing the arrays row-by-row 

(called row major order) or column-by-column (column major order) does not 

solve the problem because both rows and columns are used in every loop iteration. 

Instead of operating on entire rows or columns of an array, blocked algorithms 

operate on submatrices or blocks. The goal is to maximize accesses to the data 

loaded into the cache before the data are replaced; that is, improve temporal locality 

to reduce cache misses. 

For example, the inner loops of DGEMM (lines 4 through 9 of Figure 3.21 in 

Chapter 3) are 

for (int j = 0; j < n; ++j) 

{ 

double cij = C[i+j*n]; /* cij = C[i][j] */ 

for( int k = 0; k < n; k++ ) 

cij += A[i+k*n] * B[k+j*n]; /* cij += A[i][k]*B[k][j] */ 

C[i+j*n] = cij; /* C[i][j] = cij */ 

} 

} 

It reads all N-by-N elements of B, reads the same N elements in what corresponds to 

one row of A repeatedly, and writes what corresponds to one row of N elements of 

C. (The comments make the rows and columns of the matrices easier to identify.) 

Figure 5.20 gives a snapshot of the accesses to the three arrays. A dark shade 

indicates a recent access, a light shade indicates an older access, and white means 

not yet accessed.


x 

j 

0 1 2 3 4 5 

y 

k 

0 1 2 3 4 5 

z 

j 

0 1 2 3 4 5 

0 

0 

0 

1 

1 

1 

i 

2 

3 

i 

2 

3 

2 

k 

3 

4 

4 

4 

5 

5 

5 

FIGURE 5.20 A snapshot of the three arrays C, A, and B when N 6 and i 1. The age of 

accesses to the array elements is indicated by shade: white means not yet touched, light means older accesses, 

and dark means newer accesses. Compared to Figure 5.21, elements of A and B are read repeatedly to calculate 

new elements of x. The variables i, j, and k are shown along the rows or columns used to access the arrays. 

The number of capacity misses clearly depends on N and the size of the cache. If 

it can hold all three N-by-N matrices, then all is well, provided there are no cache 

conflicts. We purposely picked the matrix size to be 32 by 32 in DGEMM for 

Chapters 3 and 4 so that this would be the case. Each matrix is 32 32 1024 

elements and each element is 8 bytes, so the three matrices occupy 24 KiB, which 

comfortably fit in the 32 KiB data cache of the Intel Core i7 (Sandy Bridge). 

If the cache can hold one N-by-N matrix and one row of N, then at least the ith 

row of A and the array B may stay in the cache. Less than that and misses may 

occur for both B and C. In the worst case, there would be 2 N 3 N 2 memory words 

accessed for N 3 operations. 

To ensure that the elements being accessed can fit in the cache, the original code 

is changed to compute on a submatrix. Hence, we essentially invoke the version of 

DGEMM from Figure 4.80 in Chapter 4 repeatedly on matrices of size BLOCKSIZE 

by BLOCKSIZE. BLOCKSIZE is called the blocking factor. 

Figure 5.21 shows the blocked version of DGEMM. The function do_block is 

DGEMM from Figure 3.21 with three new parameters si, sj, and sk to specify 

the starting position of each submatrix of of A, B, and C. The two inner loops of the 

do_block now compute in steps of size BLOCKSIZE rather than the full length 

of B and C. The gcc optimizer removes any function call overhead by “inlining” the 

function; that is, it inserts the code directly to avoid the conventional parameter 

passing and return address bookkeeping instructions. 

Figure 5.22 illustrates the accesses to the three arrays using blocking. Looking 

only at capacity misses, the total number of memory words accessed is 2 N 3 / 

BLOCKSIZE N 2 . This total is an improvement by about a factor of BLOCKSIZE. 

Hence, blocking exploits a combination of spatial and temporal locality, since A 

benefits from spatial locality and B benefits from temporal locality.


1 #define BLOCKSIZE 32 

2 void do_block (int n, int si, int sj, int sk, double *A, double 

3 *B, double *C) 

4 { 

5 for (int i = si; i < si+BLOCKSIZE; ++i) 

6 for (int j = sj; j < sj+BLOCKSIZE; ++j) 

7 { 

8 double cij = C[i+j*n];/* cij = C[i][j] */ 

9 for( int k = sk; k < sk+BLOCKSIZE; k++ ) 

10 cij += A[i+k*n] * B[k+j*n];/* cij+=A[i][k]*B[k][j] */ 

11 C[i+j*n] = cij;/* C[i][j] = cij */ 

12 } 

13 } 

14 void dgemm (int n, double* A, double* B, double* C) 

15 { 

16 for ( int sj = 0; sj < n; sj += BLOCKSIZE ) 

17 for ( int si = 0; si < n; si += BLOCKSIZE ) 

18 for ( int sk = 0; sk < n; sk += BLOCKSIZE ) 

19 do_block(n, si, sj, sk, A, B, C); 

20 } 

FIGURE 5.21 Cache blocked version of DGEMM in Figure 3.21. Assume C is initialized to zero. The do_block 

function is basically DGEMM from Chapter 3 with new parameters to specify the starting positions of the submatrices of 

BLOCKSIZE. The gcc optimizer can remove the function overhead instructions by inlining the do_block function. 

x 

j 

0 1 2 3 4 5 

y 

k 

0 1 2 3 4 5 

z 

j 

0 1 2 3 4 5 

0 

0 

0 

1 

1 

1 

i 

2 

3 

i 

2 

3 

2 

k 

3 

4 

4 

4 

5 

5 

5 

FIGURE 5.22 The age of accesses to the arrays C, A, and B when BLOCKSIZE 3. Note that, 

in contrast to Figure 5.20, fewer elements are accessed. 

Although we have aimed at reducing cache misses, blocking can also be used to 

help register allocation. By taking a small blocking size such that the block can be 

held in registers, we can minimize the number of loads and stores in the program, 

which also improves performance.


32x32 160x160 480x480 960x960 

GFLOPS 

1.8 

1.5 

1.2 

0.9 

1.7 

1.5 

1.3 

0.8 

1.7 1.6 1.6 

1.5 

0.6 

0.3 

– 

Unoptimized 

Blocked 

FIGURE 5.23 Performance of unoptimized DGEMM (Figure 3.21) versus cache blocked 

DGEMM (Figure 5.21) as the matrix dimension varies from 32x32 (where all three matrices 

fit in the cache) to 960x960. 

Figure 5.23 shows the impact of cache blocking on the performance of the 

unoptimized DGEMM as we increase the matrix size beyond where all three 

matrices fit in the cache. The unoptimized performance is halved for the largest 

matrix. The cache-blocked version is less than 10% slower even at matrices that are 

960x960, or 900 times larger than the 32 × 32 matrices in Chapters 3 and 4. 

global miss rate The 

fraction of references 

that miss in all levels of a 

multilevel cache. 

local miss rate The 

fraction of references to 

one level of a cache that 

miss; used in multilevel 

hierarchies. 

Elaboration: Multilevel caches create several complications. First, there are now 

several different types of misses and corresponding miss rates. In the example on 

pages 410–411, we saw the primary cache miss rate and the global miss rate—the 

fraction of references that missed in all cache levels. There is also a miss rate for the 

secondary cache, which is the ratio of all misses in the secondary cache divided by the 

number of accesses to it. This miss rate is called the local miss rate of the secondary 

cache. Because the primary cache fi lters accesses, especially those with good spatial 

and temporal locality, the local miss rate of the secondary cache is much higher than the 

global miss rate. For the example on pages 410–411, we can compute the local miss 

rate of the secondary cache as 0.5%/2% 25%! Luckily, the global miss rate dictates 

how often we must access the main memory. 

Elaboration: With out-of-order processors (see Chapter 4), performance is more 

complex, since they execute instructions during the miss penalty. Instead of instruction 

miss rates and data miss rates, we use misses per instruction, and this formula: 

Memory stall cycles 

Instruction 

Misses 

Instruction 

(Total miss latency 

Overlapped miss latency)


There is no general way to calculate overlapped miss latency, so evaluations of 

memory hierarchies for out-of-order processors inevitably require simulation of the 

processor and the memory hierarchy. Only by seeing the execution of the processor 

during each miss can we see if the processor stalls waiting for data or simply fi nds other 

work to do. A guideline is that the processor often hides the miss penalty for an L1 

cache miss that hits in the L2 cache, but it rarely hides a miss to the L2 cache. 

Elaboration: The performance challenge for algorithms is that the memory hierarchy 

varies between different implementations of the same architecture in cache size, 

associativity, block size, and number of caches. To cope with such variability, some 

recent numerical libraries parameterize their algorithms and then search the parameter 

space at runtime to fi nd the best combination for a particular computer. This approach 

is called autotuning. 

Which of the following is generally true about a design with multiple levels of 

caches? 

1. First-level caches are more concerned about hit time, and second-level 

caches are more concerned about miss rate. 

2. First-level caches are more concerned about miss rate, and second-level 

caches are more concerned about hit time. 

Check 

Yourself 

Summary 

In this section, we focused on four topics: cache performance, using associativity to 

reduce miss rates, the use of multilevel cache hierarchies to reduce miss penalties, 

and software optimizations to improve effectiveness of caches. 

The memory system has a significant effect on program execution time. The 

number of memory-stall cycles depends on both the miss rate and the miss penalty. 

The challenge, as we will see in Section 5.8, is to reduce one of these factors without 

significantly affecting other critical factors in the memory hierarchy. 

To reduce the miss rate, we examined the use of associative placement schemes. 

Such schemes can reduce the miss rate of a cache by allowing more flexible 

placement of blocks within the cache. Fully associative schemes allow blocks to be 

placed anywhere, but also require that every block in the cache be searched to satisfy 

a request. The higher costs make large fully associative caches impractical. Setassociative 

caches are a practical alternative, since we need only search among the 

elements of a unique set that is chosen by indexing. Set-associative caches have higher 

miss rates but are faster to access. The amount of associativity that yields the best 

performance depends on both the technology and the details of the implementation. 

We looked at multilevel caches as a technique to reduce the miss penalty by 

allowing a larger secondary cache to handle misses to the primary cache. Secondlevel 

caches have become commonplace as designers find that limited silicon and 

the goals of high clock rates prevent primary caches from becoming large. The 

secondary cache, which is often ten or more times larger than the primary cache, 

handles many accesses that miss in the primary cache. In such cases, the miss 

penalty is that of the access time to the secondary cache (typically < 10 processor


error detection 

code A code that 

enables the detection of 

an error in data, but not 

the precise location and, 

hence, correction of the 

error. 

The Hamming Single Error Correcting, Double Error 

Detecting Code (SEC/DED) 

Richard Hamming invented a popular redundancy scheme for memory, for which 

he received the Turing Award in 1968. To invent redundant codes, it is helpful 

to talk about how “close” correct bit patterns can be. What we call the Hamming 

distance is just the minimum number of bits that are different between any two 

correct bit patterns. For example, the distance between 011011 and 001111 is two. 

What happens if the minimum distance between members of a codes is two, and 

we get a one-bit error? It will turn a valid pattern in a code to an invalid one. Thus, 

if we can detect whether members of a code are valid or not, we can detect single 

bit errors, and can say we have a single bit error detection code. 

Hamming used a parity code for error detection. In a parity code, the number 

of 1s in a word is counted; the word has odd parity if the number of 1s is odd and 

even otherwise. When a word is written into memory, the parity bit is also written 

(1 for odd, 0 for even). That is, the parity of the N+1 bit word should always be even. 

Then, when the word is read out, the parity bit is read and checked. If the parity of the 

memory word and the stored parity bit do not match, an error has occurred. 

EXAMPLE 

Calculate the parity of a byte with the value 31 ten 

and show the pattern stored to 

memory. Assume the parity bit is on the right. Suppose the most significant bit 

was inverted in memory, and then you read it back. Did you detect the error? 

What happens if the two most significant bits are inverted? 

ANSWER 

31 ten 

is 00011111 two 

, which has five 1s. To make parity even, we need to write a 1 

in the parity bit, or 000111111 two 

. If the most significant bit is inverted when we 

read it back, we would see 100111111 two 

which has seven 1s. Since we expect 

even parity and calculated odd parity, we would signal an error. If the two most 

significant bits are inverted, we would see 110111111 two 

which has eight 1s or 

even parity and we would not signal an error. 

If there are 2 bits of error, then a 1-bit parity scheme will not detect any errors, 

since the parity will match the data with two errors. (Actually, a 1-bit parity scheme 

can detect any odd number of errors; however, the probability of having 3 errors is 

much lower than the probability of having two, so, in practice, a 1-bit parity code is 

limited to detecting a single bit of error.) 

Of course, a parity code cannot correct errors, which Hamming wanted to do 

as well as detect them. If we used a code that had a minimum distance of 3, then 

any single bit error would be closer to the correct pattern than to any other valid 

pattern. He came up with an easy to understand mapping of data into a distance 3 

code that we call Hamming Error Correction Code (ECC) in his honor. We use extra


EXAMPLE 

Assume one byte data value is 10011010 two 

. First show the Hamming ECC code 

for that byte, and then invert bit 10 and show that the ECC code finds and 

corrects the single bit error. 

ANSWER 

Leaving spaces for the parity bits, the 12 bit pattern is _ _ 1 _ 0 0 1 _ 1 0 1 0. 

Position 1 checks bits 1,3,5,7,9, and11, which we highlight: __ 1 _ 0 0 1 _ 1 0 1 

0. To make the group even parity, we should set bit 1 to 0. 

Position 2 checks bits 2,3,6,7,10,11, which is 0 _ 1 _ 0 0 1 _ 1 0 1 0 or odd parity, 

so we set position 2 to a 1. 

Position 4 checks bits 4,5,6,7,12, which is 0 1 1 _ 0 0 1 _ 1 0 1, so we set it to a 1. 

Position 8 checks bits 8,9,10,11,12, which is 0 1 1 1 0 0 1 _ 1 0 1 0, so we set it 

to a 0. 

The final code word is 011100101010. Inverting bit 10 changes it to 

011100101110. 

Parity bit 1 is 0 (011100101110 is four 1s, so even parity; this group is OK). 

Parity bit 2 is 1 (011100101110 is five 1s, so odd parity; there is an error 

somewhere). 

Parity bit 4 is 1 (011100101110 is two 1s, so even parity; this group is OK). 

Parity bit 8 is 1 (011100101110 is three 1s, so odd parity; there is an error 

somewhere). 

Parity bits 2 and 10 are incorrect. As 2 + 8 = 10, bit 10 must be wrong. Hence, 

we can correct the error by inverting bit 10: 011100101010. Voila! 

Hamming did not stop at single bit error correction code. At the cost of one more 

bit, we can make the minimum Hamming distance in a code be 4. This means 

we can correct single bit errors and detect double bit errors. The idea is to add a 

parity bit that is calculated over the whole word. Let’s use a four-bit data word as 

an example, which would only need 7 bits for single bit error detection. Hamming 

parity bits H (p1 p2 p3) are computed (even parity as usual) plus the even parity 

over the entire word, p4: 

1 2 3 4 5 6 7 8 

p 1 

p 2 

d 1 

p 3 

d 2 

d 3 

d 4 

p 4 

Then the algorithm to correct one error and detect two is just to calculate parity 

over the ECC groups (H) as before plus one more over the whole group (p 4 

). There 

are four cases: 

1. H is even and p 4 

is even, so no error occurred. 

2. H is odd and p 4 

is odd, so a correctable single error occurred. (p 4 

should 

calculate odd parity if one error occurred.) 

3. H is even and p 4 

is odd, a single error occurred in p 4 

bit, not in the rest of the 

word, so correct the p 4 

bit.

5.6 Virtual Machines 425 

allow these separate software stacks to run independently yet share hardware, 

thereby consolidating the number of servers. Another example is that some 

VMMs support migration of a running VM to a different computer, either 

to balance load or to evacuate from failing hardware. 

Amazon Web Services (AWS) uses the virtual machines in its cloud computing 

offering EC2 for five reasons: 

1. It allows AWS to protect users from each other while sharing the same server. 

2. It simplifies software distribution within a warehouse scale computer. A 

customer installs a virtual machine image configured with the appropriate 

software, and AWS distributes it to all the instances a customer wants to use. 

3. Customers (and AWS) can reliably “kill” a VM to control resource usage 

when customers complete their work. 

4. Virtual machines hide the identity of the hardware on which the customer is 

running, which means AWS can keep using old servers and introduce new, 

more efficient servers. The customer expects performance for instances to 

match their ratings in “EC2 Compute Units,” which AWS defines: to “provide 

the equivalent CPU capacity of a 1.0–1.2 GHz 2007 AMD Opteron or 2007 

Intel Xeon processor.” Thanks to Moore’s Law, newer servers clearly offer 

more EC2 Compute Units than older ones, but AWS can keep renting old 

servers as long as they are economical. 

5. Virtual Machine Monitors can control the rate that a VM uses the processor, 

the network, and disk space, which allows AWS to offer many price points 

of instances of different types running on the same underlying servers. 

For example, in 2012 AWS offered 14 instance types, from small standard 

instances at $0.08 per hour to high I/O quadruple extra large instances at 

$3.10 per hour. 

Hardware/ 

Software 

Interface 

In general, the cost of processor virtualization depends on the workload. Userlevel 

processor-bound programs have zero virtualization overhead, because the 

OS is rarely invoked, so everything runs at native speeds. I/O-intensive workloads 

are generally also OS-intensive, executing many system calls and privileged 

instructions that can result in high virtualization overhead. On the other hand, if 

the I/O-intensive workload is also I/O-bound, the cost of processor virtualization 

can be completely hidden, since the processor is often idle waiting for I/O. 

The overhead is determined by both the number of instructions that must be 

emulated by the VMM and by how much time each takes to emulate them. Hence, 

when the guest VMs run the same ISA as the host, as we assume here, the goal


of the architecture and the VMM is to run almost all instructions directly on the 

native hardware. 

Requirements of a Virtual Machine Monitor 

What must a VM monitor do? It presents a software interface to guest software, it 

must isolate the state of guests from each other, and it must protect itself from guest 

software (including guest OSes). The qualitative requirements are: 

■ Guest software should behave on a VM exactly as if it were running on the 

native hardware, except for performance-related behavior or limitations of 

fixed resources shared by multiple VMs. 

■ Guest software should not be able to change allocation of real system resources 

directly. 

To “virtualize” the processor, the VMM must control just about everything—access 

to privileged state, I/O, exceptions, and interrupts—even though the guest VM and 

OS currently running are temporarily using them. 

For example, in the case of a timer interrupt, the VMM would suspend the 

currently running guest VM, save its state, handle the interrupt, determine which 

guest VM to run next, and then load its state. Guest VMs that rely on a timer 

interrupt are provided with a virtual timer and an emulated timer interrupt by the 

VMM. 

To be in charge, the VMM must be at a higher privilege level than the guest 

VM, which generally runs in user mode; this also ensures that the execution of 

any privileged instruction will be handled by the VMM. The basic requirements of 

system virtual: 

■ At least two processor modes, system and user. 

■ A privileged subset of instructions that is available only in system mode, 

resulting in a trap if executed in user mode; all system resources must be 

controllable only via these instructions. 

(Lack of) Instruction Set Architecture Support for Virtual 

Machines 

If VMs are planned for during the design of the ISA, it’s relatively easy to reduce 

both the number of instructions that must be executed by a VMM and improve 

their emulation speed. An architecture that allows the VM to execute directly on 

the hardware earns the title virtualizable, and the IBM 370 architecture proudly 

bears that label. 

Alas, since VMs have been considered for PC and server applications only fairly 

recently, most instruction sets were created without virtualization in mind. These 

culprits include x86 and most RISC architectures, including ARMv7 and MIPS.


Because the VMM must ensure that the guest system only interacts with virtual 

resources, a conventional guest OS runs as a user mode program on top of the 

VMM. Then, if a guest OS attempts to access or modify information related to 

hardware resources via a privileged instruction—for example, reading or writing 

a status bit that enables interrupts—it will trap to the VMM. The VMM can then 

effect the appropriate changes to corresponding real resources. 

Hence, if any instruction that tries to read or write such sensitive information 

traps when executed in user mode, the VMM can intercept it and support a virtual 

version of the sensitive information, as the guest OS expects. 

In the absence of such support, other measures must be taken. A VMM must 

take special precautions to locate all problematic instructions and ensure that they 

behave correctly when executed by a guest OS, thereby increasing the complexity 

of the VMM and reducing the performance of running the VM. 

Protection and Instruction Set Architecture 

Protection is a joint effort of architecture and operating systems, but architects 

had to modify some awkward details of existing instruction set architectures when 

virtual memory became popular. 

For example, the x86 instruction POPF loads the flag registers from the top of 

the stack in memory. One of the flags is the Interrupt Enable (IE) flag. If you run 

the POPF instruction in user mode, rather than trap it, it simply changes all the 

flags except IE. In system mode, it does change the IE. Since a guest OS runs in user 

mode inside a VM, this is a problem, as it expects to see a changed IE. 

Historically, IBM mainframe hardware and VMM took three steps to improve 

performance of virtual machines: 

1. Reduce the cost of processor virtualization. 

2. Reduce interrupt overhead cost due to the virtualization. 

3. Reduce interrupt cost by steering interrupts to the proper VM without 

invoking VMM. 

AMD and Intel tried to address the first point in 2006 by reducing the cost of 

processor virtualization. It will be interesting to see how many generations of 

architecture and VMM modifications it will take to address all three points, and 

how long before virtual machines of the 21st century will be as efficient as the IBM 

mainframes and VMMs of the 1970s. 

5.7 Virtual Memory 

In earlier sections, we saw how caches provided fast access to recently used portions 

of a program’s code and data. Similarly, the main memory can act as a “cache” for 

… a system has 

been devised to 

make the core drum 

combination appear 

to the programmer 

as a single level 

store, the requisite 

transfers taking place 

automatically. 

Kilburn et al., One-level 

storage system, 1962


virtual memory 

A technique that uses 

main memory as a “cache” 

for secondary storage. 

physical address 

An address in main 

memory. 

protection A set 

of mechanisms for 

ensuring that multiple 

processes sharing the 

processor, memory, 

or I/O devices cannot 

interfere, intentionally 

or unintentionally, with 

one another by reading or 

writing each other’s data. 

These mechanisms also 

isolate the operating system 

from a user process. 

page fault An event that 

occurs when an accessed 

page is not present in 

main memory. 

virtual address 

An address that 

corresponds to a location 

in virtual space and is 

translated by address 

mapping to a physical 

address when memory is 

accessed. 

the secondary storage, usually implemented with magnetic disks. This technique is 

called virtual memory. Historically, there were two major motivations for virtual 

memory: to allow efficient and safe sharing of memory among multiple programs, 

such as for the memory needed by multiple virtual machines for cloud computing, 

and to remove the programming burdens of a small, limited amount of main 

memory. Five decades after its invention, it’s the former reason that reigns today. 

Of course, to allow multiple virtual machines to share the same memory, we 

must be able to protect the virtual machines from each other, ensuring that a 

program can only read and write the portions of main memory that have been 

assigned to it. Main memory need contain only the active portions of the many 

virtual machines, just as a cache contains only the active portion of one program. 

Thus, the principle of locality enables virtual memory as well as caches, and virtual 

memory allows us to efficiently share the processor as well as the main memory. 

We cannot know which virtual machines will share the memory with other 

virtual machines when we compile them. In fact, the virtual machines sharing 

the memory change dynamically while the virtual machines are running. Because 

of this dynamic interaction, we would like to compile each program into its 

own address space—a separate range of memory locations accessible only to this 

program. Virtual memory implements the translation of a program’s address space 

to physical addresses. This translation process enforces protection of a program’s 

address space from other virtual machines. 

The second motivation for virtual memory is to allow a single user program 

to exceed the size of primary memory. Formerly, if a program became too large 

for memory, it was up to the programmer to make it fit. Programmers divided 

programs into pieces and then identified the pieces that were mutually exclusive. 

These overlays were loaded or unloaded under user program control during 

execution, with the programmer ensuring that the program never tried to access 

an overlay that was not loaded and that the overlays loaded never exceeded the 

total size of the memory. Overlays were traditionally organized as modules, each 

containing both code and data. Calls between procedures in different modules 

would lead to overlaying of one module with another. 

As you can well imagine, this responsibility was a substantial burden on 

programmers. Virtual memory, which was invented to relieve programmers of 

this difficulty, automatically manages the two levels of the memory hierarchy 

represented by main memory (sometimes called physical memory to distinguish it 

from virtual memory) and secondary storage. 

Although the concepts at work in virtual memory and in caches are the same, 

their differing historical roots have led to the use of different terminology. A virtual 

memory block is called a page, and a virtual memory miss is called a page fault. 

With virtual memory, the processor produces a virtual address, which is translated 

by a combination of hardware and software to a physical address, which in turn can 

be used to access main memory. Figure 5.25 shows the virtually addressed memory 

with pages mapped to main memory. This process is called address mapping or


Virtual address 

31 30 29 28 27 15 14 13 12 11 10 9 8 

3 2 1 0 

Virtual page number 

Page offset 

Translation 

29 28 27 15 14 13 12 11 10 9 8 

3 2 1 0 

Physical page number 

Page offset 

Physical address 

FIGURE 5.26 Mapping from a virtual to a physical address. The page size is 2 12 4 KiB. The 

number of physical pages allowed in memory is 2 18 , since the physical page number has 18 bits in it. Thus, 

main memory can have at most 1 GiB, while the virtual address space is 4 GiB. 

Many design choices in virtual memory systems are motivated by the high cost 

of a page fault. A page fault to disk will take millions of clock cycles to process. 

(The table on page 378 shows that main memory latency is about 100,000 times 

quicker than disk.) This enormous miss penalty, dominated by the time to get the 

first word for typical page sizes, leads to several key decisions in designing virtual 

memory systems: 

■ Pages should be large enough to try to amortize the high access time. Sizes 

from 4 KiB to 16 KiB are typical today. New desktop and server systems are 

being developed to support 32 KiB and 64 KiB pages, but new embedded 

systems are going in the other direction, to 1 KiB pages. 

■ Organizations that reduce the page fault rate are attractive. The primary 

technique used here is to allow fully associative placement of pages in 

memory. 

■ Page faults can be handled in software because the overhead will be small 

compared to the disk access time. In addition, software can afford to use clever 

algorithms for choosing how to place pages because even small reductions in 

the miss rate will pay for the cost of such algorithms. 

■ Write-through will not work for virtual memory, since writes take too long. 

Instead, virtual memory systems use write-back.


The next few subsections address these factors in virtual memory design. 

Elaboration: We present the motivation for virtual memory as many virtual machines 

sharing the same memory, but virtual memory was originally invented so that many 

programs could share a computer as part of a timesharing system. Since many readers 

today have no experience with time-sharing systems, we use virtual machines to motivate 

this section. 

Elaboration: For servers and even PCs, 32-bit address processors are problematic. 

Although we normally think of virtual addresses as much larger than physical addresses, 

the opposite can occur when the processor address size is small relative to the state 

of the memory technology. No single program or virtual machine can benefi t, but a 

collection of programs or virtual machines running at the same time can benefi t from 

not having to be swapped to memory or by running on parallel processors. 

Elaboration: The discussion of virtual memory in this book focuses on paging, 

which uses fi xed-size blocks. There is also a variable-size block scheme called 

segmentation. In segmentation, an address consists of two parts: a segment number 

and a segment offset. The segment number is mapped to a physical address, and 

the offset is added to fi nd the actual physical address. Because the segment can 

vary in size, a bounds check is also needed to make sure that the offset is within 

the segment. The major use of segmentation is to support more powerful methods 

of protection and sharing in an address space. Most operating system textbooks 

contain extensive discussions of segmentation compared to paging and of the use 

of segmentation to logically share the address space. The major disadvantage of 

segmentation is that it splits the address space into logically separate pieces that 

must be manipulated as a two-part address: the segment number and the offset. 

Paging, in contrast, makes the boundary between page number and offset invisible 

to programmers and compilers. 

Segments have also been used as a method to extend the address space without 

changing the word size of the computer. Such attempts have been unsuccessful because 

of the awkwardness and performance penalties inherent in a two-part address, of which 

programmers and compilers must be aware. 

Many architectures divide the address space into large fi xed-size blocks that simplify 

protection between the operating system and user programs and increase the effi ciency 

of implementing paging. Although these divisions are often called “segments,” this 

mechanism is much simpler than variable block size segmentation and is not visible to 

user programs; we discuss it in more detail shortly. 

segmentation 

A variable-size address 

mapping scheme in which 

an address consists of two 

parts: a segment number, 

which is mapped to a 

physical address, and a 

segment offset. 

Placing a Page and Finding It Again 

Because of the incredibly high penalty for a page fault, designers reduce page fault 

frequency by optimizing page placement. If we allow a virtual page to be mapped 

to any physical page, the operating system can then choose to replace any page 

it wants when a page fault occurs. For example, the operating system can use a


page table The table 

containing the virtual 

to physical address 

translations in a virtual 

memory system. The 

table, which is stored 

in memory, is typically 

indexed by the virtual 

page number; each entry 

in the table contains the 

physical page number 

for that virtual page if 

the page is currently in 

memory. 

sophisticated algorithm and complex data structures that track page usage to try 

to choose a page that will not be needed for a long time. The ability to use a clever 

and flexible replacement scheme reduces the page fault rate and simplifies the use 

of fully associative placement of pages. 

As mentioned in Section 5.4, the difficulty in using fully associative placement 

is in locating an entry, since it can be anywhere in the upper level of the hierarchy. 

A full search is impractical. In virtual memory systems, we locate pages by using a 

table that indexes the memory; this structure is called a page table, and it resides 

in memory. A page table is indexed with the page number from the virtual address 

to discover the corresponding physical page number. Each program has its own 

page table, which maps the virtual address space of that program to main memory. 

In our library analogy, the page table corresponds to a mapping between book 

titles and library locations. Just as the card catalog may contain entries for books 

in another library on campus rather than the local branch library, we will see that 

the page table may contain entries for pages not present in memory. To indicate the 

location of the page table in memory, the hardware includes a register that points to 

the start of the page table; we call this the page table register. Assume for now that 

the page table is in a fixed and contiguous area of memory. 

Hardware/ 

Software 

Interface 

The page table, together with the program counter and the registers, specifies 

the state of a virtual machine. If we want to allow another virtual machine to use 

the processor, we must save this state. Later, after restoring this state, the virtual 

machine can continue execution. We often refer to this state as a process. The 

process is considered active when it is in possession of the processor; otherwise, it 

is considered inactive. The operating system can make a process active by loading 

the process’s state, including the program counter, which will initiate execution at 

the value of the saved program counter. 

The process’s address space, and hence all the data it can access in memory, is 

defined by its page table, which resides in memory. Rather than save the entire page 

table, the operating system simply loads the page table register to point to the page 

table of the process it wants to make active. Each process has its own page table, 

since different processes use the same virtual addresses. The operating system is 

responsible for allocating the physical memory and updating the page tables, so 

that the virtual address spaces of different processes do not collide. As we will see 

shortly, the use of separate page tables also provides protection of one process from 

another.


swap space The space on 

the disk reserved for the 

full virtual memory space 

of a process. 

Page Faults 

If the valid bit for a virtual page is off, a page fault occurs. The operating system 

must be given control. This transfer is done with the exception mechanism, which 

we saw in Chapter 4 and will discuss again later in this section. Once the operating 

system gets control, it must find the page in the next level of the hierarchy (usually 

flash memory or magnetic disk) and decide where to place the requested page in 

main memory. 

The virtual address alone does not immediately tell us where the page is on disk. 

Returning to our library analogy, we cannot find the location of a library book on 

the shelves just by knowing its title. Instead, we go to the catalog and look up the 

book, obtaining an address for the location on the shelves, such as the Library of 

Congress call number. Likewise, in a virtual memory system, we must keep track 

of the location on disk of each page in virtual address space. 

Because we do not know ahead of time when a page in memory will be replaced, 

the operating system usually creates the space on flash memory or disk for all the 

pages of a process when it creates the process. This space is called the swap space. 

At that time, it also creates a data structure to record where each virtual page is 

stored on disk. This data structure may be part of the page table or may be an 

auxiliary data structure indexed in the same way as the page table. Figure 5.28 

shows the organization when a single table holds either the physical page number 

or the disk address. 

The operating system also creates a data structure that tracks which processes 

and which virtual addresses use each physical page. When a page fault occurs, 

if all the pages in main memory are in use, the operating system must choose a 

page to replace. Because we want to minimize the number of page faults, most 

operating systems try to choose a page that they hypothesize will not be needed 

in the near future. Using the past to predict the future, operating systems follow 

the least recently used (LRU) replacement scheme, which we mentioned in Section 

5.4. The operating system searches for the least recently used page, assuming that 

a page that has not been used in a long time is less likely to be needed than a more 

recently accessed page. The replaced pages are written to swap space on the disk. 

In case you are wondering, the operating system is just another process, and these 

tables controlling memory are in memory; the details of this seeming contradiction 

will be explained shortly.


Elaboration: With a 32-bit virtual address, 4 KiB pages, and 4 bytes per page table 

entry, we can compute the total page table size: 

Number of page table entries 

2 

32 

2 12 

2 

20 

20 2 

Size of page table 2 page table entries 2 

bytes 

page table entry 

4 MiB 

That is, we would need to use 4 MiB of memory for each program in execution at any 

time. This amount is not so bad for a single process. What if there are hundreds of 

processes running, each with their own page table? And how should we handle 64-bit 

addresses, which by this calculation would need 2 52 words? 

A range of techniques is used to reduce the amount of storage required for the page 

table. The fi ve techniques below aim at reducing the total maximum storage required as 

well as minimizing the main memory dedicated to page tables: 

1. The simplest technique is to keep a limit register that restricts the size of the 

page table for a given process. If the virtual page number becomes larger than 

the contents of the limit register, entries must be added to the page table. This 

technique allows the page table to grow as a process consumes more space. 

Thus, the page table will only be large if the process is using many pages of 

virtual address space. This technique requires that the address space expand in 

only one direction. 

2. Allowing growth in only one direction is not sufficient, since most languages require 

two areas whose size is expandable: one area holds the stack and the other area 

holds the heap. Because of this duality, it is convenient to divide the page table 

and let it grow from the highest address down, as well as from the lowest address 

up. This means that there will be two separate page tables and two separate 

limits. The use of two page tables breaks the address space into two segments. 

The high-order bit of an address usually determines which segment and thus which 

page table to use for that address. Since the high-order address bit specifies the 

segment, each segment can be as large as one-half of the address space. A 

limit register for each segment specifies the current size of the segment, which 

grows in units of pages. This type of segmentation is used by many architectures, 

including MIPS. Unlike the type of segmentation discussed in the third elaboration 

on page 431, this form of segmentation is invisible to the application program, 

although not to the operating system. The major disadvantage of this scheme is 

that it does not work well when the address space is used in a sparse fashion 

rather than as a contiguous set of virtual addresses. 

3. Another approach to reducing the page table size is to apply a hashing function 

to the virtual address so that the page table need be only the size of the number 

of physical pages in main memory. Such a structure is called an inverted page 

table. Of course, the lookup process is slightly more complex with an inverted 

page table, because we can no longer just index the page table. 

4. Multiple levels of page tables can also be used to reduce the total amount of 

page table storage. The fi rst level maps large fi xed-size blocks of virtual address 

space, perhaps 64 to 256 pages in total. These large blocks are sometimes 

called segments, and this fi rst-level mapping table is sometimes called a


segment table, though the segments are again invisible to the user. Each entry 

in the segment table indicates whether any pages in that segment are allocated 

and, if so, points to a page table for that segment. Address translation happens 

by fi rst looking in the segment table, using the highest-order bits of the address. 

If the segment address is valid, the next set of high-order bits is used to index 

the page table indicated by the segment table entry. This scheme allows the 

address space to be used in a sparse fashion (multiple noncontiguous segments 

can be active) without having to allocate the entire page table. Such schemes 

are particularly useful with very large address spaces and in software systems 

that require noncontiguous allocation. The primary disadvantage of this two-level 

mapping is the more complex process for address translation. 

5. To reduce the actual main memory tied up in page tables, most modern systems 

also allow the page tables to be paged. Although this sounds tricky, it works 

by using the same basic ideas of virtual memory and simply allowing the page 

tables to reside in the virtual address space. In addition, there are some small 

but critical problems, such as a never-ending series of page faults, which must 

be avoided. How these problems are overcome is both very detailed and typically 

highly processor specifi c. In brief, these problems are avoided by placing all the 

page tables in the address space of the operating system and placing at least 

some of the page tables for the operating system in a portion of main memory 

that is physically addressed and is always present and thus never on disk. 

What about Writes? 

The difference between the access time to the cache and main memory is tens to 

hundreds of cycles, and write-through schemes can be used, although we need a 

write buffer to hide the latency of the write from the processor. In a virtual memory 

system, writes to the next level of the hierarchy (disk) can take millions of processor 

clock cycles; therefore, building a write buffer to allow the system to write-through 

to disk would be completely impractical. Instead, virtual memory systems must use 

write-back, performing the individual writes into the page in memory, and copying 

the page back to disk when it is replaced in the memory. 

A write-back scheme has another major advantage in a virtual memory system. 

Because the disk transfer time is small compared with its access time, copying back 

an entire page is much more efficient than writing individual words back to the disk. 

A write-back operation, although more efficient than transferring individual words, is 

still costly. Thus, we would like to know whether a page needs to be copied back when 

we choose to replace it. To track whether a page has been written since it was read into 

the memory, a dirty bit is added to the page table. The dirty bit is set when any word 

in a page is written. If the operating system chooses to replace the page, the dirty bit 

indicates whether the page needs to be written out before its location in memory can be 

given to another page. Hence, a modified page is often called a dirty page. 

Hardware/ 

Software 

Interface


Because we access the TLB instead of the page table on every reference, the TLB 

will need to include other status bits, such as the dirty and the reference bits. 

On every reference, we look up the virtual page number in the TLB. If we get a 

hit, the physical page number is used to form the address, and the corresponding 

reference bit is turned on. If the processor is performing a write, the dirty bit is also 

turned on. If a miss in the TLB occurs, we must determine whether it is a page fault 

or merely a TLB miss. If the page exists in memory, then the TLB miss indicates 

only that the translation is missing. In such cases, the processor can handle the TLB 

miss by loading the translation from the page table into the TLB and then trying the 

reference again. If the page is not present in memory, then the TLB miss indicates 

a true page fault. In this case, the processor invokes the operating system using an 

exception. Because the TLB has many fewer entries than the number of pages in 

main memory, TLB misses will be much more frequent than true page faults. 

TLB misses can be handled either in hardware or in software. In practice, with 

care there can be little performance difference between the two approaches, because 

the basic operations are the same in either case. 

After a TLB miss occurs and the missing translation has been retrieved from the 

page table, we will need to select a TLB entry to replace. Because the reference and 

dirty bits are contained in the TLB entry, we need to copy these bits back to the page 

table entry when we replace an entry. These bits are the only portion of the TLB 

entry that can be changed. Using write-back—that is, copying these entries back at 

miss time rather than when they are written—is very efficient, since we expect the 

TLB miss rate to be small. Some systems use other techniques to approximate the 

reference and dirty bits, eliminating the need to write into the TLB except to load 

a new table entry on a miss. 

Some typical values for a TLB might be 

■ TLB size: 16–512 entries 

■ Block size: 1–2 page table entries (typically 4–8 bytes each) 

■ Hit time: 0.5–1 clock cycle 

■ Miss penalty: 10–100 clock cycles 

■ Miss rate: 0.01%–1% 

Designers have used a wide variety of associativities in TLBs. Some systems use 

small, fully associative TLBs because a fully associative mapping has a lower miss 

rate; furthermore, since the TLB is small, the cost of a fully associative mapping is 

not too high. Other systems use large TLBs, often with small associativity. With 

a fully associative mapping, choosing the entry to replace becomes tricky since 

implementing a hardware LRU scheme is too expensive. Furthermore, since TLB 

misses are much more frequent than page faults and thus must be handled more 

cheaply, we cannot afford an expensive software algorithm, as we can for page faults. 

As a result, many systems provide some support for randomly choosing an entry 

to replace. We’ll examine replacement schemes in a little more detail in Section 5.8.


The Intrinsity FastMATH TLB 

To see these ideas in a real processor, let’s take a closer look at the TLB of the 

Intrinsity FastMATH. The memory system uses 4 KiB pages and a 32-bit address 

space; thus, the virtual page number is 20 bits long, as in the top of Figure 5.30. 

The physical address is the same size as the virtual address. The TLB contains 16 

entries, it is fully associative, and it is shared between the instruction and data 

references. Each entry is 64 bits wide and contains a 20-bit tag (which is the virtual 

page number for that TLB entry), the corresponding physical page number (also 20 

bits), a valid bit, a dirty bit, and other bookkeeping bits. Like most MIPS systems, 

it uses software to handle TLB misses. 

Figure 5.30 shows the TLB and one of the caches, while Figure 5.31 shows the 

steps in processing a read or write request. When a TLB miss occurs, the MIPS 

hardware saves the page number of the reference in a special register and generates 

an exception. The exception invokes the operating system, which handles the miss 

in software. To find the physical address for the missing page, the TLB miss routine 

indexes the page table using the page number of the virtual address and the page 

table register, which indicates the starting address of the active process page table. 

Using a special set of system instructions that can update the TLB, the operating 

system places the physical address from the page table into the TLB. A TLB miss 

takes about 13 clock cycles, assuming the code and the page table entry are in the 

instruction cache and data cache, respectively. (We will see the MIPS TLB code 

on page 449.) A true page fault occurs if the page table entry does not have a valid 

physical address. The hardware maintains an index that indicates the recommended 

entry to replace; the recommended entry is chosen randomly. 

There is an extra complication for write requests: namely, the write access bit in 

the TLB must be checked. This bit prevents the program from writing into pages 

for which it has only read access. If the program attempts a write and the write 

access bit is off, an exception is generated. The write access bit forms part of the 

protection mechanism, which we will discuss shortly. 

Integrating Virtual Memory, TLBs, and Caches 

Our virtual memory and cache systems work together as a hierarchy, so that data 

cannot be in the cache unless it is present in main memory. The operating system 

helps maintain this hierarchy by flushing the contents of any page from the cache 

when it decides to migrate that page to disk. At the same time, the OS modifies the 

page tables and TLB, so that an attempt to access any data on the migrated page 

will generate a page fault. 

Under the best of circumstances, a virtual address is translated by the TLB and 

sent to the cache where the appropriate data is found, retrieved, and sent back to 

the processor. In the worst case, a reference can miss in all three components of the 

memory hierarchy: the TLB, the page table, and the cache. The following example 

illustrates these interactions in more detail.


Virtual address 

TLB access 

TLB miss 

exception 

No 

TLB hit? 

Yes 

Physical address 

No 

Write? 

Yes 

Try to read data 

from cache 

No 

Write access 

bit on? 

Yes 

Cache miss stall 

while read block 

No 

Cache hit? 

Yes 

Write protection 

exception 

Try to write data 

to cache 

Deliver data 

to the CPU 

Cache miss stall 

while read block 

No 

Cache hit? 

Yes 

Write data into cache, 

update the dirty bit, and 

put the data and the 

address into the write buffer 

FIGURE 5.31 Processing a read or a write-through in the Intrinsity FastMATH TLB and cache. If the TLB generates a hit, 

the cache can be accessed with the resulting physical address. For a read, the cache generates a hit or miss and supplies the data or causes a stall 

while the data is brought from memory. If the operation is a write, a portion of the cache entry is overwritten for a hit and the data is sent to 

the write buffer if we assume write-through. A write miss is just like a read miss except that the block is modified after it is read from memory. 

Write-back requires writes to set a dirty bit for the cache block, and a write buffer is loaded with the whole block only on a read miss or write 

miss if the block to be replaced is dirty. Notice that a TLB hit and a cache hit are independent events, but a cache hit can only occur after a TLB 

hit occurs, which means that the data must be present in memory. The relationship between TLB misses and cache misses is examined further 

in the following example and the exercises at the end of this chapter.


Overall Operation of a Memory Hierarchy 

In a memory hierarchy like that of Figure 5.30, which includes a TLB and a 

cache organized as shown, a memory reference can encounter three different 

types of misses: a TLB miss, a page fault, and a cache miss. Consider all 

the combinations of these three events with one or more occurring (seven 

possibilities). For each possibility, state whether this event can actually occur 

and under what circumstances. 

EXAMPLE 

Figure 5.32 shows all combinations and whether each is possible in practice. 

ANSWER 

Elaboration: Figure 5.32 assumes that all memory addresses are translated to 

physical addresses before the cache is accessed. In this organization, the cache is 

physically indexed and physically tagged (both the cache index and tag are physical, 

rather than virtual, addresses). In such a system, the amount of time to access memory, 

assuming a cache hit, must accommodate both a TLB access and a cache access; of 

course, these accesses can be pipelined. 

Alternatively, the processor can index the cache with an address that is completely 

or partially virtual. This is called a virtually addressed cache, and it uses tags that 

are virtual addresses; hence, such a cache is virtually indexed and virtually tagged. In 

such caches, the address translation hardware (TLB) is unused during the normal cache 

access, since the cache is accessed with a virtual address that has not been translated 

to a physical address. This takes the TLB out of the critical path, reducing cache latency. 

When a cache miss occurs, however, the processor needs to translate the address to a 

physical address so that it can fetch the cache block from main memory. 

virtually addressed 

cache A cache that is 

accessed with a virtual 

address rather than a 

physical address. 

TLB 

Page 

table Cache Possible? If so, under what circumstance? 

Hit Hit Miss Possible, although the page table is never really checked if TLB hits. 

Miss Hit Hit TLB misses, but entry found in page table; after retry, data is found in cache. 

Miss Hit Miss TLB misses, but entry found in page table; after retry, data misses in cache. 

Miss Miss Miss TLB misses and is followed by a page fault; after retry, data must miss in cache. 

Hit Miss Miss Impossible: cannot have a translation in TLB if page is not present in memory. 

Hit Miss Hit Impossible: cannot have a translation in TLB if page is not present in memory. 

Miss Miss Hit Impossible: data cannot be allowed in cache if the page is not in memory. 

FIGURE 5.32 The possible combinations of events in the TLB, virtual memory system, 

and cache. Three of these combinations are impossible, and one is possible (TLB hit, virtual memory hit, 

cache miss) but never detected.


aliasing A situation 

in which two addresses 

access the same object; 

it can occur in virtual 

memory when there are 

two virtual addresses for 

the same physical page. 

physically addressed 

cache A cache that is 

addressed by a physical 

address. 

When the cache is accessed with a virtual address and pages are shared between 

processes (which may access them with different virtual addresses), there is the 

possibility of aliasing. Aliasing occurs when the same object has two names—in this 

case, two virtual addresses for the same page. This ambiguity creates a problem, because 

a word on such a page may be cached in two different locations, each corresponding 

to different virtual addresses. This ambiguity would allow one program to write the data 

without the other program being aware that the data had changed. Completely virtually 

addressed caches either introduce design limitations on the cache and TLB to reduce 

aliases or require the operating system, and possibly the user, to take steps to ensure 

that aliases do not occur. 

A common compromise between these two design points is caches that are virtually 

indexed—sometimes using just the page-offset portion of the address, which is really 

a physical address since it is not translated—but use physical tags. These designs, 

which are virtually indexed but physically tagged, attempt to achieve the performance 

advantages of virtually indexed caches with the architecturally simpler advantages of a 

physically addressed cache. For example, there is no alias problem in this case. Figure 

5.30 assumed a 4 KiB page size, but it’s really 16 KiB, so the Intrinsity FastMATH can 

use this trick. To pull it off, there must be careful coordination between the minimum 

page size, the cache size, and associativity. 

Implementing Protection with Virtual Memory 

Perhaps the most important function of virtual memory today is to allow sharing of 

a single main memory by multiple processes, while providing memory protection 

among these processes and the operating system. The protection mechanism must 

ensure that although multiple processes are sharing the same main memory, one 

renegade process cannot write into the address space of another user process or into 

the operating system either intentionally or unintentionally. The write access bit in 

the TLB can protect a page from being written. Without this level of protection, 

computer viruses would be even more widespread. 

Hardware/ 

Software 

Interface 

supervisor mode Also 

called kernel mode. A 

mode indicating that a 

running process is an 

operating system process. 

To enable the operating system to implement protection in the virtual memory 

system, the hardware must provide at least the three basic capabilities summarized 

below. Note that the first two are the same requirements as needed for virtual 

machines (Section 5.6). 

1. Support at least two modes that indicate whether the running process is a 

user process or an operating system process, variously called a supervisor 

process, a kernel process, or an executive process. 

2. Provide a portion of the processor state that a user process can read but not 

write. This includes the user/supervisor mode bit, which dictates whether 

the processor is in user or supervisor mode, the page table pointer, and the


TLB. To write these elements, the operating system uses special instructions 

that are only available in supervisor mode. 

3. Provide mechanisms whereby the processor can go from user mode to 

supervisor mode and vice versa. The first direction is typically accomplished 

by a system call exception, implemented as a special instruction (syscall in 

the MIPS instruction set) that transfers control to a dedicated location in 

supervisor code space. As with any other exception, the program counter 

from the point of the system call is saved in the exception PC (EPC), and 

the processor is placed in supervisor mode. To return to user mode from the 

exception, use the return from exception (ERET) instruction, which resets to 

user mode and jumps to the address in EPC. 

By using these mechanisms and storing the page tables in the operating system’s 

address space, the operating system can change the page tables while preventing a 

user process from changing them, ensuring that a user process can access only the 

storage provided to it by the operating system. 

system call A special 

instruction that transfers 

control from user mode 

to a dedicated location 

in supervisor code space, 

invoking the exception 

mechanism in the process. 

We also want to prevent a process from reading the data of another process. For 

example, we wouldn’t want a student program to read the grades while they were 

in the processor’s memory. Once we begin sharing main memory, we must provide 

the ability for a process to protect its data from both reading and writing by another 

process; otherwise, sharing the main memory will be a mixed blessing! 

Remember that each process has its own virtual address space. Thus, if the 

operating system keeps the page tables organized so that the independent virtual 

pages map to disjoint physical pages, one process will not be able to access another’s 

data. Of course, this also requires that a user process be unable to change the page 

table mapping. The operating system can assure safety if it prevents the user process 

from modifying its own page tables. However, the operating system must be able 

to modify the page tables. Placing the page tables in the protected address space of 

the operating system satisfies both requirements. 

When processes want to share information in a limited way, the operating system 

must assist them, since accessing the information of another process requires 

changing the page table of the accessing process. The write access bit can be used 

to restrict the sharing to just read sharing, and, like the rest of the page table, this 

bit can be changed only by the operating system. To allow another process, say, P1, 

to read a page owned by process P2, P2 would ask the operating system to create 

a page table entry for a virtual page in P1’s address space that points to the same 

physical page that P2 wants to share. The operating system could use the write 

protection bit to prevent P1 from writing the data, if that was P2’s wish. Any bits 

that determine the access rights for a page must be included in both the page table 

and the TLB, because the page table is accessed only on a TLB miss.


context switch 

A changing of the internal 

state of the processor to 

allow a different process 

to use the processor 

that includes saving the 

state needed to return to 

the currently executing 

process. 

Elaboration: When the operating system decides to change from running process 

P1 to running process P2 (called a context switch or process switch), it must ensure 

that P2 cannot get access to the page tables of P1 because that would compromise 

protection. If there is no TLB, it suffi ces to change the page table register to point to P2’s 

page table (rather than to P1’s); with a TLB, we must clear the TLB entries that belong to 

P1—both to protect the data of P1 and to force the TLB to load the entries for P2. If the 

process switch rate were high, this could be quite ineffi cient. For example, P2 might load 

only a few TLB entries before the operating system switched back to P1. Unfortunately, 

P1 would then fi nd that all its TLB entries were gone and would have to pay TLB misses 

to reload them. This problem arises because the virtual addresses used by P1 and P2 

are the same, and we must clear out the TLB to avoid confusing these addresses. 

A common alternative is to extend the virtual address space by adding a process 

identifi er or task identifi er. The Intrinsity FastMATH has an 8-bit address space ID (ASID) 

fi eld for this purpose. This small fi eld identifi es the currently running process; it is kept 

in a register loaded by the operating system when it switches processes. The process 

identifier is concatenated to the tag portion of the TLB, so that a TLB hit occurs only if 

both the page number and the process identifi er match. This combination eliminates the 

need to clear the TLB, except on rare occasions. 

Similar problems can occur for a cache, since on a process switch the cache will 

contain data from the running process. These problems arise in different ways for 

physically addressed and virtually addressed caches, and a variety of different solutions, 

such as process identifi ers, are used to ensure that a process gets its own data. 

Handling TLB Misses and Page Faults 

Although the translation of virtual to physical addresses with a TLB is 

straightforward when we get a TLB hit, as we saw earlier, handling TLB misses and 

page faults is more complex. A TLB miss occurs when no entry in the TLB matches 

a virtual address. Recall that a TLB miss can indicate one of two possibilities: 

1. The page is present in memory, and we need only create the missing TLB 

entry. 

2. The page is not present in memory, and we need to transfer control to the 

operating system to deal with a page fault. 

MIPS traditionally handles a TLB miss in software. It brings in the page table 

entry from memory and then re-executes the instruction that caused the TLB miss. 

Upon re-executing, it will get a TLB hit. If the page table entry indicates the page is 

not in memory, this time it will get a page fault exception. 

Handling a TLB miss or a page fault requires using the exception mechanism 

to interrupt the active process, transferring control to the operating system, and 

later resuming execution of the interrupted process. A page fault will be recognized 

sometime during the clock cycle used to access memory. To restart the instruction 

after the page fault is handled, the program counter of the instruction that caused 

the page fault must be saved. Just as in Chapter 4, the exception program counter 

(EPC) is used to hold this value.


In addition, a TLB miss or page fault exception must be asserted by the end 

of the same clock cycle that the memory access occurs, so that the next clock 

cycle will begin exception processing rather than continue normal instruction 

execution. If the page fault was not recognized in this clock cycle, a load instruction 

could overwrite a register, and this could be disastrous when we try to restart the 

instruction. For example, consider the instruction lw $1,0($1): the computer 

must be able to prevent the write pipeline stage from occurring; otherwise, it could 

not properly restart the instruction, since the contents of $1 would have been 

destroyed. A similar complication arises on stores. We must prevent the write into 

memory from actually completing when there is a page fault; this is usually done 

by deasserting the write control line to the memory. 

Between the time we begin executing the exception handler in the operating 

system and the time that the operating system has saved all the state of the process, 

the operating system is particularly vulnerable. For example, if another exception 

occurred when we were processing the first exception in the operating system, the 

control unit would overwrite the exception program counter, making it impossible 

to return to the instruction that caused the page fault! We can avoid this disaster 

by providing the ability to disable and enable exceptions. When an exception first 

occurs, the processor sets a bit that disables all other exceptions; this could happen 

at the same time the processor sets the supervisor mode bit. The operating system 

will then save just enough state to allow it to recover if another exception occurs— 

namely, the exception program counter (EPC) and Cause registers. EPC and Cause 

are two of the special control registers that help with exceptions, TLB misses, and 

page faults; Figure 5.33 shows the rest. The operating system can then re-enable 

exceptions. These steps make sure that exceptions will not cause the processor 

to lose any state and thereby be unable to restart execution of the interrupting 

instruction. 

Hardware/ 

Software 

Interface 

exception enable Also 

called interrupt enable. 

A signal or action that 

controls whether the 

process responds to 

an exception or not; 

necessary for preventing 

the occurrence of 

exceptions during 

intervals before the 

processor has safely saved 

the state needed to restart. 

Once the operating system knows the virtual address that caused the page fault, it 

must complete three steps: 

1. Look up the page table entry using the virtual address and find the location 

of the referenced page on disk. 

2. Choose a physical page to replace; if the chosen page is dirty, it must be 

written out to disk before we can bring a new virtual page into this physical 

page. 

3. Start a read to bring the referenced page from disk into the chosen physical 

page.


The exception invokes the operating system, which handles the miss in software. 

Control is transferred to address 8000 0000 hex 

, the location of the TLB miss handler. 

To find the physical address for the missing page, the TLB miss routine indexes the 

page table using the page number of the virtual address and the page table register, 

which indicates the starting address of the active process page table. To make this 

indexing fast, MIPS hardware places everything you need in the special Context 

register: the upper 12 bits have the address of the base of the page table, and the 

next 18 bits have the virtual address of the missing page. Each page table entry is 

one word, so the last 2 bits are 0. Thus, the first two instructions copy the Context 

register into the kernel temporary register $k1 and then load the page table entry 

from that address into $k1. Recall that $k0 and $k1 are reserved for the operating 

system to use without saving; a major reason for this convention is to make the TLB 

miss handler fast. Below is the MIPS code for a typical TLB miss handler: 

handler Name of a 

software routine invoked 

to “handle” an exception 

or interrupt. 

TLBmiss: 

mfc0 $k1,Context # copy address of PTE into temp $k1 

lw $k1,0($k1) # put PTE into temp $k1 

mtc0 $k1,EntryLo # put PTE into special register EntryLo 

tlbwr 

# put EntryLo into TLB entry at Random 

eret 

# return from TLB miss exception 

As shown above, MIPS has a special set of system instructions to update the 

TLB. The instruction tlbwr copies from control register EntryLo into the TLB 

entry selected by the control register Random. Random implements random 

replacement, so it is basically a free-running counter. A TLB miss takes about a 

dozen clock cycles. 

Note that the TLB miss handler does not check to see if the page table entry is 

valid. Because the exception for TLB entry missing is much more frequent than 

a page fault, the operating system loads the TLB from the page table without 

examining the entry and restarts the instruction. If the entry is invalid, another 

and different exception occurs, and the operating system recognizes the page fault. 

This method makes the frequent case of a TLB miss fast, at a slight performance 

penalty for the infrequent case of a page fault. 

Once the process that generated the page fault has been interrupted, it transfers 

control to 8000 0180 hex 

, a different address than the TLB miss handler. This is 

the general address for exception; TLB miss has a special entry point to lower the 

penalty for a TLB miss. The operating system uses the exception Cause register 

to diagnose the cause of the exception. Because the exception is a page fault, the 

operating system knows that extensive processing will be required. Thus, unlike a 

TLB miss, it saves the entire state of the active process. This state includes all the 

general-purpose and floating-point registers, the page table address register, the 

EPC, and the exception Cause register. Since exception handlers do not usually use 

the floating-point registers, the general entry point does not save them, leaving that 

to the few handlers that need them.


Figure 5.34 sketches the MIPS code of an exception handler. Note that we 

save and restore the state in MIPS code, taking care when we enable and disable 

exceptions, but we invoke C code to handle the particular exception. 

The virtual address that caused the fault depends on whether the fault was an 

instruction or data fault. The address of the instruction that generated the fault is 

in the EPC. If it was an instruction page fault, the EPC contains the virtual address 

of the faulting page; otherwise, the faulting virtual address can be computed by 

examining the instruction (whose address is in the EPC) to find the base register 

and offset field. 

unmapped A portion 

of the address space that 

cannot have page faults. 

Elaboration: This simplifi ed version assumes that the stack pointer (sp) is valid. To 

avoid the problem of a page fault during this low-level exception code, MIPS sets aside 

a portion of its address space that cannot have page faults, called unmapped. The 

operating system places the exception entry point code and the exception stack in 

unmapped memory. MIPS hardware translates virtual addresses 8000 0000 hex 

to BFFF 

FFFF hex 

to physical addresses simply by ignoring the upper bits of the virtual address, 

thereby placing these addresses in the low part of physical memory. Thus, the operating 

system places exception entry points and exception stacks in unmapped memory. 

Elaboration: The code in Figure 5.34 shows the MIPS-32 exception return sequence. 

The older MIPS-I architecture uses rfe and jr instead of eret. 

Elaboration: For processors with more complex instructions that can touch many 

memory locations and write many data items, making instructions restartable is much 

harder. Processing one instruction may generate a number of page faults in the middle 

of the instruction. For example, x86 processors have block move instructions that touch 

thousands of data words. In such processors, instructions often cannot be restarted 

from the beginning, as we do for MIPS instructions. Instead, the instruction must be 

interrupted and later continued midstream in its execution. Resuming an instruction in 

the middle of its execution usually requires saving some special state, processing the 

exception, and restoring that special state. Making this work properly requires careful 

and detailed coordination between the exception-handling code in the operating system 

and the hardware. 

Elaboration: Rather than pay an extra level of indirection on every memory access, the 

VMM maintains a shadow page table that maps directly from the guest virtual address 

space to the physical address space of the hardware. By detecting all modifi cations to 

the guest’s page table, the VMM can ensure the shadow page table entries being used 

by the hardware for translations correspond to those of the guest OS environment, with 

the exception of the correct physical pages substituted for the real pages in the guest 

tables. Hence, the VMM must trap any attempt by the guest OS to change its page table 

or to access the page table pointer. This is commonly done by write protecting the guest 

page tables and trapping any access to the page table pointer by a guest OS. As noted 

above, the latter happens naturally if accessing the page table pointer is a privileged 

operation.


Elaboration: The fi nal portion of the architecture to virtualize is I/O. This is by far 

the most diffi cult part of system virtualization because of the increasing number of 

I/O devices attached to the computer and the increasing diversity of I/O device types. 

Another diffi culty is the sharing of a real device among multiple VMs, and yet another 

comes from supporting the myriad of device drivers that are required, especially if 

different guest OSes are supported on the same VM system. The VM illusion can be 

maintained by giving each VM generic versions of each type of I/O device driver, and then 

leaving it to the VMM to handle real I/O. 

Elaboration: In addition to virtualizing the instruction set for a virtual machine, 

another challenge is virtualization of virtual memory, as each guest OS in every virtual 

machine manages its own set of page tables. To make this work, the VMM separates 

the notions of real and physical memory (which are often treated synonymously), and 

makes real memory a separate, intermediate level between virtual memory and physical 

memory. (Some use the terms virtual memory, physical memory, and machine memory 

to name the same three levels.) The guest OS maps virtual memory to real memory 

via its page tables, and the VMM page tables map the guest’s real memory to physical 

memory. The virtual memory architecture is specifi ed either via page tables, as in IBM 

VM/370 and the x86, or via the TLB structure, as in MIPS. 

Summary 

Virtual memory is the name for the level of memory hierarchy that manages 

caching between the main memory and secondary memory. Virtual memory 

allows a single program to expand its address space beyond the limits of main 

memory. More importantly, virtual memory supports sharing of the main memory 

among multiple, simultaneously active processes, in a protected manner. 

Managing the memory hierarchy between main memory and disk is challenging 

because of the high cost of page faults. Several techniques are used to reduce the 

miss rate: 

1. Pages are made large to take advantage of spatial locality and to reduce the 

miss rate. 

2. The mapping between virtual addresses and physical addresses, which is 

implemented with a page table, is made fully associative so that a virtual 

page can be placed anywhere in main memory. 

3. The operating system uses techniques, such as LRU and a reference bit, to 

choose which pages to replace.


Writes to secondary memory are expensive, so virtual memory uses a write-back 

scheme and also tracks whether a page is unchanged (using a dirty bit) to avoid 

writing unchanged pages. 

The virtual memory mechanism provides address translation from a virtual 

address used by the program to the physical address space used for accessing 

memory. This address translation allows protected sharing of the main memory 

and provides several additional benefits, such as simplifying memory allocation. 

Ensuring that processes are protected from each other requires that only the 

operating system can change the address translations, which is implemented by 

preventing user programs from changing the page tables. Controlled sharing of 

pages among processes can be implemented with the help of the operating system 

and access bits in the page table that indicate whether the user program has read or 

write access to a page. 

If a processor had to access a page table resident in memory to translate every 

access, virtual memory would be too expensive, as caches would be pointless! 

Instead, a TLB acts as a cache for translations from the page table. Addresses are 

then translated from virtual to physical using the translations in the TLB. 

Caches, virtual memory, and TLBs all rely on a common set of principles and 

policies. The next section discusses this common framework. 

Although virtual memory was invented to enable a small memory to act as a large 

one, the performance difference between secondary memory and main memory 

means that if a program routinely accesses more virtual memory than it has 

physical memory, it will run very slowly. Such a program would be continuously 

swapping pages between memory and disk, called thrashing. Thrashing is a disaster 

if it occurs, but it is rare. If your program thrashes, the easiest solution is to run it on 

a computer with more memory or buy more memory for your computer. A more 

complex choice is to re-examine your algorithm and data structures to see if you 

can change the locality and thereby reduce the number of pages that your program 

uses simultaneously. This set of popular pages is informally called the working set. 

A more common performance problem is TLB misses. Since a TLB might 

handle only 32–64 page entries at a time, a program could easily see a high TLB 

miss rate, as the processor may access less than a quarter mebibyte directly: 64 

4 KiB 0.25 MiB. For example, TLB misses are often a challenge for Radix 

Sort. To try to alleviate this problem, most computer architectures now support 

variable page sizes. For example, in addition to the standard 4 KiB page, MIPS 

hardware supports 16 KiB, 64 KiB, 256 KiB, 1 MiB, 4 MiB, 16 MiB, 64 MiB, and 

256 MiB pages. Hence, if a program uses large page sizes, it can access more 

memory directly without TLB misses. 

The practical challenge is getting the operating system to allow programs to 

select these larger page sizes. Once again, the more complex solution to reducing 


Program 

Performance


TLB misses is to re-examine the algorithm and data structures to reduce the 

working set of pages; given the importance of memory accesses to performance 

and the frequency of TLB misses, some programs with large working sets have 

been redesigned with that goal. 

Check 

Yourself 

Match the definitions in the right column to the terms in the left column. 

1. L1 cache a. A cache for a cache 

2. L2 cache b. A cache for disks 

3. Main memory c. A cache for a main memory 

4. TLB d. A cache for page table entries 

5.8 

A Common Framework for Memory 

Hierarchy 

By now, you’ve recognized that the different types of memory hierarchies have a 

great deal in common. Although many of the aspects of memory hierarchies differ 

quantitatively, many of the policies and features that determine how a hierarchy 

functions are similar qualitatively. Figure 5.35 shows how some of the quantitative 

characteristics of memory hierarchies can differ. In the rest of this section, we will 

discuss the common operational alternatives for memory hierarchies, and how 

these determine their behavior. We will examine these policies as a series of four 

questions that apply between any two levels of a memory hierarchy, although for 

simplicity we will primarily use terminology for caches. 

Feature 

Typical values 

for L1 caches 


for L2 caches 

Typical values for 

paged memory 


for a TLB 

Total size in blocks 250–2000 2,500–25,000 16,000–250,000 40–1024 

Total size in kilobytes 16–64 125–2000 1,000,000–1,000,000,000 0.25–16 

Block size in bytes 16–64 64–128 4000–64,000 4–32 

Miss penalty in clocks 10–25 100–1000 10,000,000–100,000,000 10–1000 

Miss rates (global for L2) 2%–5% 0.1%–2% 0.00001%–0.0001% 0.01%–2% 

FIGURE 5.35 The key quantitative design parameters that characterize the major elements of memory hierarchy in a 

computer. These are typical values for these levels as of 2012. Although the range of values is wide, this is partially because many of the values 

that have shifted over time are related; for example, as caches become larger to overcome larger miss penalties, block sizes also grow. While not 

shown, server microprocessors today also have L3 caches, which can be 2 to 8 MiB and contain many more blocks than L2 caches. L3 caches 

lower the L2 miss penalty to 30 to 40 clock cycles.

5.8 A Common Framework for Memory Hierarchy 457 

implementation, such as whether the cache is on-chip, the technology used for 

implementing the cache, and the critical role of cache access time in determining 

the processor cycle time. 

Question 3: Which Block Should Be Replaced on 

a Cache Miss? 

When a miss occurs in an associative cache, we must decide which block to replace. 

In a fully associative cache, all blocks are candidates for replacement. If the cache is 

set associative, we must choose among the blocks in the set. Of course, replacement 

is easy in a direct-mapped cache because there is only one candidate. 

There are the two primary strategies for replacement in set-associative or fully 

associative caches: 

■ Random: Candidate blocks are randomly selected, possibly using some hardware 

assistance. For example, MIPS supports random replacement for TLB misses. 

■ Least recently used (LRU): The block replaced is the one that has been unused 

for the longest time. 

In practice, LRU is too costly to implement for hierarchies with more than a small 

degree of associativity (two to four, typically), since tracking the usage information 

is costly. Even for four-way set associativity, LRU is often approximated—for 

example, by keeping track of which pair of blocks is LRU (which requires 1 bit), 

and then tracking which block in each pair is LRU (which requires 1 bit per pair). 

For larger associativity, either LRU is approximated or random replacement is 

used. In caches, the replacement algorithm is in hardware, which means that the 

scheme should be easy to implement. Random replacement is simple to build in 

hardware, and for a two-way set-associative cache, random replacement has a miss 

rate about 1.1 times higher than LRU replacement. As the caches become larger, the 

miss rate for both replacement strategies falls, and the absolute difference becomes 

small. In fact, random replacement can sometimes be better than the simple LRU 

approximations that are easily implemented in hardware. 

In virtual memory, some form of LRU is always approximated, since even a tiny 

reduction in the miss rate can be important when the cost of a miss is enormous. 

Reference bits or equivalent functionality are often provided to make it easier for 

the operating system to track a set of less recently used pages. Because misses are 

so expensive and relatively infrequent, approximating this information primarily 

in software is acceptable. 

Question 4: What Happens on a Write? 

A key characteristic of any memory hierarchy is how it deals with writes. We have 

already seen the two basic options: 

■ Write-through: The information is written to both the block in the cache and 

the block in the lower level of the memory hierarchy (main memory for a 

cache). The caches in Section 5.3 used this scheme.


■ Write-back: The information is written only to the block in the cache. The 

modified block is written to the lower level of the hierarchy only when it 

is replaced. Virtual memory systems always use write-back, for the reasons 

discussed in Section 5.7. 

Both write-back and write-through have their advantages. The key advantages of 

write-back are the following: 

■ Individual words can be written by the processor at the rate that the cache, 

rather than the memory, can accept them. 

■ Multiple writes within a block require only one write to the lower level in the 

hierarchy. 

■ When blocks are written back, the system can make effective use of a highbandwidth 

transfer, since the entire block is written. 

Write-through has these advantages: 

■ Misses are simpler and cheaper because they never require a block to be 

written back to the lower level. 

■ Write-through is easier to implement than write-back, although to be 

practical, a write-through cache will still need to use a write buffer. 

Caches, TLBs, and virtual memory may initially look very different, but 

they rely on the same two principles of locality, and they can be understood 

by their answers to four questions: 

The BIG 

Picture 

Question 1: 

Answer: 

Question 2: 

Answer: 

Question 3: 

Answer: 

Question 4: 

Answer: 

Where can a block be placed? 

One place (direct mapped), a few places (set associative), 

or any place (fully associative). 

How is a block found? 

There are four methods: indexing (as in a direct-mapped 

cache), limited search (as in a set-associative cache), full 

search (as in a fully associative cache), and a separate 

lookup table (as in a page table). 

What block is replaced on a miss? 

Typically, either the least recently used or a random block. 

How are writes handled? 

Each level in the hierarchy can use either write-through 

or write-back.

5.8 A Common Framework for Memory Hierarchy 459 

In virtual memory systems, only a write-back policy is practical because of the long 

latency of a write to the lower level of the hierarchy. The rate at which writes are 

generated by a processor generally exceeds the rate at which the memory system can 

process them, even allowing for physically and logically wider memories and burst 

modes for DRAM. Consequently, today lowest-level caches typically use write-back. 

The Three Cs: An Intuitive Model for Understanding the 

Behavior of Memory Hierarchies 

In this subsection, we look at a model that provides insight into the sources of 

misses in a memory hierarchy and how the misses will be affected by changes 

in the hierarchy. We will explain the ideas in terms of caches, although the ideas 

carry over directly to any other level in the hierarchy. In this model, all misses are 

classified into one of three categories (the three Cs): 

■ Compulsory misses: These are cache misses caused by the first access to 

a block that has never been in the cache. These are also called cold-start 

misses. 

■ Capacity misses: These are cache misses caused when the cache cannot 

contain all the blocks needed during execution of a program. Capacity misses 

occur when blocks are replaced and then later retrieved. 

■ Conflict misses: These are cache misses that occur in set-associative or 

direct-mapped caches when multiple blocks compete for the same set. 

Conflict misses are those misses in a direct-mapped or set-associative cache 

that are eliminated in a fully associative cache of the same size. These cache 

misses are also called collision misses. 

Figure 5.37 shows how the miss rate divides into the three sources. These sources of 

misses can be directly attacked by changing some aspect of the cache design. Since 

conflict misses arise directly from contention for the same cache block, increasing 

associativity reduces conflict misses. Associativity, however, may slow access time, 

leading to lower overall performance. 

Capacity misses can easily be reduced by enlarging the cache; indeed, secondlevel 

caches have been growing steadily larger for many years. Of course, when we 

make the cache larger, we must also be careful about increasing the access time, 

which could lead to lower overall performance. Thus, first-level caches have been 

growing slowly, if at all. 

Because compulsory misses are generated by the first reference to a block, the 

primary way for the cache system to reduce the number of compulsory misses is 

to increase the block size. This will reduce the number of references required to 

touch each block of the program once, because the program will consist of fewer 

three Cs model A cache 

model in which all cache 

misses are classified into 

one of three categories: 

compulsory misses, 

capacity misses, and 

conflict misses. 

compulsory miss Also 

called cold-start miss. 

A cache miss caused by 

the first access to a block 

that has never been in the 

cache. 

capacity miss A cache 

miss that occurs because 

the cache, even with 

full associativity, cannot 

contain all the blocks 

needed to satisfy the 

request. 

conflict miss Also called 

collision miss. A cache 

miss that occurs in a 

set-associative or directmapped 

cache when 

multiple blocks compete 

for the same set and that 

are eliminated in a fully 

associative cache of the 

same size.


■ Write-back using write allocate 

■ Block size is 4 words (16 bytes or 128 bits) 

■ Cache size is 16 KiB, so it holds 1024 blocks 

■ 32-byte addresses 

■ The cache includes a valid bit and dirty bit per block 

From Section 5.3, we can now calculate the fields of an address for the cache: 

■ Cache index is 10 bits 

■ Block offset is 4 bits 

■ Tag size is 32 (10 4) or 18 bits 

The signals between the processor to the cache are 

■ 1-bit Read or Write signal 

■ 1-bit Valid signal, saying whether there is a cache operation or not 

■ 32-bit address 

■ 32-bit data from processor to cache 

■ 32-bit data from cache to processor 

■ 1-bit Ready signal, saying the cache operation is complete 

The interface between the memory and the cache has the same fields as between 

the processor and the cache, except that the data fields are now 128 bits wide. The 

extra memory width is generally found in microprocessors today, which deal with 

either 32-bit or 64-bit words in the processor while the DRAM controller is often 

128 bits. Making the cache block match the width of the DRAM simplified the 

design. Here are the signals: 

■ 1-bit Read or Write signal 

■ 1-bit Valid signal, saying whether there is a memory operation or not 

■ 32-bit address 

■ 128-bit data from cache to memory 

■ 128-bit data from memory to cache 

■ 1-bit Ready signal, saying the memory operation is complete 

Note that the interface to memory is not a fixed number of cycles. We assume a 

memory controller that will notify the cache via the Ready signal when the memory 

read or write is finished. 

Before describing the cache controller, we need to review finite-state machines, 

which allow us to control an operation that can take multiple clock cycles.


Combinational 

control logic 

Datapath control outputs 

Outputs 

Inputs 

Inputs from cache 

datapath 

State register 

Next state 

FIGURE 5.39 Finite-state machine controllers are typically implemented using a block of 

combinational logic and a register to hold the current state. The outputs of the combinational 

logic are the next-state number and the control signals to be asserted for the current state. The inputs to the 

combinational logic are the current state and any inputs used to determine the next state. Notice that in the 

finite-state machine used in this chapter, the outputs depend only on the current state, not on the inputs. The 

Elaboration explains this in more detail. 

needed early in the clock cycle, do not depend on the inputs, but only on the current 

state. In Appendix B, when the implementation of this fi nite-state machine is taken down 

to logic gates, the size advantage can be clearly seen. The potential disadvantage of a 

Moore machine is that it may require additional states. For example, in situations where 

there is a one-state difference between two sequences of states, the Mealy machine 

may unify the states by making the outputs depend on the inputs. 

FSM for a Simple Cache Controller 

Figure 5.40 shows the four states of our simple cache controller: 

■ Idle: This state waits for a valid read or write request from the processor, 

which moves the FSM to the Compare Tag state. 

■ Compare Tag: As the name suggests, this state tests to see if the requested read 

or write is a hit or a miss. The index portion of the address selects the tag to 

be compared. If the data in the cache block referred to by the index portion 

of the address is valid, and the tag portion of the address matches the tag, 

then it is a hit. Either the data is read from the selected word if it is a load or 

written to the selected word if it is a store. The Cache Ready signal is then


■ Replication: When shared data are being simultaneously read, the caches 

make a copy of the data item in the local cache. Replication reduces both 

latency of access and contention for a read shared data item. 

Supporting migration and replication is critical to performance in accessing 

shared data, so many multiprocessors introduce a hardware protocol to maintain 

coherent caches. The protocols to maintain coherence for multiple processors are 

called cache coherence protocols. Key to implementing a cache coherence protocol 

is tracking the state of any sharing of a data block. 

The most popular cache coherence protocol is snooping. Every cache that has a 

copy of the data from a block of physical memory also has a copy of the sharing 

status of the block, but no centralized state is kept. The caches are all accessible via 

some broadcast medium (a bus or network), and all cache controllers monitor or 

snoop on the medium to determine whether or not they have a copy of a block that 

is requested on a bus or switch access. 

In the following section we explain snooping-based cache coherence as 

implemented with a shared bus, but any communication medium that broadcasts 

cache misses to all processors can be used to implement a snooping-based 

coherence scheme. This broadcasting to all caches makes snooping protocols 

simple to implement but also limits their scalability. 

Snooping Protocols 

One method of enforcing coherence is to ensure that a processor has exclusive 

access to a data item before it writes that item. This style of protocol is called a write 

invalidate protocol because it invalidates copies in other caches on a write. Exclusive 

access ensures that no other readable or writable copies of an item exist when the 

write occurs: all other cached copies of the item are invalidated. 

Figure 5.42 shows an example of an invalidation protocol for a snooping bus 

with write-back caches in action. To see how this protocol ensures coherence, 

consider a write followed by a read by another processor: since the write requires 

exclusive access, any copy held by the reading processor must be invalidated (hence 

the protocol name). Thus, when the read occurs, it misses in the cache, and the 

cache is forced to fetch a new copy of the data. For a write, we require that the 

writing processor have exclusive access, preventing any other processor from being 

able to write simultaneously. If two processors do attempt to write the same data 

simultaneously, one of them wins the race, causing the other processor’s copy to be 

invalidated. For the other processor to complete its write, it must obtain a new copy 

of the data, which must now contain the updated value. Therefore, this protocol 

also enforces write serialization.

5.10 Parallelism and Memory Hierarchy: Cache Coherence 469 

Processor activity 

Bus activity 

Contents of 

CPU A’s cache 

Contents of 

CPU B’s cache 

Contents of 

memory 

location X 

0 

CPU A reads X Cache miss for X 

0 

0 

CPU B reads X Cache miss for X 0 0 0 

CPU A writes a 1 to X Invalidation for X 

1 

0 

CPU B reads X Cache miss for X 1 1 1 

FIGURE 5.42 An example of an invalidation protocol working on a snooping bus for a 

single cache block (X) with write-back caches. We assume that neither cache initially holds X 

and that the value of X in memory is 0. The CPU and memory contents show the value after the processor 

and bus activity have both completed. A blank indicates no activity or no copy cached. When the second 

miss by B occurs, CPU A responds with the value canceling the response from memory. In addition, both 

the contents of B’s cache and the memory contents of X are updated. This update of memory, which occurs 

when a block becomes shared, simplifies the protocol, but it is possible to track the ownership and force the 

write-back only if the block is replaced. This requires the introduction of an additional state called “owner,” 

which indicates that a block may be shared, but the owning processor is responsible for updating any other 

processors and memory when it changes the block or replaces it. 

One insight is that block size plays an important role in cache coherency. For 

example, take the case of snooping on a cache with a block size of eight words, 

with a single word alternatively written and read by two processors. Most protocols 

exchange full blocks between processors, thereby increasing coherency bandwidth 

demands. 

Large blocks can also cause what is called false sharing: when two unrelated 

shared variables are located in the same cache block, the full block is exchanged 

between processors even though the processors are accessing different variables. 

Programmers and compilers should lay out data carefully to avoid false sharing. 

Elaboration: Although the three properties on pages 466 and 467 are suffi cient to 

ensure coherence, the question of when a written value will be seen is also important. To 

see why, observe that we cannot require that a read of X in Figure 5.41 instantaneously 

sees the value written for X by some other processor. If, for example, a write of X on one 

processor precedes a read of X on another processor very shortly beforehand, it may be 

impossible to ensure that the read returns the value of the data written, since the written 

data may not even have left the processor at that point. The issue of exactly when a 

written value must be seen by a reader is defi ned by a memory consistency model. 

Hardware/ 

Software 

Interface 

false sharing When two 

unrelated shared variables 

are located in the same 

cache block and the 

full block is exchanged 

between processors even 

though the processors 

are accessing different 

variables.

5.13 Real Stuff: The ARM Cortex-A8 and Intel Core i7 Memory Hierarchies 471 

5.13 

Real Stuff: The ARM Cortex-A8 and Intel 

Core i7 Memory Hierarchies 

In this section, we will look at the memory hierarchy of the same two microprocessors 

described in Chapter 4: the ARM Cortex-A8 and Intel Core i7. This section is based 

on Section 2.6 of Computer Architecture: A Quantitative Approach, 5 th edition. 

Figure 5.43 summarizes the address sizes and TLBs of the two processors. Note 

that the A8 has two TLBs with a 32-bit virtual address space and a 32-bit physical 

address space. The Core i7 has three TLBs with a 48-bit virtual address and a 44-bit 

physical address. Although the 64-bit registers of the Core i7 could hold a larger 

virtual address, there was no software need for such a large space and 48-bit virtual 

addresses shrinks both the page table memory footprint and the TLB hardware. 

Figure 5.44 shows their caches. Keep in mind that the A8 has just one processor 

or core while the Core i7 has four. Both have identically organized 32 KiB, 4-way 

set associative, L1 instruction caches (per core) with 64 byte blocks. The A8 uses the 

same design for data cache, while the Core i7 keeps everything the same except the 

associativity, which it increases to 8-way. Both use an 8-way set associative unified 

L2 cache (per core) with 64 byte blocks, although the A8 varies in size from 128 KiB 

to 1 MiB while the Core i7 is fixed at 256 KiB. As the Core i7 is used for servers, it 

Characteristic ARM Cortex-A8 Intel Core i7 

Virtual address 32 bits 48 bits 

Physical address 32 bits 44 bits 

Page size Variable: 4, 16, 64 KiB, 1, 16 MiB Variable: 4 KiB, 2/4 MiB 

TLB organization 1 TLB for instructions and 1 TLB 

for data 

1 TLB for instructions and 1 TLB for 

data per core 

Both TLBs are fully associative, 

with 32 entries, round robin 

replacement 

TLB misses handled in hardware 

Both L1 TLBs are four-way set 

associative, LRU replacement 

L1 I-TLB has 128 entries for small 

pages, 7 per thread for large pages 

L1 D-TLB has 64 entries for small 

pages, 32 for large pages 

The L2 TLB is four-way set associative, 

LRU replacement 

The L2 TLB has 512 entries 

TLB misses handled in hardware 

FIGURE 5.43 Address translation and TLB hardware for the ARM Cortex-A8 and Intel 

Core i7 920. Both processors provide support for large pages, which are used for things like the operating 

system or mapping a frame buffer. The large-page scheme avoids using a large number of entries to map a 

single object that is always present.

5.13 Real Stuff: The ARM Cortex-A8 and Intel Core i7 Memory Hierarchies 473 

advantage of this capability, but large servers and multiprocessors often have 

memory systems capable of handling more than one outstanding miss in parallel. 

The Core i7 has a prefetch mechanism for data accesses. It looks at a pattern 

of data misses and use this information to try to predict the next address to start 

fetching the data before the miss occurs. Such techniques generally work best when 

accessing arrays in loops. 

The sophisticated memory hierarchies of these chips and the large fraction of 

the dies dedicated to caches and TLBs show the significant design effort expended 

to try to close the gap between processor cycle times and memory latency. 

Performance of the A8 and Core i7 Memory Hierarchies 

The memory hierarchy of the Cortex-A8 was simulated with a 1 MiB eight-way 

set associative L2 cache using the integer Minnespec benchmarks. As mentioned 

in Chapter 4, Minnespec is a set of benchmarks consisting of the SPEC2000 

benchmarks but with different inputs that reduce the running times by several 

orders of magnitude. Although the use of smaller inputs does not change the 

instruction mix, it does affect the cache behavior. For example, on mcf, the most 

memory-intensive SPEC2000 integer benchmark, Minnespec has a miss rate for a 

32 KiB cache that is only 65% of the miss rate for the full SPEC2000 version. For 

a 1 MiB cache the difference is a factor of six! For this reason, one cannot compare 

the Minnespec benchmarks against the SPEC2000 benchmarks, much less the even 

larger SPEC2006 benchmarks used for the Core i7 in Figure 5.47. Instead, the data 

are useful for looking at the relative impact of L1 and L2 misses and on overall CPI, 

which we used in Chapter 4. 

The A8 instruction cache miss rates for these benchmarks (and also for the 

full SPEC2000 versions on which Minnespec is based) are very small even for 

just the L1: close to zero for most and under 1% for all of them. This low rate 

probably results from the computationally intensive nature of the SPEC programs 

and the four-way set associative cache that eliminates most conflict misses. Figure 

5.45 shows the data cache results for the A8, which have significant L1 and L2 

miss rates. The L1 miss penalty for a 1 GHz Cortex-A8 is 11 clock cycles, while 

the L2 miss penalty is assumed to be 60 clock cycles. Using these miss penalties, 

Figure 5.46 shows the average miss penalty per data access. 

Figure 5.47 shows the miss rates for the caches of the Core i7 using the SPEC2006 

benchmarks. The L1 instruction cache miss rate varies from 0.1% to 1.8%, 

averaging just over 0.4%. This rate is in keeping with other studies of instruction 

cache behavior for the SPECCPU2006 benchmarks, which show low instruction 

cache miss rates. With L1 data cache miss rates running 5% to 10%, and sometimes 

higher, the importance of the L2 and L3 caches should be obvious. Since the cost 

for a miss to memory is over 100 cycles and the average data miss rate in L2 is 4%, 

L3 is obviously critical. Assuming about half the instructions are loads or stores, 

without L3 the L2 cache misses could add two cycles per instruction to the CPI! In 

comparison, the average L3 data miss rate of 1% is still significant but four times 

lower than the L2 miss rate and six times less than the L1 miss rate.


1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 

21 

22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

32 

33 

34 

#include 

#define UNROLL (4) 

#define BLOCKSIZE 32 

void do_block (int n, int si, int sj, int sk, 

double *A, double *B, double *C) 

{ 

for ( int i = si; i < si+BLOCKSIZE; i+=UNROLL*4 ) 

for ( int j = sj; j < sj+BLOCKSIZE; j++ ) { 

__m256d c[4]; 

for ( int x = 0; x < UNROLL; x++ ) 

c[x] = _mm256_load_pd(C+i+x*4+j*n); 

/* c[x] = C[i][j] */ 

for( int k = sk; k < sk+BLOCKSIZE; k++ ) 

{ 

__m256d b = _mm256_broadcast_sd(B+k+j*n); 

/* b = B[k][j] */ 

for (int x = 0; x < UNROLL; x++) 

c[x] = _mm256_add_pd(c[x], /* c[x]+=A[i][k]*b */ 

_mm256_mul_pd(_mm256_load_pd(A+n*k+x*4+i), b)); 

} 

} 

} 


_mm256_store_pd(C+i+x*4+j*n, c[x]); 

/* C[i][j] = c[x] */ 

void dgemm (int n, double* A, double* B, double* C) 

{ 

for ( int sj = 0; sj < n; sj += BLOCKSIZE ) 

for ( int si = 0; si < n; si += BLOCKSIZE ) 

for ( int sk = 0; sk < n; sk += BLOCKSIZE ) 

do_block(n, si, sj, sk, A, B, C); 

} 

FIGURE 5.48 Optimized C version of DGEMM from Figure 4.80 using cache blocking. These changes 

are the same ones found in Figure 5.21. The assembly language produced by the compiler for the do_block function 

is nearly identical to Figure 4.81. Once again, there is no overhead to call the do_block because the compiler inlines 

the function call.

5.14 Going Faster: Cache Blocking and Matrix Multiply 477 

of A, B, and C. Indeed, lines 28 – 34 and lines 7 – 8 in Figure 5.48 are identical to 

lines 14 – 20 and lines 5 – 6 in Figure 5.21, with the exception of incrementing the 

for loop in line 7 by the amount unrolled. 

Unlike the earlier chapters, we do not show the resulting x86 code because the 

inner loop code is nearly identical to Figure 4.81, as the blocking does not affect the 

computation, just the order that it accesses data in memory. What does change is 

the bookkeeping integer instructions to implement the for loops. It expands from 

14 instructions before the inner loop and 8 after the loop for Figure 4.80 to 40 and 

28 instructions respectively for the bookkeeping code generated for Figure 5.48. 

Nevertheless, the extra instructions executed pale in comparison to the performance 

improvement of reducing cache misses. Figure 5.49 compares unoptimzed to 

optimizations for subword parallelism, instruction level parallelism, and caches. 

Blocking improves performance over unrolled AVX code by factors of 2 to 2.5 for 

the larger matrices. When we compare unoptimized code to the code with all three 

optimizations, the performance improvement is factors of 8 to 15, with the largest 

increase for the largest matrix. 

32x32 160x160 480x480 960x960 

GFLOPS 

16.0 

12.0 

8.0 

4.0 

14.6 

13.6 

12.7 

11.7 12.0 

6.4 

6.6 

3.5 

– 

Unoptimized AVX AVX + unroll AVX + unroll + 

blocked 

FIGURE 5.49 Performance of four versions of DGEMM from matrix dimensions 32x32 to 

960x960. The fully optimized code for largest matrix is almost 15 times as fast the unoptimized version in 

Figure 3.21 in Chapter 3. 

Elaboration: As mentioned in the Elaboration in Section 3.8, these results are 

with Turbo mode turned off. As in Chapters 3 and 4, when we turn it on we improve all 

the results by the temporary increase in the clock rate of 3.3/2.6 1.27. Turbo mode 

works particularly well in this case because it is using only a single core of an eightcore 

chip. However, if we want to run fast we should use all cores, which we’ll see in 

Chapter 6.


This mistake catches many people, including the authors (in earlier drafts) and 

instructors who forget whether they intended the addresses to be in words, bytes, 

or block numbers. Remember this pitfall when you tackle the exercises. 

Pitfall: Having less set associativity for a shared cache than the number of cores or 

threads sharing that cache. 

Without extra care, a parallel program running on 2 n processors or threads can 

easily allocate data structures to addresses that would map to the same set of a 

shared L2 cache. If the cache is at least 2 n -way associative, then these accidental 

conflicts are hidden by the hardware from the program. If not, programmers could 

face apparently mysterious performance bugs—actually due to L2 conflict misses— 

when migrating from, say, a 16-core design to 32-core design if both use 16-way 

associative L2 caches. 

Pitfall: Using average memory access time to evaluate the memory hierarchy of an 

out-of-order processor. 

If a processor stalls during a cache miss, then you can separately calculate the 

memory-stall time and the processor execution time, and hence evaluate the memory 

hierarchy independently using average memory access time (see page 399). 

If the processor continues to execute instructions, and may even sustain more 

cache misses during a cache miss, then the only accurate assessment of the memory 

hierarchy is to simulate the out-of-order processor along with the memory hierarchy. 

Pitfall: Extending an address space by adding segments on top of an unsegmented 

address space. 

During the 1970s, many programs grew so large that not all the code and data could 

be addressed with just a 16-bit address. Computers were then revised to offer 32- 

bit addresses, either through an unsegmented 32-bit address space (also called a flat 

address space) or by adding 16 bits of segment to the existing 16-bit address. From 

a marketing point of view, adding segments that were programmer-visible and that 

forced the programmer and compiler to decompose programs into segments could 

solve the addressing problem. Unfortunately, there is trouble any time a programming 

language wants an address that is larger than one segment, such as indices for large 

arrays, unrestricted pointers, or reference parameters. Moreover, adding segments 

can turn every address into two words—one for the segment number and one for the 

segment offset—causing problems in the use of addresses in registers. 

Fallacy: Disk failure rates in the field match their specifications. 

Two recent studies evaluated large collections of disks to check the relationship 

between results in the field compared to specifications. One study was of almost 

100,000 disks that had quoted MTTF of 1,000,000 to 1,500,000 hours, or AFR of 

0.6% to 0.8%. They found AFRs of 2% to 4% to be common, often three to five 

times higher than the specified rates [Schroeder and Gibson, 2007]. A second study 

of more than 100,000 disks at Google, which had a quoted AFR of about 1.5%, saw 

failure rates of 1.7% for drives in their first year rise to 8.6% for drives in their third 

year, or about five to six times the specified rate [Pinheiro, Weber, and Barroso, 

2007].


Problem category 

Problem x86 instructions 

Access sensitive registers without 

trapping when running in user mode 

When accessing virtual memory 

mechanisms in user mode, instructions 

fail the x86 protection checks 

Store global descriptor table register (SGDT) 

Store local descriptor table register (SLDT) 

Store interrupt descriptor table register (SIDT) 

Store machine status word (SMSW) 

Push flags (PUSHF, PUSHFD) 

Pop flags (POPF, POPFD) 

Load access rights from segment descriptor (LAR) 

Load segment limit from segment descriptor (LSL) 

Verify if segment descriptor is readable (VERR) 

Verify if segment descriptor is writable (VERW) 

Pop to segment register (POP CS, POP SS, . . .) 

Push segment register (PUSH CS, PUSH SS, . . .) 

Far call to different privilege level (CALL) 

Far return to different privilege level (RET) 

Far jump to different privilege level (JMP) 

Software interrupt (INT) 

Store segment selector register (STR) 

Move to/from segment registers (MOVE) 

FIGURE 5.51 Summary of 18 x86 instructions that cause problems for virtualization 

[Robin and Irvine, 2000]. The first five instructions in the top group allow a program in user mode to 

read a control register, such as descriptor table registers, without causing a trap. The pop flags instruction 

modifies a control register with sensitive information but fails silently when in user mode. The protection 

checking of the segmented architecture of the x86 is the downfall of the bottom group, as each of these 

instructions checks the privilege level implicitly as part of instruction execution when reading a control 

register. The checking assumes that the OS must be at the highest privilege level, which is not the case for 

guest VMs. Only the Move to segment register tries to modify control state, and protection checking foils it 

as well. 

Pitfall: Implementing a virtual machine monitor on an instruction set architecture 

that wasn’t designed to be virtualizable. 

Many architects in the 1970s and 1980s weren’t careful to make sure that all 

instructions reading or writing information related to hardware resource 

information were privileged. This laissez-faire attitude causes problems for VMMs 

for all of these architectures, including the x86, which we use here as an example. 

Figure 5.51 describes the 18 instructions that cause problems for virtualization 

[Robin and Irvine, 2000]. The two broad classes are instructions that 

■ Read control registers in user mode that reveals that the guest operating 

system is running in a virtual machine (such as POPF, mentioned earlier) 

■ Check protection as required by the segmented architecture but assume that 

the operating system is running at the highest privilege level 

To simplify implementations of VMMs on the x86, both AMD and Intel have 

proposed extensions to the architecture via a new mode. Intel’s VT-x provides 

a new execution mode for running VMs, an architected definition of the VM


5.1.4 [10] How many 16-byte cache blocks are needed to store all 32-bit 

matrix elements being referenced? 

5.1.5 [5] References to which variables exhibit temporal locality? 

5.1.6 [5] References to which variables exhibit spatial locality? 

5.2 Caches are important to providing a high-performance memory hierarchy 

to processors. Below is a list of 32-bit memory address references, given as word 

addresses. 

3, 180, 43, 2, 191, 88, 190, 14, 181, 44, 186, 253 

5.2.1 [10] For each of these references, identify the binary address, the tag, 

and the index given a direct-mapped cache with 16 one-word blocks. Also list if each 

reference is a hit or a miss, assuming the cache is initially empty. 

5.2.2 [10] For each of these references, identify the binary address, the tag, 

and the index given a direct-mapped cache with two-word blocks and a total size of 8 

blocks. Also list if each reference is a hit or a miss, assuming the cache is initially empty. 

5.2.3 [20] You are asked to optimize a cache design for the given 

references. There are three direct-mapped cache designs possible, all with a total of 8 

words of data: C1 has 1-word blocks, C2 has 2-word blocks, and C3 has 4-word blocks. 

In terms of miss rate, which cache design is the best? If the miss stall time is 25 cycles, 

and C1 has an access time of 2 cycles, C2 takes 3 cycles, and C3 takes 5 cycles, which is 

the best cache design? 

There are many different design parameters that are important to a cache’s overall 

performance. Below are listed parameters for different direct-mapped cache designs. 

Cache Data Size: 32 KiB 

Cache Block Size: 2 words 

Cache Access Time: 1 cycle 

5.2.4 [15] Calculate the total number of bits required for the cache listed 

above, assuming a 32-bit address. Given that total size, find the total size of the closest 

direct-mapped cache with 16-word blocks of equal size or greater. Explain why the 

second cache, despite its larger data size, might provide slower performance than the 

first cache. 

5.2.5 [20] Generate a series of read requests that have a lower miss rate 

on a 2 KiB 2-way set associative cache than the cache listed above. Identify one possible 

solution that would make the cache listed have an equal or lower miss rate than the 2 

KiB cache. Discuss the advantages and disadvantages of such a solution. 

5.2.6 [15] The formula shown in Section 5.3 shows the typical method to 

index a direct-mapped cache, specifically (Block address) modulo (Number of blocks in 

the cache). Assuming a 32-bit address and 1024 blocks in the cache, consider a different


Consider the following address sequence: 0, 2, 4, 8, 10, 12, 14, 16, 0 

5.13.1 [5] Assuming an LRU replacement policy, how many hits does 

this address sequence exhibit? 

5.13.2 [5] Assuming an MRU (most recently used) replacement policy, 

how many hits does this address sequence exhibit? 

5.13.3 [5] Simulate a random replacement policy by flipping a coin. For 

example, “heads” means to evict the first block in a set and “tails” means to evict the 

second block in a set. How many hits does this address sequence exhibit? 

5.13.4 [10] Which address should be evicted at each replacement to 

maximize the number of hits? How many hits does this address sequence exhibit if you 

follow this “optimal” policy? 

5.13.5 [10] Describe why it is difficult to implement a cache replacement 

policy that is optimal for all address sequences. 

5.13.6 [10] Assume you could make a decision upon each memory 

reference whether or not you want the requested address to be cached. What impact 

could this have on miss rate? 

5.14 To support multiple virtual machines, two levels of memory virtualization are 

needed. Each virtual machine still controls the mapping of virtual address (VA) to 

physical address (PA), while the hypervisor maps the physical address (PA) of each 

virtual machine to the actual machine address (MA). To accelerate such mappings, 

a software approach called “shadow paging” duplicates each virtual machine’s page 

tables in the hypervisor, and intercepts VA to PA mapping changes to keep both copies 

consistent. To remove the complexity of shadow page tables, a hardware approach 

called nested page table (NPT) explicitly supports two classes of page tables (VA ⇒ PA 

and PA ⇒ MA) and can walk such tables purely in hardware. 

Consider the following sequence of operations: (1) Create process; (2) TLB miss; 

(3) page fault; (4) context switch; 

5.14.1 [10] What would happen for the given operation sequence for 

shadow page table and nested page table, respectively? 

5.14.2 [10] Assuming an x86-based 4-level page table in both guest and 

nested page table, how many memory references are needed to service a TLB miss for 

native vs. nested page table? 

5.14.3 [15] Among TLB miss rate, TLB miss latency, page fault rate, and 

page fault handler latency, which metrics are more important for shadow page table? 

Which are important for nested page table?


5.16 In this exercise, we will explore the control unit for a cache controller for a 

processor with a write buffer. Use the finite state machine found in Figure 5.40 as a 

starting point for designing your own finite state machines. Assume that the cache 

controller is for the simple direct-mapped cache described on page 465 (Figure 5.40 in 

Section 5.9), but you will add a write buffer with a capacity of one block. 

Recall that the purpose of a write buffer is to serve as temporary storage so that the 

processor doesn’t have to wait for two memory accesses on a dirty miss. Rather than 

writing back the dirty block before reading the new block, it buffers the dirty block and 

immediately begins reading the new block. The dirty block can then be written to main 

memory while the processor is working. 

5.16.1 [10] What should happen if the processor issues a request that 

hits in the cache while a block is being written back to main memory from the write 

buffer? 

5.16.2 [10] What should happen if the processor issues a request that 

misses in the cache while a block is being written back to main memory from the write 

buffer? 

5.16.3 [30] Design a finite state machine to enable the use of a write 

buffer. 

5.17 Cache coherence concerns the views of multiple processors on a given cache 

block. The following data shows two processors and their read/write operations on two 

different words of a cache block X (initially X[0] = X[1] = 0). Assume the size of integers is 

32 bits. 

P1 

X[0] ++; X[1] = 3; X[0] = 5; X[1] +=2; 

P2 

5.17.1 [15] List the possible values of the given cache block for a correct 

cache coherence protocol implementation. List at least one more possible value of the 

block if the protocol doesn’t ensure cache coherency. 

5.17.2 [15] For a snooping protocol, list a valid operation sequence on each 

processor/cache to finish the above read/write operations. 

5.17.3 [10] What are the best-case and worst-case numbers of cache misses 

needed to execute the listed read/write instructions? 

Memory consistency concerns the views of multiple data items. The following data 

shows two processors and their read/write operations on different cache blocks (A and 

B initially 0). 

P1 

A = 1; B = 2; A+=2; B++; C = B; D = A; 

P2


5.19 In this exercise we show the definition of a web server log and examine code 

optimizations to improve log processing speed. The data structure for the log is defined 

as follows: 

struct entry { 

int srcIP; // remote IP address 

char URL[128]; // request URL (e.g., “GET index.html”) 

long long refTime; // reference time 

int status; // connection status 

char browser[64]; // client browser name 

} log [NUM_ENTRIES]; 

Assume the following processing function for the log: 

topK_sourceIP (int hour); 

5.19.1 [5] Which fields in a log entry will be accessed for the given log 

processing function? Assuming 64-byte cache blocks and no prefetching, how many 

cache misses per entry does the given function incur on average? 

5.19.2 [10] How can you reorganize the data structure to improve cache 

utilization and access locality? Show your structure definition code. 

5.19.3 [10] Give an example of another log processing function that would 

prefer a different data structure layout. If both functions are important, how would you 

rewrite the program to improve the overall performance? Supplement the discussion 

with code snippet and data. 

For the problems below, use data from “Cache Performance for SPEC CPU2000 

Benchmarks” (http://www.cs.wisc.edu/multifacet/misc/spec2000cache-data/) for the 

pairs of benchmarks shown in the following table. 

a. Mesa / gcc 

b. mcf / swim 

5.19.4 [10] For 64 KiB data caches with varying set associativities, what are 

the miss rates broken down by miss types (cold, capacity, and conflict misses) for each 

benchmark? 

5.19.5 [10] Select the set associativity to be used by a 64 KiB L1 data cache 

shared by both benchmarks. If the L1 cache has to be directly mapped, select the set 

associativity for the 1 MiB L2 cache. 

5.19.6 [20] Give an example in the miss rate table where higher set 

associativity actually increases miss rate. Construct a cache configuration and reference 

stream to demonstrate this.


Answers to 


§5.1, page 377: 1 and 4. (3 is false because the cost of the memory hierarchy varies 

per computer, but in 2013 the highest cost is usually the DRAM.) 

§5.3, page 398: 1 and 4: A lower miss penalty can enable smaller blocks, since you 

don’t have that much latency to amortize, yet higher memory bandwidth usually 

leads to larger blocks, since the miss penalty is only slightly larger. 

§5.4, page 417: 1. 

§5.7, page 454: 1-a, 2-c, 3-b, 4-d. 

§5.8, page 461: 2. (Both large block sizes and prefetching may reduce compulsory 

misses, so 1 is false.)


6 

Parallel Processors 

from Client to Cloud 

“I swing big, with 

everything I’ve got. 

I hit big or I miss big. 

I like to live as big as 

I can.” 

Babe Ruth 

American baseball player 


6.2 The Difficulty of Creating Parallel Processing 

Programs 504 

6.3 SISD, MIMD, SIMD, SPMD, and Vector 509 

6.4 Hardware Multithreading 516 

6.5 Multicore and Other Shared Memory 

Multiprocessors 519 

6.6 Introduction to Graphics Processing 

Units 524 




multicore microprocessors instead of multiprocessor microprocessors, 

presumably to avoid redundancy in naming. Hence, processors are often called 

cores in a multicore chip. The number of cores is expected to increase with 

Moore’s Law. These multicores are almost always Shared Memory Processors 

(SMPs), as they usually share a single physical address space. We’ll see SMPs 

more in Section 6.5. 

The state of technology today means that programmers who care about 

performance must become parallel programmers, for sequential code now means 

slow code. 

The tall challenge facing the industry is to create hardware and software that 

will make it easy to write correct parallel processing programs that will execute 

efficiently in performance and energy as the number of cores per chip scales. 

This abrupt shift in microprocessor design caught many off guard, so there is a 

great deal of confusion about the terminology and what it means. Figure 6.1 tries to 

clarify the terms serial, parallel, sequential, and concurrent. The columns of this figure 

represent the software, which is either inherently sequential or concurrent. The rows 

of the figure represent the hardware, which is either serial or parallel. For example, the 

programmers of compilers think of them as sequential programs: the steps include 

parsing, code generation, optimization, and so on. In contrast, the programmers 

of operating systems normally think of them as concurrent programs: cooperating 

processes handling I/O events due to independent jobs running on a computer. 

The point of these two axes of Figure 6.1 is that concurrent software can run on 

serial hardware, such as operating systems for the Intel Pentium 4 uniprocessor, 

or on parallel hardware, such as an OS on the more recent Intel Core i7. The same 

is true for sequential software. For example, the MATLAB programmer writes 

a matrix multiply thinking about it sequentially, but it could run serially on the 

Pentium 4 or in parallel on the Intel Core i7. 

You might guess that the only challenge of the parallel revolution is figuring out how 

to make naturally sequential software have high performance on parallel hardware, but 

it is also to make concurrent programs have high performance on multiprocessors as the 

number of processors increases. With this distinction made, in the rest of this chapter 

we will use parallel processing program or parallel software to mean either sequential 

or concurrent software running on parallel hardware. The next section of this chapter 

describes why it is hard to create efficient parallel processing programs. 

multicore 

microprocessor 

A microprocessor 

containing multiple 

processors (“cores”) 

in a single integrated 

circuit. Virtually all 

microprocessors today in 

desktops and servers are 

multicore. 

shared memory 

multiprocessor 

(SMP) A parallel 

processor with a single 

physical address space. 

Software 

Sequential 

Concurrent 

Hardware 

Serial 

Parallel 

Matrix Multiply written in MatLab 

running on an Intel Pentium 4 

Matrix Multiply written in MATLAB 

running on an Intel Core i7 

Windows Vista Operating System 

running on an Intel Pentium 4 

Windows Vista Operating System 

running on an Intel Core i7 

FIGURE 6.1 Hardware/software categorization and examples of application perspective 

on concurrency versus hardware perspective on parallelism.

504 Chapter 6 Parallel Processors from Client to Cloud 

Before proceeding further down the path to parallelism, dont forget our initial 

incursions from the earlier chapters: 

Check 

Yourself 

■ Chapter 2, Section 2.11: Parallelism and Instructions: Synchronization 

■ Chapter 3, Section 3.6: Parallelism and Computer Arithmetic: Subword 

Parallelism 

■ Chapter 4, Section 4.10: Parallelism via Instructions 

■ Chapter 5, Section 5.10: Parallelism and Memory Hierarchy: Cache Coherence 

True or false: To benefit from a multiprocessor, an application must be concurrent. 

6.2 

The Difficulty of Creating Parallel 

Processing Programs 

The difficulty with parallelism is not the hardware; it is that too few important 

application programs have been rewritten to complete tasks sooner on multiprocessors. 

It is difficult to write software that uses multiple processors to complete one task 

faster, and the problem gets worse as the number of processors increases. 

Why has this been so? Why have parallel processing programs been so much 

harder to develop than sequential programs? 

The first reason is that you must get better performance or better energy 

efficiency from a parallel processing program on a multiprocessor; otherwise, you 

would just use a sequential program on a uniprocessor, as sequential programming 

is simpler. In fact, uniprocessor design techniques such as superscalar and out-oforder 

execution take advantage of instruction-level parallelism (see Chapter 4), 

normally without the involvement of the programmer. Such innovations reduced 

the demand for rewriting programs for multiprocessors, since programmers 

could do nothing and yet their sequential programs would run faster on new 

computers. 

Why is it difficult to write parallel processing programs that are fast, especially 

as the number of processors increases? In Chapter 1, we used the analogy of 

eight reporters trying to write a single story in hopes of doing the work eight 

times faster. To succeed, the task must be broken into eight equal-sized pieces, 

because otherwise some reporters would be idle while waiting for the ones with 

larger pieces to finish. Another speed-up obstacle could be that the reporters 

would spend too much time communicating with each other instead of writing 

their pieces of the story. For both this analogy and parallel programming, 

the challenges include scheduling, partitioning the work into parallel pieces, 

balancing the load evenly between the workers, time to synchronize, and

6.2 The Difficulty of Creating Parallel Processing Programs 505 

overhead for communication between the parties. The challenge is stiffer with the 

more reporters for a newspaper story and with the more processors for parallel 

programming. 

Our discussion in Chapter 1 reveals another obstacle, namely Amdahls Law. It 

reminds us that even small parts of a program must be parallelized if the program 

is to make good use of many cores. 

Speed-up Challenge 

Suppose you want to achieve a speed-up of 90 times faster with 100 processors. 

What percentage of the original computation can be sequential? 

EXAMPLE 

Amdahls Law (Chapter 1) says 

Execution time after improvement = 


+ Execution time unaffected 


ANSWER 

We can reformulate Amdahls Law in terms of speed-up versus the original 

execution time: 

Speed-up = 

(Execution time before 

Execution time before 

Execution time affected 

− Execution time affected) + Amount of improvement 

This formula is usually rewritten assuming that the execution time before is 

1 for some unit of time, and the execution time affected by improvement is 

considered the fraction of the original execution time: 

Speed-up = 

1 

Fraction time affected 

(1 − Fraction time affected) + 


Substituting 90 for speed-up and 100 for amount of improvement into the 

formula above: 

90 = 

1 

Fraction time affected 

(1 − Fraction time affected) + 

100


Then simplifying the formula and solving for fraction time affected: 

90 × (1 − 0.99 × Fraction time affected) = 1 

90 − (90 × 0.99 × Fraction time affected) = 1 

90 −1 = 90 × 0.99 × Fraction time affected 

Fraction time affected = 89/89.1 = 0.999 

Thus, to achieve a speed-up of 90 from 100 processors, the sequential 

percentage can only be 0.1%. 

Yet, there are applications with plenty of parallelism, as we shall see next. 

EXAMPLE 

Speed-up Challenge: Bigger Problem 

Suppose you want to perform two sums: one is a sum of 10 scalar variables, and 

one is a matrix sum of a pair of two-dimensional arrays, with dimensions 10 by 10. 

For now let’s assume only the matrix sum is parallelizable; we’ll see soon how to 

parallelize scalar sums. What speed-up do you get with 10 versus 40 processors? 

Next, calculate the speed-ups assuming the matrices grow to 20 by 20. 

ANSWER 

If we assume performance is a function of the time for an addition, t, then 

there are 10 additions that do not benefit from parallel processors and 100 

additions that do. If the time for a single processor is 110 t, the execution time 

for 10 processors is 

Execution time after improvement = 


+ Execution time unaffected 


Execution time after improvement = 100 t 

+ 10t 

= 20t 

10 

so the speed-up with 10 processors is 110t/20t = 5.5. The execution time for 

40 processors is 


40 

+ 10t 

= 12. 

5t 

so the speed-up with 40 processors is 110t/12.5t = 8.8. Thus, for this problem 

size, we get about 55% of the potential speed-up with 10 processors, but only 

22% with 40.

6.2 The Difficulty of Creating Parallel Processing Programs 507 

Look what happens when we increase the matrix. The sequential program now 

takes 10t + 400t = 410t. The execution time for 10 processors is 


10 

+ 10t 

= 50t 

so the speed-up with 10 processors is 410t/50t = 8.2. The execution time for 

40 processors is 


40 

+ 10t 

= 20t 

so the speed-up with 40 processors is 410t/20t = 20.5. Thus, for this larger problem 

size, we get 82% of the potential speed-up with 10 processors and 51% with 40. 

These examples show that getting good speed-up on a multiprocessor while 

keeping the problem size fixed is harder than getting good speed-up by increasing 

the size of the problem. This insight allows us to introduce two terms that describe 

ways to scale up. 

Strong scaling means measuring speed-up while keeping the problem size fixed. 

Weak scaling means that the problem size grows proportionally to the increase in 

the number of processors. Let’s assume that the size of the problem, M, is the working 

set in main memory, and we have P processors. Then the memory per processor for 

strong scaling is approximately M/P, and for weak scaling, it is approximately M. 

Note that the memory hierarchy can interfere with the conventional wisdom 

about weak scaling being easier than strong scaling. For example, if the weakly 

scaled dataset no longer fits in the last level cache of a multicore microprocessor, 

the resulting performance could be much worse than by using strong scaling. 

Depending on the application, you can argue for either scaling approach. For 

example, the TPC-C debit-credit database benchmark requires that you scale up 

the number of customer accounts in proportion to the higher transactions per 

minute. The argument is that its nonsensical to think that a given customer base 

is suddenly going to start using ATMs 100 times a day just because the bank gets a 

faster computer. Instead, if youre going to demonstrate a system that can perform 

100 times the numbers of transactions per minute, you should run the experiment 

with 100 times as many customers. Bigger problems often need more data, which 

is an argument for weak scaling. 

This final example shows the importance of load balancing. 

strong scaling Speedup 

achieved on a 

multiprocessor without 

increasing the size of the 

problem. 

weak scaling Speedup 

achieved on a 

multiprocessor while 

increasing the size of the 

problem proportionally 

to the increase in the 

number of processors. 

Speed-up Challenge: Balancing Load 

To achieve the speed-up of 20.5 on the previous larger problem with 40 

processors, we assumed the load was perfectly balanced. That is, each of the 40 

EXAMPLE


data elements from memory, put them in order into a large set of registers, operate 

on them sequentially in registers using pipelined execution units, and then write 

the results back to memory. A key feature of vector architectures is then a set of 

vector registers. Thus, a vector architecture might have 32 vector registers, each 

with 64 64-bit elements. 

Comparing Vector to Conventional Code 

Suppose we extend the MIPS instruction set architecture with vector 

instructions and vector registers. Vector operations use the same names as 

MIPS operations, but with the letter V appended. For example, addv.d 

adds two double-precision vectors. The vector instructions take as their input 

either a pair of vector registers (addv.d) or a vector register and a scalar 

register (addvs.d). In the latter case, the value in the scalar register is used 

as the input for all operationsthe operation addvs.d will add the contents 

of a scalar register to each element in a vector register. The names lv and sv 

denote vector load and vector store, and they load or store an entire vector 

of double-precision data. One operand is the vector register to be loaded or 

stored; the other operand, which is a MIPS general-purpose register, is the 

starting address of the vector in memory. Given this short description, show 

the conventional MIPS code versus the vector MIPS code for 

EXAMPLE 

Y = a× X + Y 

where X and Y are vectors of 64 double precision floating-point numbers, 

initially resident in memory, and a is a scalar double precision variable. (This 

example is the so-called DAXPY loop that forms the inner loop of the Linpack 

benchmark; DAXPY stands for double precision a × X plus Y.). Assume that 

the starting addresses of X and Y are in $s0 and $s1, respectively. 

Here is the conventional MIPS code for DAXPY: 

l.d $f0,a($sp) :load scalar a 

addiu $t0,$s0,#512 :upper bound of what to load 

loop: l.d $f2,0($s0) :load x(i) 

mul.d $f2,$f2,$f0 :a x x(i) 

l.d $f4,0($s1) :load y(i) 

add.d $f4,$f4,$f2 :a x x(i) + y(i) 

s.d $f4,0($s1) :store into y(i) 

addiu $s0,$s0,#8 :increment index to x 

addiu $s1,$s1,#8 :increment index to y 

subu $t1,$t0,$s0 :compute bound 

bne $t1,$zero,loop :check if done 

Here is the vector MIPS code for DAXPY: 

ANSWER


l.d $f0,a($sp) :load scalar a 

lv $v1,0($s0) :load vector x 

mulvs.d $v2,$v1,$f0 :vector-scalar multiply 

lv $v3,0($s1) :load vector y 

addv.d $v4,$v2,$v3 :add y to product 

sv $v4,0($s1) :store the result 

There are some interesting comparisons between the two code segments in 

this example. The most dramatic is that the vector processor greatly reduces the 

dynamic instruction bandwidth, executing only 6 instructions versus almost 600 

for the traditional MIPS architecture. This reduction occurs both because the vector 

operations work on 64 elements at a time and because the overhead instructions 

that constitute nearly half the loop on MIPS are not present in the vector code. As 

you might expect, this reduction in instructions fetched and executed saves energy. 

Another important difference is the frequency of pipeline hazards (Chapter 4). 

In the straightforward MIPS code, every add.d must wait for a mul.d, every 

s.d must wait for the add.d and every add.d and mul.d must wait on l.d. 

On the vector processor, each vector instruction will only stall for the first element 

in each vector, and then subsequent elements will flow smoothly down the pipeline. 

Thus, pipeline stalls are required only once per vector operation, rather than once 

per vector element. In this example, the pipeline stall frequency on MIPS will be 

about 64 times higher than it is on the vector version of MIPS. The pipeline stalls 

can be reduced on MIPS by using loop unrolling (see Chapter 4). However, the 

large difference in instruction bandwidth cannot be reduced. 

Since the vector elements are independent, they can be operated on in parallel, 

much like subword parallelism for AVX instructions. All modern vector computers 

have vector functional units with multiple parallel pipelines (called vector lanes; see 

Figures 6.2 and 6.3) that can produce two or more results per clock cycle. 

Elaboration: The loop in the example above exactly matched the vector length. When 

loops are shorter, vector architectures use a register that reduces the length of vector 

operations. When loops are larger, we add bookkeeping code to iterate full-length vector 

operations and to handle the leftovers. This latter process is called strip mining. 

Vector versus Scalar 

Vector instructions have several important properties compared to conventional 

instruction set architectures, which are called scalar architectures in this context: 

■ A single vector instruction specifies a great deal of workit is equivalent 

to executing an entire loop. The instruction fetch and decode bandwidth 

needed is dramatically reduced. 

■ By using a vector instruction, the compiler or programmer indicates that the 

computation of each result in the vector is independent of the computation of 

other results in the same vector, so hardware does not have to check for data 

hazards within a vector instruction. 

■ Vector architectures and compilers have a reputation of making it much 

easier than when using MIMD multiprocessors to write efficient applications 

when they contain data-level parallelism.


■ Hardware need only check for data hazards between two vector instructions 

once per vector operand, not once for every element within the vectors. 

Reduced checking can save energy as well as time. 

■ Vector instructions that access memory have a known access pattern. If 

the vectors elements are all adjacent, then fetching the vector from a set 

of heavily interleaved memory banks works very well. Thus, the cost of the 

latency to main memory is seen only once for the entire vector, rather than 

once for each word of the vector. 

■ Because an entire loop is replaced by a vector instruction whose behavior 

is predetermined, control hazards that would normally arise from the loop 

branch are nonexistent. 

■ The savings in instruction bandwidth and hazard checking plus the efficient 

use of memory bandwidth give vector architectures advantages in power and 

energy versus scalar architectures. 

For these reasons, vector operations can be made faster than a sequence of 

scalar operations on the same number of data items, and designers are motivated 

to include vector units if the application domain can often use them. 

Vector versus Multimedia Extensions 

Like multimedia extensions found in the x86 AVX instructions, a vector instruction 

specifies multiple operations. However, multimedia extensions typically specify a 

few operations while vector specifies dozens of operations. Unlike multimedia 

extensions, the number of elements in a vector operation is not in the opcode but in a 

separate register. This distinction means different versions of the vector architecture 

can be implemented with a different number of elements just by changing the 

contents of that register and hence retain binary compatibility. In contrast, a new 

large set of opcodes is added each time the vector length changes in the multimedia 

extension architecture of the x86: MMX, SSE, SSE2, AVX, AVX2, … . 

Also unlike multimedia extensions, the data transfers need not be contiguous. 

Vectors support both strided accesses, where the hardware loads every nth data 

element in memory, and indexed accesses, where hardware finds the addresses of 

the items to be loaded in a vector register. Indexed accesses are also called gatherscatter, 

in that indexed loads gather elements from main memory into contiguous 

vector elements and indexed stores scatter vector elements across main memory. 

Like multimedia extensions, vector architectures easily capture the flexibility 

in data widths, so it is easy to make a vector operation work on 32 64-bit data 

elements or 64 32-bit data elements or 128 16-bit data elements or 256 8-bit data 

elements. The parallel semantics of a vector instruction allows an implementation 

to execute these operations using a deeply pipelined functional unit, an array of 

parallel functional units, or a combination of parallel and pipelined functional 

units. Figure 6.3 illustrates how to improve vector performance by using parallel 

pipelines to execute a vector add instruction. 

Vector arithmetic instructions usually only allow element N of one vector 

register to take part in operations with element N from other vector registers. This


A[9] 

B[9] 

A[8] 

B[8] 

A[7] 

B[7] 

A[6] 

B[6] 

A[5] 

B[5] 

A[4] 

B[4] 

A[3] 

B[3] 

A[2] 

B[2] 

A[8] 

B[8] 

A[9] 

B[9] 

A[1] 

B[1] 

A[4] 

B[4] 

A[5] 

B[5] A[6] B[6] A[7] B[7] 

+ 

+ + + + 

C[0] 

C[0] C[1] C[2] C[3] 

Element group 

(a) 

(b) 

vector lane One or 

more vector functional 

units and a portion of 

the vector register file. 

Inspired by lanes on 

highways that increase 

traffic speed, multiple 

lanes execute vector 

operations 

simultaneously. 

Check 

Yourself 

FIGURE 6.3 Using multiple functional units to improve the performance of a single vector 

add instruction, C = A + B. The vector processor (a) on the left has a single add pipeline and can complete 

one addition per cycle. The vector processor (b) on the right has four add pipelines or lanes and can complete 

four additions per cycle. The elements within a single vector add instruction are interleaved across the four 

lanes. 

dramatically simplifies the construction of a highly parallel vector unit, which can 

be structured as multiple parallel vector lanes. As with a traffic highway, we can 

increase the peak throughput of a vector unit by adding more lanes. Figure 6.4 

shows the structure of a four-lane vector unit. Thus, going to four lanes from one 

lane reduces the number of clocks per vector instruction by roughly a factor of four. 

For multiple lanes to be advantageous, both the applications and the architecture 

must support long vectors. Otherwise, they will execute so quickly that you’ll run 

out of instructions, requiring instruction level parallel techniques like those in 

Chapter 4 to supply enough vector instructions. 

Generally, vector architectures are a very efficient way to execute data parallel 

processing programs; they are better matches to compiler technology than 

multimedia extensions; and they are easier to evolve over time than the multimedia 

extensions to the x86 architecture. 

Given these classic categories, we next see how to exploit parallel streams of 

instructions to improve the performance of a single processor, which we will reuse 

with multiple processors. 

True or false: As exemplified in the x86, multimedia extensions can be thought of 

as a vector architecture with short vectors that supports only contiguous vector 

data transfers.

6.4 Hardware Multithreading 517 

Simultaneous multithreading (SMT) is a variation on hardware multithreading 

that uses the resources of a multiple-issue, dynamically scheduled pipelined 

processor to exploit thread-level parallelism at the same time it exploits instructionlevel 

parallelism (see Chapter 4). The key insight that motivates SMT is that 

multiple-issue processors often have more functional unit parallelism available 

than most single threads can effectively use. Furthermore, with register renaming 

and dynamic scheduling (see Chapter 4), multiple instructions from independent 

threads can be issued without regard to the dependences among them; the resolution 

of the dependences can be handled by the dynamic scheduling capability. 

Since SMT relies on the existing dynamic mechanisms, it does not switch 

resources every cycle. Instead, SMT is always executing instructions from multiple 

threads, leaving it up to the hardware to associate instruction slots and renamed 

registers with their proper threads. 

Figure 6.5 conceptually illustrates the differences in a processors ability to exploit 

superscalar resources for the following processor configurations. The top portion shows 

Issue slots 

Thread A 

Thread B 

Thread C 

Thread D 

simultaneous 

multithreading 

(SMT) A version 

of multithreading 

that lowers the cost 

of multithreading by 

utilizing the resources 

needed for multiple issue, 

dynamically scheduled 

microarchitecture. 

Time 

Time 

Issue slots 

Coarse MT 

Fine MT 

SMT 

FIGURE 6.5 How four threads use the issue slots of a superscalar processor in different 

approaches. The four threads at the top show how each would execute running alone on a standard 

superscalar processor without multithreading support. The three examples at the bottom show how they 

would execute running together in three multithreading options. The horizontal dimension represents the 

instruction issue capability in each clock cycle. The vertical dimension represents a sequence of clock cycles. 

An empty (white) box indicates that the corresponding issue slot is unused in that clock cycle. The shades of 

gray and color correspond to four different threads in the multithreading processors. The additional pipeline 

start-up effects for coarse multithreading, which are not illustrated in this figure, would lead to further loss 

in throughput for coarse multithreading.


uniform memory access 

(UMA) A multiprocessor 

in which latency to any 

word in main memory is 

about the same no matter 

which processor requests 

the access. 

nonuniform memory 

access (NUMA) A type 

of single address space 

multiprocessor in which 

some memory accesses 

are much faster than 

others depending on 

which processor asks for 

which word. 

synchronization The 

process of coordinating 

the behavior of two or 

more processes, which 

may be running on 

different processors. 

lock A synchronization 

device that allows access 

to data to only one 

processor at a time. 

nearly always the case for multicore chipsalthough a more accurate term would 

have been shared-address multiprocessor. Processors communicate through shared 

variables in memory, with all processors capable of accessing any memory location 

via loads and stores. Figure 6.7 shows the classic organization of an SMP. Note that 

such systems can still run independent jobs in their own virtual address spaces, 

even if they all share a physical address space. 

Single address space multiprocessors come in two styles. In the first style, the 

latency to a word in memory does not depend on which processor asks for it. 

Such machines are called uniform memory access (UMA) multiprocessors. In the 

second style, some memory accesses are much faster than others, depending on 

which processor asks for which word, typically because main memory is divided 

and attached to different microprocessors or to different memory controllers on 

the same chip. Such machines are called nonuniform memory access (NUMA) 

multiprocessors. As you might expect, the programming challenges are harder for 

a NUMA multiprocessor than for a UMA multiprocessor, but NUMA machines 

can scale to larger sizes and NUMAs can have lower latency to nearby memory. 

As processors operating in parallel will normally share data, they also need to 

coordinate when operating on shared data; otherwise, one processor could start 

working on data before another is finished with it. This coordination is called 

synchronization, which we saw in Chapter 2. When sharing is supported with a 

single address space, there must be a separate mechanism for synchronization. One 

approach uses a lock for a shared variable. Only one processor at a time can acquire 

the lock, and other processors interested in shared data must wait until the original 

processor unlocks the variable. Section 2.11 of Chapter 2 describes the instructions 

for locking in the MIPS instruction set. 

Processor 

Processor 

. . . 

Processor 

Cache Cache . . . 

Cache 

Interconnection Network 

Memory 

I/O 

FIGURE 6.7 

Classic organization of a shared memory multiprocessor.

6.5 Multicore and Other Shared Memory Multiprocessors 521 

A Simple Parallel Processing Program for a Shared Address Space 

Suppose we want to sum 64,000 numbers on a shared memory multiprocessor 

computer with uniform memory access time. Lets assume we have 64 

processors. 

EXAMPLE 

The first step is to ensure a balanced load per processor, so we split the set 

of numbers into subsets of the same size. We do not allocate the subsets to a 

different memory space, since there is a single memory space for this machine; 

we just give different starting addresses to each processor. Pn is the number that 

identifies the processor, between 0 and 63. All processors start the program by 

running a loop that sums their subset of numbers: 

ANSWER 

sum[Pn] = 0; 

for (i = 1000*Pn; i < 1000*(Pn+1); i += 1) 

sum[Pn] += A[i]; /*sum the assigned areas*/ 

(Note the C code i += 1 is just a shorter way to say i = i + 1.) 

The next step is to add these 64 partial sums. This step is called a reduction, 

where we divide to conquer. Half of the processors add pairs of partial sums, 

and then a quarter add pairs of the new partial sums, and so on until we 

have the single, final sum. Figure 6.8 illustrates the hierarchical nature of this 

reduction. 

In this example, the two processors must synchronize before the consumer 

processor tries to read the result from the memory location written by the 

producer processor; otherwise, the consumer may read the old value of 

reduction A function 

that processes a data 

structure and returns a 

single value. 

0 

(half = 1) 

0 1 

(half = 2) 

0 1 2 3 

(half = 4) 

0 1 2 3 4 5 6 7 

FIGURE 6.8 The last four levels of a reduction that sums results from each processor, 

from bottom to top. For all processors whose number i is less than half, add the sum produced by 

processor number (i + half) to its sum.


the data. We want each processor to have its own version of the loop counter 

variable i, so we must indicate that it is a private variable. Here is the code 

(half is private also): 

half = 64; /*64 processors in multiprocessor*/ 

do 

synch(); /*wait for partial sum completion*/ 

if (half%2 != 0 && Pn == 0) 

sum[0] += sum[half–1]; 

/*Conditional sum needed when half is 

odd; Processor0 gets missing element */ 

half = half/2; /*dividing line on who sums */ 

if (Pn < half) sum[Pn] += sum[Pn+half]; 

while (half > 1); /*exit with final sum in Sum[0] */ 

Hardware/ 

Software 

Interface 

OpenMP An API 

for shared memory 

multiprocessing in C, 

C++, or Fortran that runs 

on UNIX and Microsoft 

platforms. It includes 

compiler directives, a 

library, and runtime 

directives. 

Given the long-term interest in parallel programming, there have been hundreds 

of attempts to build parallel programming systems. A limited but popular example 

is OpenMP. It is just an Application Programmer Interface (API) along with a set of 

compiler directives, environment variables, and runtime library routines that can 

extend standard programming languages. It offers a portable, scalable, and simple 

programming model for shared memory multiprocessors. Its primary goal is to 

parallelize loops and to perform reductions. 

Most C compilers already have support for OpenMP. The command to uses the 

OpenMP API with the UNIX C compiler is just: 

cc –fopenmp foo.c 

OpenMP extends C using pragmas, which are just commands to the C macro 

preprocessor like #define and #include. To set the number of processors we 

want to use to be 64, as we wanted in the example above, we just use the command 

#define P 64 /* define a constant that we’ll use a few times */ 

#pragma omp parallel num_threads(P) 

That is, the runtime libraries should use 64 parallel threads. 

To turn the sequential for loop into a parallel for loop that divides the work 

equally between all the threads that we told it to use, we just write (assuming sum 

is initialized to 0) 

#pragma omp parallel for 

for (Pn = 0; Pn < P; Pn += 1) 

for (i = 0; 1000*Pn; i < 1000*(Pn+1); i += 1) 

sum[Pn] += A[i]; /*sum the assigned areas*/

6.5 Multicore and Other Shared Memory Multiprocessors 523 

To perform the reduction, we can use another command that tells OpenMP 

what the reduction operator is and what variable you need to use to place the result 

of the reduction. 

#pragma omp parallel for reduction(+ : FinalSum) 

for (i = 0; i < P; i += 1) 

FinalSum += sum[i]; /* Reduce to a single number */ 

Note that it is now up to the OpenMP library to find efficient code to sum 64 

numbers efficiently using 64 processors. 

While OpenMP makes it easy to write simple parallel code, it is not very helpful 

with debugging, so many parallel programmers use more sophisticated parallel 

programming systems than OpenMP, just as many programmers today use more 

productive languages than C. 

Given this tour of classic MIMD hardware and software, our next path is a more 

exotic tour of a type of MIMD architecture with a different heritage and thus a very 

different perspective on the parallel programming challenge. 

True or false: Shared memory multiprocessors cannot take advantage of task-level 

parallelism. 

Check 

Yourself 

Elaboration: Some writers repurposed the acronym SMP to mean symmetric 

multiprocessor, to indicate that the latency from processor to memory was about the 

same for all processors. This shift was done to contrast them from large-scale NUMA 

multiprocessors, as both classes used a single address space. As clusters proved much 

more popular than large-scale NUMA multiprocessors, in this book we restore SMP to 

its original meaning, and use it to contrast against that use multiple address spaces, 

such as clusters. 

Elaboration: An alternative to sharing the physical address space would be to have 

separate physical address spaces but share a common virtual address space, leaving 

it up to the operating system to handle communication. This approach has been tried, 

but it has too high an overhead to offer a practical shared memory abstraction to the 

performance-oriented programmer.


registers than do vector processors. Unlike most vector architectures, GPUs also 

rely on hardware multithreading within a single multi-threaded SIMD processor 

to hide memory latency (see Section 6.4). 

A multithreaded SIMD processor is similar to a Vector Processor, but the former 

has many parallel functional units instead of just a few that are deeply pipelined, 

as does the latter. 

As mentioned above, a GPU contains a collection of multithreaded SIMD 

processors; that is, a GPU is a MIMD composed of multithreaded SIMD processors. 

For example, NVIDIA has four implementations of the Fermi architecture at 

different price points with 7, 11, 14, or 15 multithreaded SIMD processors. To 

provide transparent scalability across models of GPUs with differing number of 

multithreaded SIMD processors, the Thread Block Scheduler hardware assigns 

blocks of threads to multithreaded SIMD processors. Figure 6.9 shows a simplified 

block diagram of a multithreaded SIMD processor. 

Dropping down one more level of detail, the machine object that the hardware 

creates, manages, schedules, and executes is a thread of SIMD instructions, which 

we will also call a SIMD thread. It is a traditional thread, but it contains exclusively 

SIMD instructions. These SIMD threads have their own program counters and 

they run on a multithreaded SIMD processor. The SIMD Thread Scheduler includes 

a controller that lets it know which threads of SIMD instructions are ready to 

run, and then it sends them off to a dispatch unit to be run on the multithreaded 

Instruction register 

SIMD Lanes 

(Thread 

Processors) 

Registers 

Reg 

Reg 

Reg 

Reg 

Reg 

Reg 

Reg 

Reg 

Reg 

Reg 

Reg 

Reg 

Reg 

Reg 

Reg 

1K × 32 1K × 32 

1K × 32 

1K × 32 

1K × 32 

1K × 32 

1K × 32 

1K × 32 

1K × 32 

1K × 32 

1K × 32 

1K × 32 

1K × 32 

1K × 32 

1K × 32 

1K × 32 

Load 

store 

unit 

Load 

store 

unit 

Load 

store 

unit 

Load 

store 

unit 

Load 

store 

unit 

Load 

store 

unit 

Load 

store 

unit 

Load 

store 

unit 

Load 

store 

unit 

Load 

store 

unit 

Load 

store 

unit 

Load 

store 

unit 

Load 

store 

unit 

Load 

store 

unit 

Load 

store 

unit 

Load 

store 

unit 

Address coalescing unit 

Interconnection network 

Local Memory 

64 KiB 

To Global 

Memory 

FIGURE 6.9 Simplified block diagram of the datapath of a multithreaded SIMD Processor. 

It has 16 SIMD lanes. The SIMD Thread Scheduler has many independent SIMD threads that it chooses from 

to run on this processor.

6.6 Introduction to Graphics Processing Units 527 

SIMD processor. It is identical to a hardware thread scheduler in a traditional 

multithreaded processor (see Section 6.4), except that it is scheduling threads of 

SIMD instructions. Thus, GPU hardware has two levels of hardware schedulers: 

1. The Thread Block Scheduler that assigns blocks of threads to multithreaded 

SIMD processors, and 

2. the SIMD Thread Scheduler within a SIMD processor, which schedules 

when SIMD threads should run. 

The SIMD instructions of these threads are 32 wide, so each thread of SIMD 

instructions would compute 32 of the elements of the computation. Since the 

thread consists of SIMD instructions, the SIMD processor must have parallel 

functional units to perform the operation. We call them SIMD Lanes, and they are 

quite similar to the Vector Lanes in Section 6.3. 

Elaboration: The number of lanes per SIMD processor varies across GPU generations. 

With Fermi, each 32-wide thread of SIMD instructions is mapped to 16 SIMD Lanes, 

so each SIMD instruction in a thread of SIMD instructions takes two clock cycles to 

complete. Each thread of SIMD instructions is executed in lock step. Staying with the 

analogy of a SIMD processor as a vector processor, you could say that it has 16 lanes, 

and the vector length would be 32. This wide but shallow nature is why we use the term 

SIMD processor instead of vector processor, as it is more intuitive. 

Since by defi nition the threads of SIMD instructions are independent, the SIMD 

Thread Scheduler can pick whatever thread of SIMD instructions is ready, and need not 

stick with the next SIMD instruction in the sequence within a single thread. Thus, using 

the terminology of Section 6.4, it uses fine-grained multithreading. 

To hold these memory elements, a Fermi SIMD processor has an impressive 32,768 

32-bit registers. Just like a vector processor, these registers are divided logically across 

the vector lanes or, in this case, SIMD Lanes. Each SIMD Thread is limited to no more than 

64 registers, so you might think of a SIMD Thread as having up to 64 vector registers, 

with each vector register having 32 elements and each element being 32 bits wide. 

Since Fermi has 16 SIMD Lanes, each contains 2048 registers. Each CUDA Thread 

gets one element of each of the vector registers. Note that a CUDA thread is just a 

vertical cut of a thread of SIMD instructions, corresponding to one element executed by 

one SIMD Lane. Beware that CUDA Threads are very different from POSIX threads; you 

cant make arbitrary system calls or synchronize arbitrarily in a CUDA Thread. 

NVIDIA GPU Memory Structures 

Figure 6.10 shows the memory structures of an NVIDIA GPU. We call the onchip 

memory that is local to each multithreaded SIMD processor Local Memory. 

It is shared by the SIMD Lanes within a multithreaded SIMD processor, but this 

memory is not shared between multithreaded SIMD processors. We call the offchip 

DRAM shared by the whole GPU and all thread blocks GPU Memory. 

Rather than rely on large caches to contain the whole working sets of an 

application, GPUs traditionally use smaller streaming caches and rely on extensive 

multithreading of threads of SIMD instructions to hide the long latency to DRAM,


CUDA Thread 

Per-CUDA Thread Private Memory 

Thread block 

Per-Block 

Local Memory 

Grid 0 

Sequence 

. . . 

Grid 1 

Inter-Grid Synchronization 

GPU Memory 

. . . 

FIGURE 6.10 GPU Memory structures. GPU Memory is shared by the vectorized loops. All threads 

of SIMD instructions within a thread block share Local Memory. 

since their working sets can be hundreds of megabytes. Thus, they will not fit 

in the last level cache of a multicore microprocessor. Given the use of hardware 

multithreading to hide DRAM latency, the chip area used for caches in system 

processors is spent instead on computing resources and on the large number of 

registers to hold the state of the many threads of SIMD instructions. 

Elaboration: While hiding memory latency is the underlying philosophy, note that the 

latest GPUs and vector processors have added caches. For example, the recent Fermi 

architecture has added caches, but they are thought of as either bandwidth fi lters to 

reduce demands on GPU Memory or as accelerators for the few variables whose latency 

cannot be hidden by multithreading. Local memory for stack frames, function calls, 

and register spilling is a good match to caches, since latency matters when calling a 

function. Caches can also save energy, since on-chip cache accesses take much less 

energy than accesses to multiple, external DRAM chips.


Type 

More descriptive 

name 

Closest old term 

outside of GPUs 

Official CUDA/ 

NVIDIA GPU term 

Book definition 

Memory hardware 

Processing hardware 

Machine object Program abstractions 

Vectorizable 

Loop 

Body of 

Vectorized Loop 

Sequence of 

SIMD Lane 

Operations 

A Thread of 

SIMD 

Instructions 

SIMD 

Instruction 

Multithreaded 

SIMD 

Processor 

Thread Block 

Scheduler 

SIMD Thread 

Scheduler 

Body of a 

(Strip-Mined) 

Vectorized Loop 

One iteration of 

a Scalar Loop 

Thread of Vector 

Instructions 

Vector Instruction 

(Multithreaded) 

Vector Processor 

Scalar Processor 

Thread scheduler 

in a Multithreaded 

CPU 

Thread Block 

CUDA Thread 

Warp 

PTX Instruction 

Streaming 

Multiprocessor 

Giga Thread 

Engine 

Warp Scheduler 

SIMD Lane Vector lane Thread Processor 

GPU Memory Main Memory Global Memory 

Local Memory Local Memory Shared Memory 

SIMD Lane 

Registers 

Vectorizable Loop Grid A vectorizable loop, executed on the GPU, made 

up of one or more Thread Blocks (bodies of 

vectorized loop) that can execute in parallel. 

Vector Lane 

Registers 

Thread Processor 

Registers 

A vectorized loop executed on a multithreaded 

SIMD Processor, made up of one or more threads 

of SIMD instructions. They can communicate via 

Local Memory. 

A vertical cut of a thread of SIMD instructions 

corresponding to one element executed by one 

SIMD Lane. Result is stored depending on mask 

and predicate register. 

A traditional thread, but it contains just SIMD 

instructions that are executed on a multithreaded 

SIMD Processor. Results stored depending on a 

per-element mask. 

A single SIMD instruction executed across SIMD 

Lanes. 

A multithreaded SIMD Processor executes 

threads of SIMD instructions, independent of 

other SIMD Processors. 

Assigns multiple Thread Blocks (bodies of 

vectorized loop) to multithreaded SIMD 

Processors. 

Hardware unit that schedules and issues threads 

of SIMD instructions when they are ready to 

execute; includes a scoreboard to track SIMD 

Thread execution. 

A SIMD Lane executes the operations in a thread 

of SIMD instructions on a single element. Results 

stored depending on mask. 

DRAM memory accessible by all multithreaded 

SIMD Processors in a GPU. 

Fast local SRAM for one multithreaded SIMD 

Processor, unavailable to other SIMD Processors. 

Registers in a single SIMD Lane allocated across 

a full thread block (body of vectorized loop). 

FIGURE 6.12 Quick guide to GPU terms. We use the first column for hardware terms. Four groups 

cluster these 12 terms. From top to bottom: Program Abstractions, Machine Objects, Processing Hardware, 

and Memory Hardware. 

make more sense when architects ask, given the hardware invested to do graphics 

well, how can we supplement it to improve the performance of a wider range of 

applications? 

Having covered two different styles of MIMD that have a shared address 

space, we next introduce parallel processors where each processor has its 

own private address space, which makes it much easier to build much larger 

systems. The Internet services that you use every day depend on these large scale 

systems.

6.7 Clusters, Warehouse Scale Computers, and Other Message-Passing Multiprocessors 533 

Given that clusters are constructed from whole computers and independent, 

scalable networks, this isolation also makes it easier to expand the system without 

bringing down the application that runs on top of the cluster. 

Their lower cost, higher availability, and rapid, incremental expandability make 

clusters attractive to service Internet providers, despite their poorer communication 

performance when compared to large-scale shared memory multiprocessors. The 

search engines that hundreds of millions of us use every day depend upon this 

technology. Amazon, Facebook, Google, Microsoft, and others all have multiple 

datacenters each with clusters of tens of thousands of servers. Clearly, the use of 

multiple processors in Internet service companies has been hugely successful. 

Warehouse-Scale Computers 

Internet services, such as those described above, necessitated the construction 

of new buildings to house, power, and cool 100,000 servers. Although they may 

be classified as just large clusters, their architecture and operation are more 

sophisticated. They act as one giant computer and cost on the order of $150M 

for the building, the electrical and cooling infrastructure, the servers, and the 

networking equipment that connects and houses 50,000 to 100,000 servers. We 

consider them a new class of computer, called Warehouse-Scale Computers (WSC). 

Anyone can build a fast 

CPU. The trick is to build a 

fast system. 

Seymour Cray, considered 

the father of the 

supercomputer. 

The most popular framework for batch processing in a WSC is MapReduce [Dean, 

2008] and its open-source twin Hadoop. Inspired by the Lisp functions of the same 

name, Map first applies a programmer-supplied function to each logical input 

record. Map runs on thousands of servers to produce an intermediate result of keyvalue 

pairs. Reduce collects the output of those distributed tasks and collapses them 

using another programmer-defined function. With appropriate software support, 

both are highly parallel yet easy to understand and to use. Within 30 minutes, a 

novice programmer can run a MapReduce task on thousands of servers. 

For example, one MapReduce program calculates the number of occurrences of 

every English word in a large collection of documents. Below is a simplified version 

of that program, which shows just the inner loop and assumes just one occurrence 

of all English words found in a document: 

Hardware/ 

Software 

Interface 

map(String key, String value): 

// key: document name 

// value: document contents 

for each word w in value: 

EmitIntermediate(w, “1”); // Produce list of all words reduce(String key, Iterator values): 

// key: a word 

// values: a list of counts 

int result = 0; 

for each v in values: 

result += ParseInt(v); // get integer from key-value pair 

Emit(AsString(result));


The function EmitIntermediate used in the Map function emits each 

word in the document and the value one. Then the Reduce function sums all the 

values per word for each document using ParseInt() to get the number of 

occurrences per word in all documents. The MapReduce runtime environment 

schedules map tasks and reduce tasks to the servers of a WSC. 

software as a service 

(SaaS) Rather than 

selling software that 

is installed and run 

on customers’ own 

computers, software is run 

at a remote site and made 

available over the Internet 

typically via a Web 

interface to customers. 

SaaS customers are 

charged based on use 

versus on ownership. 

At this extreme scale, which requires innovation in power distribution, cooling, 

monitoring, and operations, the WSC is a modern descendant of the 1970s 

supercomputers—making Seymour Cray the godfather of today’s WSC architects. 

His extreme computers handled computations that could be done nowhere else, but 

were so expensive that only a few companies could afford them. This time the target 

is providing information technology for the world instead of high performance 

computing for scientists and engineers. Hence, WSCs surely play a more important 

societal role today than Cray’s supercomputers did in the past. 

While they share some common goals with servers, WSCs have three major 

distinctions: 

1. Ample, easy parallelism: A concern for a server architect is whether the 

applications in the targeted marketplace have enough parallelism to justify 

the amount of parallel hardware and whether the cost is too high for sufficient 

communication hardware to exploit this parallelism. A WSC architect has 

no such concern. First, batch applications like MapReduce benefit from the 

large number of independent data sets that need independent processing, 

such as billions of Web pages from a Web crawl. Second, interactive Internet 

service applications, also known as Software as a Service (SaaS), can benefit 

from millions of independent users of interactive Internet services. Reads 

and writes are rarely dependent in SaaS, so SaaS rarely needs to synchronize. 

For example, search uses a read-only index and email is normally reading 

and writing independent information. We call this type of easy parallelism 

Request-Level Parallelism, as many independent efforts can proceed in 

parallel naturally with little need for communication or synchronization. 

2. Operational Costs Count: Traditionally, server architects design their systems 

for peak performance within a cost budget and worry about energy only to 

make sure they don’t exceed the cooling capacity of their enclosure. They 

usually ignored operational costs of a server, assuming that they pale in 

comparison to purchase costs. WSC have longer lifetimes—the building and 

electrical and cooling infrastructure are often amortized over 10 or more 

years—so the operational costs add up: energy, power distribution, and 

cooling represent more than 30% of the costs of a WSC over 10 years. 

3. Scale and the Opportunities/Problems Associated with Scale: To construct a 

single WSC, you must purchase 100,000 servers along with the supporting 

infrastructure, which means volume discounts. Hence, WSCs are so massive

6.7 Clusters, Warehouse Scale Computers, and Other Message-Passing Multiprocessors 535 

internally that you get economy of scale even if there are not many WSCs. 

These economies of scale led to cloud computing, as the lower per unit costs 

of a WSC meant that cloud companies could rent servers at a profitable rate 

and still be below what it costs outsiders to do it themselves. The flip side 

of the economic opportunity of scale is the need to cope with the failure 

frequency of scale. Even if a server had a Mean Time To Failure of an amazing 

25 years (200,000 hours), the WSC architect would need to design for 5 

server failures every day. Section 5.15 mentioned annualized disk failure rate 

(AFR) was measured at Google at 2% to 4%. If there were 4 disks per server 

and their annual failure rate was 2%, the WSC architect should expect to see 

one disk fail every hour. Thus, fault tolerance is even more important for the 

WSC architect than the server architect. 

The economies of scale uncovered by WSC have realized the long dreamed of 

goal of computing as a utility. Cloud computing means anyone anywhere with good 

ideas, a business model, and a credit card can tap thousands of servers to deliver 

their vision almost instantly around the world. Of course, there are important 

obstacles that could limit the growth of cloud computing—such as security, 

privacy, standards, and the rate of growth of Internet bandwidth—but we foresee 

them being addressed so that WSCs and cloud computing can flourish. 

To put the growth rate of cloud computing into perspective, in 2012 Amazon 

Web Services announced that it adds enough new server capacity every day to 

support all of Amazon’s global infrastructure as of 2003, when Amazon was a 

$5.2Bn annual revenue enterprise with 6000 employees. 

Now that we understand the importance of message-passing multiprocessors, 

especially for cloud computing, we next cover ways to connect the nodes of a WSC 

together. Thanks to Moore’s Law and the increasing number of cores per chip, we 

now need networks inside a chip as well, so these topologies are important in the 

small as well as in the large. 

Elaboration: The MapReduce framework shuffl es and sorts the key-value pairs at the 

end of the Map phase to produce groups that all share the same key. These groups are 

then passed to the Reduce phase. 

Elaboration: Another form of large scale computing is grid computing, where the 

computers are spread across large areas, and then the programs that run across them 

must communicate via long haul networks. The most popular and unique form of grid 

computing was pioneered by the SETI@home project. As millions of PCs are idle at 

any one time doing nothing useful, they could be harvested and put to good uses if 

someone developed software that could run on those computers and then gave each PC 

an independent piece of the problem to work on. The fi rst example was the Search for 

ExtraTerrestrial Intelligence (SETI), which was launched at UC Berkeley in 1999. Over 5 

million computer users in more than 200 countries have signed up for SETI@home, with 

more than 50% outside the US. By the end of 2011, the average performance of the 

SETI@home grid was 3.5 PetaFLOPS.

6.8 Introduction to Multiprocessor Network Topologies 537 

Because there are numerous topologies to choose from, performance metrics 

are needed to distinguish these designs. Two are popular. The first is total network 

bandwidth, which is the bandwidth of each link multiplied by the number of links. 

This represents the peak bandwidth. For the ring network above, with P processors, 

the total network bandwidth would be P times the bandwidth of one link; the total 

network bandwidth of a bus is just the bandwidth of that bus. 

To balance this best bandwidth case, we include another metric that is closer to 

the worst case: the bisection bandwidth. This metric is calculated by dividing the 

machine into two halves. Then you sum the bandwidth of the links that cross that 

imaginary dividing line. The bisection bandwidth of a ring is two times the link 

bandwidth. It is one times the link bandwidth for the bus. If a single link is as fast 

as the bus, the ring is only twice as fast as a bus in the worst case, but it is P times 

faster in the best case. 

Since some network topologies are not symmetric, the question arises 

of where to draw the imaginary line when bisecting the machine. Bisection 

bandwidth is a worst-case metric, so the answer is to choose the division that 

yields the most pessimistic network performance. Stated alternatively, calculate 

all possible bisection bandwidths and pick the smallest. We take this pessimistic 

view because parallel programs are often limited by the weakest link in the 

communication chain. 

At the other extreme from a ring is a fully connected network, where every 

processor has a bidirectional link to every other processor. For fully connected 

networks, the total network bandwidth is P × (P – 1)/2, and the bisection bandwidth 

is (P/2) 2 . 

The tremendous improvement in performance of fully connected networks is 

offset by the tremendous increase in cost. This consequence inspires engineers 

to invent new topologies that are between the cost of rings and the performance 

of fully connected networks. The evaluation of success depends in large part on 

the nature of the communication in the workload of parallel programs run on the 

computer. 

The number of different topologies that have been discussed in publications 

would be difficult to count, but only a few have been used in commercial parallel 

processors. Figure 6.14 illustrates two of the popular topologies. 

An alternative to placing a processor at every node in a network is to leave only 

the switch at some of these nodes. The switches are smaller than processor-memoryswitch 

nodes, and thus may be packed more densely, thereby lessening distance and 

increasing performance. Such networks are frequently called multistage networks 

to reflect the multiple steps that a message may travel. Types of multistage networks 

are as numerous as single-stage networks; Figure 6.15 illustrates two of the popular 

multistage organizations. A fully connected or crossbar network allows any 

node to communicate with any other node in one pass through the network. An 

Omega network uses less hardware than the crossbar network (2n log 2 

n versus n 2 

switches), but contention can occur between messages, depending on the pattern 

network 

bandwidth Informally, 

the peak transfer rate of a 

network; can refer to the 

speed of a single link or 

the collective transfer rate 

of all links in the network. 

bisection 

bandwidth The 

bandwidth between 

two equal parts of 

a multiprocessor. 

This measure is for a 

worst case split of the 

multiprocessor. 

fully connected 

network A network 

that connects processormemory 

nodes by 

supplying a dedicated 

communication link 

between every node. 

multistage network 

A network that supplies a 

small switch at each node. 

crossbar network 

A network that allows 

any node to communicate 

with any other node in 

one pass through the 

network.


After covering the performance of network at a low level of detail in this online 

section, the next section shows how to benchmark multiprocessors of all kinds 

with much higher-level programs. 

6.10 

Multiprocessor Benchmarks and 

Performance Models 

As we saw in Chapter 1, benchmarking systems is always a sensitive topic, because 

it is a highly visible way to try to determine which system is better. The results affect 

not only the sales of commercial systems, but also the reputation of the designers 

of those systems. Hence, all participants want to win the competition, but they also 

want to be sure that if someone else wins, they deserve to win because they have 

a genuinely better system. This desire leads to rules to ensure that the benchmark 

results are not simply engineering tricks for that benchmark, but are instead 

advances that improve performance of real applications. 

To avoid possible tricks, a typical rule is that you cant change the benchmark. 

The source code and data sets are fixed, and there is a single proper answer. Any 

deviation from those rules makes the results invalid. 

Many multiprocessor benchmarks follow these traditions. A common exception 

is to be able to increase the size of the problem so that you can run the benchmark 

on systems with a widely different number of processors. That is, many benchmarks 

allow weak scaling rather than require strong scaling, even though you must take 

care when comparing results for programs running different problem sizes. 

Figure 6.16 gives a summary of several parallel benchmarks, also described below: 

■ Linpack is a collection of linear algebra routines, and the routines for 

performing Gaussian elimination constitute what is known as the Linpack 

benchmark. The DGEMM routine in the example on page 215 represents a 

small fraction of the source code of the Linpack benchmark, but it accounts 

for most of the execution time for the benchmark. It allows weak scaling, 

letting the user pick any size problem. Moreover, it allows the user to rewrite 

Linpack in almost any form and in any language, as long as it computes the 

proper result and performs the same number of floating point operations 

for a given problem size. Twice a year, the 500 computers with the fastest 

Linpack performance are published at www.top500.org. The first on this list 

is considered by the press to be the worlds fastest computer. 

■ SPECrate is a throughput metric based on the SPEC CPU benchmarks, 

such as SPEC CPU 2006 (see Chapter 1). Rather than report performance 

of the individual programs, SPECrate runs many copies of the program 

simultaneously. Thus, it measures task-level parallelism, as there is no


Pthreads A UNIX 

API for creating and 

manipulating threads. It is 

structured as a library. 

■ The NAS (NASA Advanced Supercomputing) parallel benchmarks were 

another attempt from the 1990s to benchmark multiprocessors. Taken from 

computational fluid dynamics, they consist of five kernels. They allow weak 

scaling by defining a few data sets. Like Linpack, these benchmarks can be 

rewritten, but the rules require that the programming language can only be C 

or Fortran. 

■ The recent PARSEC (Princeton Application Repository for Shared Memory 

Computers) benchmark suite consists of multithreaded programs that use 

Pthreads (POSIX threads) and OpenMP (Open MultiProcessing; see 

Section 6.5). They focus on emerging computational domains and consist of 

nine applications and three kernels. Eight rely on data parallelism, three rely 

on pipelined parallelism, and one on unstructured parallelism. 

■ On the cloud front, the goal of the Yahoo! Cloud Serving Benchmark (YCSB) 

is to compare performance of cloud data services. It offers a framework that 

makes it easy for a client to benchmark new data services, using Cassandra 

and HBase as representative examples. [Cooper, 2010] 

The downside of such traditional restrictions to benchmarks is that innovation is 

chiefly limited to the architecture and compiler. Better data structures, algorithms, 

programming languages, and so on often cannot be used, since that would give a 

misleading result. The system could win because of, say, the algorithm, and not 

because of the hardware or the compiler. 

While these guidelines are understandable when the foundations of computing 

are relatively stableas they were in the 1990s and the first half of this decade 

they are undesirable during a programming revolution. For this revolution to 

succeed, we need to encourage innovation at all levels. 

Researchers at the University of California at Berkeley have advocated one 

approach. They identified 13 design patterns that they claim will be part of 

applications of the future. Frameworks or kernels implement these design 

patterns. Examples are sparse matrices, structured grids, finite-state machines, 

map reduce, and graph traversal. By keeping the definitions at a high level, they 

hope to encourage innovations at any level of the system. Thus, the system with the 

fastest sparse matrix solver is welcome to use any data structure, algorithm, and 

programming language, in addition to novel architectures and compilers. 

Performance Models 

A topic related to benchmarks is performance models. As we have seen with the 

increasing architectural diversity in this chapter—multithreading, SIMD, GPUs— 

it would be especially helpful if we had a simple model that offered insights into the 

performance of different architectures. It need not be perfect, just insightful. 

The 3Cs for cache performance from Chapter 5 is an example performance 

model. It is not a perfect performance model, since it ignores potentially important


The Roofline Model 

This simple model ties floating-point performance, arithmetic intensity, and memory 

performance together in a two-dimensional graph [Williams, Waterman, and 

Patterson 2009]. Peak floating-point performance can be found using the hardware 

specifications mentioned above. The working sets of the kernels we consider here 

do not fit in on-chip caches, so peak memory performance may be defined by the 

memory system behind the caches. One way to find the peak memory performance 

is the Stream benchmark. (See the Elaboration on page 381 in Chapter 5). 

Figure 6.18 shows the model, which is done once for a computer, not for each 

kernel. The vertical Y-axis is achievable floating-point performance from 0.5 to 

64.0 GFLOPs/second. The horizontal X-axis is arithmetic intensity, varying from 

1/8 FLOPs/DRAM byte accessed to 16 FLOPs/DRAM byte accessed. Note that the 

graph is a log-log scale. 

For a given kernel, we can find a point on the X-axis based on its arithmetic 

intensity. If we draw a vertical line through that point, the performance of the kernel 

on that computer must lie somewhere along that line. We can plot a horizontal line 

showing peak floating-point performance of the computer. Obviously, the actual 

floating-point performance can be no higher than the horizontal line, since that is 

a hardware limit. 

64.0 

32.0 

Attainable GFLOPs/second 

16.0 

8.0 

4.0 

2.0 

1.0 

peak memory BW (stream)peak floating-point performance 

Kernel 1 

(Memory 

Bandwidth 

limited) 

Kernel 2 

(Computation 

limited) 

0.5 

1 / 1 8 / 1 4 / 2 1 2 4 8 16 

Arithmetic Intensity: FLOPs/Byte Ratio 

FIGURE 6.18 Roofline Model [Williams, Waterman, and Patterson 2009]. This example has a 

peak floating-point performance of 16 GFLOPS/sec and a peak memory bandwidth of 16 GB/sec from the 

Stream benchmark. (Since Stream is actually four measurements, this line is the average of the four.) The 

dotted vertical line in color on the left represents Kernel 1, which has an arithmetic intensity of 0.5 FLOPs/ 

byte. It is limited by memory bandwidth to no more than 8 GFLOPS/sec on this Opteron X2. The dotted 

vertical line to the right represents Kernel 2, which has an arithmetic intensity of 4 FLOPs/byte. It is limited 

only computationally to 16 GFLOPS/s. (This data is based on the AMD Opteron X2 (Revision F) using dual 

cores running at 2 GHz in a dual socket system.)

6.10 Multiprocessor Benchmarks and Performance Models 545 

How could we plot the peak memory performance, which is measured in bytes/ 

second? Since the X-axis is FLOPs/byte and the Y-axis FLOPs/second, bytes/second 

is just a diagonal line at a 45-degree angle in this figure. Hence, we can plot a third 

line that gives the maximum floating-point performance that the memory system 

of that computer can support for a given arithmetic intensity. We can express the 

limits as a formula to plot the line in the graph in Figure 6.18: 

Attainable GFLOPs/sec = Min (Peak Memory BW × Arithmetic Intensity, Peak 

Floating-Point Performance) 

The horizontal and diagonal lines give this simple model its name and indicate its 

value. The roofline sets an upper bound on performance of a kernel depending on 

its arithmetic intensity. Given a roofline of a computer, you can apply it repeatedly, 

since it doesnt vary by kernel. 

If we think of arithmetic intensity as a pole that hits the roof, either it hits 

the slanted part of the roof, which means performance is ultimately limited by 

memory bandwidth, or it hits the flat part of the roof, which means performance is 

computationally limited. In Figure 6.18, kernel 1 is an example of the former, and 

kernel 2 is an example of the latter. 

Note that the ridge point, where the diagonal and horizontal roofs meet, offers 

an interesting insight into the computer. If it is far to the right, then only kernels 

with very high arithmetic intensity can achieve the maximum performance of 

that computer. If it is far to the left, then almost any kernel can potentially hit the 

maximum performance. 

Comparing Two Generations of Opterons 

The AMD Opteron X4 (Barcelona) with four cores is the successor to the Opteron 

X2 with two cores. To simplify board design, they use the same socket. Hence, they 

have the same DRAM channels and thus the same peak memory bandwidth. In 

addition to doubling the number of cores, the Opteron X4 also has twice the peak 

floating-point performance per core: Opteron X4 cores can issue two floating-point 

SSE2 instructions per clock cycle, while Opteron X2 cores issue at most one. As the 

two systems were comparing have similar clock rates2.2 GHz for Opteron X2 

versus 2.3 GHz for Opteron X4the Opteron X4 has about four times the peak 

floating-point performance of the Opteron X2 with the same DRAM bandwidth. 

The Opteron X4 also has a 2MiB L3 cache, which is not found in the Opteron X2. 

In Figure 6.19 the roofline models for both systems are compared. As we would 

expect, the ridge point moves to the right, from 1 in the Opteron X2 to 5 in the 

Opteron X4. Hence, to see a performance gain in the next generation, kernels need 

an arithmetic intensity higher than 1, or their working sets must fit in the caches 

of the Opteron X4. 

The roofline model gives an upper bound to performance. Suppose your 

program is far below that bound. What optimizations should you perform, and in 

what order?


Elaboration: The ceilings are ordered so that lower ceilings are easier to optimize. 

Clearly, a programmer can optimize in any order, but following this sequence reduces the 

chances of wasting effort on an optimization that has no benefi t due to other constraints. 

Like the 3Cs model, as long as the roofl ine model delivers on insights, a model can 

have assumptions that may prove optimistic. For example, roofl ine assumes the load is 

balanced between all processors. 

Elaboration: An alternative to the Stream benchmark is to use the raw DRAM 

bandwidth as the roofl ine. While the raw bandwidth defi nitely is a hard upper bound, 

actual memory performance is often so far from that boundary that its not that useful. 

That is, no program can go close to that bound. The downside to using Stream is that 

very careful programming may exceed the Stream results, so the memory roofl ine may 

not be as hard a limit as the computational roofl ine. We stick with Stream because few 

programmers will be able to deliver more memory bandwidth than Stream discovers. 

Elaboration: Although the roofl ine model shown is for multicore processors, it clearly 

would work for a uniprocessor as well. 

Check 

Yourself 

True or false: The main drawback with conventional approaches to benchmarks 

for parallel computers is that the rules that ensure fairness also slow software 

innovation. 

6.11 

Real Stuff: Benchmarking and Rooflines 

of the Intel Core i7 960 and the NVIDIA 

Tesla GPU 

A group of Intel researchers published a paper [Lee et al., 2010] comparing a 

quad-core Intel Core i7 960 with multimedia SIMD extensions to the previous 

generation GPU, the NVIDIA Tesla GTX 280. Figure 6.22 lists the characteristics 

of the two systems. Both products were purchased in Fall 2009. The Core i7 is 

in Intels 45-nanometer semiconductor technology while the GPU is in TSMCs 

65-nanometer technology. Although it might have been fairer to have a comparison 

by a neutral party or by both interested parties, the purpose of this section is not to 

determine how much faster one product is than another, but to try to understand 

the relative value of features of these two contrasting architecture styles. 

The rooflines of the Core i7 960 and GTX 280 in Figure 6.23 illustrate the 

differences in the computers. Not only does the GTX 280 have much higher 

memory bandwidth and double-precision floating-point performance, but also its 

double-precision ridge point is considerably to the left. The double-precision ridge 

point is 0.6 for the GTX 280 versus 3.1 for the Core i7. As mentioned above, it is 

much easier to hit peak computational performance the further the ridge point of

6.11 Real Stuff: Benchmarking and Rooflines of the Intel Core i7 960 and the NVIDIA Tesla GPU 551 

Core i7- 

960 

GTX 280 GTX 480 

Ratio 

280/i7 

Ratio 

480/i7 

Number of processing elements (cores or SMs) 

4 

30 

15 

7.5 

3.8 

Clock frequency (GHz) 

3.2 

1.3 

1.4 

0.41 

0.44 

Die size 

263 

576 

520 

2.2 

2.0 

Technology 

Intel 45 nm 

TSMC 65 nm 

TSMC 40 nm 

1.6 

1.0 

Power (chip, not module) 

130 

130 

167 

1.0 

1.3 

Transistors 

700 M 

1400 M 

3030 M 

2.0 

4.4 

Memory brandwith (GBytes/sec) 

32 

141 

177 

4.4 

5.5 

Single-precision SIMD width 

4 

8 

32 

2.0 

8.0 

Double-precision SIMD width 

2 

1 

16 

0.5 

8.0 

Peak Single-precision scalar FLOPS (GFLOP/sec) 

26 

117 

63 

4.6 

2.5 

Peak Single-precision SIMD FLOPS (GFLOP/Sec) 

102 

311 to 933 

515 or 1344 

3.0–9.1 

6.6–13.1 

(SP 1 add or multiply) 

N.A. 

(311) 

(515) 

(3.0) 

(6.6) 

(SP 1 instruction fused multiply-adds) 

N.A. 

(622) 

(1344) 

(6.1) 

(13.1) 

(Rare SP dual issue fused multiply-add and multiply) 

N.A. 

(933) 

N.A. 

(9.1) 

– 

Peal double-precision SIMD FLOPS (GFLOP/sec) 

51 

78 

515 

1.5 

10.1 

FIGURE 6.22 Intel Core i7-960, NVIDIA GTX 280, and GTX 480 specifications. The rightmost columns show the ratios of the 

Tesla GTX 280 and the Fermi GTX 480 to Core i7. Although the case study is between the Tesla 280 and i7, we include the Fermi 480 to show 

its relationship to the Tesla 280 since it is described in this chapter. Note that these memory bandwidths are higher than in Figure 6.23 because 

these are DRAM pin bandwidths and those in Figure 6.23 are at the processors as measured by a benchmark program. (From Table 2 in Lee 

et al. [2010].) 

the roofline is to the left. For single-precision performance, the ridge point moves 

far to the right for both computers, so its much harder to hit the roof of singleprecision 

performance. Note that the arithmetic intensity of the kernel is based on 

the bytes that go to main memory, not the bytes that go to cache memory. Thus, 

as mentioned above, caching can change the arithmetic intensity of a kernel on a 

particular computer, if most references really go to the cache. Note also that this 

bandwidth is for unit-stride accesses in both architectures. Real gather-scatter 

addresses can be slower on the GTX 280 and on the Core i7, as we shall see. 

The researchers selected the benchmark programs by analyzing the computational 

and memory characteristics of four recently proposed benchmark suites and then 

formulated the set of throughput computing kernels that capture these characteristics. 

Figure 6.24 shows the performance results, with larger numbers meaning faster. The 

Rooflines help explain the relative performance in this case study. 

Given that the raw performance specifications of the GTX 280 vary from 2.5 × 

slower (clock rate) to 7.5 × faster (cores per chip) while the performance varies


GFlop/s 

128 

64 

32 

16 

8 

Stream = 16.4 GB/s 

Core i7 960 

(Nehalem) 

51.2 GF/s 

Double Precision 

GFlop/s 

128 

64 

32 

16 

8 

Stream = 127 GB/s 

NVIDIA GTX280 

Peak = 78 GF/s 


4 

4 

2 

2 

1 

1 

1/8 1/4 1/2 1 2 4 8 16 32 

1/8 1/4 1/2 1 2 4 8 16 32 

Arithmetic intensity 


1024 

Core i7 960 

(Nehalem) 

1024 

NVIDIA GTX280 

GFlop/s 

512 

256 

128 

64 

32 

16 

8 

Stream = 16.4 GB/s 

102.4 GF/s 

Single Precision 

51.2 GF/s 


GFlop/s 

512 

256 

128 

64 

32 

16 

8 

Stream = 127 GB/s 

624 GF/s 

Single Precision 

78 GF/s 


4 

1/8 1/4 1/2 

1 2 4 8 16 32 


4 

1/8 1/4 1/2 

1 2 4 8 16 


32 

FIGURE 6.23 Roofline model [Williams, Waterman, and Patterson 2009]. These rooflines show double-precision floating-point 

performance in the top row and single-precision performance in the bottom row. (The DP FP performance ceiling is also in the bottom row 

to give perspective.) The Core i7 960 on the left has a peak DP FP performance of 51.2 GFLOP/sec, a SP FP peak of 102.4 GFLOP/sec, and a 

peak memory bandwidth of 16.4 GBytes/sec. The NVIDIA GTX 280 has a DP FP peak of 78 GFLOP/sec, SP FP peak of 624 GFLOP/sec, and 

127 GBytes/sec of memory bandwidth. The dashed vertical line on the left represents an arithmetic intensity of 0.5 FLOP/byte. It is limited by 

memory bandwidth to no more than 8 DP GFLOP/sec or 8 SP GFLOP/sec on the Core i7. The dashed vertical line to the right has an arithmetic 

intensity of 4 FLOP/byte. It is limited only computationally to 51.2 DP GFLOP/sec and 102.4 SP GFLOP/sec on the Core i7 and 78 DP GFLOP/ 

sec and 512 DP GFLOP/sec on the GTX 280. To hit the highest computation rate on the Core i7 you need to use all 4 cores and SSE instructions 

with an equal number of multiplies and adds. For the GTX 280, you need to use fused multiply-add instructions on all multithreaded SIMD 

processors.

6.11 Real Stuff: Benchmarking and Rooflines of the Intel Core i7 960 and the NVIDIA Tesla GPU 553 

Kernel Units Core i7-960 GTX 280 

GTX 280/ 

i7-960 

SGEMM 

GFLOP/sec 

94 

364 

3.9 

MC 

Billion paths/sec 

0.8 

1.4 

1.8 

Conv 

Million pixels/sec 

1250 

3500 

2.8 

FFT 

GFLOP/sec 

71.4 

213 

3.0 

SAXPY 

GBytes/sec 

16.8 

88.8 

5.3 

LBM 

Million lookups/sec 

85 

426 

5.0 

Solv 

Frames/sec 

103 

52 

0.5 

SpMV 

GFLOP/sec 

4.9 

9.1 

1.9 

GJK 

Frames/sec 

67 

1020 

15.2 

Sort 

Million elements/sec 

250 

198 

0.8 

RC 

Search 

Frames/sec 

Million queries/sec 

5 

50 

8.1 

90 

1.6 

1.8 

Hist 


1517 

2583 

1.7 

Bilat 


83 

475 

5.7 

FIGURE 6.24 Raw and relative performance measured for the two platforms. In this study, 

SAXPY is just used as a measure of memory bandwidth, so the right unit is GBytes/sec and not GFLOP/sec. 

(Based on Table 3 in [Lee et al., 2010].) 

from 2.0 × slower (Solv) to 15.2 × faster (GJK), the Intel researchers decided to 

find the reasons for the differences: 

■ Memory bandwidth. The GPU has 4.4 × the memory bandwidth, which helps 

explain why LBM and SAXPY run 5.0 and 5.3 × faster; their working sets are 

hundreds of megabytes and hence dont fit into the Core i7 cache. (So as to 

access memory intensively, they purposely did not use cache blocking as in 

Chapter 5.) Hence, the slope of the rooflines explains their performance. SpMV 

also has a large working set, but it only runs 1.9 × faster because the doubleprecision 

floating point of the GTX 280 is only 1.5 × as faster as the Core i7. 

■ Compute bandwidth. Five of the remaining kernels are compute bound: 

SGEMM, Conv, FFT, MC, and Bilat. The GTX is faster by 3.9, 2.8, 3.0, 1.8, and 

5.7 ×, respectively. The first three of these use single-precision floating-point 

arithmetic, and GTX 280 single precision is 3 to 6 × faster. MC uses double 

precision, which explains why its only 1.8 × faster since DP performance 

is only 1.5 × faster. Bilat uses transcendental functions, which the GTX 

280 supports directly. The Core i7 spends two-thirds of its time calculating 

transcendental functions for Bilat, so the GTX 280 is 5.7 × faster. This 

observation helps point out the value of hardware support for operations that 

occur in your workload: double-precision floating point and perhaps even 

transcendentals.


■ Cache benefits. Ray casting (RC) is only 1.6 × faster on the GTX because 

cache blocking with the Core i7 caches prevents it from becoming memory 

bandwidth bound (see Sections 5.4 and 5.14), as it is on GPUs. Cache 

blocking can help Search, too. If the index trees are small so that they fit in 

the cache, the Core i7 is twice as fast. Larger index trees make them memory 

bandwidth bound. Overall, the GTX 280 runs search 1.8 × faster. Cache 

blocking also helps Sort. While most programmers wouldnt run Sort on 

a SIMD processor, it can be written with a 1-bit Sort primitive called split. 

However, the split algorithm executes many more instructions than a scalar 

sort does. As a result, the Core i7 runs 1.25 × as fast as the GTX 280. Note 

that caches also help other kernels on the Core i7, since cache blocking allows 

SGEMM, FFT, and SpMV to become compute bound. This observation reemphasizes 

the importance of cache blocking optimizations in Chapter 5. 

■ Gather-Scatter. The multimedia SIMD extensions are of little help if the data are 

scattered throughout main memory; optimal performance comes only when 

accesses are to data are aligned on 16-byte boundaries. Thus, GJK gets little benefit 

from SIMD on the Core i7. As mentioned above, GPUs offer gather-scatter 

addressing that is found in a vector architecture but omitted from most SIMD 

extensions. The memory controller even batches accesses to the same DRAM 

page together (see Section 5.2). This combination means the GTX 280 runs GJK 

a startling 15.2 × as fast as the Core i7, which is larger than any single physical 

parameter in Figure 6.22. This observation reinforces the importance of gatherscatter 

to vector and GPU architectures that is missing from SIMD extensions. 

■ Synchronization. The performance of synchronization is limited by atomic 

updates, which are responsible for 28% of the total runtime on the Core i7 

despite its having a hardware fetch-and-increment instruction. Thus, Hist is only 

1.7 × faster on the GTX 280. Solv solves a batch of independent constraints in 

a small amount of computation followed by barrier synchronization. The Core 

i7 benefits from the atomic instructions and a memory consistency model that 

ensures the right results even if not all previous accesses to memory hierarchy 

have completed. Without the memory consistency model, the GTX 280 

version launches some batches from the system processor, which leads to the 

GTX 280 running 0.5 × as fast as the Core i7. This observation points out how 

synchronization performance can be important for some data parallel problems. 

It is striking how often weaknesses in the Tesla GTX 280 that were uncovered by 

kernels selected by Intel researchers were already being addressed in the successor 

architecture to Tesla: Fermi has faster double-precision floating-point performance, 

faster atomic operations, and caches. It was also interesting that the gather-scatter 

support of vector architectures that predate the SIMD instructions by decades was 

so important to the effective usefulness of these SIMD extensions, which some had 

predicted before the comparison. The Intel researchers noted that 6 of the 14 kernels 

would exploit SIMD better with more efficient gather-scatter support on the Core 

i7. This study certainly establishes the importance of cache blocking as well.

6.12 Going Faster: Multiple Processors and Matrix Multiply 555 

Now that we seen a wide range of results of benchmarking different 

multiprocessors, let’s return to our DGEMM example to see in detail how much we 

have to change the C code to exploit multiple processors. 

6.12 

Going Faster: Multiple Processors and 

Matrix Multiply 

This section is the final and largest step in our incremental performance journey of 

adapting DGEMM to the underlying hardware of the Intel Core i7 (Sandy Bridge). 

Each Core i7 has 8 cores, and the computer we have been using has 2 Core i7s. 

Thus, we have 16 cores on which to run DGEMM. 

Figure 6.25 shows the OpenMP version of DGEMM that utilizes those cores. 

Note that line 30 is the single line added to Figure 5.48 to make this code run on 

multiple processors: an OpenMP pragma that tells the compiler to use multiple 

threads in the outermost for loop. It tells the computer to spread the work of the 

outermost loop across all the threads. 

Figure 6.26 plots a classic multiprocessor speedup graph, showing the 

performance improvement versus a single thread as the number of threads increase. 

This graph makes it easy to see the challenges of strong scaling versus weak scaling. 

When everything fits in the first level data cache, as is the case for 32 × 32 matrices, 

adding threads actually hurts performance. The 16-threaded version of DGEMM 

is almost half as fast as the single-threaded version in this case. In contrast, the two 

largest matrices get a 14 × speedup from 16 threads, and hence the classic two “up 

and to the right” lines in Figure 6.26. 

Figure 6.27 shows the absolute performance increase as we increase the number 

of threads from 1 to 16. DGEMM operates now operates at 174 GLOPS for 960 × 960 

matrices. As our unoptimized C version of DGEMM in Figure 3.21 ran this code at 

just 0.8 GFOPS, the optimizations in Chapters 3 to 6 that tailor the code to the 

underlying hardware result in a speedup of over 200 times! 

Next up is our warnings of the fallacies and pitfalls of multiprocessing. The 

computer architecture graveyard is filled with parallel processing projects that have 

ignored them. 

Elaboration: These results are with Turbo mode turned off. We are using a dual chip 

system in this system, so not surprisingly, we can get the full Turbo speedup (3.3/2.6 

= 1.27) with either 1 thread (only 1 core on one of the chips) or 2 threads (1 core per 

chip). As we increase the number of threads and hence the number of active cores, the 

benefi t of Turbo mode decreases, as there is less of the power budget to spend on the 

active cores. For 4 threads the average Turbo speedup is 1.23, for 8 it is 1.13, and for 

16 it is 1.11.


1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 

21 

22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

32 

33 

34 

35 

#include 

#define UNROLL (4) 

#define BLOCKSIZE 32 

void do_block (int n, int si, int sj, int sk, 

double *A, double *B, double *C) 

{ 

for ( int i = si; i < si+BLOCKSIZE; i+=UNROLL*4 ) 

for ( int j = sj; j < sj+BLOCKSIZE; j++ ) { 

__m256d c[4]; 


c[x] = _mm256_load_pd(C+i+x*4+j*n); 

/* c[x] = C[i][j] */ 

for( int k = sk; k < sk+BLOCKSIZE; k++ ) 

{ 

__m256d b = _mm256_broadcast_sd(B+k+j*n); 

/* b = B[k][j] */ 

for (int x = 0; x < UNROLL; x++) 

c[x] = _mm256_add_pd(c[x], /* c[x]+=A[i][k]*b */ 

_mm256_mul_pd(_mm256_load_pd(A+n*k+x*4+i), b)); 

} 

} 

} 


_mm256_store_pd(C+i+x*4+j*n, c[x]); 

/* C[i][j] = c[x] */ 

void dgemm (int n, double* A, double* B, double* C) 

{ 

#pragma omp parallel for 

for ( int sj = 0; sj < n; sj += BLOCKSIZE ) 

for ( int si = 0; si < n; si += BLOCKSIZE ) 

for ( int sk = 0; sk < n; sk += BLOCKSIZE ) 

do_block(n, si, sj, sk, A, B, C); 

} 

FIGURE 6.25 OpenMP version of DGEMM from Figure 5.48. Line 30 is the only OpenMP code, making 

the outermost for loop operate in parallel. This line is the only difference from Figure 5.48. 

Elaboration: Although the Sandy Bridge supports two hardware threads per core, we 

do not get more performance from 32 threads. The reason is that a single AVX hardware 

is shared between the two threads multiplexed onto one core, so assigning two threads 

per core actually hurts performance due to the multiplexing overhead.


One frequently encountered problem occurs when software designed for a 

uniprocessor is adapted to a multiprocessor environment. For example, the Silicon 

Graphics operating system originally protected the page table with a single lock, 

assuming that page allocation is infrequent. In a uniprocessor, this does not 

represent a performance problem. In a multiprocessor, it can become a major 

performance bottleneck for some programs. Consider a program that uses a large 

number of pages that are initialized at start-up, which UNIX does for statically 

allocated pages. Suppose the program is parallelized so that multiple processes 

allocate the pages. Because page allocation requires the use of the page table, which 

is locked whenever it is in use, even an OS kernel that allows multiple threads in the 

OS will be serialized if the processes all try to allocate their pages at once (which is 

exactly what we might expect at initialization time!). 

This page table serialization eliminates parallelism in initialization and has 

significant impact on overall parallel performance. This performance bottleneck 

persists even for task-level parallelism. For example, suppose we split the parallel 

processing program apart into separate jobs and run them, one job per processor, 

so that there is no sharing between the jobs. (This is exactly what one user did, 

since he reasonably believed that the performance problem was due to unintended 

sharing or interference in his application.) Unfortunately, the lock still serializes all 

the jobsso even the independent job performance is poor. 

This pitfall indicates the kind of subtle but significant performance bugs 

that can arise when software runs on multiprocessors. Like many other key 

software components, the OS algorithms and data structures must be rethought 

in a multiprocessor context. Placing locks on smaller portions of the page table 

effectively eliminated the problem. 

Fallacy: You can get good vector performance without providing memory 

bandwidth. 

As we saw with the Roofline model, memory bandwidth is quite important to 

all architectures. DAXPY requires 1.5 memory references per floating-point 

operation, and this ratio is typical of many scientific codes. Even if the floating-point 

operations took no time, a Cray-1 could not increase the DAXPY performance of 

the vector sequence used, since it was memory limited. The Cray-1 performance on 

Linpack jumped when the compiler used blocking to change the computation so 

that values could be kept in the vector registers. This approach lowered the number 

of memory references per FLOP and improved the performance by nearly a factor 

of two! Thus, the memory bandwidth on the Cray-1 became sufficient for a loop 

that formerly required more bandwidth, which is just what the Roofline model 

would predict.


■ In the past, microprocessors and multiprocessors were subject to 

different definitions of success. When scaling uniprocessor performance, 

microprocessor architects were happy if single thread performance went up 

by the square root of the increased silicon area. Thus, they were happy with 

sublinear performance in terms of resources. Multiprocessor success used 

to be defined as linear speed-up as a function of the number of processors, 

assuming that the cost of purchase or cost of administration of n processors 

was n times as much as one processor. Now that parallelism is happening onchip 

via multicore, we can use the traditional microprocessor metric of being 

successful with sublinear performance improvement. 

■ The success of just-in-time runtime compilation and autotuning makes it 

feasible to think of software adapting itself to take advantage of the increasing 

number of cores per chip, which provides flexibility that is not available when 

limited to static compilers. 

■ Unlike in the past, the open source movement has become a critical portion 

of the software industry. This movement is a meritocracy, where better 

engineering solutions can win the mind share of the developers over legacy 

concerns. It also embraces innovation, inviting change to old software and 

welcoming new languages and software products. Such an open culture could 

be extremely helpful in this time of rapid change. 

To motivate readers to embrace this revolution, we demonstrated the potential 

of parallelism concretely for matrix multiply on the Intel Core i7 (Sandy Bridge) in 

the Going Faster sections of Chapters 3 to 6: 

■ Data-level parallelism in Chapter 3 improved performance by a factor of 3.85 

by executing four 64-bit floating-point operations in parallel using the 256- 

bit operands of the AVX instructions, demonstrating the value of SIMD. 

■ Instruction-level parallelism in Chapter 4 pushed performance up by another 

factor of 2.3 by unrolling loops 4 times to give the out-of-order execution 

hardware more instructions to schedule. 

■ Cache optimizations in Chapter 5 improved performance of matrices that 

didn’t fit into the L1 data cache by another factor of 2.0 to 2.5 by using cache 

blocking to reduce cache misses. 

■ Thread-level parallelism in this chapter improved performance of matrices 

that don’t fit into a single L1 data cache by another factor of 4 to 14 by utilizing 

all 16 cores of our multicore chips, demonstrating the value of MIMD. We 

did this by adding a single line using an OpenMP pragma. 

Using the ideas in this book and tailoring the software to this computer added 

24 lines of code to DGEMM. For the matrix sizes of 32x32, 160x160, 480x480, and 

960x960, the overall performance speedup from these ideas realized in those twodozen 

lines of code is factors of 8, 39, 129, and 212!


backpack and then carry them “in parallel”). For each of your activities, discuss if 

they are already working in parallel, but if not, why they are not. 

6.1.2 [5] Next, consider which of the activities could be carried out 

concurrently (e.g., eating breakfast and listening to the news). For each of your 

activities, describe which other activity could be paired with this activity. 

6.1.3 [5] For 6.1.2, what could we change about current systems (e.g., 

showers, clothes, TVs, cars) so that we could perform more tasks in parallel? 

6.1.4 [5] Estimate how much shorter time it would take to carry out these 

activities if you tried to carry out as many tasks in parallel as possible. 

6.2 You are trying to bake 3 blueberry pound cakes. Cake ingredients are as 

follows: 

1 cup butter, softened 

1 cup sugar 

4 large eggs 

1 teaspoon vanilla extract 

1/2 teaspoon salt 

1/4 teaspoon nutmeg 

1 1/2 cups flour 

1 cup blueberries 

The recipe for a single cake is as follows: 

Step 1: Preheat oven to 325°F (160°C). Grease and flour your cake pan. 

Step 2: In large bowl, beat together with a mixer butter and sugar at medium 

speed until light and fluffy. Add eggs, vanilla, salt and nutmeg. Beat until 

thoroughly blended. Reduce mixer speed to low and add flour, 1/2 cup at a time, 

beating just until blended. 

Step 3: Gently fold in blueberries. Spread evenly in prepared baking pan. Bake 

for 60 minutes. 

6.2.1 [5] Your job is to cook 3 cakes as efficiently as possible. Assuming 

that you only have one oven large enough to hold one cake, one large bowl, one 

cake pan, and one mixer, come up with a schedule to make three cakes as quickly 

as possible. Identify the bottlenecks in completing this task. 

6.2.2 [5] Assume now that you have three bowls, 3 cake pans and 3 mixers. 

How much faster is the process now that you have additional resources?


6.2.3 [5] Assume now that you have two friends that will help you cook, 

and that you have a large oven that can accommodate all three cakes. How will this 

change the schedule you arrived at in Exercise 6.2.1 above? 

6.2.4 [5] Compare the cake-making task to computing 3 iterations 

of a loop on a parallel computer. Identify data-level parallelism and task-level 

parallelism in the cake-making loop. 

6.3 Many computer applications involve searching through a set of data and 

sorting the data. A number of efficient searching and sorting algorithms have been 

devised in order to reduce the runtime of these tedious tasks. In this problem we 

will consider how best to parallelize these tasks. 

6.3.1 [10] Consider the following binary search algorithm (a classic divide 

and conquer algorithm) that searches for a value X in a sorted N-element array A 

and returns the index of matched entry: 

BinarySearch(A[0..N−1], X) { 

low = 0 

high = N −1 

while (low X) 

high = mid −1 

else if (A[mid]


continue until we have lists of size 1 in length. Then starting with sublists of length 

1, “merge” the two sublists into a single sorted list. 

Mergesort(m) 

var list left, right, result 

if length(m) ≤ 1 

return m 

else 

var middle = length(m) / 2 

for each x in m up to middle 

add x to left 

for each x in m after middle 

add x to right 

left = Mergesort(left) 

right = Mergesort(right) 

result = Merge(left, right) 

return result 

The merge step is carried out by the following code: 

Merge(left,right) 

var list result 

while length(left) >0 and length(right) > 0 

if first(left) ≤ first(right) 

append first(left) to result 

left = rest(left) 

else 

append first(right) to result 

right = rest(right) 

if length(left) >0 

append rest(left) to result 

if length(right) >0 

append rest(right) to result 

return result 

6.5.1 [10] Assume that you have Y cores on a multicore processor to run 

MergeSort. Assuming that Y is much smaller than length(m), express the speedup 

factor you might expect to obtain for values of Y and length(m). Plot these on a 

graph. 

6.5.2 [10] Next, assume that Y is equal to length (m). How would this 

affect your conclusions your previous answer? If you were tasked with obtaining 

the best speedup factor possible (i.e., strong scaling), explain how you might 

change this code to obtain it.


6.6 Matrix multiplication plays an important role in a number of applications. 

Two matrices can only be multiplied if the number of columns of the first matrix is 

equal to the number of rows in the second. 

Let’s assume we have an m × n matrix A and we want to multiply it by an n × p 

matrix B. We can express their product as an m × p matrix denoted by AB (or A ⋅ B). 

If we assign C = AB, and c i,j 

denotes the entry in C at position (i, j), then for each 

element i and j with 1 ≤ i ≤ m and 1 ≤ j ≤ p. Now we want to see if we can parallelize 

the computation of C. Assume that matrices are laid out in memory sequentially as 

follows: a 1,1 

, a 2,1 

, a 3,1 

, a 4,1 

, …, etc. 

6.6.1 [10] Assume that we are going to compute C on both a single core 

shared memory machine and a 4-core shared-memory machine. Compute the 

speedup we would expect to obtain on the 4-core machine, ignoring any memory 

issues. 

6.6.2 [10] Repeat Exercise 6.6.1, assuming that updates to C incur a cache 

miss due to false sharing when consecutive elements are in a row (i.e., index i) are 

updated. 

6.6.3 [10] How would you fix the false sharing issue that can occur? 

6.7 Consider the following portions of two different programs running at the 

same time on four processors in a symmetric multicore processor (SMP). Assume 

that before this code is run, both x and y are 0. 

Core 1: x = 2; 

Core 2: y = 2; 

Core 3: w = x + y + 1; 

Core 4: z = x + y; 

6.7.1 [10] What are all the possible resulting values of w, x, y, and z? For 

each possible outcome, explain how we might arrive at those values. You will need 

to examine all possible interleavings of instructions. 

6.7.2 [5] How could you make the execution more deterministic so that 

only one set of values is possible? 

6.8 The dining philosopher’s problem is a classic problem of synchronization and 

concurrency. The general problem is stated as philosophers sitting at a round table 

doing one of two things: eating or thinking. When they are eating, they are not 

thinking, and when they are thinking, they are not eating. There is a bowl of pasta 

in the center. A fork is placed in between each philosopher. The result is that each 

philosopher has one fork to her left and one fork to her right. Given the nature of 

eating pasta, the philosopher needs two forks to eat, and can only use the forks on 

her immediate left and right. The philosophers do not speak to one another.


Assume all instructions take a single cycle to execute unless noted otherwise or 

they encounter a hazard. 

6.9.1 [10] Assume that you have 1 SS CPU. How many cycles will it take to 

execute these two threads? How many issue slots are wasted due to hazards? 

6.9.2 [10] Now assume you have 2 SS CPUs. How many cycles will it take 

to execute these two threads? How many issue slots are wasted due to hazards? 

6.9.3 [10] Assume that you have 1 MT CPU. How many cycles will it take 

to execute these two threads? How many issue slots are wasted due to hazards? 

6.10 Virtualization software is being aggressively deployed to reduce the costs of 

managing today’s high performance servers. Companies like VMWare, Microsoft 

and IBM have all developed a range of virtualization products. The general concept, 

described in Chapter 5, is that a hypervisor layer can be introduced between the 

hardware and the operating system to allow multiple operating systems to share 

the same physical hardware. The hypervisor layer is then responsible for allocating 

CPU and memory resources, as well as handling services typically handled by the 

operating system (e.g., I/O). 

Virtualization provides an abstract view of the underlying hardware to the hosted 

operating system and application software. This will require us to rethink how 

multi-core and multiprocessor systems will be designed in the future to support 

the sharing of CPUs and memories by a number of operating systems concurrently. 

6.10.1 [30] Select two hypervisors on the market today, and compare 

and contrast how they virtualize and manage the underlying hardware (CPUs and 

memory). 

6.10.2 [15] Discuss what changes may be necessary in future multi-core 

CPU platforms in order to better match the resource demands placed on these 

systems. For instance, can multithreading play an effective role in alleviating the 

competition for computing resources? 

6.11 We would like to execute the loop below as efficiently as possible. We have 

two different machines, a MIMD machine and a SIMD machine. 

for (i=0; i < 2000; i++) 

for (j=0; j


6.12 A systolic array is an example of an MISD machine. A systolic array is a 

pipeline network or “wavefront” of data processing elements. Each of these elements 

does not need a program counter since execution is triggered by the arrival of data. 

Clocked systolic arrays compute in “lock-step” with each processor undertaking 

alternate compute and communication phases. 

6.12.1 [10] Consider proposed implementations of a systolic array (you 

can find these in on the Internet or in technical publications). Then attempt to 

program the loop provided in Exercise 6.11 using this MISD model. Discuss any 

difficulties you encounter. 

6.12.2 [10] Discuss the similarities and differences between an MISD and 

SIMD machine. Answer this question in terms of data-level parallelism. 

6.13 Assume we want to execute the DAXPY loop show on page 511 in MIPS 

assembly on the NVIDIA 8800 GTX GPU described in this chapter. In this problem, 

we will assume that all math operations are performed on single-precision floatingpoint 

numbers (we will rename the loop SAXPY). Assume that instructions take 

the following number of cycles to execute. 

Loads Stores Add.S Mult.S 

5 2 3 4 

6.13.1 [20] Describe how you will constructs warps for the SAXPY loop 

to exploit the 8 cores provided in a single multiprocessor. 

6.14 Download the CUDA Toolkit and SDK from http://www.nvidia.com/object/ 

cuda_get.html. Make sure to use the “emurelease” (Emulation Mode) version of the 

code (you will not need actual NVIDIA hardware for this assignment). Build the 

example programs provided in the SDK, and confirm that they run on the emulator. 

6.14.1 [90] Using the “template” SDK sample as a starting point, write a 

CUDA program to perform the following vector operations: 

1) a − b (vector-vector subtraction) 

2) a ⋅ b (vector dot product) 

The dot product of two vectors a = [a 1 

, a 2 

, … , a n 

] and b = [b 1 

, b 2 

, … , b n 

] is defined as: 

a ⋅ b ∑ a b a b a b … a b 

i i 1 1 2 2 

i 

n 

1 

Submit code for each program that demonstrates each operation and verifies the 

correctness of the results. 

6.14.2 [90] If you have GPU hardware available, complete a performance 

analysis your program, examining the computation time for the GPU and a CPU 

version of your program for a range of vector sizes. Explain any results you see. 

n n


6.15 AMD has recently announced that they will be integrating a graphics 

processing unit with their x86 cores in a single package, though with different 

clocks for each of the cores. This is an example of a heterogeneous multiprocessor 

system which we expect to see produced commericially in the near future. One 

of the key design points will be to allow for fast data communication between 

the CPU and the GPU. Presently communications must be performed between 

discrete CPU and GPU chips. But this is changing in AMDs Fusion architecture. 

Presently the plan is to use multiple (at least 16) PCI express channels for facilitate 

intercommunication. Intel is also jumping into this arena with their Larrabee chip. 

Intel is considering to use their QuickPath interconnect technology. 

6.15.1 [25] Compare the bandwidth and latency associated with these 

two interconnect technologies. 

6.16 Refer to Figure 6.14b, which shows an n-cube interconnect topology of order 

3 that interconnects 8 nodes. One attractive feature of an n-cube interconnection 

network topology is its ability to sustain broken links and still provide connectivity. 

6.16.1 [10] Develop an equation that computes how many links in the 

n-cube (where n is the order of the cube) can fail and we can still guarantee an 

unbroken link will exist to connect any node in the n-cube. 

6.16.2 [10] Compare the resiliency to failure of n-cube to a fullyconnected 

interconnection network. Plot a comparison of reliability as a function 

of the added number of links for the two topologies. 

6.17 Benchmarking is field of study that involves identifying representative 

workloads to run on specific computing platforms in order to be able to objectively 

compare performance of one system to another. In this exercise we will compare 

two classes of benchmarks: the Whetstone CPU benchmark and the PARSEC 

Benchmark suite. Select one program from PARSEC. All programs should be freely 

available on the Internet. Consider running multiple copies of Whetstone versus 

running the PARSEC Benchmark on any of systems described in Section 6.11. 

6.17.1 [60] What is inherently different between these two classes of 

workload when run on these multi-core systems? 

6.17.2 [60] In terms of the Roofline Model, how dependent will the 

results you obtain when running these benchmarks be on the amount of sharing 

and synchronization present in the workload used? 

6.18 When performing computations on sparse matrices, latency in the memory 

hierarchy becomes much more of a factor. Sparse matrices lack the spatial locality 

in the data stream typically found in matrix operations. As a result, new matrix 

representations have been proposed. 

One the earliest sparse matrix representations is the Yale Sparse Matrix Format. It 

stores an initial sparse m × n matrix, M in row form using three one-dimensional


arrays. Let R be the number of nonzero entries in M. We construct an array A 

of length R that contains all nonzero entries of M (in left-to-right top-to-bottom 

order). We also construct a second array IA of length m + 1 (i.e., one entry per row, 

plus one). IA(i) contains the index in A of the first nonzero element of row i. Row 

i of the original matrix extends from A(IA(i)) to A(IA(i+1)−1). The third array, JA, 

contains the column index of each element of A, so it also is of length R. 

6.18.1 [15] Consider the sparse matrix X below and write C code that 

would store this code in Yale Sparse Matrix Format. 

Row 1 [1, 2, 0, 0, 0, 0] 

Row 2 [0, 0, 1, 1, 0, 0] 

Row 3 [0, 0, 0, 0, 9, 0] 

Row 4 [2, 0, 0, 0, 0, 2] 

Row 5 [0, 0, 3, 3, 0, 7] 

Row 6 [1, 3, 0, 0, 0, 1] 

6.18.2 [10] In terms of storage space, assuming that each element in 

matrix X is single precision floating point, compute the amount of storage used to 

store the Matrix above in Yale Sparse Matrix Format. 

6.18.3 [15] Perform matrix multiplication of Matrix X by Matrix Y 

shown below. 

[2, 4, 1, 99, 7, 2] 

Put this computation in a loop, and time its execution. Make sure to increase 

the number of times this loop is executed to get good resolution in your timing 

measurement. Compare the runtime of using a naïve representation of the matrix, 

and the Yale Sparse Matrix Format. 

6.18.4 [15] Can you find a more efficient sparse matrix representation 

(in terms of space and computational overhead)? 

6.19 In future systems, we expect to see heterogeneous computing platforms 

constructed out of heterogeneous CPUs. We have begun to see some appear in the 

embedded processing market in systems that contain both floating point DSPs and 

a microcontroller CPUs in a multichip module package. 

Assume that you have three classes of CPU: 

CPU A—A moderate speed multi-core CPU (with a floating point unit) that can 

execute multiple instructions per cycle. 

CPU B—A fast single-core integer CPU (i.e., no floating point unit) that can 

execute a single instruction per cycle. 

CPU C—A slow vector CPU (with floating point capability) that can execute 

multiple copies of the same instruction per cycle.


§6.1, page 504: False. Task-level parallelism can help sequential applications and 

sequential applications can be made to run on parallel hardware, although it is 

more challenging. 

§6.2, page 509: False. Weak scaling can compensate for a serial portion of the 

program that would otherwise limit scalability, but not so for strong scaling. 

§6.3, page 514: True, but they are missing useful vector features like gather-scatter 

and vector length registers that improve the efficiency of vector architectures. 

(As an elaboration in this section mentions, the AVX2 SIMD extensions offers 

indexed loads via a gather operation but not scatter for indexed stores. The Haswell 

generation x86 microprocessor is the first to support AVX2.) 

§6.4, page 519: 1. True. 2. True. 

§6.5, page 523: False. Since the shared address is a physical address, multiple 

tasks each in their own virtual address spaces can run well on a shared memory 

multiprocessor. 

§6.6, page 531: False. Graphics DRAM chips are prized for their higher bandwidth. 

§6.7, page 536: 1. False. Sending and receiving a message is an implicit 

synchronization, as well as a way to share data. 2. True. 

§6.8, page 538: True. 

§6.10, page 550: True. We likely need innovation at all levels of the hardware and 

software stack for parallel computing to succeed. 

Answers to 

Check Yourself

A 

A P P E N D I X 

Fear of serious injury 

cannot alone justify 

suppression of free 

speech and assembly. 

Assemblers, Linkers, 

and the SPIM 

Simulator 

James R. Larus 

Microsoft Research 

Microsoft 

Louis Brandeis 

Whitney v. California, 1927

A-4 Appendix A Assemblers, Linkers, and the SPIM Simulator 

Source 

file 

Assembler 

Object 

file 

Source 

file 

Assembler 

Object 

file 

Linker 

Executable 

file 

Source 

file 

Assembler 

Object 

file 

Program 

library 

FIGURE A.1.1 The process that produces an executable file. An assembler translates a file of 

assembly language into an object file, which is linked with other files and libraries into an executable file. 

assembler A program 

that translates a symbolic 

version of instruction into 

the binary ver sion. 

macro A patternmatching 

and replacement 

facility that pro vides a 

simple mechanism to name 

a frequently used sequence 

of instructions. 

unresolved reference 

A reference that requires 

more information from 

an outside source to be 

complete. 

linker Also called 

link editor. A systems 

program that combines 

independently assembled 


programs and resolves all 

undefined labels into an 

executable file. 

permits programmers to use labels to identify and name particular memory words 

that hold instructions or data. 

A tool called an assembler translates assembly language into binary instructions. 

Assemblers provide a friendlier representation than a computer’s 0s and 1s, which 

sim plifies writing and reading programs. Symbolic names for operations and locations 

are one facet of this representation. Another facet is programming facilities 

that increase a program’s clarity. For example, macros, discussed in Section A.2, 

enable a programmer to extend the assembly language by defining new operations. 

An assembler reads a single assembly language source file and produces an 

object file containing machine instructions and bookkeeping information that 

helps combine several object files into a program. Figure A.1.1 illustrates how a 

program is built. Most programs consist of several files—also called modules— 

that are written, compiled, and assembled independently. A program may also use 

prewritten routines supplied in a program library. A module typically contains references 

to subroutines and data defined in other modules and in libraries. The code 

in a module cannot be executed when it contains unresolved references to labels 

in other object files or libraries. Another tool, called a linker, combines a collection 

of object and library files into an executable file, which a computer can run. 

To see the advantage of assembly language, consider the following sequence of 

figures, all of which contain a short subroutine that computes and prints the sum of 

the squares of integers from 0 to 100. Figure A.1.2 shows the machine language that 

a MIPS computer executes. With considerable effort, you could use the opcode and 

instruction format tables in Chapter 2 to translate the instructions into a symbolic 

program similar to that shown in Figure A.1.3. This form of the routine is much 

easier to read, because operations and operands are written with symbols rather

A.1 Introduction A-5 

00100111101111011111111111100000 

10101111101111110000000000010100 

10101111101001000000000000100000 

10101111101001010000000000100100 

10101111101000000000000000011000 

10101111101000000000000000011100 

10001111101011100000000000011100 

10001111101110000000000000011000 

00000001110011100000000000011001 

00100101110010000000000000000001 

00101001000000010000000001100101 

10101111101010000000000000011100 

00000000000000000111100000010010 

00000011000011111100100000100001 

00010100001000001111111111110111 

10101111101110010000000000011000 

00111100000001000001000000000000 

10001111101001010000000000011000 

00001100000100000000000011101100 

00100100100001000000010000110000 

10001111101111110000000000010100 

00100111101111010000000000100000 

00000011111000000000000000001000 

00000000000000000001000000100001 

FIGURE A.1.2 MIPS machine language code for a routine to compute and print the sum 

of the squares of integers between 0 and 100. 

than with bit patterns. However, this assembly language is still difficult to follow, 

because memory locations are named by their address rather than by a symbolic 

label. 

Figure A.1.4 shows assembly language that labels memory addresses with mnemonic 

names. Most programmers prefer to read and write this form. Names that 

begin with a period, for example .data and .globl, are assembler directives 

that tell the assembler how to translate a program but do not produce machine 

instructions. Names followed by a colon, such as str: or main:, are labels that 

name the next memory location. This program is as readable as most assembly 

language programs (except for a glaring lack of comments), but it is still difficult 

to follow, because many simple operations are required to accomplish simple tasks 

and because assembly language’s lack of control flow constructs provides few hints 

about the program’s operation. 

By contrast, the C routine in Figure A.1.5 is both shorter and clearer, since variables 

have mnemonic names and the loop is explicit rather than constructed with 

branches. In fact, the C routine is the only one that we wrote. The other forms of 

the program were produced by a C compiler and assembler. 

In general, assembly language plays two roles (see Figure A.1.6). The first role 

is the output language of compilers. A compiler translates a program written in a 

high-level language (such as C or Pascal) into an equivalent program in machine or 

assembler directive 

An operation that tells the 

assembler how to translate 

a program but does not 

produce machine instructions; 

always begins with 

a period.


addiu $29, $29, -32 

sw $31, 20($29) 

sw $4, 32($29) 

sw $5, 36($29) 

sw $0, 24($29) 

sw $0, 28($29) 

lw $14, 28($29) 

lw $24, 24($29) 

multu $14, $14 

addiu $8, $14, 1 

slti $1, $8, 101 

sw $8, 28($29) 

mflo $15 

addu $25, $24, $15 

bne $1, $0, -9 

sw $25, 24($29) 

lui $4, 4096 

lw $5, 24($29) 

jal 1048812 

addiu $4, $4, 1072 

lw $31, 20($29) 

addiu $29, $29, 32 

jr $31 

move $2, $0 

FIGURE A.1.3 The same routine as in Figure A.1.2 written in assembly language. However, 

the code for the routine does not label registers or memory locations or include comments. 

source language The 

high-level language 

in which a pro gram is 

originally written. 

assembly language. The high-level language is called the source language, and the 

compiler’s output is its target language. 

Assembly language’s other role is as a language in which to write programs. This 

role used to be the dominant one. Today, however, because of larger main memories 

and better compilers, most programmers write in a high-level language and 

rarely, if ever, see the instructions that a computer executes. Nevertheless, assembly 

language is still important to write programs in which speed or size is critical or to 

exploit hardware features that have no analogues in high-level languages. 

Although this appendix focuses on MIPS assembly language, assembly programming 

on most other machines is very similar. The additional instructions and 

address modes in CISC machines, such as the VAX, can make assembly pro grams 

shorter but do not change the process of assembling a program or provide assembly 

language with the advantages of high-level languages, such as type-checking and 

structured control flow.


FIGURE A.1.4 The same routine as in Figure A.1.2 written in assembly language with 

labels, but no com ments. The commands that start with periods are assembler directives (see pages 

A-47–49). .text indicates that succeeding lines contain instructions. .data indicates that they contain 

data. .align n indicates that the items on the succeeding lines should be aligned on a 2 n byte boundary. 

Hence, .align 2 means the next item should be on a word boundary. .globl main declares that main is 

a global symbol that should be visible to code stored in other files. Finally, .asciiz stores a null-terminated 

string in memory. 

When to Use Assembly Language 

The primary reason to program in assembly language, as opposed to an available 

high-level language, is that the speed or size of a program is critically important. 

For example, consider a computer that controls a piece of machinery, such as a 

car’s brakes. A computer that is incorporated in another device, such as a car, is 

called an embedded computer. This type of computer needs to respond rapidly 

and predictably to events in the outside world. Because a compiler introduces


#include 

int 

main (int argc, char *argv[]) 

{ 

int i; 

int sum = 0; 

} 

for (i = 0; i


This improvement is not necessarily an indication that the high-level language’s 

compiler has failed. Compilers typically are better than programmers at producing 

uniformly high-quality machine code across an entire program. Pro grammers, 

however, understand a program’s algorithms and behavior at a deeper level than 

a compiler and can expend considerable effort and ingenuity improving small 

sections of the program. In particular, programmers often consider several procedures 

simultaneously while writing their code. Compilers typically compile each 

procedure in isolation and must follow strict conventions governing the use of 

registers at procedure boundaries. By retaining commonly used values in registers, 

even across procedure boundaries, programmers can make a program run 

faster. 

Another major advantage of assembly language is the ability to exploit specialized 

instructions—for example, string copy or pattern-matching instructions. 

Compilers, in most cases, cannot determine that a program loop can be replaced 

by a single instruction. However, the programmer who wrote the loop can replace 

it easily with a single instruction. 

Currently, a programmer’s advantage over a compiler has become difficult to 

maintain as compilation techniques improve and machines’ pipelines increase in 

complexity (Chapter 4). 

The final reason to use assembly language is that no high-level language is 

available on a particular computer. Many older or specialized computers do not 

have a compiler, so a programmer’s only alternative is assembly language. 

Drawbacks of Assembly Language 

Assembly language has many disadvantages that strongly argue against its widespread 

use. Perhaps its major disadvantage is that programs written in assembly 

language are inherently machine-specific and must be totally rewritten to run on 

another computer architecture. The rapid evolution of computers discussed in 

Chapter 1 means that architectures become obsolete. An assembly language program 

remains tightly bound to its original archi tecture, even after the computer is 

eclipsed by new, faster, and more cost-effective machines. 

Another disadvantage is that assembly language programs are longer than the 

equivalent programs written in a high-level language. For example, the C program 

in Figure A.1.5 is 11 lines long, while the assembly program in Figure A.1.4 is 

31 lines long. In more complex programs, the ratio of assembly to high-level language 

(its expansion factor) can be much larger than the factor of three in this 

exam ple. Unfortunately, empirical studies have shown that programmers write 

roughly the same number of lines of code per day in assembly as in high-level 

languages. This means that programmers are roughly x times more productive in a 

high-level language, where x is the assembly language expansion factor.


To compound the problem, longer programs are more difficult to read and 

understand, and they contain more bugs. Assembly language exacerbates the problem 

because of its complete lack of structure. Common programming idioms, 

such as if-then statements and loops, must be built from branches and jumps. The 

resulting programs are hard to read, because the reader must reconstruct every 

higher-level construct from its pieces and each instance of a statement may be 

slightly different. For example, look at Figure A.1.4 and answer these questions: 

What type of loop is used? What are its lower and upper bounds? 

Elaboration: Compilers can produce machine language directly instead of relying on 

an assembler. These compilers typically execute much faster than those that invoke 

an assembler as part of compilation. However, a compiler that generates machine language 

must perform many tasks that an assembler normally handles, such as resolving 

addresses and encoding instructions as binary numbers. The tradeoff is between 

compilation speed and compiler simplicity. 

Elaboration: Despite these considerations, some embedded applications are written 

in a high-level language. Many of these applications are large and complex programs 

that must be extremely reliable. Assembly language programs are longer and 

more diffi cult to write and read than high-level language programs. This greatly increases 

the cost of writing an assembly language program and makes it extremely dif fi cult to 

verify the correctness of this type of program. In fact, these considerations led the US 

Department of Defense, which pays for many complex embedded systems, to develop 

Ada, a new high-level language for writing embedded systems. 

A.2 Assemblers 

external label Also called 

global label. A label 

referring to an object that 

can be referenced from 

files other than the one in 

which it is defined. 

An assembler translates a file of assembly language statements into a file of binary 

machine instructions and binary data. The translation process has two major 

parts. The first step is to find memory locations with labels so that the relationship 

between symbolic names and addresses is known when instructions are trans lated. 

The second step is to translate each assembly statement by combining the numeric 

equivalents of opcodes, register specifiers, and labels into a legal instruc tion. As 

shown in Figure A.1.1, the assembler produces an output file, called an object file, 

which contains the machine instructions, data, and bookkeeping infor mation. 

An object file typically cannot be executed, because it references procedures or 

data in other files. A label is external (also called global) if the labeled object can

A.2 Assemblers A-11 

be referenced from files other than the one in which it is defined. A label is local 

if the object can be used only within the file in which it is defined. In most assemblers, 

labels are local by default and must be explicitly declared global. Subrou tines 

and global variables require external labels since they are referenced from many 

files in a program. Local labels hide names that should not be visible to other 

modules—for example, static functions in C, which can only be called by other 

functions in the same file. In addition, compiler-generated names—for example, a 

name for the instruction at the beginning of a loop—are local so that the compiler 

need not produce unique names in every file. 

local label A label 

referring to an object that 

can be used only within 

the file in which it is 

defined. 

Local and Global Labels 

Consider the program in Figure A.1.4. The subroutine has an external (global) 

label main. It also contains two local labels—loop and str—that are only 

visible with this assembly language file. Finally, the routine also contains an 

unresolved reference to an external label printf, which is the library routine 

that prints values. Which labels in Figure A.1.4 could be referenced from 

another file? 

EXAMPLE 

Only global labels are visible outside a file, so the only label that could be 

referenced from another file is main. 

Since the assembler processes each file in a program individually and in isola tion, 

it only knows the addresses of local labels. The assembler depends on another tool, 

the linker, to combine a collection of object files and libraries into an executable 

file by resolving external labels. The assembler assists the linker by pro viding lists 

of labels and unresolved references. 

However, even local labels present an interesting challenge to an assembler. 

Unlike names in most high-level languages, assembly labels may be used before 

they are defined. In the example in Figure A.1.4, the label str is used by the la 

instruction before it is defined. The possibility of a forward reference, like this one, 

forces an assembler to translate a program in two steps: first find all labels and then 

produce instructions. In the example, when the assembler sees the la instruction, 

it does not know where the word labeled str is located or even whether str labels 

an instruction or datum. 

ANSWER 

forward reference 

A label that is used 

before it is defined.


An assembler’s first pass reads each line of an assembly file and breaks it into its 

component pieces. These pieces, which are called lexemes, are individual words, 

numbers, and punctuation characters. For example, the line 

symbol table A table 

that matches names of 

labels to the addresses of 

the memory words that 

instructions occupy. 

ble 

$t0, 100, loop 

contains six lexemes: the opcode ble, the register specifier $t0, a comma, the 

number 100, a comma, and the symbol loop. 

If a line begins with a label, the assembler records in its symbol table the name 

of the label and the address of the memory word that the instruction occupies. 

The assembler then calculates how many words of memory the instruction on the 

current line will occupy. By keeping track of the instructions’ sizes, the assembler 

can determine where the next instruction goes. To compute the size of a variablelength 

instruction, like those on the VAX, an assembler has to examine it in detail. 

However, fixed-length instructions, like those on MIPS, require only a cursory 

examination. The assembler performs a similar calculation to compute the space 

required for data statements. When the assembler reaches the end of an assembly 

file, the symbol table records the location of each label defined in the file. 

The assembler uses the information in the symbol table during a second pass 

over the file, which actually produces machine code. The assembler again examines 

each line in the file. If the line contains an instruction, the assembler combines 

the binary representations of its opcode and operands (register specifiers or 

memory address) into a legal instruction. The process is similar to the one used in 

Section 2.5 in Chapter 2. Instructions and data words that reference an external 

symbol defined in another file cannot be completely assembled (they are unresolved), 

since the symbol’s address is not in the symbol table. An assembler does 

not complain about unresolved references, since the corresponding label is likely 

to be defined in another file. 

The BIG 

Picture 

Assembly language is a programming language. Its principal difference 

from high-level languages such as BASIC, Java, and C is that assembly language 

provides only a few, simple types of data and control flow. Assembly 

language programs do not specify the type of value held in a variable. 

Instead, a programmer must apply the appropriate operations (e.g., integer 

or floating-point addition) to a value. In addition, in assem bly language, 

programs must implement all control flow with go tos. Both factors make 

assembly language programming for any machine—MIPS or x86—more 

difficult and error-prone than writing in a high-level language.


Elaboration: If an assembler’s speed is important, this two-step process can be done 

in one pass over the assembly fi le with a technique known as backpatching. In its 

pass over the fi le, the assembler builds a (possibly incomplete) binary representation 

of every instruction. If the instruction references a label that has not yet been defi ned, 

the assembler records the label and instruction in a table. When a label is defi ned, the 

assembler consults this table to fi nd all instructions that contain a forward reference to 

the label. The assembler goes back and corrects their binary representation to incorporate 

the address of the label. Backpatching speeds assembly because the assembler 

only reads its input once. However, it requires an assembler to hold the entire binary representation 

of a program in memory so instructions can be backpatched. This requirement 

can limit the size of programs that can be assembled. The process is com plicated 

by machines with several types of branches that span different ranges of instructions. 

When the assembler fi rst sees an unresolved label in a branch instruction, it must either 

use the largest possible branch or risk having to go back and readjust many instructions 

to make room for a larger branch. 

backpatching 

A method for translating 

from assembly lan guage 

to machine instructions 

in which the assembler 

builds a (possibly 

incomplete) binary 

representation of every 

instruc tion in one pass 

over a program and then 

returns to fill in previously 

undefined labels. 

Object File Format 

Assemblers produce object files. An object file on UNIX contains six distinct 

sections (see Figure A.2.1): 

■ The object file header describes the size and position of the other pieces of 

the file. 

■ The text segment contains the machine language code for routines in the 

source file. These routines may be unexecutable because of unresolved 

references. 

■ The data segment contains a binary representation of the data in the source 

file. The data also may be incomplete because of unresolved references to 

labels in other files. 

■ The relocation information identifies instructions and data words that 

depend on absolute addresses. These references must change if portions of 

the program are moved in memory. 

■ The symbol table associates addresses with external labels in the source file 

and lists unresolved references. 

■ The debugging information contains a concise description of the way the 

program was compiled, so a debugger can find which instruction addresses 

correspond to lines in a source file and print the data structures in readable 

form. 

The assembler produces an object file that contains a binary representation of 

the program and data and additional information to help link pieces of a program. 

text segment The 

segment of a UNIX 

object file that contains 

the machine language 

code for rou tines in the 

source file. 

data segment The 

segment of a UNIX 

object or executable file 

that contains a binary 

represen tation of the 

initialized data used by 

the program. 

relocation information 

The segment of a UNIX 

object file that identifies 

instructions and data 

words that depend on 

absolute addresses. 

absolute address 

A variable’s or routine’s 

actual address in memory.


Object file 

header 

Text 

segment 

Data 

segment 

Relocation 

information 

Symbol 

table 

Debugging 

information 

FIGURE A.2.1 Object file. A UNIX assembler produces an object file with six distinct sections. 

This relocation information is necessary because the assembler does not know 

which memory locations a procedure or piece of data will occupy after it is linked 

with the rest of the program. Procedures and data from a file are stored in a contiguous 

piece of memory, but the assembler does not know where this mem ory will 

be located. The assembler also passes some symbol table entries to the linker. In 

particular, the assembler must record which external symbols are defined in a file 

and what unresolved references occur in a file. 

Elaboration: For convenience, assemblers assume each file starts at the same 

address (for example, location 0) with the expectation that the linker will relocate the code 

and data when they are assigned locations in memory. The assembler produces relocation 

information, which contains an entry describing each instruction or data word in the file 

that references an absolute address. On MIPS, only the subroutine call, load, and store 

instructions reference absolute addresses. Instructions that use PC- relative addressing, 

such as branches, need not be relocated. 

Additional Facilities 

Assemblers provide a variety of convenience features that help make assembler 

programs shorter and easier to write, but do not fundamentally change assembly 

language. For example, data layout directives allow a programmer to describe data 

in a more concise and natural manner than its binary representation. 

In Figure A.1.4, the directive 

.asciiz “The sum from 0 .. 100 is %d\n” 

stores characters from the string in memory. Contrast this line with the alternative 

of writing each character as its ASCII value (Figure 2.15 in Chapter 2 describes the 

ASCII encoding for characters): 

.byte 84, 104, 101, 32, 115, 117, 109, 32 

.byte 102, 114, 111, 109, 32, 48, 32, 46 

.byte 46, 32, 49, 48, 48, 32, 105, 115 

.byte 32, 37, 100, 10, 0 

The .asciiz directive is easier to read because it represents characters as letters, 

not binary numbers. An assembler can translate characters to their binary representation 

much faster and more accurately than a human can. Data layout directives


specify data in a human-readable form that the assembler translates to binary. Other 

layout directives are described in Section A.10. 

String Directive 

Define the sequence of bytes produced by this directive: 

.asciiz “The quick brown fox jumps over the lazy dog” 

EXAMPLE 

.byte 84, 104, 101, 32, 113, 117, 105, 99 

.byte 107, 32, 98, 114, 111, 119, 110, 32 

.byte 102, 111, 120, 32, 106, 117, 109, 112 

.byte 115, 32, 111, 118, 101, 114, 32, 116 

.byte 104, 101, 32, 108, 97, 122, 121, 32 

.byte 100, 111, 103, 0 

ANSWER 

Macro is a pattern-matching and replacement facility that provides a simple 

mechanism to name a frequently used sequence of instructions. Instead of repeatedly 

typing the same instructions every time they are used, a programmer invokes 

the macro and the assembler replaces the macro call with the corresponding 

sequence of instructions. Macros, like subroutines, permit a programmer to create 

and name a new abstraction for a common operation. Unlike subroutines, however, 

macros do not cause a subroutine call and return when the program runs, 

since a macro call is replaced by the macro’s body when the program is assembled. 

After this replacement, the resulting assembly is indistinguishable from the equivalent 

program written without macros. 

Macros 

As an example, suppose that a programmer needs to print many numbers. The 

library routine printf accepts a format string and one or more values to print 

as its arguments. A programmer could print the integer in register $7 with the 

following instructions: 

EXAMPLE 

.data 

int_str: .asciiz“%d” 

.text 

la $a0, int_str # Load string address 

# into first arg


mov $a1, $7 # Load value into 

# second arg 

jal printf # Call the printf routine 

The .data directive tells the assembler to store the string in the program’s data 

segment, and the .text directive tells the assembler to store the instruc tions 

in its text segment. 

However, printing many numbers in this fashion is tedious and produces a 

verbose program that is difficult to understand. An alternative is to introduce 

a macro, print_int, to print an integer: 

formal parameter 

A variable that is the 

argument to a proce dure 

or macro; it is replaced by 

that argument once the 

macro is expanded. 

.data 

int_str:.asciiz “%d” 

.text 

.macro print_int($arg) 

la $a0, int_str # Load string address into 

# first arg 

mov $a1, $arg # Load macro’s parameter 

# ($arg) into second arg 

jal printf # Call the printf routine 

.end_macro 

print_int($7) 

The macro has a formal parameter, $arg, that names the argument to the 

macro. When the macro is expanded, the argument from a call is substituted 

for the formal parameter throughout the macro’s body. Then the assembler 

replaces the call with the macro’s newly expanded body. In the first call on 

print_int, the argument is $7, so the macro expands to the code 

la $a0, int_str 

mov $a1, $7 

jal printf 

In a second call on print_int, say, print_int($t0), the argument is $t0, 

so the macro expands to 


mov $a1, $t0 

jal printf 

What does the call print_int($a0) expand to?



mov $a1, $a0 

jal printf 

ANSWER 

This example illustrates a drawback of macros. A programmer who uses 

this macro must be aware that print_int uses register $a0 and so cannot 

correctly print the value in that register. 

Some assemblers also implement pseudoinstructions, which are instructions provided 

by an assembler but not implemented in hardware. Chapter 2 contains 

many examples of how the MIPS assembler synthesizes pseudoinstructions 

and addressing modes from the spartan MIPS hardware instruction set. For 

example, Section 2.7 in Chapter 2 describes how the assembler synthesizes the 

blt instruc tion from two other instructions: slt and bne. By extending the 

instruction set, the MIPS assembler makes assembly language programming 

easier without complicating the hardware. Many pseudoinstructions could also 

be simulated with macros, but the MIPS assembler can generate better code for 

these instructions because it can use a dedicated register ($at) and is able to 

optimize the generated code. 

Hardware/ 

Software 

Interface 

Elaboration: Assemblers conditionally assemble pieces of code, which permits a 

programmer to include or exclude groups of instructions when a program is assembled. 

This feature is particularly useful when several versions of a program differ by a small 

amount. Rather than keep these programs in separate fi les—which greatly complicates 

fi xing bugs in the common code—programmers typically merge the versions into a single 

fi le. Code particular to one version is conditionally assembled, so it can be excluded 

when other versions of the program are assembled. 

If macros and conditional assembly are useful, why do assemblers for UNIX systems 

rarely, if ever, provide them? One reason is that most programmers on these systems 

write programs in higher-level languages like C. Most of the assembly code is produced 

by compilers, which fi nd it more convenient to repeat code rather than defi ne macros. 

Another reason is that other tools on UNIX—such as cpp, the C preprocessor, or m4, a 

general macro processor—can provide macros and conditional assembly for assembly 

language programs.


system kernel brings a program into memory and starts it running. To start a program, 

the operating system performs the following steps: 

1. It reads the executable file’s header to determine the size of the text and data 

segments. 

2. It creates a new address space for the program. This address space is large 

enough to hold the text and data segments, along with a stack segment (see 

Section A.5). 

3. It copies instructions and data from the executable file into the new address 

space. 

4. It copies arguments passed to the program onto the stack. 

5. It initializes the machine registers. In general, most registers are cleared, but 

the stack pointer must be assigned the address of the first free stack location 

(see Section A.5). 

6. It jumps to a start-up routine that copies the program’s arguments from the 

stack to registers and calls the program’s main routine. If the main routine 

returns, the start-up routine terminates the program with the exit system call. 

A.5 Memory Usage 

static data The portion 

of memory that contains 

data whose size is known 

to the com piler and whose 

lifetime is the program’s 

entire execution. 

The next few sections elaborate the description of the MIPS architecture presented 

earlier in the book. Earlier chapters focused primarily on hardware and its relationship 

with low-level software. These sections focus primarily on how assembly language 

programmers use MIPS hardware. These sections describe a set of conventions 

followed on many MIPS systems. For the most part, the hardware does not impose 

these conventions. Instead, they represent an agreement among programmers to 

follow the same set of rules so that software written by different people can work 

together and make effective use of MIPS hardware. 

Systems based on MIPS processors typically divide memory into three parts 

(see Figure A.5.1). The first part, near the bottom of the address space (starting 

at address 400000 hex ), is the text segment, which holds the program’s instructions. 

The second part, above the text segment, is the data segment, which is further 

divided into two parts. Static data (starting at address 10000000 hex ) contains 

objects whose size is known to the compiler and whose lifetime—the interval 

dur ing which a program can access them—is the program’s entire execution. For 

example, in C, global variables are statically allocated, since they can be referenced

A.5 Memory Usage A-21 

7fffffff hex 

Stack segment 

10000000 hex 

400000 hex 

Dynamic data 

Static data 

Reserved 

Data segment 

Text segment 

FIGURE A.5.1 Layout of memory. 

anytime during a program’s execution. The linker both assigns static objects to 

locations in the data segment and resolves references to these objects. 

Immediately above static data is dynamic data. This data, as its name implies, is 

allocated by the program as it executes. In C programs, the malloc library rou tine 

Because the data segment begins far above the program at address 10000000 hex , 

load and store instructions cannot directly reference data objects with their 16-bit 

offset fields (see Section 2.5 in Chapter 2). For example, to load the word in the 

data segment at address 10010020 hex into register $v0 requires two instructions: 

Hardware/ 

Software 

Interface 

lui $s0, 0x1001 # 0x1001 means 1001 base 16 

lw $v0, 0x0020($s0) # 0x10010000 + 0x0020 = 0x10010020 

(The 0x before a number means that it is a hexadecimal value. For example, 0x8000 

is 8000 hex or 32,768 ten .) 

To avoid repeating the lui instruction at every load and store, MIPS systems 

typically dedicate a register ($gp) as a global pointer to the static data segment. This 

register contains address 10008000 hex, so load and store instructions can use their 

signed 16-bit offset fields to access the first 64 KB of the static data segment. With 

this global pointer, we can rewrite the example as a single instruction: 

lw $v0, 0x8020($gp) 

Of course, a global pointer register makes addressing locations 10000000 hex – 

10010000 hex faster than other heap locations. The MIPS compiler usually stores 

global variables in this area, because these variables have fixed locations and fit better 

than other global data, such as arrays.


stack segment The 

portion of memory used 

by a program to hold 

procedure call frames. 

finds and returns a new block of memory. Since a compiler cannot predict how 

much memory a program will allocate, the operating system expands the dynamic 

data area to meet demand. As the upward arrow in the figure indicates, malloc 

expands the dynamic area with the sbrk system call, which causes the operating 

system to add more pages to the program’s virtual address space (see Section 5.7 in 

Chapter 5) immediately above the dynamic data segment. 

The third part, the program stack segment, resides at the top of the virtual 

address space (starting at address 7fffffff hex ). Like dynamic data, the maximum size 

of a program’s stack is not known in advance. As the program pushes values on to 

the stack, the operating system expands the stack segment down toward the data 

segment. 

This three-part division of memory is not the only possible one. However, it has 

two important characteristics: the two dynamically expandable segments are as far 

apart as possible, and they can grow to use a program’s entire address space. 

A.6 Procedure Call Convention 

register use convention 

Also called procedure 

call convention. 

A software proto col 

governing the use of 

registers by procedures. 

Conventions governing the use of registers are necessary when procedures in a 

program are compiled separately. To compile a particular procedure, a compiler 

must know which registers it may use and which registers are reserved for other 

procedures. Rules for using registers are called register use or procedure call 

conventions. As the name implies, these rules are, for the most part, conventions 

fol lowed by software rather than rules enforced by hardware. However, most compilers 

and programmers try very hard to follow these conventions because violating 

them causes insidious bugs. 

The calling convention described in this section is the one used by the gcc compiler. 

The native MIPS compiler uses a more complex convention that is slightly 

faster. 

The MIPS CPU contains 32 general-purpose registers that are numbered 0–31. 

Register $0 always contains the hardwired value 0. 

■ Registers $at (1), $k0 (26), and $k1 (27) are reserved for the assembler and 

operating system and should not be used by user programs or compilers. 

■ Registers $a0–$a3 (4–7) are used to pass the first four arguments to rou tines 

(remaining arguments are passed on the stack). Registers $v0 and $v1 (2, 3) 

are used to return values from functions.

A.6 Procedure Call Convention A-23 

■ Registers $t0–$t9 (8–15, 24, 25) are caller-saved registers that are used 

to hold temporary quantities that need not be preserved across calls (see 

Section 2.8 in Chapter 2). 

■ Registers $s0–$s7 (16–23) are callee-saved registers that hold long-lived 

values that should be preserved across calls. 

■ Register $gp (28) is a global pointer that points to the middle of a 64K block 

of memory in the static data segment. 

■ Register $sp (29) is the stack pointer, which points to the last location on 

the stack. Register $fp (30) is the frame pointer. The jal instruction writes 

register $ra (31), the return address from a procedure call. These two registers 

are explained in the next section. 

The two-letter abbreviations and names for these registers—for example $sp 

for the stack pointer—reflect the registers’ intended uses in the procedure call 

convention. In describing this convention, we will use the names instead of regis ter 

numbers. Figure A.6.1 lists the registers and describes their intended uses. 

Procedure Calls 

This section describes the steps that occur when one procedure (the caller) invokes 

another procedure (the callee). Programmers who write in a high-level language 

(like C or Pascal) never see the details of how one procedure calls another, because 

the compiler takes care of this low-level bookkeeping. However, assembly language 

programmers must explicitly implement every procedure call and return. 

Most of the bookkeeping associated with a call is centered around a block 

of memory called a procedure call frame. This memory is used for a variety of 

purposes: 

■ To hold values passed to a procedure as arguments 

■ To save registers that a procedure may modify, but which the procedure’s 

caller does not want changed 

■ To provide space for variables local to a procedure 

In most programming languages, procedure calls and returns follow a strict 

last-in, first-out (LIFO) order, so this memory can be allocated and deallocated on 

a stack, which is why these blocks of memory are sometimes called stack frames. 

Figure A.6.2 shows a typical stack frame. The frame consists of the memory 

between the frame pointer ($fp), which points to the first word of the frame, 

and the stack pointer ($sp), which points to the last word of the frame. The stack 

grows down from higher memory addresses, so the frame pointer points above the 

caller-saved register 

A regis ter saved by the 

routine being called. 

callee-saved register 

A regis ter saved by 

the routine making a 

procedure call. 

procedure call frame 

A block of memory that 

is used to hold values 

passed to a procedure 

as arguments, to save 

registers that a procedure 

may modify but that the 

procedure’s caller does not 

want changed, and to provide 

space for variables 

local to a procedure.


Higher memory addresses 

$fp 

Argument 6 

Argument 5 

Saved registers 

Stack 

grows 

Local variables 

$sp 

Lower memory addresses 

FIGURE A.6.2 Layout of a stack frame. The frame pointer ($fp) points to the first word in the 

currently executing procedure’s stack frame. The stack pointer ($sp) points to the last word of the frame. The 

first four arguments are passed in registers, so the fifth argument is the first one stored on the stack. 

A stack frame may be built in many different ways; however, the caller and 

callee must agree on the sequence of steps. The steps below describe the calling 

convention used on most MIPS machines. This convention comes into play at three 

points during a procedure call: immediately before the caller invokes the callee, 

just as the callee starts executing, and immediately before the callee returns to the 

caller. In the first part, the caller puts the procedure call arguments in stan dard 

places and invokes the callee to do the following: 

1. Pass arguments. By convention, the first four arguments are passed in registers 

$a0–$a3. Any remaining arguments are pushed on the stack and appear 

at the beginning of the called procedure’s stack frame. 

2. Save caller-saved registers. The called procedure can use these registers 

($a0–$a3 and $t0–$t9) without first saving their value. If the caller expects 

to use one of these registers after a call, it must save its value before the call. 

3. Execute a jal instruction (see Section 2.8 of Chapter 2), which jumps to the 

callee’s first instruction and saves the return address in register $ra.


Before a called routine starts running, it must take the following steps to set up 

its stack frame: 

1. Allocate memory for the frame by subtracting the frame’s size from the stack 

pointer. 

2. Save callee-saved registers in the frame. A callee must save the values in 

these registers ($s0–$s7, $fp, and $ra) before altering them, since the 

caller expects to find these registers unchanged after the call. Register $fp is 

saved by every procedure that allocates a new stack frame. However, register 

$ra only needs to be saved if the callee itself makes a call. The other calleesaved 

registers that are used also must be saved. 

3. Establish the frame pointer by adding the stack frame’s size minus 4 to $sp 

and storing the sum in register $fp. 

Hardware/ 

Software 

Interface 

The MIPS register use convention provides callee- and caller-saved registers, 

because both types of registers are advantageous in different circumstances. Calleesaved 

registers are better used to hold long-lived values, such as variables from a 

user’s program. These registers are only saved during a procedure call if the callee 

expects to use the register. On the other hand, caller-saved registers are bet ter used 

to hold short-lived quantities that do not persist across a call, such as immediate 

values in an address calculation. During a call, the callee can also use these registers 

for short-lived temporaries. 

Finally, the callee returns to the caller by executing the following steps: 

1. If the callee is a function that returns a value, place the returned value in 

register $v0. 

2. Restore all callee-saved registers that were saved upon procedure entry. 

3. Pop the stack frame by adding the frame size to $sp. 

4. Return by jumping to the address in register $ra. 

recursive procedures 

Procedures that call 

themselves either directly 

or indirectly through a 

chain of calls. 

Elaboration: A programming language that does not permit recursive procedures— 

procedures that call themselves either directly or indirectly through a chain of calls—need 

not allocate frames on a stack. In a nonrecursive language, each procedure’s frame 

may be statically allocated, since only one invocation of a procedure can be active at a 

time. Older versions of Fortran prohibited recursion, because statically allocated frames 

produced faster code on some older machines. However, on load store architec tures like 

MIPS, stack frames may be just as fast, because a frame pointer register points directly


to the active stack frame, which permits a single load or store instruc tion to access 

values in the frame. In addition, recursion is a valuable programming technique. 

Procedure Call Example 

As an example, consider the C routine 

main () 

{ 

printf (“The factorial of 10 is %d\n”, fact (10)); 

} 

int fact (int n) 

{ 

if (n < 1) 

return (1); 

else 

return (n * fact (n - 1)); 

} 

which computes and prints 10! (the factorial of 10, 10! = 10 × 9 × . . . × 1). fact is 

a recursive routine that computes n! by multiplying n times (n - 1)!. The assembly 

code for this routine illustrates how programs manipulate stack frames. 

Upon entry, the routine main creates its stack frame and saves the two calleesaved 

registers it will modify: $fp and $ra. The frame is larger than required for 

these two register because the calling convention requires the minimum size of a 

stack frame to be 24 bytes. This minimum frame can hold four argument registers 

($a0–$a3) and the return address $ra, padded to a double-word boundary 

(24 bytes). Since main also needs to save $fp, its stack frame must be two words 

larger (remember: the stack pointer is kept doubleword aligned). 

.text 

.globl main 

main: 

subu $sp,$sp,32 # Stack frame is 32 bytes long 

sw $ra,20($sp) # Save return address 

sw $fp,16($sp) # Save old frame pointer 

addiu $fp,$sp,28 # Set up frame pointer 

The routine main then calls the factorial routine and passes it the single argument 

10. After fact returns, main calls the library routine printf and passes it both 

a format string and the result returned from fact:


li $a0,10 # Put argument (10) in $a0 

jal fact # Call factorial function 

la $a0,$LC # Put format string in $a0 

move $a1,$v0 # Move fact result to $a1 

jal printf # Call the print function 

Finally, after printing the factorial, main returns. But first, it must restore the 

registers it saved and pop its stack frame: 

lw $ra,20($sp) # Restore return address 

lw $fp,16($sp) # Restore frame pointer 

addiu $sp,$sp,32 # Pop stack frame 

jr $ra # Return to caller 

.rdata 

$LC: 

.ascii 

“The factorial of 10 is %d\n\000” 

The factorial routine is similar in structure to main. First, it creates a stack frame 

and saves the callee-saved registers it will use. In addition to saving $ra and $fp, 

fact also saves its argument ($a0), which it will use for the recursive call: 

.text 

fact: 

subu $sp,$sp,32 # Stack frame is 32 bytes long 

sw $ra,20($sp) # Save return address 

sw $fp,16($sp) # Save frame pointer 

addiu $fp,$sp,28 # Set up frame pointer 

sw $a0,0($fp) # Save argument (n) 

The heart of the fact routine performs the computation from the C program. 

It tests whether the argument is greater than 0. If not, the routine returns the 

value 1. If the argument is greater than 0, the routine recursively calls itself to 

compute fact(n–1) and multiplies that value times n: 

lw $v0,0($fp) # Load n 

bgtz $v0,$L2 # Branch if n > 0 

li $v0,1 # Return 1 

jr $L1 # Jump to code to return 

$L2: 


subu $v0,$v1,1 # Compute n - 1 

move $a0,$v0 # Move value to $a0


jal fact # Call factorial function 


mul $v0,$v0,$v1 # Compute fact(n-1) * n 

Finally, the factorial routine restores the callee-saved registers and returns the 

value in register $v0: 

$L1: # Result is in $v0 

lw $ra, 20($sp) # Restore $ra 

lw $fp, 16($sp) # Restore $fp 

addiu $sp, $sp, 32 # Pop stack 

jr $ra # Return to caller 

Stack in Recursive Procedure 

Figure A.6.3 shows the stack at the call fact(7). main runs first, so its frame 

is deepest on the stack. main calls fact(10), whose stack frame is next on the 

stack. Each invocation recursively invokes fact to compute the next-lowest 

factorial. The stack frames parallel the LIFO order of these calls. What does the 

stack look like when the call to fact(10) returns? 

EXAMPLE 

Stack 

Old $ra 

Old $fp 

main 

Old $a0 

Old $ra 

Old $fp 

Old $a0 

Old $ra 

Old $fp 

Old $a0 

Old $ra 

Old $fp 

Old $a0 

Old $ra 

Old $fp 

fact (10) 

fact (9) 

fact (8) 

fact (7) 

Stack grows 

FIGURE A.6.3 Stack frames during the call of fact(7).


ANSWER 

Stack 

Old $ra 

Old $fp 

main 

Stack grows 

Elaboration: The difference between the MIPS compiler and the gcc compiler is that 

the MIPS compiler usually does not use a frame pointer, so this register is available as 

another callee-saved register, $s8. This change saves a couple of instructions in the 

procedure call and return sequence. However, it complicates code generation, because 

a procedure must access its stack frame with $sp, whose value can change during a 

procedure’s execution if values are pushed on the stack. 

Another Procedure Call Example 

As another example, consider the following routine that computes the tak function, 

which is a widely used benchmark created by Ikuo Takeuchi. This function 

does not compute anything useful, but is a heavily recursive program that illustrates 

the MIPS calling convention. 

int tak (int x, int y, int z) 

{ 

if (y < x) 

return 1+ tak (tak (x - 1, y, z), 

tak (y - 1, z, x), 

tak (z - 1, x, y)); 

else 

return z; 

} 

int main () 

{ 

tak(18, 12, 6); 

} 

The assembly code for this program is shown below. The tak function first saves 

its return address in its stack frame and its arguments in callee-saved regis ters, 

since the routine may make calls that need to use registers $a0–$a2 and $ra. The 

function uses callee-saved registers, since they hold values that persist over the


lifetime of the function, which includes several calls that could potentially modify 

registers. 

.text 

.globl 

tak 

tak: 

subu $sp, $sp, 40 

sw $ra, 32($sp) 

sw $s0, 16($sp) # x 

move $s0, $a0 

sw $s1, 20($sp) # y 

move $s1, $a1 

sw $s2, 24($sp) # z 

move $s2, $a2 

sw $s3, 28($sp) # temporary 

The routine then begins execution by testing if y < x. If not, it branches to label 

L1, which is shown below. 

bge $s1, $s0, L1 # if (y < x) 

If y < x, then it executes the body of the routine, which contains four recursive 

calls. The first call uses almost the same arguments as its parent: 

addiu $a0, $s0, -1 

move $a1, $s1 

move $a2, $s2 

jal tak # tak (x - 1, y, z) 

move $s3, $v0 

Note that the result from the first recursive call is saved in register $s3, so that it 

can be used later. 

The function now prepares arguments for the second recursive call. 

addiu $a0, $s1, -1 

move $a1, $s2 

move $a2, $s0 

jal tak # tak (y - 1, z, x) 

In the instructions below, the result from this recursive call is saved in register 

$s0. But first we need to read, for the last time, the saved value of the first argument 

from this register.


addiu $a0, $s2, -1 

move $a1, $s0 

move $a2, $s1 

move $s0, $v0 

jal tak # tak (z - 1, x, y) 

After the three inner recursive calls, we are ready for the final recursive call. After 

the call, the function’s result is in $v0 and control jumps to the function’s epilogue. 

move $a0, $s3 

move $a1, $s0 

move $a2, $v0 

jal tak # tak (tak(...), tak(...), tak(...)) 

addiu $v0, $v0, 1 

j L2 

This code at label L1 is the consequent of the if-then-else statement. It just moves 

the value of argument z into the return register and falls into the function epilogue. 

L1: 

move $v0, $s2 

The code below is the function epilogue, which restores the saved registers and 

returns the function’s result to its caller. 

L2: 

lw $ra, 32($sp) 

lw $s0, 16($sp) 

lw $s1, 20($sp) 

lw $s2, 24($sp) 

lw $s3, 28($sp) 

addiu $sp, $sp, 40 

jr $ra 

The main routine calls the tak function with its initial arguments, then takes the 

computed result (7) and prints it using SPIM’s system call for printing integers. 

.globl main 

main: 

subu $sp, $sp, 24 

sw $ra, 16($sp) 

li $a0, 18 

li $a1, 12


These seven registers are part of coprocessor 0’s register set. They are accessed 

by the mfc0 and mtc0 instructions. After an exception, register EPC contains the 

address of the instruction that was executing when the exception occurred. If the 

exception was caused by an external interrupt, then the instruction will not have 

started executing. All other exceptions are caused by the execution of the instruction 

at EPC, except when the offending instruction is in the delay slot of a branch 

or jump. In that case, EPC points to the branch or jump instruction and the BD bit 

is set in the Cause register. When that bit is set, the exception handler must look 

at EPC + 4 for the offending instruction. However, in either case, an excep tion 

handler properly resumes the program by returning to the instruction at EPC. 

If the instruction that caused the exception made a memory access, register 

BadVAddr contains the referenced memory location’s address. 

The Count register is a timer that increments at a fixed rate (by default, every 

10 milliseconds) while SPIM is running. When the value in the Count register 

equals the value in the Compare register, a hardware interrupt at priority level 5 

occurs. 

Figure A.7.1 shows the subset of the Status register fields implemented by the 

MIPS simulator SPIM. The interrupt mask field contains a bit for each of the 

six hardware and two software interrupt levels. A mask bit that is 1 allows interrupts 

at that level to interrupt the processor. A mask bit that is 0 disables interrupts 

at that level. When an interrupt arrives, it sets its interrupt pending bit in the 

Cause register, even if the mask bit is disabled. When an interrupt is pending, it will 

interrupt the processor when its mask bit is subsequently enabled. 

The user mode bit is 0 if the processor is running in kernel mode and 1 if it is 

running in user mode. On SPIM, this bit is fixed at 1, since the SPIM processor 

does not implement kernel mode. The exception level bit is normally 0, but is set to 

1 after an exception occurs. When this bit is 1, interrupts are disabled and the EPC 

is not updated if another exception occurs. This bit prevents an exception handler 

from being disturbed by an interrupt or exception, but it should be reset when the 

handler finishes. If the interrupt enable bit is 1, interrupts are allowed. If it is 

0, they are disabled. 

Figure A.7.2 shows the subset of Cause register fields that SPIM implements. 

The branch delay bit is 1 if the last exception occurred in an instruction executed in 

the delay slot of a branch. The interrupt pending bits become 1 when an inter rupt


faults are requests from a process to the operating system to perform a service, 

such as bringing in a page from disk. The operating system processes these requests 

and resumes the process. The final type of exceptions are interrupts from external 

devices. These generally cause the operating system to move data to or from an I/O 

device and resume the interrupted process. 

The code in the example below is a simple exception handler, which invokes 

a routine to print a message at each exception (but not interrupts). This code is 

similar to the exception handler (exceptions.s) used by the SPIM simulator. 

Exception Handler 

EXAMPLE 

The exception handler first saves register $at, which is used in pseudoinstructions 

in the handler code, then saves $a0 and $a1, which it later uses to 

pass arguments. The exception handler cannot store the old values from these 

registers on the stack, as would an ordinary routine, because the cause of the 

exception might have been a memory reference that used a bad value (such 

as 0) in the stack pointer. Instead, the exception handler stores these registers 

in an exception handler register ($k1, since it can’t access memory without 

using $at) and two memory locations (save0 and save1). If the exception 

routine itself could be interrupted, two locations would not be enough since 

the second exception would overwrite values saved during the first exception. 

However, this simple exception handler finishes running before it enables 

interrupts, so the problem does not arise. 

.ktext 0x80000180 

mov $k1, $at # Save $at register 

sw $a0, save0 # Handler is not re-entrant and can’t use 

sw $a1, save1 # stack to save $a0, $a1 

# Don’t need to save $k0/$k1 

The exception handler then moves the Cause and EPC registers into CPU 

registers. The Cause and EPC registers are not part of the CPU register set. 

In stead, they are registers in coprocessor 0, which is the part of the CPU that 

han dles exceptions. The instruction mfc0 $k0, $13 moves coprocessor 0’s 

register 13 (the Cause register) into CPU register $k0. Note that the exception 

handler need not save registers $k0 and $k1, because user programs are not 

supposed to use these registers. The exception handler uses the value from the 

Cause reg ister to test whether the exception was caused by an interrupt (see 

the preceding ta ble). If so, the exception is ignored. If the exception was not an 

interrupt, the handler calls print_excp to print a message.

A.7 Exceptions and Interrupts A-37 

mfc0 $k0, $13 # Move Cause into $k0 

srl $a0, $k0, 2 # Extract ExcCode field 

andi $a0, $a0, Oxf 

bgtz $a0, done # Branch if ExcCode is Int (0) 

mov $a0, $k0 # Move Cause into $a0 

mfco $a1, $14 # Move EPC into $a1 

jal print_excp # Print exception error message 

Before returning, the exception handler clears the Cause register; resets 

the Status register to enable interrupts and clear the EXL bit, which allows 

subse quent exceptions to change the EPC register; and restores registers $a0, 

$a1, and $at. It then executes the eret (exception return) instruction, which 

returns to the instruction pointed to by EPC. This exception handler returns 

to the instruction following the one that caused the exception, so as to not 

re-execute the faulting instruction and cause the same exception again. 

done: mfc0 $k0, $14 # Bump EPC 

addiu $k0, $k0, 4 # Do not re-execute 

# faulting instruction 

mtc0 $k0, $14 # EPC 

mtc0 $0, $13 # Clear Cause register 

mfc0 $k0, $12 # Fix Status register 

andi $k0, Oxfffd # Clear EXL bit 

ori $k0, Ox1 # Enable interrupts 

mtc0 $k0, $12 

lw $a0, save0 # Restore registers 

lw $a1, save1 

mov $at, $k1 

eret 

# Return to EPC 

.kdata 

save0: .word 0 

save1: .word 0


Elaboration: On real MIPS processors, the return from an exception handler is more 

complex. The exception handler cannot always jump to the instruction following EPC. For 

example, if the instruction that caused the exception was in a branch instruction’s delay 

slot (see Chapter 4), the next instruction to execute may not be the following instruction 

in memory. 

A.8 Input and Output 

SPIM simulates one I/O device: a memory-mapped console on which a program 

can read and write characters. When a program is running, SPIM connects its 

own terminal (or a separate console window in the X-window version xspim or 

the Windows version PCSpim) to the processor. A MIPS program running on 

SPIM can read the characters that you type. In addition, if the MIPS program 

writes characters to the terminal, they appear on SPIM’s terminal or console window. 

One exception to this rule is control-C: this character is not passed to the 

program, but instead causes SPIM to stop and return to command mode. When 

the program stops running (for example, because you typed control-C or because 

the program hit a breakpoint), the terminal is reconnected to SPIM so you can type 

SPIM commands. 

To use memory-mapped I/O (see below), spim or xspim must be started 

with the -mapped_io flag. PCSpim can enable memory-mapped I/O through a 

command line flag or the “Settings” dialog. 

The terminal device consists of two independent units: a receiver and a transmitter. 

The receiver reads characters from the keyboard. The transmitter displays 

characters on the console. The two units are completely independent. This means, 

for example, that characters typed at the keyboard are not automatically echoed on 

the display. Instead, a program echoes a character by reading it from the receiver 

and writing it to the transmitter. 

A program controls the terminal with four memory-mapped device registers, 

as shown in Figure A.8.1. “Memory-mapped’’ means that each register appears as 

a special memory location. The Receiver Control register is at location ffff0000 hex . 

Only two of its bits are actually used. Bit 0 is called “ready’’: if it is 1, it means 

that a character has arrived from the keyboard but has not yet been read from the 

Receiver Data register. The ready bit is read-only: writes to it are ignored. The ready 

bit changes from 0 to 1 when a character is typed at the keyboard, and it changes 

from 1 to 0 when the character is read from the Receiver Data register.


and is read-only. If this bit is 1, the transmitter is ready to accept a new character 

for output. If it is 0, the transmitter is still busy writing the previous character. 

Bit 1 is “interrupt enable’’ and is readable and writable. If this bit is set to 1, then 

the terminal requests an interrupt at hardware level 0 whenever the transmitter is 

ready for a new character, and the ready bit becomes 1. 

The final device register is the Transmitter Data register (at address ffff000c hex ). 

When a value is written into this location, its low-order eight bits (i.e., an ASCII 

character as in Figure 2.15 in Chapter 2) are sent to the console. When the Transmitter 

Data register is written, the ready bit in the Transmitter Control register is 

reset to 0. This bit stays 0 until enough time has elapsed to transmit the character 

to the terminal; then the ready bit becomes 1 again. The Trans mitter Data register 

should only be written when the ready bit of the Transmitter Control register is 1. 

If the transmitter is not ready, writes to the Transmitter Data register are ignored 

(the write appears to succeed but the character is not output). 

Real computers require time to send characters to a console or terminal. These 

time lags are simulated by SPIM. For example, after the transmitter starts to write a 

character, the transmitter’s ready bit becomes 0 for a while. SPIM measures time in 

instructions executed, not in real clock time. This means that the transmitter does 

not become ready again until the processor executes a fixed number of instructions. 

If you stop the machine and look at the ready bit, it will not change. However, if you 

let the machine run, the bit eventually changes back to 1. 

A.9 SPIM 

SPIM is a software simulator that runs assembly language programs written for 

processors that implement the MIPS-32 architecture, specifically Release 1 of this 

architecture with a fixed memory mapping, no caches, and only coprocessors 0 

and 1. 2 SPIM’s name is just MIPS spelled backwards. SPIM can read and immediately 

execute assembly language files. SPIM is a self-contained system for running 

2. Earlier versions of SPIM (before 7.0) implemented the MIPS-1 architecture used in the origi nal 

MIPS R2000 processors. This architecture is almost a proper subset of the MIPS-32 architec ture, 

with the difference being the manner in which exceptions are handled. MIPS-32 also introduced 

approximately 60 new instructions, which are supported by SPIM. Programs that ran on the 

earlier versions of SPIM and did not use exceptions should run unmodified on newer ver sions of 

SPIM. Programs that used exceptions will require minor changes.

A.9 SPIM A-41 

MIPS programs. It contains a debugger and provides a few operating system-like 

services. SPIM is much slower than a real computer (100 or more times). How ever, 

its low cost and wide availability cannot be matched by real hardware! 

An obvious question is, “Why use a simulator when most people have PCs that 

contain processors that run significantly faster than SPIM?” One reason is that 

the processors in PCs are Intel 80×86s, whose architecture is far less regular and 

far more complex to understand and program than MIPS processors. The MIPS 

architecture may be the epitome of a simple, clean RISC machine. 

In addition, simulators can provide a better environment for assembly programming 

than an actual machine because they can detect more errors and provide 

a better interface than can an actual computer. 

Finally, simulators are useful tools in studying computers and the programs that 

run on them. Because they are implemented in software, not silicon, simulators can 

be examined and easily modified to add new instructions, build new systems such 

as multiprocessors, or simply collect data. 

Simulation of a Virtual Machine 

The basic MIPS architecture is difficult to program directly because of delayed 

branches, delayed loads, and restricted address modes. This difficulty is tolerable 

since these computers were designed to be programmed in high-level languages 

and present an interface designed for compilers rather than assembly language 

programmers. A good part of the programming complexity results from delayed 

instructions. A delayed branch requires two cycles to execute (see the Elaborations 

on pages 284 and 322 of Chapter 4). In the second cycle, the instruction immediately 

following the branch executes. This instruction can perform useful work 

that normally would have been done before the branch. It can also be a nop (no 

operation) that does nothing. Similarly, delayed loads require two cycles to bring 

a value from memory, so the instruction immediately following a load cannot use 

the value (see Section 4.2 of Chapter 4). 

MIPS wisely chose to hide this complexity by having its assembler implement 

a virtual machine. This virtual computer appears to have nondelayed branches 

and loads and a richer instruction set than the actual hardware. The assembler 

reorga nizes (rearranges) instructions to fill the delay slots. The virtual computer 

also provides pseudoinstructions, which appear as real instructions in assembly 

lan guage programs. The hardware, however, knows nothing about pseudoinstructions, 

so the assembler must translate them into equivalent sequences of actual 

machine instructions. For example, the MIPS hardware only provides instructions 

to branch when a register is equal to or not equal to 0. Other conditional branches, 

such as one that branches when one register is greater than another, are synthesized 

by comparing the two registers and branching when the result of the comparison 

is true (nonzero). 

virtual machine 

A virtual computer 

that appears to have 

nondelayed branches 

and loads and a richer 

instruction set than the 

actual hardware.

A.9 SPIM A-43 

Another surprise (which occurs on the real machine as well) is that a pseudoinstruction 

expands to several machine instructions. When you single-step or 

exam ine memory, the instructions that you see are different from the source 

program. The correspondence between the two sets of instructions is fairly simple, 

since SPIM does not reorganize instructions to fill slots. 

Byte Order 

Processors can number bytes within a word so the byte with the lowest number is 

either the leftmost or rightmost one. The convention used by a machine is called 

its byte order. MIPS processors can operate with either big-endian or little-endian 

byte order. For example, in a big-endian machine, the directive .byte 0, 1, 2, 3 

would result in a memory word containing 

Byte # 

0 1 2 3 

while in a little-endian machine, the word would contain 

Byte # 

3 2 1 0 

SPIM operates with both byte orders. SPIM’s byte order is the same as the byte 

order of the underlying machine that runs the simulator. For example, on an Intel 

80x86, SPIM is little-endian, while on a Macintosh or Sun SPARC, SPIM is bigendian. 

System Calls 

SPIM provides a small set of operating system–like services through the system 

call (syscall) instruction. To request a service, a program loads the system call 

code (see Figure A.9.1) into register $v0 and arguments into registers $a0–$a3 

(or $f12 for floating-point values). System calls that return values put their results 

in register $v0 (or $f0 for floating-point results). For example, the follow ing code 

prints "the answer = 5": 

.data 

str: 

.asciiz “the answer = ” 

.text

A.10 MIPS R2000 Assembly Language A-47 

lui $at, 4096 

addu $at, $at, $a1 

lw $a0, 8($at) 

The fi rst instruction loads the upper bits of the label’s address into register $at, which 

is the register that the assembler reserves for its own use. The second instruction adds 

the contents of register $a1 to the label’s partial address. Finally, the load instruction 

uses the hardware address mode to add the sum of the lower bits of the label’s address 

and the offset from the original instruction to the value in register $at. 

Assembler Syntax 

Comments in assembler files begin with a sharp sign (#). Everything from the 

sharp sign to the end of the line is ignored. 

Identifiers are a sequence of alphanumeric characters, underbars (_), and dots 

(.) that do not begin with a number. Instruction opcodes are reserved words that 

cannot be used as identifiers. Labels are declared by putting them at the beginning 

of a line followed by a colon, for example: 

.data 

item: .word 1 

.text 

.globl main 

main: lw 

# Must be global 

$t0, item 

Numbers are base 10 by default. If they are preceded by 0x, they are interpreted 

as hexadecimal. Hence, 256 and 0x100 denote the same value. 

Strings are enclosed in double quotes (”). Special characters in strings follow the 

C convention: 

■ newline \n 

■ tab \t 

■ quote \” 

SPIM supports a subset of the MIPS assembler directives: 

.align n 

.ascii str 

Align the next datum on a 2 n byte boundary. For 

example, .align 2 aligns the next value on a word 

boundary. .align 0 turns off automatic alignment 

of .half, .word, .float, and .double directives 

until the next .data or .kdata directive. 

Store the string str in memory, but do not nullterminate 

it.


.asciiz str 

.byte b1,..., bn 

.data 

Store the string str in memory and nullterminate it. 

Store the n values in successive bytes of memory. 

Subsequent items are stored in the data segment. 

If the optional argument addr is present, subsequent 

items are stored starting at address addr. 

.double d1,..., dn Store the n floating-point double precision 

num-bers in successive memory locations. 

.extern sym size 

.float f1,..., fn 

.globl sym 

.half h1,..., hn 

.kdata 

.ktext 

.set noat and .set at 

.space n 

Declare that the datum stored at sym is size bytes 

large and is a global label. This directive enables 

the assembler to store the datum in a portion of 

the data segment that is efficiently accessed via 

register $gp. 

Store the n floating-point single precision numbers 

in successive memory locations. 

Declare that label sym is global and can be referenced 

from other files. 

Store the n 16-bit quantities in successive mem ory 

halfwords. 

Subsequent data items are stored in the kernel 

data segment. If the optional argument addr is 

present, subsequent items are stored starting at 

address addr. 

Subsequent items are put in the kernel text segment. 

In SPIM, these items may only be instructions 

or words (see the .word directive below). If 

the optional argument addr is present, subse quent 


The first directive prevents SPIM from complaining 

about subsequent instructions that use regis ter 

$at. The second directive re-enables the warning. 

Since pseudoinstructions expand into code that 

uses register $at, programmers must be very careful 

about leaving values in this register. 

Allocates n bytes of space in the current segment 

(which must be the data segment in SPIM).


.text 

.word w1,..., wn 

Subsequent items are put in the user text seg ment. 

In SPIM, these items may only be instruc tions 

or words (see the .word directive below). If the 

optional argument addr is present, subse quent 


Store the n 32-bit quantities in successive mem ory 

words. 

SPIM does not distinguish various parts of the data segment (.data, .rdata, and 

.sdata). 

Encoding MIPS Instructions 

Figure A.10.2 explains how a MIPS instruction is encoded in a binary number. 

Each column contains instruction encodings for a field (a contiguous group of 

bits) from an instruction. The numbers at the left margin are values for a field. 

For example, the j opcode has a value of 2 in the opcode field. The text at the top 

of a column names a field and specifies which bits it occupies in an instruction. 

For example, the op field is contained in bits 26–31 of an instruction. This field 

encodes most instructions. However, some groups of instructions use additional 

fields to distinguish related instructions. For example, the different floating-point 

instructions are specified by bits 0–5. The arrows from the first column show which 

opcodes use these additional fields. 

Instruction Format 

The rest of this appendix describes both the instructions implemented by actual 

MIPS hardware and the pseudoinstructions provided by the MIPS assembler. The 

two types of instructions are easily distinguished. Actual instructions depict the 

fields in their binary representation. For example, in 

Addition (with overflow) 

add rd, rs, rt 

0 rs rt rd 0 0x20 

6 5 5 5 5 6 

the add instruction consists of six fields. Each field’s size in bits is the small num ber 

below the field. This instruction begins with six bits of 0s. Register specifiers begin 

with an r, so the next field is a 5-bit register specifier called rs. This is the same 

register that is the second argument in the symbolic assembly at the left of this 

line. Another common field is imm 16 , which is a 16-bit immediate number.


Pseudoinstructions follow roughly the same conventions, but omit instruction 

encoding information. For example: 

Multiply (without overflow) 

mul rdest, rsrc1, src2 


In pseudoinstructions, rdest and rsrc1 are registers and src2 is either a register 

or an immediate value. In general, the assembler and SPIM translate a more 

general form of an instruction (e.g., add $v1, $a0, 0x55) to a specialized form 

(e.g., addi $v1, $a0, 0x55). 

Arithmetic and Logical Instructions 

Absolute value 

abs rdest, rsrc 


Put the absolute value of register rsrc in register rdest. 

Addition (with overflow) 

add rd, rs, rt 

0 rs rt rd 0 0x20 

6 5 5 5 5 6 

Addition (without overflow) 

addu rd, rs, rt 

0 rs rt rd 0 0x21 

6 5 5 5 5 6 

Put the sum of registers rs and rt into register rd. 

Addition immediate (with overflow) 

addi rt, rs, imm 

8 rs rt imm 

6 5 5 16 

Addition immediate (without overflow) 

addiu rt, rs, imm 

9 rs rt imm 

6 5 5 16 

Put the sum of register rs and the sign-extended immediate into register rt.


AND 

and rd, rs, rt 

0 rs rt rd 0 0x24 

6 5 5 5 5 6 

Put the logical AND of registers rs and rt into register rd. 

AND immediate 

andi rt, rs, imm 

0xc rs rt imm 

6 5 5 16 

Put the logical AND of register rs and the zero-extended immediate into register 

rt. 

Count leading ones 

clo rd, rs 

0x1c rs 0 rd 0 0x21 

6 5 5 5 5 6 

Count leading zeros 

clz rd, rs 

0x1c rs 0 rd 0 0x20 

6 5 5 5 5 6 

Count the number of leading ones (zeros) in the word in register rs and put 

the result into register rd. If a word is all ones (zeros), the result is 32. 

Divide (with overflow) 

div rs, rt 

0 rs rt 0 0x1a 

6 5 5 10 6 

Divide (without overflow) 

divu rs, rt 

0 rs rt 0 0x1b 

6 5 5 10 6 

Divide register rs by register rt. Leave the quotient in register lo and the remainder 

in register hi. Note that if an operand is negative, the remainder is unspecified 

by the MIPS architecture and depends on the convention of the machine on which 

SPIM is run.


Divide (with overflow) 

div rdest, rsrc1, src2 


Divide (without overflow) 

divu rdest, rsrc1, src2 


Put the quotient of register rsrc1 and src2 into register rdest. 

Multiply 

mult rs, rt 

0 rs rt 0 0x18 

6 5 5 10 6 

Unsigned multiply 

multu rs, rt 

0 rs rt 0 0x19 

6 5 5 10 6 

Multiply registers rs and rt. Leave the low-order word of the product in register 

lo and the high-order word in register hi. 

Multiply (without overflow) 

mul rd, rs, rt 

0x1c rs rt rd 0 2 

6 5 5 5 5 6 

Put the low-order 32 bits of the product of rs and rt into register rd. 

Multiply (with overflow) 

mulo rdest, rsrc1, src2 


Unsigned multiply (with overflow) 

mulou rdest, rsrc1, src2 


Put the low-order 32 bits of the product of register rsrc1 and src2 into register 

rdest.


Multiply add 

madd rs, rt 

0x1c rs rt 0 0 

6 5 5 10 6 

Unsigned multiply add 

maddu rs, rt 


6 5 5 10 6 

Multiply registers rs and rt and add the resulting 64-bit product to the 64-bit 

value in the concatenated registers lo and hi. 

Multiply subtract 

msub rs, rt 


6 5 5 10 6 

Unsigned multiply subtract 

msub rs, rt 


6 5 5 10 6 

Multiply registers rs and rt and subtract the resulting 64-bit product from the 64- 

bit value in the concatenated registers lo and hi. 

Negate value (with overflow) 

neg rdest, rsrc 


Negate value (without overflow) 

negu rdest, rsrc 


Put the negative of register rsrc into register rdest. 

NOR 

nor rd, rs, rt 

0 rs rt rd 0 0x27 

6 5 5 5 5 6 

Put the logical NOR of registers rs and rt into register rd.


NOT 

not rdest, rsrc 


Put the bitwise logical negation of register rsrc into register rdest. 

OR 

or rd, rs, rt 

0 rs rt rd 0 0x25 

6 5 5 5 5 6 

Put the logical OR of registers rs and rt into register rd. 

OR immediate 

ori rt, rs, imm 

0xd rs rt imm 

6 5 5 16 

Put the logical OR of register rs and the zero-extended immediate into register rt. 

Remainder 

rem rdest, rsrc1, rsrc2 


Unsigned remainder 

remu rdest, rsrc1, rsrc2 


Put the remainder of register rsrc1 divided by register rsrc2 into register rdest. 

Note that if an operand is negative, the remainder is unspecified by the MIPS 

architecture and depends on the convention of the machine on which SPIM is run. 

Shift left logical 

sll rd, rt, shamt 

0 rs rt rd shamt 0 

6 5 5 5 5 6 

Shift left logical variable 

sllv rd, rt, rs 

0 rs rt rd 0 4 

6 5 5 5 5 6


Shift right arithmetic 

sra rd, rt, shamt 


6 5 5 5 5 6 

Shift right arithmetic variable 

srav rd, rt, rs 


6 5 5 5 5 6 

Shift right logical 

srl rd, rt, shamt 


6 5 5 5 5 6 

Shift right logical variable 

srlv rd, rt, rs 


6 5 5 5 5 6 

Shift register rt left (right) by the distance indicated by immediate shamt or the 

register rs and put the result in register rd. Note that argument rs is ignored for 

sll, sra, and srl. 

Rotate left 

rol rdest, rsrc1, rsrc2 


Rotate right 

ror rdest, rsrc1, rsrc2 


Rotate register rsrc1 left (right) by the distance indicated by rsrc2 and put the 

result in register rdest. 

Subtract (with overflow) 

sub rd, rs, rt 

0 rs rt rd 0 0x22 

6 5 5 5 5 6


Subtract (without overflow) 

subu rd, rs, rt 

0 rs rt rd 0 0x23 

6 5 5 5 5 6 

Put the difference of registers rs and rt into register rd. 

Exclusive OR 

xor rd, rs, rt 

0 rs rt rd 0 0x26 

6 5 5 5 5 6 

Put the logical XOR of registers rs and rt into register rd. 

XOR immediate 

xori rt, rs, imm 

0xe rs rt Imm 

6 5 5 16 

Put the logical XOR of register rs and the zero-extended immediate into register 

rt. 

Constant-Manipulating Instructions 

Load upper immediate 

lui rt, imm 

0xf O rt imm 

6 5 5 16 

Load the lower halfword of the immediate imm into the upper halfword of register 

rt. The lower bits of the register are set to 0. 

Load immediate 

li rdest, imm 


Move the immediate imm into register rdest. 

Comparison Instructions 

Set less than 

slt rd, rs, rt 

0 rs rt rd 0 0x2a 

6 5 5 5 5 6


Set less than unsigned 

sltu rd, rs, rt 

0 rs rt rd 0 0x2b 

6 5 5 5 5 6 

Set register rd to 1 if register rs is less than rt, and to 0 otherwise. 

Set less than immediate 

slti rt, rs, imm 

0xa rs rt imm 

6 5 5 16 

Set less than unsigned immediate 

sltiu rt, rs, imm 

0xb rs rt imm 

6 5 5 16 

Set register rt to 1 if register rs is less than the sign-extended immediate, and to 

0 otherwise. 

Set equal 

seq rdest, rsrc1, rsrc2 


Set register rdest to 1 if register rsrc1 equals rsrc2, and to 0 otherwise. 

Set greater than equal 

sge rdest, rsrc1, rsrc2 


Set greater than equal unsigned 

sgeu rdest, rsrc1, rsrc2 


Set register rdest to 1 if register rsrc1 is greater than or equal to rsrc2, and to 

0 otherwise. 

Set greater than 

sgt rdest, rsrc1, rsrc2 

pseudoinstruction


Set greater than unsigned 

sgtu rdest, rsrc1, rsrc2 


Set register rdest to 1 if register rsrc1 is greater than rsrc2, and to 0 otherwise. 

Set less than equal 

sle rdest, rsrc1, rsrc2 


Set less than equal unsigned 

sleu rdest, rsrc1, rsrc2 


Set register rdest to 1 if register rsrc1 is less than or equal to rsrc2, and to 0 

otherwise. 

Set not equal 

sne rdest, rsrc1, rsrc2 


Set register rdest to 1 if register rsrc1 is not equal to rsrc2, and to 0 otherwise. 

Branch Instructions 

Branch instructions use a signed 16-bit instruction offset field; hence, they can 

jump 2 15 − 1 instructions (not bytes) forward or 2 15 instructions backward. The 

jump instruction contains a 26-bit address field. In actual MIPS processors, branch 

instructions are delayed branches, which do not transfer control until the instruction 

following the branch (its “delay slot”) has executed (see Chapter 4). Delayed branches 

affect the offset calculation, since it must be computed relative to the address of the 

delay slot instruction (PC + 4), which is when the branch occurs. SPIM does not 

simulate this delay slot, unless the -bare or -delayed_branch flags are specified. 

In assembly code, offsets are not usually specified as numbers. Instead, an 

instructions branch to a label, and the assembler computes the distance between 

the branch and the target instructions. 

In MIPS-32, all actual (not pseudo) conditional branch instructions have a 

“likely” variant (for example, beq’s likely variant is beql), which does not execute 

the instruction in the branch’s delay slot if the branch is not taken. Do not use


these instructions; they may be removed in subsequent versions of the architec ture. 

SPIM implements these instructions, but they are not described further. 

Branch instruction 

b label 


Unconditionally branch to the instruction at the label. 

Branch coprocessor false 

bclf cc label 

0x11 8 cc 0 Offset 

6 5 3 2 16 

Branch coprocessor true 

bclt cc label 

0x11 8 cc 1 Offset 

6 5 3 2 16 

Conditionally branch the number of instructions specified by the offset if the 

floating-point coprocessor’s condition flag numbered cc is false (true). If cc is 

omitted from the instruction, condition code flag 0 is assumed. 

Branch on equal 

beq rs, rt, label 

4 rs rt Offset 

6 5 5 16 

Conditionally branch the number of instructions specified by the offset if register 

rs equals rt. 

Branch on greater than equal zero 

bgez rs, label 

1 rs 1 Offset 

6 5 5 16 


rs is greater than or equal to 0.


Branch on greater than equal zero and link 

bgezal rs, label 

1 rs 0x11 Offset 

6 5 5 16 


rs is greater than or equal to 0. Save the address of the next instruction in register 

31. 

Branch on greater than zero 

bgtz rs, label 

7 rs 0 Offset 

6 5 5 16 


rs is greater than 0. 

Branch on less than equal zero 

blez rs, label 

6 rs 0 Offset 

6 5 5 16 


rs is less than or equal to 0. 

Branch on less than and link 

bltzal rs, label 

1 rs 0x10 Offset 

6 5 5 16 


rs is less than 0. Save the address of the next instruction in register 31. 

Branch on less than zero 

bltz rs, label 

1 rs 0 Offset 

6 5 5 16 


rs is less than 0.


Branch on not equal 

bne rs, rt, label 

5 rs rt Offset 

6 5 5 16 


rs is not equal to rt. 

Branch on equal zero 

beqz rsrc, label 


Conditionally branch to the instruction at the label if rsrc equals 0. 

Branch on greater than equal 

bge rsrc1, rsrc2, label 


Branch on greater than equal unsigned 

bgeu rsrc1, rsrc2, label 


Conditionally branch to the instruction at the label if register rsrc1 is greater than 

or equal to rsrc2. 

Branch on greater than 

bgt rsrc1, src2, label 


Branch on greater than unsigned 

bgtu rsrc1, src2, label 


Conditionally branch to the instruction at the label if register rsrc1 is greater than 

src2. 

Branch on less than equal 

ble rsrc1, src2, label 

pseudoinstruction


Branch on less than equal unsigned 

bleu rsrc1, src2, label 


Conditionally branch to the instruction at the label if register rsrc1 is less than or 

equal to src2. 

Branch on less than 

blt rsrc1, rsrc2, label 


Branch on less than unsigned 

bltu rsrc1, rsrc2, label 


Conditionally branch to the instruction at the label if register rsrc1 is less than 

rsrc2. 

Branch on not equal zero 

bnez rsrc, label 


Conditionally branch to the instruction at the label if register rsrc is not equal to 0. 

Jump Instructions 

Jump 

j target 

2 target 

6 26 

Unconditionally jump to the instruction at target. 

Jump and link 

jal target 

3 target 

6 26 

Unconditionally jump to the instruction at target. Save the address of the next 

instruction in register $ra.


Jump and link register 

jalr rs, rd 

0 rs 0 rd 0 9 

6 5 5 5 5 6 

Unconditionally jump to the instruction whose address is in register rs. Save the 

address of the next instruction in register rd (which defaults to 31). 

Jump register 

jr rs 

0 rs 0 8 

6 5 15 6 

Unconditionally jump to the instruction whose address is in register rs. 

Trap Instructions 

Trap if equal 

teq rs, rt 

0 rs rt 0 0x34 

6 5 5 10 6 

If register rs is equal to register rt, raise a Trap exception. 

Trap if equal immediate 

teqi rs, imm 

1 rs 0xc imm 

6 5 5 16 

If register rs is equal to the sign-extended value imm, raise a Trap exception. 

Trap if not equal 

teq rs, rt 

0 rs rt 0 0x36 

6 5 5 10 6 

If register rs is not equal to register rt, raise a Trap exception. 

Trap if not equal immediate 

teqi rs, imm 

1 rs 0xe imm 

6 5 5 16 

If register rs is not equal to the sign-extended value imm, raise a Trap exception.


Trap if greater equal 

tge rs, rt 

0 rs rt 0 0x30 

6 5 5 10 6 

Unsigned trap if greater equal 

tgeu rs, rt 

0 rs rt 0 0x31 

6 5 5 10 6 

If register rs is greater than or equal to register rt, raise a Trap exception. 

Trap if greater equal immediate 

tgei rs, imm 

1 rs 8 imm 

6 5 5 16 

Unsigned trap if greater equal immediate 

tgeiu rs, imm 

1 rs 9 imm 

6 5 5 16 

If register rs is greater than or equal to the sign-extended value imm, raise a Trap 

exception. 

Trap if less than 

tlt rs, rt 

0 rs rt 0 0x32 

6 5 5 10 6 

Unsigned trap if less than 

tltu rs, rt 

0 rs rt 0 0x33 

6 5 5 10 6 

If register rs is less than register rt, raise a Trap exception. 

Trap if less than immediate 

tlti rs, imm 

1 rs a imm 

6 5 5 16


Unsigned trap if less than immediate 

tltiu rs, imm 

1 rs b imm 

6 5 5 16 

If register rs is less than the sign-extended value imm, raise a Trap exception. 

Load Instructions 

Load address 

la rdest, address 


Load computed address—not the contents of the location—into register rdest. 

Load byte 

lb rt, address 

0x20 rs rt Offset 

6 5 5 16 

Load unsigned byte 

lbu rt, address 


6 5 5 16 

Load the byte at address into register rt. The byte is sign-extended by lb, but not 

by lbu. 

Load halfword 

lh rt, address 


6 5 5 16 

Load unsigned halfword 

lhu rt, address 


6 5 5 16 

Load the 16-bit quantity (halfword) at address into register rt. The halfword is 

sign-extended by lh, but not by lhu.


Load word 

lw rt, address 


6 5 5 16 

Load the 32-bit quantity (word) at address into register rt. 

Load word coprocessor 1 

lwcl ft, address 


6 5 5 16 

Load the word at address into register ft in the floating-point unit. 

Load word left 

lwl rt, address 


6 5 5 16 

Load word right 

lwr rt, address 


6 5 5 16 

Load the left (right) bytes from the word at the possibly unaligned address into 

register rt. 

Load doubleword 

ld rdest, address 


Load the 64-bit quantity at address into registers rdest and rdest + 1. 

Unaligned load halfword 

ulh rdest, address 

pseudoinstruction


Unaligned load halfword unsigned 

ulhu rdest, address 


Load the 16-bit quantity (halfword) at the possibly unaligned address into register 

rdest. The halfword is sign-extended by ulh, but not ulhu. 

Unaligned load word 

ulw rdest, address 


Load the 32-bit quantity (word) at the possibly unaligned address into register 

rdest. 

Load linked 

ll rt, address 


6 5 5 16 

Load the 32-bit quantity (word) at address into register rt and start an atomic 

read-modify-write operation. This operation is completed by a store conditional 

(sc) instruction, which will fail if another processor writes into the block containing 

the loaded word. Since SPIM does not simulate multiple processors, the store 

conditional operation always succeeds. 

Store Instructions 

Store byte 

sb rt, address 


6 5 5 16 

Store the low byte from register rt at address. 

Store halfword 

sh rt, address 


6 5 5 16 

Store the low halfword from register rt at address.


Store word 

sw rt, address 

0x2b rs rt Offset 

6 5 5 16 

Store the word from register rt at address. 

Store word coprocessor 1 

swcl ft, address 

0x31 rs ft Offset 

6 5 5 16 

Store the floating-point value in register ft of floating-point coprocessor at address. 

Store double coprocessor 1 

sdcl ft, address 

0x3d rs ft Offset 

6 5 5 16 

Store the doubleword floating-point value in registers ft and ft + l of floatingpoint 

coprocessor at address. Register ft must be even numbered. 

Store word left 

swl rt, address 

0x2a rs rt Offset 

6 5 5 16 

Store word right 

swr rt, address 

0x2e rs rt Offset 

6 5 5 16 

Store the left (right) bytes from register rt at the possibly unaligned address. 

Store doubleword 

sd rsrc, address 


Store the 64-bit quantity in registers rsrc and rsrc + 1 at address.


Unaligned store halfword 

ush rsrc, address 


Store the low halfword from register rsrc at the possibly unaligned address. 

Unaligned store word 

usw rsrc, address 


Store the word from register rsrc at the possibly unaligned address. 

Store conditional 

sc rt, address 


6 5 5 16 

Store the 32-bit quantity (word) in register rt into memory at address and com plete 

an atomic read-modify-write operation. If this atomic operation is success ful, the 

memory word is modified and register rt is set to 1. If the atomic operation fails 

because another processor wrote to a location in the block contain ing the addressed 

word, this instruction does not modify memory and writes 0 into register rt. Since 

SPIM does not simulate multiple processors, the instruc tion always succeeds. 

Data Movement Instructions 

Move 

move rdest, rsrc 


Move register rsrc to rdest. 

Move from hi 

mfhi rd 

0 0 rd 0 0x10 

6 10 5 5 6


Move from lo 

mflo rd 

0 0 rd 0 0x12 

6 10 5 5 6 

The multiply and divide unit produces its result in two additional registers, hi 

and lo. These instructions move values to and from these registers. The multiply, 

divide, and remainder pseudoinstructions that make this unit appear to operate on 

the general registers move the result after the computation finishes. 

Move the hi (lo) register to register rd. 

Move to hi 

mthi rs 

0 rs 0 0x11 

6 5 15 6 

Move to lo 

mtlo rs 

0 rs 0 0x13 

6 5 15 6 

Move register rs to the hi (lo) register. 

Move from coprocessor 0 

mfc0 rt, rd 

0x10 0 rt rd 0 

6 5 5 5 11 

Move from coprocessor 1 

mfcl rt, fs 

0x11 0 rt fs 0 

6 5 5 5 11 

Coprocessors have their own register sets. These instructions move values between 

these registers and the CPU’s registers. 

Move register rd in a coprocessor (register fs in the FPU) to CPU register rt. The 

floating-point unit is coprocessor 1.


Move double from coprocessor 1 

mfc1.d rdest, frsrc1 


Move floating-point registers frsrc1 and frsrc1 + 1 to CPU registers rdest 

and rdest + 1. 

Move to coprocessor 0 

mtc0 rd, rt 

0x10 4 rt rd 0 

6 5 5 5 11 

Move to coprocessor 1 

mtc1 rd, fs 

0x11 4 rt fs 0 

6 5 5 5 11 

Move CPU register rt to register rd in a coprocessor (register fs in the FPU). 

Move conditional not zero 

movn rd, rs, rt 

0 rs rt rd 0xb 

6 5 5 5 11 

Move register rs to register rd if register rt is not 0. 

Move conditional zero 

movz rd, rs, rt 

0 rs rt rd 0xa 

6 5 5 5 11 

Move register rs to register rd if register rt is 0. 

Move conditional on FP false 

movf rd, rs, cc 

0 rs cc 0 rd 0 1 

6 5 3 2 5 5 6 

Move CPU register rs to register rd if FPU condition code flag number cc is 0. If 

cc is omitted from the instruction, condition code flag 0 is assumed.


Move conditional on FP true 

movt rd, rs, cc 

0 rs cc 1 rd 0 1 

6 5 3 2 5 5 6 

Move CPU register rs to register rd if FPU condition code flag number cc is 1. If 

cc is omitted from the instruction, condition code bit 0 is assumed. 

Floating-Point Instructions 

The MIPS has a floating-point coprocessor (numbered 1) that operates on single 

precision (32-bit) and double precision (64-bit) floating-point numbers. This 

coprocessor has its own registers, which are numbered $f0–$f31. Because these 

registers are only 32 bits wide, two of them are required to hold doubles, so only 

floating-point registers with even numbers can hold double precision values. The 

floating-point coprocessor also has eight condition code (cc) flags, numbered 0–7, 

which are set by compare instructions and tested by branch (bclf or bclt) and 

conditional move instructions. 

Values are moved in or out of these registers one word (32 bits) at a time by 

lwc1, swc1, mtc1, and mfc1 instructions or one double (64 bits) at a time by ldcl 

and sdcl, described above, or by the l.s, l.d, s.s, and s.d pseudoinstructions 

described below. 

In the actual instructions below, bits 21–26 are 0 for single precision and 1 

for double precision. In the pseudoinstructions below, fdest is a floating-point 

register (e.g., $f2). 

Floating-point absolute value double 

abs.d fd, fs 

0x11 1 0 fs fd 5 

6 5 5 5 5 6 

Floating-point absolute value single 

abs.s fd, fs 

0x11 0 0 fs fd 5 

Compute the absolute value of the floating-point double (single) in register fs and 

put it in register fd. 

Floating-point addition double 

add.d fd, fs, ft 

0x11 0x11 ft fs fd 0 

6 5 5 5 5 6


Floating-point addition single 

add.s fd, fs, ft 

0x11 0x10 ft fs fd 0 

6 5 5 5 5 6 

Compute the sum of the floating-point doubles (singles) in registers fs and ft and 


Floating-point ceiling to word 

ceil.w.d fd, fs 

ceil.w.s fd, fs 

0x11 0x11 0 fs fd 0xe 

6 5 5 5 5 6 

0x11 0x10 0 fs fd 0xe 

Compute the ceiling of the floating-point double (single) in register fs, convert to 

a 32-bit fixed-point value, and put the resulting word in register fd. 

Compare equal double 

c.eq.d cc fs, ft 

0x11 0x11 ft fs cc 0 FC 2 

6 5 5 5 3 2 2 4 

Compare equal single 

c.eq.s cc fs, ft 

0x11 0x10 ft fs cc 0 FC 2 

6 5 5 5 3 2 2 4 

Compare the floating-point double (single) in register fs against the one in ft 

and set the floating-point condition flag cc to 1 if they are equal. If cc is omitted, 

condition code flag 0 is assumed. 

Compare less than equal double 

c.le.d cc fs, ft 

0x11 0x11 ft fs cc 0 FC 0xe 

6 5 5 5 3 2 2 4 

Compare less than equal single 

c.le.s cc fs, ft 

0x11 0x10 ft fs cc 0 FC 0xe 

6 5 5 5 3 2 2 4


Compare the floating-point double (single) in register fs against the one in ft and 

set the floating-point condition flag cc to 1 if the first is less than or equal to the 

second. If cc is omitted, condition code flag 0 is assumed. 

Compare less than double 

c.lt.d cc fs, ft 

0x11 0x11 ft fs cc 0 FC 0xc 

6 5 5 5 3 2 2 4 

Compare less than single 

c.lt.s cc fs, ft 

0x11 0x10 ft fs cc 0 FC 0xc 

6 5 5 5 3 2 2 4 

Compare the floating-point double (single) in register fs against the one in ft 

and set the condition flag cc to 1 if the first is less than the second. If cc is omitted, 

condition code flag 0 is assumed. 

Convert single to double 

cvt.d.s fd, fs 

0x11 0x10 0 fs fd 0x21 

6 5 5 5 5 6 

Convert integer to double 

cvt.d.w fd, fs 

0x11 0x14 0 fs fd 0x21 

6 5 5 5 5 6 

Convert the single precision floating-point number or integer in register fs to a 

double (single) precision number and put it in register fd. 

Convert double to single 

cvt.s.d fd, fs 

0x11 0x11 0 fs fd 0x20 

6 5 5 5 5 6 

Convert integer to single 

cvt.s.w fd, fs 

0x11 0x14 0 fs fd 0x20 

6 5 5 5 5 6 

Convert the double precision floating-point number or integer in register fs to a 

single precision number and put it in register fd.


Convert double to integer 

cvt.w.d fd, fs 

0x11 0x11 0 fs fd 0x24 

6 5 5 5 5 6 

Convert single to integer 

cvt.w.s fd, fs 

0x11 0x10 0 fs fd 0x24 

6 5 5 5 5 6 

Convert the double or single precision floating-point number in register fs to an 

integer and put it in register fd. 

Floating-point divide double 

div.d fd, fs, ft 

0x11 0x11 ft fs fd 3 

6 5 5 5 5 6 

Floating-point divide single 

div.s fd, fs, ft 

0x11 0x10 ft fs fd 3 

6 5 5 5 5 6 

Compute the quotient of the floating-point doubles (singles) in registers fs and ft 

and put it in register fd. 

Floating-point floor to word 

floor.w.d fd, fs 

floor.w.s fd, fs 

0x11 0x11 0 fs fd 0xf 

6 5 5 5 5 6 

0x11 0x10 0 fs fd 0xf 

Compute the floor of the floating-point double (single) in register fs and put the 

resulting word in register fd. 

Load floating-point double 

l.d fdest, address 

pseudoinstruction


Load floating-point single 

l.s fdest, address 


Load the floating-point double (single) at address into register fdest. 

Move floating-point double 

mov.d fd, fs 

0x11 0x11 0 fs fd 6 

6 5 5 5 5 6 

Move floating-point single 

mov.s fd, fs 

0x11 0x10 0 fs fd 6 

6 5 5 5 5 6 

Move the floating-point double (single) from register fs to register fd. 

Move conditional floating-point double false 

movf.d fd, fs, cc 

0x11 0x11 cc 0 fs fd 0x11 

6 5 3 2 5 5 6 

Move conditional floating-point single false 

movf.s fd, fs, cc 

0x11 0x10 cc 0 fs fd 0x11 

6 5 3 2 5 5 6 

Move the floating-point double (single) from register fs to register fd if condi tion 

code flag cc is 0. If cc is omitted, condition code flag 0 is assumed. 

Move conditional floating-point double true 

movt.d fd, fs, cc 

0x11 0x11 cc 1 fs fd 0x11 

6 5 3 2 5 5 6 

Move conditional floating-point single true 

movt.s fd, fs, cc 

0x11 0x10 cc 1 fs fd 0x11 

6 5 3 2 5 5 6


Move the floating-point double (single) from register fs to register fd if condi tion 

code flag cc is 1. If cc is omitted, condition code flag 0 is assumed. 

Move conditional floating-point double not zero 

movn.d fd, fs, rt 

0x11 0x11 rt fs fd 0x13 

6 5 5 5 5 6 

Move conditional floating-point single not zero 

movn.s fd, fs, rt 

0x11 0x10 rt fs fd 0x13 

6 5 5 5 5 6 

Move the floating-point double (single) from register fs to register fd if proces sor 

register rt is not 0. 

Move conditional floating-point double zero 

movz.d fd, fs, rt 

0x11 0x11 rt fs fd 0x12 

6 5 5 5 5 6 

Move conditional floating-point single zero 

movz.s fd, fs, rt 

0x11 0x10 rt fs fd 0x12 

6 5 5 5 5 6 

Move the floating-point double (single) from register fs to register fd if proces sor 

register rt is 0. 

Floating-point multiply double 

mul.d fd, fs, ft 

0x11 0x11 ft fs fd 2 

6 5 5 5 5 6 

Floating-point multiply single 

mul.s fd, fs, ft 

0x11 0x10 ft fs fd 2 

6 5 5 5 5 6 

Compute the product of the floating-point doubles (singles) in registers fs and ft 

and put it in register fd. 

Negate double 

neg.d fd, fs 

0x11 0x11 0 fs fd 7 

6 5 5 5 5 6


Negate single 

neg.s fd, fs 

0x11 0x10 0 fs fd 7 

6 5 5 5 5 6 

Negate the floating-point double (single) in register fs and put it in register fd. 

Floating-point round to word 

round.w.d fd, fs 

0x11 0x11 0 fs fd 0xc 

6 5 5 5 5 6 

round.w.s fd, fs 0x11 0x10 0 fs fd 0xc 

Round the floating-point double (single) value in register fs, convert to a 32-bit 

fixed-point value, and put the resulting word in register fd. 

Square root double 

sqrt.d fd, fs 

0x11 0x11 0 fs fd 4 

6 5 5 5 5 6 

Square root single 

sqrt.s fd, fs 

0x11 0x10 0 fs fd 4 

6 5 5 5 5 6 

Compute the square root of the floating-point double (single) in register fs and 


Store floating-point double 

s.d fdest, address 


Store floating-point single 

s.s fdest, address 


Store the floating-point double (single) in register fdest at address. 

Floating-point subtract double 

sub.d fd, fs, ft 

0x11 0x11 ft fs fd 1 

6 5 5 5 5 6


Floating-point subtract single 

sub.s fd, fs, ft 

0x11 0x10 ft fs fd 1 

6 5 5 5 5 6 

Compute the difference of the floating-point doubles (singles) in registers fs and 

ft and put it in register fd. 

Floating-point truncate to word 

trunc.w.d fd, fs 

0x11 0x11 0 fs fd 0xd 

6 5 5 5 5 6 

trunc.w.s fd, fs 0x11 0x10 0 fs fd 0xd 

Truncate the floating-point double (single) value in register fs, convert to a 32-bit 

fixed-point value, and put the resulting word in register fd. 

Exception and Interrupt Instructions 

Exception return 

eret 

0x10 1 0 0x18 

6 1 19 6 

Set the EXL bit in coprocessor 0’s Status register to 0 and return to the instruction 

pointed to by coprocessor 0’s EPC register. 

System call 

syscall 

0 0 0xc 

6 20 6 

Register $v0 contains the number of the system call (see Figure A.9.1) provided 

by SPIM. 

Break 

break code 

0 code 0xd 

6 20 6 

Cause exception code. Exception 1 is reserved for the debugger. 

No operation 

nop 

0 0 0 0 0 0 

6 5 5 5 5 6 

Do nothing.

A.11 Concluding Remarks A-81 

A.11 

Concluding Remarks 

Programming in assembly language requires a programmer to trade helpful features 

of high-level languages—such as data structures, type checking, and control 

constructs—for complete control over the instructions that a computer executes. 

External constraints on some applications, such as response time or program size, 

require a programmer to pay close attention to every instruction. However, the 

cost of this level of attention is assembly language programs that are longer, more 

time-consuming to write, and more difficult to maintain than high-level language 

programs. 

Moreover, three trends are reducing the need to write programs in assembly 

language. The first trend is toward the improvement of compilers. Modern compilers 

produce code that is typically comparable to the best handwritten code— 

and is sometimes better. The second trend is the introduction of new processors 

that are not only faster, but in the case of processors that execute multiple instructions 

simultaneously, also more difficult to program by hand. In addition, the rapid 

evolution of the modern computer favors high-level language programs that are 

not tied to a single architecture. Finally, we witness a trend toward increasingly 

complex applications, characterized by complex graphic interfaces and many more 

features than their predecessors had. Large applications are written by teams of 

programmers and require the modularity and semantic checking features pro vided 

by high-level languages. 

Further Reading 

Aho, A., R. Sethi, and J. Ullman [1985]. Compilers: Principles, Techniques, and Tools, Reading, MA: Addison- 

Wesley. 

Slightly dated and lacking in coverage of modern architectures, but still the standard reference on compilers. 

Sweetman, D. [1999]. See MIPS Run, San Francisco, CA: Morgan Kaufmann Publishers. 

A complete, detailed, and engaging introduction to the MIPS instruction set and assembly language programming 

on these machines. 

Detailed documentation on the MIPS-32 architecture is available on the Web: 

MIPS32 Architecture for Programmers Volume I: Introduction to the MIPS32 Architecture 

(http://mips.com/content/Documentation/MIPSDocumentation/ProcessorArchitecture/ 

ArchitectureProgrammingPublicationsforMIPS32/MD00082-2B-MIPS32INT-AFP-02.00.pdf/ 

getDownload) 

MIPS32 Architecture for Programmers Volume II: The MIPS32 Instruction Set 


ArchitectureProgrammingPublicationsforMIPS32/MD00086-2B-MIPS32BIS-AFP-02.00.pdf/getDownload) 

MIPS32 Architecture for Programmers Volume III: The MIPS32 Privileged Resource Architecture 


ArchitectureProgrammingPublicationsforMIPS32/MD00090-2B-MIPS32PRA-AFP-02.00.pdf/getDownload)

A.12 Exercises A-83 

A.10 [10] Using SPIM, write and test a recursive program for solv ing 

the classic mathematical recreation, the Towers of Hanoi puzzle. (This will require 

the use of stack frames to support recursion.) The puzzle consists of three pegs 

(1, 2, and 3) and n disks (the number n can vary; typical values might be in the 

range from 1 to 8). Disk 1 is smaller than disk 2, which is in turn smaller than disk 

3, and so forth, with disk n being the largest. Initially, all the disks are on peg 1, 

starting with disk n on the bottom, disk n − 1 on top of that, and so forth, up to 

disk 1 on the top. The goal is to move all the disks to peg 2. You may only move one 

disk at a time, that is, the top disk from any of the three pegs onto the top of either 

of the other two pegs. Moreover, there is a constraint: You must not place a larger 

disk on top of a smaller disk. 

The C program below can be used to help write your assembly language program. 

/* move n smallest disks from start to finish using 

extra */ 

void hanoi(int n, int start, int finish, int extra){ 

if(n != 0){ 

hanoi(n-1, start, extra, finish); 

print_string(“Move disk”); 

print_int(n); 

print_string(“from peg”); 

print_int(start); 

print_string(“to peg”); 

print_int(finish); 

print_string(“.\n”); 

hanoi(n-1, extra, finish, start); 

} 

} 

main(){ 

int n; 

print_string(“Enter number of disks>“); 

n = read_int(); 

hanoi(n, 1, 2, 3); 

return 0; 

}

B 

A P P E N D I X 

I always loved that 

word, Boolean. 

Claude Shannon 

IEEE Spectrum, April 1992 

(Shannon’s master’s thesis showed 

that the algebra invented by George 

Boole in the 1800s could represent the 

workings of electrical switches.) 

The Basics of Logic 

Design 

B.1 Introduction B-3 

B.2 Gates, Truth Tables, and Logic 

Equations B-4 

B.3 Combinational Logic B-9 

B.4 Using a Hardware Description 

Language B-20 

B.5 Constructing a Basic Arithmetic Logic 

Unit B-26 

B.6 Faster Addition: Carry Lookahead B-38 

B.7 Clocks B-48 



B-6 Appendix B The Basics of Logic Design 

Boolean Algebra 

Another approach is to express the logic function with logic equations. This 

is done with the use of Boolean algebra (named after Boole, a 19th-century 

mathematician). In Boolean algebra, all the variables have the values 0 or 1 and, in 

typical formulations, there are three operators: 

■ The OR operator is written as , as in A B. The result of an OR operator is 

1 if either of the variables is 1. The OR operation is also called a logical sum, 

since its result is 1 if either operand is 1. 

■ The AND operator is written as , as in A B. The result of an AND operator 

is 1 only if both inputs are 1. The AND operator is also called logical product, 

since its result is 1 only if both operands are 1. 

■ The unary operator NOT is written as A. The result of a NOT operator is 1 only if 

the input is 0. Applying the operator NOT to a logical value results in an inversion 

or negation of the value (i.e., if the input is 0 the output is 1, and vice versa). 

There are several laws of Boolean algebra that are helpful in manipulating logic 

equations. 

■ Identity law: A 0 A and A 1 A 

■ Zero and One laws: A 1 1 and A 0 0 

■ Inverse laws: A A 1 and A A 0 

■ Commutative laws: A B B A and A B B A 

■ Associative laws: A (B C) (A B) C and A (B C) (A B) C 

■ Distributive laws: A (B C) (A B) (A C) and 

A (B C) (A B) (A C) 

In addition, there are two other useful theorems, called DeMorgan’s laws, that are 

discussed in more depth in the exercises. 

Any set of logic functions can be written as a series of equations with an output 

on the left-hand side of each equation and a formula consisting of variables and the 

three operators above on the right-hand side.

B.2 Gates, Truth Tables, and Logical Equations B-7 

Logic Equations 

Show the logic equations for the logic functions, D, E, and F, described in the 

previous example. 

EXAMPLE 

Here’s the equation for D: 

F is equally simple: 

D A B C 

ANSWER 

F 

A B C 

E is a little tricky. Think of it in two parts: what must be true for E to be true 

(two of the three inputs must be true), and what cannot be true (all three 

cannot be true). Thus we can write E as 

E (( A B) ( A C) ( B C)) ( A B C) 

We can also derive E by realizing that E is true only if exactly two of the inputs 

are true. Then we can write E as an OR of the three possible terms that have 

two true inputs and one false input: 

E ( A B C) ( A C B) ( B C A) 

Proving that these two expressions are equivalent is explored in the exercises. 

In Verilog, we describe combinational logic whenever possible using the assign 

statement, which is described beginning on page B-23. We can write a definition 

for E using the Verilog exclusive-OR operator as assign E (A ^ B ^ C) * 

(A + B + C) * (A * B * C), which is yet another way to describe this function. 

D and F have even simpler representations, which are just like the corresponding C 

code: D A | B | C and F A & B & C.


gate A device that 

implements basic logic 

functions, such as AND 

or OR. 

NOR gate An inverted 

OR gate. 

NAND gate An inverted 

AND gate. 

Check 

Yourself 

Gates 

Logic blocks are built from gates that implement basic logic functions. For example, 

an AND gate implements the AND function, and an OR gate implements the OR 

function. Since both AND and OR are commutative and associative, an AND or an 

OR gate can have multiple inputs, with the output equal to the AND or OR of all 

the inputs. The logical function NOT is implemented with an inverter that always 

has a single input. The standard representation of these three logic building blocks 

is shown in Figure B.2.1. 

Rather than draw inverters explicitly, a common practice is to add “bubbles” 

to the inputs or outputs of a gate to cause the logic value on that input line or 

output line to be inverted. For example, Figure B.2.2 shows the logic diagram for 

the function A B , using explicit inverters on the left and bubbled inputs and 

outputs on the right. 

Any logical function can be constructed using AND gates, OR gates, and 

inversion; several of the exercises give you the opportunity to try implementing 

some common logic functions with gates. In the next section, we’ll see how an 

implementation of any logic function can be constructed using this knowledge. 

In fact, all logic functions can be constructed with only a single gate type, if that 

gate is inverting. The two common inverting gates are called NOR and NAND and 

correspond to inverted OR and AND gates, respectively. NOR and NAND gates are 

called universal, since any logic function can be built using this one gate type. The 

exercises explore this concept further. 

Are the following two logical expressions equivalent? If not, find a setting of the 

variables to show they are not: 

■ ( A B C) ( A C B) ( B C A) 

■ B ( A C C A) 

FIGURE B.2.1 Standard drawing for an AND gate, OR gate, and an inverter, shown from 

left to right. The signals to the left of each symbol are the inputs, while the output appears on the right. The 

AND and OR gates both have two inputs. Inverters have a single input. 

A 

B 

A 

B 

FIGURE B.2.2 Logic gate implementation of A B using explicit inverts on the left and 

bubbled inputs and outputs on the right. This logic function can be simplified to AB 

or in Verilog, 

A & ~ B.


B.3 Combinational Logic 

In this section, we look at a couple of larger logic building blocks that we use 

heavily, and we discuss the design of structured logic that can be automatically 

implemented from a logic equation or truth table by a translation program. Last, 

we discuss the notion of an array of logic blocks. 

Decoders 

One logic block that we will use in building larger components is a decoder. The 

most common type of decoder has an n-bit input and 2 n outputs, where only one 

output is asserted for each input combination. This decoder translates the n-bit 

input into a signal that corresponds to the binary value of the n-bit input. The 

outputs are thus usually numbered, say, Out0, Out1, … , Out2 n 1. If the value of 

the input is i, then Outi will be true and all other outputs will be false. Figure B.3.1 

shows a 3-bit decoder and the truth table. This decoder is called a 3-to-8 decoder 

since there are 3 inputs and 8 (2 3 ) outputs. There is also a logic element called 

an encoder that performs the inverse function of a decoder, taking 2 n inputs and 

producing an n-bit output. 

decoder A logic block 

that has an n-bit input 

and 2n outputs, where 

only one output is 

asserted for each input 

combination. 

3 

Decoder 

Out0 

Out1 

Out2 

Out3 

Out4 

Out5 

Out6 

Out7 

Inputs 

Outputs 

12 11 10 Out7 Out6 Out5 Out4 Out3 Out2 Out1 Out0 

0 0 0 0 0 0 0 0 0 0 1 

0 0 1 0 0 0 0 0 0 1 0 

0 1 0 0 0 0 0 0 1 0 0 

0 1 1 0 0 0 0 1 0 0 0 

1 0 0 0 0 0 1 0 0 0 0 

1 0 1 0 0 1 0 0 0 0 0 

1 1 0 0 1 0 0 0 0 0 0 

1 1 1 1 0 0 0 0 0 0 0 

a. A 3-bit decoder 

b. The truth table for a 3-bit decoder 

FIGURE B.3.1 A 3-bit decoder has 3 inputs, called 12, 11, and 10, and 2 3 = 8 outputs, called Out0 to Out7. Only the 

output corresponding to the binary value of the input is true, as shown in the truth table. The label 3 on the input to the decoder says that the 

input signal is 3 bits wide.


A 

B 

0 

M 

u 

x 

1 

C 

A 

B 

C 

S 

S 

FIGURE B.3.2 A two-input multiplexor on the left and its implementation with gates on 

the right. The multiplexor has two data inputs (A and B), which are labeled 0 and 1, and one selector input 

(S), as well as an output C. Implementing multiplexors in Verilog requires a little more work, especially when 

they are wider than two inputs. We show how to do this beginning on page B-23. 

selector value Also 

called control value. The 

control signal that is used 

to select one of the input 

values of a multiplexor 

as the output of the 

multiplexor. 

Multiplexors 

One basic logic function that we use quite often in Chapter 4 is the multiplexor. 

A multiplexor might more properly be called a selector, since its output is one of 

the inputs that is selected by a control. Consider the two-input multiplexor. The 

left side of Figure B.3.2 shows this multiplexor has three inputs: two data values 

and a selector (or control) value. The selector value determines which of the 

inputs becomes the output. We can represent the logic function computed by a 

two-input multiplexor, shown in gate form on the right side of Figure B.3.2, as 

C ( A S) ( B S). 

Multiplexors can be created with an arbitrary number of data inputs. When 

there are only two inputs, the selector is a single signal that selects one of the inputs 

if it is true (1) and the other if it is false (0). If there are n data inputs, there will 

need to be ⎡ 

⎢log 2 

n⎤ 

⎥ selector inputs. In this case, the multiplexor basically consists 

of three parts: 

1. A decoder that generates n signals, each indicating a different input value 

2. An array of n AND gates, each combining one of the inputs with a signal 

from the decoder 

3. A single large OR gate that incorporates the outputs of the AND gates 

To associate the inputs with selector values, we often label the data inputs numerically 

(i.e., 0, 1, 2, 3, …, n 1) and interpret the data selector inputs as a binary number. 

Sometimes, we make use of a multiplexor with undecoded selector signals. 

Multiplexors are easily represented combinationally in Verilog by using if 

expressions. For larger multiplexors, case statements are more convenient, but care 

must be taken to synthesize combinational logic.


Two-Level Logic and PLAs 

As pointed out in the previous section, any logic function can be implemented with 

only AND, OR, and NOT functions. In fact, a much stronger result is true. Any logic 

function can be written in a canonical form, where every input is either a true or 

complemented variable and there are only two levels of gates—one being AND and 

the other OR—with a possible inversion on the final output. Such a representation 

is called a two-level representation, and there are two forms, called sum of products 

and product of sums. A sum-of-products representation is a logical sum (OR) of 

products (terms using the AND operator); a product of sums is just the opposite. 

In our earlier example, we had two equations for the output E: 

and 

E (( A B) ( A C) ( B C)) ( A B C) 

sum of products A form 

of logical representation 

that employs a logical sum 

(OR) of products (terms 

joined using the AND 

operator). 

E ( A B C) ( A C B) ( B C A) 

This second equation is in a sum-of-products form: it has two levels of logic and the 

only inversions are on individual variables. The first equation has three levels of logic. 

Elaboration: We can also write E as a product of sums: 

E ( A B C) ( A C B) ( B C A) 

To derive this form, you need to use DeMorgan’s theorems, which are discussed in the 

exercises. 

In this text, we use the sum-of-products form. It is easy to see that any logic 

function can be represented as a sum of products by constructing such a 

representation from the truth table for the function. Each truth table entry for 

which the function is true corresponds to a product term. The product term 

consists of a logical product of all the inputs or the complements of the inputs, 

depending on whether the entry in the truth table has a 0 or 1 corresponding to 

this variable. The logic function is the logical sum of the product terms where the 

function is true. This is more easily seen with an example.


Sum of Products 

EXAMPLE 

Show the sum-of-products representation for the following truth table for D. 

Inputs 

Outputs 

A B C D 

0 0 0 0 

0 0 1 1 

0 1 0 1 

0 1 1 0 

1 0 0 1 

1 0 1 0 

1 1 0 0 

1 1 1 1 

ANSWER 

There are four product terms, since the function is true (1) for four different 

input combinations. These are: 

ABC 

ABC 

programmable logic 

array (PLA) 

A structured-logic 

element composed 

of a set of inputs and 

corresponding input 

complements and two 

stages of logic: the first 

generates product terms 

of the inputs and input 

complements, and the 

second generates sum 

terms of the product 

terms. Hence, PLAs 

implement logic functions 

as a sum of products. 

minterms Also called 

product terms. A set 

of logic inputs joined 

by conjunction (AND 

operations); the product 

terms form the first logic 

stage of the programmable 

logic array (PLA). 

ABC 

ABC 

Thus, we can write the function for D as the sum of these terms: 

D ( A B C)( A B C)( A B C)( A B C) 

Note that only those truth table entries for which the function is true generate 

terms in the equation. 

We can use this relationship between a truth table and a two-level representation 

to generate a gate-level implementation of any set of logic functions. A set of logic 

functions corresponds to a truth table with multiple output columns, as we saw in 

the example on page B-5. Each output column represents a different logic function, 

which may be directly constructed from the truth table. 

The sum-of-products representation corresponds to a common structured-logic 

implementation called a programmable logic array (PLA). A PLA has a set of 

inputs and corresponding input complements (which can be implemented with a 

set of inverters), and two stages of logic. The first stage is an array of AND gates that 

form a set of product terms (sometimes called minterms); each product term can 

consist of any of the inputs or their complements. The second stage is an array of 

OR gates, each of which forms a logical sum of any number of the product terms. 

Figure B.3.3 shows the basic form of a PLA.


Inputs 

AND gates 

Product terms 

OR gates 

Outputs 

FIGURE B.3.3 The basic form of a PLA consists of an array of AND gates followed by an 

array of OR gates. Each entry in the AND gate array is a product term consisting of any number of inputs or 

inverted inputs. Each entry in the OR gate array is a sum term consisting of any number of these product terms. 

A PLA can directly implement the truth table of a set of logic functions with 

multiple inputs and outputs. Since each entry where the output is true requires 

a product term, there will be a corresponding row in the PLA. Each output 

corresponds to a potential row of OR gates in the second stage. The number of OR 

gates corresponds to the number of truth table entries for which the output is true. 

The total size of a PLA, such as that shown in Figure B.3.3, is equal to the sum of the 

size of the AND gate array (called the AND plane) and the size of the OR gate array 

(called the OR plane). Looking at Figure B.3.3, we can see that the size of the AND 

gate array is equal to the number of inputs times the number of different product 

terms, and the size of the OR gate array is the number of outputs times the number 

of product terms. 

A PLA has two characteristics that help make it an efficient way to implement a 

set of logic functions. First, only the truth table entries that produce a true value for 

at least one output have any logic gates associated with them. Second, each different 

product term will have only one entry in the PLA, even if the product term is used 

in multiple outputs. Let’s look at an example. 

PLAs 

Consider the set of logic functions defined in the example on page B-5. Show 

a PLA implementation of this example for D, E, and F. 

EXAMPLE


ANSWER 

Here is the truth table we constructed earlier: 

Inputs 

Outputs 

A B C D E F 

0 0 0 0 0 0 

0 0 1 1 0 0 

0 1 0 1 0 0 

0 1 1 1 1 0 

1 0 0 1 0 0 

1 0 1 1 1 0 

1 1 0 1 1 0 

1 1 1 1 0 1 

Since there are seven unique product terms with at least one true value in the 

output section, there will be seven columns in the AND plane. The number of 

rows in the AND plane is three (since there are three inputs), and there are also 

three rows in the OR plane (since there are three outputs). Figure B.3.4 shows 

the resulting PLA, with the product terms corresponding to the truth table 

entries from top to bottom. 

read-only memory 

(ROM) A memory 

whose contents are 

designated at creation 

time, after which the 

contents can only be read. 

ROM is used as structured 

logic to implement a 

set of logic functions by 

using the terms in the 

logic functions as address 

inputs and the outputs as 

bits in each word of the 

memory. 

programmable ROM 

(PROM) A form of 

read-only memory that 

can be pro grammed 

when a designer knows its 

contents. 

Rather than drawing all the gates, as we do in Figure B.3.4, designers often show 

just the position of AND gates and OR gates. Dots are used on the intersection of a 

product term signal line and an input line or an output line when a corresponding 

AND gate or OR gate is required. Figure B.3.5 shows how the PLA of Figure B.3.4 

would look when drawn in this way. The contents of a PLA are fixed when the PLA 

is created, although there are also forms of PLA-like structures, called PALs, that 

can be programmed electronically when a designer is ready to use them. 

ROMs 

Another form of structured logic that can be used to implement a set of logic 

functions is a read-only memory (ROM). A ROM is called a memory because it 

has a set of locations that can be read; however, the contents of these locations are 

fixed, usually at the time the ROM is manufactured. There are also programmable 

ROMs (PROMs) that can be programmed electronically, when a designer knows 

their contents. There are also erasable PROMs; these devices require a slow erasure 

process using ultraviolet light, and thus are used as read-only memories, except 

during the design and debugging process. 

A ROM has a set of input address lines and a set of outputs. The number of 

addressable entries in the ROM determines the number of address lines: if the

B.4 Using a Hardware Description Language B-19 

elements, which we can represent simply by showing that a given operation will 

happen to an entire collection of inputs. Inside a machine, much of the time we 

want to select between a pair of buses. A bus is a collection of data lines that is 

treated together as a single logical signal. (The term bus is also used to indicate a 

shared collection of lines with multiple sources and uses.) 

For example, in the MIPS instruction set, the result of an instruction that is written 

into a register can come from one of two sources. A multiplexor is used to choose 

which of the two buses (each 32 bits wide) will be written into the Result register. 

The 1-bit multiplexor, which we showed earlier, will need to be replicated 32 times. 

We indicate that a signal is a bus rather than a single 1-bit line by showing it with 

a thicker line in a figure. Most buses are 32 bits wide; those that are not are explicitly 

labeled with their width. When we show a logic unit whose inputs and outputs are 

buses, this means that the unit must be replicated a sufficient number of times to 

accommodate the width of the input. Figure B.3.6 shows how we draw a multiplexor 

that selects between a pair of 32-bit buses and how this expands in terms of 1-bitwide 

multiplexors. Sometimes we need to construct an array of logic elements 

where the inputs for some elements in the array are outputs from earlier elements. 

For example, this is how a multibit-wide ALU is constructed. In such cases, we must 

explicitly show how to create wider arrays, since the individual elements of the array 

are no longer independent, as they are in the case of a 32-bit-wide multiplexor. 

bus In logic design, a 

collection of data lines 

that is treated together 

as a single logical signal; 

also, a shared collection 

of lines with multiple 

sources and uses. 

Select 

Select 

32 

A 

32 

B 

M 

u 

x 

32 

C 

A31 

B31 

M 

u 

x 

C31 

A30 

B30 

M 

u 

x 

. 

C30 

. 

A0 

B0 

M 

u 

x 

C0 

a. A 32-bit wide 2-to-1 multiplexor b. The 32-bit wide multiplexor is actually 

an array of 32 1-bit multiplexors 

FIGURE B.3.6 A multiplexor is arrayed 32 times to perform a selection between two 32- 

bit inputs. Note that there is still only one data selection signal used for all 32 1-bit multiplexors.


Readers already familiar with VHDL should find the concepts simple, provided 

they have been exposed to the syntax of C. 

Verilog can specify both a behavioral and a structural definition of a digital 

system. A behavioral specification describes how a digital system functionally 

operates. A structural specification describes the detailed organization of a digital 

system, usually using a hierarchical description. A structural specification can be 

used to describe a hardware system in terms of a hierarchy of basic elements such 

as gates and switches. Thus, we could use Verilog to describe the exact contents of 

the truth tables and datapath of the last section. 

With the arrival of hardware synthesis tools, most designers now use Verilog 

or VHDL to structurally describe only the datapath, relying on logic synthesis to 

generate the control from a behavioral description. In addition, most CAD systems 

provide extensive libraries of standardized parts, such as ALUs, multiplexors, 

register files, memories, and programmable logic blocks, as well as basic gates. 

Obtaining an acceptable result using libraries and logic synthesis requires that 

the specification be written with an eye toward the eventual synthesis and the 

desired outcome. For our simple designs, this primarily means making clear what 

we expect to be implemented in combinational logic and what we expect to require 

sequential logic. In most of the examples we use in this section and the remainder 

of this appendix, we have written the Verilog with the eventual synthesis in mind. 

Datatypes and Operators in Verilog 

There are two primary datatypes in Verilog: 

1. A wire specifies a combinational signal. 

2. A reg (register) holds a value, which can vary with time. A reg need not 

necessarily correspond to an actual register in an implementation, although 

it often will. 

A register or wire, named X, that is 32 bits wide is declared as an array: reg 

[31:0] X or wire [31:0] X, which also sets the index of 0 to designate the 

least significant bit of the register. Because we often want to access a subfield of a 

register or wire, we can refer to a contiguous set of bits of a register or wire with the 

notation [starting bit: ending bit], where both indices must be constant 

values. 

An array of registers is used for a structure like a register file or memory. Thus, 

the declaration 

behavioral 

specification Describes 

how a digital system 

operates functionally. 

structural 

specification Describes 

how a digital system is 

organized in terms of a 

hierarchical connection of 

elements. 

hardware synthesis 

tools Computer-aided 

design software that 

can generate a gatelevel 

design based on 

behavioral descriptions of 

a digital system. 

wire In Verilog, specifies 

a combinational signal. 

reg In Verilog, a register. 

reg [31:0] registerfile[0:31] 

specifies a variable registerfile that is equivalent to a MIPS registerfile, where 

register 0 is the first. When accessing an array, we can refer to a single element, as 

in C, using the notation registerfile[regnum].


The possible values for a register or wire in Verilog are 

■ 0 or 1, representing logical false or true 

■ X, representing unknown, the initial value given to all registers and to any 

wire not connected to something 

■ Z, representing the high-impedance state for tristate gates, which we will not 

discuss in this appendix 

Constant values can be specified as decimal numbers as well as binary, octal, or 

hexadecimal. We often want to say exactly how large a constant field is in bits. This 

is done by prefixing the value with a decimal number specifying its size in bits. For 

example: 

■ 4’b0100 specifies a 4-bit binary constant with the value 4, as does 4’d4. 

■ - 8 ‘h4 specifies an 8-bit constant with the value 4 (in two’s complement 

representation) 

Values can also be concatenated by placing them within { } separated by commas. 

The notation {x{bit field}} replicates bit field x times. For example: 

■ {16{2’b01}} creates a 32-bit value with the pattern 0101 … 01. 

■ {A[31:16],B[15:0]} creates a value whose upper 16 bits come from A 

and whose lower 16 bits come from B. 

Verilog provides the full set of unary and binary operators from C, including the 

arithmetic operators (, , *. /), the logical operators (&, |, ), the comparison 

operators ( , !, , , , ), the shift operators (, ), and C’s 

conditional operator (?, which is used in the form condition ? expr1 :expr2 

and returns expr1 if the condition is true and expr2 if it is false). Verilog adds 

a set of unary logic reduction operators (&, |, ^) that yield a single bit by applying 

the logical operator to all the bits of an operand. For example, &A returns the value 

obtained by ANDing all the bits of A together, and ^A returns the reduction obtained 

by using exclusive OR on all the bits of A. 

Check 

Yourself 

Which of the following define exactly the same value? 

l. 8’bimoooo 

2. 8’hF0 

3. 8’d240 

4. {{4{1’b1}},{4{1’b0}}} 

5. {4’b1,4’b0)


Structure of a Verilog Program 

A Verilog program is structured as a set of modules, which may represent anything 

from a collection of logic gates to a complete system. Modules are similar to classes 

in C, although not nearly as powerful. A module specifies its input and output 

ports, which describe the incoming and outgoing connections of a module. A 

module may also declare additional variables. The body of a module consists of: 

■ initial constructs, which can initialize reg variables 

■ Continuous assignments, which define only combinational logic 

■ always constructs, which can define either sequential or combinational 

logic 

■ Instances of other modules, which are used to implement the module being 

defined 

Representing Complex Combinational Logic in Verilog 

A continuous assignment, which is indicated with the keyword assign, acts like 

a combinational logic function: the output is continuously assigned the value, and 

a change in the input values is reflected immediately in the output value. Wires 

may only be assigned values with continuous assignments. Using continuous 

assignments, we can define a module that implements a half-adder, as Figure B.4.1 

shows. 

Assign statements are one sure way to write Verilog that generates combinational 

logic. For more complex structures, however, assign statements may be awkward or 

tedious to use. It is also possible to use the always block of a module to describe 

a combinational logic element, although care must be taken. Using an always 

block allows the inclusion of Verilog control constructs, such as if-then-else, case 

statements, for statements, and repeat statements, to be used. These statements are 

similar to those in C with small changes. 

An always block specifies an optional list of signals on which the block is 

sensitive (in a list starting with @). The always block is re-evaluated if any of the 

FIGURE B.4.1 

A Verilog module that defines a half-adder using continuous assignments.


sensitivity list The list of 

signals that specifies when 

an always block should 

be re-evaluated. 

listed signals changes value; if the list is omitted, the always block is constantly reevaluated. 

When an always block is specifying combinational logic, the sensitivity 

list should include all the input signals. If there are multiple Verilog statements to 

be executed in an always block, they are surrounded by the keywords begin and 

end, which take the place of the { and } in C. An always block thus looks like this: 

always @(list of signals that cause reevaluation) begin 

Verilog statements including assignments and other 

control statements end 

blocking assignment 

In Verilog, an assignment 

that completes before 

the execution of the next 

statement. 

nonblocking 

assignment An 

assignment that continues 

after evaluating the righthand 

side, assigning the 

left-hand side the value 

only after all right-hand 

sides are evaluated. 

Reg variables may only be assigned inside an always block, using a procedural 

assignment statement (as distinguished from continuous assignment we saw 

earlier). There are, however, two different types of procedural assignments. The 

assignment operator executes as it does in C; the right-hand side is evaluated, 

and the left-hand side is assigned the value. Furthermore, it executes like the 

normal C assignment statement: that is, it is completed before the next statement is 

executed. Hence, the assignment operator has the name blocking assignment. 

This blocking can be useful in the generation of sequential logic, and we will return 

to it shortly. The other form of assignment (nonblocking) is indicated by

B.5 Constructing a Basic Arithmetic Logic Unit B-25 

FIGURE B.4.2 A Verilog definition of a 4-to-1 multiplexor with 32-bit inputs, using a case 

statement. The case statement acts like a C switch statement, except that in Verilog only the code 

associated with the selected case is executed (as if each case state had a break at the end) and there is no fallthrough 

to the next statement. 

FIGURE B.4.3 A Verilog behavioral definition of a MIPS ALU. This could be synthesized using a module library containing basic 

arithmetic and logical operations.


Check 

Yourself 

Assuming all values are initially zero, what are the values of A and B after executing 

this Verilog code inside an always block? 

C=1; 

A


Operation 

CarryIn 

a0 

b0 

CarryIn 

ALU0 

CarryOut 

Result0 

a1 

b1 

CarryIn 

ALU1 

CarryOut 

Result1 

a2 

b2 

CarryIn 

ALU2 

CarryOut 

Result2 

. 

. 

. 

a31 

b31 

CarryIn 

ALU31 

Result31 

FIGURE B.5.7 A 32-bit ALU constructed from 32 1-bit ALUs. CarryOut of the less significant bit is 

connected to the CarryIn of the more significant bit. This organization is called ripple carry. 

this is only one step in negating a two’s complement number. Notice that the least 

significant bit still has a CarryIn signal, even though it’s unnecessary for addition. 

What happens if we set this CarryIn to 1 instead of 0? The adder will then calculate 

a b 1. By selecting the inverted version of b, we get exactly what we want: 

a b 1 a ( b 1) a ( b) a b 

The simplicity of the hardware design of a two’s complement adder helps explain 

why two’s complement representation has become the universal standard for 

integer computer arithmetic.

. 

. 

. 


Binvert 

Ainvert 

Operation 

CarryIn 

a0 

b0 

CarryIn 

ALU0 

Less 

CarryOut 

Result0 

a1 

b1 

0 

CarryIn 

ALU1 

Less 

CarryOut 

Result1 

a2 

b2 

0 

CarryIn 

ALU2 

Less 

CarryOut 

Result2 

. 

. 

CarryIn 

. 

a31 CarryIn 

Result31 

b31 ALU31 

Set 

0 Less 

Overflow 

FIGURE B.5.11 A 32-bit ALU constructed from the 31 copies of the 1-bit ALU in the top 

of Figure B.5.10 and one 1-bit ALU in the bottom of that figure. The Less inputs are connected 

to 0 except for the least significant bit, which is connected to the Set output of the most significant bit. If the 

ALU performs a b and we select the input 3 in the multiplexor in Figure B.5.10, then Result 0 … 001 if 

a b, and Result 0 … 000 otherwise. 

Thus, we need a new 1-bit ALU for the most significant bit that has an extra 

output bit: the adder output. The bottom drawing of Figure B.5.10 shows the 

design, with this new adder output line called Set, and used only for slt. As long 

as we need a special ALU for the most significant bit, we added the overflow detection 

logic since it is also associated with that bit.

B.1 Introduction B-35 

Alas, the test of less than is a little more complicated than just described because 

of overflow, as we explore in the exercises. Figure B.5.11 shows the 32-bit ALU. 

Notice that every time we want the ALU to subtract, we set both CarryIn and 

Binvert to 1. For adds or logical operations, we want both control lines to be 0. We 

can therefore simplify control of the ALU by combining the CarryIn and Binvert to 

a single control line called Bnegate. 

To further tailor the ALU to the MIPS instruction set, we must support 

conditional branch instructions. These instructions branch either if two registers 

are equal or if they are unequal. The easiest way to test equality with the ALU is to 

subtract b from a and then test to see if the result is 0, since 

( a b 0) 

⇒ a b 

Thus, if we add hardware to test if the result is 0, we can test for equality. The 

simplest way is to OR all the outputs together and then send that signal through 

an inverter: 

Zero ( Result31 Result30 … Result2 Result1 Result0) 

Figure B.5.12 shows the revised 32-bit ALU. We can think of the combination of 

the 1-bit Ainvert line, the 1-bit Binvert line, and the 2-bit Operation lines as 4-bit 

control lines for the ALU, telling it to perform add, subtract, AND, OR, or set on 

less than. Figure B.5.13 shows the ALU control lines and the corresponding ALU 

operation. 

Finally, now that we have seen what is inside a 32-bit ALU, we will use the 

universal symbol for a complete ALU, as shown in Figure B.5.14. 

Defining the MIPS ALU in Verilog 

Figure B.5.15 shows how a combinational MIPS ALU might be specified in Verilog; 

such a specification would probably be compiled using a standard parts library that 

provided an adder, which could be instantiated. For completeness, we show the 

ALU control for MIPS in Figure B.5.16, which is used in Chapter 4, where we build 

a Verilog version of the MIPS datapath. 

The next question is, “How quickly can this ALU add two 32-bit operands?” 

We can determine the a and b inputs, but the CarryIn input depends on the 

operation in the adjacent 1-bit adder. If we trace all the way through the chain of 

dependencies, we connect the most significant bit to the least significant bit, so 

the most significant bit of the sum must wait for the sequential evaluation of all 32 

1-bit adders. This sequential chain reaction is too slow to be used in time-critical 

hardware. The next section explores how to speed-up addition. This topic is not 

crucial to understanding the rest of the appendix and may be skipped.

. 

. 

. 


Ainvert 

Bnegate 

Operation 

a0 

b0 

CarryIn 

ALU0 

Less 

CarryOut 

Result0 

a1 

b1 

0 

CarryIn 

ALU1 

Less 

CarryOut 

Result1 

. 

Zero 

a2 

b2 

0 

CarryIn 

ALU2 

Less 

CarryOut 

Result2 

. 

. 

CarryIn 

. 

. 

Result31 

a31 CarryIn 

b31 ALU31 

Set 

0 Less 

Overflow 

FIGURE B.5.12 

The final 32-bit ALU. This adds a Zero detector to Figure B.5.11. 

ALU control lines Function 

0000 AND 

0001 OR 

0010 add 

0110 subtract 

0111 set on less than 

1100 NOR 

FIGURE B.5.13 The values of the three ALU control lines, Bnegate, and Operation, and the 

corresponding ALU operations.

B.5 Constructing a Basic Arithmetic Logic Unit B-37 

ALU operation 

a 

Zero 

ALU 

Result 

Overflow 

b 

CarryOut 

FIGURE B.5.14 The symbol commonly used to represent an ALU, as shown in Figure 

B.5.12. This symbol is also used to represent an adder, so it is normally labeled either with ALU or Adder. 

FIGURE B.5.15 

A Verilog behavioral definition of a MIPS ALU.


significant bit of the adder, in theory we could calculate the CarryIn values to all 

the remaining bits of the adder in just two levels of logic. 

For example, the CarryIn for bit 2 of the adder is exactly the CarryOut of bit 1, 

so the formula is 

CarryIn2 ( b1 CarryIn1) ( a1 CarryIn1) ( a1 

b1) 

Similarly, CarryIn1 is defined as 

CarryIn1 ( b0 CarryIn0) ( a0 CarryIn0) ( a0 b0) 

Using the shorter and more traditional abbreviation of ci for CarryIni, we can 

rewrite the formulas as 

c2 ( b1 c1) ( a1 c1) ( a1 b1) 

c1 ( b0 c0) ( a0 c0) ( a0 b0) 

Substituting the definition of c1 for the first equation results in this formula: 

c2 ( a1 a0 b0) ( a1 a0 c0) ( a1 b0 c0) 

( b1 a0 b0) ( b1 a0 c0) ( b1 b0 c0) ( a1 b1) 

You can imagine how the equation expands as we get to higher bits in the adder; 

it grows rapidly with the number of bits. This complexity is reflected in the cost of 

the hardware for fast carry, making this simple scheme prohibitively expensive for 

wide adders. 

Fast Carry Using the First Level of Abstraction: Propagate 

and Generate 

Most fast-carry schemes limit the complexity of the equations to simplify the 

hardware, while still making substantial speed improvements over ripple carry. 

One such scheme is a carry-lookahead adder. In Chapter 1, we said computer 

systems cope with complexity by using levels of abstraction. A carry-lookahead 

adder relies on levels of abstraction in its implementation. 

Let’s factor our original equation as a first step: 

ci 1 ( bi ci) ( ai ci) ( ai bi) 

= ( ai bi) ( ai bi) 

ci 

If we were to rewrite the equation for c2 using this formula, we would see some 

repeated patterns: 

c2 ( a1 b1) ( a1 b1) (( a0 b0) ( a0 b0) c0) 

Note the repeated appearance of (ai bi) and (ai bi) in the formula above. These 

two important factors are traditionally called generate (gi) and propagate (pi):


Using them to define ci 1, we get 

gi ai bi 

pi ai bi 

ci 1 gi pi ci 

To see where the signals get their names, suppose gi is 1. Then 

ci 1 gi pi ci 1 pi ci 

1 

That is, the adder generates a CarryOut (ci 1) independent of the value of CarryIn 

(ci). Now suppose that gi is 0 and pi is 1. Then 

ci 1 gi pi ci 0 1 ci ci 

That is, the adder propagates CarryIn to a CarryOut. Putting the two together, 

CarryIni 1 is a 1 if either gi is 1 or both pi is 1 and CarryIni is 1. 

As an analogy, imagine a row of dominoes set on edge. The end domino can be 

tipped over by pushing one far away, provided there are no gaps between the two. 

Similarly, a carry out can be made true by a generate far away, provided all the 

propagates between them are true. 

Relying on the definitions of propagate and generate as our first level of 

abstraction, we can express the CarryIn signals more economically. Let’s show it 

for 4 bits: 

c1 g0 ( p0 c0) 

c2 g1 ( p1 g0) ( p1 p0 c0) 

c3 g2 ( p2 g1) ( p2 p1 

g0) ( p2 p1 p0 c0) 

c4 g3 ( p3 g2) ( p3 p2 g1) ( p3 p2 p1 g0) 

(p3 p2 p1 p0 c 0) 

These equations just represent common sense: CarryIni is a 1 if some earlier adder 

generates a carry and all intermediary adders propagate a carry. Figure B.6.1 uses 

plumbing to try to explain carry lookahead. 

Even this simplified form leads to large equations and, hence, considerable logic 

even for a 16-bit adder. Let’s try moving to two levels of abstraction. 

Fast Carry Using the Second Level of Abstraction 

First, we consider this 4-bit adder with its carry-lookahead logic as a single building 

block. If we connect them in ripple carry fashion to form a 16-bit adder, the add 

will be faster than the original with a little more hardware.


To go faster, we’ll need carry lookahead at a higher level. To perform carry look 

ahead for 4-bit adders, we need to propagate and generate signals at this higher 

level. Here they are for the four 4-bit adder blocks: 

P0 p3 p2 p1 p0 

P1 p7 p6 p5 p4 

P2 p11 p10 p9 p8 

P3 p15 p14 p13 p12 

That is, the “super” propagate signal for the 4-bit abstraction (Pi) is true only if each 

of the bits in the group will propagate a carry. 

For the “super” generate signal (Gi), we care only if there is a carry out of the 

most significant bit of the 4-bit group. This obviously occurs if generate is true 

for that most significant bit; it also occurs if an earlier generate is true and all the 

intermediate propagates, including that of the most significant bit, are also true: 

G0 g3 ( p3 g2) ( p3 p2 g1) ( p3 p2 p1 g0) 

G1 g7 ( p7 g6) ( p7 p6 

g5) ( p7 p6 p5 g4) 

G2 g11 ( p11 g10) ( p11 p10 g9) ( p11 p10 

p9 g8) 

G3 g15 ( p15 g14) ( p15 p14 g13) ( p15 p14 p13 g12) 

Figure B.6.2 updates our plumbing analogy to show P0 and G0. 

Then the equations at this higher level of abstraction for the carry in for each 

4-bit group of the 16-bit adder (C1, C2, C3, C4 in Figure B.6.3) are very similar to 

the carry out equations for each bit of the 4-bit adder (c1, c2, c3, c4) on page B-40: 

C1 G0 ( P0 c0) 

C2 G1 ( P1 G0) ( P1 P0 c0) 

C3 G2 ( P2 G1) ( P2 P1 

G0) ( P2 P1 P0 c0) 

C4 G3 ( P3 G2) ( P3 P2 G1) ( P3 P2 P1 G 0) 

( P3 P2 P1 P0 c0) 

Figure B.6.3 shows 4-bit adders connected with such a carry-lookahead unit. 

The exercises explore the speed differences between these carry schemes, different 

notations for multibit propagate and generate signals, and the design of a 64-bit 

adder.


Both Levels of the Propagate and Generate 

EXAMPLE 

Determine the gi, pi, Pi, and Gi values of these two 16-bit numbers: 

a: 0001 1010 0011 0011 two 

b: 1110 0101 1110 1011 two 

Also, what is CarryOut15 (C4)? 

ANSWER 

Aligning the bits makes it easy to see the values of generate gi (ai bi) and 

propagate pi (ai bi): 

a: 0001 1010 0011 0011 

b: 1110 0101 1110 1011 

gi: 0000 0000 0010 0011 

pi: 1111 1111 1111 1011 

where the bits are numbered 15 to 0 from left to right. Next, the “super” 

propagates (P3, P2, P1, P0) are simply the AND of the lower-level propagates: 

P3 

1111 1 

P2 

1111 1 

P1 

1111 1 

P0 1 0 1 1 0 

The “super” generates are more complex, so use the following equations: 

G0 g3 ( p3 g2) ( p3 p2 g1) ( p3 p2 p1 g0) 

= 0 ( 1 0) ( 1 0 1) ( 1 0 1 1) 

0 0 0 0 0 

G1 g7 ( p7 g6) ( p7 p6 g5) ( p7 p6 p5 g4) 

0 ( 10) ( 111) ( 1110) 

0 0 1 0 1 

G2 g11 ( p11 g10) ( p11 p10 g9) ( p11 p10 p9 g8) 

0 ( 1 0) ( 1 1 0) ( 1 1 1 0) 

0 0 0 0 0 

G3 g15 ( p15 g14) ( p15 p14 g13) ( p15 p14 p13 g12) 

0 ( 1 0) ( 1 1 0) ( 1 1 1 0) 

0 0 0 0 0 

Finally, CarryOut15 is 

C4 G3 ( P3 G2) ( P3 P2 G1) ( P3 P2 P1 G0) 

( P3 P2 P1 P0 c0) 

0 ( 10) ( 111) ( 1110) ( 1110 0) 

0 0 1 0 0 1 

Hence, there is a carry out when adding these two 16-bit numbers.


CarryIn 

a0 

b0 

a1 

b1 

a2 

b2 

a3 

b3 

CarryIn 

ALU0 

P0 

G0 

C1 

pi 

gi 

ci + 1 

Result0–3 

Carry-lookahead unit 

a4 

b4 

a5 

b5 

a6 

b6 

a7 

b7 

CarryIn 

ALU1 

P1 

G1 

C2 

pi + 1 

gi + 1 

ci + 2 

Result4–7 

a8 

b8 

a9 

b9 

a10 

b10 

a11 

b11 

CarryIn 

ALU2 

P2 

G2 

C3 

pi + 2 

gi + 2 

ci + 3 

Result8–11 

a12 

b12 

a13 

b13 

a14 

b14 

a15 

b15 

CarryIn 

ALU3 

P3 

G3 

C4 

pi + 3 

gi + 3 

ci + 4 

Result12–15 

CarryOut 

FIGURE B.6.3 Four 4-bit ALUs using carry lookahead to form a 16-bit adder. Note that the 

carries come from the carry-lookahead unit, not from the 4-bit ALUs.


The reason carry lookahead can make carries faster is that all logic begins 

evaluating the moment the clock cycle begins, and the result will not change once 

the output of each gate stops changing. By taking the shortcut of going through 

fewer gates to send the carry in signal, the output of the gates will stop changing 

sooner, and hence the time for the adder can be less. 

To appreciate the importance of carry lookahead, we need to calculate the 

relative performance between it and ripple carry adders. 

Speed of Ripple Carry versus Carry Lookahead 

EXAMPLE 

One simple way to model time for logic is to assume each AND or OR gate 

takes the same time for a signal to pass through it. Time is estimated by simply 

counting the number of gates along the path through a piece of logic. Compare 

the number of gate delays for paths of two 16-bit adders, one using ripple carry 

and one using two-level carry lookahead. 

ANSWER 

Figure B.5.5 on page B-28 shows that the carry out signal takes two gate 

delays per bit. Then the number of gate delays between a carry in to the least 

significant bit and the carry out of the most significant is 16 2 32. 

For carry lookahead, the carry out of the most significant bit is just C4, 

defined in the example. It takes two levels of logic to specify C4 in terms of 

Pi and Gi (the OR of several AND terms). Pi is specified in one level of logic 

(AND) using pi, and Gi is specified in two levels using pi and gi, so the worst 

case for this next level of abstraction is two levels of logic. pi and gi are each 

one level of logic, defined in terms of ai and bi. If we assume one gate delay 

for each level of logic in these equations, the worst case is 2 2 1 5 gate 

delays. 

Hence, for the path from carry in to carry out, the 16-bit addition by a 

carry-lookahead adder is six times faster, using this very simple estimate of 

hardware speed. 

Summary 

Carry lookahead offers a faster path than waiting for the carries to ripple through 

all 32 1-bit adders. This faster path is paved by two signals, generate and propagate.


The former creates a carry regardless of the carry input, and the latter passes a carry 

along. Carry lookahead also gives another example of how abstraction is important 

in computer design to cope with complexity. 

Using the simple estimate of hardware speed above with gate delays, what is the 

relative performance of a ripple carry 8-bit add versus a 64-bit add using carrylookahead 

logic? 

1. A 64-bit carry-lookahead adder is three times faster: 8-bit adds are 16 gate 

delays and 64-bit adds are 7 gate delays. 

2. They are about the same speed, since 64-bit adds need more levels of logic in 

the 16-bit adder. 

3. 8-bit adds are faster than 64 bits, even with carry lookahead. 

Check 

Yourself 

Elaboration: We have now accounted for all but one of the arithmetic and logical 

operations for the core MIPS instruction set: the ALU in Figure B.5.14 omits support of 

shift instructions. It would be possible to widen the ALU multiplexor to include a left shift 

by 1 bit or a right shift by 1 bit. But hardware designers have created a circuit called a 

barrel shifter, which can shift from 1 to 31 bits in no more time than it takes to add two 

32-bit numbers, so shifting is normally done outside the ALU. 

Elaboration: The logic equation for the Sum output of the full adder on page B-28 can 

be expressed more simply by using a more powerful gate than AND and OR. An exclusive 

OR gate is true if the two operands disagree; that is, 

x ≠ y ⇒ 1 and x y ⇒ 0 

In some technologies, exclusive OR is more effi cient than two levels of AND and OR 

gates. Using the symbol ⊕ to represent exclusive OR, here is the new equation: 

Sum a ⊕ b ⊕ CarryIn 

Also, we have drawn the ALU the traditional way, using gates. Computers are designed 

today in CMOS transistors, which are basically switches. CMOS ALU and barrel shifters 

take advantage of these switches and have many fewer multiplexors than shown in our 

designs, but the design principles are similar. 

Elaboration: Using lowercase and uppercase to distinguish the hierarchy of generate 

and propagate symbols breaks down when you have more than two levels. An alternate 

notation that scales is g i..j 

and p i..j 

for the generate and propagate signals for bits i to j. 

Thus, g 1..1 

is generated for bit 1, g 4..1 

is for bits 4 to 1, and g 16..1 

is for bits 16 to 1.

B.7 Clocks B-49 

clock edge occurs. A signal is valid if it is stable (i.e., not changing), and the value 

will not change again until the inputs change. Since combinational circuits cannot 

have feedback, if the inputs to a combinational logic unit are not changed, the 

outputs will eventually become valid. 

Figure B.7.2 shows the relationship among the state elements and the 

combinational logic blocks in a synchronous, sequential logic design. The state 

elements, whose outputs change only after the clock edge, provide valid inputs 

to the combinational logic block. To ensure that the values written into the state 

elements on the active clock edge are valid, the clock must have a long enough 

period so that all the signals in the combinational logic block stabilize, and then the 

clock edge samples those values for storage in the state elements. This constraint 

sets a lower bound on the length of the clock period, which must be long enough 

for all state element inputs to be valid. 

In the rest of this appendix, as well as in Chapter 4, we usually omit the clock 

signal, since we are assuming that all state elements are updated on the same clock 

edge. Some state elements will be written on every clock edge, while others will be 

written only under certain conditions (such as a register being updated). In such 

cases, we will have an explicit write signal for that state element. The write signal 

must still be gated with the clock so that the update occurs only on the clock edge if 

the write signal is active. We will see how this is done and used in the next section. 

One other advantage of an edge-triggered methodology is that it is possible 

to have a state element that is used as both an input and output to the same 

combinational logic block, as shown in Figure B.7.3. In practice, care must be 

taken to prevent races in such situations and to ensure that the clock period is long 

enough; this topic is discussed further in Section B.11. 

Now that we have discussed how clocking is used to update state elements, we 

can discuss how to construct the state elements. 

State 

element 

1 

Combinational logic 

State 

element 

2 

Clock cycle 

FIGURE B.7.2 The inputs to a combinational logic block come from a state element, and 

the outputs are written into a state element. The clock edge determines when the contents of the 

state elements are updated.

B.8 Memory Elements: Flip-Flops, Latches, and Registers B-51 

The simplest type of memory elements are unclocked; that is, they do not 

have any clock input. Although we only use clocked memory elements in this 

text, an unclocked latch is the simplest memory element, so let’s look at this 

circuit first. Figure B.8.1 shows an S-R latch (set-reset latch), built from a pair of 

NOR gates (OR gates with inverted outputs). The outputs Q and Q represent the 

value of the stored state and its complement. When neither S nor R are asserted, 

the cross-coupled NOR gates act as inverters and store the previous values of 

Q and Q. 

For example, if the output, Q, is true, then the bottom inverter produces a false 

output (which is Q), which becomes the input to the top inverter, which produces 

a true output, which is Q, and so on. If S is asserted, then the output Q will be 

asserted and Q will be deasserted, while if R is asserted, then the output Q will be 

asserted and Q will be deasserted. When S and R are both deasserted, the last values 

of Q and Q will continue to be stored in the cross-coupled structure. Asserting S 

and R simultaneously can lead to incorrect operation: depending on how S and R 

are deasserted, the latch may oscillate or become metastable (this is described in 

more detail in Section B.11). 

This cross-coupled structure is the basis for more complex memory elements 

that allow us to store data signals. These elements contain additional gates used to 

store signal values and to cause the state to be updated only in conjunction with a 

clock. The next section shows how these elements are built. 

Flip-Flops and Latches 

Flip-flops and latches are the simplest memory elements. In both flip-flops and 

latches, the output is equal to the value of the stored state inside the element. 

Furthermore, unlike the S-R latch described above, all the latches and flip-flops we 

will use from this point on are clocked, which means that they have a clock input 

and the change of state is triggered by that clock. The difference between a flipflop 

and a latch is the point at which the clock causes the state to actually change. 

In a clocked latch, the state is changed whenever the appropriate inputs change 

and the clock is asserted, whereas in a flip-flop, the state is changed only on a clock 

edge. Since throughout this text we use an edge-triggered timing methodology 

where state is only updated on clock edges, we need only use flip-flops. Flip-flops 

are often built from latches, so we start by describing the operation of a simple 

clocked latch and then discuss the operation of a flip-flop constructed from that 

latch. 

For computer applications, the function of both flip-flops and latches is to 

store a signal. A D latch or D flip-flop stores the value of its data input signal in 

the internal memory. Although there are many other types of latch and flip-flop, 

the D type is the only basic building block that we will need. A D latch has two 

inputs and two outputs. The inputs are the data value to be stored (called D) and 

a clock signal (called C) that indicates when the latch should read the value on 

the D input and store it. The outputs are simply the value of the internal state (Q) 

flip-flop A memory 

element for which the 

output is equal to the 

value of the stored state 

inside the element and for 

which the internal state is 

changed only on a clock 

edge. 

latch A memory element 

in which the output is 

equal to the value of the 

stored state inside the 

element and the state is 

changed whenever the 

appropriate inputs change 

and the clock is asserted. 

D flip-flop A flip-flop 

with one data input 

that stores the value of 

that input signal in the 

internal memory when 

the clock edge occurs.


and its complement (Q). When the clock input C is asserted, the latch is said to 

be open, and the value of the output (Q) becomes the value of the input D. When 

the clock input C is deasserted, the latch is said to be closed, and the value of the 

output (Q) is whatever value was stored the last time the latch was open. 

Figure B.8.2 shows how a D latch can be implemented with two additional gates 

added to the cross-coupled NOR gates. Since when the latch is open the value of Q 

changes as D changes, this structure is sometimes called a transparent latch. Figure 

B.8.3 shows how this D latch works, assuming that the output Q is initially false and 

that D changes first. 

As mentioned earlier, we use flip-flops as the basic building block, rather than 

latches. Flip-flops are not transparent: their outputs change only on the clock edge. 

A flip-flop can be built so that it triggers on either the rising (positive) or falling 

(negative) clock edge; for our designs we can use either type. Figure B.8.4 shows 

how a falling-edge D flip-flop is constructed from a pair of D latches. In a D flipflop, 

the output is stored when the clock edge occurs. Figure B.8.5 shows how this 

flip-flop operates. 

C 

Q 

D 

Q 

FIGURE B.8.2 A D latch implemented with NOR gates. A NOR gate acts as an inverter if the other 

input is 0. Thus, the cross-coupled pair of NOR gates acts to store the state value unless the clock input, C, is 

asserted, in which case the value of input D replaces the value of Q and is stored. The value of input D must 

be stable when the clock signal C changes from asserted to deasserted. 

D 

C 

Q 

FIGURE B.8.3 Operation of a D latch, assuming the output is initially deasserted. When 

the clock, C, is asserted, the latch is open and the Q output immediately assumes the value of the D input.


D 

D 

C 

D 

latch 

Q 

D 

C 

D 

latch 

Q 

Q 

Q 

Q 

C 

FIGURE B.8.4 A D flip-flop with a falling-edge trigger. The first latch, called the master, is open 

and follows the input D when the clock input, C, is asserted. When the clock input, C, falls, the first latch is 

closed, but the second latch, called the slave, is open and gets its input from the output of the master latch. 

D 

C 

Q 

FIGURE B.8.5 Operation of a D flip-flop with a falling-edge trigger, assuming the output is 

initially deasserted. When the clock input (C) changes from asserted to deasserted, the Q output stores 

the value of the D input. Compare this behavior to that of the clocked D latch shown in Figure B.8.3. In a 

clocked latch, the stored value and the output, Q, both change whenever C is high, as opposed to only when 

C transitions. 

Here is a Verilog description of a module for a rising-edge D flip-flop, assuming 

that C is the clock input and D is the data input: 

module DFF(clock,D,Q,Qbar); 

input clock, D; 

output reg Q; // Q is a reg since it is assigned in an 

always block 

output Qbar; 

assign Qbar = ~ Q; // Qbar is always just the inverse 

of Q 

always @(posedge clock) // perform actions whenever the 

clock rises 

Q = D; 

endmodule 

Because the D input is sampled on the clock edge, it must be valid for a period 

of time immediately before and immediately after the clock edge. The minimum 

time that the input must be valid before the clock edge is called the setup time; the 

setup time The 

minimum time that the 

input to a memory device 

must be valid before the 

clock edge.


D 

Setup time 

Hold time 

C 

FIGURE B.8.6 Setup and hold time requirements for a D flip-flop with a falling-edge trigger. 

The input must be stable for a period of time before the clock edge, as well as after the clock edge. The 

minimum time the signal must be stable before the clock edge is called the setup time, while the minimum 

time the signal must be stable after the clock edge is called the hold time. Failure to meet these minimum 

requirements can result in a situation where the output of the flip-flop may not be predictable, as described 

in Section B.11. Hold times are usually either 0 or very small and thus not a cause of worry. 

hold time The minimum 

time during which the 

input must be valid after 

the clock edge. 

minimum time during which it must be valid after the clock edge is called the hold 

time. Thus the inputs to any flip-flop (or anything built using flip-flops) must be valid 

during a window that begins at time t setup 

before the clock edge and ends at t hold 

after 

the clock edge, as shown in Figure B.8.6. Section B.11 talks about clocking and timing 

constraints, including the propagation delay through a flip-flop, in more detail. 

We can use an array of D flip-flops to build a register that can hold a multibit 

datum, such as a byte or word. We used registers throughout our datapaths in 

Chapter 4. 

Register Files 

One structure that is central to our datapath is a register file. A register file consists 

of a set of registers that can be read and written by supplying a register number 

to be accessed. A register file can be implemented with a decoder for each read 

or write port and an array of registers built from D flip-flops. Because reading a 

register does not change any state, we need only supply a register number as an 

input, and the only output will be the data contained in that register. For writing a 

register we will need three inputs: a register number, the data to write, and a clock 

that controls the writing into the register. In Chapter 4, we used a register file that 

has two read ports and one write port. This register file is drawn as shown in Figure 

B.8.7. The read ports can be implemented with a pair of multiplexors, each of which 

is as wide as the number of bits in each register of the register file. Figure B.8.8 

shows the implementation of two register read ports for a 32-bit-wide register file. 

Implementing the write port is slightly more complex, since we can only change 

the contents of the designated register. We can do this by using a decoder to generate 

a signal that can be used to determine which register to write. Figure B.8.9 shows 

how to implement the write port for a register file. It is important to remember that 

the flip-flop changes state only on the clock edge. In Chapter 4, we hooked up write 

signals for the register file explicitly and assumed the clock shown in Figure B.8.9 

is attached implicitly. 

What happens if the same register is read and written during a clock cycle? 

Because the write of the register file occurs on the clock edge, the register will be


Read register 

number 1 

Read register 

number 2 

Write 

register 

Write 

data 

Register file 

Write 

Read 

data 1 

Read 

data 2 

FIGURE B.8.7 A register file with two read ports and one write port has five inputs and 

two outputs. The control input Write is shown in color. 

Read register 

number 1 

Register 0 

Register 1 

. . . 

Register n – 2 


M 

u 

x 

Read data 1 

Read register 

number 2 

M 

u 

x 

Read data 2 

FIGURE B.8.8 The implementation of two read ports for a register file with n registers 

can be done with a pair of n-to-1 multiplexors, each 32 bits wide. The register read number 

signal is used as the multiplexor selector signal. Figure B.8.9 shows how the write port is implemented.


Write 

Register number 

n-to-2 n 

decoder 

0 

1 

n – 2 

n – 1 

. 

C 

D 

C 

D 

Register 0 

Register 1 

. 

C 


D 

C 


Register data 

D 

FIGURE B.8.9 The write port for a register file is implemented with a decoder that is 

used with the write signal to generate the C input to the registers. All three inputs (the register 

number, the data, and the write signal) will have setup and hold-time constraints that ensure that the correct 

data is written into the register file. 

valid during the time it is read, as we saw earlier in Figure B.7.2. The value returned 

will be the value written in an earlier clock cycle. If we want a read to return the 

value currently being written, additional logic in the register file or outside of it is 

needed. Chapter 4 makes extensive use of such logic. 

Specifying Sequential Logic in Verilog 

To specify sequential logic in Verilog, we must understand how to generate a 

clock, how to describe when a value is written into a register, and how to specify 

sequential control. Let us start by specifying a clock. A clock is not a predefined 

object in Verilog; instead, we generate a clock by using the Verilog notation #n 

before a statement; this causes a delay of n simulation time steps before the execution 

of the statement. In most Verilog simulators, it is also possible to generate 

a clock as an external input, allowing the user to specify at simulation time the 

number of clock cycles during which to run a simulation. 

The code in Figure B.8.10 implements a simple clock that is high or low for one 

simulation unit and then switches state. We use the delay capability and blocking 

assignment to implement the clock.


FIGURE B.8.10 

A specification of a clock. 

Next, we must be able to specify the operation of an edge-triggered register. In 

Verilog, this is done by using the sensitivity list on an always block and specifying 

as a trigger either the positive or negative edge of a binary variable with the 

notation posedge or negedge, respectively. Hence, the following Verilog code 

causes register A to be written with the value b at the positive edge clock: 

FIGURE B.8.11 A MIPS register file written in behavioral Verilog. This register file writes on 

the rising clock edge. 

Throughout this chapter and the Verilog sections of Chapter 4, we will assume 

a positive edge-triggered design. Figure B.8.11 shows a Verilog specification of a 

MIPS register file that assumes two reads and one write, with only the write being 

clocked.


Check 

Yourself 

In the Verilog for the register file in Figure B.8.11, the output ports corresponding to 

the registers being read are assigned using a continuous assignment, but the register 

being written is assigned in an always block. Which of the following is the reason? 

a. There is no special reason. It was simply convenient. 

b. Because Data1 and Data2 are output ports and WriteData is an input port. 

c. Because reading is a combinational event, while writing is a sequential event. 

B.9 Memory Elements: SRAMs and DRAMs 

static random access 

memory (SRAM) 

A memory where data 

is stored statically (as 

in flip-flops) rather 

than dynamically (as 

in DRAM). SRAMs are 

faster than DRAMs, 

but less dense and more 

expensive per bit. 

Registers and register files provide the basic building blocks for small memories, 

but larger amounts of memory are built using either SRAMs (static random 

access memories) or DRAMs (dynamic random access memories). We first discuss 

SRAMs, which are somewhat simpler, and then turn to DRAMs. 

SRAMs 

SRAMs are simply integrated circuits that are memory arrays with (usually) a single 

access port that can provide either a read or a write. SRAMs have a fixed access 

time to any datum, though the read and write access characteristics often differ. 

An SRAM chip has a specific configuration in terms of the number of addressable 

locations, as well as the width of each addressable location. For example, a 4M 8 

SRAM provides 4M entries, each of which is 8 bits wide. Thus it will have 22 address 

lines (since 4M 2 22 ), an 8-bit data output line, and an 8-bit single data input line. 

As with ROMs, the number of addressable locations is often called the height, with 

the number of bits per unit called the width. For a variety of technical reasons, the 

newest and fastest SRAMs are typically available in narrow configurations: 1 and 

4. Figure B.9.1 shows the input and output signals for a 2M 16 SRAM. 

Address 21 

Chip select 

Output enable 

Write enable 

SRAM 

2M 16 

16 

Dout[15–0] 

Din[15–0] 

16 

FIGURE B.9.1 A 32K 8 SRAM showing the 21 address lines (32K 2 15 ) and 16 data 

inputs, the 3 control lines, and the 16 data outputs.

B.9 Memory Elements: SRAMs and DRAMs B-59 

To initiate a read or write access, the Chip select signal must be made active. 

For reads, we must also activate the Output enable signal that controls whether or 

not the datum selected by the address is actually driven on the pins. The Output 

enable is useful for connecting multiple memories to a single-output bus and using 

Output enable to determine which memory drives the bus. The SRAM read access 

time is usually specified as the delay from the time that Output enable is true and 

the address lines are valid until the time that the data is on the output lines. Typical 

read access times for SRAMs in 2004 varied from about 2–4 ns for the fastest CMOS 

parts, which tend to be somewhat smaller and narrower, to 8–20 ns for the typical 

largest parts, which in 2004 had more than 32 million bits of data. The demand for 

low-power SRAMs for consumer products and digital appliances has grown greatly 

in the past five years; these SRAMs have much lower stand-by and access power, 

but usually are 5–10 times slower. Most recently, synchronous SRAMs—similar to 

the synchronous DRAMs, which we discuss in the next section—have also been 

developed. 

For writes, we must supply the data to be written and the address, as well as 

signals to cause the write to occur. When both the Write enable and Chip select are 

true, the data on the data input lines is written into the cell specified by the address. 

There are setup-time and hold-time requirements for the address and data lines, 

just as there were for D flip-flops and latches. In addition, the Write enable signal 

is not a clock edge but a pulse with a minimum width requirement. The time to 

complete a write is specified by the combination of the setup times, the hold times, 

and the Write enable pulse width. 

Large SRAMs cannot be built in the same way we build a register file because, 

unlike a register file where a 32-to-1 multiplexor might be practical, the 64K-to- 

1 multiplexor that would be needed for a 64K 1 SRAM is totally impractical. 

Rather than use a giant multiplexor, large memories are implemented with a shared 

output line, called a bit line, which multiple memory cells in the memory array can 

assert. To allow multiple sources to drive a single line, a three-state buffer (or tristate 

buffer) is used. A three-state buffer has two inputs—a data signal and an Output 

enable—and a single output, which is in one of three states: asserted, deasserted, 

or high impedance. The output of a tristate buffer is equal to the data input signal, 

either asserted or deasserted, if the Output enable is asserted, and is otherwise in a 

high-impedance state that allows another three-state buffer whose Output enable is 

asserted to determine the value of a shared output. 

Figure B.9.2 shows a set of three-state buffers wired to form a multiplexor with a 

decoded input. It is critical that the Output enable of at most one of the three-state 

buffers be asserted; otherwise, the three-state buffers may try to set the output line 

differently. By using three-state buffers in the individual cells of the SRAM, each 

cell that corresponds to a particular output can share the same output line. The use 

of a set of distributed three-state buffers is a more efficient implementation than a 

large centralized multiplexor. The three-state buffers are incorporated into the flipflops 

that form the basic cells of the SRAM. Figure B.9.3 shows how a small 4 2 

SRAM might be built, using D latches with an input called Enable that controls the 

three-state output.


Select 0 

Data 0 

In 

Enable 

Out 

Select 1 

Data 1 

In 

Enable 

Out 

Select 2 

Enable 

Output 

Data 2 

In 

Out 

Select 3 

Data 3 

In 

Enable 

Out 

FIGURE B.9.2 Four three-state buffers are used to form a multiplexor. Only one of the four 

Select inputs can be asserted. A three-state buffer with a deasserted Output enable has a high-impedance 

output that allows a three-state buffer whose Output enable is asserted to drive the shared output line. 

The design in Figure B.9.3 eliminates the need for an enormous multiplexor; 

however, it still requires a very large decoder and a correspondingly large number 

of word lines. For example, in a 4M 8 SRAM, we would need a 22-to-4M decoder 

and 4M word lines (which are the lines used to enable the individual flip-flops)! 

To circumvent this problem, large memories are organized as rectangular arrays 

and use a two-step decoding process. Figure B.9.4 shows how a 4M 8 SRAM 

might be organized internally using a two-step decode. As we will see, the two-level 

decoding process is quite important in understanding how DRAMs operate. 

Recently we have seen the development of both synchronous SRAMs (SSRAMs) 

and synchronous DRAMs (SDRAMs). The key capability provided by synchronous 

RAMs is the ability to transfer a burst of data from a series of sequential addresses 

within an array or row. The burst is defined by a starting address, supplied in the 

usual fashion, and a burst length. The speed advantage of synchronous RAMs 

comes from the ability to transfer the bits in the burst without having to specify 

additional address bits. Instead, a clock is used to transfer the successive bits in the 

burst. The elimination of the need to specify the address for the transfers within 

the burst significantly improves the rate for transferring the block of data. Because 

of this capability, synchronous SRAMs and DRAMs are rapidly becoming the 

RAMs of choice for building memory systems in computers. We discuss the use of 

synchronous DRAMs in a memory system in more detail in the next section and 

in Chapter 5.


DRAMs 

In a static RAM (SRAM), the value stored in a cell is kept on a pair of inverting gates, 

and as long as power is applied, the value can be kept indefinitely. In a dynamic 

RAM (DRAM), the value kept in a cell is stored as a charge in a capacitor. A single 

transistor is then used to access this stored charge, either to read the value or to 

overwrite the charge stored there. Because DRAMs use only a single transistor per 

bit of storage, they are much denser and cheaper per bit. By comparison, SRAMs 

require four to six transistors per bit. Because DRAMs store the charge on a 

capacitor, it cannot be kept indefinitely and must periodically be refreshed. That is 

why this memory structure is called dynamic, as opposed to the static storage in a 

SRAM cell. 

To refresh the cell, we merely read its contents and write it back. The charge can 

be kept for several milliseconds, which might correspond to close to a million clock 

cycles. Today, single-chip memory controllers often handle the refresh function 

independently of the processor. If every bit had to be read out of the DRAM and 

then written back individually, with large DRAMs containing multiple megabytes, 

we would constantly be refreshing the DRAM, leaving no time for accessing it. 

Fortunately, DRAMs also use a two-level decoding structure, and this allows us 

to refresh an entire row (which shares a word line) with a read cycle followed 

immediately by a write cycle. Typically, refresh operations consume 1% to 2% of 

the active cycles of the DRAM, leaving the remaining 98% to 99% of the cycles 

available for reading and writing data. 

Elaboration: How does a DRAM read and write the signal stored in a cell? The 

transistor inside the cell is a switch, called a pass transistor, that allows the value stored 

on the capacitor to be accessed for either reading or writing. Figure B.9.5 shows how 

the single-transistor cell looks. The pass transistor acts like a switch: when the signal 

on the word line is asserted, the switch is closed, connecting the capacitor to the bit 

line. If the operation is a write, then the value to be written is placed on the bit line. If 

the value is a 1, the capacitor will be charged. If the value is a 0, then the capacitor will 

be discharged. Reading is slightly more complex, since the DRAM must detect a very 

small charge stored in the capacitor. Before activating the word line for a read, the bit 

line is charged to the voltage that is halfway between the low and high voltage. Then, by 

activating the word line, the charge on the capacitor is read out onto the bit line. This 

causes the bit line to move slightly toward the high or low direction, and this change is 

detected with a sense amplifi er, which can detect small changes in voltage.


Word line 

Pass transistor 

Capacitor 

Bit line 

FIGURE B.9.5 A single-transistor DRAM cell contains a capacitor that stores the cell 

contents and a transistor used to access the cell. 

Row 

decoder 

11-to-2048 

2048 2048 

array 

Address[10–0] 

Column latches 

Mux 

Dout 

FIGURE B.9.6 A 4M 1 DRAM is built with a 2048 2048 array. The row access uses 11 bits to 

select a row, which is then latched in 2048 1-bit latches. A multiplexor chooses the output bit from these 2048 

latches. The RAS and CAS signals control whether the address lines are sent to the row decoder or column 

multiplexor.


DRAMs use a two-level decoder consisting of a row access followed by a column 

access, as shown in Figure B.9.6. The row access chooses one of a number of rows 

and activates the corresponding word line. The contents of all the columns in the 

active row are then stored in a set of latches. The column access then selects the 

data from the column latches. To save pins and reduce the package cost, the same 

address lines are used for both the row and column address; a pair of signals called 

RAS (Row Access Strobe) and CAS (Column Access Strobe) are used to signal the 

DRAM that either a row or column address is being supplied. Refresh is performed 

by simply reading the columns into the column latches and then writing the same 

values back. Thus, an entire row is refreshed in one cycle. The two-level addressing 

scheme, combined with the internal circuitry, makes DRAM access times much 

longer (by a factor of 5–10) than SRAM access times. In 2004, typical DRAM access 

times ranged from 45 to 65 ns; 256 Mbit DRAMs are in full production, and the 

first customer samples of 1 GB DRAMs became available in the first quarter of 

2004. The much lower cost per bit makes DRAM the choice for main memory, 

while the faster access time makes SRAM the choice for caches. 

You might observe that a 64M 4 DRAM actually accesses 8K bits on every 

row access and then throws away all but 4 of those during a column access. DRAM 

designers have used the internal structure of the DRAM as a way to provide 

higher bandwidth out of a DRAM. This is done by allowing the column address to 

change without changing the row address, resulting in an access to other bits in the 

column latches. To make this process faster and more precise, the address inputs 

were clocked, leading to the dominant form of DRAM in use today: synchronous 

DRAM or SDRAM. 

Since about 1999, SDRAMs have been the memory chip of choice for most 

cache-based main memory systems. SDRAMs provide fast access to a series of bits 

within a row by sequentially transferring all the bits in a burst under the control 

of a clock signal. In 2004, DDRRAMs (Double Data Rate RAMs), which are called 

double data rate because they transfer data on both the rising and falling edge of 

an externally supplied clock, were the most heavily used form of SDRAMs. As we 

discuss in Chapter 5, these high-speed transfers can be used to boost the bandwidth 

available out of main memory to match the needs of the processor and caches. 

Error Correction 

Because of the potential for data corruption in large memories, most computer 

systems use some sort of error-checking code to detect possible corruption of data. 

One simple code that is heavily used is a parity code. In a parity code the number 

of 1s in a word is counted; the word has odd parity if the number of 1s is odd and


NSlite 

Outputs 

EWlite 

NSgreen 1 0 

EWgreen 0 1 

function, with labels on the arcs specifying the input condition as logic functions. 

Figure B.10.2 shows the graphical representation for this finite-state machine. 

EWcar 

NSgreen 

EWgreen 

NSlite 

NScar 

EWlite 

EWcar 

NScar 

FIGURE B.10.2 The graphical representation of the two-state traffic light controller. We 

simplified the logic functions on the state transitions. For example, the transition from NSgreen to EWgreen 

in the next-state table is ( NScar EWcar) ( NScar EWcar ), which is equivalent to EWcar. 

A finite-state machine can be implemented with a register to hold the current 

state and a block of combinational logic that computes the next-state function and 

the output function. Figure B.10.3 shows how a finite-state machine with 4 bits of 

state, and thus up to 16 states, might look. To implement the finite-state machine 

in this way, we must first assign state numbers to the states. This process is called 

state assignment. For example, we could assign NSgreen to state 0 and EWgreen to 

state 1. The state register would contain a single bit. The next-state function would 

be given as 

NextState ( CurrentState EWcar) ( CurrentState NScar)


FIGURE B.10.4 

A Verilog version of the traffic light controller. 

Check 

Yourself 

What is the smallest number of states in a Moore machine for which a Mealy 

machine could have fewer states? 

a. Two, since there could be a one-state Mealy machine that might do the same 

thing. 

b. Three, since there could be a simple Moore machine that went to one of two 

different states and always returned to the original state after that. For such a 

simple machine, a two-state Mealy machine is possible. 

c. You need at least four states to exploit the advantages of a Mealy machine 

over a Moore machine. 

B.11 Timing Methodologies 

Throughout this appendix and in the rest of the text, we use an edge-triggered 

timing methodology. This timing methodology has an advantage in that it is 

simpler to explain and understand than a level-triggered methodology. In this 

section, we explain this timing methodology in a little more detail and also 

introduce level-sensitive clocking. We conclude this section by briefly discussing

B.11 Timing Methodologies B-73 

the issue of asynchronous signals and synchronizers, an important problem for 

digital designers. 

The purpose of this section is to introduce the major concepts in clocking 

methodology. The section makes some important simplifying assumptions; if you 

are interested in understanding timing methodology in more detail, consult one of 

the references listed at the end of this appendix. 

We use an edge-triggered timing methodology because it is simpler to explain 

and has fewer rules required for correctness. In particular, if we assume that all 

clocks arrive at the same time, we are guaranteed that a system with edge-triggered 

registers between blocks of combinational logic can operate correctly without races 

if we simply make the clock long enough. A race occurs when the contents of a 

state element depend on the relative speed of different logic elements. In an edgetriggered 

design, the clock cycle must be long enough to accommodate the path 

from one flip-flop through the combinational logic to another flip-flop where it 

must satisfy the setup-time requirement. Figure B.11.1 shows this requirement for 

a system using rising edge-triggered flip-flops. In such a system the clock period 

(or cycle time) must be at least as large as 

t t t 

prop combinational setup 

for the worst-case values of these three delays, which are defined as follows: 

■ t prop 

is the time for a signal to propagate through a flip-flop; it is also sometimes 

called clock-to-Q. 

■ t combinational 

is the longest delay for any combinational logic (which by definition 

is surrounded by two flip-flops). 

■ t setup 

is the time before the rising clock edge that the input to a flip-flop must 

be valid. 

D Q 

Flip-flop 

C 

Combinational 

logic block 

D Q 

Flip-flop 

C 

t prop t combinational t setup 

FIGURE B.11.1 In an edge-triggered design, the clock must be long enough to allow 

signals to be valid for the required setup time before the next clock edge. The time for a 

flip-flop input to propagate to the flip-flip outputs is t prop 

; the signal then takes t combinational 

to travel through the 

combinational logic and must be valid t setup 

before the next clock edge.


clock skew The 

difference in absolute time 

between the times when 

two state elements see a 

clock edge. 

We make one simplifying assumption: the hold-time requirements are satisfied, 

which is almost never an issue with modern logic. 

One additional complication that must be considered in edge-triggered designs 

is clock skew. Clock skew is the difference in absolute time between when two state 

elements see a clock edge. Clock skew arises because the clock signal will often 

use two different paths, with slightly different delays, to reach two different state 

elements. If the clock skew is large enough, it may be possible for a state element to 

change and cause the input to another flip-flop to change before the clock edge is 

seen by the second flip-flop. 

Figure B.11.2 illustrates this problem, ignoring setup time and flip-flop 

propagation delay. To avoid incorrect operation, the clock period is increased to 

allow for the maximum clock skew. Thus, the clock period must be longer than 

tprop tcombinational tsetup tskew 

With this constraint on the clock period, the two clocks can also arrive in the 

opposite order, with the second clock arriving t skew 

earlier, and the circuit will work 

Clock arrives 

at time t 

D Q 

Flip-flop 

C 

Combinational 

logic block with 

delay time of Δ 

Clock arrives 

after t + Δ 

D Q 

Flip-flop 

C 

FIGURE B.11.2 Illustration of how clock skew can cause a race, leading to incorrect operation. Because of the difference 

in when the two flip-flops see the clock, the signal that is stored into the first flip-flop can race forward and change the input to the second flipflop 

before the clock arrives at the second flip-flop. 

level-sensitive 

clocking A timing 

methodology in which 

state changes occur 

at either high or low 

clock levels but are not 

instantaneous as such 

changes are in edgetriggered 

designs. 

correctly. Designers reduce clock-skew problems by carefully routing the clock 

signal to minimize the difference in arrival times. In addition, smart designers also 

provide some margin by making the clock a little longer than the minimum; this 

allows for variation in components as well as in the power supply. Since clock skew 

can also affect the hold-time requirements, minimizing the size of the clock skew 

is important. 

Edge-triggered designs have two drawbacks: they require extra logic and they 

may sometimes be slower. Just looking at the D flip-flop versus the level-sensitive 

latch that we used to construct the flip-flop shows that edge-triggered design 

requires more logic. An alternative is to use level-sensitive clocking. Because state 

changes in a level-sensitive methodology are not instantaneous, a level-sensitive 

scheme is slightly more complex and requires additional care to make it operate 

correctly.

B.11 Timing Methodologies B-75 

Level-Sensitive Timing 

In level-sensitive timing, the state changes occur at either high or low levels, but 

they are not instantaneous as they are in an edge-triggered methodology. Because of 

the noninstantaneous change in state, races can easily occur. To ensure that a levelsensitive 

design will also work correctly if the clock is slow enough, designers use twophase 

clocking. Two-phase clocking is a scheme that makes use of two nonoverlapping 

clock signals. Since the two clocks, typically called φ 1 

and φ 2 

, are nonoverlapping, at 

most one of the clock signals is high at any given time, as Figure B.11.3 shows. We 

can use these two clocks to build a system that contains level-sensitive latches but is 

free from any race conditions, just as the edge-triggered designs were. 

Φ 1 

Φ 2 

Nonoverlapping 

periods 

FIGURE B.11.3 A two-phase clocking scheme showing the cycle of each clock and the 

nonoverlapping periods. 

Φ 1 

D 

C 

Latch 

Q 

Combinational 

logic block 

Φ 2 

D 

C 

Latch 

Q 

Combinational 

logic block 

Φ 1 

D 

C 

Latch 

FIGURE B.11.4 A two-phase timing scheme with alternating latches showing how the system operates on both clock 

phases. The output of a latch is stable on the opposite phase from its C input. Thus, the first block of combinational inputs has a stable input 

during φ 2 

, and its output is latched by φ 2 

. The second (rightmost) combinational block operates in just the opposite fashion, with stable inputs 

during φ 1 

. Thus, the delays through the combinational blocks determine the minimum time that the respective clocks must be asserted. The 

size of the nonoverlapping period is determined by the maximum clock skew and the minimum delay of any logic block. 

One simple way to design such a system is to alternate the use of latches that are 

open on φ 1 

with latches that are open on φ 2 

. Because both clocks are not asserted 

at the same time, a race cannot occur. If the input to a combinational block is a φ 1 

clock, then its output is latched by a φ 2 

clock, which is open only during φ 2 

when 

the input latch is closed and hence has a valid output. Figure B.11.4 shows how 

a system with two-phase timing and alternating latches operates. As in an edgetriggered 

design, we must pay attention to clock skew, particularly between the two


clock phases. By increasing the amount of nonoverlap between the two phases, we 

can reduce the potential margin of error. Thus, the system is guaranteed to operate 

correctly if each phase is long enough and if there is large enough nonoverlap 

between the phases. 

Asynchronous Inputs and Synchronizers 

By using a single clock or a two-phase clock, we can eliminate race conditions 

if clock-skew problems are avoided. Unfortunately, it is impractical to make an 

entire system function with a single clock and still keep the clock skew small. 

While the CPU may use a single clock, I/O devices will probably have their own 

clock. An asynchronous device may communicate with the CPU through a series 

of handshaking steps. To translate the asynchronous input to a synchronous signal 

that can be used to change the state of a system, we need to use a synchronizer, 

whose inputs are the asynchronous signal and a clock and whose output is a signal 

synchronous with the input clock. 

Our first attempt to build a synchronizer uses an edge-triggered D flip-flop, 

whose D input is the asynchronous signal, as Figure B.11.5 shows. Because we 

communicate with a handshaking protocol, it does not matter whether we detect 

the asserted state of the asynchronous signal on one clock or the next, since the 

signal will be held asserted until it is acknowledged. Thus, you might think that this 

simple structure is enough to sample the signal accurately, which would be the case 

except for one small problem. 

Asynchronous input 

Clock 

D Q 

Flip-flop 

C 

Synchronous output 

metastability 

A situation that occurs if 

a signal is sampled when 

it is not stable for the 

required setup and hold 

times, possibly causing 

the sampled value to 

fall in the indeterminate 

region between a high and 

low value. 

FIGURE B.11.5 A synchronizer built from a D flip-flop is used to sample an asynchronous 

signal to produce an output that is synchronous with the clock. This “synchronizer” will not 

work properly! 

The problem is a situation called metastability. Suppose the asynchronous 

signal is transitioning between high and low when the clock edge arrives. Clearly, 

it is not possible to know whether the signal will be latched as high or low. That 

problem we could live with. Unfortunately, the situation is worse: when the signal 

that is sampled is not stable for the required setup and hold times, the flip-flop may 

go into a metastable state. In such a state, the output will not have a legitimate high 

or low value, but will be in the indeterminate region between them. Furthermore,

B.13 Concluding Remarks B-77 

the flip-flop is not guaranteed to exit this state in any bounded amount of time. 

Some logic blocks that look at the output of the flip-flop may see its output as 0, 

while others may see it as 1. This situation is called a synchronizer failure. 

In a purely synchronous system, synchronizer failure can be avoided by ensuring 

that the setup and hold times for a flip-flop or latch are always met, but this is 

impossible when the input is asynchronous. Instead, the only solution possible is to 

wait long enough before looking at the output of the flip-flop to ensure that its output 

is stable, and that it has exited the metastable state, if it ever entered it. How long is 

long enough? Well, the probability that the flip-flop will stay in the metastable state 

decreases exponentially, so after a very short time the probability that the flip-flop 

is in the metastable state is very low; however, the probability never reaches 0! So 

designers wait long enough such that the probability of a synchronizer failure is very 

low, and the time between such failures will be years or even thousands of years. 

For most flip-flop designs, waiting for a period that is several times longer than 

the setup time makes the probability of synchronization failure very low. If the 

clock rate is longer than the potential metastability period (which is likely), then a 

safe synchronizer can be built with two D flip-flops, as Figure B.11.6 shows. If you 

are interested in reading more about these problems, look into the references. 

synchronizer failure 

A situation in which 

a flip-flop enters a 

metastable state and 

where some logic blocks 

reading the output of the 

flip-flop see a 0 while 

others see a 1. 

Asynchronous input 

Clock 

D Q 

Flip-flop 

C 

D Q 

Flip-flop 

C 

Synchronous output 

FIGURE B.11.6 This synchronizer will work correctly if the period of metastability that 

we wish to guard against is less than the clock period. Although the output of the first flip-flop 

may be metastable, it will not be seen by any other logic element until the second clock, when the second D 

flip-flop samples the signal, which by that time should no longer be in a metastable state. 

Suppose we have a design with very large clock skew—longer than the register 

propagation time. Is it always possible for such a design to slow the clock down 

enough to guarantee that the logic operates properly? 

a. Yes, if the clock is slow enough the signals can always propagate and the 

design will work, even if the skew is very large. 

b. No, since it is possible that two registers see the same clock edge far enough 

apart that a register is triggered, and its outputs propagated and seen by a 

second register with the same clock edge. 

Check 

Yourself 

propagation time The 

time required for an input 

to a flip-flop to propagate 

to the outputs of the flipflop.

B.14 Exercises B-81 

B.10 [15] §§B.2, B.3 Prove that a two-input multiplexor is also universal by 

showing how to build the NAND (or NOR) gate using a multiplexor. 

B.11 [5] §§4.2, B.2, B.3 Assume that X consists of 3 bits, x2 x1 x0. Write four 

logic functions that are true if and only if 

■ X contains only one 0 

■ X contains an even number of 0s 

■ X when interpreted as an unsigned binary number is less than 4 

■ X when interpreted as a signed (two’s complement) number is negative 

B.12 [5] §§4.2, B.2, B.3 Implement the four functions described in Exercise 

B.11 using a PLA. 

B.13 [5] §§4.2, B.2, B.3 Assume that X consists of 3 bits, x2 x1 x0, and Y 

consists of 3 bits, y2 y1 y0. Write logic functions that are true if and only if 

■ X Y, where X and Y are thought of as unsigned binary numbers 

■ X Y, where X and Y are thought of as signed (two’s complement) numbers 

■ X Y 

Use a hierarchical approach that can be extended to larger numbers of bits. Show 

how can you extend it to 6-bit comparison. 

B.14 [5] §§B.2, B.3 Implement a switching network that has two data inputs 

(A and B), two data outputs (C and D), and a control input (S). If S equals 1, the 

network is in pass-through mode, and C should equal A, and D should equal B. If 

S equals 0, the network is in crossing mode, and C should equal B, and D should 

equal A. 

B.15 [15] §§B.2, B.3 Derive the product-of-sums representation for E shown 

on page B-11 starting with the sum-of-products representation. You will need to 

use DeMorgan’s theorems. 

B.16 [30] §§B.2, B.3 Give an algorithm for constructing the sum-of- products 

representation for an arbitrary logic equation consisting of AND, OR, and NOT. 

The algorithm should be recursive and should not construct the truth table in the 

process. 

B.17 [5] §§B.2, B.3 Show a truth table for a multiplexor (inputs A, B, and S; 

output C ), using don’t cares to simplify the table where possible.


B.18 [5] §B.3 What is the function implemented by the following Verilog 

modules: 

module FUNC1 (I0, I1, S, out); 

input I0, I1; 

input S; 

output out; 

out = S? I1: I0; 

endmodule 

module FUNC2 (out,ctl,clk,reset); 

output [7:0] out; 

input ctl, clk, reset; 

reg [7:0] out; 

always @(posedge clk) 

if (reset) begin 

out


In 

 

Adder 

Load 

Clk 

16 

16 

Out 

Rst 

Load 

Register 

B.22 [20] §§B3, B.4, B.5 Section 3.3 presents basic operation and possible 

implementations of multipliers. A basic unit of such implementations is a shiftand-add 

unit. Show a Verilog implementation for this unit. Show how can you use 

this unit to build a 32-bit multiplier. 

B.23 [20] §§B3, B.4, B.5 Repeat Exercise B.22, but for an unsigned divider 

rather than a multiplier. 

B.24 [15] §B.5 The ALU supported set on less than (slt) using just the sign 

bit of the adder. Let’s try a set on less than operation using the values 7 ten 

and 6 ten 

. 

To make it simpler to follow the example, let’s limit the binary representations to 4 

bits: 1001 two 

and 0110 two 

. 

1001 two 

– 0110 two 

= 1001 two 

+ 1010 two 

= 0011 two 

This result would suggest that 7 6, which is clearly wrong. Hence, we must 

factor in overflow in the decision. Modify the 1-bit ALU in Figure B.5.10 on page 

B-33 to handle slt correctly. Make your changes on a photocopy of this figure to 

save time. 

B.25 [20] §B.6 A simple check for overflow during addition is to see if the 

CarryIn to the most significant bit is not the same as the CarryOut of the most 

significant bit. Prove that this check is the same as in Figure 3.2. 

B.26 [5] §B.6 Rewrite the equations on page B-44 for a carry-lookahead logic 

for a 16-bit adder using a new notation. First, use the names for the CarryIn signals 

of the individual bits of the adder. That is, use c4, c8, c12, … instead of C1, C2, 

C3, …. In addition, let Pi,j; mean a propagate signal for bits i to j, and Gi,j; mean a 

generate signal for bits i to j. For example, the equation 

C2 G1 ( P1 G0) ( P1 P0 c0)


can be rewritten as 

c8 G ( P G ) ( P P c0) 

74 , 74 , 30 , 74 , 30 , 

This more general notation is useful in creating wider adders. 

B.27 [15] §B.6 Write the equations for the carry-lookahead logic for a 64- 

bit adder using the new notation from Exercise B.26 and using 16-bit adders as 

building blocks. Include a drawing similar to Figure B.6.3 in your solution. 

B.28 [10] §B.6 Now calculate the relative performance of adders. Assume that 

hardware corresponding to any equation containing only OR or AND terms, such 

as the equations for pi and gi on page B-40, takes one time unit T. Equations that 

consist of the OR of several AND terms, such as the equations for c1, c2, c3, and 

c4 on page B-40, would thus take two time units, 2T. The reason is it would take T 

to produce the AND terms and then an additional T to produce the result of the 

OR. Calculate the numbers and performance ratio for 4-bit adders for both ripple 

carry and carry lookahead. If the terms in equations are further defined by other 

equations, then add the appropriate delays for those intermediate equations, and 

continue recursively until the actual input bits of the adder are used in an equation. 

Include a drawing of each adder labeled with the calculated delays and the path of 

the worst-case delay highlighted. 

B.29 [15] §B.6 This exercise is similar to Exercise B.28, but this time calculate 

the relative speeds of a 16-bit adder using ripple carry only, ripple carry of 4-bit 

groups that use carry lookahead, and the carry-lookahead scheme on page B-39. 

B.30 [15] §B.6 This exercise is similar to Exercises B.28 and B.29, but this 

time calculate the relative speeds of a 64-bit adder using ripple carry only, ripple 

carry of 4-bit groups that use carry lookahead, ripple carry of 16-bit groups that use 

carry lookahead, and the carry-lookahead scheme from Exercise B.27. 

B.31 [10] §B.6 Instead of thinking of an adder as a device that adds two 

numbers and then links the carries together, we can think of the adder as a hardware 

device that can add three inputs together (ai, bi, ci) and produce two outputs 

(s, ci 1). When adding two numbers together, there is little we can do with this 

observation. When we are adding more than two operands, it is possible to reduce 

the cost of the carry. The idea is to form two independent sums, called S (sum bits) 

and C (carry bits). At the end of the process, we need to add C and S together 

using a normal adder. This technique of delaying carry propagation until the end 

of a sum of numbers is called carry save addition. The block drawing on the lower 

right of Figure B.14.1 (see below) shows the organization, with two levels of carry 

save adders connected by a single normal adder. 

Calculate the delays to add four 16-bit numbers using full carry-lookahead adders 

versus carry save with a carry-lookahead adder forming the final sum. (The time 

unit T in Exercise B.28 is the same.)


First, show the block organization of the 16-bit carry save adders to add these 16 

terms, as shown on the right in Figure B.14.1. Then calculate the delays to add these 

16 numbers. Compare this time to the iterative multiplication scheme in Chapter 

3 but only assume 16 iterations using a 16-bit adder that has full carry lookahead 

whose speed was calculated in Exercise B.29. 

B.33 [10] §B.6 There are times when we want to add a collection of numbers 

together. Suppose you wanted to add four 4-bit numbers (A, B, E, F) using 1-bit 

full adders. Let’s ignore carry lookahead for now. You would likely connect the 

1-bit adders in the organization at the top of Figure B.14.1. Below the traditional 

organization is a novel organization of full adders. Try adding four numbers using 

both organizations to convince yourself that you get the same answer. 

B.34 [5] §B.6 First, show the block organization of the 16-bit carry save 

adders to add these 16 terms, as shown in Figure B.14.1. Assume that the time delay 

through each 1-bit adder is 2T. Calculate the time of adding four 4-bit numbers to 

the organization at the top versus the organization at the bottom of Figure B.14.1. 

B.35 [5] §B.8 Quite often, you would expect that given a timing diagram 

containing a description of changes that take place on a data input D and a clock 

input C (as in Figures B.8.3 and B.8.6 on pages B-52 and B-54, respectively), there 

would be differences between the output waveforms (Q) for a D latch and a D flipflop. 

In a sentence or two, describe the circumstances (e.g., the nature of the inputs) 

for which there would not be any difference between the two output waveforms. 

B.36 [5] §B.8 Figure B.8.8 on page B-55 illustrates the implementation of the 

register file for the MIPS datapath. Pretend that a new register file is to be built, 

but that there are only two registers and only one read port, and that each register 

has only 2 bits of data. Redraw Figure B.8.8 so that every wire in your diagram 

corresponds to only 1 bit of data (unlike the diagram in Figure B.8.8, in which 

some wires are 5 bits and some wires are 32 bits). Redraw the registers using D flipflops. 

You do not need to show how to implement a D flip-flop or a multiplexor. 

B.37 [10] §B.10 A friend would like you to build an “electronic eye” for use 

as a fake security device. The device consists of three lights lined up in a row, 

controlled by the outputs Left, Middle, and Right, which, if asserted, indicate that 

a light should be on. Only one light is on at a time, and the light “moves” from 

left to right and then from right to left, thus scaring away thieves who believe that 

the device is monitoring their activity. Draw the graphical representation for the 

finite-state machine used to specify the electronic eye. Note that the rate of the eye’s 

movement will be controlled by the clock speed (which should not be too great) 

and that there are essentially no inputs. 

B.38 [10] §B.10 Assign state numbers to the states of the finite-state machine 

you constructed for Exercise B.37 and write a set of logic equations for each of the 

outputs, including the next-state bits.


B.39 [15] §§B.2, B.8, B.10 Construct a 3-bit counter using three D flipflops 

and a selection of gates. The inputs should consist of a signal that resets the 

counter to 0, called reset, and a signal to increment the counter, called inc. The 

outputs should be the value of the counter. When the counter has value 7 and is 

incremented, it should wrap around and become 0. 

B.40 [20] §B.10 A Gray code is a sequence of binary numbers with the property 

that no more than 1 bit changes in going from one element of the sequence to 

another. For example, here is a 3-bit binary Gray code: 000, 001, 011, 010, 110, 

111, 101, and 100. Using three D flip-flops and a PLA, construct a 3-bit Gray code 

counter that has two inputs: reset, which sets the counter to 000, and inc, which 

makes the counter go to the next value in the sequence. Note that the code is cyclic, 

so that the value after 100 in the sequence is 000. 

B.41 [25] §B.10 We wish to add a yellow light to our traffic light example on 

page B-68. We will do this by changing the clock to run at 0.25 Hz (a 4-second clock 

cycle time), which is the duration of a yellow light. To prevent the green and red lights 

from cycling too fast, we add a 30-second timer. The timer has a single input, called 

TimerReset, which restarts the timer, and a single output, called TimerSignal, which 

indicates that the 30-second period has expired. Also, we must redefine the traffic 

signals to include yellow. We do this by defining two out put signals for each light: 

green and yellow. If the output NSgreen is asserted, the green light is on; if the output 

NSyellow is asserted, the yellow light is on. If both signals are off, the red light is on. Do 

not assert both the green and yellow signals at the same time, since American drivers 

will certainly be confused, even if European drivers understand what this means! Draw 

the graphical representation for the finite-state machine for this improved controller. 

Choose names for the states that are different from the names of the outputs. 

B.42 [15] §B.10 Write down the next-state and output-function tables for the 

traffic light controller described in Exercise B.41. 

B.43 [15] §§B.2, B.10 Assign state numbers to the states in the traf-fic light 

example of Exercise B.41 and use the tables of Exercise B.42 to write a set of logic 

equations for each of the outputs, including the next-state outputs. 

B.44 [15] §§B.3, B.10 Implement the logic equations of Exercise B.43 as a 

PLA. 

§B.2, page B-8: No. If A 1, C 1, B 0, the first is true, but the second is false. 

§B.3, page B-20: C. 

§B.4, page B-22: They are all exactly the same. 

§B.4, page B-26: A 0, B 1. 

§B.5, page B-38: 2. 

§B.6, page B-47: 1. 

§B.8, page B-58: c. 

§B.10, page B-72: b. 

§B.11, page B-77: b. 

Answers to 

Check Yourself


Index 

Note: Online information is listed by chapter and section number followed by page numbers (OL3.11-7). Page references preceded 

by a single letter with hyphen refer to appendices. 

1-bit ALU, B-26–29. See also Arithmetic 

logic unit (ALU) 

adder, B-27 

CarryOut, B-28 

for most significant bit, B-33 

illustrated, B-29 

logical unit for AND/OR, B-27 

performing AND, OR, and addition, 

B-31, B-33 

32-bit ALU, B-29–38. See also Arithmetic 


defining in Verilog, B-35–38 

from 31 copies of 1-bit ALU, B-34 


ripple carry adder, B-29 

tailoring to MIPS, B-31–35 

with 32 1-bit ALUs, B-30 

32-bit immediate operands, 112–113 

7090/7094 hardware, OL3.11-7 

A 

Absolute references, 126 

Abstractions 

hardware/software interface, 22 

principle, 22 

to simplify design, 11 

Accumulator architectures, OL2.21-2 

Acronyms, 9 

Active matrix, 18 

add (Add), 64 

add.d (FP Add Double), A-73 

add.s (FP Add Single), A-74 

Add unsigned instruction, 180 

addi (Add Immediate), 64 

Addition, 178–182. See also Arithmetic 

binary, 178–179 

floating-point, 203–206, 211, A-73–74 

instructions, A-51 

operands, 179 

significands, 203 

speed, 182 

addiu (Add Imm.Unsigned), 119 

Address interleaving, 381 

Address select logic, D-24, D-25 

Address space, 428, 431 

extending, 479 

flat, 479 

ID (ASID), 446 

inadequate, OL5.17-6 

shared, 519–520 

single physical, 517 

unmapped, 450 

virtual, 446 

Address translation 

for ARM cortex-A8, 471 

defined, 429 

fast, 438–439 

for Intel core i7, 471 

TLB for, 438–439 

Address-control lines, D-26 

Addresses 

32-bit immediates, 113–116 

base, 69 

byte, 69 

defined, 68 

memory, 77 

virtual, 428–431, 450 

Addressing 

32-bit immediates, 113–116 

base, 116 

displacement, 116 

immediate, 116 

in jumps and branches, 113–116 

MIPS modes, 116–118 

PC-relative, 114, 116 

pseudodirect, 116 

register, 116 

x86 modes, 152 

Addressing modes, A-45–47 

desktop architectures, E-6 

addu (Add Unsigned), 64 

Advanced Vector Extensions (AVX), 225, 

227 

AGP, C-9 

Algol-60, OL2.21-7 

Aliasing, 444 

Alignment restriction, 69–70 

All-pairs N-body algorithm, C-65 

Alpha architecture 

bit count instructions, E-29 

floating-point instructions, E-28 

instructions, E-27–29 

no divide, E-28 

PAL code, E-28 

unaligned load-store, E-28 

VAX floating-point formats, E-29 

ALU control, 259–261. See also 

Arithmetic logic unit (ALU) 

bits, 260 

logic, D-6 

mapping to gates, D-4–7 

truth tables, D-5 

ALU control block, 263 

defined, D-4 

generating ALU control bits, D-6 

ALUOp, 260, D-6 

bits, 260, 261 

control signal, 263 

Amazon Web Services (AWS), 425 

AMD Opteron X4 (Barcelona), 543, 544 

AMD64, 151, 224, OL2.21-6 

Amdahl’s law, 401, 503 

corollary, 49 

defined, 49 

fallacy, 556 

and (AND), 64 

AND gates, B-12, D-7 

AND operation, 88 

AND operation, A-52, B-6 

andi (And Immediate), 64 

Annual failure rate (AFR), 418 

versus.MTTF of disks, 419–420 

Antidependence, 338 

Antifuse, B-78 

Apple computer, OL1.12-7 

I-1

I-2 Index 

Apple iPad 2 A1395, 20 

logic board of, 20 

processor integrated circuit of, 21 

Application binary interface (ABI), 22 

Application programming interfaces 

(APIs) 

defined, C-4 

graphics, C-14 

Architectural registers, 347 

Arithmetic, 176–236 

addition, 178–182 

addition and subtraction, 178–182 

division, 189–195 

fallacies and pitfalls, 229–232 

floating-point, 196–222 

historical perspective, 236 

multiplication, 183–188 

parallelism and, 222–223 

Streaming SIMD Extensions and 

advanced vector extensions in x86, 

224–225 

subtraction, 178–182 

subword parallelism, 222–223 

subword parallelism and matrix 

multiply, 225–228 

Arithmetic instructions. See also 

Instructions 

desktop RISC, E-11 

embedded RISC, E-14 

logical, 251 

MIPS, A-51–57 

operands, 66–, 73 

Arithmetic intensity, 541 

Arithmetic logic unit (ALU). See also 

ALU control; Control units 

1-bit, B-26–29 

32-bit, B-29–38 

before forwarding, 309 

branch datapath, 254 

hardware, 180 

memory-reference instruction use, 245 

for register values, 252 

R-format operations, 253 

signed-immediate input, 312 

ARM Cortex-A8, 244, 345–346 

address translation for, 471 

caches in, 472 

data cache miss rates for, 474 

memory hierarchies of, 471–475 

performance of, 473–475 

specification, 345 

TLB hardware for, 471 

ARM instructions, 145–147 

12-bit immediate field, 148 

addressing modes, 145 

block loads and stores, 149 

brief history, OL2.21-5 

calculations, 145–146 

compare and conditional branch, 

147–148 

condition field, 324 

data transfer, 146 

features, 148–149 

formats, 148 

logical, 149 

MIPS similarities, 146 

register-register, 146 

unique, E-36–37 

ARMv7, 62 

ARMv8, 158–159 

ARPANET, OL1.12-10 

Arrays, 415 

logic elements, B-18–19 

multiple dimension, 218 

pointers versus, 141–145 

procedures for setting to zero, 142 

ASCII 

binary numbers versus, 107 

character representation, 106 

defined, 106 

symbols, 109 

Assembler directives, A-5 

Assemblers, 124–126, A-10–17 

conditional code assembly, A-17 

defined, 14, A-4 

function, 125, A-10 

macros, A-4, A-15–17 

microcode, D-30 

number acceptance, 125 

object file, 125 

pseudoinstructions, A-17 

relocation information, A-13, A-14 

speed, A-13 

symbol table, A-12 

Assembly language, 15 

defined, 14, 123 

drawbacks, A-9–10 

floating-point, 212 

high-level languages versus, A-12 

illustrated, 15 

MIPS, 64, 84, A-45–80 

production of, A-8–9 

programs, 123 

translating into machine language, 84 

when to use, A-7–9 

Asserted signals, 250, B-4 

Associativity 

in caches, 405 

degree, increasing, 404, 455 

increasing, 409 

set, tag size versus, 409 

Atomic compare and swap, 123 

Atomic exchange, 121 

Atomic fetch-and-increment, 123 

Atomic memory operation, C-21 

Attribute interpolation, C-43–44 

Automobiles, computer application in, 4 

Average memory access time (AMAT), 

402 

calculating, 403 

B 

Backpatching, A-13 

Bandwidth, 30–31 

bisection, 532 

external to DRAM, 398 

memory, 380–381, 398 

network, 535 

Barrier synchronization, C-18 

defined, C-20 

for thread communication, C-34 

Base addressing, 69, 116 

Base registers, 69 

Basic block, 93 

Benchmarks, 538–540 

defined, 46 

Linpack, 538, OL3.11-4 

multicores, 522–529 

multiprocessor, 538–540 

NAS parallel, 540 

parallel, 539 

PARSEC suite, 540 

SPEC CPU, 46–48 

SPEC power, 48–49 

SPECrate, 538–539 

Stream, 548 

beq (Branch On Equal), 64 

bge (Branch Greater Than or Equal), 125 

bgt (Branch Greater Than), 125 

Biased notation, 79, 200 

Big-endian byte order, 70, A-43 

Binary numbers, 81–82

Index I-3 

ASCII versus, 107 

conversion to decimal numbers, 76 

defined, 73 

Bisection bandwidth, 532 

Bit maps 


goal, 18 

storing, 18 

Bit-Interleaved Parity (RAID 3), OL5.11- 

5 

Bits 

ALUOp, 260, 261 

defined, 14 

dirty, 437 

guard, 220 

patterns, 220–221 

reference, 435 

rounding, 220 

sign, 75 

state, D-8 

sticky, 220 

valid, 383 

ble (Branch Less Than or Equal), 125 

Blocking assignment, B-24 

Blocking factor, 414 

Block-Interleaved Parity (RAID 4), 

OL5.11-5–5.11-6 

Blocks 

combinational, B-4 

defined, 376 

finding, 456 

flexible placement, 402–404 

least recently used (LRU), 409 

loads/stores, 149 

locating in cache, 407–408 

miss rate and, 391 

multiword, mapping addresses to, 390 

placement locations, 455–456 

placement strategies, 404 

replacement selection, 409 

replacement strategies, 457 

spatial locality exploitation, 391 

state, B-4 

valid data, 386 

blt (Branch Less Than), 125 

bne (Branch On Not Equal), 64 

Bonding, 28 

Boolean algebra, B-6 

Bounds check shortcut, 95 

Branch datapath 

ALU, 254 

operations, 254 

Branch delay slots 

defined, 322 

scheduling, 323 

Branch equal, 318 

Branch instructions, A-59–63 

jump instruction versus, 270 

list of, A-60–63 

pipeline impact, 317 

Branch not taken 

assumption, 318 

defined, 254 

Branch prediction 

as control hazard solution, 284 

buffers, 321, 322 

defined, 283 

dynamic, 284, 321–323 

static, 335 

Branch predictors 

accuracy, 322 

correlation, 324 

information from, 324 

tournament, 324 

Branch taken 

cost reduction, 318 

defined, 254 

Branch target 

addresses, 254 

buffers, 324 

Branches. See also Conditional 

branches 

addressing in, 113–116 

compiler creation, 91 

condition, 255 

decision, moving up, 318 

delayed, 96, 255, 284, 318–319, 322, 

324 

ending, 93 

execution in ID stage, 319 

pipelined, 318 

target address, 318 

unconditional, 91 

Branch-on-equal instruction, 268 

Bubble Sort, 140 

Bubbles, 314 

Bus-based coherent multiprocessors, 

OL6.15-7 

Buses, B-19 

Bytes 

addressing, 70 

order, 70, A-43 

C 

C.mmp, OL6.15-4 

C language 

assignment, compiling into MIPS, 

65–66 

compiling, 145, OL2.15-2–2.15-3 

compiling assignment with registers, 

67–68 

compiling while loops in, 92 

sort algorithms, 141 

translation hierarchy, 124 

translation to MIPS assembly language, 

65 

variables, 102 

C++ language, OL2.15-27, OL2.21-8 

Cache blocking and matrix multiply, 

475–476 

Cache coherence, 466–470 

coherence, 466 

consistency, 466 

enforcement schemes, 467–468 

implementation techniques, 

OL5.12-11–5.12-12 

migration, 467 

problem, 466, 467, 470 

protocol example, OL5.12-12–5.12-16 

protocols, 468 

replication, 468 

snooping protocol, 468–469 

snoopy, OL5.12-17 

state diagram, OL5.12-16 

Cache coherency protocol, OL5.12- 

12–5.12-16 

finite-state transition diagram, OL5.12- 

15 

functioning, OL5.12-14 

mechanism, OL5.12-14 

state diagram, OL5.12-16 

states, OL5.12-13 

write-back cache, OL5.12-15 

Cache controllers, 470 

coherent cache implementation 

techniques, OL5.12-11–5.12-12 

implementing, OL5.12-2 

snoopy cache coherence, OL5.12-17 

SystemVerilog, OL5.12-2 

Cache hits, 443 

Cache misses 

block replacement on, 457 

capacity, 459

I-4 Index 

Cache misses (Continued) 

compulsory, 459 

conflict, 459 

defined, 392 

direct-mapped cache, 404 

fully associative cache, 406 

handling, 392–393 

memory-stall clock cycles, 399 

reducing with flexible block placement, 

402–404 

set-associative cache, 405 

steps, 393 

in write-through cache, 393 

Cache performance, 398–417 

calculating, 400 

hit time and, 401–402 

impact on processor performance, 400 

Cache-aware instructions, 482 

Caches, 383–398. See also Blocks 

accessing, 386–389 

in ARM cortex-A8, 472 

associativity in, 405–406 

bits in, 390 

bits needed for, 390 

contents illustration, 387 

defined, 21, 383–384 

direct-mapped, 384, 385, 390, 402 

empty, 386–387 

FSM for controlling, 461–462 

fully associative, 403 

GPU, C-38 

inconsistent, 393 

index, 388 

in Intel Core i7, 472 

Intrinsity FastMATH example, 

395–398 

locating blocks in, 407–408 

locations, 385 

multilevel, 398, 410 

nonblocking, 472 

physically addressed, 443 

physically indexed, 443 

physically tagged, 443 

primary, 410, 417 

secondary, 410, 417 

set-associative, 403 

simulating, 478 

size, 389 

split, 397 

summary, 397–398 

tag field, 388 

tags, OL5.12-3, OL5.12-11 

virtual memory and TLB integration, 

440–441 

virtually addressed, 443 

virtually indexed, 443 

virtually tagged, 443 

write-back, 394, 395, 458 

write-through, 393, 395, 457 

writes, 393–395 

Callee, 98, 99 

Callee-saved register, A-23 

Caller, 98 

Caller-saved register, A-23 

Capabilities, OL5.17-8 

Capacity misses, 459 

Carry lookahead, B-38–47 

4-bit ALUs using, B-45 

adder, B-39 

fast, with first level of abstraction, 

B-39–40 

fast, with “infinite” hardware, B-38–39 

fast, with second level of abstraction, 

B-40–46 

plumbing analogy, B-42, B-43 

ripple carry speed versus, B-46 

summary, B-46–47 

Carry save adders, 188 

Cause register 

defined, 327 

fields, A-34, A-35 

OLC 6600, OL1.12-7, OL4.16-3 

Cell phones, 7 

Central processor unit (CPU). See also 

Processors 

classic performance equation, 36–40 

coprocessor 0, A-33–34 

defined, 19 

execution time, 32, 33–34 

performance, 33–35 

system, time, 32 

time, 399 

time measurements, 33–34 

user, time, 32 

Cg pixel shader program, C-15–17 

Characters 

ASCII representation, 106 

in Java, 109–111 

Chips, 19, 25, 26 

manufacturing process, 26 

Classes 

defined, OL2.15-15 

packages, OL2.15-21 

Clock cycles 

defined, 33 

memory-stall, 399 

number of registers and, 67 

worst-case delay and, 272 

Clock cycles per instruction (CPI), 35, 

282 

one level of caching, 410 

two levels of caching, 410 

Clock rate 

defined, 33 

frequency switched as function of, 41 

power and, 40 

Clocking methodology, 249–251, B-48 

edge-triggered, 249, B-48, B-73 

level-sensitive, B-74, B-75–76 

for predictability, 249 

Clocks, B-48–50 

edge, B-48, B-50 

in edge-triggered design, B-73 

skew, B-74 

specification, B-57 

synchronous system, B-48–49 

Cloud computing, 533 

defined, 7 

Cluster networking, 537–538, OL6.9-12 

Clusters, OL6.15-8–6.15-9 

defined, 30, 500, OL6.15-8 

isolation, 530 

organization, 499 

scientific computing on, OL6.15-8 

Cm*, OL6.15-4 

CMOS (complementary metal oxide 

semiconductor), 41 

Coarse-grained multithreading, 514 

Cobol, OL2.21-7 

Code generation, OL2.15-13 

Code motion, OL2.15-7 

Cold-start miss, 459 

Collision misses, 459 

Column major order, 413 

Combinational blocks, B-4 

Combinational control units, D-4–8 

Combinational elements, 248 

Combinational logic, 249, B-3, B-9–20 

arrays, B-18–19 

decoders, B-9 

defined, B-5 

don’t cares, B-17–18 

multiplexors, B-10 

ROMs, B-14–16 

two-level, B-11–14 

Verilog, B-23–26

Index I-5 

Commercial computer development, 

OL1.12-4–1.12-10 

Commit units 

buffer, 339–340 

defined, 339–340 

in update control, 343 

Common case fast, 11 

Common subexpression elimination, 

OL2.15-6 

Communication, 23–24 

overhead, reducing, 44–45 

thread, C-34 

Compact code, OL2.21-4 

Comparison instructions, A-57–59 

floating-point, A-74–75 


Comparisons, 93 

constant operands in, 93 

signed versus unsigned, 94–95 

Compilers, 123–124 

branch creation, 92 


conservative, OL2.15-6 

defined, 14 

front end, OL2.15-3 

function, 14, 123–124, A-5–6 

high-level optimizations, OL2.15-4 

ILP exploitation, OL4.16-5 

Just In Time (JIT), 132 

machine language production, A-8–9, 

A-10 

optimization, 141, OL2.21-9 

speculation, 333–334 

structure, OL2.15-2 

Compiling 

C assignment statements, 65–66 

C language, 92–93, 145, OL2.15- 

2–2.15-3 

floating-point programs, 214–217 

if-then-else, 91 

in Java, OL2.15-19 

procedures, 98, 101–102 

recursive procedures, 101–102 

while loops, 92–93 

Compressed sparse row (CSR) matrix, 

C-55, C-56 

Compulsory misses, 459 

Computer architects, 11–12 

abstraction to simplify design, 11 

common case fast, 11 

dependability via redundancy, 12 

hierarchy of memories, 12 

Moore’s law, 11 

parallelism, 12 

pipelining, 12 

prediction, 12 

Computers 

application classes, 5–6 

applications, 4 

arithmetic for, 176–236 

characteristics, OL1.12-12 

commercial development, OL1.12- 

4–1.12-10 

component organization, 17 

components, 17, 177 

design measure, 53 

desktop, 5 

embedded, 5, A-7 

first, OL1.12-2–1.12-4 

in information revolution, 4 

instruction representation, 80–87 

performance measurement, OL1.12-10 

PostPC Era, 6–7 

principles, 86 

servers, 5 

Condition field, 324 

Conditional branches 

ARM, 147–148 

changing program counter with, 324 

compiling if-then-else into, 91 

defined, 90 



implementation, 96 

in loops, 115 

PA-RISC, E-34, E-35 

PC-relative addressing, 114 

RISC, E-10–16 

SPARC, E-10–12 

Conditional move instructions, 324 

Conflict misses, 459 

Constant memory, C-40 

Constant operands, 72–73 

in comparisons, 93 

frequent occurrence, 72 

Constant-manipulating instructions, 

A-57 

Content Addressable Memory (CAM), 

408 

Context switch, 446 

Control 

ALU, 259–261 

challenge, 325–326 

finishing, 269–270 

forwarding, 307 

FSM, D-8–21 

implementation, optimizing, D-27–28 

for jump instruction, 270 

mapping to hardware, D-2–32 

memory, D-26 

organizing, to reduce logic, D-31–32 

pipelined, 300–303 

Control flow graphs, OL2.15-9–2.15-10 

illustrated examples, OL2.15-9, 

OL2.15-10 

Control functions 

ALU, mapping to gates, D-4–7 

defining, 264 

PLA, implementation, D-7, 

D-20–21 

ROM, encoding, D-18–19 

for single-cycle implementation, 269 

Control hazards, 281–282, 316–325 

branch delay reduction, 318–319 

branch not taken assumption, 318 

branch prediction as solution, 284 

delayed decision approach, 284 

dynamic branch prediction, 

321–323 

logic implementation in Verilog, 

OL4.13-8 

pipeline stalls as solution, 282 

pipeline summary, 324 

simplicity, 317 

solutions, 282 

static multiple-issue processors and, 

335–336 

Control lines 

asserted, 264 

in datapath, 263 

execution/address calculation, 300 

final three stages, 303 

instruction decode/register file read, 

300 

instruction fetch, 300 

memory access, 302 

setting of, 264 

values, 300 

write-back, 302 

Control signals 

ALUOp, 263 

defined, 250 

effect of, 264 

multi-bit, 264 

pipelined datapaths with, 300–303 

truth tables, D-14

I-6 Index 

Control units, 247. See also Arithmetic 


address select logic, D-24, D-25 

combinational, implementing, D-4–8 

with explicit counter, D-23 


logic equations, D-11 

main, designing, 261–264 

as microcode, D-28 

MIPS, D-10 

next-state outputs, D-10, D-12–13 

output, 259–261, D-10 

Conversion instructions, A-75–76 

Cooperative thread arrays (CTAs), C-30 

Coprocessors, A-33–34 

defined, 218 

move instructions, A-71–72 

Core MIPS instruction set, 236. See also 

MIPS 

abstract view, 246 

desktop RISC, E-9–11 

implementation, 244–248 

implementation illustration, 247 

overview, 245 

subset, 244 

Cores 

defined, 43 

number per chip, 43 

Correlation predictor, 324 

Cosmic Cube, OL6.15-7 

Count register, A-34 

CPU, 9 

Cray computers, OL3.11-5–3.11-6 

Critical word first, 392 

Crossbar networks, 535 

CTSS (Compatible Time-Sharing 

System), OL5.18-9 

CUDA programming environment, 523, 

C-5 

barrier synchronization, C-18, C-34 

development, C-17, C-18 

hierarchy of thread groups, C-18 

kernels, C-19, C-24 

key abstractions, C-18 

paradigm, C-19–23 

parallel plus-scan template, C-61 

per-block shared memory, C-58 

plus-reduction implementation, C-63 

programs, C-6, C-24 

scalable parallel programming with, 

C-17–23 

shared memories, C-18 

threads, C-36 

Cyclic redundancy check, 423 

Cylinder, 381 

D 

D flip-flops, B-51, B-53 

D latches, B-51, B-52 

Data bits, 421 

Data flow analysis, OL2.15-11 

Data hazards, 278, 303–316.See also 

Hazards 

forwarding, 278, 303–316 

load-use, 280, 318 

stalls and, 313–316 

Data layout directives, A-14 

Data movement instructions, A-70–73 

Data parallel problem decomposition, 

C-17, C-18 

Data race, 121 

Data segment, A-13 

Data selectors, 246 

Data transfer instructions.See also 

Instructions 

defined, 68 

load, 68 

offset, 69 

store, 71 

Datacenters, 7 

Data-level parallelism, 508 

Datapath elements 

defined, 251 

sharing, 256 

Datapaths 

branch, 254 

building, 251–259 

control signal truth tables, D-14 

control unit, 265 

defined, 19 

design, 251 

exception handling, 329 

for fetching instructions, 253 

for hazard resolution via forwarding, 

311 

for jump instruction, 270 

for memory instructions, 256 

for MIPS architecture, 257 

in operation for branch-on-equal 

instruction, 268 

in operation for load instruction, 267 

in operation for R-type instruction, 

266 

operation of, 264–269 

pipelined, 286–303 

for R-type instructions, 256, 264–265 

single, creating, 256 

single-cycle, 283 

static two-issue, 336 

Deasserted signals, 250, B-4 

Debugging information, A-13 

DEC PDP-8, OL2.21-3 

Decimal numbers 

binary number conversion to, 76 

defined, 73 

Decision-making instructions, 90–96 

Decoders, B-9 

two-level, B-65 

Decoding machine language, 118–120 

Defect, 26 

Delayed branches, 96.See also Branches 

as control hazard solution, 284 

defined, 255 

embedded RISCs and, E-23 

for five-stage pipelines, 26, 323–324 

reducing, 318–319 

scheduling limitations, 323 

Delayed decision, 284 

DeMorgan’s theorems, B-11 

Denormalized numbers, 222 

Dependability via redundancy, 12 

Dependable memory hierarchy, 418–423 

failure, defining, 418 

Dependences 

between pipeline registers, 308 

between pipeline registers and ALU 

inputs, 308 

bubble insertion and, 314 

detection, 306–308 

name, 338 

sequence, 304 

Design 

compromises and, 161 

datapath, 251 

digital, 354 

logic, 248–251, B-1–79 

main control unit, 261–264 

memory hierarchy, challenges, 460 

pipelining instruction sets, 277 

Desktop and server RISCs.See also 

Reduced instruction set computer 

(RISC) architectures

Index I-7 

addressing modes, E-6 

architecture summary, E-4 

arithmetic/logical instructions, E-11 

conditional branches, E-16 

constant extension summary, E-9 

control instructions, E-11 

conventions equivalent to MIPS core, 

E-12 

data transfer instructions, E-10 

features added to, E-45 

floating-point instructions, E-12 

instruction formats, E-7 

multimedia extensions, E-16–18 

multimedia support, E-18 

types of, E-3 

Desktop computers, defined, 5 

Device driver, OL6.9-5 

DGEMM (Double precision General 

Matrix Multiply), 225, 352, 413, 553 

cache blocked version of, 415 

optimized C version of, 226, 227, 476 

performance, 354, 416 

Dicing, 27 

Dies, 26, 26–27 

Digital design pipeline, 354 

Digital signal-processing (DSP) 

extensions, E-19 

DIMMs (dual inline memory modules), 

OL5.17-5 

Direct Data IO (DDIO), OL6.9-6 

Direct memory access (DMA), OL6.9-4 

Direct3D, C-13 

Direct-mapped caches.See also Caches 

address portions, 407 

choice of, 456 

defined, 384, 402 


memory block location, 403 

misses, 405 

single comparator, 407 

total number of bits, 390 

Dirty bit, 437 

Dirty pages, 437 

Disk memory, 381–383 

Displacement addressing, 116 

Distributed Block-Interleaved Parity 

(RAID 5), OL5.11-6 

div (Divide), A-52 

div.d (FP Divide Double), A-76 

div.s (FP Divide Single), A-76 

Divide algorithm, 190 

Dividend, 189 

Division, 189–195 

algorithm, 191 

dividend, 189 

divisor, 189 

Divisor, 189 

divu (Divide Unsigned), A-52.See also 

Arithmetic 

faster, 194 

floating-point, 211, A-76 

hardware, 189–192 

hardware, improved version, 192 

instructions, A-52–53 

in MIPS, 194 


quotient, 189 

remainder, 189 

signed, 192–194 

SRT, 194 

Don’t cares, B-17–18 

example, B-17–18 

term, 261 

Double data rate (DDR), 379 

Double Data Rate RAMs (DDRRAMs), 

379–380, B-65 

Double precision.See also Single precision 

defined, 198 

FMA, C-45–46 

GPU, C-45–46, C-74 

representation, 201 

Double words, 152 

Dual inline memory modules (DIMMs), 

381 

Dynamic branch prediction, 321–323.See 

also Control hazards 

branch prediction buffer, 321 

loops and, 321–323 

Dynamic hardware predictors, 284 

Dynamic multiple-issue processors, 333, 

339–341.See also Multiple issue 

pipeline scheduling, 339–341 

superscalar, 339 

Dynamic pipeline scheduling, 339–341 

commit unit, 339–340 

concept, 339–340 

hardware-based speculation, 341 

primary units, 340 

reorder buffer, 343 

reservation station, 339–340 

Dynamic random access memory 

(DRAM), 378, 379–381, B-63–65 

bandwidth external to, 398 

cost, 23 

defined, 19, B-63 

DIMM, OL5.17-5 

Double Date Rate (DDR), 379–380 

early board, OL5.17-4 

GPU, C-37–38 

growth of capacity, 25 

history, OL5.17-2 

internal organization of, 380 

pass transistor, B-63 

SIMM, OL5.17-5, OL5.17-6 

single-transistor, B-64 

size, 398 

speed, 23 

synchronous (SDRAM), 379–380, 

B-60, B-65 

two-level decoder, B-65 

Dynamically linked libraries (DLLs), 

129–131 

defined, 129 

lazy procedure linkage version, 130 

E 

Early restart, 392 

Edge-triggered clocking methodology, 

249, 250, B-48, B-73 

advantage, B-49 

clocks, B-73 

drawbacks, B-74 


rising edge/falling edge, B-48 

EDSAC (Electronic Delay Storage 

Automatic Calculator), OL1.12-3, 

OL5.17-2 

Eispack, OL3.11-4 

Electrically erasable programmable readonly 

memory (EEPROM), 381 

Elements 

combinational, 248 

datapath, 251, 256 

memory, B-50–58 

state, 248, 250, 252, B-48, B-50 

Embedded computers, 5 

application requirements, 6 

defined, A-7 

design, 5 

growth, OL1.12-12–1.12-13 

Embedded Microprocessor Benchmark 

Consortium (EEMBC), OL1.12-12

I-8 Index 

Embedded RISCs. See also Reduced 

instruction set computer (RISC) 

architectures 

addressing modes, E-6 

architecture summary, E-4 

arithmetic/logical instructions, E-14 

conditional branches, E-16 

constant extension summary, E-9 

control instructions, E-15 

data transfer instructions, E-13 

delayed branch and, E-23 

DSP extensions, E-19 

general purpose registers, E-5 

instruction conventions, E-15 

instruction formats, E-8 

multiply-accumulate approaches, E-19 

types of, E-4 

Encoding 

defined, D-31 

floating-point instruction, 213 

MIPS instruction, 83, 119, A-49 

ROM control function, D-18–19 

ROM logic function, B-15 

x86 instruction, 155–156 

ENIAC (Electronic Numerical Integrator 

and Calculator), OL1.12-2, OL1.12- 

3, OL5.17-2 

EPIC, OL4.16-5 

Error correction, B-65–67 

Error Detecting and Correcting Code 

(RAID 2), OL5.11-5 

Error detection, B-66 

Error detection code, 420 

Ethernet, 23 

EX stage 

load instructions, 292 

overflow exception detection, 328 

store instructions, 294 

Exabyte, 6 

Exception enable, 447 

Exception handlers, A-36–38 

defined, A-35 

return from, A-38 

Exception program counters (EPCs), 326 

address capture, 331 

copying, 181 

defined, 181, 327 

in restart determination, 326–327 

transferring, 182 

Exceptions, 325–332, A-33–38 

association, 331–332 

datapath with controls for handling, 

329 

defined, 180, 326 

detecting, 326 

event types and, 326 

imprecise, 331–332 


interrupts versus, 325–326 

in MIPS architecture, 326–327 

overflow, 329 

PC, 445, 446–447 

pipelined computer example, 328 

in pipelined implementation, 327–332 

precise, 332 

reasons for, 326–327 

result due to overflow in add 


saving/restoring stage on, 450 

Exclusive OR (XOR) instructions, A-57 

Executable files, A-4 

defined, 126 

linker production, A-19 

Execute or address calculation stage, 292 

Execute/address calculation 

control line, 300 

load instruction, 292 

store instruction, 292 


as valid performance measure, 51 

CPU, 32, 33–34 

pipelining and, 286 

Explicit counters, D-23, D-26 

Exponents, 197–198 

External labels, A-10 

F 

Facilities, A-14–17 

Failures, synchronizer, B-77 

Fallacies. See also Pitfalls 

add immediate unsigned, 227 

Amdahl’s law, 556 

arithmetic, 229–232 

assembly language for performance, 

159–160 

commercial binary compatibility 

importance, 160 

defined, 49 

GPUs, C-72–74, C-75 

low utilization uses little power, 50 

peak performance, 556 

pipelining, 355–356 

powerful instructions mean higher 

performance, 159 

right shift, 229 

False sharing, 469 

Fast carry 

with “infinite” hardware, B-38–39 

with first level of abstraction, B-39–40 

with second level of abstraction, 

B-40–46 

Fast Fourier Transforms (FFT), C-53 

Fault avoidance, 419 

Fault forecasting, 419 

Fault tolerance, 419 

Fermi architecture, 523, 552 

Field programmable devices (FPDs), B-78 

Field programmable gate arrays (FPGAs), 

B-78 

Fields 

Cause register, A-34, A-35 

defined, 82 

format, D-31 

MIPS, 82–83 

names, 82 

Status register, A-34, A-35 

Files, register, 252, 257, B-50, B-54–56 

Fine-grained multithreading, 514 

Finite-state machines (FSMs), 451–466, 

B-67–72 

control, D-8–22 

controllers, 464 

for multicycle control, D-9 

for simple cache controller, 464–466 

implementation, 463, B-70 

Mealy, 463 

Moore, 463 

next-state function, 463, B-67 

output function, B-67, B-69 

state assignment, B-70 

state register implementation, B-71 

style of, 463 

synchronous, B-67 

SystemVerilog, OL5.12-7 

traffic light example, B-68–70 

Flash memory, 381 

characteristics, 23 

defined, 23 

Flat address space, 479 

Flip-flops 

D flip-flops, B-51, B-53 

defined, B-51

Index I-9 

Floating point, 196–222, 224 

assembly language, 212 

backward step, OL3.11-4–3.11-5 

binary to decimal conversion, 202 

branch, 211 

challenges, 232–233 

diversity versus portability, OL3.11- 

3–3.11-4 

division, 211 

first dispute, OL3.11-2–3.11-3 

form, 197 

fused multiply add, 220 

guard digits, 218–219 


IEEE 754 standard, 198, 199 

instruction encoding, 213 

intermediate calculations, 218 

machine language, 212 

MIPS instruction frequency for, 236 

MIPS instructions, 211–213 


overflow, 198 

packed format, 224 

precision, 230 

procedure with two-dimensional 

matrices, 215–217 

programs, compiling, 214–217 

registers, 217 

representation, 197–202 

rounding, 218–219 

sign and magnitude, 197 

SSE2 architecture, 224–225 

subtraction, 211 

underflow, 198 

units, 219 

in x86, 224 

Floating vectors, OL3.11-3 

Floating-point addition, 203–206 

arithmetic unit block diagram, 207 

binary, 204 


instructions, 211, A-73–74 

steps, 203–204 

Floating-point arithmetic (GPUs), 

C-41–46 

basic, C-42 

double precision, C-45–46, C-74 

performance, C-44 

specialized, C-42–44 

supported formats, C-42 

texture operations, C-44 

Floating-point instructions, A-73–80 

absolute value, A-73 

addition, A-73–74 

comparison, A-74–75 

conversion, A-75–76 


division, A-76 

load, A-76–77 

move, A-77–78 

multiplication, A-78 

negation, A-78–79 

SPARC, E-31 

square root, A-79 

store, A-79 

subtraction, A-79–80 

truncation, A-80 

Floating-point multiplication, 206–210 

binary, 210–211 


instructions, 211 

significands, 206 

steps, 206–210 

Flow-sensitive information, OL2.15-15 

Flushing instructions, 318, 319 

defined, 319 

exceptions and, 331 

For loops, 141, OL2.15-26 

inner, OL2.15-24 

SIMD and, OL6.15-2 

Formal parameters, A-16 

Format fields, D-31 

Fortran, OL2.21-7 

Forward references, A-11 

Forwarding, 303–316 

ALU before, 309 

control, 307 

datapath for hazard resolution, 311 

defined, 278 

functioning, 306 

graphical representation, 279 

illustrations, OL4.13-26–4.13-26 

multiple results and, 281 

multiplexors, 310 

pipeline registers before, 309 

with two instructions, 278 

Verilog implementation, OL4.13- 

2–4.13-4 

Fractions, 197, 198 

Frame buffer, 18 

Frame pointers, 103 

Front end, OL2.15-3 

Fully associative caches. See also Caches 

block replacement strategies, 457 


defined, 403 

memory block location, 403 

misses, 406 

Fully connected networks, 535 

Function code, 82 

Fused-multiply-add (FMA) operation, 

220, C-45–46 

G 

Game consoles, C-9 

Gates, B-3, B-8 

AND, B-12, D-7 

delays, B-46 

mapping ALU control function to, 

D-4–7 

NAND, B-8 

NOR, B-8, B-50 

Gather-scatter, 511, 552 

General Purpose GPUs (GPGPUs), 

C-5 

General-purpose registers, 150 

architectures, OL2.21-3 

embedded RISCs, E-5 

Generate 

defined, B-40 

example, B-44 

super, B-41 

Gigabyte, 6 

Global common subexpression 

elimination, OL2.15-6 

Global memory, C-21, C-39 

Global miss rates, 416 

Global optimization, OL2.15-5 

code, OL2.15-7 

implementing, OL2.15-8–2.15-11 

Global pointers, 102 

GPU computing. See also Graphics 

processing units (GPUs) 

defined, C-5 

visual applications, C-6–7 

GPU system architectures, C-7–12 

graphics logical pipeline, C-10 

heterogeneous, C-7–9 

implications for, C-24 

interfaces and drivers, C-9 

unified, C-10–12 

Graph coloring, OL2.15-12

I-10 Index 

Graphics displays 

computer hardware support, 18 

LCD, 18 

Graphics logical pipeline, C-10 

Graphics processing units (GPUs), 522– 

529. See also GPU computing 

as accelerators, 522 

attribute interpolation, C-43–44 

defined, 46, 506, C-3 

evolution, C-5 

fallacies and pitfalls, C-72–75 

floating-point arithmetic, C-17, C-41– 

46, C-74 

GeForce 8-series generation, C-5 

general computation, C-73–74 

General Purpose (GPGPUs), C-5 

graphics mode, C-6 

graphics trends, C-4 

history, C-3–4 

logical graphics pipeline, C-13–14 

mapping applications to, C-55–72 

memory, 523 

multilevel caches and, 522 

N-body applications, C-65–72 

NVIDIA architecture, 523–526 

parallel memory system, C-36–41 

parallelism, 523, C-76 

performance doubling, C-4 

perspective, 527–529 

programming, C-12–24 

programming interfaces to, C-17 

real-time graphics, C-13 

summary, C-76 

Graphics shader programs, C-14–15 

Gresham’s Law, 236, OL3.11-2 

Grid computing, 533 

Grids, C-19 

GTX 280, 548–553 

Guard digits 

defined, 218 

rounding with, 219 

H 

Half precision, C-42 

Halfwords, 110 

Hamming, Richard, 420 

Hamming distance, 420 

Hamming Error Correction Code (ECC), 

420–421 

calculating, 420–421 

Handlers 

defined, 449 

TLB miss, 448 

Hard disks 

access times, 23 

defined, 23 

Hardware 

as hierarchical layer, 13 

language of, 14–16 

operations, 63–66 

supporting procedures in, 96–106 

synthesis, B-21 

translating microprograms to, D-28–32 

virtualizable, 426 

Hardware description languages. See also 

Verilog 

defined, B-20 

using, B-20–26 

VHDL, B-20–21 

Hardware multithreading, 514–517 

coarse-grained, 514 

options, 516 

simultaneous, 515–517 

Hardware-based speculation, 341 

Harvard architecture, OL1.12-4 

Hazard detection units, 313–314 

functions, 314 

pipeline connections for, 314 

Hazards, 277–278. See also Pipelining 

control, 281–282, 316–325 

data, 278, 303–316 

forwarding and, 312 

structural, 277, 294 

Heap 

allocating space on, 104–106 

defined, 104 

Heterogeneous systems, C-4–5 

architecture, C-7–9 

defined, C-3 

Hexadecimal numbers, 81–82 

binary number conversion to, 81–82 

Hierarchy of memories, 12 

High-level languages, 14–16, A-6 

benefits, 16 

computer architectures, OL2.21-5 

importance, 16 

High-level optimizations, OL2.15-4–2.15- 

5 

Hit rate, 376 

Hit time 

cache performance and, 401–402 

defined, 376 

Hit under miss, 472 

Hold time, B-54 

Horizontal microcode, D-32 

Hot-swapping, OL5.11-7 

Human genome project, 4 

I 

I 

I/O, A-38–40, OL6.9-2, OL6.9-3 

memory-mapped, A-38 

on system performance, OL5.11-2 

I/O benchmarks.See Benchmarks 

IBM 360/85, OL5.17-7 

IBM 701, OL1.12-5 

IBM 7030, OL4.16-2 

IBM ALOG, OL3.11-7 

IBM Blue Gene, OL6.15-9–6.15-10 

IBM Personal Computer, OL1.12-7, 

OL2.21-6 

IBM System/360 computers, OL1.12-6, 

OL3.11-6, OL4.16-2 

IBM z/VM, OL5.17-8 

ID stage 

branch execution in, 319 

load instructions, 292 

store instruction in, 291 

IEEE 754 floating-point standard, 198, 

199, OL3.11-8–3.11-10. See also 

Floating point 

first chips, OL3.11-8–3.11-9 

in GPU arithmetic, C-42–43 

implementation, OL3.11-10 

rounding modes, 219 

today, OL3.11-10 

If statements, 114 

I-format, 83 

If-then-else, 91 

Immediate addressing, 116 

Immediate instructions, 72 

Imprecise interrupts, 331, OL4.16-4 

Index-out-of-bounds check, 94–95 

Induction variable elimination, OL2.15-7 

Inheritance, OL2.15-15 

In-order commit, 341 

Input devices, 16 

Inputs, 261 

Instances, OL2.15-15 

Instruction count, 36, 38 

Instruction decode/register file read stage

Index I-11 




Instruction execution illustrations, 

OL4.13-16–4.13-17 

clock cycle 9, OL4.13-24 

clock cycles 1 and 2, OL4.13-21 


clock cycles 5 and 6, OL4.13-23, 

OL4.13-23 


examples, OL4.13-20–4.13-25 

forwarding, OL4.13-26–4.13-31 

no hazard, OL4.13-17 

pipelines with stalls and forwarding, 

OL4.13-26, OL4.13-20 

Instruction fetch stage 




Instruction formats, 157 

ARM, 148 

defined, 81 

desktop/server RISC architectures, E-7 

embedded RISC architectures, E-8 

I-type, 83 

J-type, 113 

jump instruction, 270 

MIPS, 148 

R-type, 83, 261 

x86, 157 

Instruction latency, 356 

Instruction mix, 39, OL1.12-10 

Instruction set architecture 

ARM, 145–147 

branch address calculation, 254 


history, 163 

maintaining, 52 

protection and, 427 

thread, C-31–34 

virtual machine support, 426–427 

Instruction sets, 235, C-49 

ARM, 324 

design for pipelining, 277 

MIPS, 62, 161, 234 

MIPS-32, 235 

Pseudo MIPS, 233 

x86 growth, 161 

Instruction-level parallelism (ILP), 354. 

See also Parallelism 

compiler exploitation, OL4.16-5–4.16-6 


exploitation, increasing, 343 

and matrix multiply, 351–354 

Instructions, 60–164, E-25–27, E-40–42. 

See also Arithmetic instructions; 

MIPS; Operands 

add immediate, 72 

addition, 180, A-51 

Alpha, E-27–29 

arithmetic-logical, 251, A-51–57 

ARM, 145–147, E-36–37 

assembly, 66 

basic block, 93 

branch, A-59–63 

cache-aware, 482 

comparison, A-57–59 

conditional branch, 90 

conditional move, 324 

constant-manipulating, A-57 

conversion, A-75–76 

core, 233 

data movement, A-70–73 

data transfer, 68 

decision-making, 90–96 


desktop RISC conventions, E-12 

division, A-52–53 

as electronic signals, 80 

embedded RISC conventions, E-15 

encoding, 83 

exception and interrupt, A-80 

exclusive OR, A-57 

fetching, 253 

fields, 80 

floating-point (x86), 224 

floating-point, 211–213, A-73–80 

flushing, 318, 319, 331 

immediate, 72 

introduction to, 62–63 

jump, 95, 97, A-63–64 

left-to-right flow, 287–288 

load, 68, A-66–68 

load linked, 122 

logical operations, 87–89 

M32R, E-40 

memory access, C-33–34 

memory-reference, 245 

multiplication, 188, A-53–54 

negation, A-54 

nop, 314 

PA-RISC, E-34–36 


pipeline sequence, 313 

PowerPC, E-12–13, E-32–34 

PTX, C-31, C-32 

remainder, A-55 

representation in computer, 80–87 

restartable, 450 

resuming, 450 

R-type, 252 

shift, A-55–56 

SPARC, E-29–32 

store, 71, A-68–70 

store conditional, 122 

subtraction, 180, A-56–57 

SuperH, E-39–40 

thread, C-30–31 

Thumb, E-38 

trap, A-64–66 

vector, 510 

as words, 62 

x86, 149–155 

Instructions per clock cycle (IPC), 333 

Integrated circuits (ICs), 19. See also 

specific chips 

cost, 27 

defined, 25 

manufacturing process, 26 

very large-scale (VLSIs), 25 

Intel Core i7, 46–49, 244, 501, 548–553 

address translation for, 471 

architectural registers, 347 

caches in, 472 

memory hierarchies of, 471–475 

microarchitecture, 338 

performance of, 473 

SPEC CPU benchmark, 46–48 

SPEC power benchmark, 48–49 

TLB hardware for, 471 

Intel Core i7 920, 346–349 

microarchitecture, 347 

Intel Core i7 960 

benchmarking and rooflines of, 

548–553 

Intel Core i7 Pipelines, 344, 346–349 

memory components, 348 


program performance, 351 

specification, 345 

Intel IA-64 architecture, OL2.21-3 

Intel Paragon, OL6.15-8

I-12 Index 

Intel Threading Building Blocks, C-60 

Intel x86 microprocessors 

clock rate and power for, 40 

Interference graphs, OL2.15-12 

Interleaving, 398 

Interprocedural analysis, OL2.15-14 

Interrupt enable, 447 

Interrupt handlers, A-33 

Interrupt-driven I/O, OL6.9-4 

Interrupts 

defined, 180, 326 

event types and, 326 

exceptions versus, 325–326 

imprecise, 331, OL4.16-4 


precise, 332 

vectored, 327 

Intrinsity FastMATH processor, 395–398 

caches, 396 

data miss rates, 397, 407 

read processing, 442 

TLB, 440 

write-through processing, 442 

Inverted page tables, 436 

Issue packets, 334 

J 

j (Jump), 64 

jal (Jump And Link), 64 

Java 

bytecode, 131 

bytecode architecture, OL2.15-17 

characters in, 109–111 

compiling in, OL2.15-19–2.15-20 

goals, 131 

interpreting, 131, 145, OL2.15-15– 

2.15-16 

keywords, OL2.15-21 

method invocation in, OL2.15-21 

pointers, OL2.15-26 

primitive types, OL2.15-26 

programs, starting, 131–132 

reference types, OL2.15-26 

sort algorithms, 141 

strings in, 109–111 

translation hierarchy, 131 

while loop compilation in, OL2.15- 

18–2.15-19 

Java Virtual Machine (JVM), 145, 

OL2.15-16 

jr (Jump Register), 64 

J-type instruction format, 113 

Jump instructions, 254, E-26 

branch instruction versus, 270 

control and datapath for, 271 

implementing, 270 

instruction format, 270 


Just In Time (JIT) compilers, 

132, 560 

K 

Karnaugh maps, B-18 

Kernel mode, 444 

Kernels 

CUDA, C-19, C-24 

defined, C-19 

Kilobyte, 6 

L 

Labels 

global, A-10, A-11 

local, A-11 

LAPACK, 230 

Large-scale multiprocessors, OL6.15-7, 

OL6.15-9–6.15-10 

Latches 

D latch, B-51, B-52 

defined, B-51 

Latency 


memory, C-74–75 

pipeline, 286 

use, 336–337 

lbu (Load Byte Unsigned), 64 

Leaf procedures. See also Procedures 

defined, 100 

example, 109 

Least recently used (LRU) 

as block replacement strategy, 457 

defined, 409 

pages, 434 

Least significant bits, B-32 

defined, 74 

SPARC, E-31 

Left-to-right instruction flow, 287–288 

Level-sensitive clocking, B-74, B-75–76 

defined, B-74 

two-phase, B-75 

lhu (Load Halfword Unsigned), 64 

li (Load Immediate), 162 

Link, OL6.9-2 

Linkers, 126–129, A-18–19 

defined, 126, A-4 

executable files, 126, A-19 

function illustration, A-19 

steps, 126 

using, 126–129 

Linking object files, 126–129 

Linpack, 538, OL3.11-4 

Liquid crystal displays (LCDs), 18 

LISP, SPARC support, E-30 

Little-endian byte order, A-43 

Live range, OL2.15-11 

Livermore Loops, OL1.12-11 

ll (Load Linked), 64 

Load balancing, 505–506 

Load instructions. See also Store 

instructions 

access, C-41 

base register, 262 

block, 149 

compiling with, 71 

datapath in operation for, 267 

defined, 68 

details, A-66–68 

EX stage, 292 


halfword unsigned, 110 

ID stage, 291 

IF stage, 291 

linked, 122, 123 


load byte unsigned, 76 

load half, 110 

load upper immediate, 112, 113 

MEM stage, 293 

pipelined datapath in, 296 

signed, 76 

unit for implementing, 255 

unsigned, 76 

WB stage, 293 

Load word, 68, 71 

Loaders, 129 

Loading, A-19–20 

Load-store architectures, OL2.21-3 

Load-use data hazard, 280, 318 

Load-use stalls, 318 

Local area networks (LANs), 24. See also 

Networks

Index I-13 

Local labels, A-11 

Local memory, C-21, C-40 

Local miss rates, 416 

Local optimization, OL2.15-5. 

See also Optimization 

implementing, OL2.15-8 

Locality 

principle, 374 

spatial, 374, 377 

temporal, 374, 377 

Lock synchronization, 121 

Locks, 518 

Logic 

address select, D-24, D-25 

ALU control, D-6 

combinational, 250, B-5, B-9–20 

components, 249 

control unit equations, D-11 

design, 248–251, B-1–79 

equations, B-7 

minimization, B-18 

programmable array (PAL), 

B-78 

sequential, B-5, B-56–58 

two-level, B-11–14 

Logical operations, 87–89 

AND, 88, A-52 

ARM, 149 



MIPS, A-51–57 

NOR, 89, A-54 

NOT, 89, A-55 

OR, 89, A-55 

shifts, 87 

Long instruction word (LIW), 

OL4.16-5 

Lookup tables (LUTs), B-79 

Loop unrolling 

defined, 338, OL2.15-4 

for multiple-issue pipelines, 338 

register renaming and, 338 

Loops, 92–93 

conditional branches in, 114 

for, 141 

prediction and, 321–323 

test, 142, 143 

while, compiling, 92–93 

lui (Load Upper Imm.), 64 

lw (Load Word), 64 

lwc1 (Load FP Single), A-73 

M 

M32R, E-15, E-40 

Machine code, 81 

Machine instructions, 81 

Machine language, 15 

branch offset in, 115 

decoding, 118–120 

defined, 14, 81, A-3 



MIPS, 85 

SRAM, 21 

translating MIPS assembly language 

into, 84 

Macros 

defined, A-4 

example, A-15–17 

use of, A-15 

Main memory, 428. See also Memory 

defined, 23 

page tables, 437 

physical addresses, 428 

Mapping applications, C-55–72 

Mark computers, OL1.12-14 

Matrix multiply, 225–228, 553–555 

Mealy machine, 463–464, B-68, B-71, 

B-72 

Mean time to failure(MTTF), 418 

improving, 419 

versus AFR of disks, 419–420 

Media Access Control (MAC) address, 

OL6.9-7 

Megabyte, 6 

Memory 

addresses, 77 

affinity, 545 

atomic, C-21 

bandwidth, 380–381, 397 

cache, 21, 383–398, 398–417 

CAM, 408 

constant, C-40 

control, D-26 

defined, 19 

DRAM, 19, 379–380, B-63–65 

flash, 23 

global, C-21, C-39 

GPU, 523 

instructions, datapath for, 256 

layout, A-21 

local, C-21, C-40 

main, 23 

nonvolatile, 22 

operands, 68–69 

parallel system, C-36–41 

read-only (ROM), B-14–16 

SDRAM, 379–380 

secondary, 23 

shared, C-21, C-39–40 

spaces, C-39 

SRAM, B-58–62 

stalls, 400 

technologies for building, 24–28 

texture, C-40 

usage, A-20–22 

virtual, 427–454 

volatile, 22 

Memory access instructions, C-33–34 

Memory access stage 




Memory bandwidth, 551, 557 

Memory consistency model, 469 

Memory elements, B-50–58 

clocked, B-51 

D flip-flop, B-51, B-53 

D latch, B-52 

DRAMs, B-63–67 

flip-flop, B-51 

hold time, B-54 

latch, B-51 

setup time, B-53, B-54 

SRAMs, B-58–62 

unclocked, B-51 

Memory hierarchies, 545 

of ARM cortex-A8, 471–475 

block (or line), 376 

cache performance, 398–417 

caches, 383–417 

common framework, 454–461 

defined, 375 

design challenges, 461 

development, OL5.17-6–5.17-8 

exploiting, 372–498 

of Intel core i7, 471–475 

level pairs, 376 

multiple levels, 375 

overall operation of, 443–444 

parallelism and, 466–470, OL5.11-2 

pitfalls, 478–482 

program execution time and, 417

I-14 Index 

Memory hierarchies (Continued) 

quantitative design parameters, 454 

redundant arrays and inexpensive 

disks, 470 

reliance on, 376 

structure, 375 

structure diagram, 378 

variance, 417 

virtual memory, 427–454 

Memory rank, 381 

Memory technologies, 378–383 

disk memory, 381–383 

DRAM technology, 378, 379–381 

flash memory, 381 

SRAM technology, 378, 379 

Memory-mapped I/O, OL6.9-3 

use of, A-38 

Memory-stall clock cycles, 399 

Message passing 

defined, 529 

multiprocessors, 529–534 

Metastability, B-76 

Methods 

defined, OL2.15-5 

invoking in Java, OL2.15-20–2.15-21 

static, A-20 

mfc0 (Move From Control), A-71 

mfhi (Move From Hi), A-71 

mflo (Move From Lo), A-71 

Microarchitectures, 347 

Intel Core i7 920, 347 

Microcode 

assembler, D-30 

control unit as, D-28 

defined, D-27 

dispatch ROMs, D-30–31 

horizontal, D-32 

vertical, D-32 

Microinstructions, D-31 

Microprocessors 

design shift, 501 

multicore, 8, 43, 500–501 

Microprograms 

as abstract control representation, 

D-30 

field translation, D-29 

translating to hardware, D-28–32 

Migration, 467 

Million instructions per second (MIPS), 

51 

Minterms 

defined, B-12, D-20 

in PLA implementation, D-20 

MIP-map, C-44 

MIPS, 64, 84, A-45–80 

addressing for 32-bit immediates, 

116–118 

addressing modes, A-45–47 

arithmetic core, 233 

arithmetic instructions, 63, A-51–57 

ARM similarities, 146 

assembler directive support, A-47–49 

assembler syntax, A-47–49 

assembly instruction, mapping, 80–81 

branch instructions, A-59–63 

comparison instructions, A-57–59 

compiling C assignment statements 

into, 65 

compiling complex C assignment into, 

65–66 

constant-manipulating instructions, 

A-57 

control registers, 448 

control unit, D-10 

CPU, A-46 

divide in, 194 

exceptions in, 326–327 

fields, 82–83 

floating-point instructions, 211–213 

FPU, A-46 

instruction classes, 163 

instruction encoding, 83, 119, A-49 

instruction formats, 120, 148, A-49–51 

instruction set, 62, 162, 234 

jump instructions, A-63–66 

logical instructions, A-51–57 

machine language, 85 

memory addresses, 70 

memory allocation for program and 

data, 104 

multiply in, 188 

opcode map, A-50 


Pseudo, 233, 235 

register conventions, 105 

static multiple issue with, 335–338 

MIPS core 

architecture, 195 

arithmetic/logical instructions not in, 

E-21, E-23 

common extensions to, E-20–25 

control instructions not in, E-21 

data transfer instructions not in, E-20, 

E-22 

floating-point instructions not in, E-22 

instruction set, 233, 244–248, E-9–10 

MIPS-16 

16-bit instruction set, E-41–42 

immediate fields, E-41 


MIPS core instruction changes, E-42 

PC-relative addressing, E-41 

MIPS-32 instruction set, 235 

MIPS-64 instructions, E-25–27 

conditional procedure call instructions, 

E-27 

constant shift amount, E-25 

jump/call not PC-relative, E-26 

move to/from control registers, E-26 

nonaligned data transfers, E-25 

NOR, E-25 

parallel single precision floating-point 

operations, E-27 

reciprocal and reciprocal square root, 

E-27 

SYSCALL, E-25 

TLB instructions, E-26–27 

Mirroring, OL5.11-5 

Miss penalty 

defined, 376 

determination, 391–392 

multilevel caches, reducing, 410 

Miss rates 

block size versus, 392 

data cache, 455 

defined, 376 

global, 416 

improvement, 391–392 

Intrinsity FastMATH processor, 397 

local, 416 

miss sources, 460 

split cache, 397 

Miss under miss, 472 

MMX (MultiMedia eXtension), 224 

Modules, A-4 

Moore machines, 463–464, B-68, B-71, 

B-72 

Moore’s law, 11, 379, 522, OL6.9-2, 

C-72–73 

Most significant bit 

1-bit ALU for, B-33 

defined, 74 

move (Move), 139

Index I-15 

Move instructions, A-70–73 

coprocessor, A-71–72 



MS-DOS, OL5.17-11 

mul.d (FP Multiply Double), A-78 

mul.s (FP Multiply Single), A-78 

mult (Multiply), A-53 

Multicore, 517–521 

Multicore multiprocessors, 8, 43 

defined, 8, 500–501 

MULTICS (Multiplexed Information 

and Computing Service), OL5.17- 

9–5.17-10 

Multilevel caches. See also Caches 

complications, 416 

defined, 398, 416 

miss penalty, reducing, 410 

performance of, 410 


Multimedia extensions 

desktop/server RISCs, E-16–18 

as SIMD extensions to instruction sets, 

OL6.15-4 

vector versus, 511–512 

Multiple dimension arrays, 218 

Multiple instruction multiple data 

(MIMD), 558 

defined, 507, 508 

first multiprocessor, OL6.15-14 

Multiple instruction single data (MISD), 507 

Multiple issue, 332–339 

code scheduling, 337–338 

dynamic, 333, 339–341 

issue packets, 334 

loop unrolling and, 338 

processors, 332, 333 

static, 333, 334–339 

throughput and, 342 

Multiple processors, 553–555 

Multiple-clock-cycle pipeline diagrams, 

296–297 

five instructions, 298 


Multiplexors, B-10 

controls, 463 

in datapath, 263 

defined, 246 

forwarding, control values, 310 

selector control, 256–257 

two-input, B-10 

Multiplicand, 183 

Multiplication, 183–188. See also 

Arithmetic 

fast, hardware, 188 

faster, 187–188 

first algorithm, 185 

floating-point, 206–208, A-78 


instructions, 188, A-53–54 

in MIPS, 188 

multiplicand, 183 

multiplier, 183 


product, 183 

sequential version, 184–186 

signed, 187 

Multiplier, 183 

Multiply algorithm, 186 

Multiply-add (MAD), C-42 

Multiprocessors 

benchmarks, 538–540 

bus-based coherent, OL6.15-7 

defined, 500 

historical perspective, 561 

large-scale, OL6.15-7–6.15-8, OL6.15- 

9–6.15-10 

message-passing, 529–534 

multithreaded architecture, C-26–27, 

C-35–36 

organization, 499, 529 

for performance, 559 

shared memory, 501, 517–521 

software, 500 

TFLOPS, OL6.15-6 

UMA, 518 

Multistage networks, 535 

Multithreaded multiprocessor 

architecture, C-25–36 

conclusion, C-36 

ISA, C-31–34 

massive multithreading, C-25–26 

multiprocessor, C-26–27 

multiprocessor comparison, C-35–36 

SIMT, C-27–30 

special function units (SFUs), C-35 

streaming processor (SP), C-34 

thread instructions, C-30–31 

threads/thread blocks management, 

C-30 

Multithreading, C-25–26 

coarse-grained, 514 

defined, 506 

fine-grained, 514 


simultaneous (SMT), 515–517 

multu (Multiply Unsigned), A-54 

Must-information, OL2.15-5 

Mutual exclusion, 121 

N 

Name dependence, 338 

NAND gates, B-8 

NAS (NASA Advanced Supercomputing), 

540 

N-body 

all-pairs algorithm, C-65 

GPU simulation, C-71 

mathematics, C-65–67 

multiple threads per body, C-68–69 

optimization, C-67 

performance comparison, C-69–70 

results, C-70–72 

shared memory use, C-67–68 

Negation instructions, A-54, A-78–79 

Negation shortcut, 76 

Nested procedures, 100–102 

compiling recursive procedure 

showing, 101–102 

NetFPGA 10-Gigagit Ethernet card, 

OL6.9-2, OL6.9-3 

Network of Workstations, OL6.15- 

8–6.15-9 

Network topologies, 534–537 

implementing, 536 

multistage, 537 

Networking, OL6.9-4 

operating system in, OL6.9-4–6.9-5 

performance improvement, OL6.9- 

7–6.9-10 

Networks, 23–24 

advantages, 23 

bandwidth, 535 

crossbar, 535 

fully connected, 535 

local area (LANs), 24 

multistage, 535 

wide area (WANs), 24 

Newton’s iteration, 218 

Next state 

nonsequential, D-24 

sequential, D-23

I-16 Index 

Next-state function, 463, B-67 

defined, 463 

implementing, with sequencer, 

D-22–28 

Next-state outputs, D-10, D-12–13 

example, D-12–13 

implementation, D-12 

logic equations, D-12–13 

truth tables, D-15 

No Redundancy (RAID 0), OL5.11-4 

No write allocation, 394 

Nonblocking assignment, B-24 

Nonblocking caches, 344, 472 

Nonuniform memory access (NUMA), 

518 

Nonvolatile memory, 22 

Nops, 314 

nor (NOR), 64 

NOR gates, B-8 

cross-coupled, B-50 

D latch implemented with, B-52 

NOR operation, 89, A-54, E-25 

NOT operation, 89, A-55, B-6 

Numbers 

binary, 73 

computer versus real-world, 221 

decimal, 73, 76 

denormalized, 222 

hexadecimal, 81–82 

signed, 73–78 

unsigned, 73–78 

NVIDIA GeForce 8800, C-46–55 

all-pairs N-body algorithm, C-71 

dense linear algebra computations, 

C-51–53 

FFT performance, C-53 

instruction set, C-49 

performance, C-51 

rasterization, C-50 

ROP, C-50–51 

scalability, C-51 

sorting performance, C-54–55 

special function approximation 

statistics, C-43 

special function unit (SFU), C-50 

streaming multiprocessor (SM), 

C-48–49 

streaming processor, C-49–50 

streaming processor array (SPA), C-46 

texture/processor cluster (TPC), 

C-47–48 

NVIDIA GPU architecture, 523–526 

NVIDIA GTX 280, 548–553 

NVIDIA Tesla GPU, 548–553 

O 

Object files, 125, A-4 

debugging information, 124 

defined, A-10 

format, A-13–14 

header, 125, A-13 

linking, 126–129 

relocation information, 125 

static data segment, 125 

symbol table, 125, 126 

text segment, 125 

Object-oriented languages. See also Java 


defined, 145, OL2.15-5 

One’s complement, 79, B-29 

Opcodes 

control line setting and, 264 


OpenGL, C-13 

OpenMP (Open MultiProcessing), 520, 

540 

Operands, 66–73. See also Instructions 

32-bit immediate, 112–113 

adding, 179 

arithmetic instructions, 66 

compiling assignment when in 

memory, 69 

constant, 72–73 

division, 189 


memory, 68–69 

MIPS, 64 

multiplication, 183 

shifting, 148 

Operating systems 

brief history, OL5.17-9–5.17-12 

defined, 13 

encapsulation, 22 

in networking, OL6.9-4–6.9-5 

Operations 

atomic, implementing, 121 


logical, 87–89 

x86 integer, 152, 154–155 

Optimization 

class explanation, OL2.15-14 

compiler, 141 

control implementation, D-27–28 

global, OL2.15-5 

high-level, OL2.15-4–2.15-5 

local, OL2.15-5, OL2.15-8 

manual, 144 

or (OR), 64 

OR operation, 89, A-55, B-6 

ori (Or Immediate), 64 

Out-of-order execution 

defined, 341 

performance complexity, 416 

processors, 344 

Output devices, 16 

Overflow 


detection, 180 

exceptions, 329 


occurrence, 75 

saturation and, 181 

subtraction, 179 

P 

P+Q redundancy (RAID 6), OL5.11-7 

Packed floating-point format, 224 

Page faults, 434. See also Virtual memory 

for data access, 450 

defined, 428 

handling, 429, 446–453 

virtual address causing, 449, 450 

Page tables, 456 

defined, 432 


indexing, 432 

inverted, 436 

levels, 436–437 

main memory, 437 

register, 432 

storage reduction techniques, 436–437 

updating, 432 

VMM, 452 

Pages. See also Virtual memory 

defined, 428 

dirty, 437 

finding, 432–434 

LRU, 434 

offset, 429 

physical number, 429 

placing, 432–434

Index I-17 

size, 430 

virtual number, 429 

Parallel bus, OL6.9-3 

Parallel execution, 121 

Parallel memory system, C-36–41. See 

also Graphics processing units 

(GPUs) 

caches, C-38 

constant memory, C-40 

DRAM considerations, C-37–38 

global memory, C-39 

load/store access, C-41 

local memory, C-40 

memory spaces, C-39 

MMU, C-38–39 

ROP, C-41 

shared memory, C-39–40 

surfaces, C-41 

texture memory, C-40 

Parallel processing programs, 502–507 

creation difficulty, 502–507 

defined, 501 

for message passing, 519–520 

great debates in, OL6.15-5 

for shared address space, 519–520 

use of, 559 

Parallel reduction, C-62 

Parallel scan, C-60–63 

CUDA template, C-61 

inclusive, C-60 

tree-based, C-62 

Parallel software, 501 

Parallelism, 12, 43, 332–344 

and computers arithmetic, 222–223 

data-level, 233, 508 

debates, OL6.15-5–6.15-7 

GPUs and, 523, C-76 

instruction-level, 43, 332, 343 

memory hierarchies and, 466–470, 

OL5.11-2 

multicore and, 517 

multiple issue, 332–339 

multithreading and, 517 

performance benefits, 44–45 

process-level, 500 

redundant arrays and inexpensive 

disks, 470 

subword, E-17 

task, C-24 

task-level, 500 

thread, C-22 

Paravirtualization, 482 

PA-RISC, E-14, E-17 

branch vectored, E-35 

conditional branches, E-34, E-35 

debug instructions, E-36 

decimal operations, E-35 

extract and deposit, E-35 


load and clear instructions, E-36 

multiply/add and multiply/subtract, 

E-36 

nullification, E-34 

nullifying branch option, E-25 

store bytes short, E-36 

synthesized multiply and divide, 

E-34–35 

Parity, OL5.11-5 

bits, 421 

code, 420, B-65 

PARSEC (Princeton Application 

Repository for Shared Memory 

Computers), 540 

Pass transistor, B-63 

PCI-Express (PCIe), 537, C-8, OL6.9-2 

PC-relative addressing, 114, 116 

Peak floating-point performance, 542 

Pentium bug morality play, 231–232 

Performance, 28–36 

assessing, 28 

classic CPU equation, 36–40 

components, 38 

CPU, 33–35 

defining, 29–32 

equation, using, 36 

improving, 34–35 

instruction, 35–36 

measuring, 33–35, OL1.12-10 

program, 39–40 

ratio, 31 

relative, 31–32 

response time, 30–31 

sorting, C-54–55 

throughput, 30–31 

time measurement, 32 

Personal computers (PCs), 7 

defined, 5 

Personal mobile device (PMD) 

defined, 7 

Petabyte, 6 

Physical addresses, 428 

mapping to, 428–429 

space, 517, 521 

Physically addressed caches, 443 

Pipeline registers 

before forwarding, 309 

dependences, 308 

forwarding unit selection, 312 

Pipeline stalls, 280 

avoiding with code reordering, 280 

data hazards and, 313–316 

insertion, 315 

load-use, 318 

as solution to control hazards, 282 

Pipelined branches, 319 

Pipelined control, 300–303. See also 

Control 

control lines, 300, 303 

overview illustration, 316 

specifying, 300 

Pipelined datapaths, 286–303 

with connected control signals, 304 

with control signals, 300–303 

corrected, 296 


in load instruction stages, 296 

Pipelined dependencies, 305 

Pipelines 

branch instruction impact, 317 

effectiveness, improving, OL4.16- 

4–4.16-5 

execute and address calculation stage, 

290, 292 

five-stage, 274, 290, 299 

graphic representation, 279, 296–300 

instruction decode and register file 

read stage, 289, 292 

instruction fetch stage, 290, 292 

instructions sequence, 313 

latency, 286 

memory access stage, 290, 292 

multiple-clock-cycle diagrams, 

296–297 

performance bottlenecks, 343 

single-clock-cycle diagrams, 296–297 

stages, 274 

static two-issue, 335 

write-back stage, 290, 294 

Pipelining, 12, 272–286 

advanced, 343–344 

benefits, 272 

control hazards, 281–282 

data hazards, 278

I-18 Index 

Pipelining (Continued) 

exceptions and, 327–332 

execution time and, 286 

fallacies, 355–356 

hazards, 277–278 

instruction set design for, 277 

laundry analogy, 273 

overview, 272–286 

paradox, 273 

performance improvement, 277 

pitfall, 355–356 

simultaneous executing instructions, 

286 

speed-up formula, 273 

structural hazards, 277, 294 

summary, 285 

throughput and, 286 

Pitfalls. See also Fallacies 

address space extension, 479 

arithmetic, 229–232 

associativity, 479 

defined, 49 

GPUs, C-74–75 

ignoring memory system behavior, 478 

memory hierarchies, 478–482 

out-of-order processor evaluation, 479 

performance equation subset, 50–51 

pipelining, 355–356 

pointer to automatic variables, 160 

sequential word addresses, 160 

simulating cache, 478 

software development with 

multiprocessors, 556 

VMM implementation, 481, 481–482 

Pixel shader example, C-15–17 

Pixels, 18 

Pointers 

arrays versus, 141–145 

frame, 103 

global, 102 

incrementing, 143 

Java, OL2.15-26 

stack, 98, 102 

Polling, OL6.9-8 

Pop, 98 

Power 

clock rate and, 40 

critical nature of, 53 

efficiency, 343–344 

relative, 41 

PowerPC 

algebraic right shift, E-33 

branch registers, E-32–33 

condition codes, E-12 


instructions unique to, E-31–33 

load multiple/store multiple, E-33 

logical shifted immediate, E-33 

rotate with mask, E-33 

Precise interrupts, 332 

Prediction, 12 

2-bit scheme, 322 

accuracy, 321, 324 

dynamic branch, 321–323 

loops and, 321–323 

steady-state, 321 

Prefetching, 482, 544 

Primitive types, OL2.15-26 

Procedure calls 

convention, A-22–33 

examples, A-27–33 

frame, A-23 

preservation across, 102 

Procedures, 96–106 

compiling, 98 

compiling, showing nested procedure 

linking, 101–102 

execution steps, 96 

frames, 103 

leaf, 100 

nested, 100–102 

recursive, 105, A-26–27 

for setting arrays to zero, 142 

sort, 135–139 

strcpy, 108–109 

string copy, 108–109 

swap, 133 

Process identifiers, 446 

Process-level parallelism, 500 

Processors, 242–356 

as cores, 43 

control, 19 

datapath, 19 


dynamic multiple-issue, 333 

multiple-issue, 333 

out-of-order execution, 344, 416 

performance growth, 44 

ROP, C-12, C-41 

speculation, 333–334 

static multiple-issue, 333, 334–339 

streaming, C-34 

superscalar, 339, 515–516, OL4.16-5 

technologies for building, 24–28 

two-issue, 336–337 

vector, 508–510 

VLIW, 335 

Product, 183 

Product of sums, B-11 

Program counters (PCs), 251 

changing with conditional branch, 324 


exception, 445, 447 

incrementing, 251, 253 

instruction updates, 289 

Program libraries, A-4 

Program performance 

elements affecting, 39 

understanding, 9 

Programmable array logic (PAL), B-78 

Programmable logic arrays (PLAs) 

component dots illustration, B-16 

control function implementation, D-7, 

D-20–21 

defined, B-12 

example, B-13–14 


ROMs and, B-15–16 

size, D-20 

truth table implementation, B-13 

Programmable logic devices (PLDs), B-78 

Programmable ROMs (PROMs), B-14 

Programming languages. See also specific 

languages 

brief history of, OL2.21-7–2.21-8 

object-oriented, 145 

variables, 67 

Programs 

assembly language, 123 

Java, starting, 131–132 

parallel processing, 502–507 

starting, 123–132 

translating, 123–132 

Propagate 

defined, B-40 

example, B-44 

super, B-41 

Protected keywords, OL2.15-21 

Protection 

defined, 428 

implementing, 444–446 

mechanisms, OL5.17-9 

VMs for, 424 

Protection group, OL5.11-5 

Pseudo MIPS 

defined, 233

Index I-19 

instruction set, 235 

Pseudodirect addressing, 116 

Pseudoinstructions 

defined, 124 

summary, 125 

Pthreads (POSIX threads), 540 

PTX instructions, C-31, C-32 

Public keywords, OL2.15-21 

Push 

defined, 98 

using, 100 

Q 

Quad words, 154 

Quicksort, 411, 412 

Quotient, 189 

R 

Race, B-73 

Radix sort, 411, 412, C-63–65 

CUDA code, C-64 

implementation, C-63–65 

RAID, See Redundant arrays of 

inexpensive disks (RAID) 

RAM, 9 

Raster operation (ROP) processors, C-12, 

C-41, C-50–51 

fixed function, C-41 

Raster refresh buffer, 18 

Rasterization, C-50 

Ray casting (RC), 552 

Read-only memories (ROMs), B-14–16 

control entries, D-16–17 

control function encoding, D-18–19 

dispatch, D-25 

implementation, D-15–19 

logic function encoding, B-15 

overhead, D-18 

PLAs and, B-15–16 

programmable (PROM), B-14 

total size, D-16 

Read-stall cycles, 399 

Read-write head, 381 

Receive message routine, 529 

Receiver Control register, A-39 

Receiver Data register, A-38, A-39 

Recursive procedures, 105, A-26–27. See 

also Procedures 

clone invocation, 100 

stack in, A-29–30 

Reduced instruction set computer (RISC) 

architectures, E-2–45, OL2.21-5, 

OL4.16-4. See also Desktop and 

server RISCs; Embedded RISCs 

group types, E-3–4 

instruction set lineage, E-44 

Reduction, 519 

Redundant arrays of inexpensive disks 

(RAID), OL5.11-2–5.11-8 


RAID 0, OL5.11-4 

RAID 1, OL5.11-5 

RAID 2, OL5.11-5 

RAID 3, OL5.11-5 

RAID 4, OL5.11-5–5.11-6 

RAID 5, OL5.11-6–5.11-7 

RAID 6, OL5.11-7 

spread of, OL5.11-6 

summary, OL5.11-7–5.11-8 

use statistics, OL5.11-7 

Reference bit, 435 

References 

absolute, 126 

forward, A-11 

types, OL2.15-26 

unresolved, A-4, A-18 

Register addressing, 116 

Register allocation, OL2.15-11–2.15-13 

Register files, B-50, B-54–56 

defined, 252, B-50, B-54 

in behavioral Verilog, B-57 

single, 257 

two read ports implementation, B-55 

with two read ports/one write port, 

B-55 

write port implementation, B-56 

Register-memory architecture, OL2.21-3 

Registers, 152, 153–154 

architectural, 325–332 

base, 69 

callee-saved, A-23 

caller-saved, A-23 

Cause, A-35 

clock cycle time and, 67 

compiling C assignment with, 67–68 

Count, A-34 

defined, 66 

destination, 83, 262 


left half, 290 

mapping, 80 

MIPS conventions, 105 

number specification, 252 

page table, 432 

pipeline, 308, 309, 312 

primitives, 66 

Receiver Control, A-39 

Receiver Data, A-38, A-39 

renaming, 338 

right half, 290 

spilling, 71 

Status, 327, A-35 

temporary, 67, 99 

Transmitter Control, A-39–40 

Transmitter Data, A-40 

usage convention, A-24 

use convention, A-22 

variables, 67 

Relative performance, 31–32 

Relative power, 41 

Reliability, 418 

Relocation information, A-13, A-14 

Remainder 

defined, 189 


Reorder buffers, 343 

Replication, 468 

Requested word first, 392 

Request-level parallelism, 532 

Reservation stations 

buffering operands in, 340–341 


Response time, 30–31 

Restartable instructions, 448 

Return address, 97 

Return from exception (ERET), 445 

R-format, 262 

ALU operations, 253 

defined, 83 

Ripple carry 

adder, B-29 

carry lookahead speed versus, B-46 

Roofline model, 542–543, 544, 545 

with ceilings, 546, 547 

computational roofline, 545 


Opteron generations, 543, 544 

with overlapping areas shaded, 547 

peak floating-point performance, 

542 

peak memory performance, 543 

with two kernels, 547 

Rotational delay.See Rotational latency 

Rotational latency, 383

I-20 Index 

Rounding, 218 

accurate, 218 

bits, 220 

with guard digits, 219 

IEEE 754 modes, 219 

Row-major order, 217, 413 

R-type instructions, 252 

datapath for, 264–265 

datapath in operation for, 266 

S 

Saturation, 181 

sb (Store Byte), 64 

sc (Store Conditional), 64 

SCALAPAK, 230 

Scaling 

strong, 505, 507 

weak, 505 

Scientific notation 

adding numbers in, 203 

defined, 196 

for reals, 197 

Search engines, 4 

Secondary memory, 23 

Sectors, 381 

Seek, 382 

Segmentation, 431 

Selector values, B-10 

Semiconductors, 25 

Send message routine, 529 

Sensitivity list, B-24 

Sequencers 

explicit, D-32 

implementing next-state function with, 

D-22–28 

Sequential logic, B-5 

Servers, OL5. See also Desktop and server 

RISCs 

cost and capability, 5 

Service accomplishment, 418 

Service interruption, 418 

Set instructions, 93 

Set-associative caches, 403. See also 

Caches 

address portions, 407 

block replacement strategies, 457 


four-way, 404, 407 

memory-block location, 403 

misses, 405–406 

n-way, 403 

two-way, 404 

Setup time, B-53, B-54 

sh (Store Halfword), 64 

Shaders 

defined, C-14 

floating-point arithmetic, C-14 

graphics, C-14–15 

pixel example, C-15–17 

Shading languages, C-14 

Shadowing, OL5.11-5 

Shared memory. See also Memory 

as low-latency memory, C-21 

caching in, C-58–60 

CUDA, C-58 

N-body and, C-67–68 

per-CTA, C-39 

SRAM banks, C-40 

Shared memory multiprocessors (SMP), 

517–521 

defined, 501, 517 

single physical address space, 517 

synchronization, 518 

Shift amount, 82 

Shift instructions, 87, A-55–56 

Sign and magnitude, 197 

Sign bit, 76 

Sign extension, 254 

defined, 76 

shortcut, 78 

Signals 

asserted, 250, B-4 

control, 250, 263–264 

deasserted, 250, B-4 

Signed division, 192–194 

Signed multiplication, 187 

Signed numbers, 73–78 

sign and magnitude, 75 

treating as unsigned, 94–95 

Significands, 198 

addition, 203 

multiplication, 206 

Silicon, 25 

as key hardware technology, 53 

crystal ingot, 26 

defined, 26 

wafers, 26 

Silicon crystal ingot, 26 

SIMD (Single Instruction Multiple Data), 

507–508, 558 

computers, OL6.15-2–6.15-4 

data vector, C-35 

extensions, OL6.15-4 

for loops and, OL6.15-3 

massively parallel multiprocessors, 

OL6.15-2 

small-scale, OL6.15-4 

vector architecture, 508–510 

in x86, 508 

SIMMs (single inline memory modules), 

OL5.17-5, OL5.17-6 

Simple programmable logic devices 

(SPLDs), B-78 

Simplicity, 161 

Simultaneous multithreading (SMT), 

515–517 

support, 515 

thread-level parallelism, 517 

unused issue slots, 515 

Single error correcting/Double error 

correcting (SEC/DEC), 420–422 

Single instruction single data (SISD), 507 

Single precision. See also Double 

precision 

binary representation, 201 

defined, 198 

Single-clock-cycle pipeline diagrams, 

296–297 


Single-cycle datapaths. See also Datapaths 


instruction execution, 288 

Single-cycle implementation 

control function for, 269 

defined, 270 

nonpipelined execution versus 

pipelined execution, 276 

non-use of, 271–272 

penalty, 271–272 

pipelined performance versus, 274 

Single-instruction multiple-thread 

(SIMT), C-27–30 

overhead, C-35 

multithreaded warp scheduling, C-28 

processor architecture, C-28 

warp execution and divergence, 

C-29–30 

Single-program multiple data (SPMD), 

C-22 

sll (Shift Left Logical), 64 

slt (Set Less Than), 64 

slti (Set Less Than Imm.), 64

Index I-21 

sltiu (Set Less Than Imm.Unsigned), 64 

sltu (Set Less Than Unsig.), 64 

Smalltalk-80, OL2.21-8 

Smart phones, 7 

Snooping protocol, 468–470 

Snoopy cache coherence, OL5.12-7 

Software optimization 

via blocking, 413–418 

Sort algorithms, 141 

Software 

layers, 13 

multiprocessor, 500 

parallel, 501 

as service, 7, 532, 558 

systems, 13 

Sort procedure, 135–139. See also 

Procedures 

code for body, 135–137 

full procedure, 138–139 

passing parameters in, 138 

preserving registers in, 138 

procedure call, 137 

register allocation for, 135 

Sorting performance, C-54–55 

Source files, A-4 

Source language, A-6 

Space allocation 

on heap, 104–106 

on stack, 103 

SPARC 

annulling branch, E-23 

CASA, E-31 

conditional branches, E-10–12 

fast traps, E-30 

floating-point operations, E-31 


least significant bits, E-31 

multiple precision floating-point 

results, E-32 

nonfaulting loads, E-32 

overlapping integer operations, E-31 

quadruple precision floating-point 

arithmetic, E-32 

register windows, E-29–30 

support for LISP and Smalltalk, E-30 

Sparse matrices, C-55–58 

Sparse Matrix-Vector multiply (SpMV), 

C-55, C-57, C-58 

CUDA version, C-57 

serial code, C-57 

shared memory version, C-59 

Spatial locality, 374 

large block exploitation of, 391 

tendency, 378 

SPEC, OL1.12-11–1.12-12 

CPU benchmark, 46–48 

power benchmark, 48–49 

SPEC2000, OL1.12-12 

SPEC2006, 233, OL1.12-12 

SPEC89, OL1.12-11 

SPEC92, OL1.12-12 

SPEC95, OL1.12-12 

SPECrate, 538–539 

SPECratio, 47 

Special function units (SFUs), C-35, C-50 

defined, C-43 

Speculation, 333–334 

hardware-based, 341 

implementation, 334 

performance and, 334 

problems, 334 

recovery mechanism, 334 

Speed-up challenge, 503–505 

balancing load, 505–506 

bigger problem, 504–505 

Spilling registers, 71, 98 

SPIM, A-40–45 

byte order, A-43 

features, A-42–43 

getting started with, A-42 

MIPS assembler directives support, 

A-47–49 

speed, A-41 

system calls, A-43–45 

versions, A-42 

virtual machine simulation, A-41–42 

Split algorithm, 552 

Split caches, 397 

Square root instructions, A-79 

sra (Shift Right Arith.), A-56 

srl (Shift Right Logical), 64 

Stack architectures, OL2.21-4 

Stack pointers 

adjustment, 100 

defined, 98 

values, 100 

Stack segment, A-22 

Stacks 

allocating space on, 103 

for arguments, 140 

defined, 98 

pop, 98 

push, 98, 100 

recursive procedures, A-29–30 

Stalls, 280 

as solution to control hazard, 282 

avoiding with code reordering, 280 

behavioral Verilog with detection, 

OL4.13-6–4.13-8 

data hazards and, 313–316 

illustrations, OL4.13-23, OL4.13-30 

insertion into pipeline, 315 

load-use, 318 

memory, 400 

write-back scheme, 399 

write buffer, 399 

Standby spares, OL5.11-8 

State 

in 2-bit prediction scheme, 322 

assignment, B-70, D-27 

bits, D-8 

exception, saving/restoring, 450 

logic components, 249 

specification of, 432 

State elements 

clock and, 250 

combinational logic and, 250 


inputs, 249 

in storing/accessing instructions, 

252 

register file, B-50 

Static branch prediction, 335 

Static data 

as dynamic data, A-21 

defined, A-20 

segment, 104 

Static multiple-issue processors, 333, 

334–339. See also Multiple issue 

control hazards and, 335–336 

instruction sets, 335 

with MIPS ISA, 335–338 

Static random access memories (SRAMs), 

378, 379, B-58–62 

array organization, B-62 

basic structure, B-61 


fixed access time, B-58 

large, B-59 

read/write initiation, B-59 

synchronous (SSRAMs), B-60 

three-state buffers, B-59, B-60 

Static variables, 102

I-22 Index 

Status register 

fields, A-34, A-35 

Steady-state prediction, 321 

Sticky bits, 220 

Store buffers, 343 

Store instructions. See also Load 

instructions 

access, C-41 

base register, 262 

block, 149 

compiling with, 71 

conditional, 122 

defined, 71 


EX stage, 294 

floating-point, A-79 

ID stage, 291 

IF stage, 291 

instruction dependency, 312 


MEM stage, 295 

unit for implementing, 255 

WB stage, 295 

Store word, 71 

Stored program concept, 63 

as computer principle, 86 


principles, 161 

Strcpy procedure, 108–109. See also 

Procedures 

as leaf procedure, 109 

pointers, 109 

Stream benchmark, 548 

Streaming multiprocessor (SM), C-48–49 

Streaming processors, C-34, C-49–50 

array (SPA), C-41, C-46 

Streaming SIMD Extension 2 (SSE2) 

floating-point architecture, 224 

Streaming SIMD Extensions (SSE) and 

advanced vector extensions in x86, 

224–225 

Stretch computer, OL4.16-2 

Strings 

defined, 107 

in Java, 109–111 

representation, 107 

Strip mining, 510 

Striping, OL5.11-4 

Strong scaling, 505, 517 

Structural hazards, 277, 294 

sub (Subtract), 64 

sub.d (FP Subtract Double), A-79 

sub.s (FP Subtract Single), A-80 

Subnormals, 222 

Subtraction, 178–182. See also Arithmetic 

binary, 178–179 

floating-point, 211, A-79–80 

instructions, A-56–57 

negative number, 179 

overflow, 179 

subu (Subtract Unsigned), 119 

Subword parallelism, 222–223, 352, E-17 

and matrix multiply, 225–228 

Sum of products, B-11, B-12 

Supercomputers, OL4.16-3 

defined, 5 

SuperH, E-15, E-39–40 

Superscalars 

defined, 339, OL4.16-5 

dynamic pipeline scheduling, 339 

multithreading options, 516 

Surfaces, C-41 

sw (Store Word), 64 

Swap procedure, 133. See also Procedures 

body code, 135 

full, 135, 138–139 

register allocation, 133 

Swap space, 434 

swc1 (Store FP Single), A-73 

Symbol tables, 125, A-12, A-13 

Synchronization, 121–123, 552 

barrier, C-18, C-20, C-34 

defined, 518 

lock, 121 

overhead, reducing, 44–45 

unlock, 121 

Synchronizers 

defined, B-76 

failure, B-77 

from D flip-flop, B-76 

Synchronous DRAM (SRAM), 379–380, 

B-60, B-65 

Synchronous SRAM (SSRAM), B-60 

Synchronous system, B-48 

Syntax tree, OL2.15-3 

System calls, A-43–45 

code, A-43–44 

defined, 445 

loading, A-43 

Systems software, 13 

SystemVerilog 

cache controller, OL5.12-2 

T 

cache data and tag modules, OL5.12-6 

FSM, OL5.12-7 

simple cache block diagram, OL5.12-4 

type declarations, OL5.12-2 

Tablets, 7 

Tags 

defined, 384 

in locating block, 407 

page tables and, 434 

size of, 409 

Tail call, 105–106 

Task identifiers, 446 

Task parallelism, C-24 

Task-level parallelism, 500 

Tebibyte (TiB), 5 

Telsa PTX ISA, C-31–34 

arithmetic instructions, C-33 

barrier synchronization, C-34 

GPU thread instructions, C-32 

memory access instructions, C-33–34 

Temporal locality, 374 

tendency, 378 

Temporary registers, 67, 99 

Terabyte (TB) , 6 

defined, 5 

Text segment, A-13 

Texture memory, C-40 

Texture/processor cluster (TPC), 

C-47–48 

TFLOPS multiprocessor, OL6.15-6 

Thrashing, 453 

Thread blocks, 528 

creation, C-23 

defined, C-19 

managing, C-30 

memory sharing, C-20 

synchronization, C-20 

Thread parallelism, C-22 

Threads 

creation, C-23 

CUDA, C-36 

ISA, C-31–34 

managing, C-30 

memory latencies and, C-74–75 

multiple, per body, C-68–69 

warps, C-27 

Three Cs model, 459–461 

Three-state buffers, B-59, B-60

Index I-23 

Throughput 


multiple issue and, 342 

pipelining and, 286, 342 

Thumb, E-15, E-38 

Timing 

asynchronous inputs, B-76–77 

level-sensitive, B-75–76 

methodologies, B-72–77 

two-phase, B-75 

TLB misses, 439. See also Translationlookaside 

buffer (TLB) 

entry point, 449 

handler, 449 


occurrence, 446 

problem, 453 

Tomasulo’s algorithm, OL4.16-3 

Touchscreen, 19 

Tournament branch predicators, 324 

Tracks, 381–382 

Transfer time, 383 

Transistors, 25 

Translation-lookaside buffer (TLB), 

438–439, E-26–27, OL5.17-6. See 

also TLB misses 

associativities, 439 


integration, 440–441 

Intrinsity FastMATH, 440 

typical values, 439 

Transmit driver and NIC hardware time 

versus.receive driver and NIC hardware 

time, OL6.9-8 

Transmitter Control register, A-39–40 

Transmitter Data register, A-40 

Trap instructions, A-64–66 

Tree-based parallel scan, C-62 

Truth tables, B-5 

ALU control lines, D-5 

for control bits, 260–261 

datapath control outputs, D-17 

datapath control signals, D-14 

defined, 260 

example, B-5 

next-state output bits, D-15 

PLA implementation, B-13 

Two’s complement representation, 75–76 

advantage, 75–76 

negation shortcut, 76 

rule, 79 

sign extension shortcut, 78 

Two-level logic, B-11–14 

Two-phase clocking, B-75 

TX-2 computer, OL6.15-4 

U 

Unconditional branches, 91 

Underflow, 198 

Unicode 

alphabets, 109 

defined, 110 

example alphabets, 110 

Unified GPU architecture, C-10–12 

illustrated, C-11 

processor array, C-11–12 

Uniform memory access (UMA), 518, 

C-9 

multiprocessors, 519 

Units 

commit, 339–340, 343 

control, 247–248, 259–261, D-4–8, 

D-10, D-12–13 

defined, 219 

floating point, 219 

hazard detection, 313, 314–315 

for load/store implementation, 255 

special function (SFUs), C-35, C-43, 

C-50 

UNIVAC I, OL1.12-5 

UNIX, OL2.21-8, OL5.17-9–5.17-12 

AT&T, OL5.17-10 

Berkeley version (BSD), OL5.17-10 

genius, OL5.17-12 

history, OL5.17-9–5.17-12 

Unlock synchronization, 121 

Unresolved references 

defined, A-4 

linkers and, A-18 

Unsigned numbers, 73–78 

Use latency 


one-instruction, 336–337 

V 

Vacuum tubes, 25 

Valid bit, 386 

Variables 

C language, 102 

programming language, 67 

register, 67 

static, 102 

storage class, 102 

type, 102 

VAX architecture, OL2.21-4, OL5.17-7 

Vector lanes, 512 

Vector processors, 508–510. See also 

Processors 

conventional code comparison, 

509–510 

instructions, 510 

multimedia extensions and, 511–512 

scalar versus, 510–511 

Vectored interrupts, 327 

Verilog 

behavioral definition of MIPS ALU, 

B-25 

behavioral definition with bypassing, 

OL4.13-4–4.13-6 

behavioral definition with stalls for 

loads, OL4.13-6–4.13-8 

behavioral specification, B-21, OL4.13- 

2–4.13-4 

behavioral specification of multicycle 

MIPS design, OL4.13-12–4.13-13 

behavioral specification with 

simulation, OL4.13-2 

behavioral specification with stall 

detection, OL4.13-6–4.13-8 

behavioral specification with synthesis, 

OL4.13-11–4.13-16 

blocking assignment, B-24 

branch hazard logic implementation, 

OL4.13-8–4.13-10 

combinational logic, B-23–26 

datatypes, B-21–22 

defined, B-20 

forwarding implementation, 

OL4.13-4 

MIPS ALU definition in, B-35–38 

modules, B-23 

multicycle MIPS datapath, OL4.13-14 

nonblocking assignment, B-24 

operators, B-22 

program structure, B-23 

reg, B-21–22 

sensitivity list, B-24 

sequential logic specification, B-56–58 

structural specification, B-21 

wire, B-21–22 

Vertical microcode, D-32

I-24 Index 

Very large-scale integrated (VLSI) 

circuits, 25 

Very Long Instruction Word (VLIW) 


first generation computers, OL4.16-5 

processors, 335 

VHDL, B-20–21 

Video graphics array (VGA) controllers, 

C-3–4 

Virtual addresses 

causing page faults, 449 

defined, 428 

mapping from, 428–429 

size, 430 

Virtual machine monitors (VMMs) 

defined, 424 

implementing, 481, 481–482 

laissez-faire attitude, 481 

page tables, 452 

in performance improvement, 427 

requirements, 426 

Virtual machines (VMs), 424–427 

benefits, 424 

defined, A-41 

illusion, 452 

instruction set architecture support, 

426–427 

performance improvement, 427 

for protection improvement, 424 

simulation of, A-41–42 

Virtual memory, 427–454. See also Pages 

address translation, 429, 438–439 

integration, 440–441 

mechanism, 452–453 

motivations, 427–428 

page faults, 428, 434 

protection implementation, 

444–446 

segmentation, 431 


virtualization of, 452 

writes, 437 

Virtualizable hardware, 426 

Virtually addressed caches, 443 

Visual computing, C-3 

Volatile memory, 22 

W 

Wafers, 26 

defects, 26 

dies, 26–27 

yield, 27 

Warehouse Scale Computers (WSCs), 7, 

531–533, 558 

Warps, 528, C-27 

Weak scaling, 505 

Wear levelling, 381 

While loops, 92–93 

Whirlwind, OL5.17-2 

Wide area networks (WANs), 24. See also 

Networks 

Words 

accessing, 68 

defined, 66 

double, 152 

load, 68, 71 

quad, 154 

store, 71 

Working set, 453 

World Wide Web, 4 

Worst-case delay, 272 

Write buffers 

defined, 394 

stalls, 399 

write-back cache, 395 

Write invalidate protocols, 468, 469 

Write serialization, 467 

Write-back caches. See also Caches 


cache coherency protocol, OL5.12-5 

complexity, 395 

defined, 394, 458 

stalls, 399 

write buffers, 395 

Write-back stage 




Writes 

complications, 394 

expense, 453 


memory hierarchy handling of, 

457–458 

schemes, 394 

virtual memory, 437 

write-back cache, 394, 395 

write-through cache, 394, 395 

Write-stall cycles, 400 

Write-through caches. See also Caches 


defined, 393, 457 

tag mismatch, 394 

X 

x86, 149–158 

Advanced Vector Extensions in, 225 


conclusion, 156–158 

data addressing modes, 152, 153–154 

evolution, 149–152 

first address specifier encoding, 158 

historical timeline, 149–152 

instruction encoding, 155–156 

instruction formats, 157 

instruction set growth, 161 

instruction types, 153 

integer operations, 152–155 

registers, 152, 153–154 

SIMD in, 507–508, 508 

Streaming SIMD Extensions in, 

224–225 

typical instructions/functions, 155 

typical operations, 157 

Xerox Alto computer, OL1.12-8 

XMM, 224 

Y 

Yahoo! Cloud Serving Benchmark 

(YCSB), 540 

Yield, 27 

YMM, 225 

Z 

Zettabyte, 6

Computer Organisation and Design (2014)

Create successful ePaper yourself

Delete template?

Save as template?