03.03.2015 Views

Computer Organisation and Design (2014)

Computer Organisation and Design (2014)

Computer Organisation and Design (2014)

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

In Praise of <strong>Computer</strong> Organization <strong>and</strong> <strong>Design</strong>: The Hardware/<br />

Software Interface, Fifth Edition<br />

“Textbook selection is often a frustrating act of compromise—pedagogy, content<br />

coverage, quality of exposition, level of rigor, cost. <strong>Computer</strong> Organization <strong>and</strong><br />

<strong>Design</strong> is the rare book that hits all the right notes across the board, without<br />

compromise. It is not only the premier computer organization textbook, it is a<br />

shining example of what all computer science textbooks could <strong>and</strong> should be.”<br />

—Michael Goldweber, Xavier University<br />

“I have been using <strong>Computer</strong> Organization <strong>and</strong> <strong>Design</strong> for years, from the very<br />

first edition. The new Fifth Edition is yet another outst<strong>and</strong>ing improvement on an<br />

already classic text. The evolution from desktop computing to mobile computing<br />

to Big Data brings new coverage of embedded processors such as the ARM, new<br />

material on how software <strong>and</strong> hardware interact to increase performance, <strong>and</strong><br />

cloud computing. All this without sacrificing the fundamentals.”<br />

—Ed Harcourt, St. Lawrence University<br />

“To Millennials: <strong>Computer</strong> Organization <strong>and</strong> <strong>Design</strong> is the computer architecture<br />

book you should keep on your (virtual) bookshelf. The book is both old <strong>and</strong> new,<br />

because it develops venerable principles—Moore's Law, abstraction, common case<br />

fast, redundancy, memory hierarchies, parallelism, <strong>and</strong> pipelining—but illustrates<br />

them with contemporary designs, e.g., ARM Cortex A8 <strong>and</strong> Intel Core i7.”<br />

—Mark D. Hill, University of Wisconsin-Madison<br />

“The new edition of <strong>Computer</strong> Organization <strong>and</strong> <strong>Design</strong> keeps pace with advances<br />

in emerging embedded <strong>and</strong> many-core (GPU) systems, where tablets <strong>and</strong><br />

smartphones will are quickly becoming our new desktops. This text acknowledges<br />

these changes, but continues to provide a rich foundation of the fundamentals<br />

in computer organization <strong>and</strong> design which will be needed for the designers of<br />

hardware <strong>and</strong> software that power this new class of devices <strong>and</strong> systems.”<br />

—Dave Kaeli, Northeastern University<br />

“The Fifth Edition of <strong>Computer</strong> Organization <strong>and</strong> <strong>Design</strong> provides more than an<br />

introduction to computer architecture. It prepares the reader for the changes necessary<br />

to meet the ever-increasing performance needs of mobile systems <strong>and</strong> big data<br />

processing at a time that difficulties in semiconductor scaling are making all systems<br />

power constrained. In this new era for computing, hardware <strong>and</strong> software must be codesigned<br />

<strong>and</strong> system-level architecture is as critical as component-level optimizations.”<br />

—Christos Kozyrakis, Stanford University<br />

“Patterson <strong>and</strong> Hennessy brilliantly address the issues in ever-changing computer<br />

hardware architectures, emphasizing on interactions among hardware <strong>and</strong> software<br />

components at various abstraction levels. By interspersing I/O <strong>and</strong> parallelism concepts<br />

with a variety of mechanisms in hardware <strong>and</strong> software throughout the book, the new<br />

edition achieves an excellent holistic presentation of computer architecture for the<br />

PostPC era. This book is an essential guide to hardware <strong>and</strong> software professionals<br />

facing energy efficiency <strong>and</strong> parallelization challenges in Tablet PC to cloud computing.”<br />

—Jae C. Oh, Syracuse University


This page intentionally left blank


F I F T H E D I T I O N<br />

<strong>Computer</strong> Organization <strong>and</strong> <strong>Design</strong><br />

THE HARDWARE/SOFTWARE INTERFACE


David A. Patterson has been teaching computer architecture at the University of<br />

California, Berkeley, since joining the faculty in 1977, where he holds the Pardee Chair<br />

of <strong>Computer</strong> Science. His teaching has been honored by the Distinguished Teaching<br />

Award from the University of California, the Karlstrom Award from ACM, <strong>and</strong> the<br />

Mulligan Education Medal <strong>and</strong> Undergraduate Teaching Award from IEEE. Patterson<br />

received the IEEE Technical Achievement Award <strong>and</strong> the ACM Eckert-Mauchly Award<br />

for contributions to RISC, <strong>and</strong> he shared the IEEE Johnson Information Storage Award<br />

for contributions to RAID. He also shared the IEEE John von Neumann Medal <strong>and</strong><br />

the C & C Prize with John Hennessy. Like his co-author, Patterson is a Fellow of the<br />

American Academy of Arts <strong>and</strong> Sciences, the <strong>Computer</strong> History Museum, ACM,<br />

<strong>and</strong> IEEE, <strong>and</strong> he was elected to the National Academy of Engineering, the National<br />

Academy of Sciences, <strong>and</strong> the Silicon Valley Engineering Hall of Fame. He served on<br />

the Information Technology Advisory Committee to the U.S. President, as chair of the<br />

CS division in the Berkeley EECS department, as chair of the Computing Research<br />

Association, <strong>and</strong> as President of ACM. This record led to Distinguished Service Awards<br />

from ACM <strong>and</strong> CRA.<br />

At Berkeley, Patterson led the design <strong>and</strong> implementation of RISC I, likely the first<br />

VLSI reduced instruction set computer, <strong>and</strong> the foundation of the commercial<br />

SPARC architecture. He was a leader of the Redundant Arrays of Inexpensive Disks<br />

(RAID) project, which led to dependable storage systems from many companies.<br />

He was also involved in the Network of Workstations (NOW) project, which led to<br />

cluster technology used by Internet companies <strong>and</strong> later to cloud computing. These<br />

projects earned three dissertation awards from ACM. His current research projects<br />

are Algorithm-Machine-People <strong>and</strong> Algorithms <strong>and</strong> Specializers for Provably Optimal<br />

Implementations with Resilience <strong>and</strong> Efficiency. The AMP Lab is developing scalable<br />

machine learning algorithms, warehouse-scale-computer-friendly programming<br />

models, <strong>and</strong> crowd-sourcing tools to gain valuable insights quickly from big data in<br />

the cloud. The ASPIRE Lab uses deep hardware <strong>and</strong> software co-tuning to achieve the<br />

highest possible performance <strong>and</strong> energy efficiency for mobile <strong>and</strong> rack computing<br />

systems.<br />

John L. Hennessy is the tenth president of Stanford University, where he has been<br />

a member of the faculty since 1977 in the departments of electrical engineering <strong>and</strong><br />

computer science. Hennessy is a Fellow of the IEEE <strong>and</strong> ACM; a member of the<br />

National Academy of Engineering, the National Academy of Science, <strong>and</strong> the American<br />

Philosophical Society; <strong>and</strong> a Fellow of the American Academy of Arts <strong>and</strong> Sciences.<br />

Among his many awards are the 2001 Eckert-Mauchly Award for his contributions to<br />

RISC technology, the 2001 Seymour Cray <strong>Computer</strong> Engineering Award, <strong>and</strong> the 2000<br />

John von Neumann Award, which he shared with David Patterson. He has also received<br />

seven honorary doctorates.<br />

In 1981, he started the MIPS project at Stanford with a h<strong>and</strong>ful of graduate students.<br />

After completing the project in 1984, he took a leave from the university to cofound<br />

MIPS <strong>Computer</strong> Systems (now MIPS Technologies), which developed one of the first<br />

commercial RISC microprocessors. As of 2006, over 2 billion MIPS microprocessors have<br />

been shipped in devices ranging from video games <strong>and</strong> palmtop computers to laser printers<br />

<strong>and</strong> network switches. Hennessy subsequently led the DASH (Director Architecture<br />

for Shared Memory) project, which prototyped the first scalable cache coherent<br />

multiprocessor; many of the key ideas have been adopted in modern multiprocessors.<br />

In addition to his technical activities <strong>and</strong> university responsibilities, he has continued to<br />

work with numerous start-ups both as an early-stage advisor <strong>and</strong> an investor.


To Linda,<br />

who has been, is, <strong>and</strong> always will be the love of my life


A CKNOWLEDGMENTS<br />

Figures 1.7, 1.8 Courtesy of iFixit ( www.ifixit.com ).<br />

Figure 1.9 Courtesy of Chipworks ( www.chipworks.com ).<br />

Figure 1.13 Courtesy of Intel.<br />

Figures 1.10.1, 1.10.2, 4.15.2 Courtesy of the Charles Babbage<br />

Institute, University of Minnesota Libraries, Minneapolis.<br />

Figures 1.10.3, 4.15.1, 4.15.3, 5.12.3, 6.14.2 Courtesy of IBM.<br />

Figure 1.10.4 Courtesy of Cray Inc.<br />

Figure 1.10.5 Courtesy of Apple <strong>Computer</strong>, Inc.<br />

Figure 1.10.6 Courtesy of the <strong>Computer</strong> History Museum.<br />

Figures 5.17.1, 5.17.2 Courtesy of Museum of Science, Boston.<br />

Figure 5.17.4 Courtesy of MIPS Technologies, Inc.<br />

Figure 6.15.1 Courtesy of NASA Ames Research Center.


Preface<br />

The most beautiful thing we can experience is the mysterious. It is the<br />

source of all true art <strong>and</strong> science.<br />

Albert Einstein, What I Believe, 1930<br />

About This Book<br />

We believe that learning in computer science <strong>and</strong> engineering should reflect<br />

the current state of the field, as well as introduce the principles that are shaping<br />

computing. We also feel that readers in every specialty of computing need<br />

to appreciate the organizational paradigms that determine the capabilities,<br />

performance, energy, <strong>and</strong>, ultimately, the success of computer systems.<br />

Modern computer technology requires professionals of every computing<br />

specialty to underst<strong>and</strong> both hardware <strong>and</strong> software. The interaction between<br />

hardware <strong>and</strong> software at a variety of levels also offers a framework for underst<strong>and</strong>ing<br />

the fundamentals of computing. Whether your primary interest is hardware or<br />

software, computer science or electrical engineering, the central ideas in computer<br />

organization <strong>and</strong> design are the same. Thus, our emphasis in this book is to show<br />

the relationship between hardware <strong>and</strong> software <strong>and</strong> to focus on the concepts that<br />

are the basis for current computers.<br />

The recent switch from uniprocessor to multicore microprocessors confirmed<br />

the soundness of this perspective, given since the first edition. While programmers<br />

could ignore the advice <strong>and</strong> rely on computer architects, compiler writers, <strong>and</strong> silicon<br />

engineers to make their programs run faster or be more energy-efficient without<br />

change, that era is over. For programs to run faster, they must become parallel.<br />

While the goal of many researchers is to make it possible for programmers to be<br />

unaware of the underlying parallel nature of the hardware they are programming,<br />

it will take many years to realize this vision. Our view is that for at least the next<br />

decade, most programmers are going to have to underst<strong>and</strong> the hardware/software<br />

interface if they want programs to run efficiently on parallel computers.<br />

The audience for this book includes those with little experience in assembly<br />

language or logic design who need to underst<strong>and</strong> basic computer organization as<br />

well as readers with backgrounds in assembly language <strong>and</strong>/or logic design who<br />

want to learn how to design a computer or underst<strong>and</strong> how a system works <strong>and</strong><br />

why it performs as it does.


xvi<br />

Preface<br />

About the Other Book<br />

Some readers may be familiar with <strong>Computer</strong> Architecture: A Quantitative<br />

Approach , popularly known as Hennessy <strong>and</strong> Patterson. (This book in turn is<br />

often called Patterson <strong>and</strong> Hennessy.) Our motivation in writing the earlier book<br />

was to describe the principles of computer architecture using solid engineering<br />

fundamentals <strong>and</strong> quantitative cost/performance tradeoffs. We used an approach<br />

that combined examples <strong>and</strong> measurements, based on commercial systems, to<br />

create realistic design experiences. Our goal was to demonstrate that computer<br />

architecture could be learned using quantitative methodologies instead of a<br />

descriptive approach. It was intended for the serious computing professional who<br />

wanted a detailed underst<strong>and</strong>ing of computers.<br />

A majority of the readers for this book do not plan to become computer<br />

architects. The performance <strong>and</strong> energy efficiency of future software systems will<br />

be dramatically affected, however, by how well software designers underst<strong>and</strong> the<br />

basic hardware techniques at work in a system. Thus, compiler writers, operating<br />

system designers, database programmers, <strong>and</strong> most other software engineers need<br />

a firm grounding in the principles presented in this book. Similarly, hardware<br />

designers must underst<strong>and</strong> clearly the effects of their work on software applications.<br />

Thus, we knew that this book had to be much more than a subset of the material<br />

in <strong>Computer</strong> Architecture , <strong>and</strong> the material was extensively revised to match the<br />

different audience. We were so happy with the result that the subsequent editions of<br />

<strong>Computer</strong> Architecture were revised to remove most of the introductory material;<br />

hence, there is much less overlap today than with the first editions of both books.<br />

Changes for the Fifth Edition<br />

We had six major goals for the fifth edition of <strong>Computer</strong> Organization <strong>and</strong> <strong>Design</strong>:<br />

demonstrate the importance of underst<strong>and</strong>ing hardware with a running example;<br />

highlight major themes across the topics using margin icons that are introduced<br />

early; update examples to reflect changeover from PC era to PostPC era; spread the<br />

material on I/O throughout the book rather than isolating it into a single chapter;<br />

update the technical content to reflect changes in the industry since the publication<br />

of the fourth edition in 2009; <strong>and</strong> put appendices <strong>and</strong> optional sections online<br />

instead of including a CD to lower costs <strong>and</strong> to make this edition viable as an<br />

electronic book.<br />

Before discussing the goals in detail, let’s look at the table on the next page. It<br />

shows the hardware <strong>and</strong> software paths through the material. Chapters 1, 4, 5, <strong>and</strong><br />

6 are found on both paths, no matter what the experience or the focus. Chapter 1<br />

discusses the importance of energy <strong>and</strong> how it motivates the switch from single<br />

core to multicore microprocessors <strong>and</strong> introduces the eight great ideas in computer<br />

architecture. Chapter 2 is likely to be review material for the hardware-oriented,<br />

but it is essential reading for the software-oriented, especially for those readers<br />

interested in learning more about compilers <strong>and</strong> object-oriented programming<br />

languages. Chapter 3 is for readers interested in constructing a datapath or in


xviii<br />

Preface<br />

learning more about floating-point arithmetic. Some will skip parts of Chapter 3,<br />

either because they don’t need them or because they offer a review. However, we<br />

introduce the running example of matrix multiply in this chapter, showing how<br />

subword parallels offers a fourfold improvement, so don’t skip sections 3.6 to 3.8.<br />

Chapter 4 explains pipelined processors. Sections 4.1, 4.5, <strong>and</strong> 4.10 give overviews<br />

<strong>and</strong> Section 4.12 gives the next performance boost for matrix multiply for those with<br />

a software focus. Those with a hardware focus, however, will find that this chapter<br />

presents core material; they may also, depending on their background, want to read<br />

Appendix C on logic design first. The last chapter on multicores, multiprocessors,<br />

<strong>and</strong> clusters, is mostly new content <strong>and</strong> should be read by everyone. It was<br />

significantly reorganized in this edition to make the flow of ideas more natural<br />

<strong>and</strong> to include much more depth on GPUs, warehouse scale computers, <strong>and</strong> the<br />

hardware-software interface of network interface cards that are key to clusters.<br />

The first of the six goals for this firth edition was to demonstrate the importance<br />

of underst<strong>and</strong>ing modern hardware to get good performance <strong>and</strong> energy efficiency<br />

with a concrete example. As mentioned above, we start with subword parallelism<br />

in Chapter 3 to improve matrix multiply by a factor of 4. We double performance<br />

in Chapter 4 by unrolling the loop to demonstrate the value of instruction level<br />

parallelism. Chapter 5 doubles performance again by optimizing for caches using<br />

blocking. Finally, Chapter 6 demonstrates a speedup of 14 from 16 processors by<br />

using thread-level parallelism. All four optimizations in total add just 24 lines of C<br />

code to our initial matrix multiply example.<br />

The second goal was to help readers separate the forest from the trees by<br />

identifying eight great ideas of computer architecture early <strong>and</strong> then pointing out<br />

all the places they occur throughout the rest of the book. We use (hopefully) easy<br />

to remember margin icons <strong>and</strong> highlight the corresponding word in the text to<br />

remind readers of these eight themes. There are nearly 100 citations in the book.<br />

No chapter has less than seven examples of great ideas, <strong>and</strong> no idea is cited less than<br />

five times. Performance via parallelism, pipelining, <strong>and</strong> prediction are the three<br />

most popular great ideas, followed closely by Moore’s Law. The processor chapter<br />

(4) is the one with the most examples, which is not a surprise since it probably<br />

received the most attention from computer architects. The one great idea found in<br />

every chapter is performance via parallelism, which is a pleasant observation given<br />

the recent emphasis in parallelism in the field <strong>and</strong> in editions of this book.<br />

The third goal was to recognize the generation change in computing from the<br />

PC era to the PostPC era by this edition with our examples <strong>and</strong> material. Thus,<br />

Chapter 1 dives into the guts of a tablet computer rather than a PC, <strong>and</strong> Chapter 6<br />

describes the computing infrastructure of the cloud. We also feature the ARM,<br />

which is the instruction set of choice in the personal mobile devices of the PostPC<br />

era, as well as the x86 instruction set that dominated the PC Era <strong>and</strong> (so far)<br />

dominates cloud computing.<br />

The fourth goal was to spread the I/O material throughout the book rather<br />

than have it in its own chapter, much as we spread parallelism throughout all the<br />

chapters in the fourth edition. Hence, I/O material in this edition can be found in


Preface xix<br />

Sections 1.4, 4.9, 5.2, 5.5, 5.11, <strong>and</strong> 6.9. The thought is that readers (<strong>and</strong> instructors)<br />

are more likely to cover I/O if it’s not segregated to its own chapter.<br />

This is a fast-moving field, <strong>and</strong>, as is always the case for our new editions, an<br />

important goal is to update the technical content. The running example is the ARM<br />

Cortex A8 <strong>and</strong> the Intel Core i7, reflecting our PostPC Era. Other highlights include<br />

an overview the new 64-bit instruction set of ARMv8, a tutorial on GPUs that<br />

explains their unique terminology, more depth on the warehouse scale computers<br />

that make up the cloud, <strong>and</strong> a deep dive into 10 Gigabyte Ethernet cards.<br />

To keep the main book short <strong>and</strong> compatible with electronic books, we placed<br />

the optional material as online appendices instead of on a companion CD as in<br />

prior editions.<br />

Finally, we updated all the exercises in the book.<br />

While some elements changed, we have preserved useful book elements from<br />

prior editions. To make the book work better as a reference, we still place definitions<br />

of new terms in the margins at their first occurrence. The book element called<br />

“Underst<strong>and</strong>ing Program Performance” sections helps readers underst<strong>and</strong> the<br />

performance of their programs <strong>and</strong> how to improve it, just as the “Hardware/Software<br />

Interface” book element helped readers underst<strong>and</strong> the tradeoffs at this interface.<br />

“The Big Picture” section remains so that the reader sees the forest despite all the<br />

trees. “Check Yourself ” sections help readers to confirm their comprehension of the<br />

material on the first time through with answers provided at the end of each chapter.<br />

This edition still includes the green MIPS reference card, which was inspired by the<br />

“Green Card” of the IBM System/360. This card has been updated <strong>and</strong> should be a<br />

h<strong>and</strong>y reference when writing MIPS assembly language programs.<br />

Changes for the Fifth Edition<br />

We have collected a great deal of material to help instructors teach courses using<br />

this book. Solutions to exercises, figures from the book, lecture slides, <strong>and</strong> other<br />

materials are available to adopters from the publisher. Check the publisher’s Web<br />

site for more information:<br />

textbooks.elsevier.com/9780124077263<br />

Concluding Remarks<br />

If you read the following acknowledgments section, you will see that we went to<br />

great lengths to correct mistakes. Since a book goes through many printings, we<br />

have the opportunity to make even more corrections. If you uncover any remaining,<br />

resilient bugs, please contact the publisher by electronic mail at cod5bugs@mkp.<br />

com or by low-tech mail using the address found on the copyright page.<br />

This edition is the second break in the long-st<strong>and</strong>ing collaboration between<br />

Hennessy <strong>and</strong> Patterson, which started in 1989. The dem<strong>and</strong>s of running one of<br />

the world’s great universities meant that President Hennessy could no longer make<br />

the substantial commitment to create a new edition. The remaining author felt


xx<br />

Preface<br />

once again like a tightrope walker without a safety net. Hence, the people in the<br />

acknowledgments <strong>and</strong> Berkeley colleagues played an even larger role in shaping<br />

the contents of this book. Nevertheless, this time around there is only one author<br />

to blame for the new material in what you are about to read.<br />

Acknowledgments for the Fifth Edition<br />

With every edition of this book, we are very fortunate to receive help from many<br />

readers, reviewers, <strong>and</strong> contributors. Each of these people has helped to make this<br />

book better.<br />

Chapter 6 was so extensively revised that we did a separate review for ideas <strong>and</strong><br />

contents, <strong>and</strong> I made changes based on the feedback from every reviewer. I’d like to<br />

thank Christos Kozyrakis of Stanford University for suggesting using the network<br />

interface for clusters to demonstrate the hardware-software interface of I/O <strong>and</strong><br />

for suggestions on organizing the rest of the chapter; Mario Flagsilk of Stanford<br />

University for providing details, diagrams, <strong>and</strong> performance measurements of the<br />

NetFPGA NIC; <strong>and</strong> the following for suggestions on how to improve the chapter:<br />

David Kaeli of Northeastern University, Partha Ranganathan of HP Labs,<br />

David Wood of the University of Wisconsin, <strong>and</strong> my Berkeley colleagues Siamak<br />

Faridani , Shoaib Kamil , Yunsup Lee , Zhangxi Tan , <strong>and</strong> Andrew Waterman .<br />

Special thanks goes to Rimas Avizenis of UC Berkeley, who developed the<br />

various versions of matrix multiply <strong>and</strong> supplied the performance numbers as well.<br />

As I worked with his father while I was a graduate student at UCLA, it was a nice<br />

symmetry to work with Rimas at UCB.<br />

I also wish to thank my longtime collaborator R<strong>and</strong>y Katz of UC Berkeley, who<br />

helped develop the concept of great ideas in computer architecture as part of the<br />

extensive revision of an undergraduate class that we did together.<br />

I’d like to thank David Kirk , John Nickolls , <strong>and</strong> their colleagues at NVIDIA<br />

(Michael Garl<strong>and</strong>, John Montrym, Doug Voorhies, Lars Nyl<strong>and</strong>, Erik Lindholm,<br />

Paulius Micikevicius, Massimiliano Fatica, Stuart Oberman, <strong>and</strong> Vasily Volkov)<br />

for writing the first in-depth appendix on GPUs. I’d like to express again my<br />

appreciation to Jim Larus , recently named Dean of the School of <strong>Computer</strong> <strong>and</strong><br />

Communications Science at EPFL, for his willingness in contributing his expertise<br />

on assembly language programming, as well as for welcoming readers of this book<br />

with regard to using the simulator he developed <strong>and</strong> maintains.<br />

I am also very grateful to Jason Bakos of the University of South Carolina,<br />

who updated <strong>and</strong> created new exercises for this edition, working from originals<br />

prepared for the fourth edition by Perry Alex<strong>and</strong>er (The University of Kansas);<br />

Javier Bruguera (Universidade de Santiago de Compostela); Matthew Farrens<br />

(University of California, Davis); David Kaeli (Northeastern University); Nicole<br />

Kaiyan (University of Adelaide); John Oliver (Cal Poly, San Luis Obispo); Milos<br />

Prvulovic (Georgia Tech); <strong>and</strong> Jichuan Chang , Jacob Leverich , Kevin Lim , <strong>and</strong><br />

Partha Ranganathan (all from Hewlett-Packard).<br />

Additional thanks goes to Jason Bakos for developing the new lecture slides.


I am grateful to the many instructors who have answered the publisher’s surveys,<br />

reviewed our proposals, <strong>and</strong> attended focus groups to analyze <strong>and</strong> respond to our<br />

plans for this edition. They include the following individuals: Focus Groups in<br />

2012: Bruce Barton (Suffolk County Community College), Jeff Braun (Montana<br />

Tech), Ed Gehringer (North Carolina State), Michael Goldweber (Xavier University),<br />

Ed Harcourt (St. Lawrence University), Mark Hill (University of Wisconsin,<br />

Madison), Patrick Homer (University of Arizona), Norm Jouppi (HP Labs), Dave<br />

Kaeli (Northeastern University), Christos Kozyrakis (Stanford University),<br />

Zachary Kurmas (Gr<strong>and</strong> Valley State University), Jae C. Oh (Syracuse University),<br />

Lu Peng (LSU), Milos Prvulovic (Georgia Tech), Partha Ranganathan (HP<br />

Labs), David Wood (University of Wisconsin), Craig Zilles (University of Illinois<br />

at Urbana-Champaign). Surveys <strong>and</strong> Reviews: Mahmoud Abou-Nasr (Wayne State<br />

University), Perry Alex<strong>and</strong>er (The University of Kansas), Hakan Aydin (George<br />

Mason University), Hussein Badr (State University of New York at Stony Brook),<br />

Mac Baker (Virginia Military Institute), Ron Barnes (George Mason University),<br />

Douglas Blough (Georgia Institute of Technology), Kevin Bolding (Seattle Pacific<br />

University), Miodrag Bolic (University of Ottawa), John Bonomo (Westminster<br />

College), Jeff Braun (Montana Tech), Tom Briggs (Shippensburg University), Scott<br />

Burgess (Humboldt State University), Fazli Can (Bilkent University), Warren R.<br />

Carithers (Rochester Institute of Technology), Bruce Carlton (Mesa Community<br />

College), Nicholas Carter (University of Illinois at Urbana-Champaign), Anthony<br />

Cocchi (The City University of New York), Don Cooley (Utah State University),<br />

Robert D. Cupper (Allegheny College), Edward W. Davis (North Carolina State<br />

University), Nathaniel J. Davis (Air Force Institute of Technology), Molisa Derk<br />

(Oklahoma City University), Derek Eager (University of Saskatchewan), Ernest<br />

Ferguson (Northwest Missouri State University), Rhonda Kay Gaede (The University<br />

of Alabama), Etienne M. Gagnon (UQAM), Costa Gerousis (Christopher Newport<br />

University), Paul Gillard (Memorial University of Newfoundl<strong>and</strong>), Michael<br />

Goldweber (Xavier University), Georgia Grant (College of San Mateo), Merrill Hall<br />

(The Master’s College), Tyson Hall (Southern Adventist University), Ed Harcourt<br />

(St. Lawrence University), Justin E. Harlow (University of South Florida), Paul F.<br />

Hemler (Hampden-Sydney College), Martin Herbordt (Boston University), Steve<br />

J. Hodges (Cabrillo College), Kenneth Hopkinson (Cornell University), Dalton<br />

Hunkins (St. Bonaventure University), Baback Izadi (State University of New<br />

York—New Paltz), Reza Jafari, Robert W. Johnson (Colorado Technical University),<br />

Bharat Joshi (University of North Carolina, Charlotte), Nagarajan K<strong>and</strong>asamy<br />

(Drexel University), Rajiv Kapadia, Ryan Kastner (University of California,<br />

Santa Barbara), E.J. Kim (Texas A&M University), Jihong Kim (Seoul National<br />

University), Jim Kirk (Union University), Geoffrey S. Knauth (Lycoming College),<br />

Manish M. Kochhal (Wayne State), Suzan Koknar-Tezel (Saint Joseph’s University),<br />

Angkul Kongmunvattana (Columbus State University), April Kontostathis (Ursinus<br />

College), Christos Kozyrakis (Stanford University), Danny Krizanc (Wesleyan<br />

University), Ashok Kumar, S. Kumar (The University of Texas), Zachary Kurmas<br />

(Gr<strong>and</strong> Valley State University), Robert N. Lea (University of Houston), Baoxin<br />

Preface xxi


xxii<br />

Preface<br />

Li (Arizona State University), Li Liao (University of Delaware), Gary Livingston<br />

(University of Massachusetts), Michael Lyle, Douglas W. Lynn (Oregon Institute<br />

of Technology), Yashwant K Malaiya (Colorado State University), Bill Mark<br />

(University of Texas at Austin), An<strong>and</strong>a Mondal (Claflin University), Alvin Moser<br />

(Seattle University), Walid Najjar (University of California, Riverside), Danial J.<br />

Neebel (Loras College), John Nestor (Lafayette College), Jae C. Oh (Syracuse<br />

University), Joe Oldham (Centre College), Timour Paltashev, James Parkerson<br />

(University of Arkansas), Shaunak Pawagi (SUNY at Stony Brook), Steve Pearce, Ted<br />

Pedersen (University of Minnesota), Lu Peng (Louisiana State University), Gregory<br />

D Peterson (The University of Tennessee), Milos Prvulovic (Georgia Tech), Partha<br />

Ranganathan (HP Labs), Dejan Raskovic (University of Alaska, Fairbanks) Brad<br />

Richards (University of Puget Sound), Roman Rozanov, Louis Rubinfield (Villanova<br />

University), Md Abdus Salam (Southern University), Augustine Samba (Kent State<br />

University), Robert Schaefer (Daniel Webster College), Carolyn J. C. Schauble<br />

(Colorado State University), Keith Schubert (CSU San Bernardino), William<br />

L. Schultz, Kelly Shaw (University of Richmond), Shahram Shirani (McMaster<br />

University), Scott Sigman (Drury University), Bruce Smith, David Smith, Jeff W.<br />

Smith (University of Georgia, Athens), Mark Smotherman (Clemson University),<br />

Philip Snyder (Johns Hopkins University), Alex Sprintson (Texas A&M), Timothy<br />

D. Stanley (Brigham Young University), Dean Stevens (Morningside College),<br />

Nozar Tabrizi (Kettering University), Yuval Tamir (UCLA), Alex<strong>and</strong>er Taubin<br />

(Boston University), Will Thacker (Winthrop University), Mithuna Thottethodi<br />

(Purdue University), Manghui Tu (Southern Utah University), Dean Tullsen<br />

(UC San Diego), Rama Viswanathan (Beloit College), Ken Vollmar (Missouri<br />

State University), Guoping Wang (Indiana-Purdue University), Patricia Wenner<br />

(Bucknell University), Kent Wilken (University of California, Davis), David Wolfe<br />

(Gustavus Adolphus College), David Wood (University of Wisconsin, Madison),<br />

Ki Hwan Yum (University of Texas, San Antonio), Mohamed Zahran (City College<br />

of New York), Gerald D. Zarnett (Ryerson University), Nian Zhang (South Dakota<br />

School of Mines & Technology), Jiling Zhong (Troy University), Huiyang Zhou<br />

(The University of Central Florida), Weiyu Zhu (Illinois Wesleyan University).<br />

A special thanks also goes to Mark Smotherman for making multiple passes to<br />

find technical <strong>and</strong> writing glitches that significantly improved the quality of this<br />

edition.<br />

We wish to thank the extended Morgan Kaufmann family for agreeing to publish<br />

this book again under the able leadership of Todd Green <strong>and</strong> Nate McFadden : I<br />

certainly couldn’t have completed the book without them. We also want to extend<br />

thanks to Lisa Jones , who managed the book production process, <strong>and</strong> Russell<br />

Purdy , who did the cover design. The new cover cleverly connects the PostPC Era<br />

content of this edition to the cover of the first edition.<br />

The contributions of the nearly 150 people we mentioned here have helped<br />

make this fifth edition what I hope will be our best book yet. Enjoy!<br />

David A. Patterson


This page intentionally left blank


1<br />

Civilization advances<br />

by extending the<br />

number of important<br />

operations which we<br />

can perform without<br />

thinking about them.<br />

Alfred North Whitehead,<br />

An Introduction to Mathematics, 1911<br />

<strong>Computer</strong><br />

Abstractions <strong>and</strong><br />

Technology<br />

1.1 Introduction 3<br />

1.2 Eight Great Ideas in <strong>Computer</strong><br />

Architecture 11<br />

1.3 Below Your Program 13<br />

1.4 Under the Covers 16<br />

1.5 Technologies for Building Processors <strong>and</strong><br />

Memory 24<br />

<strong>Computer</strong> Organization <strong>and</strong> <strong>Design</strong>. DOI: http://dx.doi.org/10.1016/B978-0-12-407726-3.00001-1<br />

© 2013 Elsevier Inc. All rights reserved.


4 Chapter 1 <strong>Computer</strong> Abstractions <strong>and</strong> Technology<br />

<strong>Computer</strong>s have led to a third revolution for civilization, with the information<br />

revolution taking its place alongside the agricultural <strong>and</strong> the industrial revolutions.<br />

The resulting multiplication of humankind’s intellectual strength <strong>and</strong> reach<br />

naturally has affected our everyday lives profoundly <strong>and</strong> changed the ways in which<br />

the search for new knowledge is carried out. There is now a new vein of scientific<br />

investigation, with computational scientists joining theoretical <strong>and</strong> experimental<br />

scientists in the exploration of new frontiers in astronomy, biology, chemistry, <strong>and</strong><br />

physics, among others.<br />

The computer revolution continues. Each time the cost of computing improves<br />

by another factor of 10, the opportunities for computers multiply. Applications that<br />

were economically infeasible suddenly become practical. In the recent past, the<br />

following applications were “computer science fiction.”<br />

■ <strong>Computer</strong>s in automobiles: Until microprocessors improved dramatically<br />

in price <strong>and</strong> performance in the early 1980s, computer control of cars was<br />

ludicrous. Today, computers reduce pollution, improve fuel efficiency via<br />

engine controls, <strong>and</strong> increase safety through blind spot warnings, lane<br />

departure warnings, moving object detection, <strong>and</strong> air bag inflation to protect<br />

occupants in a crash.<br />

■ Cell phones: Who would have dreamed that advances in computer<br />

systems would lead to more than half of the planet having mobile phones,<br />

allowing person-to-person communication to almost anyone anywhere in<br />

the world?<br />

■ Human genome project: The cost of computer equipment to map <strong>and</strong> analyze<br />

human DNA sequences was hundreds of millions of dollars. It’s unlikely that<br />

anyone would have considered this project had the computer costs been 10<br />

to 100 times higher, as they would have been 15 to 25 years earlier. Moreover,<br />

costs continue to drop; you will soon be able to acquire your own genome,<br />

allowing medical care to be tailored to you.<br />

■ World Wide Web: Not in existence at the time of the first edition of this book,<br />

the web has transformed our society. For many, the web has replaced libraries<br />

<strong>and</strong> newspapers.<br />

■ Search engines: As the content of the web grew in size <strong>and</strong> in value, finding<br />

relevant information became increasingly important. Today, many people<br />

rely on search engines for such a large part of their lives that it would be a<br />

hardship to go without them.<br />

Clearly, advances in this technology now affect almost every aspect of our<br />

society. Hardware advances have allowed programmers to create wonderfully<br />

useful software, which explains why computers are omnipresent. Today’s science<br />

fiction suggests tomorrow’s killer applications: already on their way are glasses that<br />

augment reality, the cashless society, <strong>and</strong> cars that can drive themselves.


1.1 Introduction 5<br />

Classes of Computing Applications <strong>and</strong> Their<br />

Characteristics<br />

Although a common set of hardware technologies (see Sections 1.4 <strong>and</strong> 1.5) is used<br />

in computers ranging from smart home appliances to cell phones to the largest<br />

supercomputers, these different applications have different design requirements<br />

<strong>and</strong> employ the core hardware technologies in different ways. Broadly speaking,<br />

computers are used in three different classes of applications.<br />

Personal computers (PCs) are possibly the best known form of computing,<br />

which readers of this book have likely used extensively. Personal computers<br />

emphasize delivery of good performance to single users at low cost <strong>and</strong> usually<br />

execute third-party software. This class of computing drove the evolution of many<br />

computing technologies, which is only about 35 years old!<br />

Servers are the modern form of what were once much larger computers, <strong>and</strong><br />

are usually accessed only via a network. Servers are oriented to carrying large<br />

workloads, which may consist of either single complex applications—usually a<br />

scientific or engineering application—or h<strong>and</strong>ling many small jobs, such as would<br />

occur in building a large web server. These applications are usually based on<br />

software from another source (such as a database or simulation system), but are<br />

often modified or customized for a particular function. Servers are built from the<br />

same basic technology as desktop computers, but provide for greater computing,<br />

storage, <strong>and</strong> input/output capacity. In general, servers also place a greater emphasis<br />

on dependability, since a crash is usually more costly than it would be on a singleuser<br />

PC.<br />

Servers span the widest range in cost <strong>and</strong> capability. At the low end, a server<br />

may be little more than a desktop computer without a screen or keyboard <strong>and</strong><br />

cost a thous<strong>and</strong> dollars. These low-end servers are typically used for file storage,<br />

small business applications, or simple web serving (see Section 6.10). At the other<br />

extreme are supercomputers, which at the present consist of tens of thous<strong>and</strong>s of<br />

processors <strong>and</strong> many terabytes of memory, <strong>and</strong> cost tens to hundreds of millions<br />

of dollars. Supercomputers are usually used for high-end scientific <strong>and</strong> engineering<br />

calculations, such as weather forecasting, oil exploration, protein structure<br />

determination, <strong>and</strong> other large-scale problems. Although such supercomputers<br />

represent the peak of computing capability, they represent a relatively small fraction<br />

of the servers <strong>and</strong> a relatively small fraction of the overall computer market in<br />

terms of total revenue.<br />

Embedded computers are the largest class of computers <strong>and</strong> span the widest<br />

range of applications <strong>and</strong> performance. Embedded computers include the<br />

microprocessors found in your car, the computers in a television set, <strong>and</strong> the<br />

networks of processors that control a modern airplane or cargo ship. Embedded<br />

computing systems are designed to run one application or one set of related<br />

applications that are normally integrated with the hardware <strong>and</strong> delivered as a<br />

single system; thus, despite the large number of embedded computers, most users<br />

never really see that they are using a computer!<br />

personal computer<br />

(PC) A computer<br />

designed for use by<br />

an individual, usually<br />

incorporating a graphics<br />

display, a keyboard, <strong>and</strong> a<br />

mouse.<br />

server A computer<br />

used for running<br />

larger programs for<br />

multiple users, often<br />

simultaneously, <strong>and</strong><br />

typically accessed only via<br />

a network.<br />

supercomputer A class<br />

of computers with the<br />

highest performance <strong>and</strong><br />

cost; they are configured<br />

as servers <strong>and</strong> typically<br />

cost tens to hundreds of<br />

millions of dollars.<br />

terabyte (TB) Originally<br />

1,099,511,627,776<br />

(2 40 ) bytes, although<br />

communications <strong>and</strong><br />

secondary storage<br />

systems developers<br />

started using the term to<br />

mean 1,000,000,000,000<br />

(10 12 ) bytes. To reduce<br />

confusion, we now use the<br />

term tebibyte (TiB) for<br />

2 40 bytes, defining terabyte<br />

(TB) to mean 10 12 bytes.<br />

Figure 1.1 shows the full<br />

range of decimal <strong>and</strong><br />

binary values <strong>and</strong> names.<br />

embedded computer<br />

A computer inside another<br />

device used for running<br />

one predetermined<br />

application or collection of<br />

software.


8 Chapter 1 <strong>Computer</strong> Abstractions <strong>and</strong> Technology<br />

multicore<br />

microprocessor<br />

A microprocessor<br />

containing multiple<br />

processors (“cores”) in a<br />

single integrated circuit.<br />

last decade, advances in computer design <strong>and</strong> memory technology have greatly<br />

reduced the importance of small memory size in most applications other than<br />

those in embedded computing systems.<br />

Programmers interested in performance now need to underst<strong>and</strong> the issues<br />

that have replaced the simple memory model of the 1960s: the parallel nature<br />

of processors <strong>and</strong> the hierarchical nature of memories. Moreover, as we explain<br />

in Section 1.7, today’s programmers need to worry about energy efficiency of<br />

their programs running either on the PMD or in the Cloud, which also requires<br />

underst<strong>and</strong>ing what is below your code. Programmers who seek to build<br />

competitive versions of software will therefore need to increase their knowledge of<br />

computer organization.<br />

We are honored to have the opportunity to explain what’s inside this revolutionary<br />

machine, unraveling the software below your program <strong>and</strong> the hardware under the<br />

covers of your computer. By the time you complete this book, we believe you will<br />

be able to answer the following questions:<br />

■ How are programs written in a high-level language, such as C or Java,<br />

translated into the language of the hardware, <strong>and</strong> how does the hardware<br />

execute the resulting program? Comprehending these concepts forms the<br />

basis of underst<strong>and</strong>ing the aspects of both the hardware <strong>and</strong> software that<br />

affect program performance.<br />

■ What is the interface between the software <strong>and</strong> the hardware, <strong>and</strong> how does<br />

software instruct the hardware to perform needed functions? These concepts<br />

are vital to underst<strong>and</strong>ing how to write many kinds of software.<br />

■ What determines the performance of a program, <strong>and</strong> how can a programmer<br />

improve the performance? As we will see, this depends on the original<br />

program, the software translation of that program into the computer’s<br />

language, <strong>and</strong> the effectiveness of the hardware in executing the program.<br />

■ What techniques can be used by hardware designers to improve performance?<br />

This book will introduce the basic concepts of modern computer design. The<br />

interested reader will find much more material on this topic in our advanced<br />

book, <strong>Computer</strong> Architecture: A Quantitative Approach.<br />

■ What techniques can be used by hardware designers to improve energy<br />

efficiency? What can the programmer do to help or hinder energy efficiency?<br />

■ What are the reasons for <strong>and</strong> the consequences of the recent switch from<br />

sequential processing to parallel processing? This book gives the motivation,<br />

describes the current hardware mechanisms to support parallelism, <strong>and</strong><br />

surveys the new generation of “multicore” microprocessors (see Chapter 6).<br />

■ Since the first commercial computer in 1951, what great ideas did computer<br />

architects come up with that lay the foundation of modern computing?


10 Chapter 1 <strong>Computer</strong> Abstractions <strong>and</strong> Technology<br />

To demonstrate the impact of the ideas in this book, we improve the performance<br />

of a C program that multiplies a matrix times a vector in a sequence of<br />

chapters. Each step leverages underst<strong>and</strong>ing how the underlying hardware<br />

really works in a modern microprocessor to improve performance by a factor<br />

of 200!<br />

■ In the category of data level parallelism, in Chapter 3 we use subword<br />

parallelism via C intrinsics to increase performance by a factor of 3.8.<br />

■ In the category of instruction level parallelism, in Chapter 4 we use loop<br />

unrolling to exploit multiple instruction issue <strong>and</strong> out-of-order execution<br />

hardware to increase performance by another factor of 2.3.<br />

■ In the category of memory hierarchy optimization, in Chapter 5 we use<br />

cache blocking to increase performance on large matrices by another factor<br />

of 2.5.<br />

■ In the category of thread level parallelism, in Chapter 6 we use parallel for<br />

loops in OpenMP to exploit multicore hardware to increase performance by<br />

another factor of 14.<br />

Check<br />

Yourself<br />

Check Yourself sections are designed to help readers assess whether they<br />

comprehend the major concepts introduced in a chapter <strong>and</strong> underst<strong>and</strong> the<br />

implications of those concepts. Some Check Yourself questions have simple answers;<br />

others are for discussion among a group. Answers to the specific questions can<br />

be found at the end of the chapter. Check Yourself questions appear only at the<br />

end of a section, making it easy to skip them if you are sure you underst<strong>and</strong> the<br />

material.<br />

1. The number of embedded processors sold every year greatly outnumbers<br />

the number of PC <strong>and</strong> even PostPC processors. Can you confirm or deny<br />

this insight based on your own experience? Try to count the number of<br />

embedded processors in your home. How does it compare with the number<br />

of conventional computers in your home?<br />

2. As mentioned earlier, both the software <strong>and</strong> hardware affect the performance<br />

of a program. Can you think of examples where each of the following is the<br />

right place to look for a performance bottleneck?<br />

■ The algorithm chosen<br />

■ The programming language or compiler<br />

■ The operating system<br />

■ The processor<br />

■ The I/O system <strong>and</strong> devices


14 Chapter 1 <strong>Computer</strong> Abstractions <strong>and</strong> Technology<br />

compiler A program<br />

that translates high-level<br />

language statements<br />

into assembly language<br />

statements.<br />

binary digit Also called<br />

a bit. One of the two<br />

numbers in base 2 (0 or 1)<br />

that are the components<br />

of information.<br />

instruction A comm<strong>and</strong><br />

that computer hardware<br />

underst<strong>and</strong>s <strong>and</strong> obeys.<br />

assembler A program<br />

that translates a symbolic<br />

version of instructions<br />

into the binary version.<br />

assembly language<br />

A symbolic representation<br />

of machine instructions.<br />

machine language<br />

A binary representation of<br />

machine instructions.<br />

Compilers perform another vital function: the translation of a program written<br />

in a high-level language, such as C, C, Java, or Visual Basic into instructions<br />

that the hardware can execute. Given the sophistication of modern programming<br />

languages <strong>and</strong> the simplicity of the instructions executed by the hardware, the<br />

translation from a high-level language program to hardware instructions is<br />

complex. We give a brief overview of the process here <strong>and</strong> then go into more depth<br />

in Chapter 2 <strong>and</strong> in Appendix A.<br />

From a High-Level Language to the Language of Hardware<br />

To actually speak to electronic hardware, you need to send electrical signals. The<br />

easiest signals for computers to underst<strong>and</strong> are on <strong>and</strong> off, <strong>and</strong> so the computer<br />

alphabet is just two letters. Just as the 26 letters of the English alphabet do not limit<br />

how much can be written, the two letters of the computer alphabet do not limit<br />

what computers can do. The two symbols for these two letters are the numbers 0<br />

<strong>and</strong> 1, <strong>and</strong> we commonly think of the computer language as numbers in base 2, or<br />

binary numbers. We refer to each “letter” as a binary digit or bit. <strong>Computer</strong>s are<br />

slaves to our comm<strong>and</strong>s, which are called instructions. Instructions, which are just<br />

collections of bits that the computer underst<strong>and</strong>s <strong>and</strong> obeys, can be thought of as<br />

numbers. For example, the bits<br />

1000110010100000<br />

tell one computer to add two numbers. Chapter 2 explains why we use numbers<br />

for instructions <strong>and</strong> data; we don’t want to steal that chapter’s thunder, but using<br />

numbers for both instructions <strong>and</strong> data is a foundation of computing.<br />

The first programmers communicated to computers in binary numbers, but this<br />

was so tedious that they quickly invented new notations that were closer to the way<br />

humans think. At first, these notations were translated to binary by h<strong>and</strong>, but this<br />

process was still tiresome. Using the computer to help program the computer, the<br />

pioneers invented programs to translate from symbolic notation to binary. The first of<br />

these programs was named an assembler. This program translates a symbolic version<br />

of an instruction into the binary version. For example, the programmer would write<br />

add A,B<br />

<strong>and</strong> the assembler would translate this notation into<br />

1000110010100000<br />

This instruction tells the computer to add the two numbers A <strong>and</strong> B. The name coined<br />

for this symbolic language, still used today, is assembly language. In contrast, the<br />

binary language that the machine underst<strong>and</strong>s is the machine language.<br />

Although a tremendous improvement, assembly language is still far from the<br />

notations a scientist might like to use to simulate fluid flow or that an accountant<br />

might use to balance the books. Assembly language requires the programmer<br />

to write one line for every instruction that the computer will follow, forcing the<br />

programmer to think like the computer.


16 Chapter 1 <strong>Computer</strong> Abstractions <strong>and</strong> Technology<br />

A compiler enables a programmer to write this high-level language expression:<br />

A + B<br />

The compiler would compile it into this assembly language statement:<br />

add A,B<br />

As shown above, the assembler would translate this statement into the binary<br />

instructions that tell the computer to add the two numbers A <strong>and</strong> B.<br />

High-level programming languages offer several important benefits. First, they<br />

allow the programmer to think in a more natural language, using English words<br />

<strong>and</strong> algebraic notation, resulting in programs that look much more like text than<br />

like tables of cryptic symbols (see Figure 1.4). Moreover, they allow languages to be<br />

designed according to their intended use. Hence, Fortran was designed for scientific<br />

computation, Cobol for business data processing, Lisp for symbol manipulation,<br />

<strong>and</strong> so on. There are also domain-specific languages for even narrower groups of<br />

users, such as those interested in simulation of fluids, for example.<br />

The second advantage of programming languages is improved programmer<br />

productivity. One of the few areas of widespread agreement in software development<br />

is that it takes less time to develop programs when they are written in languages<br />

that require fewer lines to express an idea. Conciseness is a clear advantage of highlevel<br />

languages over assembly language.<br />

The final advantage is that programming languages allow programs to be<br />

independent of the computer on which they were developed, since compilers <strong>and</strong><br />

assemblers can translate high-level language programs to the binary instructions of<br />

any computer. These three advantages are so strong that today little programming<br />

is done in assembly language.<br />

1.4 Under the Covers<br />

input device<br />

A mechanism through<br />

which the computer is<br />

fed information, such as a<br />

keyboard.<br />

output device<br />

A mechanism that<br />

conveys the result of a<br />

computation to a user,<br />

such as a display, or to<br />

another computer.<br />

Now that we have looked below your program to uncover the underlying software,<br />

let’s open the covers of your computer to learn about the underlying hardware. The<br />

underlying hardware in any computer performs the same basic functions: inputting<br />

data, outputting data, processing data, <strong>and</strong> storing data. How these functions are<br />

performed is the primary topic of this book, <strong>and</strong> subsequent chapters deal with<br />

different parts of these four tasks.<br />

When we come to an important point in this book, a point so important that<br />

we hope you will remember it forever, we emphasize it by identifying it as a Big<br />

Picture item. We have about a dozen Big Pictures in this book, the first being the<br />

five components of a computer that perform the tasks of inputting, outputting,<br />

processing, <strong>and</strong> storing data.<br />

Two key components of computers are input devices, such as the microphone,<br />

<strong>and</strong> output devices, such as the speaker. As the names suggest, input feeds the


1.4 Under the Covers 17<br />

computer, <strong>and</strong> output is the result of computation sent to the user. Some devices,<br />

such as wireless networks, provide both input <strong>and</strong> output to the computer.<br />

Chapters 5 <strong>and</strong> 6 describe input/output (I/O) devices in more detail, but let’s<br />

take an introductory tour through the computer hardware, starting with the<br />

external I/O devices.<br />

The five classic components of a computer are input, output, memory,<br />

datapath, <strong>and</strong> control, with the last two sometimes combined <strong>and</strong> called<br />

the processor. Figure 1.5 shows the st<strong>and</strong>ard organization of a computer.<br />

This organization is independent of hardware technology: you can place<br />

every piece of every computer, past <strong>and</strong> present, into one of these five<br />

categories. To help you keep all this in perspective, the five components of<br />

a computer are shown on the front page of each of the following chapters,<br />

with the portion of interest to that chapter highlighted.<br />

The BIG<br />

Picture<br />

FIGURE 1.5 The organization of a computer, showing the five classic components. The<br />

processor gets instructions <strong>and</strong> data from memory. Input writes data to memory, <strong>and</strong> output reads data from<br />

memory. Control sends the signals that determine the operations of the datapath, memory, input, <strong>and</strong> output.


18 Chapter 1 <strong>Computer</strong> Abstractions <strong>and</strong> Technology<br />

liquid crystal display<br />

A display technology<br />

using a thin layer of liquid<br />

polymers that can be used<br />

to transmit or block light<br />

according to whether a<br />

charge is applied.<br />

active matrix display<br />

A liquid crystal display<br />

using a transistor to<br />

control the transmission<br />

of light at each individual<br />

pixel.<br />

pixel The smallest<br />

individual picture<br />

element. Screens are<br />

composed of hundreds<br />

of thous<strong>and</strong>s to millions<br />

of pixels, organized in a<br />

matrix.<br />

Through computer<br />

displays I have l<strong>and</strong>ed<br />

an airplane on the<br />

deck of a moving<br />

carrier, observed a<br />

nuclear particle hit a<br />

potential well, flown<br />

in a rocket at nearly<br />

the speed of light <strong>and</strong><br />

watched a computer<br />

reveal its innermost<br />

workings.<br />

Ivan Sutherl<strong>and</strong>, the<br />

“father” of computer<br />

graphics, Scientific<br />

American, 1984<br />

Through the Looking Glass<br />

The most fascinating I/O device is probably the graphics display. Most personal<br />

mobile devices use liquid crystal displays (LCDs) to get a thin, low-power display.<br />

The LCD is not the source of light; instead, it controls the transmission of light.<br />

A typical LCD includes rod-shaped molecules in a liquid that form a twisting<br />

helix that bends light entering the display, from either a light source behind the<br />

display or less often from reflected light. The rods straighten out when a current is<br />

applied <strong>and</strong> no longer bend the light. Since the liquid crystal material is between<br />

two screens polarized at 90 degrees, the light cannot pass through unless it is bent.<br />

Today, most LCD displays use an active matrix that has a tiny transistor switch at<br />

each pixel to precisely control current <strong>and</strong> make sharper images. A red-green-blue<br />

mask associated with each dot on the display determines the intensity of the threecolor<br />

components in the final image; in a color active matrix LCD, there are three<br />

transistor switches at each point.<br />

The image is composed of a matrix of picture elements, or pixels, which can<br />

be represented as a matrix of bits, called a bit map. Depending on the size of the<br />

screen <strong>and</strong> the resolution, the display matrix in a typical tablet ranges in size from<br />

1024 768 to 2048 1536. A color display might use 8 bits for each of the three<br />

colors (red, blue, <strong>and</strong> green), for 24 bits per pixel, permitting millions of different<br />

colors to be displayed.<br />

The computer hardware support for graphics consists mainly of a raster refresh<br />

buffer, or frame buffer, to store the bit map. The image to be represented onscreen<br />

is stored in the frame buffer, <strong>and</strong> the bit pattern per pixel is read out to the graphics<br />

display at the refresh rate. Figure 1.6 shows a frame buffer with a simplified design<br />

of just 4 bits per pixel.<br />

The goal of the bit map is to faithfully represent what is on the screen. The<br />

challenges in graphics systems arise because the human eye is very good at detecting<br />

even subtle changes on the screen.<br />

Y 0<br />

Frame buffer<br />

0 011 1 101<br />

Y 0<br />

Y 1<br />

Raster scan CRT display<br />

Y 1<br />

X 0 X 1<br />

X 0 X 1<br />

FIGURE 1.6 Each coordinate in the frame buffer on the left determines the shade of the<br />

corresponding coordinate for the raster scan CRT display on the right. Pixel (X 0<br />

, Y 0<br />

) contains<br />

the bit pattern 0011, which is a lighter shade on the screen than the bit pattern 1101 in pixel (X 1<br />

, Y 1<br />

).


1.4 Under the Covers 19<br />

Touchscreen<br />

While PCs also use LCD displays, the tablets <strong>and</strong> smartphones of the PostPC era<br />

have replaced the keyboard <strong>and</strong> mouse with touch sensitive displays, which has<br />

the wonderful user interface advantage of users pointing directly what they are<br />

interested in rather than indirectly with a mouse.<br />

While there are a variety of ways to implement a touch screen, many tablets<br />

today use capacitive sensing. Since people are electrical conductors, if an insulator<br />

like glass is covered with a transparent conductor, touching distorts the electrostatic<br />

field of the screen, which results in a change in capacitance. This technology can<br />

allow multiple touches simultaneously, which allows gestures that can lead to<br />

attractive user interfaces.<br />

Opening the Box<br />

Figure 1.7 shows the contents of the Apple iPad 2 tablet computer. Unsurprisingly,<br />

of the five classic components of the computer, I/O dominates this reading device.<br />

The list of I/O devices includes a capacitive multitouch LCD display, front facing<br />

camera, rear facing camera, microphone, headphone jack, speakers, accelerometer,<br />

gyroscope, Wi-Fi network, <strong>and</strong> Bluetooth network. The datapath, control, <strong>and</strong><br />

memory are a tiny portion of the components.<br />

The small rectangles in Figure 1.8 contain the devices that drive our advancing<br />

technology, called integrated circuits <strong>and</strong> nicknamed chips. The A5 package seen<br />

in the middle of in Figure 1.8 contains two ARM processors that operate with a<br />

clock rate of 1 GHz. The processor is the active part of the computer, following the<br />

instructions of a program to the letter. It adds numbers, tests numbers, signals I/O<br />

devices to activate, <strong>and</strong> so on. Occasionally, people call the processor the CPU, for<br />

the more bureaucratic-sounding central processor unit.<br />

Descending even lower into the hardware, Figure 1.9 reveals details of a<br />

microprocessor. The processor logically comprises two main components: datapath<br />

<strong>and</strong> control, the respective brawn <strong>and</strong> brain of the processor. The datapath performs<br />

the arithmetic operations, <strong>and</strong> control tells the datapath, memory, <strong>and</strong> I/O devices<br />

what to do according to the wishes of the instructions of the program. Chapter 4<br />

explains the datapath <strong>and</strong> control for a higher-performance design.<br />

The A5 package in Figure 1.8 also includes two memory chips, each with<br />

2 gibibits of capacity, thereby supplying 512 MiB. The memory is where the<br />

programs are kept when they are running; it also contains the data needed by the<br />

running programs. The memory is built from DRAM chips. DRAM st<strong>and</strong>s for<br />

dynamic r<strong>and</strong>om access memory. Multiple DRAMs are used together to contain<br />

the instructions <strong>and</strong> data of a program. In contrast to sequential access memories,<br />

such as magnetic tapes, the RAM portion of the term DRAM means that memory<br />

accesses take basically the same amount of time no matter what portion of the<br />

memory is read.<br />

Descending into the depths of any component of the hardware reveals insights<br />

into the computer. Inside the processor is another type of memory—cache memory.<br />

integrated circuit Also<br />

called a chip. A device<br />

combining dozens to<br />

millions of transistors.<br />

central processor unit<br />

(CPU) Also called<br />

processor. The active part<br />

of the computer, which<br />

contains the datapath <strong>and</strong><br />

control <strong>and</strong> which adds<br />

numbers, tests numbers,<br />

signals I/O devices to<br />

activate, <strong>and</strong> so on.<br />

datapath The<br />

component of the<br />

processor that performs<br />

arithmetic operations<br />

control The component<br />

of the processor that<br />

comm<strong>and</strong>s the datapath,<br />

memory, <strong>and</strong> I/O<br />

devices according to<br />

the instructions of the<br />

program.<br />

memory The storage<br />

area in which programs<br />

are kept when they are<br />

running <strong>and</strong> that contains<br />

the data needed by the<br />

running programs.<br />

dynamic r<strong>and</strong>om access<br />

memory (DRAM)<br />

Memory built as an<br />

integrated circuit; it<br />

provides r<strong>and</strong>om access to<br />

any location. Access times<br />

are 50 nanoseconds <strong>and</strong><br />

cost per gigabyte in 2012<br />

was $5 to $10.


20 Chapter 1 <strong>Computer</strong> Abstractions <strong>and</strong> Technology<br />

FIGURE 1.7 Components of the Apple iPad 2 A1395. The metal back of the iPad (with the reversed<br />

Apple logo in the middle) is in the center. At the top is the capacitive multitouch screen <strong>and</strong> LCD display. To<br />

the far right is the 3.8 V, 25 watt-hour, polymer battery, which consists of three Li-ion cell cases <strong>and</strong> offers<br />

10 hours of battery life. To the far left is the metal frame that attaches the LCD to the back of the iPad. The<br />

small components surrounding the metal back in the center are what we think of as the computer; they<br />

are often L-shaped to fit compactly inside the case next to the battery. Figure 1.8 shows a close-up of the<br />

L-shaped board to the lower left of the metal case, which is the logic printed circuit board that contains the<br />

processor <strong>and</strong> the memory. The tiny rectangle below the logic board contains a chip that provides wireless<br />

communication: Wi-Fi, Bluetooth, <strong>and</strong> FM tuner. It fits into a small slot in the lower left corner of the logic<br />

board. Near the upper left corner of the case is another L-shaped component, which is a front-facing camera<br />

assembly that includes the camera, headphone jack, <strong>and</strong> microphone. Near the right upper corner of the case<br />

is the board containing the volume control <strong>and</strong> silent/screen rotation lock button along with a gyroscope <strong>and</strong><br />

accelerometer. These last two chips combine to allow the iPad to recognize 6-axis motion. The tiny rectangle<br />

next to it is the rear-facing camera. Near the bottom right of the case is the L-shaped speaker assembly. The<br />

cable at the bottom is the connector between the logic board <strong>and</strong> the camera/volume control board. The<br />

board between the cable <strong>and</strong> the speaker assembly is the controller for the capacitive touchscreen. (Courtesy<br />

iFixit, www.ifixit.com)<br />

FIGURE 1.8 The logic board of Apple iPad 2 in Figure 1.7. The photo highlights five integrated circuits.<br />

The large integrated circuit in the middle is the Apple A5 chip, which contains a dual ARM processor cores<br />

that run at 1 GHz as well as 512 MB of main memory inside the package. Figure 1.9 shows a photograph of<br />

the processor chip inside the A5 package. The similar sized chip to the left is the 32 GB flash memory chip<br />

for non-volatile storage. There is an empty space between the two chips where a second flash chip can be<br />

installed to double storage capacity of the iPad. The chips to the right of the A5 include power controller <strong>and</strong><br />

I/O controller chips. (Courtesy iFixit, www.ifixit.com)


1.4 Under the Covers 21<br />

cache memory A small,<br />

fast memory that acts as a<br />

buffer for a slower, larger<br />

memory.<br />

FIGURE 1.9 The processor integrated circuit inside the A5 package. The size of chip is 12.1 by 10.1 mm, <strong>and</strong><br />

it was manufactured originally in a 45-nm process (see Section 1.5). It has two identical ARM processors or<br />

cores in the middle left of the chip <strong>and</strong> a PowerVR graphical processor unit (GPU) with four datapaths in the<br />

upper left quadrant. To the left <strong>and</strong> bottom side of the ARM cores are interfaces to main memory (DRAM).<br />

(Courtesy Chipworks, www.chipworks.com)<br />

static r<strong>and</strong>om access<br />

memory (SRAM) Also<br />

memory built as an<br />

integrated circuit, but<br />

faster <strong>and</strong> less dense than<br />

DRAM.<br />

Cache memory consists of a small, fast memory that acts as a buffer for the DRAM<br />

memory. (The nontechnical definition of cache is a safe place for hiding things.)<br />

Cache is built using a different memory technology, static r<strong>and</strong>om access memory<br />

(SRAM). SRAM is faster but less dense, <strong>and</strong> hence more expensive, than DRAM<br />

(see Chapter 5). SRAM <strong>and</strong> DRAM are two layers of the memory hierarchy.


1.4 Under the Covers 23<br />

To distinguish between the volatile memory used to hold data <strong>and</strong> programs<br />

while they are running <strong>and</strong> this nonvolatile memory used to store data <strong>and</strong><br />

programs between runs, the term main memory or primary memory is used for<br />

the former, <strong>and</strong> secondary memory for the latter. Secondary memory forms the<br />

next lower layer of the memory hierarchy. DRAMs have dominated main memory<br />

since 1975, but magnetic disks dominated secondary memory starting even earlier.<br />

Because of their size <strong>and</strong> form factor, personal Mobile Devices use flash memory,<br />

a nonvolatile semiconductor memory, instead of disks. Figure 1.8 shows the chip<br />

containing the flash memory of the iPad 2. While slower than DRAM, it is much<br />

cheaper than DRAM in addition to being nonvolatile. Although costing more per<br />

bit than disks, it is smaller, it comes in much smaller capacities, it is more rugged,<br />

<strong>and</strong> it is more power efficient than disks. Hence, flash memory is the st<strong>and</strong>ard<br />

secondary memory for PMDs. Alas, unlike disks <strong>and</strong> DRAM, flash memory bits<br />

wear out after 100,000 to 1,000,000 writes. Thus, file systems must keep track of<br />

the number of writes <strong>and</strong> have a strategy to avoid wearing out storage, such as by<br />

moving popular data. Chapter 5 describes disks <strong>and</strong> flash memory in more detail.<br />

Communicating with Other <strong>Computer</strong>s<br />

We’ve explained how we can input, compute, display, <strong>and</strong> save data, but there is<br />

still one missing item found in today’s computers: computer networks. Just as the<br />

processor shown in Figure 1.5 is connected to memory <strong>and</strong> I/O devices, networks<br />

interconnect whole computers, allowing computer users to extend the power of<br />

computing by including communication. Networks have become so popular that<br />

they are the backbone of current computer systems; a new personal mobile device<br />

or server without a network interface would be ridiculed. Networked computers<br />

have several major advantages:<br />

■ Communication: Information is exchanged between computers at high<br />

speeds.<br />

■ Resource sharing: Rather than each computer having its own I/O devices,<br />

computers on the network can share I/O devices.<br />

■ Nonlocal access: By connecting computers over long distances, users need not<br />

be near the computer they are using.<br />

Networks vary in length <strong>and</strong> performance, with the cost of communication<br />

increasing according to both the speed of communication <strong>and</strong> the distance that<br />

information travels. Perhaps the most popular type of network is Ethernet. It can<br />

be up to a kilometer long <strong>and</strong> transfer at up to 40 gigabits per second. Its length <strong>and</strong><br />

speed make Ethernet useful to connect computers on the same floor of a building;<br />

main memory Also<br />

called primary memory.<br />

Memory used to hold<br />

programs while they are<br />

running; typically consists<br />

of DRAM in today’s<br />

computers.<br />

secondary memory<br />

Nonvolatile memory<br />

used to store programs<br />

<strong>and</strong> data between runs;<br />

typically consists of flash<br />

memory in PMDs <strong>and</strong><br />

magnetic disks in servers.<br />

magnetic disk Also<br />

called hard disk. A form<br />

of nonvolatile secondary<br />

memory composed of<br />

rotating platters coated<br />

with a magnetic recording<br />

material. Because they<br />

are rotating mechanical<br />

devices, access times are<br />

about 5 to 20 milliseconds<br />

<strong>and</strong> cost per gigabyte in<br />

2012 was $0.05 to $0.10.<br />

flash memory<br />

A nonvolatile semiconductor<br />

memory. It<br />

is cheaper <strong>and</strong> slower<br />

than DRAM but more<br />

expensive per bit <strong>and</strong><br />

faster than magnetic disks.<br />

Access times are about 5<br />

to 50 microseconds <strong>and</strong><br />

cost per gigabyte in 2012<br />

was $0.75 to $1.00.


24 Chapter 1 <strong>Computer</strong> Abstractions <strong>and</strong> Technology<br />

local area network<br />

(LAN) A network<br />

designed to carry data<br />

within a geographically<br />

confined area, typically<br />

within a single building.<br />

wide area network<br />

(WAN) A network<br />

extended over hundreds<br />

of kilometers that can<br />

span a continent.<br />

Check<br />

Yourself<br />

hence, it is an example of what is generically called a local area network. Local area<br />

networks are interconnected with switches that can also provide routing services<br />

<strong>and</strong> security. Wide area networks cross continents <strong>and</strong> are the backbone of the<br />

Internet, which supports the web. They are typically based on optical fibers <strong>and</strong> are<br />

leased from telecommunication companies.<br />

Networks have changed the face of computing in the last 30 years, both by<br />

becoming much more ubiquitous <strong>and</strong> by making dramatic increases in performance.<br />

In the 1970s, very few individuals had access to electronic mail, the Internet <strong>and</strong><br />

web did not exist, <strong>and</strong> physically mailing magnetic tapes was the primary way to<br />

transfer large amounts of data between two locations. Local area networks were<br />

almost nonexistent, <strong>and</strong> the few existing wide area networks had limited capacity<br />

<strong>and</strong> restricted access.<br />

As networking technology improved, it became much cheaper <strong>and</strong> had a much<br />

higher capacity. For example, the first st<strong>and</strong>ardized local area network technology,<br />

developed about 30 years ago, was a version of Ethernet that had a maximum capacity<br />

(also called b<strong>and</strong>width) of 10 million bits per second, typically shared by tens of, if<br />

not a hundred, computers. Today, local area network technology offers a capacity<br />

of from 1 to 40 gigabits per second, usually shared by at most a few computers.<br />

Optical communications technology has allowed similar growth in the capacity of<br />

wide area networks, from hundreds of kilobits to gigabits <strong>and</strong> from hundreds of<br />

computers connected to a worldwide network to millions of computers connected.<br />

This combination of dramatic rise in deployment of networking combined with<br />

increases in capacity have made network technology central to the information<br />

revolution of the last 30 years.<br />

For the last decade another innovation in networking is reshaping the way<br />

computers communicate. Wireless technology is widespread, which enabled<br />

the PostPC Era. The ability to make a radio in the same low-cost semiconductor<br />

technology (CMOS) used for memory <strong>and</strong> microprocessors enabled a significant<br />

improvement in price, leading to an explosion in deployment. Currently available<br />

wireless technologies, called by the IEEE st<strong>and</strong>ard name 802.11, allow for transmission<br />

rates from 1 to nearly 100 million bits per second. Wireless technology is quite a bit<br />

different from wire-based networks, since all users in an immediate area share the<br />

airwaves.<br />

■ Semiconductor DRAM memory, flash memory, <strong>and</strong> disk storage differ<br />

significantly. For each technology, list its volatility, approximate relative<br />

access time, <strong>and</strong> approximate relative cost compared to DRAM.<br />

1.5<br />

Technologies for Building Processors<br />

<strong>and</strong> Memory<br />

Processors <strong>and</strong> memory have improved at an incredible rate, because computer<br />

designers have long embraced the latest in electronic technology to try to win the<br />

race to design a better computer. Figure 1.10 shows the technologies that have


1.6 Performance 27<br />

FIGURE 1.13 A 12-inch (300 mm) wafer of Intel Core i7 (Courtesy Intel). The number of<br />

dies on this 300 mm (12 inch) wafer at 100% yield is 280, each 20.7 by 10.5 mm. The several dozen partially<br />

rounded chips at the boundaries of the wafer are useless; they are included because it’s easier to create the<br />

masks used to pattern the silicon. This die uses a 32-nanometer technology, which means that the smallest<br />

features are approximately 32 nm in size, although they are typically somewhat smaller than the actual feature<br />

size, which refers to the size of the transistors as “drawn” versus the final manufactured size.<br />

called dies <strong>and</strong> more informally known as chips. Figure 1.13 shows a photograph<br />

of a wafer containing microprocessors before they have been diced; earlier, Figure<br />

1.9 shows an individual microprocessor die.<br />

Dicing enables you to discard only those dies that were unlucky enough to<br />

contain the flaws, rather than the whole wafer. This concept is quantified by the<br />

yield of a process, which is defined as the percentage of good dies from the total<br />

number of dies on the wafer.<br />

The cost of an integrated circuit rises quickly as the die size increases, due both<br />

to the lower yield <strong>and</strong> the smaller number of dies that fit on a wafer. To reduce the<br />

cost, using the next generation process shrinks a large die as it uses smaller sizes for<br />

both transistors <strong>and</strong> wires. This improves the yield <strong>and</strong> the die count per wafer. A<br />

32-nanometer (nm) process was typical in 2012, which means essentially that the<br />

smallest feature size on the die is 32 nm.<br />

die The individual<br />

rectangular sections that<br />

are cut from a wafer, more<br />

informally known as<br />

chips.<br />

yield The percentage of<br />

good dies from the total<br />

number of dies on the<br />

wafer.


28 Chapter 1 <strong>Computer</strong> Abstractions <strong>and</strong> Technology<br />

Once you’ve found good dies, they are connected to the input/output pins of a<br />

package, using a process called bonding. These packaged parts are tested a final time,<br />

since mistakes can occur in packaging, <strong>and</strong> then they are shipped to customers.<br />

Elaboration: The cost of an integrated circuit can be expressed in three simple<br />

equations:<br />

Cost per wafer<br />

Cost per die<br />

Dies per wafer yield<br />

Dies per wafer <br />

Yield<br />

Wafer area<br />

Die area<br />

1<br />

( 1 ( Defects per area Die area/2)) The fi rst equation is straightforward to derive. The second is an approximation,<br />

since it does not subtract the area near the border of the round wafer that cannot<br />

accommodate the rectangular dies (see Figure 1.13). The fi nal equation is based on<br />

empirical observations of yields at integrated circuit factories, with the exponent related<br />

to the number of critical processing steps.<br />

Hence, depending on the defect rate <strong>and</strong> the size of the die <strong>and</strong> wafer, costs are<br />

generally not linear in the die area.<br />

Check<br />

Yourself<br />

A key factor in determining the cost of an integrated circuit is volume. Which of<br />

the following are reasons why a chip made in high volume should cost less?<br />

1. With high volumes, the manufacturing process can be tuned to a particular<br />

design, increasing the yield.<br />

2. It is less work to design a high-volume part than a low-volume part.<br />

3. The masks used to make the chip are expensive, so the cost per chip is lower<br />

for higher volumes.<br />

4. Engineering development costs are high <strong>and</strong> largely independent of volume;<br />

thus, the development cost per die is lower with high-volume parts.<br />

5. High-volume parts usually have smaller die sizes than low-volume parts <strong>and</strong><br />

therefore have higher yield per wafer.<br />

1.6 Performance<br />

Assessing the performance of computers can be quite challenging. The scale <strong>and</strong><br />

intricacy of modern software systems, together with the wide range of performance<br />

improvement techniques employed by hardware designers, have made performance<br />

assessment much more difficult.<br />

When trying to choose among different computers, performance is an important<br />

attribute. Accurately measuring <strong>and</strong> comparing different computers is critical to


1.6 Performance 29<br />

purchasers <strong>and</strong> therefore to designers. The people selling computers know this as<br />

well. Often, salespeople would like you to see their computer in the best possible<br />

light, whether or not this light accurately reflects the needs of the purchaser’s<br />

application. Hence, underst<strong>and</strong>ing how best to measure performance <strong>and</strong> the<br />

limitations of performance measurements is important in selecting a computer.<br />

The rest of this section describes different ways in which performance can be<br />

determined; then, we describe the metrics for measuring performance from the<br />

viewpoint of both a computer user <strong>and</strong> a designer. We also look at how these metrics<br />

are related <strong>and</strong> present the classical processor performance equation, which we will<br />

use throughout the text.<br />

Defining Performance<br />

When we say one computer has better performance than another, what do we<br />

mean? Although this question might seem simple, an analogy with passenger<br />

airplanes shows how subtle the question of performance can be. Figure 1.14<br />

lists some typical passenger airplanes, together with their cruising speed, range,<br />

<strong>and</strong> capacity. If we wanted to know which of the planes in this table had the best<br />

performance, we would first need to define performance. For example, considering<br />

different measures of performance, we see that the plane with the highest cruising<br />

speed was the Concorde (retired from service in 2003), the plane with the longest<br />

range is the DC-8, <strong>and</strong> the plane with the largest capacity is the 747.<br />

Airplane<br />

Passenger<br />

capacity<br />

Cruising range<br />

(miles)<br />

Cruising speed<br />

(m.p.h.)<br />

Passenger throughput<br />

(passengers × m.p.h.)<br />

Boeing 777 375 4630 610 228,750<br />

Boeing 747 470<br />

4150 610 286,700<br />

BAC/Sud Concorde 132<br />

4000 1350 178,200<br />

Douglas DC-8-50 146<br />

8720 544 79,424<br />

FIGURE 1.14 The capacity, range, <strong>and</strong> speed for a number of commercial airplanes. The last<br />

column shows the rate at which the airplane transports passengers, which is the capacity times the cruising<br />

speed (ignoring range <strong>and</strong> takeoff <strong>and</strong> l<strong>and</strong>ing times).<br />

Let’s suppose we define performance in terms of speed. This still leaves two<br />

possible definitions. You could define the fastest plane as the one with the highest<br />

cruising speed, taking a single passenger from one point to another in the least time.<br />

If you were interested in transporting 450 passengers from one point to another,<br />

however, the 747 would clearly be the fastest, as the last column of the figure shows.<br />

Similarly, we can define computer performance in several different ways.<br />

If you were running a program on two different desktop computers, you’d say<br />

that the faster one is the desktop computer that gets the job done first. If you were<br />

running a datacenter that had several servers running jobs submitted by many<br />

users, you’d say that the faster computer was the one that completed the most<br />

jobs during a day. As an individual computer user, you are interested in reducing<br />

response time—the time between the start <strong>and</strong> completion of a task—also referred<br />

response time Also<br />

called execution time.<br />

The total time required<br />

for the computer to<br />

complete a task, including<br />

disk accesses, memory<br />

accesses, I/O activities,<br />

operating system<br />

overhead, CPU execution<br />

time, <strong>and</strong> so on.


30 Chapter 1 <strong>Computer</strong> Abstractions <strong>and</strong> Technology<br />

throughput Also called<br />

b<strong>and</strong>width. Another<br />

measure of performance,<br />

it is the number of tasks<br />

completed per unit time.<br />

to as execution time. Datacenter managers are often interested in increasing<br />

throughput or b<strong>and</strong>width—the total amount of work done in a given time. Hence,<br />

in most cases, we will need different performance metrics as well as different sets<br />

of applications to benchmark personal mobile devices, which are more focused on<br />

response time, versus servers, which are more focused on throughput.<br />

Throughput <strong>and</strong> Response Time<br />

EXAMPLE<br />

ANSWER<br />

Do the following changes to a computer system increase throughput, decrease<br />

response time, or both?<br />

1. Replacing the processor in a computer with a faster version<br />

2. Adding additional processors to a system that uses multiple processors<br />

for separate tasks—for example, searching the web<br />

Decreasing response time almost always improves throughput. Hence, in case<br />

1, both response time <strong>and</strong> throughput are improved. In case 2, no one task gets<br />

work done faster, so only throughput increases.<br />

If, however, the dem<strong>and</strong> for processing in the second case was almost<br />

as large as the throughput, the system might force requests to queue up. In<br />

this case, increasing the throughput could also improve response time, since<br />

it would reduce the waiting time in the queue. Thus, in many real computer<br />

systems, changing either execution time or throughput often affects the other.<br />

In discussing the performance of computers, we will be primarily concerned with<br />

response time for the first few chapters. To maximize performance, we want to<br />

minimize response time or execution time for some task. Thus, we can relate<br />

performance <strong>and</strong> execution time for a computer X:<br />

1<br />

PerformanceX<br />

<br />

Execution time<br />

This means that for two computers X <strong>and</strong> Y, if the performance of X is greater than<br />

the performance of Y, we have<br />

PerformanceX<br />

PerformanceY<br />

1 1<br />

<br />

Execution time Execution time<br />

Execution time<br />

X<br />

Y<br />

X<br />

Execution time<br />

That is, the execution time on Y is longer than that on X, if X is faster than Y.<br />

Y<br />

X


1.6 Performance 31<br />

In discussing a computer design, we often want to relate the performance of two<br />

different computers quantitatively. We will use the phrase “X is n times faster than<br />

Y”—or equivalently “X is n times as fast as Y”—to mean<br />

Performance<br />

Performance<br />

If X is n times as fast as Y, then the execution time on Y is n times as long as it is<br />

on X:<br />

Performance<br />

Performance<br />

X<br />

Y<br />

X<br />

Y<br />

n<br />

Execution time<br />

<br />

Execution time<br />

Y<br />

X<br />

n<br />

Relative Performance<br />

If computer A runs a program in 10 seconds <strong>and</strong> computer B runs the same<br />

program in 15 seconds, how much faster is A than B?<br />

EXAMPLE<br />

We know that A is n times as fast as B if<br />

Performance<br />

Performance<br />

A<br />

B<br />

Execution time<br />

<br />

Execution time<br />

B<br />

A<br />

n<br />

ANSWER<br />

Thus the performance ratio is<br />

15<br />

15 .<br />

10<br />

<strong>and</strong> A is therefore 1.5 times as fast as B.<br />

In the above example, we could also say that computer B is 1.5 times slower than<br />

computer A, since<br />

Performance<br />

Performance<br />

A<br />

B<br />

15 .<br />

means that<br />

Performance<br />

15 .<br />

A<br />

Performance<br />

B


32 Chapter 1 <strong>Computer</strong> Abstractions <strong>and</strong> Technology<br />

For simplicity, we will normally use the terminology as fast as when we try to<br />

compare computers quantitatively. Because performance <strong>and</strong> execution time are<br />

reciprocals, increasing performance requires decreasing execution time. To avoid<br />

the potential confusion between the terms increasing <strong>and</strong> decreasing, we usually<br />

say “improve performance” or “improve execution time” when we mean “increase<br />

performance” <strong>and</strong> “decrease execution time.”<br />

CPU execution<br />

time Also called CPU<br />

time. The actual time the<br />

CPU spends computing<br />

for a specific task.<br />

user CPU time The<br />

CPU time spent in a<br />

program itself.<br />

system CPU time The<br />

CPU time spent in<br />

the operating system<br />

performing tasks on<br />

behalf of the program.<br />

Measuring Performance<br />

Time is the measure of computer performance: the computer that performs the<br />

same amount of work in the least time is the fastest. Program execution time is<br />

measured in seconds per program. However, time can be defined in different ways,<br />

depending on what we count. The most straightforward definition of time is called<br />

wall clock time, response time, or elapsed time. These terms mean the total time<br />

to complete a task, including disk accesses, memory accesses, input/output (I/O)<br />

activities, operating system overhead—everything.<br />

<strong>Computer</strong>s are often shared, however, <strong>and</strong> a processor may work on several<br />

programs simultaneously. In such cases, the system may try to optimize throughput<br />

rather than attempt to minimize the elapsed time for one program. Hence, we<br />

often want to distinguish between the elapsed time <strong>and</strong> the time over which the<br />

processor is working on our behalf. CPU execution time or simply CPU time,<br />

which recognizes this distinction, is the time the CPU spends computing for this<br />

task <strong>and</strong> does not include time spent waiting for I/O or running other programs.<br />

(Remember, though, that the response time experienced by the user will be the<br />

elapsed time of the program, not the CPU time.) CPU time can be further divided<br />

into the CPU time spent in the program, called user CPU time, <strong>and</strong> the CPU time<br />

spent in the operating system performing tasks on behalf of the program, called<br />

system CPU time. Differentiating between system <strong>and</strong> user CPU time is difficult to<br />

do accurately, because it is often hard to assign responsibility for operating system<br />

activities to one user program rather than another <strong>and</strong> because of the functionality<br />

differences among operating systems.<br />

For consistency, we maintain a distinction between performance based on<br />

elapsed time <strong>and</strong> that based on CPU execution time. We will use the term system<br />

performance to refer to elapsed time on an unloaded system <strong>and</strong> CPU performance<br />

to refer to user CPU time. We will focus on CPU performance in this chapter,<br />

although our discussions of how to summarize performance can be applied to<br />

either elapsed time or CPU time measurements.<br />

Underst<strong>and</strong>ing<br />

Program<br />

Performance<br />

Different applications are sensitive to different aspects of the performance of a<br />

computer system. Many applications, especially those running on servers, depend<br />

as much on I/O performance, which, in turn, relies on both hardware <strong>and</strong> software.<br />

Total elapsed time measured by a wall clock is the measurement of interest. In


1.6 Performance 33<br />

some application environments, the user may care about throughput, response<br />

time, or a complex combination of the two (e.g., maximum throughput with a<br />

worst-case response time). To improve the performance of a program, one must<br />

have a clear definition of what performance metric matters <strong>and</strong> then proceed to<br />

look for performance bottlenecks by measuring program execution <strong>and</strong> looking<br />

for the likely bottlenecks. In the following chapters, we will describe how to search<br />

for bottlenecks <strong>and</strong> improve performance in various parts of the system.<br />

Although as computer users we care about time, when we examine the details<br />

of a computer it’s convenient to think about performance in other metrics. In<br />

particular, computer designers may want to think about a computer by using a<br />

measure that relates to how fast the hardware can perform basic functions. Almost<br />

all computers are constructed using a clock that determines when events take<br />

place in the hardware. These discrete time intervals are called clock cycles (or<br />

ticks, clock ticks, clock periods, clocks, cycles). <strong>Design</strong>ers refer to the length of a<br />

clock period both as the time for a complete clock cycle (e.g., 250 picoseconds, or<br />

250 ps) <strong>and</strong> as the clock rate (e.g., 4 gigahertz, or 4 GHz), which is the inverse of the<br />

clock period. In the next subsection, we will formalize the relationship between the<br />

clock cycles of the hardware designer <strong>and</strong> the seconds of the computer user.<br />

1. Suppose we know that an application that uses both personal mobile<br />

devices <strong>and</strong> the Cloud is limited by network performance. For the following<br />

changes, state whether only the throughput improves, both response time<br />

<strong>and</strong> throughput improve, or neither improves.<br />

a. An extra network channel is added between the PMD <strong>and</strong> the Cloud,<br />

increasing the total network throughput <strong>and</strong> reducing the delay to obtain<br />

network access (since there are now two channels).<br />

b. The networking software is improved, thereby reducing the network<br />

communication delay, but not increasing throughput.<br />

c. More memory is added to the computer.<br />

2. <strong>Computer</strong> C’s performance is 4 times as fast as the performance of computer<br />

B, which runs a given application in 28 seconds. How long will computer C<br />

take to run that application?<br />

clock cycle Also called<br />

tick, clock tick, clock<br />

period, clock, or cycle.<br />

The time for one clock<br />

period, usually of the<br />

processor clock, which<br />

runs at a constant rate.<br />

clock period The length<br />

of each clock cycle.<br />

Check<br />

Yourself<br />

CPU Performance <strong>and</strong> Its Factors<br />

Users <strong>and</strong> designers often examine performance using different metrics. If we could<br />

relate these different metrics, we could determine the effect of a design change<br />

on the performance as experienced by the user. Since we are confining ourselves<br />

to CPU performance at this point, the bottom-line performance measure is CPU


34 Chapter 1 <strong>Computer</strong> Abstractions <strong>and</strong> Technology<br />

execution time. A simple formula relates the most basic metrics (clock cycles <strong>and</strong><br />

clock cycle time) to CPU time:<br />

CPU execution time<br />

for a program<br />

CPU clock cycles<br />

for a program<br />

Clock cycle time<br />

Alternatively, because clock rate <strong>and</strong> clock cycle time are inverses,<br />

CPU execution time<br />

for a program<br />

CPU clock cycles for a program<br />

<br />

Clock rate<br />

This formula makes it clear that the hardware designer can improve performance<br />

by reducing the number of clock cycles required for a program or the length of<br />

the clock cycle. As we will see in later chapters, the designer often faces a trade-off<br />

between the number of clock cycles needed for a program <strong>and</strong> the length of each<br />

cycle. Many techniques that decrease the number of clock cycles may also increase<br />

the clock cycle time.<br />

Improving Performance<br />

EXAMPLE<br />

Our favorite program runs in 10 seconds on computer A, which has a 2 GHz<br />

clock. We are trying to help a computer designer build a computer, B, which will<br />

run this program in 6 seconds. The designer has determined that a substantial<br />

increase in the clock rate is possible, but this increase will affect the rest of the<br />

CPU design, causing computer B to require 1.2 times as many clock cycles as<br />

computer A for this program. What clock rate should we tell the designer to<br />

target?<br />

ANSWER<br />

Let’s first find the number of clock cycles required for the program on A:<br />

CPU time<br />

A<br />

10 seconds<br />

CPU clock cycles<br />

Clock rate<br />

A<br />

CPU clock cycles<br />

9 cycles<br />

2 10<br />

second<br />

cycles<br />

CPU clock cycles A 10 seconds 2 10 20 10<br />

second<br />

A<br />

A<br />

9 9<br />

cycles


1.6 Performance 35<br />

CPU time for B can be found using this equation:<br />

CPU time<br />

B<br />

12 .<br />

CPU clock cycles<br />

Clock rate<br />

B<br />

A<br />

6 seconds<br />

12 . 20 10 cycles<br />

Clock rate<br />

9<br />

B<br />

Clock rate<br />

B<br />

1.<br />

2 20 10 cycles<br />

6 seconds<br />

9<br />

9 9<br />

0.<br />

2 20 10 cycles 4 10 cycles<br />

second<br />

second<br />

4 GHz<br />

To run the program in 6 seconds, B must have twice the clock rate of A.<br />

Instruction Performance<br />

The performance equations above did not include any reference to the number of<br />

instructions needed for the program. However, since the compiler clearly generated<br />

instructions to execute, <strong>and</strong> the computer had to execute the instructions to run<br />

the program, the execution time must depend on the number of instructions in a<br />

program. One way to think about execution time is that it equals the number of<br />

instructions executed multiplied by the average time per instruction. Therefore, the<br />

number of clock cycles required for a program can be written as<br />

CPU clock cycles Instructions for a program<br />

Average clock cycles<br />

per instruction<br />

The term clock cycles per instruction, which is the average number of clock<br />

cycles each instruction takes to execute, is often abbreviated as CPI. Since different<br />

instructions may take different amounts of time depending on what they do, CPI is<br />

an average of all the instructions executed in the program. CPI provides one way of<br />

comparing two different implementations of the same instruction set architecture,<br />

since the number of instructions executed for a program will, of course, be the<br />

same.<br />

clock cycles<br />

per instruction<br />

(CPI) Average number<br />

of clock cycles per<br />

instruction for a program<br />

or program fragment.<br />

Using the Performance Equation<br />

Suppose we have two implementations of the same instruction set architecture.<br />

<strong>Computer</strong> A has a clock cycle time of 250 ps <strong>and</strong> a CPI of 2.0 for some program,<br />

<strong>and</strong> computer B has a clock cycle time of 500 ps <strong>and</strong> a CPI of 1.2 for the same<br />

program. Which computer is faster for this program <strong>and</strong> by how much?<br />

EXAMPLE


36 Chapter 1 <strong>Computer</strong> Abstractions <strong>and</strong> Technology<br />

ANSWER<br />

We know that each computer executes the same number of instructions for<br />

the program; let’s call this number I. First, find the number of processor clock<br />

cycles for each computer:<br />

CPU clock cyclesA<br />

I × 20 .<br />

CPU clock cycles I × 12 .<br />

B<br />

Now we can compute the CPU time for each computer:<br />

CPU timeA<br />

CPU clock cyclesA<br />

Clock cycle time<br />

I 2. 0 250 ps 500 I ps<br />

Likewise, for B:<br />

CPU time I 12 . 500ps 600 I ps<br />

B<br />

Clearly, computer A is faster. The amount faster is given by the ratio of the<br />

execution times:<br />

CPU performance Execution time 600 I ps<br />

A<br />

B<br />

12 .<br />

CPU performance Execution time 500 I ps<br />

B<br />

We can conclude that computer A is 1.2 times as fast as computer B for this<br />

program.<br />

A<br />

instruction count The<br />

number of instructions<br />

executed by the program.<br />

The Classic CPU Performance Equation<br />

We can now write this basic performance equation in terms of instruction count<br />

(the number of instructions executed by the program), CPI, <strong>and</strong> clock cycle time:<br />

CPU time Instruction count CPI Clock cycle time<br />

or, since the clock rate is the inverse of clock cycle time:<br />

Instruction count CPI<br />

CPU time<br />

Clock rate<br />

These formulas are particularly useful because they separate the three key factors<br />

that affect performance. We can use these formulas to compare two different<br />

implementations or to evaluate a design alternative if we know its impact on these<br />

three parameters.


1.7 The Power Wall 41<br />

Although power provides a limit to what we can cool, in the PostPC Era the<br />

really critical resource is energy. Battery life can trump performance in the personal<br />

mobile device, <strong>and</strong> the architects of warehouse scale computers try to reduce the<br />

costs of powering <strong>and</strong> cooling 100,000 servers as the costs are high at this scale. Just<br />

as measuring time in seconds is a safer measure of program performance than a<br />

rate like MIPS (see Section 1.10), the energy metric joules is a better measure than<br />

a power rate like watts, which is just joules/second.<br />

The dominant technology for integrated circuits is called CMOS (complementary<br />

metal oxide semiconductor). For CMOS, the primary source of energy consumption<br />

is so-called dynamic energy—that is, energy that is consumed when transistors<br />

switch states from 0 to 1 <strong>and</strong> vice versa. The dynamic energy depends on the<br />

capacitive loading of each transistor <strong>and</strong> the voltage applied:<br />

2<br />

Energy ∝ Capacitive load Voltage<br />

This equation is the energy of a pulse during the logic transition of 0 → 1 → 0 or<br />

1 → 0 → 1. The energy of a single transition is then<br />

Energy ∝ 12 / Capacitive load Voltage<br />

The power required per transistor is just the product of energy of a transition <strong>and</strong><br />

the frequency of transitions:<br />

Power ∝ 12 / Capacitive load Voltage Frequency switched<br />

Frequency switched is a function of the clock rate. The capacitive load per transistor<br />

is a function of both the number of transistors connected to an output (called the<br />

fanout) <strong>and</strong> the technology, which determines the capacitance of both wires <strong>and</strong><br />

transistors.<br />

With regard to Figure 1.16, how could clock rates grow by a factor of 1000<br />

while power grew by only a factor of 30? Energy <strong>and</strong> thus power can be reduced by<br />

lowering the voltage, which occurred with each new generation of technology, <strong>and</strong><br />

power is a function of the voltage squared. Typically, the voltage was reduced about<br />

15% per generation. In 20 years, voltages have gone from 5 V to 1 V, which is why<br />

the increase in power is only 30 times.<br />

2<br />

2<br />

Relative Power<br />

Suppose we developed a new, simpler processor that has 85% of the capacitive<br />

load of the more complex older processor. Further, assume that it has adjustable<br />

voltage so that it can reduce voltage 15% compared to processor B, which<br />

results in a 15% shrink in frequency. What is the impact on dynamic power?<br />

EXAMPLE


42 Chapter 1 <strong>Computer</strong> Abstractions <strong>and</strong> Technology<br />

ANSWER<br />

Power<br />

Power<br />

new<br />

old<br />

〈 Capacitive load 085 . 〉 〈 Voltage 085 . 〉 2 〈 Frequency switched<br />

2<br />

Capacitive load Voltage Frequency switched<br />

085 . 〉<br />

Thus the power ratio is<br />

4<br />

085 . 052 .<br />

Hence, the new processor uses about half the power of the old processor.<br />

The problem today is that further lowering of the voltage appears to make the<br />

transistors too leaky, like water faucets that cannot be completely shut off. Even<br />

today about 40% of the power consumption in server chips is due to leakage. If<br />

transistors started leaking more, the whole process could become unwieldy.<br />

To try to address the power problem, designers have already attached large<br />

devices to increase cooling, <strong>and</strong> they turn off parts of the chip that are not used in<br />

a given clock cycle. Although there are many more expensive ways to cool chips<br />

<strong>and</strong> thereby raise their power to, say, 300 watts, these techniques are generally<br />

too expensive for personal computers <strong>and</strong> even servers, not to mention personal<br />

mobile devices.<br />

Since computer designers slammed into a power wall, they needed a new way<br />

forward. They chose a different path from the way they designed microprocessors<br />

for their first 30 years.<br />

Elaboration: Although dynamic energy is the primary source of energy consumption<br />

in CMOS, static energy consumption occurs because of leakage current that flows even<br />

when a transistor is off. In servers, leakage is typically responsible for 40% of the energy<br />

consumption. Thus, increasing the number of transistors increases power dissipation,<br />

even if the transistors are always off. A variety of design techniques <strong>and</strong> technology<br />

innovations are being deployed to control leakage, but it’s hard to lower voltage further.<br />

Elaboration: Power is a challenge for integrated circuits for two reasons. First, power<br />

must be brought in <strong>and</strong> distributed around the chip; modern microprocessors use<br />

hundreds of pins just for power <strong>and</strong> ground! Similarly, multiple levels of chip interconnect<br />

are used solely for power <strong>and</strong> ground distribution to portions of the chip. Second, power<br />

is dissipated as heat <strong>and</strong> must be removed. Server chips can burn more than 100 watts,<br />

<strong>and</strong> cooling the chip <strong>and</strong> the surrounding system is a major expense in Warehouse Scale<br />

<strong>Computer</strong>s (see Chapter 6).


1.9 Real Stuff: Benchmarking the Intel Core i7 47<br />

Description<br />

Name<br />

Instruction<br />

Count x 10 9<br />

CPI<br />

Clock cycle time<br />

(seconds x 10 –9 )<br />

Execution<br />

Time<br />

(seconds)<br />

Reference<br />

Time<br />

(seconds)<br />

SPECratio<br />

Interpreted string processing perl 2252 0.60 0.376 508 9770 19.2<br />

Block-sorting bzip2 2390 0.70 0.376 629 9650 15.4<br />

compression<br />

GNU C compiler gcc 794 1.20 0.376 358 8050 22.5<br />

Combinatorial optimization mcf 221 2.66 0.376 221 9120 41.2<br />

Go game (AI) go 1274 1.10 0.376 527 10490 19.9<br />

Search gene sequence hmmer 2616 0.60 0.376 590 9330 15.8<br />

Chess game (AI) sjeng 1948 0.80 0.376 586 12100 20.7<br />

Quantum computer libquantum 659 0.44 0.376 109 20720 190.0<br />

simulation<br />

Video compression h264avc 3793 0.50 0.376 713 22130 31.0<br />

Discrete event omnetpp 367 2.10 0.376 290 6250 21.5<br />

simulation library<br />

Games/path finding astar 1250 1.00 0.376 470 7020 14.9<br />

XML parsing xalancbmk 1045 0.70 0.376 275 6900 25.1<br />

Geometric mean – – – – – –<br />

25.7<br />

FIGURE 1.18 SPECINTC2006 benchmarks running on a 2.66 GHz Intel Core i7 920. As the equation on page 35 explains,<br />

execution time is the product of the three factors in this table: instruction count in billions, clocks per instruction (CPI), <strong>and</strong> clock cycle time in<br />

nanoseconds. SPECratio is simply the reference time, which is supplied by SPEC, divided by the measured execution time. The single number<br />

quoted as SPECINTC2006 is the geometric mean of the SPECratios.<br />

set focusing on processor performance (now called SPEC89), which has evolved<br />

through five generations. The latest is SPEC CPU2006, which consists of a set of 12<br />

integer benchmarks (CINT2006) <strong>and</strong> 17 floating-point benchmarks (CFP2006).<br />

The integer benchmarks vary from part of a C compiler to a chess program to a<br />

quantum computer simulation. The floating-point benchmarks include structured<br />

grid codes for finite element modeling, particle method codes for molecular<br />

dynamics, <strong>and</strong> sparse linear algebra codes for fluid dynamics.<br />

Figure 1.18 describes the SPEC integer benchmarks <strong>and</strong> their execution time<br />

on the Intel Core i7 <strong>and</strong> shows the factors that explain execution time: instruction<br />

count, CPI, <strong>and</strong> clock cycle time. Note that CPI varies by more than a factor of 5.<br />

To simplify the marketing of computers, SPEC decided to report a single number<br />

to summarize all 12 integer benchmarks. Dividing the execution time of a reference<br />

processor by the execution time of the measured computer normalizes the execution<br />

time measurements; this normalization yields a measure, called the SPECratio, which<br />

has the advantage that bigger numeric results indicate faster performance. That is,<br />

the SPECratio is the inverse of execution time. A CINT2006 or CFP2006 summary<br />

measurement is obtained by taking the geometric mean of the SPECratios.<br />

Elaboration: When comparing two computers using SPECratios, use the geometric<br />

mean so that it gives the same relative answer no matter what computer is used to<br />

normalize the results. If we averaged the normalized execution time values with an<br />

arithmetic mean, the results would vary depending on the computer we choose as the<br />

reference.


48 Chapter 1 <strong>Computer</strong> Abstractions <strong>and</strong> Technology<br />

The formula for the geometric mean is<br />

n<br />

n<br />

∏<br />

i1<br />

Execution time ratio i<br />

where Execution time ratio i<br />

is the execution time, normalized to the reference computer,<br />

for the ith program of a total of n in the workload, <strong>and</strong><br />

i<br />

n<br />

∏<br />

SPEC Power Benchmark<br />

1<br />

ai<br />

means the product a 1 a 2 … a<br />

Given the increasing importance of energy <strong>and</strong> power, SPEC added a benchmark<br />

to measure power. It reports power consumption of servers at different workload<br />

levels, divided into 10% increments, over a period of time. Figure 1.19 shows the<br />

results for a server using Intel Nehalem processors similar to the above.<br />

n<br />

Target Load %<br />

Performance<br />

(ssj_ops)<br />

Average Power<br />

(watts)<br />

100% 865,618 258<br />

90% 786,688 242<br />

80% 698,051 224<br />

70% 607,826 204<br />

60% 521,391 185<br />

50% 436,757 170<br />

40% 345,919 157<br />

30% 262,071 146<br />

20% 176,061 135<br />

10% 86,784 121<br />

0% 0 80<br />

Overall Sum 4,787,166 1922<br />

∑ssj_ops / ∑power = 2490<br />

FIGURE 1.19 SPECpower_ssj2008 running on a dual socket 2.66 GHz Intel Xeon X5650<br />

with 16 GB of DRAM <strong>and</strong> one 100 GB SSD disk.<br />

SPECpower started with another SPEC benchmark for Java business applications<br />

(SPECJBB2005), which exercises the processors, caches, <strong>and</strong> main memory as well<br />

as the Java virtual machine, compiler, garbage collector, <strong>and</strong> pieces of the operating<br />

system. Performance is measured in throughput, <strong>and</strong> the units are business<br />

operations per second. Once again, to simplify the marketing of computers, SPEC


1.10 Fallacies <strong>and</strong> Pitfalls 49<br />

boils these numbers down to a single number, called “overall ssj_ops per watt.” The<br />

formula for this single summarizing metric is<br />

⎛ 10 ⎞ ⎛ 10 ⎞<br />

overall ssj_ops per watt <br />

∑ssj_ops i<br />

poweri<br />

⎝⎜<br />

⎠⎟<br />

∑<br />

⎝⎜<br />

⎠⎟<br />

where ssj_ops i<br />

is performance at each 10% increment <strong>and</strong> power i<br />

is power<br />

consumed at each performance level.<br />

i0<br />

i0<br />

1.10<br />

Fallacies <strong>and</strong> Pitfalls<br />

The purpose of a section on fallacies <strong>and</strong> pitfalls, which will be found in every<br />

chapter, is to explain some commonly held misconceptions that you might<br />

encounter. We call them fallacies. When discussing a fallacy, we try to give a<br />

counterexample. We also discuss pitfalls, or easily made mistakes. Often pitfalls are<br />

generalizations of principles that are only true in a limited context. The purpose<br />

of these sections is to help you avoid making these mistakes in the computers you<br />

may design or use. Cost/performance fallacies <strong>and</strong> pitfalls have ensnared many a<br />

computer architect, including us. Accordingly, this section suffers no shortage of<br />

relevant examples. We start with a pitfall that traps many designers <strong>and</strong> reveals an<br />

important relationship in computer design.<br />

Pitfall: Expecting the improvement of one aspect of a computer to increase overall<br />

performance by an amount proportional to the size of the improvement.<br />

The great idea of making the common case fast has a demoralizing corollary<br />

that has plagued designers of both hardware <strong>and</strong> software. It reminds us that the<br />

opportunity for improvement is affected by how much time the event consumes.<br />

A simple design problem illustrates it well. Suppose a program runs in 100<br />

seconds on a computer, with multiply operations responsible for 80 seconds of this<br />

time. How much do I have to improve the speed of multiplication if I want my<br />

program to run five times faster?<br />

The execution time of the program after making the improvement is given by<br />

the following simple equation known as Amdahl’s Law:<br />

Execution time after improvement<br />

Execution time affected by improvement<br />

Execution time unaffected<br />

Amount of improvement<br />

For this problem:<br />

Execution time after improvement<br />

80 seconds<br />

n<br />

( 100 80 seconds)<br />

Science must begin<br />

with myths, <strong>and</strong> the<br />

criticism of myths.<br />

Sir Karl Popper, The<br />

Philosophy of Science,<br />

1957<br />

Amdahl’s Law<br />

A rule stating that<br />

the performance<br />

enhancement possible<br />

with a given improvement<br />

is limited by the amount<br />

that the improved feature<br />

is used. It is a quantitative<br />

version of the law of<br />

diminishing returns.


50 Chapter 1 <strong>Computer</strong> Abstractions <strong>and</strong> Technology<br />

Since we want the performance to be five times faster, the new execution time<br />

should be 20 seconds, giving<br />

20 seconds<br />

0<br />

80 seconds<br />

n<br />

80 seconds<br />

n<br />

20 seconds<br />

That is, there is no amount by which we can enhance-multiply to achieve a fivefold<br />

increase in performance, if multiply accounts for only 80% of the workload. The<br />

performance enhancement possible with a given improvement is limited by the amount<br />

that the improved feature is used. In everyday life this concept also yields what we call<br />

the law of diminishing returns.<br />

We can use Amdahl’s Law to estimate performance improvements when we<br />

know the time consumed for some function <strong>and</strong> its potential speedup. Amdahl’s<br />

Law, together with the CPU performance equation, is a h<strong>and</strong>y tool for evaluating<br />

potential enhancements. Amdahl’s Law is explored in more detail in the exercises.<br />

Amdahl’s Law is also used to argue for practical limits to the number of parallel<br />

processors. We examine this argument in the Fallacies <strong>and</strong> Pitfalls section of<br />

Chapter 6.<br />

Fallacy: <strong>Computer</strong>s at low utilization use little power.<br />

Power efficiency matters at low utilizations because server workloads vary.<br />

Utilization of servers in Google’s warehouse scale computer, for example, is<br />

between 10% <strong>and</strong> 50% most of the time <strong>and</strong> at 100% less than 1% of the time. Even<br />

given five years to learn how to run the SPECpower benchmark well, the specially<br />

configured computer with the best results in 2012 still uses 33% of the peak power<br />

at 10% of the load. Systems in the field that are not configured for the SPECpower<br />

benchmark are surely worse.<br />

Since servers’ workloads vary but use a large fraction of peak power, Luiz<br />

Barroso <strong>and</strong> Urs Hölzle [2007] argue that we should redesign hardware to achieve<br />

“energy-proportional computing.” If future servers used, say, 10% of peak power at<br />

10% workload, we could reduce the electricity bill of datacenters <strong>and</strong> become good<br />

corporate citizens in an era of increasing concern about CO 2<br />

emissions.<br />

Fallacy: <strong>Design</strong>ing for performance <strong>and</strong> designing for energy efficiency are<br />

unrelated goals.<br />

Since energy is power over time, it is often the case that hardware or software<br />

optimizations that take less time save energy overall even if the optimization takes<br />

a bit more energy when it is used. One reason is that all of the rest of the computer is<br />

consuming energy while the program is running, so even if the optimized portion<br />

uses a little more energy, the reduced time can save the energy of the whole system.<br />

Pitfall: Using a subset of the performance equation as a performance metric.<br />

We have already warned about the danger of predicting performance based on<br />

simply one of clock rate, instruction count, or CPI. Another common mistake


1.10 Fallacies <strong>and</strong> Pitfalls 51<br />

is to use only two of the three factors to compare performance. Although using<br />

two of the three factors may be valid in a limited context, the concept is also<br />

easily misused. Indeed, nearly all proposed alternatives to the use of time as the<br />

performance metric have led eventually to misleading claims, distorted results, or<br />

incorrect interpretations.<br />

One alternative to time is MIPS (million instructions per second). For a given<br />

program, MIPS is simply<br />

Instruction count<br />

MIPS<br />

Execution time 10 6<br />

Since MIPS is an instruction execution rate, MIPS specifies performance inversely<br />

to execution time; faster computers have a higher MIPS rating. The good news<br />

about MIPS is that it is easy to underst<strong>and</strong>, <strong>and</strong> faster computers mean bigger<br />

MIPS, which matches intuition.<br />

There are three problems with using MIPS as a measure for comparing computers.<br />

First, MIPS specifies the instruction execution rate but does not take into account<br />

the capabilities of the instructions. We cannot compare computers with different<br />

instruction sets using MIPS, since the instruction counts will certainly differ.<br />

Second, MIPS varies between programs on the same computer; thus, a computer<br />

cannot have a single MIPS rating. For example, by substituting for execution time,<br />

we see the relationship between MIPS, clock rate, <strong>and</strong> CPI:<br />

Instruction count Clock rate<br />

MIPS<br />

Instruction count CPI<br />

10 6 CPI 10 6<br />

Clock rate<br />

million instructions<br />

per second (MIPS)<br />

A measurement of<br />

program execution speed<br />

based on the number of<br />

millions of instructions.<br />

MIPS is computed as the<br />

instruction count divided<br />

by the product of the<br />

execution time <strong>and</strong> 10 6 .<br />

The CPI varied by a factor of 5 for SPEC CPU2006 on an Intel Core i7 computer<br />

in Figure 1.18, so MIPS does as well. Finally, <strong>and</strong> most importantly, if a new<br />

program executes more instructions but each instruction is faster, MIPS can vary<br />

independently from performance!<br />

Consider the following performance measurements for a program:<br />

Measurement <strong>Computer</strong> A <strong>Computer</strong> B<br />

Check<br />

Yourself<br />

Instruction count 10 billion 8 billion<br />

Clock rate 4 GHz 4 GHz<br />

CPI 1.0 1.1<br />

a. Which computer has the higher MIPS rating?<br />

b. Which computer is faster?


1.13 Exercises 55<br />

e. Library reserve desk<br />

f. Increasing the gate area on a CMOS transistor to decrease its switching time<br />

g. Adding electromagnetic aircraft catapults (which are electrically-powered<br />

as opposed to current steam-powered models), allowed by the increased power<br />

generation offered by the new reactor technology<br />

h. Building self-driving cars whose control systems partially rely on existing sensor<br />

systems already installed into the base vehicle, such as lane departure systems <strong>and</strong><br />

smart cruise control systems<br />

1.3 [2] Describe the steps that transform a program written in a high-level<br />

language such as C into a representation that is directly executed by a computer<br />

processor.<br />

1.4 [2] Assume a color display using 8 bits for each of the primary colors<br />

(red, green, blue) per pixel <strong>and</strong> a frame size of 1280 × 1024.<br />

a. What is the minimum size in bytes of the frame buffer to store a frame?<br />

b. How long would it take, at a minimum, for the frame to be sent over a 100<br />

Mbit/s network?<br />

1.5 [4] Consider three different processors P1, P2, <strong>and</strong> P3 executing<br />

the same instruction set. P1 has a 3 GHz clock rate <strong>and</strong> a CPI of 1.5. P2 has a<br />

2.5 GHz clock rate <strong>and</strong> a CPI of 1.0. P3 has a 4.0 GHz clock rate <strong>and</strong> has a CPI<br />

of 2.2.<br />

a. Which processor has the highest performance expressed in instructions per second?<br />

b. If the processors each execute a program in 10 seconds, find the number of<br />

cycles <strong>and</strong> the number of instructions.<br />

c. We are trying to reduce the execution time by 30% but this leads to an increase<br />

of 20% in the CPI. What clock rate should we have to get this time reduction?<br />

1.6 [20] Consider two different implementations of the same instruction<br />

set architecture. The instructions can be divided into four classes according to<br />

their CPI (class A, B, C, <strong>and</strong> D). P1 with a clock rate of 2.5 GHz <strong>and</strong> CPIs of 1, 2, 3,<br />

<strong>and</strong> 3, <strong>and</strong> P2 with a clock rate of 3 GHz <strong>and</strong> CPIs of 2, 2, 2, <strong>and</strong> 2.<br />

Given a program with a dynamic instruction count of 1.0E6 instructions divided<br />

into classes as follows: 10% class A, 20% class B, 50% class C, <strong>and</strong> 20% class D,<br />

which implementation is faster?<br />

a. What is the global CPI for each implementation?<br />

b. Find the clock cycles required in both cases.


56 Chapter 1 <strong>Computer</strong> Abstractions <strong>and</strong> Technology<br />

1.7 [15] Compilers can have a profound impact on the performance<br />

of an application. Assume that for a program, compiler A results in a dynamic<br />

instruction count of 1.0E9 <strong>and</strong> has an execution time of 1.1 s, while compiler B<br />

results in a dynamic instruction count of 1.2E9 <strong>and</strong> an execution time of 1.5 s.<br />

a. Find the average CPI for each program given that the processor has a clock cycle<br />

time of 1 ns.<br />

b. Assume the compiled programs run on two different processors. If the execution<br />

times on the two processors are the same, how much faster is the clock of the<br />

processor running compiler A’s code versus the clock of the processor running<br />

compiler B’s code?<br />

c. A new compiler is developed that uses only 6.0E8 instructions <strong>and</strong> has an<br />

average CPI of 1.1. What is the speedup of using this new compiler versus using<br />

compiler A or B on the original processor?<br />

1.8 The Pentium 4 Prescott processor, released in 2004, had a clock rate of 3.6<br />

GHz <strong>and</strong> voltage of 1.25 V. Assume that, on average, it consumed 10 W of static<br />

power <strong>and</strong> 90 W of dynamic power.<br />

The Core i5 Ivy Bridge, released in 2012, had a clock rate of 3.4 GHz <strong>and</strong> voltage<br />

of 0.9 V. Assume that, on average, it consumed 30 W of static power <strong>and</strong> 40 W of<br />

dynamic power.<br />

1.8.1 [5] For each processor find the average capacitive loads.<br />

1.8.2 [5] Find the percentage of the total dissipated power comprised by<br />

static power <strong>and</strong> the ratio of static power to dynamic power for each technology.<br />

1.8.3 [15] If the total dissipated power is to be reduced by 10%, how much<br />

should the voltage be reduced to maintain the same leakage current? Note: power<br />

is defined as the product of voltage <strong>and</strong> current.<br />

1.9 Assume for arithmetic, load/store, <strong>and</strong> branch instructions, a processor has<br />

CPIs of 1, 12, <strong>and</strong> 5, respectively. Also assume that on a single processor a program<br />

requires the execution of 2.56E9 arithmetic instructions, 1.28E9 load/store<br />

instructions, <strong>and</strong> 256 million branch instructions. Assume that each processor has<br />

a 2 GHz clock frequency.<br />

Assume that, as the program is parallelized to run over multiple cores, the number<br />

of arithmetic <strong>and</strong> load/store instructions per processor is divided by 0.7 x p (where<br />

p is the number of processors) but the number of branch instructions per processor<br />

remains the same.<br />

1.9.1 [5] Find the total execution time for this program on 1, 2, 4, <strong>and</strong> 8<br />

processors, <strong>and</strong> show the relative speedup of the 2, 4, <strong>and</strong> 8 processor result relative<br />

to the single processor result.


1.13 Exercises 57<br />

1.9.2 [10] If the CPI of the arithmetic instructions was doubled,<br />

what would the impact be on the execution time of the program on 1, 2, 4, or 8<br />

processors?<br />

1.9.3 [10] To what should the CPI of load/store instructions be<br />

reduced in order for a single processor to match the performance of four processors<br />

using the original CPI values?<br />

1.10 Assume a 15 cm diameter wafer has a cost of 12, contains 84 dies, <strong>and</strong> has<br />

0.020 defects/cm 2 . Assume a 20 cm diameter wafer has a cost of 15, contains 100<br />

dies, <strong>and</strong> has 0.031 defects/cm 2 .<br />

1.10.1 [10] Find the yield for both wafers.<br />

1.10.2 [5] Find the cost per die for both wafers.<br />

1.10.3 [5] If the number of dies per wafer is increased by 10% <strong>and</strong> the<br />

defects per area unit increases by 15%, find the die area <strong>and</strong> yield.<br />

1.10.4 [5] Assume a fabrication process improves the yield from 0.92 to<br />

0.95. Find the defects per area unit for each version of the technology given a die<br />

area of 200 mm 2 .<br />

1.11 The results of the SPEC CPU2006 bzip2 benchmark running on an AMD<br />

Barcelona has an instruction count of 2.389E12, an execution time of 750 s, <strong>and</strong> a<br />

reference time of 9650 s.<br />

1.11.1 [5] Find the CPI if the clock cycle time is 0.333 ns.<br />

1.11.2 [5] Find the SPECratio.<br />

1.11.3 [5] Find the increase in CPU time if the number of instructions<br />

of the benchmark is increased by 10% without affecting the CPI.<br />

1.11.4 [5] Find the increase in CPU time if the number of instructions<br />

of the benchmark is increased by 10% <strong>and</strong> the CPI is increased by 5%.<br />

1.11.5 [5] Find the change in the SPECratio for this change.<br />

1.11.6 [10] Suppose that we are developing a new version of the AMD<br />

Barcelona processor with a 4 GHz clock rate. We have added some additional<br />

instructions to the instruction set in such a way that the number of instructions<br />

has been reduced by 15%. The execution time is reduced to 700 s <strong>and</strong> the new<br />

SPECratio is 13.7. Find the new CPI.<br />

1.11.7 [10] This CPI value is larger than obtained in 1.11.1 as the clock<br />

rate was increased from 3 GHz to 4 GHz. Determine whether the increase in the<br />

CPI is similar to that of the clock rate. If they are dissimilar, why?<br />

1.11.8 [5] By how much has the CPU time been reduced?


58 Chapter 1 <strong>Computer</strong> Abstractions <strong>and</strong> Technology<br />

1.11.9 [10] For a second benchmark, libquantum, assume an execution<br />

time of 960 ns, CPI of 1.61, <strong>and</strong> clock rate of 3 GHz. If the execution time is<br />

reduced by an additional 10% without affecting to the CPI <strong>and</strong> with a clock rate of<br />

4 GHz, determine the number of instructions.<br />

1.11.10 [10] Determine the clock rate required to give a further 10%<br />

reduction in CPU time while maintaining the number of instructions <strong>and</strong> with the<br />

CPI unchanged.<br />

1.11.11 [10] Determine the clock rate if the CPI is reduced by 15% <strong>and</strong><br />

the CPU time by 20% while the number of instructions is unchanged.<br />

1.12 Section 1.10 cites as a pitfall the utilization of a subset of the performance<br />

equation as a performance metric. To illustrate this, consider the following two<br />

processors. P1 has a clock rate of 4 GHz, average CPI of 0.9, <strong>and</strong> requires the<br />

execution of 5.0E9 instructions. P2 has a clock rate of 3 GHz, an average CPI of<br />

0.75, <strong>and</strong> requires the execution of 1.0E9 instructions.<br />

1.12.1 [5] One usual fallacy is to consider the computer with the<br />

largest clock rate as having the largest performance. Check if this is true for P1 <strong>and</strong><br />

P2.<br />

1.12.2 [10] Another fallacy is to consider that the processor executing<br />

the largest number of instructions will need a larger CPU time. Considering that<br />

processor P1 is executing a sequence of 1.0E9 instructions <strong>and</strong> that the CPI of<br />

processors P1 <strong>and</strong> P2 do not change, determine the number of instructions that P2<br />

can execute in the same time that P1 needs to execute 1.0E9 instructions.<br />

1.12.3 [10] A common fallacy is to use MIPS (millions of<br />

instructions per second) to compare the performance of two different processors,<br />

<strong>and</strong> consider that the processor with the largest MIPS has the largest performance.<br />

Check if this is true for P1 <strong>and</strong> P2.<br />

1.12.4 [10] Another common performance figure is MFLOPS (millions<br />

of floating-point operations per second), defined as<br />

MFLOPS = No. FP operations / (execution time × 1E6)<br />

but this figure has the same problems as MIPS. Assume that 40% of the instructions<br />

executed on both P1 <strong>and</strong> P2 are floating-point instructions. Find the MFLOPS<br />

figures for the programs.<br />

1.13 Another pitfall cited in Section 1.10 is expecting to improve the overall<br />

performance of a computer by improving only one aspect of the computer. Consider<br />

a computer running a program that requires 250 s, with 70 s spent executing FP<br />

instructions, 85 s executed L/S instructions, <strong>and</strong> 40 s spent executing branch<br />

instructions.<br />

1.13.1 [5] By how much is the total time reduced if the time for FP<br />

operations is reduced by 20%?


1.13 Exercises 59<br />

1.13.2 [5] By how much is the time for INT operations reduced if the<br />

total time is reduced by 20%?<br />

1.13.3 [5] Can the total time can be reduced by 20% by reducing only<br />

the time for branch instructions?<br />

1.14 Assume a program requires the execution of 50 × 106 FP instructions,<br />

110 × 106 INT instructions, 80 × 106 L/S instructions, <strong>and</strong> 16 × 106 branch<br />

instructions. The CPI for each type of instruction is 1, 1, 4, <strong>and</strong> 2, respectively.<br />

Assume that the processor has a 2 GHz clock rate.<br />

1.14.1 [10] By how much must we improve the CPI of FP instructions if<br />

we want the program to run two times faster?<br />

1.14.2 [10] By how much must we improve the CPI of L/S instructions<br />

if we want the program to run two times faster?<br />

1.14.3 [5] By how much is the execution time of the program improved<br />

if the CPI of INT <strong>and</strong> FP instructions is reduced by 40% <strong>and</strong> the CPI of L/S <strong>and</strong><br />

Branch is reduced by 30%?<br />

1.15 [5] When a program is adapted to run on multiple processors in<br />

a multiprocessor system, the execution time on each processor is comprised of<br />

computing time <strong>and</strong> the overhead time required for locked critical sections <strong>and</strong>/or<br />

to send data from one processor to another.<br />

Assume a program requires t = 100 s of execution time on one processor. When run<br />

p processors, each processor requires t/p s, as well as an additional 4 s of overhead,<br />

irrespective of the number of processors. Compute the per-processor execution<br />

time for 2, 4, 8, 16, 32, 64, <strong>and</strong> 128 processors. For each case, list the corresponding<br />

speedup relative to a single processor <strong>and</strong> the ratio between actual speedup versus<br />

ideal speedup (speedup if there was no overhead).<br />

§1.1, page 10: Discussion questions: many answers are acceptable.<br />

§1.4, page 24: DRAM memory: volatile, short access time of 50 to 70 nanoseconds,<br />

<strong>and</strong> cost per GB is $5 to $10. Disk memory: nonvolatile, access times are 100,000<br />

to 400,000 times slower than DRAM, <strong>and</strong> cost per GB is 100 times cheaper than<br />

DRAM. Flash memory: nonvolatile, access times are 100 to 1000 times slower than<br />

DRAM, <strong>and</strong> cost per GB is 7 to 10 times cheaper than DRAM.<br />

§1.5, page 28: 1, 3, <strong>and</strong> 4 are valid reasons. Answer 5 can be generally true because<br />

high volume can make the extra investment to reduce die size by, say, 10% a good<br />

economic decision, but it doesn’t have to be true.<br />

§1.6, page 33: 1. a: both, b: latency, c: neither. 7 seconds.<br />

§1.6, page 40: b.<br />

§1.10, page 51: a. <strong>Computer</strong> A has the higher MIPS rating. b. <strong>Computer</strong> B is faster.<br />

Answers to<br />

Check Yourself


2<br />

I speak Spanish<br />

to God, Italian to<br />

women, French to<br />

men, <strong>and</strong> German to<br />

my horse.<br />

Charles V, Holy Roman Emperor<br />

(1500–1558)<br />

Instructions:<br />

Language of the<br />

<strong>Computer</strong><br />

2.1 Introduction 62<br />

2.2 Operations of the <strong>Computer</strong> Hardware 63<br />

2.3 Oper<strong>and</strong>s of the <strong>Computer</strong> Hardware 66<br />

2.4 Signed <strong>and</strong> Unsigned Numbers 73<br />

2.5 Representing Instructions in the<br />

<strong>Computer</strong> 80<br />

2.6 Logical Operations 87<br />

2.7 Instructions for Making Decisions 90<br />

<strong>Computer</strong> Organization <strong>and</strong> <strong>Design</strong>. DOI: http://dx.doi.org/10.1016/B978-0-12-407726-3.00001-1<br />

© 2013 Elsevier Inc. All rights reserved.


2.2 Operations of the <strong>Computer</strong> Hardware 65<br />

instruction. Another difference from C is that comments always terminate at the<br />

end of a line.<br />

The natural number of oper<strong>and</strong>s for an operation like addition is three: the<br />

two numbers being added together <strong>and</strong> a place to put the sum. Requiring every<br />

instruction to have exactly three oper<strong>and</strong>s, no more <strong>and</strong> no less, conforms to the<br />

philosophy of keeping the hardware simple: hardware for a variable number of<br />

oper<strong>and</strong>s is more complicated than hardware for a fixed number. This situation<br />

illustrates the first of three underlying principles of hardware design:<br />

<strong>Design</strong> Principle 1: Simplicity favors regularity.<br />

We can now show, in the two examples that follow, the relationship of programs<br />

written in higher-level programming languages to programs in this more primitive<br />

notation.<br />

Compiling Two C Assignment Statements into MIPS<br />

This segment of a C program contains the five variables a, b, c, d, <strong>and</strong> e. Since<br />

Java evolved from C, this example <strong>and</strong> the next few work for either high-level<br />

programming language:<br />

EXAMPLE<br />

a = b + c;<br />

d = a – e;<br />

The translation from C to MIPS assembly language instructions is performed<br />

by the compiler. Show the MIPS code produced by a compiler.<br />

A MIPS instruction operates on two source oper<strong>and</strong>s <strong>and</strong> places the result<br />

in one destination oper<strong>and</strong>. Hence, the two simple statements above compile<br />

directly into these two MIPS assembly language instructions:<br />

ANSWER<br />

add a, b, c<br />

sub d, a, e<br />

Compiling a Complex C Assignment into MIPS<br />

A somewhat complex statement contains the five variables f, g, h, i, <strong>and</strong> j:<br />

EXAMPLE<br />

f = (g + h) – (i + j);<br />

What might a C compiler produce?


68 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />

ANSWER<br />

The compiled program is very similar to the prior example, except we replace<br />

the variables with the register names mentioned above plus two temporary<br />

registers, $t0 <strong>and</strong> $t1, which correspond to the temporary variables above:<br />

add $t0,$s1,$s2 # register $t0 contains g + h<br />

add $t1,$s3,$s4 # register $t1 contains i + j<br />

sub $s0,$t0,$t1 # f gets $t0 – $t1, which is (g + h)–(i + j)<br />

data transfer<br />

instruction A comm<strong>and</strong><br />

that moves data between<br />

memory <strong>and</strong> registers.<br />

address A value used to<br />

delineate the location of<br />

a specific data element<br />

within a memory array.<br />

Memory Oper<strong>and</strong>s<br />

Programming languages have simple variables that contain single data elements,<br />

as in these examples, but they also have more complex data structures—arrays <strong>and</strong><br />

structures. These complex data structures can contain many more data elements<br />

than there are registers in a computer. How can a computer represent <strong>and</strong> access<br />

such large structures?<br />

Recall the five components of a computer introduced in Chapter 1 <strong>and</strong> repeated<br />

on page 61. The processor can keep only a small amount of data in registers, but<br />

computer memory contains billions of data elements. Hence, data structures<br />

(arrays <strong>and</strong> structures) are kept in memory.<br />

As explained above, arithmetic operations occur only on registers in MIPS<br />

instructions; thus, MIPS must include instructions that transfer data between<br />

memory <strong>and</strong> registers. Such instructions are called data transfer instructions.<br />

To access a word in memory, the instruction must supply the memory address.<br />

Memory is just a large, single-dimensional array, with the address acting as the<br />

index to that array, starting at 0. For example, in Figure 2.2, the address of the third<br />

data element is 2, <strong>and</strong> the value of Memory [2] is 10.<br />

3<br />

2<br />

1<br />

0<br />

Address<br />

100<br />

10<br />

101<br />

1<br />

Data<br />

Processor<br />

Memory<br />

FIGURE 2.2 Memory addresses <strong>and</strong> contents of memory at those locations. If these elements<br />

were words, these addresses would be incorrect, since MIPS actually uses byte addressing, with each word<br />

representing four bytes. Figure 2.3 shows the memory addressing for sequential word addresses.<br />

The data transfer instruction that copies data from memory to a register is<br />

traditionally called load. The format of the load instruction is the name of the<br />

operation followed by the register to be loaded, then a constant <strong>and</strong> register used to<br />

access memory. The sum of the constant portion of the instruction <strong>and</strong> the contents<br />

of the second register forms the memory address. The actual MIPS name for this<br />

instruction is lw, st<strong>and</strong>ing for load word.


2.3 Oper<strong>and</strong>s of the <strong>Computer</strong> Hardware 69<br />

Compiling an Assignment When an Oper<strong>and</strong> Is in Memory<br />

Let’s assume that A is an array of 100 words <strong>and</strong> that the compiler has<br />

associated the variables g <strong>and</strong> h with the registers $s1 <strong>and</strong> $s2 as before.<br />

Let’s also assume that the starting address, or base address, of the array is in<br />

$s3. Compile this C assignment statement:<br />

EXAMPLE<br />

g = h + A[8];<br />

Although there is a single operation in this assignment statement, one of<br />

the oper<strong>and</strong>s is in memory, so we must first transfer A[8] to a register. The<br />

address of this array element is the sum of the base of the array A, found in<br />

register $s3, plus the number to select element 8. The data should be placed<br />

in a temporary register for use in the next instruction. Based on Figure 2.2, the<br />

first compiled instruction is<br />

ANSWER<br />

lw<br />

$t0,8($s3) # Temporary reg $t0 gets A[8]<br />

(We’ll be making a slight adjustment to this instruction, but we’ll use this<br />

simplified version for now.) The following instruction can operate on the value<br />

in $t0 (which equals A[8]) since it is in a register. The instruction must add<br />

h (contained in $s2) to A[8] (contained in $t0) <strong>and</strong> put the sum in the<br />

register corresponding to g (associated with $s1):<br />

add<br />

$s1,$s2,$t0 # g = h + A[8]<br />

The constant in a data transfer instruction (8) is called the offset, <strong>and</strong> the<br />

register added to form the address ($s3) is called the base register.<br />

In addition to associating variables with registers, the compiler allocates data<br />

structures like arrays <strong>and</strong> structures to locations in memory. The compiler can then<br />

place the proper starting address into the data transfer instructions.<br />

Since 8-bit bytes are useful in many programs, virtually all architectures today<br />

address individual bytes. Therefore, the address of a word matches the address of<br />

one of the 4 bytes within the word, <strong>and</strong> addresses of sequential words differ by 4.<br />

For example, Figure 2.3 shows the actual MIPS addresses for the words in Figure<br />

2.2; the byte address of the third word is 8.<br />

In MIPS, words must start at addresses that are multiples of 4. This requirement<br />

is called an alignment restriction, <strong>and</strong> many architectures have it. (Chapter 4<br />

suggests why alignment leads to faster data transfers.)<br />

Hardware/<br />

Software<br />

Interface<br />

alignment restriction<br />

A requirement that data<br />

be aligned in memory on<br />

natural boundaries.


70 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />

12<br />

8<br />

4<br />

0<br />

Byte Address<br />

100<br />

10<br />

101<br />

1<br />

Data<br />

Processor<br />

Memory<br />

FIGURE 2.3 Actual MIPS memory addresses <strong>and</strong> contents of memory for those words.<br />

The changed addresses are highlighted to contrast with Figure 2.2. Since MIPS addresses each byte, word<br />

addresses are multiples of 4: there are 4 bytes in a word.<br />

<strong>Computer</strong>s divide into those that use the address of the leftmost or “big end” byte<br />

as the word address versus those that use the rightmost or “little end” byte. MIPS is<br />

in the big-endian camp. Since the order matters only if you access the identical data<br />

both as a word <strong>and</strong> as four bytes, few need to be aware of the endianess. (Appendix<br />

A shows the two options to number bytes in a word.)<br />

Byte addressing also affects the array index. To get the proper byte address in the<br />

code above, the offset to be added to the base register $s3 must be 4 8, or 32, so<br />

that the load address will select A[8] <strong>and</strong> not A[8/4]. (See the related pitfall on<br />

page 160 of Section 2.19.)<br />

The instruction complementary to load is traditionally called store; it copies data<br />

from a register to memory. The format of a store is similar to that of a load: the<br />

name of the operation, followed by the register to be stored, then offset to select<br />

the array element, <strong>and</strong> finally the base register. Once again, the MIPS address is<br />

specified in part by a constant <strong>and</strong> in part by the contents of a register. The actual<br />

MIPS name is sw, st<strong>and</strong>ing for store word.<br />

Hardware/<br />

Software<br />

Interface<br />

As the addresses in loads <strong>and</strong> stores are binary numbers, we can see why the<br />

DRAM for main memory comes in binary sizes rather than in decimal sizes. That<br />

is, in gebibytes (2 30 ) or tebibytes (2 40 ), not in gigabytes (10 9 ) or terabytes (10 12 ); see<br />

Figure 1.1.


2.3 Oper<strong>and</strong>s of the <strong>Computer</strong> Hardware 71<br />

Compiling Using Load <strong>and</strong> Store<br />

Assume variable h is associated with register $s2 <strong>and</strong> the base address of<br />

the array A is in $s3. What is the MIPS assembly code for the C assignment<br />

statement below?<br />

EXAMPLE<br />

A[12] = h + A[8];<br />

Although there is a single operation in the C statement, now two of the<br />

oper<strong>and</strong>s are in memory, so we need even more MIPS instructions. The first<br />

two instructions are the same as in the prior example, except this time we use<br />

the proper offset for byte addressing in the load word instruction to select<br />

A[8], <strong>and</strong> the add instruction places the sum in $t0:<br />

ANSWER<br />

lw $t0,32($s3) # Temporary reg $t0 gets A[8]<br />

add $t0,$s2,$t0 # Temporary reg $t0 gets h + A[8]<br />

The final instruction stores the sum into A[12], using 48 (4 12) as the offset<br />

<strong>and</strong> register $s3 as the base register.<br />

sw $t0,48($s3) # Stores h + A[8] back into A[12]<br />

Load word <strong>and</strong> store word are the instructions that copy words between<br />

memory <strong>and</strong> registers in the MIPS architecture. Other br<strong>and</strong>s of computers use<br />

other instructions along with load <strong>and</strong> store to transfer data. An architecture with<br />

such alternatives is the Intel x86, described in Section 2.17.<br />

Many programs have more variables than computers have registers. Consequently,<br />

the compiler tries to keep the most frequently used variables in registers <strong>and</strong> places<br />

the rest in memory, using loads <strong>and</strong> stores to move variables between registers <strong>and</strong><br />

memory. The process of putting less commonly used variables (or those needed<br />

later) into memory is called spilling registers.<br />

The hardware principle relating size <strong>and</strong> speed suggests that memory must be<br />

slower than registers, since there are fewer registers. This is indeed the case; data<br />

accesses are faster if data is in registers instead of memory.<br />

Moreover, data is more useful when in a register. A MIPS arithmetic instruction<br />

can read two registers, operate on them, <strong>and</strong> write the result. A MIPS data transfer<br />

instruction only reads one oper<strong>and</strong> or writes one oper<strong>and</strong>, without operating on it.<br />

Thus, registers take less time to access <strong>and</strong> have higher throughput than memory,<br />

making data in registers both faster to access <strong>and</strong> simpler to use. Accessing registers<br />

also uses less energy than accessing memory. To achieve highest performance <strong>and</strong><br />

conserve energy, an instruction set architecture must have a sufficient number of<br />

registers, <strong>and</strong> compilers must use registers efficiently.<br />

Hardware/<br />

Software<br />

Interface


74 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />

We number the bits 0, 1, 2, 3, . . . from right to left in a word. The drawing below<br />

shows the numbering of bits within a MIPS word <strong>and</strong> the placement of the number<br />

1011 two<br />

:<br />

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0<br />

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1<br />

(32 bits wide)<br />

least significant bit The<br />

rightmost bit in a MIPS<br />

word.<br />

most significant bit The<br />

leftmost bit in a MIPS<br />

word.<br />

Since words are drawn vertically as well as horizontally, leftmost <strong>and</strong> rightmost<br />

may be unclear. Hence, the phrase least significant bit is used to refer to the rightmost<br />

bit (bit 0 above) <strong>and</strong> most significant bit to the leftmost bit (bit 31).<br />

The MIPS word is 32 bits long, so we can represent 2 32 different 32-bit patterns.<br />

It is natural to let these combinations represent the numbers from 0 to 2 32 1<br />

(4,294,967,295 ten<br />

):<br />

0000 0000 0000 0000 0000 0000 0000 0000 two<br />

= 0 ten<br />

0000 0000 0000 0000 0000 0000 0000 0001 two<br />

= 1 ten<br />

0000 0000 0000 0000 0000 0000 0000 0010 two<br />

= 2 ten<br />

. . . . . .<br />

1111 1111 1111 1111 1111 1111 1111 1101 two<br />

= 4,294,967,293 ten<br />

1111 1111 1111 1111 1111 1111 1111 1110 two<br />

= 4,294,967,294 ten<br />

1111 1111 1111 1111 1111 1111 1111 1111 two<br />

= 4,294,967,295 ten<br />

That is, 32-bit binary numbers can be represented in terms of the bit value times a<br />

power of 2 (here xi means the ith bit of x):<br />

31 30 29 1 0<br />

( x31 2 ) ( x30 2 ) ( x29 2 ) … ( x1 2 ) ( x0 2 )<br />

For reasons we will shortly see, these positive numbers are called unsigned numbers.<br />

Hardware/<br />

Software<br />

Interface<br />

Base 2 is not natural to human beings; we have 10 fingers <strong>and</strong> so find base 10<br />

natural. Why didn’t computers use decimal? In fact, the first commercial computer<br />

did offer decimal arithmetic. The problem was that the computer still used on<br />

<strong>and</strong> off signals, so a decimal digit was simply represented by several binary digits.<br />

Decimal proved so inefficient that subsequent computers reverted to all binary,<br />

converting to base 10 only for the relatively infrequent input/output events.<br />

Keep in mind that the binary bit patterns above are simply representatives of<br />

numbers. Numbers really have an infinite number of digits, with almost all being<br />

0 except for a few of the rightmost digits. We just don’t normally show leading 0s.<br />

Hardware can be designed to add, subtract, multiply, <strong>and</strong> divide these binary<br />

bit patterns. If the number that is the proper result of such operations cannot be<br />

represented by these rightmost hardware bits, overflow is said to have occurred.


2.4 Signed <strong>and</strong> Unsigned Numbers 75<br />

It’s up to the programming language, the operating system, <strong>and</strong> the program to<br />

determine what to do if overflow occurs.<br />

<strong>Computer</strong> programs calculate both positive <strong>and</strong> negative numbers, so we need a<br />

representation that distinguishes the positive from the negative. The most obvious<br />

solution is to add a separate sign, which conveniently can be represented in a single<br />

bit; the name for this representation is sign <strong>and</strong> magnitude.<br />

Alas, sign <strong>and</strong> magnitude representation has several shortcomings. First, it’s<br />

not obvious where to put the sign bit. To the right? To the left? Early computers<br />

tried both. Second, adders for sign <strong>and</strong> magnitude may need an extra step to set<br />

the sign because we can’t know in advance what the proper sign will be. Finally, a<br />

separate sign bit means that sign <strong>and</strong> magnitude has both a positive <strong>and</strong> a negative<br />

zero, which can lead to problems for inattentive programmers. As a result of these<br />

shortcomings, sign <strong>and</strong> magnitude representation was soon ab<strong>and</strong>oned.<br />

In the search for a more attractive alternative, the question arose as to what<br />

would be the result for unsigned numbers if we tried to subtract a large number<br />

from a small one. The answer is that it would try to borrow from a string of leading<br />

0s, so the result would have a string of leading 1s.<br />

Given that there was no obvious better alternative, the final solution was to pick<br />

the representation that made the hardware simple: leading 0s mean positive, <strong>and</strong><br />

leading 1s mean negative. This convention for representing signed binary numbers<br />

is called two’s complement representation:<br />

0000 0000 0000 0000 0000 0000 0000 0000 two<br />

= 0 ten<br />

0000 0000 0000 0000 0000 0000 0000 0001 two<br />

= 1 ten<br />

0000 0000 0000 0000 0000 0000 0000 0010 two<br />

= 2 ten<br />

. . . . . .<br />

0111 1111 1111 1111 1111 1111 1111 1101 two<br />

= 2,147,483,645 ten<br />

0111 1111 1111 1111 1111 1111 1111 1110 two<br />

= 2,147,483,646 ten<br />

0111 1111 1111 1111 1111 1111 1111 1111 two<br />

= 2,147,483,647 ten<br />

1000 0000 0000 0000 0000 0000 0000 0000 two<br />

= –2,147,483,648 ten<br />

1000 0000 0000 0000 0000 0000 0000 0001 two<br />

= –2,147,483,647 ten<br />

1000 0000 0000 0000 0000 0000 0000 0010 two<br />

= –2,147,483,646 ten<br />

. . . . . .<br />

1111 1111 1111 1111 1111 1111 1111 1101 two<br />

= –3 ten<br />

1111 1111 1111 1111 1111 1111 1111 1110 two<br />

= –2 ten<br />

1111 1111 1111 1111 1111 1111 1111 1111 two<br />

= –1 ten<br />

The positive half of the numbers, from 0 to 2,147,483,647 ten<br />

(2 31 1), use the same<br />

representation as before. The following bit pattern (1000 . . . 0000 two<br />

) represents the most<br />

negative number 2,147,483,648 ten<br />

(2 31 ). It is followed by a declining set of negative<br />

numbers: 2,147,483,647 ten<br />

(1000 . . . 0001 two<br />

) down to 1 ten<br />

(1111 . . . 1111 two<br />

).<br />

Two’s complement does have one negative number, 2,147,483,648 ten<br />

, that<br />

has no corresponding positive number. Such imbalance was also a worry to the<br />

inattentive programmer, but sign <strong>and</strong> magnitude had problems for both the<br />

programmer <strong>and</strong> the hardware designer. Consequently, every computer today uses<br />

two’s complement binary representations for signed numbers.


76 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />

Two’s complement representation has the advantage that all negative numbers<br />

have a 1 in the most significant bit. Consequently, hardware needs to test only<br />

this bit to see if a number is positive or negative (with the number 0 considered<br />

positive). This bit is often called the sign bit. By recognizing the role of the sign bit,<br />

we can represent positive <strong>and</strong> negative 32-bit numbers in terms of the bit value<br />

times a power of 2:<br />

31 30 29 1 0<br />

( x31 2 ) ( x30 2 ) + ( x29 2 ) … ( x1 2 ) ( x0 2 )<br />

The sign bit is multiplied by 2 31 , <strong>and</strong> the rest of the bits are then multiplied by<br />

positive versions of their respective base values.<br />

EXAMPLE<br />

Binary to Decimal Conversion<br />

What is the decimal value of this 32-bit two’s complement number?<br />

1111 1111 1111 1111 1111 1111 1111 1100 two<br />

ANSWER<br />

Substituting the number’s bit values into the formula above:<br />

31 30 29 1 1 0<br />

( 1 2 ) ( 1 2 ) ( 1 2 ) … ( 1 2 ) ( 0 2 ) ( 0 2 )<br />

31 30 29 2<br />

2 2 2 … 2 0 0<br />

2, 147, 483, 648te<br />

n<br />

2, 147, 483,<br />

644ten<br />

4<br />

ten<br />

We’ll see a shortcut to simplify conversion from negative to positive soon.<br />

Just as an operation on unsigned numbers can overflow the capacity of hardware<br />

to represent the result, so can an operation on two’s complement numbers. Overflow<br />

occurs when the leftmost retained bit of the binary bit pattern is not the same as the<br />

infinite number of digits to the left (the sign bit is incorrect): a 0 on the left of the bit<br />

pattern when the number is negative or a 1 when the number is positive.<br />

Hardware/<br />

Software<br />

Interface<br />

Signed versus unsigned applies to loads as well as to arithmetic. The function of a<br />

signed load is to copy the sign repeatedly to fill the rest of the register—called sign<br />

extension—but its purpose is to place a correct representation of the number within<br />

that register. Unsigned loads simply fill with 0s to the left of the data, since the<br />

number represented by the bit pattern is unsigned.<br />

When loading a 32-bit word into a 32-bit register, the point is moot; signed <strong>and</strong><br />

unsigned loads are identical. MIPS does offer two flavors of byte loads: load byte (lb)<br />

treats the byte as a signed number <strong>and</strong> thus sign-extends to fill the 24 left-most bits<br />

of the register, while load byte unsigned (lbu) works with unsigned integers. Since C<br />

programs almost always use bytes to represent characters rather than consider bytes<br />

as very short signed integers, lbu is used practically exclusively for byte loads.


2.4 Signed <strong>and</strong> Unsigned Numbers 77<br />

Unlike the numbers discussed above, memory addresses naturally start at 0<br />

<strong>and</strong> continue to the largest address. Put another way, negative addresses make<br />

no sense. Thus, programs want to deal sometimes with numbers that can be<br />

positive or negative <strong>and</strong> sometimes with numbers that can be only positive.<br />

Some programming languages reflect this distinction. C, for example, names the<br />

former integers (declared as int in the program) <strong>and</strong> the latter unsigned integers<br />

(unsigned int). Some C style guides even recommend declaring the former as<br />

signed int to keep the distinction clear.<br />

Let’s examine two useful shortcuts when working with two’s complement<br />

numbers. The first shortcut is a quick way to negate a two’s complement binary<br />

number. Simply invert every 0 to 1 <strong>and</strong> every 1 to 0, then add one to the result.<br />

This shortcut is based on the observation that the sum of a number <strong>and</strong> its inverted<br />

representation must be 111 . . . 111 two<br />

, which represents 1. Since x x 1,<br />

therefore x x 1 0 or x 1 − x. (We use the notation x to mean invert<br />

every bit in x from 0 to 1 <strong>and</strong> vice versa.)<br />

Negation Shortcut<br />

Negate 2 ten<br />

, <strong>and</strong> then check the result by negating 2 ten<br />

.<br />

2 ten<br />

0000 0000 0000 0000 0000 0000 0000 0010 two<br />

Negating this number by inverting the bits <strong>and</strong> adding one,<br />

1111 1111 1111 1111 1111 1111 1111 1101 two<br />

+ 1 two<br />

= 1111 1111 1111 1111 1111 1111 1111 1110 two<br />

= –2 ten<br />

Going the other direction,<br />

Hardware/<br />

Software<br />

Interface<br />

EXAMPLE<br />

ANSWER<br />

= 0000 0000 0000 0000 0000 0000 0000 0010 two<br />

= 2 ten<br />

1111 1111 1111 1111 1111 1111 1111 1110 two<br />

is first inverted <strong>and</strong> then incremented:<br />

0000 0000 0000 0000 0000 0000 0000 0001 two<br />

+ 1 two


78 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />

Our next shortcut tells us how to convert a binary number represented in n bits<br />

to a number represented with more than n bits. For example, the immediate field<br />

in the load, store, branch, add, <strong>and</strong> set on less than instructions contains a two’s<br />

complement 16-bit number, representing 32,768 ten<br />

(2 15 ) to 32,767 ten<br />

(2 15 1).<br />

To add the immediate field to a 32-bit register, the computer must convert that 16-<br />

bit number to its 32-bit equivalent. The shortcut is to take the most significant bit<br />

from the smaller quantity—the sign bit—<strong>and</strong> replicate it to fill the new bits of the<br />

larger quantity. The old nonsign bits are simply copied into the right portion of the<br />

new word. This shortcut is commonly called sign extension.<br />

EXAMPLE<br />

Sign Extension Shortcut<br />

Convert 16-bit binary versions of 2 ten<br />

<strong>and</strong> 2 ten<br />

to 32-bit binary numbers.<br />

ANSWER<br />

The 16-bit binary version of the number 2 is<br />

0000 0000 0000 0010 two<br />

= 2 ten<br />

It is converted to a 32-bit number by making 16 copies of the value in the most<br />

significant bit (0) <strong>and</strong> placing that in the left-h<strong>and</strong> half of the word. The right<br />

half gets the old value:<br />

0000 0000 0000 0000 0000 0000 0000 0010 two<br />

= 2 ten<br />

Let’s negate the 16-bit version of 2 using the earlier shortcut. Thus,<br />

0000 0000 0000 0010 two<br />

becomes<br />

1111 1111 1111 1101 two<br />

+ 1 two<br />

= 1111 1111 1111 1110 two<br />

Creating a 32-bit version of the negative number means copying the sign bit<br />

16 times <strong>and</strong> placing it on the left:<br />

1111 1111 1111 1111 1111 1111 1111 1110 two<br />

= –2 ten<br />

This trick works because positive two’s complement numbers really have an infinite<br />

number of 0s on the left <strong>and</strong> negative two’s complement numbers have an infinite<br />

number of 1s. The binary bit pattern representing a number hides leading bits to fit<br />

the width of the hardware; sign extension simply restores some of them.


2.4 Signed <strong>and</strong> Unsigned Numbers 79<br />

Summary<br />

The main point of this section is that we need to represent both positive <strong>and</strong><br />

negative integers within a computer word, <strong>and</strong> although there are pros <strong>and</strong> cons to<br />

any option, the unanimous choice since 1965 has been two’s complement.<br />

Elaboration: For signed decimal numbers, we used “” to represent negative<br />

because there are no limits to the size of a decimal number. Given a fi xed word size,<br />

binary <strong>and</strong> hexadecimal (see Figure 2.4) bit strings can encode the sign; hence we do<br />

not normally use “” or “” with binary or hexadecimal notation.<br />

What is the decimal value of this 64-bit two’s complement number?<br />

1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1000 two<br />

Check<br />

Yourself<br />

1) –4 ten<br />

2) –8 ten<br />

3) –16 ten<br />

4) 18,446,744,073,709,551,609 ten<br />

Elaboration: Two’s complement gets its name from the rule that the unsigned sum<br />

of an n-bit number <strong>and</strong> its n-bit negative is 2 n ; hence, the negation or complement of a<br />

number x is 2 n x, or its “two’s complement.”<br />

A third alternative representation to two’s complement <strong>and</strong> sign <strong>and</strong> magnitude is<br />

called one’s complement. The negative of a one’s complement is found by inverting<br />

each bit, from 0 to 1 <strong>and</strong> from 1 to 0, or x. This relation helps explain its name since<br />

the complement of x is 2 n x 1. It was also an attempt to be a better solution<br />

than sign <strong>and</strong> magnitude, <strong>and</strong> several early scientifi c computers did use the notation.<br />

This representation is similar to two’s complement except that it also has two 0s:<br />

00 . . . 00 two<br />

is positive 0 <strong>and</strong> 11 . . . 11 two<br />

is negative 0. The most negative number,<br />

10 . . . 000 two<br />

, represents 2,147,483,647 ten<br />

, <strong>and</strong> so the positives <strong>and</strong> negatives are<br />

balanced. One’s complement adders did need an extra step to subtract a number, <strong>and</strong><br />

hence two’s complement dominates today.<br />

A fi nal notation, which we will look at when we discuss fl oating point in Chapter 3,<br />

is to represent the most negative value by 00 . . . 000 two<br />

<strong>and</strong> the most positive value<br />

by 11 . . . 11 two<br />

, with 0 typically having the value 10 . . . 00 two<br />

. This is called a biased<br />

notation, since it biases the number such that the number plus the bias has a nonnegative<br />

representation.<br />

one’s complement<br />

A notation that represents<br />

the most negative value<br />

by 10 . . . 000 two<br />

<strong>and</strong> the<br />

most positive value by<br />

01 . . . 11 two<br />

, leaving an<br />

equal number of negatives<br />

<strong>and</strong> positives but ending<br />

up with two zeros, one<br />

positive (00 . . . 00 two<br />

) <strong>and</strong><br />

one negative (11 . . . 11 two<br />

).<br />

The term is also used to<br />

mean the inversion of<br />

every bit in a pattern: 0 to<br />

1 <strong>and</strong> 1 to 0.<br />

biased notation<br />

A notation that represents<br />

the most negative value<br />

by 00 . . . 000 two<br />

<strong>and</strong> the<br />

most positive value by 11<br />

. . . 11 two<br />

, with 0 typically<br />

having the value 10 . . .<br />

00 two<br />

, thereby biasing<br />

the number such that<br />

the number plus the<br />

bias has a non-negative<br />

representation.


2.5 Representing Instructions in the <strong>Computer</strong> 81<br />

This layout of the instruction is called the instruction format. As you can see<br />

from counting the number of bits, this MIPS instruction takes exactly 32 bits—the<br />

same size as a data word. In keeping with our design principle that simplicity favors<br />

regularity, all MIPS instructions are 32 bits long.<br />

To distinguish it from assembly language, we call the numeric version of<br />

instructions machine language <strong>and</strong> a sequence of such instructions machine code.<br />

It would appear that you would now be reading <strong>and</strong> writing long, tedious strings<br />

of binary numbers. We avoid that tedium by using a higher base than binary that<br />

converts easily into binary. Since almost all computer data sizes are multiples of<br />

4, hexadecimal (base 16) numbers are popular. Since base 16 is a power of 2,<br />

we can trivially convert by replacing each group of four binary digits by a single<br />

hexadecimal digit, <strong>and</strong> vice versa. Figure 2.4 converts between hexadecimal <strong>and</strong><br />

binary.<br />

instruction format<br />

A form of representation<br />

of an instruction<br />

composed of fields of<br />

binary numbers.<br />

machine<br />

language Binary<br />

representation used for<br />

communication within a<br />

computer system.<br />

hexadecimal Numbers<br />

in base 16.<br />

Hexadecimal Binary Hexadecimal Binary Hexadecimal Binary Hexadecimal Binary<br />

0 hex 0000 two 4 hex 0100 two 8 hex 1000 two c hex 1100 two<br />

1 hex 0001 two 5 hex 0101 two 9 hex 1001 two d hex 1101 two<br />

2 hex 0010 two 6 hex 0110 two a hex 1010 two e hex 1110 two<br />

3 hex 0011 two 7 hex 0111 two b hex 1011 two f hex 1111 two<br />

FIGURE 2.4 The hexadecimal-binary conversion table. Just replace one hexadecimal digit by the corresponding four binary digits,<br />

<strong>and</strong> vice versa. If the length of the binary number is not a multiple of 4, go from right to left.<br />

Because we frequently deal with different number bases, to avoid confusion<br />

we will subscript decimal numbers with ten, binary numbers with two, <strong>and</strong><br />

hexadecimal numbers with hex. (If there is no subscript, the default is base 10.) By<br />

the way, C <strong>and</strong> Java use the notation 0xnnnn for hexadecimal numbers.<br />

Binary to Hexadecimal <strong>and</strong> Back<br />

Convert the following hexadecimal <strong>and</strong> binary numbers into the other base:<br />

EXAMPLE<br />

eca8 6420 hex<br />

0001 0011 0101 0111 1001 1011 1101 1111 two


82 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />

ANSWER<br />

Using Figure 2.4, the answer is just a table lookup one way:<br />

eca8 6420 hex<br />

1110 1100 1010 1000 0110 0100 0010 0000 two<br />

And then the other direction:<br />

0001 0011 0101 0111 1001 1011 1101 1111 two<br />

1357 9bdf hex<br />

MIPS Fields<br />

MIPS fields are given names to make them easier to discuss:<br />

op rs rt rd shamt funct<br />

6 bits 5 bits 5 bits 5 bits 5 bits 6 bits<br />

Here is the meaning of each name of the fields in MIPS instructions:<br />

opcode The field that<br />

denotes the operation <strong>and</strong><br />

format of an instruction.<br />

■ op: Basic operation of the instruction, traditionally called the opcode.<br />

■ rs: The first register source oper<strong>and</strong>.<br />

■ rt: The second register source oper<strong>and</strong>.<br />

■ rd: The register destination oper<strong>and</strong>. It gets the result of the operation.<br />

■ shamt: Shift amount. (Section 2.6 explains shift instructions <strong>and</strong> this term; it<br />

will not be used until then, <strong>and</strong> hence the field contains zero in this section.)<br />

■ funct: Function. This field, often called the function code, selects the specific<br />

variant of the operation in the op field.<br />

A problem occurs when an instruction needs longer fields than those shown<br />

above. For example, the load word instruction must specify two registers <strong>and</strong> a<br />

constant. If the address were to use one of the 5-bit fields in the format above, the<br />

constant within the load word instruction would be limited to only 2 5 or 32. This<br />

constant is used to select elements from arrays or data structures, <strong>and</strong> it often needs<br />

to be much larger than 32. This 5-bit field is too small to be useful.<br />

Hence, we have a conflict between the desire to keep all instructions the same<br />

length <strong>and</strong> the desire to have a single instruction format. This leads us to the final<br />

hardware design principle:


2.5 Representing Instructions in the <strong>Computer</strong> 83<br />

<strong>Design</strong> Principle 3: Good design dem<strong>and</strong>s good compromises.<br />

The compromise chosen by the MIPS designers is to keep all instructions the<br />

same length, thereby requiring different kinds of instruction formats for different<br />

kinds of instructions. For example, the format above is called R-type (for register)<br />

or R-format. A second type of instruction format is called I-type (for immediate)<br />

or I-format <strong>and</strong> is used by the immediate <strong>and</strong> data transfer instructions. The fields<br />

of I-format are<br />

op rs rt constant or address<br />

6 bits 5 bits 5 bits 16 bits<br />

The 16-bit address means a load word instruction can load any word within<br />

a region of 2 15 or 32,768 bytes (2 13 or 8192 words) of the address in the base<br />

register rs. Similarly, add immediate is limited to constants no larger than 2 15 .<br />

We see that more than 32 registers would be difficult in this format, as the rs <strong>and</strong> rt<br />

fields would each need another bit, making it harder to fit everything in one word.<br />

Let’s look at the load word instruction from page 71:<br />

lw $t0,32($s3) # Temporary reg $t0 gets A[8]<br />

Here, 19 (for $s3) is placed in the rs field, 8 (for $t0) is placed in the rt field, <strong>and</strong><br />

32 is placed in the address field. Note that the meaning of the rt field has changed<br />

for this instruction: in a load word instruction, the rt field specifies the destination<br />

register, which receives the result of the load.<br />

Although multiple formats complicate the hardware, we can reduce the complexity<br />

by keeping the formats similar. For example, the first three fields of the R-type <strong>and</strong><br />

I-type formats are the same size <strong>and</strong> have the same names; the length of the fourth<br />

field in I-type is equal to the sum of the lengths of the last three fields of R-type.<br />

In case you were wondering, the formats are distinguished by the values in the<br />

first field: each format is assigned a distinct set of values in the first field (op) so that<br />

the hardware knows whether to treat the last half of the instruction as three fields<br />

(R-type) or as a single field (I-type). Figure 2.5 shows the numbers used in each<br />

field for the MIPS instructions covered so far.<br />

Instruction Format op rs rt rd shamt funct address<br />

add R 0 reg reg reg 0 32 ten n.a.<br />

sub (subtract) R 0 reg reg reg 0 34 ten n.a.<br />

add immediate I 8 ten reg reg n.a. n.a. n.a. constant<br />

lw (load word) I 35 ten reg reg n.a. n.a. n.a. address<br />

sw (store word) I 43 ten reg reg n.a. n.a. n.a. address<br />

FIGURE 2.5 MIPS instruction encoding. In the table above, “reg” means a register number between 0<br />

<strong>and</strong> 31, “address” means a 16-bit address, <strong>and</strong> “n.a.” (not applicable) means this field does not appear in this<br />

format. Note that add <strong>and</strong> sub instructions have the same value in the op field; the hardware uses the funct<br />

field to decide the variant of the operation: add (32) or subtract (34).


84 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />

EXAMPLE<br />

Translating MIPS Assembly Language into Machine Language<br />

We can now take an example all the way from what the programmer writes<br />

to what the computer executes. If $t1 has the base of the array A <strong>and</strong> $s2<br />

corresponds to h, the assignment statement<br />

A[300] = h + A[300];<br />

is compiled into<br />

lw $t0,1200($t1) # Temporary reg $t0 gets A[300]<br />

add $t0,$s2,$t0 # Temporary reg $t0 gets h + A[300]<br />

sw $t0,1200($t1) # Stores h + A[300] back into A[300]<br />

What is the MIPS machine language code for these three instructions?<br />

ANSWER<br />

For convenience, let’s first represent the machine language instructions using<br />

decimal numbers. From Figure 2.5, we can determine the three machine<br />

language instructions:<br />

Op rs rt rd<br />

address/<br />

shamt<br />

funct<br />

35 9 8 1200<br />

0 18 8 8 0 32<br />

43 9 8 1200<br />

The lw instruction is identified by 35 (see Figure 2.5) in the first field<br />

(op). The base register 9 ($t1) is specified in the second field (rs), <strong>and</strong> the<br />

destination register 8 ($t0) is specified in the third field (rt). The offset to<br />

select A[300] (1200 300 4) is found in the final field (address).<br />

The add instruction that follows is specified with 0 in the first field (op) <strong>and</strong><br />

32 in the last field (funct). The three register oper<strong>and</strong>s (18, 8, <strong>and</strong> 8) are found<br />

in the second, third, <strong>and</strong> fourth fields <strong>and</strong> correspond to $s2, $t0, <strong>and</strong> $t0.<br />

The sw instruction is identified with 43 in the first field. The rest of this final<br />

instruction is identical to the lw instruction.<br />

Since 1200 ten<br />

0000 0100 1011 0000 two<br />

, the binary equivalent to the decimal<br />

form is:<br />

100011 01001 01000 0000 0100 1011 0000<br />

000000 10010 01000 01000 00000 100000<br />

101011 01001 01000 0000 0100 1011 0000


88 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />

The dual of a shift left is a shift right. The actual name of the two MIPS shift<br />

instructions are called shift left logical (sll) <strong>and</strong> shift right logical (srl). The<br />

following instruction performs the operation above, assuming that the original<br />

value was in register $s0 <strong>and</strong> the result should go in register $t2:<br />

sll $t2,$s0,4 # reg $t2 = reg $s0


2.6 Logical Operations 89<br />

To place a value into one of these seas of 0s, there is the dual to AND, called<br />

OR. It is a bit-by-bit operation that places a 1 in the result if either oper<strong>and</strong> bit is<br />

a 1. To elaborate, if the registers $t1 <strong>and</strong> $t2 are unchanged from the preceding<br />

example, the result of the MIPS instruction<br />

or $t0,$t1,$t2 # reg $t0 = reg $t1 | reg $t2<br />

is this value in register $t0:<br />

OR A logical bit-bybit<br />

operation with two<br />

oper<strong>and</strong>s that calculates<br />

a 1 if there is a 1 in either<br />

oper<strong>and</strong>.<br />

0000 0000 0000 0000 0011 1101 1100 0000 two<br />

The final logical operation is a contrarian. NOT takes one oper<strong>and</strong> <strong>and</strong> places a 1<br />

in the result if one oper<strong>and</strong> bit is a 0, <strong>and</strong> vice versa. Using our prior notation, it<br />

calculates x.<br />

In keeping with the three-oper<strong>and</strong> format, the designers of MIPS decided to<br />

include the instruction NOR (NOT OR) instead of NOT. If one oper<strong>and</strong> is zero,<br />

then it is equivalent to NOT: A NOR 0 NOT (A OR 0) NOT (A).<br />

If the register $t1 is unchanged from the preceding example <strong>and</strong> register $t3<br />

has the value 0, the result of the MIPS instruction<br />

nor $t0,$t1,$t3 # reg $t0 = ~ (reg $t1 | reg $t3)<br />

is this value in register $t0:<br />

NOT A logical bit-bybit<br />

operation with one<br />

oper<strong>and</strong> that inverts the<br />

bits; that is, it replaces<br />

every 1 with a 0, <strong>and</strong><br />

every 0 with a 1.<br />

NOR A logical bit-bybit<br />

operation with two<br />

oper<strong>and</strong>s that calculates<br />

the NOT of the OR of the<br />

two oper<strong>and</strong>s. That is, it<br />

calculates a 1 only if there<br />

is a 0 in both oper<strong>and</strong>s.<br />

1111 1111 1111 1111 1100 0011 1111 1111 two<br />

Figure 2.8 above shows the relationship between the C <strong>and</strong> Java operators <strong>and</strong> the<br />

MIPS instructions. Constants are useful in AND <strong>and</strong> OR logical operations as well<br />

as in arithmetic operations, so MIPS also provides the instructions <strong>and</strong> immediate<br />

(<strong>and</strong>i) <strong>and</strong> or immediate (ori). Constants are rare for NOR, since its main use is<br />

to invert the bits of a single oper<strong>and</strong>; thus, the MIPS instruction set architecture has<br />

no immediate version of NOR.<br />

Elaboration: The full MIPS instruction set also includes exclusive or (XOR), which<br />

sets the bit to 1 when two corresponding bits differ, <strong>and</strong> to 0 when they are the same. C<br />

allows bit fi elds or fi elds to be defi ned within words, both allowing objects to be packed<br />

within a word <strong>and</strong> to match an externally enforced interface such as an I/O device. All<br />

fi elds must fi t within a single word. Fields are unsigned integers that can be as short as<br />

1 bit. C compilers insert <strong>and</strong> extract fi elds using logical instructions in MIPS: <strong>and</strong>, or,<br />

sll, <strong>and</strong> srl.<br />

Elaboration: Logical AND immediate <strong>and</strong> logical OR immediate put 0s into the upper<br />

16 bits to form a 32-bit constant, unlike add immediate, which does sign extension.<br />

Which operations can isolate a field in a word?<br />

1. AND<br />

2. A shift left followed by a shift right<br />

Check<br />

Yourself


2.7 Instructions for Making Decisions 91<br />

The next assignment statement performs a single operation, <strong>and</strong> if all the<br />

oper<strong>and</strong>s are allocated to registers, it is just one instruction:<br />

add $s0,$s1,$s2 # f = g + h (skipped if i ≠ j)<br />

We now need to go to the end of the if statement. This example introduces<br />

another kind of branch, often called an unconditional branch. This instruction<br />

says that the processor always follows the branch. To distinguish between<br />

conditional <strong>and</strong> unconditional branches, the MIPS name for this type of<br />

instruction is jump, abbreviated as j (the label Exit is defined below).<br />

conditional branch An<br />

instruction that requires<br />

the comparison of two<br />

values <strong>and</strong> that allows for<br />

a subsequent transfer of<br />

control to a new address<br />

in the program based<br />

on the outcome of the<br />

comparison.<br />

j Exit<br />

# go to Exit<br />

The assignment statement in the else portion of the if statement can again be<br />

compiled into a single instruction. We just need to append the label Else to<br />

this instruction. We also show the label Exit that is after this instruction,<br />

showing the end of the if-then-else compiled code:<br />

Else:sub $s0,$s1,$s2 # f = g – h (skipped if i = j)<br />

Exit:<br />

Notice that the assembler relieves the compiler <strong>and</strong> the assembly language<br />

programmer from the tedium of calculating addresses for branches, just as it does<br />

for calculating data addresses for loads <strong>and</strong> stores (see Section 2.12).<br />

i=j<br />

i= =j?<br />

i≠ j<br />

Else:<br />

f=g+h<br />

f=g–h<br />

Exit:<br />

FIGURE 2.9 Illustration of the options in the if statement above. The left box corresponds to<br />

the then part of the if statement, <strong>and</strong> the right box corresponds to the else part.


92 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />

Hardware/<br />

Software<br />

Interface<br />

Compilers frequently create branches <strong>and</strong> labels where they do not appear in<br />

the programming language. Avoiding the burden of writing explicit labels <strong>and</strong><br />

branches is one benefit of writing in high-level programming languages <strong>and</strong> is a<br />

reason coding is faster at that level.<br />

Loops<br />

Decisions are important both for choosing between two alternatives—found in if<br />

statements—<strong>and</strong> for iterating a computation—found in loops. The same assembly<br />

instructions are the building blocks for both cases.<br />

EXAMPLE<br />

Compiling a while Loop in C<br />

Here is a traditional loop in C:<br />

while (save[i] == k)<br />

i += 1;<br />

Assume that i <strong>and</strong> k correspond to registers $s3 <strong>and</strong> $s5 <strong>and</strong> the base of the<br />

array save is in $s6. What is the MIPS assembly code corresponding to this<br />

C segment?<br />

ANSWER<br />

The first step is to load save[i] into a temporary register. Before we can load<br />

save[i] into a temporary register, we need to have its address. Before we<br />

can add i to the base of array save to form the address, we must multiply the<br />

index i by 4 due to the byte addressing problem. Fortunately, we can use shift<br />

left logical, since shifting left by 2 bits multiplies by 2 2 or 4 (see page 88 in the<br />

prior section). We need to add the label Loop to it so that we can branch back<br />

to that instruction at the end of the loop:<br />

Loop: sll $t1,$s3,2 # Temp reg $t1 = i * 4<br />

To get the address of save[i], we need to add $t1 <strong>and</strong> the base of save in $s6:<br />

add $t1,$t1,$s6 # $t1 = address of save[i]<br />

Now we can use that address to load save[i] into a temporary register:<br />

lw $t0,0($t1) # Temp reg $t0 = save[i]<br />

The next instruction performs the loop test, exiting if save[i] ≠ k:<br />

bne $t0,$s5, Exit<br />

# go to Exit if save[i] ≠ k


2.7 Instructions for Making Decisions 93<br />

The next instruction adds 1 to i:<br />

addi $s3,$s3,1 # i = i + 1<br />

The end of the loop branches back to the while test at the top of the loop. We<br />

just add the Exit label after it, <strong>and</strong> we’re done:<br />

j Loop # go to Loop<br />

Exit:<br />

(See the exercises for an optimization of this sequence.)<br />

Such sequences of instructions that end in a branch are so fundamental to compiling<br />

that they are given their own buzzword: a basic block is a sequence of instructions<br />

without branches, except possibly at the end, <strong>and</strong> without branch targets or branch<br />

labels, except possibly at the beginning. One of the first early phases of compilation<br />

is breaking the program into basic blocks.<br />

The test for equality or inequality is probably the most popular test, but sometimes<br />

it is useful to see if a variable is less than another variable. For example, a for loop<br />

may want to test to see if the index variable is less than 0. Such comparisons are<br />

accomplished in MIPS assembly language with an instruction that compares two<br />

registers <strong>and</strong> sets a third register to 1 if the first is less than the second; otherwise,<br />

it is set to 0. The MIPS instruction is called set on less than, or slt. For example,<br />

Hardware/<br />

Software<br />

Interface<br />

basic block A sequence<br />

of instructions without<br />

branches (except possibly<br />

at the end) <strong>and</strong> without<br />

branch targets or branch<br />

labels (except possibly at<br />

the beginning).<br />

slt $t0, $s3, $s4 # $t0 = 1 if $s3 < $s4<br />

means that register $t0 is set to 1 if the value in register $s3 is less than the value<br />

in register $s4; otherwise, register $t0 is set to 0.<br />

Constant oper<strong>and</strong>s are popular in comparisons, so there is an immediate version<br />

of the set on less than instruction. To test if register $s2 is less than the constant<br />

10, we can just write<br />

slti $t0,$s2,10 # $t0 = 1 if $s2 < 10<br />

MIPS compilers use the slt, slti, beq, bne, <strong>and</strong> the fixed value of 0 (always<br />

available by reading register $zero) to create all relative conditions: equal, not<br />

equal, less than, less than or equal, greater than, greater than or equal.<br />

Hardware/<br />

Software<br />

Interface


94 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />

Heeding von Neumann’s warning about the simplicity of the “equipment,” the<br />

MIPS architecture doesn’t include branch on less than because it is too complicated;<br />

either it would stretch the clock cycle time or it would take extra clock cycles per<br />

instruction. Two faster instructions are more useful.<br />

Hardware/<br />

Software<br />

Interface<br />

Comparison instructions must deal with the dichotomy between signed <strong>and</strong><br />

unsigned numbers. Sometimes a bit pattern with a 1 in the most significant bit<br />

represents a negative number <strong>and</strong>, of course, is less than any positive number,<br />

which must have a 0 in the most significant bit. With unsigned integers, on the<br />

other h<strong>and</strong>, a 1 in the most significant bit represents a number that is larger than<br />

any that begins with a 0. (We’ll soon take advantage of this dual meaning of the<br />

most significant bit to reduce the cost of the array bounds checking.)<br />

MIPS offers two versions of the set on less than comparison to h<strong>and</strong>le these<br />

alternatives. Set on less than (slt) <strong>and</strong> set on less than immediate (slti) work with<br />

signed integers. Unsigned integers are compared using set on less than unsigned<br />

(sltu) <strong>and</strong> set on less than immediate unsigned (sltiu).<br />

EXAMPLE<br />

Signed versus Unsigned Comparison<br />

Suppose register $s0 has the binary number<br />

1111 1111 1111 1111 1111 1111 1111 1111 two<br />

<strong>and</strong> that register $s1 has the binary number<br />

0000 0000 0000 0000 0000 0000 0000 0001 two<br />

What are the values of registers $t0 <strong>and</strong> $t1 after these two instructions?<br />

slt<br />

sltu<br />

$t0, $s0, $s1 # signed comparison<br />

$t1, $s0, $s1 # unsigned comparison<br />

ANSWER<br />

The value in register $s0 represents 1 ten<br />

if it is an integer <strong>and</strong> 4,294,967,295 ten<br />

if it is an unsigned integer. The value in register $s1 represents 1 ten<br />

in either<br />

case. Then register $t0 has the value 1, since 1 ten<br />

1 ten<br />

, <strong>and</strong> register $t1 has<br />

the value 0, since 4,294,967,295 ten<br />

1 ten<br />

.


2.7 Instructions for Making Decisions 95<br />

Treating signed numbers as if they were unsigned gives us a low cost way of<br />

checking if 0 x y, which matches the index out-of-bounds check for arrays. The<br />

key is that negative integers in two’s complement notation look like large numbers<br />

in unsigned notation; that is, the most significant bit is a sign bit in the former<br />

notation but a large part of the number in the latter. Thus, an unsigned comparison<br />

of x y also checks if x is negative as well as if x is less than y.<br />

Bounds Check Shortcut<br />

Use this shortcut to reduce an index-out-of-bounds check: jump to<br />

IndexOutOfBounds if $s1 ≥ $t2 or if $s1 is negative.<br />

EXAMPLE<br />

The checking code just uses u to do both checks:<br />

sltu $t0,$s1,$t2 # $t0=0 if $s1>=length or $s1


2.8 Supporting Procedures in <strong>Computer</strong> Hardware 97<br />

You can think of a procedure like a spy who leaves with a secret plan, acquires<br />

resources, performs the task, covers his or her tracks, <strong>and</strong> then returns to the point<br />

of origin with the desired result. Nothing else should be perturbed once the mission<br />

is complete. Moreover, a spy operates on only a “need to know” basis, so the spy<br />

can’t make assumptions about his employer.<br />

Similarly, in the execution of a procedure, the program must follow these six<br />

steps:<br />

1. Put parameters in a place where the procedure can access them.<br />

2. Transfer control to the procedure.<br />

3. Acquire the storage resources needed for the procedure.<br />

4. Perform the desired task.<br />

5. Put the result value in a place where the calling program can access it.<br />

6. Return control to the point of origin, since a procedure can be called from<br />

several points in a program.<br />

As mentioned above, registers are the fastest place to hold data in a computer,<br />

so we want to use them as much as possible. MIPS software follows the following<br />

convention for procedure calling in allocating its 32 registers:<br />

■ $a0–$a3: four argument registers in which to pass parameters<br />

■ $v0–$v1: two value registers in which to return values<br />

■ $ra: one return address register to return to the point of origin<br />

In addition to allocating these registers, MIPS assembly language includes an<br />

instruction just for the procedures: it jumps to an address <strong>and</strong> simultaneously<br />

saves the address of the following instruction in register $ra. The jump-<strong>and</strong>-link<br />

instruction (jal) is simply written<br />

jal ProcedureAddress<br />

The link portion of the name means that an address or link is formed that points<br />

to the calling site to allow the procedure to return to the proper address. This “link,”<br />

stored in register$ra (register 31), is called the return address. The return address<br />

is needed because the same procedure could be called from several parts of the<br />

program.<br />

To support such situations, computers like MIPS use jump register instruction<br />

(jr), introduced above to help with case statements, meaning an unconditional<br />

jump to the address specified in a register:<br />

jr<br />

$ra<br />

jump-<strong>and</strong>-link<br />

instruction An<br />

instruction that jumps<br />

to an address <strong>and</strong><br />

simultaneously saves the<br />

address of the following<br />

instruction in a register<br />

($ra in MIPS).<br />

return address A link to<br />

the calling site that allows<br />

a procedure to return<br />

to the proper address;<br />

in MIPS it is stored in<br />

register $ra.


98 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />

caller The program that<br />

instigates a procedure <strong>and</strong><br />

provides the necessary<br />

parameter values.<br />

callee A procedure that<br />

executes a series of stored<br />

instructions based on<br />

parameters provided by<br />

the caller <strong>and</strong> then returns<br />

control to the caller.<br />

program counter<br />

(PC) The register<br />

containing the address<br />

of the instruction in the<br />

program being executed.<br />

stack A data structure<br />

for spilling registers<br />

organized as a last-infirst-out<br />

queue.<br />

stack pointer A value<br />

denoting the most<br />

recently allocated address<br />

in a stack that shows<br />

where registers should<br />

be spilled or where old<br />

register values can be<br />

found. In MIPS, it is<br />

register $sp.<br />

push Add element to<br />

stack.<br />

pop Remove element<br />

from stack.<br />

The jump register instruction jumps to the address stored in register $ra—<br />

which is just what we want. Thus, the calling program, or caller, puts the parameter<br />

values in $a0–$a3 <strong>and</strong> uses jal X to jump to procedure X (sometimes named<br />

the callee). The callee then performs the calculations, places the results in $v0 <strong>and</strong><br />

$v1, <strong>and</strong> returns control to the caller using jr $ra.<br />

Implicit in the stored-program idea is the need to have a register to hold the<br />

address of the current instruction being executed. For historical reasons, this<br />

register is almost always called the program counter, abbreviated PC in the MIPS<br />

architecture, although a more sensible name would have been instruction address<br />

register. The jal instruction actually saves PC 4 in register $ra to link to the<br />

following instruction to set up the procedure return.<br />

Using More Registers<br />

Suppose a compiler needs more registers for a procedure than the four argument<br />

<strong>and</strong> two return value registers. Since we must cover our tracks after our mission<br />

is complete, any registers needed by the caller must be restored to the values that<br />

they contained before the procedure was invoked. This situation is an example in<br />

which we need to spill registers to memory, as mentioned in the Hardware/Software<br />

Interface section above.<br />

The ideal data structure for spilling registers is a stack—a last-in-first-out<br />

queue. A stack needs a pointer to the most recently allocated address in the stack<br />

to show where the next procedure should place the registers to be spilled or where<br />

old register values are found. The stack pointer is adjusted by one word for each<br />

register that is saved or restored. MIPS software reserves register 29 for the stack<br />

pointer, giving it the obvious name $sp. Stacks are so popular that they have their<br />

own buzzwords for transferring data to <strong>and</strong> from the stack: placing data onto the<br />

stack is called a push, <strong>and</strong> removing data from the stack is called a pop.<br />

By historical precedent, stacks “grow” from higher addresses to lower addresses.<br />

This convention means that you push values onto the stack by subtracting from the<br />

stack pointer. Adding to the stack pointer shrinks the stack, thereby popping values<br />

off the stack.<br />

EXAMPLE<br />

Compiling a C Procedure That Doesn’t Call Another Procedure<br />

Let’s turn the example on page 65 from Section 2.2 into a C procedure:<br />

int leaf_example (int g, int h, int i, int j)<br />

{<br />

int f;<br />

}<br />

f = (g + h) – (i + j);<br />

return f;<br />

What is the compiled MIPS assembly code?


2.8 Supporting Procedures in <strong>Computer</strong> Hardware 99<br />

The parameter variables g, h, i, <strong>and</strong> j correspond to the argument registers<br />

$a0, $a1, $a2, <strong>and</strong> $a3, <strong>and</strong> f corresponds to $s0. The compiled program<br />

starts with the label of the procedure:<br />

ANSWER<br />

leaf_example:<br />

The next step is to save the registers used by the procedure. The C assignment<br />

statement in the procedure body is identical to the example on page 68, which<br />

uses two temporary registers. Thus, we need to save three registers: $s0, $t0,<br />

<strong>and</strong> $t1. We “push” the old values onto the stack by creating space for three<br />

words (12 bytes) on the stack <strong>and</strong> then store them:<br />

addi $sp, $sp, –12 # adjust stack to make room for 3 items<br />

sw $t1, 8($sp) # save register $t1 for use afterwards<br />

sw $t0, 4($sp) # save register $t0 for use afterwards<br />

sw $s0, 0($sp) # save register $s0 for use afterwards<br />

Figure 2.10 shows the stack before, during, <strong>and</strong> after the procedure call.<br />

The next three statements correspond to the body of the procedure, which<br />

follows the example on page 68:<br />

add $t0,$a0,$a1 # register $t0 contains g + h<br />

add $t1,$a2,$a3 # register $t1 contains i + j<br />

sub $s0,$t0,$t1 # f = $t0 – $t1, which is (g + h)–(i + j)<br />

To return the value of f, we copy it into a return value register:<br />

add $v0,$s0,$zero # returns f ($v0 = $s0 + 0)<br />

Before returning, we restore the three old values of the registers we saved by<br />

“popping” them from the stack:<br />

lw $s0, 0($sp) # restore register $s0 for caller<br />

lw $t0, 4($sp) # restore register $t0 for caller<br />

lw $t1, 8($sp) # restore register $t1 for caller<br />

addi $sp,$sp,12 # adjust stack to delete 3 items<br />

The procedure ends with a jump register using the return address:<br />

jr $ra # jump back to calling routine<br />

In the previous example, we used temporary registers <strong>and</strong> assumed their old<br />

values must be saved <strong>and</strong> restored. To avoid saving <strong>and</strong> restoring a register whose<br />

value is never used, which might happen with a temporary register, MIPS software<br />

separates 18 of the registers into two groups:<br />

■ $t0–$t9: temporary registers that are not preserved by the callee (called<br />

procedure) on a procedure call<br />

■ $s0–$s7: saved registers that must be preserved on a procedure call (if<br />

used, the callee saves <strong>and</strong> restores them)


100 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />

High address<br />

$sp<br />

$sp<br />

Contents of register $t1<br />

Contents of register $t0<br />

Contents of register $s0<br />

$sp<br />

Low address<br />

(a) (b) (c)<br />

FIGURE 2.10 The values of the stack pointer <strong>and</strong> the stack (a) before, (b) during, <strong>and</strong> (c)<br />

after the procedure call. The stack pointer always points to the “top” of the stack, or the last word in the<br />

stack in this drawing.<br />

This simple convention reduces register spilling. In the example above, since the<br />

caller does not expect registers $t0 <strong>and</strong> $t1 to be preserved across a procedure<br />

call, we can drop two stores <strong>and</strong> two loads from the code. We still must save <strong>and</strong><br />

restore $s0, since the callee must assume that the caller needs its value.<br />

Nested Procedures<br />

Procedures that do not call others are called leaf procedures. Life would be simple if<br />

all procedures were leaf procedures, but they aren’t. Just as a spy might employ other<br />

spies as part of a mission, who in turn might use even more spies, so do procedures<br />

invoke other procedures. Moreover, recursive procedures even invoke “clones” of<br />

themselves. Just as we need to be careful when using registers in procedures, more<br />

care must also be taken when invoking nonleaf procedures.<br />

For example, suppose that the main program calls procedure A with an argument<br />

of 3, by placing the value 3 into register $a0 <strong>and</strong> then using jal A. Then suppose<br />

that procedure A calls procedure B via jal B with an argument of 7, also placed<br />

in $a0. Since A hasn’t finished its task yet, there is a conflict over the use of register<br />

$a0. Similarly, there is a conflict over the return address in register $ra, since it<br />

now has the return address for B. Unless we take steps to prevent the problem, this<br />

conflict will eliminate procedure A’s ability to return to its caller.<br />

One solution is to push all the other registers that must be preserved onto<br />

the stack, just as we did with the saved registers. The caller pushes any argument<br />

registers ($a0–$a3) or temporary registers ($t0–$t9) that are needed after<br />

the call. The callee pushes the return address register $ra <strong>and</strong> any saved registers<br />

($s0–$s7) used by the callee. The stack pointer $sp is adjusted to account for the<br />

number of registers placed on the stack. Upon the return, the registers are restored<br />

from memory <strong>and</strong> the stack pointer is readjusted.


2.8 Supporting Procedures in <strong>Computer</strong> Hardware 101<br />

Compiling a Recursive C Procedure, Showing Nested Procedure<br />

Linking<br />

EXAMPLE<br />

Let’s tackle a recursive procedure that calculates factorial:<br />

int fact (int n)<br />

{<br />

if (n < 1) return (1);<br />

else return (n * fact(n – 1));<br />

}<br />

What is the MIPS assembly code?<br />

The parameter variable n corresponds to the argument register $a0. The<br />

compiled program starts with the label of the procedure <strong>and</strong> then saves two<br />

registers on the stack, the return address <strong>and</strong> $a0:<br />

ANSWER<br />

fact:<br />

addi $sp, $sp, –8 # adjust stack for 2 items<br />

sw $ra, 4($sp) # save the return address<br />

sw $a0, 0($sp) # save the argument n<br />

The first time fact is called, sw saves an address in the program that called<br />

fact. The next two instructions test whether n is less than 1, going to L1 if<br />

n ≥ 1.<br />

slti $t0,$a0,1 # test for n < 1<br />

beq $t0,$zero,L1 # if n >= 1, go to L1<br />

If n is less than 1, fact returns 1 by putting 1 into a value register: it adds 1 to<br />

0 <strong>and</strong> places that sum in $v0. It then pops the two saved values off the stack<br />

<strong>and</strong> jumps to the return address:<br />

addi $v0,$zero,1 # return 1<br />

addi $sp,$sp,8 # pop 2 items off stack<br />

jr $ra # return to caller<br />

Before popping two items off the stack, we could have loaded $a0 <strong>and</strong><br />

$ra. Since $a0 <strong>and</strong> $ra don’t change when n is less than 1, we skip those<br />

instructions.<br />

If n is not less than 1, the argument n is decremented <strong>and</strong> then fact is<br />

called again with the decremented value:<br />

L1: addi $a0,$a0,–1 # n >= 1: argument gets (n – 1)<br />

jal fact # call fact with (n –1)


102 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />

The next instruction is where fact returns. Now the old return address <strong>and</strong><br />

old argument are restored, along with the stack pointer:<br />

lw $a0, 0($sp) # return from jal: restore argument n<br />

lw $ra, 4($sp) # restore the return address<br />

addi $sp, $sp, 8 # adjust stack pointer to pop 2 items<br />

Next, the value register $v0 gets the product of old argument $a0 <strong>and</strong><br />

the current value of the value register. We assume a multiply instruction is<br />

available, even though it is not covered until Chapter 3:<br />

mul $v0,$a0,$v0 # return n * fact (n – 1)<br />

Finally, fact jumps again to the return address:<br />

jr $ra # return to the caller<br />

Hardware/<br />

Software<br />

Interface<br />

global pointer The<br />

register that is reserved to<br />

point to the static area.<br />

A C variable is generally a location in storage, <strong>and</strong> its interpretation depends both<br />

on its type <strong>and</strong> storage class. Examples include integers <strong>and</strong> characters (see Section<br />

2.9). C has two storage classes: automatic <strong>and</strong> static. Automatic variables are local to<br />

a procedure <strong>and</strong> are discarded when the procedure exits. Static variables exist across<br />

exits from <strong>and</strong> entries to procedures. C variables declared outside all procedures<br />

are considered static, as are any variables declared using the keyword static. The<br />

rest are automatic. To simplify access to static data, MIPS software reserves another<br />

register, called the global pointer, or $gp.<br />

Figure 2.11 summarizes what is preserved across a procedure call. Note that<br />

several schemes preserve the stack, guaranteeing that the caller will get the same<br />

data back on a load from the stack as it stored onto the stack. The stack above $sp<br />

is preserved simply by making sure the callee does not write above $sp; $sp is<br />

Preserved<br />

Saved registers: $s0–$s7<br />

Stack pointer register: $sp<br />

Return address register: $ra<br />

Stack above the stack pointer<br />

Not preserved<br />

Temporary registers: $t0–$t9<br />

Argument registers: $a0–$a3<br />

Return value registers: $v0–$v1<br />

Stack below the stack pointer<br />

FIGURE 2.11 What is <strong>and</strong> what is not preserved across a procedure call. If the software relies<br />

on the frame pointer register or on the global pointer register, discussed in the following subsections, they<br />

are also preserved.


2.8 Supporting Procedures in <strong>Computer</strong> Hardware 103<br />

itself preserved by the callee adding exactly the same amount that was subtracted<br />

from it; <strong>and</strong> the other registers are preserved by saving them on the stack (if they<br />

are used) <strong>and</strong> restoring them from there.<br />

Allocating Space for New Data on the Stack<br />

The final complexity is that the stack is also used to store variables that are local<br />

to the procedure but do not fit in registers, such as local arrays or structures. The<br />

segment of the stack containing a procedure’s saved registers <strong>and</strong> local variables is<br />

called a procedure frame or activation record. Figure 2.12 shows the state of the<br />

stack before, during, <strong>and</strong> after the procedure call.<br />

Some MIPS software uses a frame pointer ($fp) to point to the first word of<br />

the frame of a procedure. A stack pointer might change during the procedure, <strong>and</strong><br />

so references to a local variable in memory might have different offsets depending<br />

on where they are in the procedure, making the procedure harder to underst<strong>and</strong>.<br />

Alternatively, a frame pointer offers a stable base register within a procedure for<br />

local memory-references. Note that an activation record appears on the stack<br />

whether or not an explicit frame pointer is used. We’ve been avoiding using $fp by<br />

avoiding changes to $sp within a procedure: in our examples, the stack is adjusted<br />

only on entry <strong>and</strong> exit of the procedure.<br />

procedure frame Also<br />

called activation record.<br />

The segment of the stack<br />

containing a procedure’s<br />

saved registers <strong>and</strong> local<br />

variables.<br />

frame pointer A value<br />

denoting the location of<br />

the saved registers <strong>and</strong><br />

local variables for a given<br />

procedure.<br />

High address<br />

$fp<br />

$fp<br />

$sp<br />

$fp<br />

Saved argument<br />

registers (if any)<br />

$sp<br />

Saved return address<br />

Saved saved<br />

registers (if any)<br />

$sp<br />

Local arrays <strong>and</strong><br />

structures (if any)<br />

Low address<br />

(a) (b) (c)<br />

FIGURE 2.12 Illustration of the stack allocation (a) before, (b) during, <strong>and</strong> (c) after the<br />

procedure call. The frame pointer ($fp) points to the first word of the frame, often a saved argument<br />

register, <strong>and</strong> the stack pointer ($sp) points to the top of the stack. The stack is adjusted to make room for<br />

all the saved registers <strong>and</strong> any memory-resident local variables. Since the stack pointer may change during<br />

program execution, it’s easier for programmers to reference variables via the stable frame pointer, although it<br />

could be done just with the stack pointer <strong>and</strong> a little address arithmetic. If there are no local variables on the<br />

stack within a procedure, the compiler will save time by not setting <strong>and</strong> restoring the frame pointer. When a<br />

frame pointer is used, it is initialized using the address in $sp on a call, <strong>and</strong> $sp is restored using $fp. This<br />

information is also found in Column 4 of the MIPS Reference Data Card at the front of this book.


2.8 Supporting Procedures in <strong>Computer</strong> Hardware 105<br />

Figure 2.14 summarizes the register conventions for the MIPS assembly<br />

language. This convention is another example of making the common case fast:<br />

most procedures can be satisfied with up to 4 arguments, 2 registers for a return<br />

value, 8 saved registers, <strong>and</strong> 10 temporary registers without ever going to memory.<br />

Name Register number Usage<br />

Preserved on<br />

call?<br />

$zero 0 The constant value 0 n.a.<br />

$v0–$v1 2–3 Values for results <strong>and</strong> expression evaluation no<br />

$a0–$a3 4–7 Arguments no<br />

$t0–$t7<br />

$s0–$s7<br />

$t8–$t9<br />

$gp<br />

$sp<br />

$fp<br />

$ra<br />

8–15<br />

16–23<br />

24–25<br />

28<br />

29<br />

30<br />

31<br />

Temporaries<br />

Saved<br />

More temporaries<br />

Global pointer<br />

Stack pointer<br />

Frame pointer<br />

Return address<br />

FIGURE 2.14 MIPS register conventions. Register 1, called $at, is reserved for the assembler (see<br />

Section 2.12), <strong>and</strong> registers 26–27, called $k0–$k1, are reserved for the operating system. This information<br />

is also found in Column 2 of the MIPS Reference Data Card at the front of this book.<br />

no<br />

yes<br />

no<br />

yes<br />

yes<br />

yes<br />

yes<br />

Elaboration: What if there are more than four parameters? The MIPS convention is<br />

to place the extra parameters on the stack just above the frame pointer. The procedure<br />

then expects the fi rst four parameters to be in registers $a0 through $a3 <strong>and</strong> the rest<br />

in memory, addressable via the frame pointer.<br />

As mentioned in the caption of Figure 2.12, the frame pointer is convenient because<br />

all references to variables in the stack within a procedure will have the same offset.<br />

The frame pointer is not necessary, however. The GNU MIPS C compiler uses a frame<br />

pointer, but the C compiler from MIPS does not; it treats register 30 as another save<br />

register ($s8).<br />

Elaboration: Some recursive procedures can be implemented iteratively without using<br />

recursion. Iteration can signifi cantly improve performance by removing the overhead<br />

associated with recursive procedure calls. For example, consider a procedure used to<br />

accumulate a sum:<br />

int sum (int n, int acc) {<br />

if (n >0)<br />

return sum(n – 1, acc + n);<br />

else<br />

return acc;<br />

}<br />

Consider the procedure call sum(3,0). This will result in recursive calls to<br />

sum(2,3), sum(1,5), <strong>and</strong> sum(0,6), <strong>and</strong> then the result 6 will be returned four


2.9 Communicating with People 107<br />

ASCII versus Binary Numbers<br />

We could represent numbers as strings of ASCII digits instead of as integers.<br />

How much does storage increase if the number 1 billion is represented in<br />

ASCII versus a 32-bit integer?<br />

EXAMPLE<br />

One billion is 1,000,000,000, so it would take 10 ASCII digits, each 8 bits long.<br />

Thus the storage expansion would be (10 8)/32 or 2.5. Beyond the expansion<br />

in storage, the hardware to add, subtract, multiply, <strong>and</strong> divide such decimal<br />

numbers is difficult <strong>and</strong> would consume more energy. Such difficulties explain<br />

why computing professionals are raised to believe that binary is natural <strong>and</strong><br />

that the occasional decimal computer is bizarre.<br />

ANSWER<br />

A series of instructions can extract a byte from a word, so load word <strong>and</strong> store<br />

word are sufficient for transferring bytes as well as words. Because of the popularity<br />

of text in some programs, however, MIPS provides instructions to move bytes. Load<br />

byte (lb) loads a byte from memory, placing it in the rightmost 8 bits of a register.<br />

Store byte (sb) takes a byte from the rightmost 8 bits of a register <strong>and</strong> writes it to<br />

memory. Thus, we copy a byte with the sequence<br />

lb $t0,0($sp)<br />

sb $t0,0($gp)<br />

# Read byte from source<br />

# Write byte to destination<br />

Characters are normally combined into strings, which have a variable number<br />

of characters. There are three choices for representing a string: (1) the first position<br />

of the string is reserved to give the length of a string, (2) an accompanying variable<br />

has the length of the string (as in a structure), or (3) the last position of a string is<br />

indicated by a character used to mark the end of a string. C uses the third choice,<br />

terminating a string with a byte whose value is 0 (named null in ASCII). Thus,<br />

the string “Cal” is represented in C by the following 4 bytes, shown as decimal<br />

numbers: 67, 97, 108, 0. (As we shall see, Java uses the first option.)


108 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />

EXAMPLE<br />

Compiling a String Copy Procedure, Showing How to Use C Strings<br />

The procedure strcpy copies string y to string x using the null byte<br />

termination convention of C:<br />

void strcpy (char x[], char y[])<br />

{<br />

int i;<br />

}<br />

i = 0;<br />

while ((x[i] = y[i]) != ‘\0’) /* copy & test byte */<br />

i += 1;<br />

What is the MIPS assembly code?<br />

ANSWER<br />

Below is the basic MIPS assembly code segment. Assume that base addresses<br />

for arrays x <strong>and</strong> y are found in $a0 <strong>and</strong> $a1, while i is in $s0. strcpy<br />

adjusts the stack pointer <strong>and</strong> then saves the saved register $s0 on the stack:<br />

strcpy:<br />

addi $sp,$sp,–4 # adjust stack for 1 more item<br />

sw $s0, 0($sp) # save $s0<br />

To initialize i to 0, the next instruction sets $s0 to 0 by adding 0 to 0 <strong>and</strong><br />

placing that sum in $s0:<br />

add $s0,$zero,$zero # i = 0 + 0<br />

This is the beginning of the loop. The address of y[i] is first formed by adding<br />

i to y[]:<br />

L1: add $t1,$s0,$a1 # address of y[i] in $t1<br />

Note that we don’t have to multiply i by 4 since y is an array of bytes <strong>and</strong> not<br />

of words, as in prior examples.<br />

To load the character in y[i], we use load byte unsigned, which puts the<br />

character into $t2:<br />

lbu<br />

$t2, 0($t1) # $t2 = y[i]<br />

A similar address calculation puts the address of x[i] in $t3, <strong>and</strong> then the<br />

character in $t2 is stored at that address.


2.9 Communicating with People 109<br />

add $t3,$s0,$a0 # address of x[i] in $t3<br />

sb $t2, 0($t3) # x[i] = y[i]<br />

Next, we exit the loop if the character was 0. That is, we exit if it is the last<br />

character of the string:<br />

beq<br />

$t2,$zero,L2 # if y[i] == 0, go to L2<br />

If not, we increment i <strong>and</strong> loop back:<br />

addi $s0, $s0,1 # i = i + 1<br />

j L1 # go to L1<br />

If we don’t loop back, it was the last character of the string; we restore $s0 <strong>and</strong><br />

the stack pointer, <strong>and</strong> then return.<br />

L2: lw $s0, 0($sp) # y[i] == 0: end of string.<br />

# Restore old $s0<br />

addi $sp,$sp,4 # pop 1 word off stack<br />

jr $ra # return<br />

String copies usually use pointers instead of arrays in C to avoid the operations<br />

on i in the code above. See Section 2.14 for an explanation of arrays versus<br />

pointers.<br />

Since the procedure strcpy above is a leaf procedure, the compiler could<br />

allocate i to a temporary register <strong>and</strong> avoid saving <strong>and</strong> restoring $s0. Hence,<br />

instead of thinking of the $t registers as being just for temporaries, we can think of<br />

them as registers that the callee should use whenever convenient. When a compiler<br />

finds a leaf procedure, it exhausts all temporary registers before using registers it<br />

must save.<br />

Characters <strong>and</strong> Strings in Java<br />

Unicode is a universal encoding of the alphabets of most human languages. Figure<br />

2.16 gives a list of Unicode alphabets; there are almost as many alphabets in Unicode<br />

as there are useful symbols in ASCII. To be more inclusive, Java uses Unicode for<br />

characters. By default, it uses 16 bits to represent a character.


110 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />

Latin Malayalam Tagbanwa General Punctuation<br />

Greek Sinhala Khmer Spacing Modifier Letters<br />

Cyrillic Thai Mongolian Currency Symbols<br />

Armenian Lao Limbu Combining Diacritical Marks<br />

Hebrew Tibetan Tai Le Combining Marks for Symbols<br />

Arabic Myanmar Kangxi Radicals Superscripts <strong>and</strong> Subscripts<br />

Syriac Georgian Hiragana Number Forms<br />

Thaana Hangul Jamo Katakana Mathematical Operators<br />

Devanagari Ethiopic Bopomofo Mathematical Alphanumeric Symbols<br />

Bengali Cherokee Kanbun Braille Patterns<br />

Gurmukhi Unified Canadian Shavian<br />

Optical Character Recognition<br />

Aboriginal Syllabic<br />

Gujarati Ogham Osmanya Byzantine Musical Symbols<br />

Oriya Runic Cypriot Syllabary Musical Symbols<br />

Tamil Tagalog Tai Xuan Jing Symbols Arrows<br />

Telugu Hanunoo Yijing Hexagram Symbols Box Drawing<br />

Kannada Buhid Aegean Numbers Geometric Shapes<br />

FIGURE 2.16 Example alphabets in Unicode. Unicode version 4.0 has more than 160 “blocks,”<br />

which is their name for a collection of symbols. Each block is a multiple of 16. For example, Greek starts at<br />

0370 hex<br />

, <strong>and</strong> Cyrillic at 0400 hex<br />

. The first three columns show 48 blocks that correspond to human languages<br />

in roughly Unicode numerical order. The last column has 16 blocks that are multilingual <strong>and</strong> are not in order.<br />

A 16-bit encoding, called UTF-16, is the default. A variable-length encoding, called UTF-8, keeps the ASCII<br />

subset as eight bits <strong>and</strong> uses 16 or 32 bits for the other characters. UTF-32 uses 32 bits per character. To learn<br />

more, see www.unicode.org.<br />

The MIPS instruction set has explicit instructions to load <strong>and</strong> store such 16-<br />

bit quantities, called halfwords. Load half (lh) loads a halfword from memory,<br />

placing it in the rightmost 16 bits of a register. Like load byte, load half (lh) treats<br />

the halfword as a signed number <strong>and</strong> thus sign-extends to fill the 16 leftmost bits<br />

of the register, while load halfword unsigned (lhu) works with unsigned integers.<br />

Thus, lhu is the more popular of the two. Store half (sh) takes a halfword from the<br />

rightmost 16 bits of a register <strong>and</strong> writes it to memory. We copy a halfword with<br />

the sequence<br />

lhu $t0,0($sp) # Read halfword (16 bits) from source<br />

sh $t0,0($gp) # Write halfword (16 bits) to destination<br />

Strings are a st<strong>and</strong>ard Java class with special built-in support <strong>and</strong> predefined<br />

methods for concatenation, comparison, <strong>and</strong> conversion. Unlike C, Java includes a<br />

word that gives the length of the string, similar to Java arrays.


112 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />

32-Bit Immediate Oper<strong>and</strong>s<br />

Although constants are frequently short <strong>and</strong> fit into the 16-bit field, sometimes they<br />

are bigger. The MIPS instruction set includes the instruction load upper immediate<br />

(lui) specifically to set the upper 16 bits of a constant in a register, allowing a<br />

subsequent instruction to specify the lower 16 bits of the constant. Figure 2.17<br />

shows the operation of lui.<br />

EXAMPLE<br />

Loading a 32-Bit Constant<br />

What is the MIPS assembly code to load this 32-bit constant into register $s0?<br />

0000 0000 0011 1101 0000 1001 0000 0000<br />

ANSWER<br />

First, we would load the upper 16 bits, which is 61 in decimal, using lui:<br />

lui $s0, 61 # 61 decimal = 0000 0000 0011 1101 binary<br />

The value of register $s0 afterward is<br />

0000 0000 0011 1101 0000 0000 0000 0000<br />

The next step is to insert the lower 16 bits, whose decimal value is 2304:<br />

ori $s0, $s0, 2304 # 2304 decimal = 0000 1001 0000 0000<br />

The final value in register $s0 is the desired value:<br />

0000 0000 0011 1101 0000 1001 0000 0000<br />

The machine language version of lui $t0, 255 # $t0 is register 8:<br />

001111 00000 01000 0000 0000 1111 1111<br />

Contents of register $t0 after executing lui $t0, 255:<br />

0000 0000 1111 1111 0000 0000 0000 0000<br />

FIGURE 2.17 The effect of the lui instruction. The instruction lui transfers the 16-bit immediate constant field value into the<br />

leftmost 16 bits of the register, filling the lower 16 bits with 0s.


2.10 MIPS Addressing for 32-bit Immediates <strong>and</strong> Addresses 113<br />

Either the compiler or the assembler must break large constants into pieces <strong>and</strong><br />

then reassemble them into a register. As you might expect, the immediate field’s<br />

size restriction may be a problem for memory addresses in loads <strong>and</strong> stores as<br />

well as for constants in immediate instructions. If this job falls to the assembler,<br />

as it does for MIPS software, then the assembler must have a temporary register<br />

available in which to create the long values. This need is a reason for the register<br />

$at (assembler temporary), which is reserved for the assembler.<br />

Hence, the symbolic representation of the MIPS machine language is no longer<br />

limited by the hardware, but by whatever the creator of an assembler chooses to<br />

include (see Section 2.12). We stick close to the hardware to explain the architecture<br />

of the computer, noting when we use the enhanced language of the assembler that<br />

is not found in the processor.<br />

Hardware/<br />

Software<br />

Interface<br />

Elaboration: Creating 32-bit constants needs care. The instruction addi copies the<br />

left-most bit of the 16-bit immediate fi eld of the instruction into the upper 16 bits of a<br />

word. Logical or immediate from Section 2.6 loads 0s into the upper 16 bits <strong>and</strong> hence<br />

is used by the assembler in conjunction with lui to create 32-bit constants.<br />

Addressing in Branches <strong>and</strong> Jumps<br />

The MIPS jump instructions have the simplest addressing. They use the final MIPS<br />

instruction format, called the J-type, which consists of 6 bits for the operation field<br />

<strong>and</strong> the rest of the bits for the address field. Thus,<br />

j 10000 # go to location 10000<br />

could be assembled into this format (it’s actually a bit more complicated, as we will<br />

see):<br />

2 10000<br />

6 bits 26 bits<br />

where the value of the jump opcode is 2 <strong>and</strong> the jump address is 10000.<br />

Unlike the jump instruction, the conditional branch instruction must specify<br />

two oper<strong>and</strong>s in addition to the branch address. Thus,<br />

bne $s0,$s1,Exit # go to Exit if $s0 ≠ $s1<br />

is assembled into this instruction, leaving only 16 bits for the branch address:<br />

5 16 17 Exit<br />

6 bits 5 bits 5 bits 16 bits


114 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />

If addresses of the program had to fit in this 16-bit field, it would mean that no<br />

program could be bigger than 2 16 , which is far too small to be a realistic option<br />

today. An alternative would be to specify a register that would always be added<br />

to the branch address, so that a branch instruction would calculate the following:<br />

Program counter Register Branch address<br />

PC-relative<br />

addressing An<br />

addressing regime<br />

in which the address<br />

is the sum of the<br />

program counter (PC)<br />

<strong>and</strong> a constant in the<br />

instruction.<br />

This sum allows the program to be as large as 2 32 <strong>and</strong> still be able to use<br />

conditional branches, solving the branch address size problem. Then the question<br />

is, which register?<br />

The answer comes from seeing how conditional branches are used. Conditional<br />

branches are found in loops <strong>and</strong> in if statements, so they tend to branch to a<br />

nearby instruction. For example, about half of all conditional branches in SPEC<br />

benchmarks go to locations less than 16 instructions away. Since the program<br />

counter (PC) contains the address of the current instruction, we can branch within<br />

2 15 words of the current instruction if we use the PC as the register to be added<br />

to the address. Almost all loops <strong>and</strong> if statements are much smaller than 2 16 words,<br />

so the PC is the ideal choice.<br />

This form of branch addressing is called PC-relative addressing. As we shall see<br />

in Chapter 4, it is convenient for the hardware to increment the PC early to point<br />

to the next instruction. Hence, the MIPS address is actually relative to the address<br />

of the following instruction (PC 4) as opposed to the current instruction (PC).<br />

It is yet another example of making the common case fast, which in this case is<br />

addressing nearby instructions.<br />

Like most recent computers, MIPS uses PC-relative addressing for all conditional<br />

branches, because the destination of these instructions is likely to be close to the<br />

branch. On the other h<strong>and</strong>, jump-<strong>and</strong>-link instructions invoke procedures that<br />

have no reason to be near the call, so they normally use other forms of addressing.<br />

Hence, the MIPS architecture offers long addresses for procedure calls by using the<br />

J-type format for both jump <strong>and</strong> jump-<strong>and</strong>-link instructions.<br />

Since all MIPS instructions are 4 bytes long, MIPS stretches the distance of the<br />

branch by having PC-relative addressing refer to the number of words to the next<br />

instruction instead of the number of bytes. Thus, the 16-bit field can branch four<br />

times as far by interpreting the field as a relative word address rather than as a<br />

relative byte address. Similarly, the 26-bit field in jump instructions is also a word<br />

address, meaning that it represents a 28-bit byte address.<br />

Elaboration: Since the PC is 32 bits, 4 bits must come from somewhere else for<br />

jumps. The MIPS jump instruction replaces only the lower 28 bits of the PC, leaving<br />

the upper 4 bits of the PC unchanged. The loader <strong>and</strong> linker (Section 2.12) must be<br />

careful to avoid placing a program across an address boundary of 256 MB (64 million<br />

instructions); otherwise, a jump must be replaced by a jump register instruction preceded<br />

by other instructions to load the full 32-bit address into a register.


116 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />

Hardware/<br />

Software<br />

Interface<br />

Most conditional branches are to a nearby location, but occasionally they branch<br />

far away, farther than can be represented in the 16 bits of the conditional branch<br />

instruction. The assembler comes to the rescue just as it did with large addresses<br />

or constants: it inserts an unconditional jump to the branch target, <strong>and</strong> inverts the<br />

condition so that the branch decides whether to skip the jump.<br />

EXAMPLE<br />

Branching Far Away<br />

Given a branch on register $s0 being equal to register $s1,<br />

beq<br />

$s0, $s1, L1<br />

replace it by a pair of instructions that offers a much greater branching distance.<br />

ANSWER<br />

These instructions replace the short-address conditional branch:<br />

bne $s0, $s1, L2<br />

j L1<br />

L2:<br />

addressing mode One<br />

of several addressing<br />

regimes delimited by their<br />

varied use of oper<strong>and</strong>s<br />

<strong>and</strong>/or addresses.<br />

MIPS Addressing Mode Summary<br />

Multiple forms of addressing are generically called addressing modes. Figure 2.18<br />

shows how oper<strong>and</strong>s are identified for each addressing mode. The MIPS addressing<br />

modes are the following:<br />

1. Immediate addressing, where the oper<strong>and</strong> is a constant within the instruction<br />

itself<br />

2. Register addressing, where the oper<strong>and</strong> is a register<br />

3. Base or displacement addressing, where the oper<strong>and</strong> is at the memory location<br />

whose address is the sum of a register <strong>and</strong> a constant in the instruction<br />

4. PC-relative addressing, where the branch address is the sum of the PC <strong>and</strong> a<br />

constant in the instruction<br />

5. Pseudodirect addressing, where the jump address is the 26 bits of the<br />

instruction concatenated with the upper bits of the PC


118 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />

Decoding Machine Language<br />

Sometimes you are forced to reverse-engineer machine language to create the<br />

original assembly language. One example is when looking at “core dump.” Figure<br />

2.19 shows the MIPS encoding of the fields for the MIPS machine language. This<br />

figure helps when translating by h<strong>and</strong> between assembly language <strong>and</strong> machine<br />

language.<br />

EXAMPLE<br />

Decoding Machine Code<br />

What is the assembly language statement corresponding to this machine<br />

instruction?<br />

00af8020hex<br />

ANSWER<br />

The first step in converting hexadecimal to binary is to find the op fields:<br />

(Bits: 31 28 26 5 2 0)<br />

0000 0000 1010 1111 1000 0000 0010 0000<br />

We look at the op field to determine the operation. Referring to Figure 2.19,<br />

when bits 31–29 are 000 <strong>and</strong> bits 28–26 are 000, it is an R-format instruction.<br />

Let’s reformat the binary instruction into R-format fields, listed in Figure 2.20:<br />

op rs rt rd shamt funct<br />

000000 00101 01111 10000 00000 100000<br />

The bottom portion of Figure 2.19 determines the operation of an R-format<br />

instruction. In this case, bits 5–3 are 100 <strong>and</strong> bits 2–0 are 000, which means<br />

this binary pattern represents an add instruction.<br />

We decode the rest of the instruction by looking at the field values. The<br />

decimal values are 5 for the rs field, 15 for rt, <strong>and</strong> 16 for rd (shamt is unused).<br />

Figure 2.14 shows that these numbers represent registers $a1, $t7, <strong>and</strong> $s0.<br />

Now we can reveal the assembly instruction:<br />

add $s0,$a1,$t7


120 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />

Name Fields Comments<br />

Field size 6 bits 5 bits 5 bits 5 bits 5 bits 6 bits All MIPS instructions are 32 bits long<br />

R-format op rs rt rd shamt funct Arithmetic instruction format<br />

I-format op rs rt address/immediate Transfer, branch,imm. format<br />

J-format op<br />

target address<br />

Jump instruction format<br />

FIGURE 2.20<br />

MIPS instruction formats.<br />

Figure 2.20 shows all the MIPS instruction formats. Figure 2.1 on page 64 shows<br />

the MIPS assembly language revealed in this chapter. The remaining hidden portion<br />

of MIPS instructions deals mainly with arithmetic <strong>and</strong> real numbers, which are<br />

covered in the next chapter.<br />

Check<br />

Yourself<br />

I. What is the range of addresses for conditional branches in MIPS (K 1024)?<br />

1. Addresses between 0 <strong>and</strong> 64K 1<br />

2. Addresses between 0 <strong>and</strong> 256K 1<br />

3. Addresses up to about 32K before the branch to about 32K after<br />

4. Addresses up to about 128K before the branch to about 128K after<br />

II. What is the range of addresses for jump <strong>and</strong> jump <strong>and</strong> link in MIPS<br />

(M 1024K)?<br />

1. Addresses between 0 <strong>and</strong> 64M 1<br />

2. Addresses between 0 <strong>and</strong> 256M 1<br />

3. Addresses up to about 32M before the branch to about 32M after<br />

4. Addresses up to about 128M before the branch to about 128M after<br />

5. Anywhere within a block of 64M addresses where the PC supplies the<br />

upper 6 bits<br />

6. Anywhere within a block of 256M addresses where the PC supplies the<br />

upper 4 bits<br />

III. What is the MIPS assembly language instruction corresponding to the<br />

machine instruction with the value 0000 0000 hex<br />

?<br />

1. j<br />

2. R-format<br />

3. addi<br />

4. sll<br />

5. mfc0<br />

6. Undefined opcode: there is no legal instruction that corresponds to 0


2.11 Parallelism <strong>and</strong> Instructions: Synchronization 121<br />

2.11<br />

Parallelism <strong>and</strong> Instructions:<br />

Synchronization<br />

Parallel execution is easier when tasks are independent, but often they need to<br />

cooperate. Cooperation usually means some tasks are writing new values that<br />

others must read. To know when a task is finished writing so that it is safe for<br />

another to read, the tasks need to synchronize. If they don’t synchronize, there is a<br />

danger of a data race, where the results of the program can change depending on<br />

how events happen to occur.<br />

For example, recall the analogy of the eight reporters writing a story on page 44 of<br />

Chapter 1. Suppose one reporter needs to read all the prior sections before writing<br />

a conclusion. Hence, he or she must know when the other reporters have finished<br />

their sections, so that there is no danger of sections being changed afterwards. That<br />

is, they had better synchronize the writing <strong>and</strong> reading of each section so that the<br />

conclusion will be consistent with what is printed in the prior sections.<br />

In computing, synchronization mechanisms are typically built with user-level<br />

software routines that rely on hardware-supplied synchronization instructions. In<br />

this section, we focus on the implementation of lock <strong>and</strong> unlock synchronization<br />

operations. Lock <strong>and</strong> unlock can be used straightforwardly to create regions<br />

where only a single processor can operate, called a mutual exclusion, as well as to<br />

implement more complex synchronization mechanisms.<br />

The critical ability we require to implement synchronization in a multiprocessor<br />

is a set of hardware primitives with the ability to atomically read <strong>and</strong> modify a<br />

memory location. That is, nothing else can interpose itself between the read <strong>and</strong><br />

the write of the memory location. Without such a capability, the cost of building<br />

basic synchronization primitives will be high <strong>and</strong> will increase unreasonably as the<br />

processor count increases.<br />

There are a number of alternative formulations of the basic hardware primitives,<br />

all of which provide the ability to atomically read <strong>and</strong> modify a location, together<br />

with some way to tell if the read <strong>and</strong> write were performed atomically. In general,<br />

architects do not expect users to employ the basic hardware primitives, but<br />

instead expect that the primitives will be used by system programmers to build a<br />

synchronization library, a process that is often complex <strong>and</strong> tricky.<br />

Let’s start with one such hardware primitive <strong>and</strong> show how it can be used to<br />

build a basic synchronization primitive. One typical operation for building<br />

synchronization operations is the atomic exchange or atomic swap, which interchanges<br />

a value in a register for a value in memory.<br />

To see how to use this to build a basic synchronization primitive, assume that<br />

we want to build a simple lock where the value 0 is used to indicate that the lock<br />

is free <strong>and</strong> 1 is used to indicate that the lock is unavailable. A processor tries to set<br />

the lock by doing an exchange of 1, which is in a register, with the memory address<br />

corresponding to the lock. The value returned from the exchange instruction is 1<br />

if some other processor had already claimed access, <strong>and</strong> 0 otherwise. In the latter<br />

data race Two memory<br />

accesses form a data race<br />

if they are from different<br />

threads to same location,<br />

at least one is a write,<br />

<strong>and</strong> they occur one after<br />

another.


122 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />

case, the value is also changed to 1, preventing any competing exchange in another<br />

processor from also retrieving a 0.<br />

For example, consider two processors that each try to do the exchange<br />

simultaneously: this race is broken, since exactly one of the processors will perform<br />

the exchange first, returning 0, <strong>and</strong> the second processor will return 1 when it does<br />

the exchange. The key to using the exchange primitive to implement synchronization<br />

is that the operation is atomic: the exchange is indivisible, <strong>and</strong> two simultaneous<br />

exchanges will be ordered by the hardware. It is impossible for two processors<br />

trying to set the synchronization variable in this manner to both think they have<br />

simultaneously set the variable.<br />

Implementing a single atomic memory operation introduces some challenges in<br />

the design of the processor, since it requires both a memory read <strong>and</strong> a write in a<br />

single, uninterruptible instruction.<br />

An alternative is to have a pair of instructions in which the second instruction<br />

returns a value showing whether the pair of instructions was executed as if the pair<br />

were atomic. The pair of instructions is effectively atomic if it appears as if all other<br />

operations executed by any processor occurred before or after the pair. Thus, when<br />

an instruction pair is effectively atomic, no other processor can change the value<br />

between the instruction pair.<br />

In MIPS this pair of instructions includes a special load called a load linked <strong>and</strong><br />

a special store called a store conditional. These instructions are used in sequence:<br />

if the contents of the memory location specified by the load linked are changed<br />

before the store conditional to the same address occurs, then the store conditional<br />

fails. The store conditional is defined to both store the value of a (presumably<br />

different) register in memory <strong>and</strong> to change the value of that register to a 1 if it<br />

succeeds <strong>and</strong> to a 0 if it fails. Since the load linked returns the initial value, <strong>and</strong> the<br />

store conditional returns 1 only if it succeeds, the following sequence implements<br />

an atomic exchange on the memory location specified by the contents of $s1:<br />

again: addi $t0,$zero,1 ;copy locked value<br />

ll $t1,0($s1) ;load linked<br />

sc $t0,0($s1) ;store conditional<br />

beq $t0,$zero,again ;branch if store fails<br />

add $s4,$zero,$t1 ;put load value in $s4<br />

Any time a processor intervenes <strong>and</strong> modifies the value in memory between the<br />

ll <strong>and</strong> sc instructions, the sc returns 0 in $t0, causing the code sequence to try<br />

again. At the end of this sequence the contents of $s4 <strong>and</strong> the memory location<br />

specified by $s1 have been atomically exchanged.<br />

Elaboration: Although it was presented for multiprocessor synchronization, atomic<br />

exchange is also useful for the operating system in dealing with multiple processes<br />

in a single processor. To make sure nothing interferes in a single processor, the store<br />

conditional also fails if the processor does a context switch between the two instructions<br />

(see Chapter 5).


124 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />

C program<br />

Compiler<br />

Assembly language program<br />

Assembler<br />

Object: Machine language module<br />

Object: Library routine (machine language)<br />

Linker<br />

Executable: Machine language program<br />

Loader<br />

Memory<br />

FIGURE 2.21 A translation hierarchy for C. A high-level language program is first compiled into<br />

an assembly language program <strong>and</strong> then assembled into an object module in machine language. The linker<br />

combines multiple modules with library routines to resolve all references. The loader then places the machine<br />

code into the proper memory locations for execution by the processor. To speed up the translation process,<br />

some steps are skipped or combined. Some compilers produce object modules directly, <strong>and</strong> some systems use<br />

linking loaders that perform the last two steps. To identify the type of file, UNIX follows a suffix convention<br />

for files: C source files are named x.c, assembly files are x.s, object files are named x.o, statically linked<br />

library routines are x.a, dynamically linked library routes are x.so, <strong>and</strong> executable files by default are<br />

called a.out. MS-DOS uses the suffixes .C, .ASM, .OBJ, .LIB, .DLL, <strong>and</strong> .EXE to the same effect.<br />

pseudoinstruction<br />

A common variation<br />

of assembly language<br />

instructions often treated<br />

as if it were an instruction<br />

in its own right.<br />

Assembler<br />

Since assembly language is an interface to higher-level software, the assembler<br />

can also treat common variations of machine language instructions as if they<br />

were instructions in their own right. The hardware need not implement these<br />

instructions; however, their appearance in assembly language simplifies translation<br />

<strong>and</strong> programming. Such instructions are called pseudoinstructions.<br />

As mentioned above, the MIPS hardware makes sure that register $zero always<br />

has the value 0. That is, whenever register $zero is used, it supplies a 0, <strong>and</strong> the<br />

programmer cannot change the value of register $zero. Register $zero is used<br />

to create the assembly language instruction that copies the contents of one register<br />

to another. Thus the MIPS assembler accepts this instruction even though it is not<br />

found in the MIPS architecture:<br />

move $t0,$t1 # register $t0 gets register $t1


2.12 Translating <strong>and</strong> Starting a Program 125<br />

The assembler converts this assembly language instruction into the machine<br />

language equivalent of the following instruction:<br />

add $t0,$zero,$t1 # register $t0 gets 0 + register $t1<br />

The MIPS assembler also converts blt (branch on less than) into the two<br />

instructions slt <strong>and</strong> bne mentioned in the example on page 95. Other examples<br />

include bgt, bge, <strong>and</strong> ble. It also converts branches to faraway locations into a<br />

branch <strong>and</strong> jump. As mentioned above, the MIPS assembler allows 32-bit constants<br />

to be loaded into a register despite the 16-bit limit of the immediate instructions.<br />

In summary, pseudoinstructions give MIPS a richer set of assembly language<br />

instructions than those implemented by the hardware. The only cost is reserving<br />

one register, $at, for use by the assembler. If you are going to write assembly<br />

programs, use pseudoinstructions to simplify your task. To underst<strong>and</strong> the MIPS<br />

architecture <strong>and</strong> be sure to get best performance, however, study the real MIPS<br />

instructions found in Figures 2.1 <strong>and</strong> 2.19.<br />

Assemblers will also accept numbers in a variety of bases. In addition to binary<br />

<strong>and</strong> decimal, they usually accept a base that is more succinct than binary yet<br />

converts easily to a bit pattern. MIPS assemblers use hexadecimal.<br />

Such features are convenient, but the primary task of an assembler is assembly<br />

into machine code. The assembler turns the assembly language program into an<br />

object file, which is a combination of machine language instructions, data, <strong>and</strong><br />

information needed to place instructions properly in memory.<br />

To produce the binary version of each instruction in the assembly language<br />

program, the assembler must determine the addresses corresponding to all labels.<br />

Assemblers keep track of labels used in branches <strong>and</strong> data transfer instructions<br />

in a symbol table. As you might expect, the table contains pairs of symbols <strong>and</strong><br />

addresses.<br />

The object file for UNIX systems typically contains six distinct pieces:<br />

■ The object file header describes the size <strong>and</strong> position of the other pieces of the<br />

object file.<br />

■ The text segment contains the machine language code.<br />

■ The static data segment contains data allocated for the life of the program.<br />

(UNIX allows programs to use both static data, which is allocated throughout<br />

the program, <strong>and</strong> dynamic data, which can grow or shrink as needed by the<br />

program. See Figure 2.13.)<br />

■ The relocation information identifies instructions <strong>and</strong> data words that depend<br />

on absolute addresses when the program is loaded into memory.<br />

■ The symbol table contains the remaining labels that are not defined, such as<br />

external references.<br />

symbol table A table<br />

that matches names of<br />

labels to the addresses of<br />

the memory words that<br />

instructions occupy.


126 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />

■ The debugging information contains a concise description of how the modules<br />

were compiled so that a debugger can associate machine instructions with C<br />

source files <strong>and</strong> make data structures readable.<br />

The next subsection shows how to attach such routines that have already been<br />

assembled, such as library routines.<br />

linker Also called<br />

link editor. A systems<br />

program that combines<br />

independently assembled<br />

machine language<br />

programs <strong>and</strong> resolves all<br />

undefined labels into an<br />

executable file.<br />

executable file<br />

A functional program in<br />

the format of an object<br />

file that contains no<br />

unresolved references.<br />

It can contain symbol<br />

tables <strong>and</strong> debugging<br />

information. A “stripped<br />

executable” does not<br />

contain that information.<br />

Relocation information<br />

may be included for the<br />

loader.<br />

Linker<br />

What we have presented so far suggests that a single change to one line of one<br />

procedure requires compiling <strong>and</strong> assembling the whole program. Complete<br />

retranslation is a terrible waste of computing resources. This repetition is<br />

particularly wasteful for st<strong>and</strong>ard library routines, because programmers would<br />

be compiling <strong>and</strong> assembling routines that by definition almost never change. An<br />

alternative is to compile <strong>and</strong> assemble each procedure independently, so that a<br />

change to one line would require compiling <strong>and</strong> assembling only one procedure.<br />

This alternative requires a new systems program, called a link editor or linker,<br />

which takes all the independently assembled machine language programs <strong>and</strong><br />

“stitches” them together.<br />

There are three steps for the linker:<br />

1. Place code <strong>and</strong> data modules symbolically in memory.<br />

2. Determine the addresses of data <strong>and</strong> instruction labels.<br />

3. Patch both the internal <strong>and</strong> external references.<br />

The linker uses the relocation information <strong>and</strong> symbol table in each object<br />

module to resolve all undefined labels. Such references occur in branch instructions,<br />

jump instructions, <strong>and</strong> data addresses, so the job of this program is much like that<br />

of an editor: it finds the old addresses <strong>and</strong> replaces them with the new addresses.<br />

Editing is the origin of the name “link editor,” or linker for short. The reason a<br />

linker is useful is that it is much faster to patch code than it is to recompile <strong>and</strong><br />

reassemble.<br />

If all external references are resolved, the linker next determines the memory<br />

locations each module will occupy. Recall that Figure 2.13 on page 104 shows<br />

the MIPS convention for allocation of program <strong>and</strong> data to memory. Since the<br />

files were assembled in isolation, the assembler could not know where a module’s<br />

instructions <strong>and</strong> data would be placed relative to other modules. When the linker<br />

places a module in memory, all absolute references, that is, memory addresses that<br />

are not relative to a register, must be relocated to reflect its true location.<br />

The linker produces an executable file that can be run on a computer. Typically,<br />

this file has the same format as an object file, except that it contains no unresolved<br />

references. It is possible to have partially linked files, such as library routines, that<br />

still have unresolved addresses <strong>and</strong> hence result in object files.


2.12 Translating <strong>and</strong> Starting a Program 127<br />

Linking Object Files<br />

Link the two object files below. Show updated addresses of the first few<br />

instructions of the completed executable file. We show the instructions in<br />

assembly language just to make the example underst<strong>and</strong>able; in reality, the<br />

instructions would be numbers.<br />

Note that in the object files we have highlighted the addresses <strong>and</strong> symbols<br />

that must be updated in the link process: the instructions that refer to the<br />

addresses of procedures A <strong>and</strong> B <strong>and</strong> the instructions that refer to the addresses<br />

of data words X <strong>and</strong> Y.<br />

EXAMPLE<br />

Object file header<br />

Name<br />

Procedure A<br />

Text size<br />

Data size<br />

100 hex<br />

20 hex<br />

Text segment Address Instruction<br />

0 lw $a0, 0($gp)<br />

4 jal 0<br />

…<br />

…<br />

Data segment 0 (X)<br />

…<br />

…<br />

Relocation information Address Instruction type Dependency<br />

0 lw X<br />

4 jal B<br />

Symbol table Label Address<br />

X –<br />

B –<br />

Object file header<br />

Name<br />

Procedure B<br />

Text size<br />

Data size<br />

200 hex<br />

30 hex<br />

Text segment Address Instruction<br />

0 sw $a1, 0($gp)<br />

4 jal 0<br />

…<br />

…<br />

Data segment 0 (Y)<br />

…<br />

…<br />

Relocation information Address Instruction type Dependency<br />

0 sw Y<br />

4 jal A<br />

Symbol table Label Address<br />

Y –<br />

A –


128 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />

ANSWER<br />

Procedure A needs to find the address for the variable labeled X to put in the<br />

load instruction <strong>and</strong> to find the address of procedure B to place in the jal<br />

instruction. Procedure B needs the address of the variable labeled Y for the<br />

store instruction <strong>and</strong> the address of procedure A for its jal instruction.<br />

From Figure 2.13 on page 104, we know that the text segment starts<br />

at address 40 0000 hex<br />

<strong>and</strong> the data segment at 1000 0000 hex<br />

. The text of<br />

procedure A is placed at the first address <strong>and</strong> its data at the second. The object<br />

file header for procedure A says that its text is 100 hex<br />

bytes <strong>and</strong> its data is 20 hex<br />

bytes, so the starting address for procedure B text is 40 0100 hex<br />

, <strong>and</strong> its data<br />

starts at 1000 0020 hex<br />

.<br />

Executable file header<br />

Text size<br />

Data size<br />

300 hex<br />

50 hex<br />

Text segment Address Instruction<br />

0040 0000 hex lw $a0, 8000 hex<br />

($gp)<br />

0040 0004 hex jal 40 0100 hex<br />

…<br />

…<br />

0040 0100 hex sw $a1, 8020 hex<br />

($gp)<br />

Data segment<br />

0040 0104 hex jal 40 0000 hex<br />

…<br />

…<br />

Address<br />

1000 0000 hex (X)<br />

…<br />

…<br />

1000 0020 hex (Y)<br />

…<br />

…<br />

Figure 2.13 also shows that the text segment starts at address 40 0000 hex<br />

<strong>and</strong> the data segment at 1000 0000 hex<br />

. The text of procedure A is placed at the<br />

first address <strong>and</strong> its data at the second. The object file header for procedure A<br />

says that its text is 100 hex<br />

bytes <strong>and</strong> its data is 20 hex<br />

bytes, so the starting address<br />

for procedure B text is 40 0100 hex<br />

, <strong>and</strong> its data starts at 1000 0020 hex<br />

.<br />

Now the linker updates the address fields of the instructions. It uses the<br />

instruction type field to know the format of the address to be edited. We have<br />

two types here:<br />

1. The jals are easy because they use pseudodirect addressing. The jal at<br />

address 40 0004 hex<br />

gets 40 0100 hex<br />

(the address of procedure B) in its<br />

address field, <strong>and</strong> the jal at 40 0104 hex<br />

gets 40 0000 hex<br />

(the address of<br />

procedure A) in its address field.<br />

2. The load <strong>and</strong> store addresses are harder because they are relative to a base<br />

register. This example uses the global pointer as the base register. Figure 2.13<br />

shows that $gp is initialized to 1000 8000 hex<br />

. To get the address 1000 0000 hex<br />

(the address of word X), we place 8000 hex<br />

in the address field of lw at address<br />

40 0000 hex<br />

. Similarly, we place 8020 hex<br />

in the address field of sw at address<br />

40 0100 hex<br />

to get the address 1000 0020 hex<br />

(the address of word Y).


2.12 Translating <strong>and</strong> Starting a Program 129<br />

Elaboration: Recall that MIPS instructions are word aligned, so jal drops the right<br />

two bits to increase the instruction’s address range. Thus, it uses 26 bits to create a<br />

28-bit byte address. Hence, the actual address in the lower 26 bits of the jal instruction<br />

in this example is 10 0040 hex,<br />

rather than 40 0100 hex<br />

.<br />

Loader<br />

Now that the executable file is on disk, the operating system reads it to memory <strong>and</strong><br />

starts it. The loader follows these steps in UNIX systems:<br />

1. Reads the executable file header to determine size of the text <strong>and</strong> data<br />

segments.<br />

2. Creates an address space large enough for the text <strong>and</strong> data.<br />

3. Copies the instructions <strong>and</strong> data from the executable file into memory.<br />

4. Copies the parameters (if any) to the main program onto the stack.<br />

5. Initializes the machine registers <strong>and</strong> sets the stack pointer to the first free<br />

location.<br />

6. Jumps to a start-up routine that copies the parameters into the argument<br />

registers <strong>and</strong> calls the main routine of the program. When the main routine<br />

returns, the start-up routine terminates the program with an exit system<br />

call.<br />

Sections A.3 <strong>and</strong> A.4 in Appendix A describe linkers <strong>and</strong> loaders in more detail.<br />

Dynamically Linked Libraries<br />

The first part of this section describes the traditional approach to linking libraries<br />

before the program is run. Although this static approach is the fastest way to call<br />

library routines, it has a few disadvantages:<br />

■ The library routines become part of the executable code. If a new version of<br />

the library is released that fixes bugs or supports new hardware devices, the<br />

statically linked program keeps using the old version.<br />

■ It loads all routines in the library that are called anywhere in the executable,<br />

even if those calls are not executed. The library can be large relative to the<br />

program; for example, the st<strong>and</strong>ard C library is 2.5 MB.<br />

These disadvantages lead to dynamically linked libraries (DLLs), where the<br />

library routines are not linked <strong>and</strong> loaded until the program is run. Both the<br />

program <strong>and</strong> library routines keep extra information on the location of nonlocal<br />

procedures <strong>and</strong> their names. In the initial version of DLLs, the loader ran a dynamic<br />

linker, using the extra information in the file to find the appropriate libraries <strong>and</strong> to<br />

update all external references.<br />

loader A systems<br />

program that places an<br />

object program in main<br />

memory so that it is ready<br />

to execute.<br />

Virtually every<br />

problem in computer<br />

science can be solved<br />

by another level of<br />

indirection.<br />

David Wheeler<br />

dynamically linked<br />

libraries (DLLs) Library<br />

routines that are linked<br />

to a program during<br />

execution.


130 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />

The downside of the initial version of DLLs was that it still linked all routines<br />

of the library that might be called, versus only those that are called during the<br />

running of the program. This observation led to the lazy procedure linkage version<br />

of DLLs, where each routine is linked only after it is called.<br />

Like many innovations in our field, this trick relies on a level of indirection.<br />

Figure 2.22 shows the technique. It starts with the nonlocal routines calling a set of<br />

dummy routines at the end of the program, with one entry per nonlocal routine.<br />

These dummy entries each contain an indirect jump.<br />

The first time the library routine is called, the program calls the dummy entry<br />

<strong>and</strong> follows the indirect jump. It points to code that puts a number in a register to<br />

Text<br />

jal<br />

...<br />

lw<br />

jr<br />

...<br />

Text<br />

jal<br />

...<br />

lw<br />

jr<br />

...<br />

Data<br />

Data<br />

Text<br />

...<br />

li<br />

j<br />

...<br />

ID<br />

Text<br />

Dynamic linker/loader<br />

Remap DLL routine<br />

j ...<br />

Data/Text<br />

DLL routine<br />

...<br />

jr<br />

Text<br />

DLL routine<br />

...<br />

jr<br />

(a) First call to DLL routine<br />

(b) Subsequent calls to DLL routine<br />

FIGURE 2.22 Dynamically linked library via lazy procedure linkage. (a) Steps for the first time<br />

a call is made to the DLL routine. (b) The steps to find the routine, remap it, <strong>and</strong> link it are skipped on<br />

subsequent calls. As we will see in Chapter 5, the operating system may avoid copying the desired routine by<br />

remapping it using virtual memory management.


2.13 A C Sort Example to Put It All Together 133<br />

void swap(int v[], int k)<br />

{<br />

int temp;<br />

temp = v[k];<br />

v[k] = v[k+1];<br />

v[k+1] = temp;<br />

}<br />

FIGURE 2.24 A C procedure that swaps two locations in memory. This subsection uses this<br />

procedure in a sorting example.<br />

The Procedure swap<br />

Let’s start with the code for the procedure swap in Figure 2.24. This procedure<br />

simply swaps two locations in memory. When translating from C to assembly<br />

language by h<strong>and</strong>, we follow these general steps:<br />

1. Allocate registers to program variables.<br />

2. Produce code for the body of the procedure.<br />

3. Preserve registers across the procedure invocation.<br />

This section describes the swap procedure in these three pieces, concluding by<br />

putting all the pieces together.<br />

Register Allocation for swap<br />

As mentioned on pages 98–99, the MIPS convention on parameter passing is to<br />

use registers $a0, $a1, $a2, <strong>and</strong> $a3. Since swap has just two parameters, v <strong>and</strong><br />

k, they will be found in registers $a0 <strong>and</strong> $a1. The only other variable is temp,<br />

which we associate with register $t0 since swap is a leaf procedure (see page 100).<br />

This register allocation corresponds to the variable declarations in the first part of<br />

the swap procedure in Figure 2.24.<br />

Code for the Body of the Procedure swap<br />

The remaining lines of C code in swap are<br />

temp = v[k];<br />

v[k] = v[k+1];<br />

v[k+1] = temp;<br />

Recall that the memory address for MIPS refers to the byte address, <strong>and</strong> so<br />

words are really 4 bytes apart. Hence we need to multiply the index k by 4 before<br />

adding it to the address. Forgetting that sequential word addresses differ by 4 instead


2.13 A C Sort Example to Put It All Together 135<br />

The Procedure sort<br />

To ensure that you appreciate the rigor of programming in assembly language, we’ll<br />

try a second, longer example. In this case, we’ll build a routine that calls the swap<br />

procedure. This program sorts an array of integers, using bubble or exchange sort,<br />

which is one of the simplest if not the fastest sorts. Figure 2.26 shows the C version<br />

of the program. Once again, we present this procedure in several steps, concluding<br />

with the full procedure.<br />

void sort (int v[], int n)<br />

{<br />

int i, j;<br />

for (i = 0; i < n; i += 1) {<br />

for (j = i – 1; j >= 0 && v[j] > v[j + 1]; j =1) {<br />

swap(v,j);<br />

}<br />

}<br />

}<br />

FIGURE 2.26 A C procedure that performs a sort on the array v.<br />

Register Allocation for sort<br />

The two parameters of the procedure sort, v <strong>and</strong> n, are in the parameter registers<br />

$a0 <strong>and</strong> $a1, <strong>and</strong> we assign register $s0 to i <strong>and</strong> register $s1 to j.<br />

Code for the Body of the Procedure sort<br />

The procedure body consists of two nested for loops <strong>and</strong> a call to swap that includes<br />

parameters. Let’s unwrap the code from the outside to the middle.<br />

The first translation step is the first for loop:<br />

for (i = 0; i


136 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />

The loop should be exited if i < n is not true or, said another way, should be<br />

exited if i ≥ n. The set on less than instruction sets register $t0 to 1 if $s0 <<br />

$a1 <strong>and</strong> to 0 otherwise. Since we want to test if $s0 ≥ $a1, we branch if register<br />

$t0 is 0. This test takes two instructions:<br />

for1tst:slt $t0, $s0, $a1 # reg $t0 = 0 if $s0 ≥ $a1 (i≥n)<br />

beq $t0, $zero,exit1 # go to exit1 if $s0 ≥ $a1 (i≥n)<br />

The bottom of the loop just jumps back to the loop test:<br />

exit1:<br />

j for1tst<br />

# jump to test of outer loop<br />

The skeleton code of the first for loop is then<br />

move $s0, $zero # i = 0<br />

for1tst:slt $t0, $s0, $a1 # reg $t0 = 0 if $s0 ≥ $a1 (i≥n)<br />

beq $t0, $zero,exit1 # go to exit1 if $s0 ≥ $a1 (i≥n)<br />

. . .<br />

(body of first for loop)<br />

. . .<br />

addi $s0, $s0, 1 # i += 1<br />

j for1tst # jump to test of outer loop<br />

exit1:<br />

Voila! (The exercises explore writing faster code for similar loops.)<br />

The second for loop looks like this in C:<br />

for (j = i – 1; j >= 0 && v[j] > v[j + 1]; j –= 1) {<br />

The initialization portion of this loop is again one instruction:<br />

addi $s1, $s0, –1 # j = i – 1<br />

The decrement of j at the end of the loop is also one instruction:<br />

addi $s1, $s1, –1 # j –= 1<br />

The loop test has two parts. We exit the loop if either condition fails, so the first<br />

test must exit the loop if it fails (j 0):<br />

for2tst: slti $t0, $s1, 0 # reg $t0 = 1 if $s1 < 0 (j < 0)<br />

bne $t0, $zero, exit2 # go to exit2 if $s1 < 0 (j < 0)<br />

This branch will skip over the second condition test. If it doesn’t skip, j ≥ 0.


2.13 A C Sort Example to Put It All Together 137<br />

The second test exits if v[j] > v[j + 1] is not true, or exits if v[j] ≤<br />

v[j + 1]. First we create the address by multiplying j by 4 (since we need a byte<br />

address) <strong>and</strong> add it to the base address of v:<br />

sll $t1, $s1, 2 # reg $t1 = j * 4<br />

add $t2, $a0, $t1 # reg $t2 = v + (j * 4)<br />

Now we load v[j]:<br />

lw $t3, 0($t2) # reg $t3 = v[j]<br />

Since we know that the second element is just the following word, we add 4 to<br />

the address in register $t2 to get v[j + 1]:<br />

lw $t4, 4($t2) # reg $t4 = v[j + 1]<br />

The test of v[j] ≤ v[j + 1] is the same as v[j + 1] ≥ v[j], so the<br />

two instructions of the exit test are<br />

slt $t0, $t4, $t3 # reg $t0 = 0 if $t4 ≥ $t3<br />

beq $t0, $zero, exit2 # go to exit2 if $t4 ≥ $t3<br />

The bottom of the loop jumps back to the inner loop test:<br />

j for2tst # jump to test of inner loop<br />

Combining the pieces, the skeleton of the second for loop looks like this:<br />

addi $s1, $s0, –1 # j = i – 1<br />

for2tst:slti $t0, $s1, 0 # reg $t0 = 1 if $s1 < 0 (j < 0)<br />

bne $t0, $zero, exit2 # go to exit2 if $s1 < 0 (j < 0)<br />

sll $t1, $s1, 2 # reg $t1 = j * 4<br />

add $t2, $a0, $t1 # reg $t2 = v + (j * 4)<br />

lw $t3, 0($t2) # reg $t3 = v[j]<br />

lw $t4, 4($t2) # reg $t4 = v[j + 1]<br />

slt $t0, $t4, $t3 # reg $t0 = 0 if $t4 ≥ $t3<br />

beq $t0, $zero, exit2 # go to exit2 if $t4 ≥ $t3<br />

. . .<br />

(body of second for loop)<br />

. . .<br />

addi $s1, $s1, –1 # j –= 1<br />

j for2tst<br />

# jump to test of inner loop<br />

exit2:<br />

The Procedure Call in sort<br />

The next step is the body of the second for loop:<br />

swap(v,j);<br />

Calling swap is easy enough:<br />

jal<br />

swap


138 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />

Passing Parameters in sort<br />

The problem comes when we want to pass parameters because the sort procedure<br />

needs the values in registers $a0 <strong>and</strong> $a1, yet the swap procedure needs to have its<br />

parameters placed in those same registers. One solution is to copy the parameters<br />

for sort into other registers earlier in the procedure, making registers $a0 <strong>and</strong><br />

$a1 available for the call of swap. (This copy is faster than saving <strong>and</strong> restoring on<br />

the stack.) We first copy $a0 <strong>and</strong> $a1 into $s2 <strong>and</strong> $s3 during the procedure:<br />

move $s2, $a0 # copy parameter $a0 into $s2<br />

move $s3, $a1 # copy parameter $a1 into $s3<br />

Then we pass the parameters to swap with these two instructions:<br />

move $a0, $s2<br />

move $a1, $s1<br />

# first swap parameter is v<br />

# second swap parameter is j<br />

Preserving Registers in sort<br />

The only remaining code is the saving <strong>and</strong> restoring of registers. Clearly, we must<br />

save the return address in register $ra, since sort is a procedure <strong>and</strong> is called<br />

itself. The sort procedure also uses the saved registers $s0, $s1, $s2, <strong>and</strong> $s3,<br />

so they must be saved. The prologue of the sort procedure is then<br />

addi $sp,$sp,–20 # make room on stack for 5 registers<br />

sw $ra,16($sp) # save $ra on stack<br />

sw $s3,12($sp) # save $s3 on stack<br />

sw $s2, 8($sp) # save $s2 on stack<br />

sw $s1, 4($sp) # save $s1 on stack<br />

sw $s0, 0($sp) # save $s0 on stack<br />

The tail of the procedure simply reverses all these instructions, then adds a jr to<br />

return.<br />

The Full Procedure sort<br />

Now we put all the pieces together in Figure 2.27, being careful to replace references<br />

to registers $a0 <strong>and</strong> $a1 in the for loops with references to registers $s2 <strong>and</strong> $s3.<br />

Once again, to make the code easier to follow, we identify each block of code with<br />

its purpose in the procedure. In this example, nine lines of the sort procedure in<br />

C became 35 lines in the MIPS assembly language.<br />

Elaboration: One optimization that works with this example is procedure inlining.<br />

Instead of passing arguments in parameters <strong>and</strong> invoking the code with a jal instruction,<br />

the compiler would copy the code from the body of the swap procedure where the call<br />

to swap appears in the code. Inlining would avoid four instructions in this example. The<br />

downside of the inlining optimization is that the compiled code would be bigger if the<br />

inlined procedure is called from several locations. Such a code expansion might turn<br />

into lower performance if it increased the cache miss rate; see Chapter 5.


142 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />

clear1(int array[], int size)<br />

{<br />

int i;<br />

for (i = 0; i < size; i += 1)<br />

array[i] = 0;<br />

}<br />

clear2(int *array, int size)<br />

{<br />

int *p;<br />

for (p = &array[0]; p < &array[size]; p = p + 1)<br />

*p = 0;<br />

}<br />

FIGURE 2.30 Two C procedures for setting an array to all zeros. Clear1 uses indices,<br />

while clear2 uses pointers. The second procedure needs some explanation for those unfamiliar with C.<br />

The address of a variable is indicated by &, <strong>and</strong> the object pointed to by a pointer is indicated by *. The<br />

declarations declare that array <strong>and</strong> p are pointers to integers. The first part of the for loop in clear2<br />

assigns the address of the first element of array to the pointer p. The second part of the for loop tests to see<br />

if the pointer is pointing beyond the last element of array. Incrementing a pointer by one, in the last part of<br />

the for loop, means moving the pointer to the next sequential object of its declared size. Since p is a pointer to<br />

integers, the compiler will generate MIPS instructions to increment p by four, the number of bytes in a MIPS<br />

integer. The assignment in the loop places 0 in the object pointed to by p.<br />

Finally, we can store 0 in that address:<br />

sw $zero, 0($t2) # array[i] = 0<br />

This instruction is the end of the body of the loop, so the next step is to increment i:<br />

addi $t0,$t0,1 # i = i + 1<br />

The loop test checks if i is less than size:<br />

slt $t3,$t0,$a1 # $t3 = (i < size)<br />

bne $t3,$zero,loop1 # if (i < size) go to loop1<br />

We have now seen all the pieces of the procedure. Here is the MIPS code for<br />

clearing an array using indices:<br />

move $t0,$zero # i = 0<br />

loop1: sll $t1,$t0,2 # $t1 = i * 4<br />

add $t2,$a0,$t1 # $t2 = address of array[i]<br />

sw $zero, 0($t2) # array[i] = 0<br />

addi $t0,$t0,1 # i = i + 1<br />

slt $t3,$t0,$a1 # $t3 = (i < size)<br />

bne $t3,$zero,loop1 # if (i < size) go to loop1<br />

(This code works as long as size is greater than 0; ANSI C requires a test of size<br />

before the loop, but we’ll skip that legality here.)


2.14 Arrays versus Pointers 143<br />

Pointer Version of Clear<br />

The second procedure that uses pointers allocates the two parameters array <strong>and</strong><br />

size to the registers $a0 <strong>and</strong> $a1 <strong>and</strong> allocates p to register $t0. The code for<br />

the second procedure starts with assigning the pointer p to the address of the first<br />

element of the array:<br />

move $t0,$a0<br />

# p = address of array[0]<br />

The next code is the body of the for loop, which simply stores 0 into p:<br />

loop2: sw $zero,0($t0) # Memory[p] = 0<br />

This instruction implements the body of the loop, so the next code is the iteration<br />

increment, which changes p to point to the next word:<br />

addi $t0,$t0,4 # p = p + 4<br />

Incrementing a pointer by 1 means moving the pointer to the next sequential<br />

object in C. Since p is a pointer to integers, each of which uses 4 bytes, the compiler<br />

increments p by 4.<br />

The loop test is next. The first step is calculating the address of the last element<br />

of array. Start with multiplying size by 4 to get its byte address:<br />

sll $t1,$a1,2 # $t1 = size * 4<br />

<strong>and</strong> then we add the product to the starting address of the array to get the address<br />

of the first word after the array:<br />

add $t2,$a0,$t1<br />

# $t2 = address of array[size]<br />

The loop test is simply to see if p is less than the last element of array:<br />

slt $t3,$t0,$t2 # $t3 = (p


2.16 Real Stuff: ARMv7 (32-bit) Instructions 147<br />

by any amount, add it to the other registers to form the address, <strong>and</strong> then update<br />

one register with this new address.<br />

Addressing mode<br />

ARM<br />

MIPS<br />

Register oper<strong>and</strong><br />

Immediate oper<strong>and</strong><br />

Register + offset (displacement or based)<br />

Register + register (indexed) X<br />

—<br />

Register + scaled register (scaled) X<br />

—<br />

Register + offset <strong>and</strong> update register X<br />

—<br />

Register + register <strong>and</strong> update register X<br />

—<br />

Autoincrement, autodecrement X<br />

—<br />

PC-relative data X<br />

—<br />

X<br />

X<br />

X<br />

X<br />

X<br />

X<br />

FIGURE 2.33 Summary of data addressing modes. ARM has separate register indirect <strong>and</strong> register<br />

offset addressing modes, rather than just putting 0 in the offset of the latter mode. To get greater addressing<br />

range, ARM shifts the offset left 1 or 2 bits if the data size is halfword or word.<br />

Compare <strong>and</strong> Conditional Branch<br />

MIPS uses the contents of registers to evaluate conditional branches. ARM uses the<br />

traditional four condition code bits stored in the program status word: negative,<br />

zero, carry, <strong>and</strong> overflow. They can be set on any arithmetic or logical instruction;<br />

unlike earlier architectures, this setting is optional on each instruction. An<br />

explicit option leads to fewer problems in a pipelined implementation. ARM uses<br />

conditional branches to test condition codes to determine all possible unsigned<br />

<strong>and</strong> signed relations.<br />

CMP subtracts one oper<strong>and</strong> from the other <strong>and</strong> the difference sets the condition<br />

codes. Compare negative (CMN) adds one oper<strong>and</strong> to the other, <strong>and</strong> the sum sets<br />

the condition codes. TST performs logical AND on the two oper<strong>and</strong>s to set all<br />

condition codes but overflow, while TEQ uses exclusive OR to set the first three<br />

condition codes.<br />

One unusual feature of ARM is that every instruction has the option of executing<br />

conditionally, depending on the condition codes. Every instruction starts with a<br />

4-bit field that determines whether it will act as a no operation instruction (nop)<br />

or as a real instruction, depending on the condition codes. Hence, conditional<br />

branches are properly considered as conditionally executing the unconditional<br />

branch instruction. Conditional execution allows avoiding a branch to jump over a<br />

single instruction. It takes less code space <strong>and</strong> time to simply conditionally execute<br />

one instruction.<br />

Figure 2.34 shows the instruction formats for ARM <strong>and</strong> MIPS. The principal<br />

differences are the 4-bit conditional execution field in every instruction <strong>and</strong> the<br />

smaller register field, because ARM has half the number of registers.


148 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />

ARM<br />

31 28 27<br />

Opx 4<br />

20 19 16 15<br />

12 11<br />

4 3 0<br />

Op 8 Rs1 4 Rd 4 Opx 8<br />

Rs2 4<br />

Register-register<br />

31 26 25<br />

21 20<br />

16 15<br />

11 10 6 5 0<br />

MIPS<br />

Op 6<br />

Rs1 5 Rs2 5 Rd 5 Const 5 Opx 6<br />

31 28 27<br />

20 19 16 15 12 11<br />

0<br />

ARM Opx 4<br />

Op 8 Rs1 4 Rd 4 Const 12<br />

Data transfer<br />

31 26 25 21 20 16 15<br />

0<br />

MIPS<br />

Op 6<br />

Rs1 5 Rd 5<br />

Const 16<br />

31 28 27 24 23<br />

0<br />

ARM Opx 4 Op 4 Const 24<br />

Branch<br />

31 26 25<br />

21 20<br />

16 15<br />

0<br />

MIPS<br />

Op 6<br />

Rs1 5 Opx 5 /Rs2 5 Const 16<br />

ARM<br />

31 28 27 24 23<br />

0<br />

Opx 4 Op 4 Const 24<br />

Jump/Call<br />

31 26 25<br />

0<br />

MIPS Op 6<br />

Const 26<br />

Opcode<br />

Register<br />

Constant<br />

FIGURE 2.34 Instruction formats, ARM <strong>and</strong> MIPS. The differences result from whether the<br />

architecture has 16 or 32 registers.<br />

Unique Features of ARM<br />

Figure 2.35 shows a few arithmetic-logical instructions not found in MIPS. Since<br />

ARM does not have a dedicated register for 0, it has separate opcodes to perform<br />

some operations that MIPS can do with $zero. In addition, ARM has support for<br />

multiword arithmetic.<br />

ARM’s 12-bit immediate field has a novel interpretation. The eight leastsignificant<br />

bits are zero-extended to a 32-bit value, then rotated right the number<br />

of bits specified in the first four bits of the field multiplied by two. One advantage is<br />

that this scheme can represent all powers of two in a 32-bit word. Whether this split<br />

actually catches more immediates than a simple 12-bit field would be an interesting<br />

study.<br />

Oper<strong>and</strong> shifting is not limited to immediates. The second register of all<br />

arithmetic <strong>and</strong> logical processing operations has the option of being shifted before<br />

being operated on. The shift options are shift left logical, shift right logical, shift<br />

right arithmetic, <strong>and</strong> rotate right.


2.17 Real Stuff: x86 Instructions 151<br />

in parallel. Not only does this change enable more multimedia operations;<br />

it gives the compiler a different target for floating-point operations than<br />

the unique stack architecture. Compilers can choose to use the eight SSE<br />

registers as floating-point registers like those found in other computers. This<br />

change boosted the floating-point performance of the Pentium 4, the first<br />

microprocessor to include SSE2 instructions.<br />

■ 2003: A company other than Intel enhanced the x86 architecture this time.<br />

AMD announced a set of architectural extensions to increase the address<br />

space from 32 to 64 bits. Similar to the transition from a 16- to 32-bit address<br />

space in 1985 with the 80386, AMD64 widens all registers to 64 bits. It also<br />

increases the number of registers to 16 <strong>and</strong> increases the number of 128-<br />

bit SSE registers to 16. The primary ISA change comes from adding a new<br />

mode called long mode that redefines the execution of all x86 instructions<br />

with 64-bit addresses <strong>and</strong> data. To address the larger number of registers, it<br />

adds a new prefix to instructions. Depending how you count, long mode also<br />

adds four to ten new instructions <strong>and</strong> drops 27 old ones. PC-relative data<br />

addressing is another extension. AMD64 still has a mode that is identical<br />

to x86 (legacy mode) plus a mode that restricts user programs to x86 but<br />

allows operating systems to use AMD64 (compatibility mode). These modes<br />

allow a more graceful transition to 64-bit addressing than the HP/Intel IA-64<br />

architecture.<br />

■ 2004: Intel capitulates <strong>and</strong> embraces AMD64, relabeling it Extended Memory<br />

64 Technology (EM64T). The major difference is that Intel added a 128-bit<br />

atomic compare <strong>and</strong> swap instruction, which probably should have been<br />

included in AMD64. At the same time, Intel announced another generation of<br />

media extensions. SSE3 adds 13 instructions to support complex arithmetic,<br />

graphics operations on arrays of structures, video encoding, floating-point<br />

conversion, <strong>and</strong> thread synchronization (see Section 2.11). AMD added SSE3<br />

in subsequent chips <strong>and</strong> the missing atomic swap instruction to AMD64 to<br />

maintain binary compatibility with Intel.<br />

■ 2006: Intel announces 54 new instructions as part of the SSE4 instruction set<br />

extensions. These extensions perform tweaks like sum of absolute differences,<br />

dot products for arrays of structures, sign or zero extension of narrow data to<br />

wider sizes, population count, <strong>and</strong> so on. They also added support for virtual<br />

machines (see Chapter 5).<br />

■ 2007: AMD announces 170 instructions as part of SSE5, including 46<br />

instructions of the base instruction set that adds three oper<strong>and</strong> instructions<br />

like MIPS.<br />

■ 2011: Intel ships the Advanced Vector Extension that exp<strong>and</strong>s the SSE<br />

register width from 128 to 256 bits, thereby redefining about 250 instructions<br />

<strong>and</strong> adding 128 new instructions.


152 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />

This history illustrates the impact of the “golden h<strong>and</strong>cuffs” of compatibility on<br />

the x86, as the existing software base at each step was too important to jeopardize<br />

with significant architectural changes.<br />

Whatever the artistic failures of the x86, keep in mind that this instruction set<br />

largely drove the PC generation of computers <strong>and</strong> still dominates the cloud portion<br />

of the PostPC Era. Manufacturing 350M x86 chips per year may seem small<br />

compared to 9 billion ARMv7 chips, but many companies would love to control<br />

such a market. Nevertheless, this checkered ancestry has led to an architecture that<br />

is difficult to explain <strong>and</strong> impossible to love.<br />

Brace yourself for what you are about to see! Do not try to read this section<br />

with the care you would need to write x86 programs; the goal instead is to give you<br />

familiarity with the strengths <strong>and</strong> weaknesses of the world’s most popular desktop<br />

architecture.<br />

Rather than show the entire 16-bit, 32-bit, <strong>and</strong> 64-bit instruction set, in this<br />

section we concentrate on the 32-bit subset that originated with the 80386. We start<br />

our explanation with the registers <strong>and</strong> addressing modes, move on to the integer<br />

operations, <strong>and</strong> conclude with an examination of instruction encoding.<br />

x86 Registers <strong>and</strong> Data Addressing Modes<br />

The registers of the 80386 show the evolution of the instruction set (Figure 2.36).<br />

The 80386 extended all 16-bit registers (except the segment registers) to 32 bits,<br />

prefixing an E to their name to indicate the 32-bit version. We’ll refer to them<br />

generically as GPRs (general-purpose registers). The 80386 contains only eight<br />

GPRs. This means MIPS programs can use four times as many <strong>and</strong> ARMv7 twice<br />

as many.<br />

Figure 2.37 shows the arithmetic, logical, <strong>and</strong> data transfer instructions are<br />

two-oper<strong>and</strong> instructions. There are two important differences here. The x86<br />

arithmetic <strong>and</strong> logical instructions must have one oper<strong>and</strong> act as both a source<br />

<strong>and</strong> a destination; ARMv7 <strong>and</strong> MIPS allow separate registers for source <strong>and</strong><br />

destination. This restriction puts more pressure on the limited registers, since one<br />

source register must be modified. The second important difference is that one of<br />

the oper<strong>and</strong>s can be in memory. Thus, virtually any instruction may have one<br />

oper<strong>and</strong> in memory, unlike ARMv7 <strong>and</strong> MIPS.<br />

Data memory-addressing modes, described in detail below, offer two sizes of<br />

addresses within the instruction. These so-called displacements can be 8 bits or 32<br />

bits.<br />

Although a memory oper<strong>and</strong> can use any addressing mode, there are restrictions<br />

on which registers can be used in a mode. Figure 2.38 shows the x86 addressing<br />

modes <strong>and</strong> which GPRs cannot be used with each mode, as well as how to get the<br />

same effect using MIPS instructions.<br />

x86 Integer Operations<br />

The 8086 provides support for both 8-bit (byte) <strong>and</strong> 16-bit (word) data types. The<br />

80386 adds 32-bit addresses <strong>and</strong> data (double words) in the x86. (AMD64 adds 64-


2.17 Real Stuff: x86 Instructions 153<br />

Name<br />

31<br />

EAX<br />

ECX<br />

EDX<br />

EBX<br />

ESP<br />

EBP<br />

ESI<br />

EDI<br />

Use<br />

0<br />

GPR 0<br />

GPR 1<br />

GPR 2<br />

GPR 3<br />

GPR 4<br />

GPR 5<br />

GPR 6<br />

GPR 7<br />

CS<br />

SS<br />

DS<br />

ES<br />

FS<br />

GS<br />

Code segment pointer<br />

Stack segment pointer (top of stack)<br />

Data segment pointer 0<br />

Data segment pointer 1<br />

Data segment pointer 2<br />

Data segment pointer 3<br />

EIP<br />

EFLAGS<br />

Instruction pointer (PC)<br />

Condition codes<br />

FIGURE 2.36 The 80386 register set. Starting with the 80386, the top eight registers were extended<br />

to 32 bits <strong>and</strong> could also be used as general-purpose registers.<br />

Source/destination oper<strong>and</strong> type<br />

Register<br />

Register<br />

Register<br />

Memory<br />

Memory<br />

Second source oper<strong>and</strong><br />

Register<br />

Immediate<br />

Memory<br />

Register<br />

Immediate<br />

FIGURE 2.37 Instruction types for the arithmetic, logical, <strong>and</strong> data transfer instructions.<br />

The x86 allows the combinations shown. The only restriction is the absence of a memory-memory mode.<br />

Immediates may be 8, 16, or 32 bits in length; a register is any one of the 14 major registers in Figure 2.36<br />

(not EIP or EFLAGS).


154 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />

Register<br />

Mode<br />

Description<br />

restrictions<br />

MIPS equivalent<br />

Register indirect Address is in a register. Not ESP or EBP lw $s0,0($s1)<br />

Based mode with 8- or 32-bit<br />

displacement<br />

Base plus scaled index<br />

Base plus scaled index with<br />

8- or 32-bit displacement<br />

Address is contents of base register plus<br />

displacement.<br />

The address is<br />

Base + (2 Scale x Index)<br />

where Scale has the value 0, 1, 2, or 3.<br />

The address is<br />

Base + (2 Scale x Index) + displacement<br />

where Scale has the value 0, 1, 2, or 3.<br />

Not ESP<br />

Base: any GPR<br />

Index: not ESP<br />

Base: any GPR<br />

Index: not ESP<br />

lw $s0,100($s1) #


2.17 Real Stuff: x86 Instructions 155<br />

The first two categories are unremarkable, except that the arithmetic <strong>and</strong> logic<br />

instruction operations allow the destination to be either a register or a memory<br />

location. Figure 2.39 shows some typical x86 instructions <strong>and</strong> their functions.<br />

Conditional branches on the x86 are based on condition codes or flags, like<br />

ARMv7. Condition codes are set as a side effect of an operation; most are used<br />

to compare the value of a result to 0. Branches then test the condition codes. PC-<br />

Instruction<br />

Function<br />

je name<br />

if equal(condition code) {EIP=name};<br />

EIP–128


156 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />

Instruction<br />

Control<br />

jnz, jz<br />

Meaning<br />

Conditional <strong>and</strong> unconditional branches<br />

Jump if condition to EIP + 8-bit offset; JNE (forJNZ), JE (for JZ) are<br />

alternative names<br />

jmp<br />

Unconditional jump—8-bit or 16-bit offset<br />

call<br />

Subroutine call—16-bit offset; return address pushed onto stack<br />

ret<br />

Pops return address from stack <strong>and</strong> jumps to it<br />

loop Loop branch—decrement ECX; jump to EIP + 8-bit displacement if ECX ≠ 0<br />

Data transfer Move data between registers or between register <strong>and</strong> memory<br />

move<br />

Move between two registers or between register <strong>and</strong> memory<br />

push, pop Push source oper<strong>and</strong> on stack; pop oper<strong>and</strong> from stack top to a register<br />

les<br />

Load ES <strong>and</strong> one of the GPRs from memory<br />

Arithmetic, logical Arithmetic <strong>and</strong> logical operations using the data registers <strong>and</strong> memory<br />

add, sub<br />

Add source to destination; subtract source from destination; register-memory<br />

format<br />

cmp<br />

Compare source <strong>and</strong> destination; register-memory format<br />

shl, shr, rcr Shift left; shift logical right; rotate right with carry condition code as fi ll<br />

cbw<br />

Convert byte in eight rightmost bits of EAX to 16-bit word in right of EAX<br />

test<br />

Logical AND of source <strong>and</strong> destination sets condition codes<br />

inc, dec<br />

Increment destination, decrement destination<br />

or, xor<br />

Logical OR; exclusive OR; register-memory format<br />

String<br />

Move between string oper<strong>and</strong>s; length given by a repeat prefix<br />

movs<br />

Copies from string source to destination by incrementing ESI <strong>and</strong> EDI; may be<br />

repeated<br />

lods<br />

Loads a byte, word, or doubleword of a string into the EAX register<br />

FIGURE 2.40 Some typical operations on the x86. Many operations use register-memory format,<br />

where either the source or the destination may be memory <strong>and</strong> the other may be a register or immediate<br />

oper<strong>and</strong>.<br />

of the instructions that address memory. The base plus scaled index mode uses a second<br />

postbyte, labeled “sc, index, base.”<br />

Figure 2.42 shows the encoding of the two postbyte address specifiers for<br />

both 16-bit <strong>and</strong> 32-bit mode. Unfortunately, to underst<strong>and</strong> fully which registers<br />

<strong>and</strong> which addressing modes are available, you need to see the encoding of all<br />

addressing modes <strong>and</strong> sometimes even the encoding of the instructions.<br />

x86 Conclusion<br />

Intel had a 16-bit microprocessor two years before its competitors’ more elegant<br />

architectures, such as the Motorola 68000, <strong>and</strong> this head start led to the selection<br />

of the 8086 as the CPU for the IBM PC. Intel engineers generally acknowledge that<br />

the x86 is more difficult to build than computers like ARMv7 <strong>and</strong> MIPS, but the<br />

large market meant in the PC Era that AMD <strong>and</strong> Intel could afford more resources


2.17 Real Stuff: x86 Instructions 157<br />

a. JE EIP + displacement<br />

4 4 8<br />

JE<br />

Condition<br />

Displacement<br />

b. CALL<br />

8 32<br />

CALL<br />

Offset<br />

c. MOV EBX, [EDI + 45]<br />

6 1 1 8<br />

8<br />

r/m<br />

MOV d w<br />

Displacement<br />

Postbyte<br />

d. PUSH ESI<br />

5 3<br />

PUSH Reg<br />

e. ADD EAX, #6765<br />

4 3 1<br />

32<br />

ADD<br />

Reg w<br />

Immediate<br />

f. TEST EDX, #42<br />

7 1 8<br />

32<br />

TEST<br />

w<br />

Postbyte<br />

Immediate<br />

FIGURE 2.41 Typical x86 instruction formats. Figure 2.42 shows the encoding of the postbyte.<br />

Many instructions contain the 1-bit field w, which says whether the operation is a byte or a double word. The<br />

d field in MOV is used in instructions that may move to or from memory <strong>and</strong> shows the direction of the move.<br />

The ADD instruction requires 32 bits for the immediate field, because in 32-bit mode, the immediates are<br />

either 8 bits or 32 bits. The immediate field in the TEST is 32 bits long because there is no 8-bit immediate for<br />

test in 32-bit mode. Overall, instructions may vary from 1 to 15 bytes in length. The long length comes from<br />

extra 1-byte prefixes, having both a 4-byte immediate <strong>and</strong> a 4-byte displacement address, using an opcode of<br />

2 bytes, <strong>and</strong> using the scaled index mode specifier, which adds another byte.<br />

to help overcome the added complexity. What the x86 lacks in style, it made up for<br />

in market size, making it beautiful from the right perspective.<br />

Its saving grace is that the most frequently used x86 architectural components<br />

are not too difficult to implement, as AMD <strong>and</strong> Intel have demonstrated by rapidly<br />

improving performance of integer programs since 1978. To get that performance,


158 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />

reg w = 0 w = 1 r/m mod = 0 mod = 1 mod = 2 mod = 3<br />

16b 32b 16b 32b 16b 32b 16b 32b<br />

0 AL AX EAX 0 addr=BX+SI =EAX same same same same same<br />

1 CL CX ECX 1 addr=BX+DI =ECX addr as addr as addr as addr as as<br />

2 DL DX EDX 2 addr=BP+SI =EDX mod=0 mod=0 mod=0 mod=0 reg<br />

3 BL BX EBX 3 addr=BP+SI =EBX + disp8 + disp8 + disp16 + disp32 fi eld<br />

4 AH SP ESP 4 addr=SI =(sib) SI+disp8 (sib)+disp8 SI+disp8 (sib)+disp32 “<br />

5 CH BP EBP 5 addr=DI =disp32 DI+disp8 EBP+disp8 DI+disp16 EBP+disp32 “<br />

6 DH SI ESI 6 addr=disp16 =ESI BP+disp8 ESI+disp8 BP+disp16 ESI+disp32 “<br />

7 BH DI EDI 7 addr=BX =EDI BX+disp8 EDI+disp8 BX+disp16 EDI+disp32 “<br />

FIGURE 2.42 The encoding of the first address specifier of the x86: mod, reg, r/m. The first four columns show the encoding<br />

of the 3-bit reg field, which depends on the w bit from the opcode <strong>and</strong> whether the machine is in 16-bit mode (8086) or 32-bit mode (80386).<br />

The remaining columns explain the mod <strong>and</strong> r/m fields. The meaning of the 3-bit r/m field depends on the value in the 2-bit mod field <strong>and</strong> the<br />

address size. Basically, the registers used in the address calculation are listed in the sixth <strong>and</strong> seventh columns, under mod 0, with mod 1<br />

adding an 8-bit displacement <strong>and</strong> mod 2 adding a 16-bit or 32-bit displacement, depending on the address mode. The exceptions are 1) r/m<br />

6 when mod 1 or mod 2 in 16-bit mode selects BP plus the displacement; 2) r/m 5 when mod 1 or mod 2 in 32-bit mode selects<br />

EBP plus displacement; <strong>and</strong> 3) r/m 4 in 32-bit mode when mod does not equal 3, where (sib) means use the scaled index mode shown in<br />

Figure 2.38. When mod 3, the r/m field indicates a register, using the same encoding as the reg field combined with the w bit.<br />

compilers must avoid the portions of the architecture that are hard to implement<br />

fast.<br />

In the PostPC Era, however, despite considerable architectural <strong>and</strong> manufacturing<br />

expertise, x86 has not yet been competitive in the personal mobile device.<br />

2.18 Real Stuff: ARMv8 (64-bit) Instructions<br />

Of the many potential problems in an instruction set, the one that is almost impossible<br />

to overcome is having too small a memory address. While the x86 was successfully<br />

extended first to 32-bit addresses <strong>and</strong> then later to 64-bit addresses, many of its<br />

brethren were left behind. For example, the 16-bit address MOStek 6502 powered the<br />

Apple II, but even given this headstart with the first commercially successful personal<br />

computer, its lack of address bits condemned it to the dustbin of history.<br />

ARM architects could see the writing on the wall of their 32-bit address<br />

computer, <strong>and</strong> began design of the 64-bit address version of ARM in 2007. It was<br />

finally revealed in 2013. Rather than some minor cosmetic changes to make all<br />

the registers 64 bits wide, which is basically what happened to the x86, ARM did a<br />

complete overhaul. The good news is that if you know MIPS it will be very easy to<br />

pick up ARMv8, as the 64-bit version is called.<br />

First, as compared to MIPS, ARM dropped virtually all of the unusual features<br />

of v7:<br />

■ There is no conditional execution field, as there was in nearly every instruction<br />

in v7.


160 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />

This battle between compilers <strong>and</strong> assembly language coders is another situation<br />

in which humans are losing ground. For example, C offers the programmer a<br />

chance to give a hint to the compiler about which variables to keep in registers<br />

versus spilled to memory. When compilers were poor at register allocation, such<br />

hints were vital to performance. In fact, some old C textbooks spent a fair amount<br />

of time giving examples that effectively use register hints. Today’s C compilers<br />

generally ignore such hints, because the compiler does a better job at allocation<br />

than the programmer does.<br />

Even if writing by h<strong>and</strong> resulted in faster code, the dangers of writing in assembly<br />

language are the longer time spent coding <strong>and</strong> debugging, the loss in portability,<br />

<strong>and</strong> the difficulty of maintaining such code. One of the few widely accepted axioms<br />

of software engineering is that coding takes longer if you write more lines, <strong>and</strong> it<br />

clearly takes many more lines to write a program in assembly language than in C<br />

or Java. Moreover, once it is coded, the next danger is that it will become a popular<br />

program. Such programs always live longer than expected, meaning that someone<br />

will have to update the code over several years <strong>and</strong> make it work with new releases<br />

of operating systems <strong>and</strong> new models of machines. Writing in higher-level language<br />

instead of assembly language not only allows future compilers to tailor the code<br />

to future machines; it also makes the software easier to maintain <strong>and</strong> allows the<br />

program to run on more br<strong>and</strong>s of computers.<br />

Fallacy: The importance of commercial binary compatibility means successful<br />

instruction sets don’t change.<br />

While backwards binary compatibility is sacrosanct, Figure 2.43 shows that the x86<br />

architecture has grown dramatically. The average is more than one instruction per<br />

month over its 35-year lifetime!<br />

Pitfall: Forgetting that sequential word addresses in machines with byte addressing<br />

do not differ by one.<br />

Many an assembly language programmer has toiled over errors made by assuming<br />

that the address of the next word can be found by incrementing the address in a<br />

register by one instead of by the word size in bytes. Forewarned is forearmed!<br />

Pitfall: Using a pointer to an automatic variable outside its defining procedure.<br />

A common mistake in dealing with pointers is to pass a result from a procedure<br />

that includes a pointer to an array that is local to that procedure. Following the<br />

stack discipline in Figure 2.12, the memory that contains the local array will be<br />

reused as soon as the procedure returns. Pointers to automatic variables can lead<br />

to chaos.


162 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />

We also saw the great idea of making the common cast fast applied to instruction<br />

sets as well as computer architecture. Examples of making the common MIPS<br />

case fast include PC-relative addressing for conditional branches <strong>and</strong> immediate<br />

addressing for larger constant oper<strong>and</strong>s.<br />

Above this machine level is assembly language, a language that humans can read.<br />

The assembler translates it into the binary numbers that machines can underst<strong>and</strong>,<br />

<strong>and</strong> it even “extends” the instruction set by creating symbolic instructions that<br />

aren’t in the hardware. For instance, constants or addresses that are too big are<br />

broken into properly sized pieces, common variations of instructions are given<br />

their own name, <strong>and</strong> so on. Figure 2.44 lists the MIPS instructions we have covered<br />

MIPS instructions Name Format Pseudo MIPS Name Format<br />

add add R move move R<br />

subtract sub R multiply mult R<br />

add immediate addi I multiply immediate multi I<br />

load word lw I load immediate li I<br />

store word sw I branch less than blt I<br />

load half lh I branch less than<br />

load half unsigned lhu I or equal ble I<br />

store half sh I branch greater than bgt I<br />

load byte lb I branch greater than<br />

load byte unsigned lbu I or equal bge I<br />

store byte sb I<br />

load linked ll I<br />

store conditional sc I<br />

load upper immediate lui I<br />

<strong>and</strong> <strong>and</strong> R<br />

or or R<br />

nor nor R<br />

<strong>and</strong> immediate <strong>and</strong>i I<br />

or immediate ori I<br />

shift left logical sll R<br />

shift right logical srl R<br />

branch on equal beq I<br />

branch on not equal bne I<br />

set less than slt R<br />

set less than immediate slti I<br />

set less than immediate<br />

sltiu I<br />

unsigned<br />

jump j J<br />

jump register jr R<br />

jump <strong>and</strong> link jal J<br />

FIGURE 2.44 The MIPS instruction set covered so far, with the real MIPS instructions<br />

on the left <strong>and</strong> the pseudoinstructions on the right. Appendix A (Section A.10) describes the<br />

full MIPS architecture. Figure 2.1 shows more details of the MIPS architecture revealed in this chapter. The<br />

information given here is also found in Columns 1 <strong>and</strong> 2 of the MIPS Reference Data Card at the front of<br />

the book.


2.22 Exercises 165<br />

2.3 [5] For the following C statement, what is the corresponding<br />

MIPS assembly code? Assume that the variables f, g, h, i, <strong>and</strong> j are assigned to<br />

registers $s0, $s1, $s2, $s3, <strong>and</strong> $s4, respectively. Assume that the base address<br />

of the arrays A <strong>and</strong> B are in registers $s6 <strong>and</strong> $s7, respectively.<br />

B[8] = A[i−j];<br />

2.4 [5] For the MIPS assembly instructions below, what is the<br />

corresponding C statement? Assume that the variables f, g, h, i, <strong>and</strong> j are assigned<br />

to registers $s0, $s1, $s2, $s3, <strong>and</strong> $s4, respectively. Assume that the base address<br />

of the arrays A <strong>and</strong> B are in registers $s6 <strong>and</strong> $s7, respectively.<br />

sll $t0, $s0, 2 # $t0 = f * 4<br />

add $t0, $s6, $t0 # $t0 = &A[f]<br />

sll $t1, $s1, 2 # $t1 = g * 4<br />

add $t1, $s7, $t1 # $t1 = &B[g]<br />

lw $s0, 0($t0) # f = A[f]<br />

addi $t2, $t0, 4<br />

lw $t0, 0($t2)<br />

add $t0, $t0, $s0<br />

sw $t0, 0($t1)<br />

2.5 [5] For the MIPS assembly instructions in Exercise 2.4, rewrite<br />

the assembly code to minimize the number if MIPS instructions (if possible)<br />

needed to carry out the same function.<br />

2.6 The table below shows 32-bit values of an array stored in memory.<br />

Address Data<br />

24 2<br />

38 4<br />

32 3<br />

36 6<br />

40 1


166 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />

2.6.1 [5] For the memory locations in the table above, write C<br />

code to sort the data from lowest to highest, placing the lowest value in the<br />

smallest memory location shown in the figure. Assume that the data shown<br />

represents the C variable called Array, which is an array of type int, <strong>and</strong> that<br />

the first number in the array shown is the first element in the array. Assume<br />

that this particular machine is a byte-addressable machine <strong>and</strong> a word consists<br />

of four bytes.<br />

2.6.2 [5] For the memory locations in the table above, write MIPS<br />

code to sort the data from lowest to highest, placing the lowest value in the smallest<br />

memory location. Use a minimum number of MIPS instructions. Assume the base<br />

address of Array is stored in register $s6.<br />

2.7 [5] Show how the value 0xabcdef12 would be arranged in memory<br />

of a little-endian <strong>and</strong> a big-endian machine. Assume the data is stored starting at<br />

address 0.<br />

2.8 [5] Translate 0xabcdef12 into decimal.<br />

2.9 [5] Translate the following C code to MIPS. Assume that the<br />

variables f, g, h, i, <strong>and</strong> j are assigned to registers $s0, $s1, $s2, $s3, <strong>and</strong> $s4,<br />

respectively. Assume that the base address of the arrays A <strong>and</strong> B are in registers $s6<br />

<strong>and</strong> $s7, respectively. Assume that the elements of the arrays A <strong>and</strong> B are 4-byte<br />

words:<br />

B[8] = A[i] + A[j];<br />

2.10 [5] Translate the following MIPS code to C. Assume that the<br />

variables f, g, h, i, <strong>and</strong> j are assigned to registers $s0, $s1, $s2, $s3, <strong>and</strong> $s4,<br />

respectively. Assume that the base address of the arrays A <strong>and</strong> B are in registers $s6<br />

<strong>and</strong> $s7, respectively.<br />

addi $t0, $s6, 4<br />

add $t1, $s6, $0<br />

sw $t1, 0($t0)<br />

lw $t0, 0($t0)<br />

add $s0, $t1, $t0<br />

2.11 [5] For each MIPS instruction, show the value of the opcode<br />

(OP), source register (RS), <strong>and</strong> target register (RT) fields. For the I-type instructions,<br />

show the value of the immediate field, <strong>and</strong> for the R-type instructions, show the<br />

value of the destination register (RD) field.


2.22 Exercises 167<br />

2.12 Assume that registers $s0 <strong>and</strong> $s1 hold the values 0x80000000 <strong>and</strong><br />

0xD0000000, respectively.<br />

2.12.1 [5] What is the value of $t0 for the following assembly code?<br />

add $t0, $s0, $s1<br />

2.12.2 [5] Is the result in $t0 the desired result, or has there been overflow?<br />

2.12.3 [5] For the contents of registers $s0 <strong>and</strong> $s1 as specified above,<br />

what is the value of $t0 for the following assembly code?<br />

sub $t0, $s0, $s1<br />

2.12.4 [5] Is the result in $t0 the desired result, or has there been overflow?<br />

2.12.5 [5] For the contents of registers $s0 <strong>and</strong> $s1 as specified above,<br />

what is the value of $t0 for the following assembly code?<br />

add $t0, $s0, $s1<br />

add $t0, $t0, $s0<br />

2.12.6 [5] Is the result in $t0 the desired result, or has there been<br />

overflow?<br />

2.13 Assume that $s0 holds the value 128 ten<br />

.<br />

2.13.1 [5] For the instruction add $t0, $s0, $s1, what is the range(s) of<br />

values for $s1 that would result in overflow?<br />

2.13.2 [5] For the instruction sub $t0, $s0, $s1, what is the range(s) of<br />

values for $s1 that would result in overflow?<br />

2.13.3 [5] For the instruction sub $t0, $s1, $s0, what is the range(s) of<br />

values for $s1 that would result in overflow?<br />

2.14 [5] Provide the type <strong>and</strong> assembly language instruction for the<br />

following binary value: 0000 0010 0001 0000 1000 0000 0010 0000 two<br />

2.15 [5] Provide the type <strong>and</strong> hexadecimal representation of<br />

following instruction: sw $t1, 32($t2)


168 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />

2.16 [5] Provide the type, assembly language instruction, <strong>and</strong> binary<br />

representation of instruction described by the following MIPS fields:<br />

op=0, rs=3, rt=2, rd=3, shamt=0, funct=34<br />

2.17 [5] Provide the type, assembly language instruction, <strong>and</strong> binary<br />

representation of instruction described by the following MIPS fields:<br />

op=0x23, rs=1, rt=2, const=0x4<br />

2.18 Assume that we would like to exp<strong>and</strong> the MIPS register file to 128 registers<br />

<strong>and</strong> exp<strong>and</strong> the instruction set to contain four times as many instructions.<br />

2.18.1 [5] How this would this affect the size of each of the bit fields in<br />

the R-type instructions?<br />

2.18.2 [5] How this would this affect the size of each of the bit fields in<br />

the I-type instructions?<br />

2.18.3 [5] How could each of the two proposed changes decrease<br />

the size of an MIPS assembly program? On the other h<strong>and</strong>, how could the proposed<br />

change increase the size of an MIPS assembly program?<br />

2.19 Assume the following register contents:<br />

$t0 = 0xAAAAAAAA, $t1 = 0x12345678<br />

2.19.1 [5] For the register values shown above, what is the value of $t2<br />

for the following sequence of instructions?<br />

sll $t2, $t0, 44<br />

or $t2, $t2, $t1<br />

2.19.2 [5] For the register values shown above, what is the value of $t2<br />

for the following sequence of instructions?<br />

sll $t2, $t0, 4<br />

<strong>and</strong>i $t2, $t2, −1<br />

2.19.3 [5] For the register values shown above, what is the value of $t2<br />

for the following sequence of instructions?<br />

srl $t2, $t0, 3<br />

<strong>and</strong>i $t2, $t2, 0xFFEF


2.22 Exercises 169<br />

2.20 [5] Find the shortest sequence of MIPS instructions that extracts bits<br />

16 down to 11 from register $t0 <strong>and</strong> uses the value of this field to replace bits 31<br />

down to 26 in register $t1 without changing the other 26 bits of register $t1.<br />

2.21 [5] Provide a minimal set of MIPS instructions that may be used to<br />

implement the following pseudoinstruction:<br />

not $t1, $t2<br />

// bit-wise invert<br />

2.22 [5] For the following C statement, write a minimal sequence of MIPS<br />

assembly instructions that does the identical operation. Assume $t1 = A, $t2 = B,<br />

<strong>and</strong> $s1 is the base address of C.<br />

A = C[0] 0) R[rs]=R[rs]−1, PC=PC+4+BranchAddr<br />

2.25.1 [5] If this instruction were to be implemented in the MIPS<br />

instruction set, what is the most appropriate instruction format?<br />

2.25.2 [5] What is the shortest sequence of MIPS instructions that<br />

performs the same operation?


170 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />

2.26 Consider the following MIPS loop:<br />

LOOP: slt $t2, $0, $t1<br />

beq $t2, $0, DONE<br />

subi $t1, $t1, 1<br />

addi $s2, $s2, 2<br />

j LOOP<br />

DONE:<br />

2.26.1 [5] Assume that the register $t1 is initialized to the value 10. What<br />

is the value in register $s2 assuming $s2 is initially zero?<br />

2.26.2 [5] For each of the loops above, write the equivalent C code<br />

routine. Assume that the registers $s1, $s2, $t1, <strong>and</strong> $t2 are integers A, B, i, <strong>and</strong><br />

temp, respectively.<br />

2.26.3 [5] For the loops written in MIPS assembly above, assume that<br />

the register $t1 is initialized to the value N. How many MIPS instructions are<br />

executed?<br />

2.27 [5] Translate the following C code to MIPS assembly code. Use a<br />

minimum number of instructions. Assume that the values of a, b, i, <strong>and</strong> j are in<br />

registers $s0, $s1, $t0, <strong>and</strong> $t1, respectively. Also, assume that register $s2 holds<br />

the base address of the array D.<br />

for(i=0; i


2.22 Exercises 171<br />

addi $t1, $t1, 1<br />

slti $t2, $t1, 100<br />

bne $t2, $s0, LOOP<br />

2.30 [5] Rewrite the loop from Exercise 2.29 to reduce the number of<br />

MIPS instructions executed.<br />

2.31 [5] Implement the following C code in MIPS assembly. What is the<br />

total number of MIPS instructions needed to execute the function?<br />

int fib(int n){<br />

if (n==0)<br />

return 0;<br />

else if (n == 1)<br />

return 1;<br />

else<br />

return fib(n−1) + fib(n−2);<br />

2.32 [5] Functions can often be implemented by compilers “in-line.” An<br />

in-line function is when the body of the function is copied into the program space,<br />

allowing the overhead of the function call to be eliminated. Implement an “in-line”<br />

version of the C code above in MIPS assembly. What is the reduction in the total<br />

number of MIPS assembly instructions needed to complete the function? Assume<br />

that the C variable n is initialized to 5.<br />

2.33 [5] For each function call, show the contents of the stack after the<br />

function call is made. Assume the stack pointer is originally at address 0x7ffffffc,<br />

<strong>and</strong> follow the register conventions as specified in Figure 2.11.<br />

2.34 Translate function f into MIPS assembly language. If you need to use<br />

registers $t0 through $t7, use the lower-numbered registers first. Assume the<br />

function declaration for func is “int f(int a, int b);”. The code for function<br />

f is as follows:<br />

int f(int a, int b, int c, int d){<br />

return func(func(a,b),c+d);<br />

}


172 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />

2.35 [5] Can we use the tail-call optimization in this function? If no,<br />

explain why not. If yes, what is the difference in the number of executed instructions<br />

in f with <strong>and</strong> without the optimization?<br />

2.36 [5] Right before your function f from Exercise 2.34 returns, what do<br />

we know about contents of registers $t5, $s3, $ra, <strong>and</strong> $sp? Keep in mind that<br />

we know what the entire function f looks like, but for function func we only know<br />

its declaration.<br />

2.37 [5] Write a program in MIPS assembly language to convert an ASCII<br />

number string containing positive <strong>and</strong> negative integer decimal strings, to an<br />

integer. Your program should expect register $a0 to hold the address of a nullterminated<br />

string containing some combination of the digits 0 through 9. Your<br />

program should compute the integer value equivalent to this string of digits, then<br />

place the number in register $v0. If a non-digit character appears anywhere in the<br />

string, your program should stop with the value −1 in register $v0. For example,<br />

if register $a0 points to a sequence of three bytes 50ten, 52ten, 0ten (the nullterminated<br />

string “24”), then when the program stops, register $v0 should contain<br />

the value 24 ten<br />

.<br />

2.38 [5] Consider the following code:<br />

lbu $t0, 0($t1)<br />

sw $t0, 0($t2)<br />

Assume that the register $t1 contains the address 0x1000 0000 <strong>and</strong> the register<br />

$t2 contains the address 0x1000 0010. Note the MIPS architecture utilizes<br />

big-endian addressing. Assume that the data (in hexadecimal) at address 0x1000<br />

0000 is: 0x11223344. What value is stored at the address pointed to by register<br />

$t2?<br />

2.39 [5] Write the MIPS assembly code that creates the 32-bit constant<br />

0010 0000 0000 0001 0100 1001 0010 0100 two<br />

<strong>and</strong> stores that value to<br />

register $t1.<br />

2.40 [5] If the current value of the PC is 0x00000000, can you use<br />

a single jump instruction to get to the PC address as shown in Exercise 2.39?<br />

2.41 [5] If the current value of the PC is 0x00000600, can you use<br />

a single branch instruction to get to the PC address as shown in Exercise 2.39?


2.22 Exercises 173<br />

2.42 [5] If the current value of the PC is 0x1FFFf000, can you use<br />

a single branch instruction to get to the PC address as shown in Exercise 2.39?<br />

2.43 [5] Write the MIPS assembly code to implement the following C<br />

code:<br />

lock(lk);<br />

shvar=max(shvar,x);<br />

unlock(lk);<br />

Assume that the address of the lk variable is in $a0, the address of the shvar<br />

variable is in $a1, <strong>and</strong> the value of variable x is in $a2. Your critical section should<br />

not contain any function calls. Use ll/sc instructions to implement the lock()<br />

operation, <strong>and</strong> the unlock() operation is simply an ordinary store instruction.<br />

2.44 [5] Repeat Exercise 2.43, but this time use ll/sc to perform<br />

an atomic update of the shvar variable directly, without using lock() <strong>and</strong><br />

unlock(). Note that in this problem there is no variable lk.<br />

2.45 [5] Using your code from Exercise 2.43 as an example, explain what<br />

happens when two processors begin to execute this critical section at the same<br />

time, assuming that each processor executes exactly one instruction per cycle.<br />

2.46 Assume for a given processor the CPI of arithmetic instructions is 1,<br />

the CPI of load/store instructions is 10, <strong>and</strong> the CPI of branch instructions is<br />

3. Assume a program has the following instruction breakdowns: 500 million<br />

arithmetic instructions, 300 million load/store instructions, 100 million branch<br />

instructions.<br />

2.46.1 [5] Suppose that new, more powerful arithmetic instructions are<br />

added to the instruction set. On average, through the use of these more powerful<br />

arithmetic instructions, we can reduce the number of arithmetic instructions<br />

needed to execute a program by 25%, <strong>and</strong> the cost of increasing the clock cycle<br />

time by only 10%. Is this a good design choice? Why?<br />

2.46.2 [5] Suppose that we find a way to double the performance of<br />

arithmetic instructions. What is the overall speedup of our machine? What if we<br />

find a way to improve the performance of arithmetic instructions by 10 times?<br />

2.47 Assume that for a given program 70% of the executed instructions are<br />

arithmetic, 10% are load/store, <strong>and</strong> 20% are branch.


174 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />

2.47.1 [5] Given this instruction mix <strong>and</strong> the assumption that an<br />

arithmetic instruction requires 2 cycles, a load/store instruction takes 6 cycles, <strong>and</strong><br />

a branch instruction takes 3 cycles, find the average CPI.<br />

2.47.2 [5] For a 25% improvement in performance, how many cycles, on<br />

average, may an arithmetic instruction take if load/store <strong>and</strong> branch instructions<br />

are not improved at all?<br />

2.47.3 [5] For a 50% improvement in performance, how many cycles, on<br />

average, may an arithmetic instruction take if load/store <strong>and</strong> branch instructions<br />

are not improved at all?<br />

Answers to<br />

Check Yourself<br />

§2.2, page 66: MIPS, C, Java<br />

§2.3, page 72: 2) Very slow<br />

§2.4, page 79: 2) 8 ten<br />

§2.5, page 87: 4) sub $t2, $t0, $t1<br />

§2.6, page 89: Both. AND with a mask pattern of 1s will leaves 0s everywhere but<br />

the desired field. Shifting left by the correct amount removes the bits from the left<br />

of the field. Shifting right by the appropriate amount puts the field into the rightmost<br />

bits of the word, with 0s in the rest of the word. Note that AND leaves the<br />

field where it was originally, <strong>and</strong> the shift pair moves the field into the rightmost<br />

part of the word.<br />

§2.7, page 96: I. All are true. II. 1).<br />

§2.8, page 106: Both are true.<br />

§2.9, page 111: I. 1) <strong>and</strong> 2) II. 3)<br />

§2.10, page 120: I. 4) 128K. II. 6) a block of 256M. III. 4) sll<br />

§2.11, page 123: Both are true.<br />

§2.12, page 132: 4) Machine independence.


This page intentionally left blank


3<br />

Arithmetic for<br />

<strong>Computer</strong>s<br />

Numerical precision<br />

is the very soul of<br />

science.<br />

Sir D’arcy Wentworth Thompson<br />

On Growth <strong>and</strong> Form, 1917<br />

3.1 Introduction 178<br />

3.2 Addition <strong>and</strong> Subtraction 178<br />

3.3 Multiplication 183<br />

3.4 Division 189<br />

3.5 Floating Point 196<br />

3.6 Parallelism <strong>and</strong> <strong>Computer</strong> Arithmetic:<br />

Subword Parallelism 222<br />

3.7 Real Stuff: Streaming SIMD Extensions <strong>and</strong><br />

Advanced Vector Extensions in x86 224<br />

<strong>Computer</strong> Organization <strong>and</strong> <strong>Design</strong>. DOI: http://dx.doi.org/10.1016/B978-0-12-407726-3.00001-1<br />

© 2013 Elsevier Inc. All rights reserved.


3.2 Addition <strong>and</strong> Subtraction 179<br />

(0)<br />

(0)<br />

0<br />

0<br />

0 (0)<br />

(1)<br />

0<br />

0<br />

1 (1)<br />

(1)<br />

1<br />

1<br />

1 (1)<br />

(0)<br />

1<br />

1<br />

0 (0)<br />

(Carries)<br />

1<br />

0<br />

1<br />

. . .<br />

. . .<br />

. . .<br />

0<br />

0<br />

0 (0)<br />

(0)<br />

FIGURE 3.1 Binary addition, showing carries from right to left. The rightmost bit adds 1<br />

to 0, resulting in the sum of this bit being 1 <strong>and</strong> the carry out from this bit being 0. Hence, the operation<br />

for the second digit to the right is 0 1 1. This generates a 0 for this sum bit <strong>and</strong> a carry out of 1. The<br />

third digit is the sum of 1 1 1, resulting in a carry out of 1 <strong>and</strong> a sum bit of 1. The fourth bit is 1 <br />

0 0, yielding a 1 sum <strong>and</strong> no carry.<br />

0000 0000 0000 0000 0000 0000 0000 0111 two = 7 ten<br />

– 0000 0000 0000 0000 0000 0000 0000 0110 two = 6 ten<br />

= 0000 0000 0000 0000 0000 0000 0000 0001 two = 1 ten<br />

or via addition using the two’s complement representation of 6:<br />

0000 0000 0000 0000 0000 0000 0000 0111 two = 7 ten<br />

+ 1111 1111 1111 1111 1111 1111 1111 1010 two = –6 ten<br />

= 0000 0000 0000 0000 0000 0000 0000 0001 two = 1 ten<br />

Recall that overflow occurs when the result from an operation cannot be<br />

represented with the available hardware, in this case a 32-bit word. When can<br />

overflow occur in addition? When adding oper<strong>and</strong>s with different signs, overflow<br />

cannot occur. The reason is the sum must be no larger than one of the oper<strong>and</strong>s.<br />

For example, 10 4 6. Since the oper<strong>and</strong>s fit in 32 bits <strong>and</strong> the sum is no<br />

larger than an oper<strong>and</strong>, the sum must fit in 32 bits as well. Therefore, no overflow<br />

can occur when adding positive <strong>and</strong> negative oper<strong>and</strong>s.<br />

There are similar restrictions to the occurrence of overflow during subtract, but<br />

it’s just the opposite principle: when the signs of the oper<strong>and</strong>s are the same, overflow<br />

cannot occur. To see this, remember that c a c (a) because we subtract by<br />

negating the second oper<strong>and</strong> <strong>and</strong> then add. Therefore, when we subtract oper<strong>and</strong>s<br />

of the same sign we end up by adding oper<strong>and</strong>s of different signs. From the prior<br />

paragraph, we know that overflow cannot occur in this case either.<br />

Knowing when overflow cannot occur in addition <strong>and</strong> subtraction is all well <strong>and</strong><br />

good, but how do we detect it when it does occur? Clearly, adding or subtracting<br />

two 32-bit numbers can yield a result that needs 33 bits to be fully expressed.<br />

The lack of a 33rd bit means that when overflow occurs, the sign bit is set with<br />

the value of the result instead of the proper sign of the result. Since we need just one<br />

extra bit, only the sign bit can be wrong. Hence, overflow occurs when adding two<br />

positive numbers <strong>and</strong> the sum is negative, or vice versa. This spurious sum means<br />

a carry out occurred into the sign bit.<br />

Overflow occurs in subtraction when we subtract a negative number from a<br />

positive number <strong>and</strong> get a negative result, or when we subtract a positive number<br />

from a negative number <strong>and</strong> get a positive result. Such a ridiculous result means a<br />

borrow occurred from the sign bit. Figure 3.2 shows the combination of operations,<br />

oper<strong>and</strong>s, <strong>and</strong> results that indicate an overflow.


3.2 Addition <strong>and</strong> Subtraction 181<br />

more detail; Chapter 5 describes other situations where exceptions <strong>and</strong> interrupts<br />

occur.)<br />

MIPS includes a register called the exception program counter (EPC) to contain<br />

the address of the instruction that caused the exception. The instruction move from<br />

system control (mfc0) is used to copy EPC into a general-purpose register so that<br />

MIPS software has the option of returning to the offending instruction via a jump<br />

register instruction.<br />

interrupt An exception<br />

that comes from outside<br />

of the processor. (Some<br />

architectures use the<br />

term interrupt for all<br />

exceptions.)<br />

Summary<br />

A major point of this section is that, independent of the representation, the finite<br />

word size of computers means that arithmetic operations can create results that<br />

are too large to fit in this fixed word size. It’s easy to detect overflow in unsigned<br />

numbers, although these are almost always ignored because programs don’t want to<br />

detect overflow for address arithmetic, the most common use of natural numbers.<br />

Two’s complement presents a greater challenge, yet some software systems require<br />

detection of overflow, so today all computers have a way to detect it.<br />

Some programming languages allow two’s complement integer arithmetic<br />

on variables declared byte <strong>and</strong> half, whereas MIPS only has integer arithmetic<br />

operations on full words. As we recall from Chapter 2, MIPS does have data transfer<br />

operations for bytes <strong>and</strong> halfwords. What MIPS instructions should be generated<br />

for byte <strong>and</strong> halfword arithmetic operations?<br />

1. Load with lbu, lhu; arithmetic with add, sub, mult, div; then store using<br />

sb, sh.<br />

2. Load with lb, lh; arithmetic with add, sub, mult, div; then store using<br />

sb, sh.<br />

3. Load with lb, lh; arithmetic with add, sub, mult, div, using AND to mask<br />

result to 8 or 16 bits after each operation; then store using sb, sh.<br />

Check<br />

Yourself<br />

Elaboration: One feature not generally found in general-purpose microprocessors is<br />

saturating operations. Saturation means that when a calculation overflows, the result<br />

is set to the largest positive number or most negative number, rather than a modulo<br />

calculation as in two’s complement arithmetic. Saturation is likely what you want for media<br />

operations. For example, the volume knob on a radio set would be frustrating if, as you<br />

turned it, the volume would get continuously louder for a while <strong>and</strong> then immediately very<br />

soft. A knob with saturation would stop at the highest volume no matter how far you turned<br />

it. Multimedia extensions to st<strong>and</strong>ard instruction sets often offer saturating arithmetic.<br />

Elaboration: MIPS can trap on overfl ow, but unlike many other computers, there is<br />

no conditional branch to test overfl ow. A sequence of MIPS instructions can discover


3.3 Multiplication 183<br />

3.3 Multiplication<br />

Now that we have completed the explanation of addition <strong>and</strong> subtraction, we are<br />

ready to build the more vexing operation of multiplication.<br />

First, let’s review the multiplication of decimal numbers in longh<strong>and</strong> to remind<br />

ourselves of the steps of multiplication <strong>and</strong> the names of the oper<strong>and</strong>s. For reasons<br />

that will become clear shortly, we limit this decimal example to using only the<br />

digits 0 <strong>and</strong> 1. Multiplying 1000 ten<br />

by 1001 ten<br />

:<br />

Multiplic<strong>and</strong> 1000 ten<br />

Multiplier x 1001 ten<br />

1000<br />

0000<br />

0000<br />

1000<br />

Product 1001000 ten<br />

Multiplication is<br />

vexation, Division is<br />

as bad; The rule of<br />

three doth puzzle me,<br />

And practice drives me<br />

mad.<br />

Anonymous,<br />

Elizabethan manuscript,<br />

1570<br />

The first oper<strong>and</strong> is called the multiplic<strong>and</strong> <strong>and</strong> the second the multiplier.<br />

The final result is called the product. As you may recall, the algorithm learned in<br />

grammar school is to take the digits of the multiplier one at a time from right to<br />

left, multiplying the multiplic<strong>and</strong> by the single digit of the multiplier, <strong>and</strong> shifting<br />

the intermediate product one digit to the left of the earlier intermediate products.<br />

The first observation is that the number of digits in the product is considerably<br />

larger than the number in either the multiplic<strong>and</strong> or the multiplier. In fact, if we<br />

ignore the sign bits, the length of the multiplication of an n-bit multiplic<strong>and</strong> <strong>and</strong> an<br />

m-bit multiplier is a product that is n m bits long. That is, n m bits are required<br />

to represent all possible products. Hence, like add, multiply must cope with<br />

overflow because we frequently want a 32-bit product as the result of multiplying<br />

two 32-bit numbers.<br />

In this example, we restricted the decimal digits to 0 <strong>and</strong> 1. With only two<br />

choices, each step of the multiplication is simple:<br />

1. Just place a copy of the multiplic<strong>and</strong> (1 multiplic<strong>and</strong>) in the proper place<br />

if the multiplier digit is a 1, or<br />

2. Place 0 (0 multiplic<strong>and</strong>) in the proper place if the digit is 0.<br />

Although the decimal example above happens to use only 0 <strong>and</strong> 1, multiplication<br />

of binary numbers must always use 0 <strong>and</strong> 1, <strong>and</strong> thus always offers only these two<br />

choices.<br />

Now that we have reviewed the basics of multiplication, the traditional next<br />

step is to provide the highly optimized multiply hardware. We break with tradition<br />

in the belief that you will gain a better underst<strong>and</strong>ing by seeing the evolution of<br />

the multiply hardware <strong>and</strong> algorithm through multiple generations. For now, let’s<br />

assume that we are multiplying only positive numbers.


3.3 Multiplication 185<br />

Start<br />

Multiplier0 = 1<br />

1. Test<br />

Multiplier0<br />

Multiplier0 = 0<br />

1a. Add multiplic<strong>and</strong> to product <strong>and</strong><br />

place the result in Product register<br />

2. Shift the Multiplic<strong>and</strong> register left 1 bit<br />

3. Shift the Multiplier register right 1 bit<br />

32nd repetition?<br />

No: < 32 repetitions<br />

Yes: 32 repetitions<br />

Done<br />

FIGURE 3.4 The first multiplication algorithm, using the hardware shown in Figure 3.3. If<br />

the least significant bit of the multiplier is 1, add the multiplic<strong>and</strong> to the product. If not, go to the next step.<br />

Shift the multiplic<strong>and</strong> left <strong>and</strong> the multiplier right in the next two steps. These three steps are repeated 32<br />

times.<br />

This algorithm <strong>and</strong> hardware are easily refined to take 1 clock cycle per step.<br />

The speed-up comes from performing the operations in parallel: the multiplier<br />

<strong>and</strong> multiplic<strong>and</strong> are shifted while the multiplic<strong>and</strong> is added to the product if the<br />

multiplier bit is a 1. The hardware just has to ensure that it tests the right bit of<br />

the multiplier <strong>and</strong> gets the preshifted version of the multiplic<strong>and</strong>. The hardware is<br />

usually further optimized to halve the width of the adder <strong>and</strong> registers by noticing<br />

where there are unused portions of registers <strong>and</strong> adders. Figure 3.5 shows the<br />

revised hardware.


3.3 Multiplication 187<br />

Iteration Step Multiplier Multiplic<strong>and</strong> Product<br />

0 Initial values 0011 0000 0010 0000 0000<br />

1 1a: 1 ⇒ Prod = Prod + Mc<strong>and</strong> 0011 0000 0010 0000 0010<br />

2: Shift left Multiplic<strong>and</strong> 0011 0000 0100 0000 0010<br />

3: Shift right Multiplier 0001 0000 0100 0000 0010<br />

2 1a: 1 ⇒ Prod = Prod + Mc<strong>and</strong> 0001 0000 0100 0000 0110<br />

2: Shift left Multiplic<strong>and</strong> 0001 0000 1000 0000 0110<br />

3: Shift right Multiplier 0000 0000 1000 0000 0110<br />

3 1: 0 ⇒ No operation 0000 0000 1000 0000 0110<br />

2: Shift left Multiplic<strong>and</strong> 0000 0001 0000 0000 0110<br />

3: Shift right Multiplier 0000 0001 0000 0000 0110<br />

4 1: 0 ⇒ No operation 0000 0001 0000 0000 0110<br />

2: Shift left Multiplic<strong>and</strong> 0000 0010 0000 0000 0110<br />

3: Shift right Multiplier 0000 0010 0000 0000 0110<br />

FIGURE 3.6 Multiply example using algorithm in Figure 3.4. The bit examined to determine the<br />

next step is circled in color.<br />

Signed Multiplication<br />

So far, we have dealt with positive numbers. The easiest way to underst<strong>and</strong> how<br />

to deal with signed numbers is to first convert the multiplier <strong>and</strong> multiplic<strong>and</strong> to<br />

positive numbers <strong>and</strong> then remember the original signs. The algorithms should<br />

then be run for 31 iterations, leaving the signs out of the calculation. As we learned<br />

in grammar school, we need negate the product only if the original signs disagree.<br />

It turns out that the last algorithm will work for signed numbers, provided that<br />

we remember that we are dealing with numbers that have infinite digits, <strong>and</strong> we are<br />

only representing them with 32 bits. Hence, the shifting steps would need to extend<br />

the sign of the product for signed numbers. When the algorithm completes, the<br />

lower word would have the 32-bit product.<br />

Faster Multiplication<br />

Moore’s Law has provided so much more in resources that hardware designers can<br />

now build much faster multiplication hardware. Whether the multiplic<strong>and</strong> is to be<br />

added or not is known at the beginning of the multiplication by looking at each of<br />

the 32 multiplier bits. Faster multiplications are possible by essentially providing<br />

one 32-bit adder for each bit of the multiplier: one input is the multiplic<strong>and</strong> ANDed<br />

with a multiplier bit, <strong>and</strong> the other is the output of a prior adder.<br />

A straightforward approach would be to connect the outputs of adders on the<br />

right to the inputs of adders on the left, making a stack of adders 32 high. An<br />

alternative way to organize these 32 additions is in a parallel tree, as Figure 3.7<br />

shows. Instead of waiting for 32 add times, we wait just the log 2<br />

(32) or five 32-bit<br />

add times.


3.4 Division 189<br />

3.4 Division<br />

The reciprocal operation of multiply is divide, an operation that is even less frequent<br />

<strong>and</strong> even more quirky. It even offers the opportunity to perform a mathematically<br />

invalid operation: dividing by 0.<br />

Let’s start with an example of long division using decimal numbers to recall the<br />

names of the oper<strong>and</strong>s <strong>and</strong> the grammar school division algorithm. For reasons<br />

similar to those in the previous section, we limit the decimal digits to just 0 or 1.<br />

The example is dividing 1,001,010 ten<br />

by 1000 ten<br />

:<br />

1001 ten Quotient<br />

Divisor 1000 ten 1001010 ten Dividend<br />

−1000<br />

10<br />

101<br />

1010<br />

−1000<br />

10 ten Remainder<br />

Divide’s two oper<strong>and</strong>s, called the dividend <strong>and</strong> divisor, <strong>and</strong> the result, called<br />

the quotient, are accompanied by a second result, called the remainder. Here is<br />

another way to express the relationship between the components:<br />

Dividend Quotient Divisor Remainder<br />

where the remainder is smaller than the divisor. Infrequently, programs use the<br />

divide instruction just to get the remainder, ignoring the quotient.<br />

The basic grammar school division algorithm tries to see how big a number<br />

can be subtracted, creating a digit of the quotient on each attempt. Our carefully<br />

selected decimal example uses only the numbers 0 <strong>and</strong> 1, so it’s easy to figure out<br />

how many times the divisor goes into the portion of the dividend: it’s either 0 times<br />

or 1 time. Binary numbers contain only 0 or 1, so binary division is restricted to<br />

these two choices, thereby simplifying binary division.<br />

Let’s assume that both the dividend <strong>and</strong> the divisor are positive <strong>and</strong> hence the<br />

quotient <strong>and</strong> the remainder are nonnegative. The division oper<strong>and</strong>s <strong>and</strong> both<br />

results are 32-bit values, <strong>and</strong> we will ignore the sign for now.<br />

A Division Algorithm <strong>and</strong> Hardware<br />

Figure 3.8 shows hardware to mimic our grammar school algorithm. We start with<br />

the 32-bit Quotient register set to 0. Each iteration of the algorithm needs to move<br />

the divisor to the right one digit, so we start with the divisor placed in the left half<br />

of the 64-bit Divisor register <strong>and</strong> shift it right 1 bit each step to align it with the<br />

dividend. The Remainder register is initialized with the dividend.<br />

Divide et impera.<br />

Latin for “Divide <strong>and</strong><br />

rule,” ancient political<br />

maxim cited by<br />

Machiavelli, 1532<br />

dividend A number<br />

being divided.<br />

divisor A number that<br />

the dividend is divided by.<br />

quotient The primary<br />

result of a division;<br />

a number that when<br />

multiplied by the<br />

divisor <strong>and</strong> added to the<br />

remainder produces the<br />

dividend.<br />

remainder The<br />

secondary result of<br />

a division; a number<br />

that when added to the<br />

product of the quotient<br />

<strong>and</strong> the divisor produces<br />

the dividend.


3.4 Division 191<br />

Start<br />

1. Subtract the Divisor register from the<br />

Remainder register <strong>and</strong> place the<br />

result in the Remainder register<br />

Remainder ≥ 0<br />

Test Remainder<br />

Remainder < 0<br />

2a. Shift the Quotient register to the left,<br />

setting the new rightmost bit to 1<br />

2b. Restore the original value by adding<br />

the Divisor register to the Remainder<br />

register <strong>and</strong> placing the sum in the<br />

Remainder register. Also shift the<br />

Quotient register to the left, setting the<br />

new least significant bit to 0<br />

3. Shift the Divisor register right 1 bit<br />

33rd repetition?<br />

No: < 33 repetitions<br />

Yes: 33 repetitions<br />

Done<br />

FIGURE 3.9 A division algorithm, using the hardware in Figure 3.8. If the remainder is positive,<br />

the divisor did go into the dividend, so step 2a generates a 1 in the quotient. A negative remainder after<br />

step 1 means that the divisor did not go into the dividend, so step 2b generates a 0 in the quotient <strong>and</strong> adds<br />

the divisor to the remainder, thereby reversing the subtraction of step 1. The final shift, in step 3, aligns the<br />

divisor properly, relative to the dividend for the next iteration. These steps are repeated 33 times.<br />

This algorithm <strong>and</strong> hardware can be refined to be faster <strong>and</strong> cheaper. The speedup<br />

comes from shifting the oper<strong>and</strong>s <strong>and</strong> the quotient simultaneously with the<br />

subtraction. This refinement halves the width of the adder <strong>and</strong> registers by noticing<br />

where there are unused portions of registers <strong>and</strong> adders. Figure 3.11 shows the<br />

revised hardware.


3.4 Division 193<br />

Elaboration: The one complication of signed division is that we must also set the sign<br />

of the remainder. Remember that the following equation must always hold:<br />

Dividend Quotient Divisor Remainder<br />

To underst<strong>and</strong> how to set the sign of the remainder, let’s look at the example of dividing<br />

all the combinations of 7 ten<br />

by 2 ten<br />

. The fi rst case is easy:<br />

Checking the results:<br />

7 2: Quotient 3, Remainder 1<br />

7 3 2 (1) 6 1<br />

If we change the sign of the dividend, the quotient must change as well:<br />

7 2: Quotient 3<br />

Rewriting our basic formula to calculate the remainder:<br />

So,<br />

Remainder (Dividend Quotient Divisor) 7 (3x 2)<br />

7 (6) 1<br />

Checking the results again:<br />

7 2: Quotient 3, Remainder 1<br />

7 3 2 (1) 6 1<br />

The reason the answer isn’t a quotient of 4 <strong>and</strong> a remainder of 1, which would also<br />

fi t this formula, is that the absolute value of the quotient would then change depending<br />

on the sign of the dividend <strong>and</strong> the divisor! Clearly, if<br />

(x y) (x) y<br />

programming would be an even greater challenge. This anomalous behavior is avoided<br />

by following the rule that the dividend <strong>and</strong> remainder must have the same signs, no<br />

matter what the signs of the divisor <strong>and</strong> quotient.<br />

We calculate the other combinations by following the same rule:<br />

7 2: Quotient 3, Remainder 1<br />

7 2: Quotient 3, Remainder 1


196 Chapter 3 Arithmetic for <strong>Computer</strong>s<br />

Hardware/<br />

Software<br />

Interface<br />

MIPS divide instructions ignore overflow, so software must determine whether the<br />

quotient is too large. In addition to overflow, division can also result in an improper<br />

calculation: division by 0. Some computers distinguish these two anomalous events.<br />

MIPS software must check the divisor to discover division by 0 as well as overflow.<br />

Elaboration: An even faster algorithm does not immediately add the divisor back<br />

if the remainder is negative. It simply adds the dividend to the shifted remainder in<br />

the following step, since (r d) 2 d r 2 d 2 d r 2 d. This<br />

nonrestoring division algorithm, which takes 1 clock cycle per step, is explored further<br />

in the exercises; the algorithm above is called restoring division. A third algorithm that<br />

doesn’t save the result of the subtract if it’s negative is called a nonperforming division<br />

algorithm. It averages one-third fewer arithmetic operations.<br />

3.5 Floating Point<br />

Speed gets you<br />

nowhere if you’re<br />

headed the wrong way.<br />

American proverb<br />

scientific notation<br />

A notation that renders<br />

numbers with a single<br />

digit to the left of the<br />

decimal point.<br />

normalized A number<br />

in floating-point notation<br />

that has no leading 0s.<br />

Going beyond signed <strong>and</strong> unsigned integers, programming languages support<br />

numbers with fractions, which are called reals in mathematics. Here are some<br />

examples of reals:<br />

3.14159265… ten<br />

(pi)<br />

2.71828… ten<br />

(e)<br />

0.000000001 ten<br />

or 1.0 ten<br />

× 10 −9 (seconds in a nanosecond)<br />

3,155,760,000 ten<br />

or 3.15576 ten<br />

× 10 9 (seconds in a typical century)<br />

Notice that in the last case, the number didn’t represent a small fraction, but it<br />

was bigger than we could represent with a 32-bit signed integer. The alternative<br />

notation for the last two numbers is called scientific notation, which has a single<br />

digit to the left of the decimal point. A number in scientific notation that has no<br />

leading 0s is called a normalized number, which is the usual way to write it. For<br />

example, 1.0 ten<br />

10 9 is in normalized scientific notation, but 0.1 ten<br />

10 8 <strong>and</strong><br />

10.0 ten<br />

10 10 are not.<br />

Just as we can show decimal numbers in scientific notation, we can also show<br />

binary numbers in scientific notation:<br />

1.0 two<br />

2 1<br />

To keep a binary number in normalized form, we need a base that we can increase<br />

or decrease by exactly the number of bits the number must be shifted to have one<br />

nonzero digit to the left of the decimal point. Only a base of 2 fulfills our need. Since<br />

the base is not 10, we also need a new name for decimal point; binary point will do fine.


3.5 Floating Point 197<br />

<strong>Computer</strong> arithmetic that supports such numbers is called floating point<br />

because it represents numbers in which the binary point is not fixed, as it is for<br />

integers. The programming language C uses the name float for such numbers. Just<br />

as in scientific notation, numbers are represented as a single nonzero digit to the<br />

left of the binary point. In binary, the form is<br />

floating point<br />

<strong>Computer</strong> arithmetic that<br />

represents numbers in<br />

which the binary point is<br />

not fixed.<br />

1.xxxxxxxxx two<br />

2 yyyy<br />

(Although the computer represents the exponent in base 2 as well as the rest of the<br />

number, to simplify the notation we show the exponent in decimal.)<br />

A st<strong>and</strong>ard scientific notation for reals in normalized form offers three<br />

advantages. It simplifies exchange of data that includes floating-point numbers;<br />

it simplifies the floating-point arithmetic algorithms to know that numbers will<br />

always be in this form; <strong>and</strong> it increases the accuracy of the numbers that can be<br />

stored in a word, since the unnecessary leading 0s are replaced by real digits to the<br />

right of the binary point.<br />

Floating-Point Representation<br />

A designer of a floating-point representation must find a compromise between the<br />

size of the fraction <strong>and</strong> the size of the exponent, because a fixed word size means<br />

you must take a bit from one to add a bit to the other. This tradeoff is between<br />

precision <strong>and</strong> range: increasing the size of the fraction enhances the precision<br />

of the fraction, while increasing the size of the exponent increases the range of<br />

numbers that can be represented. As our design guideline from Chapter 2 reminds<br />

us, good design dem<strong>and</strong>s good compromise.<br />

Floating-point numbers are usually a multiple of the size of a word. The<br />

representation of a MIPS floating-point number is shown below, where s is the sign<br />

of the floating-point number (1 meaning negative), exponent is the value of the<br />

8-bit exponent field (including the sign of the exponent), <strong>and</strong> fraction is the 23-bit<br />

number. As we recall from Chapter 2, this representation is sign <strong>and</strong> magnitude,<br />

since the sign is a separate bit from the rest of the number.<br />

fraction The value,<br />

generally between 0 <strong>and</strong><br />

1, placed in the fraction<br />

field. The fraction is also<br />

called the mantissa.<br />

exponent In the<br />

numerical representation<br />

system of floating-point<br />

arithmetic, the value that<br />

is placed in the exponent<br />

field.<br />

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0<br />

s exponent fraction<br />

1 bit 8 bits 23 bits<br />

In general, floating-point numbers are of the form<br />

(1) S F 2 E<br />

F involves the value in the fraction field <strong>and</strong> E involves the value in the exponent<br />

field; the exact relationship to these fields will be spelled out soon. (We will shortly<br />

see that MIPS does something slightly more sophisticated.)


198 Chapter 3 Arithmetic for <strong>Computer</strong>s<br />

overflow (floatingpoint)<br />

A situation in<br />

which a positive exponent<br />

becomes too large to fit in<br />

the exponent field.<br />

underflow (floatingpoint)<br />

A situation<br />

in which a negative<br />

exponent becomes too<br />

large to fit in the exponent<br />

field.<br />

double precision<br />

A floating-point value<br />

represented in two 32-bit<br />

words.<br />

single precision<br />

A floating-point value<br />

represented in a single 32-<br />

bit word.<br />

These chosen sizes of exponent <strong>and</strong> fraction give MIPS computer arithmetic<br />

an extraordinary range. Fractions almost as small as 2.0 ten<br />

10 38 <strong>and</strong> numbers<br />

almost as large as 2.0 ten<br />

10 38 can be represented in a computer. Alas, extraordinary<br />

differs from infinite, so it is still possible for numbers to be too large. Thus, overflow<br />

interrupts can occur in floating-point arithmetic as well as in integer arithmetic.<br />

Notice that overflow here means that the exponent is too large to be represented<br />

in the exponent field.<br />

Floating point offers a new kind of exceptional event as well. Just as programmers<br />

will want to know when they have calculated a number that is too large to be<br />

represented, they will want to know if the nonzero fraction they are calculating<br />

has become so small that it cannot be represented; either event could result in a<br />

program giving incorrect answers. To distinguish it from overflow, we call this<br />

event underflow. This situation occurs when the negative exponent is too large to<br />

fit in the exponent field.<br />

One way to reduce chances of underflow or overflow is to offer another format<br />

that has a larger exponent. In C this number is called double, <strong>and</strong> operations on<br />

doubles are called double precision floating-point arithmetic; single precision<br />

floating point is the name of the earlier format.<br />

The representation of a double precision floating-point number takes two MIPS<br />

words, as shown below, where s is still the sign of the number, exponent is the value<br />

of the 11-bit exponent field, <strong>and</strong> fraction is the 52-bit number in the fraction field.<br />

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0<br />

s<br />

exponent<br />

fraction<br />

1 bit 11 bits 20 bits<br />

fraction (continued)<br />

32 bits<br />

MIPS double precision allows numbers almost as small as 2.0 ten<br />

10 308 <strong>and</strong> almost<br />

as large as 2.0 ten<br />

10 308 . Although double precision does increase the exponent<br />

range, its primary advantage is its greater precision because of the much larger<br />

fraction.<br />

These formats go beyond MIPS. They are part of the IEEE 754 floating-point<br />

st<strong>and</strong>ard, found in virtually every computer invented since 1980. This st<strong>and</strong>ard has<br />

greatly improved both the ease of porting floating-point programs <strong>and</strong> the quality<br />

of computer arithmetic.<br />

To pack even more bits into the signific<strong>and</strong>, IEEE 754 makes the leading 1-bit<br />

of normalized binary numbers implicit. Hence, the number is actually 24 bits long<br />

in single precision (implied 1 <strong>and</strong> a 23-bit fraction), <strong>and</strong> 53 bits long in double<br />

precision (1 52). To be precise, we use the term signific<strong>and</strong> to represent the 24-<br />

or 53-bit number that is 1 plus the fraction, <strong>and</strong> fraction when we mean the 23- or<br />

52-bit number. Since 0 has no leading 1, it is given the reserved exponent value 0 so<br />

that the hardware won’t attach a leading 1 to it.


3.5 Floating Point 199<br />

Single precision Double precision Object represented<br />

Exponent Fraction Exponent Fraction<br />

0 0 0 0 0<br />

0 Nonzero 0 Nonzero ± denormalized number<br />

1–254 Anything 1–2046 Anything ± floating-point number<br />

255 0 2047 0 ± infinity<br />

255 Nonzero 2047 Nonzero NaN (Not a Number)<br />

FIGURE 3.13 EEE 754 encoding of floating-point numbers. A separate sign bit determines the<br />

sign. Denormalized numbers are described in the Elaboration on page 222. This information is also found in<br />

Column 4 of the MIPS Reference Data Card at the front of this book.<br />

Thus 00 … 00 two<br />

represents 0; the representation of the rest of the numbers uses<br />

the form from before with the hidden 1 added:<br />

(1) S (1 Fraction) 2 E<br />

where the bits of the fraction represent a number between 0 <strong>and</strong> 1 <strong>and</strong> E specifies<br />

the value in the exponent field, to be given in detail shortly. If we number the bits<br />

of the fraction from left to right s1, s2, s3, …, then the value is<br />

(1) S (1 (s1 2 1 ) (s2 2 2 ) (s3 2 3 ) (s4 2 4 ) ...) 2 E<br />

Figure 3.13 shows the encodings of IEEE 754 floating-point numbers. Other<br />

features of IEEE 754 are special symbols to represent unusual events. For example,<br />

instead of interrupting on a divide by 0, software can set the result to a bit pattern<br />

representing ∞ or ∞; the largest exponent is reserved for these special symbols.<br />

When the programmer prints the results, the program will print an infinity symbol.<br />

(For the mathematically trained, the purpose of infinity is to form topological<br />

closure of the reals.)<br />

IEEE 754 even has a symbol for the result of invalid operations, such as 0/0<br />

or subtracting infinity from infinity. This symbol is NaN, for Not a Number. The<br />

purpose of NaNs is to allow programmers to postpone some tests <strong>and</strong> decisions to<br />

a later time in the program when they are convenient.<br />

The designers of IEEE 754 also wanted a floating-point representation that could<br />

be easily processed by integer comparisons, especially for sorting. This desire is<br />

why the sign is in the most significant bit, allowing a quick test of less than, greater<br />

than, or equal to 0. (It’s a little more complicated than a simple integer sort, since<br />

this notation is essentially sign <strong>and</strong> magnitude rather than two’s complement.)<br />

Placing the exponent before the signific<strong>and</strong> also simplifies the sorting of<br />

floating-point numbers using integer comparison instructions, since numbers with<br />

bigger exponents look larger than numbers with smaller exponents, as long as both<br />

exponents have the same sign.


200 Chapter 3 Arithmetic for <strong>Computer</strong>s<br />

Negative exponents pose a challenge to simplified sorting. If we use two’s<br />

complement or any other notation in which negative exponents have a 1 in the<br />

most significant bit of the exponent field, a negative exponent will look like a big<br />

number. For example, 1.0 two<br />

2 1 would be represented as<br />

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0<br />

0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 . . .<br />

(Remember that the leading 1 is implicit in the signific<strong>and</strong>.) The value 1.0 two<br />

2 1<br />

would look like the smaller binary number<br />

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0<br />

0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 . . .<br />

The desirable notation must therefore represent the most negative exponent as<br />

00 … 00 two<br />

<strong>and</strong> the most positive as 11 … 11 two<br />

. This convention is called biased<br />

notation, with the bias being the number subtracted from the normal, unsigned<br />

representation to determine the real value.<br />

IEEE 754 uses a bias of 127 for single precision, so an exponent of 1 is<br />

represented by the bit pattern of the value 1 127 ten<br />

, or 126 ten<br />

0111 1110 two<br />

,<br />

<strong>and</strong> 1 is represented by 1 127, or 128 ten<br />

1000 0000 two<br />

. The exponent bias for<br />

double precision is 1023. Biased exponent means that the value represented by a<br />

floating-point number is really<br />

(1) S (Exponent Bias)<br />

(1 Fraction) 2<br />

The range of single precision numbers is then from as small as<br />

to as large as<br />

Let’s demonstrate.<br />

1.00000000000000000000000 two<br />

2 126<br />

1.11111111111111111111111 two<br />

2 127 .


3.5 Floating Point 201<br />

Floating-Point Representation<br />

Show the IEEE 754 binary representation of the number 0.75 ten<br />

in single <strong>and</strong><br />

double precision.<br />

EXAMPLE<br />

The number 0.75 ten<br />

is also<br />

3/4 ten<br />

or 3/2 2 ten<br />

ANSWER<br />

It is also represented by the binary fraction<br />

11 two<br />

/2 2 or 0.11 ten two<br />

In scientific notation, the value is<br />

0.11 two<br />

2 0<br />

<strong>and</strong> in normalized scientific notation, it is<br />

1.1 two<br />

2 1<br />

The general representation for a single precision number is<br />

(1) S (1 Fraction) 2 (Exponent127)<br />

Subtracting the bias 127 from the exponent of 1.1 two<br />

2 1 yields<br />

(1) 1 (1 .1000 0000 0000 0000 0000 000 two<br />

) 2 (126127)<br />

The single precision binary representation of 0.75 ten<br />

is then<br />

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0<br />

1 0 1 1 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0<br />

1 bit 8 bits 23 bits<br />

The double precision representation is<br />

(1) 1 (1 .1000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 two<br />

) 2 (10221023)<br />

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0<br />

1 0 1 1 1 1 1 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0<br />

1 bit 11 bits 20 bits<br />

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0<br />

32 bits


202 Chapter 3 Arithmetic for <strong>Computer</strong>s<br />

Now let’s try going the other direction.<br />

EXAMPLE<br />

Converting Binary to Decimal Floating Point<br />

What decimal number is represented by this single precision float?<br />

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0<br />

1 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 . . .<br />

ANSWER<br />

The sign bit is 1, the exponent field contains 129, <strong>and</strong> the fraction field contains<br />

1 2 2 1/4, or 0.25. Using the basic equation,<br />

(1) S (1 Fraction) 2 (ExponentBias) (1) 1 (1 0.25) 2 (129127)<br />

1 1.25 2 2<br />

1.25 4<br />

5.0<br />

In the next few subsections, we will give the algorithms for floating-point<br />

addition <strong>and</strong> multiplication. At their core, they use the corresponding integer<br />

operations on the signific<strong>and</strong>s, but extra bookkeeping is necessary to h<strong>and</strong>le the<br />

exponents <strong>and</strong> normalize the result. We first give an intuitive derivation of the<br />

algorithms in decimal <strong>and</strong> then give a more detailed, binary version in the figures.<br />

Elaboration: Following IEEE guidelines, the IEEE 754 committee was reformed 20<br />

years after the st<strong>and</strong>ard to see what changes, if any, should be made. The revised<br />

st<strong>and</strong>ard IEEE 754-2008 includes nearly all the IEEE 754-1985 <strong>and</strong> adds a 16-bit format<br />

(“half precision”) <strong>and</strong> a 128-bit format (“quadruple precision”). No hardware has yet been<br />

built that supports quadruple precision, but it will surely come. The revised st<strong>and</strong>ard<br />

also add decimal fl oating point arithmetic, which IBM mainframes have implemented.<br />

Elaboration: In an attempt to increase range without removing bits from the signific<strong>and</strong>,<br />

some computers before the IEEE 754 st<strong>and</strong>ard used a base other than 2. For example,<br />

the IBM 360 <strong>and</strong> 370 mainframe computers use base 16. Since changing the IBM<br />

exponent by one means shifting the signific<strong>and</strong> by 4 bits, “normalized” base 16 numbers<br />

can have up to 3 leading bits of 0s! Hence, hexadecimal digits mean that up to 3 bits must<br />

be dropped from the signific<strong>and</strong>, which leads to surprising problems in the accuracy of<br />

floating-point arithmetic. IBM mainframes now support IEEE 754 as well as the hex format.


3.5 Floating Point 203<br />

Floating-Point Addition<br />

Let’s add numbers in scientific notation by h<strong>and</strong> to illustrate the problems in<br />

floating-point addition: 9.999 ten<br />

10 1 1.610 ten<br />

10 1 . Assume that we can store<br />

only four decimal digits of the signific<strong>and</strong> <strong>and</strong> two decimal digits of the exponent.<br />

Step 1. To be able to add these numbers properly, we must align the decimal<br />

point of the number that has the smaller exponent. Hence, we need<br />

a form of the smaller number, 1.610 ten<br />

10 1 , that matches the<br />

larger exponent. We obtain this by observing that there are multiple<br />

representations of an unnormalized floating-point number in<br />

scientific notation:<br />

1.610 ten<br />

10 1 0.1610 ten<br />

10 0 0.01610 ten<br />

10 1<br />

The number on the right is the version we desire, since its exponent<br />

matches the exponent of the larger number, 9.999 ten<br />

10 1 . Thus, the<br />

first step shifts the signific<strong>and</strong> of the smaller number to the right until<br />

its corrected exponent matches that of the larger number. But we can<br />

represent only four decimal digits so, after shifting, the number is<br />

really<br />

0.016 10 1<br />

Step 2. Next comes the addition of the signific<strong>and</strong>s:<br />

The sum is 10.015 ten<br />

10 1 .<br />

9.999 ten<br />

+ 0.016 ten<br />

10.015 ten<br />

Step 3. This sum is not in normalized scientific notation, so we need to<br />

adjust it:<br />

10.015 ten<br />

10 1 1.0015 ten<br />

10 2<br />

Thus, after the addition we may have to shift the sum to put it into<br />

normalized form, adjusting the exponent appropriately. This example<br />

shows shifting to the right, but if one number were positive <strong>and</strong> the<br />

other were negative, it would be possible for the sum to have many<br />

leading 0s, requiring left shifts. Whenever the exponent is increased<br />

or decreased, we must check for overflow or underflow—that is, we<br />

must make sure that the exponent still fits in its field.<br />

Step 4. Since we assumed that the signific<strong>and</strong> can be only four digits long<br />

(excluding the sign), we must round the number. In our grammar<br />

school algorithm, the rules truncate the number if the digit to the<br />

right of the desired point is between 0 <strong>and</strong> 4 <strong>and</strong> add 1 to the digit if<br />

the number to the right is between 5 <strong>and</strong> 9. The number<br />

1.0015 ten<br />

10 2


204 Chapter 3 Arithmetic for <strong>Computer</strong>s<br />

is rounded to four digits in the signific<strong>and</strong> to<br />

1.002 ten<br />

10 2<br />

since the fourth digit to the right of the decimal point was between 5<br />

<strong>and</strong> 9. Notice that if we have bad luck on rounding, such as adding 1<br />

to a string of 9s, the sum may no longer be normalized <strong>and</strong> we would<br />

need to perform step 3 again.<br />

Figure 3.14 shows the algorithm for binary floating-point addition that follows<br />

this decimal example. Steps 1 <strong>and</strong> 2 are similar to the example just discussed:<br />

adjust the signific<strong>and</strong> of the number with the smaller exponent <strong>and</strong> then add the<br />

two signific<strong>and</strong>s. Step 3 normalizes the results, forcing a check for overflow or<br />

underflow. The test for overflow <strong>and</strong> underflow in step 3 depends on the precision<br />

of the oper<strong>and</strong>s. Recall that the pattern of all 0 bits in the exponent is reserved <strong>and</strong><br />

used for the floating-point representation of zero. Moreover, the pattern of all 1 bits<br />

in the exponent is reserved for indicating values <strong>and</strong> situations outside the scope of<br />

normal floating-point numbers (see the Elaboration on page 222). For the example<br />

below, remember that for single precision, the maximum exponent is 127, <strong>and</strong> the<br />

minimum exponent is 126.<br />

EXAMPLE<br />

Binary Floating-Point Addition<br />

Try adding the numbers 0.5 ten<br />

<strong>and</strong> 0.4375 ten<br />

in binary using the algorithm in<br />

Figure 3.14.<br />

ANSWER<br />

Let’s first look at the binary version of the two numbers in normalized scientific<br />

notation, assuming that we keep 4 bits of precision:<br />

0.5 ten<br />

1/2 ten<br />

1/2 1 ten<br />

0.1 two<br />

0.1 two<br />

2 0 1.000 two<br />

2 1<br />

0.4375 ten<br />

7/16 ten<br />

7/2 4 ten<br />

0.0111 two<br />

0.0111 two<br />

2 0 1.110 two<br />

2 2<br />

Now we follow the algorithm:<br />

Step 1. The signific<strong>and</strong> of the number with the lesser exponent (1.11 two<br />

2 2 ) is shifted right until its exponent matches the larger number:<br />

Step 2. Add the signific<strong>and</strong>s:<br />

1.110 two<br />

2 2 0.111 two<br />

2 1<br />

1.000 two<br />

2 1 (0.111 two<br />

2 1 ) 0.001 two<br />

2 1


206 Chapter 3 Arithmetic for <strong>Computer</strong>s<br />

Step 3. Normalize the sum, checking for overflow or underflow:<br />

0.001 two<br />

2 1 0.010 two<br />

2 2 0.100 two<br />

2 3<br />

1.000 two<br />

2 4<br />

Since 127 4 126, there is no overflow or underflow. (The<br />

biased exponent would be 4 127, or 123, which is between 1 <strong>and</strong><br />

254, the smallest <strong>and</strong> largest unreserved biased exponents.)<br />

Step 4. Round the sum:<br />

1.000 two<br />

2 4<br />

The sum already fits exactly in 4 bits, so there is no change to the bits<br />

due to rounding.<br />

This sum is then<br />

1.000 two<br />

2 4 0.0001000 two<br />

0.0001 two<br />

1/2 4 ten 1/16 ten 0.0625 ten<br />

This sum is what we would expect from adding 0.5 ten<br />

to 0.4375 ten<br />

.<br />

Many computers dedicate hardware to run floating-point operations as fast as possible.<br />

Figure 3.15 sketches the basic organization of hardware for floating-point addition.<br />

Floating-Point Multiplication<br />

Now that we have explained floating-point addition, let’s try floating-point<br />

multiplication. We start by multiplying decimal numbers in scientific notation by<br />

h<strong>and</strong>: 1.110 ten<br />

10 10 9.200 ten<br />

10 5 . Assume that we can store only four digits<br />

of the signific<strong>and</strong> <strong>and</strong> two digits of the exponent.<br />

Step 1. Unlike addition, we calculate the exponent of the product by simply<br />

adding the exponents of the oper<strong>and</strong>s together:<br />

New exponent 10 (5) 5<br />

Let’s do this with the biased exponents as well to make sure we obtain<br />

the same result: 10 + 127 = 137, <strong>and</strong> 5 + 127 = 122, so<br />

New exponent 137 122 259<br />

This result is too large for the 8-bit exponent field, so something is<br />

amiss! The problem is with the bias because we are adding the biases<br />

as well as the exponents:<br />

New exponent (10 127) (5 127) (5 2 127) 259<br />

Accordingly, to get the correct biased sum when we add biased numbers,<br />

we must subtract the bias from the sum:


3.5 Floating Point 207<br />

Sign<br />

Exponent<br />

Fraction<br />

Sign<br />

Exponent<br />

Fraction<br />

Small ALU<br />

Compare<br />

exponents<br />

Exponent<br />

difference<br />

0 1 0 1 0 1<br />

Control<br />

Shift right<br />

Shift smaller<br />

number right<br />

Big ALU<br />

Add<br />

0 1 0 1<br />

Increment or<br />

decrement<br />

Shift left or right<br />

Normalize<br />

Rounding hardware<br />

Round<br />

Sign<br />

Exponent<br />

Fraction<br />

FIGURE 3.15 Block diagram of an arithmetic unit dedicated to floating-point addition. The steps of Figure 3.14 correspond<br />

to each block, from top to bottom. First, the exponent of one oper<strong>and</strong> is subtracted from the other using the small ALU to determine which is<br />

larger <strong>and</strong> by how much. This difference controls the three multiplexors; from left to right, they select the larger exponent, the signific<strong>and</strong> of the<br />

smaller number, <strong>and</strong> the signific<strong>and</strong> of the larger number. The smaller signific<strong>and</strong> is shifted right, <strong>and</strong> then the signific<strong>and</strong>s are added together<br />

using the big ALU. The normalization step then shifts the sum left or right <strong>and</strong> increments or decrements the exponent. Rounding then creates<br />

the final result, which may require normalizing again to produce the actual final result.


208 Chapter 3 Arithmetic for <strong>Computer</strong>s<br />

New exponent 137 122 127 259 127 132 (5 127)<br />

<strong>and</strong> 5 is indeed the exponent we calculated initially.<br />

Step 2. Next comes the multiplication of the signific<strong>and</strong>s:<br />

1.110 ten<br />

× 9.200 ten<br />

0000<br />

0000<br />

2220<br />

9990<br />

10212000 ten<br />

There are three digits to the right of the decimal point for each<br />

oper<strong>and</strong>, so the decimal point is placed six digits from the right in the<br />

product signific<strong>and</strong>:<br />

10.212000 ten<br />

Assuming that we can keep only three digits to the right of the decimal<br />

point, the product is 10.212 10 5 .<br />

Step 3. This product is unnormalized, so we need to normalize it:<br />

10.212 ten<br />

10 5 1.0212 ten<br />

10 6<br />

Thus, after the multiplication, the product can be shifted right one digit<br />

to put it in normalized form, adding 1 to the exponent. At this point,<br />

we can check for overflow <strong>and</strong> underflow. Underflow may occur if both<br />

oper<strong>and</strong>s are small—that is, if both have large negative exponents.<br />

Step 4. We assumed that the signific<strong>and</strong> is only four digits long (excluding the<br />

sign), so we must round the number. The number<br />

1.0212 ten<br />

10 6<br />

is rounded to four digits in the signific<strong>and</strong> to<br />

1.021 ten<br />

10 6<br />

Step 5. The sign of the product depends on the signs of the original oper<strong>and</strong>s.<br />

If they are both the same, the sign is positive; otherwise, it’s negative.<br />

Hence, the product is<br />

1.021 ten<br />

10 6<br />

The sign of the sum in the addition algorithm was determined by<br />

addition of the signific<strong>and</strong>s, but in multiplication, the sign of the<br />

product is determined by the signs of the oper<strong>and</strong>s.


210 Chapter 3 Arithmetic for <strong>Computer</strong>s<br />

Once again, as Figure 3.16 shows, multiplication of binary floating-point numbers<br />

is quite similar to the steps we have just completed. We start with calculating<br />

the new exponent of the product by adding the biased exponents, being sure to<br />

subtract one bias to get the proper result. Next is multiplication of signific<strong>and</strong>s,<br />

followed by an optional normalization step. The size of the exponent is checked<br />

for overflow or underflow, <strong>and</strong> then the product is rounded. If rounding leads to<br />

further normalization, we once again check for exponent size. Finally, set the sign<br />

bit to 1 if the signs of the oper<strong>and</strong>s were different (negative product) or to 0 if they<br />

were the same (positive product).<br />

EXAMPLE<br />

ANSWER<br />

Binary Floating-Point Multiplication<br />

Let’s try multiplying the numbers 0.5 ten<br />

<strong>and</strong> 0.4375 ten<br />

, using the steps in<br />

Figure 3.16.<br />

In binary, the task is multiplying 1.000 two<br />

2 1 by 1.110 two<br />

2 2 .<br />

Step 1. Adding the exponents without bias:<br />

1 (2) 3<br />

or, using the biased representation:<br />

(1 127) (2 127) 127 (1 2) (127 127 127)<br />

3 127 124<br />

Step 2. Multiplying the signific<strong>and</strong>s:<br />

1.000 two<br />

1.110 two<br />

0000<br />

1000<br />

1000<br />

1000<br />

1110000 two<br />

The product is 1.110000 two<br />

2 3 , but we need to keep it to 4 bits, so it<br />

is 1.110 two<br />

2 3 .<br />

Step 3. Now we check the product to make sure it is normalized, <strong>and</strong> then<br />

check the exponent for overflow or underflow. The product is already<br />

normalized <strong>and</strong>, since 127 3 126, there is no overflow or<br />

underflow. (Using the biased representation, 254 124 1, so the<br />

exponent fits.)<br />

Step 4. Rounding the product makes no change:<br />

1.110 two<br />

2 3


3.5 Floating Point 211<br />

Step 5. Since the signs of the original oper<strong>and</strong>s differ, make the sign of the<br />

product negative. Hence, the product is<br />

1.110 two<br />

2 3<br />

Converting to decimal to check our results:<br />

1.110 two<br />

2 3 0.001110 two<br />

0.00111 two<br />

7/2 5 ten 7/32 ten 0.21875 ten<br />

The product of 0.5 ten<br />

<strong>and</strong> 0.4375 ten<br />

is indeed 0.21875 ten<br />

.<br />

Floating-Point Instructions in MIPS<br />

MIPS supports the IEEE 754 single precision <strong>and</strong> double precision formats with<br />

these instructions:<br />

■ Floating-point addition, single (add.s) <strong>and</strong> addition, double (add.d)<br />

■ Floating-point subtraction, single (sub.s) <strong>and</strong> subtraction, double (sub.d)<br />

■ Floating-point multiplication, single (mul.s) <strong>and</strong> multiplication, double (mul.d)<br />

■ Floating-point division, single (div.s) <strong>and</strong> division, double (div.d)<br />

■ Floating-point comparison, single (c.x.s) <strong>and</strong> comparison, double (c.x.d),<br />

where x may be equal (eq), not equal (neq), less than (lt), less than or equal<br />

(le), greater than (gt), or greater than or equal (ge)<br />

■ Floating-point branch, true (bc1t) <strong>and</strong> branch, false (bc1f)<br />

Floating-point comparison sets a bit to true or false, depending on the comparison<br />

condition, <strong>and</strong> a floating-point branch then decides whether or not to branch,<br />

depending on the condition.<br />

The MIPS designers decided to add separate floating-point registers—called<br />

$f0, $f1, $f2, …—used either for single precision or double precision. Hence,<br />

they included separate loads <strong>and</strong> stores for floating-point registers: lwc1 <strong>and</strong><br />

swc1. The base registers for floating-point data transfers which are used for<br />

addresses remain integer registers. The MIPS code to load two single precision<br />

numbers from memory, add them, <strong>and</strong> then store the sum might look like this:<br />

lwc1<br />

lwc1<br />

add.s<br />

swc1<br />

$f4,c($sp) # Load 32-bit F.P. number into F4<br />

$f6,a($sp) # Load 32-bit F.P. number into F6<br />

$f2,$f4,$f6 # F2 = F4 + F6 single precision<br />

$f2,b($sp) # Store 32-bit F.P. number from F2<br />

A double precision register is really an even-odd pair of single precision registers,<br />

using the even register number as its name. Thus, the pair of single precision<br />

registers $f2 <strong>and</strong> $f3 also form the double precision register named $f2.<br />

Figure 3.17 summarizes the floating-point portion of the MIPS architecture revealed<br />

in this chapter, with the additions to support floating point shown in color. Similar to<br />

Figure 2.19 in Chapter 2, Figure 3.18 shows the encoding of these instructions.


3.5 Floating Point 213<br />

op(31:26):<br />

28–26<br />

0(000) 1(001) 2(010) 3(011) 4(100) 5(101) 6(110) 7(111)<br />

31–29<br />

0(000) Rfmt Bltz/gez j jal beq bne blez bgtz<br />

1(001) addi addiu slti sltiu ANDi ORi xORi lui<br />

2(010) TLB FlPt<br />

3(011)<br />

4(100) lb lh lwl lw lbu lhu lwr<br />

5(101) sb sh swl sw swr<br />

6(110) lwc0 lwc1<br />

7(111) swc0 swc1<br />

op(31:26) = 010001 (FlPt), (rt(16:16) = 0 => c = f, rt(16:16) = 1 => c = t), rs(25:21):<br />

23–21<br />

0(000) 1(001) 2(010) 3(011) 4(100) 5(101) 6(110) 7(111)<br />

25–24<br />

0(00) mfc1 cfc1 mtc1 ctc1<br />

1(01) bc1.c<br />

2(10) f = single f = double<br />

3(11)<br />

op(31:26) = 010001 (FlPt), (f above: 10000 => f = s, 10001 => f = d), funct(5:0):<br />

2–0<br />

0(000) 1(001) 2(010) 3(011) 4(100) 5(101) 6(110) 7(111)<br />

5–3<br />

0(000) add.f sub.f mul.f div.f abs.f mov.f neg.f<br />

1(001)<br />

2(010)<br />

3(011)<br />

4(100) cvt.s.f cvt.d.f cvt.w.f<br />

5(101)<br />

6(110) c.f.f c.un.f c.eq.f c.ueq.f c.olt.f c.ult.f c.ole.f c.ule.f<br />

7(111) c.sf.f c.ngle.f c.seq.f c.ngl.f c.lt.f c.nge.f c.le.f c.ngt.f<br />

FIGURE 3.18 MIPS floating-point instruction encoding. This notation gives the value of a field by row <strong>and</strong> by column. For example,<br />

in the top portion of the figure, lw is found in row number 4 (100 two<br />

for bits 31–29 of the instruction) <strong>and</strong> column number 3 (011 two<br />

for bits<br />

28–26 of the instruction), so the corresponding value of the op field (bits 31–26) is 100011 two<br />

. Underscore means the field is used elsewhere.<br />

For example, FlPt in row 2 <strong>and</strong> column 1 (op 010001 two<br />

) is defined in the bottom part of the figure. Hence sub.f in row 0 <strong>and</strong> column 1 of<br />

the bottom section means that the funct field (bits 5–0) of the instruction) is 000001 two<br />

<strong>and</strong> the op field (bits 31–26) is 010001 two<br />

. Note that the<br />

5-bit rs field, specified in the middle portion of the figure, determines whether the operation is single precision (f s, so rs 10000) or double<br />

precision (f d, so rs 10001). Similarly, bit 16 of the instruction determines if the bc1.c instruction tests for true (bit 16 1 bc1.t)<br />

or false (bit 16 0 bc1.f). Instructions in color are described in Chapter 2 or this chapter, with Appendix A covering all instructions.<br />

This information is also found in column 2 of the MIPS Reference Data Card at the front of this book.


214 Chapter 3 Arithmetic for <strong>Computer</strong>s<br />

Hardware/<br />

Software<br />

Interface<br />

One issue that architects face in supporting floating-point arithmetic is whether<br />

to use the same registers used by the integer instructions or to add a special set<br />

for floating point. Because programs normally perform integer operations <strong>and</strong><br />

floating-point operations on different data, separating the registers will only<br />

slightly increase the number of instructions needed to execute a program. The<br />

major impact is to create a separate set of data transfer instructions to move data<br />

between floating-point registers <strong>and</strong> memory.<br />

The benefits of separate floating-point registers are having twice as many<br />

registers without using up more bits in the instruction format, having twice the<br />

register b<strong>and</strong>width by having separate integer <strong>and</strong> floating-point register sets, <strong>and</strong><br />

being able to customize registers to floating point; for example, some computers<br />

convert all sized oper<strong>and</strong>s in registers into a single internal format.<br />

EXAMPLE<br />

Compiling a Floating-Point C Program into MIPS Assembly Code<br />

Let’s convert a temperature in Fahrenheit to Celsius:<br />

float f2c (float fahr)<br />

{<br />

return ((5.0/9.0) *(fahr – 32.0));<br />

}<br />

Assume that the floating-point argument fahr is passed in $f12 <strong>and</strong> the<br />

result should go in $f0. (Unlike integer registers, floating-point register 0 can<br />

contain a number.) What is the MIPS assembly code?<br />

ANSWER<br />

We assume that the compiler places the three floating-point constants in<br />

memory within easy reach of the global pointer $gp. The first two instructions<br />

load the constants 5.0 <strong>and</strong> 9.0 into floating-point registers:<br />

f2c:<br />

lwc1 $f16,const5($gp) # $f16 = 5.0 (5.0 in memory)<br />

lwc1 $f18,const9($gp) # $f18 = 9.0 (9.0 in memory)<br />

They are then divided to get the fraction 5.0/9.0:<br />

div.s $f16, $f16, $f18 # $f16 = 5.0 / 9.0


3.5 Floating Point 215<br />

(Many compilers would divide 5.0 by 9.0 at compile time <strong>and</strong> save the single<br />

constant 5.0/9.0 in memory, thereby avoiding the divide at runtime.) Next, we<br />

load the constant 32.0 <strong>and</strong> then subtract it from fahr ($f12):<br />

lwc1 $f18, const32($gp)# $f18 = 32.0<br />

sub.s $f18, $f12, $f18 # $f18 = fahr – 32.0<br />

Finally, we multiply the two intermediate results, placing the product in $f0 as<br />

the return result, <strong>and</strong> then return<br />

mul.s $f0, $f16, $f18 # $f0 = (5/9)*(fahr – 32.0)<br />

jr $ra<br />

# return<br />

Now let’s perform floating-point operations on matrices, code commonly<br />

found in scientific programs.<br />

Compiling Floating-Point C Procedure with Two-Dimensional<br />

Matrices into MIPS<br />

EXAMPLE<br />

Most floating-point calculations are performed in double precision. Let’s perform<br />

matrix multiply of C C A * B. It is commonly called DGEMM,<br />

for Double precision, General Matrix Multiply. We’ll see versions of DGEMM<br />

again in Section 3.8 <strong>and</strong> subsequently in Chapters 4, 5, <strong>and</strong> 6. Let’s assume C,<br />

A, <strong>and</strong> B are all square matrices with 32 elements in each dimension.<br />

void mm (double c[][], double a[][], double b[][])<br />

{<br />

int i, j, k;<br />

for (i = 0; i != 32; i = i + 1)<br />

for (j = 0; j != 32; j = j + 1)<br />

for (k = 0; k != 32; k = k + 1)<br />

c[i][j] = c[i][j] + a[i][k] *b[k][j];<br />

}<br />

The array starting addresses are parameters, so they are in $a0, $a1, <strong>and</strong> $a2.<br />

Assume that the integer variables are in $s0, $s1, <strong>and</strong> $s2, respectively.<br />

What is the MIPS assembly code for the body of the procedure?<br />

Note that c[i][j] is used in the innermost loop above. Since the loop index<br />

is k, the index does not affect c[i][j], so we can avoid loading <strong>and</strong> storing<br />

c[i][j] each iteration. Instead, the compiler loads c[i][j] into a register<br />

outside the loop, accumulates the sum of the products of a[i][k] <strong>and</strong><br />

ANSWER


216 Chapter 3 Arithmetic for <strong>Computer</strong>s<br />

b[k][j] in that same register, <strong>and</strong> then stores the sum into c[i][j] upon<br />

termination of the innermost loop.<br />

We keep the code simpler by using the assembly language pseudoinstructions<br />

li (which loads a constant into a register), <strong>and</strong> l.d <strong>and</strong> s.d (which the<br />

assembler turns into a pair of data transfer instructions, lwc1 or swc1, to a<br />

pair of floating-point registers).<br />

The body of the procedure starts with saving the loop termination value of<br />

32 in a temporary register <strong>and</strong> then initializing the three for loop variables:<br />

mm:...<br />

li $t1, 32 # $t1 = 32 (row size/loop end)<br />

li $s0, 0 # i = 0; initialize 1st for loop<br />

L1: li $s1, 0 # j = 0; restart 2nd for loop<br />

L2: li $s2, 0 # k = 0; restart 3rd for loop<br />

To calculate the address of c[i][j], we need to know how a 32 32, twodimensional<br />

array is stored in memory. As you might expect, its layout is the<br />

same as if there were 32 single-dimension arrays, each with 32 elements. So the<br />

first step is to skip over the i “single-dimensional arrays,” or rows, to get the<br />

one we want. Thus, we multiply the index in the first dimension by the size of<br />

the row, 32. Since 32 is a power of 2, we can use a shift instead:<br />

sll $t2, $s0, 5 # $t2 = i * 2 5 (size of row of c)<br />

Now we add the second index to select the jth element of the desired row:<br />

addu $t2, $t2, $s1<br />

# $t2 = i * size(row) + j<br />

To turn this sum into a byte index, we multiply it by the size of a matrix element<br />

in bytes. Since each element is 8 bytes for double precision, we can instead shift<br />

left by 3:<br />

sll $t2, $t2, 3<br />

# $t2 = byte offset of [i][j]<br />

Next we add this sum to the base address of c, giving the address of c[i][j],<br />

<strong>and</strong> then load the double precision number c[i][j] into $f4:<br />

addu $t2, $a0, $t2 # $t2 = byte address of c[i][j]<br />

l.d $f4, 0($t2) # $f4 = 8 bytes of c[i][j]<br />

The following five instructions are virtually identical to the last five: calculate<br />

the address <strong>and</strong> then load the double precision number b[k][j].<br />

L3: sll $t0, $s2, 5 # $t0 = k * 2 5 (size of row of b)<br />

addu $t0, $t0, $s1 # $t0 = k * size(row) + j<br />

sll $t0, $t0, 3 # $t0 = byte offset of [k][j]<br />

addu $t0, $a2, $t0 # $t0 = byte address of b[k][j]<br />

l.d $f16, 0($t0) # $f16 = 8 bytes of b[k][j]<br />

Similarly, the next five instructions are like the last five: calculate the address<br />

<strong>and</strong> then load the double precision number a[i][k].


3.5 Floating Point 217<br />

sll $t0, $s0, 5 # $t0 = i * 2 5 (size of row of a)<br />

addu $t0, $t0, $s2 # $t0 = i * size(row) + k<br />

sll $t0, $t0, 3 # $t0 = byte offset of [i][k]<br />

addu $t0, $a1, $t0 # $t0 = byte address of a[i][k]<br />

l.d $f18, 0($t0) # $f18 = 8 bytes of a[i][k]<br />

Now that we have loaded all the data, we are finally ready to do some floatingpoint<br />

operations! We multiply elements of a <strong>and</strong> b located in registers $f18<br />

<strong>and</strong> $f16, <strong>and</strong> then accumulate the sum in $f4.<br />

mul.d $f16, $f18, $f16 # $f16 = a[i][k] * b[k][j]<br />

add.d $f4, $f4, $f16 # f4 = c[i][j] + a[i][k] * b[k][j]<br />

The final block increments the index k <strong>and</strong> loops back if the index is not 32.<br />

If it is 32, <strong>and</strong> thus the end of the innermost loop, we need to store the sum<br />

accumulated in $f4 into c[i][j].<br />

addiu $s2, $s2, 1 # $k = k + 1<br />

bne $s2, $t1, L3 # if (k != 32) go to L3<br />

s.d $f4, 0($t2) # c[i][j] = $f4<br />

Similarly, these final four instructions increment the index variable of the<br />

middle <strong>and</strong> outermost loops, looping back if the index is not 32 <strong>and</strong> exiting if<br />

the index is 32.<br />

addiu $s1, $s1, 1 # $j = j + 1<br />

bne $s1, $t1, L2 # if (j != 32) go to L2<br />

addiu $s0, $s0, 1 # $i = i + 1<br />

bne $s0, $t1, L1 # if (i != 32) go to L1<br />

…<br />

Figure 3.22 below shows the x86 assembly language code for a slightly different<br />

version of DGEMM in Figure 3.21.<br />

Elaboration: The array layout discussed in the example, called row-major order, is<br />

used by C <strong>and</strong> many other programming languages. Fortran instead uses column-major<br />

order, whereby the array is stored column by column.<br />

Elaboration: Only 16 of the 32 MIPS floating-point registers could originally be used<br />

for double precision operations: $f0, $f2, $f4, …, $f30. Double precision is computed<br />

using pairs of these single precision registers. The odd-numbered floating-point registers<br />

were used only to load <strong>and</strong> store the right half of 64-bit floating-point numbers. MIPS-32<br />

added l.d <strong>and</strong> s.d to the instruction set. MIPS-32 also added “paired single” versions of<br />

all floating-point instructions, where a single instruction results in two parallel floating-point<br />

operations on two 32-bit oper<strong>and</strong>s inside 64-bit registers (see Section 3.6). For example,<br />

add.ps $f0, $f2, $f4 is equivalent to add.s $f0, $f2, $f4 followed by add.s<br />

$f1, $f3, $f5.


218 Chapter 3 Arithmetic for <strong>Computer</strong>s<br />

Elaboration: Another reason for separate integers <strong>and</strong> floating-point registers is that<br />

microprocessors in the 1980s didn’t have enough transistors to put the floating-point unit<br />

on the same chip as the integer unit. Hence, the floating-point unit, including the floatingpoint<br />

registers, was optionally available as a second chip. Such optional accelerator<br />

chips are called coprocessors, <strong>and</strong> explain the acronym for floating-point loads in MIPS:<br />

lwc1 means load word to coprocessor 1, the floating-point unit. (Coprocessor 0 deals<br />

with virtual memory, described in Chapter 5.) Since the early 1990s, microprocessors<br />

have integrated floating point (<strong>and</strong> just about everything else) on chip, <strong>and</strong> hence the term<br />

coprocessor joins accumulator <strong>and</strong> core memory as quaint terms that date the speaker.<br />

Elaboration: As mentioned in Section 3.4, accelerating division is more challenging<br />

than multiplication. In addition to SRT, another technique to leverage a fast multiplier<br />

is Newton’s iteration, where division is recast as fi nding the zero of a function to fi nd<br />

the reciprocal 1/c, which is then multiplied by the other oper<strong>and</strong>. Iteration techniques<br />

cannot be rounded properly without calculating many extra bits. A TI chip solved this<br />

problem by calculating an extra-precise reciprocal.<br />

Elaboration: Java embraces IEEE 754 by name in its defi nition of Java fl oating-point<br />

data types <strong>and</strong> operations. Thus, the code in the fi rst example could have well been<br />

generated for a class method that converted Fahrenheit to Celsius.<br />

The second example above uses multiple dimensional arrays, which are not explicitly<br />

supported in Java. Java allows arrays of arrays, but each array may have its own length,<br />

unlike multiple dimensional arrays in C. Like the examples in Chapter 2, a Java version<br />

of this second example would require a good deal of checking code for array bounds,<br />

including a new length calculation at the end of row access. It would also need to check<br />

that the object reference is not null.<br />

guard The first of two<br />

extra bits kept on the<br />

right during intermediate<br />

calculations of floatingpoint<br />

numbers; used<br />

to improve rounding<br />

accuracy.<br />

round Method to<br />

make the intermediate<br />

floating-point result fit<br />

the floating-point format;<br />

the goal is typically to find<br />

the nearest number that<br />

can be represented in the<br />

format.<br />

Accurate Arithmetic<br />

Unlike integers, which can represent exactly every number between the smallest <strong>and</strong><br />

largest number, floating-point numbers are normally approximations for a number<br />

they can’t really represent. The reason is that an infinite variety of real numbers<br />

exists between, say, 0 <strong>and</strong> 1, but no more than 2 53 can be represented exactly in<br />

double precision floating point. The best we can do is getting the floating-point<br />

representation close to the actual number. Thus, IEEE 754 offers several modes of<br />

rounding to let the programmer pick the desired approximation.<br />

Rounding sounds simple enough, but to round accurately requires the hardware<br />

to include extra bits in the calculation. In the preceding examples, we were vague<br />

on the number of bits that an intermediate representation can occupy, but clearly,<br />

if every intermediate result had to be truncated to the exact number of digits, there<br />

would be no opportunity to round. IEEE 754, therefore, always keeps two extra bits<br />

on the right during intermediate additions, called guard <strong>and</strong> round, respectively.<br />

Let’s do a decimal example to illustrate their value.


3.5 Floating Point 219<br />

Rounding with Guard Digits<br />

Add 2.56 ten<br />

10 0 to 2.34 ten<br />

10 2 , assuming that we have three significant<br />

decimal digits. Round to the nearest decimal number with three significant<br />

decimal digits, first with guard <strong>and</strong> round digits, <strong>and</strong> then without them.<br />

EXAMPLE<br />

First we must shift the smaller number to the right to align the exponents, so<br />

2.56 ten<br />

10 0 becomes 0.0256 ten<br />

10 2 . Since we have guard <strong>and</strong> round digits,<br />

we are able to represent the two least significant digits when we align exponents.<br />

The guard digit holds 5 <strong>and</strong> the round digit holds 6. The sum is<br />

ANSWER<br />

2.3400 ten<br />

+ 0.0256 ten<br />

2.3656 ten<br />

Thus the sum is 2.3656 ten<br />

10 2 . Since we have two digits to round, we want<br />

values 0 to 49 to round down <strong>and</strong> 51 to 99 to round up, with 50 being the<br />

tiebreaker. Rounding the sum up with three significant digits yields 2.37 ten<br />

10 2 .<br />

Doing this without guard <strong>and</strong> round digits drops two digits from the<br />

calculation. The new sum is then<br />

2.34 ten<br />

+ 0.02 ten<br />

2.36 ten<br />

The answer is 2.36 ten<br />

10 2 , off by 1 in the last digit from the sum above.<br />

Since the worst case for rounding would be when the actual number is halfway<br />

between two floating-point representations, accuracy in floating point is normally<br />

measured in terms of the number of bits in error in the least significant bits of the<br />

signific<strong>and</strong>; the measure is called the number of units in the last place, or ulp. If<br />

a number were off by 2 in the least significant bits, it would be called off by 2 ulps.<br />

Provided there is no overflow, underflow, or invalid operation exceptions, IEEE<br />

754 guarantees that the computer uses the number that is within one-half ulp.<br />

units in the last place<br />

(ulp) The number of<br />

bits in error in the least<br />

significant bits of the<br />

signific<strong>and</strong> between<br />

the actual number <strong>and</strong><br />

the number that can be<br />

represented.<br />

Elaboration: Although the example above really needed just one extra digit, multiply<br />

can need two. A binary product may have one leading 0 bit; hence, the normalizing step<br />

must shift the product one bit left. This shifts the guard digit into the least significant bit<br />

of the product, leaving the round bit to help accurately round the product.<br />

IEEE 754 has four rounding modes: always round up (toward +∞), always round down<br />

(toward ∞), truncate, <strong>and</strong> round to nearest even. The fi nal mode determines what to<br />

do if the number is exactly halfway in between. The U.S. Internal Revenue Service (IRS)<br />

always rounds 0.50 dollars up, possibly to the benefi t of the IRS. A more equitable way<br />

would be to round up this case half the time <strong>and</strong> round down the other half. IEEE 754<br />

says that if the least signifi cant bit retained in a halfway case would be odd, add one;


222 Chapter 3 Arithmetic for <strong>Computer</strong>s<br />

In an attempt to squeeze every last bit of precision from a fl oating-point operation,<br />

the st<strong>and</strong>ard allows some numbers to be represented in unnormalized form. Rather than<br />

having a gap between 0 <strong>and</strong> the smallest normalized number, IEEE allows denormalized<br />

numbers (also known as denorms or subnormals). They have the same exponent as<br />

zero but a nonzero fraction. They allow a number to degrade in signifi cance until it<br />

becomes 0, called gradual underfl ow. For example, the smallest positive single precision<br />

normalized number is<br />

1.0000 0000 0000 0000 0000 000 two<br />

2 126<br />

but the smallest single precision denormalized number is<br />

0.0000 0000 0000 0000 0000 001 two<br />

2 126 , or 1.0 two<br />

2 149<br />

For double precision, the denorm gap goes from 1.0 2 1022 to 1.0 2 1074 .<br />

The possibility of an occasional unnormalized oper<strong>and</strong> has given headaches to<br />

fl oating-point designers who are trying to build fast fl oating-point units. Hence, many<br />

computers cause an exception if an oper<strong>and</strong> is denormalized, letting software complete<br />

the operation. Although software implementations are perfectly valid, their lower<br />

performance has lessened the popularity of denorms in portable fl oating-point software.<br />

Moreover, if programmers do not expect denorms, their programs may surprise them.<br />

3.6<br />

Parallelism <strong>and</strong> <strong>Computer</strong> Arithmetic:<br />

Subword Parallelism<br />

Since every desktop microprocessor by definition has its own graphical displays,<br />

as transistor budgets increased it was inevitable that support would be added for<br />

graphics operations.<br />

Many graphics systems originally used 8 bits to represent each of the three<br />

primary colors plus 8 bits for a location of a pixel. The addition of speakers <strong>and</strong><br />

microphones for teleconferencing <strong>and</strong> video games suggested support of sound as<br />

well. Audio samples need more than 8 bits of precision, but 16 bits are sufficient.<br />

Every microprocessor has special support so that bytes <strong>and</strong> halfwords take up<br />

less space when stored in memory (see Section 2.9), but due to the infrequency of<br />

arithmetic operations on these data sizes in typical integer programs, there was<br />

little support beyond data transfers. Architects recognized that many graphics<br />

<strong>and</strong> audio applications would perform the same operation on vectors of this data.<br />

By partitioning the carry chains within a 128-bit adder, a processor could use<br />

parallelism to perform simultaneous operations on short vectors of sixteen 8-bit<br />

oper<strong>and</strong>s, eight 16-bit oper<strong>and</strong>s, four 32-bit oper<strong>and</strong>s, or two 64-bit oper<strong>and</strong>s. The<br />

cost of such partitioned adders was small.<br />

Given that the parallelism occurs within a wide word, the extensions are<br />

classified as subword parallelism. It is also classified under the more general name<br />

of data level parallelism. They have been also called vector or SIMD, for single<br />

instruction, multiple data (see Section 6.6). The rising popularity of multimedia


226 Chapter 3 Arithmetic for <strong>Computer</strong>s<br />

1. void dgemm (int n, double* A, double* B, double* C)<br />

2. {<br />

3. for (int i = 0; i < n; ++i)<br />

4. for (int j = 0; j < n; ++j)<br />

5. {<br />

6. double cij = C[i+j*n]; /* cij = C[i][j] */<br />

7. for( int k = 0; k < n; k++ )<br />

8. cij += A[i+k*n] * B[k+j*n]; /* cij += A[i][k]*B[k][j] */<br />

9. C[i+j*n] = cij; /* C[i][j] = cij */<br />

10. }<br />

11. }<br />

FIGURE 3.21 Unoptimized C version of a double precision matrix multiply, widely known as DGEMM for<br />

Double-precision GEneral Matrix Multiply (GEMM). Because we are passing the matrix dimension as the parameter<br />

n, this version of DGEMM uses single dimensional versions of matrices C, A, <strong>and</strong> B <strong>and</strong> address arithmetic to get better<br />

performance instead of using the more intuitive two-dimensional arrays that we saw in Section 3.5. The comments remind<br />

us of this more intuitive notation.<br />

1. vmovsd (%r10),%xmm0 # Load 1 element of C into %xmm0<br />

2. mov %rsi,%rcx # register %rcx = %rsi<br />

3. xor %eax,%eax # register %eax = 0<br />

4. vmovsd (%rcx),%xmm1 # Load 1 element of B into %xmm1<br />

5. add %r9,%rcx # register %rcx = %rcx + %r9<br />

6. vmulsd (%r8,%rax,8),%xmm1,%xmm1 # Multiply %xmm1, element of A<br />

7. add $0x1,%rax # register %rax = %rax + 1<br />

8. cmp %eax,%edi # compare %eax to %edi<br />

9. vaddsd %xmm1,%xmm0,%xmm0 # Add %xmm1, %xmm0<br />

10. jg 30 # jump if %eax > %edi<br />

11. add $0x1,%r11d # register %r11 = %r11 + 1<br />

12. vmovsd %xmm0,(%r10) # Store %xmm0 into C element<br />

FIGURE 3.22 The x86 assembly language for the body of the nested loops generated by compiling the<br />

optimized C code in Figure 3.21. Although it is dealing with just 64-bits of data, the compiler uses the AVX version of<br />

the instructions instead of SSE2 presumably so that it can use three address per instruction instead of two (see the Elaboration<br />

in Section 3.7).


3.8 Going Faster: Subword Parallelism <strong>and</strong> Matrix Multiply 227<br />

1. #include <br />

2. void dgemm (int n, double* A, double* B, double* C)<br />

3. {<br />

4. for ( int i = 0; i < n; i+=4 )<br />

5. for ( int j = 0; j < n; j++ ) {<br />

6. __m256d c0 = _mm256_load_pd(C+i+j*n); /* c0 = C[i][j] */<br />

7. for( int k = 0; k < n; k++ )<br />

8. c0 = _mm256_add_pd(c0, /* c0 += A[i][k]*B[k][j] */<br />

9. _mm256_mul_pd(_mm256_load_pd(A+i+k*n),<br />

10. _mm256_broadcast_sd(B+k+j*n)));<br />

11. _mm256_store_pd(C+i+j*n, c0); /* C[i][j] = c0 */<br />

12. }<br />

13. }<br />

FIGURE 3.23 Optimized C version of DGEMM using C intrinsics to generate the AVX subword-parallel<br />

instructions for the x86. Figure 3.24 shows the assembly language produced by the compiler for the inner loop.<br />

While compiler writers may eventually be able to routinely produce highquality<br />

code that uses the AVX instructions of the x86, for now we must “cheat” by<br />

using C intrinsics that more or less tell the compiler exactly how to produce good<br />

code. Figure 3.23 shows the enhanced version of Figure 3.21 for which the Gnu C<br />

compiler produces AVX code. Figure 3.24 shows annotated x86 code that is the<br />

output of compiling using gcc with the –O3 level of optimization.<br />

The declaration on line 6 of Figure 3.23 uses the __m256d data type, which tells<br />

the compiler the variable will hold 4 double-precision floating-point values. The<br />

intrinsic _mm256_load_pd() also on line 6 uses AVX instructions to load 4<br />

double-precision floating-point numbers in parallel (_pd) from the matrix C into<br />

c0. The address calculation C+i+j*n on line 6 represents element C[i+j*n].<br />

Symmetrically, the final step on line 11 uses the intrinsic _mm256_store_pd()<br />

to store 4 double-precision floating-point numbers from c0 into the matrix C.<br />

As we’re going through 4 elements each iteration, the outer for loop on line 4<br />

increments i by 4 instead of by 1 as on line 3 of Figure 3.21.<br />

Inside the loops, on line 9 we first load 4 elements of A again using _mm256_<br />

load_pd(). To multiply these elements by one element of B, on line 10 we first<br />

use the intrinsic _mm256_broadcast_sd(), which makes 4 identical copies<br />

of the scalar double precision number—in this case an element of B—in one of the<br />

YMM registers. We then use _mm256_mul_pd() on line 9 to multiply the four<br />

double-precision results in parallel. Finally, _mm256_add_pd() on line 8 adds<br />

the 4 products to the 4 sums in c0.<br />

Figure 3.24 shows resulting x86 code for the body of the inner loops produced<br />

by the compiler. You can see the five AVX instructions—they all start with v <strong>and</strong>


228 Chapter 3 Arithmetic for <strong>Computer</strong>s<br />

1. vmovapd (%r11),%ymm0 # Load 4 elements of C into %ymm0<br />

2. mov %rbx,%rcx # register %rcx = %rbx<br />

3. xor %eax,%eax # register %eax = 0<br />

4. vbroadcastsd (%rax,%r8,1),%ymm1 # Make 4 copies of B element<br />

5. add $0x8,%rax # register %rax = %rax + 8<br />

6. vmulpd (%rcx),%ymm1,%ymm1 # Parallel mul %ymm1,4 A elements<br />

7. add %r9,%rcx # register %rcx = %rcx + %r9<br />

8. cmp %r10,%rax # compare %r10 to %rax<br />

9. vaddpd %ymm1,%ymm0,%ymm0 # Parallel add %ymm1, %ymm0<br />

10. jne 50 # jump if not %r10 != %rax<br />

11. add $0x1,%esi # register % esi = % esi + 1<br />

12. vmovapd %ymm0,(%r11) # Store %ymm0 into 4 C elements<br />

FIGURE 3.24 The x86 assembly language for the body of the nested loops generated by compiling<br />

the optimized C code in Figure 3.23. Note the similarities to Figure 3.22, with the primary difference being that the<br />

five floating-point operations are now using YMM registers <strong>and</strong> using the pd versions of the instructions for parallel double<br />

precision instead of the sd version for scalar double precision.<br />

four of the five use pd for parallel double precision—that correspond to the C<br />

intrinsics mentioned above. The code is very similar to that in Figure 3.22 above:<br />

both use 12 instructions, the integer instructions are nearly identical (but different<br />

registers), <strong>and</strong> the floating-point instruction differences are generally just going<br />

from scalar double (sd) using XMM registers to parallel double (pd) with YMM<br />

registers. The one exception is line 4 of Figure 3.24. Every element of A must be<br />

multiplied by one element of B. One solution is to place four identical copies of the<br />

64-bit B element side-by-side into the 256-bit YMM register, which is just what the<br />

instruction vbroadcastsd does.<br />

For matrices of dimensions of 32 by 32, the unoptimized DGEMM in Figure 3.21<br />

runs at 1.7 GigaFLOPS (FLoating point Operations Per Second) on one core of a<br />

2.6 GHz Intel Core i7 (S<strong>and</strong>y Bridge). The optimized code in Figure 3.23 performs<br />

at 6.4 GigaFLOPS. The AVX version is 3.85 times as fast, which is very close to the<br />

factor of 4.0 increase that you might hope for from performing 4 times as many<br />

operations at a time by using subword parallelism.<br />

Elaboration: As mentioned in the Elaboration in Section 1.6, Intel offers Turbo mode<br />

that temporarily runs at a higher clock rate until the chip gets too hot. This Intel Core i7<br />

(S<strong>and</strong>y Bridge) can increase from 2.6 GHz to 3.3 GHz in Turbo mode. The results above<br />

are with Turbo mode turned off. If we turn it on, we improve all the results by the increase<br />

in the clock rate of 3.3/2.6 = 1.27 to 2.1 GFLOPS for unoptimized DGEMM <strong>and</strong> 8.1<br />

GFLOPS with AVX. Turbo mode works particularly well when using only a single core of<br />

an eight-core chip, as in this case, as it lets that single core use much more than its fair<br />

share of power since the other cores are idle.


3.9 Fallacies <strong>and</strong> Pitfalls 229<br />

3.9 Fallacies <strong>and</strong> Pitfalls<br />

Arithmetic fallacies <strong>and</strong> pitfalls generally stem from the difference between the<br />

limited precision of computer arithmetic <strong>and</strong> the unlimited precision of natural<br />

arithmetic.<br />

Fallacy: Just as a left shift instruction can replace an integer multiply by a<br />

power of 2, a right shift is the same as an integer division by a power of 2.<br />

Recall that a binary number c, where xi means the ith bit, represents the number<br />

… (x 3 2 3 ) (x 2 2 2 ) 1 (x1 2 1 ) (x0 2 0 )<br />

Shifting the bits of c right by n bits would seem to be the same as dividing by<br />

2n. And this is true for unsigned integers. The problem is with signed integers. For<br />

example, suppose we want to divide 5 ten<br />

by 4 ten<br />

; the quotient should be 1 ten<br />

. The<br />

two’s complement representation of 5 ten<br />

is<br />

Thus mathematics<br />

may be defined as the<br />

subject in which we<br />

never know what we<br />

are talking about, nor<br />

whether what we are<br />

saying is true.<br />

Bertr<strong>and</strong> Russell, Recent<br />

Words on the Principles<br />

of Mathematics, 1901<br />

1111 1111 1111 1111 1111 1111 1111 1011 two<br />

According to this fallacy, shifting right by two should divide by 4 ten<br />

(2 2 ):<br />

0011 1111 1111 1111 1111 1111 1111 1110 two<br />

With a 0 in the sign bit, this result is clearly wrong. The value created by the shift<br />

right is actually 1,073,741,822 ten<br />

instead of 1 ten<br />

.<br />

A solution would be to have an arithmetic right shift that extends the sign bit<br />

instead of shifting in 0s. A 2-bit arithmetic shift right of 5 ten<br />

produces<br />

1111 1111 1111 1111 1111 1111 1111 1110 two<br />

The result is 2 ten<br />

instead of 1 ten<br />

; close, but no cigar.<br />

Pitfall: Floating-point addition is not associative.<br />

Associativity holds for a sequence of two’s complement integer additions, even if the<br />

computation overflows. Alas, because floating-point numbers are approximations<br />

of real numbers <strong>and</strong> because computer arithmetic has limited precision, it does<br />

not hold for floating-point numbers. Given the great range of numbers that can be<br />

represented in floating point, problems occur when adding two large numbers of<br />

opposite signs plus a small number. For example, let’s see if c (a b) (c a)<br />

b. Assume c 1.5 ten<br />

10 38 , a 1.5 ten<br />

10 38 , <strong>and</strong> b 1.0, <strong>and</strong> that these are<br />

all single precision numbers.


230 Chapter 3 Arithmetic for <strong>Computer</strong>s<br />

c ( a b) 1.5ten<br />

10 (1.5ten<br />

10 1.0)<br />

38<br />

38<br />

1.5ten<br />

10 (1.5ten<br />

10 )<br />

0.0<br />

38<br />

38<br />

c ( a b) ( 1.5ten<br />

10 1.5ten<br />

10 ) 1.0<br />

(0.0ten<br />

) 1.0<br />

1.0<br />

Since floating-point numbers have limited precision <strong>and</strong> result in approximations<br />

of real results, 1.5 ten<br />

10 38 is so much larger than 1.0 ten<br />

that 1.5 ten<br />

10 38 1.0 is still<br />

1.5 ten<br />

10 38 . That is why the sum of c, a, <strong>and</strong> b is 0.0 or 1.0, depending on the order<br />

of the floating-point additions, so c (a b) (c a) b. Therefore, floatingpoint<br />

addition is not associative.<br />

38<br />

Fallacy: Parallel execution strategies that work for integer data types also work<br />

for floating-point data types.<br />

Programs have typically been written first to run sequentially before being rewritten<br />

to run concurrently, so a natural question is, “Do the two versions get the same<br />

answer?” If the answer is no, you presume there is a bug in the parallel version that<br />

you need to track down.<br />

This approach assumes that computer arithmetic does not affect the results when<br />

going from sequential to parallel. That is, if you were to add a million numbers<br />

together, you would get the same results whether you used 1 processor or 1000<br />

processors. This assumption holds for two’s complement integers, since integer<br />

addition is associative. Alas, since floating-point addition is not associative, the<br />

assumption does not hold.<br />

A more vexing version of this fallacy occurs on a parallel computer where the<br />

operating system scheduler may use a different number of processors depending<br />

on what other programs are running on a parallel computer. As the varying<br />

number of processors from each run would cause the floating-point sums to be<br />

calculated in different orders, getting slightly different answers each time despite<br />

running identical code with identical input may flummox unaware parallel<br />

programmers.<br />

Given this qu<strong>and</strong>ary, programmers who write parallel code with floating-point<br />

numbers need to verify whether the results are credible even if they don’t give the<br />

same exact answer as the sequential code. The field that deals with such issues is<br />

called numerical analysis, which is the subject of textbooks in its own right. Such<br />

concerns are one reason for the popularity of numerical libraries such as LAPACK<br />

<strong>and</strong> SCALAPAK, which have been validated in both their sequential <strong>and</strong> parallel<br />

forms.<br />

Pitfall: The MIPS instruction add immediate unsigned (addiu) sign-extends<br />

its 16-bit immediate field.<br />

38


3.9 Fallacies <strong>and</strong> Pitfalls 231<br />

Despite its name, add immediate unsigned (addiu) is used to add constants to<br />

signed integers when we don’t care about overflow. MIPS has no subtract immediate<br />

instruction, <strong>and</strong> negative numbers need sign extension, so the MIPS architects<br />

decided to sign-extend the immediate field.<br />

Fallacy: Only theoretical mathematicians care about floating-point accuracy.<br />

Newspaper headlines of November 1994 prove this statement is a fallacy (see<br />

Figure 3.25). The following is the inside story behind the headlines.<br />

The Pentium used a st<strong>and</strong>ard floating-point divide algorithm that generates<br />

multiple quotient bits per step, using the most significant bits of divisor <strong>and</strong><br />

dividend to guess the next 2 bits of the quotient. The guess is taken from a lookup<br />

table containing 2, 1, 0, 1, or 2. The guess is multiplied by the divisor <strong>and</strong><br />

subtracted from the remainder to generate a new remainder. Like nonrestoring<br />

division, if a previous guess gets too large a remainder, the partial remainder is<br />

adjusted in a subsequent pass.<br />

Evidently, there were five elements of the table from the 80486 that Intel<br />

engineers thought could never be accessed, <strong>and</strong> they optimized the logic to return<br />

0 instead of 2 in these situations on the Pentium. Intel was wrong: while the first 11<br />

FIGURE 3.25 A sampling of newspaper <strong>and</strong> magazine articles from November 1994,<br />

including the New York Times, San Jose Mercury News, San Francisco Chronicle, <strong>and</strong><br />

Infoworld. The Pentium floating-point divide bug even made the “Top 10 List” of the David Letterman<br />

Late Show on television. Intel eventually took a $300 million write-off to replace the buggy chips.


232 Chapter 3 Arithmetic for <strong>Computer</strong>s<br />

bits were always correct, errors would show up occasionally in bits 12 to 52, or the<br />

4th to 15th decimal digits.<br />

A math professor at Lynchburg College in Virginia, Thomas Nicely, discovered the<br />

bug in September 1994. After calling Intel technical support <strong>and</strong> getting no official<br />

reaction, he posted his discovery on the Internet. This post led to a story in a trade<br />

magazine, which in turn caused Intel to issue a press release. It called the bug a glitch<br />

that would affect only theoretical mathematicians, with the average spreadsheet<br />

user seeing an error every 27,000 years. IBM Research soon counterclaimed that the<br />

average spreadsheet user would see an error every 24 days. Intel soon threw in the<br />

towel by making the following announcement on December 21:<br />

“We at Intel wish to sincerely apologize for our h<strong>and</strong>ling of the recently publicized<br />

Pentium processor flaw. The Intel Inside symbol means that your computer has<br />

a microprocessor second to none in quality <strong>and</strong> performance. Thous<strong>and</strong>s of Intel<br />

employees work very hard to ensure that this is true. But no microprocessor is<br />

ever perfect. What Intel continues to believe is technically an extremely minor<br />

problem has taken on a life of its own. Although Intel firmly st<strong>and</strong>s behind the<br />

quality of the current version of the Pentium processor, we recognize that many<br />

users have concerns. We want to resolve these concerns. Intel will exchange the<br />

current version of the Pentium processor for an updated version, in which this<br />

floating-point divide flaw is corrected, for any owner who requests it, free of<br />

charge anytime during the life of their computer.”<br />

Analysts estimate that this recall cost Intel $500 million, <strong>and</strong> Intel engineers did not<br />

get a Christmas bonus that year.<br />

This story brings up a few points for everyone to ponder. How much cheaper<br />

would it have been to fix the bug in July 1994? What was the cost to repair the<br />

damage to Intel’s reputation? And what is the corporate responsibility in disclosing<br />

bugs in a product so widely used <strong>and</strong> relied upon as a microprocessor?<br />

3.10 Concluding Remarks<br />

Over the decades, computer arithmetic has become largely st<strong>and</strong>ardized, greatly<br />

enhancing the portability of programs. Two’s complement binary integer arithmetic is<br />

found in every computer sold today, <strong>and</strong> if it includes floating point support, it offers<br />

the IEEE 754 binary floating-point arithmetic.<br />

<strong>Computer</strong> arithmetic is distinguished from paper-<strong>and</strong>-pencil arithmetic by the<br />

constraints of limited precision. This limit may result in invalid operations through<br />

calculating numbers larger or smaller than the predefined limits. Such anomalies, called<br />

“overflow” or “underflow,” may result in exceptions or interrupts, emergency events<br />

similar to unplanned subroutine calls. Chapters 4 <strong>and</strong> 5 discuss exceptions in more detail.<br />

Floating-point arithmetic has the added challenge of being an approximation<br />

of real numbers, <strong>and</strong> care needs to be taken to ensure that the computer number


3.10 Concluding Remarks 233<br />

selected is the representation closest to the actual number. The challenges of<br />

imprecision <strong>and</strong> limited representation of floating point are part of the inspiration<br />

for the field of numerical analysis. The recent switch to parallelism shines the<br />

searchlight on numerical analysis again, as solutions that were long considered<br />

safe on sequential computers must be reconsidered when trying to find the fastest<br />

algorithm for parallel computers that still achieves a correct result.<br />

Data-level parallelism, specifically subword parallelism, offers a simple path to<br />

higher performance for programs that are intensive in arithmetic operations for<br />

either integer or floating-point data. We showed that we could speed up matrix<br />

multiply nearly fourfold by using instructions that could execute four floatingpoint<br />

operations at a time.<br />

With the explanation of computer arithmetic in this chapter comes a description<br />

of much more of the MIPS instruction set. One point of confusion is the instructions<br />

covered in these chapters versus instructions executed by MIPS chips versus the<br />

instructions accepted by MIPS assemblers. Two figures try to make this clear.<br />

Figure 3.26 lists the MIPS instructions covered in this chapter <strong>and</strong> Chapter 2.<br />

We call the set of instructions on the left-h<strong>and</strong> side of the figure the MIPS core. The<br />

instructions on the right we call the MIPS arithmetic core. On the left of Figure 3.27<br />

are the instructions the MIPS processor executes that are not found in Figure 3.26.<br />

We call the full set of hardware instructions MIPS-32. On the right of Figure 3.27<br />

are the instructions accepted by the assembler that are not part of MIPS-32. We call<br />

this set of instructions Pseudo MIPS.<br />

Figure 3.28 gives the popularity of the MIPS instructions for SPEC CPU2006<br />

integer <strong>and</strong> floating-point benchmarks. All instructions are listed that were<br />

responsible for at least 0.2% of the instructions executed.<br />

Note that although programmers <strong>and</strong> compiler writers may use MIPS-32 to<br />

have a richer menu of options, MIPS core instructions dominate integer SPEC<br />

CPU2006 execution, <strong>and</strong> the integer core plus arithmetic core dominate SPEC<br />

CPU2006 floating point, as the table below shows.<br />

Instruction subset Integer Fl. pt.<br />

MIPS core 98% 31%<br />

MIPS arithmetic core 2% 66%<br />

Remaining MIPS-32 0% 3%<br />

For the rest of the book, we concentrate on the MIPS core instructions—the integer<br />

instruction set excluding multiply <strong>and</strong> divide—to make the explanation of computer<br />

design easier. As you can see, the MIPS core includes the most popular MIPS<br />

instructions; be assured that underst<strong>and</strong>ing a computer that runs the MIPS core<br />

will give you sufficient background to underst<strong>and</strong> even more ambitious computers.<br />

No matter what the instruction set or its size—MIPS, ARM, x86—never forget that<br />

bit patterns have no inherent meaning. The same bit pattern may represent a signed<br />

integer, unsigned integer, floating-point number, string, instruction, <strong>and</strong> so on. In<br />

stored program computers, it is the operation on the bit pattern that determines its<br />

meaning.


3.10 Concluding Remarks 235<br />

Remaining MIPS-32 Name Format Pseudo MIPS Name Format<br />

exclusive or (rs ⊕ rt) xor R absolute value abs rd,rs<br />

exclusive or immediate xori I negate (signed or unsigned) negs rd,rs<br />

shift right arithmetic sra R rotate left rol rd,rs,rt<br />

shift left logical variable sllv R rotate right ror rd,rs,rt<br />

shift right logical variable srlv R multiply <strong>and</strong> don’t check oflw (signed or uns.) muls rd,rs,rt<br />

shift right arithmetic variable srav R multiply <strong>and</strong> check oflw (signed or uns.) mulos rd,rs,rt<br />

move to Hi mthi R divide <strong>and</strong> check overflow div rd,rs,rt<br />

move to Lo mtlo R divide <strong>and</strong> don t check overflow divu rd,rs,rt<br />

load halfword lh I remainder (signed or unsigned) rems rd,rs,rt<br />

load byte lb I load immediate li rd,imm<br />

load word left (unaligned) lwl I load address la rd,addr<br />

load word right (unaligned) lwr I load double ld rd,addr<br />

store word left (unaligned) swl I store double sd rd,addr<br />

store word right (unaligned) swr I unaligned load word ulw rd,addr<br />

load linked (atomic update) ll I unaligned store word usw rd,addr<br />

store cond. (atomic update) sc I unaligned load halfword (signed or uns.) ulhs rd,addr<br />

move if zero movz R unaligned store halfword ush rd,addr<br />

move if not zero movn R branch b Label<br />

multiply <strong>and</strong> add (S or uns.) madds R branch on equal zero beqz rs,L<br />

multiply <strong>and</strong> subtract (S or uns.) msubs I branch on compare (signed or unsigned) bxs rs,rt,L<br />

branch on ≥ zero <strong>and</strong> link bgezal I (x = lt, le, gt, ge)<br />

branch on < zero <strong>and</strong> link bltzal I set equal seq rd,rs,rt<br />

jump <strong>and</strong> link register jalr R set not equal sne rd,rs,rt<br />

branch compare to zero bxz I set on compare (signed or unsigned) sxs rd,rs,rt<br />

branch compare to zero likely bxzl I (x = lt, le, gt, ge)<br />

(x = lt, le, gt, ge) load to floating point (s or d) l.f rd,addr<br />

branch compare reg likely bxl I store from floating point (s or d) s.f rd,addr<br />

trap if compare reg tx R<br />

trap if compare immediate txi I<br />

(x = eq, neq, lt, le, gt, ge)<br />

return from exception rfe R<br />

system call syscall I<br />

break (cause exception) break I<br />

move from FP to integer mfc1 R<br />

move to FP from integer mtc1 R<br />

FP move (s or d) mov.f R<br />

FP move if zero (s or d) movz.f R<br />

FP move if not zero (s or d) movn.f R<br />

FP square root (s or d) sqrt.f R<br />

FP absolute value (s or d) abs.f R<br />

FP negate (s or d) neg.f R<br />

FP convert (w, s, or d) cvt.f.f R<br />

FP compare un (s or d) c.xn.f R<br />

FIGURE 3.27 Remaining MIPS-32 <strong>and</strong> Pseudo MIPS instruction sets. f means single (s) or double (d) precision floating-point<br />

instructions, <strong>and</strong> s means signed <strong>and</strong> unsigned (u) versions. MIPS-32 also has FP instructions for multiply <strong>and</strong> add/sub (madd.f/ msub.f),<br />

ceiling (ceil.f), truncate (trunc.f), round (round.f), <strong>and</strong> reciprocal (recip.f). The underscore represents the letter to include to represent<br />

that datatype.


3.12 Exercises 237<br />

3.12 Exercises<br />

3.1 [5] What is 5ED4 07A4 when these values represent unsigned 16-<br />

bit hexadecimal numbers? The result should be written in hexadecimal. Show your<br />

work.<br />

3.2 [5] What is 5ED4 07A4 when these values represent signed 16-<br />

bit hexadecimal numbers stored in sign-magnitude format? The result should be<br />

written in hexadecimal. Show your work.<br />

3.3 [10] Convert 5ED4 into a binary number. What makes base 16<br />

(hexadecimal) an attractive numbering system for representing values in<br />

computers?<br />

3.4 [5] What is 4365 3412 when these values represent unsigned 12-bit<br />

octal numbers? The result should be written in octal. Show your work.<br />

3.5 [5] What is 4365 3412 when these values represent signed 12-bit<br />

octal numbers stored in sign-magnitude format? The result should be written in<br />

octal. Show your work.<br />

3.6 [5] Assume 185 <strong>and</strong> 122 are unsigned 8-bit decimal integers. Calculate<br />

185 – 122. Is there overflow, underflow, or neither?<br />

3.7 [5] Assume 185 <strong>and</strong> 122 are signed 8-bit decimal integers stored in<br />

sign-magnitude format. Calculate 185 122. Is there overflow, underflow, or<br />

neither?<br />

3.8 [5] Assume 185 <strong>and</strong> 122 are signed 8-bit decimal integers stored in<br />

sign-magnitude format. Calculate 185 122. Is there overflow, underflow, or<br />

neither?<br />

3.9 [10] Assume 151 <strong>and</strong> 214 are signed 8-bit decimal integers stored in<br />

two’s complement format. Calculate 151 214 using saturating arithmetic. The<br />

result should be written in decimal. Show your work.<br />

3.10 [10] Assume 151 <strong>and</strong> 214 are signed 8-bit decimal integers stored in<br />

two’s complement format. Calculate 151 214 using saturating arithmetic. The<br />

result should be written in decimal. Show your work.<br />

3.11 [10] Assume 151 <strong>and</strong> 214 are unsigned 8-bit integers. Calculate 151<br />

214 using saturating arithmetic. The result should be written in decimal. Show<br />

your work.<br />

3.12 [20] Using a table similar to that shown in Figure 3.6, calculate the<br />

product of the octal unsigned 6-bit integers 62 <strong>and</strong> 12 using the hardware described<br />

in Figure 3.3. You should show the contents of each register on each step.<br />

Never give in, never<br />

give in, never, never,<br />

never—in nothing,<br />

great or small, large or<br />

petty—never give in.<br />

Winston Churchill,<br />

address at Harrow<br />

School, 1941


238 Chapter 3 Arithmetic for <strong>Computer</strong>s<br />

3.13 [20] Using a table similar to that shown in Figure 3.6, calculate the<br />

product of the hexadecimal unsigned 8-bit integers 62 <strong>and</strong> 12 using the hardware<br />

described in Figure 3.5. You should show the contents of each register on each step.<br />

3.14 [10] Calculate the time necessary to perform a multiply using the<br />

approach given in Figures 3.3 <strong>and</strong> 3.4 if an integer is 8 bits wide <strong>and</strong> each step<br />

of the operation takes 4 time units. Assume that in step 1a an addition is always<br />

performed—either the multiplic<strong>and</strong> will be added, or a zero will be. Also assume<br />

that the registers have already been initialized (you are just counting how long it<br />

takes to do the multiplication loop itself). If this is being done in hardware, the<br />

shifts of the multiplic<strong>and</strong> <strong>and</strong> multiplier can be done simultaneously. If this is being<br />

done in software, they will have to be done one after the other. Solve for each case.<br />

3.15 [10] Calculate the time necessary to perform a multiply using the<br />

approach described in the text (31 adders stacked vertically) if an integer is 8 bits<br />

wide <strong>and</strong> an adder takes 4 time units.<br />

3.16 [20] Calculate the time necessary to perform a multiply using the<br />

approach given in Figure 3.7 if an integer is 8 bits wide <strong>and</strong> an adder takes 4 time<br />

units.<br />

3.17 [20] As discussed in the text, one possible performance enhancement<br />

is to do a shift <strong>and</strong> add instead of an actual multiplication. Since 9 6, for example,<br />

can be written (2 2 2 1) 6, we can calculate 9 6 by shifting 6 to the left 3<br />

times <strong>and</strong> then adding 6 to that result. Show the best way to calculate 033 055<br />

using shifts <strong>and</strong> adds/subtracts. Assume both inputs are 8-bit unsigned integers.<br />

3.18 [20] Using a table similar to that shown in Figure 3.10, calculate<br />

74 divided by 21 using the hardware described in Figure 3.8. You should show<br />

the contents of each register on each step. Assume both inputs are unsigned 6-bit<br />

integers.<br />

3.19 [30] Using a table similar to that shown in Figure 3.10, calculate<br />

74 divided by 21 using the hardware described in Figure 3.11. You should show<br />

the contents of each register on each step. Assume A <strong>and</strong> B are unsigned 6-bit<br />

integers. This algorithm requires a slightly different approach than that shown in<br />

Figure 3.9. You will want to think hard about this, do an experiment or two, or else<br />

go to the web to figure out how to make this work correctly. (Hint: one possible<br />

solution involves using the fact that Figure 3.11 implies the remainder register can<br />

be shifted either direction.)<br />

3.20 [5] What decimal number does the bit pattern 0×0C000000<br />

represent if it is a two’s complement integer? An unsigned integer?<br />

3.21 [10] If the bit pattern 0×0C000000 is placed into the Instruction<br />

Register, what MIPS instruction will be executed?<br />

3.22 [10] What decimal number does the bit pattern 0×0C000000<br />

represent if it is a floating point number? Use the IEEE 754 st<strong>and</strong>ard.


3.12 Exercises 239<br />

3.23 [10] Write down the binary representation of the decimal number<br />

63.25 assuming the IEEE 754 single precision format.<br />

3.24 [10] Write down the binary representation of the decimal number<br />

63.25 assuming the IEEE 754 double precision format.<br />

3.25 [10] Write down the binary representation of the decimal number<br />

63.25 assuming it was stored using the single precision IBM format (base 16,<br />

instead of base 2, with 7 bits of exponent).<br />

3.26 [20] Write down the binary bit pattern to represent 1.5625 10 1<br />

assuming a format similar to that employed by the DEC PDP-8 (the leftmost 12<br />

bits are the exponent stored as a two’s complement number, <strong>and</strong> the rightmost 24<br />

bits are the fraction stored as a two’s complement number). No hidden 1 is used.<br />

Comment on how the range <strong>and</strong> accuracy of this 36-bit pattern compares to the<br />

single <strong>and</strong> double precision IEEE 754 st<strong>and</strong>ards.<br />

3.27 [20] IEEE 754-2008 contains a half precision that is only 16 bits<br />

wide. The leftmost bit is still the sign bit, the exponent is 5 bits wide <strong>and</strong> has a bias<br />

of 15, <strong>and</strong> the mantissa is 10 bits long. A hidden 1 is assumed. Write down the<br />

bit pattern to represent 1.5625 10 1 assuming a version of this format, which<br />

uses an excess-16 format to store the exponent. Comment on how the range <strong>and</strong><br />

accuracy of this 16-bit floating point format compares to the single precision IEEE<br />

754 st<strong>and</strong>ard.<br />

3.28 [20] The Hewlett-Packard 2114, 2115, <strong>and</strong> 2116 used a format<br />

with the leftmost 16 bits being the fraction stored in two’s complement format,<br />

followed by another 16-bit field which had the leftmost 8 bits as an extension of the<br />

fraction (making the fraction 24 bits long), <strong>and</strong> the rightmost 8 bits representing<br />

the exponent. However, in an interesting twist, the exponent was stored in signmagnitude<br />

format with the sign bit on the far right! Write down the bit pattern to<br />

represent 1.5625 10 1 assuming this format. No hidden 1 is used. Comment on<br />

how the range <strong>and</strong> accuracy of this 32-bit pattern compares to the single precision<br />

IEEE 754 st<strong>and</strong>ard.<br />

3.29 [20] Calculate the sum of 2.6125 10 1 <strong>and</strong> 4.150390625 10 1<br />

by h<strong>and</strong>, assuming A <strong>and</strong> B are stored in the 16-bit half precision described in<br />

Exercise 3.27. Assume 1 guard, 1 round bit, <strong>and</strong> 1 sticky bit, <strong>and</strong> round to the<br />

nearest even. Show all the steps.<br />

3.30 [30] Calculate the product of –8.0546875 10 0 <strong>and</strong> 1.79931640625<br />

10 –1 by h<strong>and</strong>, assuming A <strong>and</strong> B are stored in the 16-bit half precision format<br />

described in Exercise 3.27. Assume 1 guard, 1 round bit, <strong>and</strong> 1 sticky bit, <strong>and</strong> round<br />

to the nearest even. Show all the steps; however, as is done in the example in the<br />

text, you can do the multiplication in human-readable format instead of using the<br />

techniques described in Exercises 3.12 through 3.14. Indicate if there is overflow<br />

or underflow. Write your answer in both the 16-bit floating point format described<br />

in Exercise 3.27 <strong>and</strong> also as a decimal number. How accurate is your result? How<br />

does it compare to the number you get if you do the multiplication on a calculator?


240 Chapter 3 Arithmetic for <strong>Computer</strong>s<br />

3.31 [30] Calculate by h<strong>and</strong> 8.625 10 1 divided by 4.875 10 0 . Show<br />

all the steps necessary to achieve your answer. Assume there is a guard, a round bit,<br />

<strong>and</strong> a sticky bit, <strong>and</strong> use them if necessary. Write the final answer in both the 16-bit<br />

floating point format described in Exercise 3.27 <strong>and</strong> in decimal <strong>and</strong> compare the<br />

decimal result to that which you get if you use a calculator.<br />

3.32 [20] Calculate (3.984375 10 1 3.4375 10 1 ) 1.771 10 3<br />

by h<strong>and</strong>, assuming each of the values are stored in the 16-bit half precision format<br />

described in Exercise 3.27 (<strong>and</strong> also described in the text). Assume 1 guard, 1<br />

round bit, <strong>and</strong> 1 sticky bit, <strong>and</strong> round to the nearest even. Show all the steps, <strong>and</strong><br />

write your answer in both the 16-bit floating point format <strong>and</strong> in decimal.<br />

3.33 [20] Calculate 3.984375 10 1 (3.4375 10 1 1.771 10 3 )<br />

by h<strong>and</strong>, assuming each of the values are stored in the 16-bit half precision format<br />

described in Exercise 3.27 (<strong>and</strong> also described in the text). Assume 1 guard, 1<br />

round bit, <strong>and</strong> 1 sticky bit, <strong>and</strong> round to the nearest even. Show all the steps, <strong>and</strong><br />

write your answer in both the 16-bit floating point format <strong>and</strong> in decimal.<br />

3.34 [10] Based on your answers to 3.32 <strong>and</strong> 3.33, does (3.984375 10 1<br />

3.4375 10 1 ) 1.771 10 3 = 3.984375 10 1 (3.4375 10 1 1.771 <br />

10 3 )?<br />

3.35 [30] Calculate (3.41796875 10 3 6.34765625 10 3 ) 1.05625<br />

10 2 by h<strong>and</strong>, assuming each of the values are stored in the 16-bit half precision<br />

format described in Exercise 3.27 (<strong>and</strong> also described in the text). Assume 1 guard,<br />

1 round bit, <strong>and</strong> 1 sticky bit, <strong>and</strong> round to the nearest even. Show all the steps, <strong>and</strong><br />

write your answer in both the 16-bit floating point format <strong>and</strong> in decimal.<br />

3.36 [30] Calculate 3.41796875 10 3 (6.34765625 10 3 1.05625<br />

10 2 ) by h<strong>and</strong>, assuming each of the values are stored in the 16-bit half precision<br />

format described in Exercise 3.27 (<strong>and</strong> also described in the text). Assume 1 guard,<br />

1 round bit, <strong>and</strong> 1 sticky bit, <strong>and</strong> round to the nearest even. Show all the steps, <strong>and</strong><br />

write your answer in both the 16-bit floating point format <strong>and</strong> in decimal.<br />

3.37 [10] Based on your answers to 3.35 <strong>and</strong> 3.36, does (3.41796875 10 3<br />

6.34765625 10 3 ) 1.05625 10 2 = 3.41796875 10 3 (6.34765625 <br />

10 3 1.05625 10 2 )?<br />

3.38 [30] Calculate 1.666015625 10 0 (1.9760 10 4 1.9744 <br />

10 4 ) by h<strong>and</strong>, assuming each of the values are stored in the 16-bit half precision<br />

format described in Exercise 3.27 (<strong>and</strong> also described in the text). Assume 1 guard,<br />

1 round bit, <strong>and</strong> 1 sticky bit, <strong>and</strong> round to the nearest even. Show all the steps, <strong>and</strong><br />

write your answer in both the 16-bit floating point format <strong>and</strong> in decimal.<br />

3.39 [30] Calculate (1.666015625 10 0 1.9760 10 4 ) (1.666015625<br />

10 0 1.9744 10 4 ) by h<strong>and</strong>, assuming each of the values are stored in the<br />

16-bit half precision format described in Exercise 3.27 (<strong>and</strong> also described in the<br />

text). Assume 1 guard, 1 round bit, <strong>and</strong> 1 sticky bit, <strong>and</strong> round to the nearest even.<br />

Show all the steps, <strong>and</strong> write your answer in both the 16-bit floating point format<br />

<strong>and</strong> in decimal.


3.12 Exercises 241<br />

3.40 [10] Based on your answers to 3.38 <strong>and</strong> 3.39, does (1.666015625 <br />

10 0 1.9760 10 4 ) (1.666015625 10 0 1.9744 10 4 ) = 1.666015625 <br />

10 0 (1.9760 10 4 1.9744 10 4 )?<br />

3.41 [10] Using the IEEE 754 floating point format, write down the bit<br />

pattern that would represent 1/4. Can you represent 1/4 exactly?<br />

3.42 [10] What do you get if you add 1/4 to itself 4 times? What is 1/4<br />

4? Are they the same? What should they be?<br />

3.43 [10] Write down the bit pattern in the fraction of value 1/3 assuming<br />

a floating point format that uses binary numbers in the fraction. Assume there are<br />

24 bits, <strong>and</strong> you do not need to normalize. Is this representation exact?<br />

3.44 [10] Write down the bit pattern in the fraction assuming a floating<br />

point format that uses Binary Coded Decimal (base 10) numbers in the fraction<br />

instead of base 2. Assume there are 24 bits, <strong>and</strong> you do not need to normalize. Is<br />

this representation exact?<br />

3.45 [10] Write down the bit pattern assuming that we are using base 15<br />

numbers in the fraction instead of base 2. (Base 16 numbers use the symbols 0–9<br />

<strong>and</strong> A–F. Base 15 numbers would use 0–9 <strong>and</strong> A–E.) Assume there are 24 bits, <strong>and</strong><br />

you do not need to normalize. Is this representation exact?<br />

3.46 [20] Write down the bit pattern assuming that we are using base 30<br />

numbers in the fraction instead of base 2. (Base 16 numbers use the symbols 0–9<br />

<strong>and</strong> A–F. Base 30 numbers would use 0–9 <strong>and</strong> A–T.) Assume there are 20 bits, <strong>and</strong><br />

you do not need to normalize. Is this representation exact?<br />

3.47 [45] The following C code implements a four-tap FIR filter on<br />

input array sig_in. Assume that all arrays are 16-bit fixed-point values.<br />

for (i 3;i< 128;i )<br />

sig_out[i] sig_in[i-3] * f[0] sig_in[i-2] * f[1]<br />

sig_in[i-1] * f[2] sig_in[i] * f[3];<br />

Assume you are to write an optimized implementation this code in assembly<br />

language on a processor that has SIMD instructions <strong>and</strong> 128-bit registers. Without<br />

knowing the details of the instruction set, briefly describe how you would<br />

implement this code, maximizing the use of sub-word operations <strong>and</strong> minimizing<br />

the amount of data that is transferred between registers <strong>and</strong> memory. State all your<br />

assumptions about the instructions you use.<br />

§3.2, page 182: 2.<br />

§3.5, page 221: 3.<br />

Answers to<br />

Check Yourself


4<br />

The Processor<br />

In a major matter, no<br />

details are small.<br />

French Proverb<br />

4.1 Introduction 244<br />

4.2 Logic <strong>Design</strong> Conventions 248<br />

4.3 Building a Datapath 251<br />

4.4 A Simple Implementation Scheme 259<br />

4.5 An Overview of Pipelining 272<br />

4.6 Pipelined Datapath <strong>and</strong> Control 286<br />

4.7 Data Hazards: Forwarding versus<br />

Stalling 303<br />

4.8 Control Hazards 316<br />

4.9 Exceptions 325<br />

4.10 Parallelism via Instructions 332<br />

<strong>Computer</strong> Organization <strong>and</strong> <strong>Design</strong>. DOI: http://dx.doi.org/10.1016/B978-0-12-407726-3.00001-1<br />

© 2013 Elsevier Inc. All rights reserved.


4.1 Introduction 245<br />

However, it illustrates the key principles used in creating a datapath <strong>and</strong> designing<br />

the control. The implementation of the remaining instructions is similar.<br />

In examining the implementation, we will have the opportunity to see how the<br />

instruction set architecture determines many aspects of the implementation, <strong>and</strong><br />

how the choice of various implementation strategies affects the clock rate <strong>and</strong> CPI<br />

for the computer. Many of the key design principles introduced in Chapter 1 can<br />

be illustrated by looking at the implementation, such as Simplicity favors regularity.<br />

In addition, most concepts used to implement the MIPS subset in this chapter are<br />

the same basic ideas that are used to construct a broad spectrum of computers,<br />

from high-performance servers to general-purpose microprocessors to embedded<br />

processors.<br />

An Overview of the Implementation<br />

In Chapter 2, we looked at the core MIPS instructions, including the integer<br />

arithmetic-logical instructions, the memory-reference instructions, <strong>and</strong> the branch<br />

instructions. Much of what needs to be done to implement these instructions is the<br />

same, independent of the exact class of instruction. For every instruction, the first<br />

two steps are identical:<br />

1. Send the program counter (PC) to the memory that contains the code <strong>and</strong><br />

fetch the instruction from that memory.<br />

2. Read one or two registers, using fields of the instruction to select the registers<br />

to read. For the load word instruction, we need to read only one register, but<br />

most other instructions require reading two registers.<br />

After these two steps, the actions required to complete the instruction depend<br />

on the instruction class. Fortunately, for each of the three instruction classes<br />

(memory-reference, arithmetic-logical, <strong>and</strong> branches), the actions are largely the<br />

same, independent of the exact instruction. The simplicity <strong>and</strong> regularity of the<br />

MIPS instruction set simplifies the implementation by making the execution of<br />

many of the instruction classes similar.<br />

For example, all instruction classes, except jump, use the arithmetic-logical unit<br />

(ALU) after reading the registers. The memory-reference instructions use the ALU<br />

for an address calculation, the arithmetic-logical instructions for the operation<br />

execution, <strong>and</strong> branches for comparison. After using the ALU, the actions required<br />

to complete various instruction classes differ. A memory-reference instruction<br />

will need to access the memory either to read data for a load or write data for a<br />

store. An arithmetic-logical or load instruction must write the data from the ALU<br />

or memory back into a register. Lastly, for a branch instruction, we may need to<br />

change the next instruction address based on the comparison; otherwise, the PC<br />

should be incremented by 4 to get the address of the next instruction.<br />

Figure 4.1 shows the high-level view of a MIPS implementation, focusing on<br />

the various functional units <strong>and</strong> their interconnection. Although this figure shows<br />

most of the flow of data through the processor, it omits two important aspects of<br />

instruction execution.


246 Chapter 4 The Processor<br />

First, in several places, Figure 4.1 shows data going to a particular unit as coming<br />

from two different sources. For example, the value written into the PC can come<br />

from one of two adders, the data written into the register file can come from either<br />

the ALU or the data memory, <strong>and</strong> the second input to the ALU can come from<br />

a register or the immediate field of the instruction. In practice, these data lines<br />

cannot simply be wired together; we must add a logic element that chooses from<br />

among the multiple sources <strong>and</strong> steers one of those sources to its destination. This<br />

selection is commonly done with a device called a multiplexor, although this device<br />

might better be called a data selector. Appendix B describes the multiplexor, which<br />

selects from among several inputs based on the setting of its control lines. The<br />

control lines are set based primarily on information taken from the instruction<br />

being executed.<br />

The second omission in Figure 4.1 is that several of the units must be controlled<br />

depending on the type of instruction. For example, the data memory must read<br />

4<br />

Add<br />

Add<br />

Data<br />

PC Address Instruction<br />

Instruction<br />

memory<br />

Register #<br />

Registers ALU Address<br />

Register #<br />

Data<br />

Register #<br />

memory<br />

Data<br />

FIGURE 4.1 An abstract view of the implementation of the MIPS subset showing the<br />

major functional units <strong>and</strong> the major connections between them. All instructions start by using<br />

the program counter to supply the instruction address to the instruction memory. After the instruction is<br />

fetched, the register oper<strong>and</strong>s used by an instruction are specified by fields of that instruction. Once the<br />

register oper<strong>and</strong>s have been fetched, they can be operated on to compute a memory address (for a load or<br />

store), to compute an arithmetic result (for an integer arithmetic-logical instruction), or a compare (for a<br />

branch). If the instruction is an arithmetic-logical instruction, the result from the ALU must be written to<br />

a register. If the operation is a load or store, the ALU result is used as an address to either store a value from<br />

the registers or load a value from memory into the registers. The result from the ALU or memory is written<br />

back into the register file. Branches require the use of the ALU output to determine the next instruction<br />

address, which comes either from the ALU (where the PC <strong>and</strong> branch offset are summed) or from an adder<br />

that increments the current PC by 4. The thick lines interconnecting the functional units represent buses,<br />

which consist of multiple signals. The arrows are used to guide the reader in knowing how information flows.<br />

Since signal lines may cross, we explicitly show when crossing lines are connected by the presence of a dot<br />

where the lines cross.


4.1 Introduction 247<br />

on a load <strong>and</strong> written on a store. The register file must be written only on a load<br />

or an arithmetic-logical instruction. And, of course, the ALU must perform one<br />

of several operations. (Appendix B describes the detailed design of the ALU.)<br />

Like the multiplexors, control lines that are set on the basis of various fields in the<br />

instruction direct these operations.<br />

Figure 4.2 shows the datapath of Figure 4.1 with the three required multiplexors<br />

added, as well as control lines for the major functional units. A control unit,<br />

which has the instruction as an input, is used to determine how to set the control<br />

lines for the functional units <strong>and</strong> two of the multiplexors. The third multiplexor,<br />

Branch<br />

M<br />

u<br />

x<br />

4<br />

Add<br />

Data<br />

Add<br />

M<br />

u<br />

x<br />

ALU operation<br />

PC Address Instruction<br />

Instruction<br />

memory<br />

Register #<br />

MemWrite<br />

Registers ALU Address<br />

Register #<br />

M<br />

Zero<br />

u<br />

Data<br />

x<br />

memory<br />

Register # RegWrite<br />

Data<br />

MemRead<br />

Control<br />

FIGURE 4.2 The basic implementation of the MIPS subset, including the necessary multiplexors <strong>and</strong> control lines.<br />

The top multiplexor (“Mux”) controls what value replaces the PC (PC + 4 or the branch destination address); the multiplexor is controlled<br />

by the gate that “ANDs” together the Zero output of the ALU <strong>and</strong> a control signal that indicates that the instruction is a branch. The middle<br />

multiplexor, whose output returns to the register file, is used to steer the output of the ALU (in the case of an arithmetic-logical instruction) or<br />

the output of the data memory (in the case of a load) for writing into the register file. Finally, the bottommost multiplexor is used to determine<br />

whether the second ALU input is from the registers (for an arithmetic-logical instruction or a branch) or from the offset field of the instruction<br />

(for a load or store). The added control lines are straightforward <strong>and</strong> determine the operation performed at the ALU, whether the data memory<br />

should read or write, <strong>and</strong> whether the registers should perform a write operation. The control lines are shown in color to make them easier to<br />

see.


254 Chapter 4 The Processor<br />

sign-extend To increase<br />

the size of a data item by<br />

replicating the high-order<br />

sign bit of the original<br />

data item in the highorder<br />

bits of the larger,<br />

destination data item.<br />

branch target<br />

address The address<br />

specified in a branch,<br />

which becomes the new<br />

program counter (PC)<br />

if the branch is taken. In<br />

the MIPS architecture the<br />

branch target is given by<br />

the sum of the offset field<br />

of the instruction <strong>and</strong> the<br />

address of the instruction<br />

following the branch.<br />

branch taken<br />

A branch where the<br />

branch condition is<br />

satisfied <strong>and</strong> the program<br />

counter (PC) becomes<br />

the branch target. All<br />

unconditional jumps are<br />

taken branches.<br />

branch not taken or<br />

(untaken branch)<br />

A branch where the<br />

branch condition is false<br />

<strong>and</strong> the program counter<br />

(PC) becomes the address<br />

of the instruction that<br />

sequentially follows the<br />

branch.<br />

Next, consider the MIPS load word <strong>and</strong> store word instructions, which have the<br />

general form lw $t1,offset_value($t2) or sw $t1,offset_value<br />

($t2). These instructions compute a memory address by adding the base register,<br />

which is $t2, to the 16-bit signed offset field contained in the instruction. If the<br />

instruction is a store, the value to be stored must also be read from the register file<br />

where it resides in $t1. If the instruction is a load, the value read from memory<br />

must be written into the register file in the specified register, which is $t1. Thus,<br />

we will need both the register file <strong>and</strong> the ALU from Figure 4.7.<br />

In addition, we will need a unit to sign-extend the 16-bit offset field in the<br />

instruction to a 32-bit signed value, <strong>and</strong> a data memory unit to read from or write<br />

to. The data memory must be written on store instructions; hence, data memory<br />

has read <strong>and</strong> write control signals, an address input, <strong>and</strong> an input for the data to be<br />

written into memory. Figure 4.8 shows these two elements.<br />

The beq instruction has three oper<strong>and</strong>s, two registers that are compared for<br />

equality, <strong>and</strong> a 16-bit offset used to compute the branch target address relative<br />

to the branch instruction address. Its form is beq $t1,$t2,offset. To<br />

implement this instruction, we must compute the branch target address by adding<br />

the sign-extended offset field of the instruction to the PC. There are two details in<br />

the definition of branch instructions (see Chapter 2) to which we must pay attention:<br />

■ The instruction set architecture specifies that the base for the branch address<br />

calculation is the address of the instruction following the branch. Since we<br />

compute PC + 4 (the address of the next instruction) in the instruction fetch<br />

datapath, it is easy to use this value as the base for computing the branch<br />

target address.<br />

■ The architecture also states that the offset field is shifted left 2 bits so that it<br />

is a word offset; this shift increases the effective range of the offset field by a<br />

factor of 4.<br />

To deal with the latter complication, we will need to shift the offset field by 2.<br />

As well as computing the branch target address, we must also determine whether<br />

the next instruction is the instruction that follows sequentially or the instruction<br />

at the branch target address. When the condition is true (i.e., the oper<strong>and</strong>s are<br />

equal), the branch target address becomes the new PC, <strong>and</strong> we say that the branch<br />

is taken. If the oper<strong>and</strong>s are not equal, the incremented PC should replace the<br />

current PC (just as for any other normal instruction); in this case, we say that the<br />

branch is not taken.<br />

Thus, the branch datapath must do two operations: compute the branch target<br />

address <strong>and</strong> compare the register contents. (Branches also affect the instruction<br />

fetch portion of the datapath, as we will deal with shortly.) Figure 4.9 shows the<br />

structure of the datapath segment that h<strong>and</strong>les branches. To compute the branch<br />

target address, the branch datapath includes a sign extension unit, from Figure 4.8<br />

<strong>and</strong> an adder. To perform the compare, we need to use the register file shown in<br />

Figure 4.7a to supply the two register oper<strong>and</strong>s (although we will not need to write<br />

into the register file). In addition, the comparison can be done using the ALU we


256 Chapter 4 The Processor<br />

PC + 4 from instruction datapath<br />

Shift<br />

left 2<br />

Add Sum<br />

Branch<br />

target<br />

Instruction<br />

Read<br />

register 1<br />

Read<br />

register 2<br />

Write<br />

register<br />

Write<br />

data<br />

Registers<br />

Read<br />

data 1<br />

Read<br />

data 2<br />

4<br />

ALU operation<br />

ALU Zero<br />

To branch<br />

control logic<br />

RegWrite<br />

16<br />

Signextend<br />

32<br />

FIGURE 4.9 The datapath for a branch uses the ALU to evaluate the branch condition <strong>and</strong><br />

a separate adder to compute the branch target as the sum of the incremented PC <strong>and</strong> the<br />

sign-extended, lower 16 bits of the instruction (the branch displacement), shifted left 2<br />

bits. The unit labeled Shift left 2 is simply a routing of the signals between input <strong>and</strong> output that adds 00 two<br />

to the low-order end of the sign-extended offset field; no actual shift hardware is needed, since the amount of<br />

the “shift” is constant. Since we know that the offset was sign-extended from 16 bits, the shift will throw away<br />

only “sign bits.” Control logic is used to decide whether the incremented PC or branch target should replace<br />

the PC, based on the Zero output of the ALU.<br />

Creating a Single Datapath<br />

Now that we have examined the datapath components needed for the individual<br />

instruction classes, we can combine them into a single datapath <strong>and</strong> add the control<br />

to complete the implementation. This simplest datapath will attempt to execute all<br />

instructions in one clock cycle. This means that no datapath resource can be used<br />

more than once per instruction, so any element needed more than once must be<br />

duplicated. We therefore need a memory for instructions separate from one for<br />

data. Although some of the functional units will need to be duplicated, many of the<br />

elements can be shared by different instruction flows.<br />

To share a datapath element between two different instruction classes, we may<br />

need to allow multiple connections to the input of an element, using a multiplexor<br />

<strong>and</strong> control signal to select among the multiple inputs.


4.3 Building a Datapath 257<br />

Building a Datapath<br />

The operations of arithmetic-logical (or R-type) instructions <strong>and</strong> the memory<br />

instructions datapath are quite similar. The key differences are the following:<br />

■ The arithmetic-logical instructions use the ALU, with the inputs coming<br />

from the two registers. The memory instructions can also use the ALU<br />

to do the address calculation, although the second input is the signextended<br />

16-bit offset field from the instruction.<br />

■ The value stored into a destination register comes from the ALU (for an<br />

R-type instruction) or the memory (for a load).<br />

Show how to build a datapath for the operational portion of the memoryreference<br />

<strong>and</strong> arithmetic-logical instructions that uses a single register file<br />

<strong>and</strong> a single ALU to h<strong>and</strong>le both types of instructions, adding any necessary<br />

multiplexors.<br />

EXAMPLE<br />

To create a datapath with only a single register file <strong>and</strong> a single ALU, we must<br />

support two different sources for the second ALU input, as well as two different<br />

sources for the data stored into the register file. Thus, one multiplexor is placed<br />

at the ALU input <strong>and</strong> another at the data input to the register file. Figure 4.10<br />

shows the operational portion of the combined datapath.<br />

ANSWER<br />

Now we can combine all the pieces to make a simple datapath for the core<br />

MIPS architecture by adding the datapath for instruction fetch (Figure 4.6), the<br />

datapath from R-type <strong>and</strong> memory instructions (Figure 4.10), <strong>and</strong> the datapath<br />

for branches (Figure 4.9). Figure 4.11 shows the datapath we obtain by composing<br />

the separate pieces. The branch instruction uses the main ALU for comparison of<br />

the register oper<strong>and</strong>s, so we must keep the adder from Figure 4.9 for computing<br />

the branch target address. An additional multiplexor is required to select either the<br />

sequentially following instruction address (PC + 4) or the branch target address to<br />

be written into the PC.<br />

Now that we have completed this simple datapath, we can add the control unit.<br />

The control unit must be able to take inputs <strong>and</strong> generate a write signal for each<br />

state element, the selector control for each multiplexor, <strong>and</strong> the ALU control. The<br />

ALU control is different in a number of ways, <strong>and</strong> it will be useful to design it first<br />

before we design the rest of the control unit.<br />

I. Which of the following is correct for a load instruction? Refer to Figure 4.10.<br />

a. MemtoReg should be set to cause the data from memory to be sent to the<br />

register file.<br />

Check<br />

Yourself


258 Chapter 4 The Processor<br />

Instruction<br />

Read<br />

register 1<br />

Read<br />

register 2<br />

Registers<br />

Write<br />

register<br />

Read<br />

data 1<br />

Read<br />

data 2<br />

ALUSrc<br />

0<br />

M<br />

ux<br />

4<br />

ALU<br />

ALU operation<br />

Zero<br />

ALU<br />

result<br />

Address<br />

MemWrite<br />

MemtoReg<br />

Read<br />

data<br />

1<br />

M<br />

ux<br />

Write<br />

data<br />

RegWrite<br />

1<br />

Write<br />

data<br />

Data<br />

memory<br />

0<br />

16<br />

Signextend<br />

32<br />

MemRead<br />

FIGURE 4.10 The datapath for the memory instructions <strong>and</strong> the R-type instructions. This example shows how a single<br />

datapath can be assembled from the pieces in Figures 4.7 <strong>and</strong> 4.8 by adding multiplexors. Two multiplexors are needed, as described in the<br />

example.<br />

PCSrc<br />

Add<br />

M<br />

ux<br />

4<br />

Shift<br />

left 2<br />

ALU<br />

Add result<br />

PC<br />

Read<br />

address<br />

Instruction<br />

Instruction<br />

memory<br />

Read<br />

register 1<br />

Read<br />

register 2<br />

Registers<br />

Write<br />

register<br />

Write<br />

data<br />

RegWrite<br />

Read<br />

data 1<br />

Read<br />

data 2<br />

ALUSrc<br />

M<br />

ux<br />

4<br />

ALU operation<br />

Zero<br />

ALU ALU<br />

result<br />

Address<br />

Write<br />

data<br />

MemWrite<br />

MemtoReg<br />

Read<br />

data<br />

Data<br />

memory<br />

M<br />

ux<br />

16<br />

Signextend<br />

32<br />

MemRead<br />

FIGURE 4.11 The simple datapath for the core MIPS architecture combines the elements required by different<br />

instruction classes. The components come from Figures 4.6, 4.9, <strong>and</strong> 4.10. This datapath can execute the basic instructions (load-store<br />

word, ALU operations, <strong>and</strong> branches) in a single clock cycle. Just one additional multiplexor is needed to integrate branches. The support for<br />

jumps will be added later.


262 Chapter 4 The Processor<br />

Field 0 rs rt rd shamt funct<br />

Bit positions 31:26 25:21 20:16 15:11 10:6 5:0<br />

a. R-type instruction<br />

Field 35 or 43 rs rt address<br />

Bit positions 31:26 25:21 20:16 15:0<br />

b. Load or store instruction<br />

Field 4 rs rt address<br />

Bit positions 31:26 25:21 20:16 15:0<br />

c. Branch instruction<br />

FIGURE 4.14 The three instruction classes (R-type, load <strong>and</strong> store, <strong>and</strong> branch) use two<br />

different instruction formats. The jump instructions use another format, which we will discuss shortly.<br />

(a) Instruction format for R-format instructions, which all have an opcode of 0. These instructions have three<br />

register oper<strong>and</strong>s: rs, rt, <strong>and</strong> rd. Fields rs <strong>and</strong> rt are sources, <strong>and</strong> rd is the destination. The ALU function is<br />

in the funct field <strong>and</strong> is decoded by the ALU control design in the previous section. The R-type instructions<br />

that we implement are add, sub, AND, OR, <strong>and</strong> slt. The shamt field is used only for shifts; we will ignore it<br />

in this chapter. (b) Instruction format for load (opcode = 35 ten<br />

) <strong>and</strong> store (opcode = 43 ten<br />

) instructions. The<br />

register rs is the base register that is added to the 16-bit address field to form the memory address. For loads,<br />

rt is the destination register for the loaded value. For stores, rt is the source register whose value should be<br />

stored into memory. (c) Instruction format for branch equal (opcode =4). The registers rs <strong>and</strong> rt are the<br />

source registers that are compared for equality. The 16-bit address field is sign-extended, shifted, <strong>and</strong> added<br />

to the PC + 4 to compute the branch target address.<br />

opcode The field that<br />

denotes the operation <strong>and</strong><br />

format of an instruction.<br />

the formats of the three instruction classes: the R-type, branch, <strong>and</strong> load-store<br />

instructions. Figure 4.14 shows these formats.<br />

There are several major observations about this instruction format that we will<br />

rely on:<br />

■ The op field, which as we saw in Chapter 2 is called the opcode, is always<br />

contained in bits 31:26. We will refer to this field as Op[5:0].<br />

■ The two registers to be read are always specified by the rs <strong>and</strong> rt fields, at<br />

positions 25:21 <strong>and</strong> 20:16. This is true for the R-type instructions, branch<br />

equal, <strong>and</strong> store.<br />

■ The base register for load <strong>and</strong> store instructions is always in bit positions<br />

25:21 (rs).<br />

■ The 16-bit offset for branch equal, load, <strong>and</strong> store is always in positions 15:0.<br />

■ The destination register is in one of two places. For a load it is in bit positions<br />

20:16 (rt), while for an R-type instruction it is in bit positions 15:11 (rd).<br />

Thus, we will need to add a multiplexor to select which field of the instruction<br />

is used to indicate the register number to be written.<br />

The first design principle from Chapter 2—simplicity favors regularity—pays off<br />

here in specifying control.


4.4 A Simple Implementation Scheme 263<br />

PCSrc<br />

0<br />

Add<br />

M ux<br />

4<br />

RegWrite<br />

Shift<br />

left 2<br />

ALU<br />

Addresult<br />

1<br />

PC<br />

Instruction [25:21] Read<br />

Read<br />

register 1<br />

address<br />

Read<br />

Instruction [20:16] Read data 1<br />

register 2<br />

ALUSrc Zero<br />

Instruction<br />

0<br />

[31:0] ALU<br />

M Write Read<br />

ALU<br />

0<br />

data 2<br />

result<br />

Instruction<br />

ux<br />

Instruction [15:11] register<br />

M<br />

memory<br />

1<br />

ux<br />

Write<br />

1<br />

data Registers<br />

RegDst<br />

Instruction [15:0]<br />

16<br />

Signextend<br />

32<br />

ALU<br />

control<br />

MemWrite<br />

Address<br />

Write<br />

data<br />

Read<br />

data<br />

Data<br />

memory<br />

MemRead<br />

MemtoReg<br />

1<br />

M ux<br />

0<br />

Instruction [5:0]<br />

ALUOp<br />

FIGURE 4.15 The datapath of Figure 4.11 with all necessary multiplexors <strong>and</strong> all control lines identified. The control<br />

lines are shown in color. The ALU control block has also been added. The PC does not require a write control, since it is written once at the end<br />

of every clock cycle; the branch control logic determines whether it is written with the incremented PC or the branch target address.<br />

Using this information, we can add the instruction labels <strong>and</strong> extra multiplexor<br />

(for the Write register number input of the register file) to the simple datapath.<br />

Figure 4.15 shows these additions plus the ALU control block, the write signals for<br />

state elements, the read signal for the data memory, <strong>and</strong> the control signals for the<br />

multiplexors. Since all the multiplexors have two inputs, they each require a single<br />

control line.<br />

Figure 4.15 shows seven single-bit control lines plus the 2-bit ALUOp control<br />

signal. We have already defined how the ALUOp control signal works, <strong>and</strong> it is<br />

useful to define what the seven other control signals do informally before we<br />

determine how to set these control signals during instruction execution. Figure<br />

4.16 describes the function of these seven control lines.<br />

Now that we have looked at the function of each of the control signals, we can<br />

look at how to set them. The control unit can set all but one of the control signals<br />

based solely on the opcode field of the instruction. The PCSrc control line is the<br />

exception. That control line should be asserted if the instruction is branch on equal<br />

(a decision that the control unit can make) <strong>and</strong> the Zero output of the ALU, which<br />

is used for equality comparison, is asserted. To generate the PCSrc signal, we will<br />

need to AND together a signal from the control unit, which we call Branch, with<br />

the Zero signal out of the ALU.


4.4 A Simple Implementation Scheme 265<br />

0<br />

Add<br />

M ux<br />

4<br />

Instruction [31–26]<br />

Control<br />

RegDst<br />

Branch<br />

MemRead<br />

MemtoReg<br />

ALUOp<br />

MemWrite<br />

ALUSrc<br />

RegWrite<br />

Shift<br />

left 2<br />

ALU<br />

Add<br />

result<br />

1<br />

PC<br />

Instruction [25–21] Read<br />

Read<br />

register 1<br />

address<br />

Read<br />

Instruction [20–16] Read<br />

data 1<br />

Zero<br />

Instruction<br />

register 2<br />

0<br />

[31–0] ALU<br />

M Read<br />

ALU<br />

Write<br />

0<br />

Instruction<br />

ux<br />

data 2<br />

result<br />

Instruction [15–11] register<br />

M<br />

memory<br />

ux<br />

1<br />

Write<br />

data<br />

1<br />

Registers<br />

Address<br />

Read<br />

data<br />

Write<br />

data<br />

Data<br />

memory<br />

1<br />

M ux<br />

0<br />

Instruction [15–0]<br />

16<br />

Signextend<br />

32<br />

ALU<br />

control<br />

Instruction [5–0]<br />

FIGURE 4.17 The simple datapath with the control unit. The input to the control unit is the 6-bit opcode field from the instruction.<br />

The outputs of the control unit consist of three 1-bit signals that are used to control multiplexors (RegDst, ALUSrc, <strong>and</strong> MemtoReg), three<br />

signals for controlling reads <strong>and</strong> writes in the register file <strong>and</strong> data memory (RegWrite, MemRead, <strong>and</strong> MemWrite), a 1-bit signal used in<br />

determining whether to possibly branch (Branch), <strong>and</strong> a 2-bit control signal for the ALU (ALUOp). An AND gate is used to combine the<br />

branch control signal <strong>and</strong> the Zero output from the ALU; the AND gate output controls the selection of the next PC. Notice that PCSrc is now<br />

a derived signal, rather than one coming directly from the control unit. Thus, we drop the signal name in subsequent figures.<br />

think of four steps to execute the instruction; these steps are ordered by the flow<br />

of information:<br />

1. The instruction is fetched, <strong>and</strong> the PC is incremented.<br />

2. Two registers, $t2 <strong>and</strong> $t3, are read from the register file; also, the main<br />

control unit computes the setting of the control lines during this step.<br />

3. The ALU operates on the data read from the register file, using the function<br />

code (bits 5:0, which is the funct field, of the instruction) to generate the<br />

ALU function.


268 Chapter 4 The Processor<br />

3. The ALU computes the sum of the value read from the register file <strong>and</strong> the<br />

sign-extended, lower 16 bits of the instruction (offset).<br />

4. The sum from the ALU is used as the address for the data memory.<br />

5. The data from the memory unit is written into the register file; the register<br />

destination is given by bits 20:16 of the instruction ($t1).<br />

Finally, we can show the operation of the branch-on-equal instruction, such as<br />

beq $t1, $t2, offset, in the same fashion. It operates much like an R-format<br />

instruction, but the ALU output is used to determine whether the PC is written with<br />

PC + 4 or the branch target address. Figure 4.21 shows the four steps in execution:<br />

1. An instruction is fetched from the instruction memory, <strong>and</strong> the PC is<br />

incremented.<br />

0<br />

Add<br />

M ux<br />

4<br />

Instruction [31–26]<br />

Control<br />

RegDst<br />

Branch<br />

MemRead<br />

MemtoReg<br />

ALUOp<br />

MemWrite<br />

ALUSrc<br />

RegWrite<br />

Shift<br />

left 2<br />

ALU<br />

Add<br />

result<br />

1<br />

PC<br />

Instruction [25–21] Read<br />

Read<br />

register 1<br />

address<br />

Read<br />

Instruction [20–16] Read<br />

data 1<br />

Zero<br />

Instruction<br />

register 2<br />

0<br />

[31–0] ALU<br />

M Read<br />

ALU<br />

Write<br />

0<br />

Instruction<br />

ux<br />

data 2<br />

result<br />

Instruction [15–11] register<br />

M<br />

memory<br />

ux<br />

1<br />

Write<br />

data<br />

1<br />

Registers<br />

Address<br />

Read<br />

data<br />

Write<br />

data<br />

Data<br />

memory<br />

1<br />

M ux<br />

0<br />

Instruction [15–0]<br />

16<br />

Signextend<br />

32<br />

ALU<br />

control<br />

Instruction [5–0]<br />

FIGURE 4.21 The datapath in operation for a branch-on-equal instruction. The control lines, datapath units, <strong>and</strong> connections<br />

that are active are highlighted. After using the register file <strong>and</strong> ALU to perform the compare, the Zero output is used to select the next program<br />

counter from between the two c<strong>and</strong>idates.


270 Chapter 4 The Processor<br />

single-cycle<br />

implementation Also<br />

called single clock cycle<br />

implementation. An<br />

implementation in which<br />

an instruction is executed<br />

in one clock cycle. While<br />

easy to underst<strong>and</strong>, it is<br />

too slow to be practical.<br />

Now that we have a single-cycle implementation of most of the MIPS core<br />

instruction set, let’s add the jump instruction to show how the basic datapath <strong>and</strong><br />

control can be extended to h<strong>and</strong>le other instructions in the instruction set.<br />

EXAMPLE<br />

Implementing Jumps<br />

Figure 4.17 shows the implementation of many of the instructions we looked at<br />

in Chapter 2. One class of instructions missing is that of the jump instruction.<br />

Extend the datapath <strong>and</strong> control of Figure 4.17 to include the jump instruction.<br />

Describe how to set any new control lines.<br />

ANSWER<br />

The jump instruction, shown in Figure 4.23, looks somewhat like a branch<br />

instruction but computes the target PC differently <strong>and</strong> is not conditional. Like<br />

a branch, the low-order 2 bits of a jump address are always 00 two<br />

. The next<br />

lower 26 bits of this 32-bit address come from the 26-bit immediate field in the<br />

instruction. The upper 4 bits of the address that should replace the PC come<br />

from the PC of the jump instruction plus 4. Thus, we can implement a jump by<br />

storing into the PC the concatenation of<br />

■ the upper 4 bits of the current PC + 4 (these are bits 31:28 of the<br />

sequentially following instruction address)<br />

■ the 26-bit immediate field of the jump instruction<br />

■ the bits 00 two<br />

Figure 4.24 shows the addition of the control for jump added to Figure 4.17. An<br />

additional multiplexor is used to select the source for the new PC value, which<br />

is either the incremented PC (PC + 4), the branch target PC, or the jump target<br />

PC. One additional control signal is needed for the additional multiplexor. This<br />

control signal, called Jump, is asserted only when the instruction is a jump—<br />

that is, when the opcode is 2.<br />

Field 000010 address<br />

Bit positions 31:26 25:0<br />

FIGURE 4.23 Instruction format for the jump instruction (opcode = 2). The destination<br />

address for a jump instruction is formed by concatenating the upper 4 bits of the current PC + 4 to the 26-bit<br />

address field in the jump instruction <strong>and</strong> adding 00 as the 2 low-order bits.


4.4 A Simple Implementation Scheme 271<br />

Add<br />

Instruction [25–0] Jump address [31–0]<br />

Shift<br />

left 2<br />

26 28<br />

PC + 4 [31–28]<br />

0<br />

M ux<br />

1<br />

M ux<br />

4<br />

Instruction [31–26]<br />

Control<br />

RegDst<br />

Jump<br />

Branch<br />

MemRead<br />

MemtoReg<br />

ALUOp<br />

MemWrite<br />

ALUSrc<br />

RegWrite<br />

Shift<br />

left 2<br />

ALU<br />

Add<br />

result<br />

1<br />

0<br />

PC<br />

Instruction [25–21] Read<br />

Read<br />

register 1<br />

address<br />

Read<br />

Instruction [20–16] Read<br />

data 1<br />

Zero<br />

Instruction<br />

register 2<br />

0<br />

[31–0] ALU<br />

M Read<br />

ALU<br />

Write<br />

0<br />

Instruction<br />

ux<br />

data 2<br />

result<br />

Instruction [15–11] register<br />

M<br />

memory<br />

ux<br />

1<br />

Write<br />

data<br />

1<br />

Registers<br />

Address<br />

Write<br />

data<br />

Read<br />

data<br />

Data<br />

memory<br />

1<br />

M ux<br />

0<br />

Instruction [15–0]<br />

16<br />

Signextend<br />

32<br />

ALU<br />

control<br />

Instruction [5–0]<br />

FIGURE 4.24 The simple control <strong>and</strong> datapath are extended to h<strong>and</strong>le the jump instruction. An additional multiplexor (at<br />

the upper right) is used to choose between the jump target <strong>and</strong> either the branch target or the sequential instruction following this one. This<br />

multiplexor is controlled by the jump control signal. The jump target address is obtained by shifting the lower 26 bits of the jump instruction<br />

left 2 bits, effectively adding 00 as the low-order bits, <strong>and</strong> then concatenating the upper 4 bits of PC + 4 as the high-order bits, thus yielding a<br />

32-bit address.<br />

Why a Single-Cycle Implementation Is Not Used Today<br />

Although the single-cycle design will work correctly, it would not be used in<br />

modern designs because it is inefficient. To see why this is so, notice that the clock<br />

cycle must have the same length for every instruction in this single-cycle design.<br />

Of course, the longest possible path in the processor determines the clock cycle.<br />

This path is almost certainly a load instruction, which uses five functional units<br />

in series: the instruction memory, the register file, the ALU, the data memory, <strong>and</strong><br />

the register file. Although the CPI is 1 (see Chapter 1), the overall performance of<br />

a single-cycle implementation is likely to be poor, since the clock cycle is too long.<br />

The penalty for using the single-cycle design with a fixed clock cycle is significant,<br />

but might be considered acceptable for this small instruction set. Historically, early


272 Chapter 4 The Processor<br />

computers with very simple instruction sets did use this implementation technique.<br />

However, if we tried to implement the floating-point unit or an instruction set with<br />

more complex instructions, this single-cycle design wouldn’t work well at all.<br />

Because we must assume that the clock cycle is equal to the worst-case delay<br />

for all instructions, it’s useless to try implementation techniques that reduce the<br />

delay of the common case but do not improve the worst-case cycle time. A singlecycle<br />

implementation thus violates the great idea from Chapter 1 of making the<br />

common case fast.<br />

In next section, we’ll look at another implementation technique, called<br />

pipelining, that uses a datapath very similar to the single-cycle datapath but is<br />

much more efficient by having a much higher throughput. Pipelining improves<br />

efficiency by executing multiple instructions simultaneously.<br />

Check<br />

Yourself<br />

Look at the control signals in Figure 4.22. Can you combine any together? Can any<br />

control signal output in the figure be replaced by the inverse of another? (Hint: take<br />

into account the don’t cares.) If so, can you use one signal for the other without<br />

adding an inverter?<br />

4.5 An Overview of Pipelining<br />

Never waste time.<br />

American proverb<br />

pipelining An<br />

implementation<br />

technique in which<br />

multiple instructions are<br />

overlapped in execution,<br />

much like an assembly<br />

line.<br />

Pipelining is an implementation technique in which multiple instructions are<br />

overlapped in execution. Today, pipelining is nearly universal.<br />

This section relies heavily on one analogy to give an overview of the pipelining<br />

terms <strong>and</strong> issues. If you are interested in just the big picture, you should concentrate<br />

on this section <strong>and</strong> then skip to Sections 4.10 <strong>and</strong> 4.11 to see an introduction to the<br />

advanced pipelining techniques used in recent processors such as the Intel Core i7<br />

<strong>and</strong> ARM Cortex-A8. If you are interested in exploring the anatomy of a pipelined<br />

computer, this section is a good introduction to Sections 4.6 through 4.9.<br />

Anyone who has done a lot of laundry has intuitively used pipelining. The nonpipelined<br />

approach to laundry would be as follows:<br />

1. Place one dirty load of clothes in the washer.<br />

2. When the washer is finished, place the wet load in the dryer.<br />

3. When the dryer is finished, place the dry load on a table <strong>and</strong> fold.<br />

4. When folding is finished, ask your roommate to put the clothes away.<br />

When your roommate is done, start over with the next dirty load.<br />

The pipelined approach takes much less time, as Figure 4.25 shows. As soon<br />

as the washer is finished with the first load <strong>and</strong> placed in the dryer, you load the<br />

washer with the second dirty load. When the first load is dry, you place it on the<br />

table to start folding, move the wet load to the dryer, <strong>and</strong> put the next dirty load


274 Chapter 4 The Processor<br />

pipeline, in this case four: washing, drying, folding, <strong>and</strong> putting away. Therefore,<br />

pipelined laundry is potentially four times faster than nonpipelined: 20 loads would<br />

take about 5 times as long as 1 load, while 20 loads of sequential laundry takes 20<br />

times as long as 1 load. It’s only 2.3 times faster in Figure 4.25, because we only<br />

show 4 loads. Notice that at the beginning <strong>and</strong> end of the workload in the pipelined<br />

version in Figure 4.25, the pipeline is not completely full; this start-up <strong>and</strong> winddown<br />

affects performance when the number of tasks is not large compared to the<br />

number of stages in the pipeline. If the number of loads is much larger than 4, then<br />

the stages will be full most of the time <strong>and</strong> the increase in throughput will be very<br />

close to 4.<br />

The same principles apply to processors where we pipeline instruction-execution.<br />

MIPS instructions classically take five steps:<br />

1. Fetch instruction from memory.<br />

2. Read registers while decoding the instruction. The regular format of MIPS<br />

instructions allows reading <strong>and</strong> decoding to occur simultaneously.<br />

3. Execute the operation or calculate an address.<br />

4. Access an oper<strong>and</strong> in data memory.<br />

5. Write the result into a register.<br />

Hence, the MIPS pipeline we explore in this chapter has five stages. The following<br />

example shows that pipelining speeds up instruction execution just as it speeds up<br />

the laundry.<br />

EXAMPLE<br />

ANSWER<br />

Single-Cycle versus Pipelined Performance<br />

To make this discussion concrete, let’s create a pipeline. In this example, <strong>and</strong> in<br />

the rest of this chapter, we limit our attention to eight instructions: load word<br />

(lw), store word (sw), add (add), subtract (sub), AND (<strong>and</strong>), OR (or), set<br />

less than (slt), <strong>and</strong> branch on equal (beq).<br />

Compare the average time between instructions of a single-cycle<br />

implementation, in which all instructions take one clock cycle, to a pipelined<br />

implementation. The operation times for the major functional units in this<br />

example are 200 ps for memory access, 200 ps for ALU operation, <strong>and</strong> 100 ps<br />

for register file read or write. In the single-cycle model, every instruction takes<br />

exactly one clock cycle, so the clock cycle must be stretched to accommodate<br />

the slowest instruction.<br />

Figure 4.26 shows the time required for each of the eight instructions.<br />

The single-cycle design must allow for the slowest instruction—in Figure<br />

4.26 it is lw—so the time required for every instruction is 800 ps. Similarly


4.5 An Overview of Pipelining 275<br />

to Figure 4.25, Figure 4.27 compares nonpipelined <strong>and</strong> pipelined execution<br />

of three load word instructions. Thus, the time between the first <strong>and</strong> fourth<br />

instructions in the nonpipelined design is 3 × 800 ns or 2400 ps.<br />

All the pipeline stages take a single clock cycle, so the clock cycle must be long<br />

enough to accommodate the slowest operation. Just as the single-cycle design<br />

must take the worst-case clock cycle of 800 ps, even though some instructions<br />

can be as fast as 500 ps, the pipelined execution clock cycle must have the<br />

worst-case clock cycle of 200 ps, even though some stages take only 100 ps.<br />

Pipelining still offers a fourfold performance improvement: the time between<br />

the first <strong>and</strong> fourth instructions is 3 × 200 ps or 600 ps.<br />

We can turn the pipelining speed-up discussion above into a formula. If the<br />

stages are perfectly balanced, then the time between instructions on the pipelined<br />

processor—assuming ideal conditions—is equal to<br />

Time between instructions<br />

pipelined<br />

Time between instructio<br />

<br />

Number of pipe stages<br />

n nonpipelined<br />

Under ideal conditions <strong>and</strong> with a large number of instructions, the speed-up<br />

from pipelining is approximately equal to the number of pipe stages; a five-stage<br />

pipeline is nearly five times faster.<br />

The formula suggests that a five-stage pipeline should offer nearly a fivefold<br />

improvement over the 800 ps nonpipelined time, or a 160 ps clock cycle. The<br />

example shows, however, that the stages may be imperfectly balanced. Moreover,<br />

pipelining involves some overhead, the source of which will be clearer shortly.<br />

Thus, the time per instruction in the pipelined processor will exceed the minimum<br />

possible, <strong>and</strong> speed-up will be less than the number of pipeline stages.<br />

Instruction class<br />

Instruction<br />

fetch<br />

Register<br />

read<br />

ALU<br />

operation<br />

Data<br />

access<br />

Register<br />

write<br />

Total<br />

time<br />

Load word (lw) 200 ps 100 ps 200 ps 200 ps 100 ps 800 ps<br />

Store word (sw) 200 ps 100 ps 200 ps 200 ps 700 ps<br />

R-format (add, sub, AND, 200 ps 100 ps 200 ps 100 ps 600 ps<br />

OR, slt)<br />

Branch (beq) 200 ps 100 ps 200 ps 500 ps<br />

FIGURE 4.26 Total time for each instruction calculated from the time for each component.<br />

This calculation assumes that the multiplexors, control unit, PC accesses, <strong>and</strong> sign extension unit have no<br />

delay.


4.5 An Overview of Pipelining 277<br />

Pipelining improves performance by increasing instruction throughput, as<br />

opposed to decreasing the execution time of an individual instruction, but instruction<br />

throughput is the important metric because real programs execute billions of<br />

instructions.<br />

<strong>Design</strong>ing Instruction Sets for Pipelining<br />

Even with this simple explanation of pipelining, we can get insight into the design<br />

of the MIPS instruction set, which was designed for pipelined execution.<br />

First, all MIPS instructions are the same length. This restriction makes it much<br />

easier to fetch instructions in the first pipeline stage <strong>and</strong> to decode them in the<br />

second stage. In an instruction set like the x86, where instructions vary from 1 byte<br />

to 15 bytes, pipelining is considerably more challenging. Recent implementations<br />

of the x86 architecture actually translate x86 instructions into simple operations<br />

that look like MIPS instructions <strong>and</strong> then pipeline the simple operations rather<br />

than the native x86 instructions! (See Section 4.10.)<br />

Second, MIPS has only a few instruction formats, with the source register fields<br />

being located in the same place in each instruction. This symmetry means that the<br />

second stage can begin reading the register file at the same time that the hardware<br />

is determining what type of instruction was fetched. If MIPS instruction formats<br />

were not symmetric, we would need to split stage 2, resulting in six pipeline stages.<br />

We will shortly see the downside of longer pipelines.<br />

Third, memory oper<strong>and</strong>s only appear in loads or stores in MIPS. This restriction<br />

means we can use the execute stage to calculate the memory address <strong>and</strong> then<br />

access memory in the following stage. If we could operate on the oper<strong>and</strong>s in<br />

memory, as in the x86, stages 3 <strong>and</strong> 4 would exp<strong>and</strong> to an address stage, memory<br />

stage, <strong>and</strong> then execute stage.<br />

Fourth, as discussed in Chapter 2, oper<strong>and</strong>s must be aligned in memory. Hence,<br />

we need not worry about a single data transfer instruction requiring two data<br />

memory accesses; the requested data can be transferred between processor <strong>and</strong><br />

memory in a single pipeline stage.<br />

Pipeline Hazards<br />

There are situations in pipelining when the next instruction cannot execute in the<br />

following clock cycle. These events are called hazards, <strong>and</strong> there are three different<br />

types.<br />

Hazards<br />

The first hazard is called a structural hazard. It means that the hardware cannot<br />

support the combination of instructions that we want to execute in the same clock<br />

cycle. A structural hazard in the laundry room would occur if we used a washerdryer<br />

combination instead of a separate washer <strong>and</strong> dryer, or if our roommate was<br />

busy doing something else <strong>and</strong> wouldn’t put clothes away. Our carefully scheduled<br />

pipeline plans would then be foiled.<br />

structural hazard When<br />

a planned instruction<br />

cannot execute in the<br />

proper clock cycle because<br />

the hardware does not<br />

support the combination<br />

of instructions that are set<br />

to execute.


278 Chapter 4 The Processor<br />

As we said above, the MIPS instruction set was designed to be pipelined,<br />

making it fairly easy for designers to avoid structural hazards when designing a<br />

pipeline. Suppose, however, that we had a single memory instead of two memories.<br />

If the pipeline in Figure 4.27 had a fourth instruction, we would see that in the<br />

same clock cycle the first instruction is accessing data from memory while the<br />

fourth instruction is fetching an instruction from that same memory. Without two<br />

memories, our pipeline could have a structural hazard.<br />

data hazard Also<br />

called a pipeline data<br />

hazard. When a planned<br />

instruction cannot<br />

execute in the proper<br />

clock cycle because data<br />

that is needed to execute<br />

the instruction is not yet<br />

available.<br />

forwarding Also called<br />

bypassing. A method of<br />

resolving a data hazard<br />

by retrieving the missing<br />

data element from<br />

internal buffers rather<br />

than waiting for it to<br />

arrive from programmervisible<br />

registers or<br />

memory.<br />

Data Hazards<br />

Data hazards occur when the pipeline must be stalled because one step must wait<br />

for another to complete. Suppose you found a sock at the folding station for which<br />

no match existed. One possible strategy is to run down to your room <strong>and</strong> search<br />

through your clothes bureau to see if you can find the match. Obviously, while you<br />

are doing the search, loads must wait that have completed drying <strong>and</strong> are ready to<br />

fold as well as those that have finished washing <strong>and</strong> are ready to dry.<br />

In a computer pipeline, data hazards arise from the dependence of one<br />

instruction on an earlier one that is still in the pipeline (a relationship that does not<br />

really exist when doing laundry). For example, suppose we have an add instruction<br />

followed immediately by a subtract instruction that uses the sum ($s0):<br />

add $s0, $t0, $t1<br />

sub $t2, $s0, $t3<br />

Without intervention, a data hazard could severely stall the pipeline. The add<br />

instruction doesn’t write its result until the fifth stage, meaning that we would have<br />

to waste three clock cycles in the pipeline.<br />

Although we could try to rely on compilers to remove all such hazards, the<br />

results would not be satisfactory. These dependences happen just too often <strong>and</strong> the<br />

delay is just too long to expect the compiler to rescue us from this dilemma.<br />

The primary solution is based on the observation that we don’t need to wait for<br />

the instruction to complete before trying to resolve the data hazard. For the code<br />

sequence above, as soon as the ALU creates the sum for the add, we can supply it as<br />

an input for the subtract. Adding extra hardware to retrieve the missing item early<br />

from the internal resources is called forwarding or bypassing.<br />

EXAMPLE<br />

Forwarding with Two Instructions<br />

For the two instructions above, show what pipeline stages would be connected<br />

by forwarding. Use the drawing in Figure 4.28 to represent the datapath during<br />

the five stages of the pipeline. Align a copy of the datapath for each instruction,<br />

similar to the laundry pipeline in Figure 4.25.


4.5 An Overview of Pipelining 279<br />

Time<br />

200 400 600 800 1000<br />

add $s0, $t0, $t1 IF ID EX MEM<br />

WB<br />

FIGURE 4.28 Graphical representation of the instruction pipeline, similar in spirit to<br />

the laundry pipeline in Figure 4.25. Here we use symbols representing the physical resources with<br />

the abbreviations for pipeline stages used throughout the chapter. The symbols for the five stages: IF for<br />

the instruction fetch stage, with the box representing instruction memory; ID for the instruction decode/<br />

register file read stage, with the drawing showing the register file being read; EX for the execution stage,<br />

with the drawing representing the ALU; MEM for the memory access stage, with the box representing data<br />

memory; <strong>and</strong> WB for the write-back stage, with the drawing showing the register file being written. The<br />

shading indicates the element is used by the instruction. Hence, MEM has a white background because add<br />

does not access the data memory. Shading on the right half of the register file or memory means the element<br />

is read in that stage, <strong>and</strong> shading of the left half means it is written in that stage. Hence the right half of ID is<br />

shaded in the second stage because the register file is read, <strong>and</strong> the left half of WB is shaded in the fifth stage<br />

because the register file is written.<br />

Figure 4.29 shows the connection to forward the value in $s0 after the<br />

execution stage of the add instruction as input to the execution stage of the<br />

sub instruction.<br />

ANSWER<br />

In this graphical representation of events, forwarding paths are valid only if the<br />

destination stage is later in time than the source stage. For example, there cannot<br />

be a valid forwarding path from the output of the memory access stage in the first<br />

instruction to the input of the execution stage of the following, since that would<br />

mean going backward in time.<br />

Forwarding works very well <strong>and</strong> is described in detail in Section 4.7. It cannot<br />

prevent all pipeline stalls, however. For example, suppose the first instruction was a<br />

load of $s0 instead of an add. As we can imagine from looking at Figure 4.29, the<br />

Program<br />

execution<br />

order Time<br />

(in instructions)<br />

add $s0, $t0, $t1<br />

IF<br />

200 400 600 800 1000<br />

ID EX MEM WB<br />

sub $t2, $s0, $t3<br />

IF<br />

ID<br />

EX<br />

MEM<br />

WB<br />

FIGURE 4.29 Graphical representation of forwarding. The connection shows the forwarding path<br />

from the output of the EX stage of add to the input of the EX stage for sub, replacing the value from register<br />

$s0 read in the second stage of sub.


4.5 An Overview of Pipelining 281<br />

Find the hazards in the preceding code segment <strong>and</strong> reorder the instructions<br />

to avoid any pipeline stalls.<br />

Both add instructions have a hazard because of their respective dependence<br />

on the immediately preceding lw instruction. Notice that bypassing eliminates<br />

several other potential hazards, including the dependence of the first add on<br />

the first lw <strong>and</strong> any hazards for store instructions. Moving up the third lw<br />

instruction to become the third instruction eliminates both hazards:<br />

ANSWER<br />

lw $t1, 0($t0)<br />

lw $t2, 4($t0)<br />

lw $t4, 8($t0)<br />

add $t3, $t1,$t2<br />

sw $t3, 12($t0)<br />

add $t5, $t1,$t4<br />

sw $t5, 16($t0)<br />

On a pipelined processor with forwarding, the reordered sequence will<br />

complete in two fewer cycles than the original version.<br />

Forwarding yields another insight into the MIPS architecture, in addition to the<br />

four mentioned on page 277. Each MIPS instruction writes at most one result <strong>and</strong><br />

does this in the last stage of the pipeline. Forwarding is harder if there are multiple<br />

results to forward per instruction or if there is a need to write a result early on in<br />

instruction execution.<br />

Elaboration: The name “forwarding” comes from the idea that the result is passed<br />

forward from an earlier instruction to a later instruction. “Bypassing” comes from<br />

passing the result around the register fi le to the desired unit.<br />

Control Hazards<br />

The third type of hazard is called a control hazard, arising from the need to make a<br />

decision based on the results of one instruction while others are executing.<br />

Suppose our laundry crew was given the happy task of cleaning the uniforms<br />

of a football team. Given how filthy the laundry is, we need to determine whether<br />

the detergent <strong>and</strong> water temperature setting we select is strong enough to get the<br />

uniforms clean but not so strong that the uniforms wear out sooner. In our laundry<br />

pipeline, we have to wait until after the second stage to examine the dry uniform to<br />

see if we need to change the washer setup or not. What to do?<br />

Here is the first of two solutions to control hazards in the laundry room <strong>and</strong> its<br />

computer equivalent.<br />

Stall: Just operate sequentially until the first batch is dry <strong>and</strong> then repeat until<br />

you have the right formula.<br />

This conservative option certainly works, but it is slow.<br />

control hazard Also<br />

called branch hazard.<br />

When the proper<br />

instruction cannot<br />

execute in the proper<br />

pipeline clock cycle<br />

because the instruction<br />

that was fetched is not the<br />

one that is needed; that<br />

is, the flow of instruction<br />

addresses is not what the<br />

pipeline expected.


4.6 Pipelined Datapath <strong>and</strong> Control 287<br />

IF: Instruction fetch<br />

ID: Instruction decode/<br />

register file read<br />

EX: Execute/<br />

address calculation<br />

MEM: Memory access<br />

WB: Write back<br />

Add<br />

4<br />

Shift<br />

left 2<br />

ADD<br />

Add<br />

result<br />

0<br />

Read Read<br />

M<br />

register 1 data 1<br />

u PC Address<br />

Zero<br />

x<br />

Read<br />

ALU<br />

1 register 2<br />

ALU<br />

Address<br />

Instruction<br />

Instruction<br />

memory<br />

Write<br />

register<br />

Write<br />

data<br />

Registers<br />

Read<br />

data 2<br />

0<br />

M<br />

u<br />

x<br />

1<br />

result<br />

Write<br />

data<br />

Data<br />

memory<br />

Read<br />

data<br />

1<br />

M<br />

u<br />

x<br />

0<br />

16<br />

Signextend<br />

32<br />

FIGURE 4.33 The single-cycle datapath from Section 4.4 (similar to Figure 4.17). Each step of the instruction can be mapped<br />

onto the datapath from left to right. The only exceptions are the update of the PC <strong>and</strong> the write-back step, shown in color, which sends either<br />

the ALU result or the data from memory to the left to be written into the register file. (Normally we use color lines for control, but these are<br />

data lines.)<br />

five stages as they complete execution. Returning to our laundry analogy, clothes<br />

get cleaner, drier, <strong>and</strong> more organized as they move through the line, <strong>and</strong> they<br />

never move backward.<br />

There are, however, two exceptions to this left-to-right flow of instructions:<br />

■ The write-back stage, which places the result back into the register file in the<br />

middle of the datapath<br />

■ The selection of the next value of the PC, choosing between the incremented<br />

PC <strong>and</strong> the branch address from the MEM stage<br />

Data flowing from right to left does not affect the current instruction; these<br />

reverse data movements influence only later instructions in the pipeline. Note that


288 Chapter 4 The Processor<br />

the first right-to-left flow of data can lead to data hazards <strong>and</strong> the second leads to<br />

control hazards.<br />

One way to show what happens in pipelined execution is to pretend that each<br />

instruction has its own datapath, <strong>and</strong> then to place these datapaths on a timeline to<br />

show their relationship. Figure 4.34 shows the execution of the instructions in Figure<br />

4.27 by displaying their private datapaths on a common timeline. We use a stylized<br />

version of the datapath in Figure 4.33 to show the relationships in Figure 4.34.<br />

Figure 4.34 seems to suggest that three instructions need three datapaths.<br />

Instead, we add registers to hold data so that portions of a single datapath can be<br />

shared during instruction execution.<br />

For example, as Figure 4.34 shows, the instruction memory is used during<br />

only one of the five stages of an instruction, allowing it to be shared by following<br />

instructions during the other four stages. To retain the value of an individual<br />

instruction for its other four stages, the value read from instruction memory must<br />

be saved in a register. Similar arguments apply to every pipeline stage, so we must<br />

place registers wherever there are dividing lines between stages in Figure 4.33.<br />

Returning to our laundry analogy, we might have a basket between each pair of<br />

stages to hold the clothes for the next step.<br />

Program<br />

execution<br />

order<br />

(in instructions)<br />

Time (in clock cycles)<br />

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7<br />

lw $1, 100($0)<br />

IM<br />

Reg<br />

ALU<br />

DM<br />

Reg<br />

lw $2, 200($0)<br />

IM<br />

Reg<br />

ALU<br />

DM<br />

Reg<br />

lw $3, 300($0)<br />

IM<br />

Reg<br />

ALU<br />

DM<br />

Reg<br />

FIGURE 4.34 Instructions being executed using the single-cycle datapath in Figure 4.33,<br />

assuming pipelined execution. Similar to Figures 4.28 through 4.30, this figure pretends that each<br />

instruction has its own datapath, <strong>and</strong> shades each portion according to use. Unlike those figures, each stage<br />

is labeled by the physical resource used in that stage, corresponding to the portions of the datapath in Figure<br />

4.33. IM represents the instruction memory <strong>and</strong> the PC in the instruction fetch stage, Reg st<strong>and</strong>s for the<br />

register file <strong>and</strong> sign extender in the instruction decode/register file read stage (ID), <strong>and</strong> so on. To maintain<br />

proper time order, this stylized datapath breaks the register file into two logical parts: registers read during<br />

register fetch (ID) <strong>and</strong> registers written during write back (WB). This dual use is represented by drawing<br />

the unshaded left half of the register file using dashed lines in the ID stage, when it is not being written, <strong>and</strong><br />

the unshaded right half in dashed lines in the WB stage, when it is not being read. As before, we assume the<br />

register file is written in the first half of the clock cycle <strong>and</strong> the register file is read during the second half.


4.6 Pipelined Datapath <strong>and</strong> Control 289<br />

Figure 4.35 shows the pipelined datapath with the pipeline registers highlighted.<br />

All instructions advance during each clock cycle from one pipeline register<br />

to the next. The registers are named for the two stages separated by that register.<br />

For example, the pipeline register between the IF <strong>and</strong> ID stages is called IF/ID.<br />

Notice that there is no pipeline register at the end of the write-back stage. All<br />

instructions must update some state in the processor—the register file, memory, or<br />

the PC—so a separate pipeline register is redundant to the state that is updated. For<br />

example, a load instruction will place its result in 1 of the 32 registers, <strong>and</strong> any later<br />

instruction that needs that data will simply read the appropriate register.<br />

Of course, every instruction updates the PC, whether by incrementing it or by<br />

setting it to a branch destination address. The PC can be thought of as a pipeline<br />

register: one that feeds the IF stage of the pipeline. Unlike the shaded pipeline<br />

registers in Figure 4.35, however, the PC is part of the visible architectural state;<br />

its contents must be saved when an exception occurs, while the contents of the<br />

pipeline registers can be discarded. In the laundry analogy, you could think of the<br />

PC as corresponding to the basket that holds the load of dirty clothes before the<br />

wash step.<br />

To show how the pipelining works, throughout this chapter we show sequences<br />

of figures to demonstrate operation over time. These extra pages would seem to<br />

require much more time for you to underst<strong>and</strong>. Fear not; the sequences take much<br />

IF/ID<br />

ID/EX<br />

EX/MEM<br />

MEM/WB<br />

Add<br />

4<br />

Shift<br />

left 2<br />

Add Add<br />

result<br />

0<br />

M<br />

u<br />

x<br />

1<br />

PC<br />

Address<br />

Instruction<br />

memory<br />

Instruction<br />

Read<br />

register 1<br />

Read<br />

register 2<br />

Write<br />

register<br />

Write<br />

data<br />

Registers<br />

Read<br />

data 1<br />

Read<br />

data 2<br />

0<br />

M<br />

u<br />

x<br />

1<br />

Zero<br />

ALU ALU<br />

result<br />

Read<br />

Address<br />

data<br />

Data<br />

memory<br />

1<br />

M<br />

u<br />

x<br />

0<br />

Write<br />

data<br />

16 Signextend<br />

32<br />

FIGURE 4.35 The pipelined version of the datapath in Figure 4.33. The pipeline registers, in color, separate each pipeline stage.<br />

They are labeled by the stages that they separate; for example, the first is labeled IF/ID because it separates the instruction fetch <strong>and</strong> instruction<br />

decode stages. The registers must be wide enough to store all the data corresponding to the lines that go through them. For example, the<br />

IF/ID register must be 64 bits wide, because it must hold both the 32-bit instruction fetched from memory <strong>and</strong> the incremented 32-bit PC<br />

address. We will exp<strong>and</strong> these registers over the course of this chapter, but for now the other three pipeline registers contain 128, 97, <strong>and</strong> 64<br />

bits, respectively.


290 Chapter 4 The Processor<br />

less time than it might appear, because you can compare them to see what changes<br />

occur in each clock cycle. Section 4.7 describes what happens when there are data<br />

hazards between pipelined instructions; ignore them for now.<br />

Figures 4.36 through 4.38, our first sequence, show the active portions of the<br />

datapath highlighted as a load instruction goes through the five stages of pipelined<br />

execution. We show a load first because it is active in all five stages. As in Figures<br />

4.28 through 4.30, we highlight the right half of registers or memory when they are<br />

being read <strong>and</strong> highlight the left half when they are being written.<br />

We show the instruction abbreviation lw with the name of the pipe stage that is<br />

active in each figure. The five stages are the following:<br />

1. Instruction fetch: The top portion of Figure 4.36 shows the instruction being<br />

read from memory using the address in the PC <strong>and</strong> then being placed in the<br />

IF/ID pipeline register. The PC address is incremented by 4 <strong>and</strong> then written<br />

back into the PC to be ready for the next clock cycle. This incremented<br />

address is also saved in the IF/ID pipeline register in case it is needed later<br />

for an instruction, such as beq. The computer cannot know which type of<br />

instruction is being fetched, so it must prepare for any instruction, passing<br />

potentially needed information down the pipeline.<br />

2. Instruction decode <strong>and</strong> register file read: The bottom portion of Figure 4.36<br />

shows the instruction portion of the IF/ID pipeline register supplying the<br />

16-bit immediate field, which is sign-extended to 32 bits, <strong>and</strong> the register<br />

numbers to read the two registers. All three values are stored in the ID/EX<br />

pipeline register, along with the incremented PC address. We again transfer<br />

everything that might be needed by any instruction during a later clock<br />

cycle.<br />

3. Execute or address calculation: Figure 4.37 shows that the load instruction<br />

reads the contents of register 1 <strong>and</strong> the sign-extended immediate from the<br />

ID/EX pipeline register <strong>and</strong> adds them using the ALU. That sum is placed in<br />

the EX/MEM pipeline register.<br />

4. Memory access: The top portion of Figure 4.38 shows the load instruction<br />

reading the data memory using the address from the EX/MEM pipeline<br />

register <strong>and</strong> loading the data into the MEM/WB pipeline register.<br />

5. Write-back: The bottom portion of Figure 4.38 shows the final step: reading<br />

the data from the MEM/WB pipeline register <strong>and</strong> writing it into the register<br />

file in the middle of the figure.<br />

This walk-through of the load instruction shows that any information needed<br />

in a later pipe stage must be passed to that stage via a pipeline register. Walking<br />

through a store instruction shows the similarity of instruction execution, as well<br />

as passing the information for later stages. Here are the five pipe stages of the store<br />

instruction:


4.6 Pipelined Datapath <strong>and</strong> Control 291<br />

lw<br />

Instruction fetch<br />

IF/ID<br />

ID/EX<br />

EX/MEM<br />

MEM/WB<br />

Add<br />

4<br />

Shift<br />

left 2<br />

Add<br />

Add<br />

resu t<br />

0<br />

M<br />

u<br />

x<br />

1<br />

PC<br />

Address<br />

Instruction<br />

memory<br />

Instruction<br />

Read<br />

register 1 Read<br />

data 1<br />

Read<br />

register 2<br />

Registers Read<br />

Write<br />

data 2<br />

register<br />

Write<br />

data<br />

0<br />

M<br />

u<br />

x<br />

1<br />

Zero<br />

ALU ALU<br />

result<br />

Address<br />

Data<br />

memory<br />

Read<br />

data<br />

0<br />

M<br />

u<br />

x<br />

1<br />

Write<br />

data<br />

16 Sign 32<br />

extend<br />

lw<br />

Instruction decode<br />

IF/ID<br />

ID/EX<br />

EX/MEM<br />

MEM/WB<br />

Add<br />

4<br />

Shift<br />

left 2<br />

Add<br />

Add<br />

resu t<br />

0<br />

M<br />

u<br />

x<br />

1<br />

PC<br />

Address<br />

Instruction<br />

memory<br />

Instruction<br />

Read<br />

register 1 Read<br />

data 1<br />

Read<br />

register 2<br />

Registers Read<br />

Write<br />

data 2<br />

register<br />

Write<br />

data<br />

0<br />

M<br />

u<br />

x<br />

1<br />

Zero<br />

ALU ALU<br />

result<br />

Address<br />

Data<br />

memory<br />

Read<br />

data<br />

1<br />

M<br />

u<br />

x<br />

0<br />

Write<br />

data<br />

16 Sign 32<br />

extend<br />

FIGURE 4.36 IF <strong>and</strong> ID: First <strong>and</strong> second pipe stages of an instruction, with the active portions of the datapath in<br />

Figure 4.35 highlighted. The highlighting convention is the same as that used in Figure 4.28. As in Section 4.2, there is no confusion when<br />

reading <strong>and</strong> writing registers, because the contents change only on the clock edge. Although the load needs only the top register in stage 2,<br />

the processor doesn’t know what instruction is being decoded, so it sign-extends the 16-bit constant <strong>and</strong> reads both registers into the ID/EX<br />

pipeline register. We don’t need all three oper<strong>and</strong>s, but it simplifies control to keep all three.


292 Chapter 4 The Processor<br />

Iw<br />

Execution<br />

IF/ID<br />

ID/EX<br />

EX/MEM<br />

MEM/WB<br />

Add<br />

4<br />

Shift<br />

left 2<br />

Add Add<br />

result<br />

0<br />

M<br />

u<br />

x<br />

1<br />

PC<br />

Address<br />

Instruction<br />

memory<br />

Instruction<br />

Read<br />

register 1<br />

Read<br />

register 2<br />

Write<br />

register<br />

Write<br />

data<br />

Registers<br />

Read<br />

data 1<br />

Read<br />

data 2<br />

0<br />

M<br />

u<br />

x<br />

1<br />

Zero<br />

ALU ALU<br />

result<br />

Read<br />

Address<br />

data<br />

Data<br />

memory<br />

Write<br />

data<br />

1<br />

M<br />

u<br />

x<br />

0<br />

16 Signextend<br />

32<br />

FIGURE 4.37 EX: The third pipe stage of a load instruction, highlighting the portions of the datapath in Figure 4.35<br />

used in this pipe stage. The register is added to the sign-extended immediate, <strong>and</strong> the sum is placed in the EX/MEM pipeline register.<br />

1. Instruction fetch: The instruction is read from memory using the address<br />

in the PC <strong>and</strong> then is placed in the IF/ID pipeline register. This stage occurs<br />

before the instruction is identified, so the top portion of Figure 4.36 works<br />

for store as well as load.<br />

2. Instruction decode <strong>and</strong> register file read: The instruction in the IF/ID pipeline<br />

register supplies the register numbers for reading two registers <strong>and</strong> extends<br />

the sign of the 16-bit immediate. These three 32-bit values are all stored<br />

in the ID/EX pipeline register. The bottom portion of Figure 4.36 for load<br />

instructions also shows the operations of the second stage for stores. These<br />

first two stages are executed by all instructions, since it is too early to know<br />

the type of the instruction.<br />

3. Execute <strong>and</strong> address calculation: Figure 4.39 shows the third step; the<br />

effective address is placed in the EX/MEM pipeline register.<br />

4. Memory access: The top portion of Figure 4.40 shows the data being written<br />

to memory. Note that the register containing the data to be stored was read in<br />

an earlier stage <strong>and</strong> stored in ID/EX. The only way to make the data available<br />

during the MEM stage is to place the data into the EX/MEM pipeline register<br />

in the EX stage, just as we stored the effective address into EX/MEM.


4.6 Pipelined Datapath <strong>and</strong> Control 293<br />

Iw<br />

Memory<br />

IF/ID<br />

ID/EX<br />

EX/MEM<br />

MEM/WB<br />

Add<br />

4<br />

Shift<br />

left 2<br />

Add<br />

Add<br />

result<br />

0<br />

M<br />

u<br />

x<br />

1<br />

PC<br />

Address<br />

Instruction<br />

memory<br />

Instruction<br />

Read<br />

register 1 Read<br />

data 1<br />

Read<br />

register 2<br />

Registers Read<br />

Write<br />

data 2<br />

register<br />

Write<br />

data<br />

0<br />

M<br />

u<br />

x<br />

1<br />

Zero<br />

ALU ALU<br />

result<br />

Address<br />

Data<br />

memory<br />

Read<br />

data<br />

0<br />

M<br />

u<br />

x<br />

1<br />

Write<br />

data<br />

16 Sign 32<br />

extend<br />

Iw<br />

Write-back<br />

IF/ID<br />

ID/EX<br />

EX/MEM<br />

MEM/WB<br />

Add<br />

4<br />

Shift<br />

left 2<br />

Add<br />

Add<br />

result<br />

0<br />

M<br />

u<br />

x<br />

1<br />

PC<br />

Address<br />

Instruction<br />

memory<br />

Instruction<br />

Read<br />

register 1 Read<br />

data 1<br />

Read<br />

register 2<br />

Registers Read<br />

Write<br />

data 2<br />

register<br />

Write<br />

data<br />

0<br />

M<br />

u<br />

x<br />

1<br />

Zero<br />

ALU ALU<br />

result<br />

Address<br />

Data<br />

memory<br />

Read<br />

data<br />

1<br />

M<br />

u<br />

x<br />

0<br />

Write<br />

data<br />

16 Sign 32<br />

extend<br />

FIGURE 4.38 MEM <strong>and</strong> WB: The fourth <strong>and</strong> fifth pipe stages of a load instruction, highlighting the portions of the<br />

datapath in Figure 4.35 used in this pipe stage. Data memory is read using the address in the EX/MEM pipeline registers, <strong>and</strong> the<br />

data is placed in the MEM/WB pipeline register. Next, data is read from the MEM/WB pipeline register <strong>and</strong> written into the register file in the<br />

middle of the datapath. Note: there is a bug in this design that is repaired in Figure 4.41.


294 Chapter 4 The Processor<br />

sw<br />

Execution<br />

IF/ID<br />

ID/EX<br />

EX/MEM<br />

MEM/WB<br />

Add<br />

4<br />

Shift<br />

left 2<br />

Add Add<br />

result<br />

0<br />

M<br />

u<br />

x<br />

1<br />

PC<br />

Address<br />

Instruction<br />

memory<br />

Instruction<br />

Read<br />

register 1<br />

Read<br />

register 2<br />

Write<br />

register<br />

Write<br />

data<br />

Registers<br />

Read<br />

data 1<br />

Read<br />

data 2<br />

0<br />

M<br />

u<br />

x<br />

1<br />

Zero<br />

ALU ALU<br />

result<br />

Read<br />

Address<br />

data<br />

Data<br />

memory<br />

Write<br />

data<br />

1<br />

M<br />

u<br />

x<br />

0<br />

16 Signextend<br />

32<br />

FIGURE 4.39 EX: The third pipe stage of a store instruction. Unlike the third stage of the load instruction in Figure 4.37, the<br />

second register value is loaded into the EX/MEM pipeline register to be used in the next stage. Although it wouldn’t hurt to always write this<br />

second register into the EX/MEM pipeline register, we write the second register only on a store instruction to make the pipeline easier to<br />

underst<strong>and</strong>.<br />

5. Write-back: The bottom portion of Figure 4.40 shows the final step of the<br />

store. For this instruction, nothing happens in the write-back stage. Since<br />

every instruction behind the store is already in progress, we have no way<br />

to accelerate those instructions. Hence, an instruction passes through a<br />

stage even if there is nothing to do, because later instructions are already<br />

progressing at the maximum rate.<br />

The store instruction again illustrates that to pass something from an early pipe<br />

stage to a later pipe stage, the information must be placed in a pipeline register;<br />

otherwise, the information is lost when the next instruction enters that pipeline<br />

stage. For the store instruction we needed to pass one of the registers read in the<br />

ID stage to the MEM stage, where it is stored in memory. The data was first placed<br />

in the ID/EX pipeline register <strong>and</strong> then passed to the EX/MEM pipeline register.<br />

Load <strong>and</strong> store illustrate a second key point: each logical component of the<br />

datapath—such as instruction memory, register read ports, ALU, data memory,<br />

<strong>and</strong> register write port—can be used only within a single pipeline stage. Otherwise,<br />

we would have a structural hazard (see page 277). Hence these components, <strong>and</strong><br />

their control, can be associated with a single pipeline stage.<br />

Now we can uncover a bug in the design of the load instruction. Did you see it?<br />

Which register is changed in the final stage of the load? More specifically, which


4.6 Pipelined Datapath <strong>and</strong> Control 295<br />

sw<br />

Memory<br />

IF/ID<br />

ID/EX<br />

EX/MEM<br />

MEM/WB<br />

Add<br />

4<br />

Shift<br />

left 2<br />

Add<br />

Add<br />

result<br />

0<br />

M<br />

u<br />

x<br />

1<br />

PC<br />

Address<br />

Instruction<br />

memory<br />

Instruction<br />

Read<br />

register 1 Read<br />

data 1<br />

Read<br />

register 2<br />

Registers Read<br />

Write<br />

data 2<br />

register<br />

Write<br />

data<br />

0<br />

M<br />

u<br />

x<br />

1<br />

Zero<br />

ALU ALU<br />

result<br />

Address<br />

Data<br />

memory<br />

Read<br />

data<br />

0<br />

M<br />

u<br />

x<br />

1<br />

Write<br />

data<br />

16 Sign 32<br />

extend<br />

sw<br />

Write-back<br />

IF/ID<br />

ID/EX<br />

EX/MEM<br />

MEM/WB<br />

Add<br />

4<br />

Shift<br />

left 2<br />

Add<br />

Add<br />

result<br />

0<br />

M<br />

u<br />

x<br />

1<br />

PC<br />

Address<br />

Instruction<br />

memory<br />

Instruction<br />

Read<br />

register 1 Read<br />

data 1<br />

Read<br />

register 2<br />

Registers Read<br />

Write<br />

data 2<br />

register<br />

Write<br />

data<br />

0<br />

M<br />

u<br />

x<br />

1<br />

Zero<br />

ALU ALU<br />

result<br />

Address<br />

Data<br />

memory<br />

Read<br />

data<br />

1<br />

M<br />

u<br />

x<br />

0<br />

Write<br />

data<br />

16 Sign 32<br />

extend<br />

FIGURE 4.40 MEM <strong>and</strong> WB: The fourth <strong>and</strong> fifth pipe stages of a store instruction. In the fourth stage, the data is written into<br />

data memory for the store. Note that the data comes from the EX/MEM pipeline register <strong>and</strong> that nothing is changed in the MEM/WB pipeline<br />

register. Once the data is written in memory, there is nothing left for the store instruction to do, so nothing happens in stage 5.


296 Chapter 4 The Processor<br />

instruction supplies the write register number? The instruction in the IF/ID pipeline<br />

register supplies the write register number, yet this instruction occurs considerably<br />

after the load instruction!<br />

Hence, we need to preserve the destination register number in the load<br />

instruction. Just as store passed the register contents from the ID/EX to the EX/<br />

MEM pipeline registers for use in the MEM stage, load must pass the register<br />

number from the ID/EX through EX/MEM to the MEM/WB pipeline register for<br />

use in the WB stage. Another way to think about the passing of the register number<br />

is that to share the pipelined datapath, we need to preserve the instruction read<br />

during the IF stage, so each pipeline register contains a portion of the instruction<br />

needed for that stage <strong>and</strong> later stages.<br />

Figure 4.41 shows the correct version of the datapath, passing the write register<br />

number first to the ID/EX register, then to the EX/MEM register, <strong>and</strong> finally to the<br />

MEM/WB register. The register number is used during the WB stage to specify<br />

the register to be written. Figure 4.42 is a single drawing of the corrected datapath,<br />

highlighting the hardware used in all five stages of the load word instruction in<br />

Figures 4.36 through 4.38. See Section 4.8 for an explanation of how to make the<br />

branch instruction work as expected.<br />

Graphically Representing Pipelines<br />

Pipelining can be difficult to underst<strong>and</strong>, since many instructions are simultaneously<br />

executing in a single datapath in every clock cycle. To aid underst<strong>and</strong>ing, there are<br />

IF/ID<br />

ID/EX<br />

EX/MEM<br />

MEM/WB<br />

Add<br />

4<br />

Shift<br />

left 2<br />

Add<br />

Add<br />

resu t<br />

0<br />

M<br />

u<br />

x<br />

1<br />

PC<br />

Address<br />

Instruction<br />

memory<br />

Instruction<br />

Read<br />

register 1 Read<br />

data 1<br />

Read<br />

register 2<br />

Registers Read<br />

Write<br />

data 2<br />

register<br />

Write<br />

data<br />

0<br />

M<br />

u<br />

x<br />

1<br />

Zero<br />

ALU ALU<br />

result<br />

Address<br />

Data<br />

memory<br />

Read<br />

data<br />

1<br />

M<br />

u<br />

x<br />

0<br />

Write<br />

data<br />

16 Sign 32<br />

extend<br />

FIGURE 4.41 The corrected pipelined datapath to h<strong>and</strong>le the load instruction properly. The write register number now<br />

comes from the MEM/WB pipeline register along with the data. The register number is passed from the ID pipe stage until it reaches the MEM/<br />

WB pipeline register, adding five more bits to the last three pipeline registers. This new path is shown in color.


4.6 Pipelined Datapath <strong>and</strong> Control 297<br />

two basic styles of pipeline figures: multiple-clock-cycle pipeline diagrams, such as<br />

Figure 4.34 on page 288, <strong>and</strong> single-clock-cycle pipeline diagrams, such as Figures<br />

4.36 through 4.40. The multiple-clock-cycle diagrams are simpler but do not contain<br />

all the details. For example, consider the following five-instruction sequence:<br />

lw $10, 20($1)<br />

sub $11, $2, $3<br />

add $12, $3, $4<br />

lw $13, 24($1)<br />

add $14, $5, $6<br />

Figure 4.43 shows the multiple-clock-cycle pipeline diagram for these<br />

instructions. Time advances from left to right across the page in these diagrams,<br />

<strong>and</strong> instructions advance from the top to the bottom of the page, similar to the<br />

laundry pipeline in Figure 4.25. A representation of the pipeline stages is placed<br />

in each portion along the instruction axis, occupying the proper clock cycles.<br />

These stylized datapaths represent the five stages of our pipeline graphically, but<br />

a rectangle naming each pipe stage works just as well. Figure 4.44 shows the more<br />

traditional version of the multiple-clock-cycle pipeline diagram. Note that Figure<br />

4.43 shows the physical resources used at each stage, while Figure 4.44 uses the<br />

name of each stage.<br />

Single-clock-cycle pipeline diagrams show the state of the entire datapath during<br />

a single clock cycle, <strong>and</strong> usually all five instructions in the pipeline are identified by<br />

labels above their respective pipeline stages. We use this type of figure to show the<br />

details of what is happening within the pipeline during each clock cycle; typically,<br />

IF/ID<br />

ID/EX<br />

EX/MEM<br />

MEM/WB<br />

Add<br />

4<br />

Shift<br />

left 2<br />

Add<br />

Add<br />

resu t<br />

0<br />

M<br />

u<br />

x<br />

1<br />

PC<br />

Address<br />

Instruction<br />

memory<br />

Instruction<br />

Read<br />

register 1 Read<br />

data 1<br />

Read<br />

register 2<br />

Registers Read<br />

Write<br />

data 2<br />

register<br />

Write<br />

data<br />

0<br />

M<br />

u<br />

x<br />

1<br />

Zero<br />

ALU ALU<br />

result<br />

Address<br />

Data<br />

memory<br />

Read<br />

data<br />

1<br />

M<br />

u<br />

x<br />

0<br />

Write<br />

data<br />

16 Sign 32<br />

extend<br />

FIGURE 4.42<br />

The portion of the datapath in Figure 4.41 that is used in all five stages of a load instruction.


4.6 Pipelined Datapath <strong>and</strong> Control 299<br />

Program<br />

execution<br />

order<br />

(in instructions)<br />

Time (in clock cycles)<br />

CC 1 CC 2 CC 3<br />

CC 4<br />

CC 5<br />

CC 6<br />

CC 7<br />

CC 8<br />

CC 9<br />

lw $10, 20($1)<br />

sub $11, $2, $3<br />

add $12, $3, $4<br />

lw $13, 24($1)<br />

add $14, $5, $6<br />

Instruction<br />

fetch<br />

Instruction<br />

decode<br />

Instruction<br />

fetch<br />

Execution<br />

Instruction<br />

decode<br />

Instruction<br />

fetch<br />

Data<br />

access<br />

Execution<br />

Instruction<br />

decode<br />

Instruction<br />

fetch<br />

Write-back<br />

Data<br />

access<br />

Execution<br />

Instruction<br />

decode<br />

Instruction<br />

fetch<br />

Write-back<br />

Data<br />

access<br />

Execution<br />

Instruction<br />

decode<br />

Write-back<br />

Data<br />

access<br />

Execution<br />

Write-back<br />

Data<br />

access<br />

Write-back<br />

FIGURE 4.44 Traditional multiple-clock-cycle pipeline diagram of five instructions in Figure 4.43.<br />

add $14, $5, $6<br />

lw $13, 24 ($1)<br />

add $12, $3, $4<br />

sub $11, $2, $3<br />

lw $10, 20($1)<br />

Instruction fetch<br />

Instruction decode<br />

Execution<br />

Memory<br />

Write-back<br />

IF/ID<br />

ID/EX<br />

EX/MEM<br />

MEM/WB<br />

Add<br />

4<br />

Shift<br />

left 2<br />

Add Add<br />

result<br />

0<br />

M<br />

u<br />

x<br />

1<br />

PC<br />

Address<br />

Instruction<br />

memory<br />

Instruction<br />

Read<br />

register 1<br />

Read<br />

data 1<br />

Read<br />

register 2<br />

Registers<br />

Read<br />

Write<br />

data 2<br />

register<br />

Write<br />

data<br />

0<br />

M<br />

u<br />

x<br />

1<br />

Zero<br />

ALU ALU<br />

result<br />

Read<br />

Address<br />

data<br />

Data<br />

memory<br />

1<br />

M<br />

u<br />

x<br />

0<br />

Write<br />

data<br />

16 Sign 32<br />

extend<br />

FIGURE 4.45 The single-clock-cycle diagram corresponding to clock cycle 5 of the pipeline in Figures 4.43 <strong>and</strong> 4.44.<br />

As you can see, a single-clock-cycle figure is a vertical slice through a multiple-clock-cycle diagram.<br />

1. Allowing jumps, branches, <strong>and</strong> ALU instructions to take fewer stages than<br />

the five required by the load instruction will increase pipeline performance<br />

under all circumstances.


300 Chapter 4 The Processor<br />

2. Trying to allow some instructions to take fewer cycles does not help, since<br />

the throughput is determined by the clock cycle; the number of pipe stages<br />

per instruction affects latency, not throughput.<br />

3. You cannot make ALU instructions take fewer cycles because of the writeback<br />

of the result, but branches <strong>and</strong> jumps can take fewer cycles, so there is<br />

some opportunity for improvement.<br />

4. Instead of trying to make instructions take fewer cycles, we should explore<br />

making the pipeline longer, so that instructions take more cycles, but the<br />

cycles are shorter. This could improve performance.<br />

In the 6600 <strong>Computer</strong>,<br />

perhaps even more<br />

than in any previous<br />

computer, the control<br />

system is the difference.<br />

James Thornton, <strong>Design</strong><br />

of a <strong>Computer</strong>: The<br />

Control Data 6600, 1970<br />

Pipelined Control<br />

Just as we added control to the single-cycle datapath in Section 4.3, we now add<br />

control to the pipelined datapath. We start with a simple design that views the<br />

problem through rose-colored glasses.<br />

The first step is to label the control lines on the existing datapath. Figure 4.46<br />

shows those lines. We borrow as much as we can from the control for the simple<br />

datapath in Figure 4.17. In particular, we use the same ALU control logic, branch<br />

logic, destination-register-number multiplexor, <strong>and</strong> control lines. These functions<br />

are defined in Figures 4.12, 4.16, <strong>and</strong> 4.18. We reproduce the key information in<br />

Figures 4.47 through 4.49 on a single page to make the following discussion easier<br />

to follow.<br />

As was the case for the single-cycle implementation, we assume that the PC is<br />

written on each clock cycle, so there is no separate write signal for the PC. By the<br />

same argument, there are no separate write signals for the pipeline registers (IF/<br />

ID, ID/EX, EX/MEM, <strong>and</strong> MEM/WB), since the pipeline registers are also written<br />

during each clock cycle.<br />

To specify control for the pipeline, we need only set the control values during<br />

each pipeline stage. Because each control line is associated with a component active<br />

in only a single pipeline stage, we can divide the control lines into five groups<br />

according to the pipeline stage.<br />

1. Instruction fetch: The control signals to read instruction memory <strong>and</strong> to<br />

write the PC are always asserted, so there is nothing special to control in this<br />

pipeline stage.<br />

2. Instruction decode/register file read: As in the previous stage, the same thing<br />

happens at every clock cycle, so there are no optional control lines to set.<br />

3. Execution/address calculation: The signals to be set are RegDst, ALUOp,<br />

<strong>and</strong> ALUSrc (see Figures 4.47 <strong>and</strong> 4.48). The signals select the Result register,<br />

the ALU operation, <strong>and</strong> either Read data 2 or a sign-extended immediate<br />

for the ALU.


4.6 Pipelined Datapath <strong>and</strong> Control 301<br />

PCSrc<br />

IF/ID<br />

ID/EX<br />

EX/MEM<br />

MEM/WB<br />

Add<br />

4<br />

Shift<br />

left 2<br />

Add Add<br />

result<br />

Branch<br />

0<br />

Mux<br />

RegWrite<br />

1<br />

PC<br />

Address<br />

Instruction<br />

memory<br />

Instruction<br />

Read<br />

register 1<br />

Read<br />

register 2<br />

Registers<br />

Write Read<br />

register data 2<br />

Write<br />

data<br />

Instruction<br />

(15–0)<br />

Read<br />

data 1<br />

16 Signextend<br />

32 6<br />

ALUSrc<br />

0<br />

Mux<br />

1<br />

ALU<br />

control<br />

Zero<br />

Add ALU<br />

result<br />

Write<br />

data<br />

MemWrite<br />

Read<br />

Address<br />

data<br />

Data<br />

memory<br />

MemRead<br />

MemtoReg<br />

1<br />

M<br />

ux<br />

0<br />

Instruction<br />

(20–16)<br />

0<br />

Mux<br />

ALUOp<br />

Instruction<br />

(15–11)<br />

1<br />

RegDst<br />

FIGURE 4.46 The pipelined datapath of Figure 4.41 with the control signals identified. This datapath borrows the control<br />

logic for PC source, register destination number, <strong>and</strong> ALU control from Section 4.4. Note that we now need the 6-bit funct field (function<br />

code) of the instruction in the EX stage as input to ALU control, so these bits must also be included in the ID/EX pipeline register. Recall that<br />

these 6 bits are also the 6 least significant bits of the immediate field in the instruction, so the ID/EX pipeline register can supply them from the<br />

immediate field since sign extension leaves these bits unchanged.<br />

Instruction<br />

opcode<br />

ALUOp<br />

Instruction<br />

operation<br />

Function code<br />

Desired<br />

ALU action<br />

ALU control<br />

input<br />

LW 00 load word XXXXXX add 0010<br />

SW 00 store word XXXXXX add 0010<br />

Branch equal 01 branch equal XXXXXX subtract 0110<br />

R-type 10 add 100000 add 0010<br />

R-type 10 subtract 100010 subtract 0110<br />

R-type 10 AND 100100 AND 0000<br />

R-type 10 OR 100101 OR 0001<br />

R-type 10 set on less than 101010 set on less than 0111<br />

FIGURE 4.47 A copy of Figure 4.12. This figure shows how the ALU control bits are set depending on the ALUOp control bits <strong>and</strong> the<br />

different function codes for the R-type instruction.


302 Chapter 4 The Processor<br />

Signal name Effect when deasserted (0) Effect when asserted (1)<br />

RegDst<br />

The register destination number for the Write<br />

register comes from the rt field (bits 20:16).<br />

The register destination number for the Write register comes<br />

from the rd field (bits 15:11).<br />

RegWrite None. The register on the Write register input is written with the value<br />

on the Write data input.<br />

ALUSrc<br />

PCSrc<br />

The second ALU oper<strong>and</strong> comes from the second<br />

register file output (Read data 2).<br />

The PC is replaced by the output of the adder that<br />

computes the value of PC + 4.<br />

The second ALU oper<strong>and</strong> is the sign-extended, lower 16 bits of<br />

the instruction.<br />

The PC is replaced by the output of the adder that computes<br />

the branch target.<br />

MemRead None. Data memory contents designated by the address input are<br />

put on the Read data output.<br />

MemWrite None. Data memory contents designated by the address input are<br />

replaced by the value on the Write data input.<br />

MemtoReg<br />

The value fed to the register Write data input<br />

comes from the ALU.<br />

The value fed to the register Write data input comes from the<br />

data memory.<br />

FIGURE 4.48 A copy of Figure 4.16. The function of each of seven control signals is defined. The ALU control lines (ALUOp) are defined<br />

in the second column of Figure 4.47. When a 1-bit control to a 2-way multiplexor is asserted, the multiplexor selects the input corresponding<br />

to 1. Otherwise, if the control is deasserted, the multiplexor selects the 0 input. Note that PCSrc is controlled by an AND gate in Figure 4.46.<br />

If the Branch signal <strong>and</strong> the ALU Zero signal are both set, then PCSrc is 1; otherwise, it is 0. Control sets the Branch signal only during a beq<br />

instruction; otherwise, PCSrc is set to 0.<br />

Instruction<br />

Execution/address calculation stage<br />

control lines<br />

RegDst ALUOp1 ALUOp0 ALUSrc Branch<br />

Memory access stage<br />

control lines<br />

Mem-<br />

Read<br />

Mem-<br />

Write<br />

Write-back stage<br />

control lines<br />

Reg-<br />

Write<br />

R-format 1 1 0 0 0 0 0 1 0<br />

lw 0 0 0 1 0 1 0 1 1<br />

sw X 0 0 1 0 0 1 0 X<br />

beq X 0 1 0 1 0 0 0 X<br />

Memto-<br />

Reg<br />

FIGURE 4.49 The values of the control lines are the same as in Figure 4.18, but they have been shuffled into three<br />

groups corresponding to the last three pipeline stages.<br />

4. Memory access: The control lines set in this stage are Branch, MemRead, <strong>and</strong><br />

MemWrite. The branch equal, load, <strong>and</strong> store instructions set these signals,<br />

respectively. Recall that PCSrc in Figure 4.48 selects the next sequential<br />

address unless control asserts Branch <strong>and</strong> the ALU result was 0.<br />

5. Write-back: The two control lines are MemtoReg, which decides between<br />

sending the ALU result or the memory value to the register file, <strong>and</strong> Reg-<br />

Write, which writes the chosen value.<br />

Since pipelining the datapath leaves the meaning of the control lines unchanged,<br />

we can use the same control values. Figure 4.49 has the same values as in Section<br />

4.4, but now the nine control lines are grouped by pipeline stage.


304 Chapter 4 The Processor<br />

PCSrc<br />

ID/EX<br />

WB<br />

EX/MEM<br />

Control<br />

M<br />

WB<br />

MEM/WB<br />

IF/ID<br />

EX<br />

M<br />

WB<br />

Add<br />

0<br />

M<br />

ux<br />

1<br />

PC<br />

4<br />

Address<br />

Instruction<br />

memory<br />

Instruction<br />

Read<br />

register 1<br />

Read<br />

register 2<br />

Write<br />

register<br />

Write<br />

data<br />

RegWrite<br />

Registers<br />

Read<br />

data 1<br />

Read<br />

data 2<br />

Shift<br />

left 2<br />

0 Mux<br />

1<br />

Add Add<br />

result<br />

ALUSrc<br />

Zero<br />

ALU ALU<br />

result<br />

Branch<br />

MemWrite<br />

Read<br />

Address<br />

data<br />

Data<br />

memory<br />

MemtoReg<br />

1<br />

M<br />

ux<br />

0<br />

Instruction<br />

[15–0]<br />

16 Signextend<br />

32<br />

6<br />

ALU<br />

control<br />

Write<br />

data<br />

MemRead<br />

Instruction<br />

[20–16]<br />

0<br />

ALUOp<br />

Instruction<br />

[15–11]<br />

1<br />

M<br />

ux<br />

RegDst<br />

FIGURE 4.51 The pipelined datapath of Figure 4.46, with the control signals connected to the control portions of<br />

the pipeline registers. The control values for the last three stages are created during the instruction decode stage <strong>and</strong> then placed in the<br />

ID/EX pipeline register. The control lines for each pipe stage are used, <strong>and</strong> remaining control lines are then passed to the next pipeline stage.<br />

Let’s look at a sequence with many dependences, shown in color:<br />

sub $2, $1,$3 # Register $2 written by sub<br />

<strong>and</strong> $12,$2,$5 # 1st oper<strong>and</strong>($2) depends on sub<br />

or $13,$6,$2 # 2nd oper<strong>and</strong>($2) depends on sub<br />

add $14,$2,$2 # 1st($2) & 2nd($2) depend on sub<br />

sw $15,100($2) # Base ($2) depends on sub<br />

The last four instructions are all dependent on the result in register $2 of the<br />

first instruction. If register $2 had the value 10 before the subtract instruction <strong>and</strong><br />

−20 afterwards, the programmer intends that −20 will be used in the following<br />

instructions that refer to register $2.


4.7 Data Hazards: Forwarding versus Stalling 305<br />

How would this sequence perform with our pipeline? Figure 4.52 illustrates the<br />

execution of these instructions using a multiple-clock-cycle pipeline representation.<br />

To demonstrate the execution of this instruction sequence in our current pipeline,<br />

the top of Figure 4.52 shows the value of register $2, which changes during the<br />

middle of clock cycle 5, when the sub instruction writes its result.<br />

The last potential hazard can be resolved by the design of the register file<br />

hardware: What happens when a register is read <strong>and</strong> written in the same clock<br />

cycle? We assume that the write is in the first half of the clock cycle <strong>and</strong> the read<br />

is in the second half, so the read delivers what is written. As is the case for many<br />

implementations of register files, we have no data hazard in this case.<br />

Figure 4.52 shows that the values read for register $2 would not be the result of<br />

the sub instruction unless the read occurred during clock cycle 5 or later. Thus, the<br />

instructions that would get the correct value of −20 are add <strong>and</strong> sw; the AND <strong>and</strong><br />

Time (in clock cycles)<br />

Value of CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9<br />

register $2: 10 10 10 10 10/–20 –20 –20 –20 –20<br />

Program<br />

execution<br />

order<br />

(in instructions)<br />

sub $2, $1, $3<br />

IM<br />

Reg<br />

DM<br />

Reg<br />

<strong>and</strong> $12, $2, $5<br />

IM<br />

Reg<br />

DM<br />

Reg<br />

or $13, $6, $2<br />

IM<br />

Reg<br />

DM<br />

Reg<br />

add $14, $2,$2<br />

IM<br />

Reg<br />

DM<br />

Reg<br />

sw $15, 100($2)<br />

IM<br />

Reg<br />

DM<br />

Reg<br />

FIGURE 4.52 Pipelined dependences in a five-instruction sequence using simplified datapaths to show the<br />

dependences. All the dependent actions are shown in color, <strong>and</strong> “CC 1” at the top of the figure means clock cycle 1. The first instruction<br />

writes into $2, <strong>and</strong> all the following instructions read $2. This register is written in clock cycle 5, so the proper value is unavailable before clock<br />

cycle 5. (A read of a register during a clock cycle returns the value written at the end of the first half of the cycle, when such a write occurs.) The<br />

colored lines from the top datapath to the lower ones show the dependences. Those that must go backward in time are pipeline data hazards.


306 Chapter 4 The Processor<br />

OR instructions would get the incorrect value 10! Using this style of drawing, such<br />

problems become apparent when a dependence line goes backward in time.<br />

As mentioned in Section 4.5, the desired result is available at the end of the<br />

EX stage or clock cycle 3. When is the data actually needed by the AND <strong>and</strong> OR<br />

instructions? At the beginning of the EX stage, or clock cycles 4 <strong>and</strong> 5, respectively.<br />

Thus, we can execute this segment without stalls if we simply forward the data as<br />

soon as it is available to any units that need it before it is available to read from the<br />

register file.<br />

How does forwarding work? For simplicity in the rest of this section, we consider<br />

only the challenge of forwarding to an operation in the EX stage, which may be<br />

either an ALU operation or an effective address calculation. This means that when<br />

an instruction tries to use a register in its EX stage that an earlier instruction<br />

intends to write in its WB stage, we actually need the values as inputs to the ALU.<br />

A notation that names the fields of the pipeline registers allows for a more<br />

precise notation of dependences. For example, “ID/EX.RegisterRs” refers to the<br />

number of one register whose value is found in the pipeline register ID/EX; that is,<br />

the one from the first read port of the register file. The first part of the name, to the<br />

left of the period, is the name of the pipeline register; the second part is the name of<br />

the field in that register. Using this notation, the two pairs of hazard conditions are<br />

1a. EX/MEM.RegisterRd = ID/EX.RegisterRs<br />

1b. EX/MEM.RegisterRd = ID/EX.RegisterRt<br />

2a. MEM/WB.RegisterRd = ID/EX.RegisterRs<br />

2b. MEM/WB.RegisterRd = ID/EX.RegisterRt<br />

The first hazard in the sequence on page 304 is on register $2, between the<br />

result of sub $2,$1,$3 <strong>and</strong> the first read oper<strong>and</strong> of <strong>and</strong> $12,$2,$5. This<br />

hazard can be detected when the <strong>and</strong> instruction is in the EX stage <strong>and</strong> the prior<br />

instruction is in the MEM stage, so this is hazard 1a:<br />

EX/MEM.RegisterRd = ID/EX.RegisterRs = $2<br />

EXAMPLE<br />

Dependence Detection<br />

Classify the dependences in this sequence from page 304:<br />

sub $2, $1, $3 # Register $2 set by sub<br />

<strong>and</strong> $12, $2, $5 # 1st oper<strong>and</strong>($2) set by sub<br />

or $13, $6, $2 # 2nd oper<strong>and</strong>($2) set by sub<br />

add $14, $2, $2 # 1st($2) & 2nd($2) set by sub<br />

sw $15, 100($2) # Index($2) set by sub


4.7 Data Hazards: Forwarding versus Stalling 307<br />

As mentioned above, the sub-<strong>and</strong> is a type 1a hazard. The remaining hazards<br />

are as follows:<br />

■ The sub-or is a type 2b hazard:<br />

ANSWER<br />

MEM/WB.RegisterRd = ID/EX.RegisterRt = $2<br />

■ The two dependences on sub-add are not hazards because the register<br />

file supplies the proper data during the ID stage of add.<br />

■ There is no data hazard between sub <strong>and</strong> sw because sw reads $2 the<br />

clock cycle after sub writes $2.<br />

Because some instructions do not write registers, this policy is inaccurate;<br />

sometimes it would forward when it shouldn’t. One solution is simply to check<br />

to see if the RegWrite signal will be active: examining the WB control field of the<br />

pipeline register during the EX <strong>and</strong> MEM stages determines whether RegWrite<br />

is asserted. Recall that MIPS requires that every use of $0 as an oper<strong>and</strong> must<br />

yield an oper<strong>and</strong> value of 0. In the event that an instruction in the pipeline has<br />

$0 as its destination (for example, sll $0, $1, 2), we want to avoid forwarding<br />

its possibly nonzero result value. Not forwarding results destined for $0 frees the<br />

assembly programmer <strong>and</strong> the compiler of any requirement to avoid using $0 as<br />

a destination. The conditions above thus work properly as long we add EX/MEM.<br />

RegisterRd ≠ 0 to the first hazard condition <strong>and</strong> MEM/WB.RegisterRd ≠ 0 to the<br />

second.<br />

Now that we can detect hazards, half of the problem is resolved—but we must<br />

still forward the proper data.<br />

Figure 4.53 shows the dependences between the pipeline registers <strong>and</strong> the inputs<br />

to the ALU for the same code sequence as in Figure 4.52. The change is that the<br />

dependence begins from a pipeline register, rather than waiting for the WB stage to<br />

write the register file. Thus, the required data exists in time for later instructions,<br />

with the pipeline registers holding the data to be forwarded.<br />

If we can take the inputs to the ALU from any pipeline register rather than just<br />

ID/EX, then we can forward the proper data. By adding multiplexors to the input<br />

of the ALU, <strong>and</strong> with the proper controls, we can run the pipeline at full speed in<br />

the presence of these data dependences.<br />

For now, we will assume the only instructions we need to forward are the four<br />

R-format instructions: add, sub, AND, <strong>and</strong> OR. Figure 4.54 shows a close-up of<br />

the ALU <strong>and</strong> pipeline register before <strong>and</strong> after adding forwarding. Figure 4.55<br />

shows the values of the control lines for the ALU multiplexors that select either the<br />

register file values or one of the forwarded values.<br />

This forwarding control will be in the EX stage, because the ALU forwarding<br />

multiplexors are found in that stage. Thus, we must pass the oper<strong>and</strong> register<br />

numbers from the ID stage via the ID/EX pipeline register to determine whether<br />

to forward values. We already have the rt field (bits 20–16). Before forwarding, the<br />

ID/EX register had no need to include space to hold the rs field. Hence, rs (bits<br />

25–21) is added to ID/EX.


4.7 Data Hazards: Forwarding versus Stalling 309<br />

ID/EX<br />

EX/MEM<br />

MEM/WB<br />

Registers<br />

ALU<br />

Data<br />

memory<br />

M<br />

ux<br />

a. No forwarding<br />

ID/EX EX/MEM MEM/WB<br />

M<br />

ux<br />

M<br />

ux<br />

M<br />

ux<br />

Registers<br />

ForwardA<br />

ALU<br />

Data<br />

memory<br />

M<br />

ux<br />

ForwardB<br />

Rs<br />

Rt<br />

Rt<br />

Rd<br />

EX/MEM.RegisterRd<br />

Forwarding<br />

unit<br />

MEM/WB.RegisterRd<br />

b. With forwarding<br />

FIGURE 4.54 On the top are the ALU <strong>and</strong> pipeline registers before adding forwarding. On<br />

the bottom, the multiplexors have been exp<strong>and</strong>ed to add the forwarding paths, <strong>and</strong> we show the forwarding<br />

unit. The new hardware is shown in color. This figure is a stylized drawing, however, leaving out details<br />

from the full datapath such as the sign extension hardware. Note that the ID/EX.RegisterRt field is shown<br />

twice, once to connect to the Mux <strong>and</strong> once to the forwarding unit, but it is a single signal. As in the earlier<br />

discussion, this ignores forwarding of a store value to a store instruction. Also note that this mechanism<br />

works for slt instructions as well.


310 Chapter 4 The Processor<br />

Mux control Source Explanation<br />

ForwardA = 00 ID/EX The first ALU oper<strong>and</strong> comes from the register file.<br />

ForwardA = 10 EX/MEM The first ALU oper<strong>and</strong> is forwarded from the prior ALU result.<br />

ForwardA = 01 MEM/WB The first ALU oper<strong>and</strong> is forwarded from data memory or an earlier<br />

ALU result.<br />

ForwardB = 00 ID/EX The second ALU oper<strong>and</strong> comes from the register file.<br />

ForwardB = 10 EX/MEM The second ALU oper<strong>and</strong> is forwarded from the prior ALU result.<br />

ForwardB = 01 MEM/WB The second ALU oper<strong>and</strong> is forwarded from data memory or an<br />

earlier ALU result.<br />

FIGURE 4.55 The control values for the forwarding multiplexors in Figure 4.54. The signed<br />

immediate that is another input to the ALU is described in the Elaboration at the end of this section.<br />

Note that the EX/MEM.RegisterRd field is the register destination for either<br />

an ALU instruction (which comes from the Rd field of the instruction) or a load<br />

(which comes from the Rt field).<br />

This case forwards the result from the previous instruction to either input of the<br />

ALU. If the previous instruction is going to write to the register file, <strong>and</strong> the write<br />

register number matches the read register number of ALU inputs A or B, provided<br />

it is not register 0, then steer the multiplexor to pick the value instead from the<br />

pipeline register EX/MEM.<br />

2. MEM hazard:<br />

if (MEM/WB.RegWrite<br />

<strong>and</strong> (MEM/WB.RegisterRd ≠ 0)<br />

<strong>and</strong> ( MEM/WB.RegisterRd = ID/EX.RegisterRs)) ForwardA = 01<br />

if (MEM/WB.RegWrite<br />

<strong>and</strong> (MEM/WB.RegisterRd ≠ 0)<br />

<strong>and</strong> (MEM/WB.RegisterRd = ID/EX.RegisterRt)) ForwardB = 01<br />

As mentioned above, there is no hazard in the WB stage, because we assume that<br />

the register file supplies the correct result if the instruction in the ID stage reads<br />

the same register written by the instruction in the WB stage. Such a register file<br />

performs another form of forwarding, but it occurs within the register file.<br />

One complication is potential data hazards between the result of the instruction<br />

in the WB stage, the result of the instruction in the MEM stage, <strong>and</strong> the source<br />

oper<strong>and</strong> of the instruction in the ALU stage. For example, when summing a vector<br />

of numbers in a single register, a sequence of instructions will all read <strong>and</strong> write to<br />

the same register:<br />

add $1,$1,$2<br />

add $1,$1,$3<br />

add $1,$1,$4<br />

. . .


4.7 Data Hazards: Forwarding versus Stalling 311<br />

In this case, the result is forwarded from the MEM stage because the result in the<br />

MEM stage is the more recent result. Thus, the control for the MEM hazard would<br />

be (with the additions highlighted):<br />

if (MEM/WB.RegWrite<br />

<strong>and</strong> (MEM/WB.RegisterRd ≠ 0)<br />

<strong>and</strong> not(EX/MEM.RegWrite <strong>and</strong> (EX/MEM.RegisterRd ≠ 0)<br />

<strong>and</strong> (EX/MEM.RegisterRd ≠ ID/EX.RegisterRs))<br />

<strong>and</strong> (MEM/WB.RegisterRd = ID/EX.RegisterRs)) ForwardA = 01<br />

if (MEM/WB.RegWrite<br />

<strong>and</strong> (MEM/WB.RegisterRd ≠ 0)<br />

<strong>and</strong> not(EX/MEM.RegWrite <strong>and</strong> (EX/MEM.RegisterRd ≠ 0)<br />

<strong>and</strong> (EX/MEM.RegisterRd ≠ ID/EX.RegisterRt))<br />

<strong>and</strong> (MEM/WB.RegisterRd = ID/EX.RegisterRt)) ForwardB = 01<br />

Figure 4.56 shows the hardware necessary to support forwarding for operations<br />

that use results during the EX stage. Note that the EX/MEM.RegisterRd field is the<br />

register destination for either an ALU instruction (which comes from the Rd field<br />

of the instruction) or a load (which comes from the Rt field).<br />

ID/EX<br />

WB<br />

EX/MEM<br />

Control<br />

M<br />

WB<br />

MEM/WB<br />

IF/ID<br />

EX<br />

M<br />

WB<br />

M<br />

u<br />

x<br />

PC<br />

Instruction<br />

memory<br />

Instruction<br />

Registers<br />

M<br />

u<br />

x<br />

ALU<br />

Data<br />

memory<br />

M<br />

u<br />

x<br />

IF/ID.RegisterRs<br />

IF/ID.RegisterRt<br />

IF/ID.RegisterRt<br />

IF/ID.RegisterRd<br />

Rs<br />

Rt<br />

Rt<br />

Rd<br />

M<br />

u<br />

x<br />

Forwarding<br />

unit<br />

EX/MEM.RegisterRd<br />

MEM/WB.RegisterRd<br />

FIGURE 4.56 The datapath modified to resolve hazards via forwarding. Compared with the datapath in Figure 4.51, the additions<br />

are the multiplexors to the inputs to the ALU. This figure is a more stylized drawing, however, leaving out details from the full datapath, such<br />

as the branch hardware <strong>and</strong> the sign extension hardware.


314 Chapter 4 The Processor<br />

use. Checking for load instructions, the control for the hazard detection unit is this<br />

single condition:<br />

nop An instruction that<br />

does no operation to<br />

change state.<br />

if (ID/EX.MemRead <strong>and</strong><br />

((ID/EX.RegisterRt = IF/ID.RegisterRs) or<br />

(ID/EX.RegisterRt = IF/ID.RegisterRt)))<br />

stall the pipeline<br />

The first line tests to see if the instruction is a load: the only instruction that reads<br />

data memory is a load. The next two lines check to see if the destination register<br />

field of the load in the EX stage matches either source register of the instruction<br />

in the ID stage. If the condition holds, the instruction stalls one clock cycle. After<br />

this 1-cycle stall, the forwarding logic can h<strong>and</strong>le the dependence <strong>and</strong> execution<br />

proceeds. (If there were no forwarding, then the instructions in Figure 4.58 would<br />

need another stall cycle.)<br />

If the instruction in the ID stage is stalled, then the instruction in the IF stage<br />

must also be stalled; otherwise, we would lose the fetched instruction. Preventing<br />

these two instructions from making progress is accomplished simply by preventing<br />

the PC register <strong>and</strong> the IF/ID pipeline register from changing. Provided these<br />

registers are preserved, the instruction in the IF stage will continue to be read<br />

using the same PC, <strong>and</strong> the registers in the ID stage will continue to be read using<br />

the same instruction fields in the IF/ID pipeline register. Returning to our favorite<br />

analogy, it’s as if you restart the washer with the same clothes <strong>and</strong> let the dryer<br />

continue tumbling empty. Of course, like the dryer, the back half of the pipeline<br />

starting with the EX stage must be doing something; what it is doing is executing<br />

instructions that have no effect: nops.<br />

How can we insert these nops, which act like bubbles, into the pipeline? In Figure<br />

4.49, we see that deasserting all nine control signals (setting them to 0) in the EX,<br />

MEM, <strong>and</strong> WB stages will create a “do nothing” or nop instruction. By identifying<br />

the hazard in the ID stage, we can insert a bubble into the pipeline by changing the<br />

EX, MEM, <strong>and</strong> WB control fields of the ID/EX pipeline register to 0. These benign<br />

control values are percolated forward at each clock cycle with the proper effect: no<br />

registers or memories are written if the control values are all 0.<br />

Figure 4.59 shows what really happens in the hardware: the pipeline execution<br />

slot associated with the AND instruction is turned into a nop <strong>and</strong> all instructions<br />

beginning with the AND instruction are delayed one cycle. Like an air bubble in<br />

a water pipe, a stall bubble delays everything behind it <strong>and</strong> proceeds down the<br />

instruction pipe one stage each cycle until it exits at the end. In this example, the<br />

hazard forces the AND <strong>and</strong> OR instructions to repeat in clock cycle 4 what they<br />

did in clock cycle 3: AND reads registers <strong>and</strong> decodes, <strong>and</strong> OR is refetched from<br />

instruction memory. Such repeated work is what a stall looks like, but its effect is<br />

to stretch the time of the AND <strong>and</strong> OR instructions <strong>and</strong> delay the fetch of the add<br />

instruction.<br />

Figure 4.60 highlights the pipeline connections for both the hazard detection<br />

unit <strong>and</strong> the forwarding unit. As before, the forwarding unit controls the ALU


316 Chapter 4 The Processor<br />

Hazard<br />

detection<br />

unit<br />

ID/EX.MemRead<br />

IF/DWrite<br />

ID/EX<br />

WB<br />

EX/MEM<br />

PCWrite<br />

IF/ID<br />

Control<br />

M<br />

WB<br />

0 EX<br />

M<br />

MEM/WB<br />

WB<br />

PC<br />

Instruction<br />

memory<br />

Instruction<br />

Registers<br />

ALU<br />

M<br />

ux<br />

M<br />

ux<br />

M<br />

ux<br />

M<br />

ux<br />

M<br />

ux<br />

Data<br />

memory<br />

IF/ID.RegisterRs<br />

IF/ID.RegisterRt<br />

IF/ID.RegisterRt<br />

IF/ID.RegisterRd<br />

ID/EX.RegisterRt<br />

Rt<br />

Rd<br />

Rs<br />

Rt<br />

Forwarding<br />

unit<br />

FIGURE 4.60 Pipelined control overview, showing the two multiplexors for forwarding, the hazard detection unit, <strong>and</strong><br />

the forwarding unit. Although the ID <strong>and</strong> EX stages have been simplified—the sign-extended immediate <strong>and</strong> branch logic are missing—<br />

this drawing gives the essence of the forwarding hardware requirements.<br />

Elaboration: Regarding the remark earlier about setting control lines to 0 to avoid<br />

writing registers or memory: only the signals RegWrite <strong>and</strong> MemWrite need be 0, while<br />

the other control signals can be don’t cares.<br />

There are a thous<strong>and</strong><br />

hacking at the<br />

branches of evil to one<br />

who is striking at the<br />

root.<br />

Henry David Thoreau,<br />

Walden, 1854<br />

4.8 Control Hazards<br />

Thus far, we have limited our concern to hazards involving arithmetic operations<br />

<strong>and</strong> data transfers. However, as we saw in Section 4.5, there are also pipeline hazards<br />

involving branches. Figure 4.61 shows a sequence of instructions <strong>and</strong> indicates when<br />

the branch would occur in this pipeline. An instruction must be fetched at every<br />

clock cycle to sustain the pipeline, yet in our design the decision about whether to<br />

branch doesn’t occur until the MEM pipeline stage. As mentioned in Section 4.5,


4.8 Control Hazards 319<br />

Forwarding for the oper<strong>and</strong>s of branches was formerly h<strong>and</strong>led by the ALU<br />

forwarding logic, but the introduction of the equality test unit in ID will<br />

require new forwarding logic. Note that the bypassed source oper<strong>and</strong>s of a<br />

branch can come from either the ALU/MEM or MEM/WB pipeline latches.<br />

2. Because the values in a branch comparison are needed during ID but may be<br />

produced later in time, it is possible that a data hazard can occur <strong>and</strong> a stall<br />

will be needed. For example, if an ALU instruction immediately preceding<br />

a branch produces one of the oper<strong>and</strong>s for the comparison in the branch,<br />

a stall will be required, since the EX stage for the ALU instruction will<br />

occur after the ID cycle of the branch. By extension, if a load is immediately<br />

followed by a conditional branch that is on the load result, two stall cycles<br />

will be needed, as the result from the load appears at the end of the MEM<br />

cycle but is needed at the beginning of ID for the branch.<br />

Despite these difficulties, moving the branch execution to the ID stage is an<br />

improvement, because it reduces the penalty of a branch to only one instruction if<br />

the branch is taken, namely, the one currently being fetched. The exercises explore<br />

the details of implementing the forwarding path <strong>and</strong> detecting the hazard.<br />

To flush instructions in the IF stage, we add a control line, called IF.Flush,<br />

that zeros the instruction field of the IF/ID pipeline register. Clearing the register<br />

transforms the fetched instruction into a nop, an instruction that has no action<br />

<strong>and</strong> changes no state.<br />

Pipelined Branch<br />

Show what happens when the branch is taken in this instruction sequence,<br />

assuming the pipeline is optimized for branches that are not taken <strong>and</strong> that we<br />

moved the branch execution to the ID stage:<br />

EXAMPLE<br />

36 sub $10, $4, $8<br />

40 beq $1, $3, 7 # PC-relative branch to 40 + 4 + 7 * 4 = 72<br />

44 <strong>and</strong> $12, $2, $5<br />

48 or $13, $2, $6<br />

52 add $14, $4, $2<br />

56 slt $15, $6, $7<br />

. . .<br />

72 lw $4, 50($7)<br />

Figure 4.62 shows what happens when a branch is taken. Unlike Figure 4.61,<br />

there is only one pipeline bubble on a taken branch.<br />

ANSWER


4.8 Control Hazards 323<br />

The limitations on delayed branch scheduling arise from (1) the restrictions on the<br />

instructions that are scheduled into the delay slots <strong>and</strong> (2) our ability to predict at<br />

compile time whether a branch is likely to be taken or not.<br />

Delayed branching was a simple <strong>and</strong> effective solution for a fi ve-stage pipeline<br />

issuing one instruction each clock cycle. As processors go to both longer pipelines<br />

<strong>and</strong> issuing multiple instructions per clock cycle (see Section 4.10), the branch delay<br />

becomes longer, <strong>and</strong> a single delay slot is insuffi cient. Hence, delayed branching has<br />

lost popularity compared to more expensive but more fl exible dynamic approaches.<br />

Simultaneously, the growth in available transistors per chip has due to Moore’s Law<br />

made dynamic prediction relatively cheaper.<br />

a. From before<br />

add $s1, $s2, $s3<br />

if $s2 = 0 then<br />

Delay slot<br />

b. From target<br />

sub $t4, $t5, $t6<br />

. . .<br />

add $s1, $s2, $s3<br />

if $s1 = 0 then<br />

Delay slot<br />

c. From fall-through<br />

add $s1, $s2, $s3<br />

if $s1 = 0 then<br />

Delay slot<br />

sub $t4, $t5, $t6<br />

Becomes<br />

Becomes<br />

Becomes<br />

if $s2 = 0 then<br />

add $s1, $s2, $s3<br />

add $s1, $s2, $s3<br />

if $s1 = 0 then<br />

sub $t4, $t5, $t6<br />

add $s1, $s2, $s3<br />

if $s1 = 0 then<br />

sub $t4, $t5, $t6<br />

FIGURE 4.64 Scheduling the branch delay slot. The top box in each pair shows the code before<br />

scheduling; the bottom box shows the scheduled code. In (a), the delay slot is scheduled with an independent<br />

instruction from before the branch. This is the best choice. Strategies (b) <strong>and</strong> (c) are used when (a) is not<br />

possible. In the code sequences for (b) <strong>and</strong> (c), the use of $s1 in the branch condition prevents the add<br />

instruction (whose destination is $s1) from being moved into the branch delay slot. In (b) the branch delay<br />

slot is scheduled from the target of the branch; usually the target instruction will need to be copied because<br />

it can be reached by another path. Strategy (b) is preferred when the branch is taken with high probability,<br />

such as a loop branch. Finally, the branch may be scheduled from the not-taken fall-through as in (c). To<br />

make this optimization legal for (b) or (c), it must be OK to execute the sub instruction when the branch<br />

goes in the unexpected direction. By “OK” we mean that the work is wasted, but the program will still execute<br />

correctly. This is the case, for example, if $t4 were an unused temporary register when the branch goes in<br />

the unexpected direction.


324 Chapter 4 The Processor<br />

branch target buffer<br />

A structure that caches<br />

the destination PC or<br />

destination instruction<br />

for a branch. It is usually<br />

organized as a cache with<br />

tags, making it more<br />

costly than a simple<br />

prediction buffer.<br />

correlating predictor<br />

A branch predictor that<br />

combines local behavior<br />

of a particular branch<br />

<strong>and</strong> global information<br />

about the behavior of<br />

some recent number of<br />

executed branches.<br />

tournament branch<br />

predictor A branch<br />

predictor with multiple<br />

predictions for each<br />

branch <strong>and</strong> a selection<br />

mechanism that chooses<br />

which predictor to enable<br />

for a given branch.<br />

Elaboration: A branch predictor tells us whether or not a branch is taken, but still<br />

requires the calculation of the branch target. In the fi ve-stage pipeline, this calculation<br />

takes one cycle, meaning that taken branches will have a 1-cycle penalty. Delayed<br />

branches are one approach to eliminate that penalty. Another approach is to use a<br />

cache to hold the destination program counter or destination instruction using a branch<br />

target buffer.<br />

The 2-bit dynamic prediction scheme uses only information about a particular branch.<br />

Researchers noticed that using information about both a local branch, <strong>and</strong> the global<br />

behavior of recently executed branches together yields greater prediction accuracy for<br />

the same number of prediction bits. Such predictors are called correlating predictors.<br />

A typical correlating predictor might have two 2-bit predictors for each branch, with the<br />

choice between predictors made based on whether the last executed branch was taken<br />

or not taken. Thus, the global branch behavior can be thought of as adding additional<br />

index bits for the prediction lookup.<br />

A more recent innovation in branch prediction is the use of tournament predictors. A<br />

tournament predictor uses multiple predictors, tracking, for each branch, which predictor<br />

yields the best results. A typical tournament predictor might contain two predictions for<br />

each branch index: one based on local information <strong>and</strong> one based on global branch<br />

behavior. A selector would choose which predictor to use for any given prediction. The<br />

selector can operate similarly to a 1- or 2-bit predictor, favoring whichever of the two<br />

predictors has been more accurate. Some recent microprocessors use such elaborate<br />

predictors.<br />

Elaboration: One way to reduce the number of conditional branches is to add<br />

conditional move instructions. Instead of changing the PC with a conditional branch, the<br />

instruction conditionally changes the destination register of the move. If the condition<br />

fails, the move acts as a nop. For example, one version of the MIPS instruction set<br />

architecture has two new instructions called movn (move if not zero) <strong>and</strong> movz (move<br />

if zero). Thus, movn $8, $11, $4 copies the contents of register 11 into register 8,<br />

provided that the value in register 4 is nonzero; otherwise, it does nothing.<br />

The ARMv7 instruction set has a condition fi eld in most instructions. Hence, ARM<br />

programs could have fewer conditional branches than in MIPS programs.<br />

Pipeline Summary<br />

We started in the laundry room, showing principles of pipelining in an everyday<br />

setting. Using that analogy as a guide, we explained instruction pipelining<br />

step-by-step, starting with the single-cycle datapath <strong>and</strong> then adding pipeline<br />

registers, forwarding paths, data hazard detection, branch prediction, <strong>and</strong> flushing<br />

instructions on exceptions. Figure 4.65 shows the final evolved datapath <strong>and</strong> control.<br />

We now are ready for yet another control hazard: the sticky issue of exceptions.<br />

Check<br />

Yourself<br />

Consider three branch prediction schemes: predict not taken, predict taken, <strong>and</strong><br />

dynamic prediction. Assume that they all have zero penalty when they predict<br />

correctly <strong>and</strong> two cycles when they are wrong. Assume that the average predict


4.9 Exceptions 325<br />

IF.Flush<br />

Hazard<br />

detection<br />

unit<br />

ID/EX<br />

WB<br />

MEM/WB<br />

Control<br />

0<br />

M<br />

WB<br />

EX/MEM<br />

+<br />

IF/ID<br />

+<br />

EX<br />

M<br />

WB<br />

4<br />

Shift<br />

left 2<br />

M<br />

ux<br />

M<br />

ux<br />

M<br />

ux<br />

M<br />

ux<br />

Registers =<br />

M Instruction<br />

Data<br />

ux PC<br />

ALU<br />

memory<br />

memory<br />

M<br />

ux<br />

Signextend<br />

Fowarding<br />

unit<br />

FIGURE 4.65 The final datapath <strong>and</strong> control for this chapter. Note that this is a stylized figure rather than a detailed datapath, so<br />

it’s missing the ALUsrc Mux from Figure 4.57 <strong>and</strong> the multiplexor controls from Figure 4.51.<br />

accuracy of the dynamic predictor is 90%. Which predictor is the best choice for<br />

the following branches?<br />

1. A branch that is taken with 5% frequency<br />

2. A branch that is taken with 95% frequency<br />

3. A branch that is taken with 70% frequency<br />

4.9 Exceptions<br />

Control is the most challenging aspect of processor design: it is both the hardest<br />

part to get right <strong>and</strong> the hardest part to make fast. One of the hardest parts of<br />

To make a computer<br />

with automatic<br />

program-interruption<br />

facilities behave<br />

[sequentially] was<br />

not an easy matter,<br />

because the number of<br />

instructions in various<br />

stages of processing<br />

when an interrupt<br />

signal occurs may be<br />

large.<br />

Fred Brooks, Jr.,<br />

Planning a <strong>Computer</strong><br />

System: Project Stretch,<br />

1962


328 Chapter 4 The Processor<br />

we did for the taken branch in the previous section, we must flush the instructions<br />

that follow the add instruction from the pipeline <strong>and</strong> begin fetching instructions<br />

from the new address. We will use the same mechanism we used for taken branches,<br />

but this time the exception causes the deasserting of control lines.<br />

When we dealt with branch mispredict, we saw how to flush the instruction<br />

in the IF stage by turning it into a nop. To flush instructions in the ID stage, we<br />

use the multiplexor already in the ID stage that zeros control signals for stalls. A<br />

new control signal, called ID.Flush, is ORed with the stall signal from the hazard<br />

detection unit to flush during ID. To flush the instruction in the EX phase, we use<br />

a new signal called EX.Flush to cause new multiplexors to zero the control lines. To<br />

start fetching instructions from location 8000 0180 hex<br />

, which is the MIPS exception<br />

address, we simply add an additional input to the PC multiplexor that sends 8000<br />

0180 hex<br />

to the PC. Figure 4.66 shows these changes.<br />

This example points out a problem with exceptions: if we do not stop execution<br />

in the middle of the instruction, the programmer will not be able to see the original<br />

value of register $1 that helped cause the overflow because it will be clobbered as<br />

the Destination register of the add instruction. Because of careful planning, the<br />

overflow exception is detected during the EX stage; hence, we can use the EX.Flush<br />

signal to prevent the instruction in the EX stage from writing its result in the WB<br />

stage. Many exceptions require that we eventually complete the instruction that<br />

caused the exception as if it executed normally. The easiest way to do this is to flush<br />

the instruction <strong>and</strong> restart it from the beginning after the exception is h<strong>and</strong>led.<br />

The final step is to save the address of the offending instruction in the exception<br />

program counter (EPC). In reality, we save the address +4, so the exception h<strong>and</strong>ling<br />

the software routine must first subtract 4 from the saved value. Figure 4.66 shows<br />

a stylized version of the datapath, including the branch hardware <strong>and</strong> necessary<br />

accommodations to h<strong>and</strong>le exceptions.<br />

EXAMPLE<br />

Exception in a Pipelined <strong>Computer</strong><br />

Given this instruction sequence,<br />

40 hex<br />

sub $11, $2, $4<br />

44 hex<br />

<strong>and</strong> $12, $2, $5<br />

48 hex<br />

or $13, $2, $6<br />

4C hex<br />

add $1, $2, $1<br />

50 hex<br />

slt $15, $6, $7<br />

54 hex<br />

lw $16, 50($7)<br />

. . .


4.9 Exceptions 329<br />

IF.Flush<br />

EX.Flush<br />

ID.Flush<br />

Hazard<br />

detection<br />

unit<br />

ID/EX<br />

M<br />

ux<br />

IF/ID<br />

Control<br />

<br />

0<br />

M<br />

u<br />

x<br />

WB<br />

M<br />

EX<br />

0<br />

Cause<br />

EPC<br />

EX/MEM<br />

M<br />

WB<br />

ux<br />

0 M<br />

MEM/WB<br />

WB<br />

80000180<br />

M<br />

u<br />

x<br />

PC<br />

4<br />

<br />

Instruction<br />

memory<br />

Shift<br />

left 2<br />

Registers<br />

<br />

M<br />

u<br />

x<br />

M<br />

u<br />

x<br />

ALU<br />

Data<br />

memory<br />

M<br />

u<br />

x<br />

Signextend<br />

M<br />

u<br />

x<br />

Forwarding<br />

unit<br />

FIGURE 4.66 The datapath with controls to h<strong>and</strong>le exceptions. The key additions include a new input with the value 8000 0180 hex<br />

in the multiplexor that supplies the new PC value; a Cause register to record the cause of the exception; <strong>and</strong> an Exception PC register to save<br />

the address of the instruction that caused the exception. The 8000 0180 hex<br />

input to the multiplexor is the initial address to begin fetching<br />

instructions in the event of an exception. Although not shown, the ALU overflow signal is an input to the control unit.<br />

assume the instructions to be invoked on an exception begin like this:<br />

80000180 hex<br />

sw $26, 1000($0)<br />

80000184 hex<br />

sw $27, 1004($0)<br />

...<br />

Show what happens in the pipeline if an overflow exception occurs in the add<br />

instruction.<br />

Figure 4.67 shows the events, starting with the add instruction in the EX stage.<br />

The overflow is detected during that phase, <strong>and</strong> 8000 0180 hex<br />

is forced into the<br />

PC. Clock cycle 7 shows that the add <strong>and</strong> following instructions are flushed,<br />

<strong>and</strong> the first instruction of the exception code is fetched. Note that the address<br />

of the instruction following the add is saved: 4C hex<br />

+ 4 = 50 hex<br />

.<br />

ANSWER


330 Chapter 4 The Processor<br />

80000180<br />

M ux<br />

Clock 6<br />

80000180<br />

M<br />

ux<br />

lw $16, 50($7) slt $15, $6, $7 add $1, $2, $1 or $13, . . . <strong>and</strong> $12, . . .<br />

EX.Flush<br />

IF.Flush<br />

ID.Flush<br />

Hazard<br />

detection<br />

unit<br />

M<br />

ID/EX ux<br />

0 10<br />

WB<br />

0<br />

M<br />

EX/MEM<br />

0 000<br />

M 10<br />

Control<br />

ux M<br />

ux<br />

WB<br />

MEM/WB<br />

Cause<br />

0<br />

50<br />

1<br />

IF/ID<br />

+<br />

0<br />

EPC<br />

WB<br />

PC<br />

80000180 54<br />

4<br />

+<br />

Instruction<br />

memory<br />

sw $26, 1000($0)<br />

IF.Flush<br />

PC<br />

80000184<br />

80000180<br />

4<br />

+<br />

Instruction<br />

memory<br />

58<br />

54<br />

IF/ID<br />

58<br />

Sh ft<br />

left 2<br />

Registers<br />

12<br />

S gn<br />

extend<br />

=<br />

$6<br />

$7<br />

15<br />

EX<br />

$1<br />

M<br />

ux<br />

M<br />

u<br />

x<br />

0 M<br />

Forwarding<br />

unit<br />

bubble (nop) bubble bubble or $13, . . .<br />

EX.Flush<br />

Hazard<br />

detection<br />

unit<br />

Control<br />

Sh ft<br />

left 2<br />

+<br />

ID.Flush<br />

Registers<br />

13<br />

ID/EX<br />

00<br />

0 0<br />

WB<br />

0<br />

EX/MEM<br />

M<br />

0<br />

000 M<br />

WB<br />

00<br />

ux M<br />

Cause ux<br />

0 0 EX<br />

EPC 0 M<br />

=<br />

M<br />

ux<br />

M<br />

ux<br />

M<br />

ux<br />

M<br />

ux<br />

$2<br />

$1<br />

ALU<br />

Data<br />

memory<br />

13 12<br />

Data<br />

memory<br />

MEM/WB<br />

WB<br />

M<br />

ux<br />

M<br />

ux<br />

S gn<br />

extend<br />

Clock 7<br />

M<br />

u<br />

x<br />

Forwarding<br />

unit<br />

13<br />

FIGURE 4.67 The result of an exception due to arithmetic overflow in the add instruction. The overflow is detected during<br />

the EX stage of clock 6, saving the address following the add in the EPC register (4C + 4 = 50 hex<br />

). Overflow causes all the Flush signals to be set<br />

near the end of this clock cycle, deasserting control values (setting them to 0) for the add. Clock cycle 7 shows the instructions converted to<br />

bubbles in the pipeline plus the fetching of the first instruction of the exception routine—sw $25,1000($0)—from instruction location<br />

8000 0180 hex<br />

. Note that the AND <strong>and</strong> OR instructions, which are prior to the add, still complete. Although not shown, the ALU overflow signal<br />

is an input to the control unit.


4.9 Exceptions 331<br />

We mentioned five examples of exceptions on page 326, <strong>and</strong> we will see others<br />

in Chapter 5. With five instructions active in any clock cycle, the challenge is<br />

to associate an exception with the appropriate instruction. Moreover, multiple<br />

exceptions can occur simultaneously in a single clock cycle. The solution is to<br />

prioritize the exceptions so that it is easy to determine which is serviced first. In<br />

most MIPS implementations, the hardware sorts exceptions so that the earliest<br />

instruction is interrupted.<br />

I/O device requests <strong>and</strong> hardware malfunctions are not associated with a specific<br />

instruction, so the implementation has some flexibility as to when to interrupt the<br />

pipeline. Hence, the mechanism used for other exceptions works just fine.<br />

The EPC captures the address of the interrupted instructions, <strong>and</strong> the MIPS<br />

Cause register records all possible exceptions in a clock cycle, so the exception<br />

software must match the exception to the instruction. An important clue is knowing<br />

in which pipeline stage a type of exception can occur. For example, an undefined<br />

instruction is discovered in the ID stage, <strong>and</strong> invoking the operating system<br />

occurs in the EX stage. Exceptions are collected in the Cause register in a pending<br />

exception field so that the hardware can interrupt based on later exceptions, once<br />

the earliest one has been serviced.<br />

The hardware <strong>and</strong> the operating system must work in conjunction so that<br />

exceptions behave as you would expect. The hardware contract is normally to<br />

stop the offending instruction in midstream, let all prior instructions complete,<br />

flush all following instructions, set a register to show the cause of the exception,<br />

save the address of the offending instruction, <strong>and</strong> then jump to a prearranged<br />

address. The operating system contract is to look at the cause of the exception <strong>and</strong><br />

act appropriately. For an undefined instruction, hardware failure, or arithmetic<br />

overflow exception, the operating system normally kills the program <strong>and</strong> returns<br />

an indicator of the reason. For an I/O device request or an operating system service<br />

call, the operating system saves the state of the program, performs the desired task,<br />

<strong>and</strong>, at some point in the future, restores the program to continue execution. In<br />

the case of I/O device requests, we may often choose to run another task before<br />

resuming the task that requested the I/O, since that task may often not be able to<br />

proceed until the I/O is complete. Exceptions are why the ability to save <strong>and</strong> restore<br />

the state of any task is critical. One of the most important <strong>and</strong> frequent uses of<br />

exceptions is h<strong>and</strong>ling page faults <strong>and</strong> TLB exceptions; Chapter 5 describes these<br />

exceptions <strong>and</strong> their h<strong>and</strong>ling in more detail.<br />

Elaboration: The diffi culty of always associating the correct exception with the correct<br />

instruction in pipelined computers has led some computer designers to relax this<br />

requirement in noncritical cases. Such processors are said to have imprecise interrupts<br />

or imprecise exceptions. In the example above, PC would normally have 58 hex<br />

at the start<br />

of the clock cycle after the exception is detected, even though the offending instruction<br />

Hardware/<br />

Software<br />

Interface<br />

imprecise<br />

interrupt Also called<br />

imprecise exception.<br />

Interrupts or exceptions<br />

in pipelined computers<br />

that are not associated<br />

with the exact instruction<br />

that was the cause of the<br />

interrupt or exception.


334 Chapter 4 The Processor<br />

Another example is that we might speculate that a store that precedes a load does<br />

not refer to the same address, which would allow the load to be executed before the<br />

store. The difficulty with speculation is that it may be wrong. So, any speculation<br />

mechanism must include both a method to check if the guess was right <strong>and</strong> a<br />

method to unroll or back out the effects of the instructions that were executed<br />

speculatively. The implementation of this back-out capability adds complexity.<br />

Speculation may be done in the compiler or by the hardware. For example, the<br />

compiler can use speculation to reorder instructions, moving an instruction across<br />

a branch or a load across a store. The processor hardware can perform the same<br />

transformation at runtime using techniques we discuss later in this section.<br />

The recovery mechanisms used for incorrect speculation are rather different.<br />

In the case of speculation in software, the compiler usually inserts additional<br />

instructions that check the accuracy of the speculation <strong>and</strong> provide a fix-up routine<br />

to use when the speculation is incorrect. In hardware speculation, the processor<br />

usually buffers the speculative results until it knows they are no longer speculative.<br />

If the speculation is correct, the instructions are completed by allowing the<br />

contents of the buffers to be written to the registers or memory. If the speculation is<br />

incorrect, the hardware flushes the buffers <strong>and</strong> re-executes the correct instruction<br />

sequence.<br />

Speculation introduces one other possible problem: speculating on certain<br />

instructions may introduce exceptions that were formerly not present. For<br />

example, suppose a load instruction is moved in a speculative manner, but the<br />

address it uses is not legal when the speculation is incorrect. The result would be<br />

an exception that should not have occurred. The problem is complicated by the<br />

fact that if the load instruction were not speculative, then the exception must<br />

occur! In compiler-based speculation, such problems are avoided by adding<br />

special speculation support that allows such exceptions to be ignored until it is<br />

clear that they really should occur. In hardware-based speculation, exceptions<br />

are simply buffered until it is clear that the instruction causing them is no longer<br />

speculative <strong>and</strong> is ready to complete; at that point the exception is raised, <strong>and</strong><br />

nor-mal exception h<strong>and</strong>ling proceeds.<br />

Since speculation can improve performance when done properly <strong>and</strong> decrease<br />

performance when done carelessly, significant effort goes into deciding when it<br />

is appropriate to speculate. Later in this section, we will examine both static <strong>and</strong><br />

dynamic techniques for speculation.<br />

issue packet The set<br />

of instructions that<br />

issues together in one<br />

clock cycle; the packet<br />

may be determined<br />

statically by the compiler<br />

or dynamically by the<br />

processor.<br />

Static Multiple Issue<br />

Static multiple-issue processors all use the compiler to assist with packaging<br />

instructions <strong>and</strong> h<strong>and</strong>ling hazards. In a static issue processor, you can think of the<br />

set of instructions issued in a given clock cycle, which is called an issue packet, as<br />

one large instruction with multiple operations. This view is more than an analogy.<br />

Since a static multiple-issue processor usually restricts what mix of instructions can<br />

be initiated in a given clock cycle, it is useful to think of the issue packet as a single


4.10 Parallelism via Instructions 335<br />

instruction allowing several operations in certain predefined fields. This view led to<br />

the original name for this approach: Very Long Instruction Word (VLIW).<br />

Most static issue processors also rely on the compiler to take on some<br />

responsibility for h<strong>and</strong>ling data <strong>and</strong> control hazards. The compiler’s responsibilities<br />

may include static branch prediction <strong>and</strong> code scheduling to reduce or prevent all<br />

hazards. Let’s look at a simple static issue version of a MIPS processor, before we<br />

describe the use of these techniques in more aggressive processors.<br />

An Example: Static Multiple Issue with the MIPS ISA<br />

To give a flavor of static multiple issue, we consider a simple two-issue MIPS<br />

processor, where one of the instructions can be an integer ALU operation or<br />

branch <strong>and</strong> the other can be a load or store. Such a design is like that used in some<br />

embedded MIPS processors. Issuing two instructions per cycle will require fetching<br />

<strong>and</strong> decoding 64 bits of instructions. In many static multiple-issue processors, <strong>and</strong><br />

essentially all VLIW processors, the layout of simultaneously issuing instructions<br />

is restricted to simplify the decoding <strong>and</strong> instruction issue. Hence, we will require<br />

that the instructions be paired <strong>and</strong> aligned on a 64-bit boundary, with the ALU<br />

or branch portion appearing first. Furthermore, if one instruction of the pair<br />

cannot be used, we require that it be replaced with a nop. Thus, the instructions<br />

always issue in pairs, possibly with a nop in one slot. Figure 4.68 shows how the<br />

instructions look as they go into the pipeline in pairs.<br />

Static multiple-issue processors vary in how they deal with potential data <strong>and</strong><br />

control hazards. In some designs, the compiler takes full responsibility for removing<br />

all hazards, scheduling the code <strong>and</strong> inserting no-ops so that the code executes<br />

without any need for hazard detection or hardware-generated stalls. In others,<br />

the hardware detects data hazards <strong>and</strong> generates stalls between two issue packets,<br />

while requiring that the compiler avoid all dependences within an instruction pair.<br />

Even so, a hazard generally forces the entire issue packet containing the dependent<br />

Very Long Instruction<br />

Word (VLIW)<br />

A style of instruction set<br />

architecture that launches<br />

many operations that are<br />

defined to be independent<br />

in a single wide<br />

instruction, typically with<br />

many separate opcode<br />

fields.<br />

Instruction type<br />

Pipe stages<br />

ALU or branch instruction IF ID EX MEM WB<br />

Load or store instruction IF ID EX MEM WB<br />

ALU or branch instruction IF ID EX MEM WB<br />

Load or store instruction IF ID EX MEM WB<br />

ALU or branch instruction IF ID EX MEM WB<br />

Load or store instruction IF ID EX MEM WB<br />

ALU or branch instruction IF ID EX MEM WB<br />

Load or store instruction IF ID EX MEM WB<br />

FIGURE 4.68 Static two-issue pipeline in operation. The ALU <strong>and</strong> data transfer instructions<br />

are issued at the same time. Here we have assumed the same five-stage structure as used for the single-issue<br />

pipeline. Although this is not strictly necessary, it does have some advantages. In particular, keeping the<br />

register writes at the end of the pipeline simplifies the h<strong>and</strong>ling of exceptions <strong>and</strong> the maintenance of a<br />

precise exception model, which become more difficult in multiple-issue processors.


336 Chapter 4 The Processor<br />

instruction to stall. Whether the software must h<strong>and</strong>le all hazards or only try to<br />

reduce the fraction of hazards between separate issue packets, the appearance of<br />

having a large single instruction with multiple operations is reinforced. We will<br />

assume the second approach for this example.<br />

To issue an ALU <strong>and</strong> a data transfer operation in parallel, the first need for<br />

additional hardware—beyond the usual hazard detection <strong>and</strong> stall logic—is extra<br />

ports in the register file (see Figure 4.69). In one clock cycle we may need to read<br />

two registers for the ALU operation <strong>and</strong> two more for a store, <strong>and</strong> also one write<br />

port for an ALU operation <strong>and</strong> one write port for a load. Since the ALU is tied<br />

up for the ALU operation, we also need a separate adder to calculate the effective<br />

address for data transfers. Without these extra resources, our two-issue pipeline<br />

would be hindered by structural hazards.<br />

Clearly, this two-issue processor can improve performance by up to a factor of<br />

two. Doing so, however, requires that twice as many instructions be overlapped<br />

in execution, <strong>and</strong> this additional overlap increases the relative performance loss<br />

from data <strong>and</strong> control hazards. For example, in our simple five-stage pipeline,<br />

<br />

<br />

M<br />

ux<br />

M<br />

ux<br />

4<br />

ALU<br />

80000180<br />

M<br />

ux<br />

PC<br />

Instruction<br />

memory<br />

Registers<br />

Signextend<br />

Signextend<br />

ALU<br />

Write<br />

data<br />

Data<br />

memory<br />

Address<br />

FIGURE 4.69 A static two-issue datapath. The additions needed for double issue are highlighted: another 32 bits from instruction<br />

memory, two more read ports <strong>and</strong> one more write port on the register file, <strong>and</strong> another ALU. Assume the bottom ALU h<strong>and</strong>les address<br />

calculations for data transfers <strong>and</strong> the top ALU h<strong>and</strong>les everything else.


4.10 Parallelism via Instructions 337<br />

loads have a use latency of one clock cycle, which prevents one instruction from<br />

using the result without stalling. In the two-issue, five-stage pipeline the result of<br />

a load instruction cannot be used on the next clock cycle. This means that the next<br />

two instructions cannot use the load result without stalling. Furthermore, ALU<br />

instructions that had no use latency in the simple five-stage pipeline now have a<br />

one-instruction use latency, since the results cannot be used in the paired load or<br />

store. To effectively exploit the parallelism available in a multiple-issue processor,<br />

more ambitious compiler or hardware scheduling techniques are needed, <strong>and</strong> static<br />

multiple issue requires that the compiler take on this role.<br />

use latency Number<br />

of clock cycles between<br />

a load instruction <strong>and</strong><br />

an instruction that can<br />

use the result of the<br />

load without stalling the<br />

pipeline.<br />

Simple Multiple-Issue Code Scheduling<br />

How would this loop be scheduled on a static two-issue pipeline for MIPS?<br />

EXAMPLE<br />

Loop: lw $t0, 0($s1) # $t0=array element<br />

addu $t0,$t0,$s2# add scalar in $s2<br />

sw $t0, 0($s1)# store result<br />

addi $s1,$s1,–4# decrement pointer<br />

bne $s1,$zero,Loop# branch $s1!=0<br />

Reorder the instructions to avoid as many pipeline stalls as possible. Assume<br />

branches are predicted, so that control hazards are h<strong>and</strong>led by the hardware.<br />

The first three instructions have data dependences, <strong>and</strong> so do the last two.<br />

Figure 4.70 shows the best schedule for these instructions. Notice that just<br />

one pair of instructions has both issue slots used. It takes four clocks per loop<br />

iteration; at four clocks to execute five instructions, we get the disappointing<br />

CPI of 0.8 versus the best case of 0.5., or an IPC of 1.25 versus 2.0. Notice<br />

that in computing CPI or IPC, we do not count any nops executed as useful<br />

instructions. Doing so would improve CPI, but not performance!<br />

ANSWER<br />

ALU or branch instruction Data transfer instruction Clock cycle<br />

Loop: lw $t0, 0($s1) 1<br />

addi $s1,$s1,–4 2<br />

addu $t0,$t0,$s2 3<br />

bne $s1,$zero,Loop sw $t0, 4($s1) 4<br />

FIGURE 4.70 The scheduled code as it would look on a two-issue MIPS pipeline. The empty<br />

slots are no-ops.


338 Chapter 4 The Processor<br />

loop unrolling<br />

A technique to get more<br />

performance from loops<br />

that access arrays, in<br />

which multiple copies of<br />

the loop body are made<br />

<strong>and</strong> instructions from<br />

different iterations are<br />

EXAMPLE<br />

An important compiler technique to get more performance from loops<br />

is loop unrolling, where multiple copies of the loop body are made. After<br />

unrolling, there is more ILP available by overlapping instructions from different<br />

iterations.<br />

Loop Unrolling for Multiple-Issue Pipelines<br />

See how well loop unrolling <strong>and</strong> scheduling work in the example above. For<br />

simplicity assume that the loop index is a multiple of four.<br />

ANSWER<br />

register renaming The<br />

renaming of registers<br />

by the compiler or<br />

hardware to remove<br />

antidependences.<br />

antidependence Also<br />

called name<br />

dependence. An<br />

ordering forced by the<br />

reuse of a name, typically<br />

a register, rather than by<br />

a true dependence that<br />

carries a value between<br />

two instructions.<br />

To schedule the loop without any delays, it turns out that we need to make<br />

four copies of the loop body. After unrolling <strong>and</strong> eliminating the unnecessary<br />

loop overhead instructions, the loop will contain four copies each of lw, add,<br />

<strong>and</strong> sw, plus one addi <strong>and</strong> one bne. Figure 4.71 shows the unrolled <strong>and</strong><br />

scheduled code.<br />

During the unrolling process, the compiler introduced additional registers<br />

($t1, $t2, $t3). The goal of this process, called register renaming, is to<br />

eliminate dependences that are not true data dependences, but could either<br />

lead to potential hazards or prevent the compiler from flexibly scheduling<br />

the code. Consider how the unrolled code would look using only $t0. There<br />

would be repeated instances of lw $t0,0($$s1), addu $t0, $t0, $s2<br />

followed by sw t0,4($s1), but these sequences, despite using $t0, are<br />

actually completely independent—no data values flow between one set of these<br />

instructions <strong>and</strong> the next set. This case is what is called an antidependence or<br />

name dependence, which is an ordering forced purely by the reuse of a name,<br />

rather than a real data dependence that is also called a true dependence.<br />

Renaming the registers during the unrolling process allows the compiler<br />

to move these independent instructions subsequently so as to better schedule<br />

ALU or branch instruction Data transfer instruction Clock cycle<br />

Loop: addi $s1,$s1,–16 lw $t0, 0($s1) 1<br />

lw $t1,12($s1) 2<br />

addu $t0,$t0,$s2 lw $t2, 8($s1) 3<br />

addu $t1,$t1,$s2 lw $t3, 4($s1) 4<br />

addu $t2,$t2,$s2 sw $t0, 16($s1) 5<br />

addu $t3,$t3,$s2 sw $t1,12($s1) 6<br />

sw $t2, 8($s1) 7<br />

bne $s1,$zero,Loop sw $t3, 4($s1) 8<br />

FIGURE 4.71 The unrolled <strong>and</strong> scheduled code of Figure 4.70 as it would look on a static<br />

two-issue MIPS pipeline. The empty slots are no-ops. Since the first instruction in the loop decrements<br />

$s1 by 16, the addresses loaded are the original value of $s1, then that address minus 4, minus 8, <strong>and</strong> minus 12.


4.10 Parallelism via Instructions 339<br />

the code. The renaming process eliminates the name dependences, while<br />

preserving the true dependences.<br />

Notice now that 12 of the 14 instructions in the loop execute as pairs. It takes<br />

8 clocks for 4 loop iterations, or 2 clocks per iteration, which yields a CPI of 8/14<br />

= 0.57. Loop unrolling <strong>and</strong> scheduling with dual issue gave us an improvement<br />

factor of almost 2, partly from reducing the loop control instructions <strong>and</strong> partly<br />

from dual issue execution. The cost of this performance improvement is using four<br />

temporary registers rather than one, as well as a significant increase in code size.<br />

Dynamic Multiple-Issue Processors<br />

Dynamic multiple-issue processors are also known as superscalar processors, or<br />

simply superscalars. In the simplest superscalar processors, instructions issue in<br />

order, <strong>and</strong> the processor decides whether zero, one, or more instructions can issue<br />

in a given clock cycle. Obviously, achieving good performance on such a processor<br />

still requires the compiler to try to schedule instructions to move dependences<br />

apart <strong>and</strong> thereby improve the instruction issue rate. Even with such compiler<br />

scheduling, there is an important difference between this simple superscalar<br />

<strong>and</strong> a VLIW processor: the code, whether scheduled or not, is guaranteed by<br />

the hardware to execute correctly. Furthermore, compiled code will always run<br />

correctly independent of the issue rate or pipeline structure of the processor. In<br />

some VLIW designs, this has not been the case, <strong>and</strong> recompilation was required<br />

when moving across different processor models; in other static issue processors,<br />

code would run correctly across different implementations, but often so poorly as<br />

to make compilation effectively required.<br />

Many superscalars extend the basic framework of dynamic issue decisions to<br />

include dynamic pipeline scheduling. Dynamic pipeline scheduling chooses<br />

which instructions to execute in a given clock cycle while trying to avoid hazards<br />

<strong>and</strong> stalls. Let’s start with a simple example of avoiding a data hazard. Consider the<br />

following code sequence:<br />

lw $t0, 20($s2)<br />

addu $t1, $t0, $t2<br />

sub $s4, $s4, $t3<br />

slti $t5, $s4, 20<br />

Even though the sub instruction is ready to execute, it must wait for the lw<br />

<strong>and</strong> addu to complete first, which might take many clock cycles if memory is slow.<br />

(Chapter 5 explains cache misses, the reason that memory accesses are sometimes<br />

very slow.) Dynamic pipeline scheduling allows such hazards to be avoided either<br />

fully or partially.<br />

superscalar An<br />

advanced pipelining<br />

technique that enables the<br />

processor to execute more<br />

than one instruction per<br />

clock cycle by selecting<br />

them during execution.<br />

dynamic pipeline<br />

scheduling Hardware<br />

support for reordering<br />

the order of instruction<br />

execution so as to avoid<br />

stalls.<br />

Dynamic Pipeline Scheduling<br />

Dynamic pipeline scheduling chooses which instructions to execute next, possibly<br />

reordering them to avoid stalls. In such processors, the pipeline is divided into<br />

three major units: an instruction fetch <strong>and</strong> issue unit, multiple functional units


340 Chapter 4 The Processor<br />

Instruction fetch<br />

<strong>and</strong> decode unit<br />

In-order issue<br />

Reservation<br />

station<br />

Reservation<br />

station<br />

. . .<br />

Reservation<br />

station<br />

Reservation<br />

station<br />

Functional<br />

units<br />

Integer<br />

Integer<br />

. . .<br />

Floating<br />

point<br />

Loadstore<br />

Out-of-order execute<br />

Commit<br />

unit<br />

In-order commit<br />

FIGURE 4.72 The three primary units of a dynamically scheduled pipeline. The final step of<br />

updating the state is also called retirement or graduation.<br />

commit unit The unit in<br />

a dynamic or out-of-order<br />

execution pipeline that<br />

decides when it is safe to<br />

release the result of an<br />

operation to programmervisible<br />

registers <strong>and</strong><br />

memory.<br />

reservation station<br />

A buffer within a<br />

functional unit that holds<br />

the oper<strong>and</strong>s <strong>and</strong> the<br />

operation.<br />

reorder buffer The<br />

buffer that holds results in<br />

a dynamically scheduled<br />

processor until it is safe<br />

to store the results to<br />

memory or a register.<br />

(a dozen or more in high-end designs in 2013), <strong>and</strong> a commit unit. Figure 4.72<br />

shows the model. The first unit fetches instructions, decodes them, <strong>and</strong> sends<br />

each instruction to a corresponding functional unit for execution. Each functional<br />

unit has buffers, called reservation stations, which hold the oper<strong>and</strong>s <strong>and</strong> the<br />

operation. (The Elaboration discusses an alternative to reservation stations used<br />

by many recent processors.) As soon as the buffer contains all its oper<strong>and</strong>s <strong>and</strong><br />

the functional unit is ready to execute, the result is calculated. When the result is<br />

completed, it is sent to any reservation stations waiting for this particular result<br />

as well as to the commit unit, which buffers the result until it is safe to put the<br />

result into the register file or, for a store, into memory. The buffer in the commit<br />

unit, often called the reorder buffer, is also used to supply oper<strong>and</strong>s, in much the<br />

same way as forwarding logic does in a statically scheduled pipeline. Once a result<br />

is committed to the register file, it can be fetched directly from there, just as in a<br />

normal pipeline.<br />

The combination of buffering oper<strong>and</strong>s in the reservation stations <strong>and</strong> results<br />

in the reorder buffer provides a form of register renaming, just like that used by<br />

the compiler in our earlier loop-unrolling example on page 338. To see how this<br />

conceptually works, consider the following steps:


4.10 Parallelism via Instructions 341<br />

1. When an instruction issues, it is copied to a reservation station for the<br />

appropriate functional unit. Any oper<strong>and</strong>s that are available in the register<br />

file or reorder buffer are also immediately copied into the reservation station.<br />

The instruction is buffered in the reservation station until all the oper<strong>and</strong>s<br />

<strong>and</strong> the functional unit are available. For the issuing instruction, the register<br />

copy of the oper<strong>and</strong> is no longer required, <strong>and</strong> if a write to that register<br />

occurred, the value could be overwritten.<br />

2. If an oper<strong>and</strong> is not in the register file or reorder buffer, it must be waiting to<br />

be produced by a functional unit. The name of the functional unit that will<br />

produce the result is tracked. When that unit eventually produces the result,<br />

it is copied directly into the waiting reservation station from the functional<br />

unit bypassing the registers.<br />

These steps effectively use the reorder buffer <strong>and</strong> the reservation stations to<br />

implement register renaming.<br />

Conceptually, you can think of a dynamically scheduled pipeline as analyzing<br />

the data flow structure of a program. The processor then executes the instructions<br />

in some order that preserves the data flow order of the program. This style of<br />

execution is called an out-of-order execution, since the instructions can be<br />

executed in a different order than they were fetched.<br />

To make programs behave as if they were running on a simple in-order pipeline,<br />

the instruction fetch <strong>and</strong> decode unit is required to issue instructions in order,<br />

which allows dependences to be tracked, <strong>and</strong> the commit unit is required to write<br />

results to registers <strong>and</strong> memory in program fetch order. This conservative mode is<br />

called in-order commit. Hence, if an exception occurs, the computer can point to<br />

the last instruction executed, <strong>and</strong> the only registers updated will be those written<br />

by instructions before the instruction causing the exception. Although the front<br />

end (fetch <strong>and</strong> issue) <strong>and</strong> the back end (commit) of the pipeline run in order,<br />

the functional units are free to initiate execution whenever the data they need is<br />

available. Today, all dynamically scheduled pipelines use in-order commit.<br />

Dynamic scheduling is often extended by including hardware-based speculation,<br />

especially for branch outcomes. By predicting the direction of a branch, a<br />

dynamically scheduled processor can continue to fetch <strong>and</strong> execute instructions<br />

along the predicted path. Because the instructions are committed in order, we know<br />

whether or not the branch was correctly predicted before any instructions from the<br />

predicted path are committed. A speculative, dynamically scheduled pipeline can<br />

also support speculation on load addresses, allowing load-store reordering, <strong>and</strong><br />

using the commit unit to avoid incorrect speculation. In the next section, we will<br />

look at the use of dynamic scheduling with speculation in the Intel Core i7 design.<br />

out-of-order<br />

execution A situation in<br />

pipelined execution when<br />

an instruction blocked<br />

from executing does<br />

not cause the following<br />

instructions to wait.<br />

in-order commit<br />

A commit in which<br />

the results of pipelined<br />

execution are written to<br />

the programmer visible<br />

state in the same order<br />

that instructions are<br />

fetched.


4.10 Parallelism via Instructions 343<br />

Modern, high-performance microprocessors are capable of issuing several instructions<br />

per clock; unfortunately, sustaining that issue rate is very difficult. For example, despite<br />

the existence of processors with four to six issues per clock, very few applications can<br />

sustain more than two instructions per clock. There are two primary reasons for this.<br />

First, within the pipeline, the major performance bottlenecks arise from<br />

dependences that cannot be alleviated, thus reducing the parallelism among<br />

instructions <strong>and</strong> the sustained issue rate. Although little can be done about true data<br />

dependences, often the compiler or hardware does not know precisely whether a<br />

dependence exists or not, <strong>and</strong> so must conservatively assume the dependence exists.<br />

For example, code that makes use of pointers, particularly in ways that may lead to<br />

aliasing, will lead to more implied potential dependences. In contrast, the greater<br />

regularity of array accesses often allows a compiler to deduce that no dependences<br />

exist. Similarly, branches that cannot be accurately predicted whether at runtime or<br />

compile time will limit the ability to exploit ILP. Often, additional ILP is available, but<br />

the ability of the compiler or the hardware to find ILP that may be widely separated<br />

(sometimes by the execution of thous<strong>and</strong>s of instructions) is limited.<br />

Second, losses in the memory hierarchy (the topic of Chapter 5) also limit the<br />

ability to keep the pipeline full. Some memory system stalls can be hidden, but<br />

limited amounts of ILP also limit the extent to which such stalls can be hidden.<br />

Hardware/<br />

Software<br />

Interface<br />

Energy Efficiency <strong>and</strong> Advanced Pipelining<br />

The downside to the increasing exploitation of instruction-level parallelism via<br />

dynamic multiple issue <strong>and</strong> speculation is potential energy inefficiency. Each<br />

innovation was able to turn more transistors into performance, but they often did<br />

so very inefficiently. Now that we have hit the power wall, we are seeing designs<br />

with multiple processors per chip where the processors are not as deeply pipelined<br />

or as aggressively speculative as its predecessors.<br />

The belief is that while the simpler processors are not as fast as their sophisticated<br />

brethren, they deliver better performance per joule, so that they can deliver more<br />

performance per chip when designs are constrained more by energy than they are<br />

by number of transistors.<br />

Figure 4.73 shows the number of pipeline stages, the issue width, speculation level,<br />

clock rate, cores per chip, <strong>and</strong> power of several past <strong>and</strong> recent microprocessors. Note<br />

the drop in pipeline stages <strong>and</strong> power as companies switch to multicore designs.<br />

Elaboration: A commit unit controls updates to the register file <strong>and</strong> memory. Some<br />

dynamically scheduled processors update the register file immediately during execution,<br />

using extra registers to implement the renaming function <strong>and</strong> preserving the older copy of a<br />

register until the instruction updating the register is no longer speculative. Other processors<br />

buffer the result, typically in a structure called a reorder buffer, <strong>and</strong> the actual update to the<br />

register file occurs later as part of the commit. Stores to memory must be buffered until<br />

commit time either in a store buffer (see Chapter 5) or in the reorder buffer. The commit unit<br />

allows the store to write to memory from the buffer when the buffer has a valid address <strong>and</strong><br />

valid data, <strong>and</strong> when the store is no longer dependent on predicted branches.


344 Chapter 4 The Processor<br />

Microprocessor Year Clock Rate<br />

Pipeline<br />

Stages<br />

Issue<br />

Width<br />

Out-of-Order/<br />

Speculation<br />

Cores/<br />

Chip<br />

Power<br />

Intel 486 1989 25 MHz 5 1 No 1 5 W<br />

Intel Pentium 1993 66 MHz 5 2 No 1 10 W<br />

Intel Pentium Pro 1997 200 MHz 10 3 Yes 1 29 W<br />

Intel Pentium 4 Willamette 2001 2000 MHz 22 3 Yes 1 75 W<br />

Intel Pentium 4 Prescott 2004 3600 MHz 31 3 Yes 1 103 W<br />

Intel Core 2006 2930 MHz 14 4 Yes<br />

2 75 W<br />

Intel Core i5 Nehalem 2010 3300 MHz 14 4 Yes<br />

1 87 W<br />

Intel Core i5 Ivy Bridge 2012 3400 MHz 14 4 Yes<br />

8 77 W<br />

FIGURE 4.73 Record of Intel Microprocessors in terms of pipeline complexity, number of cores, <strong>and</strong> power. The Pentium<br />

4 pipeline stages do not include the commit stages. If we included them, the Pentium 4 pipelines would be even deeper.<br />

Elaboration: Memory accesses benefi t from nonblocking caches, which continue<br />

servicing cache accesses during a cache miss (see Chapter 5). Out-of-order execution<br />

processors need the cache design to allow instructions to execute during a miss.<br />

Check<br />

Yourself<br />

State whether the following techniques or components are associated primarily<br />

with a software- or hardware-based approach to exploiting ILP. In some cases, the<br />

answer may be both.<br />

1. Branch prediction<br />

2. Multiple issue<br />

3. VLIW<br />

4. Superscalar<br />

5. Dynamic scheduling<br />

6. Out-of-order execution<br />

7. Speculation<br />

8. Reorder buffer<br />

9. Register renaming<br />

4.11 Real Stuff: The ARM Cortex-A8 <strong>and</strong> Intel<br />

Core i7 Pipelines<br />

Figure 4.74 describes the two microprocessors we examine in this section, whose<br />

targets are the two bookends of the PostPC Era.


4.11 Real Stuff: The ARM Cortex-A8 <strong>and</strong> Intel Core i7 Pipelines 345<br />

Processor ARM A8<br />

Intel Core i7 920<br />

Market<br />

Thermal design power<br />

Clock rate<br />

Cores/Chip<br />

Floating point?<br />

Multiple Issue?<br />

Peak instructions/clock cycle<br />

Pipeline Stages<br />

Pipeline schedule<br />

Branch prediction<br />

1st level caches / core<br />

2nd level cache / core<br />

3rd level cache (shared)<br />

Personal Mobile Device<br />

2 Watts<br />

1 GHz<br />

1<br />

No<br />

Dynamic<br />

2<br />

14<br />

Static In-order<br />

2-level<br />

32 KiB I, 32 KiB D<br />

128 - 1024 KiB<br />

--<br />

Server, Cloud<br />

130 Watts<br />

2.66 GHz<br />

4<br />

Yes<br />

Dynamic<br />

4<br />

14<br />

Dynamic Out-of-order with Speculation<br />

2-level<br />

32 KiB I, 32 KiB D<br />

256 KiB<br />

2 - 8 MiB<br />

FIGURE 4.74 Specification of the ARM Cortex-A8 <strong>and</strong> the Intel Core i7 920.<br />

The ARM Cortex-A8<br />

The ARM Corxtex-A8 runs at 1 GHz with a 14-stage pipeline. It uses dynamic<br />

multiple issue, with two instructions per clock cycle. It is a static in-order pipeline,<br />

in that instructions issue, execute, <strong>and</strong> commit in order. The pipeline consists of<br />

three sections for instruction fetch, instruction decode, <strong>and</strong> execute. Figure 4.75<br />

shows the overall pipeline.<br />

The first three stages fetch two instructions at a time <strong>and</strong> try to keep a<br />

12-instruction entry prefetch buffer full. It uses a two-level branch predictor using<br />

both a 512-entry branch target buffer, a 4096-entry global history buffer, <strong>and</strong> an<br />

8-entry return stack to predict future returns. When the branch prediction is<br />

wrong, it empties the pipeline, resulting in a 13-clock cycle misprediction penalty.<br />

The five stages of the decode pipeline determine if there are dependences<br />

between a pair of instructions, which would force sequential execution, <strong>and</strong> in<br />

which pipeline of the execution stages to send the instructions.<br />

The six stages of the instruction execution section offer one pipeline for load<br />

<strong>and</strong> store instructions <strong>and</strong> two pipelines for arithmetic operations, although only<br />

the first of the pair can h<strong>and</strong>le multiplies. Either instruction from the pair can be<br />

issued to the load-store pipeline. The execution stages have full bypassing between<br />

the three pipelines.<br />

Figure 4.76 shows the CPI of the A8 using small versions of programs derived<br />

from the SPEC2000 benchmarks. While the ideal CPI is 0.5, the best case here is<br />

1.4, the median case is 2.0, <strong>and</strong> the worst case is 5.2. For the median case, 80% of<br />

the stalls are due to the pipelining hazards <strong>and</strong> 20% are stalls due to the memory


4.11 Real Stuff: The ARM Cortex-A8 <strong>and</strong> Intel Core i7 Pipelines 349<br />

issue the micro-ops from the buffer, eliminating the need for the instruction<br />

fetch <strong>and</strong> instruction decode stages to be activated.<br />

5. Perform the basic instruction issue—Looking up the register location in the<br />

register tables, renaming the registers, allocating a reorder buffer entry, <strong>and</strong><br />

fetching any results from the registers or reorder buffer before sending the<br />

micro-ops to the reservation stations.<br />

6. The i7 uses a 36-entry centralized reservation station shared by six functional<br />

units. Up to six micro-ops may be dispatched to the functional units every<br />

clock cycle.<br />

7. The individual function units execute micro-ops <strong>and</strong> then results are sent<br />

back to any waiting reservation station as well as to the register retirement<br />

unit, where they will update the register state, once it is known that the<br />

instruction is no longer speculative. The entry corresponding to the<br />

instruction in the reorder buffer is marked as complete.<br />

8. When one or more instructions at the head of the reorder buffer have been<br />

marked as complete, the pending writes in the register retirement unit are<br />

executed, <strong>and</strong> the instructions are removed from the reorder buffer.<br />

Elaboration: Hardware in the second <strong>and</strong> fourth steps can combine or fuse operations<br />

together to reduce the number of operations that must be performed. Macro-op fusion<br />

in the second step takes x86 instruction combinations, such as compare followed by a<br />

branch, <strong>and</strong> fuses them into a single operation. Microfusion in the fourth step combines<br />

micro-operation pairs such as load/ALU operation <strong>and</strong> ALU operation/store <strong>and</strong> issues<br />

them to a single reservation station (where they can still issue independently), thus<br />

increasing the usage of the buffer. In a study of the Intel Core architecture, which also<br />

incorporated microfusion <strong>and</strong> macrofusion, Bird et al. [2007] discovered that microfusion<br />

had little impact on performance, while macrofusion appears to have a modest positive<br />

impact on integer performance <strong>and</strong> little impact on floating-point performance.<br />

Performance of the Intel Core i7 920<br />

Figure 4.78 shows the CPI of the Intel Core i7 for each of the SPEC2006 benchmarks.<br />

While the ideal CPI is 0.25, the best case here is 0.44, the median case is 0.79, <strong>and</strong><br />

the worst case is 2.67.<br />

While it is difficult to differentiate between pipeline stalls <strong>and</strong> memory stalls<br />

in a dynamic out-of-order execution pipeline, we can show the effectiveness of<br />

branch prediction <strong>and</strong> speculation. Figure 4.79 shows the percentage of branches<br />

mispredicted <strong>and</strong> the percentage of the work (measured by the numbers of microops<br />

dispatched into the pipeline) that does not retire (that is, their results are<br />

annulled) relative to all micro-op dispatches. The min, median, <strong>and</strong> max of branch<br />

mispredictions are 0%, 2%, <strong>and</strong> 10%. For wasted work, they are 1%, 18%, <strong>and</strong> 39%.<br />

The wasted work in some cases closely matches the branch misprediction rates,<br />

such as for gobmk <strong>and</strong> astar. In several instances, such as mcf, the wasted work<br />

seems relatively larger than the misprediction rate. This divergence is likely due


350 Chapter 4 The Processor<br />

3<br />

2.5<br />

Stalls, misspeculation<br />

Ideal CPI<br />

2.67<br />

2<br />

2.12<br />

CPI<br />

1.5<br />

1.02<br />

1<br />

0.5 0.44 0.59 0.61 0.65 0.74 0.77 0.82<br />

0<br />

libquantum<br />

h264ref<br />

1.06<br />

1.23<br />

hmmer<br />

perlbench<br />

bzip2<br />

xalancbmk<br />

sjeng<br />

gobmk<br />

astar<br />

gcc<br />

omnetpp<br />

mcf<br />

FIGURE 4.78 CPI of Intel Core i7 920 running SPEC2006 integer benchmarks.<br />

Branch misprediction % Wasted work %<br />

40%<br />

38%<br />

39%<br />

35%<br />

30%<br />

32%<br />

25%<br />

24%<br />

25%<br />

22%<br />

20%<br />

15%<br />

15%<br />

10%<br />

5%<br />

0%<br />

1%<br />

0%<br />

libquantum<br />

11%<br />

6%<br />

5%<br />

2% 2% 2%<br />

h264ref<br />

hmmer<br />

perlbench<br />

7%<br />

5%<br />

1%<br />

bzip2<br />

xalancbmk<br />

5%<br />

sjeng<br />

10%<br />

gobmk<br />

9%<br />

astar<br />

2% 2%<br />

gcc<br />

omnetpp<br />

6%<br />

mcf<br />

FIGURE 4.79 Percentage of branch mispredictions <strong>and</strong> wasted work due to unfruitful<br />

speculation of Intel Core i7 920 running SPEC2006 integer benchmarks.


352 Chapter 4 The Processor<br />

1 #include <br />

2 #define UNROLL (4)<br />

3<br />

4 void dgemm (int n, double* A, double* B, double* C)<br />

5 {<br />

6 for ( int i = 0; i < n; i+=UNROLL*4 )<br />

7 for ( int j = 0; j < n; j++ ) {<br />

8 __m256d c[4];<br />

9 for ( int x = 0; x < UNROLL; x++ )<br />

10 c[x] = _mm256_load_pd(C+i+x*4+j*n);<br />

11<br />

12 for( int k = 0; k < n; k++ )<br />

13 {<br />

14 __m256d b = _mm256_broadcast_sd(B+k+j*n);<br />

15 for (int x = 0; x < UNROLL; x++)<br />

16 c[x] = _mm256_add_pd(c[x],<br />

17 _mm256_mul_pd(_mm256_load_pd(A+n*k+x*4+i), b));<br />

18 }<br />

19<br />

20 for ( int x = 0; x < UNROLL; x++ )<br />

21 _mm256_store_pd(C+i+x*4+j*n, c[x]);<br />

22 }<br />

23 }<br />

FIGURE 4.80 Optimized C version of DGEMM using C intrinsics to generate the AVX subwordparallel<br />

instructions for the x86 (Figure 3.23) <strong>and</strong> loop unrolling to create more opportunities for<br />

instruction-level parallelism. Figure 4.81 shows the assembly language produced by the compiler for the inner<br />

loop, which unrolls the three for-loop bodies to expose instruction level parallelism.<br />

instruction, since we can use the four copies of the B element in register %ymm0<br />

repeatedly throughout the loop. Thus, the 5 AVX instructions in Figure 3.24<br />

become 17 in Figure 4.81, <strong>and</strong> the 7 integer instructions appear in both, although<br />

the constants <strong>and</strong> addressing changes to account for the unrolling. Hence, despite<br />

unrolling 4 times, the number of instructions in the body of the loop only doubles:<br />

from 12 to 24.<br />

Figure 4.82 shows the performance increase DGEMM for 32x32 matrices in<br />

going from unoptimized to AVX <strong>and</strong> then to AVX with unrolling. Unrolling more<br />

than doubles performance, going from 6.4 GFLOPS to 14.6 GFLOPS. Optimizations<br />

for subword parallelism <strong>and</strong> instruction level parallelism result in an overall<br />

speedup of 8.8 versus the unoptimized DGEMM in Figure 3.21.<br />

Elaboration: As mentioned in the Elaboration in Section 3.8, these results are with<br />

Turbo mode turned off. If we turn it on, like in Chapter 3 we improve all the results by the<br />

temporary increase in the clock rate of 3.3/2.6 = 1.27 to 2.1 GFLOPS for unoptimized<br />

DGEMM, 8.1 GFLOPS with AVX, <strong>and</strong> 18.6 GFLOPS with unrolling <strong>and</strong> AVX. As mentioned<br />

in Section 3.8, Turbo mode works particularly well in this case because it is using only<br />

a single core of an eight-core chip.


4.12 Going Faster: Instruction-Level Parallelism <strong>and</strong> Matrix Multiply 353<br />

1<br />

2<br />

3<br />

4<br />

5<br />

6<br />

7<br />

8<br />

9<br />

10<br />

11<br />

12<br />

13<br />

14<br />

15<br />

16<br />

17<br />

18<br />

19<br />

20<br />

21<br />

22<br />

23<br />

24<br />

vmovapd (%r11),%ymm4<br />

# Load 4 elements of C into %ymm4<br />

mov %rbx,%rax # register %rax = %rbx<br />

xor %ecx,%ecx # register %ecx = 0<br />

vmovapd 0x20(%r11),%ymm3 # Load 4 elements of C into %ymm3<br />

vmovapd 0x40(%r11),%ymm2 # Load 4 elements of C into %ymm2<br />

vmovapd 0x60(%r11),%ymm1 # Load 4 elements of C into %ymm1<br />

vbroadcastsd (%rcx,%r9,1),%ymm0 # Make 4 copies of B element<br />

add $0x8,%rcx # register %rcx = %rcx + 8<br />

vmulpd (%rax),%ymm0,%ymm5 # Parallel mul %ymm1,4 A elements<br />

vaddpd %ymm5,%ymm4,%ymm4 # Parallel add %ymm5, %ymm4<br />

vmulpd 0x20(%rax),%ymm0,%ymm5 # Parallel mul %ymm1,4 A elements<br />

vaddpd %ymm5,%ymm3,%ymm3 # Parallel add %ymm5, %ymm3<br />

vmulpd 0x40(%rax),%ymm0,%ymm5 # Parallel mul %ymm1,4 A elements<br />

vmulpd 0x60(%rax),%ymm0,%ymm0 # Parallel mul %ymm1,4 A elements<br />

add %r8,%rax # register %rax = %rax + %r8<br />

cmp %r10,%rcx # compare %r8 to %rax<br />

vaddpd %ymm5,%ymm2,%ymm2 # Parallel add %ymm5, %ymm2<br />

vaddpd %ymm0,%ymm1,%ymm1 # Parallel add %ymm0, %ymm1<br />

jne 68 # jump if not %r8 != %rax<br />

add $0x1,%esi # register % esi = % esi + 1<br />

vmovapd %ymm4,(%r11)<br />

# Store %ymm4 into 4 C elements<br />

vmovapd %ymm3,0x20(%r11) # Store %ymm3 into 4 C elements<br />

vmovapd %ymm2,0x40(%r11) # Store %ymm2 into 4 C elements<br />

vmovapd %ymm1,0x60(%r11) # Store %ymm1 into 4 C elements<br />

FIGURE 4.81 The x86 assembly language for the body of the nested loops generated by compiling<br />

the unrolled C code in Figure 4.80.<br />

Elaboration: There are no pipeline stalls despite the reuse of register %ymm5 in lines<br />

9 to 17 Figure 4.81 because the Intel Core i7 pipeline renames the registers.<br />

Are the following statements true or false?<br />

1. The Intel Core i7 uses a multiple-issue pipeline to directly execute x86<br />

instructions.<br />

2. Both the A8 <strong>and</strong> the Core i7 use dynamic multiple issue.<br />

3. The Core i7 microarchitecture has many more registers than x86 requires.<br />

4. The Intel Core i7 uses less than half the pipeline stages of the earlier Intel<br />

Pentium 4 Prescott (see Figure 4.73).<br />

Check<br />

Yourself


4.14 Fallacies <strong>and</strong> Pitfalls 355<br />

4.14 Fallacies <strong>and</strong> Pitfalls<br />

Fallacy: Pipelining is easy.<br />

Our books testify to the subtlety of correct pipeline execution. Our advanced book<br />

had a pipeline bug in its first edition, despite its being reviewed by more than 100<br />

people <strong>and</strong> being class-tested at 18 universities. The bug was uncovered only when<br />

someone tried to build the computer in that book. The fact that the Verilog to<br />

describe a pipeline like that in the Intel Core i7 will be many thous<strong>and</strong>s of lines is<br />

an indication of the complexity. Beware!<br />

Fallacy: Pipelining ideas can be implemented independent of technology.<br />

When the number of transistors on-chip <strong>and</strong> the speed of transistors made a<br />

five-stage pipeline the best solution, then the delayed branch (see the Elaboration<br />

on page 255) was a simple solution to control hazards. With longer pipelines,<br />

superscalar execution, <strong>and</strong> dynamic branch prediction, it is now redundant. In<br />

the early 1990s, dynamic pipeline scheduling took too many resources <strong>and</strong> was<br />

not required for high performance, but as transistor budgets continued to double<br />

due to Moore’s Law <strong>and</strong> logic became much faster than memory, then multiple<br />

functional units <strong>and</strong> dynamic pipelining made more sense. Today, concerns about<br />

power are leading to less aggressive designs.<br />

Pitfall: Failure to consider instruction set design can adversely impact pipelining.<br />

Many of the difficulties of pipelining arise because of instruction set complications.<br />

Here are some examples:<br />

■ Widely variable instruction lengths <strong>and</strong> running times can lead to imbalance<br />

among pipeline stages <strong>and</strong> severely complicate hazard detection in a design<br />

pipelined at the instruction set level. This problem was overcome, initially<br />

in the DEC VAX 8500 in the late 1980s, using the micro-operations <strong>and</strong><br />

micropipelined scheme that the Intel Core i7 employs today. Of course, the<br />

overhead of translation <strong>and</strong> maintaining correspondence between the microoperations<br />

<strong>and</strong> the actual instructions remains.<br />

■ Sophisticated addressing modes can lead to different sorts of problems.<br />

Addressing modes that update registers complicate hazard detection. Other<br />

addressing modes that require multiple memory accesses substantially<br />

complicate pipeline control <strong>and</strong> make it difficult to keep the pipeline flowing<br />

smoothly.<br />

■ Perhaps the best example is the DEC Alpha <strong>and</strong> the DEC NVAX. In<br />

comparable technology, the newer instruction set architecture of the Alpha<br />

allowed an implementation whose performance is more than twice as fast<br />

as NVAX. In another example, Bh<strong>and</strong>arkar <strong>and</strong> Clark [1991] compared the<br />

MIPS M/2000 <strong>and</strong> the DEC VAX 8700 by counting clock cycles of the SPEC<br />

benchmarks; they concluded that although the MIPS M/2000 executes more


358 Chapter 4 The Processor<br />

4.3 When processor designers consider a possible improvement to the processor<br />

datapath, the decision usually depends on the cost/performance trade-off. In<br />

the following three problems, assume that we are starting with a datapath from<br />

Figure 4.2, where I-Mem, Add, Mux, ALU, Regs, D-Mem, <strong>and</strong> Control blocks have<br />

latencies of 400 ps, 100 ps, 30 ps, 120 ps, 200 ps, 350 ps, <strong>and</strong> 100 ps, respectively,<br />

<strong>and</strong> costs of 1000, 30, 10, 100, 200, 2000, <strong>and</strong> 500, respectively.<br />

Consider the addition of a multiplier to the ALU. This addition will add 300 ps to the<br />

latency of the ALU <strong>and</strong> will add a cost of 600 to the ALU. The result will be 5% fewer<br />

instructions executed since we will no longer need to emulate the MUL instruction.<br />

4.3.1 [10] What is the clock cycle time with <strong>and</strong> without this improvement?<br />

4.3.2 [10] What is the speedup achieved by adding this improvement?<br />

4.3.3 [10] Compare the cost/performance ratio with <strong>and</strong> without this<br />

improvement.<br />

4.4 Problems in this exercise assume that logic blocks needed to implement a<br />

processor’s datapath have the following latencies:<br />

I-Mem Add Mux ALU Regs D-Mem Sign-Extend Shift-Left-2<br />

200ps 70ps 20ps 90ps 90ps 250ps 15ps 10ps<br />

4.4.1 [10] If the only thing we need to do in a processor is fetch consecutive<br />

instructions (Figure 4.6), what would the cycle time be?<br />

4.4.2 [10] Consider a datapath similar to the one in Figure 4.11, but for a<br />

processor that only has one type of instruction: unconditional PC-relative branch.<br />

What would the cycle time be for this datapath?<br />

4.4.3 [10] Repeat 4.4.2, but this time we need to support only conditional<br />

PC-relative branches.<br />

The remaining three problems in this exercise refer to the datapath element Shiftleft-2:<br />

4.4.4 [10] Which kinds of instructions require this resource?<br />

4.4.5 [20] For which kinds of instructions (if any) is this resource on the<br />

critical path?<br />

4.4.6 [10] Assuming that we only support beq <strong>and</strong> add instructions,<br />

discuss how changes in the given latency of this resource affect the cycle time of the<br />

processor. Assume that the latencies of other resources do not change.


4.17 Exercises 359<br />

4.5 For the problems in this exercise, assume that there are no pipeline stalls <strong>and</strong><br />

that the breakdown of executed instructions is as follows:<br />

add addi not beq lw sw<br />

20% 20% 0% 25% 25% 10%<br />

4.5.1 [10] In what fraction of all cycles is the data memory used?<br />

4.5.2 [10] In what fraction of all cycles is the input of the sign-extend<br />

circuit needed? What is this circuit doing in cycles in which its input is not needed?<br />

4.6 When silicon chips are fabricated, defects in materials (e.g., silicon) <strong>and</strong><br />

manufacturing errors can result in defective circuits. A very common defect is for<br />

one wire to affect the signal in another. This is called a cross-talk fault. A special<br />

class of cross-talk faults is when a signal is connected to a wire that has a constant<br />

logical value (e.g., a power supply wire). In this case we have a stuck-at-0 or a stuckat-1<br />

fault, <strong>and</strong> the affected signal always has a logical value of 0 or 1, respectively.<br />

The following problems refer to bit 0 of the Write Register input on the register file<br />

in Figure 4.24.<br />

4.6.1 [10] Let us assume that processor testing is done by filling the<br />

PC, registers, <strong>and</strong> data <strong>and</strong> instruction memories with some values (you can choose<br />

which values), letting a single instruction execute, then reading the PC, memories,<br />

<strong>and</strong> registers. These values are then examined to determine if a particular fault is<br />

present. Can you design a test (values for PC, memories, <strong>and</strong> registers) that would<br />

determine if there is a stuck-at-0 fault on this signal?<br />

4.6.2 [10] Repeat 4.6.1 for a stuck-at-1 fault. Can you use a single<br />

test for both stuck-at-0 <strong>and</strong> stuck-at-1? If yes, explain how; if no, explain why not.<br />

4.6.3 [60] If we know that the processor has a stuck-at-1 fault on<br />

this signal, is the processor still usable? To be usable, we must be able to convert<br />

any program that executes on a normal MIPS processor into a program that works<br />

on this processor. You can assume that there is enough free instruction memory<br />

<strong>and</strong> data memory to let you make the program longer <strong>and</strong> store additional<br />

data. Hint: the processor is usable if every instruction “broken” by this fault can<br />

be replaced with a sequence of “working” instructions that achieve the same<br />

effect.<br />

4.6.4 [10] Repeat 4.6.1, but now the fault to test for is whether<br />

the “MemRead” control signal becomes 0 if RegDst control signal is 0, no fault<br />

otherwise.<br />

4.6.5 [10] Repeat 4.6.4, but now the fault to test for is whether the<br />

“Jump” control signal becomes 0 if RegDst control signal is 0, no fault otherwise.


360 Chapter 4 The Processor<br />

4.7 In this exercise we examine in detail how an instruction is executed in a<br />

single-cycle datapath. Problems in this exercise refer to a clock cycle in which the<br />

processor fetches the following instruction word:<br />

10101100011000100000000000010100.<br />

Assume that data memory is all zeros <strong>and</strong> that the processor’s registers have the<br />

following values at the beginning of the cycle in which the above instruction word<br />

is fetched:<br />

r0 r1 r2 r3 r4 r5 r6 r8 r12 r31<br />

0 –1 2 –3 –4 10 6 8 2 –16<br />

4.7.1 [5] What are the outputs of the sign-extend <strong>and</strong> the jump “Shift left<br />

2” unit (near the top of Figure 4.24) for this instruction word?<br />

4.7.2 [10] What are the values of the ALU control unit’s inputs for this<br />

instruction?<br />

4.7.3 [10] What is the new PC address after this instruction is executed?<br />

Highlight the path through which this value is determined.<br />

4.7.4 [10] For each Mux, show the values of its data output during the<br />

execution of this instruction <strong>and</strong> these register values.<br />

4.7.5 [10] For the ALU <strong>and</strong> the two add units, what are their data input<br />

values?<br />

4.7.6 [10] What are the values of all inputs for the “Registers” unit?<br />

4.8 In this exercise, we examine how pipelining affects the clock cycle time of the<br />

processor. Problems in this exercise assume that individual stages of the datapath<br />

have the following latencies:<br />

IF ID EX MEM WB<br />

250ps 350ps 150ps 300ps 200ps<br />

Also, assume that instructions executed by the processor are broken down as<br />

follows:<br />

alu beq lw sw<br />

45% 20% 20% 15%<br />

4.8.1 [5] What is the clock cycle time in a pipelined <strong>and</strong> non-pipelined<br />

processor?<br />

4.8.2 [10] What is the total latency of an LW instruction in a pipelined<br />

<strong>and</strong> non-pipelined processor?


4.17 Exercises 363<br />

4.10.6 [10] Assuming stall-on-branch <strong>and</strong> no delay slots, what is the new<br />

clock cycle time <strong>and</strong> execution time of this instruction sequence if beq address<br />

computation is moved to the MEM stage? What is the speedup from this change?<br />

Assume that the latency of the EX stage is reduced by 20 ps <strong>and</strong> the latency of the<br />

MEM stage is unchanged when branch outcome resolution is moved from EX to<br />

MEM.<br />

4.11 Consider the following loop.<br />

loop:lw r1,0(r1)<br />

<strong>and</strong> r1,r1,r2<br />

lw r1,0(r1)<br />

lw r1,0(r1)<br />

beq r1,r0,loop<br />

Assume that perfect branch prediction is used (no stalls due to control hazards),<br />

that there are no delay slots, <strong>and</strong> that the pipeline has full forwarding support. Also<br />

assume that many iterations of this loop are executed before the loop exits.<br />

4.11.1 [10] Show a pipeline execution diagram for the third iteration of<br />

this loop, from the cycle in which we fetch the first instruction of that iteration up<br />

to (but not including) the cycle in which we can fetch the first instruction of the<br />

next iteration. Show all instructions that are in the pipeline during these cycles (not<br />

just those from the third iteration).<br />

4.11.2 [10] How often (as a percentage of all cycles) do we have a cycle in<br />

which all five pipeline stages are doing useful work?<br />

4.12 This exercise is intended to help you underst<strong>and</strong> the cost/complexity/<br />

performance trade-offs of forwarding in a pipelined processor. Problems in this<br />

exercise refer to pipelined datapaths from Figure 4.45. These problems assume<br />

that, of all the instructions executed in a processor, the following fraction of these<br />

instructions have a particular type of RAW data dependence. The type of RAW<br />

data dependence is identified by the stage that produces the result (EX or MEM)<br />

<strong>and</strong> the instruction that consumes the result (1st instruction that follows the one<br />

that produces the result, 2nd instruction that follows, or both). We assume that the<br />

register write is done in the first half of the clock cycle <strong>and</strong> that register reads are<br />

done in the second half of the cycle, so “EX to 3rd” <strong>and</strong> “MEM to 3rd” dependences<br />

are not counted because they cannot result in data hazards. Also, assume that the<br />

CPI of the processor is 1 if there are no data hazards.<br />

EX to 1 st<br />

Only<br />

MEM to 1 st<br />

Only<br />

EX to 2 nd<br />

Only<br />

MEM to 2 nd<br />

Only<br />

EX to 1 st<br />

<strong>and</strong> MEM<br />

to 2nd<br />

Other RAW<br />

Dependences<br />

5% 20% 5% 10% 10% 10%


4.17 Exercises 365<br />

4.13.2 [10] Repeat 4.13.1 but now use nops only when a hazard cannot be<br />

avoided by changing or rearranging these instructions. You can assume register R7<br />

can be used to hold temporary values in your modified code.<br />

4.13.3 [10] If the processor has forwarding, but we forgot to implement<br />

the hazard detection unit, what happens when this code executes?<br />

4.13.4 [20] If there is forwarding, for the first five cycles during the<br />

execution of this code, specify which signals are asserted in each cycle by hazard<br />

detection <strong>and</strong> forwarding units in Figure 4.60.<br />

4.13.5 [10] If there is no forwarding, what new inputs <strong>and</strong> output signals<br />

do we need for the hazard detection unit in Figure 4.60? Using this instruction<br />

sequence as an example, explain why each signal is needed.<br />

4.13.6 [20] For the new hazard detection unit from 4.13.5, specify which<br />

output signals it asserts in each of the first five cycles during the execution of this<br />

code.<br />

4.14 This exercise is intended to help you underst<strong>and</strong> the relationship between<br />

delay slots, control hazards, <strong>and</strong> branch execution in a pipelined processor. In<br />

this exercise, we assume that the following MIPS code is executed on a pipelined<br />

processor with a 5-stage pipeline, full forwarding, <strong>and</strong> a predict-taken branch<br />

predictor:<br />

lw r2,0(r1)<br />

label1: beq r2,r0,label2 # not taken once, then taken<br />

lw r3,0(r2)<br />

beq r3,r0,label1 # taken<br />

add r1,r3,r1<br />

label2: sw r1,0(r2)<br />

4.14.1 [10] Draw the pipeline execution diagram for this code, assuming<br />

there are no delay slots <strong>and</strong> that branches execute in the EX stage.<br />

4.14.2 [10] Repeat 4.14.1, but assume that delay slots are used. In the<br />

given code, the instruction that follows the branch is now the delay slot instruction<br />

for that branch.<br />

4.14.3 [20] One way to move the branch resolution one stage earlier is to<br />

not need an ALU operation in conditional branches. The branch instructions would<br />

be “bez rd,label” <strong>and</strong> “bnez rd,label”, <strong>and</strong> it would branch if the register has<br />

<strong>and</strong> does not have a zero value, respectively. Change this code to use these branch<br />

instructions instead of beq. You can assume that register R8 is available for you<br />

to use as a temporary register, <strong>and</strong> that an seq (set if equal) R-type instruction can<br />

be used.


366 Chapter 4 The Processor<br />

Section 4.8 describes how the severity of control hazards can be reduced by moving<br />

branch execution into the ID stage. This approach involves a dedicated comparator<br />

in the ID stage, as shown in Figure 4.62. However, this approach potentially adds<br />

to the latency of the ID stage, <strong>and</strong> requires additional forwarding logic <strong>and</strong> hazard<br />

detection.<br />

4.14.4 [10] Using the first branch instruction in the given code as an<br />

example, describe the hazard detection logic needed to support branch execution<br />

in the ID stage as in Figure 4.62. Which type of hazard is this new logic supposed<br />

to detect?<br />

4.14.5 [10] For the given code, what is the speedup achieved by moving<br />

branch execution into the ID stage? Explain your answer. In your speedup<br />

calculation, assume that the additional comparison in the ID stage does not affect<br />

clock cycle time.<br />

4.14.6 [10] Using the first branch instruction in the given code as an<br />

example, describe the forwarding support that must be added to support branch<br />

execution in the ID stage. Compare the complexity of this new forwarding unit to<br />

the complexity of the existing forwarding unit in Figure 4.62.<br />

4.15 The importance of having a good branch predictor depends on how often<br />

conditional branches are executed. Together with branch predictor accuracy, this<br />

will determine how much time is spent stalling due to mispredicted branches. In<br />

this exercise, assume that the breakdown of dynamic instructions into various<br />

instruction categories is as follows:<br />

R-Type BEQ JMP LW SW<br />

40% 25% 5% 25% 5%<br />

Also, assume the following branch predictor accuracies:<br />

Always-Taken Always-Not-Taken 2-Bit<br />

45% 55% 85%<br />

4.15.1 [10] Stall cycles due to mispredicted branches increase the<br />

CPI. What is the extra CPI due to mispredicted branches with the always-taken<br />

predictor? Assume that branch outcomes are determined in the EX stage, that there<br />

are no data hazards, <strong>and</strong> that no delay slots are used.<br />

4.15.2 [10] Repeat 4.15.1 for the “always-not-taken” predictor.<br />

4.15.3 [10] Repeat 4.15.1 for for the 2-bit predictor.<br />

4.15.4 [10] With the 2-bit predictor, what speedup would be achieved if<br />

we could convert half of the branch instructions in a way that replaces a branch<br />

instruction with an ALU instruction? Assume that correctly <strong>and</strong> incorrectly<br />

predicted instructions have the same chance of being replaced.


4.17 Exercises 367<br />

4.15.5 [10] With the 2-bit predictor, what speedup would be achieved if<br />

we could convert half of the branch instructions in a way that replaced each branch<br />

instruction with two ALU instructions? Assume that correctly <strong>and</strong> incorrectly<br />

predicted instructions have the same chance of being replaced.<br />

4.15.6 [10] Some branch instructions are much more predictable than<br />

others. If we know that 80% of all executed branch instructions are easy-to-predict<br />

loop-back branches that are always predicted correctly, what is the accuracy of the<br />

2-bit predictor on the remaining 20% of the branch instructions?<br />

4.16 This exercise examines the accuracy of various branch predictors for the<br />

following repeating pattern (e.g., in a loop) of branch outcomes: T, NT, T, T, NT<br />

4.16.1 [5] What is the accuracy of always-taken <strong>and</strong> always-not-taken<br />

predictors for this sequence of branch outcomes?<br />

4.16.2 [5] What is the accuracy of the two-bit predictor for the first 4<br />

branches in this pattern, assuming that the predictor starts off in the bottom left<br />

state from Figure 4.63 (predict not taken)?<br />

4.16.3 [10] What is the accuracy of the two-bit predictor if this pattern is<br />

repeated forever?<br />

4.16.4 [30] <strong>Design</strong> a predictor that would achieve a perfect accuracy if<br />

this pattern is repeated forever. You predictor should be a sequential circuit with<br />

one output that provides a prediction (1 for taken, 0 for not taken) <strong>and</strong> no inputs<br />

other than the clock <strong>and</strong> the control signal that indicates that the instruction is a<br />

conditional branch.<br />

4.16.5 [10] What is the accuracy of your predictor from 4.16.4 if it is<br />

given a repeating pattern that is the exact opposite of this one?<br />

4.16.6 [20] Repeat 4.16.4, but now your predictor should be able to<br />

eventually (after a warm-up period during which it can make wrong predictions)<br />

start perfectly predicting both this pattern <strong>and</strong> its opposite. Your predictor should<br />

have an input that tells it what the real outcome was. Hint: this input lets your<br />

predictor determine which of the two repeating patterns it is given.<br />

4.17 This exercise explores how exception h<strong>and</strong>ling affects pipeline design. The<br />

first three problems in this exercise refer to the following two instructions:<br />

Instruction 1 Instruction 2<br />

BNE R1, R2, Label<br />

LW R1, 0(R1)<br />

4.17.1 [5] Which exceptions can each of these instructions trigger? For<br />

each of these exceptions, specify the pipeline stage in which it is detected.


368 Chapter 4 The Processor<br />

4.17.2 [10] If there is a separate h<strong>and</strong>ler address for each exception, show<br />

how the pipeline organization must be changed to be able to h<strong>and</strong>le this exception.<br />

You can assume that the addresses of these h<strong>and</strong>lers are known when the processor<br />

is designed.<br />

4.17.3 [10] If the second instruction is fetched right after the first<br />

instruction, describe what happens in the pipeline when the first instruction causes<br />

the first exception you listed in 4.17.1. Show the pipeline execution diagram from<br />

the time the first instruction is fetched until the time the first instruction of the<br />

exception h<strong>and</strong>ler is completed.<br />

4.17.4 [20] In vectored exception h<strong>and</strong>ling, the table of exception h<strong>and</strong>ler<br />

addresses is in data memory at a known (fixed) address. Change the pipeline to<br />

implement this exception h<strong>and</strong>ling mechanism. Repeat 4.17.3 using this modified<br />

pipeline <strong>and</strong> vectored exception h<strong>and</strong>ling.<br />

4.17.5 [15] We want to emulate vectored exception h<strong>and</strong>ling (described<br />

in 4.17.4) on a machine that has only one fixed h<strong>and</strong>ler address. Write the code<br />

that should be at that fixed address. Hint: this code should identify the exception,<br />

get the right address from the exception vector table, <strong>and</strong> transfer execution to that<br />

h<strong>and</strong>ler.<br />

4.18 In this exercise we compare the performance of 1-issue <strong>and</strong> 2-issue<br />

processors, taking into account program transformations that can be made to<br />

optimize for 2-issue execution. Problems in this exercise refer to the following loop<br />

(written in C):<br />

for(i=0;i!=j;i+=2)<br />

b[i]=a[i]–a[i+1];<br />

When writing MIPS code, assume that variables are kept in registers as follows, <strong>and</strong><br />

that all registers except those indicated as Free are used to keep various variables,<br />

so they cannot be used for anything else.<br />

i j a b c Free<br />

R5 R6 R1 R2 R3 R10, R11, R12<br />

4.18.1 [10] Translate this C code into MIPS instructions. Your translation<br />

should be direct, without rearranging instructions to achieve better performance.<br />

4.18.2 [10] If the loop exits after executing only two iterations, draw a<br />

pipeline diagram for your MIPS code from 4.18.1 executed on a 2-issue processor<br />

shown in Figure 4.69. Assume the processor has perfect branch prediction <strong>and</strong> can<br />

fetch any two instructions (not just consecutive instructions) in the same cycle.<br />

4.18.3 [10] Rearrange your code from 4.18.1 to achieve better<br />

performance on a 2-issue statically scheduled processor from Figure 4.69.


4.17 Exercises 369<br />

4.18.4 [10] Repeat 4.18.2, but this time use your MIPS code from 4.18.3.<br />

4.18.5 [10] What is the speedup of going from a 1-issue processor to<br />

a 2-issue processor from Figure 4.69? Use your code from 4.18.1 for both 1-issue<br />

<strong>and</strong> 2-issue, <strong>and</strong> assume that 1,000,000 iterations of the loop are executed. As in<br />

4.18.2, assume that the processor has perfect branch predictions, <strong>and</strong> that a 2-issue<br />

processor can fetch any two instructions in the same cycle.<br />

4.18.6 [10] Repeat 4.18.5, but this time assume that in the 2-issue<br />

processor one of the instructions to be executed in a cycle can be of any kind, <strong>and</strong><br />

the other must be a non-memory instruction.<br />

4.19 This exercise explores energy efficiency <strong>and</strong> its relationship with performance.<br />

Problems in this exercise assume the following energy consumption for activity in<br />

Instruction memory, Registers, <strong>and</strong> Data memory. You can assume that the other<br />

components of the datapath spend a negligible amount of energy.<br />

I-Mem 1 Register Read Register Write D-Mem Read D-Mem Write<br />

140pJ 70pJ 60pJ 140pJ 120pJ<br />

Assume that components in the datapath have the following latencies. You can<br />

assume that the other components of the datapath have negligible latencies.<br />

I-Mem Control Register Read or Write ALU D-Mem Read or Write<br />

200ps 150ps 90ps 90ps 250ps<br />

4.19.1 [10] How much energy is spent to execute an ADD<br />

instruction in a single-cycle design <strong>and</strong> in the 5-stage pipelined design?<br />

4.19.2 [10] What is the worst-case MIPS instruction in terms of<br />

energy consumption, <strong>and</strong> what is the energy spent to execute it?<br />

4.19.3 [10] If energy reduction is paramount, how would you<br />

change the pipelined design? What is the percentage reduction in the energy spent<br />

by an LW instruction after this change?<br />

4.19.4 [10] What is the performance impact of your changes from<br />

4.19.3?<br />

4.19.5 [10] We can eliminate the MemRead control signal <strong>and</strong> have<br />

the data memory be read in every cycle, i.e., we can permanently have MemRead=1.<br />

Explain why the processor still functions correctly after this change. What is the<br />

effect of this change on clock frequency <strong>and</strong> energy consumption?<br />

4.19.6 [10] If an idle unit spends 10% of the power it would spend<br />

if it were active, what is the energy spent by the instruction memory in each cycle?<br />

What percentage of the overall energy spent by the instruction memory does this<br />

idle energy represent?


370 Chapter 4 The Processor<br />

Answers to<br />

Check Yourself<br />

§4.1, page 248: 3 of 5: Control, Datapath, Memory. Input <strong>and</strong> Output are missing.<br />

§4.2, page 251: false. Edge-triggered state elements make simultaneous reading <strong>and</strong><br />

writing both possible <strong>and</strong> unambiguous.<br />

§4.3, page 257: I. a. II. c.<br />

§4.4, page 272: Yes, Branch <strong>and</strong> ALUOp0 are identical. In addition, MemtoReg <strong>and</strong><br />

RegDst are inverses of one another. You don’t need an inverter; simply use the other<br />

signal <strong>and</strong> flip the order of the inputs to the multiplexor!<br />

§4.5, page 285: I. Stall on the lw result. 2. Bypass the first add result written into<br />

$t1. 3. No stall or bypass required.<br />

§4.6, page 298: Statements 2 <strong>and</strong> 4 are correct; the rest are incorrect.<br />

§4.8, page 324: 1. Predict not taken. 2. Predict taken. 3. Dynamic prediction.<br />

§4.9, page 332: The first instruction, since it is logically executed before the others.<br />

§4.10, page 344: 1. Both. 2. Both. 3. Software. 4. Hardware. 5. Hardware. 6.<br />

Hardware. 7. Both. 8. Hardware. 9. Both.<br />

§4.11, page 353: First two are false <strong>and</strong> the last two are true.


This page intentionally left blank


5<br />

Ideally one would desire an<br />

indefinitely large memory<br />

capacity such that any<br />

particular … word would be<br />

immediately available. … We<br />

are … forced to recognize the<br />

possibility of constructing a<br />

hierarchy of memories, each<br />

of which has greater capacity<br />

than the preceding but which<br />

is less quickly accessible.<br />

A. W. Burks, H. H. Goldstine, <strong>and</strong><br />

J. von Neumann<br />

Preliminary Discussion of the Logical <strong>Design</strong> of an<br />

Electronic Computing Instrument, 1946<br />

Large <strong>and</strong> Fast:<br />

Exploiting Memory<br />

Hierarchy<br />

5.1 Introduction 374<br />

5.2 Memory Technologies 378<br />

5.3 The Basics of Caches 383<br />

5.4 Measuring <strong>and</strong> Improving Cache<br />

Performance 398<br />

5.5 Dependable Memory Hierarchy 418<br />

5.6 Virtual Machines 424<br />

5.7 Virtual Memory 427<br />

<strong>Computer</strong> Organization <strong>and</strong> <strong>Design</strong>. DOI: http://dx.doi.org/10.1016/B978-0-12-407726-3.00001-1<br />

© 2013 Elsevier Inc. All rights reserved.


5.1 Introduction 375<br />

Speed<br />

Processor<br />

Size<br />

Cost ($/bit)<br />

Current<br />

technology<br />

Fastest<br />

Memory<br />

Smallest<br />

Highest<br />

SRAM<br />

Memory<br />

DRAM<br />

Slowest<br />

Memory<br />

Biggest<br />

Lowest<br />

Magnetic disk<br />

FIGURE 5.1 The basic structure of a memory hierarchy. By implementing the memory system as<br />

a hierarchy, the user has the illusion of a memory that is as large as the largest level of the hierarchy, but can<br />

be accessed as if it were all built from the fastest memory. Flash memory has replaced disks in many personal<br />

mobile devices, <strong>and</strong> may lead to a new level in the storage hierarchy for desktop <strong>and</strong> server computers; see<br />

Section 5.2.<br />

Just as accesses to books on the desk naturally exhibit locality, locality in<br />

programs arises from simple <strong>and</strong> natural program structures. For example,<br />

most programs contain loops, so instructions <strong>and</strong> data are likely to be accessed<br />

repeatedly, showing high amounts of temporal locality. Since instructions are<br />

normally accessed sequentially, programs also show high spatial locality. Accesses<br />

to data also exhibit a natural spatial locality. For example, sequential accesses to<br />

elements of an array or a record will naturally have high degrees of spatial locality.<br />

We take advantage of the principle of locality by implementing the memory<br />

of a computer as a memory hierarchy. A memory hierarchy consists of multiple<br />

levels of memory with different speeds <strong>and</strong> sizes. The faster memories are more<br />

expensive per bit than the slower memories <strong>and</strong> thus are smaller.<br />

Figure 5.1 shows the faster memory is close to the processor <strong>and</strong> the slower,<br />

less expensive memory is below it. The goal is to present the user with as much<br />

memory as is available in the cheapest technology, while providing access at the<br />

speed offered by the fastest memory.<br />

The data is similarly hierarchical: a level closer to the processor is generally a<br />

subset of any level further away, <strong>and</strong> all the data is stored at the lowest level. By<br />

analogy, the books on your desk form a subset of the library you are working in,<br />

which is in turn a subset of all the libraries on campus. Furthermore, as we move<br />

away from the processor, the levels take progressively longer to access, just as we<br />

might encounter in a hierarchy of campus libraries.<br />

A memory hierarchy can consist of multiple levels, but data is copied between<br />

only two adjacent levels at a time, so we can focus our attention on just two levels.<br />

memory hierarchy<br />

A structure that uses<br />

multiple levels of<br />

memories; as the distance<br />

from the processor<br />

increases, the size of the<br />

memories <strong>and</strong> the access<br />

time both increase.


376 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />

Processor<br />

Data is transferred<br />

block (or line) The<br />

minimum unit of<br />

information that can<br />

be either present or not<br />

present in a cache.<br />

hit rate The fraction of<br />

memory accesses found<br />

in a level of the memory<br />

hierarchy.<br />

miss rate The fraction<br />

of memory accesses not<br />

found in a level of the<br />

memory hierarchy.<br />

hit time The time<br />

required to access a level<br />

of the memory hierarchy,<br />

including the time needed<br />

to determine whether the<br />

access is a hit or a miss.<br />

miss penalty The time<br />

required to fetch a block<br />

into a level of the memory<br />

hierarchy from the lower<br />

level, including the time<br />

to access the block,<br />

transmit it from one level<br />

to the other, insert it in<br />

the level that experienced<br />

the miss, <strong>and</strong> then pass<br />

the block to the requestor.<br />

FIGURE 5.2 Every pair of levels in the memory hierarchy can be thought of as having an<br />

upper <strong>and</strong> lower level. Within each level, the unit of information that is present or not is called a block or<br />

a line. Usually we transfer an entire block when we copy something between levels.<br />

The upper level—the one closer to the processor—is smaller <strong>and</strong> faster than the lower<br />

level, since the upper level uses technology that is more expensive. Figure 5.2 shows<br />

that the minimum unit of information that can be either present or not present in<br />

the two-level hierarchy is called a block or a line; in our library analogy, a block of<br />

information is one book.<br />

If the data requested by the processor appears in some block in the upper level,<br />

this is called a hit (analogous to your finding the information in one of the books<br />

on your desk). If the data is not found in the upper level, the request is called a miss.<br />

The lower level in the hierarchy is then accessed to retrieve the block containing the<br />

requested data. (Continuing our analogy, you go from your desk to the shelves to<br />

find the desired book.) The hit rate, or hit ratio, is the fraction of memory accesses<br />

found in the upper level; it is often used as a measure of the performance of the<br />

memory hierarchy. The miss rate (1−hit rate) is the fraction of memory accesses<br />

not found in the upper level.<br />

Since performance is the major reason for having a memory hierarchy, the time<br />

to service hits <strong>and</strong> misses is important. Hit time is the time to access the upper level<br />

of the memory hierarchy, which includes the time needed to determine whether<br />

the access is a hit or a miss (that is, the time needed to look through the books on<br />

the desk). The miss penalty is the time to replace a block in the upper level with<br />

the corresponding block from the lower level, plus the time to deliver this block to<br />

the processor (or the time to get another book from the shelves <strong>and</strong> place it on the<br />

desk). Because the upper level is smaller <strong>and</strong> built using faster memory parts, the<br />

hit time will be much smaller than the time to access the next level in the hierarchy,<br />

which is the major component of the miss penalty. (The time to examine the books<br />

on the desk is much smaller than the time to get up <strong>and</strong> get a new book from the<br />

shelves.)


5.1 Introduction 377<br />

As we will see in this chapter, the concepts used to build memory systems affect<br />

many other aspects of a computer, including how the operating system manages<br />

memory <strong>and</strong> I/O, how compilers generate code, <strong>and</strong> even how applications use<br />

the computer. Of course, because all programs spend much of their time accessing<br />

memory, the memory system is necessarily a major factor in determining<br />

performance. The reliance on memory hierarchies to achieve performance<br />

has meant that programmers, who used to be able to think of memory as a flat,<br />

r<strong>and</strong>om access storage device, now need to underst<strong>and</strong> that memory is a hierarchy<br />

to get good performance. We show how important this underst<strong>and</strong>ing is in later<br />

examples, such as Figure 5.18 on page 408, <strong>and</strong> Section 5.14, which shows how to<br />

double matrix multiply performance.<br />

Since memory systems are critical to performance, computer designers devote a<br />

great deal of attention to these systems <strong>and</strong> develop sophisticated mechanisms for<br />

improving the performance of the memory system. In this chapter, we discuss the<br />

major conceptual ideas, although we use many simplifications <strong>and</strong> abstractions to<br />

keep the material manageable in length <strong>and</strong> complexity.<br />

Programs exhibit both temporal locality, the tendency to reuse recently<br />

accessed data items, <strong>and</strong> spatial locality, the tendency to reference data<br />

items that are close to other recently accessed items. Memory hierarchies<br />

take advantage of temporal locality by keeping more recently accessed<br />

data items closer to the processor. Memory hierarchies take advantage of<br />

spatial locality by moving blocks consisting of multiple contiguous words<br />

in memory to upper levels of the hierarchy.<br />

Figure 5.3 shows that a memory hierarchy uses smaller <strong>and</strong> faster<br />

memory technologies close to the processor. Thus, accesses that hit in the<br />

highest level of the hierarchy can be processed quickly. Accesses that miss<br />

go to lower levels of the hierarchy, which are larger but slower. If the hit<br />

rate is high enough, the memory hierarchy has an effective access time<br />

close to that of the highest (<strong>and</strong> fastest) level <strong>and</strong> a size equal to that of the<br />

lowest (<strong>and</strong> largest) level.<br />

In most systems, the memory is a true hierarchy, meaning that data<br />

cannot be present in level i unless it is also present in level i 1.<br />

The BIG<br />

Picture<br />

Which of the following statements are generally true?<br />

1. Memory hierarchies take advantage of temporal locality.<br />

2. On a read, the value returned depends on which blocks are in the cache.<br />

3. Most of the cost of the memory hierarchy is at the highest level.<br />

4. Most of the capacity of the memory hierarchy is at the lowest level.<br />

Check<br />

Yourself


5.2 Memory Technologies 379<br />

SRAM Technology<br />

SRAMs are simply integrated circuits that are memory arrays with (usually) a<br />

single access port that can provide either a read or a write. SRAMs have a fixed<br />

access time to any datum, though the read <strong>and</strong> write access times may differ.<br />

SRAMs don’t need to refresh <strong>and</strong> so the access time is very close to the cycle<br />

time. SRAMs typically use six to eight transistors per bit to prevent the information<br />

from being disturbed when read. SRAM needs only minimal power to retain the<br />

charge in st<strong>and</strong>by mode.<br />

In the past, most PCs <strong>and</strong> server systems used separate SRAM chips for either<br />

their primary, secondary, or even tertiary caches. Today, thanks to Moore’s Law, all<br />

levels of caches are integrated onto the processor chip, so the market for separate<br />

SRAM chips has nearly evaporated.<br />

DRAM Technology<br />

In a SRAM, as long as power is applied, the value can be kept indefinitely. In a<br />

dynamic RAM (DRAM), the value kept in a cell is stored as a charge in a capacitor.<br />

A single transistor is then used to access this stored charge, either to read the<br />

value or to overwrite the charge stored there. Because DRAMs use only a single<br />

transistor per bit of storage, they are much denser <strong>and</strong> cheaper per bit than SRAM.<br />

As DRAMs store the charge on a capacitor, it cannot be kept indefinitely <strong>and</strong> must<br />

periodically be refreshed. That is why this memory structure is called dynamic, as<br />

opposed to the static storage in an SRAM cell.<br />

To refresh the cell, we merely read its contents <strong>and</strong> write it back. The charge<br />

can be kept for several milliseconds. If every bit had to be read out of the DRAM<br />

<strong>and</strong> then written back individually, we would constantly be refreshing the DRAM,<br />

leaving no time for accessing it. Fortunately, DRAMs use a two-level decoding<br />

structure, <strong>and</strong> this allows us to refresh an entire row (which shares a word line)<br />

with a read cycle followed immediately by a write cycle.<br />

Figure 5.4 shows the internal organization of a DRAM, <strong>and</strong> Figure 5.5 shows<br />

how the density, cost, <strong>and</strong> access time of DRAMs have changed over the years.<br />

The row organization that helps with refresh also helps with performance. To<br />

improve performance, DRAMs buffer rows for repeated access. The buffer acts<br />

like an SRAM; by changing the address, r<strong>and</strong>om bits can be accessed in the buffer<br />

until the next row access. This capability improves the access time significantly,<br />

since the access time to bits in the row is much lower. Making the chip wider also<br />

improves the memory b<strong>and</strong>width of the chip. When the row is in the buffer, it<br />

can be transferred by successive addresses at whatever the width of the DRAM is<br />

(typically 4, 8, or 16 bits), or by specifying a block transfer <strong>and</strong> the starting address<br />

within the buffer.<br />

To further improve the interface to processors, DRAMs added clocks <strong>and</strong> are<br />

properly called Synchronous DRAMs or SDRAMs. The advantage of SDRAMs<br />

is that the use of a clock eliminates the time for the memory <strong>and</strong> processor to<br />

synchronize. The speed advantage of synchronous DRAMs comes from the ability<br />

to transfer the bits in the burst without having to specify additional address bits.


5.2 Memory Technologies 381<br />

write from multiple banks, with each having its own row buffer. Sending an address<br />

to several banks permits them all to read or write simultaneously. For example,<br />

with four banks, there is just one access time <strong>and</strong> then accesses rotate between<br />

the four banks to supply four times the b<strong>and</strong>width. This rotating access scheme is<br />

called address interleaving.<br />

Although Personal Mobile Devices like the iPad (see Chapter 1) use individual<br />

DRAMs, memory for servers are commonly sold on small boards called dual inline<br />

memory modules (DIMMs). DIMMs typically contain 4–16 DRAMs, <strong>and</strong> they are<br />

normally organized to be 8 bytes wide for server systems. A DIMM using DDR4-<br />

3200 SDRAMs could transfer at 8 3200 25,600 megabytes per second. Such<br />

DIMMs are named after their b<strong>and</strong>width: PC25600. Since a DIMM can have so<br />

many DRAM chips that only a portion of them are used for a particular transfer, we<br />

need a term to refer to the subset of chips in a DIMM that share common address<br />

lines. To avoid confusion with the internal DRAM names of row <strong>and</strong> banks, we use<br />

the term memory rank for such a subset of chips in a DIMM.<br />

Elaboration: One way to measure the performance of the memory system behind the<br />

caches is the Stream benchmark [McCalpin, 1995]. It measures the performance of<br />

long vector operations. They have no temporal locality <strong>and</strong> they access arrays that are<br />

larger than the cache of the computer being tested.<br />

Flash Memory<br />

Flash memory is a type of electrically erasable programmable read-only memory<br />

(EEPROM).<br />

Unlike disks <strong>and</strong> DRAM, but like other EEPROM technologies, writes can wear out<br />

flash memory bits. To cope with such limits, most flash products include a controller<br />

to spread the writes by remapping blocks that have been written many times to less<br />

trodden blocks. This technique is called wear leveling. With wear leveling, personal<br />

mobile devices are very unlikely to exceed the write limits in the flash. Such wear<br />

leveling lowers the potential performance of flash, but it is needed unless higherlevel<br />

software monitors block wear. Flash controllers that perform wear leveling can<br />

also improve yield by mapping out memory cells that were manufactured incorrectly.<br />

Disk Memory<br />

As Figure 5.6 shows, a magnetic hard disk consists of a collection of platters, which<br />

rotate on a spindle at 5400 to 15,000 revolutions per minute. The metal platters are<br />

covered with magnetic recording material on both sides, similar to the material found<br />

on a cassette or videotape. To read <strong>and</strong> write information on a hard disk, a movable arm<br />

containing a small electromagnetic coil called a read-write head is located just above<br />

each surface. The entire drive is permanently sealed to control the environment inside<br />

the drive, which, in turn, allows the disk heads to be much closer to the drive surface.<br />

Each disk surface is divided into concentric circles, called tracks. There are<br />

typically tens of thous<strong>and</strong>s of tracks per surface. Each track is in turn divided into<br />

track One of thous<strong>and</strong>s<br />

of concentric circles that<br />

makes up the surface of a<br />

magnetic disk.


382 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />

sector One of the<br />

segments that make up a<br />

track on a magnetic disk;<br />

a sector is the smallest<br />

amount of information<br />

that is read or written on<br />

a disk.<br />

sectors that contain the information; each track may have thous<strong>and</strong>s of sectors.<br />

Sectors are typically 512 to 4096 bytes in size. The sequence recorded on the<br />

magnetic media is a sector number, a gap, the information for that sector including<br />

error correction code (see Section 5.5), a gap, the sector number of the next sector,<br />

<strong>and</strong> so on.<br />

The disk heads for each surface are connected together <strong>and</strong> move in conjunction,<br />

so that every head is over the same track of every surface. The term cylinder is used<br />

to refer to all the tracks under the heads at a given point on all surfaces.<br />

FIGURE 5.6 A disk showing 10 disk platters <strong>and</strong> the read/write heads. The diameter of<br />

today’s disks is 2.5 or 3.5 inches, <strong>and</strong> there are typically one or two platters per drive today.<br />

seek The process of<br />

positioning a read/write<br />

head over the proper<br />

track on a disk.<br />

To access data, the operating system must direct the disk through a three-stage<br />

process. The first step is to position the head over the proper track. This operation is<br />

called a seek, <strong>and</strong> the time to move the head to the desired track is called the seek time.<br />

Disk manufacturers report minimum seek time, maximum seek time, <strong>and</strong> average<br />

seek time in their manuals. The first two are easy to measure, but the average is open to<br />

wide interpretation because it depends on the seek distance. The industry calculates<br />

average seek time as the sum of the time for all possible seeks divided by the number<br />

of possible seeks. Average seek times are usually advertised as 3 ms to 13 ms, but,<br />

depending on the application <strong>and</strong> scheduling of disk requests, the actual average seek<br />

time may be only 25% to 33% of the advertised number because of locality of disk


5.3 The Basics of Caches 383<br />

references. This locality arises both because of successive accesses to the same file <strong>and</strong><br />

because the operating system tries to schedule such accesses together.<br />

Once the head has reached the correct track, we must wait for the desired sector<br />

to rotate under the read/write head. This time is called the rotational latency or<br />

rotational delay. The average latency to the desired information is halfway around<br />

the disk. Disks rotate at 5400 RPM to 15,000 RPM. The average rotational latency<br />

at 5400 RPM is<br />

0.5 rotation 0.5 rotation<br />

Average rotational latency <br />

<br />

5400 RPM<br />

⎛ seconds⎞<br />

5400 RPM/<br />

60<br />

⎝⎜<br />

minute ⎠⎟<br />

0.0056 seconds 5.6 ms<br />

rotational latency Also<br />

called rotational delay.<br />

The time required for<br />

the desired sector of a<br />

disk to rotate under the<br />

read/write head; usually<br />

assumed to be half the<br />

rotation time.<br />

The last component of a disk access, transfer time, is the time to transfer a block<br />

of bits. The transfer time is a function of the sector size, the rotation speed, <strong>and</strong> the<br />

recording density of a track. Transfer rates in 2012 were between 100 <strong>and</strong> 200 MB/sec.<br />

One complication is that most disk controllers have a built-in cache that stores<br />

sectors as they are passed over; transfer rates from the cache are typically higher,<br />

<strong>and</strong> were up to 750 MB/sec (6 Gbit/sec) in 2012.<br />

Alas, where block numbers are located is no longer intuitive. The assumptions of<br />

the sector-track-cylinder model above are that nearby blocks are on the same track,<br />

blocks in the same cylinder take less time to access since there is no seek time,<br />

<strong>and</strong> some tracks are closer than others. The reason for the change was the raising<br />

of the level of the disk interfaces. To speed-up sequential transfers, these higherlevel<br />

interfaces organize disks more like tapes than like r<strong>and</strong>om access devices.<br />

The logical blocks are ordered in serpentine fashion across a single surface, trying<br />

to capture all the sectors that are recorded at the same bit density to try to get best<br />

performance. Hence, sequential blocks may be on different tracks.<br />

In summary, the two primary differences between magnetic disks <strong>and</strong><br />

semiconductor memory technologies are that disks have a slower access time because<br />

they are mechanical devices—flash is 1000 times as fast <strong>and</strong> DRAM is 100,000 times<br />

as fast—yet they are cheaper per bit because they have very high storage capacity at a<br />

modest cost—disk is 10 to 100 time cheaper. Magnetic disks are nonvolatile like flash,<br />

but unlike flash there is no write wear-out problem. However, flash is much more<br />

rugged <strong>and</strong> hence a better match to the jostling inherent in personal mobile devices.<br />

5.3 The Basics of Caches<br />

In our library example, the desk acted as a cache—a safe place to store things<br />

(books) that we needed to examine. Cache was the name chosen to represent the<br />

level of the memory hierarchy between the processor <strong>and</strong> main memory in the first<br />

commercial computer to have this extra level. The memories in the datapath in<br />

Chapter 4 are simply replaced by caches. Today, although this remains the dominant<br />

Cache: a safe place<br />

for hiding or storing<br />

things.<br />

Webster’s New World<br />

Dictionary of the<br />

American Language,<br />

Third College Edition,<br />

1988


384 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />

direct-mapped cache<br />

A cache structure in<br />

which each memory<br />

location is mapped to<br />

exactly one location in the<br />

cache.<br />

use of the word cache, the term is also used to refer to any storage managed to take<br />

advantage of locality of access. Caches first appeared in research computers in the<br />

early 1960s <strong>and</strong> in production computers later in that same decade; every generalpurpose<br />

computer built today, from servers to low-power embedded processors,<br />

includes caches.<br />

In this section, we begin by looking at a very simple cache in which the processor<br />

requests are each one word <strong>and</strong> the blocks also consist of a single word. (Readers<br />

already familiar with cache basics may want to skip to Section 5.4.) Figure 5.7 shows<br />

such a simple cache, before <strong>and</strong> after requesting a data item that is not initially in<br />

the cache. Before the request, the cache contains a collection of recent references<br />

X 1<br />

, X 2<br />

, …, X n1<br />

, <strong>and</strong> the processor requests a word X n<br />

that is not in the cache. This<br />

request results in a miss, <strong>and</strong> the word X n<br />

is brought from memory into the cache.<br />

In looking at the scenario in Figure 5.7, there are two questions to answer: How<br />

do we know if a data item is in the cache? Moreover, if it is, how do we find it? The<br />

answers are related. If each word can go in exactly one place in the cache, then it<br />

is straightforward to find the word if it is in the cache. The simplest way to assign<br />

a location in the cache for each word in memory is to assign the cache location<br />

based on the address of the word in memory. This cache structure is called direct<br />

mapped, since each memory location is mapped directly to exactly one location in<br />

the cache. The typical mapping between addresses <strong>and</strong> cache locations for a directmapped<br />

cache is usually simple. For example, almost all direct-mapped caches use<br />

this mapping to find a block:<br />

(Block address) modulo (Number of blocks in the cache)<br />

tag A field in a table used<br />

for a memory hierarchy<br />

that contains the address<br />

information required<br />

to identify whether the<br />

associated block in the<br />

hierarchy corresponds to<br />

a requested word.<br />

If the number of entries in the cache is a power of 2, then modulo can be<br />

computed simply by using the low-order log 2<br />

(cache size in blocks) bits of the<br />

address. Thus, an 8-block cache uses the three lowest bits (8 2 3 ) of the block<br />

address. For example, Figure 5.8 shows how the memory addresses between 1 ten<br />

(00001 two<br />

) <strong>and</strong> 29 ten<br />

(11101 two<br />

) map to locations 1 ten<br />

(001 two<br />

) <strong>and</strong> 5 ten<br />

(101 two<br />

) in a<br />

direct-mapped cache of eight words.<br />

Because each cache location can contain the contents of a number of different<br />

memory locations, how do we know whether the data in the cache corresponds<br />

to a requested word? That is, how do we know whether a requested word is in the<br />

cache or not? We answer this question by adding a set of tags to the cache. The<br />

tags contain the address information required to identify whether a word in the<br />

cache corresponds to the requested word. The tag needs only to contain the upper<br />

portion of the address, corresponding to the bits that are not used as an index into<br />

the cache. For example, in Figure 5.8 we need only have the upper 2 of the 5 address<br />

bits in the tag, since the lower 3-bit index field of the address selects the block.<br />

Architects omit the index bits because they are redundant, since by definition the<br />

index field of any address of a cache block must be that block number.<br />

We also need a way to recognize that a cache block does not have valid<br />

information. For instance, when a processor starts up, the cache does not have good<br />

data, <strong>and</strong> the tag fields will be meaningless. Even after executing many instructions,


388 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />

we have conflicting dem<strong>and</strong>s for a block. The word at address 18 (10010 two<br />

) should<br />

be brought into cache block 2 (010 two<br />

). Hence, it must replace the word at address<br />

26 (11010 two<br />

), which is already in cache block 2 (010 two<br />

). This behavior allows a<br />

cache to take advantage of temporal locality: recently referenced words replace less<br />

recently referenced words.<br />

This situation is directly analogous to needing a book from the shelves <strong>and</strong><br />

having no more space on your desk—some book already on your desk must be<br />

returned to the shelves. In a direct-mapped cache, there is only one place to put the<br />

newly requested item <strong>and</strong> hence only one choice of what to replace.<br />

We know where to look in the cache for each possible address: the low-order bits<br />

of an address can be used to find the unique cache entry to which the address could<br />

map. Figure 5.10 shows how a referenced address is divided into<br />

■ A tag field, which is used to compare with the value of the tag field of the<br />

cache<br />

■ A cache index, which is used to select the block<br />

The index of a cache block, together with the tag contents of that block, uniquely<br />

specifies the memory address of the word contained in the cache block. Because<br />

the index field is used as an address to reference the cache, <strong>and</strong> because an n-bit<br />

field has 2 n values, the total number of entries in a direct-mapped cache must be a<br />

power of 2. In the MIPS architecture, since words are aligned to multiples of four<br />

bytes, the least significant two bits of every address specify a byte within a word.<br />

Hence, the least significant two bits are ignored when selecting a word in the block.<br />

The total number of bits needed for a cache is a function of the cache size <strong>and</strong><br />

the address size, because the cache includes both the storage for the data <strong>and</strong> the<br />

tags. The size of the block above was one word, but normally it is several. For the<br />

following situation:<br />

■ 32-bit addresses<br />

■ A direct-mapped cache<br />

■ The cache size is 2 n blocks, so n bits are used for the index<br />

■ The block size is 2 m words (2 m+2 bytes), so m bits are used for the word within<br />

the block, <strong>and</strong> two bits are used for the byte part of the address<br />

the size of the tag field is<br />

32 (n m 2).<br />

The total number of bits in a direct-mapped cache is<br />

2 n (block size tag size valid field size).


390 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />

Bits in a Cache<br />

EXAMPLE<br />

How many total bits are required for a direct-mapped cache with 16 KiB of<br />

data <strong>and</strong> 4-word blocks, assuming a 32-bit address?<br />

ANSWER<br />

We know that 16 KiB is 4096 (2 12 ) words. With a block size of 4 words (2 2 ),<br />

there are 1024 (2 10 ) blocks. Each block has 4 32 or 128 bits of data plus a<br />

tag, which is 32 10 2 2 bits, plus a valid bit. Thus, the total cache size is<br />

2 10 (4 32 (32 10 2 2) 1) 2 10 147 147 Kibibits<br />

or 18.4 KiB for a 16 KiB cache. For this cache, the total number of bits in the<br />

cache is about 1.15 times as many as needed just for the storage of the data.<br />

Mapping an Address to a Multiword Cache Block<br />

EXAMPLE<br />

Consider a cache with 64 blocks <strong>and</strong> a block size of 16 bytes. To what block<br />

number does byte address 1200 map?<br />

ANSWER<br />

We saw the formula on page 384. The block is given by<br />

(Block address) modulo (Number of blocks in the cache)<br />

where the address of the block is<br />

Byte address<br />

Bytes per block<br />

Notice that this block address is the block containing all addresses between<br />

⎡ Byte address ⎤<br />

Bytes per block<br />

⎢<br />

⎣Bytes per block ⎥ <br />


392 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />

the block from the next lower level of the hierarchy <strong>and</strong> load it into the cache. The<br />

time to fetch the block has two parts: the latency to the first word <strong>and</strong> the transfer<br />

time for the rest of the block. Clearly, unless we change the memory system, the<br />

transfer time—<strong>and</strong> hence the miss penalty—will likely increase as the block size<br />

increases. Furthermore, the improvement in the miss rate starts to decrease as the<br />

blocks become larger. The result is that the increase in the miss penalty overwhelms<br />

the decrease in the miss rate for blocks that are too large, <strong>and</strong> cache performance<br />

thus decreases. Of course, if we design the memory to transfer larger blocks more<br />

efficiently, we can increase the block size <strong>and</strong> obtain further improvements in cache<br />

performance. We discuss this topic in the next section.<br />

Elaboration: Although it is hard to do anything about the longer latency component of<br />

the miss penalty for large blocks, we may be able to hide some of the transfer time so<br />

that the miss penalty is effectively smaller. The simplest method for doing this, called<br />

early restart, is simply to resume execution as soon as the requested word of the block<br />

is returned, rather than wait for the entire block. Many processors use this technique<br />

for instruction access, where it works best. Instruction accesses are largely sequential,<br />

so if the memory system can deliver a word every clock cycle, the processor may be<br />

able to restart operation when the requested word is returned, with the memory system<br />

delivering new instruction words just in time. This technique is usually less effective for<br />

data caches because it is likely that the words will be requested from the block in a<br />

less predictable way, <strong>and</strong> the probability that the processor will need another word from<br />

a different cache block before the transfer completes is high. If the processor cannot<br />

access the data cache because a transfer is ongoing, then it must stall.<br />

An even more sophisticated scheme is to organize the memory so that the requested<br />

word is transferred from the memory to the cache fi rst. The remainder of the block<br />

is then transferred, starting with the address after the requested word <strong>and</strong> wrapping<br />

around to the beginning of the block. This technique, called requested word fi rst or<br />

critical word fi rst, can be slightly faster than early restart, but it is limited by the same<br />

properties that limit early restart.<br />

cache miss A request for<br />

data from the cache that<br />

cannot be filled because<br />

the data is not present in<br />

the cache.<br />

H<strong>and</strong>ling Cache Misses<br />

Before we look at the cache of a real system, let’s see how the control unit deals with<br />

cache misses. (We describe a cache controller in detail in Section 5.9). The control<br />

unit must detect a miss <strong>and</strong> process the miss by fetching the requested data from<br />

memory (or, as we shall see, a lower-level cache). If the cache reports a hit, the<br />

computer continues using the data as if nothing happened.<br />

Modifying the control of a processor to h<strong>and</strong>le a hit is trivial; misses, however,<br />

require some extra work. The cache miss h<strong>and</strong>ling is done in collaboration with<br />

the processor control unit <strong>and</strong> with a separate controller that initiates the memory<br />

access <strong>and</strong> refills the cache. The processing of a cache miss creates a pipeline stall<br />

(Chapter 4) as opposed to an interrupt, which would require saving the state of all<br />

registers. For a cache miss, we can stall the entire processor, essentially freezing<br />

the contents of the temporary <strong>and</strong> programmer-visible registers, while we wait


5.3 The Basics of Caches 393<br />

for memory. More sophisticated out-of-order processors can allow execution of<br />

instructions while waiting for a cache miss, but we’ll assume in-order processors<br />

that stall on cache misses in this section.<br />

Let’s look a little more closely at how instruction misses are h<strong>and</strong>led; the same<br />

approach can be easily extended to h<strong>and</strong>le data misses. If an instruction access<br />

results in a miss, then the content of the Instruction register is invalid. To get the<br />

proper instruction into the cache, we must be able to instruct the lower level in the<br />

memory hierarchy to perform a read. Since the program counter is incremented in<br />

the first clock cycle of execution, the address of the instruction that generates an<br />

instruction cache miss is equal to the value of the program counter minus 4. Once<br />

we have the address, we need to instruct the main memory to perform a read. We<br />

wait for the memory to respond (since the access will take multiple clock cycles),<br />

<strong>and</strong> then write the words containing the desired instruction into the cache.<br />

We can now define the steps to be taken on an instruction cache miss:<br />

1. Send the original PC value (current PC – 4) to the memory.<br />

2. Instruct main memory to perform a read <strong>and</strong> wait for the memory to<br />

complete its access.<br />

3. Write the cache entry, putting the data from memory in the data portion of<br />

the entry, writing the upper bits of the address (from the ALU) into the tag<br />

field, <strong>and</strong> turning the valid bit on.<br />

4. Restart the instruction execution at the first step, which will refetch the<br />

instruction, this time finding it in the cache.<br />

The control of the cache on a data access is essentially identical: on a miss, we<br />

simply stall the processor until the memory responds with the data.<br />

H<strong>and</strong>ling Writes<br />

Writes work somewhat differently. Suppose on a store instruction, we wrote the<br />

data into only the data cache (without changing main memory); then, after the<br />

write into the cache, memory would have a different value from that in the cache.<br />

In such a case, the cache <strong>and</strong> memory are said to be inconsistent. The simplest way<br />

to keep the main memory <strong>and</strong> the cache consistent is always to write the data into<br />

both the memory <strong>and</strong> the cache. This scheme is called write-through.<br />

The other key aspect of writes is what occurs on a write miss. We first fetch the<br />

words of the block from memory. After the block is fetched <strong>and</strong> placed into the<br />

cache, we can overwrite the word that caused the miss into the cache block. We also<br />

write the word to main memory using the full address.<br />

Although this design h<strong>and</strong>les writes very simply, it would not provide very<br />

good performance. With a write-through scheme, every write causes the data<br />

to be written to main memory. These writes will take a long time, likely at least<br />

100 processor clock cycles, <strong>and</strong> could slow down the processor considerably. For<br />

example, suppose 10% of the instructions are stores. If the CPI without cache<br />

write-through<br />

A scheme in which writes<br />

always update both the<br />

cache <strong>and</strong> the next lower<br />

level of the memory<br />

hierarchy, ensuring that<br />

data is always consistent<br />

between the two.


394 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />

write buffer A queue<br />

that holds data while<br />

the data is waiting to be<br />

written to memory.<br />

write-back A scheme<br />

that h<strong>and</strong>les writes by<br />

updating values only to<br />

the block in the cache,<br />

then writing the modified<br />

block to the lower level<br />

of the hierarchy when the<br />

block is replaced.<br />

misses was 1.0, spending 100 extra cycles on every write would lead to a CPI of<br />

1.0 100 10% 11, reducing performance by more than a factor of 10.<br />

One solution to this problem is to use a write buffer. A write buffer stores the<br />

data while it is waiting to be written to memory. After writing the data into the<br />

cache <strong>and</strong> into the write buffer, the processor can continue execution. When a write<br />

to main memory completes, the entry in the write buffer is freed. If the write buffer<br />

is full when the processor reaches a write, the processor must stall until there is an<br />

empty position in the write buffer. Of course, if the rate at which the memory can<br />

complete writes is less than the rate at which the processor is generating writes, no<br />

amount of buffering can help, because writes are being generated faster than the<br />

memory system can accept them.<br />

The rate at which writes are generated may also be less than the rate at which the<br />

memory can accept them, <strong>and</strong> yet stalls may still occur. This can happen when the<br />

writes occur in bursts. To reduce the occurrence of such stalls, processors usually<br />

increase the depth of the write buffer beyond a single entry.<br />

The alternative to a write-through scheme is a scheme called write-back. In a<br />

write-back scheme, when a write occurs, the new value is written only to the block<br />

in the cache. The modified block is written to the lower level of the hierarchy when<br />

it is replaced. Write-back schemes can improve performance, especially when<br />

processors can generate writes as fast or faster than the writes can be h<strong>and</strong>led by<br />

main memory; a write-back scheme is, however, more complex to implement than<br />

write-through.<br />

In the rest of this section, we describe caches from real processors, <strong>and</strong> we<br />

examine how they h<strong>and</strong>le both reads <strong>and</strong> writes. In Section 5.8, we will describe<br />

the h<strong>and</strong>ling of writes in more detail.<br />

Elaboration: Writes introduce several complications into caches that are not present<br />

for reads. Here we discuss two of them: the policy on write misses <strong>and</strong> effi cient<br />

implementation of writes in write-back caches.<br />

Consider a miss in a write-through cache. The most common strategy is to allocate a<br />

block in the cache, called write allocate. The block is fetched from memory <strong>and</strong> then the<br />

appropriate portion of the block is overwritten. An alternative strategy is to update the portion<br />

of the block in memory but not put it in the cache, called no write allocate. The motivation is<br />

that sometimes programs write entire blocks of data, such as when the operating system<br />

zeros a page of memory. In such cases, the fetch associated with the initial write miss may<br />

be unnecessary. Some computers allow the write allocation policy to be changed on a per<br />

page basis.<br />

Actually implementing stores effi ciently in a cache that uses a write-back strategy is<br />

more complex than in a write-through cache. A write-through cache can write the data<br />

into the cache <strong>and</strong> read the tag; if the tag mismatches, then a miss occurs. Because the<br />

cache is write-through, the overwriting of the block in the cache is not catastrophic, since<br />

memory has the correct value. In a write-back cache, we must fi rst write the block back<br />

to memory if the data in the cache is modifi ed <strong>and</strong> we have a cache miss. If we simply<br />

overwrote the block on a store instruction before we knew whether the store had hit in<br />

the cache (as we could for a write-through cache), we would destroy the contents of the<br />

block, which is not backed up in the next lower level of the memory hierarchy.


5.3 The Basics of Caches 395<br />

In a write-back cache, because we cannot overwrite the block, stores either require<br />

two cycles (a cycle to check for a hit followed by a cycle to actually perform the write) or<br />

require a write buffer to hold that data—effectively allowing the store to take only one<br />

cycle by pipelining it. When a store buffer is used, the processor does the cache lookup<br />

<strong>and</strong> places the data in the store buffer during the normal cache access cycle. Assuming<br />

a cache hit, the new data is written from the store buffer into the cache on the next<br />

unused cache access cycle.<br />

By comparison, in a write-through cache, writes can always be done in one cycle.<br />

We read the tag <strong>and</strong> write the data portion of the selected block. If the tag matches<br />

the address of the block being written, the processor can continue normally, since the<br />

correct block has been updated. If the tag does not match, the processor generates a<br />

write miss to fetch the rest of the block corresponding to that address.<br />

Many write-back caches also include write buffers that are used to reduce the miss<br />

penalty when a miss replaces a modifi ed block. In such a case, the modifi ed block is<br />

moved to a write-back buffer associated with the cache while the requested block is read<br />

from memory. The write-back buffer is later written back to memory. Assuming another<br />

miss does not occur immediately, this technique halves the miss penalty when a dirty<br />

block must be replaced.<br />

An Example Cache: The Intrinsity FastMATH Processor<br />

The Intrinsity FastMATH is an embedded microprocessor that uses the MIPS<br />

architecture <strong>and</strong> a simple cache implementation. Near the end of the chapter, we<br />

will examine the more complex cache designs of ARM <strong>and</strong> Intel microprocessors,<br />

but we start with this simple, yet real, example for pedagogical reasons. Figure 5.12<br />

shows the organization of the Intrinsity FastMATH data cache.<br />

This processor has a 12-stage pipeline. When operating at peak speed, the<br />

processor can request both an instruction word <strong>and</strong> a data word on every clock.<br />

To satisfy the dem<strong>and</strong>s of the pipeline without stalling, separate instruction<br />

<strong>and</strong> data caches are used. Each cache is 16 KiB, or 4096 words, with 16-word<br />

blocks.<br />

Read requests for the cache are straightforward. Because there are separate<br />

data <strong>and</strong> instruction caches, we need separate control signals to read <strong>and</strong> write<br />

each cache. (Remember that we need to update the instruction cache when a miss<br />

occurs.) Thus, the steps for a read request to either cache are as follows:<br />

1. Send the address to the appropriate cache. The address comes either from<br />

the PC (for an instruction) or from the ALU (for data).<br />

2. If the cache signals hit, the requested word is available on the data lines.<br />

Since there are 16 words in the desired block, we need to select the right one.<br />

A block index field is used to control the multiplexor (shown at the bottom<br />

of the figure), which selects the requested word from the 16 words in the<br />

indexed block.


398 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />

To take advantage of spatial locality, a cache must have a block size larger than<br />

one word. The use of a larger block decreases the miss rate <strong>and</strong> improves the<br />

efficiency of the cache by reducing the amount of tag storage relative to the amount<br />

of data storage in the cache. Although a larger block size decreases the miss rate, it<br />

can also increase the miss penalty. If the miss penalty increased linearly with the<br />

block size, larger blocks could easily lead to lower performance.<br />

To avoid performance loss, the b<strong>and</strong>width of main memory is increased to<br />

transfer cache blocks more efficiently. Common methods for increasing b<strong>and</strong>width<br />

external to the DRAM are making the memory wider <strong>and</strong> interleaving. DRAM<br />

designers have steadily improved the interface between the processor <strong>and</strong> memory<br />

to increase the b<strong>and</strong>width of burst mode transfers to reduce the cost of larger cache<br />

block sizes.<br />

Check<br />

Yourself<br />

The speed of the memory system affects the designer’s decision on the size of<br />

the cache block. Which of the following cache designer guidelines are generally<br />

valid?<br />

1. The shorter the memory latency, the smaller the cache block<br />

2. The shorter the memory latency, the larger the cache block<br />

3. The higher the memory b<strong>and</strong>width, the smaller the cache block<br />

4. The higher the memory b<strong>and</strong>width, the larger the cache block<br />

5.4<br />

Measuring <strong>and</strong> Improving Cache<br />

Performance<br />

In this section, we begin by examining ways to measure <strong>and</strong> analyze cache<br />

performance. We then explore two different techniques for improving cache<br />

performance. One focuses on reducing the miss rate by reducing the probability<br />

that two different memory blocks will contend for the same cache location. The<br />

second technique reduces the miss penalty by adding an additional level to the<br />

hierarchy. This technique, called multilevel caching, first appeared in high-end<br />

computers selling for more than $100,000 in 1990; since then it has become<br />

common on personal mobile devices selling for a few hundred dollars!


5.4 Measuring <strong>and</strong> Improving Cache Performance 399<br />

CPU time can be divided into the clock cycles that the CPU spends executing<br />

the program <strong>and</strong> the clock cycles that the CPU spends waiting for the memory<br />

system. Normally, we assume that the costs of cache accesses that are hits are part<br />

of the normal CPU execution cycles. Thus,<br />

CPU time (CPU execution clock cycles Memory-stall clock cycles)<br />

Clock cycle time<br />

The memory-stall clock cycles come primarily from cache misses, <strong>and</strong> we make<br />

that assumption here. We also restrict the discussion to a simplified model of the<br />

memory system. In real processors, the stalls generated by reads <strong>and</strong> writes can be<br />

quite complex, <strong>and</strong> accurate performance prediction usually requires very detailed<br />

simulations of the processor <strong>and</strong> memory system.<br />

Memory-stall clock cycles can be defined as the sum of the stall cycles coming<br />

from reads plus those coming from writes:<br />

Memory-stall clock cycles (Read-stall cycles Write-stall cycles)<br />

The read-stall cycles can be defined in terms of the number of read accesses per<br />

program, the miss penalty in clock cycles for a read, <strong>and</strong> the read miss rate:<br />

Read-stall cycles<br />

Reads<br />

Program<br />

Read miss rate Read miss penalty<br />

Writes are more complicated. For a write-through scheme, we have two sources of<br />

stalls: write misses, which usually require that we fetch the block before continuing<br />

the write (see the Elaboration on page 394 for more details on dealing with writes),<br />

<strong>and</strong> write buffer stalls, which occur when the write buffer is full when a write<br />

occurs. Thus, the cycles stalled for writes equals the sum of these two:<br />

Write-stall cycles<br />

⎛ Writes<br />

⎞<br />

Write miss rate Write miss penalty<br />

⎝⎜<br />

Program ⎠⎟<br />

Write buffer stalls<br />

Because the write buffer stalls depend on the proximity of writes, <strong>and</strong> not just<br />

the frequency, it is not possible to give a simple equation to compute such stalls.<br />

Fortunately, in systems with a reasonable write buffer depth (e.g., four or more<br />

words) <strong>and</strong> a memory capable of accepting writes at a rate that significantly exceeds<br />

the average write frequency in programs (e.g., by a factor of 2), the write buffer<br />

stalls will be small, <strong>and</strong> we can safely ignore them. If a system did not meet these<br />

criteria, it would not be well designed; instead, the designer should have used either<br />

a deeper write buffer or a write-back organization.


400 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />

Write-back schemes also have potential additional stalls arising from the need<br />

to write a cache block back to memory when the block is replaced. We will discuss<br />

this more in Section 5.8.<br />

In most write-through cache organizations, the read <strong>and</strong> write miss penalties are<br />

the same (the time to fetch the block from memory). If we assume that the write<br />

buffer stalls are negligible, we can combine the reads <strong>and</strong> writes by using a single<br />

miss rate <strong>and</strong> the miss penalty:<br />

Memory-stall clock cycles<br />

Memory accesses<br />

Program<br />

Miss rate<br />

Miss penalty<br />

We can also factor this as<br />

Memory-stall clock cycles<br />

Instructions<br />

Program<br />

Misses<br />

Instruction<br />

Miss penalty<br />

Let’s consider a simple example to help us underst<strong>and</strong> the impact of cache<br />

performance on processor performance.<br />

Calculating Cache Performance<br />

EXAMPLE<br />

Assume the miss rate of an instruction cache is 2% <strong>and</strong> the miss rate of the data<br />

cache is 4%. If a processor has a CPI of 2 without any memory stalls <strong>and</strong> the<br />

miss penalty is 100 cycles for all misses, determine how much faster a processor<br />

would run with a perfect cache that never missed. Assume the frequency of all<br />

loads <strong>and</strong> stores is 36%.<br />

ANSWER<br />

The number of memory miss cycles for instructions in terms of the Instruction<br />

count (I) is<br />

Instruction miss cycles I 2% 100 2.00 I<br />

As the frequency of all loads <strong>and</strong> stores is 36%, we can find the number of<br />

memory miss cycles for data references:<br />

Data miss cycles I 36% 4% 100 1.44 I


5.4 Measuring <strong>and</strong> Improving Cache Performance 401<br />

The total number of memory-stall cycles is 2.00 I 1.44 I 3.44 I. This is<br />

more than three cycles of memory stall per instruction. Accordingly, the total<br />

CPI including memory stalls is 2 3.44 5.44. Since there is no change in<br />

instruction count or clock rate, the ratio of the CPU execution times is<br />

CPU time with stalls<br />

CPU time with perfect cache<br />

I CPI stall Clock cycle<br />

I CPIperfect<br />

Clock cycle<br />

CPIstall<br />

5. 4 4<br />

CPI 2<br />

perfect<br />

The performance with the perfect cache is better by 544 .<br />

2<br />

2.72.<br />

What happens if the processor is made faster, but the memory system is not? The<br />

amount of time spent on memory stalls will take up an increasing fraction of the<br />

execution time; Amdahl’s Law, which we examined in Chapter 1, reminds us of<br />

this fact. A few simple examples show how serious this problem can be. Suppose<br />

we speed-up the computer in the previous example by reducing its CPI from 2 to 1<br />

without changing the clock rate, which might be done with an improved pipeline.<br />

The system with cache misses would then have a CPI of 1 3.44 4.44, <strong>and</strong> the<br />

system with the perfect cache would be<br />

444 .<br />

4.44 times as fast.<br />

1<br />

The amount of execution time spent on memory stalls would have risen from<br />

344 .<br />

63%<br />

544 .<br />

to 344 .<br />

77%<br />

444 .<br />

Similarly, increasing the clock rate without changing the memory system also<br />

increases the performance lost due to cache misses.<br />

The previous examples <strong>and</strong> equations assume that the hit time is not a factor in<br />

determining cache performance. Clearly, if the hit time increases, the total time to<br />

access a word from the memory system will increase, possibly causing an increase in<br />

the processor cycle time. Although we will see additional examples of what can increase


402 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />

hit time shortly, one example is increasing the cache size. A larger cache could clearly<br />

have a longer access time, just as, if your desk in the library was very large (say, 3 square<br />

meters), it would take longer to locate a book on the desk. An increase in hit time<br />

likely adds another stage to the pipeline, since it may take multiple cycles for a cache<br />

hit. Although it is more complex to calculate the performance impact of a deeper<br />

pipeline, at some point the increase in hit time for a larger cache could dominate the<br />

improvement in hit rate, leading to a decrease in processor performance.<br />

To capture the fact that the time to access data for both hits <strong>and</strong> misses affects<br />

performance, designers sometime use average memory access time (AMAT) as<br />

a way to examine alternative cache designs. Average memory access time is the<br />

average time to access memory considering both hits <strong>and</strong> misses <strong>and</strong> the frequency<br />

of different accesses; it is equal to the following:<br />

AMAT Time for a hit Miss rate Miss penalty<br />

Calculating Average Memory Access Time<br />

EXAMPLE<br />

Find the AMAT for a processor with a 1 ns clock cycle time, a miss penalty of<br />

20 clock cycles, a miss rate of 0.05 misses per instruction, <strong>and</strong> a cache access<br />

time (including hit detection) of 1 clock cycle. Assume that the read <strong>and</strong> write<br />

miss penalties are the same <strong>and</strong> ignore other write stalls.<br />

ANSWER<br />

The average memory access time per instruction is<br />

AMAT Time for a hit Miss rate Miss penalty<br />

1 0.05 20<br />

2 clock cycles<br />

or 2 ns.<br />

The next subsection discusses alternative cache organizations that decrease<br />

miss rate but may sometimes increase hit time; additional examples appear in<br />

Section 5.15, Fallacies <strong>and</strong> Pitfalls.<br />

Reducing Cache Misses by More Flexible Placement<br />

of Blocks<br />

So far, when we place a block in the cache, we have used a simple placement scheme:<br />

A block can go in exactly one place in the cache. As mentioned earlier, it is called<br />

direct mapped because there is a direct mapping from any block address in memory<br />

to a single location in the upper level of the hierarchy. However, there is actually a<br />

whole range of schemes for placing blocks. Direct mapped, where a block can be<br />

placed in exactly one location, is at one extreme.


5.4 Measuring <strong>and</strong> Improving Cache Performance 403<br />

At the other extreme is a scheme where a block can be placed in any location<br />

in the cache. Such a scheme is called fully associative, because a block in memory<br />

may be associated with any entry in the cache. To find a given block in a fully<br />

associative cache, all the entries in the cache must be searched because a block<br />

can be placed in any one. To make the search practical, it is done in parallel with<br />

a comparator associated with each cache entry. These comparators significantly<br />

increase the hardware cost, effectively making fully associative placement practical<br />

only for caches with small numbers of blocks.<br />

The middle range of designs between direct mapped <strong>and</strong> fully associative<br />

is called set associative. In a set-associative cache, there are a fixed number of<br />

locations where each block can be placed. A set-associative cache with n locations<br />

for a block is called an n-way set-associative cache. An n-way set-associative cache<br />

consists of a number of sets, each of which consists of n blocks. Each block in the<br />

memory maps to a unique set in the cache given by the index field, <strong>and</strong> a block can<br />

be placed in any element of that set. Thus, a set-associative placement combines<br />

direct-mapped placement <strong>and</strong> fully associative placement: a block is directly<br />

mapped into a set, <strong>and</strong> then all the blocks in the set are searched for a match. For<br />

example, Figure 5.14 shows where block 12 may be placed in a cache with eight<br />

blocks total, according to the three block placement policies.<br />

Remember that in a direct-mapped cache, the position of a memory block is<br />

given by<br />

fully associative<br />

cache A cache structure<br />

in which a block can be<br />

placed in any location in<br />

the cache.<br />

set-associative cache<br />

A cache that has a fixed<br />

number of locations (at<br />

least two) where each<br />

block can be placed.<br />

(Block number) modulo (Number of blocks in the cache)<br />

Direct mapped<br />

Set associative<br />

Fully associative<br />

Block #<br />

012 34 5 6 7<br />

Set #<br />

0 1 2 3<br />

Data<br />

Data<br />

Data<br />

Tag<br />

1<br />

2<br />

Tag<br />

1<br />

2<br />

Tag<br />

1<br />

2<br />

Search<br />

Search<br />

Search<br />

FIGURE 5.14 The location of a memory block whose address is 12 in a cache with eight<br />

blocks varies for direct-mapped, set-associative, <strong>and</strong> fully associative placement. In directmapped<br />

placement, there is only one cache block where memory block 12 can be found, <strong>and</strong> that block is<br />

given by (12 modulo 8) 4. In a two-way set-associative cache, there would be four sets, <strong>and</strong> memory block<br />

12 must be in set (12 mod 4) 0; the memory block could be in either element of the set. In a fully associative<br />

placement, the memory block for block address 12 can appear in any of the eight cache blocks.


404 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />

In a set-associative cache, the set containing a memory block is given by<br />

(Block number) modulo (Number of sets in the cache)<br />

Since the block may be placed in any element of the set, all the tags of all the elements<br />

of the set must be searched. In a fully associative cache, the block can go anywhere,<br />

<strong>and</strong> all tags of all the blocks in the cache must be searched.<br />

We can also think of all block placement strategies as a variation on set<br />

associativity. Figure 5.15 shows the possible associativity structures for an eightblock<br />

cache. A direct-mapped cache is simply a one-way set-associative cache:<br />

each cache entry holds one block <strong>and</strong> each set has one element. A fully associative<br />

cache with m entries is simply an m-way set-associative cache; it has one set with m<br />

blocks, <strong>and</strong> an entry can reside in any block within that set.<br />

The advantage of increasing the degree of associativity is that it usually decreases<br />

the miss rate, as the next example shows. The main disadvantage, which we discuss<br />

in more detail shortly, is a potential increase in the hit time.<br />

One-way set associative<br />

(direct mapped)<br />

Block Tag Data<br />

0<br />

1<br />

2<br />

3<br />

4<br />

5<br />

6<br />

7<br />

Set<br />

0<br />

1<br />

2<br />

3<br />

Two-way set associative<br />

Tag Data Tag Data<br />

Four-way set associative<br />

Set<br />

0<br />

1<br />

Tag Data Tag Data Tag Data Tag Data<br />

Eight-way set associative (fully associative)<br />

Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data<br />

FIGURE 5.15 An eight-block cache configured as direct mapped, two-way set associative,<br />

four-way set associative, <strong>and</strong> fully associative. The total size of the cache in blocks is equal to the<br />

number of sets times the associativity. Thus, for a fixed cache size, increasing the associativity decreases<br />

the number of sets while increasing the number of elements per set. With eight blocks, an eight-way setassociative<br />

cache is the same as a fully associative cache.


406 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />

is replaced. (We will discuss other replacement rules in more detail shortly.)<br />

Using this replacement rule, the contents of the set-associative cache after each<br />

reference looks like this:<br />

Address of memory<br />

block accessed<br />

Hit<br />

or miss<br />

0 miss Memory[0]<br />

8 miss Memory[0] Memory[8]<br />

0 hit Memory[0] Memory[8]<br />

6 miss Memory[0] Memory[6]<br />

8 miss Memory[8] Memory[6]<br />

Contents of cache blocks after reference<br />

Set 0 Set 0 Set 1 Set 1<br />

Notice that when block 6 is referenced, it replaces block 8, since block 8 has<br />

been less recently referenced than block 0. The two-way set-associative cache<br />

has four misses, one less than the direct-mapped cache.<br />

The fully associative cache has four cache blocks (in a single set); any<br />

memory block can be stored in any cache block. The fully associative cache has<br />

the best performance, with only three misses:<br />

Address of memory<br />

block accessed<br />

Hit<br />

or miss<br />

Contents of cache blocks after reference<br />

Block 0 Block 1 Block 2 Block 3<br />

0 miss Memory[0]<br />

8 miss Memory[0] Memory[8]<br />

0 hit Memory[0] Memory[8]<br />

6 miss Memory[0] Memory[8] Memory[6]<br />

8 hit Memory[0] Memory[8] Memory[6]<br />

For this series of references, three misses is the best we can do, because three<br />

unique block addresses are accessed. Notice that if we had eight blocks in the<br />

cache, there would be no replacements in the two-way set-associative cache<br />

(check this for yourself), <strong>and</strong> it would have the same number of misses as the<br />

fully associative cache. Similarly, if we had 16 blocks, all 3 caches would have<br />

the same number of misses. Even this trivial example shows that cache size <strong>and</strong><br />

associativity are not independent in determining cache performance.<br />

How much of a reduction in the miss rate is achieved by associativity?<br />

Figure 5.16 shows the improvement for a 64 KiB data cache with a 16-word block,<br />

<strong>and</strong> associativity ranging from direct mapped to eight-way. Going from one-way<br />

to two-way associativity decreases the miss rate by about 15%, but there is little<br />

further improvement in going to higher associativity.


5.4 Measuring <strong>and</strong> Improving Cache Performance 407<br />

Associativity<br />

Data miss rate<br />

1 10.3%<br />

2 8.6%<br />

4 8.3%<br />

8 8.1%<br />

FIGURE 5.16 The data cache miss rates for an organization like the Intrinsity FastMATH<br />

processor for SPEC CPU2000 benchmarks with associativity varying from one-way to<br />

eight-way. These results for 10 SPEC CPU2000 programs are from Hennessy <strong>and</strong> Patterson (2003).<br />

Tag<br />

Index<br />

Block offset<br />

FIGURE 5.17 The three portions of an address in a set-associative or direct-mapped<br />

cache. The index is used to select the set, then the tag is used to choose the block by comparison with the<br />

blocks in the selected set. The block offset is the address of the desired data within the block.<br />

Locating a Block in the Cache<br />

Now, let’s consider the task of finding a block in a cache that is set associative.<br />

Just as in a direct-mapped cache, each block in a set-associative cache includes<br />

an address tag that gives the block address. The tag of every cache block within<br />

the appropriate set is checked to see if it matches the block address from the<br />

processor. Figure 5.17 decomposes the address. The index value is used to select<br />

the set containing the address of interest, <strong>and</strong> the tags of all the blocks in the set<br />

must be searched. Because speed is of the essence, all the tags in the selected set are<br />

searched in parallel. As in a fully associative cache, a sequential search would make<br />

the hit time of a set-associative cache too slow.<br />

If the total cache size is kept the same, increasing the associativity increases the<br />

number of blocks per set, which is the number of simultaneous compares needed<br />

to perform the search in parallel: each increase by a factor of 2 in associativity<br />

doubles the number of blocks per set <strong>and</strong> halves the number of sets. Accordingly,<br />

each factor-of-2 increase in associativity decreases the size of the index by 1 bit <strong>and</strong><br />

increases the size of the tag by 1 bit. In a fully associative cache, there is effectively<br />

only one set, <strong>and</strong> all the blocks must be checked in parallel. Thus, there is no index,<br />

<strong>and</strong> the entire address, excluding the block offset, is compared against the tag of<br />

every block. In other words, we search the entire cache without any indexing.<br />

In a direct-mapped cache, only a single comparator is needed, because the entry can<br />

be in only one block, <strong>and</strong> we access the cache simply by indexing. Figure 5.18 shows<br />

that in a four-way set-associative cache, four comparators are needed, together with<br />

a 4-to-1 multiplexor to choose among the four potential members of the selected set.<br />

The cache access consists of indexing the appropriate set <strong>and</strong> then searching the tags<br />

of the set. The costs of an associative cache are the extra comparators <strong>and</strong> any delay<br />

imposed by having to do the compare <strong>and</strong> select from among the elements of the set.


5.4 Measuring <strong>and</strong> Improving Cache Performance 409<br />

Choosing Which Block to Replace<br />

When a miss occurs in a direct-mapped cache, the requested block can go in<br />

exactly one position, <strong>and</strong> the block occupying that position must be replaced. In<br />

an associative cache, we have a choice of where to place the requested block, <strong>and</strong><br />

hence a choice of which block to replace. In a fully associative cache, all blocks are<br />

c<strong>and</strong>idates for replacement. In a set-associative cache, we must choose among the<br />

blocks in the selected set.<br />

The most commonly used scheme is least recently used (LRU), which we used<br />

in the previous example. In an LRU scheme, the block replaced is the one that has<br />

been unused for the longest time. The set associative example on page 405 uses<br />

LRU, which is why we replaced Memory(0) instead of Memory(6).<br />

LRU replacement is implemented by keeping track of when each element in a<br />

set was used relative to the other elements in the set. For a two-way set-associative<br />

cache, tracking when the two elements were used can be implemented by keeping<br />

a single bit in each set <strong>and</strong> setting the bit to indicate an element whenever that<br />

element is referenced. As associativity increases, implementing LRU gets harder; in<br />

Section 5.8, we will see an alternative scheme for replacement.<br />

least recently used<br />

(LRU) A replacement<br />

scheme in which the<br />

block replaced is the one<br />

that has been unused for<br />

the longest time.<br />

Size of Tags versus Set Associativity<br />

Increasing associativity requires more comparators <strong>and</strong> more tag bits per<br />

cache block. Assuming a cache of 4096 blocks, a 4-word block size, <strong>and</strong> a<br />

32-bit address, find the total number of sets <strong>and</strong> the total number of tag bits<br />

for caches that are direct mapped, two-way <strong>and</strong> four-way set associative, <strong>and</strong><br />

fully associative.<br />

EXAMPLE<br />

Since there are 16 ( 2 4 ) bytes per block, a 32-bit address yields 324 28 bits<br />

to be used for index <strong>and</strong> tag. The direct-mapped cache has the same number<br />

of sets as blocks, <strong>and</strong> hence 12 bits of index, since log 2<br />

(4096) 12; hence, the<br />

total number is (2812) 4096 16 4096 66 K tag bits.<br />

Each degree of associativity decreases the number of sets by a factor of 2 <strong>and</strong><br />

thus decreases the number of bits used to index the cache by 1 <strong>and</strong> increases<br />

the number of bits in the tag by 1. Thus, for a two-way set-associative cache,<br />

there are 2048 sets, <strong>and</strong> the total number of tag bits is (2811) 2 2048 <br />

34 2048 70 Kbits. For a four-way set-associative cache, the total number<br />

of sets is 1024, <strong>and</strong> the total number is (2810) 4 1024 72 1024 <br />

74 K tag bits.<br />

For a fully associative cache, there is only one set with 4096 blocks, <strong>and</strong> the<br />

tag is 28 bits, leading to 28 4096 1 115 K tag bits.<br />

ANSWER


410 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />

Reducing the Miss Penalty Using Multilevel Caches<br />

All modern computers make use of caches. To close the gap further between the<br />

fast clock rates of modern processors <strong>and</strong> the increasingly long time required to<br />

access DRAMs, most microprocessors support an additional level of caching. This<br />

second-level cache is normally on the same chip <strong>and</strong> is accessed whenever a miss<br />

occurs in the primary cache. If the second-level cache contains the desired data,<br />

the miss penalty for the first-level cache will be essentially the access time of the<br />

second-level cache, which will be much less than the access time of main memory.<br />

If neither the primary nor the secondary cache contains the data, a main memory<br />

access is required, <strong>and</strong> a larger miss penalty is incurred.<br />

How significant is the performance improvement from the use of a secondary<br />

cache? The next example shows us.<br />

Performance of Multilevel Caches<br />

EXAMPLE<br />

ANSWER<br />

Suppose we have a processor with a base CPI of 1.0, assuming all references<br />

hit in the primary cache, <strong>and</strong> a clock rate of 4 GHz. Assume a main memory<br />

access time of 100 ns, including all the miss h<strong>and</strong>ling. Suppose the miss rate<br />

per instruction at the primary cache is 2%. How much faster will the processor<br />

be if we add a secondary cache that has a 5 ns access time for either a hit or<br />

a miss <strong>and</strong> is large enough to reduce the miss rate to main memory to 0.5%?<br />

The miss penalty to main memory is<br />

100 ns<br />

400 clock cycles<br />

ns<br />

025 .<br />

clock cycle<br />

The effective CPI with one level of caching is given by<br />

Total CPI Base CPI Memory-stall cycles per instruction<br />

For the processor with one level of caching,<br />

Total CPI 1.0 Memory-stall cycles per instruction 1.0 2% 400 9<br />

With two levels of caching, a miss in the primary (or first-level) cache can be<br />

satisfied either by the secondary cache or by main memory. The miss penalty<br />

for an access to the second-level cache is<br />

025 .<br />

5 ns<br />

ns<br />

clock cycle<br />

20 clock cycles


5.4 Measuring <strong>and</strong> Improving Cache Performance 411<br />

If the miss is satisfied in the secondary cache, then this is the entire miss<br />

penalty. If the miss needs to go to main memory, then the total miss penalty is<br />

the sum of the secondary cache access time <strong>and</strong> the main memory access time.<br />

Thus, for a two-level cache, total CPI is the sum of the stall cycles from both<br />

levels of cache <strong>and</strong> the base CPI:<br />

Total CPI 1 Primary stalls per instruction Secondary stalls per instruction<br />

1 2% 20 0.5% 400 1 0.4 2.0 3.4<br />

Thus, the processor with the secondary cache is faster by<br />

90 .<br />

2.6<br />

34 .<br />

Alternatively, we could have computed the stall cycles by summing the stall<br />

cycles of those references that hit in the secondary cache ((2%0.5%) <br />

20 0.3). Those references that go to main memory, which must include the<br />

cost to access the secondary cache as well as the main memory access time, are<br />

(0.5% (20 400) 2.1). The sum, 1.0 0.3 2.1, is again 3.4.<br />

The design considerations for a primary <strong>and</strong> secondary cache are significantly<br />

different, because the presence of the other cache changes the best choice versus<br />

a single-level cache. In particular, a two-level cache structure allows the primary<br />

cache to focus on minimizing hit time to yield a shorter clock cycle or fewer<br />

pipeline stages, while allowing the secondary cache to focus on miss rate to reduce<br />

the penalty of long memory access times.<br />

The effect of these changes on the two caches can be seen by comparing each<br />

cache to the optimal design for a single level of cache. In comparison to a singlelevel<br />

cache, the primary cache of a multilevel cache is often smaller. Furthermore,<br />

the primary cache may use a smaller block size, to go with the smaller cache size <strong>and</strong><br />

also to reduce the miss penalty. In comparison, the secondary cache will be much<br />

larger than in a single-level cache, since the access time of the secondary cache is<br />

less critical. With a larger total size, the secondary cache may use a larger block size<br />

than appropriate with a single-level cache. It often uses higher associativity than<br />

the primary cache given the focus of reducing miss rates.<br />

multilevel cache<br />

A memory hierarchy with<br />

multiple levels of caches,<br />

rather than just a cache<br />

<strong>and</strong> main memory.<br />

Sorting has been exhaustively analyzed to find better algorithms: Bubble Sort,<br />

Quicksort, Radix Sort, <strong>and</strong> so on. Figure 5.19(a) shows instructions executed by<br />

item searched for Radix Sort versus Quicksort. As expected, for large arrays, Radix<br />

Sort has an algorithmic advantage over Quicksort in terms of number of operations.<br />

Figure 5.19(b) shows time per key instead of instructions executed. We see that the<br />

lines start on the same trajectory as in Figure 5.19(a), but then the Radix Sort line<br />

Underst<strong>and</strong>ing<br />

Program<br />

Performance


412 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />

1200<br />

1000<br />

Radix Sort<br />

Instructions/item<br />

800<br />

600<br />

400<br />

200<br />

Quicksort<br />

a.<br />

0<br />

4 8 16 32<br />

64 128 256 512 1024 2048 4096<br />

Size (K items to sort)<br />

2000<br />

Clock cycles/item<br />

1600<br />

1200<br />

800<br />

400<br />

Radix Sort<br />

Quicksort<br />

b.<br />

0<br />

4 8 16 32<br />

64 128 256 512 1024 2048 4096<br />

Size (K items to sort)<br />

5<br />

Cache misses/item<br />

4<br />

3<br />

2<br />

1<br />

Radix Sort<br />

Quicksort<br />

c.<br />

0<br />

4 8 16 32<br />

64 128 256 512 1024 2048 4096<br />

Size (K items to sort)<br />

FIGURE 5.19 Comparing Quicksort <strong>and</strong> Radix Sort by (a) instructions executed per item<br />

sorted, (b) time per item sorted, <strong>and</strong> (c) cache misses per item sorted. This data is from a<br />

paper by LaMarca <strong>and</strong> Ladner [1996]. Due to such results, new versions of Radix Sort have been invented<br />

that take memory hierarchy into account, to regain its algorithmic advantages (see Section 5.15). The basic<br />

idea of cache optimizations is to use all the data in a block repeatedly before it is replaced on a miss.


5.4 Measuring <strong>and</strong> Improving Cache Performance 413<br />

diverges as the data to sort increases. What is going on? Figure 5.19(c) answers by<br />

looking at the cache misses per item sorted: Quicksort consistently has many fewer<br />

misses per item to be sorted.<br />

Alas, st<strong>and</strong>ard algorithmic analysis often ignores the impact of the memory<br />

hierarchy. As faster clock rates <strong>and</strong> Moore’s Law allow architects to squeeze all of<br />

the performance out of a stream of instructions, using the memory hierarchy well<br />

is critical to high performance. As we said in the introduction, underst<strong>and</strong>ing the<br />

behavior of the memory hierarchy is critical to underst<strong>and</strong>ing the performance of<br />

programs on today’s computers.<br />

Software Optimization via Blocking<br />

Given the importance of the memory hierarchy to program performance, not<br />

surprisingly many software optimizations were invented that can dramatically<br />

improve performance by reusing data within the cache <strong>and</strong> hence lower miss rates<br />

due to improved temporal locality.<br />

When dealing with arrays, we can get good performance from the memory<br />

system if we store the array in memory so that accesses to the array are sequential<br />

in memory. Suppose that we are dealing with multiple arrays, however, with some<br />

arrays accessed by rows <strong>and</strong> some by columns. Storing the arrays row-by-row<br />

(called row major order) or column-by-column (column major order) does not<br />

solve the problem because both rows <strong>and</strong> columns are used in every loop iteration.<br />

Instead of operating on entire rows or columns of an array, blocked algorithms<br />

operate on submatrices or blocks. The goal is to maximize accesses to the data<br />

loaded into the cache before the data are replaced; that is, improve temporal locality<br />

to reduce cache misses.<br />

For example, the inner loops of DGEMM (lines 4 through 9 of Figure 3.21 in<br />

Chapter 3) are<br />

for (int j = 0; j < n; ++j)<br />

{<br />

double cij = C[i+j*n]; /* cij = C[i][j] */<br />

for( int k = 0; k < n; k++ )<br />

cij += A[i+k*n] * B[k+j*n]; /* cij += A[i][k]*B[k][j] */<br />

C[i+j*n] = cij; /* C[i][j] = cij */<br />

}<br />

}<br />

It reads all N-by-N elements of B, reads the same N elements in what corresponds to<br />

one row of A repeatedly, <strong>and</strong> writes what corresponds to one row of N elements of<br />

C. (The comments make the rows <strong>and</strong> columns of the matrices easier to identify.)<br />

Figure 5.20 gives a snapshot of the accesses to the three arrays. A dark shade<br />

indicates a recent access, a light shade indicates an older access, <strong>and</strong> white means<br />

not yet accessed.


414 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />

x<br />

j<br />

0 1 2 3 4 5<br />

y<br />

k<br />

0 1 2 3 4 5<br />

z<br />

j<br />

0 1 2 3 4 5<br />

0<br />

0<br />

0<br />

1<br />

1<br />

1<br />

i<br />

2<br />

3<br />

i<br />

2<br />

3<br />

2<br />

k<br />

3<br />

4<br />

4<br />

4<br />

5<br />

5<br />

5<br />

FIGURE 5.20 A snapshot of the three arrays C, A, <strong>and</strong> B when N 6 <strong>and</strong> i 1. The age of<br />

accesses to the array elements is indicated by shade: white means not yet touched, light means older accesses,<br />

<strong>and</strong> dark means newer accesses. Compared to Figure 5.21, elements of A <strong>and</strong> B are read repeatedly to calculate<br />

new elements of x. The variables i, j, <strong>and</strong> k are shown along the rows or columns used to access the arrays.<br />

The number of capacity misses clearly depends on N <strong>and</strong> the size of the cache. If<br />

it can hold all three N-by-N matrices, then all is well, provided there are no cache<br />

conflicts. We purposely picked the matrix size to be 32 by 32 in DGEMM for<br />

Chapters 3 <strong>and</strong> 4 so that this would be the case. Each matrix is 32 32 1024<br />

elements <strong>and</strong> each element is 8 bytes, so the three matrices occupy 24 KiB, which<br />

comfortably fit in the 32 KiB data cache of the Intel Core i7 (S<strong>and</strong>y Bridge).<br />

If the cache can hold one N-by-N matrix <strong>and</strong> one row of N, then at least the ith<br />

row of A <strong>and</strong> the array B may stay in the cache. Less than that <strong>and</strong> misses may<br />

occur for both B <strong>and</strong> C. In the worst case, there would be 2 N 3 N 2 memory words<br />

accessed for N 3 operations.<br />

To ensure that the elements being accessed can fit in the cache, the original code<br />

is changed to compute on a submatrix. Hence, we essentially invoke the version of<br />

DGEMM from Figure 4.80 in Chapter 4 repeatedly on matrices of size BLOCKSIZE<br />

by BLOCKSIZE. BLOCKSIZE is called the blocking factor.<br />

Figure 5.21 shows the blocked version of DGEMM. The function do_block is<br />

DGEMM from Figure 3.21 with three new parameters si, sj, <strong>and</strong> sk to specify<br />

the starting position of each submatrix of of A, B, <strong>and</strong> C. The two inner loops of the<br />

do_block now compute in steps of size BLOCKSIZE rather than the full length<br />

of B <strong>and</strong> C. The gcc optimizer removes any function call overhead by “inlining” the<br />

function; that is, it inserts the code directly to avoid the conventional parameter<br />

passing <strong>and</strong> return address bookkeeping instructions.<br />

Figure 5.22 illustrates the accesses to the three arrays using blocking. Looking<br />

only at capacity misses, the total number of memory words accessed is 2 N 3 /<br />

BLOCKSIZE N 2 . This total is an improvement by about a factor of BLOCKSIZE.<br />

Hence, blocking exploits a combination of spatial <strong>and</strong> temporal locality, since A<br />

benefits from spatial locality <strong>and</strong> B benefits from temporal locality.


5.4 Measuring <strong>and</strong> Improving Cache Performance 415<br />

1 #define BLOCKSIZE 32<br />

2 void do_block (int n, int si, int sj, int sk, double *A, double<br />

3 *B, double *C)<br />

4 {<br />

5 for (int i = si; i < si+BLOCKSIZE; ++i)<br />

6 for (int j = sj; j < sj+BLOCKSIZE; ++j)<br />

7 {<br />

8 double cij = C[i+j*n];/* cij = C[i][j] */<br />

9 for( int k = sk; k < sk+BLOCKSIZE; k++ )<br />

10 cij += A[i+k*n] * B[k+j*n];/* cij+=A[i][k]*B[k][j] */<br />

11 C[i+j*n] = cij;/* C[i][j] = cij */<br />

12 }<br />

13 }<br />

14 void dgemm (int n, double* A, double* B, double* C)<br />

15 {<br />

16 for ( int sj = 0; sj < n; sj += BLOCKSIZE )<br />

17 for ( int si = 0; si < n; si += BLOCKSIZE )<br />

18 for ( int sk = 0; sk < n; sk += BLOCKSIZE )<br />

19 do_block(n, si, sj, sk, A, B, C);<br />

20 }<br />

FIGURE 5.21 Cache blocked version of DGEMM in Figure 3.21. Assume C is initialized to zero. The do_block<br />

function is basically DGEMM from Chapter 3 with new parameters to specify the starting positions of the submatrices of<br />

BLOCKSIZE. The gcc optimizer can remove the function overhead instructions by inlining the do_block function.<br />

x<br />

j<br />

0 1 2 3 4 5<br />

y<br />

k<br />

0 1 2 3 4 5<br />

z<br />

j<br />

0 1 2 3 4 5<br />

0<br />

0<br />

0<br />

1<br />

1<br />

1<br />

i<br />

2<br />

3<br />

i<br />

2<br />

3<br />

2<br />

k<br />

3<br />

4<br />

4<br />

4<br />

5<br />

5<br />

5<br />

FIGURE 5.22 The age of accesses to the arrays C, A, <strong>and</strong> B when BLOCKSIZE 3. Note that,<br />

in contrast to Figure 5.20, fewer elements are accessed.<br />

Although we have aimed at reducing cache misses, blocking can also be used to<br />

help register allocation. By taking a small blocking size such that the block can be<br />

held in registers, we can minimize the number of loads <strong>and</strong> stores in the program,<br />

which also improves performance.


416 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />

32x32 160x160 480x480 960x960<br />

GFLOPS<br />

1.8<br />

1.5<br />

1.2<br />

0.9<br />

1.7<br />

1.5<br />

1.3<br />

0.8<br />

1.7 1.6 1.6<br />

1.5<br />

0.6<br />

0.3<br />

–<br />

Unoptimized<br />

Blocked<br />

FIGURE 5.23 Performance of unoptimized DGEMM (Figure 3.21) versus cache blocked<br />

DGEMM (Figure 5.21) as the matrix dimension varies from 32x32 (where all three matrices<br />

fit in the cache) to 960x960.<br />

Figure 5.23 shows the impact of cache blocking on the performance of the<br />

unoptimized DGEMM as we increase the matrix size beyond where all three<br />

matrices fit in the cache. The unoptimized performance is halved for the largest<br />

matrix. The cache-blocked version is less than 10% slower even at matrices that are<br />

960x960, or 900 times larger than the 32 × 32 matrices in Chapters 3 <strong>and</strong> 4.<br />

global miss rate The<br />

fraction of references<br />

that miss in all levels of a<br />

multilevel cache.<br />

local miss rate The<br />

fraction of references to<br />

one level of a cache that<br />

miss; used in multilevel<br />

hierarchies.<br />

Elaboration: Multilevel caches create several complications. First, there are now<br />

several different types of misses <strong>and</strong> corresponding miss rates. In the example on<br />

pages 410–411, we saw the primary cache miss rate <strong>and</strong> the global miss rate—the<br />

fraction of references that missed in all cache levels. There is also a miss rate for the<br />

secondary cache, which is the ratio of all misses in the secondary cache divided by the<br />

number of accesses to it. This miss rate is called the local miss rate of the secondary<br />

cache. Because the primary cache fi lters accesses, especially those with good spatial<br />

<strong>and</strong> temporal locality, the local miss rate of the secondary cache is much higher than the<br />

global miss rate. For the example on pages 410–411, we can compute the local miss<br />

rate of the secondary cache as 0.5%/2% 25%! Luckily, the global miss rate dictates<br />

how often we must access the main memory.<br />

Elaboration: With out-of-order processors (see Chapter 4), performance is more<br />

complex, since they execute instructions during the miss penalty. Instead of instruction<br />

miss rates <strong>and</strong> data miss rates, we use misses per instruction, <strong>and</strong> this formula:<br />

Memory stall cycles<br />

Instruction<br />

Misses<br />

Instruction<br />

(Total miss latency<br />

Overlapped miss latency)


5.4 Measuring <strong>and</strong> Improving Cache Performance 417<br />

There is no general way to calculate overlapped miss latency, so evaluations of<br />

memory hierarchies for out-of-order processors inevitably require simulation of the<br />

processor <strong>and</strong> the memory hierarchy. Only by seeing the execution of the processor<br />

during each miss can we see if the processor stalls waiting for data or simply fi nds other<br />

work to do. A guideline is that the processor often hides the miss penalty for an L1<br />

cache miss that hits in the L2 cache, but it rarely hides a miss to the L2 cache.<br />

Elaboration: The performance challenge for algorithms is that the memory hierarchy<br />

varies between different implementations of the same architecture in cache size,<br />

associativity, block size, <strong>and</strong> number of caches. To cope with such variability, some<br />

recent numerical libraries parameterize their algorithms <strong>and</strong> then search the parameter<br />

space at runtime to fi nd the best combination for a particular computer. This approach<br />

is called autotuning.<br />

Which of the following is generally true about a design with multiple levels of<br />

caches?<br />

1. First-level caches are more concerned about hit time, <strong>and</strong> second-level<br />

caches are more concerned about miss rate.<br />

2. First-level caches are more concerned about miss rate, <strong>and</strong> second-level<br />

caches are more concerned about hit time.<br />

Check<br />

Yourself<br />

Summary<br />

In this section, we focused on four topics: cache performance, using associativity to<br />

reduce miss rates, the use of multilevel cache hierarchies to reduce miss penalties,<br />

<strong>and</strong> software optimizations to improve effectiveness of caches.<br />

The memory system has a significant effect on program execution time. The<br />

number of memory-stall cycles depends on both the miss rate <strong>and</strong> the miss penalty.<br />

The challenge, as we will see in Section 5.8, is to reduce one of these factors without<br />

significantly affecting other critical factors in the memory hierarchy.<br />

To reduce the miss rate, we examined the use of associative placement schemes.<br />

Such schemes can reduce the miss rate of a cache by allowing more flexible<br />

placement of blocks within the cache. Fully associative schemes allow blocks to be<br />

placed anywhere, but also require that every block in the cache be searched to satisfy<br />

a request. The higher costs make large fully associative caches impractical. Setassociative<br />

caches are a practical alternative, since we need only search among the<br />

elements of a unique set that is chosen by indexing. Set-associative caches have higher<br />

miss rates but are faster to access. The amount of associativity that yields the best<br />

performance depends on both the technology <strong>and</strong> the details of the implementation.<br />

We looked at multilevel caches as a technique to reduce the miss penalty by<br />

allowing a larger secondary cache to h<strong>and</strong>le misses to the primary cache. Secondlevel<br />

caches have become commonplace as designers find that limited silicon <strong>and</strong><br />

the goals of high clock rates prevent primary caches from becoming large. The<br />

secondary cache, which is often ten or more times larger than the primary cache,<br />

h<strong>and</strong>les many accesses that miss in the primary cache. In such cases, the miss<br />

penalty is that of the access time to the secondary cache (typically < 10 processor


420 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />

error detection<br />

code A code that<br />

enables the detection of<br />

an error in data, but not<br />

the precise location <strong>and</strong>,<br />

hence, correction of the<br />

error.<br />

The Hamming Single Error Correcting, Double Error<br />

Detecting Code (SEC/DED)<br />

Richard Hamming invented a popular redundancy scheme for memory, for which<br />

he received the Turing Award in 1968. To invent redundant codes, it is helpful<br />

to talk about how “close” correct bit patterns can be. What we call the Hamming<br />

distance is just the minimum number of bits that are different between any two<br />

correct bit patterns. For example, the distance between 011011 <strong>and</strong> 001111 is two.<br />

What happens if the minimum distance between members of a codes is two, <strong>and</strong><br />

we get a one-bit error? It will turn a valid pattern in a code to an invalid one. Thus,<br />

if we can detect whether members of a code are valid or not, we can detect single<br />

bit errors, <strong>and</strong> can say we have a single bit error detection code.<br />

Hamming used a parity code for error detection. In a parity code, the number<br />

of 1s in a word is counted; the word has odd parity if the number of 1s is odd <strong>and</strong><br />

even otherwise. When a word is written into memory, the parity bit is also written<br />

(1 for odd, 0 for even). That is, the parity of the N+1 bit word should always be even.<br />

Then, when the word is read out, the parity bit is read <strong>and</strong> checked. If the parity of the<br />

memory word <strong>and</strong> the stored parity bit do not match, an error has occurred.<br />

EXAMPLE<br />

Calculate the parity of a byte with the value 31 ten<br />

<strong>and</strong> show the pattern stored to<br />

memory. Assume the parity bit is on the right. Suppose the most significant bit<br />

was inverted in memory, <strong>and</strong> then you read it back. Did you detect the error?<br />

What happens if the two most significant bits are inverted?<br />

ANSWER<br />

31 ten<br />

is 00011111 two<br />

, which has five 1s. To make parity even, we need to write a 1<br />

in the parity bit, or 000111111 two<br />

. If the most significant bit is inverted when we<br />

read it back, we would see 100111111 two<br />

which has seven 1s. Since we expect<br />

even parity <strong>and</strong> calculated odd parity, we would signal an error. If the two most<br />

significant bits are inverted, we would see 110111111 two<br />

which has eight 1s or<br />

even parity <strong>and</strong> we would not signal an error.<br />

If there are 2 bits of error, then a 1-bit parity scheme will not detect any errors,<br />

since the parity will match the data with two errors. (Actually, a 1-bit parity scheme<br />

can detect any odd number of errors; however, the probability of having 3 errors is<br />

much lower than the probability of having two, so, in practice, a 1-bit parity code is<br />

limited to detecting a single bit of error.)<br />

Of course, a parity code cannot correct errors, which Hamming wanted to do<br />

as well as detect them. If we used a code that had a minimum distance of 3, then<br />

any single bit error would be closer to the correct pattern than to any other valid<br />

pattern. He came up with an easy to underst<strong>and</strong> mapping of data into a distance 3<br />

code that we call Hamming Error Correction Code (ECC) in his honor. We use extra


422 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />

EXAMPLE<br />

Assume one byte data value is 10011010 two<br />

. First show the Hamming ECC code<br />

for that byte, <strong>and</strong> then invert bit 10 <strong>and</strong> show that the ECC code finds <strong>and</strong><br />

corrects the single bit error.<br />

ANSWER<br />

Leaving spaces for the parity bits, the 12 bit pattern is _ _ 1 _ 0 0 1 _ 1 0 1 0.<br />

Position 1 checks bits 1,3,5,7,9, <strong>and</strong>11, which we highlight: __ 1 _ 0 0 1 _ 1 0 1<br />

0. To make the group even parity, we should set bit 1 to 0.<br />

Position 2 checks bits 2,3,6,7,10,11, which is 0 _ 1 _ 0 0 1 _ 1 0 1 0 or odd parity,<br />

so we set position 2 to a 1.<br />

Position 4 checks bits 4,5,6,7,12, which is 0 1 1 _ 0 0 1 _ 1 0 1, so we set it to a 1.<br />

Position 8 checks bits 8,9,10,11,12, which is 0 1 1 1 0 0 1 _ 1 0 1 0, so we set it<br />

to a 0.<br />

The final code word is 011100101010. Inverting bit 10 changes it to<br />

011100101110.<br />

Parity bit 1 is 0 (011100101110 is four 1s, so even parity; this group is OK).<br />

Parity bit 2 is 1 (011100101110 is five 1s, so odd parity; there is an error<br />

somewhere).<br />

Parity bit 4 is 1 (011100101110 is two 1s, so even parity; this group is OK).<br />

Parity bit 8 is 1 (011100101110 is three 1s, so odd parity; there is an error<br />

somewhere).<br />

Parity bits 2 <strong>and</strong> 10 are incorrect. As 2 + 8 = 10, bit 10 must be wrong. Hence,<br />

we can correct the error by inverting bit 10: 011100101010. Voila!<br />

Hamming did not stop at single bit error correction code. At the cost of one more<br />

bit, we can make the minimum Hamming distance in a code be 4. This means<br />

we can correct single bit errors <strong>and</strong> detect double bit errors. The idea is to add a<br />

parity bit that is calculated over the whole word. Let’s use a four-bit data word as<br />

an example, which would only need 7 bits for single bit error detection. Hamming<br />

parity bits H (p1 p2 p3) are computed (even parity as usual) plus the even parity<br />

over the entire word, p4:<br />

1 2 3 4 5 6 7 8<br />

p 1<br />

p 2<br />

d 1<br />

p 3<br />

d 2<br />

d 3<br />

d 4<br />

p 4<br />

Then the algorithm to correct one error <strong>and</strong> detect two is just to calculate parity<br />

over the ECC groups (H) as before plus one more over the whole group (p 4<br />

). There<br />

are four cases:<br />

1. H is even <strong>and</strong> p 4<br />

is even, so no error occurred.<br />

2. H is odd <strong>and</strong> p 4<br />

is odd, so a correctable single error occurred. (p 4<br />

should<br />

calculate odd parity if one error occurred.)<br />

3. H is even <strong>and</strong> p 4<br />

is odd, a single error occurred in p 4<br />

bit, not in the rest of the<br />

word, so correct the p 4<br />

bit.


5.6 Virtual Machines 425<br />

allow these separate software stacks to run independently yet share hardware,<br />

thereby consolidating the number of servers. Another example is that some<br />

VMMs support migration of a running VM to a different computer, either<br />

to balance load or to evacuate from failing hardware.<br />

Amazon Web Services (AWS) uses the virtual machines in its cloud computing<br />

offering EC2 for five reasons:<br />

1. It allows AWS to protect users from each other while sharing the same server.<br />

2. It simplifies software distribution within a warehouse scale computer. A<br />

customer installs a virtual machine image configured with the appropriate<br />

software, <strong>and</strong> AWS distributes it to all the instances a customer wants to use.<br />

3. Customers (<strong>and</strong> AWS) can reliably “kill” a VM to control resource usage<br />

when customers complete their work.<br />

4. Virtual machines hide the identity of the hardware on which the customer is<br />

running, which means AWS can keep using old servers <strong>and</strong> introduce new,<br />

more efficient servers. The customer expects performance for instances to<br />

match their ratings in “EC2 Compute Units,” which AWS defines: to “provide<br />

the equivalent CPU capacity of a 1.0–1.2 GHz 2007 AMD Opteron or 2007<br />

Intel Xeon processor.” Thanks to Moore’s Law, newer servers clearly offer<br />

more EC2 Compute Units than older ones, but AWS can keep renting old<br />

servers as long as they are economical.<br />

5. Virtual Machine Monitors can control the rate that a VM uses the processor,<br />

the network, <strong>and</strong> disk space, which allows AWS to offer many price points<br />

of instances of different types running on the same underlying servers.<br />

For example, in 2012 AWS offered 14 instance types, from small st<strong>and</strong>ard<br />

instances at $0.08 per hour to high I/O quadruple extra large instances at<br />

$3.10 per hour.<br />

Hardware/<br />

Software<br />

Interface<br />

In general, the cost of processor virtualization depends on the workload. Userlevel<br />

processor-bound programs have zero virtualization overhead, because the<br />

OS is rarely invoked, so everything runs at native speeds. I/O-intensive workloads<br />

are generally also OS-intensive, executing many system calls <strong>and</strong> privileged<br />

instructions that can result in high virtualization overhead. On the other h<strong>and</strong>, if<br />

the I/O-intensive workload is also I/O-bound, the cost of processor virtualization<br />

can be completely hidden, since the processor is often idle waiting for I/O.<br />

The overhead is determined by both the number of instructions that must be<br />

emulated by the VMM <strong>and</strong> by how much time each takes to emulate them. Hence,<br />

when the guest VMs run the same ISA as the host, as we assume here, the goal


426 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />

of the architecture <strong>and</strong> the VMM is to run almost all instructions directly on the<br />

native hardware.<br />

Requirements of a Virtual Machine Monitor<br />

What must a VM monitor do? It presents a software interface to guest software, it<br />

must isolate the state of guests from each other, <strong>and</strong> it must protect itself from guest<br />

software (including guest OSes). The qualitative requirements are:<br />

■ Guest software should behave on a VM exactly as if it were running on the<br />

native hardware, except for performance-related behavior or limitations of<br />

fixed resources shared by multiple VMs.<br />

■ Guest software should not be able to change allocation of real system resources<br />

directly.<br />

To “virtualize” the processor, the VMM must control just about everything—access<br />

to privileged state, I/O, exceptions, <strong>and</strong> interrupts—even though the guest VM <strong>and</strong><br />

OS currently running are temporarily using them.<br />

For example, in the case of a timer interrupt, the VMM would suspend the<br />

currently running guest VM, save its state, h<strong>and</strong>le the interrupt, determine which<br />

guest VM to run next, <strong>and</strong> then load its state. Guest VMs that rely on a timer<br />

interrupt are provided with a virtual timer <strong>and</strong> an emulated timer interrupt by the<br />

VMM.<br />

To be in charge, the VMM must be at a higher privilege level than the guest<br />

VM, which generally runs in user mode; this also ensures that the execution of<br />

any privileged instruction will be h<strong>and</strong>led by the VMM. The basic requirements of<br />

system virtual:<br />

■ At least two processor modes, system <strong>and</strong> user.<br />

■ A privileged subset of instructions that is available only in system mode,<br />

resulting in a trap if executed in user mode; all system resources must be<br />

controllable only via these instructions.<br />

(Lack of) Instruction Set Architecture Support for Virtual<br />

Machines<br />

If VMs are planned for during the design of the ISA, it’s relatively easy to reduce<br />

both the number of instructions that must be executed by a VMM <strong>and</strong> improve<br />

their emulation speed. An architecture that allows the VM to execute directly on<br />

the hardware earns the title virtualizable, <strong>and</strong> the IBM 370 architecture proudly<br />

bears that label.<br />

Alas, since VMs have been considered for PC <strong>and</strong> server applications only fairly<br />

recently, most instruction sets were created without virtualization in mind. These<br />

culprits include x86 <strong>and</strong> most RISC architectures, including ARMv7 <strong>and</strong> MIPS.


5.7 Virtual Memory 427<br />

Because the VMM must ensure that the guest system only interacts with virtual<br />

resources, a conventional guest OS runs as a user mode program on top of the<br />

VMM. Then, if a guest OS attempts to access or modify information related to<br />

hardware resources via a privileged instruction—for example, reading or writing<br />

a status bit that enables interrupts—it will trap to the VMM. The VMM can then<br />

effect the appropriate changes to corresponding real resources.<br />

Hence, if any instruction that tries to read or write such sensitive information<br />

traps when executed in user mode, the VMM can intercept it <strong>and</strong> support a virtual<br />

version of the sensitive information, as the guest OS expects.<br />

In the absence of such support, other measures must be taken. A VMM must<br />

take special precautions to locate all problematic instructions <strong>and</strong> ensure that they<br />

behave correctly when executed by a guest OS, thereby increasing the complexity<br />

of the VMM <strong>and</strong> reducing the performance of running the VM.<br />

Protection <strong>and</strong> Instruction Set Architecture<br />

Protection is a joint effort of architecture <strong>and</strong> operating systems, but architects<br />

had to modify some awkward details of existing instruction set architectures when<br />

virtual memory became popular.<br />

For example, the x86 instruction POPF loads the flag registers from the top of<br />

the stack in memory. One of the flags is the Interrupt Enable (IE) flag. If you run<br />

the POPF instruction in user mode, rather than trap it, it simply changes all the<br />

flags except IE. In system mode, it does change the IE. Since a guest OS runs in user<br />

mode inside a VM, this is a problem, as it expects to see a changed IE.<br />

Historically, IBM mainframe hardware <strong>and</strong> VMM took three steps to improve<br />

performance of virtual machines:<br />

1. Reduce the cost of processor virtualization.<br />

2. Reduce interrupt overhead cost due to the virtualization.<br />

3. Reduce interrupt cost by steering interrupts to the proper VM without<br />

invoking VMM.<br />

AMD <strong>and</strong> Intel tried to address the first point in 2006 by reducing the cost of<br />

processor virtualization. It will be interesting to see how many generations of<br />

architecture <strong>and</strong> VMM modifications it will take to address all three points, <strong>and</strong><br />

how long before virtual machines of the 21st century will be as efficient as the IBM<br />

mainframes <strong>and</strong> VMMs of the 1970s.<br />

5.7 Virtual Memory<br />

In earlier sections, we saw how caches provided fast access to recently used portions<br />

of a program’s code <strong>and</strong> data. Similarly, the main memory can act as a “cache” for<br />

… a system has<br />

been devised to<br />

make the core drum<br />

combination appear<br />

to the programmer<br />

as a single level<br />

store, the requisite<br />

transfers taking place<br />

automatically.<br />

Kilburn et al., One-level<br />

storage system, 1962


428 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />

virtual memory<br />

A technique that uses<br />

main memory as a “cache”<br />

for secondary storage.<br />

physical address<br />

An address in main<br />

memory.<br />

protection A set<br />

of mechanisms for<br />

ensuring that multiple<br />

processes sharing the<br />

processor, memory,<br />

or I/O devices cannot<br />

interfere, intentionally<br />

or unintentionally, with<br />

one another by reading or<br />

writing each other’s data.<br />

These mechanisms also<br />

isolate the operating system<br />

from a user process.<br />

page fault An event that<br />

occurs when an accessed<br />

page is not present in<br />

main memory.<br />

virtual address<br />

An address that<br />

corresponds to a location<br />

in virtual space <strong>and</strong> is<br />

translated by address<br />

mapping to a physical<br />

address when memory is<br />

accessed.<br />

the secondary storage, usually implemented with magnetic disks. This technique is<br />

called virtual memory. Historically, there were two major motivations for virtual<br />

memory: to allow efficient <strong>and</strong> safe sharing of memory among multiple programs,<br />

such as for the memory needed by multiple virtual machines for cloud computing,<br />

<strong>and</strong> to remove the programming burdens of a small, limited amount of main<br />

memory. Five decades after its invention, it’s the former reason that reigns today.<br />

Of course, to allow multiple virtual machines to share the same memory, we<br />

must be able to protect the virtual machines from each other, ensuring that a<br />

program can only read <strong>and</strong> write the portions of main memory that have been<br />

assigned to it. Main memory need contain only the active portions of the many<br />

virtual machines, just as a cache contains only the active portion of one program.<br />

Thus, the principle of locality enables virtual memory as well as caches, <strong>and</strong> virtual<br />

memory allows us to efficiently share the processor as well as the main memory.<br />

We cannot know which virtual machines will share the memory with other<br />

virtual machines when we compile them. In fact, the virtual machines sharing<br />

the memory change dynamically while the virtual machines are running. Because<br />

of this dynamic interaction, we would like to compile each program into its<br />

own address space—a separate range of memory locations accessible only to this<br />

program. Virtual memory implements the translation of a program’s address space<br />

to physical addresses. This translation process enforces protection of a program’s<br />

address space from other virtual machines.<br />

The second motivation for virtual memory is to allow a single user program<br />

to exceed the size of primary memory. Formerly, if a program became too large<br />

for memory, it was up to the programmer to make it fit. Programmers divided<br />

programs into pieces <strong>and</strong> then identified the pieces that were mutually exclusive.<br />

These overlays were loaded or unloaded under user program control during<br />

execution, with the programmer ensuring that the program never tried to access<br />

an overlay that was not loaded <strong>and</strong> that the overlays loaded never exceeded the<br />

total size of the memory. Overlays were traditionally organized as modules, each<br />

containing both code <strong>and</strong> data. Calls between procedures in different modules<br />

would lead to overlaying of one module with another.<br />

As you can well imagine, this responsibility was a substantial burden on<br />

programmers. Virtual memory, which was invented to relieve programmers of<br />

this difficulty, automatically manages the two levels of the memory hierarchy<br />

represented by main memory (sometimes called physical memory to distinguish it<br />

from virtual memory) <strong>and</strong> secondary storage.<br />

Although the concepts at work in virtual memory <strong>and</strong> in caches are the same,<br />

their differing historical roots have led to the use of different terminology. A virtual<br />

memory block is called a page, <strong>and</strong> a virtual memory miss is called a page fault.<br />

With virtual memory, the processor produces a virtual address, which is translated<br />

by a combination of hardware <strong>and</strong> software to a physical address, which in turn can<br />

be used to access main memory. Figure 5.25 shows the virtually addressed memory<br />

with pages mapped to main memory. This process is called address mapping or


430 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />

Virtual address<br />

31 30 29 28 27 15 14 13 12 11 10 9 8<br />

3 2 1 0<br />

Virtual page number<br />

Page offset<br />

Translation<br />

29 28 27 15 14 13 12 11 10 9 8<br />

3 2 1 0<br />

Physical page number<br />

Page offset<br />

Physical address<br />

FIGURE 5.26 Mapping from a virtual to a physical address. The page size is 2 12 4 KiB. The<br />

number of physical pages allowed in memory is 2 18 , since the physical page number has 18 bits in it. Thus,<br />

main memory can have at most 1 GiB, while the virtual address space is 4 GiB.<br />

Many design choices in virtual memory systems are motivated by the high cost<br />

of a page fault. A page fault to disk will take millions of clock cycles to process.<br />

(The table on page 378 shows that main memory latency is about 100,000 times<br />

quicker than disk.) This enormous miss penalty, dominated by the time to get the<br />

first word for typical page sizes, leads to several key decisions in designing virtual<br />

memory systems:<br />

■ Pages should be large enough to try to amortize the high access time. Sizes<br />

from 4 KiB to 16 KiB are typical today. New desktop <strong>and</strong> server systems are<br />

being developed to support 32 KiB <strong>and</strong> 64 KiB pages, but new embedded<br />

systems are going in the other direction, to 1 KiB pages.<br />

■ Organizations that reduce the page fault rate are attractive. The primary<br />

technique used here is to allow fully associative placement of pages in<br />

memory.<br />

■ Page faults can be h<strong>and</strong>led in software because the overhead will be small<br />

compared to the disk access time. In addition, software can afford to use clever<br />

algorithms for choosing how to place pages because even small reductions in<br />

the miss rate will pay for the cost of such algorithms.<br />

■ Write-through will not work for virtual memory, since writes take too long.<br />

Instead, virtual memory systems use write-back.


5.7 Virtual Memory 431<br />

The next few subsections address these factors in virtual memory design.<br />

Elaboration: We present the motivation for virtual memory as many virtual machines<br />

sharing the same memory, but virtual memory was originally invented so that many<br />

programs could share a computer as part of a timesharing system. Since many readers<br />

today have no experience with time-sharing systems, we use virtual machines to motivate<br />

this section.<br />

Elaboration: For servers <strong>and</strong> even PCs, 32-bit address processors are problematic.<br />

Although we normally think of virtual addresses as much larger than physical addresses,<br />

the opposite can occur when the processor address size is small relative to the state<br />

of the memory technology. No single program or virtual machine can benefi t, but a<br />

collection of programs or virtual machines running at the same time can benefi t from<br />

not having to be swapped to memory or by running on parallel processors.<br />

Elaboration: The discussion of virtual memory in this book focuses on paging,<br />

which uses fi xed-size blocks. There is also a variable-size block scheme called<br />

segmentation. In segmentation, an address consists of two parts: a segment number<br />

<strong>and</strong> a segment offset. The segment number is mapped to a physical address, <strong>and</strong><br />

the offset is added to fi nd the actual physical address. Because the segment can<br />

vary in size, a bounds check is also needed to make sure that the offset is within<br />

the segment. The major use of segmentation is to support more powerful methods<br />

of protection <strong>and</strong> sharing in an address space. Most operating system textbooks<br />

contain extensive discussions of segmentation compared to paging <strong>and</strong> of the use<br />

of segmentation to logically share the address space. The major disadvantage of<br />

segmentation is that it splits the address space into logically separate pieces that<br />

must be manipulated as a two-part address: the segment number <strong>and</strong> the offset.<br />

Paging, in contrast, makes the boundary between page number <strong>and</strong> offset invisible<br />

to programmers <strong>and</strong> compilers.<br />

Segments have also been used as a method to extend the address space without<br />

changing the word size of the computer. Such attempts have been unsuccessful because<br />

of the awkwardness <strong>and</strong> performance penalties inherent in a two-part address, of which<br />

programmers <strong>and</strong> compilers must be aware.<br />

Many architectures divide the address space into large fi xed-size blocks that simplify<br />

protection between the operating system <strong>and</strong> user programs <strong>and</strong> increase the effi ciency<br />

of implementing paging. Although these divisions are often called “segments,” this<br />

mechanism is much simpler than variable block size segmentation <strong>and</strong> is not visible to<br />

user programs; we discuss it in more detail shortly.<br />

segmentation<br />

A variable-size address<br />

mapping scheme in which<br />

an address consists of two<br />

parts: a segment number,<br />

which is mapped to a<br />

physical address, <strong>and</strong> a<br />

segment offset.<br />

Placing a Page <strong>and</strong> Finding It Again<br />

Because of the incredibly high penalty for a page fault, designers reduce page fault<br />

frequency by optimizing page placement. If we allow a virtual page to be mapped<br />

to any physical page, the operating system can then choose to replace any page<br />

it wants when a page fault occurs. For example, the operating system can use a


432 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />

page table The table<br />

containing the virtual<br />

to physical address<br />

translations in a virtual<br />

memory system. The<br />

table, which is stored<br />

in memory, is typically<br />

indexed by the virtual<br />

page number; each entry<br />

in the table contains the<br />

physical page number<br />

for that virtual page if<br />

the page is currently in<br />

memory.<br />

sophisticated algorithm <strong>and</strong> complex data structures that track page usage to try<br />

to choose a page that will not be needed for a long time. The ability to use a clever<br />

<strong>and</strong> flexible replacement scheme reduces the page fault rate <strong>and</strong> simplifies the use<br />

of fully associative placement of pages.<br />

As mentioned in Section 5.4, the difficulty in using fully associative placement<br />

is in locating an entry, since it can be anywhere in the upper level of the hierarchy.<br />

A full search is impractical. In virtual memory systems, we locate pages by using a<br />

table that indexes the memory; this structure is called a page table, <strong>and</strong> it resides<br />

in memory. A page table is indexed with the page number from the virtual address<br />

to discover the corresponding physical page number. Each program has its own<br />

page table, which maps the virtual address space of that program to main memory.<br />

In our library analogy, the page table corresponds to a mapping between book<br />

titles <strong>and</strong> library locations. Just as the card catalog may contain entries for books<br />

in another library on campus rather than the local branch library, we will see that<br />

the page table may contain entries for pages not present in memory. To indicate the<br />

location of the page table in memory, the hardware includes a register that points to<br />

the start of the page table; we call this the page table register. Assume for now that<br />

the page table is in a fixed <strong>and</strong> contiguous area of memory.<br />

Hardware/<br />

Software<br />

Interface<br />

The page table, together with the program counter <strong>and</strong> the registers, specifies<br />

the state of a virtual machine. If we want to allow another virtual machine to use<br />

the processor, we must save this state. Later, after restoring this state, the virtual<br />

machine can continue execution. We often refer to this state as a process. The<br />

process is considered active when it is in possession of the processor; otherwise, it<br />

is considered inactive. The operating system can make a process active by loading<br />

the process’s state, including the program counter, which will initiate execution at<br />

the value of the saved program counter.<br />

The process’s address space, <strong>and</strong> hence all the data it can access in memory, is<br />

defined by its page table, which resides in memory. Rather than save the entire page<br />

table, the operating system simply loads the page table register to point to the page<br />

table of the process it wants to make active. Each process has its own page table,<br />

since different processes use the same virtual addresses. The operating system is<br />

responsible for allocating the physical memory <strong>and</strong> updating the page tables, so<br />

that the virtual address spaces of different processes do not collide. As we will see<br />

shortly, the use of separate page tables also provides protection of one process from<br />

another.


434 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />

swap space The space on<br />

the disk reserved for the<br />

full virtual memory space<br />

of a process.<br />

Page Faults<br />

If the valid bit for a virtual page is off, a page fault occurs. The operating system<br />

must be given control. This transfer is done with the exception mechanism, which<br />

we saw in Chapter 4 <strong>and</strong> will discuss again later in this section. Once the operating<br />

system gets control, it must find the page in the next level of the hierarchy (usually<br />

flash memory or magnetic disk) <strong>and</strong> decide where to place the requested page in<br />

main memory.<br />

The virtual address alone does not immediately tell us where the page is on disk.<br />

Returning to our library analogy, we cannot find the location of a library book on<br />

the shelves just by knowing its title. Instead, we go to the catalog <strong>and</strong> look up the<br />

book, obtaining an address for the location on the shelves, such as the Library of<br />

Congress call number. Likewise, in a virtual memory system, we must keep track<br />

of the location on disk of each page in virtual address space.<br />

Because we do not know ahead of time when a page in memory will be replaced,<br />

the operating system usually creates the space on flash memory or disk for all the<br />

pages of a process when it creates the process. This space is called the swap space.<br />

At that time, it also creates a data structure to record where each virtual page is<br />

stored on disk. This data structure may be part of the page table or may be an<br />

auxiliary data structure indexed in the same way as the page table. Figure 5.28<br />

shows the organization when a single table holds either the physical page number<br />

or the disk address.<br />

The operating system also creates a data structure that tracks which processes<br />

<strong>and</strong> which virtual addresses use each physical page. When a page fault occurs,<br />

if all the pages in main memory are in use, the operating system must choose a<br />

page to replace. Because we want to minimize the number of page faults, most<br />

operating systems try to choose a page that they hypothesize will not be needed<br />

in the near future. Using the past to predict the future, operating systems follow<br />

the least recently used (LRU) replacement scheme, which we mentioned in Section<br />

5.4. The operating system searches for the least recently used page, assuming that<br />

a page that has not been used in a long time is less likely to be needed than a more<br />

recently accessed page. The replaced pages are written to swap space on the disk.<br />

In case you are wondering, the operating system is just another process, <strong>and</strong> these<br />

tables controlling memory are in memory; the details of this seeming contradiction<br />

will be explained shortly.


436 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />

Elaboration: With a 32-bit virtual address, 4 KiB pages, <strong>and</strong> 4 bytes per page table<br />

entry, we can compute the total page table size:<br />

Number of page table entries <br />

2<br />

32<br />

2 12<br />

2<br />

20<br />

20 2<br />

Size of page table 2 page table entries 2<br />

bytes<br />

page table entry<br />

4 MiB<br />

That is, we would need to use 4 MiB of memory for each program in execution at any<br />

time. This amount is not so bad for a single process. What if there are hundreds of<br />

processes running, each with their own page table? And how should we h<strong>and</strong>le 64-bit<br />

addresses, which by this calculation would need 2 52 words?<br />

A range of techniques is used to reduce the amount of storage required for the page<br />

table. The fi ve techniques below aim at reducing the total maximum storage required as<br />

well as minimizing the main memory dedicated to page tables:<br />

1. The simplest technique is to keep a limit register that restricts the size of the<br />

page table for a given process. If the virtual page number becomes larger than<br />

the contents of the limit register, entries must be added to the page table. This<br />

technique allows the page table to grow as a process consumes more space.<br />

Thus, the page table will only be large if the process is using many pages of<br />

virtual address space. This technique requires that the address space exp<strong>and</strong> in<br />

only one direction.<br />

2. Allowing growth in only one direction is not sufficient, since most languages require<br />

two areas whose size is exp<strong>and</strong>able: one area holds the stack <strong>and</strong> the other area<br />

holds the heap. Because of this duality, it is convenient to divide the page table<br />

<strong>and</strong> let it grow from the highest address down, as well as from the lowest address<br />

up. This means that there will be two separate page tables <strong>and</strong> two separate<br />

limits. The use of two page tables breaks the address space into two segments.<br />

The high-order bit of an address usually determines which segment <strong>and</strong> thus which<br />

page table to use for that address. Since the high-order address bit specifies the<br />

segment, each segment can be as large as one-half of the address space. A<br />

limit register for each segment specifies the current size of the segment, which<br />

grows in units of pages. This type of segmentation is used by many architectures,<br />

including MIPS. Unlike the type of segmentation discussed in the third elaboration<br />

on page 431, this form of segmentation is invisible to the application program,<br />

although not to the operating system. The major disadvantage of this scheme is<br />

that it does not work well when the address space is used in a sparse fashion<br />

rather than as a contiguous set of virtual addresses.<br />

3. Another approach to reducing the page table size is to apply a hashing function<br />

to the virtual address so that the page table need be only the size of the number<br />

of physical pages in main memory. Such a structure is called an inverted page<br />

table. Of course, the lookup process is slightly more complex with an inverted<br />

page table, because we can no longer just index the page table.<br />

4. Multiple levels of page tables can also be used to reduce the total amount of<br />

page table storage. The fi rst level maps large fi xed-size blocks of virtual address<br />

space, perhaps 64 to 256 pages in total. These large blocks are sometimes<br />

called segments, <strong>and</strong> this fi rst-level mapping table is sometimes called a


5.7 Virtual Memory 437<br />

segment table, though the segments are again invisible to the user. Each entry<br />

in the segment table indicates whether any pages in that segment are allocated<br />

<strong>and</strong>, if so, points to a page table for that segment. Address translation happens<br />

by fi rst looking in the segment table, using the highest-order bits of the address.<br />

If the segment address is valid, the next set of high-order bits is used to index<br />

the page table indicated by the segment table entry. This scheme allows the<br />

address space to be used in a sparse fashion (multiple noncontiguous segments<br />

can be active) without having to allocate the entire page table. Such schemes<br />

are particularly useful with very large address spaces <strong>and</strong> in software systems<br />

that require noncontiguous allocation. The primary disadvantage of this two-level<br />

mapping is the more complex process for address translation.<br />

5. To reduce the actual main memory tied up in page tables, most modern systems<br />

also allow the page tables to be paged. Although this sounds tricky, it works<br />

by using the same basic ideas of virtual memory <strong>and</strong> simply allowing the page<br />

tables to reside in the virtual address space. In addition, there are some small<br />

but critical problems, such as a never-ending series of page faults, which must<br />

be avoided. How these problems are overcome is both very detailed <strong>and</strong> typically<br />

highly processor specifi c. In brief, these problems are avoided by placing all the<br />

page tables in the address space of the operating system <strong>and</strong> placing at least<br />

some of the page tables for the operating system in a portion of main memory<br />

that is physically addressed <strong>and</strong> is always present <strong>and</strong> thus never on disk.<br />

What about Writes?<br />

The difference between the access time to the cache <strong>and</strong> main memory is tens to<br />

hundreds of cycles, <strong>and</strong> write-through schemes can be used, although we need a<br />

write buffer to hide the latency of the write from the processor. In a virtual memory<br />

system, writes to the next level of the hierarchy (disk) can take millions of processor<br />

clock cycles; therefore, building a write buffer to allow the system to write-through<br />

to disk would be completely impractical. Instead, virtual memory systems must use<br />

write-back, performing the individual writes into the page in memory, <strong>and</strong> copying<br />

the page back to disk when it is replaced in the memory.<br />

A write-back scheme has another major advantage in a virtual memory system.<br />

Because the disk transfer time is small compared with its access time, copying back<br />

an entire page is much more efficient than writing individual words back to the disk.<br />

A write-back operation, although more efficient than transferring individual words, is<br />

still costly. Thus, we would like to know whether a page needs to be copied back when<br />

we choose to replace it. To track whether a page has been written since it was read into<br />

the memory, a dirty bit is added to the page table. The dirty bit is set when any word<br />

in a page is written. If the operating system chooses to replace the page, the dirty bit<br />

indicates whether the page needs to be written out before its location in memory can be<br />

given to another page. Hence, a modified page is often called a dirty page.<br />

Hardware/<br />

Software<br />

Interface


5.7 Virtual Memory 439<br />

Because we access the TLB instead of the page table on every reference, the TLB<br />

will need to include other status bits, such as the dirty <strong>and</strong> the reference bits.<br />

On every reference, we look up the virtual page number in the TLB. If we get a<br />

hit, the physical page number is used to form the address, <strong>and</strong> the corresponding<br />

reference bit is turned on. If the processor is performing a write, the dirty bit is also<br />

turned on. If a miss in the TLB occurs, we must determine whether it is a page fault<br />

or merely a TLB miss. If the page exists in memory, then the TLB miss indicates<br />

only that the translation is missing. In such cases, the processor can h<strong>and</strong>le the TLB<br />

miss by loading the translation from the page table into the TLB <strong>and</strong> then trying the<br />

reference again. If the page is not present in memory, then the TLB miss indicates<br />

a true page fault. In this case, the processor invokes the operating system using an<br />

exception. Because the TLB has many fewer entries than the number of pages in<br />

main memory, TLB misses will be much more frequent than true page faults.<br />

TLB misses can be h<strong>and</strong>led either in hardware or in software. In practice, with<br />

care there can be little performance difference between the two approaches, because<br />

the basic operations are the same in either case.<br />

After a TLB miss occurs <strong>and</strong> the missing translation has been retrieved from the<br />

page table, we will need to select a TLB entry to replace. Because the reference <strong>and</strong><br />

dirty bits are contained in the TLB entry, we need to copy these bits back to the page<br />

table entry when we replace an entry. These bits are the only portion of the TLB<br />

entry that can be changed. Using write-back—that is, copying these entries back at<br />

miss time rather than when they are written—is very efficient, since we expect the<br />

TLB miss rate to be small. Some systems use other techniques to approximate the<br />

reference <strong>and</strong> dirty bits, eliminating the need to write into the TLB except to load<br />

a new table entry on a miss.<br />

Some typical values for a TLB might be<br />

■ TLB size: 16–512 entries<br />

■ Block size: 1–2 page table entries (typically 4–8 bytes each)<br />

■ Hit time: 0.5–1 clock cycle<br />

■ Miss penalty: 10–100 clock cycles<br />

■ Miss rate: 0.01%–1%<br />

<strong>Design</strong>ers have used a wide variety of associativities in TLBs. Some systems use<br />

small, fully associative TLBs because a fully associative mapping has a lower miss<br />

rate; furthermore, since the TLB is small, the cost of a fully associative mapping is<br />

not too high. Other systems use large TLBs, often with small associativity. With<br />

a fully associative mapping, choosing the entry to replace becomes tricky since<br />

implementing a hardware LRU scheme is too expensive. Furthermore, since TLB<br />

misses are much more frequent than page faults <strong>and</strong> thus must be h<strong>and</strong>led more<br />

cheaply, we cannot afford an expensive software algorithm, as we can for page faults.<br />

As a result, many systems provide some support for r<strong>and</strong>omly choosing an entry<br />

to replace. We’ll examine replacement schemes in a little more detail in Section 5.8.


440 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />

The Intrinsity FastMATH TLB<br />

To see these ideas in a real processor, let’s take a closer look at the TLB of the<br />

Intrinsity FastMATH. The memory system uses 4 KiB pages <strong>and</strong> a 32-bit address<br />

space; thus, the virtual page number is 20 bits long, as in the top of Figure 5.30.<br />

The physical address is the same size as the virtual address. The TLB contains 16<br />

entries, it is fully associative, <strong>and</strong> it is shared between the instruction <strong>and</strong> data<br />

references. Each entry is 64 bits wide <strong>and</strong> contains a 20-bit tag (which is the virtual<br />

page number for that TLB entry), the corresponding physical page number (also 20<br />

bits), a valid bit, a dirty bit, <strong>and</strong> other bookkeeping bits. Like most MIPS systems,<br />

it uses software to h<strong>and</strong>le TLB misses.<br />

Figure 5.30 shows the TLB <strong>and</strong> one of the caches, while Figure 5.31 shows the<br />

steps in processing a read or write request. When a TLB miss occurs, the MIPS<br />

hardware saves the page number of the reference in a special register <strong>and</strong> generates<br />

an exception. The exception invokes the operating system, which h<strong>and</strong>les the miss<br />

in software. To find the physical address for the missing page, the TLB miss routine<br />

indexes the page table using the page number of the virtual address <strong>and</strong> the page<br />

table register, which indicates the starting address of the active process page table.<br />

Using a special set of system instructions that can update the TLB, the operating<br />

system places the physical address from the page table into the TLB. A TLB miss<br />

takes about 13 clock cycles, assuming the code <strong>and</strong> the page table entry are in the<br />

instruction cache <strong>and</strong> data cache, respectively. (We will see the MIPS TLB code<br />

on page 449.) A true page fault occurs if the page table entry does not have a valid<br />

physical address. The hardware maintains an index that indicates the recommended<br />

entry to replace; the recommended entry is chosen r<strong>and</strong>omly.<br />

There is an extra complication for write requests: namely, the write access bit in<br />

the TLB must be checked. This bit prevents the program from writing into pages<br />

for which it has only read access. If the program attempts a write <strong>and</strong> the write<br />

access bit is off, an exception is generated. The write access bit forms part of the<br />

protection mechanism, which we will discuss shortly.<br />

Integrating Virtual Memory, TLBs, <strong>and</strong> Caches<br />

Our virtual memory <strong>and</strong> cache systems work together as a hierarchy, so that data<br />

cannot be in the cache unless it is present in main memory. The operating system<br />

helps maintain this hierarchy by flushing the contents of any page from the cache<br />

when it decides to migrate that page to disk. At the same time, the OS modifies the<br />

page tables <strong>and</strong> TLB, so that an attempt to access any data on the migrated page<br />

will generate a page fault.<br />

Under the best of circumstances, a virtual address is translated by the TLB <strong>and</strong><br />

sent to the cache where the appropriate data is found, retrieved, <strong>and</strong> sent back to<br />

the processor. In the worst case, a reference can miss in all three components of the<br />

memory hierarchy: the TLB, the page table, <strong>and</strong> the cache. The following example<br />

illustrates these interactions in more detail.


442 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />

Virtual address<br />

TLB access<br />

TLB miss<br />

exception<br />

No<br />

TLB hit?<br />

Yes<br />

Physical address<br />

No<br />

Write?<br />

Yes<br />

Try to read data<br />

from cache<br />

No<br />

Write access<br />

bit on?<br />

Yes<br />

Cache miss stall<br />

while read block<br />

No<br />

Cache hit?<br />

Yes<br />

Write protection<br />

exception<br />

Try to write data<br />

to cache<br />

Deliver data<br />

to the CPU<br />

Cache miss stall<br />

while read block<br />

No<br />

Cache hit?<br />

Yes<br />

Write data into cache,<br />

update the dirty bit, <strong>and</strong><br />

put the data <strong>and</strong> the<br />

address into the write buffer<br />

FIGURE 5.31 Processing a read or a write-through in the Intrinsity FastMATH TLB <strong>and</strong> cache. If the TLB generates a hit,<br />

the cache can be accessed with the resulting physical address. For a read, the cache generates a hit or miss <strong>and</strong> supplies the data or causes a stall<br />

while the data is brought from memory. If the operation is a write, a portion of the cache entry is overwritten for a hit <strong>and</strong> the data is sent to<br />

the write buffer if we assume write-through. A write miss is just like a read miss except that the block is modified after it is read from memory.<br />

Write-back requires writes to set a dirty bit for the cache block, <strong>and</strong> a write buffer is loaded with the whole block only on a read miss or write<br />

miss if the block to be replaced is dirty. Notice that a TLB hit <strong>and</strong> a cache hit are independent events, but a cache hit can only occur after a TLB<br />

hit occurs, which means that the data must be present in memory. The relationship between TLB misses <strong>and</strong> cache misses is examined further<br />

in the following example <strong>and</strong> the exercises at the end of this chapter.


5.7 Virtual Memory 443<br />

Overall Operation of a Memory Hierarchy<br />

In a memory hierarchy like that of Figure 5.30, which includes a TLB <strong>and</strong> a<br />

cache organized as shown, a memory reference can encounter three different<br />

types of misses: a TLB miss, a page fault, <strong>and</strong> a cache miss. Consider all<br />

the combinations of these three events with one or more occurring (seven<br />

possibilities). For each possibility, state whether this event can actually occur<br />

<strong>and</strong> under what circumstances.<br />

EXAMPLE<br />

Figure 5.32 shows all combinations <strong>and</strong> whether each is possible in practice.<br />

ANSWER<br />

Elaboration: Figure 5.32 assumes that all memory addresses are translated to<br />

physical addresses before the cache is accessed. In this organization, the cache is<br />

physically indexed <strong>and</strong> physically tagged (both the cache index <strong>and</strong> tag are physical,<br />

rather than virtual, addresses). In such a system, the amount of time to access memory,<br />

assuming a cache hit, must accommodate both a TLB access <strong>and</strong> a cache access; of<br />

course, these accesses can be pipelined.<br />

Alternatively, the processor can index the cache with an address that is completely<br />

or partially virtual. This is called a virtually addressed cache, <strong>and</strong> it uses tags that<br />

are virtual addresses; hence, such a cache is virtually indexed <strong>and</strong> virtually tagged. In<br />

such caches, the address translation hardware (TLB) is unused during the normal cache<br />

access, since the cache is accessed with a virtual address that has not been translated<br />

to a physical address. This takes the TLB out of the critical path, reducing cache latency.<br />

When a cache miss occurs, however, the processor needs to translate the address to a<br />

physical address so that it can fetch the cache block from main memory.<br />

virtually addressed<br />

cache A cache that is<br />

accessed with a virtual<br />

address rather than a<br />

physical address.<br />

TLB<br />

Page<br />

table Cache Possible? If so, under what circumstance?<br />

Hit Hit Miss Possible, although the page table is never really checked if TLB hits.<br />

Miss Hit Hit TLB misses, but entry found in page table; after retry, data is found in cache.<br />

Miss Hit Miss TLB misses, but entry found in page table; after retry, data misses in cache.<br />

Miss Miss Miss TLB misses <strong>and</strong> is followed by a page fault; after retry, data must miss in cache.<br />

Hit Miss Miss Impossible: cannot have a translation in TLB if page is not present in memory.<br />

Hit Miss Hit Impossible: cannot have a translation in TLB if page is not present in memory.<br />

Miss Miss Hit Impossible: data cannot be allowed in cache if the page is not in memory.<br />

FIGURE 5.32 The possible combinations of events in the TLB, virtual memory system,<br />

<strong>and</strong> cache. Three of these combinations are impossible, <strong>and</strong> one is possible (TLB hit, virtual memory hit,<br />

cache miss) but never detected.


444 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />

aliasing A situation<br />

in which two addresses<br />

access the same object;<br />

it can occur in virtual<br />

memory when there are<br />

two virtual addresses for<br />

the same physical page.<br />

physically addressed<br />

cache A cache that is<br />

addressed by a physical<br />

address.<br />

When the cache is accessed with a virtual address <strong>and</strong> pages are shared between<br />

processes (which may access them with different virtual addresses), there is the<br />

possibility of aliasing. Aliasing occurs when the same object has two names—in this<br />

case, two virtual addresses for the same page. This ambiguity creates a problem, because<br />

a word on such a page may be cached in two different locations, each corresponding<br />

to different virtual addresses. This ambiguity would allow one program to write the data<br />

without the other program being aware that the data had changed. Completely virtually<br />

addressed caches either introduce design limitations on the cache <strong>and</strong> TLB to reduce<br />

aliases or require the operating system, <strong>and</strong> possibly the user, to take steps to ensure<br />

that aliases do not occur.<br />

A common compromise between these two design points is caches that are virtually<br />

indexed—sometimes using just the page-offset portion of the address, which is really<br />

a physical address since it is not translated—but use physical tags. These designs,<br />

which are virtually indexed but physically tagged, attempt to achieve the performance<br />

advantages of virtually indexed caches with the architecturally simpler advantages of a<br />

physically addressed cache. For example, there is no alias problem in this case. Figure<br />

5.30 assumed a 4 KiB page size, but it’s really 16 KiB, so the Intrinsity FastMATH can<br />

use this trick. To pull it off, there must be careful coordination between the minimum<br />

page size, the cache size, <strong>and</strong> associativity.<br />

Implementing Protection with Virtual Memory<br />

Perhaps the most important function of virtual memory today is to allow sharing of<br />

a single main memory by multiple processes, while providing memory protection<br />

among these processes <strong>and</strong> the operating system. The protection mechanism must<br />

ensure that although multiple processes are sharing the same main memory, one<br />

renegade process cannot write into the address space of another user process or into<br />

the operating system either intentionally or unintentionally. The write access bit in<br />

the TLB can protect a page from being written. Without this level of protection,<br />

computer viruses would be even more widespread.<br />

Hardware/<br />

Software<br />

Interface<br />

supervisor mode Also<br />

called kernel mode. A<br />

mode indicating that a<br />

running process is an<br />

operating system process.<br />

To enable the operating system to implement protection in the virtual memory<br />

system, the hardware must provide at least the three basic capabilities summarized<br />

below. Note that the first two are the same requirements as needed for virtual<br />

machines (Section 5.6).<br />

1. Support at least two modes that indicate whether the running process is a<br />

user process or an operating system process, variously called a supervisor<br />

process, a kernel process, or an executive process.<br />

2. Provide a portion of the processor state that a user process can read but not<br />

write. This includes the user/supervisor mode bit, which dictates whether<br />

the processor is in user or supervisor mode, the page table pointer, <strong>and</strong> the


5.7 Virtual Memory 445<br />

TLB. To write these elements, the operating system uses special instructions<br />

that are only available in supervisor mode.<br />

3. Provide mechanisms whereby the processor can go from user mode to<br />

supervisor mode <strong>and</strong> vice versa. The first direction is typically accomplished<br />

by a system call exception, implemented as a special instruction (syscall in<br />

the MIPS instruction set) that transfers control to a dedicated location in<br />

supervisor code space. As with any other exception, the program counter<br />

from the point of the system call is saved in the exception PC (EPC), <strong>and</strong><br />

the processor is placed in supervisor mode. To return to user mode from the<br />

exception, use the return from exception (ERET) instruction, which resets to<br />

user mode <strong>and</strong> jumps to the address in EPC.<br />

By using these mechanisms <strong>and</strong> storing the page tables in the operating system’s<br />

address space, the operating system can change the page tables while preventing a<br />

user process from changing them, ensuring that a user process can access only the<br />

storage provided to it by the operating system.<br />

system call A special<br />

instruction that transfers<br />

control from user mode<br />

to a dedicated location<br />

in supervisor code space,<br />

invoking the exception<br />

mechanism in the process.<br />

We also want to prevent a process from reading the data of another process. For<br />

example, we wouldn’t want a student program to read the grades while they were<br />

in the processor’s memory. Once we begin sharing main memory, we must provide<br />

the ability for a process to protect its data from both reading <strong>and</strong> writing by another<br />

process; otherwise, sharing the main memory will be a mixed blessing!<br />

Remember that each process has its own virtual address space. Thus, if the<br />

operating system keeps the page tables organized so that the independent virtual<br />

pages map to disjoint physical pages, one process will not be able to access another’s<br />

data. Of course, this also requires that a user process be unable to change the page<br />

table mapping. The operating system can assure safety if it prevents the user process<br />

from modifying its own page tables. However, the operating system must be able<br />

to modify the page tables. Placing the page tables in the protected address space of<br />

the operating system satisfies both requirements.<br />

When processes want to share information in a limited way, the operating system<br />

must assist them, since accessing the information of another process requires<br />

changing the page table of the accessing process. The write access bit can be used<br />

to restrict the sharing to just read sharing, <strong>and</strong>, like the rest of the page table, this<br />

bit can be changed only by the operating system. To allow another process, say, P1,<br />

to read a page owned by process P2, P2 would ask the operating system to create<br />

a page table entry for a virtual page in P1’s address space that points to the same<br />

physical page that P2 wants to share. The operating system could use the write<br />

protection bit to prevent P1 from writing the data, if that was P2’s wish. Any bits<br />

that determine the access rights for a page must be included in both the page table<br />

<strong>and</strong> the TLB, because the page table is accessed only on a TLB miss.


446 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />

context switch<br />

A changing of the internal<br />

state of the processor to<br />

allow a different process<br />

to use the processor<br />

that includes saving the<br />

state needed to return to<br />

the currently executing<br />

process.<br />

Elaboration: When the operating system decides to change from running process<br />

P1 to running process P2 (called a context switch or process switch), it must ensure<br />

that P2 cannot get access to the page tables of P1 because that would compromise<br />

protection. If there is no TLB, it suffi ces to change the page table register to point to P2’s<br />

page table (rather than to P1’s); with a TLB, we must clear the TLB entries that belong to<br />

P1—both to protect the data of P1 <strong>and</strong> to force the TLB to load the entries for P2. If the<br />

process switch rate were high, this could be quite ineffi cient. For example, P2 might load<br />

only a few TLB entries before the operating system switched back to P1. Unfortunately,<br />

P1 would then fi nd that all its TLB entries were gone <strong>and</strong> would have to pay TLB misses<br />

to reload them. This problem arises because the virtual addresses used by P1 <strong>and</strong> P2<br />

are the same, <strong>and</strong> we must clear out the TLB to avoid confusing these addresses.<br />

A common alternative is to extend the virtual address space by adding a process<br />

identifi er or task identifi er. The Intrinsity FastMATH has an 8-bit address space ID (ASID)<br />

fi eld for this purpose. This small fi eld identifi es the currently running process; it is kept<br />

in a register loaded by the operating system when it switches processes. The process<br />

identifier is concatenated to the tag portion of the TLB, so that a TLB hit occurs only if<br />

both the page number <strong>and</strong> the process identifi er match. This combination eliminates the<br />

need to clear the TLB, except on rare occasions.<br />

Similar problems can occur for a cache, since on a process switch the cache will<br />

contain data from the running process. These problems arise in different ways for<br />

physically addressed <strong>and</strong> virtually addressed caches, <strong>and</strong> a variety of different solutions,<br />

such as process identifi ers, are used to ensure that a process gets its own data.<br />

H<strong>and</strong>ling TLB Misses <strong>and</strong> Page Faults<br />

Although the translation of virtual to physical addresses with a TLB is<br />

straightforward when we get a TLB hit, as we saw earlier, h<strong>and</strong>ling TLB misses <strong>and</strong><br />

page faults is more complex. A TLB miss occurs when no entry in the TLB matches<br />

a virtual address. Recall that a TLB miss can indicate one of two possibilities:<br />

1. The page is present in memory, <strong>and</strong> we need only create the missing TLB<br />

entry.<br />

2. The page is not present in memory, <strong>and</strong> we need to transfer control to the<br />

operating system to deal with a page fault.<br />

MIPS traditionally h<strong>and</strong>les a TLB miss in software. It brings in the page table<br />

entry from memory <strong>and</strong> then re-executes the instruction that caused the TLB miss.<br />

Upon re-executing, it will get a TLB hit. If the page table entry indicates the page is<br />

not in memory, this time it will get a page fault exception.<br />

H<strong>and</strong>ling a TLB miss or a page fault requires using the exception mechanism<br />

to interrupt the active process, transferring control to the operating system, <strong>and</strong><br />

later resuming execution of the interrupted process. A page fault will be recognized<br />

sometime during the clock cycle used to access memory. To restart the instruction<br />

after the page fault is h<strong>and</strong>led, the program counter of the instruction that caused<br />

the page fault must be saved. Just as in Chapter 4, the exception program counter<br />

(EPC) is used to hold this value.


5.7 Virtual Memory 447<br />

In addition, a TLB miss or page fault exception must be asserted by the end<br />

of the same clock cycle that the memory access occurs, so that the next clock<br />

cycle will begin exception processing rather than continue normal instruction<br />

execution. If the page fault was not recognized in this clock cycle, a load instruction<br />

could overwrite a register, <strong>and</strong> this could be disastrous when we try to restart the<br />

instruction. For example, consider the instruction lw $1,0($1): the computer<br />

must be able to prevent the write pipeline stage from occurring; otherwise, it could<br />

not properly restart the instruction, since the contents of $1 would have been<br />

destroyed. A similar complication arises on stores. We must prevent the write into<br />

memory from actually completing when there is a page fault; this is usually done<br />

by deasserting the write control line to the memory.<br />

Between the time we begin executing the exception h<strong>and</strong>ler in the operating<br />

system <strong>and</strong> the time that the operating system has saved all the state of the process,<br />

the operating system is particularly vulnerable. For example, if another exception<br />

occurred when we were processing the first exception in the operating system, the<br />

control unit would overwrite the exception program counter, making it impossible<br />

to return to the instruction that caused the page fault! We can avoid this disaster<br />

by providing the ability to disable <strong>and</strong> enable exceptions. When an exception first<br />

occurs, the processor sets a bit that disables all other exceptions; this could happen<br />

at the same time the processor sets the supervisor mode bit. The operating system<br />

will then save just enough state to allow it to recover if another exception occurs—<br />

namely, the exception program counter (EPC) <strong>and</strong> Cause registers. EPC <strong>and</strong> Cause<br />

are two of the special control registers that help with exceptions, TLB misses, <strong>and</strong><br />

page faults; Figure 5.33 shows the rest. The operating system can then re-enable<br />

exceptions. These steps make sure that exceptions will not cause the processor<br />

to lose any state <strong>and</strong> thereby be unable to restart execution of the interrupting<br />

instruction.<br />

Hardware/<br />

Software<br />

Interface<br />

exception enable Also<br />

called interrupt enable.<br />

A signal or action that<br />

controls whether the<br />

process responds to<br />

an exception or not;<br />

necessary for preventing<br />

the occurrence of<br />

exceptions during<br />

intervals before the<br />

processor has safely saved<br />

the state needed to restart.<br />

Once the operating system knows the virtual address that caused the page fault, it<br />

must complete three steps:<br />

1. Look up the page table entry using the virtual address <strong>and</strong> find the location<br />

of the referenced page on disk.<br />

2. Choose a physical page to replace; if the chosen page is dirty, it must be<br />

written out to disk before we can bring a new virtual page into this physical<br />

page.<br />

3. Start a read to bring the referenced page from disk into the chosen physical<br />

page.


5.7 Virtual Memory 449<br />

The exception invokes the operating system, which h<strong>and</strong>les the miss in software.<br />

Control is transferred to address 8000 0000 hex<br />

, the location of the TLB miss h<strong>and</strong>ler.<br />

To find the physical address for the missing page, the TLB miss routine indexes the<br />

page table using the page number of the virtual address <strong>and</strong> the page table register,<br />

which indicates the starting address of the active process page table. To make this<br />

indexing fast, MIPS hardware places everything you need in the special Context<br />

register: the upper 12 bits have the address of the base of the page table, <strong>and</strong> the<br />

next 18 bits have the virtual address of the missing page. Each page table entry is<br />

one word, so the last 2 bits are 0. Thus, the first two instructions copy the Context<br />

register into the kernel temporary register $k1 <strong>and</strong> then load the page table entry<br />

from that address into $k1. Recall that $k0 <strong>and</strong> $k1 are reserved for the operating<br />

system to use without saving; a major reason for this convention is to make the TLB<br />

miss h<strong>and</strong>ler fast. Below is the MIPS code for a typical TLB miss h<strong>and</strong>ler:<br />

h<strong>and</strong>ler Name of a<br />

software routine invoked<br />

to “h<strong>and</strong>le” an exception<br />

or interrupt.<br />

TLBmiss:<br />

mfc0 $k1,Context # copy address of PTE into temp $k1<br />

lw $k1,0($k1) # put PTE into temp $k1<br />

mtc0 $k1,EntryLo # put PTE into special register EntryLo<br />

tlbwr<br />

# put EntryLo into TLB entry at R<strong>and</strong>om<br />

eret<br />

# return from TLB miss exception<br />

As shown above, MIPS has a special set of system instructions to update the<br />

TLB. The instruction tlbwr copies from control register EntryLo into the TLB<br />

entry selected by the control register R<strong>and</strong>om. R<strong>and</strong>om implements r<strong>and</strong>om<br />

replacement, so it is basically a free-running counter. A TLB miss takes about a<br />

dozen clock cycles.<br />

Note that the TLB miss h<strong>and</strong>ler does not check to see if the page table entry is<br />

valid. Because the exception for TLB entry missing is much more frequent than<br />

a page fault, the operating system loads the TLB from the page table without<br />

examining the entry <strong>and</strong> restarts the instruction. If the entry is invalid, another<br />

<strong>and</strong> different exception occurs, <strong>and</strong> the operating system recognizes the page fault.<br />

This method makes the frequent case of a TLB miss fast, at a slight performance<br />

penalty for the infrequent case of a page fault.<br />

Once the process that generated the page fault has been interrupted, it transfers<br />

control to 8000 0180 hex<br />

, a different address than the TLB miss h<strong>and</strong>ler. This is<br />

the general address for exception; TLB miss has a special entry point to lower the<br />

penalty for a TLB miss. The operating system uses the exception Cause register<br />

to diagnose the cause of the exception. Because the exception is a page fault, the<br />

operating system knows that extensive processing will be required. Thus, unlike a<br />

TLB miss, it saves the entire state of the active process. This state includes all the<br />

general-purpose <strong>and</strong> floating-point registers, the page table address register, the<br />

EPC, <strong>and</strong> the exception Cause register. Since exception h<strong>and</strong>lers do not usually use<br />

the floating-point registers, the general entry point does not save them, leaving that<br />

to the few h<strong>and</strong>lers that need them.


450 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />

Figure 5.34 sketches the MIPS code of an exception h<strong>and</strong>ler. Note that we<br />

save <strong>and</strong> restore the state in MIPS code, taking care when we enable <strong>and</strong> disable<br />

exceptions, but we invoke C code to h<strong>and</strong>le the particular exception.<br />

The virtual address that caused the fault depends on whether the fault was an<br />

instruction or data fault. The address of the instruction that generated the fault is<br />

in the EPC. If it was an instruction page fault, the EPC contains the virtual address<br />

of the faulting page; otherwise, the faulting virtual address can be computed by<br />

examining the instruction (whose address is in the EPC) to find the base register<br />

<strong>and</strong> offset field.<br />

unmapped A portion<br />

of the address space that<br />

cannot have page faults.<br />

Elaboration: This simplifi ed version assumes that the stack pointer (sp) is valid. To<br />

avoid the problem of a page fault during this low-level exception code, MIPS sets aside<br />

a portion of its address space that cannot have page faults, called unmapped. The<br />

operating system places the exception entry point code <strong>and</strong> the exception stack in<br />

unmapped memory. MIPS hardware translates virtual addresses 8000 0000 hex<br />

to BFFF<br />

FFFF hex<br />

to physical addresses simply by ignoring the upper bits of the virtual address,<br />

thereby placing these addresses in the low part of physical memory. Thus, the operating<br />

system places exception entry points <strong>and</strong> exception stacks in unmapped memory.<br />

Elaboration: The code in Figure 5.34 shows the MIPS-32 exception return sequence.<br />

The older MIPS-I architecture uses rfe <strong>and</strong> jr instead of eret.<br />

Elaboration: For processors with more complex instructions that can touch many<br />

memory locations <strong>and</strong> write many data items, making instructions restartable is much<br />

harder. Processing one instruction may generate a number of page faults in the middle<br />

of the instruction. For example, x86 processors have block move instructions that touch<br />

thous<strong>and</strong>s of data words. In such processors, instructions often cannot be restarted<br />

from the beginning, as we do for MIPS instructions. Instead, the instruction must be<br />

interrupted <strong>and</strong> later continued midstream in its execution. Resuming an instruction in<br />

the middle of its execution usually requires saving some special state, processing the<br />

exception, <strong>and</strong> restoring that special state. Making this work properly requires careful<br />

<strong>and</strong> detailed coordination between the exception-h<strong>and</strong>ling code in the operating system<br />

<strong>and</strong> the hardware.<br />

Elaboration: Rather than pay an extra level of indirection on every memory access, the<br />

VMM maintains a shadow page table that maps directly from the guest virtual address<br />

space to the physical address space of the hardware. By detecting all modifi cations to<br />

the guest’s page table, the VMM can ensure the shadow page table entries being used<br />

by the hardware for translations correspond to those of the guest OS environment, with<br />

the exception of the correct physical pages substituted for the real pages in the guest<br />

tables. Hence, the VMM must trap any attempt by the guest OS to change its page table<br />

or to access the page table pointer. This is commonly done by write protecting the guest<br />

page tables <strong>and</strong> trapping any access to the page table pointer by a guest OS. As noted<br />

above, the latter happens naturally if accessing the page table pointer is a privileged<br />

operation.


452 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />

Elaboration: The fi nal portion of the architecture to virtualize is I/O. This is by far<br />

the most diffi cult part of system virtualization because of the increasing number of<br />

I/O devices attached to the computer <strong>and</strong> the increasing diversity of I/O device types.<br />

Another diffi culty is the sharing of a real device among multiple VMs, <strong>and</strong> yet another<br />

comes from supporting the myriad of device drivers that are required, especially if<br />

different guest OSes are supported on the same VM system. The VM illusion can be<br />

maintained by giving each VM generic versions of each type of I/O device driver, <strong>and</strong> then<br />

leaving it to the VMM to h<strong>and</strong>le real I/O.<br />

Elaboration: In addition to virtualizing the instruction set for a virtual machine,<br />

another challenge is virtualization of virtual memory, as each guest OS in every virtual<br />

machine manages its own set of page tables. To make this work, the VMM separates<br />

the notions of real <strong>and</strong> physical memory (which are often treated synonymously), <strong>and</strong><br />

makes real memory a separate, intermediate level between virtual memory <strong>and</strong> physical<br />

memory. (Some use the terms virtual memory, physical memory, <strong>and</strong> machine memory<br />

to name the same three levels.) The guest OS maps virtual memory to real memory<br />

via its page tables, <strong>and</strong> the VMM page tables map the guest’s real memory to physical<br />

memory. The virtual memory architecture is specifi ed either via page tables, as in IBM<br />

VM/370 <strong>and</strong> the x86, or via the TLB structure, as in MIPS.<br />

Summary<br />

Virtual memory is the name for the level of memory hierarchy that manages<br />

caching between the main memory <strong>and</strong> secondary memory. Virtual memory<br />

allows a single program to exp<strong>and</strong> its address space beyond the limits of main<br />

memory. More importantly, virtual memory supports sharing of the main memory<br />

among multiple, simultaneously active processes, in a protected manner.<br />

Managing the memory hierarchy between main memory <strong>and</strong> disk is challenging<br />

because of the high cost of page faults. Several techniques are used to reduce the<br />

miss rate:<br />

1. Pages are made large to take advantage of spatial locality <strong>and</strong> to reduce the<br />

miss rate.<br />

2. The mapping between virtual addresses <strong>and</strong> physical addresses, which is<br />

implemented with a page table, is made fully associative so that a virtual<br />

page can be placed anywhere in main memory.<br />

3. The operating system uses techniques, such as LRU <strong>and</strong> a reference bit, to<br />

choose which pages to replace.


5.7 Virtual Memory 453<br />

Writes to secondary memory are expensive, so virtual memory uses a write-back<br />

scheme <strong>and</strong> also tracks whether a page is unchanged (using a dirty bit) to avoid<br />

writing unchanged pages.<br />

The virtual memory mechanism provides address translation from a virtual<br />

address used by the program to the physical address space used for accessing<br />

memory. This address translation allows protected sharing of the main memory<br />

<strong>and</strong> provides several additional benefits, such as simplifying memory allocation.<br />

Ensuring that processes are protected from each other requires that only the<br />

operating system can change the address translations, which is implemented by<br />

preventing user programs from changing the page tables. Controlled sharing of<br />

pages among processes can be implemented with the help of the operating system<br />

<strong>and</strong> access bits in the page table that indicate whether the user program has read or<br />

write access to a page.<br />

If a processor had to access a page table resident in memory to translate every<br />

access, virtual memory would be too expensive, as caches would be pointless!<br />

Instead, a TLB acts as a cache for translations from the page table. Addresses are<br />

then translated from virtual to physical using the translations in the TLB.<br />

Caches, virtual memory, <strong>and</strong> TLBs all rely on a common set of principles <strong>and</strong><br />

policies. The next section discusses this common framework.<br />

Although virtual memory was invented to enable a small memory to act as a large<br />

one, the performance difference between secondary memory <strong>and</strong> main memory<br />

means that if a program routinely accesses more virtual memory than it has<br />

physical memory, it will run very slowly. Such a program would be continuously<br />

swapping pages between memory <strong>and</strong> disk, called thrashing. Thrashing is a disaster<br />

if it occurs, but it is rare. If your program thrashes, the easiest solution is to run it on<br />

a computer with more memory or buy more memory for your computer. A more<br />

complex choice is to re-examine your algorithm <strong>and</strong> data structures to see if you<br />

can change the locality <strong>and</strong> thereby reduce the number of pages that your program<br />

uses simultaneously. This set of popular pages is informally called the working set.<br />

A more common performance problem is TLB misses. Since a TLB might<br />

h<strong>and</strong>le only 32–64 page entries at a time, a program could easily see a high TLB<br />

miss rate, as the processor may access less than a quarter mebibyte directly: 64<br />

4 KiB 0.25 MiB. For example, TLB misses are often a challenge for Radix<br />

Sort. To try to alleviate this problem, most computer architectures now support<br />

variable page sizes. For example, in addition to the st<strong>and</strong>ard 4 KiB page, MIPS<br />

hardware supports 16 KiB, 64 KiB, 256 KiB, 1 MiB, 4 MiB, 16 MiB, 64 MiB, <strong>and</strong><br />

256 MiB pages. Hence, if a program uses large page sizes, it can access more<br />

memory directly without TLB misses.<br />

The practical challenge is getting the operating system to allow programs to<br />

select these larger page sizes. Once again, the more complex solution to reducing<br />

Underst<strong>and</strong>ing<br />

Program<br />

Performance


454 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />

TLB misses is to re-examine the algorithm <strong>and</strong> data structures to reduce the<br />

working set of pages; given the importance of memory accesses to performance<br />

<strong>and</strong> the frequency of TLB misses, some programs with large working sets have<br />

been redesigned with that goal.<br />

Check<br />

Yourself<br />

Match the definitions in the right column to the terms in the left column.<br />

1. L1 cache a. A cache for a cache<br />

2. L2 cache b. A cache for disks<br />

3. Main memory c. A cache for a main memory<br />

4. TLB d. A cache for page table entries<br />

5.8<br />

A Common Framework for Memory<br />

Hierarchy<br />

By now, you’ve recognized that the different types of memory hierarchies have a<br />

great deal in common. Although many of the aspects of memory hierarchies differ<br />

quantitatively, many of the policies <strong>and</strong> features that determine how a hierarchy<br />

functions are similar qualitatively. Figure 5.35 shows how some of the quantitative<br />

characteristics of memory hierarchies can differ. In the rest of this section, we will<br />

discuss the common operational alternatives for memory hierarchies, <strong>and</strong> how<br />

these determine their behavior. We will examine these policies as a series of four<br />

questions that apply between any two levels of a memory hierarchy, although for<br />

simplicity we will primarily use terminology for caches.<br />

Feature<br />

Typical values<br />

for L1 caches<br />

Typical values<br />

for L2 caches<br />

Typical values for<br />

paged memory<br />

Typical values<br />

for a TLB<br />

Total size in blocks 250–2000 2,500–25,000 16,000–250,000 40–1024<br />

Total size in kilobytes 16–64 125–2000 1,000,000–1,000,000,000 0.25–16<br />

Block size in bytes 16–64 64–128 4000–64,000 4–32<br />

Miss penalty in clocks 10–25 100–1000 10,000,000–100,000,000 10–1000<br />

Miss rates (global for L2) 2%–5% 0.1%–2% 0.00001%–0.0001% 0.01%–2%<br />

FIGURE 5.35 The key quantitative design parameters that characterize the major elements of memory hierarchy in a<br />

computer. These are typical values for these levels as of 2012. Although the range of values is wide, this is partially because many of the values<br />

that have shifted over time are related; for example, as caches become larger to overcome larger miss penalties, block sizes also grow. While not<br />

shown, server microprocessors today also have L3 caches, which can be 2 to 8 MiB <strong>and</strong> contain many more blocks than L2 caches. L3 caches<br />

lower the L2 miss penalty to 30 to 40 clock cycles.


5.8 A Common Framework for Memory Hierarchy 457<br />

implementation, such as whether the cache is on-chip, the technology used for<br />

implementing the cache, <strong>and</strong> the critical role of cache access time in determining<br />

the processor cycle time.<br />

Question 3: Which Block Should Be Replaced on<br />

a Cache Miss?<br />

When a miss occurs in an associative cache, we must decide which block to replace.<br />

In a fully associative cache, all blocks are c<strong>and</strong>idates for replacement. If the cache is<br />

set associative, we must choose among the blocks in the set. Of course, replacement<br />

is easy in a direct-mapped cache because there is only one c<strong>and</strong>idate.<br />

There are the two primary strategies for replacement in set-associative or fully<br />

associative caches:<br />

■ R<strong>and</strong>om: C<strong>and</strong>idate blocks are r<strong>and</strong>omly selected, possibly using some hardware<br />

assistance. For example, MIPS supports r<strong>and</strong>om replacement for TLB misses.<br />

■ Least recently used (LRU): The block replaced is the one that has been unused<br />

for the longest time.<br />

In practice, LRU is too costly to implement for hierarchies with more than a small<br />

degree of associativity (two to four, typically), since tracking the usage information<br />

is costly. Even for four-way set associativity, LRU is often approximated—for<br />

example, by keeping track of which pair of blocks is LRU (which requires 1 bit),<br />

<strong>and</strong> then tracking which block in each pair is LRU (which requires 1 bit per pair).<br />

For larger associativity, either LRU is approximated or r<strong>and</strong>om replacement is<br />

used. In caches, the replacement algorithm is in hardware, which means that the<br />

scheme should be easy to implement. R<strong>and</strong>om replacement is simple to build in<br />

hardware, <strong>and</strong> for a two-way set-associative cache, r<strong>and</strong>om replacement has a miss<br />

rate about 1.1 times higher than LRU replacement. As the caches become larger, the<br />

miss rate for both replacement strategies falls, <strong>and</strong> the absolute difference becomes<br />

small. In fact, r<strong>and</strong>om replacement can sometimes be better than the simple LRU<br />

approximations that are easily implemented in hardware.<br />

In virtual memory, some form of LRU is always approximated, since even a tiny<br />

reduction in the miss rate can be important when the cost of a miss is enormous.<br />

Reference bits or equivalent functionality are often provided to make it easier for<br />

the operating system to track a set of less recently used pages. Because misses are<br />

so expensive <strong>and</strong> relatively infrequent, approximating this information primarily<br />

in software is acceptable.<br />

Question 4: What Happens on a Write?<br />

A key characteristic of any memory hierarchy is how it deals with writes. We have<br />

already seen the two basic options:<br />

■ Write-through: The information is written to both the block in the cache <strong>and</strong><br />

the block in the lower level of the memory hierarchy (main memory for a<br />

cache). The caches in Section 5.3 used this scheme.


458 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />

■ Write-back: The information is written only to the block in the cache. The<br />

modified block is written to the lower level of the hierarchy only when it<br />

is replaced. Virtual memory systems always use write-back, for the reasons<br />

discussed in Section 5.7.<br />

Both write-back <strong>and</strong> write-through have their advantages. The key advantages of<br />

write-back are the following:<br />

■ Individual words can be written by the processor at the rate that the cache,<br />

rather than the memory, can accept them.<br />

■ Multiple writes within a block require only one write to the lower level in the<br />

hierarchy.<br />

■ When blocks are written back, the system can make effective use of a highb<strong>and</strong>width<br />

transfer, since the entire block is written.<br />

Write-through has these advantages:<br />

■ Misses are simpler <strong>and</strong> cheaper because they never require a block to be<br />

written back to the lower level.<br />

■ Write-through is easier to implement than write-back, although to be<br />

practical, a write-through cache will still need to use a write buffer.<br />

Caches, TLBs, <strong>and</strong> virtual memory may initially look very different, but<br />

they rely on the same two principles of locality, <strong>and</strong> they can be understood<br />

by their answers to four questions:<br />

The BIG<br />

Picture<br />

Question 1:<br />

Answer:<br />

Question 2:<br />

Answer:<br />

Question 3:<br />

Answer:<br />

Question 4:<br />

Answer:<br />

Where can a block be placed?<br />

One place (direct mapped), a few places (set associative),<br />

or any place (fully associative).<br />

How is a block found?<br />

There are four methods: indexing (as in a direct-mapped<br />

cache), limited search (as in a set-associative cache), full<br />

search (as in a fully associative cache), <strong>and</strong> a separate<br />

lookup table (as in a page table).<br />

What block is replaced on a miss?<br />

Typically, either the least recently used or a r<strong>and</strong>om block.<br />

How are writes h<strong>and</strong>led?<br />

Each level in the hierarchy can use either write-through<br />

or write-back.


5.8 A Common Framework for Memory Hierarchy 459<br />

In virtual memory systems, only a write-back policy is practical because of the long<br />

latency of a write to the lower level of the hierarchy. The rate at which writes are<br />

generated by a processor generally exceeds the rate at which the memory system can<br />

process them, even allowing for physically <strong>and</strong> logically wider memories <strong>and</strong> burst<br />

modes for DRAM. Consequently, today lowest-level caches typically use write-back.<br />

The Three Cs: An Intuitive Model for Underst<strong>and</strong>ing the<br />

Behavior of Memory Hierarchies<br />

In this subsection, we look at a model that provides insight into the sources of<br />

misses in a memory hierarchy <strong>and</strong> how the misses will be affected by changes<br />

in the hierarchy. We will explain the ideas in terms of caches, although the ideas<br />

carry over directly to any other level in the hierarchy. In this model, all misses are<br />

classified into one of three categories (the three Cs):<br />

■ Compulsory misses: These are cache misses caused by the first access to<br />

a block that has never been in the cache. These are also called cold-start<br />

misses.<br />

■ Capacity misses: These are cache misses caused when the cache cannot<br />

contain all the blocks needed during execution of a program. Capacity misses<br />

occur when blocks are replaced <strong>and</strong> then later retrieved.<br />

■ Conflict misses: These are cache misses that occur in set-associative or<br />

direct-mapped caches when multiple blocks compete for the same set.<br />

Conflict misses are those misses in a direct-mapped or set-associative cache<br />

that are eliminated in a fully associative cache of the same size. These cache<br />

misses are also called collision misses.<br />

Figure 5.37 shows how the miss rate divides into the three sources. These sources of<br />

misses can be directly attacked by changing some aspect of the cache design. Since<br />

conflict misses arise directly from contention for the same cache block, increasing<br />

associativity reduces conflict misses. Associativity, however, may slow access time,<br />

leading to lower overall performance.<br />

Capacity misses can easily be reduced by enlarging the cache; indeed, secondlevel<br />

caches have been growing steadily larger for many years. Of course, when we<br />

make the cache larger, we must also be careful about increasing the access time,<br />

which could lead to lower overall performance. Thus, first-level caches have been<br />

growing slowly, if at all.<br />

Because compulsory misses are generated by the first reference to a block, the<br />

primary way for the cache system to reduce the number of compulsory misses is<br />

to increase the block size. This will reduce the number of references required to<br />

touch each block of the program once, because the program will consist of fewer<br />

three Cs model A cache<br />

model in which all cache<br />

misses are classified into<br />

one of three categories:<br />

compulsory misses,<br />

capacity misses, <strong>and</strong><br />

conflict misses.<br />

compulsory miss Also<br />

called cold-start miss.<br />

A cache miss caused by<br />

the first access to a block<br />

that has never been in the<br />

cache.<br />

capacity miss A cache<br />

miss that occurs because<br />

the cache, even with<br />

full associativity, cannot<br />

contain all the blocks<br />

needed to satisfy the<br />

request.<br />

conflict miss Also called<br />

collision miss. A cache<br />

miss that occurs in a<br />

set-associative or directmapped<br />

cache when<br />

multiple blocks compete<br />

for the same set <strong>and</strong> that<br />

are eliminated in a fully<br />

associative cache of the<br />

same size.


462 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />

■ Write-back using write allocate<br />

■ Block size is 4 words (16 bytes or 128 bits)<br />

■ Cache size is 16 KiB, so it holds 1024 blocks<br />

■ 32-byte addresses<br />

■ The cache includes a valid bit <strong>and</strong> dirty bit per block<br />

From Section 5.3, we can now calculate the fields of an address for the cache:<br />

■ Cache index is 10 bits<br />

■ Block offset is 4 bits<br />

■ Tag size is 32 (10 4) or 18 bits<br />

The signals between the processor to the cache are<br />

■ 1-bit Read or Write signal<br />

■ 1-bit Valid signal, saying whether there is a cache operation or not<br />

■ 32-bit address<br />

■ 32-bit data from processor to cache<br />

■ 32-bit data from cache to processor<br />

■ 1-bit Ready signal, saying the cache operation is complete<br />

The interface between the memory <strong>and</strong> the cache has the same fields as between<br />

the processor <strong>and</strong> the cache, except that the data fields are now 128 bits wide. The<br />

extra memory width is generally found in microprocessors today, which deal with<br />

either 32-bit or 64-bit words in the processor while the DRAM controller is often<br />

128 bits. Making the cache block match the width of the DRAM simplified the<br />

design. Here are the signals:<br />

■ 1-bit Read or Write signal<br />

■ 1-bit Valid signal, saying whether there is a memory operation or not<br />

■ 32-bit address<br />

■ 128-bit data from cache to memory<br />

■ 128-bit data from memory to cache<br />

■ 1-bit Ready signal, saying the memory operation is complete<br />

Note that the interface to memory is not a fixed number of cycles. We assume a<br />

memory controller that will notify the cache via the Ready signal when the memory<br />

read or write is finished.<br />

Before describing the cache controller, we need to review finite-state machines,<br />

which allow us to control an operation that can take multiple clock cycles.


464 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />

Combinational<br />

control logic<br />

Datapath control outputs<br />

Outputs<br />

Inputs<br />

Inputs from cache<br />

datapath<br />

State register<br />

Next state<br />

FIGURE 5.39 Finite-state machine controllers are typically implemented using a block of<br />

combinational logic <strong>and</strong> a register to hold the current state. The outputs of the combinational<br />

logic are the next-state number <strong>and</strong> the control signals to be asserted for the current state. The inputs to the<br />

combinational logic are the current state <strong>and</strong> any inputs used to determine the next state. Notice that in the<br />

finite-state machine used in this chapter, the outputs depend only on the current state, not on the inputs. The<br />

Elaboration explains this in more detail.<br />

needed early in the clock cycle, do not depend on the inputs, but only on the current<br />

state. In Appendix B, when the implementation of this fi nite-state machine is taken down<br />

to logic gates, the size advantage can be clearly seen. The potential disadvantage of a<br />

Moore machine is that it may require additional states. For example, in situations where<br />

there is a one-state difference between two sequences of states, the Mealy machine<br />

may unify the states by making the outputs depend on the inputs.<br />

FSM for a Simple Cache Controller<br />

Figure 5.40 shows the four states of our simple cache controller:<br />

■ Idle: This state waits for a valid read or write request from the processor,<br />

which moves the FSM to the Compare Tag state.<br />

■ Compare Tag: As the name suggests, this state tests to see if the requested read<br />

or write is a hit or a miss. The index portion of the address selects the tag to<br />

be compared. If the data in the cache block referred to by the index portion<br />

of the address is valid, <strong>and</strong> the tag portion of the address matches the tag,<br />

then it is a hit. Either the data is read from the selected word if it is a load or<br />

written to the selected word if it is a store. The Cache Ready signal is then


468 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />

■ Replication: When shared data are being simultaneously read, the caches<br />

make a copy of the data item in the local cache. Replication reduces both<br />

latency of access <strong>and</strong> contention for a read shared data item.<br />

Supporting migration <strong>and</strong> replication is critical to performance in accessing<br />

shared data, so many multiprocessors introduce a hardware protocol to maintain<br />

coherent caches. The protocols to maintain coherence for multiple processors are<br />

called cache coherence protocols. Key to implementing a cache coherence protocol<br />

is tracking the state of any sharing of a data block.<br />

The most popular cache coherence protocol is snooping. Every cache that has a<br />

copy of the data from a block of physical memory also has a copy of the sharing<br />

status of the block, but no centralized state is kept. The caches are all accessible via<br />

some broadcast medium (a bus or network), <strong>and</strong> all cache controllers monitor or<br />

snoop on the medium to determine whether or not they have a copy of a block that<br />

is requested on a bus or switch access.<br />

In the following section we explain snooping-based cache coherence as<br />

implemented with a shared bus, but any communication medium that broadcasts<br />

cache misses to all processors can be used to implement a snooping-based<br />

coherence scheme. This broadcasting to all caches makes snooping protocols<br />

simple to implement but also limits their scalability.<br />

Snooping Protocols<br />

One method of enforcing coherence is to ensure that a processor has exclusive<br />

access to a data item before it writes that item. This style of protocol is called a write<br />

invalidate protocol because it invalidates copies in other caches on a write. Exclusive<br />

access ensures that no other readable or writable copies of an item exist when the<br />

write occurs: all other cached copies of the item are invalidated.<br />

Figure 5.42 shows an example of an invalidation protocol for a snooping bus<br />

with write-back caches in action. To see how this protocol ensures coherence,<br />

consider a write followed by a read by another processor: since the write requires<br />

exclusive access, any copy held by the reading processor must be invalidated (hence<br />

the protocol name). Thus, when the read occurs, it misses in the cache, <strong>and</strong> the<br />

cache is forced to fetch a new copy of the data. For a write, we require that the<br />

writing processor have exclusive access, preventing any other processor from being<br />

able to write simultaneously. If two processors do attempt to write the same data<br />

simultaneously, one of them wins the race, causing the other processor’s copy to be<br />

invalidated. For the other processor to complete its write, it must obtain a new copy<br />

of the data, which must now contain the updated value. Therefore, this protocol<br />

also enforces write serialization.


5.10 Parallelism <strong>and</strong> Memory Hierarchy: Cache Coherence 469<br />

Processor activity<br />

Bus activity<br />

Contents of<br />

CPU A’s cache<br />

Contents of<br />

CPU B’s cache<br />

Contents of<br />

memory<br />

location X<br />

0<br />

CPU A reads X Cache miss for X<br />

0<br />

0<br />

CPU B reads X Cache miss for X 0 0 0<br />

CPU A writes a 1 to X Invalidation for X<br />

1<br />

0<br />

CPU B reads X Cache miss for X 1 1 1<br />

FIGURE 5.42 An example of an invalidation protocol working on a snooping bus for a<br />

single cache block (X) with write-back caches. We assume that neither cache initially holds X<br />

<strong>and</strong> that the value of X in memory is 0. The CPU <strong>and</strong> memory contents show the value after the processor<br />

<strong>and</strong> bus activity have both completed. A blank indicates no activity or no copy cached. When the second<br />

miss by B occurs, CPU A responds with the value canceling the response from memory. In addition, both<br />

the contents of B’s cache <strong>and</strong> the memory contents of X are updated. This update of memory, which occurs<br />

when a block becomes shared, simplifies the protocol, but it is possible to track the ownership <strong>and</strong> force the<br />

write-back only if the block is replaced. This requires the introduction of an additional state called “owner,”<br />

which indicates that a block may be shared, but the owning processor is responsible for updating any other<br />

processors <strong>and</strong> memory when it changes the block or replaces it.<br />

One insight is that block size plays an important role in cache coherency. For<br />

example, take the case of snooping on a cache with a block size of eight words,<br />

with a single word alternatively written <strong>and</strong> read by two processors. Most protocols<br />

exchange full blocks between processors, thereby increasing coherency b<strong>and</strong>width<br />

dem<strong>and</strong>s.<br />

Large blocks can also cause what is called false sharing: when two unrelated<br />

shared variables are located in the same cache block, the full block is exchanged<br />

between processors even though the processors are accessing different variables.<br />

Programmers <strong>and</strong> compilers should lay out data carefully to avoid false sharing.<br />

Elaboration: Although the three properties on pages 466 <strong>and</strong> 467 are suffi cient to<br />

ensure coherence, the question of when a written value will be seen is also important. To<br />

see why, observe that we cannot require that a read of X in Figure 5.41 instantaneously<br />

sees the value written for X by some other processor. If, for example, a write of X on one<br />

processor precedes a read of X on another processor very shortly beforeh<strong>and</strong>, it may be<br />

impossible to ensure that the read returns the value of the data written, since the written<br />

data may not even have left the processor at that point. The issue of exactly when a<br />

written value must be seen by a reader is defi ned by a memory consistency model.<br />

Hardware/<br />

Software<br />

Interface<br />

false sharing When two<br />

unrelated shared variables<br />

are located in the same<br />

cache block <strong>and</strong> the<br />

full block is exchanged<br />

between processors even<br />

though the processors<br />

are accessing different<br />

variables.


5.13 Real Stuff: The ARM Cortex-A8 <strong>and</strong> Intel Core i7 Memory Hierarchies 471<br />

5.13<br />

Real Stuff: The ARM Cortex-A8 <strong>and</strong> Intel<br />

Core i7 Memory Hierarchies<br />

In this section, we will look at the memory hierarchy of the same two microprocessors<br />

described in Chapter 4: the ARM Cortex-A8 <strong>and</strong> Intel Core i7. This section is based<br />

on Section 2.6 of <strong>Computer</strong> Architecture: A Quantitative Approach, 5 th edition.<br />

Figure 5.43 summarizes the address sizes <strong>and</strong> TLBs of the two processors. Note<br />

that the A8 has two TLBs with a 32-bit virtual address space <strong>and</strong> a 32-bit physical<br />

address space. The Core i7 has three TLBs with a 48-bit virtual address <strong>and</strong> a 44-bit<br />

physical address. Although the 64-bit registers of the Core i7 could hold a larger<br />

virtual address, there was no software need for such a large space <strong>and</strong> 48-bit virtual<br />

addresses shrinks both the page table memory footprint <strong>and</strong> the TLB hardware.<br />

Figure 5.44 shows their caches. Keep in mind that the A8 has just one processor<br />

or core while the Core i7 has four. Both have identically organized 32 KiB, 4-way<br />

set associative, L1 instruction caches (per core) with 64 byte blocks. The A8 uses the<br />

same design for data cache, while the Core i7 keeps everything the same except the<br />

associativity, which it increases to 8-way. Both use an 8-way set associative unified<br />

L2 cache (per core) with 64 byte blocks, although the A8 varies in size from 128 KiB<br />

to 1 MiB while the Core i7 is fixed at 256 KiB. As the Core i7 is used for servers, it<br />

Characteristic ARM Cortex-A8 Intel Core i7<br />

Virtual address 32 bits 48 bits<br />

Physical address 32 bits 44 bits<br />

Page size Variable: 4, 16, 64 KiB, 1, 16 MiB Variable: 4 KiB, 2/4 MiB<br />

TLB organization 1 TLB for instructions <strong>and</strong> 1 TLB<br />

for data<br />

1 TLB for instructions <strong>and</strong> 1 TLB for<br />

data per core<br />

Both TLBs are fully associative,<br />

with 32 entries, round robin<br />

replacement<br />

TLB misses h<strong>and</strong>led in hardware<br />

Both L1 TLBs are four-way set<br />

associative, LRU replacement<br />

L1 I-TLB has 128 entries for small<br />

pages, 7 per thread for large pages<br />

L1 D-TLB has 64 entries for small<br />

pages, 32 for large pages<br />

The L2 TLB is four-way set associative,<br />

LRU replacement<br />

The L2 TLB has 512 entries<br />

TLB misses h<strong>and</strong>led in hardware<br />

FIGURE 5.43 Address translation <strong>and</strong> TLB hardware for the ARM Cortex-A8 <strong>and</strong> Intel<br />

Core i7 920. Both processors provide support for large pages, which are used for things like the operating<br />

system or mapping a frame buffer. The large-page scheme avoids using a large number of entries to map a<br />

single object that is always present.


5.13 Real Stuff: The ARM Cortex-A8 <strong>and</strong> Intel Core i7 Memory Hierarchies 473<br />

advantage of this capability, but large servers <strong>and</strong> multiprocessors often have<br />

memory systems capable of h<strong>and</strong>ling more than one outst<strong>and</strong>ing miss in parallel.<br />

The Core i7 has a prefetch mechanism for data accesses. It looks at a pattern<br />

of data misses <strong>and</strong> use this information to try to predict the next address to start<br />

fetching the data before the miss occurs. Such techniques generally work best when<br />

accessing arrays in loops.<br />

The sophisticated memory hierarchies of these chips <strong>and</strong> the large fraction of<br />

the dies dedicated to caches <strong>and</strong> TLBs show the significant design effort expended<br />

to try to close the gap between processor cycle times <strong>and</strong> memory latency.<br />

Performance of the A8 <strong>and</strong> Core i7 Memory Hierarchies<br />

The memory hierarchy of the Cortex-A8 was simulated with a 1 MiB eight-way<br />

set associative L2 cache using the integer Minnespec benchmarks. As mentioned<br />

in Chapter 4, Minnespec is a set of benchmarks consisting of the SPEC2000<br />

benchmarks but with different inputs that reduce the running times by several<br />

orders of magnitude. Although the use of smaller inputs does not change the<br />

instruction mix, it does affect the cache behavior. For example, on mcf, the most<br />

memory-intensive SPEC2000 integer benchmark, Minnespec has a miss rate for a<br />

32 KiB cache that is only 65% of the miss rate for the full SPEC2000 version. For<br />

a 1 MiB cache the difference is a factor of six! For this reason, one cannot compare<br />

the Minnespec benchmarks against the SPEC2000 benchmarks, much less the even<br />

larger SPEC2006 benchmarks used for the Core i7 in Figure 5.47. Instead, the data<br />

are useful for looking at the relative impact of L1 <strong>and</strong> L2 misses <strong>and</strong> on overall CPI,<br />

which we used in Chapter 4.<br />

The A8 instruction cache miss rates for these benchmarks (<strong>and</strong> also for the<br />

full SPEC2000 versions on which Minnespec is based) are very small even for<br />

just the L1: close to zero for most <strong>and</strong> under 1% for all of them. This low rate<br />

probably results from the computationally intensive nature of the SPEC programs<br />

<strong>and</strong> the four-way set associative cache that eliminates most conflict misses. Figure<br />

5.45 shows the data cache results for the A8, which have significant L1 <strong>and</strong> L2<br />

miss rates. The L1 miss penalty for a 1 GHz Cortex-A8 is 11 clock cycles, while<br />

the L2 miss penalty is assumed to be 60 clock cycles. Using these miss penalties,<br />

Figure 5.46 shows the average miss penalty per data access.<br />

Figure 5.47 shows the miss rates for the caches of the Core i7 using the SPEC2006<br />

benchmarks. The L1 instruction cache miss rate varies from 0.1% to 1.8%,<br />

averaging just over 0.4%. This rate is in keeping with other studies of instruction<br />

cache behavior for the SPECCPU2006 benchmarks, which show low instruction<br />

cache miss rates. With L1 data cache miss rates running 5% to 10%, <strong>and</strong> sometimes<br />

higher, the importance of the L2 <strong>and</strong> L3 caches should be obvious. Since the cost<br />

for a miss to memory is over 100 cycles <strong>and</strong> the average data miss rate in L2 is 4%,<br />

L3 is obviously critical. Assuming about half the instructions are loads or stores,<br />

without L3 the L2 cache misses could add two cycles per instruction to the CPI! In<br />

comparison, the average L3 data miss rate of 1% is still significant but four times<br />

lower than the L2 miss rate <strong>and</strong> six times less than the L1 miss rate.


476 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />

1<br />

2<br />

3<br />

4<br />

5<br />

6<br />

7<br />

8<br />

9<br />

10<br />

11<br />

12<br />

13<br />

14<br />

15<br />

16<br />

17<br />

18<br />

19<br />

20<br />

21<br />

22<br />

23<br />

24<br />

25<br />

26<br />

27<br />

28<br />

29<br />

30<br />

31<br />

32<br />

33<br />

34<br />

#include <br />

#define UNROLL (4)<br />

#define BLOCKSIZE 32<br />

void do_block (int n, int si, int sj, int sk,<br />

double *A, double *B, double *C)<br />

{<br />

for ( int i = si; i < si+BLOCKSIZE; i+=UNROLL*4 )<br />

for ( int j = sj; j < sj+BLOCKSIZE; j++ ) {<br />

__m256d c[4];<br />

for ( int x = 0; x < UNROLL; x++ )<br />

c[x] = _mm256_load_pd(C+i+x*4+j*n);<br />

/* c[x] = C[i][j] */<br />

for( int k = sk; k < sk+BLOCKSIZE; k++ )<br />

{<br />

__m256d b = _mm256_broadcast_sd(B+k+j*n);<br />

/* b = B[k][j] */<br />

for (int x = 0; x < UNROLL; x++)<br />

c[x] = _mm256_add_pd(c[x], /* c[x]+=A[i][k]*b */<br />

_mm256_mul_pd(_mm256_load_pd(A+n*k+x*4+i), b));<br />

}<br />

}<br />

}<br />

for ( int x = 0; x < UNROLL; x++ )<br />

_mm256_store_pd(C+i+x*4+j*n, c[x]);<br />

/* C[i][j] = c[x] */<br />

void dgemm (int n, double* A, double* B, double* C)<br />

{<br />

for ( int sj = 0; sj < n; sj += BLOCKSIZE )<br />

for ( int si = 0; si < n; si += BLOCKSIZE )<br />

for ( int sk = 0; sk < n; sk += BLOCKSIZE )<br />

do_block(n, si, sj, sk, A, B, C);<br />

}<br />

FIGURE 5.48 Optimized C version of DGEMM from Figure 4.80 using cache blocking. These changes<br />

are the same ones found in Figure 5.21. The assembly language produced by the compiler for the do_block function<br />

is nearly identical to Figure 4.81. Once again, there is no overhead to call the do_block because the compiler inlines<br />

the function call.


5.14 Going Faster: Cache Blocking <strong>and</strong> Matrix Multiply 477<br />

of A, B, <strong>and</strong> C. Indeed, lines 28 – 34 <strong>and</strong> lines 7 – 8 in Figure 5.48 are identical to<br />

lines 14 – 20 <strong>and</strong> lines 5 – 6 in Figure 5.21, with the exception of incrementing the<br />

for loop in line 7 by the amount unrolled.<br />

Unlike the earlier chapters, we do not show the resulting x86 code because the<br />

inner loop code is nearly identical to Figure 4.81, as the blocking does not affect the<br />

computation, just the order that it accesses data in memory. What does change is<br />

the bookkeeping integer instructions to implement the for loops. It exp<strong>and</strong>s from<br />

14 instructions before the inner loop <strong>and</strong> 8 after the loop for Figure 4.80 to 40 <strong>and</strong><br />

28 instructions respectively for the bookkeeping code generated for Figure 5.48.<br />

Nevertheless, the extra instructions executed pale in comparison to the performance<br />

improvement of reducing cache misses. Figure 5.49 compares unoptimzed to<br />

optimizations for subword parallelism, instruction level parallelism, <strong>and</strong> caches.<br />

Blocking improves performance over unrolled AVX code by factors of 2 to 2.5 for<br />

the larger matrices. When we compare unoptimized code to the code with all three<br />

optimizations, the performance improvement is factors of 8 to 15, with the largest<br />

increase for the largest matrix.<br />

32x32 160x160 480x480 960x960<br />

GFLOPS<br />

16.0<br />

12.0<br />

8.0<br />

4.0<br />

14.6<br />

13.6<br />

12.7<br />

11.7 12.0<br />

6.4<br />

6.6<br />

3.5<br />

–<br />

Unoptimized AVX AVX + unroll AVX + unroll +<br />

blocked<br />

FIGURE 5.49 Performance of four versions of DGEMM from matrix dimensions 32x32 to<br />

960x960. The fully optimized code for largest matrix is almost 15 times as fast the unoptimized version in<br />

Figure 3.21 in Chapter 3.<br />

Elaboration: As mentioned in the Elaboration in Section 3.8, these results are<br />

with Turbo mode turned off. As in Chapters 3 <strong>and</strong> 4, when we turn it on we improve all<br />

the results by the temporary increase in the clock rate of 3.3/2.6 1.27. Turbo mode<br />

works particularly well in this case because it is using only a single core of an eightcore<br />

chip. However, if we want to run fast we should use all cores, which we’ll see in<br />

Chapter 6.


5.15 Fallacies <strong>and</strong> Pitfalls 479<br />

This mistake catches many people, including the authors (in earlier drafts) <strong>and</strong><br />

instructors who forget whether they intended the addresses to be in words, bytes,<br />

or block numbers. Remember this pitfall when you tackle the exercises.<br />

Pitfall: Having less set associativity for a shared cache than the number of cores or<br />

threads sharing that cache.<br />

Without extra care, a parallel program running on 2 n processors or threads can<br />

easily allocate data structures to addresses that would map to the same set of a<br />

shared L2 cache. If the cache is at least 2 n -way associative, then these accidental<br />

conflicts are hidden by the hardware from the program. If not, programmers could<br />

face apparently mysterious performance bugs—actually due to L2 conflict misses—<br />

when migrating from, say, a 16-core design to 32-core design if both use 16-way<br />

associative L2 caches.<br />

Pitfall: Using average memory access time to evaluate the memory hierarchy of an<br />

out-of-order processor.<br />

If a processor stalls during a cache miss, then you can separately calculate the<br />

memory-stall time <strong>and</strong> the processor execution time, <strong>and</strong> hence evaluate the memory<br />

hierarchy independently using average memory access time (see page 399).<br />

If the processor continues to execute instructions, <strong>and</strong> may even sustain more<br />

cache misses during a cache miss, then the only accurate assessment of the memory<br />

hierarchy is to simulate the out-of-order processor along with the memory hierarchy.<br />

Pitfall: Extending an address space by adding segments on top of an unsegmented<br />

address space.<br />

During the 1970s, many programs grew so large that not all the code <strong>and</strong> data could<br />

be addressed with just a 16-bit address. <strong>Computer</strong>s were then revised to offer 32-<br />

bit addresses, either through an unsegmented 32-bit address space (also called a flat<br />

address space) or by adding 16 bits of segment to the existing 16-bit address. From<br />

a marketing point of view, adding segments that were programmer-visible <strong>and</strong> that<br />

forced the programmer <strong>and</strong> compiler to decompose programs into segments could<br />

solve the addressing problem. Unfortunately, there is trouble any time a programming<br />

language wants an address that is larger than one segment, such as indices for large<br />

arrays, unrestricted pointers, or reference parameters. Moreover, adding segments<br />

can turn every address into two words—one for the segment number <strong>and</strong> one for the<br />

segment offset—causing problems in the use of addresses in registers.<br />

Fallacy: Disk failure rates in the field match their specifications.<br />

Two recent studies evaluated large collections of disks to check the relationship<br />

between results in the field compared to specifications. One study was of almost<br />

100,000 disks that had quoted MTTF of 1,000,000 to 1,500,000 hours, or AFR of<br />

0.6% to 0.8%. They found AFRs of 2% to 4% to be common, often three to five<br />

times higher than the specified rates [Schroeder <strong>and</strong> Gibson, 2007]. A second study<br />

of more than 100,000 disks at Google, which had a quoted AFR of about 1.5%, saw<br />

failure rates of 1.7% for drives in their first year rise to 8.6% for drives in their third<br />

year, or about five to six times the specified rate [Pinheiro, Weber, <strong>and</strong> Barroso,<br />

2007].


5.15 Fallacies <strong>and</strong> Pitfalls 481<br />

Problem category<br />

Problem x86 instructions<br />

Access sensitive registers without<br />

trapping when running in user mode<br />

When accessing virtual memory<br />

mechanisms in user mode, instructions<br />

fail the x86 protection checks<br />

Store global descriptor table register (SGDT)<br />

Store local descriptor table register (SLDT)<br />

Store interrupt descriptor table register (SIDT)<br />

Store machine status word (SMSW)<br />

Push flags (PUSHF, PUSHFD)<br />

Pop flags (POPF, POPFD)<br />

Load access rights from segment descriptor (LAR)<br />

Load segment limit from segment descriptor (LSL)<br />

Verify if segment descriptor is readable (VERR)<br />

Verify if segment descriptor is writable (VERW)<br />

Pop to segment register (POP CS, POP SS, . . .)<br />

Push segment register (PUSH CS, PUSH SS, . . .)<br />

Far call to different privilege level (CALL)<br />

Far return to different privilege level (RET)<br />

Far jump to different privilege level (JMP)<br />

Software interrupt (INT)<br />

Store segment selector register (STR)<br />

Move to/from segment registers (MOVE)<br />

FIGURE 5.51 Summary of 18 x86 instructions that cause problems for virtualization<br />

[Robin <strong>and</strong> Irvine, 2000]. The first five instructions in the top group allow a program in user mode to<br />

read a control register, such as descriptor table registers, without causing a trap. The pop flags instruction<br />

modifies a control register with sensitive information but fails silently when in user mode. The protection<br />

checking of the segmented architecture of the x86 is the downfall of the bottom group, as each of these<br />

instructions checks the privilege level implicitly as part of instruction execution when reading a control<br />

register. The checking assumes that the OS must be at the highest privilege level, which is not the case for<br />

guest VMs. Only the Move to segment register tries to modify control state, <strong>and</strong> protection checking foils it<br />

as well.<br />

Pitfall: Implementing a virtual machine monitor on an instruction set architecture<br />

that wasn’t designed to be virtualizable.<br />

Many architects in the 1970s <strong>and</strong> 1980s weren’t careful to make sure that all<br />

instructions reading or writing information related to hardware resource<br />

information were privileged. This laissez-faire attitude causes problems for VMMs<br />

for all of these architectures, including the x86, which we use here as an example.<br />

Figure 5.51 describes the 18 instructions that cause problems for virtualization<br />

[Robin <strong>and</strong> Irvine, 2000]. The two broad classes are instructions that<br />

■ Read control registers in user mode that reveals that the guest operating<br />

system is running in a virtual machine (such as POPF, mentioned earlier)<br />

■ Check protection as required by the segmented architecture but assume that<br />

the operating system is running at the highest privilege level<br />

To simplify implementations of VMMs on the x86, both AMD <strong>and</strong> Intel have<br />

proposed extensions to the architecture via a new mode. Intel’s VT-x provides<br />

a new execution mode for running VMs, an architected definition of the VM


484 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />

5.1.4 [10] How many 16-byte cache blocks are needed to store all 32-bit<br />

matrix elements being referenced?<br />

5.1.5 [5] References to which variables exhibit temporal locality?<br />

5.1.6 [5] References to which variables exhibit spatial locality?<br />

5.2 Caches are important to providing a high-performance memory hierarchy<br />

to processors. Below is a list of 32-bit memory address references, given as word<br />

addresses.<br />

3, 180, 43, 2, 191, 88, 190, 14, 181, 44, 186, 253<br />

5.2.1 [10] For each of these references, identify the binary address, the tag,<br />

<strong>and</strong> the index given a direct-mapped cache with 16 one-word blocks. Also list if each<br />

reference is a hit or a miss, assuming the cache is initially empty.<br />

5.2.2 [10] For each of these references, identify the binary address, the tag,<br />

<strong>and</strong> the index given a direct-mapped cache with two-word blocks <strong>and</strong> a total size of 8<br />

blocks. Also list if each reference is a hit or a miss, assuming the cache is initially empty.<br />

5.2.3 [20] You are asked to optimize a cache design for the given<br />

references. There are three direct-mapped cache designs possible, all with a total of 8<br />

words of data: C1 has 1-word blocks, C2 has 2-word blocks, <strong>and</strong> C3 has 4-word blocks.<br />

In terms of miss rate, which cache design is the best? If the miss stall time is 25 cycles,<br />

<strong>and</strong> C1 has an access time of 2 cycles, C2 takes 3 cycles, <strong>and</strong> C3 takes 5 cycles, which is<br />

the best cache design?<br />

There are many different design parameters that are important to a cache’s overall<br />

performance. Below are listed parameters for different direct-mapped cache designs.<br />

Cache Data Size: 32 KiB<br />

Cache Block Size: 2 words<br />

Cache Access Time: 1 cycle<br />

5.2.4 [15] Calculate the total number of bits required for the cache listed<br />

above, assuming a 32-bit address. Given that total size, find the total size of the closest<br />

direct-mapped cache with 16-word blocks of equal size or greater. Explain why the<br />

second cache, despite its larger data size, might provide slower performance than the<br />

first cache.<br />

5.2.5 [20] Generate a series of read requests that have a lower miss rate<br />

on a 2 KiB 2-way set associative cache than the cache listed above. Identify one possible<br />

solution that would make the cache listed have an equal or lower miss rate than the 2<br />

KiB cache. Discuss the advantages <strong>and</strong> disadvantages of such a solution.<br />

5.2.6 [15] The formula shown in Section 5.3 shows the typical method to<br />

index a direct-mapped cache, specifically (Block address) modulo (Number of blocks in<br />

the cache). Assuming a 32-bit address <strong>and</strong> 1024 blocks in the cache, consider a different


5.18 Exercises 493<br />

Consider the following address sequence: 0, 2, 4, 8, 10, 12, 14, 16, 0<br />

5.13.1 [5] Assuming an LRU replacement policy, how many hits does<br />

this address sequence exhibit?<br />

5.13.2 [5] Assuming an MRU (most recently used) replacement policy,<br />

how many hits does this address sequence exhibit?<br />

5.13.3 [5] Simulate a r<strong>and</strong>om replacement policy by flipping a coin. For<br />

example, “heads” means to evict the first block in a set <strong>and</strong> “tails” means to evict the<br />

second block in a set. How many hits does this address sequence exhibit?<br />

5.13.4 [10] Which address should be evicted at each replacement to<br />

maximize the number of hits? How many hits does this address sequence exhibit if you<br />

follow this “optimal” policy?<br />

5.13.5 [10] Describe why it is difficult to implement a cache replacement<br />

policy that is optimal for all address sequences.<br />

5.13.6 [10] Assume you could make a decision upon each memory<br />

reference whether or not you want the requested address to be cached. What impact<br />

could this have on miss rate?<br />

5.14 To support multiple virtual machines, two levels of memory virtualization are<br />

needed. Each virtual machine still controls the mapping of virtual address (VA) to<br />

physical address (PA), while the hypervisor maps the physical address (PA) of each<br />

virtual machine to the actual machine address (MA). To accelerate such mappings,<br />

a software approach called “shadow paging” duplicates each virtual machine’s page<br />

tables in the hypervisor, <strong>and</strong> intercepts VA to PA mapping changes to keep both copies<br />

consistent. To remove the complexity of shadow page tables, a hardware approach<br />

called nested page table (NPT) explicitly supports two classes of page tables (VA ⇒ PA<br />

<strong>and</strong> PA ⇒ MA) <strong>and</strong> can walk such tables purely in hardware.<br />

Consider the following sequence of operations: (1) Create process; (2) TLB miss;<br />

(3) page fault; (4) context switch;<br />

5.14.1 [10] What would happen for the given operation sequence for<br />

shadow page table <strong>and</strong> nested page table, respectively?<br />

5.14.2 [10] Assuming an x86-based 4-level page table in both guest <strong>and</strong><br />

nested page table, how many memory references are needed to service a TLB miss for<br />

native vs. nested page table?<br />

5.14.3 [15] Among TLB miss rate, TLB miss latency, page fault rate, <strong>and</strong><br />

page fault h<strong>and</strong>ler latency, which metrics are more important for shadow page table?<br />

Which are important for nested page table?


5.18 Exercises 495<br />

5.16 In this exercise, we will explore the control unit for a cache controller for a<br />

processor with a write buffer. Use the finite state machine found in Figure 5.40 as a<br />

starting point for designing your own finite state machines. Assume that the cache<br />

controller is for the simple direct-mapped cache described on page 465 (Figure 5.40 in<br />

Section 5.9), but you will add a write buffer with a capacity of one block.<br />

Recall that the purpose of a write buffer is to serve as temporary storage so that the<br />

processor doesn’t have to wait for two memory accesses on a dirty miss. Rather than<br />

writing back the dirty block before reading the new block, it buffers the dirty block <strong>and</strong><br />

immediately begins reading the new block. The dirty block can then be written to main<br />

memory while the processor is working.<br />

5.16.1 [10] What should happen if the processor issues a request that<br />

hits in the cache while a block is being written back to main memory from the write<br />

buffer?<br />

5.16.2 [10] What should happen if the processor issues a request that<br />

misses in the cache while a block is being written back to main memory from the write<br />

buffer?<br />

5.16.3 [30] <strong>Design</strong> a finite state machine to enable the use of a write<br />

buffer.<br />

5.17 Cache coherence concerns the views of multiple processors on a given cache<br />

block. The following data shows two processors <strong>and</strong> their read/write operations on two<br />

different words of a cache block X (initially X[0] = X[1] = 0). Assume the size of integers is<br />

32 bits.<br />

P1<br />

X[0] ++; X[1] = 3; X[0] = 5; X[1] +=2;<br />

P2<br />

5.17.1 [15] List the possible values of the given cache block for a correct<br />

cache coherence protocol implementation. List at least one more possible value of the<br />

block if the protocol doesn’t ensure cache coherency.<br />

5.17.2 [15] For a snooping protocol, list a valid operation sequence on each<br />

processor/cache to finish the above read/write operations.<br />

5.17.3 [10] What are the best-case <strong>and</strong> worst-case numbers of cache misses<br />

needed to execute the listed read/write instructions?<br />

Memory consistency concerns the views of multiple data items. The following data<br />

shows two processors <strong>and</strong> their read/write operations on different cache blocks (A <strong>and</strong><br />

B initially 0).<br />

P1<br />

A = 1; B = 2; A+=2; B++; C = B; D = A;<br />

P2


5.18 Exercises 497<br />

5.19 In this exercise we show the definition of a web server log <strong>and</strong> examine code<br />

optimizations to improve log processing speed. The data structure for the log is defined<br />

as follows:<br />

struct entry {<br />

int srcIP; // remote IP address<br />

char URL[128]; // request URL (e.g., “GET index.html”)<br />

long long refTime; // reference time<br />

int status; // connection status<br />

char browser[64]; // client browser name<br />

} log [NUM_ENTRIES];<br />

Assume the following processing function for the log:<br />

topK_sourceIP (int hour);<br />

5.19.1 [5] Which fields in a log entry will be accessed for the given log<br />

processing function? Assuming 64-byte cache blocks <strong>and</strong> no prefetching, how many<br />

cache misses per entry does the given function incur on average?<br />

5.19.2 [10] How can you reorganize the data structure to improve cache<br />

utilization <strong>and</strong> access locality? Show your structure definition code.<br />

5.19.3 [10] Give an example of another log processing function that would<br />

prefer a different data structure layout. If both functions are important, how would you<br />

rewrite the program to improve the overall performance? Supplement the discussion<br />

with code snippet <strong>and</strong> data.<br />

For the problems below, use data from “Cache Performance for SPEC CPU2000<br />

Benchmarks” (http://www.cs.wisc.edu/multifacet/misc/spec2000cache-data/) for the<br />

pairs of benchmarks shown in the following table.<br />

a. Mesa / gcc<br />

b. mcf / swim<br />

5.19.4 [10] For 64 KiB data caches with varying set associativities, what are<br />

the miss rates broken down by miss types (cold, capacity, <strong>and</strong> conflict misses) for each<br />

benchmark?<br />

5.19.5 [10] Select the set associativity to be used by a 64 KiB L1 data cache<br />

shared by both benchmarks. If the L1 cache has to be directly mapped, select the set<br />

associativity for the 1 MiB L2 cache.<br />

5.19.6 [20] Give an example in the miss rate table where higher set<br />

associativity actually increases miss rate. Construct a cache configuration <strong>and</strong> reference<br />

stream to demonstrate this.


498 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />

Answers to<br />

Check Yourself<br />

§5.1, page 377: 1 <strong>and</strong> 4. (3 is false because the cost of the memory hierarchy varies<br />

per computer, but in 2013 the highest cost is usually the DRAM.)<br />

§5.3, page 398: 1 <strong>and</strong> 4: A lower miss penalty can enable smaller blocks, since you<br />

don’t have that much latency to amortize, yet higher memory b<strong>and</strong>width usually<br />

leads to larger blocks, since the miss penalty is only slightly larger.<br />

§5.4, page 417: 1.<br />

§5.7, page 454: 1-a, 2-c, 3-b, 4-d.<br />

§5.8, page 461: 2. (Both large block sizes <strong>and</strong> prefetching may reduce compulsory<br />

misses, so 1 is false.)


This page intentionally left blank


6<br />

Parallel Processors<br />

from Client to Cloud<br />

“I swing big, with<br />

everything I’ve got.<br />

I hit big or I miss big.<br />

I like to live as big as<br />

I can.”<br />

Babe Ruth<br />

American baseball player<br />

6.1 Introduction 502<br />

6.2 The Difficulty of Creating Parallel Processing<br />

Programs 504<br />

6.3 SISD, MIMD, SIMD, SPMD, <strong>and</strong> Vector 509<br />

6.4 Hardware Multithreading 516<br />

6.5 Multicore <strong>and</strong> Other Shared Memory<br />

Multiprocessors 519<br />

6.6 Introduction to Graphics Processing<br />

Units 524<br />

<strong>Computer</strong> Organization <strong>and</strong> <strong>Design</strong>. DOI: http://dx.doi.org/10.1016/B978-0-12-407726-3.00001-1<br />

© 2013 Elsevier Inc. All rights reserved.


6.1 Introduction 503<br />

multicore microprocessors instead of multiprocessor microprocessors,<br />

presumably to avoid redundancy in naming. Hence, processors are often called<br />

cores in a multicore chip. The number of cores is expected to increase with<br />

Moore’s Law. These multicores are almost always Shared Memory Processors<br />

(SMPs), as they usually share a single physical address space. We’ll see SMPs<br />

more in Section 6.5.<br />

The state of technology today means that programmers who care about<br />

performance must become parallel programmers, for sequential code now means<br />

slow code.<br />

The tall challenge facing the industry is to create hardware <strong>and</strong> software that<br />

will make it easy to write correct parallel processing programs that will execute<br />

efficiently in performance <strong>and</strong> energy as the number of cores per chip scales.<br />

This abrupt shift in microprocessor design caught many off guard, so there is a<br />

great deal of confusion about the terminology <strong>and</strong> what it means. Figure 6.1 tries to<br />

clarify the terms serial, parallel, sequential, <strong>and</strong> concurrent. The columns of this figure<br />

represent the software, which is either inherently sequential or concurrent. The rows<br />

of the figure represent the hardware, which is either serial or parallel. For example, the<br />

programmers of compilers think of them as sequential programs: the steps include<br />

parsing, code generation, optimization, <strong>and</strong> so on. In contrast, the programmers<br />

of operating systems normally think of them as concurrent programs: cooperating<br />

processes h<strong>and</strong>ling I/O events due to independent jobs running on a computer.<br />

The point of these two axes of Figure 6.1 is that concurrent software can run on<br />

serial hardware, such as operating systems for the Intel Pentium 4 uniprocessor,<br />

or on parallel hardware, such as an OS on the more recent Intel Core i7. The same<br />

is true for sequential software. For example, the MATLAB programmer writes<br />

a matrix multiply thinking about it sequentially, but it could run serially on the<br />

Pentium 4 or in parallel on the Intel Core i7.<br />

You might guess that the only challenge of the parallel revolution is figuring out how<br />

to make naturally sequential software have high performance on parallel hardware, but<br />

it is also to make concurrent programs have high performance on multiprocessors as the<br />

number of processors increases. With this distinction made, in the rest of this chapter<br />

we will use parallel processing program or parallel software to mean either sequential<br />

or concurrent software running on parallel hardware. The next section of this chapter<br />

describes why it is hard to create efficient parallel processing programs.<br />

multicore<br />

microprocessor<br />

A microprocessor<br />

containing multiple<br />

processors (“cores”)<br />

in a single integrated<br />

circuit. Virtually all<br />

microprocessors today in<br />

desktops <strong>and</strong> servers are<br />

multicore.<br />

shared memory<br />

multiprocessor<br />

(SMP) A parallel<br />

processor with a single<br />

physical address space.<br />

Software<br />

Sequential<br />

Concurrent<br />

Hardware<br />

Serial<br />

Parallel<br />

Matrix Multiply written in MatLab<br />

running on an Intel Pentium 4<br />

Matrix Multiply written in MATLAB<br />

running on an Intel Core i7<br />

Windows Vista Operating System<br />

running on an Intel Pentium 4<br />

Windows Vista Operating System<br />

running on an Intel Core i7<br />

FIGURE 6.1 Hardware/software categorization <strong>and</strong> examples of application perspective<br />

on concurrency versus hardware perspective on parallelism.


504 Chapter 6 Parallel Processors from Client to Cloud<br />

Before proceeding further down the path to parallelism, dont forget our initial<br />

incursions from the earlier chapters:<br />

Check<br />

Yourself<br />

■ Chapter 2, Section 2.11: Parallelism <strong>and</strong> Instructions: Synchronization<br />

■ Chapter 3, Section 3.6: Parallelism <strong>and</strong> <strong>Computer</strong> Arithmetic: Subword<br />

Parallelism<br />

■ Chapter 4, Section 4.10: Parallelism via Instructions<br />

■ Chapter 5, Section 5.10: Parallelism <strong>and</strong> Memory Hierarchy: Cache Coherence<br />

True or false: To benefit from a multiprocessor, an application must be concurrent.<br />

6.2<br />

The Difficulty of Creating Parallel<br />

Processing Programs<br />

The difficulty with parallelism is not the hardware; it is that too few important<br />

application programs have been rewritten to complete tasks sooner on multiprocessors.<br />

It is difficult to write software that uses multiple processors to complete one task<br />

faster, <strong>and</strong> the problem gets worse as the number of processors increases.<br />

Why has this been so? Why have parallel processing programs been so much<br />

harder to develop than sequential programs?<br />

The first reason is that you must get better performance or better energy<br />

efficiency from a parallel processing program on a multiprocessor; otherwise, you<br />

would just use a sequential program on a uniprocessor, as sequential programming<br />

is simpler. In fact, uniprocessor design techniques such as superscalar <strong>and</strong> out-oforder<br />

execution take advantage of instruction-level parallelism (see Chapter 4),<br />

normally without the involvement of the programmer. Such innovations reduced<br />

the dem<strong>and</strong> for rewriting programs for multiprocessors, since programmers<br />

could do nothing <strong>and</strong> yet their sequential programs would run faster on new<br />

computers.<br />

Why is it difficult to write parallel processing programs that are fast, especially<br />

as the number of processors increases? In Chapter 1, we used the analogy of<br />

eight reporters trying to write a single story in hopes of doing the work eight<br />

times faster. To succeed, the task must be broken into eight equal-sized pieces,<br />

because otherwise some reporters would be idle while waiting for the ones with<br />

larger pieces to finish. Another speed-up obstacle could be that the reporters<br />

would spend too much time communicating with each other instead of writing<br />

their pieces of the story. For both this analogy <strong>and</strong> parallel programming,<br />

the challenges include scheduling, partitioning the work into parallel pieces,<br />

balancing the load evenly between the workers, time to synchronize, <strong>and</strong>


6.2 The Difficulty of Creating Parallel Processing Programs 505<br />

overhead for communication between the parties. The challenge is stiffer with the<br />

more reporters for a newspaper story <strong>and</strong> with the more processors for parallel<br />

programming.<br />

Our discussion in Chapter 1 reveals another obstacle, namely Amdahls Law. It<br />

reminds us that even small parts of a program must be parallelized if the program<br />

is to make good use of many cores.<br />

Speed-up Challenge<br />

Suppose you want to achieve a speed-up of 90 times faster with 100 processors.<br />

What percentage of the original computation can be sequential?<br />

EXAMPLE<br />

Amdahls Law (Chapter 1) says<br />

Execution time after improvement =<br />

Execution time affected by improvement<br />

+ Execution time unaffected<br />

Amount of improvement<br />

ANSWER<br />

We can reformulate Amdahls Law in terms of speed-up versus the original<br />

execution time:<br />

Speed-up =<br />

(Execution time before<br />

Execution time before<br />

Execution time affected<br />

− Execution time affected) + Amount of improvement<br />

This formula is usually rewritten assuming that the execution time before is<br />

1 for some unit of time, <strong>and</strong> the execution time affected by improvement is<br />

considered the fraction of the original execution time:<br />

Speed-up =<br />

1<br />

Fraction time affected<br />

(1 − Fraction time affected) +<br />

Amount of improvement<br />

Substituting 90 for speed-up <strong>and</strong> 100 for amount of improvement into the<br />

formula above:<br />

90 =<br />

1<br />

Fraction time affected<br />

(1 − Fraction time affected) +<br />

100


506 Chapter 6 Parallel Processors from Client to Cloud<br />

Then simplifying the formula <strong>and</strong> solving for fraction time affected:<br />

90 × (1 − 0.99 × Fraction time affected) = 1<br />

90 − (90 × 0.99 × Fraction time affected) = 1<br />

90 −1 = 90 × 0.99 × Fraction time affected<br />

Fraction time affected = 89/89.1 = 0.999<br />

Thus, to achieve a speed-up of 90 from 100 processors, the sequential<br />

percentage can only be 0.1%.<br />

Yet, there are applications with plenty of parallelism, as we shall see next.<br />

EXAMPLE<br />

Speed-up Challenge: Bigger Problem<br />

Suppose you want to perform two sums: one is a sum of 10 scalar variables, <strong>and</strong><br />

one is a matrix sum of a pair of two-dimensional arrays, with dimensions 10 by 10.<br />

For now let’s assume only the matrix sum is parallelizable; we’ll see soon how to<br />

parallelize scalar sums. What speed-up do you get with 10 versus 40 processors?<br />

Next, calculate the speed-ups assuming the matrices grow to 20 by 20.<br />

ANSWER<br />

If we assume performance is a function of the time for an addition, t, then<br />

there are 10 additions that do not benefit from parallel processors <strong>and</strong> 100<br />

additions that do. If the time for a single processor is 110 t, the execution time<br />

for 10 processors is<br />

Execution time after improvement =<br />

Execution time affected by improvement<br />

+ Execution time unaffected<br />

Amount of improvement<br />

Execution time after improvement = 100 t<br />

+ 10t<br />

= 20t<br />

10<br />

so the speed-up with 10 processors is 110t/20t = 5.5. The execution time for<br />

40 processors is<br />

Execution time after improvement = 100 t<br />

40<br />

+ 10t<br />

= 12.<br />

5t<br />

so the speed-up with 40 processors is 110t/12.5t = 8.8. Thus, for this problem<br />

size, we get about 55% of the potential speed-up with 10 processors, but only<br />

22% with 40.


6.2 The Difficulty of Creating Parallel Processing Programs 507<br />

Look what happens when we increase the matrix. The sequential program now<br />

takes 10t + 400t = 410t. The execution time for 10 processors is<br />

Execution time after improvement = 400 t<br />

10<br />

+ 10t<br />

= 50t<br />

so the speed-up with 10 processors is 410t/50t = 8.2. The execution time for<br />

40 processors is<br />

Execution time after improvement = 400 t<br />

40<br />

+ 10t<br />

= 20t<br />

so the speed-up with 40 processors is 410t/20t = 20.5. Thus, for this larger problem<br />

size, we get 82% of the potential speed-up with 10 processors <strong>and</strong> 51% with 40.<br />

These examples show that getting good speed-up on a multiprocessor while<br />

keeping the problem size fixed is harder than getting good speed-up by increasing<br />

the size of the problem. This insight allows us to introduce two terms that describe<br />

ways to scale up.<br />

Strong scaling means measuring speed-up while keeping the problem size fixed.<br />

Weak scaling means that the problem size grows proportionally to the increase in<br />

the number of processors. Let’s assume that the size of the problem, M, is the working<br />

set in main memory, <strong>and</strong> we have P processors. Then the memory per processor for<br />

strong scaling is approximately M/P, <strong>and</strong> for weak scaling, it is approximately M.<br />

Note that the memory hierarchy can interfere with the conventional wisdom<br />

about weak scaling being easier than strong scaling. For example, if the weakly<br />

scaled dataset no longer fits in the last level cache of a multicore microprocessor,<br />

the resulting performance could be much worse than by using strong scaling.<br />

Depending on the application, you can argue for either scaling approach. For<br />

example, the TPC-C debit-credit database benchmark requires that you scale up<br />

the number of customer accounts in proportion to the higher transactions per<br />

minute. The argument is that its nonsensical to think that a given customer base<br />

is suddenly going to start using ATMs 100 times a day just because the bank gets a<br />

faster computer. Instead, if youre going to demonstrate a system that can perform<br />

100 times the numbers of transactions per minute, you should run the experiment<br />

with 100 times as many customers. Bigger problems often need more data, which<br />

is an argument for weak scaling.<br />

This final example shows the importance of load balancing.<br />

strong scaling Speedup<br />

achieved on a<br />

multiprocessor without<br />

increasing the size of the<br />

problem.<br />

weak scaling Speedup<br />

achieved on a<br />

multiprocessor while<br />

increasing the size of the<br />

problem proportionally<br />

to the increase in the<br />

number of processors.<br />

Speed-up Challenge: Balancing Load<br />

To achieve the speed-up of 20.5 on the previous larger problem with 40<br />

processors, we assumed the load was perfectly balanced. That is, each of the 40<br />

EXAMPLE


6.3 SISD, MIMD, SIMD, SPMD, <strong>and</strong> Vector 511<br />

data elements from memory, put them in order into a large set of registers, operate<br />

on them sequentially in registers using pipelined execution units, <strong>and</strong> then write<br />

the results back to memory. A key feature of vector architectures is then a set of<br />

vector registers. Thus, a vector architecture might have 32 vector registers, each<br />

with 64 64-bit elements.<br />

Comparing Vector to Conventional Code<br />

Suppose we extend the MIPS instruction set architecture with vector<br />

instructions <strong>and</strong> vector registers. Vector operations use the same names as<br />

MIPS operations, but with the letter V appended. For example, addv.d<br />

adds two double-precision vectors. The vector instructions take as their input<br />

either a pair of vector registers (addv.d) or a vector register <strong>and</strong> a scalar<br />

register (addvs.d). In the latter case, the value in the scalar register is used<br />

as the input for all operationsthe operation addvs.d will add the contents<br />

of a scalar register to each element in a vector register. The names lv <strong>and</strong> sv<br />

denote vector load <strong>and</strong> vector store, <strong>and</strong> they load or store an entire vector<br />

of double-precision data. One oper<strong>and</strong> is the vector register to be loaded or<br />

stored; the other oper<strong>and</strong>, which is a MIPS general-purpose register, is the<br />

starting address of the vector in memory. Given this short description, show<br />

the conventional MIPS code versus the vector MIPS code for<br />

EXAMPLE<br />

Y = a× X + Y<br />

where X <strong>and</strong> Y are vectors of 64 double precision floating-point numbers,<br />

initially resident in memory, <strong>and</strong> a is a scalar double precision variable. (This<br />

example is the so-called DAXPY loop that forms the inner loop of the Linpack<br />

benchmark; DAXPY st<strong>and</strong>s for double precision a × X plus Y.). Assume that<br />

the starting addresses of X <strong>and</strong> Y are in $s0 <strong>and</strong> $s1, respectively.<br />

Here is the conventional MIPS code for DAXPY:<br />

l.d $f0,a($sp) :load scalar a<br />

addiu $t0,$s0,#512 :upper bound of what to load<br />

loop: l.d $f2,0($s0) :load x(i)<br />

mul.d $f2,$f2,$f0 :a x x(i)<br />

l.d $f4,0($s1) :load y(i)<br />

add.d $f4,$f4,$f2 :a x x(i) + y(i)<br />

s.d $f4,0($s1) :store into y(i)<br />

addiu $s0,$s0,#8 :increment index to x<br />

addiu $s1,$s1,#8 :increment index to y<br />

subu $t1,$t0,$s0 :compute bound<br />

bne $t1,$zero,loop :check if done<br />

Here is the vector MIPS code for DAXPY:<br />

ANSWER


512 Chapter 6 Parallel Processors from Client to Cloud<br />

l.d $f0,a($sp) :load scalar a<br />

lv $v1,0($s0) :load vector x<br />

mulvs.d $v2,$v1,$f0 :vector-scalar multiply<br />

lv $v3,0($s1) :load vector y<br />

addv.d $v4,$v2,$v3 :add y to product<br />

sv $v4,0($s1) :store the result<br />

There are some interesting comparisons between the two code segments in<br />

this example. The most dramatic is that the vector processor greatly reduces the<br />

dynamic instruction b<strong>and</strong>width, executing only 6 instructions versus almost 600<br />

for the traditional MIPS architecture. This reduction occurs both because the vector<br />

operations work on 64 elements at a time <strong>and</strong> because the overhead instructions<br />

that constitute nearly half the loop on MIPS are not present in the vector code. As<br />

you might expect, this reduction in instructions fetched <strong>and</strong> executed saves energy.<br />

Another important difference is the frequency of pipeline hazards (Chapter 4).<br />

In the straightforward MIPS code, every add.d must wait for a mul.d, every<br />

s.d must wait for the add.d <strong>and</strong> every add.d <strong>and</strong> mul.d must wait on l.d.<br />

On the vector processor, each vector instruction will only stall for the first element<br />

in each vector, <strong>and</strong> then subsequent elements will flow smoothly down the pipeline.<br />

Thus, pipeline stalls are required only once per vector operation, rather than once<br />

per vector element. In this example, the pipeline stall frequency on MIPS will be<br />

about 64 times higher than it is on the vector version of MIPS. The pipeline stalls<br />

can be reduced on MIPS by using loop unrolling (see Chapter 4). However, the<br />

large difference in instruction b<strong>and</strong>width cannot be reduced.<br />

Since the vector elements are independent, they can be operated on in parallel,<br />

much like subword parallelism for AVX instructions. All modern vector computers<br />

have vector functional units with multiple parallel pipelines (called vector lanes; see<br />

Figures 6.2 <strong>and</strong> 6.3) that can produce two or more results per clock cycle.<br />

Elaboration: The loop in the example above exactly matched the vector length. When<br />

loops are shorter, vector architectures use a register that reduces the length of vector<br />

operations. When loops are larger, we add bookkeeping code to iterate full-length vector<br />

operations <strong>and</strong> to h<strong>and</strong>le the leftovers. This latter process is called strip mining.<br />

Vector versus Scalar<br />

Vector instructions have several important properties compared to conventional<br />

instruction set architectures, which are called scalar architectures in this context:<br />

■ A single vector instruction specifies a great deal of workit is equivalent<br />

to executing an entire loop. The instruction fetch <strong>and</strong> decode b<strong>and</strong>width<br />

needed is dramatically reduced.<br />

■ By using a vector instruction, the compiler or programmer indicates that the<br />

computation of each result in the vector is independent of the computation of<br />

other results in the same vector, so hardware does not have to check for data<br />

hazards within a vector instruction.<br />

■ Vector architectures <strong>and</strong> compilers have a reputation of making it much<br />

easier than when using MIMD multiprocessors to write efficient applications<br />

when they contain data-level parallelism.


6.3 SISD, MIMD, SIMD, SPMD, <strong>and</strong> Vector 513<br />

■ Hardware need only check for data hazards between two vector instructions<br />

once per vector oper<strong>and</strong>, not once for every element within the vectors.<br />

Reduced checking can save energy as well as time.<br />

■ Vector instructions that access memory have a known access pattern. If<br />

the vectors elements are all adjacent, then fetching the vector from a set<br />

of heavily interleaved memory banks works very well. Thus, the cost of the<br />

latency to main memory is seen only once for the entire vector, rather than<br />

once for each word of the vector.<br />

■ Because an entire loop is replaced by a vector instruction whose behavior<br />

is predetermined, control hazards that would normally arise from the loop<br />

branch are nonexistent.<br />

■ The savings in instruction b<strong>and</strong>width <strong>and</strong> hazard checking plus the efficient<br />

use of memory b<strong>and</strong>width give vector architectures advantages in power <strong>and</strong><br />

energy versus scalar architectures.<br />

For these reasons, vector operations can be made faster than a sequence of<br />

scalar operations on the same number of data items, <strong>and</strong> designers are motivated<br />

to include vector units if the application domain can often use them.<br />

Vector versus Multimedia Extensions<br />

Like multimedia extensions found in the x86 AVX instructions, a vector instruction<br />

specifies multiple operations. However, multimedia extensions typically specify a<br />

few operations while vector specifies dozens of operations. Unlike multimedia<br />

extensions, the number of elements in a vector operation is not in the opcode but in a<br />

separate register. This distinction means different versions of the vector architecture<br />

can be implemented with a different number of elements just by changing the<br />

contents of that register <strong>and</strong> hence retain binary compatibility. In contrast, a new<br />

large set of opcodes is added each time the vector length changes in the multimedia<br />

extension architecture of the x86: MMX, SSE, SSE2, AVX, AVX2, … .<br />

Also unlike multimedia extensions, the data transfers need not be contiguous.<br />

Vectors support both strided accesses, where the hardware loads every nth data<br />

element in memory, <strong>and</strong> indexed accesses, where hardware finds the addresses of<br />

the items to be loaded in a vector register. Indexed accesses are also called gatherscatter,<br />

in that indexed loads gather elements from main memory into contiguous<br />

vector elements <strong>and</strong> indexed stores scatter vector elements across main memory.<br />

Like multimedia extensions, vector architectures easily capture the flexibility<br />

in data widths, so it is easy to make a vector operation work on 32 64-bit data<br />

elements or 64 32-bit data elements or 128 16-bit data elements or 256 8-bit data<br />

elements. The parallel semantics of a vector instruction allows an implementation<br />

to execute these operations using a deeply pipelined functional unit, an array of<br />

parallel functional units, or a combination of parallel <strong>and</strong> pipelined functional<br />

units. Figure 6.3 illustrates how to improve vector performance by using parallel<br />

pipelines to execute a vector add instruction.<br />

Vector arithmetic instructions usually only allow element N of one vector<br />

register to take part in operations with element N from other vector registers. This


514 Chapter 6 Parallel Processors from Client to Cloud<br />

A[9]<br />

B[9]<br />

A[8]<br />

B[8]<br />

A[7]<br />

B[7]<br />

A[6]<br />

B[6]<br />

A[5]<br />

B[5]<br />

A[4]<br />

B[4]<br />

A[3]<br />

B[3]<br />

A[2]<br />

B[2]<br />

A[8]<br />

B[8]<br />

A[9]<br />

B[9]<br />

A[1]<br />

B[1]<br />

A[4]<br />

B[4]<br />

A[5]<br />

B[5] A[6] B[6] A[7] B[7]<br />

+<br />

+ + + +<br />

C[0]<br />

C[0] C[1] C[2] C[3]<br />

Element group<br />

(a)<br />

(b)<br />

vector lane One or<br />

more vector functional<br />

units <strong>and</strong> a portion of<br />

the vector register file.<br />

Inspired by lanes on<br />

highways that increase<br />

traffic speed, multiple<br />

lanes execute vector<br />

operations<br />

simultaneously.<br />

Check<br />

Yourself<br />

FIGURE 6.3 Using multiple functional units to improve the performance of a single vector<br />

add instruction, C = A + B. The vector processor (a) on the left has a single add pipeline <strong>and</strong> can complete<br />

one addition per cycle. The vector processor (b) on the right has four add pipelines or lanes <strong>and</strong> can complete<br />

four additions per cycle. The elements within a single vector add instruction are interleaved across the four<br />

lanes.<br />

dramatically simplifies the construction of a highly parallel vector unit, which can<br />

be structured as multiple parallel vector lanes. As with a traffic highway, we can<br />

increase the peak throughput of a vector unit by adding more lanes. Figure 6.4<br />

shows the structure of a four-lane vector unit. Thus, going to four lanes from one<br />

lane reduces the number of clocks per vector instruction by roughly a factor of four.<br />

For multiple lanes to be advantageous, both the applications <strong>and</strong> the architecture<br />

must support long vectors. Otherwise, they will execute so quickly that you’ll run<br />

out of instructions, requiring instruction level parallel techniques like those in<br />

Chapter 4 to supply enough vector instructions.<br />

Generally, vector architectures are a very efficient way to execute data parallel<br />

processing programs; they are better matches to compiler technology than<br />

multimedia extensions; <strong>and</strong> they are easier to evolve over time than the multimedia<br />

extensions to the x86 architecture.<br />

Given these classic categories, we next see how to exploit parallel streams of<br />

instructions to improve the performance of a single processor, which we will reuse<br />

with multiple processors.<br />

True or false: As exemplified in the x86, multimedia extensions can be thought of<br />

as a vector architecture with short vectors that supports only contiguous vector<br />

data transfers.


6.4 Hardware Multithreading 517<br />

Simultaneous multithreading (SMT) is a variation on hardware multithreading<br />

that uses the resources of a multiple-issue, dynamically scheduled pipelined<br />

processor to exploit thread-level parallelism at the same time it exploits instructionlevel<br />

parallelism (see Chapter 4). The key insight that motivates SMT is that<br />

multiple-issue processors often have more functional unit parallelism available<br />

than most single threads can effectively use. Furthermore, with register renaming<br />

<strong>and</strong> dynamic scheduling (see Chapter 4), multiple instructions from independent<br />

threads can be issued without regard to the dependences among them; the resolution<br />

of the dependences can be h<strong>and</strong>led by the dynamic scheduling capability.<br />

Since SMT relies on the existing dynamic mechanisms, it does not switch<br />

resources every cycle. Instead, SMT is always executing instructions from multiple<br />

threads, leaving it up to the hardware to associate instruction slots <strong>and</strong> renamed<br />

registers with their proper threads.<br />

Figure 6.5 conceptually illustrates the differences in a processors ability to exploit<br />

superscalar resources for the following processor configurations. The top portion shows<br />

Issue slots<br />

Thread A<br />

Thread B<br />

Thread C<br />

Thread D<br />

simultaneous<br />

multithreading<br />

(SMT) A version<br />

of multithreading<br />

that lowers the cost<br />

of multithreading by<br />

utilizing the resources<br />

needed for multiple issue,<br />

dynamically scheduled<br />

microarchitecture.<br />

Time<br />

Time<br />

Issue slots<br />

Coarse MT<br />

Fine MT<br />

SMT<br />

FIGURE 6.5 How four threads use the issue slots of a superscalar processor in different<br />

approaches. The four threads at the top show how each would execute running alone on a st<strong>and</strong>ard<br />

superscalar processor without multithreading support. The three examples at the bottom show how they<br />

would execute running together in three multithreading options. The horizontal dimension represents the<br />

instruction issue capability in each clock cycle. The vertical dimension represents a sequence of clock cycles.<br />

An empty (white) box indicates that the corresponding issue slot is unused in that clock cycle. The shades of<br />

gray <strong>and</strong> color correspond to four different threads in the multithreading processors. The additional pipeline<br />

start-up effects for coarse multithreading, which are not illustrated in this figure, would lead to further loss<br />

in throughput for coarse multithreading.


520 Chapter 6 Parallel Processors from Client to Cloud<br />

uniform memory access<br />

(UMA) A multiprocessor<br />

in which latency to any<br />

word in main memory is<br />

about the same no matter<br />

which processor requests<br />

the access.<br />

nonuniform memory<br />

access (NUMA) A type<br />

of single address space<br />

multiprocessor in which<br />

some memory accesses<br />

are much faster than<br />

others depending on<br />

which processor asks for<br />

which word.<br />

synchronization The<br />

process of coordinating<br />

the behavior of two or<br />

more processes, which<br />

may be running on<br />

different processors.<br />

lock A synchronization<br />

device that allows access<br />

to data to only one<br />

processor at a time.<br />

nearly always the case for multicore chipsalthough a more accurate term would<br />

have been shared-address multiprocessor. Processors communicate through shared<br />

variables in memory, with all processors capable of accessing any memory location<br />

via loads <strong>and</strong> stores. Figure 6.7 shows the classic organization of an SMP. Note that<br />

such systems can still run independent jobs in their own virtual address spaces,<br />

even if they all share a physical address space.<br />

Single address space multiprocessors come in two styles. In the first style, the<br />

latency to a word in memory does not depend on which processor asks for it.<br />

Such machines are called uniform memory access (UMA) multiprocessors. In the<br />

second style, some memory accesses are much faster than others, depending on<br />

which processor asks for which word, typically because main memory is divided<br />

<strong>and</strong> attached to different microprocessors or to different memory controllers on<br />

the same chip. Such machines are called nonuniform memory access (NUMA)<br />

multiprocessors. As you might expect, the programming challenges are harder for<br />

a NUMA multiprocessor than for a UMA multiprocessor, but NUMA machines<br />

can scale to larger sizes <strong>and</strong> NUMAs can have lower latency to nearby memory.<br />

As processors operating in parallel will normally share data, they also need to<br />

coordinate when operating on shared data; otherwise, one processor could start<br />

working on data before another is finished with it. This coordination is called<br />

synchronization, which we saw in Chapter 2. When sharing is supported with a<br />

single address space, there must be a separate mechanism for synchronization. One<br />

approach uses a lock for a shared variable. Only one processor at a time can acquire<br />

the lock, <strong>and</strong> other processors interested in shared data must wait until the original<br />

processor unlocks the variable. Section 2.11 of Chapter 2 describes the instructions<br />

for locking in the MIPS instruction set.<br />

Processor<br />

Processor<br />

. . .<br />

Processor<br />

Cache Cache . . .<br />

Cache<br />

Interconnection Network<br />

Memory<br />

I/O<br />

FIGURE 6.7<br />

Classic organization of a shared memory multiprocessor.


6.5 Multicore <strong>and</strong> Other Shared Memory Multiprocessors 521<br />

A Simple Parallel Processing Program for a Shared Address Space<br />

Suppose we want to sum 64,000 numbers on a shared memory multiprocessor<br />

computer with uniform memory access time. Lets assume we have 64<br />

processors.<br />

EXAMPLE<br />

The first step is to ensure a balanced load per processor, so we split the set<br />

of numbers into subsets of the same size. We do not allocate the subsets to a<br />

different memory space, since there is a single memory space for this machine;<br />

we just give different starting addresses to each processor. Pn is the number that<br />

identifies the processor, between 0 <strong>and</strong> 63. All processors start the program by<br />

running a loop that sums their subset of numbers:<br />

ANSWER<br />

sum[Pn] = 0;<br />

for (i = 1000*Pn; i < 1000*(Pn+1); i += 1)<br />

sum[Pn] += A[i]; /*sum the assigned areas*/<br />

(Note the C code i += 1 is just a shorter way to say i = i + 1.)<br />

The next step is to add these 64 partial sums. This step is called a reduction,<br />

where we divide to conquer. Half of the processors add pairs of partial sums,<br />

<strong>and</strong> then a quarter add pairs of the new partial sums, <strong>and</strong> so on until we<br />

have the single, final sum. Figure 6.8 illustrates the hierarchical nature of this<br />

reduction.<br />

In this example, the two processors must synchronize before the consumer<br />

processor tries to read the result from the memory location written by the<br />

producer processor; otherwise, the consumer may read the old value of<br />

reduction A function<br />

that processes a data<br />

structure <strong>and</strong> returns a<br />

single value.<br />

0<br />

(half = 1)<br />

0 1<br />

(half = 2)<br />

0 1 2 3<br />

(half = 4)<br />

0 1 2 3 4 5 6 7<br />

FIGURE 6.8 The last four levels of a reduction that sums results from each processor,<br />

from bottom to top. For all processors whose number i is less than half, add the sum produced by<br />

processor number (i + half) to its sum.


522 Chapter 6 Parallel Processors from Client to Cloud<br />

the data. We want each processor to have its own version of the loop counter<br />

variable i, so we must indicate that it is a private variable. Here is the code<br />

(half is private also):<br />

half = 64; /*64 processors in multiprocessor*/<br />

do<br />

synch(); /*wait for partial sum completion*/<br />

if (half%2 != 0 && Pn == 0)<br />

sum[0] += sum[half–1];<br />

/*Conditional sum needed when half is<br />

odd; Processor0 gets missing element */<br />

half = half/2; /*dividing line on who sums */<br />

if (Pn < half) sum[Pn] += sum[Pn+half];<br />

while (half > 1); /*exit with final sum in Sum[0] */<br />

Hardware/<br />

Software<br />

Interface<br />

OpenMP An API<br />

for shared memory<br />

multiprocessing in C,<br />

C++, or Fortran that runs<br />

on UNIX <strong>and</strong> Microsoft<br />

platforms. It includes<br />

compiler directives, a<br />

library, <strong>and</strong> runtime<br />

directives.<br />

Given the long-term interest in parallel programming, there have been hundreds<br />

of attempts to build parallel programming systems. A limited but popular example<br />

is OpenMP. It is just an Application Programmer Interface (API) along with a set of<br />

compiler directives, environment variables, <strong>and</strong> runtime library routines that can<br />

extend st<strong>and</strong>ard programming languages. It offers a portable, scalable, <strong>and</strong> simple<br />

programming model for shared memory multiprocessors. Its primary goal is to<br />

parallelize loops <strong>and</strong> to perform reductions.<br />

Most C compilers already have support for OpenMP. The comm<strong>and</strong> to uses the<br />

OpenMP API with the UNIX C compiler is just:<br />

cc –fopenmp foo.c<br />

OpenMP extends C using pragmas, which are just comm<strong>and</strong>s to the C macro<br />

preprocessor like #define <strong>and</strong> #include. To set the number of processors we<br />

want to use to be 64, as we wanted in the example above, we just use the comm<strong>and</strong><br />

#define P 64 /* define a constant that we’ll use a few times */<br />

#pragma omp parallel num_threads(P)<br />

That is, the runtime libraries should use 64 parallel threads.<br />

To turn the sequential for loop into a parallel for loop that divides the work<br />

equally between all the threads that we told it to use, we just write (assuming sum<br />

is initialized to 0)<br />

#pragma omp parallel for<br />

for (Pn = 0; Pn < P; Pn += 1)<br />

for (i = 0; 1000*Pn; i < 1000*(Pn+1); i += 1)<br />

sum[Pn] += A[i]; /*sum the assigned areas*/


6.5 Multicore <strong>and</strong> Other Shared Memory Multiprocessors 523<br />

To perform the reduction, we can use another comm<strong>and</strong> that tells OpenMP<br />

what the reduction operator is <strong>and</strong> what variable you need to use to place the result<br />

of the reduction.<br />

#pragma omp parallel for reduction(+ : FinalSum)<br />

for (i = 0; i < P; i += 1)<br />

FinalSum += sum[i]; /* Reduce to a single number */<br />

Note that it is now up to the OpenMP library to find efficient code to sum 64<br />

numbers efficiently using 64 processors.<br />

While OpenMP makes it easy to write simple parallel code, it is not very helpful<br />

with debugging, so many parallel programmers use more sophisticated parallel<br />

programming systems than OpenMP, just as many programmers today use more<br />

productive languages than C.<br />

Given this tour of classic MIMD hardware <strong>and</strong> software, our next path is a more<br />

exotic tour of a type of MIMD architecture with a different heritage <strong>and</strong> thus a very<br />

different perspective on the parallel programming challenge.<br />

True or false: Shared memory multiprocessors cannot take advantage of task-level<br />

parallelism.<br />

Check<br />

Yourself<br />

Elaboration: Some writers repurposed the acronym SMP to mean symmetric<br />

multiprocessor, to indicate that the latency from processor to memory was about the<br />

same for all processors. This shift was done to contrast them from large-scale NUMA<br />

multiprocessors, as both classes used a single address space. As clusters proved much<br />

more popular than large-scale NUMA multiprocessors, in this book we restore SMP to<br />

its original meaning, <strong>and</strong> use it to contrast against that use multiple address spaces,<br />

such as clusters.<br />

Elaboration: An alternative to sharing the physical address space would be to have<br />

separate physical address spaces but share a common virtual address space, leaving<br />

it up to the operating system to h<strong>and</strong>le communication. This approach has been tried,<br />

but it has too high an overhead to offer a practical shared memory abstraction to the<br />

performance-oriented programmer.


526 Chapter 6 Parallel Processors from Client to Cloud<br />

registers than do vector processors. Unlike most vector architectures, GPUs also<br />

rely on hardware multithreading within a single multi-threaded SIMD processor<br />

to hide memory latency (see Section 6.4).<br />

A multithreaded SIMD processor is similar to a Vector Processor, but the former<br />

has many parallel functional units instead of just a few that are deeply pipelined,<br />

as does the latter.<br />

As mentioned above, a GPU contains a collection of multithreaded SIMD<br />

processors; that is, a GPU is a MIMD composed of multithreaded SIMD processors.<br />

For example, NVIDIA has four implementations of the Fermi architecture at<br />

different price points with 7, 11, 14, or 15 multithreaded SIMD processors. To<br />

provide transparent scalability across models of GPUs with differing number of<br />

multithreaded SIMD processors, the Thread Block Scheduler hardware assigns<br />

blocks of threads to multithreaded SIMD processors. Figure 6.9 shows a simplified<br />

block diagram of a multithreaded SIMD processor.<br />

Dropping down one more level of detail, the machine object that the hardware<br />

creates, manages, schedules, <strong>and</strong> executes is a thread of SIMD instructions, which<br />

we will also call a SIMD thread. It is a traditional thread, but it contains exclusively<br />

SIMD instructions. These SIMD threads have their own program counters <strong>and</strong><br />

they run on a multithreaded SIMD processor. The SIMD Thread Scheduler includes<br />

a controller that lets it know which threads of SIMD instructions are ready to<br />

run, <strong>and</strong> then it sends them off to a dispatch unit to be run on the multithreaded<br />

Instruction register<br />

SIMD Lanes<br />

(Thread<br />

Processors)<br />

Registers<br />

Reg<br />

Reg<br />

Reg<br />

Reg<br />

Reg<br />

Reg<br />

Reg<br />

Reg<br />

Reg<br />

Reg<br />

Reg<br />

Reg<br />

Reg<br />

Reg<br />

Reg<br />

1K × 32 1K × 32<br />

1K × 32<br />

1K × 32<br />

1K × 32<br />

1K × 32<br />

1K × 32<br />

1K × 32<br />

1K × 32<br />

1K × 32<br />

1K × 32<br />

1K × 32<br />

1K × 32<br />

1K × 32<br />

1K × 32<br />

1K × 32<br />

Load<br />

store<br />

unit<br />

Load<br />

store<br />

unit<br />

Load<br />

store<br />

unit<br />

Load<br />

store<br />

unit<br />

Load<br />

store<br />

unit<br />

Load<br />

store<br />

unit<br />

Load<br />

store<br />

unit<br />

Load<br />

store<br />

unit<br />

Load<br />

store<br />

unit<br />

Load<br />

store<br />

unit<br />

Load<br />

store<br />

unit<br />

Load<br />

store<br />

unit<br />

Load<br />

store<br />

unit<br />

Load<br />

store<br />

unit<br />

Load<br />

store<br />

unit<br />

Load<br />

store<br />

unit<br />

Address coalescing unit<br />

Interconnection network<br />

Local Memory<br />

64 KiB<br />

To Global<br />

Memory<br />

FIGURE 6.9 Simplified block diagram of the datapath of a multithreaded SIMD Processor.<br />

It has 16 SIMD lanes. The SIMD Thread Scheduler has many independent SIMD threads that it chooses from<br />

to run on this processor.


6.6 Introduction to Graphics Processing Units 527<br />

SIMD processor. It is identical to a hardware thread scheduler in a traditional<br />

multithreaded processor (see Section 6.4), except that it is scheduling threads of<br />

SIMD instructions. Thus, GPU hardware has two levels of hardware schedulers:<br />

1. The Thread Block Scheduler that assigns blocks of threads to multithreaded<br />

SIMD processors, <strong>and</strong><br />

2. the SIMD Thread Scheduler within a SIMD processor, which schedules<br />

when SIMD threads should run.<br />

The SIMD instructions of these threads are 32 wide, so each thread of SIMD<br />

instructions would compute 32 of the elements of the computation. Since the<br />

thread consists of SIMD instructions, the SIMD processor must have parallel<br />

functional units to perform the operation. We call them SIMD Lanes, <strong>and</strong> they are<br />

quite similar to the Vector Lanes in Section 6.3.<br />

Elaboration: The number of lanes per SIMD processor varies across GPU generations.<br />

With Fermi, each 32-wide thread of SIMD instructions is mapped to 16 SIMD Lanes,<br />

so each SIMD instruction in a thread of SIMD instructions takes two clock cycles to<br />

complete. Each thread of SIMD instructions is executed in lock step. Staying with the<br />

analogy of a SIMD processor as a vector processor, you could say that it has 16 lanes,<br />

<strong>and</strong> the vector length would be 32. This wide but shallow nature is why we use the term<br />

SIMD processor instead of vector processor, as it is more intuitive.<br />

Since by defi nition the threads of SIMD instructions are independent, the SIMD<br />

Thread Scheduler can pick whatever thread of SIMD instructions is ready, <strong>and</strong> need not<br />

stick with the next SIMD instruction in the sequence within a single thread. Thus, using<br />

the terminology of Section 6.4, it uses fine-grained multithreading.<br />

To hold these memory elements, a Fermi SIMD processor has an impressive 32,768<br />

32-bit registers. Just like a vector processor, these registers are divided logically across<br />

the vector lanes or, in this case, SIMD Lanes. Each SIMD Thread is limited to no more than<br />

64 registers, so you might think of a SIMD Thread as having up to 64 vector registers,<br />

with each vector register having 32 elements <strong>and</strong> each element being 32 bits wide.<br />

Since Fermi has 16 SIMD Lanes, each contains 2048 registers. Each CUDA Thread<br />

gets one element of each of the vector registers. Note that a CUDA thread is just a<br />

vertical cut of a thread of SIMD instructions, corresponding to one element executed by<br />

one SIMD Lane. Beware that CUDA Threads are very different from POSIX threads; you<br />

cant make arbitrary system calls or synchronize arbitrarily in a CUDA Thread.<br />

NVIDIA GPU Memory Structures<br />

Figure 6.10 shows the memory structures of an NVIDIA GPU. We call the onchip<br />

memory that is local to each multithreaded SIMD processor Local Memory.<br />

It is shared by the SIMD Lanes within a multithreaded SIMD processor, but this<br />

memory is not shared between multithreaded SIMD processors. We call the offchip<br />

DRAM shared by the whole GPU <strong>and</strong> all thread blocks GPU Memory.<br />

Rather than rely on large caches to contain the whole working sets of an<br />

application, GPUs traditionally use smaller streaming caches <strong>and</strong> rely on extensive<br />

multithreading of threads of SIMD instructions to hide the long latency to DRAM,


528 Chapter 6 Parallel Processors from Client to Cloud<br />

CUDA Thread<br />

Per-CUDA Thread Private Memory<br />

Thread block<br />

Per-Block<br />

Local Memory<br />

Grid 0<br />

Sequence<br />

. . .<br />

Grid 1<br />

Inter-Grid Synchronization<br />

GPU Memory<br />

. . .<br />

FIGURE 6.10 GPU Memory structures. GPU Memory is shared by the vectorized loops. All threads<br />

of SIMD instructions within a thread block share Local Memory.<br />

since their working sets can be hundreds of megabytes. Thus, they will not fit<br />

in the last level cache of a multicore microprocessor. Given the use of hardware<br />

multithreading to hide DRAM latency, the chip area used for caches in system<br />

processors is spent instead on computing resources <strong>and</strong> on the large number of<br />

registers to hold the state of the many threads of SIMD instructions.<br />

Elaboration: While hiding memory latency is the underlying philosophy, note that the<br />

latest GPUs <strong>and</strong> vector processors have added caches. For example, the recent Fermi<br />

architecture has added caches, but they are thought of as either b<strong>and</strong>width fi lters to<br />

reduce dem<strong>and</strong>s on GPU Memory or as accelerators for the few variables whose latency<br />

cannot be hidden by multithreading. Local memory for stack frames, function calls,<br />

<strong>and</strong> register spilling is a good match to caches, since latency matters when calling a<br />

function. Caches can also save energy, since on-chip cache accesses take much less<br />

energy than accesses to multiple, external DRAM chips.


530 Chapter 6 Parallel Processors from Client to Cloud<br />

Type<br />

More descriptive<br />

name<br />

Closest old term<br />

outside of GPUs<br />

Official CUDA/<br />

NVIDIA GPU term<br />

Book definition<br />

Memory hardware<br />

Processing hardware<br />

Machine object Program abstractions<br />

Vectorizable<br />

Loop<br />

Body of<br />

Vectorized Loop<br />

Sequence of<br />

SIMD Lane<br />

Operations<br />

A Thread of<br />

SIMD<br />

Instructions<br />

SIMD<br />

Instruction<br />

Multithreaded<br />

SIMD<br />

Processor<br />

Thread Block<br />

Scheduler<br />

SIMD Thread<br />

Scheduler<br />

Body of a<br />

(Strip-Mined)<br />

Vectorized Loop<br />

One iteration of<br />

a Scalar Loop<br />

Thread of Vector<br />

Instructions<br />

Vector Instruction<br />

(Multithreaded)<br />

Vector Processor<br />

Scalar Processor<br />

Thread scheduler<br />

in a Multithreaded<br />

CPU<br />

Thread Block<br />

CUDA Thread<br />

Warp<br />

PTX Instruction<br />

Streaming<br />

Multiprocessor<br />

Giga Thread<br />

Engine<br />

Warp Scheduler<br />

SIMD Lane Vector lane Thread Processor<br />

GPU Memory Main Memory Global Memory<br />

Local Memory Local Memory Shared Memory<br />

SIMD Lane<br />

Registers<br />

Vectorizable Loop Grid A vectorizable loop, executed on the GPU, made<br />

up of one or more Thread Blocks (bodies of<br />

vectorized loop) that can execute in parallel.<br />

Vector Lane<br />

Registers<br />

Thread Processor<br />

Registers<br />

A vectorized loop executed on a multithreaded<br />

SIMD Processor, made up of one or more threads<br />

of SIMD instructions. They can communicate via<br />

Local Memory.<br />

A vertical cut of a thread of SIMD instructions<br />

corresponding to one element executed by one<br />

SIMD Lane. Result is stored depending on mask<br />

<strong>and</strong> predicate register.<br />

A traditional thread, but it contains just SIMD<br />

instructions that are executed on a multithreaded<br />

SIMD Processor. Results stored depending on a<br />

per-element mask.<br />

A single SIMD instruction executed across SIMD<br />

Lanes.<br />

A multithreaded SIMD Processor executes<br />

threads of SIMD instructions, independent of<br />

other SIMD Processors.<br />

Assigns multiple Thread Blocks (bodies of<br />

vectorized loop) to multithreaded SIMD<br />

Processors.<br />

Hardware unit that schedules <strong>and</strong> issues threads<br />

of SIMD instructions when they are ready to<br />

execute; includes a scoreboard to track SIMD<br />

Thread execution.<br />

A SIMD Lane executes the operations in a thread<br />

of SIMD instructions on a single element. Results<br />

stored depending on mask.<br />

DRAM memory accessible by all multithreaded<br />

SIMD Processors in a GPU.<br />

Fast local SRAM for one multithreaded SIMD<br />

Processor, unavailable to other SIMD Processors.<br />

Registers in a single SIMD Lane allocated across<br />

a full thread block (body of vectorized loop).<br />

FIGURE 6.12 Quick guide to GPU terms. We use the first column for hardware terms. Four groups<br />

cluster these 12 terms. From top to bottom: Program Abstractions, Machine Objects, Processing Hardware,<br />

<strong>and</strong> Memory Hardware.<br />

make more sense when architects ask, given the hardware invested to do graphics<br />

well, how can we supplement it to improve the performance of a wider range of<br />

applications?<br />

Having covered two different styles of MIMD that have a shared address<br />

space, we next introduce parallel processors where each processor has its<br />

own private address space, which makes it much easier to build much larger<br />

systems. The Internet services that you use every day depend on these large scale<br />

systems.


6.7 Clusters, Warehouse Scale <strong>Computer</strong>s, <strong>and</strong> Other Message-Passing Multiprocessors 533<br />

Given that clusters are constructed from whole computers <strong>and</strong> independent,<br />

scalable networks, this isolation also makes it easier to exp<strong>and</strong> the system without<br />

bringing down the application that runs on top of the cluster.<br />

Their lower cost, higher availability, <strong>and</strong> rapid, incremental exp<strong>and</strong>ability make<br />

clusters attractive to service Internet providers, despite their poorer communication<br />

performance when compared to large-scale shared memory multiprocessors. The<br />

search engines that hundreds of millions of us use every day depend upon this<br />

technology. Amazon, Facebook, Google, Microsoft, <strong>and</strong> others all have multiple<br />

datacenters each with clusters of tens of thous<strong>and</strong>s of servers. Clearly, the use of<br />

multiple processors in Internet service companies has been hugely successful.<br />

Warehouse-Scale <strong>Computer</strong>s<br />

Internet services, such as those described above, necessitated the construction<br />

of new buildings to house, power, <strong>and</strong> cool 100,000 servers. Although they may<br />

be classified as just large clusters, their architecture <strong>and</strong> operation are more<br />

sophisticated. They act as one giant computer <strong>and</strong> cost on the order of $150M<br />

for the building, the electrical <strong>and</strong> cooling infrastructure, the servers, <strong>and</strong> the<br />

networking equipment that connects <strong>and</strong> houses 50,000 to 100,000 servers. We<br />

consider them a new class of computer, called Warehouse-Scale <strong>Computer</strong>s (WSC).<br />

Anyone can build a fast<br />

CPU. The trick is to build a<br />

fast system.<br />

Seymour Cray, considered<br />

the father of the<br />

supercomputer.<br />

The most popular framework for batch processing in a WSC is MapReduce [Dean,<br />

2008] <strong>and</strong> its open-source twin Hadoop. Inspired by the Lisp functions of the same<br />

name, Map first applies a programmer-supplied function to each logical input<br />

record. Map runs on thous<strong>and</strong>s of servers to produce an intermediate result of keyvalue<br />

pairs. Reduce collects the output of those distributed tasks <strong>and</strong> collapses them<br />

using another programmer-defined function. With appropriate software support,<br />

both are highly parallel yet easy to underst<strong>and</strong> <strong>and</strong> to use. Within 30 minutes, a<br />

novice programmer can run a MapReduce task on thous<strong>and</strong>s of servers.<br />

For example, one MapReduce program calculates the number of occurrences of<br />

every English word in a large collection of documents. Below is a simplified version<br />

of that program, which shows just the inner loop <strong>and</strong> assumes just one occurrence<br />

of all English words found in a document:<br />

Hardware/<br />

Software<br />

Interface<br />

map(String key, String value):<br />

// key: document name<br />

// value: document contents<br />

for each word w in value:<br />

EmitIntermediate(w, “1”); // Produce list of all words reduce(String key, Iterator values):<br />

// key: a word<br />

// values: a list of counts<br />

int result = 0;<br />

for each v in values:<br />

result += ParseInt(v); // get integer from key-value pair<br />

Emit(AsString(result));


534 Chapter 6 Parallel Processors from Client to Cloud<br />

The function EmitIntermediate used in the Map function emits each<br />

word in the document <strong>and</strong> the value one. Then the Reduce function sums all the<br />

values per word for each document using ParseInt() to get the number of<br />

occurrences per word in all documents. The MapReduce runtime environment<br />

schedules map tasks <strong>and</strong> reduce tasks to the servers of a WSC.<br />

software as a service<br />

(SaaS) Rather than<br />

selling software that<br />

is installed <strong>and</strong> run<br />

on customers’ own<br />

computers, software is run<br />

at a remote site <strong>and</strong> made<br />

available over the Internet<br />

typically via a Web<br />

interface to customers.<br />

SaaS customers are<br />

charged based on use<br />

versus on ownership.<br />

At this extreme scale, which requires innovation in power distribution, cooling,<br />

monitoring, <strong>and</strong> operations, the WSC is a modern descendant of the 1970s<br />

supercomputers—making Seymour Cray the godfather of today’s WSC architects.<br />

His extreme computers h<strong>and</strong>led computations that could be done nowhere else, but<br />

were so expensive that only a few companies could afford them. This time the target<br />

is providing information technology for the world instead of high performance<br />

computing for scientists <strong>and</strong> engineers. Hence, WSCs surely play a more important<br />

societal role today than Cray’s supercomputers did in the past.<br />

While they share some common goals with servers, WSCs have three major<br />

distinctions:<br />

1. Ample, easy parallelism: A concern for a server architect is whether the<br />

applications in the targeted marketplace have enough parallelism to justify<br />

the amount of parallel hardware <strong>and</strong> whether the cost is too high for sufficient<br />

communication hardware to exploit this parallelism. A WSC architect has<br />

no such concern. First, batch applications like MapReduce benefit from the<br />

large number of independent data sets that need independent processing,<br />

such as billions of Web pages from a Web crawl. Second, interactive Internet<br />

service applications, also known as Software as a Service (SaaS), can benefit<br />

from millions of independent users of interactive Internet services. Reads<br />

<strong>and</strong> writes are rarely dependent in SaaS, so SaaS rarely needs to synchronize.<br />

For example, search uses a read-only index <strong>and</strong> email is normally reading<br />

<strong>and</strong> writing independent information. We call this type of easy parallelism<br />

Request-Level Parallelism, as many independent efforts can proceed in<br />

parallel naturally with little need for communication or synchronization.<br />

2. Operational Costs Count: Traditionally, server architects design their systems<br />

for peak performance within a cost budget <strong>and</strong> worry about energy only to<br />

make sure they don’t exceed the cooling capacity of their enclosure. They<br />

usually ignored operational costs of a server, assuming that they pale in<br />

comparison to purchase costs. WSC have longer lifetimes—the building <strong>and</strong><br />

electrical <strong>and</strong> cooling infrastructure are often amortized over 10 or more<br />

years—so the operational costs add up: energy, power distribution, <strong>and</strong><br />

cooling represent more than 30% of the costs of a WSC over 10 years.<br />

3. Scale <strong>and</strong> the Opportunities/Problems Associated with Scale: To construct a<br />

single WSC, you must purchase 100,000 servers along with the supporting<br />

infrastructure, which means volume discounts. Hence, WSCs are so massive


6.7 Clusters, Warehouse Scale <strong>Computer</strong>s, <strong>and</strong> Other Message-Passing Multiprocessors 535<br />

internally that you get economy of scale even if there are not many WSCs.<br />

These economies of scale led to cloud computing, as the lower per unit costs<br />

of a WSC meant that cloud companies could rent servers at a profitable rate<br />

<strong>and</strong> still be below what it costs outsiders to do it themselves. The flip side<br />

of the economic opportunity of scale is the need to cope with the failure<br />

frequency of scale. Even if a server had a Mean Time To Failure of an amazing<br />

25 years (200,000 hours), the WSC architect would need to design for 5<br />

server failures every day. Section 5.15 mentioned annualized disk failure rate<br />

(AFR) was measured at Google at 2% to 4%. If there were 4 disks per server<br />

<strong>and</strong> their annual failure rate was 2%, the WSC architect should expect to see<br />

one disk fail every hour. Thus, fault tolerance is even more important for the<br />

WSC architect than the server architect.<br />

The economies of scale uncovered by WSC have realized the long dreamed of<br />

goal of computing as a utility. Cloud computing means anyone anywhere with good<br />

ideas, a business model, <strong>and</strong> a credit card can tap thous<strong>and</strong>s of servers to deliver<br />

their vision almost instantly around the world. Of course, there are important<br />

obstacles that could limit the growth of cloud computing—such as security,<br />

privacy, st<strong>and</strong>ards, <strong>and</strong> the rate of growth of Internet b<strong>and</strong>width—but we foresee<br />

them being addressed so that WSCs <strong>and</strong> cloud computing can flourish.<br />

To put the growth rate of cloud computing into perspective, in 2012 Amazon<br />

Web Services announced that it adds enough new server capacity every day to<br />

support all of Amazon’s global infrastructure as of 2003, when Amazon was a<br />

$5.2Bn annual revenue enterprise with 6000 employees.<br />

Now that we underst<strong>and</strong> the importance of message-passing multiprocessors,<br />

especially for cloud computing, we next cover ways to connect the nodes of a WSC<br />

together. Thanks to Moore’s Law <strong>and</strong> the increasing number of cores per chip, we<br />

now need networks inside a chip as well, so these topologies are important in the<br />

small as well as in the large.<br />

Elaboration: The MapReduce framework shuffl es <strong>and</strong> sorts the key-value pairs at the<br />

end of the Map phase to produce groups that all share the same key. These groups are<br />

then passed to the Reduce phase.<br />

Elaboration: Another form of large scale computing is grid computing, where the<br />

computers are spread across large areas, <strong>and</strong> then the programs that run across them<br />

must communicate via long haul networks. The most popular <strong>and</strong> unique form of grid<br />

computing was pioneered by the SETI@home project. As millions of PCs are idle at<br />

any one time doing nothing useful, they could be harvested <strong>and</strong> put to good uses if<br />

someone developed software that could run on those computers <strong>and</strong> then gave each PC<br />

an independent piece of the problem to work on. The fi rst example was the Search for<br />

ExtraTerrestrial Intelligence (SETI), which was launched at UC Berkeley in 1999. Over 5<br />

million computer users in more than 200 countries have signed up for SETI@home, with<br />

more than 50% outside the US. By the end of 2011, the average performance of the<br />

SETI@home grid was 3.5 PetaFLOPS.


6.8 Introduction to Multiprocessor Network Topologies 537<br />

Because there are numerous topologies to choose from, performance metrics<br />

are needed to distinguish these designs. Two are popular. The first is total network<br />

b<strong>and</strong>width, which is the b<strong>and</strong>width of each link multiplied by the number of links.<br />

This represents the peak b<strong>and</strong>width. For the ring network above, with P processors,<br />

the total network b<strong>and</strong>width would be P times the b<strong>and</strong>width of one link; the total<br />

network b<strong>and</strong>width of a bus is just the b<strong>and</strong>width of that bus.<br />

To balance this best b<strong>and</strong>width case, we include another metric that is closer to<br />

the worst case: the bisection b<strong>and</strong>width. This metric is calculated by dividing the<br />

machine into two halves. Then you sum the b<strong>and</strong>width of the links that cross that<br />

imaginary dividing line. The bisection b<strong>and</strong>width of a ring is two times the link<br />

b<strong>and</strong>width. It is one times the link b<strong>and</strong>width for the bus. If a single link is as fast<br />

as the bus, the ring is only twice as fast as a bus in the worst case, but it is P times<br />

faster in the best case.<br />

Since some network topologies are not symmetric, the question arises<br />

of where to draw the imaginary line when bisecting the machine. Bisection<br />

b<strong>and</strong>width is a worst-case metric, so the answer is to choose the division that<br />

yields the most pessimistic network performance. Stated alternatively, calculate<br />

all possible bisection b<strong>and</strong>widths <strong>and</strong> pick the smallest. We take this pessimistic<br />

view because parallel programs are often limited by the weakest link in the<br />

communication chain.<br />

At the other extreme from a ring is a fully connected network, where every<br />

processor has a bidirectional link to every other processor. For fully connected<br />

networks, the total network b<strong>and</strong>width is P × (P – 1)/2, <strong>and</strong> the bisection b<strong>and</strong>width<br />

is (P/2) 2 .<br />

The tremendous improvement in performance of fully connected networks is<br />

offset by the tremendous increase in cost. This consequence inspires engineers<br />

to invent new topologies that are between the cost of rings <strong>and</strong> the performance<br />

of fully connected networks. The evaluation of success depends in large part on<br />

the nature of the communication in the workload of parallel programs run on the<br />

computer.<br />

The number of different topologies that have been discussed in publications<br />

would be difficult to count, but only a few have been used in commercial parallel<br />

processors. Figure 6.14 illustrates two of the popular topologies.<br />

An alternative to placing a processor at every node in a network is to leave only<br />

the switch at some of these nodes. The switches are smaller than processor-memoryswitch<br />

nodes, <strong>and</strong> thus may be packed more densely, thereby lessening distance <strong>and</strong><br />

increasing performance. Such networks are frequently called multistage networks<br />

to reflect the multiple steps that a message may travel. Types of multistage networks<br />

are as numerous as single-stage networks; Figure 6.15 illustrates two of the popular<br />

multistage organizations. A fully connected or crossbar network allows any<br />

node to communicate with any other node in one pass through the network. An<br />

Omega network uses less hardware than the crossbar network (2n log 2<br />

n versus n 2<br />

switches), but contention can occur between messages, depending on the pattern<br />

network<br />

b<strong>and</strong>width Informally,<br />

the peak transfer rate of a<br />

network; can refer to the<br />

speed of a single link or<br />

the collective transfer rate<br />

of all links in the network.<br />

bisection<br />

b<strong>and</strong>width The<br />

b<strong>and</strong>width between<br />

two equal parts of<br />

a multiprocessor.<br />

This measure is for a<br />

worst case split of the<br />

multiprocessor.<br />

fully connected<br />

network A network<br />

that connects processormemory<br />

nodes by<br />

supplying a dedicated<br />

communication link<br />

between every node.<br />

multistage network<br />

A network that supplies a<br />

small switch at each node.<br />

crossbar network<br />

A network that allows<br />

any node to communicate<br />

with any other node in<br />

one pass through the<br />

network.


540 Chapter 6 Parallel Processors from Client to Cloud<br />

After covering the performance of network at a low level of detail in this online<br />

section, the next section shows how to benchmark multiprocessors of all kinds<br />

with much higher-level programs.<br />

6.10<br />

Multiprocessor Benchmarks <strong>and</strong><br />

Performance Models<br />

As we saw in Chapter 1, benchmarking systems is always a sensitive topic, because<br />

it is a highly visible way to try to determine which system is better. The results affect<br />

not only the sales of commercial systems, but also the reputation of the designers<br />

of those systems. Hence, all participants want to win the competition, but they also<br />

want to be sure that if someone else wins, they deserve to win because they have<br />

a genuinely better system. This desire leads to rules to ensure that the benchmark<br />

results are not simply engineering tricks for that benchmark, but are instead<br />

advances that improve performance of real applications.<br />

To avoid possible tricks, a typical rule is that you cant change the benchmark.<br />

The source code <strong>and</strong> data sets are fixed, <strong>and</strong> there is a single proper answer. Any<br />

deviation from those rules makes the results invalid.<br />

Many multiprocessor benchmarks follow these traditions. A common exception<br />

is to be able to increase the size of the problem so that you can run the benchmark<br />

on systems with a widely different number of processors. That is, many benchmarks<br />

allow weak scaling rather than require strong scaling, even though you must take<br />

care when comparing results for programs running different problem sizes.<br />

Figure 6.16 gives a summary of several parallel benchmarks, also described below:<br />

■ Linpack is a collection of linear algebra routines, <strong>and</strong> the routines for<br />

performing Gaussian elimination constitute what is known as the Linpack<br />

benchmark. The DGEMM routine in the example on page 215 represents a<br />

small fraction of the source code of the Linpack benchmark, but it accounts<br />

for most of the execution time for the benchmark. It allows weak scaling,<br />

letting the user pick any size problem. Moreover, it allows the user to rewrite<br />

Linpack in almost any form <strong>and</strong> in any language, as long as it computes the<br />

proper result <strong>and</strong> performs the same number of floating point operations<br />

for a given problem size. Twice a year, the 500 computers with the fastest<br />

Linpack performance are published at www.top500.org. The first on this list<br />

is considered by the press to be the worlds fastest computer.<br />

■ SPECrate is a throughput metric based on the SPEC CPU benchmarks,<br />

such as SPEC CPU 2006 (see Chapter 1). Rather than report performance<br />

of the individual programs, SPECrate runs many copies of the program<br />

simultaneously. Thus, it measures task-level parallelism, as there is no


542 Chapter 6 Parallel Processors from Client to Cloud<br />

Pthreads A UNIX<br />

API for creating <strong>and</strong><br />

manipulating threads. It is<br />

structured as a library.<br />

■ The NAS (NASA Advanced Supercomputing) parallel benchmarks were<br />

another attempt from the 1990s to benchmark multiprocessors. Taken from<br />

computational fluid dynamics, they consist of five kernels. They allow weak<br />

scaling by defining a few data sets. Like Linpack, these benchmarks can be<br />

rewritten, but the rules require that the programming language can only be C<br />

or Fortran.<br />

■ The recent PARSEC (Princeton Application Repository for Shared Memory<br />

<strong>Computer</strong>s) benchmark suite consists of multithreaded programs that use<br />

Pthreads (POSIX threads) <strong>and</strong> OpenMP (Open MultiProcessing; see<br />

Section 6.5). They focus on emerging computational domains <strong>and</strong> consist of<br />

nine applications <strong>and</strong> three kernels. Eight rely on data parallelism, three rely<br />

on pipelined parallelism, <strong>and</strong> one on unstructured parallelism.<br />

■ On the cloud front, the goal of the Yahoo! Cloud Serving Benchmark (YCSB)<br />

is to compare performance of cloud data services. It offers a framework that<br />

makes it easy for a client to benchmark new data services, using Cass<strong>and</strong>ra<br />

<strong>and</strong> HBase as representative examples. [Cooper, 2010]<br />

The downside of such traditional restrictions to benchmarks is that innovation is<br />

chiefly limited to the architecture <strong>and</strong> compiler. Better data structures, algorithms,<br />

programming languages, <strong>and</strong> so on often cannot be used, since that would give a<br />

misleading result. The system could win because of, say, the algorithm, <strong>and</strong> not<br />

because of the hardware or the compiler.<br />

While these guidelines are underst<strong>and</strong>able when the foundations of computing<br />

are relatively stableas they were in the 1990s <strong>and</strong> the first half of this decade<br />

they are undesirable during a programming revolution. For this revolution to<br />

succeed, we need to encourage innovation at all levels.<br />

Researchers at the University of California at Berkeley have advocated one<br />

approach. They identified 13 design patterns that they claim will be part of<br />

applications of the future. Frameworks or kernels implement these design<br />

patterns. Examples are sparse matrices, structured grids, finite-state machines,<br />

map reduce, <strong>and</strong> graph traversal. By keeping the definitions at a high level, they<br />

hope to encourage innovations at any level of the system. Thus, the system with the<br />

fastest sparse matrix solver is welcome to use any data structure, algorithm, <strong>and</strong><br />

programming language, in addition to novel architectures <strong>and</strong> compilers.<br />

Performance Models<br />

A topic related to benchmarks is performance models. As we have seen with the<br />

increasing architectural diversity in this chapter—multithreading, SIMD, GPUs—<br />

it would be especially helpful if we had a simple model that offered insights into the<br />

performance of different architectures. It need not be perfect, just insightful.<br />

The 3Cs for cache performance from Chapter 5 is an example performance<br />

model. It is not a perfect performance model, since it ignores potentially important


544 Chapter 6 Parallel Processors from Client to Cloud<br />

The Roofline Model<br />

This simple model ties floating-point performance, arithmetic intensity, <strong>and</strong> memory<br />

performance together in a two-dimensional graph [Williams, Waterman, <strong>and</strong><br />

Patterson 2009]. Peak floating-point performance can be found using the hardware<br />

specifications mentioned above. The working sets of the kernels we consider here<br />

do not fit in on-chip caches, so peak memory performance may be defined by the<br />

memory system behind the caches. One way to find the peak memory performance<br />

is the Stream benchmark. (See the Elaboration on page 381 in Chapter 5).<br />

Figure 6.18 shows the model, which is done once for a computer, not for each<br />

kernel. The vertical Y-axis is achievable floating-point performance from 0.5 to<br />

64.0 GFLOPs/second. The horizontal X-axis is arithmetic intensity, varying from<br />

1/8 FLOPs/DRAM byte accessed to 16 FLOPs/DRAM byte accessed. Note that the<br />

graph is a log-log scale.<br />

For a given kernel, we can find a point on the X-axis based on its arithmetic<br />

intensity. If we draw a vertical line through that point, the performance of the kernel<br />

on that computer must lie somewhere along that line. We can plot a horizontal line<br />

showing peak floating-point performance of the computer. Obviously, the actual<br />

floating-point performance can be no higher than the horizontal line, since that is<br />

a hardware limit.<br />

64.0<br />

32.0<br />

Attainable GFLOPs/second<br />

16.0<br />

8.0<br />

4.0<br />

2.0<br />

1.0<br />

peak memory BW (stream)peak floating-point performance<br />

Kernel 1<br />

(Memory<br />

B<strong>and</strong>width<br />

limited)<br />

Kernel 2<br />

(Computation<br />

limited)<br />

0.5<br />

1 / 1 8 / 1 4 / 2 1 2 4 8 16<br />

Arithmetic Intensity: FLOPs/Byte Ratio<br />

FIGURE 6.18 Roofline Model [Williams, Waterman, <strong>and</strong> Patterson 2009]. This example has a<br />

peak floating-point performance of 16 GFLOPS/sec <strong>and</strong> a peak memory b<strong>and</strong>width of 16 GB/sec from the<br />

Stream benchmark. (Since Stream is actually four measurements, this line is the average of the four.) The<br />

dotted vertical line in color on the left represents Kernel 1, which has an arithmetic intensity of 0.5 FLOPs/<br />

byte. It is limited by memory b<strong>and</strong>width to no more than 8 GFLOPS/sec on this Opteron X2. The dotted<br />

vertical line to the right represents Kernel 2, which has an arithmetic intensity of 4 FLOPs/byte. It is limited<br />

only computationally to 16 GFLOPS/s. (This data is based on the AMD Opteron X2 (Revision F) using dual<br />

cores running at 2 GHz in a dual socket system.)


6.10 Multiprocessor Benchmarks <strong>and</strong> Performance Models 545<br />

How could we plot the peak memory performance, which is measured in bytes/<br />

second? Since the X-axis is FLOPs/byte <strong>and</strong> the Y-axis FLOPs/second, bytes/second<br />

is just a diagonal line at a 45-degree angle in this figure. Hence, we can plot a third<br />

line that gives the maximum floating-point performance that the memory system<br />

of that computer can support for a given arithmetic intensity. We can express the<br />

limits as a formula to plot the line in the graph in Figure 6.18:<br />

Attainable GFLOPs/sec = Min (Peak Memory BW × Arithmetic Intensity, Peak<br />

Floating-Point Performance)<br />

The horizontal <strong>and</strong> diagonal lines give this simple model its name <strong>and</strong> indicate its<br />

value. The roofline sets an upper bound on performance of a kernel depending on<br />

its arithmetic intensity. Given a roofline of a computer, you can apply it repeatedly,<br />

since it doesnt vary by kernel.<br />

If we think of arithmetic intensity as a pole that hits the roof, either it hits<br />

the slanted part of the roof, which means performance is ultimately limited by<br />

memory b<strong>and</strong>width, or it hits the flat part of the roof, which means performance is<br />

computationally limited. In Figure 6.18, kernel 1 is an example of the former, <strong>and</strong><br />

kernel 2 is an example of the latter.<br />

Note that the ridge point, where the diagonal <strong>and</strong> horizontal roofs meet, offers<br />

an interesting insight into the computer. If it is far to the right, then only kernels<br />

with very high arithmetic intensity can achieve the maximum performance of<br />

that computer. If it is far to the left, then almost any kernel can potentially hit the<br />

maximum performance.<br />

Comparing Two Generations of Opterons<br />

The AMD Opteron X4 (Barcelona) with four cores is the successor to the Opteron<br />

X2 with two cores. To simplify board design, they use the same socket. Hence, they<br />

have the same DRAM channels <strong>and</strong> thus the same peak memory b<strong>and</strong>width. In<br />

addition to doubling the number of cores, the Opteron X4 also has twice the peak<br />

floating-point performance per core: Opteron X4 cores can issue two floating-point<br />

SSE2 instructions per clock cycle, while Opteron X2 cores issue at most one. As the<br />

two systems were comparing have similar clock rates2.2 GHz for Opteron X2<br />

versus 2.3 GHz for Opteron X4the Opteron X4 has about four times the peak<br />

floating-point performance of the Opteron X2 with the same DRAM b<strong>and</strong>width.<br />

The Opteron X4 also has a 2MiB L3 cache, which is not found in the Opteron X2.<br />

In Figure 6.19 the roofline models for both systems are compared. As we would<br />

expect, the ridge point moves to the right, from 1 in the Opteron X2 to 5 in the<br />

Opteron X4. Hence, to see a performance gain in the next generation, kernels need<br />

an arithmetic intensity higher than 1, or their working sets must fit in the caches<br />

of the Opteron X4.<br />

The roofline model gives an upper bound to performance. Suppose your<br />

program is far below that bound. What optimizations should you perform, <strong>and</strong> in<br />

what order?


550 Chapter 6 Parallel Processors from Client to Cloud<br />

Elaboration: The ceilings are ordered so that lower ceilings are easier to optimize.<br />

Clearly, a programmer can optimize in any order, but following this sequence reduces the<br />

chances of wasting effort on an optimization that has no benefi t due to other constraints.<br />

Like the 3Cs model, as long as the roofl ine model delivers on insights, a model can<br />

have assumptions that may prove optimistic. For example, roofl ine assumes the load is<br />

balanced between all processors.<br />

Elaboration: An alternative to the Stream benchmark is to use the raw DRAM<br />

b<strong>and</strong>width as the roofl ine. While the raw b<strong>and</strong>width defi nitely is a hard upper bound,<br />

actual memory performance is often so far from that boundary that its not that useful.<br />

That is, no program can go close to that bound. The downside to using Stream is that<br />

very careful programming may exceed the Stream results, so the memory roofl ine may<br />

not be as hard a limit as the computational roofl ine. We stick with Stream because few<br />

programmers will be able to deliver more memory b<strong>and</strong>width than Stream discovers.<br />

Elaboration: Although the roofl ine model shown is for multicore processors, it clearly<br />

would work for a uniprocessor as well.<br />

Check<br />

Yourself<br />

True or false: The main drawback with conventional approaches to benchmarks<br />

for parallel computers is that the rules that ensure fairness also slow software<br />

innovation.<br />

6.11<br />

Real Stuff: Benchmarking <strong>and</strong> Rooflines<br />

of the Intel Core i7 960 <strong>and</strong> the NVIDIA<br />

Tesla GPU<br />

A group of Intel researchers published a paper [Lee et al., 2010] comparing a<br />

quad-core Intel Core i7 960 with multimedia SIMD extensions to the previous<br />

generation GPU, the NVIDIA Tesla GTX 280. Figure 6.22 lists the characteristics<br />

of the two systems. Both products were purchased in Fall 2009. The Core i7 is<br />

in Intels 45-nanometer semiconductor technology while the GPU is in TSMCs<br />

65-nanometer technology. Although it might have been fairer to have a comparison<br />

by a neutral party or by both interested parties, the purpose of this section is not to<br />

determine how much faster one product is than another, but to try to underst<strong>and</strong><br />

the relative value of features of these two contrasting architecture styles.<br />

The rooflines of the Core i7 960 <strong>and</strong> GTX 280 in Figure 6.23 illustrate the<br />

differences in the computers. Not only does the GTX 280 have much higher<br />

memory b<strong>and</strong>width <strong>and</strong> double-precision floating-point performance, but also its<br />

double-precision ridge point is considerably to the left. The double-precision ridge<br />

point is 0.6 for the GTX 280 versus 3.1 for the Core i7. As mentioned above, it is<br />

much easier to hit peak computational performance the further the ridge point of


6.11 Real Stuff: Benchmarking <strong>and</strong> Rooflines of the Intel Core i7 960 <strong>and</strong> the NVIDIA Tesla GPU 551<br />

Core i7-<br />

960<br />

GTX 280 GTX 480<br />

Ratio<br />

280/i7<br />

Ratio<br />

480/i7<br />

Number of processing elements (cores or SMs)<br />

4<br />

30<br />

15<br />

7.5<br />

3.8<br />

Clock frequency (GHz)<br />

3.2<br />

1.3<br />

1.4<br />

0.41<br />

0.44<br />

Die size<br />

263<br />

576<br />

520<br />

2.2<br />

2.0<br />

Technology<br />

Intel 45 nm<br />

TSMC 65 nm<br />

TSMC 40 nm<br />

1.6<br />

1.0<br />

Power (chip, not module)<br />

130<br />

130<br />

167<br />

1.0<br />

1.3<br />

Transistors<br />

700 M<br />

1400 M<br />

3030 M<br />

2.0<br />

4.4<br />

Memory br<strong>and</strong>with (GBytes/sec)<br />

32<br />

141<br />

177<br />

4.4<br />

5.5<br />

Single-precision SIMD width<br />

4<br />

8<br />

32<br />

2.0<br />

8.0<br />

Double-precision SIMD width<br />

2<br />

1<br />

16<br />

0.5<br />

8.0<br />

Peak Single-precision scalar FLOPS (GFLOP/sec)<br />

26<br />

117<br />

63<br />

4.6<br />

2.5<br />

Peak Single-precision SIMD FLOPS (GFLOP/Sec)<br />

102<br />

311 to 933<br />

515 or 1344<br />

3.0–9.1<br />

6.6–13.1<br />

(SP 1 add or multiply)<br />

N.A.<br />

(311)<br />

(515)<br />

(3.0)<br />

(6.6)<br />

(SP 1 instruction fused multiply-adds)<br />

N.A.<br />

(622)<br />

(1344)<br />

(6.1)<br />

(13.1)<br />

(Rare SP dual issue fused multiply-add <strong>and</strong> multiply)<br />

N.A.<br />

(933)<br />

N.A.<br />

(9.1)<br />

–<br />

Peal double-precision SIMD FLOPS (GFLOP/sec)<br />

51<br />

78<br />

515<br />

1.5<br />

10.1<br />

FIGURE 6.22 Intel Core i7-960, NVIDIA GTX 280, <strong>and</strong> GTX 480 specifications. The rightmost columns show the ratios of the<br />

Tesla GTX 280 <strong>and</strong> the Fermi GTX 480 to Core i7. Although the case study is between the Tesla 280 <strong>and</strong> i7, we include the Fermi 480 to show<br />

its relationship to the Tesla 280 since it is described in this chapter. Note that these memory b<strong>and</strong>widths are higher than in Figure 6.23 because<br />

these are DRAM pin b<strong>and</strong>widths <strong>and</strong> those in Figure 6.23 are at the processors as measured by a benchmark program. (From Table 2 in Lee<br />

et al. [2010].)<br />

the roofline is to the left. For single-precision performance, the ridge point moves<br />

far to the right for both computers, so its much harder to hit the roof of singleprecision<br />

performance. Note that the arithmetic intensity of the kernel is based on<br />

the bytes that go to main memory, not the bytes that go to cache memory. Thus,<br />

as mentioned above, caching can change the arithmetic intensity of a kernel on a<br />

particular computer, if most references really go to the cache. Note also that this<br />

b<strong>and</strong>width is for unit-stride accesses in both architectures. Real gather-scatter<br />

addresses can be slower on the GTX 280 <strong>and</strong> on the Core i7, as we shall see.<br />

The researchers selected the benchmark programs by analyzing the computational<br />

<strong>and</strong> memory characteristics of four recently proposed benchmark suites <strong>and</strong> then<br />

formulated the set of throughput computing kernels that capture these characteristics.<br />

Figure 6.24 shows the performance results, with larger numbers meaning faster. The<br />

Rooflines help explain the relative performance in this case study.<br />

Given that the raw performance specifications of the GTX 280 vary from 2.5 ×<br />

slower (clock rate) to 7.5 × faster (cores per chip) while the performance varies


552 Chapter 6 Parallel Processors from Client to Cloud<br />

GFlop/s<br />

128<br />

64<br />

32<br />

16<br />

8<br />

Stream = 16.4 GB/s<br />

Core i7 960<br />

(Nehalem)<br />

51.2 GF/s<br />

Double Precision<br />

GFlop/s<br />

128<br />

64<br />

32<br />

16<br />

8<br />

Stream = 127 GB/s<br />

NVIDIA GTX280<br />

Peak = 78 GF/s<br />

Double Precision<br />

4<br />

4<br />

2<br />

2<br />

1<br />

1<br />

1/8 1/4 1/2 1 2 4 8 16 32<br />

1/8 1/4 1/2 1 2 4 8 16 32<br />

Arithmetic intensity<br />

Arithmetic intensity<br />

1024<br />

Core i7 960<br />

(Nehalem)<br />

1024<br />

NVIDIA GTX280<br />

GFlop/s<br />

512<br />

256<br />

128<br />

64<br />

32<br />

16<br />

8<br />

Stream = 16.4 GB/s<br />

102.4 GF/s<br />

Single Precision<br />

51.2 GF/s<br />

Double Precision<br />

GFlop/s<br />

512<br />

256<br />

128<br />

64<br />

32<br />

16<br />

8<br />

Stream = 127 GB/s<br />

624 GF/s<br />

Single Precision<br />

78 GF/s<br />

Double Precision<br />

4<br />

1/8 1/4 1/2<br />

1 2 4 8 16 32<br />

Arithmetic intensity<br />

4<br />

1/8 1/4 1/2<br />

1 2 4 8 16<br />

Arithmetic intensity<br />

32<br />

FIGURE 6.23 Roofline model [Williams, Waterman, <strong>and</strong> Patterson 2009]. These rooflines show double-precision floating-point<br />

performance in the top row <strong>and</strong> single-precision performance in the bottom row. (The DP FP performance ceiling is also in the bottom row<br />

to give perspective.) The Core i7 960 on the left has a peak DP FP performance of 51.2 GFLOP/sec, a SP FP peak of 102.4 GFLOP/sec, <strong>and</strong> a<br />

peak memory b<strong>and</strong>width of 16.4 GBytes/sec. The NVIDIA GTX 280 has a DP FP peak of 78 GFLOP/sec, SP FP peak of 624 GFLOP/sec, <strong>and</strong><br />

127 GBytes/sec of memory b<strong>and</strong>width. The dashed vertical line on the left represents an arithmetic intensity of 0.5 FLOP/byte. It is limited by<br />

memory b<strong>and</strong>width to no more than 8 DP GFLOP/sec or 8 SP GFLOP/sec on the Core i7. The dashed vertical line to the right has an arithmetic<br />

intensity of 4 FLOP/byte. It is limited only computationally to 51.2 DP GFLOP/sec <strong>and</strong> 102.4 SP GFLOP/sec on the Core i7 <strong>and</strong> 78 DP GFLOP/<br />

sec <strong>and</strong> 512 DP GFLOP/sec on the GTX 280. To hit the highest computation rate on the Core i7 you need to use all 4 cores <strong>and</strong> SSE instructions<br />

with an equal number of multiplies <strong>and</strong> adds. For the GTX 280, you need to use fused multiply-add instructions on all multithreaded SIMD<br />

processors.


6.11 Real Stuff: Benchmarking <strong>and</strong> Rooflines of the Intel Core i7 960 <strong>and</strong> the NVIDIA Tesla GPU 553<br />

Kernel Units Core i7-960 GTX 280<br />

GTX 280/<br />

i7-960<br />

SGEMM<br />

GFLOP/sec<br />

94<br />

364<br />

3.9<br />

MC<br />

Billion paths/sec<br />

0.8<br />

1.4<br />

1.8<br />

Conv<br />

Million pixels/sec<br />

1250<br />

3500<br />

2.8<br />

FFT<br />

GFLOP/sec<br />

71.4<br />

213<br />

3.0<br />

SAXPY<br />

GBytes/sec<br />

16.8<br />

88.8<br />

5.3<br />

LBM<br />

Million lookups/sec<br />

85<br />

426<br />

5.0<br />

Solv<br />

Frames/sec<br />

103<br />

52<br />

0.5<br />

SpMV<br />

GFLOP/sec<br />

4.9<br />

9.1<br />

1.9<br />

GJK<br />

Frames/sec<br />

67<br />

1020<br />

15.2<br />

Sort<br />

Million elements/sec<br />

250<br />

198<br />

0.8<br />

RC<br />

Search<br />

Frames/sec<br />

Million queries/sec<br />

5<br />

50<br />

8.1<br />

90<br />

1.6<br />

1.8<br />

Hist<br />

Million pixels/sec<br />

1517<br />

2583<br />

1.7<br />

Bilat<br />

Million pixels/sec<br />

83<br />

475<br />

5.7<br />

FIGURE 6.24 Raw <strong>and</strong> relative performance measured for the two platforms. In this study,<br />

SAXPY is just used as a measure of memory b<strong>and</strong>width, so the right unit is GBytes/sec <strong>and</strong> not GFLOP/sec.<br />

(Based on Table 3 in [Lee et al., 2010].)<br />

from 2.0 × slower (Solv) to 15.2 × faster (GJK), the Intel researchers decided to<br />

find the reasons for the differences:<br />

■ Memory b<strong>and</strong>width. The GPU has 4.4 × the memory b<strong>and</strong>width, which helps<br />

explain why LBM <strong>and</strong> SAXPY run 5.0 <strong>and</strong> 5.3 × faster; their working sets are<br />

hundreds of megabytes <strong>and</strong> hence dont fit into the Core i7 cache. (So as to<br />

access memory intensively, they purposely did not use cache blocking as in<br />

Chapter 5.) Hence, the slope of the rooflines explains their performance. SpMV<br />

also has a large working set, but it only runs 1.9 × faster because the doubleprecision<br />

floating point of the GTX 280 is only 1.5 × as faster as the Core i7.<br />

■ Compute b<strong>and</strong>width. Five of the remaining kernels are compute bound:<br />

SGEMM, Conv, FFT, MC, <strong>and</strong> Bilat. The GTX is faster by 3.9, 2.8, 3.0, 1.8, <strong>and</strong><br />

5.7 ×, respectively. The first three of these use single-precision floating-point<br />

arithmetic, <strong>and</strong> GTX 280 single precision is 3 to 6 × faster. MC uses double<br />

precision, which explains why its only 1.8 × faster since DP performance<br />

is only 1.5 × faster. Bilat uses transcendental functions, which the GTX<br />

280 supports directly. The Core i7 spends two-thirds of its time calculating<br />

transcendental functions for Bilat, so the GTX 280 is 5.7 × faster. This<br />

observation helps point out the value of hardware support for operations that<br />

occur in your workload: double-precision floating point <strong>and</strong> perhaps even<br />

transcendentals.


554 Chapter 6 Parallel Processors from Client to Cloud<br />

■ Cache benefits. Ray casting (RC) is only 1.6 × faster on the GTX because<br />

cache blocking with the Core i7 caches prevents it from becoming memory<br />

b<strong>and</strong>width bound (see Sections 5.4 <strong>and</strong> 5.14), as it is on GPUs. Cache<br />

blocking can help Search, too. If the index trees are small so that they fit in<br />

the cache, the Core i7 is twice as fast. Larger index trees make them memory<br />

b<strong>and</strong>width bound. Overall, the GTX 280 runs search 1.8 × faster. Cache<br />

blocking also helps Sort. While most programmers wouldnt run Sort on<br />

a SIMD processor, it can be written with a 1-bit Sort primitive called split.<br />

However, the split algorithm executes many more instructions than a scalar<br />

sort does. As a result, the Core i7 runs 1.25 × as fast as the GTX 280. Note<br />

that caches also help other kernels on the Core i7, since cache blocking allows<br />

SGEMM, FFT, <strong>and</strong> SpMV to become compute bound. This observation reemphasizes<br />

the importance of cache blocking optimizations in Chapter 5.<br />

■ Gather-Scatter. The multimedia SIMD extensions are of little help if the data are<br />

scattered throughout main memory; optimal performance comes only when<br />

accesses are to data are aligned on 16-byte boundaries. Thus, GJK gets little benefit<br />

from SIMD on the Core i7. As mentioned above, GPUs offer gather-scatter<br />

addressing that is found in a vector architecture but omitted from most SIMD<br />

extensions. The memory controller even batches accesses to the same DRAM<br />

page together (see Section 5.2). This combination means the GTX 280 runs GJK<br />

a startling 15.2 × as fast as the Core i7, which is larger than any single physical<br />

parameter in Figure 6.22. This observation reinforces the importance of gatherscatter<br />

to vector <strong>and</strong> GPU architectures that is missing from SIMD extensions.<br />

■ Synchronization. The performance of synchronization is limited by atomic<br />

updates, which are responsible for 28% of the total runtime on the Core i7<br />

despite its having a hardware fetch-<strong>and</strong>-increment instruction. Thus, Hist is only<br />

1.7 × faster on the GTX 280. Solv solves a batch of independent constraints in<br />

a small amount of computation followed by barrier synchronization. The Core<br />

i7 benefits from the atomic instructions <strong>and</strong> a memory consistency model that<br />

ensures the right results even if not all previous accesses to memory hierarchy<br />

have completed. Without the memory consistency model, the GTX 280<br />

version launches some batches from the system processor, which leads to the<br />

GTX 280 running 0.5 × as fast as the Core i7. This observation points out how<br />

synchronization performance can be important for some data parallel problems.<br />

It is striking how often weaknesses in the Tesla GTX 280 that were uncovered by<br />

kernels selected by Intel researchers were already being addressed in the successor<br />

architecture to Tesla: Fermi has faster double-precision floating-point performance,<br />

faster atomic operations, <strong>and</strong> caches. It was also interesting that the gather-scatter<br />

support of vector architectures that predate the SIMD instructions by decades was<br />

so important to the effective usefulness of these SIMD extensions, which some had<br />

predicted before the comparison. The Intel researchers noted that 6 of the 14 kernels<br />

would exploit SIMD better with more efficient gather-scatter support on the Core<br />

i7. This study certainly establishes the importance of cache blocking as well.


6.12 Going Faster: Multiple Processors <strong>and</strong> Matrix Multiply 555<br />

Now that we seen a wide range of results of benchmarking different<br />

multiprocessors, let’s return to our DGEMM example to see in detail how much we<br />

have to change the C code to exploit multiple processors.<br />

6.12<br />

Going Faster: Multiple Processors <strong>and</strong><br />

Matrix Multiply<br />

This section is the final <strong>and</strong> largest step in our incremental performance journey of<br />

adapting DGEMM to the underlying hardware of the Intel Core i7 (S<strong>and</strong>y Bridge).<br />

Each Core i7 has 8 cores, <strong>and</strong> the computer we have been using has 2 Core i7s.<br />

Thus, we have 16 cores on which to run DGEMM.<br />

Figure 6.25 shows the OpenMP version of DGEMM that utilizes those cores.<br />

Note that line 30 is the single line added to Figure 5.48 to make this code run on<br />

multiple processors: an OpenMP pragma that tells the compiler to use multiple<br />

threads in the outermost for loop. It tells the computer to spread the work of the<br />

outermost loop across all the threads.<br />

Figure 6.26 plots a classic multiprocessor speedup graph, showing the<br />

performance improvement versus a single thread as the number of threads increase.<br />

This graph makes it easy to see the challenges of strong scaling versus weak scaling.<br />

When everything fits in the first level data cache, as is the case for 32 × 32 matrices,<br />

adding threads actually hurts performance. The 16-threaded version of DGEMM<br />

is almost half as fast as the single-threaded version in this case. In contrast, the two<br />

largest matrices get a 14 × speedup from 16 threads, <strong>and</strong> hence the classic two “up<br />

<strong>and</strong> to the right” lines in Figure 6.26.<br />

Figure 6.27 shows the absolute performance increase as we increase the number<br />

of threads from 1 to 16. DGEMM operates now operates at 174 GLOPS for 960 × 960<br />

matrices. As our unoptimized C version of DGEMM in Figure 3.21 ran this code at<br />

just 0.8 GFOPS, the optimizations in Chapters 3 to 6 that tailor the code to the<br />

underlying hardware result in a speedup of over 200 times!<br />

Next up is our warnings of the fallacies <strong>and</strong> pitfalls of multiprocessing. The<br />

computer architecture graveyard is filled with parallel processing projects that have<br />

ignored them.<br />

Elaboration: These results are with Turbo mode turned off. We are using a dual chip<br />

system in this system, so not surprisingly, we can get the full Turbo speedup (3.3/2.6<br />

= 1.27) with either 1 thread (only 1 core on one of the chips) or 2 threads (1 core per<br />

chip). As we increase the number of threads <strong>and</strong> hence the number of active cores, the<br />

benefi t of Turbo mode decreases, as there is less of the power budget to spend on the<br />

active cores. For 4 threads the average Turbo speedup is 1.23, for 8 it is 1.13, <strong>and</strong> for<br />

16 it is 1.11.


556 Chapter 6 Parallel Processors from Client to Cloud<br />

1<br />

2<br />

3<br />

4<br />

5<br />

6<br />

7<br />

8<br />

9<br />

10<br />

11<br />

12<br />

13<br />

14<br />

15<br />

16<br />

17<br />

18<br />

19<br />

20<br />

21<br />

22<br />

23<br />

24<br />

25<br />

26<br />

27<br />

28<br />

29<br />

30<br />

31<br />

32<br />

33<br />

34<br />

35<br />

#include <br />

#define UNROLL (4)<br />

#define BLOCKSIZE 32<br />

void do_block (int n, int si, int sj, int sk,<br />

double *A, double *B, double *C)<br />

{<br />

for ( int i = si; i < si+BLOCKSIZE; i+=UNROLL*4 )<br />

for ( int j = sj; j < sj+BLOCKSIZE; j++ ) {<br />

__m256d c[4];<br />

for ( int x = 0; x < UNROLL; x++ )<br />

c[x] = _mm256_load_pd(C+i+x*4+j*n);<br />

/* c[x] = C[i][j] */<br />

for( int k = sk; k < sk+BLOCKSIZE; k++ )<br />

{<br />

__m256d b = _mm256_broadcast_sd(B+k+j*n);<br />

/* b = B[k][j] */<br />

for (int x = 0; x < UNROLL; x++)<br />

c[x] = _mm256_add_pd(c[x], /* c[x]+=A[i][k]*b */<br />

_mm256_mul_pd(_mm256_load_pd(A+n*k+x*4+i), b));<br />

}<br />

}<br />

}<br />

for ( int x = 0; x < UNROLL; x++ )<br />

_mm256_store_pd(C+i+x*4+j*n, c[x]);<br />

/* C[i][j] = c[x] */<br />

void dgemm (int n, double* A, double* B, double* C)<br />

{<br />

#pragma omp parallel for<br />

for ( int sj = 0; sj < n; sj += BLOCKSIZE )<br />

for ( int si = 0; si < n; si += BLOCKSIZE )<br />

for ( int sk = 0; sk < n; sk += BLOCKSIZE )<br />

do_block(n, si, sj, sk, A, B, C);<br />

}<br />

FIGURE 6.25 OpenMP version of DGEMM from Figure 5.48. Line 30 is the only OpenMP code, making<br />

the outermost for loop operate in parallel. This line is the only difference from Figure 5.48.<br />

Elaboration: Although the S<strong>and</strong>y Bridge supports two hardware threads per core, we<br />

do not get more performance from 32 threads. The reason is that a single AVX hardware<br />

is shared between the two threads multiplexed onto one core, so assigning two threads<br />

per core actually hurts performance due to the multiplexing overhead.


6.13 Fallacies <strong>and</strong> Pitfalls 559<br />

One frequently encountered problem occurs when software designed for a<br />

uniprocessor is adapted to a multiprocessor environment. For example, the Silicon<br />

Graphics operating system originally protected the page table with a single lock,<br />

assuming that page allocation is infrequent. In a uniprocessor, this does not<br />

represent a performance problem. In a multiprocessor, it can become a major<br />

performance bottleneck for some programs. Consider a program that uses a large<br />

number of pages that are initialized at start-up, which UNIX does for statically<br />

allocated pages. Suppose the program is parallelized so that multiple processes<br />

allocate the pages. Because page allocation requires the use of the page table, which<br />

is locked whenever it is in use, even an OS kernel that allows multiple threads in the<br />

OS will be serialized if the processes all try to allocate their pages at once (which is<br />

exactly what we might expect at initialization time!).<br />

This page table serialization eliminates parallelism in initialization <strong>and</strong> has<br />

significant impact on overall parallel performance. This performance bottleneck<br />

persists even for task-level parallelism. For example, suppose we split the parallel<br />

processing program apart into separate jobs <strong>and</strong> run them, one job per processor,<br />

so that there is no sharing between the jobs. (This is exactly what one user did,<br />

since he reasonably believed that the performance problem was due to unintended<br />

sharing or interference in his application.) Unfortunately, the lock still serializes all<br />

the jobsso even the independent job performance is poor.<br />

This pitfall indicates the kind of subtle but significant performance bugs<br />

that can arise when software runs on multiprocessors. Like many other key<br />

software components, the OS algorithms <strong>and</strong> data structures must be rethought<br />

in a multiprocessor context. Placing locks on smaller portions of the page table<br />

effectively eliminated the problem.<br />

Fallacy: You can get good vector performance without providing memory<br />

b<strong>and</strong>width.<br />

As we saw with the Roofline model, memory b<strong>and</strong>width is quite important to<br />

all architectures. DAXPY requires 1.5 memory references per floating-point<br />

operation, <strong>and</strong> this ratio is typical of many scientific codes. Even if the floating-point<br />

operations took no time, a Cray-1 could not increase the DAXPY performance of<br />

the vector sequence used, since it was memory limited. The Cray-1 performance on<br />

Linpack jumped when the compiler used blocking to change the computation so<br />

that values could be kept in the vector registers. This approach lowered the number<br />

of memory references per FLOP <strong>and</strong> improved the performance by nearly a factor<br />

of two! Thus, the memory b<strong>and</strong>width on the Cray-1 became sufficient for a loop<br />

that formerly required more b<strong>and</strong>width, which is just what the Roofline model<br />

would predict.


562 Chapter 6 Parallel Processors from Client to Cloud<br />

■ In the past, microprocessors <strong>and</strong> multiprocessors were subject to<br />

different definitions of success. When scaling uniprocessor performance,<br />

microprocessor architects were happy if single thread performance went up<br />

by the square root of the increased silicon area. Thus, they were happy with<br />

sublinear performance in terms of resources. Multiprocessor success used<br />

to be defined as linear speed-up as a function of the number of processors,<br />

assuming that the cost of purchase or cost of administration of n processors<br />

was n times as much as one processor. Now that parallelism is happening onchip<br />

via multicore, we can use the traditional microprocessor metric of being<br />

successful with sublinear performance improvement.<br />

■ The success of just-in-time runtime compilation <strong>and</strong> autotuning makes it<br />

feasible to think of software adapting itself to take advantage of the increasing<br />

number of cores per chip, which provides flexibility that is not available when<br />

limited to static compilers.<br />

■ Unlike in the past, the open source movement has become a critical portion<br />

of the software industry. This movement is a meritocracy, where better<br />

engineering solutions can win the mind share of the developers over legacy<br />

concerns. It also embraces innovation, inviting change to old software <strong>and</strong><br />

welcoming new languages <strong>and</strong> software products. Such an open culture could<br />

be extremely helpful in this time of rapid change.<br />

To motivate readers to embrace this revolution, we demonstrated the potential<br />

of parallelism concretely for matrix multiply on the Intel Core i7 (S<strong>and</strong>y Bridge) in<br />

the Going Faster sections of Chapters 3 to 6:<br />

■ Data-level parallelism in Chapter 3 improved performance by a factor of 3.85<br />

by executing four 64-bit floating-point operations in parallel using the 256-<br />

bit oper<strong>and</strong>s of the AVX instructions, demonstrating the value of SIMD.<br />

■ Instruction-level parallelism in Chapter 4 pushed performance up by another<br />

factor of 2.3 by unrolling loops 4 times to give the out-of-order execution<br />

hardware more instructions to schedule.<br />

■ Cache optimizations in Chapter 5 improved performance of matrices that<br />

didn’t fit into the L1 data cache by another factor of 2.0 to 2.5 by using cache<br />

blocking to reduce cache misses.<br />

■ Thread-level parallelism in this chapter improved performance of matrices<br />

that don’t fit into a single L1 data cache by another factor of 4 to 14 by utilizing<br />

all 16 cores of our multicore chips, demonstrating the value of MIMD. We<br />

did this by adding a single line using an OpenMP pragma.<br />

Using the ideas in this book <strong>and</strong> tailoring the software to this computer added<br />

24 lines of code to DGEMM. For the matrix sizes of 32x32, 160x160, 480x480, <strong>and</strong><br />

960x960, the overall performance speedup from these ideas realized in those twodozen<br />

lines of code is factors of 8, 39, 129, <strong>and</strong> 212!


564 Chapter 6 Parallel Processors from Client to Cloud<br />

backpack <strong>and</strong> then carry them “in parallel”). For each of your activities, discuss if<br />

they are already working in parallel, but if not, why they are not.<br />

6.1.2 [5] Next, consider which of the activities could be carried out<br />

concurrently (e.g., eating breakfast <strong>and</strong> listening to the news). For each of your<br />

activities, describe which other activity could be paired with this activity.<br />

6.1.3 [5] For 6.1.2, what could we change about current systems (e.g.,<br />

showers, clothes, TVs, cars) so that we could perform more tasks in parallel?<br />

6.1.4 [5] Estimate how much shorter time it would take to carry out these<br />

activities if you tried to carry out as many tasks in parallel as possible.<br />

6.2 You are trying to bake 3 blueberry pound cakes. Cake ingredients are as<br />

follows:<br />

1 cup butter, softened<br />

1 cup sugar<br />

4 large eggs<br />

1 teaspoon vanilla extract<br />

1/2 teaspoon salt<br />

1/4 teaspoon nutmeg<br />

1 1/2 cups flour<br />

1 cup blueberries<br />

The recipe for a single cake is as follows:<br />

Step 1: Preheat oven to 325°F (160°C). Grease <strong>and</strong> flour your cake pan.<br />

Step 2: In large bowl, beat together with a mixer butter <strong>and</strong> sugar at medium<br />

speed until light <strong>and</strong> fluffy. Add eggs, vanilla, salt <strong>and</strong> nutmeg. Beat until<br />

thoroughly blended. Reduce mixer speed to low <strong>and</strong> add flour, 1/2 cup at a time,<br />

beating just until blended.<br />

Step 3: Gently fold in blueberries. Spread evenly in prepared baking pan. Bake<br />

for 60 minutes.<br />

6.2.1 [5] Your job is to cook 3 cakes as efficiently as possible. Assuming<br />

that you only have one oven large enough to hold one cake, one large bowl, one<br />

cake pan, <strong>and</strong> one mixer, come up with a schedule to make three cakes as quickly<br />

as possible. Identify the bottlenecks in completing this task.<br />

6.2.2 [5] Assume now that you have three bowls, 3 cake pans <strong>and</strong> 3 mixers.<br />

How much faster is the process now that you have additional resources?


6.16 Exercises 565<br />

6.2.3 [5] Assume now that you have two friends that will help you cook,<br />

<strong>and</strong> that you have a large oven that can accommodate all three cakes. How will this<br />

change the schedule you arrived at in Exercise 6.2.1 above?<br />

6.2.4 [5] Compare the cake-making task to computing 3 iterations<br />

of a loop on a parallel computer. Identify data-level parallelism <strong>and</strong> task-level<br />

parallelism in the cake-making loop.<br />

6.3 Many computer applications involve searching through a set of data <strong>and</strong><br />

sorting the data. A number of efficient searching <strong>and</strong> sorting algorithms have been<br />

devised in order to reduce the runtime of these tedious tasks. In this problem we<br />

will consider how best to parallelize these tasks.<br />

6.3.1 [10] Consider the following binary search algorithm (a classic divide<br />

<strong>and</strong> conquer algorithm) that searches for a value X in a sorted N-element array A<br />

<strong>and</strong> returns the index of matched entry:<br />

BinarySearch(A[0..N−1], X) {<br />

low = 0<br />

high = N −1<br />

while (low X)<br />

high = mid −1<br />

else if (A[mid]


6.16 Exercises 567<br />

continue until we have lists of size 1 in length. Then starting with sublists of length<br />

1, “merge” the two sublists into a single sorted list.<br />

Mergesort(m)<br />

var list left, right, result<br />

if length(m) ≤ 1<br />

return m<br />

else<br />

var middle = length(m) / 2<br />

for each x in m up to middle<br />

add x to left<br />

for each x in m after middle<br />

add x to right<br />

left = Mergesort(left)<br />

right = Mergesort(right)<br />

result = Merge(left, right)<br />

return result<br />

The merge step is carried out by the following code:<br />

Merge(left,right)<br />

var list result<br />

while length(left) >0 <strong>and</strong> length(right) > 0<br />

if first(left) ≤ first(right)<br />

append first(left) to result<br />

left = rest(left)<br />

else<br />

append first(right) to result<br />

right = rest(right)<br />

if length(left) >0<br />

append rest(left) to result<br />

if length(right) >0<br />

append rest(right) to result<br />

return result<br />

6.5.1 [10] Assume that you have Y cores on a multicore processor to run<br />

MergeSort. Assuming that Y is much smaller than length(m), express the speedup<br />

factor you might expect to obtain for values of Y <strong>and</strong> length(m). Plot these on a<br />

graph.<br />

6.5.2 [10] Next, assume that Y is equal to length (m). How would this<br />

affect your conclusions your previous answer? If you were tasked with obtaining<br />

the best speedup factor possible (i.e., strong scaling), explain how you might<br />

change this code to obtain it.


568 Chapter 6 Parallel Processors from Client to Cloud<br />

6.6 Matrix multiplication plays an important role in a number of applications.<br />

Two matrices can only be multiplied if the number of columns of the first matrix is<br />

equal to the number of rows in the second.<br />

Let’s assume we have an m × n matrix A <strong>and</strong> we want to multiply it by an n × p<br />

matrix B. We can express their product as an m × p matrix denoted by AB (or A ⋅ B).<br />

If we assign C = AB, <strong>and</strong> c i,j<br />

denotes the entry in C at position (i, j), then for each<br />

element i <strong>and</strong> j with 1 ≤ i ≤ m <strong>and</strong> 1 ≤ j ≤ p. Now we want to see if we can parallelize<br />

the computation of C. Assume that matrices are laid out in memory sequentially as<br />

follows: a 1,1<br />

, a 2,1<br />

, a 3,1<br />

, a 4,1<br />

, …, etc.<br />

6.6.1 [10] Assume that we are going to compute C on both a single core<br />

shared memory machine <strong>and</strong> a 4-core shared-memory machine. Compute the<br />

speedup we would expect to obtain on the 4-core machine, ignoring any memory<br />

issues.<br />

6.6.2 [10] Repeat Exercise 6.6.1, assuming that updates to C incur a cache<br />

miss due to false sharing when consecutive elements are in a row (i.e., index i) are<br />

updated.<br />

6.6.3 [10] How would you fix the false sharing issue that can occur?<br />

6.7 Consider the following portions of two different programs running at the<br />

same time on four processors in a symmetric multicore processor (SMP). Assume<br />

that before this code is run, both x <strong>and</strong> y are 0.<br />

Core 1: x = 2;<br />

Core 2: y = 2;<br />

Core 3: w = x + y + 1;<br />

Core 4: z = x + y;<br />

6.7.1 [10] What are all the possible resulting values of w, x, y, <strong>and</strong> z? For<br />

each possible outcome, explain how we might arrive at those values. You will need<br />

to examine all possible interleavings of instructions.<br />

6.7.2 [5] How could you make the execution more deterministic so that<br />

only one set of values is possible?<br />

6.8 The dining philosopher’s problem is a classic problem of synchronization <strong>and</strong><br />

concurrency. The general problem is stated as philosophers sitting at a round table<br />

doing one of two things: eating or thinking. When they are eating, they are not<br />

thinking, <strong>and</strong> when they are thinking, they are not eating. There is a bowl of pasta<br />

in the center. A fork is placed in between each philosopher. The result is that each<br />

philosopher has one fork to her left <strong>and</strong> one fork to her right. Given the nature of<br />

eating pasta, the philosopher needs two forks to eat, <strong>and</strong> can only use the forks on<br />

her immediate left <strong>and</strong> right. The philosophers do not speak to one another.


570 Chapter 6 Parallel Processors from Client to Cloud<br />

Assume all instructions take a single cycle to execute unless noted otherwise or<br />

they encounter a hazard.<br />

6.9.1 [10] Assume that you have 1 SS CPU. How many cycles will it take to<br />

execute these two threads? How many issue slots are wasted due to hazards?<br />

6.9.2 [10] Now assume you have 2 SS CPUs. How many cycles will it take<br />

to execute these two threads? How many issue slots are wasted due to hazards?<br />

6.9.3 [10] Assume that you have 1 MT CPU. How many cycles will it take<br />

to execute these two threads? How many issue slots are wasted due to hazards?<br />

6.10 Virtualization software is being aggressively deployed to reduce the costs of<br />

managing today’s high performance servers. Companies like VMWare, Microsoft<br />

<strong>and</strong> IBM have all developed a range of virtualization products. The general concept,<br />

described in Chapter 5, is that a hypervisor layer can be introduced between the<br />

hardware <strong>and</strong> the operating system to allow multiple operating systems to share<br />

the same physical hardware. The hypervisor layer is then responsible for allocating<br />

CPU <strong>and</strong> memory resources, as well as h<strong>and</strong>ling services typically h<strong>and</strong>led by the<br />

operating system (e.g., I/O).<br />

Virtualization provides an abstract view of the underlying hardware to the hosted<br />

operating system <strong>and</strong> application software. This will require us to rethink how<br />

multi-core <strong>and</strong> multiprocessor systems will be designed in the future to support<br />

the sharing of CPUs <strong>and</strong> memories by a number of operating systems concurrently.<br />

6.10.1 [30] Select two hypervisors on the market today, <strong>and</strong> compare<br />

<strong>and</strong> contrast how they virtualize <strong>and</strong> manage the underlying hardware (CPUs <strong>and</strong><br />

memory).<br />

6.10.2 [15] Discuss what changes may be necessary in future multi-core<br />

CPU platforms in order to better match the resource dem<strong>and</strong>s placed on these<br />

systems. For instance, can multithreading play an effective role in alleviating the<br />

competition for computing resources?<br />

6.11 We would like to execute the loop below as efficiently as possible. We have<br />

two different machines, a MIMD machine <strong>and</strong> a SIMD machine.<br />

for (i=0; i < 2000; i++)<br />

for (j=0; j


6.16 Exercises 571<br />

6.12 A systolic array is an example of an MISD machine. A systolic array is a<br />

pipeline network or “wavefront” of data processing elements. Each of these elements<br />

does not need a program counter since execution is triggered by the arrival of data.<br />

Clocked systolic arrays compute in “lock-step” with each processor undertaking<br />

alternate compute <strong>and</strong> communication phases.<br />

6.12.1 [10] Consider proposed implementations of a systolic array (you<br />

can find these in on the Internet or in technical publications). Then attempt to<br />

program the loop provided in Exercise 6.11 using this MISD model. Discuss any<br />

difficulties you encounter.<br />

6.12.2 [10] Discuss the similarities <strong>and</strong> differences between an MISD <strong>and</strong><br />

SIMD machine. Answer this question in terms of data-level parallelism.<br />

6.13 Assume we want to execute the DAXPY loop show on page 511 in MIPS<br />

assembly on the NVIDIA 8800 GTX GPU described in this chapter. In this problem,<br />

we will assume that all math operations are performed on single-precision floatingpoint<br />

numbers (we will rename the loop SAXPY). Assume that instructions take<br />

the following number of cycles to execute.<br />

Loads Stores Add.S Mult.S<br />

5 2 3 4<br />

6.13.1 [20] Describe how you will constructs warps for the SAXPY loop<br />

to exploit the 8 cores provided in a single multiprocessor.<br />

6.14 Download the CUDA Toolkit <strong>and</strong> SDK from http://www.nvidia.com/object/<br />

cuda_get.html. Make sure to use the “emurelease” (Emulation Mode) version of the<br />

code (you will not need actual NVIDIA hardware for this assignment). Build the<br />

example programs provided in the SDK, <strong>and</strong> confirm that they run on the emulator.<br />

6.14.1 [90] Using the “template” SDK sample as a starting point, write a<br />

CUDA program to perform the following vector operations:<br />

1) a − b (vector-vector subtraction)<br />

2) a ⋅ b (vector dot product)<br />

The dot product of two vectors a = [a 1<br />

, a 2<br />

, … , a n<br />

] <strong>and</strong> b = [b 1<br />

, b 2<br />

, … , b n<br />

] is defined as:<br />

a ⋅ b ∑ a b a b a b … a b<br />

i i 1 1 2 2<br />

i<br />

n<br />

1<br />

Submit code for each program that demonstrates each operation <strong>and</strong> verifies the<br />

correctness of the results.<br />

6.14.2 [90] If you have GPU hardware available, complete a performance<br />

analysis your program, examining the computation time for the GPU <strong>and</strong> a CPU<br />

version of your program for a range of vector sizes. Explain any results you see.<br />

n n


572 Chapter 6 Parallel Processors from Client to Cloud<br />

6.15 AMD has recently announced that they will be integrating a graphics<br />

processing unit with their x86 cores in a single package, though with different<br />

clocks for each of the cores. This is an example of a heterogeneous multiprocessor<br />

system which we expect to see produced commericially in the near future. One<br />

of the key design points will be to allow for fast data communication between<br />

the CPU <strong>and</strong> the GPU. Presently communications must be performed between<br />

discrete CPU <strong>and</strong> GPU chips. But this is changing in AMDs Fusion architecture.<br />

Presently the plan is to use multiple (at least 16) PCI express channels for facilitate<br />

intercommunication. Intel is also jumping into this arena with their Larrabee chip.<br />

Intel is considering to use their QuickPath interconnect technology.<br />

6.15.1 [25] Compare the b<strong>and</strong>width <strong>and</strong> latency associated with these<br />

two interconnect technologies.<br />

6.16 Refer to Figure 6.14b, which shows an n-cube interconnect topology of order<br />

3 that interconnects 8 nodes. One attractive feature of an n-cube interconnection<br />

network topology is its ability to sustain broken links <strong>and</strong> still provide connectivity.<br />

6.16.1 [10] Develop an equation that computes how many links in the<br />

n-cube (where n is the order of the cube) can fail <strong>and</strong> we can still guarantee an<br />

unbroken link will exist to connect any node in the n-cube.<br />

6.16.2 [10] Compare the resiliency to failure of n-cube to a fullyconnected<br />

interconnection network. Plot a comparison of reliability as a function<br />

of the added number of links for the two topologies.<br />

6.17 Benchmarking is field of study that involves identifying representative<br />

workloads to run on specific computing platforms in order to be able to objectively<br />

compare performance of one system to another. In this exercise we will compare<br />

two classes of benchmarks: the Whetstone CPU benchmark <strong>and</strong> the PARSEC<br />

Benchmark suite. Select one program from PARSEC. All programs should be freely<br />

available on the Internet. Consider running multiple copies of Whetstone versus<br />

running the PARSEC Benchmark on any of systems described in Section 6.11.<br />

6.17.1 [60] What is inherently different between these two classes of<br />

workload when run on these multi-core systems?<br />

6.17.2 [60] In terms of the Roofline Model, how dependent will the<br />

results you obtain when running these benchmarks be on the amount of sharing<br />

<strong>and</strong> synchronization present in the workload used?<br />

6.18 When performing computations on sparse matrices, latency in the memory<br />

hierarchy becomes much more of a factor. Sparse matrices lack the spatial locality<br />

in the data stream typically found in matrix operations. As a result, new matrix<br />

representations have been proposed.<br />

One the earliest sparse matrix representations is the Yale Sparse Matrix Format. It<br />

stores an initial sparse m × n matrix, M in row form using three one-dimensional


6.16 Exercises 573<br />

arrays. Let R be the number of nonzero entries in M. We construct an array A<br />

of length R that contains all nonzero entries of M (in left-to-right top-to-bottom<br />

order). We also construct a second array IA of length m + 1 (i.e., one entry per row,<br />

plus one). IA(i) contains the index in A of the first nonzero element of row i. Row<br />

i of the original matrix extends from A(IA(i)) to A(IA(i+1)−1). The third array, JA,<br />

contains the column index of each element of A, so it also is of length R.<br />

6.18.1 [15] Consider the sparse matrix X below <strong>and</strong> write C code that<br />

would store this code in Yale Sparse Matrix Format.<br />

Row 1 [1, 2, 0, 0, 0, 0]<br />

Row 2 [0, 0, 1, 1, 0, 0]<br />

Row 3 [0, 0, 0, 0, 9, 0]<br />

Row 4 [2, 0, 0, 0, 0, 2]<br />

Row 5 [0, 0, 3, 3, 0, 7]<br />

Row 6 [1, 3, 0, 0, 0, 1]<br />

6.18.2 [10] In terms of storage space, assuming that each element in<br />

matrix X is single precision floating point, compute the amount of storage used to<br />

store the Matrix above in Yale Sparse Matrix Format.<br />

6.18.3 [15] Perform matrix multiplication of Matrix X by Matrix Y<br />

shown below.<br />

[2, 4, 1, 99, 7, 2]<br />

Put this computation in a loop, <strong>and</strong> time its execution. Make sure to increase<br />

the number of times this loop is executed to get good resolution in your timing<br />

measurement. Compare the runtime of using a naïve representation of the matrix,<br />

<strong>and</strong> the Yale Sparse Matrix Format.<br />

6.18.4 [15] Can you find a more efficient sparse matrix representation<br />

(in terms of space <strong>and</strong> computational overhead)?<br />

6.19 In future systems, we expect to see heterogeneous computing platforms<br />

constructed out of heterogeneous CPUs. We have begun to see some appear in the<br />

embedded processing market in systems that contain both floating point DSPs <strong>and</strong><br />

a microcontroller CPUs in a multichip module package.<br />

Assume that you have three classes of CPU:<br />

CPU A—A moderate speed multi-core CPU (with a floating point unit) that can<br />

execute multiple instructions per cycle.<br />

CPU B—A fast single-core integer CPU (i.e., no floating point unit) that can<br />

execute a single instruction per cycle.<br />

CPU C—A slow vector CPU (with floating point capability) that can execute<br />

multiple copies of the same instruction per cycle.


6.16 Exercises 575<br />

§6.1, page 504: False. Task-level parallelism can help sequential applications <strong>and</strong><br />

sequential applications can be made to run on parallel hardware, although it is<br />

more challenging.<br />

§6.2, page 509: False. Weak scaling can compensate for a serial portion of the<br />

program that would otherwise limit scalability, but not so for strong scaling.<br />

§6.3, page 514: True, but they are missing useful vector features like gather-scatter<br />

<strong>and</strong> vector length registers that improve the efficiency of vector architectures.<br />

(As an elaboration in this section mentions, the AVX2 SIMD extensions offers<br />

indexed loads via a gather operation but not scatter for indexed stores. The Haswell<br />

generation x86 microprocessor is the first to support AVX2.)<br />

§6.4, page 519: 1. True. 2. True.<br />

§6.5, page 523: False. Since the shared address is a physical address, multiple<br />

tasks each in their own virtual address spaces can run well on a shared memory<br />

multiprocessor.<br />

§6.6, page 531: False. Graphics DRAM chips are prized for their higher b<strong>and</strong>width.<br />

§6.7, page 536: 1. False. Sending <strong>and</strong> receiving a message is an implicit<br />

synchronization, as well as a way to share data. 2. True.<br />

§6.8, page 538: True.<br />

§6.10, page 550: True. We likely need innovation at all levels of the hardware <strong>and</strong><br />

software stack for parallel computing to succeed.<br />

Answers to<br />

Check Yourself


A<br />

A P P E N D I X<br />

Fear of serious injury<br />

cannot alone justify<br />

suppression of free<br />

speech <strong>and</strong> assembly.<br />

Assemblers, Linkers,<br />

<strong>and</strong> the SPIM<br />

Simulator<br />

James R. Larus<br />

Microsoft Research<br />

Microsoft<br />

Louis Br<strong>and</strong>eis<br />

Whitney v. California, 1927


A-4 Appendix A Assemblers, Linkers, <strong>and</strong> the SPIM Simulator<br />

Source<br />

file<br />

Assembler<br />

Object<br />

file<br />

Source<br />

file<br />

Assembler<br />

Object<br />

file<br />

Linker<br />

Executable<br />

file<br />

Source<br />

file<br />

Assembler<br />

Object<br />

file<br />

Program<br />

library<br />

FIGURE A.1.1 The process that produces an executable file. An assembler translates a file of<br />

assembly language into an object file, which is linked with other files <strong>and</strong> libraries into an executable file.<br />

assembler A program<br />

that translates a symbolic<br />

version of instruction into<br />

the binary ver sion.<br />

macro A patternmatching<br />

<strong>and</strong> replacement<br />

facility that pro vides a<br />

simple mechanism to name<br />

a frequently used sequence<br />

of instructions.<br />

unresolved reference<br />

A reference that requires<br />

more information from<br />

an outside source to be<br />

complete.<br />

linker Also called<br />

link editor. A systems<br />

program that combines<br />

independently assembled<br />

machine language<br />

programs <strong>and</strong> resolves all<br />

undefined labels into an<br />

executable file.<br />

permits programmers to use labels to identify <strong>and</strong> name particular memory words<br />

that hold instructions or data.<br />

A tool called an assembler translates assembly language into binary instructions.<br />

Assemblers provide a friendlier representation than a computer’s 0s <strong>and</strong> 1s, which<br />

sim plifies writing <strong>and</strong> reading programs. Symbolic names for operations <strong>and</strong> locations<br />

are one facet of this representation. Another facet is programming facilities<br />

that increase a program’s clarity. For example, macros, discussed in Section A.2,<br />

enable a programmer to extend the assembly language by defining new operations.<br />

An assembler reads a single assembly language source file <strong>and</strong> produces an<br />

object file containing machine instructions <strong>and</strong> bookkeeping information that<br />

helps combine several object files into a program. Figure A.1.1 illustrates how a<br />

program is built. Most programs consist of several files—also called modules—<br />

that are written, compiled, <strong>and</strong> assembled independently. A program may also use<br />

prewritten routines supplied in a program library. A module typically contains references<br />

to subroutines <strong>and</strong> data defined in other modules <strong>and</strong> in libraries. The code<br />

in a module cannot be executed when it contains unresolved references to labels<br />

in other object files or libraries. Another tool, called a linker, combines a collection<br />

of object <strong>and</strong> library files into an executable file, which a computer can run.<br />

To see the advantage of assembly language, consider the following sequence of<br />

figures, all of which contain a short subroutine that computes <strong>and</strong> prints the sum of<br />

the squares of integers from 0 to 100. Figure A.1.2 shows the machine language that<br />

a MIPS computer executes. With considerable effort, you could use the opcode <strong>and</strong><br />

instruction format tables in Chapter 2 to translate the instructions into a symbolic<br />

program similar to that shown in Figure A.1.3. This form of the routine is much<br />

easier to read, because operations <strong>and</strong> oper<strong>and</strong>s are written with symbols rather


A.1 Introduction A-5<br />

00100111101111011111111111100000<br />

10101111101111110000000000010100<br />

10101111101001000000000000100000<br />

10101111101001010000000000100100<br />

10101111101000000000000000011000<br />

10101111101000000000000000011100<br />

10001111101011100000000000011100<br />

10001111101110000000000000011000<br />

00000001110011100000000000011001<br />

00100101110010000000000000000001<br />

00101001000000010000000001100101<br />

10101111101010000000000000011100<br />

00000000000000000111100000010010<br />

00000011000011111100100000100001<br />

00010100001000001111111111110111<br />

10101111101110010000000000011000<br />

00111100000001000001000000000000<br />

10001111101001010000000000011000<br />

00001100000100000000000011101100<br />

00100100100001000000010000110000<br />

10001111101111110000000000010100<br />

00100111101111010000000000100000<br />

00000011111000000000000000001000<br />

00000000000000000001000000100001<br />

FIGURE A.1.2 MIPS machine language code for a routine to compute <strong>and</strong> print the sum<br />

of the squares of integers between 0 <strong>and</strong> 100.<br />

than with bit patterns. However, this assembly language is still difficult to follow,<br />

because memory locations are named by their address rather than by a symbolic<br />

label.<br />

Figure A.1.4 shows assembly language that labels memory addresses with mnemonic<br />

names. Most programmers prefer to read <strong>and</strong> write this form. Names that<br />

begin with a period, for example .data <strong>and</strong> .globl, are assembler directives<br />

that tell the assembler how to translate a program but do not produce machine<br />

instructions. Names followed by a colon, such as str: or main:, are labels that<br />

name the next memory location. This program is as readable as most assembly<br />

language programs (except for a glaring lack of comments), but it is still difficult<br />

to follow, because many simple operations are required to accomplish simple tasks<br />

<strong>and</strong> because assembly language’s lack of control flow constructs provides few hints<br />

about the program’s operation.<br />

By contrast, the C routine in Figure A.1.5 is both shorter <strong>and</strong> clearer, since variables<br />

have mnemonic names <strong>and</strong> the loop is explicit rather than constructed with<br />

branches. In fact, the C routine is the only one that we wrote. The other forms of<br />

the program were produced by a C compiler <strong>and</strong> assembler.<br />

In general, assembly language plays two roles (see Figure A.1.6). The first role<br />

is the output language of compilers. A compiler translates a program written in a<br />

high-level language (such as C or Pascal) into an equivalent program in machine or<br />

assembler directive<br />

An operation that tells the<br />

assembler how to translate<br />

a program but does not<br />

produce machine instructions;<br />

always begins with<br />

a period.


A-6 Appendix A Assemblers, Linkers, <strong>and</strong> the SPIM Simulator<br />

addiu $29, $29, -32<br />

sw $31, 20($29)<br />

sw $4, 32($29)<br />

sw $5, 36($29)<br />

sw $0, 24($29)<br />

sw $0, 28($29)<br />

lw $14, 28($29)<br />

lw $24, 24($29)<br />

multu $14, $14<br />

addiu $8, $14, 1<br />

slti $1, $8, 101<br />

sw $8, 28($29)<br />

mflo $15<br />

addu $25, $24, $15<br />

bne $1, $0, -9<br />

sw $25, 24($29)<br />

lui $4, 4096<br />

lw $5, 24($29)<br />

jal 1048812<br />

addiu $4, $4, 1072<br />

lw $31, 20($29)<br />

addiu $29, $29, 32<br />

jr $31<br />

move $2, $0<br />

FIGURE A.1.3 The same routine as in Figure A.1.2 written in assembly language. However,<br />

the code for the routine does not label registers or memory locations or include comments.<br />

source language The<br />

high-level language<br />

in which a pro gram is<br />

originally written.<br />

assembly language. The high-level language is called the source language, <strong>and</strong> the<br />

compiler’s output is its target language.<br />

Assembly language’s other role is as a language in which to write programs. This<br />

role used to be the dominant one. Today, however, because of larger main memories<br />

<strong>and</strong> better compilers, most programmers write in a high-level language <strong>and</strong><br />

rarely, if ever, see the instructions that a computer executes. Nevertheless, assembly<br />

language is still important to write programs in which speed or size is critical or to<br />

exploit hardware features that have no analogues in high-level languages.<br />

Although this appendix focuses on MIPS assembly language, assembly programming<br />

on most other machines is very similar. The additional instructions <strong>and</strong><br />

address modes in CISC machines, such as the VAX, can make assembly pro grams<br />

shorter but do not change the process of assembling a program or provide assembly<br />

language with the advantages of high-level languages, such as type-checking <strong>and</strong><br />

structured control flow.


A.1 Introduction A-7<br />

FIGURE A.1.4 The same routine as in Figure A.1.2 written in assembly language with<br />

labels, but no com ments. The comm<strong>and</strong>s that start with periods are assembler directives (see pages<br />

A-47–49). .text indicates that succeeding lines contain instructions. .data indicates that they contain<br />

data. .align n indicates that the items on the succeeding lines should be aligned on a 2 n byte boundary.<br />

Hence, .align 2 means the next item should be on a word boundary. .globl main declares that main is<br />

a global symbol that should be visible to code stored in other files. Finally, .asciiz stores a null-terminated<br />

string in memory.<br />

When to Use Assembly Language<br />

The primary reason to program in assembly language, as opposed to an available<br />

high-level language, is that the speed or size of a program is critically important.<br />

For example, consider a computer that controls a piece of machinery, such as a<br />

car’s brakes. A computer that is incorporated in another device, such as a car, is<br />

called an embedded computer. This type of computer needs to respond rapidly<br />

<strong>and</strong> predictably to events in the outside world. Because a compiler introduces


A-8 Appendix A Assemblers, Linkers, <strong>and</strong> the SPIM Simulator<br />

#include <br />

int<br />

main (int argc, char *argv[])<br />

{<br />

int i;<br />

int sum = 0;<br />

}<br />

for (i = 0; i


A.1 Introduction A-9<br />

This improvement is not necessarily an indication that the high-level language’s<br />

compiler has failed. Compilers typically are better than programmers at producing<br />

uniformly high-quality machine code across an entire program. Pro grammers,<br />

however, underst<strong>and</strong> a program’s algorithms <strong>and</strong> behavior at a deeper level than<br />

a compiler <strong>and</strong> can expend considerable effort <strong>and</strong> ingenuity improving small<br />

sections of the program. In particular, programmers often consider several procedures<br />

simultaneously while writing their code. Compilers typically compile each<br />

procedure in isolation <strong>and</strong> must follow strict conventions governing the use of<br />

registers at procedure boundaries. By retaining commonly used values in registers,<br />

even across procedure boundaries, programmers can make a program run<br />

faster.<br />

Another major advantage of assembly language is the ability to exploit specialized<br />

instructions—for example, string copy or pattern-matching instructions.<br />

Compilers, in most cases, cannot determine that a program loop can be replaced<br />

by a single instruction. However, the programmer who wrote the loop can replace<br />

it easily with a single instruction.<br />

Currently, a programmer’s advantage over a compiler has become difficult to<br />

maintain as compilation techniques improve <strong>and</strong> machines’ pipelines increase in<br />

complexity (Chapter 4).<br />

The final reason to use assembly language is that no high-level language is<br />

available on a particular computer. Many older or specialized computers do not<br />

have a compiler, so a programmer’s only alternative is assembly language.<br />

Drawbacks of Assembly Language<br />

Assembly language has many disadvantages that strongly argue against its widespread<br />

use. Perhaps its major disadvantage is that programs written in assembly<br />

language are inherently machine-specific <strong>and</strong> must be totally rewritten to run on<br />

another computer architecture. The rapid evolution of computers discussed in<br />

Chapter 1 means that architectures become obsolete. An assembly language program<br />

remains tightly bound to its original archi tecture, even after the computer is<br />

eclipsed by new, faster, <strong>and</strong> more cost-effective machines.<br />

Another disadvantage is that assembly language programs are longer than the<br />

equivalent programs written in a high-level language. For example, the C program<br />

in Figure A.1.5 is 11 lines long, while the assembly program in Figure A.1.4 is<br />

31 lines long. In more complex programs, the ratio of assembly to high-level language<br />

(its expansion factor) can be much larger than the factor of three in this<br />

exam ple. Unfortunately, empirical studies have shown that programmers write<br />

roughly the same number of lines of code per day in assembly as in high-level<br />

languages. This means that programmers are roughly x times more productive in a<br />

high-level language, where x is the assembly language expansion factor.


A-10 Appendix A Assemblers, Linkers, <strong>and</strong> the SPIM Simulator<br />

To compound the problem, longer programs are more difficult to read <strong>and</strong><br />

underst<strong>and</strong>, <strong>and</strong> they contain more bugs. Assembly language exacerbates the problem<br />

because of its complete lack of structure. Common programming idioms,<br />

such as if-then statements <strong>and</strong> loops, must be built from branches <strong>and</strong> jumps. The<br />

resulting programs are hard to read, because the reader must reconstruct every<br />

higher-level construct from its pieces <strong>and</strong> each instance of a statement may be<br />

slightly different. For example, look at Figure A.1.4 <strong>and</strong> answer these questions:<br />

What type of loop is used? What are its lower <strong>and</strong> upper bounds?<br />

Elaboration: Compilers can produce machine language directly instead of relying on<br />

an assembler. These compilers typically execute much faster than those that invoke<br />

an assembler as part of compilation. However, a compiler that generates machine language<br />

must perform many tasks that an assembler normally h<strong>and</strong>les, such as resolving<br />

addresses <strong>and</strong> encoding instructions as binary numbers. The tradeoff is between<br />

compilation speed <strong>and</strong> compiler simplicity.<br />

Elaboration: Despite these considerations, some embedded applications are written<br />

in a high-level language. Many of these applications are large <strong>and</strong> complex programs<br />

that must be extremely reliable. Assembly language programs are longer <strong>and</strong><br />

more diffi cult to write <strong>and</strong> read than high-level language programs. This greatly increases<br />

the cost of writing an assembly language program <strong>and</strong> makes it extremely dif fi cult to<br />

verify the correctness of this type of program. In fact, these considerations led the US<br />

Department of Defense, which pays for many complex embedded systems, to develop<br />

Ada, a new high-level language for writing embedded systems.<br />

A.2 Assemblers<br />

external label Also called<br />

global label. A label<br />

referring to an object that<br />

can be referenced from<br />

files other than the one in<br />

which it is defined.<br />

An assembler translates a file of assembly language statements into a file of binary<br />

machine instructions <strong>and</strong> binary data. The translation process has two major<br />

parts. The first step is to find memory locations with labels so that the relationship<br />

between symbolic names <strong>and</strong> addresses is known when instructions are trans lated.<br />

The second step is to translate each assembly statement by combining the numeric<br />

equivalents of opcodes, register specifiers, <strong>and</strong> labels into a legal instruc tion. As<br />

shown in Figure A.1.1, the assembler produces an output file, called an object file,<br />

which contains the machine instructions, data, <strong>and</strong> bookkeeping infor mation.<br />

An object file typically cannot be executed, because it references procedures or<br />

data in other files. A label is external (also called global) if the labeled object can


A.2 Assemblers A-11<br />

be referenced from files other than the one in which it is defined. A label is local<br />

if the object can be used only within the file in which it is defined. In most assemblers,<br />

labels are local by default <strong>and</strong> must be explicitly declared global. Subrou tines<br />

<strong>and</strong> global variables require external labels since they are referenced from many<br />

files in a program. Local labels hide names that should not be visible to other<br />

modules—for example, static functions in C, which can only be called by other<br />

functions in the same file. In addition, compiler-generated names—for example, a<br />

name for the instruction at the beginning of a loop—are local so that the compiler<br />

need not produce unique names in every file.<br />

local label A label<br />

referring to an object that<br />

can be used only within<br />

the file in which it is<br />

defined.<br />

Local <strong>and</strong> Global Labels<br />

Consider the program in Figure A.1.4. The subroutine has an external (global)<br />

label main. It also contains two local labels—loop <strong>and</strong> str—that are only<br />

visible with this assembly language file. Finally, the routine also contains an<br />

unresolved reference to an external label printf, which is the library routine<br />

that prints values. Which labels in Figure A.1.4 could be referenced from<br />

another file?<br />

EXAMPLE<br />

Only global labels are visible outside a file, so the only label that could be<br />

referenced from another file is main.<br />

Since the assembler processes each file in a program individually <strong>and</strong> in isola tion,<br />

it only knows the addresses of local labels. The assembler depends on another tool,<br />

the linker, to combine a collection of object files <strong>and</strong> libraries into an executable<br />

file by resolving external labels. The assembler assists the linker by pro viding lists<br />

of labels <strong>and</strong> unresolved references.<br />

However, even local labels present an interesting challenge to an assembler.<br />

Unlike names in most high-level languages, assembly labels may be used before<br />

they are defined. In the example in Figure A.1.4, the label str is used by the la<br />

instruction before it is defined. The possibility of a forward reference, like this one,<br />

forces an assembler to translate a program in two steps: first find all labels <strong>and</strong> then<br />

produce instructions. In the example, when the assembler sees the la instruction,<br />

it does not know where the word labeled str is located or even whether str labels<br />

an instruction or datum.<br />

ANSWER<br />

forward reference<br />

A label that is used<br />

before it is defined.


A-12 Appendix A Assemblers, Linkers, <strong>and</strong> the SPIM Simulator<br />

An assembler’s first pass reads each line of an assembly file <strong>and</strong> breaks it into its<br />

component pieces. These pieces, which are called lexemes, are individual words,<br />

numbers, <strong>and</strong> punctuation characters. For example, the line<br />

symbol table A table<br />

that matches names of<br />

labels to the addresses of<br />

the memory words that<br />

instructions occupy.<br />

ble<br />

$t0, 100, loop<br />

contains six lexemes: the opcode ble, the register specifier $t0, a comma, the<br />

number 100, a comma, <strong>and</strong> the symbol loop.<br />

If a line begins with a label, the assembler records in its symbol table the name<br />

of the label <strong>and</strong> the address of the memory word that the instruction occupies.<br />

The assembler then calculates how many words of memory the instruction on the<br />

current line will occupy. By keeping track of the instructions’ sizes, the assembler<br />

can determine where the next instruction goes. To compute the size of a variablelength<br />

instruction, like those on the VAX, an assembler has to examine it in detail.<br />

However, fixed-length instructions, like those on MIPS, require only a cursory<br />

examination. The assembler performs a similar calculation to compute the space<br />

required for data statements. When the assembler reaches the end of an assembly<br />

file, the symbol table records the location of each label defined in the file.<br />

The assembler uses the information in the symbol table during a second pass<br />

over the file, which actually produces machine code. The assembler again examines<br />

each line in the file. If the line contains an instruction, the assembler combines<br />

the binary representations of its opcode <strong>and</strong> oper<strong>and</strong>s (register specifiers or<br />

memory address) into a legal instruction. The process is similar to the one used in<br />

Section 2.5 in Chapter 2. Instructions <strong>and</strong> data words that reference an external<br />

symbol defined in another file cannot be completely assembled (they are unresolved),<br />

since the symbol’s address is not in the symbol table. An assembler does<br />

not complain about unresolved references, since the corresponding label is likely<br />

to be defined in another file.<br />

The BIG<br />

Picture<br />

Assembly language is a programming language. Its principal difference<br />

from high-level languages such as BASIC, Java, <strong>and</strong> C is that assembly language<br />

provides only a few, simple types of data <strong>and</strong> control flow. Assembly<br />

language programs do not specify the type of value held in a variable.<br />

Instead, a programmer must apply the appropriate operations (e.g., integer<br />

or floating-point addition) to a value. In addition, in assem bly language,<br />

programs must implement all control flow with go tos. Both factors make<br />

assembly language programming for any machine—MIPS or x86—more<br />

difficult <strong>and</strong> error-prone than writing in a high-level language.


A.2 Assemblers A-13<br />

Elaboration: If an assembler’s speed is important, this two-step process can be done<br />

in one pass over the assembly fi le with a technique known as backpatching. In its<br />

pass over the fi le, the assembler builds a (possibly incomplete) binary representation<br />

of every instruction. If the instruction references a label that has not yet been defi ned,<br />

the assembler records the label <strong>and</strong> instruction in a table. When a label is defi ned, the<br />

assembler consults this table to fi nd all instructions that contain a forward reference to<br />

the label. The assembler goes back <strong>and</strong> corrects their binary representation to incorporate<br />

the address of the label. Backpatching speeds assembly because the assembler<br />

only reads its input once. However, it requires an assembler to hold the entire binary representation<br />

of a program in memory so instructions can be backpatched. This requirement<br />

can limit the size of programs that can be assembled. The process is com plicated<br />

by machines with several types of branches that span different ranges of instructions.<br />

When the assembler fi rst sees an unresolved label in a branch instruction, it must either<br />

use the largest possible branch or risk having to go back <strong>and</strong> readjust many instructions<br />

to make room for a larger branch.<br />

backpatching<br />

A method for translating<br />

from assembly lan guage<br />

to machine instructions<br />

in which the assembler<br />

builds a (possibly<br />

incomplete) binary<br />

representation of every<br />

instruc tion in one pass<br />

over a program <strong>and</strong> then<br />

returns to fill in previously<br />

undefined labels.<br />

Object File Format<br />

Assemblers produce object files. An object file on UNIX contains six distinct<br />

sections (see Figure A.2.1):<br />

■ The object file header describes the size <strong>and</strong> position of the other pieces of<br />

the file.<br />

■ The text segment contains the machine language code for routines in the<br />

source file. These routines may be unexecutable because of unresolved<br />

references.<br />

■ The data segment contains a binary representation of the data in the source<br />

file. The data also may be incomplete because of unresolved references to<br />

labels in other files.<br />

■ The relocation information identifies instructions <strong>and</strong> data words that<br />

depend on absolute addresses. These references must change if portions of<br />

the program are moved in memory.<br />

■ The symbol table associates addresses with external labels in the source file<br />

<strong>and</strong> lists unresolved references.<br />

■ The debugging information contains a concise description of the way the<br />

program was compiled, so a debugger can find which instruction addresses<br />

correspond to lines in a source file <strong>and</strong> print the data structures in readable<br />

form.<br />

The assembler produces an object file that contains a binary representation of<br />

the program <strong>and</strong> data <strong>and</strong> additional information to help link pieces of a program.<br />

text segment The<br />

segment of a UNIX<br />

object file that contains<br />

the machine language<br />

code for rou tines in the<br />

source file.<br />

data segment The<br />

segment of a UNIX<br />

object or executable file<br />

that contains a binary<br />

represen tation of the<br />

initialized data used by<br />

the program.<br />

relocation information<br />

The segment of a UNIX<br />

object file that identifies<br />

instructions <strong>and</strong> data<br />

words that depend on<br />

absolute addresses.<br />

absolute address<br />

A variable’s or routine’s<br />

actual address in memory.


A-14 Appendix A Assemblers, Linkers, <strong>and</strong> the SPIM Simulator<br />

Object file<br />

header<br />

Text<br />

segment<br />

Data<br />

segment<br />

Relocation<br />

information<br />

Symbol<br />

table<br />

Debugging<br />

information<br />

FIGURE A.2.1 Object file. A UNIX assembler produces an object file with six distinct sections.<br />

This relocation information is necessary because the assembler does not know<br />

which memory locations a procedure or piece of data will occupy after it is linked<br />

with the rest of the program. Procedures <strong>and</strong> data from a file are stored in a contiguous<br />

piece of memory, but the assembler does not know where this mem ory will<br />

be located. The assembler also passes some symbol table entries to the linker. In<br />

particular, the assembler must record which external symbols are defined in a file<br />

<strong>and</strong> what unresolved references occur in a file.<br />

Elaboration: For convenience, assemblers assume each file starts at the same<br />

address (for example, location 0) with the expectation that the linker will relocate the code<br />

<strong>and</strong> data when they are assigned locations in memory. The assembler produces relocation<br />

information, which contains an entry describing each instruction or data word in the file<br />

that references an absolute address. On MIPS, only the subroutine call, load, <strong>and</strong> store<br />

instructions reference absolute addresses. Instructions that use PC- relative addressing,<br />

such as branches, need not be relocated.<br />

Additional Facilities<br />

Assemblers provide a variety of convenience features that help make assembler<br />

programs shorter <strong>and</strong> easier to write, but do not fundamentally change assembly<br />

language. For example, data layout directives allow a programmer to describe data<br />

in a more concise <strong>and</strong> natural manner than its binary representation.<br />

In Figure A.1.4, the directive<br />

.asciiz “The sum from 0 .. 100 is %d\n”<br />

stores characters from the string in memory. Contrast this line with the alternative<br />

of writing each character as its ASCII value (Figure 2.15 in Chapter 2 describes the<br />

ASCII encoding for characters):<br />

.byte 84, 104, 101, 32, 115, 117, 109, 32<br />

.byte 102, 114, 111, 109, 32, 48, 32, 46<br />

.byte 46, 32, 49, 48, 48, 32, 105, 115<br />

.byte 32, 37, 100, 10, 0<br />

The .asciiz directive is easier to read because it represents characters as letters,<br />

not binary numbers. An assembler can translate characters to their binary representation<br />

much faster <strong>and</strong> more accurately than a human can. Data layout directives


A.2 Assemblers A-15<br />

specify data in a human-readable form that the assembler translates to binary. Other<br />

layout directives are described in Section A.10.<br />

String Directive<br />

Define the sequence of bytes produced by this directive:<br />

.asciiz “The quick brown fox jumps over the lazy dog”<br />

EXAMPLE<br />

.byte 84, 104, 101, 32, 113, 117, 105, 99<br />

.byte 107, 32, 98, 114, 111, 119, 110, 32<br />

.byte 102, 111, 120, 32, 106, 117, 109, 112<br />

.byte 115, 32, 111, 118, 101, 114, 32, 116<br />

.byte 104, 101, 32, 108, 97, 122, 121, 32<br />

.byte 100, 111, 103, 0<br />

ANSWER<br />

Macro is a pattern-matching <strong>and</strong> replacement facility that provides a simple<br />

mechanism to name a frequently used sequence of instructions. Instead of repeatedly<br />

typing the same instructions every time they are used, a programmer invokes<br />

the macro <strong>and</strong> the assembler replaces the macro call with the corresponding<br />

sequence of instructions. Macros, like subroutines, permit a programmer to create<br />

<strong>and</strong> name a new abstraction for a common operation. Unlike subroutines, however,<br />

macros do not cause a subroutine call <strong>and</strong> return when the program runs,<br />

since a macro call is replaced by the macro’s body when the program is assembled.<br />

After this replacement, the resulting assembly is indistinguishable from the equivalent<br />

program written without macros.<br />

Macros<br />

As an example, suppose that a programmer needs to print many numbers. The<br />

library routine printf accepts a format string <strong>and</strong> one or more values to print<br />

as its arguments. A programmer could print the integer in register $7 with the<br />

following instructions:<br />

EXAMPLE<br />

.data<br />

int_str: .asciiz“%d”<br />

.text<br />

la $a0, int_str # Load string address<br />

# into first arg


A-16 Appendix A Assemblers, Linkers, <strong>and</strong> the SPIM Simulator<br />

mov $a1, $7 # Load value into<br />

# second arg<br />

jal printf # Call the printf routine<br />

The .data directive tells the assembler to store the string in the program’s data<br />

segment, <strong>and</strong> the .text directive tells the assembler to store the instruc tions<br />

in its text segment.<br />

However, printing many numbers in this fashion is tedious <strong>and</strong> produces a<br />

verbose program that is difficult to underst<strong>and</strong>. An alternative is to introduce<br />

a macro, print_int, to print an integer:<br />

formal parameter<br />

A variable that is the<br />

argument to a proce dure<br />

or macro; it is replaced by<br />

that argument once the<br />

macro is exp<strong>and</strong>ed.<br />

.data<br />

int_str:.asciiz “%d”<br />

.text<br />

.macro print_int($arg)<br />

la $a0, int_str # Load string address into<br />

# first arg<br />

mov $a1, $arg # Load macro’s parameter<br />

# ($arg) into second arg<br />

jal printf # Call the printf routine<br />

.end_macro<br />

print_int($7)<br />

The macro has a formal parameter, $arg, that names the argument to the<br />

macro. When the macro is exp<strong>and</strong>ed, the argument from a call is substituted<br />

for the formal parameter throughout the macro’s body. Then the assembler<br />

replaces the call with the macro’s newly exp<strong>and</strong>ed body. In the first call on<br />

print_int, the argument is $7, so the macro exp<strong>and</strong>s to the code<br />

la $a0, int_str<br />

mov $a1, $7<br />

jal printf<br />

In a second call on print_int, say, print_int($t0), the argument is $t0,<br />

so the macro exp<strong>and</strong>s to<br />

la $a0, int_str<br />

mov $a1, $t0<br />

jal printf<br />

What does the call print_int($a0) exp<strong>and</strong> to?


A.2 Assemblers A-17<br />

la $a0, int_str<br />

mov $a1, $a0<br />

jal printf<br />

ANSWER<br />

This example illustrates a drawback of macros. A programmer who uses<br />

this macro must be aware that print_int uses register $a0 <strong>and</strong> so cannot<br />

correctly print the value in that register.<br />

Some assemblers also implement pseudoinstructions, which are instructions provided<br />

by an assembler but not implemented in hardware. Chapter 2 contains<br />

many examples of how the MIPS assembler synthesizes pseudoinstructions<br />

<strong>and</strong> addressing modes from the spartan MIPS hardware instruction set. For<br />

example, Section 2.7 in Chapter 2 describes how the assembler synthesizes the<br />

blt instruc tion from two other instructions: slt <strong>and</strong> bne. By extending the<br />

instruction set, the MIPS assembler makes assembly language programming<br />

easier without complicating the hardware. Many pseudoinstructions could also<br />

be simulated with macros, but the MIPS assembler can generate better code for<br />

these instructions because it can use a dedicated register ($at) <strong>and</strong> is able to<br />

optimize the generated code.<br />

Hardware/<br />

Software<br />

Interface<br />

Elaboration: Assemblers conditionally assemble pieces of code, which permits a<br />

programmer to include or exclude groups of instructions when a program is assembled.<br />

This feature is particularly useful when several versions of a program differ by a small<br />

amount. Rather than keep these programs in separate fi les—which greatly complicates<br />

fi xing bugs in the common code—programmers typically merge the versions into a single<br />

fi le. Code particular to one version is conditionally assembled, so it can be excluded<br />

when other versions of the program are assembled.<br />

If macros <strong>and</strong> conditional assembly are useful, why do assemblers for UNIX systems<br />

rarely, if ever, provide them? One reason is that most programmers on these systems<br />

write programs in higher-level languages like C. Most of the assembly code is produced<br />

by compilers, which fi nd it more convenient to repeat code rather than defi ne macros.<br />

Another reason is that other tools on UNIX—such as cpp, the C preprocessor, or m4, a<br />

general macro processor—can provide macros <strong>and</strong> conditional assembly for assembly<br />

language programs.


A-20 Appendix A Assemblers, Linkers, <strong>and</strong> the SPIM Simulator<br />

system kernel brings a program into memory <strong>and</strong> starts it running. To start a program,<br />

the operating system performs the following steps:<br />

1. It reads the executable file’s header to determine the size of the text <strong>and</strong> data<br />

segments.<br />

2. It creates a new address space for the program. This address space is large<br />

enough to hold the text <strong>and</strong> data segments, along with a stack segment (see<br />

Section A.5).<br />

3. It copies instructions <strong>and</strong> data from the executable file into the new address<br />

space.<br />

4. It copies arguments passed to the program onto the stack.<br />

5. It initializes the machine registers. In general, most registers are cleared, but<br />

the stack pointer must be assigned the address of the first free stack location<br />

(see Section A.5).<br />

6. It jumps to a start-up routine that copies the program’s arguments from the<br />

stack to registers <strong>and</strong> calls the program’s main routine. If the main routine<br />

returns, the start-up routine terminates the program with the exit system call.<br />

A.5 Memory Usage<br />

static data The portion<br />

of memory that contains<br />

data whose size is known<br />

to the com piler <strong>and</strong> whose<br />

lifetime is the program’s<br />

entire execution.<br />

The next few sections elaborate the description of the MIPS architecture presented<br />

earlier in the book. Earlier chapters focused primarily on hardware <strong>and</strong> its relationship<br />

with low-level software. These sections focus primarily on how assembly language<br />

programmers use MIPS hardware. These sections describe a set of conventions<br />

followed on many MIPS systems. For the most part, the hardware does not impose<br />

these conventions. Instead, they represent an agreement among programmers to<br />

follow the same set of rules so that software written by different people can work<br />

together <strong>and</strong> make effective use of MIPS hardware.<br />

Systems based on MIPS processors typically divide memory into three parts<br />

(see Figure A.5.1). The first part, near the bottom of the address space (starting<br />

at address 400000 hex ), is the text segment, which holds the program’s instructions.<br />

The second part, above the text segment, is the data segment, which is further<br />

divided into two parts. Static data (starting at address 10000000 hex ) contains<br />

objects whose size is known to the compiler <strong>and</strong> whose lifetime—the interval<br />

dur ing which a program can access them—is the program’s entire execution. For<br />

example, in C, global variables are statically allocated, since they can be referenced


A.5 Memory Usage A-21<br />

7fffffff hex<br />

Stack segment<br />

10000000 hex<br />

400000 hex<br />

Dynamic data<br />

Static data<br />

Reserved<br />

Data segment<br />

Text segment<br />

FIGURE A.5.1 Layout of memory.<br />

anytime during a program’s execution. The linker both assigns static objects to<br />

locations in the data segment <strong>and</strong> resolves references to these objects.<br />

Immediately above static data is dynamic data. This data, as its name implies, is<br />

allocated by the program as it executes. In C programs, the malloc library rou tine<br />

Because the data segment begins far above the program at address 10000000 hex ,<br />

load <strong>and</strong> store instructions cannot directly reference data objects with their 16-bit<br />

offset fields (see Section 2.5 in Chapter 2). For example, to load the word in the<br />

data segment at address 10010020 hex into register $v0 requires two instructions:<br />

Hardware/<br />

Software<br />

Interface<br />

lui $s0, 0x1001 # 0x1001 means 1001 base 16<br />

lw $v0, 0x0020($s0) # 0x10010000 + 0x0020 = 0x10010020<br />

(The 0x before a number means that it is a hexadecimal value. For example, 0x8000<br />

is 8000 hex or 32,768 ten .)<br />

To avoid repeating the lui instruction at every load <strong>and</strong> store, MIPS systems<br />

typically dedicate a register ($gp) as a global pointer to the static data segment. This<br />

register contains address 10008000 hex, so load <strong>and</strong> store instructions can use their<br />

signed 16-bit offset fields to access the first 64 KB of the static data segment. With<br />

this global pointer, we can rewrite the example as a single instruction:<br />

lw $v0, 0x8020($gp)<br />

Of course, a global pointer register makes addressing locations 10000000 hex –<br />

10010000 hex faster than other heap locations. The MIPS compiler usually stores<br />

global variables in this area, because these variables have fixed locations <strong>and</strong> fit better<br />

than other global data, such as arrays.


A-22 Appendix A Assemblers, Linkers, <strong>and</strong> the SPIM Simulator<br />

stack segment The<br />

portion of memory used<br />

by a program to hold<br />

procedure call frames.<br />

finds <strong>and</strong> returns a new block of memory. Since a compiler cannot predict how<br />

much memory a program will allocate, the operating system exp<strong>and</strong>s the dynamic<br />

data area to meet dem<strong>and</strong>. As the upward arrow in the figure indicates, malloc<br />

exp<strong>and</strong>s the dynamic area with the sbrk system call, which causes the operating<br />

system to add more pages to the program’s virtual address space (see Section 5.7 in<br />

Chapter 5) immediately above the dynamic data segment.<br />

The third part, the program stack segment, resides at the top of the virtual<br />

address space (starting at address 7fffffff hex ). Like dynamic data, the maximum size<br />

of a program’s stack is not known in advance. As the program pushes values on to<br />

the stack, the operating system exp<strong>and</strong>s the stack segment down toward the data<br />

segment.<br />

This three-part division of memory is not the only possible one. However, it has<br />

two important characteristics: the two dynamically exp<strong>and</strong>able segments are as far<br />

apart as possible, <strong>and</strong> they can grow to use a program’s entire address space.<br />

A.6 Procedure Call Convention<br />

register use convention<br />

Also called procedure<br />

call convention.<br />

A software proto col<br />

governing the use of<br />

registers by procedures.<br />

Conventions governing the use of registers are necessary when procedures in a<br />

program are compiled separately. To compile a particular procedure, a compiler<br />

must know which registers it may use <strong>and</strong> which registers are reserved for other<br />

procedures. Rules for using registers are called register use or procedure call<br />

conventions. As the name implies, these rules are, for the most part, conventions<br />

fol lowed by software rather than rules enforced by hardware. However, most compilers<br />

<strong>and</strong> programmers try very hard to follow these conventions because violating<br />

them causes insidious bugs.<br />

The calling convention described in this section is the one used by the gcc compiler.<br />

The native MIPS compiler uses a more complex convention that is slightly<br />

faster.<br />

The MIPS CPU contains 32 general-purpose registers that are numbered 0–31.<br />

Register $0 always contains the hardwired value 0.<br />

■ Registers $at (1), $k0 (26), <strong>and</strong> $k1 (27) are reserved for the assembler <strong>and</strong><br />

operating system <strong>and</strong> should not be used by user programs or compilers.<br />

■ Registers $a0–$a3 (4–7) are used to pass the first four arguments to rou tines<br />

(remaining arguments are passed on the stack). Registers $v0 <strong>and</strong> $v1 (2, 3)<br />

are used to return values from functions.


A.6 Procedure Call Convention A-23<br />

■ Registers $t0–$t9 (8–15, 24, 25) are caller-saved registers that are used<br />

to hold temporary quantities that need not be preserved across calls (see<br />

Section 2.8 in Chapter 2).<br />

■ Registers $s0–$s7 (16–23) are callee-saved registers that hold long-lived<br />

values that should be preserved across calls.<br />

■ Register $gp (28) is a global pointer that points to the middle of a 64K block<br />

of memory in the static data segment.<br />

■ Register $sp (29) is the stack pointer, which points to the last location on<br />

the stack. Register $fp (30) is the frame pointer. The jal instruction writes<br />

register $ra (31), the return address from a procedure call. These two registers<br />

are explained in the next section.<br />

The two-letter abbreviations <strong>and</strong> names for these registers—for example $sp<br />

for the stack pointer—reflect the registers’ intended uses in the procedure call<br />

convention. In describing this convention, we will use the names instead of regis ter<br />

numbers. Figure A.6.1 lists the registers <strong>and</strong> describes their intended uses.<br />

Procedure Calls<br />

This section describes the steps that occur when one procedure (the caller) invokes<br />

another procedure (the callee). Programmers who write in a high-level language<br />

(like C or Pascal) never see the details of how one procedure calls another, because<br />

the compiler takes care of this low-level bookkeeping. However, assembly language<br />

programmers must explicitly implement every procedure call <strong>and</strong> return.<br />

Most of the bookkeeping associated with a call is centered around a block<br />

of memory called a procedure call frame. This memory is used for a variety of<br />

purposes:<br />

■ To hold values passed to a procedure as arguments<br />

■ To save registers that a procedure may modify, but which the procedure’s<br />

caller does not want changed<br />

■ To provide space for variables local to a procedure<br />

In most programming languages, procedure calls <strong>and</strong> returns follow a strict<br />

last-in, first-out (LIFO) order, so this memory can be allocated <strong>and</strong> deallocated on<br />

a stack, which is why these blocks of memory are sometimes called stack frames.<br />

Figure A.6.2 shows a typical stack frame. The frame consists of the memory<br />

between the frame pointer ($fp), which points to the first word of the frame,<br />

<strong>and</strong> the stack pointer ($sp), which points to the last word of the frame. The stack<br />

grows down from higher memory addresses, so the frame pointer points above the<br />

caller-saved register<br />

A regis ter saved by the<br />

routine being called.<br />

callee-saved register<br />

A regis ter saved by<br />

the routine making a<br />

procedure call.<br />

procedure call frame<br />

A block of memory that<br />

is used to hold values<br />

passed to a procedure<br />

as arguments, to save<br />

registers that a procedure<br />

may modify but that the<br />

procedure’s caller does not<br />

want changed, <strong>and</strong> to provide<br />

space for variables<br />

local to a procedure.


A.6 Procedure Call Convention A-25<br />

Higher memory addresses<br />

$fp<br />

Argument 6<br />

Argument 5<br />

Saved registers<br />

Stack<br />

grows<br />

Local variables<br />

$sp<br />

Lower memory addresses<br />

FIGURE A.6.2 Layout of a stack frame. The frame pointer ($fp) points to the first word in the<br />

currently executing procedure’s stack frame. The stack pointer ($sp) points to the last word of the frame. The<br />

first four arguments are passed in registers, so the fifth argument is the first one stored on the stack.<br />

A stack frame may be built in many different ways; however, the caller <strong>and</strong><br />

callee must agree on the sequence of steps. The steps below describe the calling<br />

convention used on most MIPS machines. This convention comes into play at three<br />

points during a procedure call: immediately before the caller invokes the callee,<br />

just as the callee starts executing, <strong>and</strong> immediately before the callee returns to the<br />

caller. In the first part, the caller puts the procedure call arguments in stan dard<br />

places <strong>and</strong> invokes the callee to do the following:<br />

1. Pass arguments. By convention, the first four arguments are passed in registers<br />

$a0–$a3. Any remaining arguments are pushed on the stack <strong>and</strong> appear<br />

at the beginning of the called procedure’s stack frame.<br />

2. Save caller-saved registers. The called procedure can use these registers<br />

($a0–$a3 <strong>and</strong> $t0–$t9) without first saving their value. If the caller expects<br />

to use one of these registers after a call, it must save its value before the call.<br />

3. Execute a jal instruction (see Section 2.8 of Chapter 2), which jumps to the<br />

callee’s first instruction <strong>and</strong> saves the return address in register $ra.


A-26 Appendix A Assemblers, Linkers, <strong>and</strong> the SPIM Simulator<br />

Before a called routine starts running, it must take the following steps to set up<br />

its stack frame:<br />

1. Allocate memory for the frame by subtracting the frame’s size from the stack<br />

pointer.<br />

2. Save callee-saved registers in the frame. A callee must save the values in<br />

these registers ($s0–$s7, $fp, <strong>and</strong> $ra) before altering them, since the<br />

caller expects to find these registers unchanged after the call. Register $fp is<br />

saved by every procedure that allocates a new stack frame. However, register<br />

$ra only needs to be saved if the callee itself makes a call. The other calleesaved<br />

registers that are used also must be saved.<br />

3. Establish the frame pointer by adding the stack frame’s size minus 4 to $sp<br />

<strong>and</strong> storing the sum in register $fp.<br />

Hardware/<br />

Software<br />

Interface<br />

The MIPS register use convention provides callee- <strong>and</strong> caller-saved registers,<br />

because both types of registers are advantageous in different circumstances. Calleesaved<br />

registers are better used to hold long-lived values, such as variables from a<br />

user’s program. These registers are only saved during a procedure call if the callee<br />

expects to use the register. On the other h<strong>and</strong>, caller-saved registers are bet ter used<br />

to hold short-lived quantities that do not persist across a call, such as immediate<br />

values in an address calculation. During a call, the callee can also use these registers<br />

for short-lived temporaries.<br />

Finally, the callee returns to the caller by executing the following steps:<br />

1. If the callee is a function that returns a value, place the returned value in<br />

register $v0.<br />

2. Restore all callee-saved registers that were saved upon procedure entry.<br />

3. Pop the stack frame by adding the frame size to $sp.<br />

4. Return by jumping to the address in register $ra.<br />

recursive procedures<br />

Procedures that call<br />

themselves either directly<br />

or indirectly through a<br />

chain of calls.<br />

Elaboration: A programming language that does not permit recursive procedures—<br />

procedures that call themselves either directly or indirectly through a chain of calls—need<br />

not allocate frames on a stack. In a nonrecursive language, each procedure’s frame<br />

may be statically allocated, since only one invocation of a procedure can be active at a<br />

time. Older versions of Fortran prohibited recursion, because statically allocated frames<br />

produced faster code on some older machines. However, on load store architec tures like<br />

MIPS, stack frames may be just as fast, because a frame pointer register points directly


A.6 Procedure Call Convention A-27<br />

to the active stack frame, which permits a single load or store instruc tion to access<br />

values in the frame. In addition, recursion is a valuable programming technique.<br />

Procedure Call Example<br />

As an example, consider the C routine<br />

main ()<br />

{<br />

printf (“The factorial of 10 is %d\n”, fact (10));<br />

}<br />

int fact (int n)<br />

{<br />

if (n < 1)<br />

return (1);<br />

else<br />

return (n * fact (n - 1));<br />

}<br />

which computes <strong>and</strong> prints 10! (the factorial of 10, 10! = 10 × 9 × . . . × 1). fact is<br />

a recursive routine that computes n! by multiplying n times (n - 1)!. The assembly<br />

code for this routine illustrates how programs manipulate stack frames.<br />

Upon entry, the routine main creates its stack frame <strong>and</strong> saves the two calleesaved<br />

registers it will modify: $fp <strong>and</strong> $ra. The frame is larger than required for<br />

these two register because the calling convention requires the minimum size of a<br />

stack frame to be 24 bytes. This minimum frame can hold four argument registers<br />

($a0–$a3) <strong>and</strong> the return address $ra, padded to a double-word boundary<br />

(24 bytes). Since main also needs to save $fp, its stack frame must be two words<br />

larger (remember: the stack pointer is kept doubleword aligned).<br />

.text<br />

.globl main<br />

main:<br />

subu $sp,$sp,32 # Stack frame is 32 bytes long<br />

sw $ra,20($sp) # Save return address<br />

sw $fp,16($sp) # Save old frame pointer<br />

addiu $fp,$sp,28 # Set up frame pointer<br />

The routine main then calls the factorial routine <strong>and</strong> passes it the single argument<br />

10. After fact returns, main calls the library routine printf <strong>and</strong> passes it both<br />

a format string <strong>and</strong> the result returned from fact:


A-28 Appendix A Assemblers, Linkers, <strong>and</strong> the SPIM Simulator<br />

li $a0,10 # Put argument (10) in $a0<br />

jal fact # Call factorial function<br />

la $a0,$LC # Put format string in $a0<br />

move $a1,$v0 # Move fact result to $a1<br />

jal printf # Call the print function<br />

Finally, after printing the factorial, main returns. But first, it must restore the<br />

registers it saved <strong>and</strong> pop its stack frame:<br />

lw $ra,20($sp) # Restore return address<br />

lw $fp,16($sp) # Restore frame pointer<br />

addiu $sp,$sp,32 # Pop stack frame<br />

jr $ra # Return to caller<br />

.rdata<br />

$LC:<br />

.ascii<br />

“The factorial of 10 is %d\n\000”<br />

The factorial routine is similar in structure to main. First, it creates a stack frame<br />

<strong>and</strong> saves the callee-saved registers it will use. In addition to saving $ra <strong>and</strong> $fp,<br />

fact also saves its argument ($a0), which it will use for the recursive call:<br />

.text<br />

fact:<br />

subu $sp,$sp,32 # Stack frame is 32 bytes long<br />

sw $ra,20($sp) # Save return address<br />

sw $fp,16($sp) # Save frame pointer<br />

addiu $fp,$sp,28 # Set up frame pointer<br />

sw $a0,0($fp) # Save argument (n)<br />

The heart of the fact routine performs the computation from the C program.<br />

It tests whether the argument is greater than 0. If not, the routine returns the<br />

value 1. If the argument is greater than 0, the routine recursively calls itself to<br />

compute fact(n–1) <strong>and</strong> multiplies that value times n:<br />

lw $v0,0($fp) # Load n<br />

bgtz $v0,$L2 # Branch if n > 0<br />

li $v0,1 # Return 1<br />

jr $L1 # Jump to code to return<br />

$L2:<br />

lw $v1,0($fp) # Load n<br />

subu $v0,$v1,1 # Compute n - 1<br />

move $a0,$v0 # Move value to $a0


A.6 Procedure Call Convention A-29<br />

jal fact # Call factorial function<br />

lw $v1,0($fp) # Load n<br />

mul $v0,$v0,$v1 # Compute fact(n-1) * n<br />

Finally, the factorial routine restores the callee-saved registers <strong>and</strong> returns the<br />

value in register $v0:<br />

$L1: # Result is in $v0<br />

lw $ra, 20($sp) # Restore $ra<br />

lw $fp, 16($sp) # Restore $fp<br />

addiu $sp, $sp, 32 # Pop stack<br />

jr $ra # Return to caller<br />

Stack in Recursive Procedure<br />

Figure A.6.3 shows the stack at the call fact(7). main runs first, so its frame<br />

is deepest on the stack. main calls fact(10), whose stack frame is next on the<br />

stack. Each invocation recursively invokes fact to compute the next-lowest<br />

factorial. The stack frames parallel the LIFO order of these calls. What does the<br />

stack look like when the call to fact(10) returns?<br />

EXAMPLE<br />

Stack<br />

Old $ra<br />

Old $fp<br />

main<br />

Old $a0<br />

Old $ra<br />

Old $fp<br />

Old $a0<br />

Old $ra<br />

Old $fp<br />

Old $a0<br />

Old $ra<br />

Old $fp<br />

Old $a0<br />

Old $ra<br />

Old $fp<br />

fact (10)<br />

fact (9)<br />

fact (8)<br />

fact (7)<br />

Stack grows<br />

FIGURE A.6.3 Stack frames during the call of fact(7).


A-30 Appendix A Assemblers, Linkers, <strong>and</strong> the SPIM Simulator<br />

ANSWER<br />

Stack<br />

Old $ra<br />

Old $fp<br />

main<br />

Stack grows<br />

Elaboration: The difference between the MIPS compiler <strong>and</strong> the gcc compiler is that<br />

the MIPS compiler usually does not use a frame pointer, so this register is available as<br />

another callee-saved register, $s8. This change saves a couple of instructions in the<br />

procedure call <strong>and</strong> return sequence. However, it complicates code generation, because<br />

a procedure must access its stack frame with $sp, whose value can change during a<br />

procedure’s execution if values are pushed on the stack.<br />

Another Procedure Call Example<br />

As another example, consider the following routine that computes the tak function,<br />

which is a widely used benchmark created by Ikuo Takeuchi. This function<br />

does not compute anything useful, but is a heavily recursive program that illustrates<br />

the MIPS calling convention.<br />

int tak (int x, int y, int z)<br />

{<br />

if (y < x)<br />

return 1+ tak (tak (x - 1, y, z),<br />

tak (y - 1, z, x),<br />

tak (z - 1, x, y));<br />

else<br />

return z;<br />

}<br />

int main ()<br />

{<br />

tak(18, 12, 6);<br />

}<br />

The assembly code for this program is shown below. The tak function first saves<br />

its return address in its stack frame <strong>and</strong> its arguments in callee-saved regis ters,<br />

since the routine may make calls that need to use registers $a0–$a2 <strong>and</strong> $ra. The<br />

function uses callee-saved registers, since they hold values that persist over the


A.6 Procedure Call Convention A-31<br />

lifetime of the function, which includes several calls that could potentially modify<br />

registers.<br />

.text<br />

.globl<br />

tak<br />

tak:<br />

subu $sp, $sp, 40<br />

sw $ra, 32($sp)<br />

sw $s0, 16($sp) # x<br />

move $s0, $a0<br />

sw $s1, 20($sp) # y<br />

move $s1, $a1<br />

sw $s2, 24($sp) # z<br />

move $s2, $a2<br />

sw $s3, 28($sp) # temporary<br />

The routine then begins execution by testing if y < x. If not, it branches to label<br />

L1, which is shown below.<br />

bge $s1, $s0, L1 # if (y < x)<br />

If y < x, then it executes the body of the routine, which contains four recursive<br />

calls. The first call uses almost the same arguments as its parent:<br />

addiu $a0, $s0, -1<br />

move $a1, $s1<br />

move $a2, $s2<br />

jal tak # tak (x - 1, y, z)<br />

move $s3, $v0<br />

Note that the result from the first recursive call is saved in register $s3, so that it<br />

can be used later.<br />

The function now prepares arguments for the second recursive call.<br />

addiu $a0, $s1, -1<br />

move $a1, $s2<br />

move $a2, $s0<br />

jal tak # tak (y - 1, z, x)<br />

In the instructions below, the result from this recursive call is saved in register<br />

$s0. But first we need to read, for the last time, the saved value of the first argument<br />

from this register.


A-32 Appendix A Assemblers, Linkers, <strong>and</strong> the SPIM Simulator<br />

addiu $a0, $s2, -1<br />

move $a1, $s0<br />

move $a2, $s1<br />

move $s0, $v0<br />

jal tak # tak (z - 1, x, y)<br />

After the three inner recursive calls, we are ready for the final recursive call. After<br />

the call, the function’s result is in $v0 <strong>and</strong> control jumps to the function’s epilogue.<br />

move $a0, $s3<br />

move $a1, $s0<br />

move $a2, $v0<br />

jal tak # tak (tak(...), tak(...), tak(...))<br />

addiu $v0, $v0, 1<br />

j L2<br />

This code at label L1 is the consequent of the if-then-else statement. It just moves<br />

the value of argument z into the return register <strong>and</strong> falls into the function epilogue.<br />

L1:<br />

move $v0, $s2<br />

The code below is the function epilogue, which restores the saved registers <strong>and</strong><br />

returns the function’s result to its caller.<br />

L2:<br />

lw $ra, 32($sp)<br />

lw $s0, 16($sp)<br />

lw $s1, 20($sp)<br />

lw $s2, 24($sp)<br />

lw $s3, 28($sp)<br />

addiu $sp, $sp, 40<br />

jr $ra<br />

The main routine calls the tak function with its initial arguments, then takes the<br />

computed result (7) <strong>and</strong> prints it using SPIM’s system call for printing integers.<br />

.globl main<br />

main:<br />

subu $sp, $sp, 24<br />

sw $ra, 16($sp)<br />

li $a0, 18<br />

li $a1, 12


A-34 Appendix A Assemblers, Linkers, <strong>and</strong> the SPIM Simulator<br />

These seven registers are part of coprocessor 0’s register set. They are accessed<br />

by the mfc0 <strong>and</strong> mtc0 instructions. After an exception, register EPC contains the<br />

address of the instruction that was executing when the exception occurred. If the<br />

exception was caused by an external interrupt, then the instruction will not have<br />

started executing. All other exceptions are caused by the execution of the instruction<br />

at EPC, except when the offending instruction is in the delay slot of a branch<br />

or jump. In that case, EPC points to the branch or jump instruction <strong>and</strong> the BD bit<br />

is set in the Cause register. When that bit is set, the exception h<strong>and</strong>ler must look<br />

at EPC + 4 for the offending instruction. However, in either case, an excep tion<br />

h<strong>and</strong>ler properly resumes the program by returning to the instruction at EPC.<br />

If the instruction that caused the exception made a memory access, register<br />

BadVAddr contains the referenced memory location’s address.<br />

The Count register is a timer that increments at a fixed rate (by default, every<br />

10 milliseconds) while SPIM is running. When the value in the Count register<br />

equals the value in the Compare register, a hardware interrupt at priority level 5<br />

occurs.<br />

Figure A.7.1 shows the subset of the Status register fields implemented by the<br />

MIPS simulator SPIM. The interrupt mask field contains a bit for each of the<br />

six hardware <strong>and</strong> two software interrupt levels. A mask bit that is 1 allows interrupts<br />

at that level to interrupt the processor. A mask bit that is 0 disables interrupts<br />

at that level. When an interrupt arrives, it sets its interrupt pending bit in the<br />

Cause register, even if the mask bit is disabled. When an interrupt is pending, it will<br />

interrupt the processor when its mask bit is subsequently enabled.<br />

The user mode bit is 0 if the processor is running in kernel mode <strong>and</strong> 1 if it is<br />

running in user mode. On SPIM, this bit is fixed at 1, since the SPIM processor<br />

does not implement kernel mode. The exception level bit is normally 0, but is set to<br />

1 after an exception occurs. When this bit is 1, interrupts are disabled <strong>and</strong> the EPC<br />

is not updated if another exception occurs. This bit prevents an exception h<strong>and</strong>ler<br />

from being disturbed by an interrupt or exception, but it should be reset when the<br />

h<strong>and</strong>ler finishes. If the interrupt enable bit is 1, interrupts are allowed. If it is<br />

0, they are disabled.<br />

Figure A.7.2 shows the subset of Cause register fields that SPIM implements.<br />

The branch delay bit is 1 if the last exception occurred in an instruction executed in<br />

the delay slot of a branch. The interrupt pending bits become 1 when an inter rupt


A-36 Appendix A Assemblers, Linkers, <strong>and</strong> the SPIM Simulator<br />

faults are requests from a process to the operating system to perform a service,<br />

such as bringing in a page from disk. The operating system processes these requests<br />

<strong>and</strong> resumes the process. The final type of exceptions are interrupts from external<br />

devices. These generally cause the operating system to move data to or from an I/O<br />

device <strong>and</strong> resume the interrupted process.<br />

The code in the example below is a simple exception h<strong>and</strong>ler, which invokes<br />

a routine to print a message at each exception (but not interrupts). This code is<br />

similar to the exception h<strong>and</strong>ler (exceptions.s) used by the SPIM simulator.<br />

Exception H<strong>and</strong>ler<br />

EXAMPLE<br />

The exception h<strong>and</strong>ler first saves register $at, which is used in pseudoinstructions<br />

in the h<strong>and</strong>ler code, then saves $a0 <strong>and</strong> $a1, which it later uses to<br />

pass arguments. The exception h<strong>and</strong>ler cannot store the old values from these<br />

registers on the stack, as would an ordinary routine, because the cause of the<br />

exception might have been a memory reference that used a bad value (such<br />

as 0) in the stack pointer. Instead, the exception h<strong>and</strong>ler stores these registers<br />

in an exception h<strong>and</strong>ler register ($k1, since it can’t access memory without<br />

using $at) <strong>and</strong> two memory locations (save0 <strong>and</strong> save1). If the exception<br />

routine itself could be interrupted, two locations would not be enough since<br />

the second exception would overwrite values saved during the first exception.<br />

However, this simple exception h<strong>and</strong>ler finishes running before it enables<br />

interrupts, so the problem does not arise.<br />

.ktext 0x80000180<br />

mov $k1, $at # Save $at register<br />

sw $a0, save0 # H<strong>and</strong>ler is not re-entrant <strong>and</strong> can’t use<br />

sw $a1, save1 # stack to save $a0, $a1<br />

# Don’t need to save $k0/$k1<br />

The exception h<strong>and</strong>ler then moves the Cause <strong>and</strong> EPC registers into CPU<br />

registers. The Cause <strong>and</strong> EPC registers are not part of the CPU register set.<br />

In stead, they are registers in coprocessor 0, which is the part of the CPU that<br />

han dles exceptions. The instruction mfc0 $k0, $13 moves coprocessor 0’s<br />

register 13 (the Cause register) into CPU register $k0. Note that the exception<br />

h<strong>and</strong>ler need not save registers $k0 <strong>and</strong> $k1, because user programs are not<br />

supposed to use these registers. The exception h<strong>and</strong>ler uses the value from the<br />

Cause reg ister to test whether the exception was caused by an interrupt (see<br />

the preceding ta ble). If so, the exception is ignored. If the exception was not an<br />

interrupt, the h<strong>and</strong>ler calls print_excp to print a message.


A.7 Exceptions <strong>and</strong> Interrupts A-37<br />

mfc0 $k0, $13 # Move Cause into $k0<br />

srl $a0, $k0, 2 # Extract ExcCode field<br />

<strong>and</strong>i $a0, $a0, Oxf<br />

bgtz $a0, done # Branch if ExcCode is Int (0)<br />

mov $a0, $k0 # Move Cause into $a0<br />

mfco $a1, $14 # Move EPC into $a1<br />

jal print_excp # Print exception error message<br />

Before returning, the exception h<strong>and</strong>ler clears the Cause register; resets<br />

the Status register to enable interrupts <strong>and</strong> clear the EXL bit, which allows<br />

subse quent exceptions to change the EPC register; <strong>and</strong> restores registers $a0,<br />

$a1, <strong>and</strong> $at. It then executes the eret (exception return) instruction, which<br />

returns to the instruction pointed to by EPC. This exception h<strong>and</strong>ler returns<br />

to the instruction following the one that caused the exception, so as to not<br />

re-execute the faulting instruction <strong>and</strong> cause the same exception again.<br />

done: mfc0 $k0, $14 # Bump EPC<br />

addiu $k0, $k0, 4 # Do not re-execute<br />

# faulting instruction<br />

mtc0 $k0, $14 # EPC<br />

mtc0 $0, $13 # Clear Cause register<br />

mfc0 $k0, $12 # Fix Status register<br />

<strong>and</strong>i $k0, Oxfffd # Clear EXL bit<br />

ori $k0, Ox1 # Enable interrupts<br />

mtc0 $k0, $12<br />

lw $a0, save0 # Restore registers<br />

lw $a1, save1<br />

mov $at, $k1<br />

eret<br />

# Return to EPC<br />

.kdata<br />

save0: .word 0<br />

save1: .word 0


A-38 Appendix A Assemblers, Linkers, <strong>and</strong> the SPIM Simulator<br />

Elaboration: On real MIPS processors, the return from an exception h<strong>and</strong>ler is more<br />

complex. The exception h<strong>and</strong>ler cannot always jump to the instruction following EPC. For<br />

example, if the instruction that caused the exception was in a branch instruction’s delay<br />

slot (see Chapter 4), the next instruction to execute may not be the following instruction<br />

in memory.<br />

A.8 Input <strong>and</strong> Output<br />

SPIM simulates one I/O device: a memory-mapped console on which a program<br />

can read <strong>and</strong> write characters. When a program is running, SPIM connects its<br />

own terminal (or a separate console window in the X-window version xspim or<br />

the Windows version PCSpim) to the processor. A MIPS program running on<br />

SPIM can read the characters that you type. In addition, if the MIPS program<br />

writes characters to the terminal, they appear on SPIM’s terminal or console window.<br />

One exception to this rule is control-C: this character is not passed to the<br />

program, but instead causes SPIM to stop <strong>and</strong> return to comm<strong>and</strong> mode. When<br />

the program stops running (for example, because you typed control-C or because<br />

the program hit a breakpoint), the terminal is reconnected to SPIM so you can type<br />

SPIM comm<strong>and</strong>s.<br />

To use memory-mapped I/O (see below), spim or xspim must be started<br />

with the -mapped_io flag. PCSpim can enable memory-mapped I/O through a<br />

comm<strong>and</strong> line flag or the “Settings” dialog.<br />

The terminal device consists of two independent units: a receiver <strong>and</strong> a transmitter.<br />

The receiver reads characters from the keyboard. The transmitter displays<br />

characters on the console. The two units are completely independent. This means,<br />

for example, that characters typed at the keyboard are not automatically echoed on<br />

the display. Instead, a program echoes a character by reading it from the receiver<br />

<strong>and</strong> writing it to the transmitter.<br />

A program controls the terminal with four memory-mapped device registers,<br />

as shown in Figure A.8.1. “Memory-mapped’’ means that each register appears as<br />

a special memory location. The Receiver Control register is at location ffff0000 hex .<br />

Only two of its bits are actually used. Bit 0 is called “ready’’: if it is 1, it means<br />

that a character has arrived from the keyboard but has not yet been read from the<br />

Receiver Data register. The ready bit is read-only: writes to it are ignored. The ready<br />

bit changes from 0 to 1 when a character is typed at the keyboard, <strong>and</strong> it changes<br />

from 1 to 0 when the character is read from the Receiver Data register.


A-40 Appendix A Assemblers, Linkers, <strong>and</strong> the SPIM Simulator<br />

<strong>and</strong> is read-only. If this bit is 1, the transmitter is ready to accept a new character<br />

for output. If it is 0, the transmitter is still busy writing the previous character.<br />

Bit 1 is “interrupt enable’’ <strong>and</strong> is readable <strong>and</strong> writable. If this bit is set to 1, then<br />

the terminal requests an interrupt at hardware level 0 whenever the transmitter is<br />

ready for a new character, <strong>and</strong> the ready bit becomes 1.<br />

The final device register is the Transmitter Data register (at address ffff000c hex ).<br />

When a value is written into this location, its low-order eight bits (i.e., an ASCII<br />

character as in Figure 2.15 in Chapter 2) are sent to the console. When the Transmitter<br />

Data register is written, the ready bit in the Transmitter Control register is<br />

reset to 0. This bit stays 0 until enough time has elapsed to transmit the character<br />

to the terminal; then the ready bit becomes 1 again. The Trans mitter Data register<br />

should only be written when the ready bit of the Transmitter Control register is 1.<br />

If the transmitter is not ready, writes to the Transmitter Data register are ignored<br />

(the write appears to succeed but the character is not output).<br />

Real computers require time to send characters to a console or terminal. These<br />

time lags are simulated by SPIM. For example, after the transmitter starts to write a<br />

character, the transmitter’s ready bit becomes 0 for a while. SPIM measures time in<br />

instructions executed, not in real clock time. This means that the transmitter does<br />

not become ready again until the processor executes a fixed number of instructions.<br />

If you stop the machine <strong>and</strong> look at the ready bit, it will not change. However, if you<br />

let the machine run, the bit eventually changes back to 1.<br />

A.9 SPIM<br />

SPIM is a software simulator that runs assembly language programs written for<br />

processors that implement the MIPS-32 architecture, specifically Release 1 of this<br />

architecture with a fixed memory mapping, no caches, <strong>and</strong> only coprocessors 0<br />

<strong>and</strong> 1. 2 SPIM’s name is just MIPS spelled backwards. SPIM can read <strong>and</strong> immediately<br />

execute assembly language files. SPIM is a self-contained system for running<br />

2. Earlier versions of SPIM (before 7.0) implemented the MIPS-1 architecture used in the origi nal<br />

MIPS R2000 processors. This architecture is almost a proper subset of the MIPS-32 architec ture,<br />

with the difference being the manner in which exceptions are h<strong>and</strong>led. MIPS-32 also introduced<br />

approximately 60 new instructions, which are supported by SPIM. Programs that ran on the<br />

earlier versions of SPIM <strong>and</strong> did not use exceptions should run unmodified on newer ver sions of<br />

SPIM. Programs that used exceptions will require minor changes.


A.9 SPIM A-41<br />

MIPS programs. It contains a debugger <strong>and</strong> provides a few operating system-like<br />

services. SPIM is much slower than a real computer (100 or more times). How ever,<br />

its low cost <strong>and</strong> wide availability cannot be matched by real hardware!<br />

An obvious question is, “Why use a simulator when most people have PCs that<br />

contain processors that run significantly faster than SPIM?” One reason is that<br />

the processors in PCs are Intel 80×86s, whose architecture is far less regular <strong>and</strong><br />

far more complex to underst<strong>and</strong> <strong>and</strong> program than MIPS processors. The MIPS<br />

architecture may be the epitome of a simple, clean RISC machine.<br />

In addition, simulators can provide a better environment for assembly programming<br />

than an actual machine because they can detect more errors <strong>and</strong> provide<br />

a better interface than can an actual computer.<br />

Finally, simulators are useful tools in studying computers <strong>and</strong> the programs that<br />

run on them. Because they are implemented in software, not silicon, simulators can<br />

be examined <strong>and</strong> easily modified to add new instructions, build new systems such<br />

as multiprocessors, or simply collect data.<br />

Simulation of a Virtual Machine<br />

The basic MIPS architecture is difficult to program directly because of delayed<br />

branches, delayed loads, <strong>and</strong> restricted address modes. This difficulty is tolerable<br />

since these computers were designed to be programmed in high-level languages<br />

<strong>and</strong> present an interface designed for compilers rather than assembly language<br />

programmers. A good part of the programming complexity results from delayed<br />

instructions. A delayed branch requires two cycles to execute (see the Elaborations<br />

on pages 284 <strong>and</strong> 322 of Chapter 4). In the second cycle, the instruction immediately<br />

following the branch executes. This instruction can perform useful work<br />

that normally would have been done before the branch. It can also be a nop (no<br />

operation) that does nothing. Similarly, delayed loads require two cycles to bring<br />

a value from memory, so the instruction immediately following a load cannot use<br />

the value (see Section 4.2 of Chapter 4).<br />

MIPS wisely chose to hide this complexity by having its assembler implement<br />

a virtual machine. This virtual computer appears to have nondelayed branches<br />

<strong>and</strong> loads <strong>and</strong> a richer instruction set than the actual hardware. The assembler<br />

reorga nizes (rearranges) instructions to fill the delay slots. The virtual computer<br />

also provides pseudoinstructions, which appear as real instructions in assembly<br />

lan guage programs. The hardware, however, knows nothing about pseudoinstructions,<br />

so the assembler must translate them into equivalent sequences of actual<br />

machine instructions. For example, the MIPS hardware only provides instructions<br />

to branch when a register is equal to or not equal to 0. Other conditional branches,<br />

such as one that branches when one register is greater than another, are synthesized<br />

by comparing the two registers <strong>and</strong> branching when the result of the comparison<br />

is true (nonzero).<br />

virtual machine<br />

A virtual computer<br />

that appears to have<br />

nondelayed branches<br />

<strong>and</strong> loads <strong>and</strong> a richer<br />

instruction set than the<br />

actual hardware.


A.9 SPIM A-43<br />

Another surprise (which occurs on the real machine as well) is that a pseudoinstruction<br />

exp<strong>and</strong>s to several machine instructions. When you single-step or<br />

exam ine memory, the instructions that you see are different from the source<br />

program. The correspondence between the two sets of instructions is fairly simple,<br />

since SPIM does not reorganize instructions to fill slots.<br />

Byte Order<br />

Processors can number bytes within a word so the byte with the lowest number is<br />

either the leftmost or rightmost one. The convention used by a machine is called<br />

its byte order. MIPS processors can operate with either big-endian or little-endian<br />

byte order. For example, in a big-endian machine, the directive .byte 0, 1, 2, 3<br />

would result in a memory word containing<br />

Byte #<br />

0 1 2 3<br />

while in a little-endian machine, the word would contain<br />

Byte #<br />

3 2 1 0<br />

SPIM operates with both byte orders. SPIM’s byte order is the same as the byte<br />

order of the underlying machine that runs the simulator. For example, on an Intel<br />

80x86, SPIM is little-endian, while on a Macintosh or Sun SPARC, SPIM is bigendian.<br />

System Calls<br />

SPIM provides a small set of operating system–like services through the system<br />

call (syscall) instruction. To request a service, a program loads the system call<br />

code (see Figure A.9.1) into register $v0 <strong>and</strong> arguments into registers $a0–$a3<br />

(or $f12 for floating-point values). System calls that return values put their results<br />

in register $v0 (or $f0 for floating-point results). For example, the follow ing code<br />

prints "the answer = 5":<br />

.data<br />

str:<br />

.asciiz “the answer = ”<br />

.text


A.10 MIPS R2000 Assembly Language A-47<br />

lui $at, 4096<br />

addu $at, $at, $a1<br />

lw $a0, 8($at)<br />

The fi rst instruction loads the upper bits of the label’s address into register $at, which<br />

is the register that the assembler reserves for its own use. The second instruction adds<br />

the contents of register $a1 to the label’s partial address. Finally, the load instruction<br />

uses the hardware address mode to add the sum of the lower bits of the label’s address<br />

<strong>and</strong> the offset from the original instruction to the value in register $at.<br />

Assembler Syntax<br />

Comments in assembler files begin with a sharp sign (#). Everything from the<br />

sharp sign to the end of the line is ignored.<br />

Identifiers are a sequence of alphanumeric characters, underbars (_), <strong>and</strong> dots<br />

(.) that do not begin with a number. Instruction opcodes are reserved words that<br />

cannot be used as identifiers. Labels are declared by putting them at the beginning<br />

of a line followed by a colon, for example:<br />

.data<br />

item: .word 1<br />

.text<br />

.globl main<br />

main: lw<br />

# Must be global<br />

$t0, item<br />

Numbers are base 10 by default. If they are preceded by 0x, they are interpreted<br />

as hexadecimal. Hence, 256 <strong>and</strong> 0x100 denote the same value.<br />

Strings are enclosed in double quotes (”). Special characters in strings follow the<br />

C convention:<br />

■ newline \n<br />

■ tab \t<br />

■ quote \”<br />

SPIM supports a subset of the MIPS assembler directives:<br />

.align n<br />

.ascii str<br />

Align the next datum on a 2 n byte boundary. For<br />

example, .align 2 aligns the next value on a word<br />

boundary. .align 0 turns off automatic alignment<br />

of .half, .word, .float, <strong>and</strong> .double directives<br />

until the next .data or .kdata directive.<br />

Store the string str in memory, but do not nullterminate<br />

it.


A-48 Appendix A Assemblers, Linkers, <strong>and</strong> the SPIM Simulator<br />

.asciiz str<br />

.byte b1,..., bn<br />

.data <br />

Store the string str in memory <strong>and</strong> null- terminate it.<br />

Store the n values in successive bytes of memory.<br />

Subsequent items are stored in the data segment.<br />

If the optional argument addr is present, subsequent<br />

items are stored starting at address addr.<br />

.double d1,..., dn Store the n floating-point double precision<br />

num-bers in successive memory locations.<br />

.extern sym size<br />

.float f1,..., fn<br />

.globl sym<br />

.half h1,..., hn<br />

.kdata <br />

.ktext <br />

.set noat <strong>and</strong> .set at<br />

.space n<br />

Declare that the datum stored at sym is size bytes<br />

large <strong>and</strong> is a global label. This directive enables<br />

the assembler to store the datum in a portion of<br />

the data segment that is efficiently accessed via<br />

register $gp.<br />

Store the n floating-point single precision numbers<br />

in successive memory locations.<br />

Declare that label sym is global <strong>and</strong> can be referenced<br />

from other files.<br />

Store the n 16-bit quantities in successive mem ory<br />

halfwords.<br />

Subsequent data items are stored in the kernel<br />

data segment. If the optional argument addr is<br />

present, subsequent items are stored starting at<br />

address addr.<br />

Subsequent items are put in the kernel text segment.<br />

In SPIM, these items may only be instructions<br />

or words (see the .word directive below). If<br />

the optional argument addr is present, subse quent<br />

items are stored starting at address addr.<br />

The first directive prevents SPIM from complaining<br />

about subsequent instructions that use regis ter<br />

$at. The second directive re-enables the warning.<br />

Since pseudoinstructions exp<strong>and</strong> into code that<br />

uses register $at, programmers must be very careful<br />

about leaving values in this register.<br />

Allocates n bytes of space in the current segment<br />

(which must be the data segment in SPIM).


A.10 MIPS R2000 Assembly Language A-49<br />

.text <br />

.word w1,..., wn<br />

Subsequent items are put in the user text seg ment.<br />

In SPIM, these items may only be instruc tions<br />

or words (see the .word directive below). If the<br />

optional argument addr is present, subse quent<br />

items are stored starting at address addr.<br />

Store the n 32-bit quantities in successive mem ory<br />

words.<br />

SPIM does not distinguish various parts of the data segment (.data, .rdata, <strong>and</strong><br />

.sdata).<br />

Encoding MIPS Instructions<br />

Figure A.10.2 explains how a MIPS instruction is encoded in a binary number.<br />

Each column contains instruction encodings for a field (a contiguous group of<br />

bits) from an instruction. The numbers at the left margin are values for a field.<br />

For example, the j opcode has a value of 2 in the opcode field. The text at the top<br />

of a column names a field <strong>and</strong> specifies which bits it occupies in an instruction.<br />

For example, the op field is contained in bits 26–31 of an instruction. This field<br />

encodes most instructions. However, some groups of instructions use additional<br />

fields to distinguish related instructions. For example, the different floating-point<br />

instructions are specified by bits 0–5. The arrows from the first column show which<br />

opcodes use these additional fields.<br />

Instruction Format<br />

The rest of this appendix describes both the instructions implemented by actual<br />

MIPS hardware <strong>and</strong> the pseudoinstructions provided by the MIPS assembler. The<br />

two types of instructions are easily distinguished. Actual instructions depict the<br />

fields in their binary representation. For example, in<br />

Addition (with overflow)<br />

add rd, rs, rt<br />

0 rs rt rd 0 0x20<br />

6 5 5 5 5 6<br />

the add instruction consists of six fields. Each field’s size in bits is the small num ber<br />

below the field. This instruction begins with six bits of 0s. Register specifiers begin<br />

with an r, so the next field is a 5-bit register specifier called rs. This is the same<br />

register that is the second argument in the symbolic assembly at the left of this<br />

line. Another common field is imm 16 , which is a 16-bit immediate number.


A.10 MIPS R2000 Assembly Language A-51<br />

Pseudoinstructions follow roughly the same conventions, but omit instruction<br />

encoding information. For example:<br />

Multiply (without overflow)<br />

mul rdest, rsrc1, src2<br />

pseudoinstruction<br />

In pseudoinstructions, rdest <strong>and</strong> rsrc1 are registers <strong>and</strong> src2 is either a register<br />

or an immediate value. In general, the assembler <strong>and</strong> SPIM translate a more<br />

general form of an instruction (e.g., add $v1, $a0, 0x55) to a specialized form<br />

(e.g., addi $v1, $a0, 0x55).<br />

Arithmetic <strong>and</strong> Logical Instructions<br />

Absolute value<br />

abs rdest, rsrc<br />

pseudoinstruction<br />

Put the absolute value of register rsrc in register rdest.<br />

Addition (with overflow)<br />

add rd, rs, rt<br />

0 rs rt rd 0 0x20<br />

6 5 5 5 5 6<br />

Addition (without overflow)<br />

addu rd, rs, rt<br />

0 rs rt rd 0 0x21<br />

6 5 5 5 5 6<br />

Put the sum of registers rs <strong>and</strong> rt into register rd.<br />

Addition immediate (with overflow)<br />

addi rt, rs, imm<br />

8 rs rt imm<br />

6 5 5 16<br />

Addition immediate (without overflow)<br />

addiu rt, rs, imm<br />

9 rs rt imm<br />

6 5 5 16<br />

Put the sum of register rs <strong>and</strong> the sign-extended immediate into register rt.


A-52 Appendix A Assemblers, Linkers, <strong>and</strong> the SPIM Simulator<br />

AND<br />

<strong>and</strong> rd, rs, rt<br />

0 rs rt rd 0 0x24<br />

6 5 5 5 5 6<br />

Put the logical AND of registers rs <strong>and</strong> rt into register rd.<br />

AND immediate<br />

<strong>and</strong>i rt, rs, imm<br />

0xc rs rt imm<br />

6 5 5 16<br />

Put the logical AND of register rs <strong>and</strong> the zero-extended immediate into register<br />

rt.<br />

Count leading ones<br />

clo rd, rs<br />

0x1c rs 0 rd 0 0x21<br />

6 5 5 5 5 6<br />

Count leading zeros<br />

clz rd, rs<br />

0x1c rs 0 rd 0 0x20<br />

6 5 5 5 5 6<br />

Count the number of leading ones (zeros) in the word in register rs <strong>and</strong> put<br />

the result into register rd. If a word is all ones (zeros), the result is 32.<br />

Divide (with overflow)<br />

div rs, rt<br />

0 rs rt 0 0x1a<br />

6 5 5 10 6<br />

Divide (without overflow)<br />

divu rs, rt<br />

0 rs rt 0 0x1b<br />

6 5 5 10 6<br />

Divide register rs by register rt. Leave the quotient in register lo <strong>and</strong> the remainder<br />

in register hi. Note that if an oper<strong>and</strong> is negative, the remainder is unspecified<br />

by the MIPS architecture <strong>and</strong> depends on the convention of the machine on which<br />

SPIM is run.


A.10 MIPS R2000 Assembly Language A-53<br />

Divide (with overflow)<br />

div rdest, rsrc1, src2<br />

pseudoinstruction<br />

Divide (without overflow)<br />

divu rdest, rsrc1, src2<br />

pseudoinstruction<br />

Put the quotient of register rsrc1 <strong>and</strong> src2 into register rdest.<br />

Multiply<br />

mult rs, rt<br />

0 rs rt 0 0x18<br />

6 5 5 10 6<br />

Unsigned multiply<br />

multu rs, rt<br />

0 rs rt 0 0x19<br />

6 5 5 10 6<br />

Multiply registers rs <strong>and</strong> rt. Leave the low-order word of the product in register<br />

lo <strong>and</strong> the high-order word in register hi.<br />

Multiply (without overflow)<br />

mul rd, rs, rt<br />

0x1c rs rt rd 0 2<br />

6 5 5 5 5 6<br />

Put the low-order 32 bits of the product of rs <strong>and</strong> rt into register rd.<br />

Multiply (with overflow)<br />

mulo rdest, rsrc1, src2<br />

pseudoinstruction<br />

Unsigned multiply (with overflow)<br />

mulou rdest, rsrc1, src2<br />

pseudoinstruction<br />

Put the low-order 32 bits of the product of register rsrc1 <strong>and</strong> src2 into register<br />

rdest.


A-54 Appendix A Assemblers, Linkers, <strong>and</strong> the SPIM Simulator<br />

Multiply add<br />

madd rs, rt<br />

0x1c rs rt 0 0<br />

6 5 5 10 6<br />

Unsigned multiply add<br />

maddu rs, rt<br />

0x1c rs rt 0 1<br />

6 5 5 10 6<br />

Multiply registers rs <strong>and</strong> rt <strong>and</strong> add the resulting 64-bit product to the 64-bit<br />

value in the concatenated registers lo <strong>and</strong> hi.<br />

Multiply subtract<br />

msub rs, rt<br />

0x1c rs rt 0 4<br />

6 5 5 10 6<br />

Unsigned multiply subtract<br />

msub rs, rt<br />

0x1c rs rt 0 5<br />

6 5 5 10 6<br />

Multiply registers rs <strong>and</strong> rt <strong>and</strong> subtract the resulting 64-bit product from the 64-<br />

bit value in the concatenated registers lo <strong>and</strong> hi.<br />

Negate value (with overflow)<br />

neg rdest, rsrc<br />

pseudoinstruction<br />

Negate value (without overflow)<br />

negu rdest, rsrc<br />

pseudoinstruction<br />

Put the negative of register rsrc into register rdest.<br />

NOR<br />

nor rd, rs, rt<br />

0 rs rt rd 0 0x27<br />

6 5 5 5 5 6<br />

Put the logical NOR of registers rs <strong>and</strong> rt into register rd.


A.10 MIPS R2000 Assembly Language A-55<br />

NOT<br />

not rdest, rsrc<br />

pseudoinstruction<br />

Put the bitwise logical negation of register rsrc into register rdest.<br />

OR<br />

or rd, rs, rt<br />

0 rs rt rd 0 0x25<br />

6 5 5 5 5 6<br />

Put the logical OR of registers rs <strong>and</strong> rt into register rd.<br />

OR immediate<br />

ori rt, rs, imm<br />

0xd rs rt imm<br />

6 5 5 16<br />

Put the logical OR of register rs <strong>and</strong> the zero-extended immediate into register rt.<br />

Remainder<br />

rem rdest, rsrc1, rsrc2<br />

pseudoinstruction<br />

Unsigned remainder<br />

remu rdest, rsrc1, rsrc2<br />

pseudoinstruction<br />

Put the remainder of register rsrc1 divided by register rsrc2 into register rdest.<br />

Note that if an oper<strong>and</strong> is negative, the remainder is unspecified by the MIPS<br />

architecture <strong>and</strong> depends on the convention of the machine on which SPIM is run.<br />

Shift left logical<br />

sll rd, rt, shamt<br />

0 rs rt rd shamt 0<br />

6 5 5 5 5 6<br />

Shift left logical variable<br />

sllv rd, rt, rs<br />

0 rs rt rd 0 4<br />

6 5 5 5 5 6


A-56 Appendix A Assemblers, Linkers, <strong>and</strong> the SPIM Simulator<br />

Shift right arithmetic<br />

sra rd, rt, shamt<br />

0 rs rt rd shamt 3<br />

6 5 5 5 5 6<br />

Shift right arithmetic variable<br />

srav rd, rt, rs<br />

0 rs rt rd 0 7<br />

6 5 5 5 5 6<br />

Shift right logical<br />

srl rd, rt, shamt<br />

0 rs rt rd shamt 2<br />

6 5 5 5 5 6<br />

Shift right logical variable<br />

srlv rd, rt, rs<br />

0 rs rt rd 0 6<br />

6 5 5 5 5 6<br />

Shift register rt left (right) by the distance indicated by immediate shamt or the<br />

register rs <strong>and</strong> put the result in register rd. Note that argument rs is ignored for<br />

sll, sra, <strong>and</strong> srl.<br />

Rotate left<br />

rol rdest, rsrc1, rsrc2<br />

pseudoinstruction<br />

Rotate right<br />

ror rdest, rsrc1, rsrc2<br />

pseudoinstruction<br />

Rotate register rsrc1 left (right) by the distance indicated by rsrc2 <strong>and</strong> put the<br />

result in register rdest.<br />

Subtract (with overflow)<br />

sub rd, rs, rt<br />

0 rs rt rd 0 0x22<br />

6 5 5 5 5 6


A.10 MIPS R2000 Assembly Language A-57<br />

Subtract (without overflow)<br />

subu rd, rs, rt<br />

0 rs rt rd 0 0x23<br />

6 5 5 5 5 6<br />

Put the difference of registers rs <strong>and</strong> rt into register rd.<br />

Exclusive OR<br />

xor rd, rs, rt<br />

0 rs rt rd 0 0x26<br />

6 5 5 5 5 6<br />

Put the logical XOR of registers rs <strong>and</strong> rt into register rd.<br />

XOR immediate<br />

xori rt, rs, imm<br />

0xe rs rt Imm<br />

6 5 5 16<br />

Put the logical XOR of register rs <strong>and</strong> the zero-extended immediate into register<br />

rt.<br />

Constant-Manipulating Instructions<br />

Load upper immediate<br />

lui rt, imm<br />

0xf O rt imm<br />

6 5 5 16<br />

Load the lower halfword of the immediate imm into the upper halfword of register<br />

rt. The lower bits of the register are set to 0.<br />

Load immediate<br />

li rdest, imm<br />

pseudoinstruction<br />

Move the immediate imm into register rdest.<br />

Comparison Instructions<br />

Set less than<br />

slt rd, rs, rt<br />

0 rs rt rd 0 0x2a<br />

6 5 5 5 5 6


A-58 Appendix A Assemblers, Linkers, <strong>and</strong> the SPIM Simulator<br />

Set less than unsigned<br />

sltu rd, rs, rt<br />

0 rs rt rd 0 0x2b<br />

6 5 5 5 5 6<br />

Set register rd to 1 if register rs is less than rt, <strong>and</strong> to 0 otherwise.<br />

Set less than immediate<br />

slti rt, rs, imm<br />

0xa rs rt imm<br />

6 5 5 16<br />

Set less than unsigned immediate<br />

sltiu rt, rs, imm<br />

0xb rs rt imm<br />

6 5 5 16<br />

Set register rt to 1 if register rs is less than the sign-extended immediate, <strong>and</strong> to<br />

0 otherwise.<br />

Set equal<br />

seq rdest, rsrc1, rsrc2<br />

pseudoinstruction<br />

Set register rdest to 1 if register rsrc1 equals rsrc2, <strong>and</strong> to 0 otherwise.<br />

Set greater than equal<br />

sge rdest, rsrc1, rsrc2<br />

pseudoinstruction<br />

Set greater than equal unsigned<br />

sgeu rdest, rsrc1, rsrc2<br />

pseudoinstruction<br />

Set register rdest to 1 if register rsrc1 is greater than or equal to rsrc2, <strong>and</strong> to<br />

0 otherwise.<br />

Set greater than<br />

sgt rdest, rsrc1, rsrc2<br />

pseudoinstruction


A.10 MIPS R2000 Assembly Language A-59<br />

Set greater than unsigned<br />

sgtu rdest, rsrc1, rsrc2<br />

pseudoinstruction<br />

Set register rdest to 1 if register rsrc1 is greater than rsrc2, <strong>and</strong> to 0 otherwise.<br />

Set less than equal<br />

sle rdest, rsrc1, rsrc2<br />

pseudoinstruction<br />

Set less than equal unsigned<br />

sleu rdest, rsrc1, rsrc2<br />

pseudoinstruction<br />

Set register rdest to 1 if register rsrc1 is less than or equal to rsrc2, <strong>and</strong> to 0<br />

otherwise.<br />

Set not equal<br />

sne rdest, rsrc1, rsrc2<br />

pseudoinstruction<br />

Set register rdest to 1 if register rsrc1 is not equal to rsrc2, <strong>and</strong> to 0 otherwise.<br />

Branch Instructions<br />

Branch instructions use a signed 16-bit instruction offset field; hence, they can<br />

jump 2 15 − 1 instructions (not bytes) forward or 2 15 instructions backward. The<br />

jump instruction contains a 26-bit address field. In actual MIPS processors, branch<br />

instructions are delayed branches, which do not transfer control until the instruction<br />

following the branch (its “delay slot”) has executed (see Chapter 4). Delayed branches<br />

affect the offset calculation, since it must be computed relative to the address of the<br />

delay slot instruction (PC + 4), which is when the branch occurs. SPIM does not<br />

simulate this delay slot, unless the -bare or -delayed_branch flags are specified.<br />

In assembly code, offsets are not usually specified as numbers. Instead, an<br />

instructions branch to a label, <strong>and</strong> the assembler computes the distance between<br />

the branch <strong>and</strong> the target instructions.<br />

In MIPS-32, all actual (not pseudo) conditional branch instructions have a<br />

“likely” variant (for example, beq’s likely variant is beql), which does not execute<br />

the instruction in the branch’s delay slot if the branch is not taken. Do not use


A-60 Appendix A Assemblers, Linkers, <strong>and</strong> the SPIM Simulator<br />

these instructions; they may be removed in subsequent versions of the architec ture.<br />

SPIM implements these instructions, but they are not described further.<br />

Branch instruction<br />

b label<br />

pseudoinstruction<br />

Unconditionally branch to the instruction at the label.<br />

Branch coprocessor false<br />

bclf cc label<br />

0x11 8 cc 0 Offset<br />

6 5 3 2 16<br />

Branch coprocessor true<br />

bclt cc label<br />

0x11 8 cc 1 Offset<br />

6 5 3 2 16<br />

Conditionally branch the number of instructions specified by the offset if the<br />

floating-point coprocessor’s condition flag numbered cc is false (true). If cc is<br />

omitted from the instruction, condition code flag 0 is assumed.<br />

Branch on equal<br />

beq rs, rt, label<br />

4 rs rt Offset<br />

6 5 5 16<br />

Conditionally branch the number of instructions specified by the offset if register<br />

rs equals rt.<br />

Branch on greater than equal zero<br />

bgez rs, label<br />

1 rs 1 Offset<br />

6 5 5 16<br />

Conditionally branch the number of instructions specified by the offset if register<br />

rs is greater than or equal to 0.


A.10 MIPS R2000 Assembly Language A-61<br />

Branch on greater than equal zero <strong>and</strong> link<br />

bgezal rs, label<br />

1 rs 0x11 Offset<br />

6 5 5 16<br />

Conditionally branch the number of instructions specified by the offset if register<br />

rs is greater than or equal to 0. Save the address of the next instruction in register<br />

31.<br />

Branch on greater than zero<br />

bgtz rs, label<br />

7 rs 0 Offset<br />

6 5 5 16<br />

Conditionally branch the number of instructions specified by the offset if register<br />

rs is greater than 0.<br />

Branch on less than equal zero<br />

blez rs, label<br />

6 rs 0 Offset<br />

6 5 5 16<br />

Conditionally branch the number of instructions specified by the offset if register<br />

rs is less than or equal to 0.<br />

Branch on less than <strong>and</strong> link<br />

bltzal rs, label<br />

1 rs 0x10 Offset<br />

6 5 5 16<br />

Conditionally branch the number of instructions specified by the offset if register<br />

rs is less than 0. Save the address of the next instruction in register 31.<br />

Branch on less than zero<br />

bltz rs, label<br />

1 rs 0 Offset<br />

6 5 5 16<br />

Conditionally branch the number of instructions specified by the offset if register<br />

rs is less than 0.


A-62 Appendix A Assemblers, Linkers, <strong>and</strong> the SPIM Simulator<br />

Branch on not equal<br />

bne rs, rt, label<br />

5 rs rt Offset<br />

6 5 5 16<br />

Conditionally branch the number of instructions specified by the offset if register<br />

rs is not equal to rt.<br />

Branch on equal zero<br />

beqz rsrc, label<br />

pseudoinstruction<br />

Conditionally branch to the instruction at the label if rsrc equals 0.<br />

Branch on greater than equal<br />

bge rsrc1, rsrc2, label<br />

pseudoinstruction<br />

Branch on greater than equal unsigned<br />

bgeu rsrc1, rsrc2, label<br />

pseudoinstruction<br />

Conditionally branch to the instruction at the label if register rsrc1 is greater than<br />

or equal to rsrc2.<br />

Branch on greater than<br />

bgt rsrc1, src2, label<br />

pseudoinstruction<br />

Branch on greater than unsigned<br />

bgtu rsrc1, src2, label<br />

pseudoinstruction<br />

Conditionally branch to the instruction at the label if register rsrc1 is greater than<br />

src2.<br />

Branch on less than equal<br />

ble rsrc1, src2, label<br />

pseudoinstruction


A.10 MIPS R2000 Assembly Language A-63<br />

Branch on less than equal unsigned<br />

bleu rsrc1, src2, label<br />

pseudoinstruction<br />

Conditionally branch to the instruction at the label if register rsrc1 is less than or<br />

equal to src2.<br />

Branch on less than<br />

blt rsrc1, rsrc2, label<br />

pseudoinstruction<br />

Branch on less than unsigned<br />

bltu rsrc1, rsrc2, label<br />

pseudoinstruction<br />

Conditionally branch to the instruction at the label if register rsrc1 is less than<br />

rsrc2.<br />

Branch on not equal zero<br />

bnez rsrc, label<br />

pseudoinstruction<br />

Conditionally branch to the instruction at the label if register rsrc is not equal to 0.<br />

Jump Instructions<br />

Jump<br />

j target<br />

2 target<br />

6 26<br />

Unconditionally jump to the instruction at target.<br />

Jump <strong>and</strong> link<br />

jal target<br />

3 target<br />

6 26<br />

Unconditionally jump to the instruction at target. Save the address of the next<br />

instruction in register $ra.


A-64 Appendix A Assemblers, Linkers, <strong>and</strong> the SPIM Simulator<br />

Jump <strong>and</strong> link register<br />

jalr rs, rd<br />

0 rs 0 rd 0 9<br />

6 5 5 5 5 6<br />

Unconditionally jump to the instruction whose address is in register rs. Save the<br />

address of the next instruction in register rd (which defaults to 31).<br />

Jump register<br />

jr rs<br />

0 rs 0 8<br />

6 5 15 6<br />

Unconditionally jump to the instruction whose address is in register rs.<br />

Trap Instructions<br />

Trap if equal<br />

teq rs, rt<br />

0 rs rt 0 0x34<br />

6 5 5 10 6<br />

If register rs is equal to register rt, raise a Trap exception.<br />

Trap if equal immediate<br />

teqi rs, imm<br />

1 rs 0xc imm<br />

6 5 5 16<br />

If register rs is equal to the sign-extended value imm, raise a Trap exception.<br />

Trap if not equal<br />

teq rs, rt<br />

0 rs rt 0 0x36<br />

6 5 5 10 6<br />

If register rs is not equal to register rt, raise a Trap exception.<br />

Trap if not equal immediate<br />

teqi rs, imm<br />

1 rs 0xe imm<br />

6 5 5 16<br />

If register rs is not equal to the sign-extended value imm, raise a Trap exception.


A.10 MIPS R2000 Assembly Language A-65<br />

Trap if greater equal<br />

tge rs, rt<br />

0 rs rt 0 0x30<br />

6 5 5 10 6<br />

Unsigned trap if greater equal<br />

tgeu rs, rt<br />

0 rs rt 0 0x31<br />

6 5 5 10 6<br />

If register rs is greater than or equal to register rt, raise a Trap exception.<br />

Trap if greater equal immediate<br />

tgei rs, imm<br />

1 rs 8 imm<br />

6 5 5 16<br />

Unsigned trap if greater equal immediate<br />

tgeiu rs, imm<br />

1 rs 9 imm<br />

6 5 5 16<br />

If register rs is greater than or equal to the sign-extended value imm, raise a Trap<br />

exception.<br />

Trap if less than<br />

tlt rs, rt<br />

0 rs rt 0 0x32<br />

6 5 5 10 6<br />

Unsigned trap if less than<br />

tltu rs, rt<br />

0 rs rt 0 0x33<br />

6 5 5 10 6<br />

If register rs is less than register rt, raise a Trap exception.<br />

Trap if less than immediate<br />

tlti rs, imm<br />

1 rs a imm<br />

6 5 5 16


A-66 Appendix A Assemblers, Linkers, <strong>and</strong> the SPIM Simulator<br />

Unsigned trap if less than immediate<br />

tltiu rs, imm<br />

1 rs b imm<br />

6 5 5 16<br />

If register rs is less than the sign-extended value imm, raise a Trap exception.<br />

Load Instructions<br />

Load address<br />

la rdest, address<br />

pseudoinstruction<br />

Load computed address—not the contents of the location—into register rdest.<br />

Load byte<br />

lb rt, address<br />

0x20 rs rt Offset<br />

6 5 5 16<br />

Load unsigned byte<br />

lbu rt, address<br />

0x24 rs rt Offset<br />

6 5 5 16<br />

Load the byte at address into register rt. The byte is sign-extended by lb, but not<br />

by lbu.<br />

Load halfword<br />

lh rt, address<br />

0x21 rs rt Offset<br />

6 5 5 16<br />

Load unsigned halfword<br />

lhu rt, address<br />

0x25 rs rt Offset<br />

6 5 5 16<br />

Load the 16-bit quantity (halfword) at address into register rt. The halfword is<br />

sign-extended by lh, but not by lhu.


A.10 MIPS R2000 Assembly Language A-67<br />

Load word<br />

lw rt, address<br />

0x23 rs rt Offset<br />

6 5 5 16<br />

Load the 32-bit quantity (word) at address into register rt.<br />

Load word coprocessor 1<br />

lwcl ft, address<br />

0x31 rs rt Offset<br />

6 5 5 16<br />

Load the word at address into register ft in the floating-point unit.<br />

Load word left<br />

lwl rt, address<br />

0x22 rs rt Offset<br />

6 5 5 16<br />

Load word right<br />

lwr rt, address<br />

0x26 rs rt Offset<br />

6 5 5 16<br />

Load the left (right) bytes from the word at the possibly unaligned address into<br />

register rt.<br />

Load doubleword<br />

ld rdest, address<br />

pseudoinstruction<br />

Load the 64-bit quantity at address into registers rdest <strong>and</strong> rdest + 1.<br />

Unaligned load halfword<br />

ulh rdest, address<br />

pseudoinstruction


A-68 Appendix A Assemblers, Linkers, <strong>and</strong> the SPIM Simulator<br />

Unaligned load halfword unsigned<br />

ulhu rdest, address<br />

pseudoinstruction<br />

Load the 16-bit quantity (halfword) at the possibly unaligned address into register<br />

rdest. The halfword is sign-extended by ulh, but not ulhu.<br />

Unaligned load word<br />

ulw rdest, address<br />

pseudoinstruction<br />

Load the 32-bit quantity (word) at the possibly unaligned address into register<br />

rdest.<br />

Load linked<br />

ll rt, address<br />

0x30 rs rt Offset<br />

6 5 5 16<br />

Load the 32-bit quantity (word) at address into register rt <strong>and</strong> start an atomic<br />

read-modify-write operation. This operation is completed by a store conditional<br />

(sc) instruction, which will fail if another processor writes into the block containing<br />

the loaded word. Since SPIM does not simulate multiple processors, the store<br />

conditional operation always succeeds.<br />

Store Instructions<br />

Store byte<br />

sb rt, address<br />

0x28 rs rt Offset<br />

6 5 5 16<br />

Store the low byte from register rt at address.<br />

Store halfword<br />

sh rt, address<br />

0x29 rs rt Offset<br />

6 5 5 16<br />

Store the low halfword from register rt at address.


A.10 MIPS R2000 Assembly Language A-69<br />

Store word<br />

sw rt, address<br />

0x2b rs rt Offset<br />

6 5 5 16<br />

Store the word from register rt at address.<br />

Store word coprocessor 1<br />

swcl ft, address<br />

0x31 rs ft Offset<br />

6 5 5 16<br />

Store the floating-point value in register ft of floating-point coprocessor at address.<br />

Store double coprocessor 1<br />

sdcl ft, address<br />

0x3d rs ft Offset<br />

6 5 5 16<br />

Store the doubleword floating-point value in registers ft <strong>and</strong> ft + l of floatingpoint<br />

coprocessor at address. Register ft must be even numbered.<br />

Store word left<br />

swl rt, address<br />

0x2a rs rt Offset<br />

6 5 5 16<br />

Store word right<br />

swr rt, address<br />

0x2e rs rt Offset<br />

6 5 5 16<br />

Store the left (right) bytes from register rt at the possibly unaligned address.<br />

Store doubleword<br />

sd rsrc, address<br />

pseudoinstruction<br />

Store the 64-bit quantity in registers rsrc <strong>and</strong> rsrc + 1 at address.


A-70 Appendix A Assemblers, Linkers, <strong>and</strong> the SPIM Simulator<br />

Unaligned store halfword<br />

ush rsrc, address<br />

pseudoinstruction<br />

Store the low halfword from register rsrc at the possibly unaligned address.<br />

Unaligned store word<br />

usw rsrc, address<br />

pseudoinstruction<br />

Store the word from register rsrc at the possibly unaligned address.<br />

Store conditional<br />

sc rt, address<br />

0x38 rs rt Offset<br />

6 5 5 16<br />

Store the 32-bit quantity (word) in register rt into memory at address <strong>and</strong> com plete<br />

an atomic read-modify-write operation. If this atomic operation is success ful, the<br />

memory word is modified <strong>and</strong> register rt is set to 1. If the atomic operation fails<br />

because another processor wrote to a location in the block contain ing the addressed<br />

word, this instruction does not modify memory <strong>and</strong> writes 0 into register rt. Since<br />

SPIM does not simulate multiple processors, the instruc tion always succeeds.<br />

Data Movement Instructions<br />

Move<br />

move rdest, rsrc<br />

pseudoinstruction<br />

Move register rsrc to rdest.<br />

Move from hi<br />

mfhi rd<br />

0 0 rd 0 0x10<br />

6 10 5 5 6


A.10 MIPS R2000 Assembly Language A-71<br />

Move from lo<br />

mflo rd<br />

0 0 rd 0 0x12<br />

6 10 5 5 6<br />

The multiply <strong>and</strong> divide unit produces its result in two additional registers, hi<br />

<strong>and</strong> lo. These instructions move values to <strong>and</strong> from these registers. The multiply,<br />

divide, <strong>and</strong> remainder pseudoinstructions that make this unit appear to operate on<br />

the general registers move the result after the computation finishes.<br />

Move the hi (lo) register to register rd.<br />

Move to hi<br />

mthi rs<br />

0 rs 0 0x11<br />

6 5 15 6<br />

Move to lo<br />

mtlo rs<br />

0 rs 0 0x13<br />

6 5 15 6<br />

Move register rs to the hi (lo) register.<br />

Move from coprocessor 0<br />

mfc0 rt, rd<br />

0x10 0 rt rd 0<br />

6 5 5 5 11<br />

Move from coprocessor 1<br />

mfcl rt, fs<br />

0x11 0 rt fs 0<br />

6 5 5 5 11<br />

Coprocessors have their own register sets. These instructions move values between<br />

these registers <strong>and</strong> the CPU’s registers.<br />

Move register rd in a coprocessor (register fs in the FPU) to CPU register rt. The<br />

floating-point unit is coprocessor 1.


A-72 Appendix A Assemblers, Linkers, <strong>and</strong> the SPIM Simulator<br />

Move double from coprocessor 1<br />

mfc1.d rdest, frsrc1<br />

pseudoinstruction<br />

Move floating-point registers frsrc1 <strong>and</strong> frsrc1 + 1 to CPU registers rdest<br />

<strong>and</strong> rdest + 1.<br />

Move to coprocessor 0<br />

mtc0 rd, rt<br />

0x10 4 rt rd 0<br />

6 5 5 5 11<br />

Move to coprocessor 1<br />

mtc1 rd, fs<br />

0x11 4 rt fs 0<br />

6 5 5 5 11<br />

Move CPU register rt to register rd in a coprocessor (register fs in the FPU).<br />

Move conditional not zero<br />

movn rd, rs, rt<br />

0 rs rt rd 0xb<br />

6 5 5 5 11<br />

Move register rs to register rd if register rt is not 0.<br />

Move conditional zero<br />

movz rd, rs, rt<br />

0 rs rt rd 0xa<br />

6 5 5 5 11<br />

Move register rs to register rd if register rt is 0.<br />

Move conditional on FP false<br />

movf rd, rs, cc<br />

0 rs cc 0 rd 0 1<br />

6 5 3 2 5 5 6<br />

Move CPU register rs to register rd if FPU condition code flag number cc is 0. If<br />

cc is omitted from the instruction, condition code flag 0 is assumed.


A.10 MIPS R2000 Assembly Language A-73<br />

Move conditional on FP true<br />

movt rd, rs, cc<br />

0 rs cc 1 rd 0 1<br />

6 5 3 2 5 5 6<br />

Move CPU register rs to register rd if FPU condition code flag number cc is 1. If<br />

cc is omitted from the instruction, condition code bit 0 is assumed.<br />

Floating-Point Instructions<br />

The MIPS has a floating-point coprocessor (numbered 1) that operates on single<br />

precision (32-bit) <strong>and</strong> double precision (64-bit) floating-point numbers. This<br />

coprocessor has its own registers, which are numbered $f0–$f31. Because these<br />

registers are only 32 bits wide, two of them are required to hold doubles, so only<br />

floating-point registers with even numbers can hold double precision values. The<br />

floating-point coprocessor also has eight condition code (cc) flags, numbered 0–7,<br />

which are set by compare instructions <strong>and</strong> tested by branch (bclf or bclt) <strong>and</strong><br />

conditional move instructions.<br />

Values are moved in or out of these registers one word (32 bits) at a time by<br />

lwc1, swc1, mtc1, <strong>and</strong> mfc1 instructions or one double (64 bits) at a time by ldcl<br />

<strong>and</strong> sdcl, described above, or by the l.s, l.d, s.s, <strong>and</strong> s.d pseudoinstructions<br />

described below.<br />

In the actual instructions below, bits 21–26 are 0 for single precision <strong>and</strong> 1<br />

for double precision. In the pseudoinstructions below, fdest is a floating-point<br />

register (e.g., $f2).<br />

Floating-point absolute value double<br />

abs.d fd, fs<br />

0x11 1 0 fs fd 5<br />

6 5 5 5 5 6<br />

Floating-point absolute value single<br />

abs.s fd, fs<br />

0x11 0 0 fs fd 5<br />

Compute the absolute value of the floating-point double (single) in register fs <strong>and</strong><br />

put it in register fd.<br />

Floating-point addition double<br />

add.d fd, fs, ft<br />

0x11 0x11 ft fs fd 0<br />

6 5 5 5 5 6


A-74 Appendix A Assemblers, Linkers, <strong>and</strong> the SPIM Simulator<br />

Floating-point addition single<br />

add.s fd, fs, ft<br />

0x11 0x10 ft fs fd 0<br />

6 5 5 5 5 6<br />

Compute the sum of the floating-point doubles (singles) in registers fs <strong>and</strong> ft <strong>and</strong><br />

put it in register fd.<br />

Floating-point ceiling to word<br />

ceil.w.d fd, fs<br />

ceil.w.s fd, fs<br />

0x11 0x11 0 fs fd 0xe<br />

6 5 5 5 5 6<br />

0x11 0x10 0 fs fd 0xe<br />

Compute the ceiling of the floating-point double (single) in register fs, convert to<br />

a 32-bit fixed-point value, <strong>and</strong> put the resulting word in register fd.<br />

Compare equal double<br />

c.eq.d cc fs, ft<br />

0x11 0x11 ft fs cc 0 FC 2<br />

6 5 5 5 3 2 2 4<br />

Compare equal single<br />

c.eq.s cc fs, ft<br />

0x11 0x10 ft fs cc 0 FC 2<br />

6 5 5 5 3 2 2 4<br />

Compare the floating-point double (single) in register fs against the one in ft<br />

<strong>and</strong> set the floating-point condition flag cc to 1 if they are equal. If cc is omitted,<br />

condition code flag 0 is assumed.<br />

Compare less than equal double<br />

c.le.d cc fs, ft<br />

0x11 0x11 ft fs cc 0 FC 0xe<br />

6 5 5 5 3 2 2 4<br />

Compare less than equal single<br />

c.le.s cc fs, ft<br />

0x11 0x10 ft fs cc 0 FC 0xe<br />

6 5 5 5 3 2 2 4


A.10 MIPS R2000 Assembly Language A-75<br />

Compare the floating-point double (single) in register fs against the one in ft <strong>and</strong><br />

set the floating-point condition flag cc to 1 if the first is less than or equal to the<br />

second. If cc is omitted, condition code flag 0 is assumed.<br />

Compare less than double<br />

c.lt.d cc fs, ft<br />

0x11 0x11 ft fs cc 0 FC 0xc<br />

6 5 5 5 3 2 2 4<br />

Compare less than single<br />

c.lt.s cc fs, ft<br />

0x11 0x10 ft fs cc 0 FC 0xc<br />

6 5 5 5 3 2 2 4<br />

Compare the floating-point double (single) in register fs against the one in ft<br />

<strong>and</strong> set the condition flag cc to 1 if the first is less than the second. If cc is omitted,<br />

condition code flag 0 is assumed.<br />

Convert single to double<br />

cvt.d.s fd, fs<br />

0x11 0x10 0 fs fd 0x21<br />

6 5 5 5 5 6<br />

Convert integer to double<br />

cvt.d.w fd, fs<br />

0x11 0x14 0 fs fd 0x21<br />

6 5 5 5 5 6<br />

Convert the single precision floating-point number or integer in register fs to a<br />

double (single) precision number <strong>and</strong> put it in register fd.<br />

Convert double to single<br />

cvt.s.d fd, fs<br />

0x11 0x11 0 fs fd 0x20<br />

6 5 5 5 5 6<br />

Convert integer to single<br />

cvt.s.w fd, fs<br />

0x11 0x14 0 fs fd 0x20<br />

6 5 5 5 5 6<br />

Convert the double precision floating-point number or integer in register fs to a<br />

single precision number <strong>and</strong> put it in register fd.


A-76 Appendix A Assemblers, Linkers, <strong>and</strong> the SPIM Simulator<br />

Convert double to integer<br />

cvt.w.d fd, fs<br />

0x11 0x11 0 fs fd 0x24<br />

6 5 5 5 5 6<br />

Convert single to integer<br />

cvt.w.s fd, fs<br />

0x11 0x10 0 fs fd 0x24<br />

6 5 5 5 5 6<br />

Convert the double or single precision floating-point number in register fs to an<br />

integer <strong>and</strong> put it in register fd.<br />

Floating-point divide double<br />

div.d fd, fs, ft<br />

0x11 0x11 ft fs fd 3<br />

6 5 5 5 5 6<br />

Floating-point divide single<br />

div.s fd, fs, ft<br />

0x11 0x10 ft fs fd 3<br />

6 5 5 5 5 6<br />

Compute the quotient of the floating-point doubles (singles) in registers fs <strong>and</strong> ft<br />

<strong>and</strong> put it in register fd.<br />

Floating-point floor to word<br />

floor.w.d fd, fs<br />

floor.w.s fd, fs<br />

0x11 0x11 0 fs fd 0xf<br />

6 5 5 5 5 6<br />

0x11 0x10 0 fs fd 0xf<br />

Compute the floor of the floating-point double (single) in register fs <strong>and</strong> put the<br />

resulting word in register fd.<br />

Load floating-point double<br />

l.d fdest, address<br />

pseudoinstruction


A.10 MIPS R2000 Assembly Language A-77<br />

Load floating-point single<br />

l.s fdest, address<br />

pseudoinstruction<br />

Load the floating-point double (single) at address into register fdest.<br />

Move floating-point double<br />

mov.d fd, fs<br />

0x11 0x11 0 fs fd 6<br />

6 5 5 5 5 6<br />

Move floating-point single<br />

mov.s fd, fs<br />

0x11 0x10 0 fs fd 6<br />

6 5 5 5 5 6<br />

Move the floating-point double (single) from register fs to register fd.<br />

Move conditional floating-point double false<br />

movf.d fd, fs, cc<br />

0x11 0x11 cc 0 fs fd 0x11<br />

6 5 3 2 5 5 6<br />

Move conditional floating-point single false<br />

movf.s fd, fs, cc<br />

0x11 0x10 cc 0 fs fd 0x11<br />

6 5 3 2 5 5 6<br />

Move the floating-point double (single) from register fs to register fd if condi tion<br />

code flag cc is 0. If cc is omitted, condition code flag 0 is assumed.<br />

Move conditional floating-point double true<br />

movt.d fd, fs, cc<br />

0x11 0x11 cc 1 fs fd 0x11<br />

6 5 3 2 5 5 6<br />

Move conditional floating-point single true<br />

movt.s fd, fs, cc<br />

0x11 0x10 cc 1 fs fd 0x11<br />

6 5 3 2 5 5 6


A-78 Appendix A Assemblers, Linkers, <strong>and</strong> the SPIM Simulator<br />

Move the floating-point double (single) from register fs to register fd if condi tion<br />

code flag cc is 1. If cc is omitted, condition code flag 0 is assumed.<br />

Move conditional floating-point double not zero<br />

movn.d fd, fs, rt<br />

0x11 0x11 rt fs fd 0x13<br />

6 5 5 5 5 6<br />

Move conditional floating-point single not zero<br />

movn.s fd, fs, rt<br />

0x11 0x10 rt fs fd 0x13<br />

6 5 5 5 5 6<br />

Move the floating-point double (single) from register fs to register fd if proces sor<br />

register rt is not 0.<br />

Move conditional floating-point double zero<br />

movz.d fd, fs, rt<br />

0x11 0x11 rt fs fd 0x12<br />

6 5 5 5 5 6<br />

Move conditional floating-point single zero<br />

movz.s fd, fs, rt<br />

0x11 0x10 rt fs fd 0x12<br />

6 5 5 5 5 6<br />

Move the floating-point double (single) from register fs to register fd if proces sor<br />

register rt is 0.<br />

Floating-point multiply double<br />

mul.d fd, fs, ft<br />

0x11 0x11 ft fs fd 2<br />

6 5 5 5 5 6<br />

Floating-point multiply single<br />

mul.s fd, fs, ft<br />

0x11 0x10 ft fs fd 2<br />

6 5 5 5 5 6<br />

Compute the product of the floating-point doubles (singles) in registers fs <strong>and</strong> ft<br />

<strong>and</strong> put it in register fd.<br />

Negate double<br />

neg.d fd, fs<br />

0x11 0x11 0 fs fd 7<br />

6 5 5 5 5 6


A.10 MIPS R2000 Assembly Language A-79<br />

Negate single<br />

neg.s fd, fs<br />

0x11 0x10 0 fs fd 7<br />

6 5 5 5 5 6<br />

Negate the floating-point double (single) in register fs <strong>and</strong> put it in register fd.<br />

Floating-point round to word<br />

round.w.d fd, fs<br />

0x11 0x11 0 fs fd 0xc<br />

6 5 5 5 5 6<br />

round.w.s fd, fs 0x11 0x10 0 fs fd 0xc<br />

Round the floating-point double (single) value in register fs, convert to a 32-bit<br />

fixed-point value, <strong>and</strong> put the resulting word in register fd.<br />

Square root double<br />

sqrt.d fd, fs<br />

0x11 0x11 0 fs fd 4<br />

6 5 5 5 5 6<br />

Square root single<br />

sqrt.s fd, fs<br />

0x11 0x10 0 fs fd 4<br />

6 5 5 5 5 6<br />

Compute the square root of the floating-point double (single) in register fs <strong>and</strong><br />

put it in register fd.<br />

Store floating-point double<br />

s.d fdest, address<br />

pseudoinstruction<br />

Store floating-point single<br />

s.s fdest, address<br />

pseudoinstruction<br />

Store the floating-point double (single) in register fdest at address.<br />

Floating-point subtract double<br />

sub.d fd, fs, ft<br />

0x11 0x11 ft fs fd 1<br />

6 5 5 5 5 6


A-80 Appendix A Assemblers, Linkers, <strong>and</strong> the SPIM Simulator<br />

Floating-point subtract single<br />

sub.s fd, fs, ft<br />

0x11 0x10 ft fs fd 1<br />

6 5 5 5 5 6<br />

Compute the difference of the floating-point doubles (singles) in registers fs <strong>and</strong><br />

ft <strong>and</strong> put it in register fd.<br />

Floating-point truncate to word<br />

trunc.w.d fd, fs<br />

0x11 0x11 0 fs fd 0xd<br />

6 5 5 5 5 6<br />

trunc.w.s fd, fs 0x11 0x10 0 fs fd 0xd<br />

Truncate the floating-point double (single) value in register fs, convert to a 32-bit<br />

fixed-point value, <strong>and</strong> put the resulting word in register fd.<br />

Exception <strong>and</strong> Interrupt Instructions<br />

Exception return<br />

eret<br />

0x10 1 0 0x18<br />

6 1 19 6<br />

Set the EXL bit in coprocessor 0’s Status register to 0 <strong>and</strong> return to the instruction<br />

pointed to by coprocessor 0’s EPC register.<br />

System call<br />

syscall<br />

0 0 0xc<br />

6 20 6<br />

Register $v0 contains the number of the system call (see Figure A.9.1) provided<br />

by SPIM.<br />

Break<br />

break code<br />

0 code 0xd<br />

6 20 6<br />

Cause exception code. Exception 1 is reserved for the debugger.<br />

No operation<br />

nop<br />

0 0 0 0 0 0<br />

6 5 5 5 5 6<br />

Do nothing.


A.11 Concluding Remarks A-81<br />

A.11<br />

Concluding Remarks<br />

Programming in assembly language requires a programmer to trade helpful features<br />

of high-level languages—such as data structures, type checking, <strong>and</strong> control<br />

constructs—for complete control over the instructions that a computer executes.<br />

External constraints on some applications, such as response time or program size,<br />

require a programmer to pay close attention to every instruction. However, the<br />

cost of this level of attention is assembly language programs that are longer, more<br />

time-consuming to write, <strong>and</strong> more difficult to maintain than high-level language<br />

programs.<br />

Moreover, three trends are reducing the need to write programs in assembly<br />

language. The first trend is toward the improvement of compilers. Modern compilers<br />

produce code that is typically comparable to the best h<strong>and</strong>written code—<br />

<strong>and</strong> is sometimes better. The second trend is the introduction of new processors<br />

that are not only faster, but in the case of processors that execute multiple instructions<br />

simultaneously, also more difficult to program by h<strong>and</strong>. In addition, the rapid<br />

evolution of the modern computer favors high-level language programs that are<br />

not tied to a single architecture. Finally, we witness a trend toward increasingly<br />

complex applications, characterized by complex graphic interfaces <strong>and</strong> many more<br />

features than their predecessors had. Large applications are written by teams of<br />

programmers <strong>and</strong> require the modularity <strong>and</strong> semantic checking features pro vided<br />

by high-level languages.<br />

Further Reading<br />

Aho, A., R. Sethi, <strong>and</strong> J. Ullman [1985]. Compilers: Principles, Techniques, <strong>and</strong> Tools, Reading, MA: Addison-<br />

Wesley.<br />

Slightly dated <strong>and</strong> lacking in coverage of modern architectures, but still the st<strong>and</strong>ard reference on compilers.<br />

Sweetman, D. [1999]. See MIPS Run, San Francisco, CA: Morgan Kaufmann Publishers.<br />

A complete, detailed, <strong>and</strong> engaging introduction to the MIPS instruction set <strong>and</strong> assembly language programming<br />

on these machines.<br />

Detailed documentation on the MIPS-32 architecture is available on the Web:<br />

MIPS32 Architecture for Programmers Volume I: Introduction to the MIPS32 Architecture<br />

(http://mips.com/content/Documentation/MIPSDocumentation/ProcessorArchitecture/<br />

ArchitectureProgrammingPublicationsforMIPS32/MD00082-2B-MIPS32INT-AFP-02.00.pdf/<br />

getDownload)<br />

MIPS32 Architecture for Programmers Volume II: The MIPS32 Instruction Set<br />

(http://mips.com/content/Documentation/MIPSDocumentation/ProcessorArchitecture/<br />

ArchitectureProgrammingPublicationsforMIPS32/MD00086-2B-MIPS32BIS-AFP-02.00.pdf/getDownload)<br />

MIPS32 Architecture for Programmers Volume III: The MIPS32 Privileged Resource Architecture<br />

(http://mips.com/content/Documentation/MIPSDocumentation/ProcessorArchitecture/<br />

ArchitectureProgrammingPublicationsforMIPS32/MD00090-2B-MIPS32PRA-AFP-02.00.pdf/getDownload)


A.12 Exercises A-83<br />

A.10 [10] Using SPIM, write <strong>and</strong> test a recursive program for solv ing<br />

the classic mathematical recreation, the Towers of Hanoi puzzle. (This will require<br />

the use of stack frames to support recursion.) The puzzle consists of three pegs<br />

(1, 2, <strong>and</strong> 3) <strong>and</strong> n disks (the number n can vary; typical values might be in the<br />

range from 1 to 8). Disk 1 is smaller than disk 2, which is in turn smaller than disk<br />

3, <strong>and</strong> so forth, with disk n being the largest. Initially, all the disks are on peg 1,<br />

starting with disk n on the bottom, disk n − 1 on top of that, <strong>and</strong> so forth, up to<br />

disk 1 on the top. The goal is to move all the disks to peg 2. You may only move one<br />

disk at a time, that is, the top disk from any of the three pegs onto the top of either<br />

of the other two pegs. Moreover, there is a constraint: You must not place a larger<br />

disk on top of a smaller disk.<br />

The C program below can be used to help write your assembly language program.<br />

/* move n smallest disks from start to finish using<br />

extra */<br />

void hanoi(int n, int start, int finish, int extra){<br />

if(n != 0){<br />

hanoi(n-1, start, extra, finish);<br />

print_string(“Move disk”);<br />

print_int(n);<br />

print_string(“from peg”);<br />

print_int(start);<br />

print_string(“to peg”);<br />

print_int(finish);<br />

print_string(“.\n”);<br />

hanoi(n-1, extra, finish, start);<br />

}<br />

}<br />

main(){<br />

int n;<br />

print_string(“Enter number of disks>“);<br />

n = read_int();<br />

hanoi(n, 1, 2, 3);<br />

return 0;<br />

}


B<br />

A P P E N D I X<br />

I always loved that<br />

word, Boolean.<br />

Claude Shannon<br />

IEEE Spectrum, April 1992<br />

(Shannon’s master’s thesis showed<br />

that the algebra invented by George<br />

Boole in the 1800s could represent the<br />

workings of electrical switches.)<br />

The Basics of Logic<br />

<strong>Design</strong><br />

B.1 Introduction B-3<br />

B.2 Gates, Truth Tables, <strong>and</strong> Logic<br />

Equations B-4<br />

B.3 Combinational Logic B-9<br />

B.4 Using a Hardware Description<br />

Language B-20<br />

B.5 Constructing a Basic Arithmetic Logic<br />

Unit B-26<br />

B.6 Faster Addition: Carry Lookahead B-38<br />

B.7 Clocks B-48<br />

<strong>Computer</strong> Organization <strong>and</strong> <strong>Design</strong>. DOI: http://dx.doi.org/10.1016/B978-0-12-407726-3.00001-1<br />

© 2013 Elsevier Inc. All rights reserved.


B-6 Appendix B The Basics of Logic <strong>Design</strong><br />

Boolean Algebra<br />

Another approach is to express the logic function with logic equations. This<br />

is done with the use of Boolean algebra (named after Boole, a 19th-century<br />

mathematician). In Boolean algebra, all the variables have the values 0 or 1 <strong>and</strong>, in<br />

typical formulations, there are three operators:<br />

■ The OR operator is written as , as in A B. The result of an OR operator is<br />

1 if either of the variables is 1. The OR operation is also called a logical sum,<br />

since its result is 1 if either oper<strong>and</strong> is 1.<br />

■ The AND operator is written as , as in A B. The result of an AND operator<br />

is 1 only if both inputs are 1. The AND operator is also called logical product,<br />

since its result is 1 only if both oper<strong>and</strong>s are 1.<br />

■ The unary operator NOT is written as A. The result of a NOT operator is 1 only if<br />

the input is 0. Applying the operator NOT to a logical value results in an inversion<br />

or negation of the value (i.e., if the input is 0 the output is 1, <strong>and</strong> vice versa).<br />

There are several laws of Boolean algebra that are helpful in manipulating logic<br />

equations.<br />

■ Identity law: A 0 A <strong>and</strong> A 1 A<br />

■ Zero <strong>and</strong> One laws: A 1 1 <strong>and</strong> A 0 0<br />

■ Inverse laws: A A 1 <strong>and</strong> A A 0<br />

■ Commutative laws: A B B A <strong>and</strong> A B B A<br />

■ Associative laws: A (B C) (A B) C <strong>and</strong> A (B C) (A B) C<br />

■ Distributive laws: A (B C) (A B) (A C) <strong>and</strong><br />

A (B C) (A B) (A C)<br />

In addition, there are two other useful theorems, called DeMorgan’s laws, that are<br />

discussed in more depth in the exercises.<br />

Any set of logic functions can be written as a series of equations with an output<br />

on the left-h<strong>and</strong> side of each equation <strong>and</strong> a formula consisting of variables <strong>and</strong> the<br />

three operators above on the right-h<strong>and</strong> side.


B.2 Gates, Truth Tables, <strong>and</strong> Logical Equations B-7<br />

Logic Equations<br />

Show the logic equations for the logic functions, D, E, <strong>and</strong> F, described in the<br />

previous example.<br />

EXAMPLE<br />

Here’s the equation for D:<br />

F is equally simple:<br />

D A B C<br />

ANSWER<br />

F<br />

A B C<br />

E is a little tricky. Think of it in two parts: what must be true for E to be true<br />

(two of the three inputs must be true), <strong>and</strong> what cannot be true (all three<br />

cannot be true). Thus we can write E as<br />

E (( A B) ( A C) ( B C)) ( A B C)<br />

We can also derive E by realizing that E is true only if exactly two of the inputs<br />

are true. Then we can write E as an OR of the three possible terms that have<br />

two true inputs <strong>and</strong> one false input:<br />

E ( A B C) ( A C B) ( B C A)<br />

Proving that these two expressions are equivalent is explored in the exercises.<br />

In Verilog, we describe combinational logic whenever possible using the assign<br />

statement, which is described beginning on page B-23. We can write a definition<br />

for E using the Verilog exclusive-OR operator as assign E (A ^ B ^ C) *<br />

(A + B + C) * (A * B * C), which is yet another way to describe this function.<br />

D <strong>and</strong> F have even simpler representations, which are just like the corresponding C<br />

code: D A | B | C <strong>and</strong> F A & B & C.


B-8 Appendix B The Basics of Logic <strong>Design</strong><br />

gate A device that<br />

implements basic logic<br />

functions, such as AND<br />

or OR.<br />

NOR gate An inverted<br />

OR gate.<br />

NAND gate An inverted<br />

AND gate.<br />

Check<br />

Yourself<br />

Gates<br />

Logic blocks are built from gates that implement basic logic functions. For example,<br />

an AND gate implements the AND function, <strong>and</strong> an OR gate implements the OR<br />

function. Since both AND <strong>and</strong> OR are commutative <strong>and</strong> associative, an AND or an<br />

OR gate can have multiple inputs, with the output equal to the AND or OR of all<br />

the inputs. The logical function NOT is implemented with an inverter that always<br />

has a single input. The st<strong>and</strong>ard representation of these three logic building blocks<br />

is shown in Figure B.2.1.<br />

Rather than draw inverters explicitly, a common practice is to add “bubbles”<br />

to the inputs or outputs of a gate to cause the logic value on that input line or<br />

output line to be inverted. For example, Figure B.2.2 shows the logic diagram for<br />

the function A B , using explicit inverters on the left <strong>and</strong> bubbled inputs <strong>and</strong><br />

outputs on the right.<br />

Any logical function can be constructed using AND gates, OR gates, <strong>and</strong><br />

inversion; several of the exercises give you the opportunity to try implementing<br />

some common logic functions with gates. In the next section, we’ll see how an<br />

implementation of any logic function can be constructed using this knowledge.<br />

In fact, all logic functions can be constructed with only a single gate type, if that<br />

gate is inverting. The two common inverting gates are called NOR <strong>and</strong> NAND <strong>and</strong><br />

correspond to inverted OR <strong>and</strong> AND gates, respectively. NOR <strong>and</strong> NAND gates are<br />

called universal, since any logic function can be built using this one gate type. The<br />

exercises explore this concept further.<br />

Are the following two logical expressions equivalent? If not, find a setting of the<br />

variables to show they are not:<br />

■ ( A B C) ( A C B) ( B C A)<br />

■ B ( A C C A)<br />

FIGURE B.2.1 St<strong>and</strong>ard drawing for an AND gate, OR gate, <strong>and</strong> an inverter, shown from<br />

left to right. The signals to the left of each symbol are the inputs, while the output appears on the right. The<br />

AND <strong>and</strong> OR gates both have two inputs. Inverters have a single input.<br />

A<br />

B<br />

A<br />

B<br />

FIGURE B.2.2 Logic gate implementation of A B using explicit inverts on the left <strong>and</strong><br />

bubbled inputs <strong>and</strong> outputs on the right. This logic function can be simplified to AB<br />

or in Verilog,<br />

A & ~ B.


B.3 Combinational Logic B-9<br />

B.3 Combinational Logic<br />

In this section, we look at a couple of larger logic building blocks that we use<br />

heavily, <strong>and</strong> we discuss the design of structured logic that can be automatically<br />

implemented from a logic equation or truth table by a translation program. Last,<br />

we discuss the notion of an array of logic blocks.<br />

Decoders<br />

One logic block that we will use in building larger components is a decoder. The<br />

most common type of decoder has an n-bit input <strong>and</strong> 2 n outputs, where only one<br />

output is asserted for each input combination. This decoder translates the n-bit<br />

input into a signal that corresponds to the binary value of the n-bit input. The<br />

outputs are thus usually numbered, say, Out0, Out1, … , Out2 n 1. If the value of<br />

the input is i, then Outi will be true <strong>and</strong> all other outputs will be false. Figure B.3.1<br />

shows a 3-bit decoder <strong>and</strong> the truth table. This decoder is called a 3-to-8 decoder<br />

since there are 3 inputs <strong>and</strong> 8 (2 3 ) outputs. There is also a logic element called<br />

an encoder that performs the inverse function of a decoder, taking 2 n inputs <strong>and</strong><br />

producing an n-bit output.<br />

decoder A logic block<br />

that has an n-bit input<br />

<strong>and</strong> 2n outputs, where<br />

only one output is<br />

asserted for each input<br />

combination.<br />

3<br />

Decoder<br />

Out0<br />

Out1<br />

Out2<br />

Out3<br />

Out4<br />

Out5<br />

Out6<br />

Out7<br />

Inputs<br />

Outputs<br />

12 11 10 Out7 Out6 Out5 Out4 Out3 Out2 Out1 Out0<br />

0 0 0 0 0 0 0 0 0 0 1<br />

0 0 1 0 0 0 0 0 0 1 0<br />

0 1 0 0 0 0 0 0 1 0 0<br />

0 1 1 0 0 0 0 1 0 0 0<br />

1 0 0 0 0 0 1 0 0 0 0<br />

1 0 1 0 0 1 0 0 0 0 0<br />

1 1 0 0 1 0 0 0 0 0 0<br />

1 1 1 1 0 0 0 0 0 0 0<br />

a. A 3-bit decoder<br />

b. The truth table for a 3-bit decoder<br />

FIGURE B.3.1 A 3-bit decoder has 3 inputs, called 12, 11, <strong>and</strong> 10, <strong>and</strong> 2 3 = 8 outputs, called Out0 to Out7. Only the<br />

output corresponding to the binary value of the input is true, as shown in the truth table. The label 3 on the input to the decoder says that the<br />

input signal is 3 bits wide.


B-10 Appendix B The Basics of Logic <strong>Design</strong><br />

A<br />

B<br />

0<br />

M<br />

u<br />

x<br />

1<br />

C<br />

A<br />

B<br />

C<br />

S<br />

S<br />

FIGURE B.3.2 A two-input multiplexor on the left <strong>and</strong> its implementation with gates on<br />

the right. The multiplexor has two data inputs (A <strong>and</strong> B), which are labeled 0 <strong>and</strong> 1, <strong>and</strong> one selector input<br />

(S), as well as an output C. Implementing multiplexors in Verilog requires a little more work, especially when<br />

they are wider than two inputs. We show how to do this beginning on page B-23.<br />

selector value Also<br />

called control value. The<br />

control signal that is used<br />

to select one of the input<br />

values of a multiplexor<br />

as the output of the<br />

multiplexor.<br />

Multiplexors<br />

One basic logic function that we use quite often in Chapter 4 is the multiplexor.<br />

A multiplexor might more properly be called a selector, since its output is one of<br />

the inputs that is selected by a control. Consider the two-input multiplexor. The<br />

left side of Figure B.3.2 shows this multiplexor has three inputs: two data values<br />

<strong>and</strong> a selector (or control) value. The selector value determines which of the<br />

inputs becomes the output. We can represent the logic function computed by a<br />

two-input multiplexor, shown in gate form on the right side of Figure B.3.2, as<br />

C ( A S) ( B S).<br />

Multiplexors can be created with an arbitrary number of data inputs. When<br />

there are only two inputs, the selector is a single signal that selects one of the inputs<br />

if it is true (1) <strong>and</strong> the other if it is false (0). If there are n data inputs, there will<br />

need to be ⎡<br />

⎢log 2<br />

n⎤<br />

⎥ selector inputs. In this case, the multiplexor basically consists<br />

of three parts:<br />

1. A decoder that generates n signals, each indicating a different input value<br />

2. An array of n AND gates, each combining one of the inputs with a signal<br />

from the decoder<br />

3. A single large OR gate that incorporates the outputs of the AND gates<br />

To associate the inputs with selector values, we often label the data inputs numerically<br />

(i.e., 0, 1, 2, 3, …, n 1) <strong>and</strong> interpret the data selector inputs as a binary number.<br />

Sometimes, we make use of a multiplexor with undecoded selector signals.<br />

Multiplexors are easily represented combinationally in Verilog by using if<br />

expressions. For larger multiplexors, case statements are more convenient, but care<br />

must be taken to synthesize combinational logic.


B.3 Combinational Logic B-11<br />

Two-Level Logic <strong>and</strong> PLAs<br />

As pointed out in the previous section, any logic function can be implemented with<br />

only AND, OR, <strong>and</strong> NOT functions. In fact, a much stronger result is true. Any logic<br />

function can be written in a canonical form, where every input is either a true or<br />

complemented variable <strong>and</strong> there are only two levels of gates—one being AND <strong>and</strong><br />

the other OR—with a possible inversion on the final output. Such a representation<br />

is called a two-level representation, <strong>and</strong> there are two forms, called sum of products<br />

<strong>and</strong> product of sums. A sum-of-products representation is a logical sum (OR) of<br />

products (terms using the AND operator); a product of sums is just the opposite.<br />

In our earlier example, we had two equations for the output E:<br />

<strong>and</strong><br />

E (( A B) ( A C) ( B C)) ( A B C)<br />

sum of products A form<br />

of logical representation<br />

that employs a logical sum<br />

(OR) of products (terms<br />

joined using the AND<br />

operator).<br />

E ( A B C) ( A C B) ( B C A)<br />

This second equation is in a sum-of-products form: it has two levels of logic <strong>and</strong> the<br />

only inversions are on individual variables. The first equation has three levels of logic.<br />

Elaboration: We can also write E as a product of sums:<br />

E ( A B C) ( A C B) ( B C A)<br />

To derive this form, you need to use DeMorgan’s theorems, which are discussed in the<br />

exercises.<br />

In this text, we use the sum-of-products form. It is easy to see that any logic<br />

function can be represented as a sum of products by constructing such a<br />

representation from the truth table for the function. Each truth table entry for<br />

which the function is true corresponds to a product term. The product term<br />

consists of a logical product of all the inputs or the complements of the inputs,<br />

depending on whether the entry in the truth table has a 0 or 1 corresponding to<br />

this variable. The logic function is the logical sum of the product terms where the<br />

function is true. This is more easily seen with an example.


B-12 Appendix B The Basics of Logic <strong>Design</strong><br />

Sum of Products<br />

EXAMPLE<br />

Show the sum-of-products representation for the following truth table for D.<br />

Inputs<br />

Outputs<br />

A B C D<br />

0 0 0 0<br />

0 0 1 1<br />

0 1 0 1<br />

0 1 1 0<br />

1 0 0 1<br />

1 0 1 0<br />

1 1 0 0<br />

1 1 1 1<br />

ANSWER<br />

There are four product terms, since the function is true (1) for four different<br />

input combinations. These are:<br />

ABC<br />

ABC<br />

programmable logic<br />

array (PLA)<br />

A structured-logic<br />

element composed<br />

of a set of inputs <strong>and</strong><br />

corresponding input<br />

complements <strong>and</strong> two<br />

stages of logic: the first<br />

generates product terms<br />

of the inputs <strong>and</strong> input<br />

complements, <strong>and</strong> the<br />

second generates sum<br />

terms of the product<br />

terms. Hence, PLAs<br />

implement logic functions<br />

as a sum of products.<br />

minterms Also called<br />

product terms. A set<br />

of logic inputs joined<br />

by conjunction (AND<br />

operations); the product<br />

terms form the first logic<br />

stage of the programmable<br />

logic array (PLA).<br />

ABC<br />

ABC<br />

Thus, we can write the function for D as the sum of these terms:<br />

D ( A B C)( A B C)( A B C)( A B C)<br />

Note that only those truth table entries for which the function is true generate<br />

terms in the equation.<br />

We can use this relationship between a truth table <strong>and</strong> a two-level representation<br />

to generate a gate-level implementation of any set of logic functions. A set of logic<br />

functions corresponds to a truth table with multiple output columns, as we saw in<br />

the example on page B-5. Each output column represents a different logic function,<br />

which may be directly constructed from the truth table.<br />

The sum-of-products representation corresponds to a common structured-logic<br />

implementation called a programmable logic array (PLA). A PLA has a set of<br />

inputs <strong>and</strong> corresponding input complements (which can be implemented with a<br />

set of inverters), <strong>and</strong> two stages of logic. The first stage is an array of AND gates that<br />

form a set of product terms (sometimes called minterms); each product term can<br />

consist of any of the inputs or their complements. The second stage is an array of<br />

OR gates, each of which forms a logical sum of any number of the product terms.<br />

Figure B.3.3 shows the basic form of a PLA.


B.3 Combinational Logic B-13<br />

Inputs<br />

AND gates<br />

Product terms<br />

OR gates<br />

Outputs<br />

FIGURE B.3.3 The basic form of a PLA consists of an array of AND gates followed by an<br />

array of OR gates. Each entry in the AND gate array is a product term consisting of any number of inputs or<br />

inverted inputs. Each entry in the OR gate array is a sum term consisting of any number of these product terms.<br />

A PLA can directly implement the truth table of a set of logic functions with<br />

multiple inputs <strong>and</strong> outputs. Since each entry where the output is true requires<br />

a product term, there will be a corresponding row in the PLA. Each output<br />

corresponds to a potential row of OR gates in the second stage. The number of OR<br />

gates corresponds to the number of truth table entries for which the output is true.<br />

The total size of a PLA, such as that shown in Figure B.3.3, is equal to the sum of the<br />

size of the AND gate array (called the AND plane) <strong>and</strong> the size of the OR gate array<br />

(called the OR plane). Looking at Figure B.3.3, we can see that the size of the AND<br />

gate array is equal to the number of inputs times the number of different product<br />

terms, <strong>and</strong> the size of the OR gate array is the number of outputs times the number<br />

of product terms.<br />

A PLA has two characteristics that help make it an efficient way to implement a<br />

set of logic functions. First, only the truth table entries that produce a true value for<br />

at least one output have any logic gates associated with them. Second, each different<br />

product term will have only one entry in the PLA, even if the product term is used<br />

in multiple outputs. Let’s look at an example.<br />

PLAs<br />

Consider the set of logic functions defined in the example on page B-5. Show<br />

a PLA implementation of this example for D, E, <strong>and</strong> F.<br />

EXAMPLE


B-14 Appendix B The Basics of Logic <strong>Design</strong><br />

ANSWER<br />

Here is the truth table we constructed earlier:<br />

Inputs<br />

Outputs<br />

A B C D E F<br />

0 0 0 0 0 0<br />

0 0 1 1 0 0<br />

0 1 0 1 0 0<br />

0 1 1 1 1 0<br />

1 0 0 1 0 0<br />

1 0 1 1 1 0<br />

1 1 0 1 1 0<br />

1 1 1 1 0 1<br />

Since there are seven unique product terms with at least one true value in the<br />

output section, there will be seven columns in the AND plane. The number of<br />

rows in the AND plane is three (since there are three inputs), <strong>and</strong> there are also<br />

three rows in the OR plane (since there are three outputs). Figure B.3.4 shows<br />

the resulting PLA, with the product terms corresponding to the truth table<br />

entries from top to bottom.<br />

read-only memory<br />

(ROM) A memory<br />

whose contents are<br />

designated at creation<br />

time, after which the<br />

contents can only be read.<br />

ROM is used as structured<br />

logic to implement a<br />

set of logic functions by<br />

using the terms in the<br />

logic functions as address<br />

inputs <strong>and</strong> the outputs as<br />

bits in each word of the<br />

memory.<br />

programmable ROM<br />

(PROM) A form of<br />

read-only memory that<br />

can be pro grammed<br />

when a designer knows its<br />

contents.<br />

Rather than drawing all the gates, as we do in Figure B.3.4, designers often show<br />

just the position of AND gates <strong>and</strong> OR gates. Dots are used on the intersection of a<br />

product term signal line <strong>and</strong> an input line or an output line when a corresponding<br />

AND gate or OR gate is required. Figure B.3.5 shows how the PLA of Figure B.3.4<br />

would look when drawn in this way. The contents of a PLA are fixed when the PLA<br />

is created, although there are also forms of PLA-like structures, called PALs, that<br />

can be programmed electronically when a designer is ready to use them.<br />

ROMs<br />

Another form of structured logic that can be used to implement a set of logic<br />

functions is a read-only memory (ROM). A ROM is called a memory because it<br />

has a set of locations that can be read; however, the contents of these locations are<br />

fixed, usually at the time the ROM is manufactured. There are also programmable<br />

ROMs (PROMs) that can be programmed electronically, when a designer knows<br />

their contents. There are also erasable PROMs; these devices require a slow erasure<br />

process using ultraviolet light, <strong>and</strong> thus are used as read-only memories, except<br />

during the design <strong>and</strong> debugging process.<br />

A ROM has a set of input address lines <strong>and</strong> a set of outputs. The number of<br />

addressable entries in the ROM determines the number of address lines: if the


B.4 Using a Hardware Description Language B-19<br />

elements, which we can represent simply by showing that a given operation will<br />

happen to an entire collection of inputs. Inside a machine, much of the time we<br />

want to select between a pair of buses. A bus is a collection of data lines that is<br />

treated together as a single logical signal. (The term bus is also used to indicate a<br />

shared collection of lines with multiple sources <strong>and</strong> uses.)<br />

For example, in the MIPS instruction set, the result of an instruction that is written<br />

into a register can come from one of two sources. A multiplexor is used to choose<br />

which of the two buses (each 32 bits wide) will be written into the Result register.<br />

The 1-bit multiplexor, which we showed earlier, will need to be replicated 32 times.<br />

We indicate that a signal is a bus rather than a single 1-bit line by showing it with<br />

a thicker line in a figure. Most buses are 32 bits wide; those that are not are explicitly<br />

labeled with their width. When we show a logic unit whose inputs <strong>and</strong> outputs are<br />

buses, this means that the unit must be replicated a sufficient number of times to<br />

accommodate the width of the input. Figure B.3.6 shows how we draw a multiplexor<br />

that selects between a pair of 32-bit buses <strong>and</strong> how this exp<strong>and</strong>s in terms of 1-bitwide<br />

multiplexors. Sometimes we need to construct an array of logic elements<br />

where the inputs for some elements in the array are outputs from earlier elements.<br />

For example, this is how a multibit-wide ALU is constructed. In such cases, we must<br />

explicitly show how to create wider arrays, since the individual elements of the array<br />

are no longer independent, as they are in the case of a 32-bit-wide multiplexor.<br />

bus In logic design, a<br />

collection of data lines<br />

that is treated together<br />

as a single logical signal;<br />

also, a shared collection<br />

of lines with multiple<br />

sources <strong>and</strong> uses.<br />

Select<br />

Select<br />

32<br />

A<br />

32<br />

B<br />

M<br />

u<br />

x<br />

32<br />

C<br />

A31<br />

B31<br />

M<br />

u<br />

x<br />

C31<br />

A30<br />

B30<br />

M<br />

u<br />

x<br />

.<br />

C30<br />

.<br />

A0<br />

B0<br />

M<br />

u<br />

x<br />

C0<br />

a. A 32-bit wide 2-to-1 multiplexor b. The 32-bit wide multiplexor is actually<br />

an array of 32 1-bit multiplexors<br />

FIGURE B.3.6 A multiplexor is arrayed 32 times to perform a selection between two 32-<br />

bit inputs. Note that there is still only one data selection signal used for all 32 1-bit multiplexors.


B.4 Using a Hardware Description Language B-21<br />

Readers already familiar with VHDL should find the concepts simple, provided<br />

they have been exposed to the syntax of C.<br />

Verilog can specify both a behavioral <strong>and</strong> a structural definition of a digital<br />

system. A behavioral specification describes how a digital system functionally<br />

operates. A structural specification describes the detailed organization of a digital<br />

system, usually using a hierarchical description. A structural specification can be<br />

used to describe a hardware system in terms of a hierarchy of basic elements such<br />

as gates <strong>and</strong> switches. Thus, we could use Verilog to describe the exact contents of<br />

the truth tables <strong>and</strong> datapath of the last section.<br />

With the arrival of hardware synthesis tools, most designers now use Verilog<br />

or VHDL to structurally describe only the datapath, relying on logic synthesis to<br />

generate the control from a behavioral description. In addition, most CAD systems<br />

provide extensive libraries of st<strong>and</strong>ardized parts, such as ALUs, multiplexors,<br />

register files, memories, <strong>and</strong> programmable logic blocks, as well as basic gates.<br />

Obtaining an acceptable result using libraries <strong>and</strong> logic synthesis requires that<br />

the specification be written with an eye toward the eventual synthesis <strong>and</strong> the<br />

desired outcome. For our simple designs, this primarily means making clear what<br />

we expect to be implemented in combinational logic <strong>and</strong> what we expect to require<br />

sequential logic. In most of the examples we use in this section <strong>and</strong> the remainder<br />

of this appendix, we have written the Verilog with the eventual synthesis in mind.<br />

Datatypes <strong>and</strong> Operators in Verilog<br />

There are two primary datatypes in Verilog:<br />

1. A wire specifies a combinational signal.<br />

2. A reg (register) holds a value, which can vary with time. A reg need not<br />

necessarily correspond to an actual register in an implementation, although<br />

it often will.<br />

A register or wire, named X, that is 32 bits wide is declared as an array: reg<br />

[31:0] X or wire [31:0] X, which also sets the index of 0 to designate the<br />

least significant bit of the register. Because we often want to access a subfield of a<br />

register or wire, we can refer to a contiguous set of bits of a register or wire with the<br />

notation [starting bit: ending bit], where both indices must be constant<br />

values.<br />

An array of registers is used for a structure like a register file or memory. Thus,<br />

the declaration<br />

behavioral<br />

specification Describes<br />

how a digital system<br />

operates functionally.<br />

structural<br />

specification Describes<br />

how a digital system is<br />

organized in terms of a<br />

hierarchical connection of<br />

elements.<br />

hardware synthesis<br />

tools <strong>Computer</strong>-aided<br />

design software that<br />

can generate a gatelevel<br />

design based on<br />

behavioral descriptions of<br />

a digital system.<br />

wire In Verilog, specifies<br />

a combinational signal.<br />

reg In Verilog, a register.<br />

reg [31:0] registerfile[0:31]<br />

specifies a variable registerfile that is equivalent to a MIPS registerfile, where<br />

register 0 is the first. When accessing an array, we can refer to a single element, as<br />

in C, using the notation registerfile[regnum].


B-22 Appendix B The Basics of Logic <strong>Design</strong><br />

The possible values for a register or wire in Verilog are<br />

■ 0 or 1, representing logical false or true<br />

■ X, representing unknown, the initial value given to all registers <strong>and</strong> to any<br />

wire not connected to something<br />

■ Z, representing the high-impedance state for tristate gates, which we will not<br />

discuss in this appendix<br />

Constant values can be specified as decimal numbers as well as binary, octal, or<br />

hexadecimal. We often want to say exactly how large a constant field is in bits. This<br />

is done by prefixing the value with a decimal number specifying its size in bits. For<br />

example:<br />

■ 4’b0100 specifies a 4-bit binary constant with the value 4, as does 4’d4.<br />

■ - 8 ‘h4 specifies an 8-bit constant with the value 4 (in two’s complement<br />

representation)<br />

Values can also be concatenated by placing them within { } separated by commas.<br />

The notation {x{bit field}} replicates bit field x times. For example:<br />

■ {16{2’b01}} creates a 32-bit value with the pattern 0101 … 01.<br />

■ {A[31:16],B[15:0]} creates a value whose upper 16 bits come from A<br />

<strong>and</strong> whose lower 16 bits come from B.<br />

Verilog provides the full set of unary <strong>and</strong> binary operators from C, including the<br />

arithmetic operators (, , *. /), the logical operators (&, |, ), the comparison<br />

operators ( , !, , , , ), the shift operators (, ), <strong>and</strong> C’s<br />

conditional operator (?, which is used in the form condition ? expr1 :expr2<br />

<strong>and</strong> returns expr1 if the condition is true <strong>and</strong> expr2 if it is false). Verilog adds<br />

a set of unary logic reduction operators (&, |, ^) that yield a single bit by applying<br />

the logical operator to all the bits of an oper<strong>and</strong>. For example, &A returns the value<br />

obtained by ANDing all the bits of A together, <strong>and</strong> ^A returns the reduction obtained<br />

by using exclusive OR on all the bits of A.<br />

Check<br />

Yourself<br />

Which of the following define exactly the same value?<br />

l. 8’bimoooo<br />

2. 8’hF0<br />

3. 8’d240<br />

4. {{4{1’b1}},{4{1’b0}}}<br />

5. {4’b1,4’b0)


B.4 Using a Hardware Description Language B-23<br />

Structure of a Verilog Program<br />

A Verilog program is structured as a set of modules, which may represent anything<br />

from a collection of logic gates to a complete system. Modules are similar to classes<br />

in C, although not nearly as powerful. A module specifies its input <strong>and</strong> output<br />

ports, which describe the incoming <strong>and</strong> outgoing connections of a module. A<br />

module may also declare additional variables. The body of a module consists of:<br />

■ initial constructs, which can initialize reg variables<br />

■ Continuous assignments, which define only combinational logic<br />

■ always constructs, which can define either sequential or combinational<br />

logic<br />

■ Instances of other modules, which are used to implement the module being<br />

defined<br />

Representing Complex Combinational Logic in Verilog<br />

A continuous assignment, which is indicated with the keyword assign, acts like<br />

a combinational logic function: the output is continuously assigned the value, <strong>and</strong><br />

a change in the input values is reflected immediately in the output value. Wires<br />

may only be assigned values with continuous assignments. Using continuous<br />

assignments, we can define a module that implements a half-adder, as Figure B.4.1<br />

shows.<br />

Assign statements are one sure way to write Verilog that generates combinational<br />

logic. For more complex structures, however, assign statements may be awkward or<br />

tedious to use. It is also possible to use the always block of a module to describe<br />

a combinational logic element, although care must be taken. Using an always<br />

block allows the inclusion of Verilog control constructs, such as if-then-else, case<br />

statements, for statements, <strong>and</strong> repeat statements, to be used. These statements are<br />

similar to those in C with small changes.<br />

An always block specifies an optional list of signals on which the block is<br />

sensitive (in a list starting with @). The always block is re-evaluated if any of the<br />

FIGURE B.4.1<br />

A Verilog module that defines a half-adder using continuous assignments.


B-24 Appendix B The Basics of Logic <strong>Design</strong><br />

sensitivity list The list of<br />

signals that specifies when<br />

an always block should<br />

be re-evaluated.<br />

listed signals changes value; if the list is omitted, the always block is constantly reevaluated.<br />

When an always block is specifying combinational logic, the sensitivity<br />

list should include all the input signals. If there are multiple Verilog statements to<br />

be executed in an always block, they are surrounded by the keywords begin <strong>and</strong><br />

end, which take the place of the { <strong>and</strong> } in C. An always block thus looks like this:<br />

always @(list of signals that cause reevaluation) begin<br />

Verilog statements including assignments <strong>and</strong> other<br />

control statements end<br />

blocking assignment<br />

In Verilog, an assignment<br />

that completes before<br />

the execution of the next<br />

statement.<br />

nonblocking<br />

assignment An<br />

assignment that continues<br />

after evaluating the righth<strong>and</strong><br />

side, assigning the<br />

left-h<strong>and</strong> side the value<br />

only after all right-h<strong>and</strong><br />

sides are evaluated.<br />

Reg variables may only be assigned inside an always block, using a procedural<br />

assignment statement (as distinguished from continuous assignment we saw<br />

earlier). There are, however, two different types of procedural assignments. The<br />

assignment operator executes as it does in C; the right-h<strong>and</strong> side is evaluated,<br />

<strong>and</strong> the left-h<strong>and</strong> side is assigned the value. Furthermore, it executes like the<br />

normal C assignment statement: that is, it is completed before the next statement is<br />

executed. Hence, the assignment operator has the name blocking assignment.<br />

This blocking can be useful in the generation of sequential logic, <strong>and</strong> we will return<br />

to it shortly. The other form of assignment (nonblocking) is indicated by


B.5 Constructing a Basic Arithmetic Logic Unit B-25<br />

FIGURE B.4.2 A Verilog definition of a 4-to-1 multiplexor with 32-bit inputs, using a case<br />

statement. The case statement acts like a C switch statement, except that in Verilog only the code<br />

associated with the selected case is executed (as if each case state had a break at the end) <strong>and</strong> there is no fallthrough<br />

to the next statement.<br />

FIGURE B.4.3 A Verilog behavioral definition of a MIPS ALU. This could be synthesized using a module library containing basic<br />

arithmetic <strong>and</strong> logical operations.


B-26 Appendix B The Basics of Logic <strong>Design</strong><br />

Check<br />

Yourself<br />

Assuming all values are initially zero, what are the values of A <strong>and</strong> B after executing<br />

this Verilog code inside an always block?<br />

C=1;<br />

A


B-30 Appendix B The Basics of Logic <strong>Design</strong><br />

Operation<br />

CarryIn<br />

a0<br />

b0<br />

CarryIn<br />

ALU0<br />

CarryOut<br />

Result0<br />

a1<br />

b1<br />

CarryIn<br />

ALU1<br />

CarryOut<br />

Result1<br />

a2<br />

b2<br />

CarryIn<br />

ALU2<br />

CarryOut<br />

Result2<br />

.<br />

.<br />

.<br />

a31<br />

b31<br />

CarryIn<br />

ALU31<br />

Result31<br />

FIGURE B.5.7 A 32-bit ALU constructed from 32 1-bit ALUs. CarryOut of the less significant bit is<br />

connected to the CarryIn of the more significant bit. This organization is called ripple carry.<br />

this is only one step in negating a two’s complement number. Notice that the least<br />

significant bit still has a CarryIn signal, even though it’s unnecessary for addition.<br />

What happens if we set this CarryIn to 1 instead of 0? The adder will then calculate<br />

a b 1. By selecting the inverted version of b, we get exactly what we want:<br />

a b 1 a ( b 1) a ( b) a b<br />

The simplicity of the hardware design of a two’s complement adder helps explain<br />

why two’s complement representation has become the universal st<strong>and</strong>ard for<br />

integer computer arithmetic.


.<br />

.<br />

.<br />

B-34 Appendix B The Basics of Logic <strong>Design</strong><br />

Binvert<br />

Ainvert<br />

Operation<br />

CarryIn<br />

a0<br />

b0<br />

CarryIn<br />

ALU0<br />

Less<br />

CarryOut<br />

Result0<br />

a1<br />

b1<br />

0<br />

CarryIn<br />

ALU1<br />

Less<br />

CarryOut<br />

Result1<br />

a2<br />

b2<br />

0<br />

CarryIn<br />

ALU2<br />

Less<br />

CarryOut<br />

Result2<br />

.<br />

.<br />

CarryIn<br />

.<br />

a31 CarryIn<br />

Result31<br />

b31 ALU31<br />

Set<br />

0 Less<br />

Overflow<br />

FIGURE B.5.11 A 32-bit ALU constructed from the 31 copies of the 1-bit ALU in the top<br />

of Figure B.5.10 <strong>and</strong> one 1-bit ALU in the bottom of that figure. The Less inputs are connected<br />

to 0 except for the least significant bit, which is connected to the Set output of the most significant bit. If the<br />

ALU performs a b <strong>and</strong> we select the input 3 in the multiplexor in Figure B.5.10, then Result 0 … 001 if<br />

a b, <strong>and</strong> Result 0 … 000 otherwise.<br />

Thus, we need a new 1-bit ALU for the most significant bit that has an extra<br />

output bit: the adder output. The bottom drawing of Figure B.5.10 shows the<br />

design, with this new adder output line called Set, <strong>and</strong> used only for slt. As long<br />

as we need a special ALU for the most significant bit, we added the overflow detection<br />

logic since it is also associated with that bit.


B.1 Introduction B-35<br />

Alas, the test of less than is a little more complicated than just described because<br />

of overflow, as we explore in the exercises. Figure B.5.11 shows the 32-bit ALU.<br />

Notice that every time we want the ALU to subtract, we set both CarryIn <strong>and</strong><br />

Binvert to 1. For adds or logical operations, we want both control lines to be 0. We<br />

can therefore simplify control of the ALU by combining the CarryIn <strong>and</strong> Binvert to<br />

a single control line called Bnegate.<br />

To further tailor the ALU to the MIPS instruction set, we must support<br />

conditional branch instructions. These instructions branch either if two registers<br />

are equal or if they are unequal. The easiest way to test equality with the ALU is to<br />

subtract b from a <strong>and</strong> then test to see if the result is 0, since<br />

( a b 0)<br />

⇒ a b<br />

Thus, if we add hardware to test if the result is 0, we can test for equality. The<br />

simplest way is to OR all the outputs together <strong>and</strong> then send that signal through<br />

an inverter:<br />

Zero ( Result31 Result30 … Result2 Result1 Result0)<br />

Figure B.5.12 shows the revised 32-bit ALU. We can think of the combination of<br />

the 1-bit Ainvert line, the 1-bit Binvert line, <strong>and</strong> the 2-bit Operation lines as 4-bit<br />

control lines for the ALU, telling it to perform add, subtract, AND, OR, or set on<br />

less than. Figure B.5.13 shows the ALU control lines <strong>and</strong> the corresponding ALU<br />

operation.<br />

Finally, now that we have seen what is inside a 32-bit ALU, we will use the<br />

universal symbol for a complete ALU, as shown in Figure B.5.14.<br />

Defining the MIPS ALU in Verilog<br />

Figure B.5.15 shows how a combinational MIPS ALU might be specified in Verilog;<br />

such a specification would probably be compiled using a st<strong>and</strong>ard parts library that<br />

provided an adder, which could be instantiated. For completeness, we show the<br />

ALU control for MIPS in Figure B.5.16, which is used in Chapter 4, where we build<br />

a Verilog version of the MIPS datapath.<br />

The next question is, “How quickly can this ALU add two 32-bit oper<strong>and</strong>s?”<br />

We can determine the a <strong>and</strong> b inputs, but the CarryIn input depends on the<br />

operation in the adjacent 1-bit adder. If we trace all the way through the chain of<br />

dependencies, we connect the most significant bit to the least significant bit, so<br />

the most significant bit of the sum must wait for the sequential evaluation of all 32<br />

1-bit adders. This sequential chain reaction is too slow to be used in time-critical<br />

hardware. The next section explores how to speed-up addition. This topic is not<br />

crucial to underst<strong>and</strong>ing the rest of the appendix <strong>and</strong> may be skipped.


.<br />

.<br />

.<br />

B-36 Appendix B The Basics of Logic <strong>Design</strong><br />

Ainvert<br />

Bnegate<br />

Operation<br />

a0<br />

b0<br />

CarryIn<br />

ALU0<br />

Less<br />

CarryOut<br />

Result0<br />

a1<br />

b1<br />

0<br />

CarryIn<br />

ALU1<br />

Less<br />

CarryOut<br />

Result1<br />

.<br />

Zero<br />

a2<br />

b2<br />

0<br />

CarryIn<br />

ALU2<br />

Less<br />

CarryOut<br />

Result2<br />

.<br />

.<br />

CarryIn<br />

.<br />

.<br />

Result31<br />

a31 CarryIn<br />

b31 ALU31<br />

Set<br />

0 Less<br />

Overflow<br />

FIGURE B.5.12<br />

The final 32-bit ALU. This adds a Zero detector to Figure B.5.11.<br />

ALU control lines Function<br />

0000 AND<br />

0001 OR<br />

0010 add<br />

0110 subtract<br />

0111 set on less than<br />

1100 NOR<br />

FIGURE B.5.13 The values of the three ALU control lines, Bnegate, <strong>and</strong> Operation, <strong>and</strong> the<br />

corresponding ALU operations.


B.5 Constructing a Basic Arithmetic Logic Unit B-37<br />

ALU operation<br />

a<br />

Zero<br />

ALU<br />

Result<br />

Overflow<br />

b<br />

CarryOut<br />

FIGURE B.5.14 The symbol commonly used to represent an ALU, as shown in Figure<br />

B.5.12. This symbol is also used to represent an adder, so it is normally labeled either with ALU or Adder.<br />

FIGURE B.5.15<br />

A Verilog behavioral definition of a MIPS ALU.


B.6 Faster Addition: Carry Lookahead B-39<br />

significant bit of the adder, in theory we could calculate the CarryIn values to all<br />

the remaining bits of the adder in just two levels of logic.<br />

For example, the CarryIn for bit 2 of the adder is exactly the CarryOut of bit 1,<br />

so the formula is<br />

CarryIn2 ( b1 CarryIn1) ( a1 CarryIn1) ( a1<br />

b1)<br />

Similarly, CarryIn1 is defined as<br />

CarryIn1 ( b0 CarryIn0) ( a0 CarryIn0) ( a0 b0)<br />

Using the shorter <strong>and</strong> more traditional abbreviation of ci for CarryIni, we can<br />

rewrite the formulas as<br />

c2 ( b1 c1) ( a1 c1) ( a1 b1)<br />

c1 ( b0 c0) ( a0 c0) ( a0 b0)<br />

Substituting the definition of c1 for the first equation results in this formula:<br />

c2 ( a1 a0 b0) ( a1 a0 c0) ( a1 b0 c0)<br />

( b1 a0 b0) ( b1 a0 c0) ( b1 b0 c0) ( a1 b1)<br />

You can imagine how the equation exp<strong>and</strong>s as we get to higher bits in the adder;<br />

it grows rapidly with the number of bits. This complexity is reflected in the cost of<br />

the hardware for fast carry, making this simple scheme prohibitively expensive for<br />

wide adders.<br />

Fast Carry Using the First Level of Abstraction: Propagate<br />

<strong>and</strong> Generate<br />

Most fast-carry schemes limit the complexity of the equations to simplify the<br />

hardware, while still making substantial speed improvements over ripple carry.<br />

One such scheme is a carry-lookahead adder. In Chapter 1, we said computer<br />

systems cope with complexity by using levels of abstraction. A carry-lookahead<br />

adder relies on levels of abstraction in its implementation.<br />

Let’s factor our original equation as a first step:<br />

ci 1 ( bi ci) ( ai ci) ( ai bi)<br />

= ( ai bi) ( ai bi)<br />

ci<br />

If we were to rewrite the equation for c2 using this formula, we would see some<br />

repeated patterns:<br />

c2 ( a1 b1) ( a1 b1) (( a0 b0) ( a0 b0) c0)<br />

Note the repeated appearance of (ai bi) <strong>and</strong> (ai bi) in the formula above. These<br />

two important factors are traditionally called generate (gi) <strong>and</strong> propagate (pi):


B-40 Appendix B The Basics of Logic <strong>Design</strong><br />

Using them to define ci 1, we get<br />

gi ai bi<br />

pi ai bi<br />

ci 1 gi pi ci<br />

To see where the signals get their names, suppose gi is 1. Then<br />

ci 1 gi pi ci 1 pi ci<br />

1<br />

That is, the adder generates a CarryOut (ci 1) independent of the value of CarryIn<br />

(ci). Now suppose that gi is 0 <strong>and</strong> pi is 1. Then<br />

ci 1 gi pi ci 0 1 ci ci<br />

That is, the adder propagates CarryIn to a CarryOut. Putting the two together,<br />

CarryIni 1 is a 1 if either gi is 1 or both pi is 1 <strong>and</strong> CarryIni is 1.<br />

As an analogy, imagine a row of dominoes set on edge. The end domino can be<br />

tipped over by pushing one far away, provided there are no gaps between the two.<br />

Similarly, a carry out can be made true by a generate far away, provided all the<br />

propagates between them are true.<br />

Relying on the definitions of propagate <strong>and</strong> generate as our first level of<br />

abstraction, we can express the CarryIn signals more economically. Let’s show it<br />

for 4 bits:<br />

c1 g0 ( p0 c0)<br />

c2 g1 ( p1 g0) ( p1 p0 c0)<br />

c3 g2 ( p2 g1) ( p2 p1<br />

g0) ( p2 p1 p0 c0)<br />

c4 g3 ( p3 g2) ( p3 p2 g1) ( p3 p2 p1 g0)<br />

(p3 p2 p1 p0 c 0)<br />

These equations just represent common sense: CarryIni is a 1 if some earlier adder<br />

generates a carry <strong>and</strong> all intermediary adders propagate a carry. Figure B.6.1 uses<br />

plumbing to try to explain carry lookahead.<br />

Even this simplified form leads to large equations <strong>and</strong>, hence, considerable logic<br />

even for a 16-bit adder. Let’s try moving to two levels of abstraction.<br />

Fast Carry Using the Second Level of Abstraction<br />

First, we consider this 4-bit adder with its carry-lookahead logic as a single building<br />

block. If we connect them in ripple carry fashion to form a 16-bit adder, the add<br />

will be faster than the original with a little more hardware.


B.6 Faster Addition: Carry Lookahead B-41<br />

To go faster, we’ll need carry lookahead at a higher level. To perform carry look<br />

ahead for 4-bit adders, we need to propagate <strong>and</strong> generate signals at this higher<br />

level. Here they are for the four 4-bit adder blocks:<br />

P0 p3 p2 p1 p0<br />

P1 p7 p6 p5 p4<br />

P2 p11 p10 p9 p8<br />

P3 p15 p14 p13 p12<br />

That is, the “super” propagate signal for the 4-bit abstraction (Pi) is true only if each<br />

of the bits in the group will propagate a carry.<br />

For the “super” generate signal (Gi), we care only if there is a carry out of the<br />

most significant bit of the 4-bit group. This obviously occurs if generate is true<br />

for that most significant bit; it also occurs if an earlier generate is true <strong>and</strong> all the<br />

intermediate propagates, including that of the most significant bit, are also true:<br />

G0 g3 ( p3 g2) ( p3 p2 g1) ( p3 p2 p1 g0)<br />

G1 g7 ( p7 g6) ( p7 p6<br />

g5) ( p7 p6 p5 g4)<br />

G2 g11 ( p11 g10) ( p11 p10 g9) ( p11 p10<br />

p9 g8)<br />

G3 g15 ( p15 g14) ( p15 p14 g13) ( p15 p14 p13 g12)<br />

Figure B.6.2 updates our plumbing analogy to show P0 <strong>and</strong> G0.<br />

Then the equations at this higher level of abstraction for the carry in for each<br />

4-bit group of the 16-bit adder (C1, C2, C3, C4 in Figure B.6.3) are very similar to<br />

the carry out equations for each bit of the 4-bit adder (c1, c2, c3, c4) on page B-40:<br />

C1 G0 ( P0 c0)<br />

C2 G1 ( P1 G0) ( P1 P0 c0)<br />

C3 G2 ( P2 G1) ( P2 P1<br />

G0) ( P2 P1 P0 c0)<br />

C4 G3 ( P3 G2) ( P3 P2 G1) ( P3 P2 P1 G 0)<br />

( P3 P2 P1 P0 c0)<br />

Figure B.6.3 shows 4-bit adders connected with such a carry-lookahead unit.<br />

The exercises explore the speed differences between these carry schemes, different<br />

notations for multibit propagate <strong>and</strong> generate signals, <strong>and</strong> the design of a 64-bit<br />

adder.


B-44 Appendix B The Basics of Logic <strong>Design</strong><br />

Both Levels of the Propagate <strong>and</strong> Generate<br />

EXAMPLE<br />

Determine the gi, pi, Pi, <strong>and</strong> Gi values of these two 16-bit numbers:<br />

a: 0001 1010 0011 0011 two<br />

b: 1110 0101 1110 1011 two<br />

Also, what is CarryOut15 (C4)?<br />

ANSWER<br />

Aligning the bits makes it easy to see the values of generate gi (ai bi) <strong>and</strong><br />

propagate pi (ai bi):<br />

a: 0001 1010 0011 0011<br />

b: 1110 0101 1110 1011<br />

gi: 0000 0000 0010 0011<br />

pi: 1111 1111 1111 1011<br />

where the bits are numbered 15 to 0 from left to right. Next, the “super”<br />

propagates (P3, P2, P1, P0) are simply the AND of the lower-level propagates:<br />

P3<br />

1111 1<br />

P2<br />

1111 1<br />

P1<br />

1111 1<br />

P0 1 0 1 1 0<br />

The “super” generates are more complex, so use the following equations:<br />

G0 g3 ( p3 g2) ( p3 p2 g1) ( p3 p2 p1 g0)<br />

= 0 ( 1 0) ( 1 0 1) ( 1 0 1 1)<br />

0 0 0 0 0<br />

G1 g7 ( p7 g6) ( p7 p6 g5) ( p7 p6 p5 g4)<br />

0 ( 10) ( 111) ( 1110)<br />

0 0 1 0 1<br />

G2 g11 ( p11 g10) ( p11 p10 g9) ( p11 p10 p9 g8)<br />

0 ( 1 0) ( 1 1 0) ( 1 1 1 0)<br />

0 0 0 0 0<br />

G3 g15 ( p15 g14) ( p15 p14 g13) ( p15 p14 p13 g12)<br />

0 ( 1 0) ( 1 1 0) ( 1 1 1 0)<br />

0 0 0 0 0<br />

Finally, CarryOut15 is<br />

C4 G3 ( P3 G2) ( P3 P2 G1) ( P3 P2 P1 G0)<br />

( P3 P2 P1 P0 c0)<br />

0 ( 10) ( 111) ( 1110) ( 1110 0)<br />

0 0 1 0 0 1<br />

Hence, there is a carry out when adding these two 16-bit numbers.


B.6 Faster Addition: Carry Lookahead B-45<br />

CarryIn<br />

a0<br />

b0<br />

a1<br />

b1<br />

a2<br />

b2<br />

a3<br />

b3<br />

CarryIn<br />

ALU0<br />

P0<br />

G0<br />

C1<br />

pi<br />

gi<br />

ci + 1<br />

Result0–3<br />

Carry-lookahead unit<br />

a4<br />

b4<br />

a5<br />

b5<br />

a6<br />

b6<br />

a7<br />

b7<br />

CarryIn<br />

ALU1<br />

P1<br />

G1<br />

C2<br />

pi + 1<br />

gi + 1<br />

ci + 2<br />

Result4–7<br />

a8<br />

b8<br />

a9<br />

b9<br />

a10<br />

b10<br />

a11<br />

b11<br />

CarryIn<br />

ALU2<br />

P2<br />

G2<br />

C3<br />

pi + 2<br />

gi + 2<br />

ci + 3<br />

Result8–11<br />

a12<br />

b12<br />

a13<br />

b13<br />

a14<br />

b14<br />

a15<br />

b15<br />

CarryIn<br />

ALU3<br />

P3<br />

G3<br />

C4<br />

pi + 3<br />

gi + 3<br />

ci + 4<br />

Result12–15<br />

CarryOut<br />

FIGURE B.6.3 Four 4-bit ALUs using carry lookahead to form a 16-bit adder. Note that the<br />

carries come from the carry-lookahead unit, not from the 4-bit ALUs.


B-46 Appendix B The Basics of Logic <strong>Design</strong><br />

The reason carry lookahead can make carries faster is that all logic begins<br />

evaluating the moment the clock cycle begins, <strong>and</strong> the result will not change once<br />

the output of each gate stops changing. By taking the shortcut of going through<br />

fewer gates to send the carry in signal, the output of the gates will stop changing<br />

sooner, <strong>and</strong> hence the time for the adder can be less.<br />

To appreciate the importance of carry lookahead, we need to calculate the<br />

relative performance between it <strong>and</strong> ripple carry adders.<br />

Speed of Ripple Carry versus Carry Lookahead<br />

EXAMPLE<br />

One simple way to model time for logic is to assume each AND or OR gate<br />

takes the same time for a signal to pass through it. Time is estimated by simply<br />

counting the number of gates along the path through a piece of logic. Compare<br />

the number of gate delays for paths of two 16-bit adders, one using ripple carry<br />

<strong>and</strong> one using two-level carry lookahead.<br />

ANSWER<br />

Figure B.5.5 on page B-28 shows that the carry out signal takes two gate<br />

delays per bit. Then the number of gate delays between a carry in to the least<br />

significant bit <strong>and</strong> the carry out of the most significant is 16 2 32.<br />

For carry lookahead, the carry out of the most significant bit is just C4,<br />

defined in the example. It takes two levels of logic to specify C4 in terms of<br />

Pi <strong>and</strong> Gi (the OR of several AND terms). Pi is specified in one level of logic<br />

(AND) using pi, <strong>and</strong> Gi is specified in two levels using pi <strong>and</strong> gi, so the worst<br />

case for this next level of abstraction is two levels of logic. pi <strong>and</strong> gi are each<br />

one level of logic, defined in terms of ai <strong>and</strong> bi. If we assume one gate delay<br />

for each level of logic in these equations, the worst case is 2 2 1 5 gate<br />

delays.<br />

Hence, for the path from carry in to carry out, the 16-bit addition by a<br />

carry-lookahead adder is six times faster, using this very simple estimate of<br />

hardware speed.<br />

Summary<br />

Carry lookahead offers a faster path than waiting for the carries to ripple through<br />

all 32 1-bit adders. This faster path is paved by two signals, generate <strong>and</strong> propagate.


B.6 Faster Addition: Carry Lookahead B-47<br />

The former creates a carry regardless of the carry input, <strong>and</strong> the latter passes a carry<br />

along. Carry lookahead also gives another example of how abstraction is important<br />

in computer design to cope with complexity.<br />

Using the simple estimate of hardware speed above with gate delays, what is the<br />

relative performance of a ripple carry 8-bit add versus a 64-bit add using carrylookahead<br />

logic?<br />

1. A 64-bit carry-lookahead adder is three times faster: 8-bit adds are 16 gate<br />

delays <strong>and</strong> 64-bit adds are 7 gate delays.<br />

2. They are about the same speed, since 64-bit adds need more levels of logic in<br />

the 16-bit adder.<br />

3. 8-bit adds are faster than 64 bits, even with carry lookahead.<br />

Check<br />

Yourself<br />

Elaboration: We have now accounted for all but one of the arithmetic <strong>and</strong> logical<br />

operations for the core MIPS instruction set: the ALU in Figure B.5.14 omits support of<br />

shift instructions. It would be possible to widen the ALU multiplexor to include a left shift<br />

by 1 bit or a right shift by 1 bit. But hardware designers have created a circuit called a<br />

barrel shifter, which can shift from 1 to 31 bits in no more time than it takes to add two<br />

32-bit numbers, so shifting is normally done outside the ALU.<br />

Elaboration: The logic equation for the Sum output of the full adder on page B-28 can<br />

be expressed more simply by using a more powerful gate than AND <strong>and</strong> OR. An exclusive<br />

OR gate is true if the two oper<strong>and</strong>s disagree; that is,<br />

x ≠ y ⇒ 1 <strong>and</strong> x y ⇒ 0<br />

In some technologies, exclusive OR is more effi cient than two levels of AND <strong>and</strong> OR<br />

gates. Using the symbol ⊕ to represent exclusive OR, here is the new equation:<br />

Sum a ⊕ b ⊕ CarryIn<br />

Also, we have drawn the ALU the traditional way, using gates. <strong>Computer</strong>s are designed<br />

today in CMOS transistors, which are basically switches. CMOS ALU <strong>and</strong> barrel shifters<br />

take advantage of these switches <strong>and</strong> have many fewer multiplexors than shown in our<br />

designs, but the design principles are similar.<br />

Elaboration: Using lowercase <strong>and</strong> uppercase to distinguish the hierarchy of generate<br />

<strong>and</strong> propagate symbols breaks down when you have more than two levels. An alternate<br />

notation that scales is g i..j<br />

<strong>and</strong> p i..j<br />

for the generate <strong>and</strong> propagate signals for bits i to j.<br />

Thus, g 1..1<br />

is generated for bit 1, g 4..1<br />

is for bits 4 to 1, <strong>and</strong> g 16..1<br />

is for bits 16 to 1.


B.7 Clocks B-49<br />

clock edge occurs. A signal is valid if it is stable (i.e., not changing), <strong>and</strong> the value<br />

will not change again until the inputs change. Since combinational circuits cannot<br />

have feedback, if the inputs to a combinational logic unit are not changed, the<br />

outputs will eventually become valid.<br />

Figure B.7.2 shows the relationship among the state elements <strong>and</strong> the<br />

combinational logic blocks in a synchronous, sequential logic design. The state<br />

elements, whose outputs change only after the clock edge, provide valid inputs<br />

to the combinational logic block. To ensure that the values written into the state<br />

elements on the active clock edge are valid, the clock must have a long enough<br />

period so that all the signals in the combinational logic block stabilize, <strong>and</strong> then the<br />

clock edge samples those values for storage in the state elements. This constraint<br />

sets a lower bound on the length of the clock period, which must be long enough<br />

for all state element inputs to be valid.<br />

In the rest of this appendix, as well as in Chapter 4, we usually omit the clock<br />

signal, since we are assuming that all state elements are updated on the same clock<br />

edge. Some state elements will be written on every clock edge, while others will be<br />

written only under certain conditions (such as a register being updated). In such<br />

cases, we will have an explicit write signal for that state element. The write signal<br />

must still be gated with the clock so that the update occurs only on the clock edge if<br />

the write signal is active. We will see how this is done <strong>and</strong> used in the next section.<br />

One other advantage of an edge-triggered methodology is that it is possible<br />

to have a state element that is used as both an input <strong>and</strong> output to the same<br />

combinational logic block, as shown in Figure B.7.3. In practice, care must be<br />

taken to prevent races in such situations <strong>and</strong> to ensure that the clock period is long<br />

enough; this topic is discussed further in Section B.11.<br />

Now that we have discussed how clocking is used to update state elements, we<br />

can discuss how to construct the state elements.<br />

State<br />

element<br />

1<br />

Combinational logic<br />

State<br />

element<br />

2<br />

Clock cycle<br />

FIGURE B.7.2 The inputs to a combinational logic block come from a state element, <strong>and</strong><br />

the outputs are written into a state element. The clock edge determines when the contents of the<br />

state elements are updated.


B.8 Memory Elements: Flip-Flops, Latches, <strong>and</strong> Registers B-51<br />

The simplest type of memory elements are unclocked; that is, they do not<br />

have any clock input. Although we only use clocked memory elements in this<br />

text, an unclocked latch is the simplest memory element, so let’s look at this<br />

circuit first. Figure B.8.1 shows an S-R latch (set-reset latch), built from a pair of<br />

NOR gates (OR gates with inverted outputs). The outputs Q <strong>and</strong> Q represent the<br />

value of the stored state <strong>and</strong> its complement. When neither S nor R are asserted,<br />

the cross-coupled NOR gates act as inverters <strong>and</strong> store the previous values of<br />

Q <strong>and</strong> Q.<br />

For example, if the output, Q, is true, then the bottom inverter produces a false<br />

output (which is Q), which becomes the input to the top inverter, which produces<br />

a true output, which is Q, <strong>and</strong> so on. If S is asserted, then the output Q will be<br />

asserted <strong>and</strong> Q will be deasserted, while if R is asserted, then the output Q will be<br />

asserted <strong>and</strong> Q will be deasserted. When S <strong>and</strong> R are both deasserted, the last values<br />

of Q <strong>and</strong> Q will continue to be stored in the cross-coupled structure. Asserting S<br />

<strong>and</strong> R simultaneously can lead to incorrect operation: depending on how S <strong>and</strong> R<br />

are deasserted, the latch may oscillate or become metastable (this is described in<br />

more detail in Section B.11).<br />

This cross-coupled structure is the basis for more complex memory elements<br />

that allow us to store data signals. These elements contain additional gates used to<br />

store signal values <strong>and</strong> to cause the state to be updated only in conjunction with a<br />

clock. The next section shows how these elements are built.<br />

Flip-Flops <strong>and</strong> Latches<br />

Flip-flops <strong>and</strong> latches are the simplest memory elements. In both flip-flops <strong>and</strong><br />

latches, the output is equal to the value of the stored state inside the element.<br />

Furthermore, unlike the S-R latch described above, all the latches <strong>and</strong> flip-flops we<br />

will use from this point on are clocked, which means that they have a clock input<br />

<strong>and</strong> the change of state is triggered by that clock. The difference between a flipflop<br />

<strong>and</strong> a latch is the point at which the clock causes the state to actually change.<br />

In a clocked latch, the state is changed whenever the appropriate inputs change<br />

<strong>and</strong> the clock is asserted, whereas in a flip-flop, the state is changed only on a clock<br />

edge. Since throughout this text we use an edge-triggered timing methodology<br />

where state is only updated on clock edges, we need only use flip-flops. Flip-flops<br />

are often built from latches, so we start by describing the operation of a simple<br />

clocked latch <strong>and</strong> then discuss the operation of a flip-flop constructed from that<br />

latch.<br />

For computer applications, the function of both flip-flops <strong>and</strong> latches is to<br />

store a signal. A D latch or D flip-flop stores the value of its data input signal in<br />

the internal memory. Although there are many other types of latch <strong>and</strong> flip-flop,<br />

the D type is the only basic building block that we will need. A D latch has two<br />

inputs <strong>and</strong> two outputs. The inputs are the data value to be stored (called D) <strong>and</strong><br />

a clock signal (called C) that indicates when the latch should read the value on<br />

the D input <strong>and</strong> store it. The outputs are simply the value of the internal state (Q)<br />

flip-flop A memory<br />

element for which the<br />

output is equal to the<br />

value of the stored state<br />

inside the element <strong>and</strong> for<br />

which the internal state is<br />

changed only on a clock<br />

edge.<br />

latch A memory element<br />

in which the output is<br />

equal to the value of the<br />

stored state inside the<br />

element <strong>and</strong> the state is<br />

changed whenever the<br />

appropriate inputs change<br />

<strong>and</strong> the clock is asserted.<br />

D flip-flop A flip-flop<br />

with one data input<br />

that stores the value of<br />

that input signal in the<br />

internal memory when<br />

the clock edge occurs.


B-52 Appendix B The Basics of Logic <strong>Design</strong><br />

<strong>and</strong> its complement (Q). When the clock input C is asserted, the latch is said to<br />

be open, <strong>and</strong> the value of the output (Q) becomes the value of the input D. When<br />

the clock input C is deasserted, the latch is said to be closed, <strong>and</strong> the value of the<br />

output (Q) is whatever value was stored the last time the latch was open.<br />

Figure B.8.2 shows how a D latch can be implemented with two additional gates<br />

added to the cross-coupled NOR gates. Since when the latch is open the value of Q<br />

changes as D changes, this structure is sometimes called a transparent latch. Figure<br />

B.8.3 shows how this D latch works, assuming that the output Q is initially false <strong>and</strong><br />

that D changes first.<br />

As mentioned earlier, we use flip-flops as the basic building block, rather than<br />

latches. Flip-flops are not transparent: their outputs change only on the clock edge.<br />

A flip-flop can be built so that it triggers on either the rising (positive) or falling<br />

(negative) clock edge; for our designs we can use either type. Figure B.8.4 shows<br />

how a falling-edge D flip-flop is constructed from a pair of D latches. In a D flipflop,<br />

the output is stored when the clock edge occurs. Figure B.8.5 shows how this<br />

flip-flop operates.<br />

C<br />

Q<br />

D<br />

Q<br />

FIGURE B.8.2 A D latch implemented with NOR gates. A NOR gate acts as an inverter if the other<br />

input is 0. Thus, the cross-coupled pair of NOR gates acts to store the state value unless the clock input, C, is<br />

asserted, in which case the value of input D replaces the value of Q <strong>and</strong> is stored. The value of input D must<br />

be stable when the clock signal C changes from asserted to deasserted.<br />

D<br />

C<br />

Q<br />

FIGURE B.8.3 Operation of a D latch, assuming the output is initially deasserted. When<br />

the clock, C, is asserted, the latch is open <strong>and</strong> the Q output immediately assumes the value of the D input.


B.8 Memory Elements: Flip-Flops, Latches, <strong>and</strong> Registers B-53<br />

D<br />

D<br />

C<br />

D<br />

latch<br />

Q<br />

D<br />

C<br />

D<br />

latch<br />

Q<br />

Q<br />

Q<br />

Q<br />

C<br />

FIGURE B.8.4 A D flip-flop with a falling-edge trigger. The first latch, called the master, is open<br />

<strong>and</strong> follows the input D when the clock input, C, is asserted. When the clock input, C, falls, the first latch is<br />

closed, but the second latch, called the slave, is open <strong>and</strong> gets its input from the output of the master latch.<br />

D<br />

C<br />

Q<br />

FIGURE B.8.5 Operation of a D flip-flop with a falling-edge trigger, assuming the output is<br />

initially deasserted. When the clock input (C) changes from asserted to deasserted, the Q output stores<br />

the value of the D input. Compare this behavior to that of the clocked D latch shown in Figure B.8.3. In a<br />

clocked latch, the stored value <strong>and</strong> the output, Q, both change whenever C is high, as opposed to only when<br />

C transitions.<br />

Here is a Verilog description of a module for a rising-edge D flip-flop, assuming<br />

that C is the clock input <strong>and</strong> D is the data input:<br />

module DFF(clock,D,Q,Qbar);<br />

input clock, D;<br />

output reg Q; // Q is a reg since it is assigned in an<br />

always block<br />

output Qbar;<br />

assign Qbar = ~ Q; // Qbar is always just the inverse<br />

of Q<br />

always @(posedge clock) // perform actions whenever the<br />

clock rises<br />

Q = D;<br />

endmodule<br />

Because the D input is sampled on the clock edge, it must be valid for a period<br />

of time immediately before <strong>and</strong> immediately after the clock edge. The minimum<br />

time that the input must be valid before the clock edge is called the setup time; the<br />

setup time The<br />

minimum time that the<br />

input to a memory device<br />

must be valid before the<br />

clock edge.


B-54 Appendix B The Basics of Logic <strong>Design</strong><br />

D<br />

Setup time<br />

Hold time<br />

C<br />

FIGURE B.8.6 Setup <strong>and</strong> hold time requirements for a D flip-flop with a falling-edge trigger.<br />

The input must be stable for a period of time before the clock edge, as well as after the clock edge. The<br />

minimum time the signal must be stable before the clock edge is called the setup time, while the minimum<br />

time the signal must be stable after the clock edge is called the hold time. Failure to meet these minimum<br />

requirements can result in a situation where the output of the flip-flop may not be predictable, as described<br />

in Section B.11. Hold times are usually either 0 or very small <strong>and</strong> thus not a cause of worry.<br />

hold time The minimum<br />

time during which the<br />

input must be valid after<br />

the clock edge.<br />

minimum time during which it must be valid after the clock edge is called the hold<br />

time. Thus the inputs to any flip-flop (or anything built using flip-flops) must be valid<br />

during a window that begins at time t setup<br />

before the clock edge <strong>and</strong> ends at t hold<br />

after<br />

the clock edge, as shown in Figure B.8.6. Section B.11 talks about clocking <strong>and</strong> timing<br />

constraints, including the propagation delay through a flip-flop, in more detail.<br />

We can use an array of D flip-flops to build a register that can hold a multibit<br />

datum, such as a byte or word. We used registers throughout our datapaths in<br />

Chapter 4.<br />

Register Files<br />

One structure that is central to our datapath is a register file. A register file consists<br />

of a set of registers that can be read <strong>and</strong> written by supplying a register number<br />

to be accessed. A register file can be implemented with a decoder for each read<br />

or write port <strong>and</strong> an array of registers built from D flip-flops. Because reading a<br />

register does not change any state, we need only supply a register number as an<br />

input, <strong>and</strong> the only output will be the data contained in that register. For writing a<br />

register we will need three inputs: a register number, the data to write, <strong>and</strong> a clock<br />

that controls the writing into the register. In Chapter 4, we used a register file that<br />

has two read ports <strong>and</strong> one write port. This register file is drawn as shown in Figure<br />

B.8.7. The read ports can be implemented with a pair of multiplexors, each of which<br />

is as wide as the number of bits in each register of the register file. Figure B.8.8<br />

shows the implementation of two register read ports for a 32-bit-wide register file.<br />

Implementing the write port is slightly more complex, since we can only change<br />

the contents of the designated register. We can do this by using a decoder to generate<br />

a signal that can be used to determine which register to write. Figure B.8.9 shows<br />

how to implement the write port for a register file. It is important to remember that<br />

the flip-flop changes state only on the clock edge. In Chapter 4, we hooked up write<br />

signals for the register file explicitly <strong>and</strong> assumed the clock shown in Figure B.8.9<br />

is attached implicitly.<br />

What happens if the same register is read <strong>and</strong> written during a clock cycle?<br />

Because the write of the register file occurs on the clock edge, the register will be


B.8 Memory Elements: Flip-Flops, Latches, <strong>and</strong> Registers B-55<br />

Read register<br />

number 1<br />

Read register<br />

number 2<br />

Write<br />

register<br />

Write<br />

data<br />

Register file<br />

Write<br />

Read<br />

data 1<br />

Read<br />

data 2<br />

FIGURE B.8.7 A register file with two read ports <strong>and</strong> one write port has five inputs <strong>and</strong><br />

two outputs. The control input Write is shown in color.<br />

Read register<br />

number 1<br />

Register 0<br />

Register 1<br />

. . .<br />

Register n – 2<br />

Register n – 1<br />

M<br />

u<br />

x<br />

Read data 1<br />

Read register<br />

number 2<br />

M<br />

u<br />

x<br />

Read data 2<br />

FIGURE B.8.8 The implementation of two read ports for a register file with n registers<br />

can be done with a pair of n-to-1 multiplexors, each 32 bits wide. The register read number<br />

signal is used as the multiplexor selector signal. Figure B.8.9 shows how the write port is implemented.


B-56 Appendix B The Basics of Logic <strong>Design</strong><br />

Write<br />

Register number<br />

n-to-2 n<br />

decoder<br />

0<br />

1<br />

n – 2<br />

n – 1<br />

.<br />

C<br />

D<br />

C<br />

D<br />

Register 0<br />

Register 1<br />

.<br />

C<br />

Register n – 2<br />

D<br />

C<br />

Register n – 1<br />

Register data<br />

D<br />

FIGURE B.8.9 The write port for a register file is implemented with a decoder that is<br />

used with the write signal to generate the C input to the registers. All three inputs (the register<br />

number, the data, <strong>and</strong> the write signal) will have setup <strong>and</strong> hold-time constraints that ensure that the correct<br />

data is written into the register file.<br />

valid during the time it is read, as we saw earlier in Figure B.7.2. The value returned<br />

will be the value written in an earlier clock cycle. If we want a read to return the<br />

value currently being written, additional logic in the register file or outside of it is<br />

needed. Chapter 4 makes extensive use of such logic.<br />

Specifying Sequential Logic in Verilog<br />

To specify sequential logic in Verilog, we must underst<strong>and</strong> how to generate a<br />

clock, how to describe when a value is written into a register, <strong>and</strong> how to specify<br />

sequential control. Let us start by specifying a clock. A clock is not a predefined<br />

object in Verilog; instead, we generate a clock by using the Verilog notation #n<br />

before a statement; this causes a delay of n simulation time steps before the execution<br />

of the statement. In most Verilog simulators, it is also possible to generate<br />

a clock as an external input, allowing the user to specify at simulation time the<br />

number of clock cycles during which to run a simulation.<br />

The code in Figure B.8.10 implements a simple clock that is high or low for one<br />

simulation unit <strong>and</strong> then switches state. We use the delay capability <strong>and</strong> blocking<br />

assignment to implement the clock.


B.8 Memory Elements: Flip-Flops, Latches, <strong>and</strong> Registers B-57<br />

FIGURE B.8.10<br />

A specification of a clock.<br />

Next, we must be able to specify the operation of an edge-triggered register. In<br />

Verilog, this is done by using the sensitivity list on an always block <strong>and</strong> specifying<br />

as a trigger either the positive or negative edge of a binary variable with the<br />

notation posedge or negedge, respectively. Hence, the following Verilog code<br />

causes register A to be written with the value b at the positive edge clock:<br />

FIGURE B.8.11 A MIPS register file written in behavioral Verilog. This register file writes on<br />

the rising clock edge.<br />

Throughout this chapter <strong>and</strong> the Verilog sections of Chapter 4, we will assume<br />

a positive edge-triggered design. Figure B.8.11 shows a Verilog specification of a<br />

MIPS register file that assumes two reads <strong>and</strong> one write, with only the write being<br />

clocked.


B-58 Appendix B The Basics of Logic <strong>Design</strong><br />

Check<br />

Yourself<br />

In the Verilog for the register file in Figure B.8.11, the output ports corresponding to<br />

the registers being read are assigned using a continuous assignment, but the register<br />

being written is assigned in an always block. Which of the following is the reason?<br />

a. There is no special reason. It was simply convenient.<br />

b. Because Data1 <strong>and</strong> Data2 are output ports <strong>and</strong> WriteData is an input port.<br />

c. Because reading is a combinational event, while writing is a sequential event.<br />

B.9 Memory Elements: SRAMs <strong>and</strong> DRAMs<br />

static r<strong>and</strong>om access<br />

memory (SRAM)<br />

A memory where data<br />

is stored statically (as<br />

in flip-flops) rather<br />

than dynamically (as<br />

in DRAM). SRAMs are<br />

faster than DRAMs,<br />

but less dense <strong>and</strong> more<br />

expensive per bit.<br />

Registers <strong>and</strong> register files provide the basic building blocks for small memories,<br />

but larger amounts of memory are built using either SRAMs (static r<strong>and</strong>om<br />

access memories) or DRAMs (dynamic r<strong>and</strong>om access memories). We first discuss<br />

SRAMs, which are somewhat simpler, <strong>and</strong> then turn to DRAMs.<br />

SRAMs<br />

SRAMs are simply integrated circuits that are memory arrays with (usually) a single<br />

access port that can provide either a read or a write. SRAMs have a fixed access<br />

time to any datum, though the read <strong>and</strong> write access characteristics often differ.<br />

An SRAM chip has a specific configuration in terms of the number of addressable<br />

locations, as well as the width of each addressable location. For example, a 4M 8<br />

SRAM provides 4M entries, each of which is 8 bits wide. Thus it will have 22 address<br />

lines (since 4M 2 22 ), an 8-bit data output line, <strong>and</strong> an 8-bit single data input line.<br />

As with ROMs, the number of addressable locations is often called the height, with<br />

the number of bits per unit called the width. For a variety of technical reasons, the<br />

newest <strong>and</strong> fastest SRAMs are typically available in narrow configurations: 1 <strong>and</strong><br />

4. Figure B.9.1 shows the input <strong>and</strong> output signals for a 2M 16 SRAM.<br />

Address 21<br />

Chip select<br />

Output enable<br />

Write enable<br />

SRAM<br />

2M 16<br />

16<br />

Dout[15–0]<br />

Din[15–0]<br />

16<br />

FIGURE B.9.1 A 32K 8 SRAM showing the 21 address lines (32K 2 15 ) <strong>and</strong> 16 data<br />

inputs, the 3 control lines, <strong>and</strong> the 16 data outputs.


B.9 Memory Elements: SRAMs <strong>and</strong> DRAMs B-59<br />

To initiate a read or write access, the Chip select signal must be made active.<br />

For reads, we must also activate the Output enable signal that controls whether or<br />

not the datum selected by the address is actually driven on the pins. The Output<br />

enable is useful for connecting multiple memories to a single-output bus <strong>and</strong> using<br />

Output enable to determine which memory drives the bus. The SRAM read access<br />

time is usually specified as the delay from the time that Output enable is true <strong>and</strong><br />

the address lines are valid until the time that the data is on the output lines. Typical<br />

read access times for SRAMs in 2004 varied from about 2–4 ns for the fastest CMOS<br />

parts, which tend to be somewhat smaller <strong>and</strong> narrower, to 8–20 ns for the typical<br />

largest parts, which in 2004 had more than 32 million bits of data. The dem<strong>and</strong> for<br />

low-power SRAMs for consumer products <strong>and</strong> digital appliances has grown greatly<br />

in the past five years; these SRAMs have much lower st<strong>and</strong>-by <strong>and</strong> access power,<br />

but usually are 5–10 times slower. Most recently, synchronous SRAMs—similar to<br />

the synchronous DRAMs, which we discuss in the next section—have also been<br />

developed.<br />

For writes, we must supply the data to be written <strong>and</strong> the address, as well as<br />

signals to cause the write to occur. When both the Write enable <strong>and</strong> Chip select are<br />

true, the data on the data input lines is written into the cell specified by the address.<br />

There are setup-time <strong>and</strong> hold-time requirements for the address <strong>and</strong> data lines,<br />

just as there were for D flip-flops <strong>and</strong> latches. In addition, the Write enable signal<br />

is not a clock edge but a pulse with a minimum width requirement. The time to<br />

complete a write is specified by the combination of the setup times, the hold times,<br />

<strong>and</strong> the Write enable pulse width.<br />

Large SRAMs cannot be built in the same way we build a register file because,<br />

unlike a register file where a 32-to-1 multiplexor might be practical, the 64K-to-<br />

1 multiplexor that would be needed for a 64K 1 SRAM is totally impractical.<br />

Rather than use a giant multiplexor, large memories are implemented with a shared<br />

output line, called a bit line, which multiple memory cells in the memory array can<br />

assert. To allow multiple sources to drive a single line, a three-state buffer (or tristate<br />

buffer) is used. A three-state buffer has two inputs—a data signal <strong>and</strong> an Output<br />

enable—<strong>and</strong> a single output, which is in one of three states: asserted, deasserted,<br />

or high impedance. The output of a tristate buffer is equal to the data input signal,<br />

either asserted or deasserted, if the Output enable is asserted, <strong>and</strong> is otherwise in a<br />

high-impedance state that allows another three-state buffer whose Output enable is<br />

asserted to determine the value of a shared output.<br />

Figure B.9.2 shows a set of three-state buffers wired to form a multiplexor with a<br />

decoded input. It is critical that the Output enable of at most one of the three-state<br />

buffers be asserted; otherwise, the three-state buffers may try to set the output line<br />

differently. By using three-state buffers in the individual cells of the SRAM, each<br />

cell that corresponds to a particular output can share the same output line. The use<br />

of a set of distributed three-state buffers is a more efficient implementation than a<br />

large centralized multiplexor. The three-state buffers are incorporated into the flipflops<br />

that form the basic cells of the SRAM. Figure B.9.3 shows how a small 4 2<br />

SRAM might be built, using D latches with an input called Enable that controls the<br />

three-state output.


B-60 Appendix B The Basics of Logic <strong>Design</strong><br />

Select 0<br />

Data 0<br />

In<br />

Enable<br />

Out<br />

Select 1<br />

Data 1<br />

In<br />

Enable<br />

Out<br />

Select 2<br />

Enable<br />

Output<br />

Data 2<br />

In<br />

Out<br />

Select 3<br />

Data 3<br />

In<br />

Enable<br />

Out<br />

FIGURE B.9.2 Four three-state buffers are used to form a multiplexor. Only one of the four<br />

Select inputs can be asserted. A three-state buffer with a deasserted Output enable has a high-impedance<br />

output that allows a three-state buffer whose Output enable is asserted to drive the shared output line.<br />

The design in Figure B.9.3 eliminates the need for an enormous multiplexor;<br />

however, it still requires a very large decoder <strong>and</strong> a correspondingly large number<br />

of word lines. For example, in a 4M 8 SRAM, we would need a 22-to-4M decoder<br />

<strong>and</strong> 4M word lines (which are the lines used to enable the individual flip-flops)!<br />

To circumvent this problem, large memories are organized as rectangular arrays<br />

<strong>and</strong> use a two-step decoding process. Figure B.9.4 shows how a 4M 8 SRAM<br />

might be organized internally using a two-step decode. As we will see, the two-level<br />

decoding process is quite important in underst<strong>and</strong>ing how DRAMs operate.<br />

Recently we have seen the development of both synchronous SRAMs (SSRAMs)<br />

<strong>and</strong> synchronous DRAMs (SDRAMs). The key capability provided by synchronous<br />

RAMs is the ability to transfer a burst of data from a series of sequential addresses<br />

within an array or row. The burst is defined by a starting address, supplied in the<br />

usual fashion, <strong>and</strong> a burst length. The speed advantage of synchronous RAMs<br />

comes from the ability to transfer the bits in the burst without having to specify<br />

additional address bits. Instead, a clock is used to transfer the successive bits in the<br />

burst. The elimination of the need to specify the address for the transfers within<br />

the burst significantly improves the rate for transferring the block of data. Because<br />

of this capability, synchronous SRAMs <strong>and</strong> DRAMs are rapidly becoming the<br />

RAMs of choice for building memory systems in computers. We discuss the use of<br />

synchronous DRAMs in a memory system in more detail in the next section <strong>and</strong><br />

in Chapter 5.


B.9 Memory Elements: SRAMs <strong>and</strong> DRAMs B-63<br />

DRAMs<br />

In a static RAM (SRAM), the value stored in a cell is kept on a pair of inverting gates,<br />

<strong>and</strong> as long as power is applied, the value can be kept indefinitely. In a dynamic<br />

RAM (DRAM), the value kept in a cell is stored as a charge in a capacitor. A single<br />

transistor is then used to access this stored charge, either to read the value or to<br />

overwrite the charge stored there. Because DRAMs use only a single transistor per<br />

bit of storage, they are much denser <strong>and</strong> cheaper per bit. By comparison, SRAMs<br />

require four to six transistors per bit. Because DRAMs store the charge on a<br />

capacitor, it cannot be kept indefinitely <strong>and</strong> must periodically be refreshed. That is<br />

why this memory structure is called dynamic, as opposed to the static storage in a<br />

SRAM cell.<br />

To refresh the cell, we merely read its contents <strong>and</strong> write it back. The charge can<br />

be kept for several milliseconds, which might correspond to close to a million clock<br />

cycles. Today, single-chip memory controllers often h<strong>and</strong>le the refresh function<br />

independently of the processor. If every bit had to be read out of the DRAM <strong>and</strong><br />

then written back individually, with large DRAMs containing multiple megabytes,<br />

we would constantly be refreshing the DRAM, leaving no time for accessing it.<br />

Fortunately, DRAMs also use a two-level decoding structure, <strong>and</strong> this allows us<br />

to refresh an entire row (which shares a word line) with a read cycle followed<br />

immediately by a write cycle. Typically, refresh operations consume 1% to 2% of<br />

the active cycles of the DRAM, leaving the remaining 98% to 99% of the cycles<br />

available for reading <strong>and</strong> writing data.<br />

Elaboration: How does a DRAM read <strong>and</strong> write the signal stored in a cell? The<br />

transistor inside the cell is a switch, called a pass transistor, that allows the value stored<br />

on the capacitor to be accessed for either reading or writing. Figure B.9.5 shows how<br />

the single-transistor cell looks. The pass transistor acts like a switch: when the signal<br />

on the word line is asserted, the switch is closed, connecting the capacitor to the bit<br />

line. If the operation is a write, then the value to be written is placed on the bit line. If<br />

the value is a 1, the capacitor will be charged. If the value is a 0, then the capacitor will<br />

be discharged. Reading is slightly more complex, since the DRAM must detect a very<br />

small charge stored in the capacitor. Before activating the word line for a read, the bit<br />

line is charged to the voltage that is halfway between the low <strong>and</strong> high voltage. Then, by<br />

activating the word line, the charge on the capacitor is read out onto the bit line. This<br />

causes the bit line to move slightly toward the high or low direction, <strong>and</strong> this change is<br />

detected with a sense amplifi er, which can detect small changes in voltage.


B-64 Appendix B The Basics of Logic <strong>Design</strong><br />

Word line<br />

Pass transistor<br />

Capacitor<br />

Bit line<br />

FIGURE B.9.5 A single-transistor DRAM cell contains a capacitor that stores the cell<br />

contents <strong>and</strong> a transistor used to access the cell.<br />

Row<br />

decoder<br />

11-to-2048<br />

2048 2048<br />

array<br />

Address[10–0]<br />

Column latches<br />

Mux<br />

Dout<br />

FIGURE B.9.6 A 4M 1 DRAM is built with a 2048 2048 array. The row access uses 11 bits to<br />

select a row, which is then latched in 2048 1-bit latches. A multiplexor chooses the output bit from these 2048<br />

latches. The RAS <strong>and</strong> CAS signals control whether the address lines are sent to the row decoder or column<br />

multiplexor.


B.9 Memory Elements: SRAMs <strong>and</strong> DRAMs B-65<br />

DRAMs use a two-level decoder consisting of a row access followed by a column<br />

access, as shown in Figure B.9.6. The row access chooses one of a number of rows<br />

<strong>and</strong> activates the corresponding word line. The contents of all the columns in the<br />

active row are then stored in a set of latches. The column access then selects the<br />

data from the column latches. To save pins <strong>and</strong> reduce the package cost, the same<br />

address lines are used for both the row <strong>and</strong> column address; a pair of signals called<br />

RAS (Row Access Strobe) <strong>and</strong> CAS (Column Access Strobe) are used to signal the<br />

DRAM that either a row or column address is being supplied. Refresh is performed<br />

by simply reading the columns into the column latches <strong>and</strong> then writing the same<br />

values back. Thus, an entire row is refreshed in one cycle. The two-level addressing<br />

scheme, combined with the internal circuitry, makes DRAM access times much<br />

longer (by a factor of 5–10) than SRAM access times. In 2004, typical DRAM access<br />

times ranged from 45 to 65 ns; 256 Mbit DRAMs are in full production, <strong>and</strong> the<br />

first customer samples of 1 GB DRAMs became available in the first quarter of<br />

2004. The much lower cost per bit makes DRAM the choice for main memory,<br />

while the faster access time makes SRAM the choice for caches.<br />

You might observe that a 64M 4 DRAM actually accesses 8K bits on every<br />

row access <strong>and</strong> then throws away all but 4 of those during a column access. DRAM<br />

designers have used the internal structure of the DRAM as a way to provide<br />

higher b<strong>and</strong>width out of a DRAM. This is done by allowing the column address to<br />

change without changing the row address, resulting in an access to other bits in the<br />

column latches. To make this process faster <strong>and</strong> more precise, the address inputs<br />

were clocked, leading to the dominant form of DRAM in use today: synchronous<br />

DRAM or SDRAM.<br />

Since about 1999, SDRAMs have been the memory chip of choice for most<br />

cache-based main memory systems. SDRAMs provide fast access to a series of bits<br />

within a row by sequentially transferring all the bits in a burst under the control<br />

of a clock signal. In 2004, DDRRAMs (Double Data Rate RAMs), which are called<br />

double data rate because they transfer data on both the rising <strong>and</strong> falling edge of<br />

an externally supplied clock, were the most heavily used form of SDRAMs. As we<br />

discuss in Chapter 5, these high-speed transfers can be used to boost the b<strong>and</strong>width<br />

available out of main memory to match the needs of the processor <strong>and</strong> caches.<br />

Error Correction<br />

Because of the potential for data corruption in large memories, most computer<br />

systems use some sort of error-checking code to detect possible corruption of data.<br />

One simple code that is heavily used is a parity code. In a parity code the number<br />

of 1s in a word is counted; the word has odd parity if the number of 1s is odd <strong>and</strong>


B-70 Appendix B The Basics of Logic <strong>Design</strong><br />

NSlite<br />

Outputs<br />

EWlite<br />

NSgreen 1 0<br />

EWgreen 0 1<br />

function, with labels on the arcs specifying the input condition as logic functions.<br />

Figure B.10.2 shows the graphical representation for this finite-state machine.<br />

EWcar<br />

NSgreen<br />

EWgreen<br />

NSlite<br />

NScar<br />

EWlite<br />

EWcar<br />

NScar<br />

FIGURE B.10.2 The graphical representation of the two-state traffic light controller. We<br />

simplified the logic functions on the state transitions. For example, the transition from NSgreen to EWgreen<br />

in the next-state table is ( NScar EWcar) ( NScar EWcar ), which is equivalent to EWcar.<br />

A finite-state machine can be implemented with a register to hold the current<br />

state <strong>and</strong> a block of combinational logic that computes the next-state function <strong>and</strong><br />

the output function. Figure B.10.3 shows how a finite-state machine with 4 bits of<br />

state, <strong>and</strong> thus up to 16 states, might look. To implement the finite-state machine<br />

in this way, we must first assign state numbers to the states. This process is called<br />

state assignment. For example, we could assign NSgreen to state 0 <strong>and</strong> EWgreen to<br />

state 1. The state register would contain a single bit. The next-state function would<br />

be given as<br />

NextState ( CurrentState EWcar) ( CurrentState NScar)


B-72 Appendix B The Basics of Logic <strong>Design</strong><br />

FIGURE B.10.4<br />

A Verilog version of the traffic light controller.<br />

Check<br />

Yourself<br />

What is the smallest number of states in a Moore machine for which a Mealy<br />

machine could have fewer states?<br />

a. Two, since there could be a one-state Mealy machine that might do the same<br />

thing.<br />

b. Three, since there could be a simple Moore machine that went to one of two<br />

different states <strong>and</strong> always returned to the original state after that. For such a<br />

simple machine, a two-state Mealy machine is possible.<br />

c. You need at least four states to exploit the advantages of a Mealy machine<br />

over a Moore machine.<br />

B.11 Timing Methodologies<br />

Throughout this appendix <strong>and</strong> in the rest of the text, we use an edge-triggered<br />

timing methodology. This timing methodology has an advantage in that it is<br />

simpler to explain <strong>and</strong> underst<strong>and</strong> than a level-triggered methodology. In this<br />

section, we explain this timing methodology in a little more detail <strong>and</strong> also<br />

introduce level-sensitive clocking. We conclude this section by briefly discussing


B.11 Timing Methodologies B-73<br />

the issue of asynchronous signals <strong>and</strong> synchronizers, an important problem for<br />

digital designers.<br />

The purpose of this section is to introduce the major concepts in clocking<br />

methodology. The section makes some important simplifying assumptions; if you<br />

are interested in underst<strong>and</strong>ing timing methodology in more detail, consult one of<br />

the references listed at the end of this appendix.<br />

We use an edge-triggered timing methodology because it is simpler to explain<br />

<strong>and</strong> has fewer rules required for correctness. In particular, if we assume that all<br />

clocks arrive at the same time, we are guaranteed that a system with edge-triggered<br />

registers between blocks of combinational logic can operate correctly without races<br />

if we simply make the clock long enough. A race occurs when the contents of a<br />

state element depend on the relative speed of different logic elements. In an edgetriggered<br />

design, the clock cycle must be long enough to accommodate the path<br />

from one flip-flop through the combinational logic to another flip-flop where it<br />

must satisfy the setup-time requirement. Figure B.11.1 shows this requirement for<br />

a system using rising edge-triggered flip-flops. In such a system the clock period<br />

(or cycle time) must be at least as large as<br />

t t t<br />

prop combinational setup<br />

for the worst-case values of these three delays, which are defined as follows:<br />

■ t prop<br />

is the time for a signal to propagate through a flip-flop; it is also sometimes<br />

called clock-to-Q.<br />

■ t combinational<br />

is the longest delay for any combinational logic (which by definition<br />

is surrounded by two flip-flops).<br />

■ t setup<br />

is the time before the rising clock edge that the input to a flip-flop must<br />

be valid.<br />

D Q<br />

Flip-flop<br />

C<br />

Combinational<br />

logic block<br />

D Q<br />

Flip-flop<br />

C<br />

t prop t combinational t setup<br />

FIGURE B.11.1 In an edge-triggered design, the clock must be long enough to allow<br />

signals to be valid for the required setup time before the next clock edge. The time for a<br />

flip-flop input to propagate to the flip-flip outputs is t prop<br />

; the signal then takes t combinational<br />

to travel through the<br />

combinational logic <strong>and</strong> must be valid t setup<br />

before the next clock edge.


B-74 Appendix B The Basics of Logic <strong>Design</strong><br />

clock skew The<br />

difference in absolute time<br />

between the times when<br />

two state elements see a<br />

clock edge.<br />

We make one simplifying assumption: the hold-time requirements are satisfied,<br />

which is almost never an issue with modern logic.<br />

One additional complication that must be considered in edge-triggered designs<br />

is clock skew. Clock skew is the difference in absolute time between when two state<br />

elements see a clock edge. Clock skew arises because the clock signal will often<br />

use two different paths, with slightly different delays, to reach two different state<br />

elements. If the clock skew is large enough, it may be possible for a state element to<br />

change <strong>and</strong> cause the input to another flip-flop to change before the clock edge is<br />

seen by the second flip-flop.<br />

Figure B.11.2 illustrates this problem, ignoring setup time <strong>and</strong> flip-flop<br />

propagation delay. To avoid incorrect operation, the clock period is increased to<br />

allow for the maximum clock skew. Thus, the clock period must be longer than<br />

tprop tcombinational tsetup tskew<br />

With this constraint on the clock period, the two clocks can also arrive in the<br />

opposite order, with the second clock arriving t skew<br />

earlier, <strong>and</strong> the circuit will work<br />

Clock arrives<br />

at time t<br />

D Q<br />

Flip-flop<br />

C<br />

Combinational<br />

logic block with<br />

delay time of Δ<br />

Clock arrives<br />

after t + Δ<br />

D Q<br />

Flip-flop<br />

C<br />

FIGURE B.11.2 Illustration of how clock skew can cause a race, leading to incorrect operation. Because of the difference<br />

in when the two flip-flops see the clock, the signal that is stored into the first flip-flop can race forward <strong>and</strong> change the input to the second flipflop<br />

before the clock arrives at the second flip-flop.<br />

level-sensitive<br />

clocking A timing<br />

methodology in which<br />

state changes occur<br />

at either high or low<br />

clock levels but are not<br />

instantaneous as such<br />

changes are in edgetriggered<br />

designs.<br />

correctly. <strong>Design</strong>ers reduce clock-skew problems by carefully routing the clock<br />

signal to minimize the difference in arrival times. In addition, smart designers also<br />

provide some margin by making the clock a little longer than the minimum; this<br />

allows for variation in components as well as in the power supply. Since clock skew<br />

can also affect the hold-time requirements, minimizing the size of the clock skew<br />

is important.<br />

Edge-triggered designs have two drawbacks: they require extra logic <strong>and</strong> they<br />

may sometimes be slower. Just looking at the D flip-flop versus the level-sensitive<br />

latch that we used to construct the flip-flop shows that edge-triggered design<br />

requires more logic. An alternative is to use level-sensitive clocking. Because state<br />

changes in a level-sensitive methodology are not instantaneous, a level-sensitive<br />

scheme is slightly more complex <strong>and</strong> requires additional care to make it operate<br />

correctly.


B.11 Timing Methodologies B-75<br />

Level-Sensitive Timing<br />

In level-sensitive timing, the state changes occur at either high or low levels, but<br />

they are not instantaneous as they are in an edge-triggered methodology. Because of<br />

the noninstantaneous change in state, races can easily occur. To ensure that a levelsensitive<br />

design will also work correctly if the clock is slow enough, designers use twophase<br />

clocking. Two-phase clocking is a scheme that makes use of two nonoverlapping<br />

clock signals. Since the two clocks, typically called φ 1<br />

<strong>and</strong> φ 2<br />

, are nonoverlapping, at<br />

most one of the clock signals is high at any given time, as Figure B.11.3 shows. We<br />

can use these two clocks to build a system that contains level-sensitive latches but is<br />

free from any race conditions, just as the edge-triggered designs were.<br />

Φ 1<br />

Φ 2<br />

Nonoverlapping<br />

periods<br />

FIGURE B.11.3 A two-phase clocking scheme showing the cycle of each clock <strong>and</strong> the<br />

nonoverlapping periods.<br />

Φ 1<br />

D<br />

C<br />

Latch<br />

Q<br />

Combinational<br />

logic block<br />

Φ 2<br />

D<br />

C<br />

Latch<br />

Q<br />

Combinational<br />

logic block<br />

Φ 1<br />

D<br />

C<br />

Latch<br />

FIGURE B.11.4 A two-phase timing scheme with alternating latches showing how the system operates on both clock<br />

phases. The output of a latch is stable on the opposite phase from its C input. Thus, the first block of combinational inputs has a stable input<br />

during φ 2<br />

, <strong>and</strong> its output is latched by φ 2<br />

. The second (rightmost) combinational block operates in just the opposite fashion, with stable inputs<br />

during φ 1<br />

. Thus, the delays through the combinational blocks determine the minimum time that the respective clocks must be asserted. The<br />

size of the nonoverlapping period is determined by the maximum clock skew <strong>and</strong> the minimum delay of any logic block.<br />

One simple way to design such a system is to alternate the use of latches that are<br />

open on φ 1<br />

with latches that are open on φ 2<br />

. Because both clocks are not asserted<br />

at the same time, a race cannot occur. If the input to a combinational block is a φ 1<br />

clock, then its output is latched by a φ 2<br />

clock, which is open only during φ 2<br />

when<br />

the input latch is closed <strong>and</strong> hence has a valid output. Figure B.11.4 shows how<br />

a system with two-phase timing <strong>and</strong> alternating latches operates. As in an edgetriggered<br />

design, we must pay attention to clock skew, particularly between the two


B-76 Appendix B The Basics of Logic <strong>Design</strong><br />

clock phases. By increasing the amount of nonoverlap between the two phases, we<br />

can reduce the potential margin of error. Thus, the system is guaranteed to operate<br />

correctly if each phase is long enough <strong>and</strong> if there is large enough nonoverlap<br />

between the phases.<br />

Asynchronous Inputs <strong>and</strong> Synchronizers<br />

By using a single clock or a two-phase clock, we can eliminate race conditions<br />

if clock-skew problems are avoided. Unfortunately, it is impractical to make an<br />

entire system function with a single clock <strong>and</strong> still keep the clock skew small.<br />

While the CPU may use a single clock, I/O devices will probably have their own<br />

clock. An asynchronous device may communicate with the CPU through a series<br />

of h<strong>and</strong>shaking steps. To translate the asynchronous input to a synchronous signal<br />

that can be used to change the state of a system, we need to use a synchronizer,<br />

whose inputs are the asynchronous signal <strong>and</strong> a clock <strong>and</strong> whose output is a signal<br />

synchronous with the input clock.<br />

Our first attempt to build a synchronizer uses an edge-triggered D flip-flop,<br />

whose D input is the asynchronous signal, as Figure B.11.5 shows. Because we<br />

communicate with a h<strong>and</strong>shaking protocol, it does not matter whether we detect<br />

the asserted state of the asynchronous signal on one clock or the next, since the<br />

signal will be held asserted until it is acknowledged. Thus, you might think that this<br />

simple structure is enough to sample the signal accurately, which would be the case<br />

except for one small problem.<br />

Asynchronous input<br />

Clock<br />

D Q<br />

Flip-flop<br />

C<br />

Synchronous output<br />

metastability<br />

A situation that occurs if<br />

a signal is sampled when<br />

it is not stable for the<br />

required setup <strong>and</strong> hold<br />

times, possibly causing<br />

the sampled value to<br />

fall in the indeterminate<br />

region between a high <strong>and</strong><br />

low value.<br />

FIGURE B.11.5 A synchronizer built from a D flip-flop is used to sample an asynchronous<br />

signal to produce an output that is synchronous with the clock. This “synchronizer” will not<br />

work properly!<br />

The problem is a situation called metastability. Suppose the asynchronous<br />

signal is transitioning between high <strong>and</strong> low when the clock edge arrives. Clearly,<br />

it is not possible to know whether the signal will be latched as high or low. That<br />

problem we could live with. Unfortunately, the situation is worse: when the signal<br />

that is sampled is not stable for the required setup <strong>and</strong> hold times, the flip-flop may<br />

go into a metastable state. In such a state, the output will not have a legitimate high<br />

or low value, but will be in the indeterminate region between them. Furthermore,


B.13 Concluding Remarks B-77<br />

the flip-flop is not guaranteed to exit this state in any bounded amount of time.<br />

Some logic blocks that look at the output of the flip-flop may see its output as 0,<br />

while others may see it as 1. This situation is called a synchronizer failure.<br />

In a purely synchronous system, synchronizer failure can be avoided by ensuring<br />

that the setup <strong>and</strong> hold times for a flip-flop or latch are always met, but this is<br />

impossible when the input is asynchronous. Instead, the only solution possible is to<br />

wait long enough before looking at the output of the flip-flop to ensure that its output<br />

is stable, <strong>and</strong> that it has exited the metastable state, if it ever entered it. How long is<br />

long enough? Well, the probability that the flip-flop will stay in the metastable state<br />

decreases exponentially, so after a very short time the probability that the flip-flop<br />

is in the metastable state is very low; however, the probability never reaches 0! So<br />

designers wait long enough such that the probability of a synchronizer failure is very<br />

low, <strong>and</strong> the time between such failures will be years or even thous<strong>and</strong>s of years.<br />

For most flip-flop designs, waiting for a period that is several times longer than<br />

the setup time makes the probability of synchronization failure very low. If the<br />

clock rate is longer than the potential metastability period (which is likely), then a<br />

safe synchronizer can be built with two D flip-flops, as Figure B.11.6 shows. If you<br />

are interested in reading more about these problems, look into the references.<br />

synchronizer failure<br />

A situation in which<br />

a flip-flop enters a<br />

metastable state <strong>and</strong><br />

where some logic blocks<br />

reading the output of the<br />

flip-flop see a 0 while<br />

others see a 1.<br />

Asynchronous input<br />

Clock<br />

D Q<br />

Flip-flop<br />

C<br />

D Q<br />

Flip-flop<br />

C<br />

Synchronous output<br />

FIGURE B.11.6 This synchronizer will work correctly if the period of metastability that<br />

we wish to guard against is less than the clock period. Although the output of the first flip-flop<br />

may be metastable, it will not be seen by any other logic element until the second clock, when the second D<br />

flip-flop samples the signal, which by that time should no longer be in a metastable state.<br />

Suppose we have a design with very large clock skew—longer than the register<br />

propagation time. Is it always possible for such a design to slow the clock down<br />

enough to guarantee that the logic operates properly?<br />

a. Yes, if the clock is slow enough the signals can always propagate <strong>and</strong> the<br />

design will work, even if the skew is very large.<br />

b. No, since it is possible that two registers see the same clock edge far enough<br />

apart that a register is triggered, <strong>and</strong> its outputs propagated <strong>and</strong> seen by a<br />

second register with the same clock edge.<br />

Check<br />

Yourself<br />

propagation time The<br />

time required for an input<br />

to a flip-flop to propagate<br />

to the outputs of the flipflop.


B.14 Exercises B-81<br />

B.10 [15] §§B.2, B.3 Prove that a two-input multiplexor is also universal by<br />

showing how to build the NAND (or NOR) gate using a multiplexor.<br />

B.11 [5] §§4.2, B.2, B.3 Assume that X consists of 3 bits, x2 x1 x0. Write four<br />

logic functions that are true if <strong>and</strong> only if<br />

■ X contains only one 0<br />

■ X contains an even number of 0s<br />

■ X when interpreted as an unsigned binary number is less than 4<br />

■ X when interpreted as a signed (two’s complement) number is negative<br />

B.12 [5] §§4.2, B.2, B.3 Implement the four functions described in Exercise<br />

B.11 using a PLA.<br />

B.13 [5] §§4.2, B.2, B.3 Assume that X consists of 3 bits, x2 x1 x0, <strong>and</strong> Y<br />

consists of 3 bits, y2 y1 y0. Write logic functions that are true if <strong>and</strong> only if<br />

■ X Y, where X <strong>and</strong> Y are thought of as unsigned binary numbers<br />

■ X Y, where X <strong>and</strong> Y are thought of as signed (two’s complement) numbers<br />

■ X Y<br />

Use a hierarchical approach that can be extended to larger numbers of bits. Show<br />

how can you extend it to 6-bit comparison.<br />

B.14 [5] §§B.2, B.3 Implement a switching network that has two data inputs<br />

(A <strong>and</strong> B), two data outputs (C <strong>and</strong> D), <strong>and</strong> a control input (S). If S equals 1, the<br />

network is in pass-through mode, <strong>and</strong> C should equal A, <strong>and</strong> D should equal B. If<br />

S equals 0, the network is in crossing mode, <strong>and</strong> C should equal B, <strong>and</strong> D should<br />

equal A.<br />

B.15 [15] §§B.2, B.3 Derive the product-of-sums representation for E shown<br />

on page B-11 starting with the sum-of-products representation. You will need to<br />

use DeMorgan’s theorems.<br />

B.16 [30] §§B.2, B.3 Give an algorithm for constructing the sum-of- products<br />

representation for an arbitrary logic equation consisting of AND, OR, <strong>and</strong> NOT.<br />

The algorithm should be recursive <strong>and</strong> should not construct the truth table in the<br />

process.<br />

B.17 [5] §§B.2, B.3 Show a truth table for a multiplexor (inputs A, B, <strong>and</strong> S;<br />

output C ), using don’t cares to simplify the table where possible.


B-82 Appendix B The Basics of Logic <strong>Design</strong><br />

B.18 [5] §B.3 What is the function implemented by the following Verilog<br />

modules:<br />

module FUNC1 (I0, I1, S, out);<br />

input I0, I1;<br />

input S;<br />

output out;<br />

out = S? I1: I0;<br />

endmodule<br />

module FUNC2 (out,ctl,clk,reset);<br />

output [7:0] out;<br />

input ctl, clk, reset;<br />

reg [7:0] out;<br />

always @(posedge clk)<br />

if (reset) begin<br />

out


B.14 Exercises B-83<br />

In<br />

<br />

Adder<br />

Load<br />

Clk<br />

16<br />

16<br />

Out<br />

Rst<br />

Load<br />

Register<br />

B.22 [20] §§B3, B.4, B.5 Section 3.3 presents basic operation <strong>and</strong> possible<br />

implementations of multipliers. A basic unit of such implementations is a shift<strong>and</strong>-add<br />

unit. Show a Verilog implementation for this unit. Show how can you use<br />

this unit to build a 32-bit multiplier.<br />

B.23 [20] §§B3, B.4, B.5 Repeat Exercise B.22, but for an unsigned divider<br />

rather than a multiplier.<br />

B.24 [15] §B.5 The ALU supported set on less than (slt) using just the sign<br />

bit of the adder. Let’s try a set on less than operation using the values 7 ten<br />

<strong>and</strong> 6 ten<br />

.<br />

To make it simpler to follow the example, let’s limit the binary representations to 4<br />

bits: 1001 two<br />

<strong>and</strong> 0110 two<br />

.<br />

1001 two<br />

– 0110 two<br />

= 1001 two<br />

+ 1010 two<br />

= 0011 two<br />

This result would suggest that 7 6, which is clearly wrong. Hence, we must<br />

factor in overflow in the decision. Modify the 1-bit ALU in Figure B.5.10 on page<br />

B-33 to h<strong>and</strong>le slt correctly. Make your changes on a photocopy of this figure to<br />

save time.<br />

B.25 [20] §B.6 A simple check for overflow during addition is to see if the<br />

CarryIn to the most significant bit is not the same as the CarryOut of the most<br />

significant bit. Prove that this check is the same as in Figure 3.2.<br />

B.26 [5] §B.6 Rewrite the equations on page B-44 for a carry-lookahead logic<br />

for a 16-bit adder using a new notation. First, use the names for the CarryIn signals<br />

of the individual bits of the adder. That is, use c4, c8, c12, … instead of C1, C2,<br />

C3, …. In addition, let Pi,j; mean a propagate signal for bits i to j, <strong>and</strong> Gi,j; mean a<br />

generate signal for bits i to j. For example, the equation<br />

C2 G1 ( P1 G0) ( P1 P0 c0)


B-84 Appendix B The Basics of Logic <strong>Design</strong><br />

can be rewritten as<br />

c8 G ( P G ) ( P P c0)<br />

74 , 74 , 30 , 74 , 30 ,<br />

This more general notation is useful in creating wider adders.<br />

B.27 [15] §B.6 Write the equations for the carry-lookahead logic for a 64-<br />

bit adder using the new notation from Exercise B.26 <strong>and</strong> using 16-bit adders as<br />

building blocks. Include a drawing similar to Figure B.6.3 in your solution.<br />

B.28 [10] §B.6 Now calculate the relative performance of adders. Assume that<br />

hardware corresponding to any equation containing only OR or AND terms, such<br />

as the equations for pi <strong>and</strong> gi on page B-40, takes one time unit T. Equations that<br />

consist of the OR of several AND terms, such as the equations for c1, c2, c3, <strong>and</strong><br />

c4 on page B-40, would thus take two time units, 2T. The reason is it would take T<br />

to produce the AND terms <strong>and</strong> then an additional T to produce the result of the<br />

OR. Calculate the numbers <strong>and</strong> performance ratio for 4-bit adders for both ripple<br />

carry <strong>and</strong> carry lookahead. If the terms in equations are further defined by other<br />

equations, then add the appropriate delays for those intermediate equations, <strong>and</strong><br />

continue recursively until the actual input bits of the adder are used in an equation.<br />

Include a drawing of each adder labeled with the calculated delays <strong>and</strong> the path of<br />

the worst-case delay highlighted.<br />

B.29 [15] §B.6 This exercise is similar to Exercise B.28, but this time calculate<br />

the relative speeds of a 16-bit adder using ripple carry only, ripple carry of 4-bit<br />

groups that use carry lookahead, <strong>and</strong> the carry-lookahead scheme on page B-39.<br />

B.30 [15] §B.6 This exercise is similar to Exercises B.28 <strong>and</strong> B.29, but this<br />

time calculate the relative speeds of a 64-bit adder using ripple carry only, ripple<br />

carry of 4-bit groups that use carry lookahead, ripple carry of 16-bit groups that use<br />

carry lookahead, <strong>and</strong> the carry-lookahead scheme from Exercise B.27.<br />

B.31 [10] §B.6 Instead of thinking of an adder as a device that adds two<br />

numbers <strong>and</strong> then links the carries together, we can think of the adder as a hardware<br />

device that can add three inputs together (ai, bi, ci) <strong>and</strong> produce two outputs<br />

(s, ci 1). When adding two numbers together, there is little we can do with this<br />

observation. When we are adding more than two oper<strong>and</strong>s, it is possible to reduce<br />

the cost of the carry. The idea is to form two independent sums, called S (sum bits)<br />

<strong>and</strong> C (carry bits). At the end of the process, we need to add C <strong>and</strong> S together<br />

using a normal adder. This technique of delaying carry propagation until the end<br />

of a sum of numbers is called carry save addition. The block drawing on the lower<br />

right of Figure B.14.1 (see below) shows the organization, with two levels of carry<br />

save adders connected by a single normal adder.<br />

Calculate the delays to add four 16-bit numbers using full carry-lookahead adders<br />

versus carry save with a carry-lookahead adder forming the final sum. (The time<br />

unit T in Exercise B.28 is the same.)


B-86 Appendix B The Basics of Logic <strong>Design</strong><br />

First, show the block organization of the 16-bit carry save adders to add these 16<br />

terms, as shown on the right in Figure B.14.1. Then calculate the delays to add these<br />

16 numbers. Compare this time to the iterative multiplication scheme in Chapter<br />

3 but only assume 16 iterations using a 16-bit adder that has full carry lookahead<br />

whose speed was calculated in Exercise B.29.<br />

B.33 [10] §B.6 There are times when we want to add a collection of numbers<br />

together. Suppose you wanted to add four 4-bit numbers (A, B, E, F) using 1-bit<br />

full adders. Let’s ignore carry lookahead for now. You would likely connect the<br />

1-bit adders in the organization at the top of Figure B.14.1. Below the traditional<br />

organization is a novel organization of full adders. Try adding four numbers using<br />

both organizations to convince yourself that you get the same answer.<br />

B.34 [5] §B.6 First, show the block organization of the 16-bit carry save<br />

adders to add these 16 terms, as shown in Figure B.14.1. Assume that the time delay<br />

through each 1-bit adder is 2T. Calculate the time of adding four 4-bit numbers to<br />

the organization at the top versus the organization at the bottom of Figure B.14.1.<br />

B.35 [5] §B.8 Quite often, you would expect that given a timing diagram<br />

containing a description of changes that take place on a data input D <strong>and</strong> a clock<br />

input C (as in Figures B.8.3 <strong>and</strong> B.8.6 on pages B-52 <strong>and</strong> B-54, respectively), there<br />

would be differences between the output waveforms (Q) for a D latch <strong>and</strong> a D flipflop.<br />

In a sentence or two, describe the circumstances (e.g., the nature of the inputs)<br />

for which there would not be any difference between the two output waveforms.<br />

B.36 [5] §B.8 Figure B.8.8 on page B-55 illustrates the implementation of the<br />

register file for the MIPS datapath. Pretend that a new register file is to be built,<br />

but that there are only two registers <strong>and</strong> only one read port, <strong>and</strong> that each register<br />

has only 2 bits of data. Redraw Figure B.8.8 so that every wire in your diagram<br />

corresponds to only 1 bit of data (unlike the diagram in Figure B.8.8, in which<br />

some wires are 5 bits <strong>and</strong> some wires are 32 bits). Redraw the registers using D flipflops.<br />

You do not need to show how to implement a D flip-flop or a multiplexor.<br />

B.37 [10] §B.10 A friend would like you to build an “electronic eye” for use<br />

as a fake security device. The device consists of three lights lined up in a row,<br />

controlled by the outputs Left, Middle, <strong>and</strong> Right, which, if asserted, indicate that<br />

a light should be on. Only one light is on at a time, <strong>and</strong> the light “moves” from<br />

left to right <strong>and</strong> then from right to left, thus scaring away thieves who believe that<br />

the device is monitoring their activity. Draw the graphical representation for the<br />

finite-state machine used to specify the electronic eye. Note that the rate of the eye’s<br />

movement will be controlled by the clock speed (which should not be too great)<br />

<strong>and</strong> that there are essentially no inputs.<br />

B.38 [10] §B.10 Assign state numbers to the states of the finite-state machine<br />

you constructed for Exercise B.37 <strong>and</strong> write a set of logic equations for each of the<br />

outputs, including the next-state bits.


B.14 Exercises B-87<br />

B.39 [15] §§B.2, B.8, B.10 Construct a 3-bit counter using three D flipflops<br />

<strong>and</strong> a selection of gates. The inputs should consist of a signal that resets the<br />

counter to 0, called reset, <strong>and</strong> a signal to increment the counter, called inc. The<br />

outputs should be the value of the counter. When the counter has value 7 <strong>and</strong> is<br />

incremented, it should wrap around <strong>and</strong> become 0.<br />

B.40 [20] §B.10 A Gray code is a sequence of binary numbers with the property<br />

that no more than 1 bit changes in going from one element of the sequence to<br />

another. For example, here is a 3-bit binary Gray code: 000, 001, 011, 010, 110,<br />

111, 101, <strong>and</strong> 100. Using three D flip-flops <strong>and</strong> a PLA, construct a 3-bit Gray code<br />

counter that has two inputs: reset, which sets the counter to 000, <strong>and</strong> inc, which<br />

makes the counter go to the next value in the sequence. Note that the code is cyclic,<br />

so that the value after 100 in the sequence is 000.<br />

B.41 [25] §B.10 We wish to add a yellow light to our traffic light example on<br />

page B-68. We will do this by changing the clock to run at 0.25 Hz (a 4-second clock<br />

cycle time), which is the duration of a yellow light. To prevent the green <strong>and</strong> red lights<br />

from cycling too fast, we add a 30-second timer. The timer has a single input, called<br />

TimerReset, which restarts the timer, <strong>and</strong> a single output, called TimerSignal, which<br />

indicates that the 30-second period has expired. Also, we must redefine the traffic<br />

signals to include yellow. We do this by defining two out put signals for each light:<br />

green <strong>and</strong> yellow. If the output NSgreen is asserted, the green light is on; if the output<br />

NSyellow is asserted, the yellow light is on. If both signals are off, the red light is on. Do<br />

not assert both the green <strong>and</strong> yellow signals at the same time, since American drivers<br />

will certainly be confused, even if European drivers underst<strong>and</strong> what this means! Draw<br />

the graphical representation for the finite-state machine for this improved controller.<br />

Choose names for the states that are different from the names of the outputs.<br />

B.42 [15] §B.10 Write down the next-state <strong>and</strong> output-function tables for the<br />

traffic light controller described in Exercise B.41.<br />

B.43 [15] §§B.2, B.10 Assign state numbers to the states in the traf-fic light<br />

example of Exercise B.41 <strong>and</strong> use the tables of Exercise B.42 to write a set of logic<br />

equations for each of the outputs, including the next-state outputs.<br />

B.44 [15] §§B.3, B.10 Implement the logic equations of Exercise B.43 as a<br />

PLA.<br />

§B.2, page B-8: No. If A 1, C 1, B 0, the first is true, but the second is false.<br />

§B.3, page B-20: C.<br />

§B.4, page B-22: They are all exactly the same.<br />

§B.4, page B-26: A 0, B 1.<br />

§B.5, page B-38: 2.<br />

§B.6, page B-47: 1.<br />

§B.8, page B-58: c.<br />

§B.10, page B-72: b.<br />

§B.11, page B-77: b.<br />

Answers to<br />

Check Yourself


This page intentionally left blank


Index<br />

Note: Online information is listed by chapter <strong>and</strong> section number followed by page numbers (OL3.11-7). Page references preceded<br />

by a single letter with hyphen refer to appendices.<br />

1-bit ALU, B-26–29. See also Arithmetic<br />

logic unit (ALU)<br />

adder, B-27<br />

CarryOut, B-28<br />

for most significant bit, B-33<br />

illustrated, B-29<br />

logical unit for AND/OR, B-27<br />

performing AND, OR, <strong>and</strong> addition,<br />

B-31, B-33<br />

32-bit ALU, B-29–38. See also Arithmetic<br />

logic unit (ALU)<br />

defining in Verilog, B-35–38<br />

from 31 copies of 1-bit ALU, B-34<br />

illustrated, B-36<br />

ripple carry adder, B-29<br />

tailoring to MIPS, B-31–35<br />

with 32 1-bit ALUs, B-30<br />

32-bit immediate oper<strong>and</strong>s, 112–113<br />

7090/7094 hardware, OL3.11-7<br />

A<br />

Absolute references, 126<br />

Abstractions<br />

hardware/software interface, 22<br />

principle, 22<br />

to simplify design, 11<br />

Accumulator architectures, OL2.21-2<br />

Acronyms, 9<br />

Active matrix, 18<br />

add (Add), 64<br />

add.d (FP Add Double), A-73<br />

add.s (FP Add Single), A-74<br />

Add unsigned instruction, 180<br />

addi (Add Immediate), 64<br />

Addition, 178–182. See also Arithmetic<br />

binary, 178–179<br />

floating-point, 203–206, 211, A-73–74<br />

instructions, A-51<br />

oper<strong>and</strong>s, 179<br />

signific<strong>and</strong>s, 203<br />

speed, 182<br />

addiu (Add Imm.Unsigned), 119<br />

Address interleaving, 381<br />

Address select logic, D-24, D-25<br />

Address space, 428, 431<br />

extending, 479<br />

flat, 479<br />

ID (ASID), 446<br />

inadequate, OL5.17-6<br />

shared, 519–520<br />

single physical, 517<br />

unmapped, 450<br />

virtual, 446<br />

Address translation<br />

for ARM cortex-A8, 471<br />

defined, 429<br />

fast, 438–439<br />

for Intel core i7, 471<br />

TLB for, 438–439<br />

Address-control lines, D-26<br />

Addresses<br />

32-bit immediates, 113–116<br />

base, 69<br />

byte, 69<br />

defined, 68<br />

memory, 77<br />

virtual, 428–431, 450<br />

Addressing<br />

32-bit immediates, 113–116<br />

base, 116<br />

displacement, 116<br />

immediate, 116<br />

in jumps <strong>and</strong> branches, 113–116<br />

MIPS modes, 116–118<br />

PC-relative, 114, 116<br />

pseudodirect, 116<br />

register, 116<br />

x86 modes, 152<br />

Addressing modes, A-45–47<br />

desktop architectures, E-6<br />

addu (Add Unsigned), 64<br />

Advanced Vector Extensions (AVX), 225,<br />

227<br />

AGP, C-9<br />

Algol-60, OL2.21-7<br />

Aliasing, 444<br />

Alignment restriction, 69–70<br />

All-pairs N-body algorithm, C-65<br />

Alpha architecture<br />

bit count instructions, E-29<br />

floating-point instructions, E-28<br />

instructions, E-27–29<br />

no divide, E-28<br />

PAL code, E-28<br />

unaligned load-store, E-28<br />

VAX floating-point formats, E-29<br />

ALU control, 259–261. See also<br />

Arithmetic logic unit (ALU)<br />

bits, 260<br />

logic, D-6<br />

mapping to gates, D-4–7<br />

truth tables, D-5<br />

ALU control block, 263<br />

defined, D-4<br />

generating ALU control bits, D-6<br />

ALUOp, 260, D-6<br />

bits, 260, 261<br />

control signal, 263<br />

Amazon Web Services (AWS), 425<br />

AMD Opteron X4 (Barcelona), 543, 544<br />

AMD64, 151, 224, OL2.21-6<br />

Amdahl’s law, 401, 503<br />

corollary, 49<br />

defined, 49<br />

fallacy, 556<br />

<strong>and</strong> (AND), 64<br />

AND gates, B-12, D-7<br />

AND operation, 88<br />

AND operation, A-52, B-6<br />

<strong>and</strong>i (And Immediate), 64<br />

Annual failure rate (AFR), 418<br />

versus.MTTF of disks, 419–420<br />

Antidependence, 338<br />

Antifuse, B-78<br />

Apple computer, OL1.12-7<br />

I-1


I-2 Index<br />

Apple iPad 2 A1395, 20<br />

logic board of, 20<br />

processor integrated circuit of, 21<br />

Application binary interface (ABI), 22<br />

Application programming interfaces<br />

(APIs)<br />

defined, C-4<br />

graphics, C-14<br />

Architectural registers, 347<br />

Arithmetic, 176–236<br />

addition, 178–182<br />

addition <strong>and</strong> subtraction, 178–182<br />

division, 189–195<br />

fallacies <strong>and</strong> pitfalls, 229–232<br />

floating-point, 196–222<br />

historical perspective, 236<br />

multiplication, 183–188<br />

parallelism <strong>and</strong>, 222–223<br />

Streaming SIMD Extensions <strong>and</strong><br />

advanced vector extensions in x86,<br />

224–225<br />

subtraction, 178–182<br />

subword parallelism, 222–223<br />

subword parallelism <strong>and</strong> matrix<br />

multiply, 225–228<br />

Arithmetic instructions. See also<br />

Instructions<br />

desktop RISC, E-11<br />

embedded RISC, E-14<br />

logical, 251<br />

MIPS, A-51–57<br />

oper<strong>and</strong>s, 66–, 73<br />

Arithmetic intensity, 541<br />

Arithmetic logic unit (ALU). See also<br />

ALU control; Control units<br />

1-bit, B-26–29<br />

32-bit, B-29–38<br />

before forwarding, 309<br />

branch datapath, 254<br />

hardware, 180<br />

memory-reference instruction use, 245<br />

for register values, 252<br />

R-format operations, 253<br />

signed-immediate input, 312<br />

ARM Cortex-A8, 244, 345–346<br />

address translation for, 471<br />

caches in, 472<br />

data cache miss rates for, 474<br />

memory hierarchies of, 471–475<br />

performance of, 473–475<br />

specification, 345<br />

TLB hardware for, 471<br />

ARM instructions, 145–147<br />

12-bit immediate field, 148<br />

addressing modes, 145<br />

block loads <strong>and</strong> stores, 149<br />

brief history, OL2.21-5<br />

calculations, 145–146<br />

compare <strong>and</strong> conditional branch,<br />

147–148<br />

condition field, 324<br />

data transfer, 146<br />

features, 148–149<br />

formats, 148<br />

logical, 149<br />

MIPS similarities, 146<br />

register-register, 146<br />

unique, E-36–37<br />

ARMv7, 62<br />

ARMv8, 158–159<br />

ARPANET, OL1.12-10<br />

Arrays, 415<br />

logic elements, B-18–19<br />

multiple dimension, 218<br />

pointers versus, 141–145<br />

procedures for setting to zero, 142<br />

ASCII<br />

binary numbers versus, 107<br />

character representation, 106<br />

defined, 106<br />

symbols, 109<br />

Assembler directives, A-5<br />

Assemblers, 124–126, A-10–17<br />

conditional code assembly, A-17<br />

defined, 14, A-4<br />

function, 125, A-10<br />

macros, A-4, A-15–17<br />

microcode, D-30<br />

number acceptance, 125<br />

object file, 125<br />

pseudoinstructions, A-17<br />

relocation information, A-13, A-14<br />

speed, A-13<br />

symbol table, A-12<br />

Assembly language, 15<br />

defined, 14, 123<br />

drawbacks, A-9–10<br />

floating-point, 212<br />

high-level languages versus, A-12<br />

illustrated, 15<br />

MIPS, 64, 84, A-45–80<br />

production of, A-8–9<br />

programs, 123<br />

translating into machine language, 84<br />

when to use, A-7–9<br />

Asserted signals, 250, B-4<br />

Associativity<br />

in caches, 405<br />

degree, increasing, 404, 455<br />

increasing, 409<br />

set, tag size versus, 409<br />

Atomic compare <strong>and</strong> swap, 123<br />

Atomic exchange, 121<br />

Atomic fetch-<strong>and</strong>-increment, 123<br />

Atomic memory operation, C-21<br />

Attribute interpolation, C-43–44<br />

Automobiles, computer application in, 4<br />

Average memory access time (AMAT),<br />

402<br />

calculating, 403<br />

B<br />

Backpatching, A-13<br />

B<strong>and</strong>width, 30–31<br />

bisection, 532<br />

external to DRAM, 398<br />

memory, 380–381, 398<br />

network, 535<br />

Barrier synchronization, C-18<br />

defined, C-20<br />

for thread communication, C-34<br />

Base addressing, 69, 116<br />

Base registers, 69<br />

Basic block, 93<br />

Benchmarks, 538–540<br />

defined, 46<br />

Linpack, 538, OL3.11-4<br />

multicores, 522–529<br />

multiprocessor, 538–540<br />

NAS parallel, 540<br />

parallel, 539<br />

PARSEC suite, 540<br />

SPEC CPU, 46–48<br />

SPEC power, 48–49<br />

SPECrate, 538–539<br />

Stream, 548<br />

beq (Branch On Equal), 64<br />

bge (Branch Greater Than or Equal), 125<br />

bgt (Branch Greater Than), 125<br />

Biased notation, 79, 200<br />

Big-endian byte order, 70, A-43<br />

Binary numbers, 81–82


Index I-3<br />

ASCII versus, 107<br />

conversion to decimal numbers, 76<br />

defined, 73<br />

Bisection b<strong>and</strong>width, 532<br />

Bit maps<br />

defined, 18, 73<br />

goal, 18<br />

storing, 18<br />

Bit-Interleaved Parity (RAID 3), OL5.11-<br />

5<br />

Bits<br />

ALUOp, 260, 261<br />

defined, 14<br />

dirty, 437<br />

guard, 220<br />

patterns, 220–221<br />

reference, 435<br />

rounding, 220<br />

sign, 75<br />

state, D-8<br />

sticky, 220<br />

valid, 383<br />

ble (Branch Less Than or Equal), 125<br />

Blocking assignment, B-24<br />

Blocking factor, 414<br />

Block-Interleaved Parity (RAID 4),<br />

OL5.11-5–5.11-6<br />

Blocks<br />

combinational, B-4<br />

defined, 376<br />

finding, 456<br />

flexible placement, 402–404<br />

least recently used (LRU), 409<br />

loads/stores, 149<br />

locating in cache, 407–408<br />

miss rate <strong>and</strong>, 391<br />

multiword, mapping addresses to, 390<br />

placement locations, 455–456<br />

placement strategies, 404<br />

replacement selection, 409<br />

replacement strategies, 457<br />

spatial locality exploitation, 391<br />

state, B-4<br />

valid data, 386<br />

blt (Branch Less Than), 125<br />

bne (Branch On Not Equal), 64<br />

Bonding, 28<br />

Boolean algebra, B-6<br />

Bounds check shortcut, 95<br />

Branch datapath<br />

ALU, 254<br />

operations, 254<br />

Branch delay slots<br />

defined, 322<br />

scheduling, 323<br />

Branch equal, 318<br />

Branch instructions, A-59–63<br />

jump instruction versus, 270<br />

list of, A-60–63<br />

pipeline impact, 317<br />

Branch not taken<br />

assumption, 318<br />

defined, 254<br />

Branch prediction<br />

as control hazard solution, 284<br />

buffers, 321, 322<br />

defined, 283<br />

dynamic, 284, 321–323<br />

static, 335<br />

Branch predictors<br />

accuracy, 322<br />

correlation, 324<br />

information from, 324<br />

tournament, 324<br />

Branch taken<br />

cost reduction, 318<br />

defined, 254<br />

Branch target<br />

addresses, 254<br />

buffers, 324<br />

Branches. See also Conditional<br />

branches<br />

addressing in, 113–116<br />

compiler creation, 91<br />

condition, 255<br />

decision, moving up, 318<br />

delayed, 96, 255, 284, 318–319, 322,<br />

324<br />

ending, 93<br />

execution in ID stage, 319<br />

pipelined, 318<br />

target address, 318<br />

unconditional, 91<br />

Branch-on-equal instruction, 268<br />

Bubble Sort, 140<br />

Bubbles, 314<br />

Bus-based coherent multiprocessors,<br />

OL6.15-7<br />

Buses, B-19<br />

Bytes<br />

addressing, 70<br />

order, 70, A-43<br />

C<br />

C.mmp, OL6.15-4<br />

C language<br />

assignment, compiling into MIPS,<br />

65–66<br />

compiling, 145, OL2.15-2–2.15-3<br />

compiling assignment with registers,<br />

67–68<br />

compiling while loops in, 92<br />

sort algorithms, 141<br />

translation hierarchy, 124<br />

translation to MIPS assembly language,<br />

65<br />

variables, 102<br />

C++ language, OL2.15-27, OL2.21-8<br />

Cache blocking <strong>and</strong> matrix multiply,<br />

475–476<br />

Cache coherence, 466–470<br />

coherence, 466<br />

consistency, 466<br />

enforcement schemes, 467–468<br />

implementation techniques,<br />

OL5.12-11–5.12-12<br />

migration, 467<br />

problem, 466, 467, 470<br />

protocol example, OL5.12-12–5.12-16<br />

protocols, 468<br />

replication, 468<br />

snooping protocol, 468–469<br />

snoopy, OL5.12-17<br />

state diagram, OL5.12-16<br />

Cache coherency protocol, OL5.12-<br />

12–5.12-16<br />

finite-state transition diagram, OL5.12-<br />

15<br />

functioning, OL5.12-14<br />

mechanism, OL5.12-14<br />

state diagram, OL5.12-16<br />

states, OL5.12-13<br />

write-back cache, OL5.12-15<br />

Cache controllers, 470<br />

coherent cache implementation<br />

techniques, OL5.12-11–5.12-12<br />

implementing, OL5.12-2<br />

snoopy cache coherence, OL5.12-17<br />

SystemVerilog, OL5.12-2<br />

Cache hits, 443<br />

Cache misses<br />

block replacement on, 457<br />

capacity, 459


I-4 Index<br />

Cache misses (Continued)<br />

compulsory, 459<br />

conflict, 459<br />

defined, 392<br />

direct-mapped cache, 404<br />

fully associative cache, 406<br />

h<strong>and</strong>ling, 392–393<br />

memory-stall clock cycles, 399<br />

reducing with flexible block placement,<br />

402–404<br />

set-associative cache, 405<br />

steps, 393<br />

in write-through cache, 393<br />

Cache performance, 398–417<br />

calculating, 400<br />

hit time <strong>and</strong>, 401–402<br />

impact on processor performance, 400<br />

Cache-aware instructions, 482<br />

Caches, 383–398. See also Blocks<br />

accessing, 386–389<br />

in ARM cortex-A8, 472<br />

associativity in, 405–406<br />

bits in, 390<br />

bits needed for, 390<br />

contents illustration, 387<br />

defined, 21, 383–384<br />

direct-mapped, 384, 385, 390, 402<br />

empty, 386–387<br />

FSM for controlling, 461–462<br />

fully associative, 403<br />

GPU, C-38<br />

inconsistent, 393<br />

index, 388<br />

in Intel Core i7, 472<br />

Intrinsity FastMATH example,<br />

395–398<br />

locating blocks in, 407–408<br />

locations, 385<br />

multilevel, 398, 410<br />

nonblocking, 472<br />

physically addressed, 443<br />

physically indexed, 443<br />

physically tagged, 443<br />

primary, 410, 417<br />

secondary, 410, 417<br />

set-associative, 403<br />

simulating, 478<br />

size, 389<br />

split, 397<br />

summary, 397–398<br />

tag field, 388<br />

tags, OL5.12-3, OL5.12-11<br />

virtual memory <strong>and</strong> TLB integration,<br />

440–441<br />

virtually addressed, 443<br />

virtually indexed, 443<br />

virtually tagged, 443<br />

write-back, 394, 395, 458<br />

write-through, 393, 395, 457<br />

writes, 393–395<br />

Callee, 98, 99<br />

Callee-saved register, A-23<br />

Caller, 98<br />

Caller-saved register, A-23<br />

Capabilities, OL5.17-8<br />

Capacity misses, 459<br />

Carry lookahead, B-38–47<br />

4-bit ALUs using, B-45<br />

adder, B-39<br />

fast, with first level of abstraction,<br />

B-39–40<br />

fast, with “infinite” hardware, B-38–39<br />

fast, with second level of abstraction,<br />

B-40–46<br />

plumbing analogy, B-42, B-43<br />

ripple carry speed versus, B-46<br />

summary, B-46–47<br />

Carry save adders, 188<br />

Cause register<br />

defined, 327<br />

fields, A-34, A-35<br />

OLC 6600, OL1.12-7, OL4.16-3<br />

Cell phones, 7<br />

Central processor unit (CPU). See also<br />

Processors<br />

classic performance equation, 36–40<br />

coprocessor 0, A-33–34<br />

defined, 19<br />

execution time, 32, 33–34<br />

performance, 33–35<br />

system, time, 32<br />

time, 399<br />

time measurements, 33–34<br />

user, time, 32<br />

Cg pixel shader program, C-15–17<br />

Characters<br />

ASCII representation, 106<br />

in Java, 109–111<br />

Chips, 19, 25, 26<br />

manufacturing process, 26<br />

Classes<br />

defined, OL2.15-15<br />

packages, OL2.15-21<br />

Clock cycles<br />

defined, 33<br />

memory-stall, 399<br />

number of registers <strong>and</strong>, 67<br />

worst-case delay <strong>and</strong>, 272<br />

Clock cycles per instruction (CPI), 35,<br />

282<br />

one level of caching, 410<br />

two levels of caching, 410<br />

Clock rate<br />

defined, 33<br />

frequency switched as function of, 41<br />

power <strong>and</strong>, 40<br />

Clocking methodology, 249–251, B-48<br />

edge-triggered, 249, B-48, B-73<br />

level-sensitive, B-74, B-75–76<br />

for predictability, 249<br />

Clocks, B-48–50<br />

edge, B-48, B-50<br />

in edge-triggered design, B-73<br />

skew, B-74<br />

specification, B-57<br />

synchronous system, B-48–49<br />

Cloud computing, 533<br />

defined, 7<br />

Cluster networking, 537–538, OL6.9-12<br />

Clusters, OL6.15-8–6.15-9<br />

defined, 30, 500, OL6.15-8<br />

isolation, 530<br />

organization, 499<br />

scientific computing on, OL6.15-8<br />

Cm*, OL6.15-4<br />

CMOS (complementary metal oxide<br />

semiconductor), 41<br />

Coarse-grained multithreading, 514<br />

Cobol, OL2.21-7<br />

Code generation, OL2.15-13<br />

Code motion, OL2.15-7<br />

Cold-start miss, 459<br />

Collision misses, 459<br />

Column major order, 413<br />

Combinational blocks, B-4<br />

Combinational control units, D-4–8<br />

Combinational elements, 248<br />

Combinational logic, 249, B-3, B-9–20<br />

arrays, B-18–19<br />

decoders, B-9<br />

defined, B-5<br />

don’t cares, B-17–18<br />

multiplexors, B-10<br />

ROMs, B-14–16<br />

two-level, B-11–14<br />

Verilog, B-23–26


Index I-5<br />

Commercial computer development,<br />

OL1.12-4–1.12-10<br />

Commit units<br />

buffer, 339–340<br />

defined, 339–340<br />

in update control, 343<br />

Common case fast, 11<br />

Common subexpression elimination,<br />

OL2.15-6<br />

Communication, 23–24<br />

overhead, reducing, 44–45<br />

thread, C-34<br />

Compact code, OL2.21-4<br />

Comparison instructions, A-57–59<br />

floating-point, A-74–75<br />

list of, A-57–59<br />

Comparisons, 93<br />

constant oper<strong>and</strong>s in, 93<br />

signed versus unsigned, 94–95<br />

Compilers, 123–124<br />

branch creation, 92<br />

brief history, OL2.21-9<br />

conservative, OL2.15-6<br />

defined, 14<br />

front end, OL2.15-3<br />

function, 14, 123–124, A-5–6<br />

high-level optimizations, OL2.15-4<br />

ILP exploitation, OL4.16-5<br />

Just In Time (JIT), 132<br />

machine language production, A-8–9,<br />

A-10<br />

optimization, 141, OL2.21-9<br />

speculation, 333–334<br />

structure, OL2.15-2<br />

Compiling<br />

C assignment statements, 65–66<br />

C language, 92–93, 145, OL2.15-<br />

2–2.15-3<br />

floating-point programs, 214–217<br />

if-then-else, 91<br />

in Java, OL2.15-19<br />

procedures, 98, 101–102<br />

recursive procedures, 101–102<br />

while loops, 92–93<br />

Compressed sparse row (CSR) matrix,<br />

C-55, C-56<br />

Compulsory misses, 459<br />

<strong>Computer</strong> architects, 11–12<br />

abstraction to simplify design, 11<br />

common case fast, 11<br />

dependability via redundancy, 12<br />

hierarchy of memories, 12<br />

Moore’s law, 11<br />

parallelism, 12<br />

pipelining, 12<br />

prediction, 12<br />

<strong>Computer</strong>s<br />

application classes, 5–6<br />

applications, 4<br />

arithmetic for, 176–236<br />

characteristics, OL1.12-12<br />

commercial development, OL1.12-<br />

4–1.12-10<br />

component organization, 17<br />

components, 17, 177<br />

design measure, 53<br />

desktop, 5<br />

embedded, 5, A-7<br />

first, OL1.12-2–1.12-4<br />

in information revolution, 4<br />

instruction representation, 80–87<br />

performance measurement, OL1.12-10<br />

PostPC Era, 6–7<br />

principles, 86<br />

servers, 5<br />

Condition field, 324<br />

Conditional branches<br />

ARM, 147–148<br />

changing program counter with, 324<br />

compiling if-then-else into, 91<br />

defined, 90<br />

desktop RISC, E-16<br />

embedded RISC, E-16<br />

implementation, 96<br />

in loops, 115<br />

PA-RISC, E-34, E-35<br />

PC-relative addressing, 114<br />

RISC, E-10–16<br />

SPARC, E-10–12<br />

Conditional move instructions, 324<br />

Conflict misses, 459<br />

Constant memory, C-40<br />

Constant oper<strong>and</strong>s, 72–73<br />

in comparisons, 93<br />

frequent occurrence, 72<br />

Constant-manipulating instructions,<br />

A-57<br />

Content Addressable Memory (CAM),<br />

408<br />

Context switch, 446<br />

Control<br />

ALU, 259–261<br />

challenge, 325–326<br />

finishing, 269–270<br />

forwarding, 307<br />

FSM, D-8–21<br />

implementation, optimizing, D-27–28<br />

for jump instruction, 270<br />

mapping to hardware, D-2–32<br />

memory, D-26<br />

organizing, to reduce logic, D-31–32<br />

pipelined, 300–303<br />

Control flow graphs, OL2.15-9–2.15-10<br />

illustrated examples, OL2.15-9,<br />

OL2.15-10<br />

Control functions<br />

ALU, mapping to gates, D-4–7<br />

defining, 264<br />

PLA, implementation, D-7,<br />

D-20–21<br />

ROM, encoding, D-18–19<br />

for single-cycle implementation, 269<br />

Control hazards, 281–282, 316–325<br />

branch delay reduction, 318–319<br />

branch not taken assumption, 318<br />

branch prediction as solution, 284<br />

delayed decision approach, 284<br />

dynamic branch prediction,<br />

321–323<br />

logic implementation in Verilog,<br />

OL4.13-8<br />

pipeline stalls as solution, 282<br />

pipeline summary, 324<br />

simplicity, 317<br />

solutions, 282<br />

static multiple-issue processors <strong>and</strong>,<br />

335–336<br />

Control lines<br />

asserted, 264<br />

in datapath, 263<br />

execution/address calculation, 300<br />

final three stages, 303<br />

instruction decode/register file read,<br />

300<br />

instruction fetch, 300<br />

memory access, 302<br />

setting of, 264<br />

values, 300<br />

write-back, 302<br />

Control signals<br />

ALUOp, 263<br />

defined, 250<br />

effect of, 264<br />

multi-bit, 264<br />

pipelined datapaths with, 300–303<br />

truth tables, D-14


I-6 Index<br />

Control units, 247. See also Arithmetic<br />

logic unit (ALU)<br />

address select logic, D-24, D-25<br />

combinational, implementing, D-4–8<br />

with explicit counter, D-23<br />

illustrated, 265<br />

logic equations, D-11<br />

main, designing, 261–264<br />

as microcode, D-28<br />

MIPS, D-10<br />

next-state outputs, D-10, D-12–13<br />

output, 259–261, D-10<br />

Conversion instructions, A-75–76<br />

Cooperative thread arrays (CTAs), C-30<br />

Coprocessors, A-33–34<br />

defined, 218<br />

move instructions, A-71–72<br />

Core MIPS instruction set, 236. See also<br />

MIPS<br />

abstract view, 246<br />

desktop RISC, E-9–11<br />

implementation, 244–248<br />

implementation illustration, 247<br />

overview, 245<br />

subset, 244<br />

Cores<br />

defined, 43<br />

number per chip, 43<br />

Correlation predictor, 324<br />

Cosmic Cube, OL6.15-7<br />

Count register, A-34<br />

CPU, 9<br />

Cray computers, OL3.11-5–3.11-6<br />

Critical word first, 392<br />

Crossbar networks, 535<br />

CTSS (Compatible Time-Sharing<br />

System), OL5.18-9<br />

CUDA programming environment, 523,<br />

C-5<br />

barrier synchronization, C-18, C-34<br />

development, C-17, C-18<br />

hierarchy of thread groups, C-18<br />

kernels, C-19, C-24<br />

key abstractions, C-18<br />

paradigm, C-19–23<br />

parallel plus-scan template, C-61<br />

per-block shared memory, C-58<br />

plus-reduction implementation, C-63<br />

programs, C-6, C-24<br />

scalable parallel programming with,<br />

C-17–23<br />

shared memories, C-18<br />

threads, C-36<br />

Cyclic redundancy check, 423<br />

Cylinder, 381<br />

D<br />

D flip-flops, B-51, B-53<br />

D latches, B-51, B-52<br />

Data bits, 421<br />

Data flow analysis, OL2.15-11<br />

Data hazards, 278, 303–316.See also<br />

Hazards<br />

forwarding, 278, 303–316<br />

load-use, 280, 318<br />

stalls <strong>and</strong>, 313–316<br />

Data layout directives, A-14<br />

Data movement instructions, A-70–73<br />

Data parallel problem decomposition,<br />

C-17, C-18<br />

Data race, 121<br />

Data segment, A-13<br />

Data selectors, 246<br />

Data transfer instructions.See also<br />

Instructions<br />

defined, 68<br />

load, 68<br />

offset, 69<br />

store, 71<br />

Datacenters, 7<br />

Data-level parallelism, 508<br />

Datapath elements<br />

defined, 251<br />

sharing, 256<br />

Datapaths<br />

branch, 254<br />

building, 251–259<br />

control signal truth tables, D-14<br />

control unit, 265<br />

defined, 19<br />

design, 251<br />

exception h<strong>and</strong>ling, 329<br />

for fetching instructions, 253<br />

for hazard resolution via forwarding,<br />

311<br />

for jump instruction, 270<br />

for memory instructions, 256<br />

for MIPS architecture, 257<br />

in operation for branch-on-equal<br />

instruction, 268<br />

in operation for load instruction, 267<br />

in operation for R-type instruction,<br />

266<br />

operation of, 264–269<br />

pipelined, 286–303<br />

for R-type instructions, 256, 264–265<br />

single, creating, 256<br />

single-cycle, 283<br />

static two-issue, 336<br />

Deasserted signals, 250, B-4<br />

Debugging information, A-13<br />

DEC PDP-8, OL2.21-3<br />

Decimal numbers<br />

binary number conversion to, 76<br />

defined, 73<br />

Decision-making instructions, 90–96<br />

Decoders, B-9<br />

two-level, B-65<br />

Decoding machine language, 118–120<br />

Defect, 26<br />

Delayed branches, 96.See also Branches<br />

as control hazard solution, 284<br />

defined, 255<br />

embedded RISCs <strong>and</strong>, E-23<br />

for five-stage pipelines, 26, 323–324<br />

reducing, 318–319<br />

scheduling limitations, 323<br />

Delayed decision, 284<br />

DeMorgan’s theorems, B-11<br />

Denormalized numbers, 222<br />

Dependability via redundancy, 12<br />

Dependable memory hierarchy, 418–423<br />

failure, defining, 418<br />

Dependences<br />

between pipeline registers, 308<br />

between pipeline registers <strong>and</strong> ALU<br />

inputs, 308<br />

bubble insertion <strong>and</strong>, 314<br />

detection, 306–308<br />

name, 338<br />

sequence, 304<br />

<strong>Design</strong><br />

compromises <strong>and</strong>, 161<br />

datapath, 251<br />

digital, 354<br />

logic, 248–251, B-1–79<br />

main control unit, 261–264<br />

memory hierarchy, challenges, 460<br />

pipelining instruction sets, 277<br />

Desktop <strong>and</strong> server RISCs.See also<br />

Reduced instruction set computer<br />

(RISC) architectures


Index I-7<br />

addressing modes, E-6<br />

architecture summary, E-4<br />

arithmetic/logical instructions, E-11<br />

conditional branches, E-16<br />

constant extension summary, E-9<br />

control instructions, E-11<br />

conventions equivalent to MIPS core,<br />

E-12<br />

data transfer instructions, E-10<br />

features added to, E-45<br />

floating-point instructions, E-12<br />

instruction formats, E-7<br />

multimedia extensions, E-16–18<br />

multimedia support, E-18<br />

types of, E-3<br />

Desktop computers, defined, 5<br />

Device driver, OL6.9-5<br />

DGEMM (Double precision General<br />

Matrix Multiply), 225, 352, 413, 553<br />

cache blocked version of, 415<br />

optimized C version of, 226, 227, 476<br />

performance, 354, 416<br />

Dicing, 27<br />

Dies, 26, 26–27<br />

Digital design pipeline, 354<br />

Digital signal-processing (DSP)<br />

extensions, E-19<br />

DIMMs (dual inline memory modules),<br />

OL5.17-5<br />

Direct Data IO (DDIO), OL6.9-6<br />

Direct memory access (DMA), OL6.9-4<br />

Direct3D, C-13<br />

Direct-mapped caches.See also Caches<br />

address portions, 407<br />

choice of, 456<br />

defined, 384, 402<br />

illustrated, 385<br />

memory block location, 403<br />

misses, 405<br />

single comparator, 407<br />

total number of bits, 390<br />

Dirty bit, 437<br />

Dirty pages, 437<br />

Disk memory, 381–383<br />

Displacement addressing, 116<br />

Distributed Block-Interleaved Parity<br />

(RAID 5), OL5.11-6<br />

div (Divide), A-52<br />

div.d (FP Divide Double), A-76<br />

div.s (FP Divide Single), A-76<br />

Divide algorithm, 190<br />

Dividend, 189<br />

Division, 189–195<br />

algorithm, 191<br />

dividend, 189<br />

divisor, 189<br />

Divisor, 189<br />

divu (Divide Unsigned), A-52.See also<br />

Arithmetic<br />

faster, 194<br />

floating-point, 211, A-76<br />

hardware, 189–192<br />

hardware, improved version, 192<br />

instructions, A-52–53<br />

in MIPS, 194<br />

oper<strong>and</strong>s, 189<br />

quotient, 189<br />

remainder, 189<br />

signed, 192–194<br />

SRT, 194<br />

Don’t cares, B-17–18<br />

example, B-17–18<br />

term, 261<br />

Double data rate (DDR), 379<br />

Double Data Rate RAMs (DDRRAMs),<br />

379–380, B-65<br />

Double precision.See also Single precision<br />

defined, 198<br />

FMA, C-45–46<br />

GPU, C-45–46, C-74<br />

representation, 201<br />

Double words, 152<br />

Dual inline memory modules (DIMMs),<br />

381<br />

Dynamic branch prediction, 321–323.See<br />

also Control hazards<br />

branch prediction buffer, 321<br />

loops <strong>and</strong>, 321–323<br />

Dynamic hardware predictors, 284<br />

Dynamic multiple-issue processors, 333,<br />

339–341.See also Multiple issue<br />

pipeline scheduling, 339–341<br />

superscalar, 339<br />

Dynamic pipeline scheduling, 339–341<br />

commit unit, 339–340<br />

concept, 339–340<br />

hardware-based speculation, 341<br />

primary units, 340<br />

reorder buffer, 343<br />

reservation station, 339–340<br />

Dynamic r<strong>and</strong>om access memory<br />

(DRAM), 378, 379–381, B-63–65<br />

b<strong>and</strong>width external to, 398<br />

cost, 23<br />

defined, 19, B-63<br />

DIMM, OL5.17-5<br />

Double Date Rate (DDR), 379–380<br />

early board, OL5.17-4<br />

GPU, C-37–38<br />

growth of capacity, 25<br />

history, OL5.17-2<br />

internal organization of, 380<br />

pass transistor, B-63<br />

SIMM, OL5.17-5, OL5.17-6<br />

single-transistor, B-64<br />

size, 398<br />

speed, 23<br />

synchronous (SDRAM), 379–380,<br />

B-60, B-65<br />

two-level decoder, B-65<br />

Dynamically linked libraries (DLLs),<br />

129–131<br />

defined, 129<br />

lazy procedure linkage version, 130<br />

E<br />

Early restart, 392<br />

Edge-triggered clocking methodology,<br />

249, 250, B-48, B-73<br />

advantage, B-49<br />

clocks, B-73<br />

drawbacks, B-74<br />

illustrated, B-50<br />

rising edge/falling edge, B-48<br />

EDSAC (Electronic Delay Storage<br />

Automatic Calculator), OL1.12-3,<br />

OL5.17-2<br />

Eispack, OL3.11-4<br />

Electrically erasable programmable readonly<br />

memory (EEPROM), 381<br />

Elements<br />

combinational, 248<br />

datapath, 251, 256<br />

memory, B-50–58<br />

state, 248, 250, 252, B-48, B-50<br />

Embedded computers, 5<br />

application requirements, 6<br />

defined, A-7<br />

design, 5<br />

growth, OL1.12-12–1.12-13<br />

Embedded Microprocessor Benchmark<br />

Consortium (EEMBC), OL1.12-12


I-8 Index<br />

Embedded RISCs. See also Reduced<br />

instruction set computer (RISC)<br />

architectures<br />

addressing modes, E-6<br />

architecture summary, E-4<br />

arithmetic/logical instructions, E-14<br />

conditional branches, E-16<br />

constant extension summary, E-9<br />

control instructions, E-15<br />

data transfer instructions, E-13<br />

delayed branch <strong>and</strong>, E-23<br />

DSP extensions, E-19<br />

general purpose registers, E-5<br />

instruction conventions, E-15<br />

instruction formats, E-8<br />

multiply-accumulate approaches, E-19<br />

types of, E-4<br />

Encoding<br />

defined, D-31<br />

floating-point instruction, 213<br />

MIPS instruction, 83, 119, A-49<br />

ROM control function, D-18–19<br />

ROM logic function, B-15<br />

x86 instruction, 155–156<br />

ENIAC (Electronic Numerical Integrator<br />

<strong>and</strong> Calculator), OL1.12-2, OL1.12-<br />

3, OL5.17-2<br />

EPIC, OL4.16-5<br />

Error correction, B-65–67<br />

Error Detecting <strong>and</strong> Correcting Code<br />

(RAID 2), OL5.11-5<br />

Error detection, B-66<br />

Error detection code, 420<br />

Ethernet, 23<br />

EX stage<br />

load instructions, 292<br />

overflow exception detection, 328<br />

store instructions, 294<br />

Exabyte, 6<br />

Exception enable, 447<br />

Exception h<strong>and</strong>lers, A-36–38<br />

defined, A-35<br />

return from, A-38<br />

Exception program counters (EPCs), 326<br />

address capture, 331<br />

copying, 181<br />

defined, 181, 327<br />

in restart determination, 326–327<br />

transferring, 182<br />

Exceptions, 325–332, A-33–38<br />

association, 331–332<br />

datapath with controls for h<strong>and</strong>ling,<br />

329<br />

defined, 180, 326<br />

detecting, 326<br />

event types <strong>and</strong>, 326<br />

imprecise, 331–332<br />

instructions, A-80<br />

interrupts versus, 325–326<br />

in MIPS architecture, 326–327<br />

overflow, 329<br />

PC, 445, 446–447<br />

pipelined computer example, 328<br />

in pipelined implementation, 327–332<br />

precise, 332<br />

reasons for, 326–327<br />

result due to overflow in add<br />

instruction, 330<br />

saving/restoring stage on, 450<br />

Exclusive OR (XOR) instructions, A-57<br />

Executable files, A-4<br />

defined, 126<br />

linker production, A-19<br />

Execute or address calculation stage, 292<br />

Execute/address calculation<br />

control line, 300<br />

load instruction, 292<br />

store instruction, 292<br />

Execution time<br />

as valid performance measure, 51<br />

CPU, 32, 33–34<br />

pipelining <strong>and</strong>, 286<br />

Explicit counters, D-23, D-26<br />

Exponents, 197–198<br />

External labels, A-10<br />

F<br />

Facilities, A-14–17<br />

Failures, synchronizer, B-77<br />

Fallacies. See also Pitfalls<br />

add immediate unsigned, 227<br />

Amdahl’s law, 556<br />

arithmetic, 229–232<br />

assembly language for performance,<br />

159–160<br />

commercial binary compatibility<br />

importance, 160<br />

defined, 49<br />

GPUs, C-72–74, C-75<br />

low utilization uses little power, 50<br />

peak performance, 556<br />

pipelining, 355–356<br />

powerful instructions mean higher<br />

performance, 159<br />

right shift, 229<br />

False sharing, 469<br />

Fast carry<br />

with “infinite” hardware, B-38–39<br />

with first level of abstraction, B-39–40<br />

with second level of abstraction,<br />

B-40–46<br />

Fast Fourier Transforms (FFT), C-53<br />

Fault avoidance, 419<br />

Fault forecasting, 419<br />

Fault tolerance, 419<br />

Fermi architecture, 523, 552<br />

Field programmable devices (FPDs), B-78<br />

Field programmable gate arrays (FPGAs),<br />

B-78<br />

Fields<br />

Cause register, A-34, A-35<br />

defined, 82<br />

format, D-31<br />

MIPS, 82–83<br />

names, 82<br />

Status register, A-34, A-35<br />

Files, register, 252, 257, B-50, B-54–56<br />

Fine-grained multithreading, 514<br />

Finite-state machines (FSMs), 451–466,<br />

B-67–72<br />

control, D-8–22<br />

controllers, 464<br />

for multicycle control, D-9<br />

for simple cache controller, 464–466<br />

implementation, 463, B-70<br />

Mealy, 463<br />

Moore, 463<br />

next-state function, 463, B-67<br />

output function, B-67, B-69<br />

state assignment, B-70<br />

state register implementation, B-71<br />

style of, 463<br />

synchronous, B-67<br />

SystemVerilog, OL5.12-7<br />

traffic light example, B-68–70<br />

Flash memory, 381<br />

characteristics, 23<br />

defined, 23<br />

Flat address space, 479<br />

Flip-flops<br />

D flip-flops, B-51, B-53<br />

defined, B-51


Index I-9<br />

Floating point, 196–222, 224<br />

assembly language, 212<br />

backward step, OL3.11-4–3.11-5<br />

binary to decimal conversion, 202<br />

branch, 211<br />

challenges, 232–233<br />

diversity versus portability, OL3.11-<br />

3–3.11-4<br />

division, 211<br />

first dispute, OL3.11-2–3.11-3<br />

form, 197<br />

fused multiply add, 220<br />

guard digits, 218–219<br />

history, OL3.11-3<br />

IEEE 754 st<strong>and</strong>ard, 198, 199<br />

instruction encoding, 213<br />

intermediate calculations, 218<br />

machine language, 212<br />

MIPS instruction frequency for, 236<br />

MIPS instructions, 211–213<br />

oper<strong>and</strong>s, 212<br />

overflow, 198<br />

packed format, 224<br />

precision, 230<br />

procedure with two-dimensional<br />

matrices, 215–217<br />

programs, compiling, 214–217<br />

registers, 217<br />

representation, 197–202<br />

rounding, 218–219<br />

sign <strong>and</strong> magnitude, 197<br />

SSE2 architecture, 224–225<br />

subtraction, 211<br />

underflow, 198<br />

units, 219<br />

in x86, 224<br />

Floating vectors, OL3.11-3<br />

Floating-point addition, 203–206<br />

arithmetic unit block diagram, 207<br />

binary, 204<br />

illustrated, 205<br />

instructions, 211, A-73–74<br />

steps, 203–204<br />

Floating-point arithmetic (GPUs),<br />

C-41–46<br />

basic, C-42<br />

double precision, C-45–46, C-74<br />

performance, C-44<br />

specialized, C-42–44<br />

supported formats, C-42<br />

texture operations, C-44<br />

Floating-point instructions, A-73–80<br />

absolute value, A-73<br />

addition, A-73–74<br />

comparison, A-74–75<br />

conversion, A-75–76<br />

desktop RISC, E-12<br />

division, A-76<br />

load, A-76–77<br />

move, A-77–78<br />

multiplication, A-78<br />

negation, A-78–79<br />

SPARC, E-31<br />

square root, A-79<br />

store, A-79<br />

subtraction, A-79–80<br />

truncation, A-80<br />

Floating-point multiplication, 206–210<br />

binary, 210–211<br />

illustrated, 209<br />

instructions, 211<br />

signific<strong>and</strong>s, 206<br />

steps, 206–210<br />

Flow-sensitive information, OL2.15-15<br />

Flushing instructions, 318, 319<br />

defined, 319<br />

exceptions <strong>and</strong>, 331<br />

For loops, 141, OL2.15-26<br />

inner, OL2.15-24<br />

SIMD <strong>and</strong>, OL6.15-2<br />

Formal parameters, A-16<br />

Format fields, D-31<br />

Fortran, OL2.21-7<br />

Forward references, A-11<br />

Forwarding, 303–316<br />

ALU before, 309<br />

control, 307<br />

datapath for hazard resolution, 311<br />

defined, 278<br />

functioning, 306<br />

graphical representation, 279<br />

illustrations, OL4.13-26–4.13-26<br />

multiple results <strong>and</strong>, 281<br />

multiplexors, 310<br />

pipeline registers before, 309<br />

with two instructions, 278<br />

Verilog implementation, OL4.13-<br />

2–4.13-4<br />

Fractions, 197, 198<br />

Frame buffer, 18<br />

Frame pointers, 103<br />

Front end, OL2.15-3<br />

Fully associative caches. See also Caches<br />

block replacement strategies, 457<br />

choice of, 456<br />

defined, 403<br />

memory block location, 403<br />

misses, 406<br />

Fully connected networks, 535<br />

Function code, 82<br />

Fused-multiply-add (FMA) operation,<br />

220, C-45–46<br />

G<br />

Game consoles, C-9<br />

Gates, B-3, B-8<br />

AND, B-12, D-7<br />

delays, B-46<br />

mapping ALU control function to,<br />

D-4–7<br />

NAND, B-8<br />

NOR, B-8, B-50<br />

Gather-scatter, 511, 552<br />

General Purpose GPUs (GPGPUs),<br />

C-5<br />

General-purpose registers, 150<br />

architectures, OL2.21-3<br />

embedded RISCs, E-5<br />

Generate<br />

defined, B-40<br />

example, B-44<br />

super, B-41<br />

Gigabyte, 6<br />

Global common subexpression<br />

elimination, OL2.15-6<br />

Global memory, C-21, C-39<br />

Global miss rates, 416<br />

Global optimization, OL2.15-5<br />

code, OL2.15-7<br />

implementing, OL2.15-8–2.15-11<br />

Global pointers, 102<br />

GPU computing. See also Graphics<br />

processing units (GPUs)<br />

defined, C-5<br />

visual applications, C-6–7<br />

GPU system architectures, C-7–12<br />

graphics logical pipeline, C-10<br />

heterogeneous, C-7–9<br />

implications for, C-24<br />

interfaces <strong>and</strong> drivers, C-9<br />

unified, C-10–12<br />

Graph coloring, OL2.15-12


I-10 Index<br />

Graphics displays<br />

computer hardware support, 18<br />

LCD, 18<br />

Graphics logical pipeline, C-10<br />

Graphics processing units (GPUs), 522–<br />

529. See also GPU computing<br />

as accelerators, 522<br />

attribute interpolation, C-43–44<br />

defined, 46, 506, C-3<br />

evolution, C-5<br />

fallacies <strong>and</strong> pitfalls, C-72–75<br />

floating-point arithmetic, C-17, C-41–<br />

46, C-74<br />

GeForce 8-series generation, C-5<br />

general computation, C-73–74<br />

General Purpose (GPGPUs), C-5<br />

graphics mode, C-6<br />

graphics trends, C-4<br />

history, C-3–4<br />

logical graphics pipeline, C-13–14<br />

mapping applications to, C-55–72<br />

memory, 523<br />

multilevel caches <strong>and</strong>, 522<br />

N-body applications, C-65–72<br />

NVIDIA architecture, 523–526<br />

parallel memory system, C-36–41<br />

parallelism, 523, C-76<br />

performance doubling, C-4<br />

perspective, 527–529<br />

programming, C-12–24<br />

programming interfaces to, C-17<br />

real-time graphics, C-13<br />

summary, C-76<br />

Graphics shader programs, C-14–15<br />

Gresham’s Law, 236, OL3.11-2<br />

Grid computing, 533<br />

Grids, C-19<br />

GTX 280, 548–553<br />

Guard digits<br />

defined, 218<br />

rounding with, 219<br />

H<br />

Half precision, C-42<br />

Halfwords, 110<br />

Hamming, Richard, 420<br />

Hamming distance, 420<br />

Hamming Error Correction Code (ECC),<br />

420–421<br />

calculating, 420–421<br />

H<strong>and</strong>lers<br />

defined, 449<br />

TLB miss, 448<br />

Hard disks<br />

access times, 23<br />

defined, 23<br />

Hardware<br />

as hierarchical layer, 13<br />

language of, 14–16<br />

operations, 63–66<br />

supporting procedures in, 96–106<br />

synthesis, B-21<br />

translating microprograms to, D-28–32<br />

virtualizable, 426<br />

Hardware description languages. See also<br />

Verilog<br />

defined, B-20<br />

using, B-20–26<br />

VHDL, B-20–21<br />

Hardware multithreading, 514–517<br />

coarse-grained, 514<br />

options, 516<br />

simultaneous, 515–517<br />

Hardware-based speculation, 341<br />

Harvard architecture, OL1.12-4<br />

Hazard detection units, 313–314<br />

functions, 314<br />

pipeline connections for, 314<br />

Hazards, 277–278. See also Pipelining<br />

control, 281–282, 316–325<br />

data, 278, 303–316<br />

forwarding <strong>and</strong>, 312<br />

structural, 277, 294<br />

Heap<br />

allocating space on, 104–106<br />

defined, 104<br />

Heterogeneous systems, C-4–5<br />

architecture, C-7–9<br />

defined, C-3<br />

Hexadecimal numbers, 81–82<br />

binary number conversion to, 81–82<br />

Hierarchy of memories, 12<br />

High-level languages, 14–16, A-6<br />

benefits, 16<br />

computer architectures, OL2.21-5<br />

importance, 16<br />

High-level optimizations, OL2.15-4–2.15-<br />

5<br />

Hit rate, 376<br />

Hit time<br />

cache performance <strong>and</strong>, 401–402<br />

defined, 376<br />

Hit under miss, 472<br />

Hold time, B-54<br />

Horizontal microcode, D-32<br />

Hot-swapping, OL5.11-7<br />

Human genome project, 4<br />

I<br />

I<br />

I/O, A-38–40, OL6.9-2, OL6.9-3<br />

memory-mapped, A-38<br />

on system performance, OL5.11-2<br />

I/O benchmarks.See Benchmarks<br />

IBM 360/85, OL5.17-7<br />

IBM 701, OL1.12-5<br />

IBM 7030, OL4.16-2<br />

IBM ALOG, OL3.11-7<br />

IBM Blue Gene, OL6.15-9–6.15-10<br />

IBM Personal <strong>Computer</strong>, OL1.12-7,<br />

OL2.21-6<br />

IBM System/360 computers, OL1.12-6,<br />

OL3.11-6, OL4.16-2<br />

IBM z/VM, OL5.17-8<br />

ID stage<br />

branch execution in, 319<br />

load instructions, 292<br />

store instruction in, 291<br />

IEEE 754 floating-point st<strong>and</strong>ard, 198,<br />

199, OL3.11-8–3.11-10. See also<br />

Floating point<br />

first chips, OL3.11-8–3.11-9<br />

in GPU arithmetic, C-42–43<br />

implementation, OL3.11-10<br />

rounding modes, 219<br />

today, OL3.11-10<br />

If statements, 114<br />

I-format, 83<br />

If-then-else, 91<br />

Immediate addressing, 116<br />

Immediate instructions, 72<br />

Imprecise interrupts, 331, OL4.16-4<br />

Index-out-of-bounds check, 94–95<br />

Induction variable elimination, OL2.15-7<br />

Inheritance, OL2.15-15<br />

In-order commit, 341<br />

Input devices, 16<br />

Inputs, 261<br />

Instances, OL2.15-15<br />

Instruction count, 36, 38<br />

Instruction decode/register file read stage


Index I-11<br />

control line, 300<br />

load instruction, 289<br />

store instruction, 294<br />

Instruction execution illustrations,<br />

OL4.13-16–4.13-17<br />

clock cycle 9, OL4.13-24<br />

clock cycles 1 <strong>and</strong> 2, OL4.13-21<br />

clock cycles 3 <strong>and</strong> 4, OL4.13-22<br />

clock cycles 5 <strong>and</strong> 6, OL4.13-23,<br />

OL4.13-23<br />

clock cycles 7 <strong>and</strong> 8, OL4.13-24<br />

examples, OL4.13-20–4.13-25<br />

forwarding, OL4.13-26–4.13-31<br />

no hazard, OL4.13-17<br />

pipelines with stalls <strong>and</strong> forwarding,<br />

OL4.13-26, OL4.13-20<br />

Instruction fetch stage<br />

control line, 300<br />

load instruction, 289<br />

store instruction, 294<br />

Instruction formats, 157<br />

ARM, 148<br />

defined, 81<br />

desktop/server RISC architectures, E-7<br />

embedded RISC architectures, E-8<br />

I-type, 83<br />

J-type, 113<br />

jump instruction, 270<br />

MIPS, 148<br />

R-type, 83, 261<br />

x86, 157<br />

Instruction latency, 356<br />

Instruction mix, 39, OL1.12-10<br />

Instruction set architecture<br />

ARM, 145–147<br />

branch address calculation, 254<br />

defined, 22, 52<br />

history, 163<br />

maintaining, 52<br />

protection <strong>and</strong>, 427<br />

thread, C-31–34<br />

virtual machine support, 426–427<br />

Instruction sets, 235, C-49<br />

ARM, 324<br />

design for pipelining, 277<br />

MIPS, 62, 161, 234<br />

MIPS-32, 235<br />

Pseudo MIPS, 233<br />

x86 growth, 161<br />

Instruction-level parallelism (ILP), 354.<br />

See also Parallelism<br />

compiler exploitation, OL4.16-5–4.16-6<br />

defined, 43, 333<br />

exploitation, increasing, 343<br />

<strong>and</strong> matrix multiply, 351–354<br />

Instructions, 60–164, E-25–27, E-40–42.<br />

See also Arithmetic instructions;<br />

MIPS; Oper<strong>and</strong>s<br />

add immediate, 72<br />

addition, 180, A-51<br />

Alpha, E-27–29<br />

arithmetic-logical, 251, A-51–57<br />

ARM, 145–147, E-36–37<br />

assembly, 66<br />

basic block, 93<br />

branch, A-59–63<br />

cache-aware, 482<br />

comparison, A-57–59<br />

conditional branch, 90<br />

conditional move, 324<br />

constant-manipulating, A-57<br />

conversion, A-75–76<br />

core, 233<br />

data movement, A-70–73<br />

data transfer, 68<br />

decision-making, 90–96<br />

defined, 14, 62<br />

desktop RISC conventions, E-12<br />

division, A-52–53<br />

as electronic signals, 80<br />

embedded RISC conventions, E-15<br />

encoding, 83<br />

exception <strong>and</strong> interrupt, A-80<br />

exclusive OR, A-57<br />

fetching, 253<br />

fields, 80<br />

floating-point (x86), 224<br />

floating-point, 211–213, A-73–80<br />

flushing, 318, 319, 331<br />

immediate, 72<br />

introduction to, 62–63<br />

jump, 95, 97, A-63–64<br />

left-to-right flow, 287–288<br />

load, 68, A-66–68<br />

load linked, 122<br />

logical operations, 87–89<br />

M32R, E-40<br />

memory access, C-33–34<br />

memory-reference, 245<br />

multiplication, 188, A-53–54<br />

negation, A-54<br />

nop, 314<br />

PA-RISC, E-34–36<br />

performance, 35–36<br />

pipeline sequence, 313<br />

PowerPC, E-12–13, E-32–34<br />

PTX, C-31, C-32<br />

remainder, A-55<br />

representation in computer, 80–87<br />

restartable, 450<br />

resuming, 450<br />

R-type, 252<br />

shift, A-55–56<br />

SPARC, E-29–32<br />

store, 71, A-68–70<br />

store conditional, 122<br />

subtraction, 180, A-56–57<br />

SuperH, E-39–40<br />

thread, C-30–31<br />

Thumb, E-38<br />

trap, A-64–66<br />

vector, 510<br />

as words, 62<br />

x86, 149–155<br />

Instructions per clock cycle (IPC), 333<br />

Integrated circuits (ICs), 19. See also<br />

specific chips<br />

cost, 27<br />

defined, 25<br />

manufacturing process, 26<br />

very large-scale (VLSIs), 25<br />

Intel Core i7, 46–49, 244, 501, 548–553<br />

address translation for, 471<br />

architectural registers, 347<br />

caches in, 472<br />

memory hierarchies of, 471–475<br />

microarchitecture, 338<br />

performance of, 473<br />

SPEC CPU benchmark, 46–48<br />

SPEC power benchmark, 48–49<br />

TLB hardware for, 471<br />

Intel Core i7 920, 346–349<br />

microarchitecture, 347<br />

Intel Core i7 960<br />

benchmarking <strong>and</strong> rooflines of,<br />

548–553<br />

Intel Core i7 Pipelines, 344, 346–349<br />

memory components, 348<br />

performance, 349–351<br />

program performance, 351<br />

specification, 345<br />

Intel IA-64 architecture, OL2.21-3<br />

Intel Paragon, OL6.15-8


I-12 Index<br />

Intel Threading Building Blocks, C-60<br />

Intel x86 microprocessors<br />

clock rate <strong>and</strong> power for, 40<br />

Interference graphs, OL2.15-12<br />

Interleaving, 398<br />

Interprocedural analysis, OL2.15-14<br />

Interrupt enable, 447<br />

Interrupt h<strong>and</strong>lers, A-33<br />

Interrupt-driven I/O, OL6.9-4<br />

Interrupts<br />

defined, 180, 326<br />

event types <strong>and</strong>, 326<br />

exceptions versus, 325–326<br />

imprecise, 331, OL4.16-4<br />

instructions, A-80<br />

precise, 332<br />

vectored, 327<br />

Intrinsity FastMATH processor, 395–398<br />

caches, 396<br />

data miss rates, 397, 407<br />

read processing, 442<br />

TLB, 440<br />

write-through processing, 442<br />

Inverted page tables, 436<br />

Issue packets, 334<br />

J<br />

j (Jump), 64<br />

jal (Jump And Link), 64<br />

Java<br />

bytecode, 131<br />

bytecode architecture, OL2.15-17<br />

characters in, 109–111<br />

compiling in, OL2.15-19–2.15-20<br />

goals, 131<br />

interpreting, 131, 145, OL2.15-15–<br />

2.15-16<br />

keywords, OL2.15-21<br />

method invocation in, OL2.15-21<br />

pointers, OL2.15-26<br />

primitive types, OL2.15-26<br />

programs, starting, 131–132<br />

reference types, OL2.15-26<br />

sort algorithms, 141<br />

strings in, 109–111<br />

translation hierarchy, 131<br />

while loop compilation in, OL2.15-<br />

18–2.15-19<br />

Java Virtual Machine (JVM), 145,<br />

OL2.15-16<br />

jr (Jump Register), 64<br />

J-type instruction format, 113<br />

Jump instructions, 254, E-26<br />

branch instruction versus, 270<br />

control <strong>and</strong> datapath for, 271<br />

implementing, 270<br />

instruction format, 270<br />

list of, A-63–64<br />

Just In Time (JIT) compilers,<br />

132, 560<br />

K<br />

Karnaugh maps, B-18<br />

Kernel mode, 444<br />

Kernels<br />

CUDA, C-19, C-24<br />

defined, C-19<br />

Kilobyte, 6<br />

L<br />

Labels<br />

global, A-10, A-11<br />

local, A-11<br />

LAPACK, 230<br />

Large-scale multiprocessors, OL6.15-7,<br />

OL6.15-9–6.15-10<br />

Latches<br />

D latch, B-51, B-52<br />

defined, B-51<br />

Latency<br />

instruction, 356<br />

memory, C-74–75<br />

pipeline, 286<br />

use, 336–337<br />

lbu (Load Byte Unsigned), 64<br />

Leaf procedures. See also Procedures<br />

defined, 100<br />

example, 109<br />

Least recently used (LRU)<br />

as block replacement strategy, 457<br />

defined, 409<br />

pages, 434<br />

Least significant bits, B-32<br />

defined, 74<br />

SPARC, E-31<br />

Left-to-right instruction flow, 287–288<br />

Level-sensitive clocking, B-74, B-75–76<br />

defined, B-74<br />

two-phase, B-75<br />

lhu (Load Halfword Unsigned), 64<br />

li (Load Immediate), 162<br />

Link, OL6.9-2<br />

Linkers, 126–129, A-18–19<br />

defined, 126, A-4<br />

executable files, 126, A-19<br />

function illustration, A-19<br />

steps, 126<br />

using, 126–129<br />

Linking object files, 126–129<br />

Linpack, 538, OL3.11-4<br />

Liquid crystal displays (LCDs), 18<br />

LISP, SPARC support, E-30<br />

Little-endian byte order, A-43<br />

Live range, OL2.15-11<br />

Livermore Loops, OL1.12-11<br />

ll (Load Linked), 64<br />

Load balancing, 505–506<br />

Load instructions. See also Store<br />

instructions<br />

access, C-41<br />

base register, 262<br />

block, 149<br />

compiling with, 71<br />

datapath in operation for, 267<br />

defined, 68<br />

details, A-66–68<br />

EX stage, 292<br />

floating-point, A-76–77<br />

halfword unsigned, 110<br />

ID stage, 291<br />

IF stage, 291<br />

linked, 122, 123<br />

list of, A-66–68<br />

load byte unsigned, 76<br />

load half, 110<br />

load upper immediate, 112, 113<br />

MEM stage, 293<br />

pipelined datapath in, 296<br />

signed, 76<br />

unit for implementing, 255<br />

unsigned, 76<br />

WB stage, 293<br />

Load word, 68, 71<br />

Loaders, 129<br />

Loading, A-19–20<br />

Load-store architectures, OL2.21-3<br />

Load-use data hazard, 280, 318<br />

Load-use stalls, 318<br />

Local area networks (LANs), 24. See also<br />

Networks


Index I-13<br />

Local labels, A-11<br />

Local memory, C-21, C-40<br />

Local miss rates, 416<br />

Local optimization, OL2.15-5.<br />

See also Optimization<br />

implementing, OL2.15-8<br />

Locality<br />

principle, 374<br />

spatial, 374, 377<br />

temporal, 374, 377<br />

Lock synchronization, 121<br />

Locks, 518<br />

Logic<br />

address select, D-24, D-25<br />

ALU control, D-6<br />

combinational, 250, B-5, B-9–20<br />

components, 249<br />

control unit equations, D-11<br />

design, 248–251, B-1–79<br />

equations, B-7<br />

minimization, B-18<br />

programmable array (PAL),<br />

B-78<br />

sequential, B-5, B-56–58<br />

two-level, B-11–14<br />

Logical operations, 87–89<br />

AND, 88, A-52<br />

ARM, 149<br />

desktop RISC, E-11<br />

embedded RISC, E-14<br />

MIPS, A-51–57<br />

NOR, 89, A-54<br />

NOT, 89, A-55<br />

OR, 89, A-55<br />

shifts, 87<br />

Long instruction word (LIW),<br />

OL4.16-5<br />

Lookup tables (LUTs), B-79<br />

Loop unrolling<br />

defined, 338, OL2.15-4<br />

for multiple-issue pipelines, 338<br />

register renaming <strong>and</strong>, 338<br />

Loops, 92–93<br />

conditional branches in, 114<br />

for, 141<br />

prediction <strong>and</strong>, 321–323<br />

test, 142, 143<br />

while, compiling, 92–93<br />

lui (Load Upper Imm.), 64<br />

lw (Load Word), 64<br />

lwc1 (Load FP Single), A-73<br />

M<br />

M32R, E-15, E-40<br />

Machine code, 81<br />

Machine instructions, 81<br />

Machine language, 15<br />

branch offset in, 115<br />

decoding, 118–120<br />

defined, 14, 81, A-3<br />

floating-point, 212<br />

illustrated, 15<br />

MIPS, 85<br />

SRAM, 21<br />

translating MIPS assembly language<br />

into, 84<br />

Macros<br />

defined, A-4<br />

example, A-15–17<br />

use of, A-15<br />

Main memory, 428. See also Memory<br />

defined, 23<br />

page tables, 437<br />

physical addresses, 428<br />

Mapping applications, C-55–72<br />

Mark computers, OL1.12-14<br />

Matrix multiply, 225–228, 553–555<br />

Mealy machine, 463–464, B-68, B-71,<br />

B-72<br />

Mean time to failure(MTTF), 418<br />

improving, 419<br />

versus AFR of disks, 419–420<br />

Media Access Control (MAC) address,<br />

OL6.9-7<br />

Megabyte, 6<br />

Memory<br />

addresses, 77<br />

affinity, 545<br />

atomic, C-21<br />

b<strong>and</strong>width, 380–381, 397<br />

cache, 21, 383–398, 398–417<br />

CAM, 408<br />

constant, C-40<br />

control, D-26<br />

defined, 19<br />

DRAM, 19, 379–380, B-63–65<br />

flash, 23<br />

global, C-21, C-39<br />

GPU, 523<br />

instructions, datapath for, 256<br />

layout, A-21<br />

local, C-21, C-40<br />

main, 23<br />

nonvolatile, 22<br />

oper<strong>and</strong>s, 68–69<br />

parallel system, C-36–41<br />

read-only (ROM), B-14–16<br />

SDRAM, 379–380<br />

secondary, 23<br />

shared, C-21, C-39–40<br />

spaces, C-39<br />

SRAM, B-58–62<br />

stalls, 400<br />

technologies for building, 24–28<br />

texture, C-40<br />

usage, A-20–22<br />

virtual, 427–454<br />

volatile, 22<br />

Memory access instructions, C-33–34<br />

Memory access stage<br />

control line, 302<br />

load instruction, 292<br />

store instruction, 292<br />

Memory b<strong>and</strong>width, 551, 557<br />

Memory consistency model, 469<br />

Memory elements, B-50–58<br />

clocked, B-51<br />

D flip-flop, B-51, B-53<br />

D latch, B-52<br />

DRAMs, B-63–67<br />

flip-flop, B-51<br />

hold time, B-54<br />

latch, B-51<br />

setup time, B-53, B-54<br />

SRAMs, B-58–62<br />

unclocked, B-51<br />

Memory hierarchies, 545<br />

of ARM cortex-A8, 471–475<br />

block (or line), 376<br />

cache performance, 398–417<br />

caches, 383–417<br />

common framework, 454–461<br />

defined, 375<br />

design challenges, 461<br />

development, OL5.17-6–5.17-8<br />

exploiting, 372–498<br />

of Intel core i7, 471–475<br />

level pairs, 376<br />

multiple levels, 375<br />

overall operation of, 443–444<br />

parallelism <strong>and</strong>, 466–470, OL5.11-2<br />

pitfalls, 478–482<br />

program execution time <strong>and</strong>, 417


I-14 Index<br />

Memory hierarchies (Continued)<br />

quantitative design parameters, 454<br />

redundant arrays <strong>and</strong> inexpensive<br />

disks, 470<br />

reliance on, 376<br />

structure, 375<br />

structure diagram, 378<br />

variance, 417<br />

virtual memory, 427–454<br />

Memory rank, 381<br />

Memory technologies, 378–383<br />

disk memory, 381–383<br />

DRAM technology, 378, 379–381<br />

flash memory, 381<br />

SRAM technology, 378, 379<br />

Memory-mapped I/O, OL6.9-3<br />

use of, A-38<br />

Memory-stall clock cycles, 399<br />

Message passing<br />

defined, 529<br />

multiprocessors, 529–534<br />

Metastability, B-76<br />

Methods<br />

defined, OL2.15-5<br />

invoking in Java, OL2.15-20–2.15-21<br />

static, A-20<br />

mfc0 (Move From Control), A-71<br />

mfhi (Move From Hi), A-71<br />

mflo (Move From Lo), A-71<br />

Microarchitectures, 347<br />

Intel Core i7 920, 347<br />

Microcode<br />

assembler, D-30<br />

control unit as, D-28<br />

defined, D-27<br />

dispatch ROMs, D-30–31<br />

horizontal, D-32<br />

vertical, D-32<br />

Microinstructions, D-31<br />

Microprocessors<br />

design shift, 501<br />

multicore, 8, 43, 500–501<br />

Microprograms<br />

as abstract control representation,<br />

D-30<br />

field translation, D-29<br />

translating to hardware, D-28–32<br />

Migration, 467<br />

Million instructions per second (MIPS),<br />

51<br />

Minterms<br />

defined, B-12, D-20<br />

in PLA implementation, D-20<br />

MIP-map, C-44<br />

MIPS, 64, 84, A-45–80<br />

addressing for 32-bit immediates,<br />

116–118<br />

addressing modes, A-45–47<br />

arithmetic core, 233<br />

arithmetic instructions, 63, A-51–57<br />

ARM similarities, 146<br />

assembler directive support, A-47–49<br />

assembler syntax, A-47–49<br />

assembly instruction, mapping, 80–81<br />

branch instructions, A-59–63<br />

comparison instructions, A-57–59<br />

compiling C assignment statements<br />

into, 65<br />

compiling complex C assignment into,<br />

65–66<br />

constant-manipulating instructions,<br />

A-57<br />

control registers, 448<br />

control unit, D-10<br />

CPU, A-46<br />

divide in, 194<br />

exceptions in, 326–327<br />

fields, 82–83<br />

floating-point instructions, 211–213<br />

FPU, A-46<br />

instruction classes, 163<br />

instruction encoding, 83, 119, A-49<br />

instruction formats, 120, 148, A-49–51<br />

instruction set, 62, 162, 234<br />

jump instructions, A-63–66<br />

logical instructions, A-51–57<br />

machine language, 85<br />

memory addresses, 70<br />

memory allocation for program <strong>and</strong><br />

data, 104<br />

multiply in, 188<br />

opcode map, A-50<br />

oper<strong>and</strong>s, 64<br />

Pseudo, 233, 235<br />

register conventions, 105<br />

static multiple issue with, 335–338<br />

MIPS core<br />

architecture, 195<br />

arithmetic/logical instructions not in,<br />

E-21, E-23<br />

common extensions to, E-20–25<br />

control instructions not in, E-21<br />

data transfer instructions not in, E-20,<br />

E-22<br />

floating-point instructions not in, E-22<br />

instruction set, 233, 244–248, E-9–10<br />

MIPS-16<br />

16-bit instruction set, E-41–42<br />

immediate fields, E-41<br />

instructions, E-40–42<br />

MIPS core instruction changes, E-42<br />

PC-relative addressing, E-41<br />

MIPS-32 instruction set, 235<br />

MIPS-64 instructions, E-25–27<br />

conditional procedure call instructions,<br />

E-27<br />

constant shift amount, E-25<br />

jump/call not PC-relative, E-26<br />

move to/from control registers, E-26<br />

nonaligned data transfers, E-25<br />

NOR, E-25<br />

parallel single precision floating-point<br />

operations, E-27<br />

reciprocal <strong>and</strong> reciprocal square root,<br />

E-27<br />

SYSCALL, E-25<br />

TLB instructions, E-26–27<br />

Mirroring, OL5.11-5<br />

Miss penalty<br />

defined, 376<br />

determination, 391–392<br />

multilevel caches, reducing, 410<br />

Miss rates<br />

block size versus, 392<br />

data cache, 455<br />

defined, 376<br />

global, 416<br />

improvement, 391–392<br />

Intrinsity FastMATH processor, 397<br />

local, 416<br />

miss sources, 460<br />

split cache, 397<br />

Miss under miss, 472<br />

MMX (MultiMedia eXtension), 224<br />

Modules, A-4<br />

Moore machines, 463–464, B-68, B-71,<br />

B-72<br />

Moore’s law, 11, 379, 522, OL6.9-2,<br />

C-72–73<br />

Most significant bit<br />

1-bit ALU for, B-33<br />

defined, 74<br />

move (Move), 139


Index I-15<br />

Move instructions, A-70–73<br />

coprocessor, A-71–72<br />

details, A-70–73<br />

floating-point, A-77–78<br />

MS-DOS, OL5.17-11<br />

mul.d (FP Multiply Double), A-78<br />

mul.s (FP Multiply Single), A-78<br />

mult (Multiply), A-53<br />

Multicore, 517–521<br />

Multicore multiprocessors, 8, 43<br />

defined, 8, 500–501<br />

MULTICS (Multiplexed Information<br />

<strong>and</strong> Computing Service), OL5.17-<br />

9–5.17-10<br />

Multilevel caches. See also Caches<br />

complications, 416<br />

defined, 398, 416<br />

miss penalty, reducing, 410<br />

performance of, 410<br />

summary, 417–418<br />

Multimedia extensions<br />

desktop/server RISCs, E-16–18<br />

as SIMD extensions to instruction sets,<br />

OL6.15-4<br />

vector versus, 511–512<br />

Multiple dimension arrays, 218<br />

Multiple instruction multiple data<br />

(MIMD), 558<br />

defined, 507, 508<br />

first multiprocessor, OL6.15-14<br />

Multiple instruction single data (MISD), 507<br />

Multiple issue, 332–339<br />

code scheduling, 337–338<br />

dynamic, 333, 339–341<br />

issue packets, 334<br />

loop unrolling <strong>and</strong>, 338<br />

processors, 332, 333<br />

static, 333, 334–339<br />

throughput <strong>and</strong>, 342<br />

Multiple processors, 553–555<br />

Multiple-clock-cycle pipeline diagrams,<br />

296–297<br />

five instructions, 298<br />

illustrated, 298<br />

Multiplexors, B-10<br />

controls, 463<br />

in datapath, 263<br />

defined, 246<br />

forwarding, control values, 310<br />

selector control, 256–257<br />

two-input, B-10<br />

Multiplic<strong>and</strong>, 183<br />

Multiplication, 183–188. See also<br />

Arithmetic<br />

fast, hardware, 188<br />

faster, 187–188<br />

first algorithm, 185<br />

floating-point, 206–208, A-78<br />

hardware, 184–186<br />

instructions, 188, A-53–54<br />

in MIPS, 188<br />

multiplic<strong>and</strong>, 183<br />

multiplier, 183<br />

oper<strong>and</strong>s, 183<br />

product, 183<br />

sequential version, 184–186<br />

signed, 187<br />

Multiplier, 183<br />

Multiply algorithm, 186<br />

Multiply-add (MAD), C-42<br />

Multiprocessors<br />

benchmarks, 538–540<br />

bus-based coherent, OL6.15-7<br />

defined, 500<br />

historical perspective, 561<br />

large-scale, OL6.15-7–6.15-8, OL6.15-<br />

9–6.15-10<br />

message-passing, 529–534<br />

multithreaded architecture, C-26–27,<br />

C-35–36<br />

organization, 499, 529<br />

for performance, 559<br />

shared memory, 501, 517–521<br />

software, 500<br />

TFLOPS, OL6.15-6<br />

UMA, 518<br />

Multistage networks, 535<br />

Multithreaded multiprocessor<br />

architecture, C-25–36<br />

conclusion, C-36<br />

ISA, C-31–34<br />

massive multithreading, C-25–26<br />

multiprocessor, C-26–27<br />

multiprocessor comparison, C-35–36<br />

SIMT, C-27–30<br />

special function units (SFUs), C-35<br />

streaming processor (SP), C-34<br />

thread instructions, C-30–31<br />

threads/thread blocks management,<br />

C-30<br />

Multithreading, C-25–26<br />

coarse-grained, 514<br />

defined, 506<br />

fine-grained, 514<br />

hardware, 514–517<br />

simultaneous (SMT), 515–517<br />

multu (Multiply Unsigned), A-54<br />

Must-information, OL2.15-5<br />

Mutual exclusion, 121<br />

N<br />

Name dependence, 338<br />

NAND gates, B-8<br />

NAS (NASA Advanced Supercomputing),<br />

540<br />

N-body<br />

all-pairs algorithm, C-65<br />

GPU simulation, C-71<br />

mathematics, C-65–67<br />

multiple threads per body, C-68–69<br />

optimization, C-67<br />

performance comparison, C-69–70<br />

results, C-70–72<br />

shared memory use, C-67–68<br />

Negation instructions, A-54, A-78–79<br />

Negation shortcut, 76<br />

Nested procedures, 100–102<br />

compiling recursive procedure<br />

showing, 101–102<br />

NetFPGA 10-Gigagit Ethernet card,<br />

OL6.9-2, OL6.9-3<br />

Network of Workstations, OL6.15-<br />

8–6.15-9<br />

Network topologies, 534–537<br />

implementing, 536<br />

multistage, 537<br />

Networking, OL6.9-4<br />

operating system in, OL6.9-4–6.9-5<br />

performance improvement, OL6.9-<br />

7–6.9-10<br />

Networks, 23–24<br />

advantages, 23<br />

b<strong>and</strong>width, 535<br />

crossbar, 535<br />

fully connected, 535<br />

local area (LANs), 24<br />

multistage, 535<br />

wide area (WANs), 24<br />

Newton’s iteration, 218<br />

Next state<br />

nonsequential, D-24<br />

sequential, D-23


I-16 Index<br />

Next-state function, 463, B-67<br />

defined, 463<br />

implementing, with sequencer,<br />

D-22–28<br />

Next-state outputs, D-10, D-12–13<br />

example, D-12–13<br />

implementation, D-12<br />

logic equations, D-12–13<br />

truth tables, D-15<br />

No Redundancy (RAID 0), OL5.11-4<br />

No write allocation, 394<br />

Nonblocking assignment, B-24<br />

Nonblocking caches, 344, 472<br />

Nonuniform memory access (NUMA),<br />

518<br />

Nonvolatile memory, 22<br />

Nops, 314<br />

nor (NOR), 64<br />

NOR gates, B-8<br />

cross-coupled, B-50<br />

D latch implemented with, B-52<br />

NOR operation, 89, A-54, E-25<br />

NOT operation, 89, A-55, B-6<br />

Numbers<br />

binary, 73<br />

computer versus real-world, 221<br />

decimal, 73, 76<br />

denormalized, 222<br />

hexadecimal, 81–82<br />

signed, 73–78<br />

unsigned, 73–78<br />

NVIDIA GeForce 8800, C-46–55<br />

all-pairs N-body algorithm, C-71<br />

dense linear algebra computations,<br />

C-51–53<br />

FFT performance, C-53<br />

instruction set, C-49<br />

performance, C-51<br />

rasterization, C-50<br />

ROP, C-50–51<br />

scalability, C-51<br />

sorting performance, C-54–55<br />

special function approximation<br />

statistics, C-43<br />

special function unit (SFU), C-50<br />

streaming multiprocessor (SM),<br />

C-48–49<br />

streaming processor, C-49–50<br />

streaming processor array (SPA), C-46<br />

texture/processor cluster (TPC),<br />

C-47–48<br />

NVIDIA GPU architecture, 523–526<br />

NVIDIA GTX 280, 548–553<br />

NVIDIA Tesla GPU, 548–553<br />

O<br />

Object files, 125, A-4<br />

debugging information, 124<br />

defined, A-10<br />

format, A-13–14<br />

header, 125, A-13<br />

linking, 126–129<br />

relocation information, 125<br />

static data segment, 125<br />

symbol table, 125, 126<br />

text segment, 125<br />

Object-oriented languages. See also Java<br />

brief history, OL2.21-8<br />

defined, 145, OL2.15-5<br />

One’s complement, 79, B-29<br />

Opcodes<br />

control line setting <strong>and</strong>, 264<br />

defined, 82, 262<br />

OpenGL, C-13<br />

OpenMP (Open MultiProcessing), 520,<br />

540<br />

Oper<strong>and</strong>s, 66–73. See also Instructions<br />

32-bit immediate, 112–113<br />

adding, 179<br />

arithmetic instructions, 66<br />

compiling assignment when in<br />

memory, 69<br />

constant, 72–73<br />

division, 189<br />

floating-point, 212<br />

memory, 68–69<br />

MIPS, 64<br />

multiplication, 183<br />

shifting, 148<br />

Operating systems<br />

brief history, OL5.17-9–5.17-12<br />

defined, 13<br />

encapsulation, 22<br />

in networking, OL6.9-4–6.9-5<br />

Operations<br />

atomic, implementing, 121<br />

hardware, 63–66<br />

logical, 87–89<br />

x86 integer, 152, 154–155<br />

Optimization<br />

class explanation, OL2.15-14<br />

compiler, 141<br />

control implementation, D-27–28<br />

global, OL2.15-5<br />

high-level, OL2.15-4–2.15-5<br />

local, OL2.15-5, OL2.15-8<br />

manual, 144<br />

or (OR), 64<br />

OR operation, 89, A-55, B-6<br />

ori (Or Immediate), 64<br />

Out-of-order execution<br />

defined, 341<br />

performance complexity, 416<br />

processors, 344<br />

Output devices, 16<br />

Overflow<br />

defined, 74, 198<br />

detection, 180<br />

exceptions, 329<br />

floating-point, 198<br />

occurrence, 75<br />

saturation <strong>and</strong>, 181<br />

subtraction, 179<br />

P<br />

P+Q redundancy (RAID 6), OL5.11-7<br />

Packed floating-point format, 224<br />

Page faults, 434. See also Virtual memory<br />

for data access, 450<br />

defined, 428<br />

h<strong>and</strong>ling, 429, 446–453<br />

virtual address causing, 449, 450<br />

Page tables, 456<br />

defined, 432<br />

illustrated, 435<br />

indexing, 432<br />

inverted, 436<br />

levels, 436–437<br />

main memory, 437<br />

register, 432<br />

storage reduction techniques, 436–437<br />

updating, 432<br />

VMM, 452<br />

Pages. See also Virtual memory<br />

defined, 428<br />

dirty, 437<br />

finding, 432–434<br />

LRU, 434<br />

offset, 429<br />

physical number, 429<br />

placing, 432–434


Index I-17<br />

size, 430<br />

virtual number, 429<br />

Parallel bus, OL6.9-3<br />

Parallel execution, 121<br />

Parallel memory system, C-36–41. See<br />

also Graphics processing units<br />

(GPUs)<br />

caches, C-38<br />

constant memory, C-40<br />

DRAM considerations, C-37–38<br />

global memory, C-39<br />

load/store access, C-41<br />

local memory, C-40<br />

memory spaces, C-39<br />

MMU, C-38–39<br />

ROP, C-41<br />

shared memory, C-39–40<br />

surfaces, C-41<br />

texture memory, C-40<br />

Parallel processing programs, 502–507<br />

creation difficulty, 502–507<br />

defined, 501<br />

for message passing, 519–520<br />

great debates in, OL6.15-5<br />

for shared address space, 519–520<br />

use of, 559<br />

Parallel reduction, C-62<br />

Parallel scan, C-60–63<br />

CUDA template, C-61<br />

inclusive, C-60<br />

tree-based, C-62<br />

Parallel software, 501<br />

Parallelism, 12, 43, 332–344<br />

<strong>and</strong> computers arithmetic, 222–223<br />

data-level, 233, 508<br />

debates, OL6.15-5–6.15-7<br />

GPUs <strong>and</strong>, 523, C-76<br />

instruction-level, 43, 332, 343<br />

memory hierarchies <strong>and</strong>, 466–470,<br />

OL5.11-2<br />

multicore <strong>and</strong>, 517<br />

multiple issue, 332–339<br />

multithreading <strong>and</strong>, 517<br />

performance benefits, 44–45<br />

process-level, 500<br />

redundant arrays <strong>and</strong> inexpensive<br />

disks, 470<br />

subword, E-17<br />

task, C-24<br />

task-level, 500<br />

thread, C-22<br />

Paravirtualization, 482<br />

PA-RISC, E-14, E-17<br />

branch vectored, E-35<br />

conditional branches, E-34, E-35<br />

debug instructions, E-36<br />

decimal operations, E-35<br />

extract <strong>and</strong> deposit, E-35<br />

instructions, E-34–36<br />

load <strong>and</strong> clear instructions, E-36<br />

multiply/add <strong>and</strong> multiply/subtract,<br />

E-36<br />

nullification, E-34<br />

nullifying branch option, E-25<br />

store bytes short, E-36<br />

synthesized multiply <strong>and</strong> divide,<br />

E-34–35<br />

Parity, OL5.11-5<br />

bits, 421<br />

code, 420, B-65<br />

PARSEC (Princeton Application<br />

Repository for Shared Memory<br />

<strong>Computer</strong>s), 540<br />

Pass transistor, B-63<br />

PCI-Express (PCIe), 537, C-8, OL6.9-2<br />

PC-relative addressing, 114, 116<br />

Peak floating-point performance, 542<br />

Pentium bug morality play, 231–232<br />

Performance, 28–36<br />

assessing, 28<br />

classic CPU equation, 36–40<br />

components, 38<br />

CPU, 33–35<br />

defining, 29–32<br />

equation, using, 36<br />

improving, 34–35<br />

instruction, 35–36<br />

measuring, 33–35, OL1.12-10<br />

program, 39–40<br />

ratio, 31<br />

relative, 31–32<br />

response time, 30–31<br />

sorting, C-54–55<br />

throughput, 30–31<br />

time measurement, 32<br />

Personal computers (PCs), 7<br />

defined, 5<br />

Personal mobile device (PMD)<br />

defined, 7<br />

Petabyte, 6<br />

Physical addresses, 428<br />

mapping to, 428–429<br />

space, 517, 521<br />

Physically addressed caches, 443<br />

Pipeline registers<br />

before forwarding, 309<br />

dependences, 308<br />

forwarding unit selection, 312<br />

Pipeline stalls, 280<br />

avoiding with code reordering, 280<br />

data hazards <strong>and</strong>, 313–316<br />

insertion, 315<br />

load-use, 318<br />

as solution to control hazards, 282<br />

Pipelined branches, 319<br />

Pipelined control, 300–303. See also<br />

Control<br />

control lines, 300, 303<br />

overview illustration, 316<br />

specifying, 300<br />

Pipelined datapaths, 286–303<br />

with connected control signals, 304<br />

with control signals, 300–303<br />

corrected, 296<br />

illustrated, 289<br />

in load instruction stages, 296<br />

Pipelined dependencies, 305<br />

Pipelines<br />

branch instruction impact, 317<br />

effectiveness, improving, OL4.16-<br />

4–4.16-5<br />

execute <strong>and</strong> address calculation stage,<br />

290, 292<br />

five-stage, 274, 290, 299<br />

graphic representation, 279, 296–300<br />

instruction decode <strong>and</strong> register file<br />

read stage, 289, 292<br />

instruction fetch stage, 290, 292<br />

instructions sequence, 313<br />

latency, 286<br />

memory access stage, 290, 292<br />

multiple-clock-cycle diagrams,<br />

296–297<br />

performance bottlenecks, 343<br />

single-clock-cycle diagrams, 296–297<br />

stages, 274<br />

static two-issue, 335<br />

write-back stage, 290, 294<br />

Pipelining, 12, 272–286<br />

advanced, 343–344<br />

benefits, 272<br />

control hazards, 281–282<br />

data hazards, 278


I-18 Index<br />

Pipelining (Continued)<br />

exceptions <strong>and</strong>, 327–332<br />

execution time <strong>and</strong>, 286<br />

fallacies, 355–356<br />

hazards, 277–278<br />

instruction set design for, 277<br />

laundry analogy, 273<br />

overview, 272–286<br />

paradox, 273<br />

performance improvement, 277<br />

pitfall, 355–356<br />

simultaneous executing instructions,<br />

286<br />

speed-up formula, 273<br />

structural hazards, 277, 294<br />

summary, 285<br />

throughput <strong>and</strong>, 286<br />

Pitfalls. See also Fallacies<br />

address space extension, 479<br />

arithmetic, 229–232<br />

associativity, 479<br />

defined, 49<br />

GPUs, C-74–75<br />

ignoring memory system behavior, 478<br />

memory hierarchies, 478–482<br />

out-of-order processor evaluation, 479<br />

performance equation subset, 50–51<br />

pipelining, 355–356<br />

pointer to automatic variables, 160<br />

sequential word addresses, 160<br />

simulating cache, 478<br />

software development with<br />

multiprocessors, 556<br />

VMM implementation, 481, 481–482<br />

Pixel shader example, C-15–17<br />

Pixels, 18<br />

Pointers<br />

arrays versus, 141–145<br />

frame, 103<br />

global, 102<br />

incrementing, 143<br />

Java, OL2.15-26<br />

stack, 98, 102<br />

Polling, OL6.9-8<br />

Pop, 98<br />

Power<br />

clock rate <strong>and</strong>, 40<br />

critical nature of, 53<br />

efficiency, 343–344<br />

relative, 41<br />

PowerPC<br />

algebraic right shift, E-33<br />

branch registers, E-32–33<br />

condition codes, E-12<br />

instructions, E-12–13<br />

instructions unique to, E-31–33<br />

load multiple/store multiple, E-33<br />

logical shifted immediate, E-33<br />

rotate with mask, E-33<br />

Precise interrupts, 332<br />

Prediction, 12<br />

2-bit scheme, 322<br />

accuracy, 321, 324<br />

dynamic branch, 321–323<br />

loops <strong>and</strong>, 321–323<br />

steady-state, 321<br />

Prefetching, 482, 544<br />

Primitive types, OL2.15-26<br />

Procedure calls<br />

convention, A-22–33<br />

examples, A-27–33<br />

frame, A-23<br />

preservation across, 102<br />

Procedures, 96–106<br />

compiling, 98<br />

compiling, showing nested procedure<br />

linking, 101–102<br />

execution steps, 96<br />

frames, 103<br />

leaf, 100<br />

nested, 100–102<br />

recursive, 105, A-26–27<br />

for setting arrays to zero, 142<br />

sort, 135–139<br />

strcpy, 108–109<br />

string copy, 108–109<br />

swap, 133<br />

Process identifiers, 446<br />

Process-level parallelism, 500<br />

Processors, 242–356<br />

as cores, 43<br />

control, 19<br />

datapath, 19<br />

defined, 17, 19<br />

dynamic multiple-issue, 333<br />

multiple-issue, 333<br />

out-of-order execution, 344, 416<br />

performance growth, 44<br />

ROP, C-12, C-41<br />

speculation, 333–334<br />

static multiple-issue, 333, 334–339<br />

streaming, C-34<br />

superscalar, 339, 515–516, OL4.16-5<br />

technologies for building, 24–28<br />

two-issue, 336–337<br />

vector, 508–510<br />

VLIW, 335<br />

Product, 183<br />

Product of sums, B-11<br />

Program counters (PCs), 251<br />

changing with conditional branch, 324<br />

defined, 98, 251<br />

exception, 445, 447<br />

incrementing, 251, 253<br />

instruction updates, 289<br />

Program libraries, A-4<br />

Program performance<br />

elements affecting, 39<br />

underst<strong>and</strong>ing, 9<br />

Programmable array logic (PAL), B-78<br />

Programmable logic arrays (PLAs)<br />

component dots illustration, B-16<br />

control function implementation, D-7,<br />

D-20–21<br />

defined, B-12<br />

example, B-13–14<br />

illustrated, B-13<br />

ROMs <strong>and</strong>, B-15–16<br />

size, D-20<br />

truth table implementation, B-13<br />

Programmable logic devices (PLDs), B-78<br />

Programmable ROMs (PROMs), B-14<br />

Programming languages. See also specific<br />

languages<br />

brief history of, OL2.21-7–2.21-8<br />

object-oriented, 145<br />

variables, 67<br />

Programs<br />

assembly language, 123<br />

Java, starting, 131–132<br />

parallel processing, 502–507<br />

starting, 123–132<br />

translating, 123–132<br />

Propagate<br />

defined, B-40<br />

example, B-44<br />

super, B-41<br />

Protected keywords, OL2.15-21<br />

Protection<br />

defined, 428<br />

implementing, 444–446<br />

mechanisms, OL5.17-9<br />

VMs for, 424<br />

Protection group, OL5.11-5<br />

Pseudo MIPS<br />

defined, 233


Index I-19<br />

instruction set, 235<br />

Pseudodirect addressing, 116<br />

Pseudoinstructions<br />

defined, 124<br />

summary, 125<br />

Pthreads (POSIX threads), 540<br />

PTX instructions, C-31, C-32<br />

Public keywords, OL2.15-21<br />

Push<br />

defined, 98<br />

using, 100<br />

Q<br />

Quad words, 154<br />

Quicksort, 411, 412<br />

Quotient, 189<br />

R<br />

Race, B-73<br />

Radix sort, 411, 412, C-63–65<br />

CUDA code, C-64<br />

implementation, C-63–65<br />

RAID, See Redundant arrays of<br />

inexpensive disks (RAID)<br />

RAM, 9<br />

Raster operation (ROP) processors, C-12,<br />

C-41, C-50–51<br />

fixed function, C-41<br />

Raster refresh buffer, 18<br />

Rasterization, C-50<br />

Ray casting (RC), 552<br />

Read-only memories (ROMs), B-14–16<br />

control entries, D-16–17<br />

control function encoding, D-18–19<br />

dispatch, D-25<br />

implementation, D-15–19<br />

logic function encoding, B-15<br />

overhead, D-18<br />

PLAs <strong>and</strong>, B-15–16<br />

programmable (PROM), B-14<br />

total size, D-16<br />

Read-stall cycles, 399<br />

Read-write head, 381<br />

Receive message routine, 529<br />

Receiver Control register, A-39<br />

Receiver Data register, A-38, A-39<br />

Recursive procedures, 105, A-26–27. See<br />

also Procedures<br />

clone invocation, 100<br />

stack in, A-29–30<br />

Reduced instruction set computer (RISC)<br />

architectures, E-2–45, OL2.21-5,<br />

OL4.16-4. See also Desktop <strong>and</strong><br />

server RISCs; Embedded RISCs<br />

group types, E-3–4<br />

instruction set lineage, E-44<br />

Reduction, 519<br />

Redundant arrays of inexpensive disks<br />

(RAID), OL5.11-2–5.11-8<br />

history, OL5.11-8<br />

RAID 0, OL5.11-4<br />

RAID 1, OL5.11-5<br />

RAID 2, OL5.11-5<br />

RAID 3, OL5.11-5<br />

RAID 4, OL5.11-5–5.11-6<br />

RAID 5, OL5.11-6–5.11-7<br />

RAID 6, OL5.11-7<br />

spread of, OL5.11-6<br />

summary, OL5.11-7–5.11-8<br />

use statistics, OL5.11-7<br />

Reference bit, 435<br />

References<br />

absolute, 126<br />

forward, A-11<br />

types, OL2.15-26<br />

unresolved, A-4, A-18<br />

Register addressing, 116<br />

Register allocation, OL2.15-11–2.15-13<br />

Register files, B-50, B-54–56<br />

defined, 252, B-50, B-54<br />

in behavioral Verilog, B-57<br />

single, 257<br />

two read ports implementation, B-55<br />

with two read ports/one write port,<br />

B-55<br />

write port implementation, B-56<br />

Register-memory architecture, OL2.21-3<br />

Registers, 152, 153–154<br />

architectural, 325–332<br />

base, 69<br />

callee-saved, A-23<br />

caller-saved, A-23<br />

Cause, A-35<br />

clock cycle time <strong>and</strong>, 67<br />

compiling C assignment with, 67–68<br />

Count, A-34<br />

defined, 66<br />

destination, 83, 262<br />

floating-point, 217<br />

left half, 290<br />

mapping, 80<br />

MIPS conventions, 105<br />

number specification, 252<br />

page table, 432<br />

pipeline, 308, 309, 312<br />

primitives, 66<br />

Receiver Control, A-39<br />

Receiver Data, A-38, A-39<br />

renaming, 338<br />

right half, 290<br />

spilling, 71<br />

Status, 327, A-35<br />

temporary, 67, 99<br />

Transmitter Control, A-39–40<br />

Transmitter Data, A-40<br />

usage convention, A-24<br />

use convention, A-22<br />

variables, 67<br />

Relative performance, 31–32<br />

Relative power, 41<br />

Reliability, 418<br />

Relocation information, A-13, A-14<br />

Remainder<br />

defined, 189<br />

instructions, A-55<br />

Reorder buffers, 343<br />

Replication, 468<br />

Requested word first, 392<br />

Request-level parallelism, 532<br />

Reservation stations<br />

buffering oper<strong>and</strong>s in, 340–341<br />

defined, 339–340<br />

Response time, 30–31<br />

Restartable instructions, 448<br />

Return address, 97<br />

Return from exception (ERET), 445<br />

R-format, 262<br />

ALU operations, 253<br />

defined, 83<br />

Ripple carry<br />

adder, B-29<br />

carry lookahead speed versus, B-46<br />

Roofline model, 542–543, 544, 545<br />

with ceilings, 546, 547<br />

computational roofline, 545<br />

illustrated, 542<br />

Opteron generations, 543, 544<br />

with overlapping areas shaded, 547<br />

peak floating-point performance,<br />

542<br />

peak memory performance, 543<br />

with two kernels, 547<br />

Rotational delay.See Rotational latency<br />

Rotational latency, 383


I-20 Index<br />

Rounding, 218<br />

accurate, 218<br />

bits, 220<br />

with guard digits, 219<br />

IEEE 754 modes, 219<br />

Row-major order, 217, 413<br />

R-type instructions, 252<br />

datapath for, 264–265<br />

datapath in operation for, 266<br />

S<br />

Saturation, 181<br />

sb (Store Byte), 64<br />

sc (Store Conditional), 64<br />

SCALAPAK, 230<br />

Scaling<br />

strong, 505, 507<br />

weak, 505<br />

Scientific notation<br />

adding numbers in, 203<br />

defined, 196<br />

for reals, 197<br />

Search engines, 4<br />

Secondary memory, 23<br />

Sectors, 381<br />

Seek, 382<br />

Segmentation, 431<br />

Selector values, B-10<br />

Semiconductors, 25<br />

Send message routine, 529<br />

Sensitivity list, B-24<br />

Sequencers<br />

explicit, D-32<br />

implementing next-state function with,<br />

D-22–28<br />

Sequential logic, B-5<br />

Servers, OL5. See also Desktop <strong>and</strong> server<br />

RISCs<br />

cost <strong>and</strong> capability, 5<br />

Service accomplishment, 418<br />

Service interruption, 418<br />

Set instructions, 93<br />

Set-associative caches, 403. See also<br />

Caches<br />

address portions, 407<br />

block replacement strategies, 457<br />

choice of, 456<br />

four-way, 404, 407<br />

memory-block location, 403<br />

misses, 405–406<br />

n-way, 403<br />

two-way, 404<br />

Setup time, B-53, B-54<br />

sh (Store Halfword), 64<br />

Shaders<br />

defined, C-14<br />

floating-point arithmetic, C-14<br />

graphics, C-14–15<br />

pixel example, C-15–17<br />

Shading languages, C-14<br />

Shadowing, OL5.11-5<br />

Shared memory. See also Memory<br />

as low-latency memory, C-21<br />

caching in, C-58–60<br />

CUDA, C-58<br />

N-body <strong>and</strong>, C-67–68<br />

per-CTA, C-39<br />

SRAM banks, C-40<br />

Shared memory multiprocessors (SMP),<br />

517–521<br />

defined, 501, 517<br />

single physical address space, 517<br />

synchronization, 518<br />

Shift amount, 82<br />

Shift instructions, 87, A-55–56<br />

Sign <strong>and</strong> magnitude, 197<br />

Sign bit, 76<br />

Sign extension, 254<br />

defined, 76<br />

shortcut, 78<br />

Signals<br />

asserted, 250, B-4<br />

control, 250, 263–264<br />

deasserted, 250, B-4<br />

Signed division, 192–194<br />

Signed multiplication, 187<br />

Signed numbers, 73–78<br />

sign <strong>and</strong> magnitude, 75<br />

treating as unsigned, 94–95<br />

Signific<strong>and</strong>s, 198<br />

addition, 203<br />

multiplication, 206<br />

Silicon, 25<br />

as key hardware technology, 53<br />

crystal ingot, 26<br />

defined, 26<br />

wafers, 26<br />

Silicon crystal ingot, 26<br />

SIMD (Single Instruction Multiple Data),<br />

507–508, 558<br />

computers, OL6.15-2–6.15-4<br />

data vector, C-35<br />

extensions, OL6.15-4<br />

for loops <strong>and</strong>, OL6.15-3<br />

massively parallel multiprocessors,<br />

OL6.15-2<br />

small-scale, OL6.15-4<br />

vector architecture, 508–510<br />

in x86, 508<br />

SIMMs (single inline memory modules),<br />

OL5.17-5, OL5.17-6<br />

Simple programmable logic devices<br />

(SPLDs), B-78<br />

Simplicity, 161<br />

Simultaneous multithreading (SMT),<br />

515–517<br />

support, 515<br />

thread-level parallelism, 517<br />

unused issue slots, 515<br />

Single error correcting/Double error<br />

correcting (SEC/DEC), 420–422<br />

Single instruction single data (SISD), 507<br />

Single precision. See also Double<br />

precision<br />

binary representation, 201<br />

defined, 198<br />

Single-clock-cycle pipeline diagrams,<br />

296–297<br />

illustrated, 299<br />

Single-cycle datapaths. See also Datapaths<br />

illustrated, 287<br />

instruction execution, 288<br />

Single-cycle implementation<br />

control function for, 269<br />

defined, 270<br />

nonpipelined execution versus<br />

pipelined execution, 276<br />

non-use of, 271–272<br />

penalty, 271–272<br />

pipelined performance versus, 274<br />

Single-instruction multiple-thread<br />

(SIMT), C-27–30<br />

overhead, C-35<br />

multithreaded warp scheduling, C-28<br />

processor architecture, C-28<br />

warp execution <strong>and</strong> divergence,<br />

C-29–30<br />

Single-program multiple data (SPMD),<br />

C-22<br />

sll (Shift Left Logical), 64<br />

slt (Set Less Than), 64<br />

slti (Set Less Than Imm.), 64


Index I-21<br />

sltiu (Set Less Than Imm.Unsigned), 64<br />

sltu (Set Less Than Unsig.), 64<br />

Smalltalk-80, OL2.21-8<br />

Smart phones, 7<br />

Snooping protocol, 468–470<br />

Snoopy cache coherence, OL5.12-7<br />

Software optimization<br />

via blocking, 413–418<br />

Sort algorithms, 141<br />

Software<br />

layers, 13<br />

multiprocessor, 500<br />

parallel, 501<br />

as service, 7, 532, 558<br />

systems, 13<br />

Sort procedure, 135–139. See also<br />

Procedures<br />

code for body, 135–137<br />

full procedure, 138–139<br />

passing parameters in, 138<br />

preserving registers in, 138<br />

procedure call, 137<br />

register allocation for, 135<br />

Sorting performance, C-54–55<br />

Source files, A-4<br />

Source language, A-6<br />

Space allocation<br />

on heap, 104–106<br />

on stack, 103<br />

SPARC<br />

annulling branch, E-23<br />

CASA, E-31<br />

conditional branches, E-10–12<br />

fast traps, E-30<br />

floating-point operations, E-31<br />

instructions, E-29–32<br />

least significant bits, E-31<br />

multiple precision floating-point<br />

results, E-32<br />

nonfaulting loads, E-32<br />

overlapping integer operations, E-31<br />

quadruple precision floating-point<br />

arithmetic, E-32<br />

register windows, E-29–30<br />

support for LISP <strong>and</strong> Smalltalk, E-30<br />

Sparse matrices, C-55–58<br />

Sparse Matrix-Vector multiply (SpMV),<br />

C-55, C-57, C-58<br />

CUDA version, C-57<br />

serial code, C-57<br />

shared memory version, C-59<br />

Spatial locality, 374<br />

large block exploitation of, 391<br />

tendency, 378<br />

SPEC, OL1.12-11–1.12-12<br />

CPU benchmark, 46–48<br />

power benchmark, 48–49<br />

SPEC2000, OL1.12-12<br />

SPEC2006, 233, OL1.12-12<br />

SPEC89, OL1.12-11<br />

SPEC92, OL1.12-12<br />

SPEC95, OL1.12-12<br />

SPECrate, 538–539<br />

SPECratio, 47<br />

Special function units (SFUs), C-35, C-50<br />

defined, C-43<br />

Speculation, 333–334<br />

hardware-based, 341<br />

implementation, 334<br />

performance <strong>and</strong>, 334<br />

problems, 334<br />

recovery mechanism, 334<br />

Speed-up challenge, 503–505<br />

balancing load, 505–506<br />

bigger problem, 504–505<br />

Spilling registers, 71, 98<br />

SPIM, A-40–45<br />

byte order, A-43<br />

features, A-42–43<br />

getting started with, A-42<br />

MIPS assembler directives support,<br />

A-47–49<br />

speed, A-41<br />

system calls, A-43–45<br />

versions, A-42<br />

virtual machine simulation, A-41–42<br />

Split algorithm, 552<br />

Split caches, 397<br />

Square root instructions, A-79<br />

sra (Shift Right Arith.), A-56<br />

srl (Shift Right Logical), 64<br />

Stack architectures, OL2.21-4<br />

Stack pointers<br />

adjustment, 100<br />

defined, 98<br />

values, 100<br />

Stack segment, A-22<br />

Stacks<br />

allocating space on, 103<br />

for arguments, 140<br />

defined, 98<br />

pop, 98<br />

push, 98, 100<br />

recursive procedures, A-29–30<br />

Stalls, 280<br />

as solution to control hazard, 282<br />

avoiding with code reordering, 280<br />

behavioral Verilog with detection,<br />

OL4.13-6–4.13-8<br />

data hazards <strong>and</strong>, 313–316<br />

illustrations, OL4.13-23, OL4.13-30<br />

insertion into pipeline, 315<br />

load-use, 318<br />

memory, 400<br />

write-back scheme, 399<br />

write buffer, 399<br />

St<strong>and</strong>by spares, OL5.11-8<br />

State<br />

in 2-bit prediction scheme, 322<br />

assignment, B-70, D-27<br />

bits, D-8<br />

exception, saving/restoring, 450<br />

logic components, 249<br />

specification of, 432<br />

State elements<br />

clock <strong>and</strong>, 250<br />

combinational logic <strong>and</strong>, 250<br />

defined, 248, B-48<br />

inputs, 249<br />

in storing/accessing instructions,<br />

252<br />

register file, B-50<br />

Static branch prediction, 335<br />

Static data<br />

as dynamic data, A-21<br />

defined, A-20<br />

segment, 104<br />

Static multiple-issue processors, 333,<br />

334–339. See also Multiple issue<br />

control hazards <strong>and</strong>, 335–336<br />

instruction sets, 335<br />

with MIPS ISA, 335–338<br />

Static r<strong>and</strong>om access memories (SRAMs),<br />

378, 379, B-58–62<br />

array organization, B-62<br />

basic structure, B-61<br />

defined, 21, B-58<br />

fixed access time, B-58<br />

large, B-59<br />

read/write initiation, B-59<br />

synchronous (SSRAMs), B-60<br />

three-state buffers, B-59, B-60<br />

Static variables, 102


I-22 Index<br />

Status register<br />

fields, A-34, A-35<br />

Steady-state prediction, 321<br />

Sticky bits, 220<br />

Store buffers, 343<br />

Store instructions. See also Load<br />

instructions<br />

access, C-41<br />

base register, 262<br />

block, 149<br />

compiling with, 71<br />

conditional, 122<br />

defined, 71<br />

details, A-68–70<br />

EX stage, 294<br />

floating-point, A-79<br />

ID stage, 291<br />

IF stage, 291<br />

instruction dependency, 312<br />

list of, A-68–70<br />

MEM stage, 295<br />

unit for implementing, 255<br />

WB stage, 295<br />

Store word, 71<br />

Stored program concept, 63<br />

as computer principle, 86<br />

illustrated, 86<br />

principles, 161<br />

Strcpy procedure, 108–109. See also<br />

Procedures<br />

as leaf procedure, 109<br />

pointers, 109<br />

Stream benchmark, 548<br />

Streaming multiprocessor (SM), C-48–49<br />

Streaming processors, C-34, C-49–50<br />

array (SPA), C-41, C-46<br />

Streaming SIMD Extension 2 (SSE2)<br />

floating-point architecture, 224<br />

Streaming SIMD Extensions (SSE) <strong>and</strong><br />

advanced vector extensions in x86,<br />

224–225<br />

Stretch computer, OL4.16-2<br />

Strings<br />

defined, 107<br />

in Java, 109–111<br />

representation, 107<br />

Strip mining, 510<br />

Striping, OL5.11-4<br />

Strong scaling, 505, 517<br />

Structural hazards, 277, 294<br />

sub (Subtract), 64<br />

sub.d (FP Subtract Double), A-79<br />

sub.s (FP Subtract Single), A-80<br />

Subnormals, 222<br />

Subtraction, 178–182. See also Arithmetic<br />

binary, 178–179<br />

floating-point, 211, A-79–80<br />

instructions, A-56–57<br />

negative number, 179<br />

overflow, 179<br />

subu (Subtract Unsigned), 119<br />

Subword parallelism, 222–223, 352, E-17<br />

<strong>and</strong> matrix multiply, 225–228<br />

Sum of products, B-11, B-12<br />

Supercomputers, OL4.16-3<br />

defined, 5<br />

SuperH, E-15, E-39–40<br />

Superscalars<br />

defined, 339, OL4.16-5<br />

dynamic pipeline scheduling, 339<br />

multithreading options, 516<br />

Surfaces, C-41<br />

sw (Store Word), 64<br />

Swap procedure, 133. See also Procedures<br />

body code, 135<br />

full, 135, 138–139<br />

register allocation, 133<br />

Swap space, 434<br />

swc1 (Store FP Single), A-73<br />

Symbol tables, 125, A-12, A-13<br />

Synchronization, 121–123, 552<br />

barrier, C-18, C-20, C-34<br />

defined, 518<br />

lock, 121<br />

overhead, reducing, 44–45<br />

unlock, 121<br />

Synchronizers<br />

defined, B-76<br />

failure, B-77<br />

from D flip-flop, B-76<br />

Synchronous DRAM (SRAM), 379–380,<br />

B-60, B-65<br />

Synchronous SRAM (SSRAM), B-60<br />

Synchronous system, B-48<br />

Syntax tree, OL2.15-3<br />

System calls, A-43–45<br />

code, A-43–44<br />

defined, 445<br />

loading, A-43<br />

Systems software, 13<br />

SystemVerilog<br />

cache controller, OL5.12-2<br />

T<br />

cache data <strong>and</strong> tag modules, OL5.12-6<br />

FSM, OL5.12-7<br />

simple cache block diagram, OL5.12-4<br />

type declarations, OL5.12-2<br />

Tablets, 7<br />

Tags<br />

defined, 384<br />

in locating block, 407<br />

page tables <strong>and</strong>, 434<br />

size of, 409<br />

Tail call, 105–106<br />

Task identifiers, 446<br />

Task parallelism, C-24<br />

Task-level parallelism, 500<br />

Tebibyte (TiB), 5<br />

Telsa PTX ISA, C-31–34<br />

arithmetic instructions, C-33<br />

barrier synchronization, C-34<br />

GPU thread instructions, C-32<br />

memory access instructions, C-33–34<br />

Temporal locality, 374<br />

tendency, 378<br />

Temporary registers, 67, 99<br />

Terabyte (TB) , 6<br />

defined, 5<br />

Text segment, A-13<br />

Texture memory, C-40<br />

Texture/processor cluster (TPC),<br />

C-47–48<br />

TFLOPS multiprocessor, OL6.15-6<br />

Thrashing, 453<br />

Thread blocks, 528<br />

creation, C-23<br />

defined, C-19<br />

managing, C-30<br />

memory sharing, C-20<br />

synchronization, C-20<br />

Thread parallelism, C-22<br />

Threads<br />

creation, C-23<br />

CUDA, C-36<br />

ISA, C-31–34<br />

managing, C-30<br />

memory latencies <strong>and</strong>, C-74–75<br />

multiple, per body, C-68–69<br />

warps, C-27<br />

Three Cs model, 459–461<br />

Three-state buffers, B-59, B-60


Index I-23<br />

Throughput<br />

defined, 30–31<br />

multiple issue <strong>and</strong>, 342<br />

pipelining <strong>and</strong>, 286, 342<br />

Thumb, E-15, E-38<br />

Timing<br />

asynchronous inputs, B-76–77<br />

level-sensitive, B-75–76<br />

methodologies, B-72–77<br />

two-phase, B-75<br />

TLB misses, 439. See also Translationlookaside<br />

buffer (TLB)<br />

entry point, 449<br />

h<strong>and</strong>ler, 449<br />

h<strong>and</strong>ling, 446–453<br />

occurrence, 446<br />

problem, 453<br />

Tomasulo’s algorithm, OL4.16-3<br />

Touchscreen, 19<br />

Tournament branch predicators, 324<br />

Tracks, 381–382<br />

Transfer time, 383<br />

Transistors, 25<br />

Translation-lookaside buffer (TLB),<br />

438–439, E-26–27, OL5.17-6. See<br />

also TLB misses<br />

associativities, 439<br />

illustrated, 438<br />

integration, 440–441<br />

Intrinsity FastMATH, 440<br />

typical values, 439<br />

Transmit driver <strong>and</strong> NIC hardware time<br />

versus.receive driver <strong>and</strong> NIC hardware<br />

time, OL6.9-8<br />

Transmitter Control register, A-39–40<br />

Transmitter Data register, A-40<br />

Trap instructions, A-64–66<br />

Tree-based parallel scan, C-62<br />

Truth tables, B-5<br />

ALU control lines, D-5<br />

for control bits, 260–261<br />

datapath control outputs, D-17<br />

datapath control signals, D-14<br />

defined, 260<br />

example, B-5<br />

next-state output bits, D-15<br />

PLA implementation, B-13<br />

Two’s complement representation, 75–76<br />

advantage, 75–76<br />

negation shortcut, 76<br />

rule, 79<br />

sign extension shortcut, 78<br />

Two-level logic, B-11–14<br />

Two-phase clocking, B-75<br />

TX-2 computer, OL6.15-4<br />

U<br />

Unconditional branches, 91<br />

Underflow, 198<br />

Unicode<br />

alphabets, 109<br />

defined, 110<br />

example alphabets, 110<br />

Unified GPU architecture, C-10–12<br />

illustrated, C-11<br />

processor array, C-11–12<br />

Uniform memory access (UMA), 518,<br />

C-9<br />

multiprocessors, 519<br />

Units<br />

commit, 339–340, 343<br />

control, 247–248, 259–261, D-4–8,<br />

D-10, D-12–13<br />

defined, 219<br />

floating point, 219<br />

hazard detection, 313, 314–315<br />

for load/store implementation, 255<br />

special function (SFUs), C-35, C-43,<br />

C-50<br />

UNIVAC I, OL1.12-5<br />

UNIX, OL2.21-8, OL5.17-9–5.17-12<br />

AT&T, OL5.17-10<br />

Berkeley version (BSD), OL5.17-10<br />

genius, OL5.17-12<br />

history, OL5.17-9–5.17-12<br />

Unlock synchronization, 121<br />

Unresolved references<br />

defined, A-4<br />

linkers <strong>and</strong>, A-18<br />

Unsigned numbers, 73–78<br />

Use latency<br />

defined, 336–337<br />

one-instruction, 336–337<br />

V<br />

Vacuum tubes, 25<br />

Valid bit, 386<br />

Variables<br />

C language, 102<br />

programming language, 67<br />

register, 67<br />

static, 102<br />

storage class, 102<br />

type, 102<br />

VAX architecture, OL2.21-4, OL5.17-7<br />

Vector lanes, 512<br />

Vector processors, 508–510. See also<br />

Processors<br />

conventional code comparison,<br />

509–510<br />

instructions, 510<br />

multimedia extensions <strong>and</strong>, 511–512<br />

scalar versus, 510–511<br />

Vectored interrupts, 327<br />

Verilog<br />

behavioral definition of MIPS ALU,<br />

B-25<br />

behavioral definition with bypassing,<br />

OL4.13-4–4.13-6<br />

behavioral definition with stalls for<br />

loads, OL4.13-6–4.13-8<br />

behavioral specification, B-21, OL4.13-<br />

2–4.13-4<br />

behavioral specification of multicycle<br />

MIPS design, OL4.13-12–4.13-13<br />

behavioral specification with<br />

simulation, OL4.13-2<br />

behavioral specification with stall<br />

detection, OL4.13-6–4.13-8<br />

behavioral specification with synthesis,<br />

OL4.13-11–4.13-16<br />

blocking assignment, B-24<br />

branch hazard logic implementation,<br />

OL4.13-8–4.13-10<br />

combinational logic, B-23–26<br />

datatypes, B-21–22<br />

defined, B-20<br />

forwarding implementation,<br />

OL4.13-4<br />

MIPS ALU definition in, B-35–38<br />

modules, B-23<br />

multicycle MIPS datapath, OL4.13-14<br />

nonblocking assignment, B-24<br />

operators, B-22<br />

program structure, B-23<br />

reg, B-21–22<br />

sensitivity list, B-24<br />

sequential logic specification, B-56–58<br />

structural specification, B-21<br />

wire, B-21–22<br />

Vertical microcode, D-32


I-24 Index<br />

Very large-scale integrated (VLSI)<br />

circuits, 25<br />

Very Long Instruction Word (VLIW)<br />

defined, 334–335<br />

first generation computers, OL4.16-5<br />

processors, 335<br />

VHDL, B-20–21<br />

Video graphics array (VGA) controllers,<br />

C-3–4<br />

Virtual addresses<br />

causing page faults, 449<br />

defined, 428<br />

mapping from, 428–429<br />

size, 430<br />

Virtual machine monitors (VMMs)<br />

defined, 424<br />

implementing, 481, 481–482<br />

laissez-faire attitude, 481<br />

page tables, 452<br />

in performance improvement, 427<br />

requirements, 426<br />

Virtual machines (VMs), 424–427<br />

benefits, 424<br />

defined, A-41<br />

illusion, 452<br />

instruction set architecture support,<br />

426–427<br />

performance improvement, 427<br />

for protection improvement, 424<br />

simulation of, A-41–42<br />

Virtual memory, 427–454. See also Pages<br />

address translation, 429, 438–439<br />

integration, 440–441<br />

mechanism, 452–453<br />

motivations, 427–428<br />

page faults, 428, 434<br />

protection implementation,<br />

444–446<br />

segmentation, 431<br />

summary, 452–453<br />

virtualization of, 452<br />

writes, 437<br />

Virtualizable hardware, 426<br />

Virtually addressed caches, 443<br />

Visual computing, C-3<br />

Volatile memory, 22<br />

W<br />

Wafers, 26<br />

defects, 26<br />

dies, 26–27<br />

yield, 27<br />

Warehouse Scale <strong>Computer</strong>s (WSCs), 7,<br />

531–533, 558<br />

Warps, 528, C-27<br />

Weak scaling, 505<br />

Wear levelling, 381<br />

While loops, 92–93<br />

Whirlwind, OL5.17-2<br />

Wide area networks (WANs), 24. See also<br />

Networks<br />

Words<br />

accessing, 68<br />

defined, 66<br />

double, 152<br />

load, 68, 71<br />

quad, 154<br />

store, 71<br />

Working set, 453<br />

World Wide Web, 4<br />

Worst-case delay, 272<br />

Write buffers<br />

defined, 394<br />

stalls, 399<br />

write-back cache, 395<br />

Write invalidate protocols, 468, 469<br />

Write serialization, 467<br />

Write-back caches. See also Caches<br />

advantages, 458<br />

cache coherency protocol, OL5.12-5<br />

complexity, 395<br />

defined, 394, 458<br />

stalls, 399<br />

write buffers, 395<br />

Write-back stage<br />

control line, 302<br />

load instruction, 292<br />

store instruction, 294<br />

Writes<br />

complications, 394<br />

expense, 453<br />

h<strong>and</strong>ling, 393–395<br />

memory hierarchy h<strong>and</strong>ling of,<br />

457–458<br />

schemes, 394<br />

virtual memory, 437<br />

write-back cache, 394, 395<br />

write-through cache, 394, 395<br />

Write-stall cycles, 400<br />

Write-through caches. See also Caches<br />

advantages, 458<br />

defined, 393, 457<br />

tag mismatch, 394<br />

X<br />

x86, 149–158<br />

Advanced Vector Extensions in, 225<br />

brief history, OL2.21-6<br />

conclusion, 156–158<br />

data addressing modes, 152, 153–154<br />

evolution, 149–152<br />

first address specifier encoding, 158<br />

historical timeline, 149–152<br />

instruction encoding, 155–156<br />

instruction formats, 157<br />

instruction set growth, 161<br />

instruction types, 153<br />

integer operations, 152–155<br />

registers, 152, 153–154<br />

SIMD in, 507–508, 508<br />

Streaming SIMD Extensions in,<br />

224–225<br />

typical instructions/functions, 155<br />

typical operations, 157<br />

Xerox Alto computer, OL1.12-8<br />

XMM, 224<br />

Y<br />

Yahoo! Cloud Serving Benchmark<br />

(YCSB), 540<br />

Yield, 27<br />

YMM, 225<br />

Z<br />

Zettabyte, 6

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!