Computer Organisation and Design (2014)
Computer Organisation and Design (2014)
Computer Organisation and Design (2014)
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
In Praise of <strong>Computer</strong> Organization <strong>and</strong> <strong>Design</strong>: The Hardware/<br />
Software Interface, Fifth Edition<br />
“Textbook selection is often a frustrating act of compromise—pedagogy, content<br />
coverage, quality of exposition, level of rigor, cost. <strong>Computer</strong> Organization <strong>and</strong><br />
<strong>Design</strong> is the rare book that hits all the right notes across the board, without<br />
compromise. It is not only the premier computer organization textbook, it is a<br />
shining example of what all computer science textbooks could <strong>and</strong> should be.”<br />
—Michael Goldweber, Xavier University<br />
“I have been using <strong>Computer</strong> Organization <strong>and</strong> <strong>Design</strong> for years, from the very<br />
first edition. The new Fifth Edition is yet another outst<strong>and</strong>ing improvement on an<br />
already classic text. The evolution from desktop computing to mobile computing<br />
to Big Data brings new coverage of embedded processors such as the ARM, new<br />
material on how software <strong>and</strong> hardware interact to increase performance, <strong>and</strong><br />
cloud computing. All this without sacrificing the fundamentals.”<br />
—Ed Harcourt, St. Lawrence University<br />
“To Millennials: <strong>Computer</strong> Organization <strong>and</strong> <strong>Design</strong> is the computer architecture<br />
book you should keep on your (virtual) bookshelf. The book is both old <strong>and</strong> new,<br />
because it develops venerable principles—Moore's Law, abstraction, common case<br />
fast, redundancy, memory hierarchies, parallelism, <strong>and</strong> pipelining—but illustrates<br />
them with contemporary designs, e.g., ARM Cortex A8 <strong>and</strong> Intel Core i7.”<br />
—Mark D. Hill, University of Wisconsin-Madison<br />
“The new edition of <strong>Computer</strong> Organization <strong>and</strong> <strong>Design</strong> keeps pace with advances<br />
in emerging embedded <strong>and</strong> many-core (GPU) systems, where tablets <strong>and</strong><br />
smartphones will are quickly becoming our new desktops. This text acknowledges<br />
these changes, but continues to provide a rich foundation of the fundamentals<br />
in computer organization <strong>and</strong> design which will be needed for the designers of<br />
hardware <strong>and</strong> software that power this new class of devices <strong>and</strong> systems.”<br />
—Dave Kaeli, Northeastern University<br />
“The Fifth Edition of <strong>Computer</strong> Organization <strong>and</strong> <strong>Design</strong> provides more than an<br />
introduction to computer architecture. It prepares the reader for the changes necessary<br />
to meet the ever-increasing performance needs of mobile systems <strong>and</strong> big data<br />
processing at a time that difficulties in semiconductor scaling are making all systems<br />
power constrained. In this new era for computing, hardware <strong>and</strong> software must be codesigned<br />
<strong>and</strong> system-level architecture is as critical as component-level optimizations.”<br />
—Christos Kozyrakis, Stanford University<br />
“Patterson <strong>and</strong> Hennessy brilliantly address the issues in ever-changing computer<br />
hardware architectures, emphasizing on interactions among hardware <strong>and</strong> software<br />
components at various abstraction levels. By interspersing I/O <strong>and</strong> parallelism concepts<br />
with a variety of mechanisms in hardware <strong>and</strong> software throughout the book, the new<br />
edition achieves an excellent holistic presentation of computer architecture for the<br />
PostPC era. This book is an essential guide to hardware <strong>and</strong> software professionals<br />
facing energy efficiency <strong>and</strong> parallelization challenges in Tablet PC to cloud computing.”<br />
—Jae C. Oh, Syracuse University
This page intentionally left blank
F I F T H E D I T I O N<br />
<strong>Computer</strong> Organization <strong>and</strong> <strong>Design</strong><br />
THE HARDWARE/SOFTWARE INTERFACE
David A. Patterson has been teaching computer architecture at the University of<br />
California, Berkeley, since joining the faculty in 1977, where he holds the Pardee Chair<br />
of <strong>Computer</strong> Science. His teaching has been honored by the Distinguished Teaching<br />
Award from the University of California, the Karlstrom Award from ACM, <strong>and</strong> the<br />
Mulligan Education Medal <strong>and</strong> Undergraduate Teaching Award from IEEE. Patterson<br />
received the IEEE Technical Achievement Award <strong>and</strong> the ACM Eckert-Mauchly Award<br />
for contributions to RISC, <strong>and</strong> he shared the IEEE Johnson Information Storage Award<br />
for contributions to RAID. He also shared the IEEE John von Neumann Medal <strong>and</strong><br />
the C & C Prize with John Hennessy. Like his co-author, Patterson is a Fellow of the<br />
American Academy of Arts <strong>and</strong> Sciences, the <strong>Computer</strong> History Museum, ACM,<br />
<strong>and</strong> IEEE, <strong>and</strong> he was elected to the National Academy of Engineering, the National<br />
Academy of Sciences, <strong>and</strong> the Silicon Valley Engineering Hall of Fame. He served on<br />
the Information Technology Advisory Committee to the U.S. President, as chair of the<br />
CS division in the Berkeley EECS department, as chair of the Computing Research<br />
Association, <strong>and</strong> as President of ACM. This record led to Distinguished Service Awards<br />
from ACM <strong>and</strong> CRA.<br />
At Berkeley, Patterson led the design <strong>and</strong> implementation of RISC I, likely the first<br />
VLSI reduced instruction set computer, <strong>and</strong> the foundation of the commercial<br />
SPARC architecture. He was a leader of the Redundant Arrays of Inexpensive Disks<br />
(RAID) project, which led to dependable storage systems from many companies.<br />
He was also involved in the Network of Workstations (NOW) project, which led to<br />
cluster technology used by Internet companies <strong>and</strong> later to cloud computing. These<br />
projects earned three dissertation awards from ACM. His current research projects<br />
are Algorithm-Machine-People <strong>and</strong> Algorithms <strong>and</strong> Specializers for Provably Optimal<br />
Implementations with Resilience <strong>and</strong> Efficiency. The AMP Lab is developing scalable<br />
machine learning algorithms, warehouse-scale-computer-friendly programming<br />
models, <strong>and</strong> crowd-sourcing tools to gain valuable insights quickly from big data in<br />
the cloud. The ASPIRE Lab uses deep hardware <strong>and</strong> software co-tuning to achieve the<br />
highest possible performance <strong>and</strong> energy efficiency for mobile <strong>and</strong> rack computing<br />
systems.<br />
John L. Hennessy is the tenth president of Stanford University, where he has been<br />
a member of the faculty since 1977 in the departments of electrical engineering <strong>and</strong><br />
computer science. Hennessy is a Fellow of the IEEE <strong>and</strong> ACM; a member of the<br />
National Academy of Engineering, the National Academy of Science, <strong>and</strong> the American<br />
Philosophical Society; <strong>and</strong> a Fellow of the American Academy of Arts <strong>and</strong> Sciences.<br />
Among his many awards are the 2001 Eckert-Mauchly Award for his contributions to<br />
RISC technology, the 2001 Seymour Cray <strong>Computer</strong> Engineering Award, <strong>and</strong> the 2000<br />
John von Neumann Award, which he shared with David Patterson. He has also received<br />
seven honorary doctorates.<br />
In 1981, he started the MIPS project at Stanford with a h<strong>and</strong>ful of graduate students.<br />
After completing the project in 1984, he took a leave from the university to cofound<br />
MIPS <strong>Computer</strong> Systems (now MIPS Technologies), which developed one of the first<br />
commercial RISC microprocessors. As of 2006, over 2 billion MIPS microprocessors have<br />
been shipped in devices ranging from video games <strong>and</strong> palmtop computers to laser printers<br />
<strong>and</strong> network switches. Hennessy subsequently led the DASH (Director Architecture<br />
for Shared Memory) project, which prototyped the first scalable cache coherent<br />
multiprocessor; many of the key ideas have been adopted in modern multiprocessors.<br />
In addition to his technical activities <strong>and</strong> university responsibilities, he has continued to<br />
work with numerous start-ups both as an early-stage advisor <strong>and</strong> an investor.
To Linda,<br />
who has been, is, <strong>and</strong> always will be the love of my life
A CKNOWLEDGMENTS<br />
Figures 1.7, 1.8 Courtesy of iFixit ( www.ifixit.com ).<br />
Figure 1.9 Courtesy of Chipworks ( www.chipworks.com ).<br />
Figure 1.13 Courtesy of Intel.<br />
Figures 1.10.1, 1.10.2, 4.15.2 Courtesy of the Charles Babbage<br />
Institute, University of Minnesota Libraries, Minneapolis.<br />
Figures 1.10.3, 4.15.1, 4.15.3, 5.12.3, 6.14.2 Courtesy of IBM.<br />
Figure 1.10.4 Courtesy of Cray Inc.<br />
Figure 1.10.5 Courtesy of Apple <strong>Computer</strong>, Inc.<br />
Figure 1.10.6 Courtesy of the <strong>Computer</strong> History Museum.<br />
Figures 5.17.1, 5.17.2 Courtesy of Museum of Science, Boston.<br />
Figure 5.17.4 Courtesy of MIPS Technologies, Inc.<br />
Figure 6.15.1 Courtesy of NASA Ames Research Center.
Preface<br />
The most beautiful thing we can experience is the mysterious. It is the<br />
source of all true art <strong>and</strong> science.<br />
Albert Einstein, What I Believe, 1930<br />
About This Book<br />
We believe that learning in computer science <strong>and</strong> engineering should reflect<br />
the current state of the field, as well as introduce the principles that are shaping<br />
computing. We also feel that readers in every specialty of computing need<br />
to appreciate the organizational paradigms that determine the capabilities,<br />
performance, energy, <strong>and</strong>, ultimately, the success of computer systems.<br />
Modern computer technology requires professionals of every computing<br />
specialty to underst<strong>and</strong> both hardware <strong>and</strong> software. The interaction between<br />
hardware <strong>and</strong> software at a variety of levels also offers a framework for underst<strong>and</strong>ing<br />
the fundamentals of computing. Whether your primary interest is hardware or<br />
software, computer science or electrical engineering, the central ideas in computer<br />
organization <strong>and</strong> design are the same. Thus, our emphasis in this book is to show<br />
the relationship between hardware <strong>and</strong> software <strong>and</strong> to focus on the concepts that<br />
are the basis for current computers.<br />
The recent switch from uniprocessor to multicore microprocessors confirmed<br />
the soundness of this perspective, given since the first edition. While programmers<br />
could ignore the advice <strong>and</strong> rely on computer architects, compiler writers, <strong>and</strong> silicon<br />
engineers to make their programs run faster or be more energy-efficient without<br />
change, that era is over. For programs to run faster, they must become parallel.<br />
While the goal of many researchers is to make it possible for programmers to be<br />
unaware of the underlying parallel nature of the hardware they are programming,<br />
it will take many years to realize this vision. Our view is that for at least the next<br />
decade, most programmers are going to have to underst<strong>and</strong> the hardware/software<br />
interface if they want programs to run efficiently on parallel computers.<br />
The audience for this book includes those with little experience in assembly<br />
language or logic design who need to underst<strong>and</strong> basic computer organization as<br />
well as readers with backgrounds in assembly language <strong>and</strong>/or logic design who<br />
want to learn how to design a computer or underst<strong>and</strong> how a system works <strong>and</strong><br />
why it performs as it does.
xvi<br />
Preface<br />
About the Other Book<br />
Some readers may be familiar with <strong>Computer</strong> Architecture: A Quantitative<br />
Approach , popularly known as Hennessy <strong>and</strong> Patterson. (This book in turn is<br />
often called Patterson <strong>and</strong> Hennessy.) Our motivation in writing the earlier book<br />
was to describe the principles of computer architecture using solid engineering<br />
fundamentals <strong>and</strong> quantitative cost/performance tradeoffs. We used an approach<br />
that combined examples <strong>and</strong> measurements, based on commercial systems, to<br />
create realistic design experiences. Our goal was to demonstrate that computer<br />
architecture could be learned using quantitative methodologies instead of a<br />
descriptive approach. It was intended for the serious computing professional who<br />
wanted a detailed underst<strong>and</strong>ing of computers.<br />
A majority of the readers for this book do not plan to become computer<br />
architects. The performance <strong>and</strong> energy efficiency of future software systems will<br />
be dramatically affected, however, by how well software designers underst<strong>and</strong> the<br />
basic hardware techniques at work in a system. Thus, compiler writers, operating<br />
system designers, database programmers, <strong>and</strong> most other software engineers need<br />
a firm grounding in the principles presented in this book. Similarly, hardware<br />
designers must underst<strong>and</strong> clearly the effects of their work on software applications.<br />
Thus, we knew that this book had to be much more than a subset of the material<br />
in <strong>Computer</strong> Architecture , <strong>and</strong> the material was extensively revised to match the<br />
different audience. We were so happy with the result that the subsequent editions of<br />
<strong>Computer</strong> Architecture were revised to remove most of the introductory material;<br />
hence, there is much less overlap today than with the first editions of both books.<br />
Changes for the Fifth Edition<br />
We had six major goals for the fifth edition of <strong>Computer</strong> Organization <strong>and</strong> <strong>Design</strong>:<br />
demonstrate the importance of underst<strong>and</strong>ing hardware with a running example;<br />
highlight major themes across the topics using margin icons that are introduced<br />
early; update examples to reflect changeover from PC era to PostPC era; spread the<br />
material on I/O throughout the book rather than isolating it into a single chapter;<br />
update the technical content to reflect changes in the industry since the publication<br />
of the fourth edition in 2009; <strong>and</strong> put appendices <strong>and</strong> optional sections online<br />
instead of including a CD to lower costs <strong>and</strong> to make this edition viable as an<br />
electronic book.<br />
Before discussing the goals in detail, let’s look at the table on the next page. It<br />
shows the hardware <strong>and</strong> software paths through the material. Chapters 1, 4, 5, <strong>and</strong><br />
6 are found on both paths, no matter what the experience or the focus. Chapter 1<br />
discusses the importance of energy <strong>and</strong> how it motivates the switch from single<br />
core to multicore microprocessors <strong>and</strong> introduces the eight great ideas in computer<br />
architecture. Chapter 2 is likely to be review material for the hardware-oriented,<br />
but it is essential reading for the software-oriented, especially for those readers<br />
interested in learning more about compilers <strong>and</strong> object-oriented programming<br />
languages. Chapter 3 is for readers interested in constructing a datapath or in
xviii<br />
Preface<br />
learning more about floating-point arithmetic. Some will skip parts of Chapter 3,<br />
either because they don’t need them or because they offer a review. However, we<br />
introduce the running example of matrix multiply in this chapter, showing how<br />
subword parallels offers a fourfold improvement, so don’t skip sections 3.6 to 3.8.<br />
Chapter 4 explains pipelined processors. Sections 4.1, 4.5, <strong>and</strong> 4.10 give overviews<br />
<strong>and</strong> Section 4.12 gives the next performance boost for matrix multiply for those with<br />
a software focus. Those with a hardware focus, however, will find that this chapter<br />
presents core material; they may also, depending on their background, want to read<br />
Appendix C on logic design first. The last chapter on multicores, multiprocessors,<br />
<strong>and</strong> clusters, is mostly new content <strong>and</strong> should be read by everyone. It was<br />
significantly reorganized in this edition to make the flow of ideas more natural<br />
<strong>and</strong> to include much more depth on GPUs, warehouse scale computers, <strong>and</strong> the<br />
hardware-software interface of network interface cards that are key to clusters.<br />
The first of the six goals for this firth edition was to demonstrate the importance<br />
of underst<strong>and</strong>ing modern hardware to get good performance <strong>and</strong> energy efficiency<br />
with a concrete example. As mentioned above, we start with subword parallelism<br />
in Chapter 3 to improve matrix multiply by a factor of 4. We double performance<br />
in Chapter 4 by unrolling the loop to demonstrate the value of instruction level<br />
parallelism. Chapter 5 doubles performance again by optimizing for caches using<br />
blocking. Finally, Chapter 6 demonstrates a speedup of 14 from 16 processors by<br />
using thread-level parallelism. All four optimizations in total add just 24 lines of C<br />
code to our initial matrix multiply example.<br />
The second goal was to help readers separate the forest from the trees by<br />
identifying eight great ideas of computer architecture early <strong>and</strong> then pointing out<br />
all the places they occur throughout the rest of the book. We use (hopefully) easy<br />
to remember margin icons <strong>and</strong> highlight the corresponding word in the text to<br />
remind readers of these eight themes. There are nearly 100 citations in the book.<br />
No chapter has less than seven examples of great ideas, <strong>and</strong> no idea is cited less than<br />
five times. Performance via parallelism, pipelining, <strong>and</strong> prediction are the three<br />
most popular great ideas, followed closely by Moore’s Law. The processor chapter<br />
(4) is the one with the most examples, which is not a surprise since it probably<br />
received the most attention from computer architects. The one great idea found in<br />
every chapter is performance via parallelism, which is a pleasant observation given<br />
the recent emphasis in parallelism in the field <strong>and</strong> in editions of this book.<br />
The third goal was to recognize the generation change in computing from the<br />
PC era to the PostPC era by this edition with our examples <strong>and</strong> material. Thus,<br />
Chapter 1 dives into the guts of a tablet computer rather than a PC, <strong>and</strong> Chapter 6<br />
describes the computing infrastructure of the cloud. We also feature the ARM,<br />
which is the instruction set of choice in the personal mobile devices of the PostPC<br />
era, as well as the x86 instruction set that dominated the PC Era <strong>and</strong> (so far)<br />
dominates cloud computing.<br />
The fourth goal was to spread the I/O material throughout the book rather<br />
than have it in its own chapter, much as we spread parallelism throughout all the<br />
chapters in the fourth edition. Hence, I/O material in this edition can be found in
Preface xix<br />
Sections 1.4, 4.9, 5.2, 5.5, 5.11, <strong>and</strong> 6.9. The thought is that readers (<strong>and</strong> instructors)<br />
are more likely to cover I/O if it’s not segregated to its own chapter.<br />
This is a fast-moving field, <strong>and</strong>, as is always the case for our new editions, an<br />
important goal is to update the technical content. The running example is the ARM<br />
Cortex A8 <strong>and</strong> the Intel Core i7, reflecting our PostPC Era. Other highlights include<br />
an overview the new 64-bit instruction set of ARMv8, a tutorial on GPUs that<br />
explains their unique terminology, more depth on the warehouse scale computers<br />
that make up the cloud, <strong>and</strong> a deep dive into 10 Gigabyte Ethernet cards.<br />
To keep the main book short <strong>and</strong> compatible with electronic books, we placed<br />
the optional material as online appendices instead of on a companion CD as in<br />
prior editions.<br />
Finally, we updated all the exercises in the book.<br />
While some elements changed, we have preserved useful book elements from<br />
prior editions. To make the book work better as a reference, we still place definitions<br />
of new terms in the margins at their first occurrence. The book element called<br />
“Underst<strong>and</strong>ing Program Performance” sections helps readers underst<strong>and</strong> the<br />
performance of their programs <strong>and</strong> how to improve it, just as the “Hardware/Software<br />
Interface” book element helped readers underst<strong>and</strong> the tradeoffs at this interface.<br />
“The Big Picture” section remains so that the reader sees the forest despite all the<br />
trees. “Check Yourself ” sections help readers to confirm their comprehension of the<br />
material on the first time through with answers provided at the end of each chapter.<br />
This edition still includes the green MIPS reference card, which was inspired by the<br />
“Green Card” of the IBM System/360. This card has been updated <strong>and</strong> should be a<br />
h<strong>and</strong>y reference when writing MIPS assembly language programs.<br />
Changes for the Fifth Edition<br />
We have collected a great deal of material to help instructors teach courses using<br />
this book. Solutions to exercises, figures from the book, lecture slides, <strong>and</strong> other<br />
materials are available to adopters from the publisher. Check the publisher’s Web<br />
site for more information:<br />
textbooks.elsevier.com/9780124077263<br />
Concluding Remarks<br />
If you read the following acknowledgments section, you will see that we went to<br />
great lengths to correct mistakes. Since a book goes through many printings, we<br />
have the opportunity to make even more corrections. If you uncover any remaining,<br />
resilient bugs, please contact the publisher by electronic mail at cod5bugs@mkp.<br />
com or by low-tech mail using the address found on the copyright page.<br />
This edition is the second break in the long-st<strong>and</strong>ing collaboration between<br />
Hennessy <strong>and</strong> Patterson, which started in 1989. The dem<strong>and</strong>s of running one of<br />
the world’s great universities meant that President Hennessy could no longer make<br />
the substantial commitment to create a new edition. The remaining author felt
xx<br />
Preface<br />
once again like a tightrope walker without a safety net. Hence, the people in the<br />
acknowledgments <strong>and</strong> Berkeley colleagues played an even larger role in shaping<br />
the contents of this book. Nevertheless, this time around there is only one author<br />
to blame for the new material in what you are about to read.<br />
Acknowledgments for the Fifth Edition<br />
With every edition of this book, we are very fortunate to receive help from many<br />
readers, reviewers, <strong>and</strong> contributors. Each of these people has helped to make this<br />
book better.<br />
Chapter 6 was so extensively revised that we did a separate review for ideas <strong>and</strong><br />
contents, <strong>and</strong> I made changes based on the feedback from every reviewer. I’d like to<br />
thank Christos Kozyrakis of Stanford University for suggesting using the network<br />
interface for clusters to demonstrate the hardware-software interface of I/O <strong>and</strong><br />
for suggestions on organizing the rest of the chapter; Mario Flagsilk of Stanford<br />
University for providing details, diagrams, <strong>and</strong> performance measurements of the<br />
NetFPGA NIC; <strong>and</strong> the following for suggestions on how to improve the chapter:<br />
David Kaeli of Northeastern University, Partha Ranganathan of HP Labs,<br />
David Wood of the University of Wisconsin, <strong>and</strong> my Berkeley colleagues Siamak<br />
Faridani , Shoaib Kamil , Yunsup Lee , Zhangxi Tan , <strong>and</strong> Andrew Waterman .<br />
Special thanks goes to Rimas Avizenis of UC Berkeley, who developed the<br />
various versions of matrix multiply <strong>and</strong> supplied the performance numbers as well.<br />
As I worked with his father while I was a graduate student at UCLA, it was a nice<br />
symmetry to work with Rimas at UCB.<br />
I also wish to thank my longtime collaborator R<strong>and</strong>y Katz of UC Berkeley, who<br />
helped develop the concept of great ideas in computer architecture as part of the<br />
extensive revision of an undergraduate class that we did together.<br />
I’d like to thank David Kirk , John Nickolls , <strong>and</strong> their colleagues at NVIDIA<br />
(Michael Garl<strong>and</strong>, John Montrym, Doug Voorhies, Lars Nyl<strong>and</strong>, Erik Lindholm,<br />
Paulius Micikevicius, Massimiliano Fatica, Stuart Oberman, <strong>and</strong> Vasily Volkov)<br />
for writing the first in-depth appendix on GPUs. I’d like to express again my<br />
appreciation to Jim Larus , recently named Dean of the School of <strong>Computer</strong> <strong>and</strong><br />
Communications Science at EPFL, for his willingness in contributing his expertise<br />
on assembly language programming, as well as for welcoming readers of this book<br />
with regard to using the simulator he developed <strong>and</strong> maintains.<br />
I am also very grateful to Jason Bakos of the University of South Carolina,<br />
who updated <strong>and</strong> created new exercises for this edition, working from originals<br />
prepared for the fourth edition by Perry Alex<strong>and</strong>er (The University of Kansas);<br />
Javier Bruguera (Universidade de Santiago de Compostela); Matthew Farrens<br />
(University of California, Davis); David Kaeli (Northeastern University); Nicole<br />
Kaiyan (University of Adelaide); John Oliver (Cal Poly, San Luis Obispo); Milos<br />
Prvulovic (Georgia Tech); <strong>and</strong> Jichuan Chang , Jacob Leverich , Kevin Lim , <strong>and</strong><br />
Partha Ranganathan (all from Hewlett-Packard).<br />
Additional thanks goes to Jason Bakos for developing the new lecture slides.
I am grateful to the many instructors who have answered the publisher’s surveys,<br />
reviewed our proposals, <strong>and</strong> attended focus groups to analyze <strong>and</strong> respond to our<br />
plans for this edition. They include the following individuals: Focus Groups in<br />
2012: Bruce Barton (Suffolk County Community College), Jeff Braun (Montana<br />
Tech), Ed Gehringer (North Carolina State), Michael Goldweber (Xavier University),<br />
Ed Harcourt (St. Lawrence University), Mark Hill (University of Wisconsin,<br />
Madison), Patrick Homer (University of Arizona), Norm Jouppi (HP Labs), Dave<br />
Kaeli (Northeastern University), Christos Kozyrakis (Stanford University),<br />
Zachary Kurmas (Gr<strong>and</strong> Valley State University), Jae C. Oh (Syracuse University),<br />
Lu Peng (LSU), Milos Prvulovic (Georgia Tech), Partha Ranganathan (HP<br />
Labs), David Wood (University of Wisconsin), Craig Zilles (University of Illinois<br />
at Urbana-Champaign). Surveys <strong>and</strong> Reviews: Mahmoud Abou-Nasr (Wayne State<br />
University), Perry Alex<strong>and</strong>er (The University of Kansas), Hakan Aydin (George<br />
Mason University), Hussein Badr (State University of New York at Stony Brook),<br />
Mac Baker (Virginia Military Institute), Ron Barnes (George Mason University),<br />
Douglas Blough (Georgia Institute of Technology), Kevin Bolding (Seattle Pacific<br />
University), Miodrag Bolic (University of Ottawa), John Bonomo (Westminster<br />
College), Jeff Braun (Montana Tech), Tom Briggs (Shippensburg University), Scott<br />
Burgess (Humboldt State University), Fazli Can (Bilkent University), Warren R.<br />
Carithers (Rochester Institute of Technology), Bruce Carlton (Mesa Community<br />
College), Nicholas Carter (University of Illinois at Urbana-Champaign), Anthony<br />
Cocchi (The City University of New York), Don Cooley (Utah State University),<br />
Robert D. Cupper (Allegheny College), Edward W. Davis (North Carolina State<br />
University), Nathaniel J. Davis (Air Force Institute of Technology), Molisa Derk<br />
(Oklahoma City University), Derek Eager (University of Saskatchewan), Ernest<br />
Ferguson (Northwest Missouri State University), Rhonda Kay Gaede (The University<br />
of Alabama), Etienne M. Gagnon (UQAM), Costa Gerousis (Christopher Newport<br />
University), Paul Gillard (Memorial University of Newfoundl<strong>and</strong>), Michael<br />
Goldweber (Xavier University), Georgia Grant (College of San Mateo), Merrill Hall<br />
(The Master’s College), Tyson Hall (Southern Adventist University), Ed Harcourt<br />
(St. Lawrence University), Justin E. Harlow (University of South Florida), Paul F.<br />
Hemler (Hampden-Sydney College), Martin Herbordt (Boston University), Steve<br />
J. Hodges (Cabrillo College), Kenneth Hopkinson (Cornell University), Dalton<br />
Hunkins (St. Bonaventure University), Baback Izadi (State University of New<br />
York—New Paltz), Reza Jafari, Robert W. Johnson (Colorado Technical University),<br />
Bharat Joshi (University of North Carolina, Charlotte), Nagarajan K<strong>and</strong>asamy<br />
(Drexel University), Rajiv Kapadia, Ryan Kastner (University of California,<br />
Santa Barbara), E.J. Kim (Texas A&M University), Jihong Kim (Seoul National<br />
University), Jim Kirk (Union University), Geoffrey S. Knauth (Lycoming College),<br />
Manish M. Kochhal (Wayne State), Suzan Koknar-Tezel (Saint Joseph’s University),<br />
Angkul Kongmunvattana (Columbus State University), April Kontostathis (Ursinus<br />
College), Christos Kozyrakis (Stanford University), Danny Krizanc (Wesleyan<br />
University), Ashok Kumar, S. Kumar (The University of Texas), Zachary Kurmas<br />
(Gr<strong>and</strong> Valley State University), Robert N. Lea (University of Houston), Baoxin<br />
Preface xxi
xxii<br />
Preface<br />
Li (Arizona State University), Li Liao (University of Delaware), Gary Livingston<br />
(University of Massachusetts), Michael Lyle, Douglas W. Lynn (Oregon Institute<br />
of Technology), Yashwant K Malaiya (Colorado State University), Bill Mark<br />
(University of Texas at Austin), An<strong>and</strong>a Mondal (Claflin University), Alvin Moser<br />
(Seattle University), Walid Najjar (University of California, Riverside), Danial J.<br />
Neebel (Loras College), John Nestor (Lafayette College), Jae C. Oh (Syracuse<br />
University), Joe Oldham (Centre College), Timour Paltashev, James Parkerson<br />
(University of Arkansas), Shaunak Pawagi (SUNY at Stony Brook), Steve Pearce, Ted<br />
Pedersen (University of Minnesota), Lu Peng (Louisiana State University), Gregory<br />
D Peterson (The University of Tennessee), Milos Prvulovic (Georgia Tech), Partha<br />
Ranganathan (HP Labs), Dejan Raskovic (University of Alaska, Fairbanks) Brad<br />
Richards (University of Puget Sound), Roman Rozanov, Louis Rubinfield (Villanova<br />
University), Md Abdus Salam (Southern University), Augustine Samba (Kent State<br />
University), Robert Schaefer (Daniel Webster College), Carolyn J. C. Schauble<br />
(Colorado State University), Keith Schubert (CSU San Bernardino), William<br />
L. Schultz, Kelly Shaw (University of Richmond), Shahram Shirani (McMaster<br />
University), Scott Sigman (Drury University), Bruce Smith, David Smith, Jeff W.<br />
Smith (University of Georgia, Athens), Mark Smotherman (Clemson University),<br />
Philip Snyder (Johns Hopkins University), Alex Sprintson (Texas A&M), Timothy<br />
D. Stanley (Brigham Young University), Dean Stevens (Morningside College),<br />
Nozar Tabrizi (Kettering University), Yuval Tamir (UCLA), Alex<strong>and</strong>er Taubin<br />
(Boston University), Will Thacker (Winthrop University), Mithuna Thottethodi<br />
(Purdue University), Manghui Tu (Southern Utah University), Dean Tullsen<br />
(UC San Diego), Rama Viswanathan (Beloit College), Ken Vollmar (Missouri<br />
State University), Guoping Wang (Indiana-Purdue University), Patricia Wenner<br />
(Bucknell University), Kent Wilken (University of California, Davis), David Wolfe<br />
(Gustavus Adolphus College), David Wood (University of Wisconsin, Madison),<br />
Ki Hwan Yum (University of Texas, San Antonio), Mohamed Zahran (City College<br />
of New York), Gerald D. Zarnett (Ryerson University), Nian Zhang (South Dakota<br />
School of Mines & Technology), Jiling Zhong (Troy University), Huiyang Zhou<br />
(The University of Central Florida), Weiyu Zhu (Illinois Wesleyan University).<br />
A special thanks also goes to Mark Smotherman for making multiple passes to<br />
find technical <strong>and</strong> writing glitches that significantly improved the quality of this<br />
edition.<br />
We wish to thank the extended Morgan Kaufmann family for agreeing to publish<br />
this book again under the able leadership of Todd Green <strong>and</strong> Nate McFadden : I<br />
certainly couldn’t have completed the book without them. We also want to extend<br />
thanks to Lisa Jones , who managed the book production process, <strong>and</strong> Russell<br />
Purdy , who did the cover design. The new cover cleverly connects the PostPC Era<br />
content of this edition to the cover of the first edition.<br />
The contributions of the nearly 150 people we mentioned here have helped<br />
make this fifth edition what I hope will be our best book yet. Enjoy!<br />
David A. Patterson
This page intentionally left blank
1<br />
Civilization advances<br />
by extending the<br />
number of important<br />
operations which we<br />
can perform without<br />
thinking about them.<br />
Alfred North Whitehead,<br />
An Introduction to Mathematics, 1911<br />
<strong>Computer</strong><br />
Abstractions <strong>and</strong><br />
Technology<br />
1.1 Introduction 3<br />
1.2 Eight Great Ideas in <strong>Computer</strong><br />
Architecture 11<br />
1.3 Below Your Program 13<br />
1.4 Under the Covers 16<br />
1.5 Technologies for Building Processors <strong>and</strong><br />
Memory 24<br />
<strong>Computer</strong> Organization <strong>and</strong> <strong>Design</strong>. DOI: http://dx.doi.org/10.1016/B978-0-12-407726-3.00001-1<br />
© 2013 Elsevier Inc. All rights reserved.
4 Chapter 1 <strong>Computer</strong> Abstractions <strong>and</strong> Technology<br />
<strong>Computer</strong>s have led to a third revolution for civilization, with the information<br />
revolution taking its place alongside the agricultural <strong>and</strong> the industrial revolutions.<br />
The resulting multiplication of humankind’s intellectual strength <strong>and</strong> reach<br />
naturally has affected our everyday lives profoundly <strong>and</strong> changed the ways in which<br />
the search for new knowledge is carried out. There is now a new vein of scientific<br />
investigation, with computational scientists joining theoretical <strong>and</strong> experimental<br />
scientists in the exploration of new frontiers in astronomy, biology, chemistry, <strong>and</strong><br />
physics, among others.<br />
The computer revolution continues. Each time the cost of computing improves<br />
by another factor of 10, the opportunities for computers multiply. Applications that<br />
were economically infeasible suddenly become practical. In the recent past, the<br />
following applications were “computer science fiction.”<br />
■ <strong>Computer</strong>s in automobiles: Until microprocessors improved dramatically<br />
in price <strong>and</strong> performance in the early 1980s, computer control of cars was<br />
ludicrous. Today, computers reduce pollution, improve fuel efficiency via<br />
engine controls, <strong>and</strong> increase safety through blind spot warnings, lane<br />
departure warnings, moving object detection, <strong>and</strong> air bag inflation to protect<br />
occupants in a crash.<br />
■ Cell phones: Who would have dreamed that advances in computer<br />
systems would lead to more than half of the planet having mobile phones,<br />
allowing person-to-person communication to almost anyone anywhere in<br />
the world?<br />
■ Human genome project: The cost of computer equipment to map <strong>and</strong> analyze<br />
human DNA sequences was hundreds of millions of dollars. It’s unlikely that<br />
anyone would have considered this project had the computer costs been 10<br />
to 100 times higher, as they would have been 15 to 25 years earlier. Moreover,<br />
costs continue to drop; you will soon be able to acquire your own genome,<br />
allowing medical care to be tailored to you.<br />
■ World Wide Web: Not in existence at the time of the first edition of this book,<br />
the web has transformed our society. For many, the web has replaced libraries<br />
<strong>and</strong> newspapers.<br />
■ Search engines: As the content of the web grew in size <strong>and</strong> in value, finding<br />
relevant information became increasingly important. Today, many people<br />
rely on search engines for such a large part of their lives that it would be a<br />
hardship to go without them.<br />
Clearly, advances in this technology now affect almost every aspect of our<br />
society. Hardware advances have allowed programmers to create wonderfully<br />
useful software, which explains why computers are omnipresent. Today’s science<br />
fiction suggests tomorrow’s killer applications: already on their way are glasses that<br />
augment reality, the cashless society, <strong>and</strong> cars that can drive themselves.
1.1 Introduction 5<br />
Classes of Computing Applications <strong>and</strong> Their<br />
Characteristics<br />
Although a common set of hardware technologies (see Sections 1.4 <strong>and</strong> 1.5) is used<br />
in computers ranging from smart home appliances to cell phones to the largest<br />
supercomputers, these different applications have different design requirements<br />
<strong>and</strong> employ the core hardware technologies in different ways. Broadly speaking,<br />
computers are used in three different classes of applications.<br />
Personal computers (PCs) are possibly the best known form of computing,<br />
which readers of this book have likely used extensively. Personal computers<br />
emphasize delivery of good performance to single users at low cost <strong>and</strong> usually<br />
execute third-party software. This class of computing drove the evolution of many<br />
computing technologies, which is only about 35 years old!<br />
Servers are the modern form of what were once much larger computers, <strong>and</strong><br />
are usually accessed only via a network. Servers are oriented to carrying large<br />
workloads, which may consist of either single complex applications—usually a<br />
scientific or engineering application—or h<strong>and</strong>ling many small jobs, such as would<br />
occur in building a large web server. These applications are usually based on<br />
software from another source (such as a database or simulation system), but are<br />
often modified or customized for a particular function. Servers are built from the<br />
same basic technology as desktop computers, but provide for greater computing,<br />
storage, <strong>and</strong> input/output capacity. In general, servers also place a greater emphasis<br />
on dependability, since a crash is usually more costly than it would be on a singleuser<br />
PC.<br />
Servers span the widest range in cost <strong>and</strong> capability. At the low end, a server<br />
may be little more than a desktop computer without a screen or keyboard <strong>and</strong><br />
cost a thous<strong>and</strong> dollars. These low-end servers are typically used for file storage,<br />
small business applications, or simple web serving (see Section 6.10). At the other<br />
extreme are supercomputers, which at the present consist of tens of thous<strong>and</strong>s of<br />
processors <strong>and</strong> many terabytes of memory, <strong>and</strong> cost tens to hundreds of millions<br />
of dollars. Supercomputers are usually used for high-end scientific <strong>and</strong> engineering<br />
calculations, such as weather forecasting, oil exploration, protein structure<br />
determination, <strong>and</strong> other large-scale problems. Although such supercomputers<br />
represent the peak of computing capability, they represent a relatively small fraction<br />
of the servers <strong>and</strong> a relatively small fraction of the overall computer market in<br />
terms of total revenue.<br />
Embedded computers are the largest class of computers <strong>and</strong> span the widest<br />
range of applications <strong>and</strong> performance. Embedded computers include the<br />
microprocessors found in your car, the computers in a television set, <strong>and</strong> the<br />
networks of processors that control a modern airplane or cargo ship. Embedded<br />
computing systems are designed to run one application or one set of related<br />
applications that are normally integrated with the hardware <strong>and</strong> delivered as a<br />
single system; thus, despite the large number of embedded computers, most users<br />
never really see that they are using a computer!<br />
personal computer<br />
(PC) A computer<br />
designed for use by<br />
an individual, usually<br />
incorporating a graphics<br />
display, a keyboard, <strong>and</strong> a<br />
mouse.<br />
server A computer<br />
used for running<br />
larger programs for<br />
multiple users, often<br />
simultaneously, <strong>and</strong><br />
typically accessed only via<br />
a network.<br />
supercomputer A class<br />
of computers with the<br />
highest performance <strong>and</strong><br />
cost; they are configured<br />
as servers <strong>and</strong> typically<br />
cost tens to hundreds of<br />
millions of dollars.<br />
terabyte (TB) Originally<br />
1,099,511,627,776<br />
(2 40 ) bytes, although<br />
communications <strong>and</strong><br />
secondary storage<br />
systems developers<br />
started using the term to<br />
mean 1,000,000,000,000<br />
(10 12 ) bytes. To reduce<br />
confusion, we now use the<br />
term tebibyte (TiB) for<br />
2 40 bytes, defining terabyte<br />
(TB) to mean 10 12 bytes.<br />
Figure 1.1 shows the full<br />
range of decimal <strong>and</strong><br />
binary values <strong>and</strong> names.<br />
embedded computer<br />
A computer inside another<br />
device used for running<br />
one predetermined<br />
application or collection of<br />
software.
8 Chapter 1 <strong>Computer</strong> Abstractions <strong>and</strong> Technology<br />
multicore<br />
microprocessor<br />
A microprocessor<br />
containing multiple<br />
processors (“cores”) in a<br />
single integrated circuit.<br />
last decade, advances in computer design <strong>and</strong> memory technology have greatly<br />
reduced the importance of small memory size in most applications other than<br />
those in embedded computing systems.<br />
Programmers interested in performance now need to underst<strong>and</strong> the issues<br />
that have replaced the simple memory model of the 1960s: the parallel nature<br />
of processors <strong>and</strong> the hierarchical nature of memories. Moreover, as we explain<br />
in Section 1.7, today’s programmers need to worry about energy efficiency of<br />
their programs running either on the PMD or in the Cloud, which also requires<br />
underst<strong>and</strong>ing what is below your code. Programmers who seek to build<br />
competitive versions of software will therefore need to increase their knowledge of<br />
computer organization.<br />
We are honored to have the opportunity to explain what’s inside this revolutionary<br />
machine, unraveling the software below your program <strong>and</strong> the hardware under the<br />
covers of your computer. By the time you complete this book, we believe you will<br />
be able to answer the following questions:<br />
■ How are programs written in a high-level language, such as C or Java,<br />
translated into the language of the hardware, <strong>and</strong> how does the hardware<br />
execute the resulting program? Comprehending these concepts forms the<br />
basis of underst<strong>and</strong>ing the aspects of both the hardware <strong>and</strong> software that<br />
affect program performance.<br />
■ What is the interface between the software <strong>and</strong> the hardware, <strong>and</strong> how does<br />
software instruct the hardware to perform needed functions? These concepts<br />
are vital to underst<strong>and</strong>ing how to write many kinds of software.<br />
■ What determines the performance of a program, <strong>and</strong> how can a programmer<br />
improve the performance? As we will see, this depends on the original<br />
program, the software translation of that program into the computer’s<br />
language, <strong>and</strong> the effectiveness of the hardware in executing the program.<br />
■ What techniques can be used by hardware designers to improve performance?<br />
This book will introduce the basic concepts of modern computer design. The<br />
interested reader will find much more material on this topic in our advanced<br />
book, <strong>Computer</strong> Architecture: A Quantitative Approach.<br />
■ What techniques can be used by hardware designers to improve energy<br />
efficiency? What can the programmer do to help or hinder energy efficiency?<br />
■ What are the reasons for <strong>and</strong> the consequences of the recent switch from<br />
sequential processing to parallel processing? This book gives the motivation,<br />
describes the current hardware mechanisms to support parallelism, <strong>and</strong><br />
surveys the new generation of “multicore” microprocessors (see Chapter 6).<br />
■ Since the first commercial computer in 1951, what great ideas did computer<br />
architects come up with that lay the foundation of modern computing?
10 Chapter 1 <strong>Computer</strong> Abstractions <strong>and</strong> Technology<br />
To demonstrate the impact of the ideas in this book, we improve the performance<br />
of a C program that multiplies a matrix times a vector in a sequence of<br />
chapters. Each step leverages underst<strong>and</strong>ing how the underlying hardware<br />
really works in a modern microprocessor to improve performance by a factor<br />
of 200!<br />
■ In the category of data level parallelism, in Chapter 3 we use subword<br />
parallelism via C intrinsics to increase performance by a factor of 3.8.<br />
■ In the category of instruction level parallelism, in Chapter 4 we use loop<br />
unrolling to exploit multiple instruction issue <strong>and</strong> out-of-order execution<br />
hardware to increase performance by another factor of 2.3.<br />
■ In the category of memory hierarchy optimization, in Chapter 5 we use<br />
cache blocking to increase performance on large matrices by another factor<br />
of 2.5.<br />
■ In the category of thread level parallelism, in Chapter 6 we use parallel for<br />
loops in OpenMP to exploit multicore hardware to increase performance by<br />
another factor of 14.<br />
Check<br />
Yourself<br />
Check Yourself sections are designed to help readers assess whether they<br />
comprehend the major concepts introduced in a chapter <strong>and</strong> underst<strong>and</strong> the<br />
implications of those concepts. Some Check Yourself questions have simple answers;<br />
others are for discussion among a group. Answers to the specific questions can<br />
be found at the end of the chapter. Check Yourself questions appear only at the<br />
end of a section, making it easy to skip them if you are sure you underst<strong>and</strong> the<br />
material.<br />
1. The number of embedded processors sold every year greatly outnumbers<br />
the number of PC <strong>and</strong> even PostPC processors. Can you confirm or deny<br />
this insight based on your own experience? Try to count the number of<br />
embedded processors in your home. How does it compare with the number<br />
of conventional computers in your home?<br />
2. As mentioned earlier, both the software <strong>and</strong> hardware affect the performance<br />
of a program. Can you think of examples where each of the following is the<br />
right place to look for a performance bottleneck?<br />
■ The algorithm chosen<br />
■ The programming language or compiler<br />
■ The operating system<br />
■ The processor<br />
■ The I/O system <strong>and</strong> devices
14 Chapter 1 <strong>Computer</strong> Abstractions <strong>and</strong> Technology<br />
compiler A program<br />
that translates high-level<br />
language statements<br />
into assembly language<br />
statements.<br />
binary digit Also called<br />
a bit. One of the two<br />
numbers in base 2 (0 or 1)<br />
that are the components<br />
of information.<br />
instruction A comm<strong>and</strong><br />
that computer hardware<br />
underst<strong>and</strong>s <strong>and</strong> obeys.<br />
assembler A program<br />
that translates a symbolic<br />
version of instructions<br />
into the binary version.<br />
assembly language<br />
A symbolic representation<br />
of machine instructions.<br />
machine language<br />
A binary representation of<br />
machine instructions.<br />
Compilers perform another vital function: the translation of a program written<br />
in a high-level language, such as C, C, Java, or Visual Basic into instructions<br />
that the hardware can execute. Given the sophistication of modern programming<br />
languages <strong>and</strong> the simplicity of the instructions executed by the hardware, the<br />
translation from a high-level language program to hardware instructions is<br />
complex. We give a brief overview of the process here <strong>and</strong> then go into more depth<br />
in Chapter 2 <strong>and</strong> in Appendix A.<br />
From a High-Level Language to the Language of Hardware<br />
To actually speak to electronic hardware, you need to send electrical signals. The<br />
easiest signals for computers to underst<strong>and</strong> are on <strong>and</strong> off, <strong>and</strong> so the computer<br />
alphabet is just two letters. Just as the 26 letters of the English alphabet do not limit<br />
how much can be written, the two letters of the computer alphabet do not limit<br />
what computers can do. The two symbols for these two letters are the numbers 0<br />
<strong>and</strong> 1, <strong>and</strong> we commonly think of the computer language as numbers in base 2, or<br />
binary numbers. We refer to each “letter” as a binary digit or bit. <strong>Computer</strong>s are<br />
slaves to our comm<strong>and</strong>s, which are called instructions. Instructions, which are just<br />
collections of bits that the computer underst<strong>and</strong>s <strong>and</strong> obeys, can be thought of as<br />
numbers. For example, the bits<br />
1000110010100000<br />
tell one computer to add two numbers. Chapter 2 explains why we use numbers<br />
for instructions <strong>and</strong> data; we don’t want to steal that chapter’s thunder, but using<br />
numbers for both instructions <strong>and</strong> data is a foundation of computing.<br />
The first programmers communicated to computers in binary numbers, but this<br />
was so tedious that they quickly invented new notations that were closer to the way<br />
humans think. At first, these notations were translated to binary by h<strong>and</strong>, but this<br />
process was still tiresome. Using the computer to help program the computer, the<br />
pioneers invented programs to translate from symbolic notation to binary. The first of<br />
these programs was named an assembler. This program translates a symbolic version<br />
of an instruction into the binary version. For example, the programmer would write<br />
add A,B<br />
<strong>and</strong> the assembler would translate this notation into<br />
1000110010100000<br />
This instruction tells the computer to add the two numbers A <strong>and</strong> B. The name coined<br />
for this symbolic language, still used today, is assembly language. In contrast, the<br />
binary language that the machine underst<strong>and</strong>s is the machine language.<br />
Although a tremendous improvement, assembly language is still far from the<br />
notations a scientist might like to use to simulate fluid flow or that an accountant<br />
might use to balance the books. Assembly language requires the programmer<br />
to write one line for every instruction that the computer will follow, forcing the<br />
programmer to think like the computer.
16 Chapter 1 <strong>Computer</strong> Abstractions <strong>and</strong> Technology<br />
A compiler enables a programmer to write this high-level language expression:<br />
A + B<br />
The compiler would compile it into this assembly language statement:<br />
add A,B<br />
As shown above, the assembler would translate this statement into the binary<br />
instructions that tell the computer to add the two numbers A <strong>and</strong> B.<br />
High-level programming languages offer several important benefits. First, they<br />
allow the programmer to think in a more natural language, using English words<br />
<strong>and</strong> algebraic notation, resulting in programs that look much more like text than<br />
like tables of cryptic symbols (see Figure 1.4). Moreover, they allow languages to be<br />
designed according to their intended use. Hence, Fortran was designed for scientific<br />
computation, Cobol for business data processing, Lisp for symbol manipulation,<br />
<strong>and</strong> so on. There are also domain-specific languages for even narrower groups of<br />
users, such as those interested in simulation of fluids, for example.<br />
The second advantage of programming languages is improved programmer<br />
productivity. One of the few areas of widespread agreement in software development<br />
is that it takes less time to develop programs when they are written in languages<br />
that require fewer lines to express an idea. Conciseness is a clear advantage of highlevel<br />
languages over assembly language.<br />
The final advantage is that programming languages allow programs to be<br />
independent of the computer on which they were developed, since compilers <strong>and</strong><br />
assemblers can translate high-level language programs to the binary instructions of<br />
any computer. These three advantages are so strong that today little programming<br />
is done in assembly language.<br />
1.4 Under the Covers<br />
input device<br />
A mechanism through<br />
which the computer is<br />
fed information, such as a<br />
keyboard.<br />
output device<br />
A mechanism that<br />
conveys the result of a<br />
computation to a user,<br />
such as a display, or to<br />
another computer.<br />
Now that we have looked below your program to uncover the underlying software,<br />
let’s open the covers of your computer to learn about the underlying hardware. The<br />
underlying hardware in any computer performs the same basic functions: inputting<br />
data, outputting data, processing data, <strong>and</strong> storing data. How these functions are<br />
performed is the primary topic of this book, <strong>and</strong> subsequent chapters deal with<br />
different parts of these four tasks.<br />
When we come to an important point in this book, a point so important that<br />
we hope you will remember it forever, we emphasize it by identifying it as a Big<br />
Picture item. We have about a dozen Big Pictures in this book, the first being the<br />
five components of a computer that perform the tasks of inputting, outputting,<br />
processing, <strong>and</strong> storing data.<br />
Two key components of computers are input devices, such as the microphone,<br />
<strong>and</strong> output devices, such as the speaker. As the names suggest, input feeds the
1.4 Under the Covers 17<br />
computer, <strong>and</strong> output is the result of computation sent to the user. Some devices,<br />
such as wireless networks, provide both input <strong>and</strong> output to the computer.<br />
Chapters 5 <strong>and</strong> 6 describe input/output (I/O) devices in more detail, but let’s<br />
take an introductory tour through the computer hardware, starting with the<br />
external I/O devices.<br />
The five classic components of a computer are input, output, memory,<br />
datapath, <strong>and</strong> control, with the last two sometimes combined <strong>and</strong> called<br />
the processor. Figure 1.5 shows the st<strong>and</strong>ard organization of a computer.<br />
This organization is independent of hardware technology: you can place<br />
every piece of every computer, past <strong>and</strong> present, into one of these five<br />
categories. To help you keep all this in perspective, the five components of<br />
a computer are shown on the front page of each of the following chapters,<br />
with the portion of interest to that chapter highlighted.<br />
The BIG<br />
Picture<br />
FIGURE 1.5 The organization of a computer, showing the five classic components. The<br />
processor gets instructions <strong>and</strong> data from memory. Input writes data to memory, <strong>and</strong> output reads data from<br />
memory. Control sends the signals that determine the operations of the datapath, memory, input, <strong>and</strong> output.
18 Chapter 1 <strong>Computer</strong> Abstractions <strong>and</strong> Technology<br />
liquid crystal display<br />
A display technology<br />
using a thin layer of liquid<br />
polymers that can be used<br />
to transmit or block light<br />
according to whether a<br />
charge is applied.<br />
active matrix display<br />
A liquid crystal display<br />
using a transistor to<br />
control the transmission<br />
of light at each individual<br />
pixel.<br />
pixel The smallest<br />
individual picture<br />
element. Screens are<br />
composed of hundreds<br />
of thous<strong>and</strong>s to millions<br />
of pixels, organized in a<br />
matrix.<br />
Through computer<br />
displays I have l<strong>and</strong>ed<br />
an airplane on the<br />
deck of a moving<br />
carrier, observed a<br />
nuclear particle hit a<br />
potential well, flown<br />
in a rocket at nearly<br />
the speed of light <strong>and</strong><br />
watched a computer<br />
reveal its innermost<br />
workings.<br />
Ivan Sutherl<strong>and</strong>, the<br />
“father” of computer<br />
graphics, Scientific<br />
American, 1984<br />
Through the Looking Glass<br />
The most fascinating I/O device is probably the graphics display. Most personal<br />
mobile devices use liquid crystal displays (LCDs) to get a thin, low-power display.<br />
The LCD is not the source of light; instead, it controls the transmission of light.<br />
A typical LCD includes rod-shaped molecules in a liquid that form a twisting<br />
helix that bends light entering the display, from either a light source behind the<br />
display or less often from reflected light. The rods straighten out when a current is<br />
applied <strong>and</strong> no longer bend the light. Since the liquid crystal material is between<br />
two screens polarized at 90 degrees, the light cannot pass through unless it is bent.<br />
Today, most LCD displays use an active matrix that has a tiny transistor switch at<br />
each pixel to precisely control current <strong>and</strong> make sharper images. A red-green-blue<br />
mask associated with each dot on the display determines the intensity of the threecolor<br />
components in the final image; in a color active matrix LCD, there are three<br />
transistor switches at each point.<br />
The image is composed of a matrix of picture elements, or pixels, which can<br />
be represented as a matrix of bits, called a bit map. Depending on the size of the<br />
screen <strong>and</strong> the resolution, the display matrix in a typical tablet ranges in size from<br />
1024 768 to 2048 1536. A color display might use 8 bits for each of the three<br />
colors (red, blue, <strong>and</strong> green), for 24 bits per pixel, permitting millions of different<br />
colors to be displayed.<br />
The computer hardware support for graphics consists mainly of a raster refresh<br />
buffer, or frame buffer, to store the bit map. The image to be represented onscreen<br />
is stored in the frame buffer, <strong>and</strong> the bit pattern per pixel is read out to the graphics<br />
display at the refresh rate. Figure 1.6 shows a frame buffer with a simplified design<br />
of just 4 bits per pixel.<br />
The goal of the bit map is to faithfully represent what is on the screen. The<br />
challenges in graphics systems arise because the human eye is very good at detecting<br />
even subtle changes on the screen.<br />
Y 0<br />
Frame buffer<br />
0 011 1 101<br />
Y 0<br />
Y 1<br />
Raster scan CRT display<br />
Y 1<br />
X 0 X 1<br />
X 0 X 1<br />
FIGURE 1.6 Each coordinate in the frame buffer on the left determines the shade of the<br />
corresponding coordinate for the raster scan CRT display on the right. Pixel (X 0<br />
, Y 0<br />
) contains<br />
the bit pattern 0011, which is a lighter shade on the screen than the bit pattern 1101 in pixel (X 1<br />
, Y 1<br />
).
1.4 Under the Covers 19<br />
Touchscreen<br />
While PCs also use LCD displays, the tablets <strong>and</strong> smartphones of the PostPC era<br />
have replaced the keyboard <strong>and</strong> mouse with touch sensitive displays, which has<br />
the wonderful user interface advantage of users pointing directly what they are<br />
interested in rather than indirectly with a mouse.<br />
While there are a variety of ways to implement a touch screen, many tablets<br />
today use capacitive sensing. Since people are electrical conductors, if an insulator<br />
like glass is covered with a transparent conductor, touching distorts the electrostatic<br />
field of the screen, which results in a change in capacitance. This technology can<br />
allow multiple touches simultaneously, which allows gestures that can lead to<br />
attractive user interfaces.<br />
Opening the Box<br />
Figure 1.7 shows the contents of the Apple iPad 2 tablet computer. Unsurprisingly,<br />
of the five classic components of the computer, I/O dominates this reading device.<br />
The list of I/O devices includes a capacitive multitouch LCD display, front facing<br />
camera, rear facing camera, microphone, headphone jack, speakers, accelerometer,<br />
gyroscope, Wi-Fi network, <strong>and</strong> Bluetooth network. The datapath, control, <strong>and</strong><br />
memory are a tiny portion of the components.<br />
The small rectangles in Figure 1.8 contain the devices that drive our advancing<br />
technology, called integrated circuits <strong>and</strong> nicknamed chips. The A5 package seen<br />
in the middle of in Figure 1.8 contains two ARM processors that operate with a<br />
clock rate of 1 GHz. The processor is the active part of the computer, following the<br />
instructions of a program to the letter. It adds numbers, tests numbers, signals I/O<br />
devices to activate, <strong>and</strong> so on. Occasionally, people call the processor the CPU, for<br />
the more bureaucratic-sounding central processor unit.<br />
Descending even lower into the hardware, Figure 1.9 reveals details of a<br />
microprocessor. The processor logically comprises two main components: datapath<br />
<strong>and</strong> control, the respective brawn <strong>and</strong> brain of the processor. The datapath performs<br />
the arithmetic operations, <strong>and</strong> control tells the datapath, memory, <strong>and</strong> I/O devices<br />
what to do according to the wishes of the instructions of the program. Chapter 4<br />
explains the datapath <strong>and</strong> control for a higher-performance design.<br />
The A5 package in Figure 1.8 also includes two memory chips, each with<br />
2 gibibits of capacity, thereby supplying 512 MiB. The memory is where the<br />
programs are kept when they are running; it also contains the data needed by the<br />
running programs. The memory is built from DRAM chips. DRAM st<strong>and</strong>s for<br />
dynamic r<strong>and</strong>om access memory. Multiple DRAMs are used together to contain<br />
the instructions <strong>and</strong> data of a program. In contrast to sequential access memories,<br />
such as magnetic tapes, the RAM portion of the term DRAM means that memory<br />
accesses take basically the same amount of time no matter what portion of the<br />
memory is read.<br />
Descending into the depths of any component of the hardware reveals insights<br />
into the computer. Inside the processor is another type of memory—cache memory.<br />
integrated circuit Also<br />
called a chip. A device<br />
combining dozens to<br />
millions of transistors.<br />
central processor unit<br />
(CPU) Also called<br />
processor. The active part<br />
of the computer, which<br />
contains the datapath <strong>and</strong><br />
control <strong>and</strong> which adds<br />
numbers, tests numbers,<br />
signals I/O devices to<br />
activate, <strong>and</strong> so on.<br />
datapath The<br />
component of the<br />
processor that performs<br />
arithmetic operations<br />
control The component<br />
of the processor that<br />
comm<strong>and</strong>s the datapath,<br />
memory, <strong>and</strong> I/O<br />
devices according to<br />
the instructions of the<br />
program.<br />
memory The storage<br />
area in which programs<br />
are kept when they are<br />
running <strong>and</strong> that contains<br />
the data needed by the<br />
running programs.<br />
dynamic r<strong>and</strong>om access<br />
memory (DRAM)<br />
Memory built as an<br />
integrated circuit; it<br />
provides r<strong>and</strong>om access to<br />
any location. Access times<br />
are 50 nanoseconds <strong>and</strong><br />
cost per gigabyte in 2012<br />
was $5 to $10.
20 Chapter 1 <strong>Computer</strong> Abstractions <strong>and</strong> Technology<br />
FIGURE 1.7 Components of the Apple iPad 2 A1395. The metal back of the iPad (with the reversed<br />
Apple logo in the middle) is in the center. At the top is the capacitive multitouch screen <strong>and</strong> LCD display. To<br />
the far right is the 3.8 V, 25 watt-hour, polymer battery, which consists of three Li-ion cell cases <strong>and</strong> offers<br />
10 hours of battery life. To the far left is the metal frame that attaches the LCD to the back of the iPad. The<br />
small components surrounding the metal back in the center are what we think of as the computer; they<br />
are often L-shaped to fit compactly inside the case next to the battery. Figure 1.8 shows a close-up of the<br />
L-shaped board to the lower left of the metal case, which is the logic printed circuit board that contains the<br />
processor <strong>and</strong> the memory. The tiny rectangle below the logic board contains a chip that provides wireless<br />
communication: Wi-Fi, Bluetooth, <strong>and</strong> FM tuner. It fits into a small slot in the lower left corner of the logic<br />
board. Near the upper left corner of the case is another L-shaped component, which is a front-facing camera<br />
assembly that includes the camera, headphone jack, <strong>and</strong> microphone. Near the right upper corner of the case<br />
is the board containing the volume control <strong>and</strong> silent/screen rotation lock button along with a gyroscope <strong>and</strong><br />
accelerometer. These last two chips combine to allow the iPad to recognize 6-axis motion. The tiny rectangle<br />
next to it is the rear-facing camera. Near the bottom right of the case is the L-shaped speaker assembly. The<br />
cable at the bottom is the connector between the logic board <strong>and</strong> the camera/volume control board. The<br />
board between the cable <strong>and</strong> the speaker assembly is the controller for the capacitive touchscreen. (Courtesy<br />
iFixit, www.ifixit.com)<br />
FIGURE 1.8 The logic board of Apple iPad 2 in Figure 1.7. The photo highlights five integrated circuits.<br />
The large integrated circuit in the middle is the Apple A5 chip, which contains a dual ARM processor cores<br />
that run at 1 GHz as well as 512 MB of main memory inside the package. Figure 1.9 shows a photograph of<br />
the processor chip inside the A5 package. The similar sized chip to the left is the 32 GB flash memory chip<br />
for non-volatile storage. There is an empty space between the two chips where a second flash chip can be<br />
installed to double storage capacity of the iPad. The chips to the right of the A5 include power controller <strong>and</strong><br />
I/O controller chips. (Courtesy iFixit, www.ifixit.com)
1.4 Under the Covers 21<br />
cache memory A small,<br />
fast memory that acts as a<br />
buffer for a slower, larger<br />
memory.<br />
FIGURE 1.9 The processor integrated circuit inside the A5 package. The size of chip is 12.1 by 10.1 mm, <strong>and</strong><br />
it was manufactured originally in a 45-nm process (see Section 1.5). It has two identical ARM processors or<br />
cores in the middle left of the chip <strong>and</strong> a PowerVR graphical processor unit (GPU) with four datapaths in the<br />
upper left quadrant. To the left <strong>and</strong> bottom side of the ARM cores are interfaces to main memory (DRAM).<br />
(Courtesy Chipworks, www.chipworks.com)<br />
static r<strong>and</strong>om access<br />
memory (SRAM) Also<br />
memory built as an<br />
integrated circuit, but<br />
faster <strong>and</strong> less dense than<br />
DRAM.<br />
Cache memory consists of a small, fast memory that acts as a buffer for the DRAM<br />
memory. (The nontechnical definition of cache is a safe place for hiding things.)<br />
Cache is built using a different memory technology, static r<strong>and</strong>om access memory<br />
(SRAM). SRAM is faster but less dense, <strong>and</strong> hence more expensive, than DRAM<br />
(see Chapter 5). SRAM <strong>and</strong> DRAM are two layers of the memory hierarchy.
1.4 Under the Covers 23<br />
To distinguish between the volatile memory used to hold data <strong>and</strong> programs<br />
while they are running <strong>and</strong> this nonvolatile memory used to store data <strong>and</strong><br />
programs between runs, the term main memory or primary memory is used for<br />
the former, <strong>and</strong> secondary memory for the latter. Secondary memory forms the<br />
next lower layer of the memory hierarchy. DRAMs have dominated main memory<br />
since 1975, but magnetic disks dominated secondary memory starting even earlier.<br />
Because of their size <strong>and</strong> form factor, personal Mobile Devices use flash memory,<br />
a nonvolatile semiconductor memory, instead of disks. Figure 1.8 shows the chip<br />
containing the flash memory of the iPad 2. While slower than DRAM, it is much<br />
cheaper than DRAM in addition to being nonvolatile. Although costing more per<br />
bit than disks, it is smaller, it comes in much smaller capacities, it is more rugged,<br />
<strong>and</strong> it is more power efficient than disks. Hence, flash memory is the st<strong>and</strong>ard<br />
secondary memory for PMDs. Alas, unlike disks <strong>and</strong> DRAM, flash memory bits<br />
wear out after 100,000 to 1,000,000 writes. Thus, file systems must keep track of<br />
the number of writes <strong>and</strong> have a strategy to avoid wearing out storage, such as by<br />
moving popular data. Chapter 5 describes disks <strong>and</strong> flash memory in more detail.<br />
Communicating with Other <strong>Computer</strong>s<br />
We’ve explained how we can input, compute, display, <strong>and</strong> save data, but there is<br />
still one missing item found in today’s computers: computer networks. Just as the<br />
processor shown in Figure 1.5 is connected to memory <strong>and</strong> I/O devices, networks<br />
interconnect whole computers, allowing computer users to extend the power of<br />
computing by including communication. Networks have become so popular that<br />
they are the backbone of current computer systems; a new personal mobile device<br />
or server without a network interface would be ridiculed. Networked computers<br />
have several major advantages:<br />
■ Communication: Information is exchanged between computers at high<br />
speeds.<br />
■ Resource sharing: Rather than each computer having its own I/O devices,<br />
computers on the network can share I/O devices.<br />
■ Nonlocal access: By connecting computers over long distances, users need not<br />
be near the computer they are using.<br />
Networks vary in length <strong>and</strong> performance, with the cost of communication<br />
increasing according to both the speed of communication <strong>and</strong> the distance that<br />
information travels. Perhaps the most popular type of network is Ethernet. It can<br />
be up to a kilometer long <strong>and</strong> transfer at up to 40 gigabits per second. Its length <strong>and</strong><br />
speed make Ethernet useful to connect computers on the same floor of a building;<br />
main memory Also<br />
called primary memory.<br />
Memory used to hold<br />
programs while they are<br />
running; typically consists<br />
of DRAM in today’s<br />
computers.<br />
secondary memory<br />
Nonvolatile memory<br />
used to store programs<br />
<strong>and</strong> data between runs;<br />
typically consists of flash<br />
memory in PMDs <strong>and</strong><br />
magnetic disks in servers.<br />
magnetic disk Also<br />
called hard disk. A form<br />
of nonvolatile secondary<br />
memory composed of<br />
rotating platters coated<br />
with a magnetic recording<br />
material. Because they<br />
are rotating mechanical<br />
devices, access times are<br />
about 5 to 20 milliseconds<br />
<strong>and</strong> cost per gigabyte in<br />
2012 was $0.05 to $0.10.<br />
flash memory<br />
A nonvolatile semiconductor<br />
memory. It<br />
is cheaper <strong>and</strong> slower<br />
than DRAM but more<br />
expensive per bit <strong>and</strong><br />
faster than magnetic disks.<br />
Access times are about 5<br />
to 50 microseconds <strong>and</strong><br />
cost per gigabyte in 2012<br />
was $0.75 to $1.00.
24 Chapter 1 <strong>Computer</strong> Abstractions <strong>and</strong> Technology<br />
local area network<br />
(LAN) A network<br />
designed to carry data<br />
within a geographically<br />
confined area, typically<br />
within a single building.<br />
wide area network<br />
(WAN) A network<br />
extended over hundreds<br />
of kilometers that can<br />
span a continent.<br />
Check<br />
Yourself<br />
hence, it is an example of what is generically called a local area network. Local area<br />
networks are interconnected with switches that can also provide routing services<br />
<strong>and</strong> security. Wide area networks cross continents <strong>and</strong> are the backbone of the<br />
Internet, which supports the web. They are typically based on optical fibers <strong>and</strong> are<br />
leased from telecommunication companies.<br />
Networks have changed the face of computing in the last 30 years, both by<br />
becoming much more ubiquitous <strong>and</strong> by making dramatic increases in performance.<br />
In the 1970s, very few individuals had access to electronic mail, the Internet <strong>and</strong><br />
web did not exist, <strong>and</strong> physically mailing magnetic tapes was the primary way to<br />
transfer large amounts of data between two locations. Local area networks were<br />
almost nonexistent, <strong>and</strong> the few existing wide area networks had limited capacity<br />
<strong>and</strong> restricted access.<br />
As networking technology improved, it became much cheaper <strong>and</strong> had a much<br />
higher capacity. For example, the first st<strong>and</strong>ardized local area network technology,<br />
developed about 30 years ago, was a version of Ethernet that had a maximum capacity<br />
(also called b<strong>and</strong>width) of 10 million bits per second, typically shared by tens of, if<br />
not a hundred, computers. Today, local area network technology offers a capacity<br />
of from 1 to 40 gigabits per second, usually shared by at most a few computers.<br />
Optical communications technology has allowed similar growth in the capacity of<br />
wide area networks, from hundreds of kilobits to gigabits <strong>and</strong> from hundreds of<br />
computers connected to a worldwide network to millions of computers connected.<br />
This combination of dramatic rise in deployment of networking combined with<br />
increases in capacity have made network technology central to the information<br />
revolution of the last 30 years.<br />
For the last decade another innovation in networking is reshaping the way<br />
computers communicate. Wireless technology is widespread, which enabled<br />
the PostPC Era. The ability to make a radio in the same low-cost semiconductor<br />
technology (CMOS) used for memory <strong>and</strong> microprocessors enabled a significant<br />
improvement in price, leading to an explosion in deployment. Currently available<br />
wireless technologies, called by the IEEE st<strong>and</strong>ard name 802.11, allow for transmission<br />
rates from 1 to nearly 100 million bits per second. Wireless technology is quite a bit<br />
different from wire-based networks, since all users in an immediate area share the<br />
airwaves.<br />
■ Semiconductor DRAM memory, flash memory, <strong>and</strong> disk storage differ<br />
significantly. For each technology, list its volatility, approximate relative<br />
access time, <strong>and</strong> approximate relative cost compared to DRAM.<br />
1.5<br />
Technologies for Building Processors<br />
<strong>and</strong> Memory<br />
Processors <strong>and</strong> memory have improved at an incredible rate, because computer<br />
designers have long embraced the latest in electronic technology to try to win the<br />
race to design a better computer. Figure 1.10 shows the technologies that have
1.6 Performance 27<br />
FIGURE 1.13 A 12-inch (300 mm) wafer of Intel Core i7 (Courtesy Intel). The number of<br />
dies on this 300 mm (12 inch) wafer at 100% yield is 280, each 20.7 by 10.5 mm. The several dozen partially<br />
rounded chips at the boundaries of the wafer are useless; they are included because it’s easier to create the<br />
masks used to pattern the silicon. This die uses a 32-nanometer technology, which means that the smallest<br />
features are approximately 32 nm in size, although they are typically somewhat smaller than the actual feature<br />
size, which refers to the size of the transistors as “drawn” versus the final manufactured size.<br />
called dies <strong>and</strong> more informally known as chips. Figure 1.13 shows a photograph<br />
of a wafer containing microprocessors before they have been diced; earlier, Figure<br />
1.9 shows an individual microprocessor die.<br />
Dicing enables you to discard only those dies that were unlucky enough to<br />
contain the flaws, rather than the whole wafer. This concept is quantified by the<br />
yield of a process, which is defined as the percentage of good dies from the total<br />
number of dies on the wafer.<br />
The cost of an integrated circuit rises quickly as the die size increases, due both<br />
to the lower yield <strong>and</strong> the smaller number of dies that fit on a wafer. To reduce the<br />
cost, using the next generation process shrinks a large die as it uses smaller sizes for<br />
both transistors <strong>and</strong> wires. This improves the yield <strong>and</strong> the die count per wafer. A<br />
32-nanometer (nm) process was typical in 2012, which means essentially that the<br />
smallest feature size on the die is 32 nm.<br />
die The individual<br />
rectangular sections that<br />
are cut from a wafer, more<br />
informally known as<br />
chips.<br />
yield The percentage of<br />
good dies from the total<br />
number of dies on the<br />
wafer.
28 Chapter 1 <strong>Computer</strong> Abstractions <strong>and</strong> Technology<br />
Once you’ve found good dies, they are connected to the input/output pins of a<br />
package, using a process called bonding. These packaged parts are tested a final time,<br />
since mistakes can occur in packaging, <strong>and</strong> then they are shipped to customers.<br />
Elaboration: The cost of an integrated circuit can be expressed in three simple<br />
equations:<br />
Cost per wafer<br />
Cost per die<br />
Dies per wafer yield<br />
Dies per wafer <br />
Yield<br />
Wafer area<br />
Die area<br />
1<br />
( 1 ( Defects per area Die area/2)) The fi rst equation is straightforward to derive. The second is an approximation,<br />
since it does not subtract the area near the border of the round wafer that cannot<br />
accommodate the rectangular dies (see Figure 1.13). The fi nal equation is based on<br />
empirical observations of yields at integrated circuit factories, with the exponent related<br />
to the number of critical processing steps.<br />
Hence, depending on the defect rate <strong>and</strong> the size of the die <strong>and</strong> wafer, costs are<br />
generally not linear in the die area.<br />
Check<br />
Yourself<br />
A key factor in determining the cost of an integrated circuit is volume. Which of<br />
the following are reasons why a chip made in high volume should cost less?<br />
1. With high volumes, the manufacturing process can be tuned to a particular<br />
design, increasing the yield.<br />
2. It is less work to design a high-volume part than a low-volume part.<br />
3. The masks used to make the chip are expensive, so the cost per chip is lower<br />
for higher volumes.<br />
4. Engineering development costs are high <strong>and</strong> largely independent of volume;<br />
thus, the development cost per die is lower with high-volume parts.<br />
5. High-volume parts usually have smaller die sizes than low-volume parts <strong>and</strong><br />
therefore have higher yield per wafer.<br />
1.6 Performance<br />
Assessing the performance of computers can be quite challenging. The scale <strong>and</strong><br />
intricacy of modern software systems, together with the wide range of performance<br />
improvement techniques employed by hardware designers, have made performance<br />
assessment much more difficult.<br />
When trying to choose among different computers, performance is an important<br />
attribute. Accurately measuring <strong>and</strong> comparing different computers is critical to
1.6 Performance 29<br />
purchasers <strong>and</strong> therefore to designers. The people selling computers know this as<br />
well. Often, salespeople would like you to see their computer in the best possible<br />
light, whether or not this light accurately reflects the needs of the purchaser’s<br />
application. Hence, underst<strong>and</strong>ing how best to measure performance <strong>and</strong> the<br />
limitations of performance measurements is important in selecting a computer.<br />
The rest of this section describes different ways in which performance can be<br />
determined; then, we describe the metrics for measuring performance from the<br />
viewpoint of both a computer user <strong>and</strong> a designer. We also look at how these metrics<br />
are related <strong>and</strong> present the classical processor performance equation, which we will<br />
use throughout the text.<br />
Defining Performance<br />
When we say one computer has better performance than another, what do we<br />
mean? Although this question might seem simple, an analogy with passenger<br />
airplanes shows how subtle the question of performance can be. Figure 1.14<br />
lists some typical passenger airplanes, together with their cruising speed, range,<br />
<strong>and</strong> capacity. If we wanted to know which of the planes in this table had the best<br />
performance, we would first need to define performance. For example, considering<br />
different measures of performance, we see that the plane with the highest cruising<br />
speed was the Concorde (retired from service in 2003), the plane with the longest<br />
range is the DC-8, <strong>and</strong> the plane with the largest capacity is the 747.<br />
Airplane<br />
Passenger<br />
capacity<br />
Cruising range<br />
(miles)<br />
Cruising speed<br />
(m.p.h.)<br />
Passenger throughput<br />
(passengers × m.p.h.)<br />
Boeing 777 375 4630 610 228,750<br />
Boeing 747 470<br />
4150 610 286,700<br />
BAC/Sud Concorde 132<br />
4000 1350 178,200<br />
Douglas DC-8-50 146<br />
8720 544 79,424<br />
FIGURE 1.14 The capacity, range, <strong>and</strong> speed for a number of commercial airplanes. The last<br />
column shows the rate at which the airplane transports passengers, which is the capacity times the cruising<br />
speed (ignoring range <strong>and</strong> takeoff <strong>and</strong> l<strong>and</strong>ing times).<br />
Let’s suppose we define performance in terms of speed. This still leaves two<br />
possible definitions. You could define the fastest plane as the one with the highest<br />
cruising speed, taking a single passenger from one point to another in the least time.<br />
If you were interested in transporting 450 passengers from one point to another,<br />
however, the 747 would clearly be the fastest, as the last column of the figure shows.<br />
Similarly, we can define computer performance in several different ways.<br />
If you were running a program on two different desktop computers, you’d say<br />
that the faster one is the desktop computer that gets the job done first. If you were<br />
running a datacenter that had several servers running jobs submitted by many<br />
users, you’d say that the faster computer was the one that completed the most<br />
jobs during a day. As an individual computer user, you are interested in reducing<br />
response time—the time between the start <strong>and</strong> completion of a task—also referred<br />
response time Also<br />
called execution time.<br />
The total time required<br />
for the computer to<br />
complete a task, including<br />
disk accesses, memory<br />
accesses, I/O activities,<br />
operating system<br />
overhead, CPU execution<br />
time, <strong>and</strong> so on.
30 Chapter 1 <strong>Computer</strong> Abstractions <strong>and</strong> Technology<br />
throughput Also called<br />
b<strong>and</strong>width. Another<br />
measure of performance,<br />
it is the number of tasks<br />
completed per unit time.<br />
to as execution time. Datacenter managers are often interested in increasing<br />
throughput or b<strong>and</strong>width—the total amount of work done in a given time. Hence,<br />
in most cases, we will need different performance metrics as well as different sets<br />
of applications to benchmark personal mobile devices, which are more focused on<br />
response time, versus servers, which are more focused on throughput.<br />
Throughput <strong>and</strong> Response Time<br />
EXAMPLE<br />
ANSWER<br />
Do the following changes to a computer system increase throughput, decrease<br />
response time, or both?<br />
1. Replacing the processor in a computer with a faster version<br />
2. Adding additional processors to a system that uses multiple processors<br />
for separate tasks—for example, searching the web<br />
Decreasing response time almost always improves throughput. Hence, in case<br />
1, both response time <strong>and</strong> throughput are improved. In case 2, no one task gets<br />
work done faster, so only throughput increases.<br />
If, however, the dem<strong>and</strong> for processing in the second case was almost<br />
as large as the throughput, the system might force requests to queue up. In<br />
this case, increasing the throughput could also improve response time, since<br />
it would reduce the waiting time in the queue. Thus, in many real computer<br />
systems, changing either execution time or throughput often affects the other.<br />
In discussing the performance of computers, we will be primarily concerned with<br />
response time for the first few chapters. To maximize performance, we want to<br />
minimize response time or execution time for some task. Thus, we can relate<br />
performance <strong>and</strong> execution time for a computer X:<br />
1<br />
PerformanceX<br />
<br />
Execution time<br />
This means that for two computers X <strong>and</strong> Y, if the performance of X is greater than<br />
the performance of Y, we have<br />
PerformanceX<br />
PerformanceY<br />
1 1<br />
<br />
Execution time Execution time<br />
Execution time<br />
X<br />
Y<br />
X<br />
Execution time<br />
That is, the execution time on Y is longer than that on X, if X is faster than Y.<br />
Y<br />
X
1.6 Performance 31<br />
In discussing a computer design, we often want to relate the performance of two<br />
different computers quantitatively. We will use the phrase “X is n times faster than<br />
Y”—or equivalently “X is n times as fast as Y”—to mean<br />
Performance<br />
Performance<br />
If X is n times as fast as Y, then the execution time on Y is n times as long as it is<br />
on X:<br />
Performance<br />
Performance<br />
X<br />
Y<br />
X<br />
Y<br />
n<br />
Execution time<br />
<br />
Execution time<br />
Y<br />
X<br />
n<br />
Relative Performance<br />
If computer A runs a program in 10 seconds <strong>and</strong> computer B runs the same<br />
program in 15 seconds, how much faster is A than B?<br />
EXAMPLE<br />
We know that A is n times as fast as B if<br />
Performance<br />
Performance<br />
A<br />
B<br />
Execution time<br />
<br />
Execution time<br />
B<br />
A<br />
n<br />
ANSWER<br />
Thus the performance ratio is<br />
15<br />
15 .<br />
10<br />
<strong>and</strong> A is therefore 1.5 times as fast as B.<br />
In the above example, we could also say that computer B is 1.5 times slower than<br />
computer A, since<br />
Performance<br />
Performance<br />
A<br />
B<br />
15 .<br />
means that<br />
Performance<br />
15 .<br />
A<br />
Performance<br />
B
32 Chapter 1 <strong>Computer</strong> Abstractions <strong>and</strong> Technology<br />
For simplicity, we will normally use the terminology as fast as when we try to<br />
compare computers quantitatively. Because performance <strong>and</strong> execution time are<br />
reciprocals, increasing performance requires decreasing execution time. To avoid<br />
the potential confusion between the terms increasing <strong>and</strong> decreasing, we usually<br />
say “improve performance” or “improve execution time” when we mean “increase<br />
performance” <strong>and</strong> “decrease execution time.”<br />
CPU execution<br />
time Also called CPU<br />
time. The actual time the<br />
CPU spends computing<br />
for a specific task.<br />
user CPU time The<br />
CPU time spent in a<br />
program itself.<br />
system CPU time The<br />
CPU time spent in<br />
the operating system<br />
performing tasks on<br />
behalf of the program.<br />
Measuring Performance<br />
Time is the measure of computer performance: the computer that performs the<br />
same amount of work in the least time is the fastest. Program execution time is<br />
measured in seconds per program. However, time can be defined in different ways,<br />
depending on what we count. The most straightforward definition of time is called<br />
wall clock time, response time, or elapsed time. These terms mean the total time<br />
to complete a task, including disk accesses, memory accesses, input/output (I/O)<br />
activities, operating system overhead—everything.<br />
<strong>Computer</strong>s are often shared, however, <strong>and</strong> a processor may work on several<br />
programs simultaneously. In such cases, the system may try to optimize throughput<br />
rather than attempt to minimize the elapsed time for one program. Hence, we<br />
often want to distinguish between the elapsed time <strong>and</strong> the time over which the<br />
processor is working on our behalf. CPU execution time or simply CPU time,<br />
which recognizes this distinction, is the time the CPU spends computing for this<br />
task <strong>and</strong> does not include time spent waiting for I/O or running other programs.<br />
(Remember, though, that the response time experienced by the user will be the<br />
elapsed time of the program, not the CPU time.) CPU time can be further divided<br />
into the CPU time spent in the program, called user CPU time, <strong>and</strong> the CPU time<br />
spent in the operating system performing tasks on behalf of the program, called<br />
system CPU time. Differentiating between system <strong>and</strong> user CPU time is difficult to<br />
do accurately, because it is often hard to assign responsibility for operating system<br />
activities to one user program rather than another <strong>and</strong> because of the functionality<br />
differences among operating systems.<br />
For consistency, we maintain a distinction between performance based on<br />
elapsed time <strong>and</strong> that based on CPU execution time. We will use the term system<br />
performance to refer to elapsed time on an unloaded system <strong>and</strong> CPU performance<br />
to refer to user CPU time. We will focus on CPU performance in this chapter,<br />
although our discussions of how to summarize performance can be applied to<br />
either elapsed time or CPU time measurements.<br />
Underst<strong>and</strong>ing<br />
Program<br />
Performance<br />
Different applications are sensitive to different aspects of the performance of a<br />
computer system. Many applications, especially those running on servers, depend<br />
as much on I/O performance, which, in turn, relies on both hardware <strong>and</strong> software.<br />
Total elapsed time measured by a wall clock is the measurement of interest. In
1.6 Performance 33<br />
some application environments, the user may care about throughput, response<br />
time, or a complex combination of the two (e.g., maximum throughput with a<br />
worst-case response time). To improve the performance of a program, one must<br />
have a clear definition of what performance metric matters <strong>and</strong> then proceed to<br />
look for performance bottlenecks by measuring program execution <strong>and</strong> looking<br />
for the likely bottlenecks. In the following chapters, we will describe how to search<br />
for bottlenecks <strong>and</strong> improve performance in various parts of the system.<br />
Although as computer users we care about time, when we examine the details<br />
of a computer it’s convenient to think about performance in other metrics. In<br />
particular, computer designers may want to think about a computer by using a<br />
measure that relates to how fast the hardware can perform basic functions. Almost<br />
all computers are constructed using a clock that determines when events take<br />
place in the hardware. These discrete time intervals are called clock cycles (or<br />
ticks, clock ticks, clock periods, clocks, cycles). <strong>Design</strong>ers refer to the length of a<br />
clock period both as the time for a complete clock cycle (e.g., 250 picoseconds, or<br />
250 ps) <strong>and</strong> as the clock rate (e.g., 4 gigahertz, or 4 GHz), which is the inverse of the<br />
clock period. In the next subsection, we will formalize the relationship between the<br />
clock cycles of the hardware designer <strong>and</strong> the seconds of the computer user.<br />
1. Suppose we know that an application that uses both personal mobile<br />
devices <strong>and</strong> the Cloud is limited by network performance. For the following<br />
changes, state whether only the throughput improves, both response time<br />
<strong>and</strong> throughput improve, or neither improves.<br />
a. An extra network channel is added between the PMD <strong>and</strong> the Cloud,<br />
increasing the total network throughput <strong>and</strong> reducing the delay to obtain<br />
network access (since there are now two channels).<br />
b. The networking software is improved, thereby reducing the network<br />
communication delay, but not increasing throughput.<br />
c. More memory is added to the computer.<br />
2. <strong>Computer</strong> C’s performance is 4 times as fast as the performance of computer<br />
B, which runs a given application in 28 seconds. How long will computer C<br />
take to run that application?<br />
clock cycle Also called<br />
tick, clock tick, clock<br />
period, clock, or cycle.<br />
The time for one clock<br />
period, usually of the<br />
processor clock, which<br />
runs at a constant rate.<br />
clock period The length<br />
of each clock cycle.<br />
Check<br />
Yourself<br />
CPU Performance <strong>and</strong> Its Factors<br />
Users <strong>and</strong> designers often examine performance using different metrics. If we could<br />
relate these different metrics, we could determine the effect of a design change<br />
on the performance as experienced by the user. Since we are confining ourselves<br />
to CPU performance at this point, the bottom-line performance measure is CPU
34 Chapter 1 <strong>Computer</strong> Abstractions <strong>and</strong> Technology<br />
execution time. A simple formula relates the most basic metrics (clock cycles <strong>and</strong><br />
clock cycle time) to CPU time:<br />
CPU execution time<br />
for a program<br />
CPU clock cycles<br />
for a program<br />
Clock cycle time<br />
Alternatively, because clock rate <strong>and</strong> clock cycle time are inverses,<br />
CPU execution time<br />
for a program<br />
CPU clock cycles for a program<br />
<br />
Clock rate<br />
This formula makes it clear that the hardware designer can improve performance<br />
by reducing the number of clock cycles required for a program or the length of<br />
the clock cycle. As we will see in later chapters, the designer often faces a trade-off<br />
between the number of clock cycles needed for a program <strong>and</strong> the length of each<br />
cycle. Many techniques that decrease the number of clock cycles may also increase<br />
the clock cycle time.<br />
Improving Performance<br />
EXAMPLE<br />
Our favorite program runs in 10 seconds on computer A, which has a 2 GHz<br />
clock. We are trying to help a computer designer build a computer, B, which will<br />
run this program in 6 seconds. The designer has determined that a substantial<br />
increase in the clock rate is possible, but this increase will affect the rest of the<br />
CPU design, causing computer B to require 1.2 times as many clock cycles as<br />
computer A for this program. What clock rate should we tell the designer to<br />
target?<br />
ANSWER<br />
Let’s first find the number of clock cycles required for the program on A:<br />
CPU time<br />
A<br />
10 seconds<br />
CPU clock cycles<br />
Clock rate<br />
A<br />
CPU clock cycles<br />
9 cycles<br />
2 10<br />
second<br />
cycles<br />
CPU clock cycles A 10 seconds 2 10 20 10<br />
second<br />
A<br />
A<br />
9 9<br />
cycles
1.6 Performance 35<br />
CPU time for B can be found using this equation:<br />
CPU time<br />
B<br />
12 .<br />
CPU clock cycles<br />
Clock rate<br />
B<br />
A<br />
6 seconds<br />
12 . 20 10 cycles<br />
Clock rate<br />
9<br />
B<br />
Clock rate<br />
B<br />
1.<br />
2 20 10 cycles<br />
6 seconds<br />
9<br />
9 9<br />
0.<br />
2 20 10 cycles 4 10 cycles<br />
second<br />
second<br />
4 GHz<br />
To run the program in 6 seconds, B must have twice the clock rate of A.<br />
Instruction Performance<br />
The performance equations above did not include any reference to the number of<br />
instructions needed for the program. However, since the compiler clearly generated<br />
instructions to execute, <strong>and</strong> the computer had to execute the instructions to run<br />
the program, the execution time must depend on the number of instructions in a<br />
program. One way to think about execution time is that it equals the number of<br />
instructions executed multiplied by the average time per instruction. Therefore, the<br />
number of clock cycles required for a program can be written as<br />
CPU clock cycles Instructions for a program<br />
Average clock cycles<br />
per instruction<br />
The term clock cycles per instruction, which is the average number of clock<br />
cycles each instruction takes to execute, is often abbreviated as CPI. Since different<br />
instructions may take different amounts of time depending on what they do, CPI is<br />
an average of all the instructions executed in the program. CPI provides one way of<br />
comparing two different implementations of the same instruction set architecture,<br />
since the number of instructions executed for a program will, of course, be the<br />
same.<br />
clock cycles<br />
per instruction<br />
(CPI) Average number<br />
of clock cycles per<br />
instruction for a program<br />
or program fragment.<br />
Using the Performance Equation<br />
Suppose we have two implementations of the same instruction set architecture.<br />
<strong>Computer</strong> A has a clock cycle time of 250 ps <strong>and</strong> a CPI of 2.0 for some program,<br />
<strong>and</strong> computer B has a clock cycle time of 500 ps <strong>and</strong> a CPI of 1.2 for the same<br />
program. Which computer is faster for this program <strong>and</strong> by how much?<br />
EXAMPLE
36 Chapter 1 <strong>Computer</strong> Abstractions <strong>and</strong> Technology<br />
ANSWER<br />
We know that each computer executes the same number of instructions for<br />
the program; let’s call this number I. First, find the number of processor clock<br />
cycles for each computer:<br />
CPU clock cyclesA<br />
I × 20 .<br />
CPU clock cycles I × 12 .<br />
B<br />
Now we can compute the CPU time for each computer:<br />
CPU timeA<br />
CPU clock cyclesA<br />
Clock cycle time<br />
I 2. 0 250 ps 500 I ps<br />
Likewise, for B:<br />
CPU time I 12 . 500ps 600 I ps<br />
B<br />
Clearly, computer A is faster. The amount faster is given by the ratio of the<br />
execution times:<br />
CPU performance Execution time 600 I ps<br />
A<br />
B<br />
12 .<br />
CPU performance Execution time 500 I ps<br />
B<br />
We can conclude that computer A is 1.2 times as fast as computer B for this<br />
program.<br />
A<br />
instruction count The<br />
number of instructions<br />
executed by the program.<br />
The Classic CPU Performance Equation<br />
We can now write this basic performance equation in terms of instruction count<br />
(the number of instructions executed by the program), CPI, <strong>and</strong> clock cycle time:<br />
CPU time Instruction count CPI Clock cycle time<br />
or, since the clock rate is the inverse of clock cycle time:<br />
Instruction count CPI<br />
CPU time<br />
Clock rate<br />
These formulas are particularly useful because they separate the three key factors<br />
that affect performance. We can use these formulas to compare two different<br />
implementations or to evaluate a design alternative if we know its impact on these<br />
three parameters.
1.7 The Power Wall 41<br />
Although power provides a limit to what we can cool, in the PostPC Era the<br />
really critical resource is energy. Battery life can trump performance in the personal<br />
mobile device, <strong>and</strong> the architects of warehouse scale computers try to reduce the<br />
costs of powering <strong>and</strong> cooling 100,000 servers as the costs are high at this scale. Just<br />
as measuring time in seconds is a safer measure of program performance than a<br />
rate like MIPS (see Section 1.10), the energy metric joules is a better measure than<br />
a power rate like watts, which is just joules/second.<br />
The dominant technology for integrated circuits is called CMOS (complementary<br />
metal oxide semiconductor). For CMOS, the primary source of energy consumption<br />
is so-called dynamic energy—that is, energy that is consumed when transistors<br />
switch states from 0 to 1 <strong>and</strong> vice versa. The dynamic energy depends on the<br />
capacitive loading of each transistor <strong>and</strong> the voltage applied:<br />
2<br />
Energy ∝ Capacitive load Voltage<br />
This equation is the energy of a pulse during the logic transition of 0 → 1 → 0 or<br />
1 → 0 → 1. The energy of a single transition is then<br />
Energy ∝ 12 / Capacitive load Voltage<br />
The power required per transistor is just the product of energy of a transition <strong>and</strong><br />
the frequency of transitions:<br />
Power ∝ 12 / Capacitive load Voltage Frequency switched<br />
Frequency switched is a function of the clock rate. The capacitive load per transistor<br />
is a function of both the number of transistors connected to an output (called the<br />
fanout) <strong>and</strong> the technology, which determines the capacitance of both wires <strong>and</strong><br />
transistors.<br />
With regard to Figure 1.16, how could clock rates grow by a factor of 1000<br />
while power grew by only a factor of 30? Energy <strong>and</strong> thus power can be reduced by<br />
lowering the voltage, which occurred with each new generation of technology, <strong>and</strong><br />
power is a function of the voltage squared. Typically, the voltage was reduced about<br />
15% per generation. In 20 years, voltages have gone from 5 V to 1 V, which is why<br />
the increase in power is only 30 times.<br />
2<br />
2<br />
Relative Power<br />
Suppose we developed a new, simpler processor that has 85% of the capacitive<br />
load of the more complex older processor. Further, assume that it has adjustable<br />
voltage so that it can reduce voltage 15% compared to processor B, which<br />
results in a 15% shrink in frequency. What is the impact on dynamic power?<br />
EXAMPLE
42 Chapter 1 <strong>Computer</strong> Abstractions <strong>and</strong> Technology<br />
ANSWER<br />
Power<br />
Power<br />
new<br />
old<br />
〈 Capacitive load 085 . 〉 〈 Voltage 085 . 〉 2 〈 Frequency switched<br />
2<br />
Capacitive load Voltage Frequency switched<br />
085 . 〉<br />
Thus the power ratio is<br />
4<br />
085 . 052 .<br />
Hence, the new processor uses about half the power of the old processor.<br />
The problem today is that further lowering of the voltage appears to make the<br />
transistors too leaky, like water faucets that cannot be completely shut off. Even<br />
today about 40% of the power consumption in server chips is due to leakage. If<br />
transistors started leaking more, the whole process could become unwieldy.<br />
To try to address the power problem, designers have already attached large<br />
devices to increase cooling, <strong>and</strong> they turn off parts of the chip that are not used in<br />
a given clock cycle. Although there are many more expensive ways to cool chips<br />
<strong>and</strong> thereby raise their power to, say, 300 watts, these techniques are generally<br />
too expensive for personal computers <strong>and</strong> even servers, not to mention personal<br />
mobile devices.<br />
Since computer designers slammed into a power wall, they needed a new way<br />
forward. They chose a different path from the way they designed microprocessors<br />
for their first 30 years.<br />
Elaboration: Although dynamic energy is the primary source of energy consumption<br />
in CMOS, static energy consumption occurs because of leakage current that flows even<br />
when a transistor is off. In servers, leakage is typically responsible for 40% of the energy<br />
consumption. Thus, increasing the number of transistors increases power dissipation,<br />
even if the transistors are always off. A variety of design techniques <strong>and</strong> technology<br />
innovations are being deployed to control leakage, but it’s hard to lower voltage further.<br />
Elaboration: Power is a challenge for integrated circuits for two reasons. First, power<br />
must be brought in <strong>and</strong> distributed around the chip; modern microprocessors use<br />
hundreds of pins just for power <strong>and</strong> ground! Similarly, multiple levels of chip interconnect<br />
are used solely for power <strong>and</strong> ground distribution to portions of the chip. Second, power<br />
is dissipated as heat <strong>and</strong> must be removed. Server chips can burn more than 100 watts,<br />
<strong>and</strong> cooling the chip <strong>and</strong> the surrounding system is a major expense in Warehouse Scale<br />
<strong>Computer</strong>s (see Chapter 6).
1.9 Real Stuff: Benchmarking the Intel Core i7 47<br />
Description<br />
Name<br />
Instruction<br />
Count x 10 9<br />
CPI<br />
Clock cycle time<br />
(seconds x 10 –9 )<br />
Execution<br />
Time<br />
(seconds)<br />
Reference<br />
Time<br />
(seconds)<br />
SPECratio<br />
Interpreted string processing perl 2252 0.60 0.376 508 9770 19.2<br />
Block-sorting bzip2 2390 0.70 0.376 629 9650 15.4<br />
compression<br />
GNU C compiler gcc 794 1.20 0.376 358 8050 22.5<br />
Combinatorial optimization mcf 221 2.66 0.376 221 9120 41.2<br />
Go game (AI) go 1274 1.10 0.376 527 10490 19.9<br />
Search gene sequence hmmer 2616 0.60 0.376 590 9330 15.8<br />
Chess game (AI) sjeng 1948 0.80 0.376 586 12100 20.7<br />
Quantum computer libquantum 659 0.44 0.376 109 20720 190.0<br />
simulation<br />
Video compression h264avc 3793 0.50 0.376 713 22130 31.0<br />
Discrete event omnetpp 367 2.10 0.376 290 6250 21.5<br />
simulation library<br />
Games/path finding astar 1250 1.00 0.376 470 7020 14.9<br />
XML parsing xalancbmk 1045 0.70 0.376 275 6900 25.1<br />
Geometric mean – – – – – –<br />
25.7<br />
FIGURE 1.18 SPECINTC2006 benchmarks running on a 2.66 GHz Intel Core i7 920. As the equation on page 35 explains,<br />
execution time is the product of the three factors in this table: instruction count in billions, clocks per instruction (CPI), <strong>and</strong> clock cycle time in<br />
nanoseconds. SPECratio is simply the reference time, which is supplied by SPEC, divided by the measured execution time. The single number<br />
quoted as SPECINTC2006 is the geometric mean of the SPECratios.<br />
set focusing on processor performance (now called SPEC89), which has evolved<br />
through five generations. The latest is SPEC CPU2006, which consists of a set of 12<br />
integer benchmarks (CINT2006) <strong>and</strong> 17 floating-point benchmarks (CFP2006).<br />
The integer benchmarks vary from part of a C compiler to a chess program to a<br />
quantum computer simulation. The floating-point benchmarks include structured<br />
grid codes for finite element modeling, particle method codes for molecular<br />
dynamics, <strong>and</strong> sparse linear algebra codes for fluid dynamics.<br />
Figure 1.18 describes the SPEC integer benchmarks <strong>and</strong> their execution time<br />
on the Intel Core i7 <strong>and</strong> shows the factors that explain execution time: instruction<br />
count, CPI, <strong>and</strong> clock cycle time. Note that CPI varies by more than a factor of 5.<br />
To simplify the marketing of computers, SPEC decided to report a single number<br />
to summarize all 12 integer benchmarks. Dividing the execution time of a reference<br />
processor by the execution time of the measured computer normalizes the execution<br />
time measurements; this normalization yields a measure, called the SPECratio, which<br />
has the advantage that bigger numeric results indicate faster performance. That is,<br />
the SPECratio is the inverse of execution time. A CINT2006 or CFP2006 summary<br />
measurement is obtained by taking the geometric mean of the SPECratios.<br />
Elaboration: When comparing two computers using SPECratios, use the geometric<br />
mean so that it gives the same relative answer no matter what computer is used to<br />
normalize the results. If we averaged the normalized execution time values with an<br />
arithmetic mean, the results would vary depending on the computer we choose as the<br />
reference.
48 Chapter 1 <strong>Computer</strong> Abstractions <strong>and</strong> Technology<br />
The formula for the geometric mean is<br />
n<br />
n<br />
∏<br />
i1<br />
Execution time ratio i<br />
where Execution time ratio i<br />
is the execution time, normalized to the reference computer,<br />
for the ith program of a total of n in the workload, <strong>and</strong><br />
i<br />
n<br />
∏<br />
SPEC Power Benchmark<br />
1<br />
ai<br />
means the product a 1 a 2 … a<br />
Given the increasing importance of energy <strong>and</strong> power, SPEC added a benchmark<br />
to measure power. It reports power consumption of servers at different workload<br />
levels, divided into 10% increments, over a period of time. Figure 1.19 shows the<br />
results for a server using Intel Nehalem processors similar to the above.<br />
n<br />
Target Load %<br />
Performance<br />
(ssj_ops)<br />
Average Power<br />
(watts)<br />
100% 865,618 258<br />
90% 786,688 242<br />
80% 698,051 224<br />
70% 607,826 204<br />
60% 521,391 185<br />
50% 436,757 170<br />
40% 345,919 157<br />
30% 262,071 146<br />
20% 176,061 135<br />
10% 86,784 121<br />
0% 0 80<br />
Overall Sum 4,787,166 1922<br />
∑ssj_ops / ∑power = 2490<br />
FIGURE 1.19 SPECpower_ssj2008 running on a dual socket 2.66 GHz Intel Xeon X5650<br />
with 16 GB of DRAM <strong>and</strong> one 100 GB SSD disk.<br />
SPECpower started with another SPEC benchmark for Java business applications<br />
(SPECJBB2005), which exercises the processors, caches, <strong>and</strong> main memory as well<br />
as the Java virtual machine, compiler, garbage collector, <strong>and</strong> pieces of the operating<br />
system. Performance is measured in throughput, <strong>and</strong> the units are business<br />
operations per second. Once again, to simplify the marketing of computers, SPEC
1.10 Fallacies <strong>and</strong> Pitfalls 49<br />
boils these numbers down to a single number, called “overall ssj_ops per watt.” The<br />
formula for this single summarizing metric is<br />
⎛ 10 ⎞ ⎛ 10 ⎞<br />
overall ssj_ops per watt <br />
∑ssj_ops i<br />
poweri<br />
⎝⎜<br />
⎠⎟<br />
∑<br />
⎝⎜<br />
⎠⎟<br />
where ssj_ops i<br />
is performance at each 10% increment <strong>and</strong> power i<br />
is power<br />
consumed at each performance level.<br />
i0<br />
i0<br />
1.10<br />
Fallacies <strong>and</strong> Pitfalls<br />
The purpose of a section on fallacies <strong>and</strong> pitfalls, which will be found in every<br />
chapter, is to explain some commonly held misconceptions that you might<br />
encounter. We call them fallacies. When discussing a fallacy, we try to give a<br />
counterexample. We also discuss pitfalls, or easily made mistakes. Often pitfalls are<br />
generalizations of principles that are only true in a limited context. The purpose<br />
of these sections is to help you avoid making these mistakes in the computers you<br />
may design or use. Cost/performance fallacies <strong>and</strong> pitfalls have ensnared many a<br />
computer architect, including us. Accordingly, this section suffers no shortage of<br />
relevant examples. We start with a pitfall that traps many designers <strong>and</strong> reveals an<br />
important relationship in computer design.<br />
Pitfall: Expecting the improvement of one aspect of a computer to increase overall<br />
performance by an amount proportional to the size of the improvement.<br />
The great idea of making the common case fast has a demoralizing corollary<br />
that has plagued designers of both hardware <strong>and</strong> software. It reminds us that the<br />
opportunity for improvement is affected by how much time the event consumes.<br />
A simple design problem illustrates it well. Suppose a program runs in 100<br />
seconds on a computer, with multiply operations responsible for 80 seconds of this<br />
time. How much do I have to improve the speed of multiplication if I want my<br />
program to run five times faster?<br />
The execution time of the program after making the improvement is given by<br />
the following simple equation known as Amdahl’s Law:<br />
Execution time after improvement<br />
Execution time affected by improvement<br />
Execution time unaffected<br />
Amount of improvement<br />
For this problem:<br />
Execution time after improvement<br />
80 seconds<br />
n<br />
( 100 80 seconds)<br />
Science must begin<br />
with myths, <strong>and</strong> the<br />
criticism of myths.<br />
Sir Karl Popper, The<br />
Philosophy of Science,<br />
1957<br />
Amdahl’s Law<br />
A rule stating that<br />
the performance<br />
enhancement possible<br />
with a given improvement<br />
is limited by the amount<br />
that the improved feature<br />
is used. It is a quantitative<br />
version of the law of<br />
diminishing returns.
50 Chapter 1 <strong>Computer</strong> Abstractions <strong>and</strong> Technology<br />
Since we want the performance to be five times faster, the new execution time<br />
should be 20 seconds, giving<br />
20 seconds<br />
0<br />
80 seconds<br />
n<br />
80 seconds<br />
n<br />
20 seconds<br />
That is, there is no amount by which we can enhance-multiply to achieve a fivefold<br />
increase in performance, if multiply accounts for only 80% of the workload. The<br />
performance enhancement possible with a given improvement is limited by the amount<br />
that the improved feature is used. In everyday life this concept also yields what we call<br />
the law of diminishing returns.<br />
We can use Amdahl’s Law to estimate performance improvements when we<br />
know the time consumed for some function <strong>and</strong> its potential speedup. Amdahl’s<br />
Law, together with the CPU performance equation, is a h<strong>and</strong>y tool for evaluating<br />
potential enhancements. Amdahl’s Law is explored in more detail in the exercises.<br />
Amdahl’s Law is also used to argue for practical limits to the number of parallel<br />
processors. We examine this argument in the Fallacies <strong>and</strong> Pitfalls section of<br />
Chapter 6.<br />
Fallacy: <strong>Computer</strong>s at low utilization use little power.<br />
Power efficiency matters at low utilizations because server workloads vary.<br />
Utilization of servers in Google’s warehouse scale computer, for example, is<br />
between 10% <strong>and</strong> 50% most of the time <strong>and</strong> at 100% less than 1% of the time. Even<br />
given five years to learn how to run the SPECpower benchmark well, the specially<br />
configured computer with the best results in 2012 still uses 33% of the peak power<br />
at 10% of the load. Systems in the field that are not configured for the SPECpower<br />
benchmark are surely worse.<br />
Since servers’ workloads vary but use a large fraction of peak power, Luiz<br />
Barroso <strong>and</strong> Urs Hölzle [2007] argue that we should redesign hardware to achieve<br />
“energy-proportional computing.” If future servers used, say, 10% of peak power at<br />
10% workload, we could reduce the electricity bill of datacenters <strong>and</strong> become good<br />
corporate citizens in an era of increasing concern about CO 2<br />
emissions.<br />
Fallacy: <strong>Design</strong>ing for performance <strong>and</strong> designing for energy efficiency are<br />
unrelated goals.<br />
Since energy is power over time, it is often the case that hardware or software<br />
optimizations that take less time save energy overall even if the optimization takes<br />
a bit more energy when it is used. One reason is that all of the rest of the computer is<br />
consuming energy while the program is running, so even if the optimized portion<br />
uses a little more energy, the reduced time can save the energy of the whole system.<br />
Pitfall: Using a subset of the performance equation as a performance metric.<br />
We have already warned about the danger of predicting performance based on<br />
simply one of clock rate, instruction count, or CPI. Another common mistake
1.10 Fallacies <strong>and</strong> Pitfalls 51<br />
is to use only two of the three factors to compare performance. Although using<br />
two of the three factors may be valid in a limited context, the concept is also<br />
easily misused. Indeed, nearly all proposed alternatives to the use of time as the<br />
performance metric have led eventually to misleading claims, distorted results, or<br />
incorrect interpretations.<br />
One alternative to time is MIPS (million instructions per second). For a given<br />
program, MIPS is simply<br />
Instruction count<br />
MIPS<br />
Execution time 10 6<br />
Since MIPS is an instruction execution rate, MIPS specifies performance inversely<br />
to execution time; faster computers have a higher MIPS rating. The good news<br />
about MIPS is that it is easy to underst<strong>and</strong>, <strong>and</strong> faster computers mean bigger<br />
MIPS, which matches intuition.<br />
There are three problems with using MIPS as a measure for comparing computers.<br />
First, MIPS specifies the instruction execution rate but does not take into account<br />
the capabilities of the instructions. We cannot compare computers with different<br />
instruction sets using MIPS, since the instruction counts will certainly differ.<br />
Second, MIPS varies between programs on the same computer; thus, a computer<br />
cannot have a single MIPS rating. For example, by substituting for execution time,<br />
we see the relationship between MIPS, clock rate, <strong>and</strong> CPI:<br />
Instruction count Clock rate<br />
MIPS<br />
Instruction count CPI<br />
10 6 CPI 10 6<br />
Clock rate<br />
million instructions<br />
per second (MIPS)<br />
A measurement of<br />
program execution speed<br />
based on the number of<br />
millions of instructions.<br />
MIPS is computed as the<br />
instruction count divided<br />
by the product of the<br />
execution time <strong>and</strong> 10 6 .<br />
The CPI varied by a factor of 5 for SPEC CPU2006 on an Intel Core i7 computer<br />
in Figure 1.18, so MIPS does as well. Finally, <strong>and</strong> most importantly, if a new<br />
program executes more instructions but each instruction is faster, MIPS can vary<br />
independently from performance!<br />
Consider the following performance measurements for a program:<br />
Measurement <strong>Computer</strong> A <strong>Computer</strong> B<br />
Check<br />
Yourself<br />
Instruction count 10 billion 8 billion<br />
Clock rate 4 GHz 4 GHz<br />
CPI 1.0 1.1<br />
a. Which computer has the higher MIPS rating?<br />
b. Which computer is faster?
1.13 Exercises 55<br />
e. Library reserve desk<br />
f. Increasing the gate area on a CMOS transistor to decrease its switching time<br />
g. Adding electromagnetic aircraft catapults (which are electrically-powered<br />
as opposed to current steam-powered models), allowed by the increased power<br />
generation offered by the new reactor technology<br />
h. Building self-driving cars whose control systems partially rely on existing sensor<br />
systems already installed into the base vehicle, such as lane departure systems <strong>and</strong><br />
smart cruise control systems<br />
1.3 [2] Describe the steps that transform a program written in a high-level<br />
language such as C into a representation that is directly executed by a computer<br />
processor.<br />
1.4 [2] Assume a color display using 8 bits for each of the primary colors<br />
(red, green, blue) per pixel <strong>and</strong> a frame size of 1280 × 1024.<br />
a. What is the minimum size in bytes of the frame buffer to store a frame?<br />
b. How long would it take, at a minimum, for the frame to be sent over a 100<br />
Mbit/s network?<br />
1.5 [4] Consider three different processors P1, P2, <strong>and</strong> P3 executing<br />
the same instruction set. P1 has a 3 GHz clock rate <strong>and</strong> a CPI of 1.5. P2 has a<br />
2.5 GHz clock rate <strong>and</strong> a CPI of 1.0. P3 has a 4.0 GHz clock rate <strong>and</strong> has a CPI<br />
of 2.2.<br />
a. Which processor has the highest performance expressed in instructions per second?<br />
b. If the processors each execute a program in 10 seconds, find the number of<br />
cycles <strong>and</strong> the number of instructions.<br />
c. We are trying to reduce the execution time by 30% but this leads to an increase<br />
of 20% in the CPI. What clock rate should we have to get this time reduction?<br />
1.6 [20] Consider two different implementations of the same instruction<br />
set architecture. The instructions can be divided into four classes according to<br />
their CPI (class A, B, C, <strong>and</strong> D). P1 with a clock rate of 2.5 GHz <strong>and</strong> CPIs of 1, 2, 3,<br />
<strong>and</strong> 3, <strong>and</strong> P2 with a clock rate of 3 GHz <strong>and</strong> CPIs of 2, 2, 2, <strong>and</strong> 2.<br />
Given a program with a dynamic instruction count of 1.0E6 instructions divided<br />
into classes as follows: 10% class A, 20% class B, 50% class C, <strong>and</strong> 20% class D,<br />
which implementation is faster?<br />
a. What is the global CPI for each implementation?<br />
b. Find the clock cycles required in both cases.
56 Chapter 1 <strong>Computer</strong> Abstractions <strong>and</strong> Technology<br />
1.7 [15] Compilers can have a profound impact on the performance<br />
of an application. Assume that for a program, compiler A results in a dynamic<br />
instruction count of 1.0E9 <strong>and</strong> has an execution time of 1.1 s, while compiler B<br />
results in a dynamic instruction count of 1.2E9 <strong>and</strong> an execution time of 1.5 s.<br />
a. Find the average CPI for each program given that the processor has a clock cycle<br />
time of 1 ns.<br />
b. Assume the compiled programs run on two different processors. If the execution<br />
times on the two processors are the same, how much faster is the clock of the<br />
processor running compiler A’s code versus the clock of the processor running<br />
compiler B’s code?<br />
c. A new compiler is developed that uses only 6.0E8 instructions <strong>and</strong> has an<br />
average CPI of 1.1. What is the speedup of using this new compiler versus using<br />
compiler A or B on the original processor?<br />
1.8 The Pentium 4 Prescott processor, released in 2004, had a clock rate of 3.6<br />
GHz <strong>and</strong> voltage of 1.25 V. Assume that, on average, it consumed 10 W of static<br />
power <strong>and</strong> 90 W of dynamic power.<br />
The Core i5 Ivy Bridge, released in 2012, had a clock rate of 3.4 GHz <strong>and</strong> voltage<br />
of 0.9 V. Assume that, on average, it consumed 30 W of static power <strong>and</strong> 40 W of<br />
dynamic power.<br />
1.8.1 [5] For each processor find the average capacitive loads.<br />
1.8.2 [5] Find the percentage of the total dissipated power comprised by<br />
static power <strong>and</strong> the ratio of static power to dynamic power for each technology.<br />
1.8.3 [15] If the total dissipated power is to be reduced by 10%, how much<br />
should the voltage be reduced to maintain the same leakage current? Note: power<br />
is defined as the product of voltage <strong>and</strong> current.<br />
1.9 Assume for arithmetic, load/store, <strong>and</strong> branch instructions, a processor has<br />
CPIs of 1, 12, <strong>and</strong> 5, respectively. Also assume that on a single processor a program<br />
requires the execution of 2.56E9 arithmetic instructions, 1.28E9 load/store<br />
instructions, <strong>and</strong> 256 million branch instructions. Assume that each processor has<br />
a 2 GHz clock frequency.<br />
Assume that, as the program is parallelized to run over multiple cores, the number<br />
of arithmetic <strong>and</strong> load/store instructions per processor is divided by 0.7 x p (where<br />
p is the number of processors) but the number of branch instructions per processor<br />
remains the same.<br />
1.9.1 [5] Find the total execution time for this program on 1, 2, 4, <strong>and</strong> 8<br />
processors, <strong>and</strong> show the relative speedup of the 2, 4, <strong>and</strong> 8 processor result relative<br />
to the single processor result.
1.13 Exercises 57<br />
1.9.2 [10] If the CPI of the arithmetic instructions was doubled,<br />
what would the impact be on the execution time of the program on 1, 2, 4, or 8<br />
processors?<br />
1.9.3 [10] To what should the CPI of load/store instructions be<br />
reduced in order for a single processor to match the performance of four processors<br />
using the original CPI values?<br />
1.10 Assume a 15 cm diameter wafer has a cost of 12, contains 84 dies, <strong>and</strong> has<br />
0.020 defects/cm 2 . Assume a 20 cm diameter wafer has a cost of 15, contains 100<br />
dies, <strong>and</strong> has 0.031 defects/cm 2 .<br />
1.10.1 [10] Find the yield for both wafers.<br />
1.10.2 [5] Find the cost per die for both wafers.<br />
1.10.3 [5] If the number of dies per wafer is increased by 10% <strong>and</strong> the<br />
defects per area unit increases by 15%, find the die area <strong>and</strong> yield.<br />
1.10.4 [5] Assume a fabrication process improves the yield from 0.92 to<br />
0.95. Find the defects per area unit for each version of the technology given a die<br />
area of 200 mm 2 .<br />
1.11 The results of the SPEC CPU2006 bzip2 benchmark running on an AMD<br />
Barcelona has an instruction count of 2.389E12, an execution time of 750 s, <strong>and</strong> a<br />
reference time of 9650 s.<br />
1.11.1 [5] Find the CPI if the clock cycle time is 0.333 ns.<br />
1.11.2 [5] Find the SPECratio.<br />
1.11.3 [5] Find the increase in CPU time if the number of instructions<br />
of the benchmark is increased by 10% without affecting the CPI.<br />
1.11.4 [5] Find the increase in CPU time if the number of instructions<br />
of the benchmark is increased by 10% <strong>and</strong> the CPI is increased by 5%.<br />
1.11.5 [5] Find the change in the SPECratio for this change.<br />
1.11.6 [10] Suppose that we are developing a new version of the AMD<br />
Barcelona processor with a 4 GHz clock rate. We have added some additional<br />
instructions to the instruction set in such a way that the number of instructions<br />
has been reduced by 15%. The execution time is reduced to 700 s <strong>and</strong> the new<br />
SPECratio is 13.7. Find the new CPI.<br />
1.11.7 [10] This CPI value is larger than obtained in 1.11.1 as the clock<br />
rate was increased from 3 GHz to 4 GHz. Determine whether the increase in the<br />
CPI is similar to that of the clock rate. If they are dissimilar, why?<br />
1.11.8 [5] By how much has the CPU time been reduced?
58 Chapter 1 <strong>Computer</strong> Abstractions <strong>and</strong> Technology<br />
1.11.9 [10] For a second benchmark, libquantum, assume an execution<br />
time of 960 ns, CPI of 1.61, <strong>and</strong> clock rate of 3 GHz. If the execution time is<br />
reduced by an additional 10% without affecting to the CPI <strong>and</strong> with a clock rate of<br />
4 GHz, determine the number of instructions.<br />
1.11.10 [10] Determine the clock rate required to give a further 10%<br />
reduction in CPU time while maintaining the number of instructions <strong>and</strong> with the<br />
CPI unchanged.<br />
1.11.11 [10] Determine the clock rate if the CPI is reduced by 15% <strong>and</strong><br />
the CPU time by 20% while the number of instructions is unchanged.<br />
1.12 Section 1.10 cites as a pitfall the utilization of a subset of the performance<br />
equation as a performance metric. To illustrate this, consider the following two<br />
processors. P1 has a clock rate of 4 GHz, average CPI of 0.9, <strong>and</strong> requires the<br />
execution of 5.0E9 instructions. P2 has a clock rate of 3 GHz, an average CPI of<br />
0.75, <strong>and</strong> requires the execution of 1.0E9 instructions.<br />
1.12.1 [5] One usual fallacy is to consider the computer with the<br />
largest clock rate as having the largest performance. Check if this is true for P1 <strong>and</strong><br />
P2.<br />
1.12.2 [10] Another fallacy is to consider that the processor executing<br />
the largest number of instructions will need a larger CPU time. Considering that<br />
processor P1 is executing a sequence of 1.0E9 instructions <strong>and</strong> that the CPI of<br />
processors P1 <strong>and</strong> P2 do not change, determine the number of instructions that P2<br />
can execute in the same time that P1 needs to execute 1.0E9 instructions.<br />
1.12.3 [10] A common fallacy is to use MIPS (millions of<br />
instructions per second) to compare the performance of two different processors,<br />
<strong>and</strong> consider that the processor with the largest MIPS has the largest performance.<br />
Check if this is true for P1 <strong>and</strong> P2.<br />
1.12.4 [10] Another common performance figure is MFLOPS (millions<br />
of floating-point operations per second), defined as<br />
MFLOPS = No. FP operations / (execution time × 1E6)<br />
but this figure has the same problems as MIPS. Assume that 40% of the instructions<br />
executed on both P1 <strong>and</strong> P2 are floating-point instructions. Find the MFLOPS<br />
figures for the programs.<br />
1.13 Another pitfall cited in Section 1.10 is expecting to improve the overall<br />
performance of a computer by improving only one aspect of the computer. Consider<br />
a computer running a program that requires 250 s, with 70 s spent executing FP<br />
instructions, 85 s executed L/S instructions, <strong>and</strong> 40 s spent executing branch<br />
instructions.<br />
1.13.1 [5] By how much is the total time reduced if the time for FP<br />
operations is reduced by 20%?
1.13 Exercises 59<br />
1.13.2 [5] By how much is the time for INT operations reduced if the<br />
total time is reduced by 20%?<br />
1.13.3 [5] Can the total time can be reduced by 20% by reducing only<br />
the time for branch instructions?<br />
1.14 Assume a program requires the execution of 50 × 106 FP instructions,<br />
110 × 106 INT instructions, 80 × 106 L/S instructions, <strong>and</strong> 16 × 106 branch<br />
instructions. The CPI for each type of instruction is 1, 1, 4, <strong>and</strong> 2, respectively.<br />
Assume that the processor has a 2 GHz clock rate.<br />
1.14.1 [10] By how much must we improve the CPI of FP instructions if<br />
we want the program to run two times faster?<br />
1.14.2 [10] By how much must we improve the CPI of L/S instructions<br />
if we want the program to run two times faster?<br />
1.14.3 [5] By how much is the execution time of the program improved<br />
if the CPI of INT <strong>and</strong> FP instructions is reduced by 40% <strong>and</strong> the CPI of L/S <strong>and</strong><br />
Branch is reduced by 30%?<br />
1.15 [5] When a program is adapted to run on multiple processors in<br />
a multiprocessor system, the execution time on each processor is comprised of<br />
computing time <strong>and</strong> the overhead time required for locked critical sections <strong>and</strong>/or<br />
to send data from one processor to another.<br />
Assume a program requires t = 100 s of execution time on one processor. When run<br />
p processors, each processor requires t/p s, as well as an additional 4 s of overhead,<br />
irrespective of the number of processors. Compute the per-processor execution<br />
time for 2, 4, 8, 16, 32, 64, <strong>and</strong> 128 processors. For each case, list the corresponding<br />
speedup relative to a single processor <strong>and</strong> the ratio between actual speedup versus<br />
ideal speedup (speedup if there was no overhead).<br />
§1.1, page 10: Discussion questions: many answers are acceptable.<br />
§1.4, page 24: DRAM memory: volatile, short access time of 50 to 70 nanoseconds,<br />
<strong>and</strong> cost per GB is $5 to $10. Disk memory: nonvolatile, access times are 100,000<br />
to 400,000 times slower than DRAM, <strong>and</strong> cost per GB is 100 times cheaper than<br />
DRAM. Flash memory: nonvolatile, access times are 100 to 1000 times slower than<br />
DRAM, <strong>and</strong> cost per GB is 7 to 10 times cheaper than DRAM.<br />
§1.5, page 28: 1, 3, <strong>and</strong> 4 are valid reasons. Answer 5 can be generally true because<br />
high volume can make the extra investment to reduce die size by, say, 10% a good<br />
economic decision, but it doesn’t have to be true.<br />
§1.6, page 33: 1. a: both, b: latency, c: neither. 7 seconds.<br />
§1.6, page 40: b.<br />
§1.10, page 51: a. <strong>Computer</strong> A has the higher MIPS rating. b. <strong>Computer</strong> B is faster.<br />
Answers to<br />
Check Yourself
2<br />
I speak Spanish<br />
to God, Italian to<br />
women, French to<br />
men, <strong>and</strong> German to<br />
my horse.<br />
Charles V, Holy Roman Emperor<br />
(1500–1558)<br />
Instructions:<br />
Language of the<br />
<strong>Computer</strong><br />
2.1 Introduction 62<br />
2.2 Operations of the <strong>Computer</strong> Hardware 63<br />
2.3 Oper<strong>and</strong>s of the <strong>Computer</strong> Hardware 66<br />
2.4 Signed <strong>and</strong> Unsigned Numbers 73<br />
2.5 Representing Instructions in the<br />
<strong>Computer</strong> 80<br />
2.6 Logical Operations 87<br />
2.7 Instructions for Making Decisions 90<br />
<strong>Computer</strong> Organization <strong>and</strong> <strong>Design</strong>. DOI: http://dx.doi.org/10.1016/B978-0-12-407726-3.00001-1<br />
© 2013 Elsevier Inc. All rights reserved.
2.2 Operations of the <strong>Computer</strong> Hardware 65<br />
instruction. Another difference from C is that comments always terminate at the<br />
end of a line.<br />
The natural number of oper<strong>and</strong>s for an operation like addition is three: the<br />
two numbers being added together <strong>and</strong> a place to put the sum. Requiring every<br />
instruction to have exactly three oper<strong>and</strong>s, no more <strong>and</strong> no less, conforms to the<br />
philosophy of keeping the hardware simple: hardware for a variable number of<br />
oper<strong>and</strong>s is more complicated than hardware for a fixed number. This situation<br />
illustrates the first of three underlying principles of hardware design:<br />
<strong>Design</strong> Principle 1: Simplicity favors regularity.<br />
We can now show, in the two examples that follow, the relationship of programs<br />
written in higher-level programming languages to programs in this more primitive<br />
notation.<br />
Compiling Two C Assignment Statements into MIPS<br />
This segment of a C program contains the five variables a, b, c, d, <strong>and</strong> e. Since<br />
Java evolved from C, this example <strong>and</strong> the next few work for either high-level<br />
programming language:<br />
EXAMPLE<br />
a = b + c;<br />
d = a – e;<br />
The translation from C to MIPS assembly language instructions is performed<br />
by the compiler. Show the MIPS code produced by a compiler.<br />
A MIPS instruction operates on two source oper<strong>and</strong>s <strong>and</strong> places the result<br />
in one destination oper<strong>and</strong>. Hence, the two simple statements above compile<br />
directly into these two MIPS assembly language instructions:<br />
ANSWER<br />
add a, b, c<br />
sub d, a, e<br />
Compiling a Complex C Assignment into MIPS<br />
A somewhat complex statement contains the five variables f, g, h, i, <strong>and</strong> j:<br />
EXAMPLE<br />
f = (g + h) – (i + j);<br />
What might a C compiler produce?
68 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />
ANSWER<br />
The compiled program is very similar to the prior example, except we replace<br />
the variables with the register names mentioned above plus two temporary<br />
registers, $t0 <strong>and</strong> $t1, which correspond to the temporary variables above:<br />
add $t0,$s1,$s2 # register $t0 contains g + h<br />
add $t1,$s3,$s4 # register $t1 contains i + j<br />
sub $s0,$t0,$t1 # f gets $t0 – $t1, which is (g + h)–(i + j)<br />
data transfer<br />
instruction A comm<strong>and</strong><br />
that moves data between<br />
memory <strong>and</strong> registers.<br />
address A value used to<br />
delineate the location of<br />
a specific data element<br />
within a memory array.<br />
Memory Oper<strong>and</strong>s<br />
Programming languages have simple variables that contain single data elements,<br />
as in these examples, but they also have more complex data structures—arrays <strong>and</strong><br />
structures. These complex data structures can contain many more data elements<br />
than there are registers in a computer. How can a computer represent <strong>and</strong> access<br />
such large structures?<br />
Recall the five components of a computer introduced in Chapter 1 <strong>and</strong> repeated<br />
on page 61. The processor can keep only a small amount of data in registers, but<br />
computer memory contains billions of data elements. Hence, data structures<br />
(arrays <strong>and</strong> structures) are kept in memory.<br />
As explained above, arithmetic operations occur only on registers in MIPS<br />
instructions; thus, MIPS must include instructions that transfer data between<br />
memory <strong>and</strong> registers. Such instructions are called data transfer instructions.<br />
To access a word in memory, the instruction must supply the memory address.<br />
Memory is just a large, single-dimensional array, with the address acting as the<br />
index to that array, starting at 0. For example, in Figure 2.2, the address of the third<br />
data element is 2, <strong>and</strong> the value of Memory [2] is 10.<br />
3<br />
2<br />
1<br />
0<br />
Address<br />
100<br />
10<br />
101<br />
1<br />
Data<br />
Processor<br />
Memory<br />
FIGURE 2.2 Memory addresses <strong>and</strong> contents of memory at those locations. If these elements<br />
were words, these addresses would be incorrect, since MIPS actually uses byte addressing, with each word<br />
representing four bytes. Figure 2.3 shows the memory addressing for sequential word addresses.<br />
The data transfer instruction that copies data from memory to a register is<br />
traditionally called load. The format of the load instruction is the name of the<br />
operation followed by the register to be loaded, then a constant <strong>and</strong> register used to<br />
access memory. The sum of the constant portion of the instruction <strong>and</strong> the contents<br />
of the second register forms the memory address. The actual MIPS name for this<br />
instruction is lw, st<strong>and</strong>ing for load word.
2.3 Oper<strong>and</strong>s of the <strong>Computer</strong> Hardware 69<br />
Compiling an Assignment When an Oper<strong>and</strong> Is in Memory<br />
Let’s assume that A is an array of 100 words <strong>and</strong> that the compiler has<br />
associated the variables g <strong>and</strong> h with the registers $s1 <strong>and</strong> $s2 as before.<br />
Let’s also assume that the starting address, or base address, of the array is in<br />
$s3. Compile this C assignment statement:<br />
EXAMPLE<br />
g = h + A[8];<br />
Although there is a single operation in this assignment statement, one of<br />
the oper<strong>and</strong>s is in memory, so we must first transfer A[8] to a register. The<br />
address of this array element is the sum of the base of the array A, found in<br />
register $s3, plus the number to select element 8. The data should be placed<br />
in a temporary register for use in the next instruction. Based on Figure 2.2, the<br />
first compiled instruction is<br />
ANSWER<br />
lw<br />
$t0,8($s3) # Temporary reg $t0 gets A[8]<br />
(We’ll be making a slight adjustment to this instruction, but we’ll use this<br />
simplified version for now.) The following instruction can operate on the value<br />
in $t0 (which equals A[8]) since it is in a register. The instruction must add<br />
h (contained in $s2) to A[8] (contained in $t0) <strong>and</strong> put the sum in the<br />
register corresponding to g (associated with $s1):<br />
add<br />
$s1,$s2,$t0 # g = h + A[8]<br />
The constant in a data transfer instruction (8) is called the offset, <strong>and</strong> the<br />
register added to form the address ($s3) is called the base register.<br />
In addition to associating variables with registers, the compiler allocates data<br />
structures like arrays <strong>and</strong> structures to locations in memory. The compiler can then<br />
place the proper starting address into the data transfer instructions.<br />
Since 8-bit bytes are useful in many programs, virtually all architectures today<br />
address individual bytes. Therefore, the address of a word matches the address of<br />
one of the 4 bytes within the word, <strong>and</strong> addresses of sequential words differ by 4.<br />
For example, Figure 2.3 shows the actual MIPS addresses for the words in Figure<br />
2.2; the byte address of the third word is 8.<br />
In MIPS, words must start at addresses that are multiples of 4. This requirement<br />
is called an alignment restriction, <strong>and</strong> many architectures have it. (Chapter 4<br />
suggests why alignment leads to faster data transfers.)<br />
Hardware/<br />
Software<br />
Interface<br />
alignment restriction<br />
A requirement that data<br />
be aligned in memory on<br />
natural boundaries.
70 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />
12<br />
8<br />
4<br />
0<br />
Byte Address<br />
100<br />
10<br />
101<br />
1<br />
Data<br />
Processor<br />
Memory<br />
FIGURE 2.3 Actual MIPS memory addresses <strong>and</strong> contents of memory for those words.<br />
The changed addresses are highlighted to contrast with Figure 2.2. Since MIPS addresses each byte, word<br />
addresses are multiples of 4: there are 4 bytes in a word.<br />
<strong>Computer</strong>s divide into those that use the address of the leftmost or “big end” byte<br />
as the word address versus those that use the rightmost or “little end” byte. MIPS is<br />
in the big-endian camp. Since the order matters only if you access the identical data<br />
both as a word <strong>and</strong> as four bytes, few need to be aware of the endianess. (Appendix<br />
A shows the two options to number bytes in a word.)<br />
Byte addressing also affects the array index. To get the proper byte address in the<br />
code above, the offset to be added to the base register $s3 must be 4 8, or 32, so<br />
that the load address will select A[8] <strong>and</strong> not A[8/4]. (See the related pitfall on<br />
page 160 of Section 2.19.)<br />
The instruction complementary to load is traditionally called store; it copies data<br />
from a register to memory. The format of a store is similar to that of a load: the<br />
name of the operation, followed by the register to be stored, then offset to select<br />
the array element, <strong>and</strong> finally the base register. Once again, the MIPS address is<br />
specified in part by a constant <strong>and</strong> in part by the contents of a register. The actual<br />
MIPS name is sw, st<strong>and</strong>ing for store word.<br />
Hardware/<br />
Software<br />
Interface<br />
As the addresses in loads <strong>and</strong> stores are binary numbers, we can see why the<br />
DRAM for main memory comes in binary sizes rather than in decimal sizes. That<br />
is, in gebibytes (2 30 ) or tebibytes (2 40 ), not in gigabytes (10 9 ) or terabytes (10 12 ); see<br />
Figure 1.1.
2.3 Oper<strong>and</strong>s of the <strong>Computer</strong> Hardware 71<br />
Compiling Using Load <strong>and</strong> Store<br />
Assume variable h is associated with register $s2 <strong>and</strong> the base address of<br />
the array A is in $s3. What is the MIPS assembly code for the C assignment<br />
statement below?<br />
EXAMPLE<br />
A[12] = h + A[8];<br />
Although there is a single operation in the C statement, now two of the<br />
oper<strong>and</strong>s are in memory, so we need even more MIPS instructions. The first<br />
two instructions are the same as in the prior example, except this time we use<br />
the proper offset for byte addressing in the load word instruction to select<br />
A[8], <strong>and</strong> the add instruction places the sum in $t0:<br />
ANSWER<br />
lw $t0,32($s3) # Temporary reg $t0 gets A[8]<br />
add $t0,$s2,$t0 # Temporary reg $t0 gets h + A[8]<br />
The final instruction stores the sum into A[12], using 48 (4 12) as the offset<br />
<strong>and</strong> register $s3 as the base register.<br />
sw $t0,48($s3) # Stores h + A[8] back into A[12]<br />
Load word <strong>and</strong> store word are the instructions that copy words between<br />
memory <strong>and</strong> registers in the MIPS architecture. Other br<strong>and</strong>s of computers use<br />
other instructions along with load <strong>and</strong> store to transfer data. An architecture with<br />
such alternatives is the Intel x86, described in Section 2.17.<br />
Many programs have more variables than computers have registers. Consequently,<br />
the compiler tries to keep the most frequently used variables in registers <strong>and</strong> places<br />
the rest in memory, using loads <strong>and</strong> stores to move variables between registers <strong>and</strong><br />
memory. The process of putting less commonly used variables (or those needed<br />
later) into memory is called spilling registers.<br />
The hardware principle relating size <strong>and</strong> speed suggests that memory must be<br />
slower than registers, since there are fewer registers. This is indeed the case; data<br />
accesses are faster if data is in registers instead of memory.<br />
Moreover, data is more useful when in a register. A MIPS arithmetic instruction<br />
can read two registers, operate on them, <strong>and</strong> write the result. A MIPS data transfer<br />
instruction only reads one oper<strong>and</strong> or writes one oper<strong>and</strong>, without operating on it.<br />
Thus, registers take less time to access <strong>and</strong> have higher throughput than memory,<br />
making data in registers both faster to access <strong>and</strong> simpler to use. Accessing registers<br />
also uses less energy than accessing memory. To achieve highest performance <strong>and</strong><br />
conserve energy, an instruction set architecture must have a sufficient number of<br />
registers, <strong>and</strong> compilers must use registers efficiently.<br />
Hardware/<br />
Software<br />
Interface
74 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />
We number the bits 0, 1, 2, 3, . . . from right to left in a word. The drawing below<br />
shows the numbering of bits within a MIPS word <strong>and</strong> the placement of the number<br />
1011 two<br />
:<br />
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0<br />
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1<br />
(32 bits wide)<br />
least significant bit The<br />
rightmost bit in a MIPS<br />
word.<br />
most significant bit The<br />
leftmost bit in a MIPS<br />
word.<br />
Since words are drawn vertically as well as horizontally, leftmost <strong>and</strong> rightmost<br />
may be unclear. Hence, the phrase least significant bit is used to refer to the rightmost<br />
bit (bit 0 above) <strong>and</strong> most significant bit to the leftmost bit (bit 31).<br />
The MIPS word is 32 bits long, so we can represent 2 32 different 32-bit patterns.<br />
It is natural to let these combinations represent the numbers from 0 to 2 32 1<br />
(4,294,967,295 ten<br />
):<br />
0000 0000 0000 0000 0000 0000 0000 0000 two<br />
= 0 ten<br />
0000 0000 0000 0000 0000 0000 0000 0001 two<br />
= 1 ten<br />
0000 0000 0000 0000 0000 0000 0000 0010 two<br />
= 2 ten<br />
. . . . . .<br />
1111 1111 1111 1111 1111 1111 1111 1101 two<br />
= 4,294,967,293 ten<br />
1111 1111 1111 1111 1111 1111 1111 1110 two<br />
= 4,294,967,294 ten<br />
1111 1111 1111 1111 1111 1111 1111 1111 two<br />
= 4,294,967,295 ten<br />
That is, 32-bit binary numbers can be represented in terms of the bit value times a<br />
power of 2 (here xi means the ith bit of x):<br />
31 30 29 1 0<br />
( x31 2 ) ( x30 2 ) ( x29 2 ) … ( x1 2 ) ( x0 2 )<br />
For reasons we will shortly see, these positive numbers are called unsigned numbers.<br />
Hardware/<br />
Software<br />
Interface<br />
Base 2 is not natural to human beings; we have 10 fingers <strong>and</strong> so find base 10<br />
natural. Why didn’t computers use decimal? In fact, the first commercial computer<br />
did offer decimal arithmetic. The problem was that the computer still used on<br />
<strong>and</strong> off signals, so a decimal digit was simply represented by several binary digits.<br />
Decimal proved so inefficient that subsequent computers reverted to all binary,<br />
converting to base 10 only for the relatively infrequent input/output events.<br />
Keep in mind that the binary bit patterns above are simply representatives of<br />
numbers. Numbers really have an infinite number of digits, with almost all being<br />
0 except for a few of the rightmost digits. We just don’t normally show leading 0s.<br />
Hardware can be designed to add, subtract, multiply, <strong>and</strong> divide these binary<br />
bit patterns. If the number that is the proper result of such operations cannot be<br />
represented by these rightmost hardware bits, overflow is said to have occurred.
2.4 Signed <strong>and</strong> Unsigned Numbers 75<br />
It’s up to the programming language, the operating system, <strong>and</strong> the program to<br />
determine what to do if overflow occurs.<br />
<strong>Computer</strong> programs calculate both positive <strong>and</strong> negative numbers, so we need a<br />
representation that distinguishes the positive from the negative. The most obvious<br />
solution is to add a separate sign, which conveniently can be represented in a single<br />
bit; the name for this representation is sign <strong>and</strong> magnitude.<br />
Alas, sign <strong>and</strong> magnitude representation has several shortcomings. First, it’s<br />
not obvious where to put the sign bit. To the right? To the left? Early computers<br />
tried both. Second, adders for sign <strong>and</strong> magnitude may need an extra step to set<br />
the sign because we can’t know in advance what the proper sign will be. Finally, a<br />
separate sign bit means that sign <strong>and</strong> magnitude has both a positive <strong>and</strong> a negative<br />
zero, which can lead to problems for inattentive programmers. As a result of these<br />
shortcomings, sign <strong>and</strong> magnitude representation was soon ab<strong>and</strong>oned.<br />
In the search for a more attractive alternative, the question arose as to what<br />
would be the result for unsigned numbers if we tried to subtract a large number<br />
from a small one. The answer is that it would try to borrow from a string of leading<br />
0s, so the result would have a string of leading 1s.<br />
Given that there was no obvious better alternative, the final solution was to pick<br />
the representation that made the hardware simple: leading 0s mean positive, <strong>and</strong><br />
leading 1s mean negative. This convention for representing signed binary numbers<br />
is called two’s complement representation:<br />
0000 0000 0000 0000 0000 0000 0000 0000 two<br />
= 0 ten<br />
0000 0000 0000 0000 0000 0000 0000 0001 two<br />
= 1 ten<br />
0000 0000 0000 0000 0000 0000 0000 0010 two<br />
= 2 ten<br />
. . . . . .<br />
0111 1111 1111 1111 1111 1111 1111 1101 two<br />
= 2,147,483,645 ten<br />
0111 1111 1111 1111 1111 1111 1111 1110 two<br />
= 2,147,483,646 ten<br />
0111 1111 1111 1111 1111 1111 1111 1111 two<br />
= 2,147,483,647 ten<br />
1000 0000 0000 0000 0000 0000 0000 0000 two<br />
= –2,147,483,648 ten<br />
1000 0000 0000 0000 0000 0000 0000 0001 two<br />
= –2,147,483,647 ten<br />
1000 0000 0000 0000 0000 0000 0000 0010 two<br />
= –2,147,483,646 ten<br />
. . . . . .<br />
1111 1111 1111 1111 1111 1111 1111 1101 two<br />
= –3 ten<br />
1111 1111 1111 1111 1111 1111 1111 1110 two<br />
= –2 ten<br />
1111 1111 1111 1111 1111 1111 1111 1111 two<br />
= –1 ten<br />
The positive half of the numbers, from 0 to 2,147,483,647 ten<br />
(2 31 1), use the same<br />
representation as before. The following bit pattern (1000 . . . 0000 two<br />
) represents the most<br />
negative number 2,147,483,648 ten<br />
(2 31 ). It is followed by a declining set of negative<br />
numbers: 2,147,483,647 ten<br />
(1000 . . . 0001 two<br />
) down to 1 ten<br />
(1111 . . . 1111 two<br />
).<br />
Two’s complement does have one negative number, 2,147,483,648 ten<br />
, that<br />
has no corresponding positive number. Such imbalance was also a worry to the<br />
inattentive programmer, but sign <strong>and</strong> magnitude had problems for both the<br />
programmer <strong>and</strong> the hardware designer. Consequently, every computer today uses<br />
two’s complement binary representations for signed numbers.
76 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />
Two’s complement representation has the advantage that all negative numbers<br />
have a 1 in the most significant bit. Consequently, hardware needs to test only<br />
this bit to see if a number is positive or negative (with the number 0 considered<br />
positive). This bit is often called the sign bit. By recognizing the role of the sign bit,<br />
we can represent positive <strong>and</strong> negative 32-bit numbers in terms of the bit value<br />
times a power of 2:<br />
31 30 29 1 0<br />
( x31 2 ) ( x30 2 ) + ( x29 2 ) … ( x1 2 ) ( x0 2 )<br />
The sign bit is multiplied by 2 31 , <strong>and</strong> the rest of the bits are then multiplied by<br />
positive versions of their respective base values.<br />
EXAMPLE<br />
Binary to Decimal Conversion<br />
What is the decimal value of this 32-bit two’s complement number?<br />
1111 1111 1111 1111 1111 1111 1111 1100 two<br />
ANSWER<br />
Substituting the number’s bit values into the formula above:<br />
31 30 29 1 1 0<br />
( 1 2 ) ( 1 2 ) ( 1 2 ) … ( 1 2 ) ( 0 2 ) ( 0 2 )<br />
31 30 29 2<br />
2 2 2 … 2 0 0<br />
2, 147, 483, 648te<br />
n<br />
2, 147, 483,<br />
644ten<br />
4<br />
ten<br />
We’ll see a shortcut to simplify conversion from negative to positive soon.<br />
Just as an operation on unsigned numbers can overflow the capacity of hardware<br />
to represent the result, so can an operation on two’s complement numbers. Overflow<br />
occurs when the leftmost retained bit of the binary bit pattern is not the same as the<br />
infinite number of digits to the left (the sign bit is incorrect): a 0 on the left of the bit<br />
pattern when the number is negative or a 1 when the number is positive.<br />
Hardware/<br />
Software<br />
Interface<br />
Signed versus unsigned applies to loads as well as to arithmetic. The function of a<br />
signed load is to copy the sign repeatedly to fill the rest of the register—called sign<br />
extension—but its purpose is to place a correct representation of the number within<br />
that register. Unsigned loads simply fill with 0s to the left of the data, since the<br />
number represented by the bit pattern is unsigned.<br />
When loading a 32-bit word into a 32-bit register, the point is moot; signed <strong>and</strong><br />
unsigned loads are identical. MIPS does offer two flavors of byte loads: load byte (lb)<br />
treats the byte as a signed number <strong>and</strong> thus sign-extends to fill the 24 left-most bits<br />
of the register, while load byte unsigned (lbu) works with unsigned integers. Since C<br />
programs almost always use bytes to represent characters rather than consider bytes<br />
as very short signed integers, lbu is used practically exclusively for byte loads.
2.4 Signed <strong>and</strong> Unsigned Numbers 77<br />
Unlike the numbers discussed above, memory addresses naturally start at 0<br />
<strong>and</strong> continue to the largest address. Put another way, negative addresses make<br />
no sense. Thus, programs want to deal sometimes with numbers that can be<br />
positive or negative <strong>and</strong> sometimes with numbers that can be only positive.<br />
Some programming languages reflect this distinction. C, for example, names the<br />
former integers (declared as int in the program) <strong>and</strong> the latter unsigned integers<br />
(unsigned int). Some C style guides even recommend declaring the former as<br />
signed int to keep the distinction clear.<br />
Let’s examine two useful shortcuts when working with two’s complement<br />
numbers. The first shortcut is a quick way to negate a two’s complement binary<br />
number. Simply invert every 0 to 1 <strong>and</strong> every 1 to 0, then add one to the result.<br />
This shortcut is based on the observation that the sum of a number <strong>and</strong> its inverted<br />
representation must be 111 . . . 111 two<br />
, which represents 1. Since x x 1,<br />
therefore x x 1 0 or x 1 − x. (We use the notation x to mean invert<br />
every bit in x from 0 to 1 <strong>and</strong> vice versa.)<br />
Negation Shortcut<br />
Negate 2 ten<br />
, <strong>and</strong> then check the result by negating 2 ten<br />
.<br />
2 ten<br />
0000 0000 0000 0000 0000 0000 0000 0010 two<br />
Negating this number by inverting the bits <strong>and</strong> adding one,<br />
1111 1111 1111 1111 1111 1111 1111 1101 two<br />
+ 1 two<br />
= 1111 1111 1111 1111 1111 1111 1111 1110 two<br />
= –2 ten<br />
Going the other direction,<br />
Hardware/<br />
Software<br />
Interface<br />
EXAMPLE<br />
ANSWER<br />
= 0000 0000 0000 0000 0000 0000 0000 0010 two<br />
= 2 ten<br />
1111 1111 1111 1111 1111 1111 1111 1110 two<br />
is first inverted <strong>and</strong> then incremented:<br />
0000 0000 0000 0000 0000 0000 0000 0001 two<br />
+ 1 two
78 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />
Our next shortcut tells us how to convert a binary number represented in n bits<br />
to a number represented with more than n bits. For example, the immediate field<br />
in the load, store, branch, add, <strong>and</strong> set on less than instructions contains a two’s<br />
complement 16-bit number, representing 32,768 ten<br />
(2 15 ) to 32,767 ten<br />
(2 15 1).<br />
To add the immediate field to a 32-bit register, the computer must convert that 16-<br />
bit number to its 32-bit equivalent. The shortcut is to take the most significant bit<br />
from the smaller quantity—the sign bit—<strong>and</strong> replicate it to fill the new bits of the<br />
larger quantity. The old nonsign bits are simply copied into the right portion of the<br />
new word. This shortcut is commonly called sign extension.<br />
EXAMPLE<br />
Sign Extension Shortcut<br />
Convert 16-bit binary versions of 2 ten<br />
<strong>and</strong> 2 ten<br />
to 32-bit binary numbers.<br />
ANSWER<br />
The 16-bit binary version of the number 2 is<br />
0000 0000 0000 0010 two<br />
= 2 ten<br />
It is converted to a 32-bit number by making 16 copies of the value in the most<br />
significant bit (0) <strong>and</strong> placing that in the left-h<strong>and</strong> half of the word. The right<br />
half gets the old value:<br />
0000 0000 0000 0000 0000 0000 0000 0010 two<br />
= 2 ten<br />
Let’s negate the 16-bit version of 2 using the earlier shortcut. Thus,<br />
0000 0000 0000 0010 two<br />
becomes<br />
1111 1111 1111 1101 two<br />
+ 1 two<br />
= 1111 1111 1111 1110 two<br />
Creating a 32-bit version of the negative number means copying the sign bit<br />
16 times <strong>and</strong> placing it on the left:<br />
1111 1111 1111 1111 1111 1111 1111 1110 two<br />
= –2 ten<br />
This trick works because positive two’s complement numbers really have an infinite<br />
number of 0s on the left <strong>and</strong> negative two’s complement numbers have an infinite<br />
number of 1s. The binary bit pattern representing a number hides leading bits to fit<br />
the width of the hardware; sign extension simply restores some of them.
2.4 Signed <strong>and</strong> Unsigned Numbers 79<br />
Summary<br />
The main point of this section is that we need to represent both positive <strong>and</strong><br />
negative integers within a computer word, <strong>and</strong> although there are pros <strong>and</strong> cons to<br />
any option, the unanimous choice since 1965 has been two’s complement.<br />
Elaboration: For signed decimal numbers, we used “” to represent negative<br />
because there are no limits to the size of a decimal number. Given a fi xed word size,<br />
binary <strong>and</strong> hexadecimal (see Figure 2.4) bit strings can encode the sign; hence we do<br />
not normally use “” or “” with binary or hexadecimal notation.<br />
What is the decimal value of this 64-bit two’s complement number?<br />
1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1000 two<br />
Check<br />
Yourself<br />
1) –4 ten<br />
2) –8 ten<br />
3) –16 ten<br />
4) 18,446,744,073,709,551,609 ten<br />
Elaboration: Two’s complement gets its name from the rule that the unsigned sum<br />
of an n-bit number <strong>and</strong> its n-bit negative is 2 n ; hence, the negation or complement of a<br />
number x is 2 n x, or its “two’s complement.”<br />
A third alternative representation to two’s complement <strong>and</strong> sign <strong>and</strong> magnitude is<br />
called one’s complement. The negative of a one’s complement is found by inverting<br />
each bit, from 0 to 1 <strong>and</strong> from 1 to 0, or x. This relation helps explain its name since<br />
the complement of x is 2 n x 1. It was also an attempt to be a better solution<br />
than sign <strong>and</strong> magnitude, <strong>and</strong> several early scientifi c computers did use the notation.<br />
This representation is similar to two’s complement except that it also has two 0s:<br />
00 . . . 00 two<br />
is positive 0 <strong>and</strong> 11 . . . 11 two<br />
is negative 0. The most negative number,<br />
10 . . . 000 two<br />
, represents 2,147,483,647 ten<br />
, <strong>and</strong> so the positives <strong>and</strong> negatives are<br />
balanced. One’s complement adders did need an extra step to subtract a number, <strong>and</strong><br />
hence two’s complement dominates today.<br />
A fi nal notation, which we will look at when we discuss fl oating point in Chapter 3,<br />
is to represent the most negative value by 00 . . . 000 two<br />
<strong>and</strong> the most positive value<br />
by 11 . . . 11 two<br />
, with 0 typically having the value 10 . . . 00 two<br />
. This is called a biased<br />
notation, since it biases the number such that the number plus the bias has a nonnegative<br />
representation.<br />
one’s complement<br />
A notation that represents<br />
the most negative value<br />
by 10 . . . 000 two<br />
<strong>and</strong> the<br />
most positive value by<br />
01 . . . 11 two<br />
, leaving an<br />
equal number of negatives<br />
<strong>and</strong> positives but ending<br />
up with two zeros, one<br />
positive (00 . . . 00 two<br />
) <strong>and</strong><br />
one negative (11 . . . 11 two<br />
).<br />
The term is also used to<br />
mean the inversion of<br />
every bit in a pattern: 0 to<br />
1 <strong>and</strong> 1 to 0.<br />
biased notation<br />
A notation that represents<br />
the most negative value<br />
by 00 . . . 000 two<br />
<strong>and</strong> the<br />
most positive value by 11<br />
. . . 11 two<br />
, with 0 typically<br />
having the value 10 . . .<br />
00 two<br />
, thereby biasing<br />
the number such that<br />
the number plus the<br />
bias has a non-negative<br />
representation.
2.5 Representing Instructions in the <strong>Computer</strong> 81<br />
This layout of the instruction is called the instruction format. As you can see<br />
from counting the number of bits, this MIPS instruction takes exactly 32 bits—the<br />
same size as a data word. In keeping with our design principle that simplicity favors<br />
regularity, all MIPS instructions are 32 bits long.<br />
To distinguish it from assembly language, we call the numeric version of<br />
instructions machine language <strong>and</strong> a sequence of such instructions machine code.<br />
It would appear that you would now be reading <strong>and</strong> writing long, tedious strings<br />
of binary numbers. We avoid that tedium by using a higher base than binary that<br />
converts easily into binary. Since almost all computer data sizes are multiples of<br />
4, hexadecimal (base 16) numbers are popular. Since base 16 is a power of 2,<br />
we can trivially convert by replacing each group of four binary digits by a single<br />
hexadecimal digit, <strong>and</strong> vice versa. Figure 2.4 converts between hexadecimal <strong>and</strong><br />
binary.<br />
instruction format<br />
A form of representation<br />
of an instruction<br />
composed of fields of<br />
binary numbers.<br />
machine<br />
language Binary<br />
representation used for<br />
communication within a<br />
computer system.<br />
hexadecimal Numbers<br />
in base 16.<br />
Hexadecimal Binary Hexadecimal Binary Hexadecimal Binary Hexadecimal Binary<br />
0 hex 0000 two 4 hex 0100 two 8 hex 1000 two c hex 1100 two<br />
1 hex 0001 two 5 hex 0101 two 9 hex 1001 two d hex 1101 two<br />
2 hex 0010 two 6 hex 0110 two a hex 1010 two e hex 1110 two<br />
3 hex 0011 two 7 hex 0111 two b hex 1011 two f hex 1111 two<br />
FIGURE 2.4 The hexadecimal-binary conversion table. Just replace one hexadecimal digit by the corresponding four binary digits,<br />
<strong>and</strong> vice versa. If the length of the binary number is not a multiple of 4, go from right to left.<br />
Because we frequently deal with different number bases, to avoid confusion<br />
we will subscript decimal numbers with ten, binary numbers with two, <strong>and</strong><br />
hexadecimal numbers with hex. (If there is no subscript, the default is base 10.) By<br />
the way, C <strong>and</strong> Java use the notation 0xnnnn for hexadecimal numbers.<br />
Binary to Hexadecimal <strong>and</strong> Back<br />
Convert the following hexadecimal <strong>and</strong> binary numbers into the other base:<br />
EXAMPLE<br />
eca8 6420 hex<br />
0001 0011 0101 0111 1001 1011 1101 1111 two
82 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />
ANSWER<br />
Using Figure 2.4, the answer is just a table lookup one way:<br />
eca8 6420 hex<br />
1110 1100 1010 1000 0110 0100 0010 0000 two<br />
And then the other direction:<br />
0001 0011 0101 0111 1001 1011 1101 1111 two<br />
1357 9bdf hex<br />
MIPS Fields<br />
MIPS fields are given names to make them easier to discuss:<br />
op rs rt rd shamt funct<br />
6 bits 5 bits 5 bits 5 bits 5 bits 6 bits<br />
Here is the meaning of each name of the fields in MIPS instructions:<br />
opcode The field that<br />
denotes the operation <strong>and</strong><br />
format of an instruction.<br />
■ op: Basic operation of the instruction, traditionally called the opcode.<br />
■ rs: The first register source oper<strong>and</strong>.<br />
■ rt: The second register source oper<strong>and</strong>.<br />
■ rd: The register destination oper<strong>and</strong>. It gets the result of the operation.<br />
■ shamt: Shift amount. (Section 2.6 explains shift instructions <strong>and</strong> this term; it<br />
will not be used until then, <strong>and</strong> hence the field contains zero in this section.)<br />
■ funct: Function. This field, often called the function code, selects the specific<br />
variant of the operation in the op field.<br />
A problem occurs when an instruction needs longer fields than those shown<br />
above. For example, the load word instruction must specify two registers <strong>and</strong> a<br />
constant. If the address were to use one of the 5-bit fields in the format above, the<br />
constant within the load word instruction would be limited to only 2 5 or 32. This<br />
constant is used to select elements from arrays or data structures, <strong>and</strong> it often needs<br />
to be much larger than 32. This 5-bit field is too small to be useful.<br />
Hence, we have a conflict between the desire to keep all instructions the same<br />
length <strong>and</strong> the desire to have a single instruction format. This leads us to the final<br />
hardware design principle:
2.5 Representing Instructions in the <strong>Computer</strong> 83<br />
<strong>Design</strong> Principle 3: Good design dem<strong>and</strong>s good compromises.<br />
The compromise chosen by the MIPS designers is to keep all instructions the<br />
same length, thereby requiring different kinds of instruction formats for different<br />
kinds of instructions. For example, the format above is called R-type (for register)<br />
or R-format. A second type of instruction format is called I-type (for immediate)<br />
or I-format <strong>and</strong> is used by the immediate <strong>and</strong> data transfer instructions. The fields<br />
of I-format are<br />
op rs rt constant or address<br />
6 bits 5 bits 5 bits 16 bits<br />
The 16-bit address means a load word instruction can load any word within<br />
a region of 2 15 or 32,768 bytes (2 13 or 8192 words) of the address in the base<br />
register rs. Similarly, add immediate is limited to constants no larger than 2 15 .<br />
We see that more than 32 registers would be difficult in this format, as the rs <strong>and</strong> rt<br />
fields would each need another bit, making it harder to fit everything in one word.<br />
Let’s look at the load word instruction from page 71:<br />
lw $t0,32($s3) # Temporary reg $t0 gets A[8]<br />
Here, 19 (for $s3) is placed in the rs field, 8 (for $t0) is placed in the rt field, <strong>and</strong><br />
32 is placed in the address field. Note that the meaning of the rt field has changed<br />
for this instruction: in a load word instruction, the rt field specifies the destination<br />
register, which receives the result of the load.<br />
Although multiple formats complicate the hardware, we can reduce the complexity<br />
by keeping the formats similar. For example, the first three fields of the R-type <strong>and</strong><br />
I-type formats are the same size <strong>and</strong> have the same names; the length of the fourth<br />
field in I-type is equal to the sum of the lengths of the last three fields of R-type.<br />
In case you were wondering, the formats are distinguished by the values in the<br />
first field: each format is assigned a distinct set of values in the first field (op) so that<br />
the hardware knows whether to treat the last half of the instruction as three fields<br />
(R-type) or as a single field (I-type). Figure 2.5 shows the numbers used in each<br />
field for the MIPS instructions covered so far.<br />
Instruction Format op rs rt rd shamt funct address<br />
add R 0 reg reg reg 0 32 ten n.a.<br />
sub (subtract) R 0 reg reg reg 0 34 ten n.a.<br />
add immediate I 8 ten reg reg n.a. n.a. n.a. constant<br />
lw (load word) I 35 ten reg reg n.a. n.a. n.a. address<br />
sw (store word) I 43 ten reg reg n.a. n.a. n.a. address<br />
FIGURE 2.5 MIPS instruction encoding. In the table above, “reg” means a register number between 0<br />
<strong>and</strong> 31, “address” means a 16-bit address, <strong>and</strong> “n.a.” (not applicable) means this field does not appear in this<br />
format. Note that add <strong>and</strong> sub instructions have the same value in the op field; the hardware uses the funct<br />
field to decide the variant of the operation: add (32) or subtract (34).
84 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />
EXAMPLE<br />
Translating MIPS Assembly Language into Machine Language<br />
We can now take an example all the way from what the programmer writes<br />
to what the computer executes. If $t1 has the base of the array A <strong>and</strong> $s2<br />
corresponds to h, the assignment statement<br />
A[300] = h + A[300];<br />
is compiled into<br />
lw $t0,1200($t1) # Temporary reg $t0 gets A[300]<br />
add $t0,$s2,$t0 # Temporary reg $t0 gets h + A[300]<br />
sw $t0,1200($t1) # Stores h + A[300] back into A[300]<br />
What is the MIPS machine language code for these three instructions?<br />
ANSWER<br />
For convenience, let’s first represent the machine language instructions using<br />
decimal numbers. From Figure 2.5, we can determine the three machine<br />
language instructions:<br />
Op rs rt rd<br />
address/<br />
shamt<br />
funct<br />
35 9 8 1200<br />
0 18 8 8 0 32<br />
43 9 8 1200<br />
The lw instruction is identified by 35 (see Figure 2.5) in the first field<br />
(op). The base register 9 ($t1) is specified in the second field (rs), <strong>and</strong> the<br />
destination register 8 ($t0) is specified in the third field (rt). The offset to<br />
select A[300] (1200 300 4) is found in the final field (address).<br />
The add instruction that follows is specified with 0 in the first field (op) <strong>and</strong><br />
32 in the last field (funct). The three register oper<strong>and</strong>s (18, 8, <strong>and</strong> 8) are found<br />
in the second, third, <strong>and</strong> fourth fields <strong>and</strong> correspond to $s2, $t0, <strong>and</strong> $t0.<br />
The sw instruction is identified with 43 in the first field. The rest of this final<br />
instruction is identical to the lw instruction.<br />
Since 1200 ten<br />
0000 0100 1011 0000 two<br />
, the binary equivalent to the decimal<br />
form is:<br />
100011 01001 01000 0000 0100 1011 0000<br />
000000 10010 01000 01000 00000 100000<br />
101011 01001 01000 0000 0100 1011 0000
88 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />
The dual of a shift left is a shift right. The actual name of the two MIPS shift<br />
instructions are called shift left logical (sll) <strong>and</strong> shift right logical (srl). The<br />
following instruction performs the operation above, assuming that the original<br />
value was in register $s0 <strong>and</strong> the result should go in register $t2:<br />
sll $t2,$s0,4 # reg $t2 = reg $s0
2.6 Logical Operations 89<br />
To place a value into one of these seas of 0s, there is the dual to AND, called<br />
OR. It is a bit-by-bit operation that places a 1 in the result if either oper<strong>and</strong> bit is<br />
a 1. To elaborate, if the registers $t1 <strong>and</strong> $t2 are unchanged from the preceding<br />
example, the result of the MIPS instruction<br />
or $t0,$t1,$t2 # reg $t0 = reg $t1 | reg $t2<br />
is this value in register $t0:<br />
OR A logical bit-bybit<br />
operation with two<br />
oper<strong>and</strong>s that calculates<br />
a 1 if there is a 1 in either<br />
oper<strong>and</strong>.<br />
0000 0000 0000 0000 0011 1101 1100 0000 two<br />
The final logical operation is a contrarian. NOT takes one oper<strong>and</strong> <strong>and</strong> places a 1<br />
in the result if one oper<strong>and</strong> bit is a 0, <strong>and</strong> vice versa. Using our prior notation, it<br />
calculates x.<br />
In keeping with the three-oper<strong>and</strong> format, the designers of MIPS decided to<br />
include the instruction NOR (NOT OR) instead of NOT. If one oper<strong>and</strong> is zero,<br />
then it is equivalent to NOT: A NOR 0 NOT (A OR 0) NOT (A).<br />
If the register $t1 is unchanged from the preceding example <strong>and</strong> register $t3<br />
has the value 0, the result of the MIPS instruction<br />
nor $t0,$t1,$t3 # reg $t0 = ~ (reg $t1 | reg $t3)<br />
is this value in register $t0:<br />
NOT A logical bit-bybit<br />
operation with one<br />
oper<strong>and</strong> that inverts the<br />
bits; that is, it replaces<br />
every 1 with a 0, <strong>and</strong><br />
every 0 with a 1.<br />
NOR A logical bit-bybit<br />
operation with two<br />
oper<strong>and</strong>s that calculates<br />
the NOT of the OR of the<br />
two oper<strong>and</strong>s. That is, it<br />
calculates a 1 only if there<br />
is a 0 in both oper<strong>and</strong>s.<br />
1111 1111 1111 1111 1100 0011 1111 1111 two<br />
Figure 2.8 above shows the relationship between the C <strong>and</strong> Java operators <strong>and</strong> the<br />
MIPS instructions. Constants are useful in AND <strong>and</strong> OR logical operations as well<br />
as in arithmetic operations, so MIPS also provides the instructions <strong>and</strong> immediate<br />
(<strong>and</strong>i) <strong>and</strong> or immediate (ori). Constants are rare for NOR, since its main use is<br />
to invert the bits of a single oper<strong>and</strong>; thus, the MIPS instruction set architecture has<br />
no immediate version of NOR.<br />
Elaboration: The full MIPS instruction set also includes exclusive or (XOR), which<br />
sets the bit to 1 when two corresponding bits differ, <strong>and</strong> to 0 when they are the same. C<br />
allows bit fi elds or fi elds to be defi ned within words, both allowing objects to be packed<br />
within a word <strong>and</strong> to match an externally enforced interface such as an I/O device. All<br />
fi elds must fi t within a single word. Fields are unsigned integers that can be as short as<br />
1 bit. C compilers insert <strong>and</strong> extract fi elds using logical instructions in MIPS: <strong>and</strong>, or,<br />
sll, <strong>and</strong> srl.<br />
Elaboration: Logical AND immediate <strong>and</strong> logical OR immediate put 0s into the upper<br />
16 bits to form a 32-bit constant, unlike add immediate, which does sign extension.<br />
Which operations can isolate a field in a word?<br />
1. AND<br />
2. A shift left followed by a shift right<br />
Check<br />
Yourself
2.7 Instructions for Making Decisions 91<br />
The next assignment statement performs a single operation, <strong>and</strong> if all the<br />
oper<strong>and</strong>s are allocated to registers, it is just one instruction:<br />
add $s0,$s1,$s2 # f = g + h (skipped if i ≠ j)<br />
We now need to go to the end of the if statement. This example introduces<br />
another kind of branch, often called an unconditional branch. This instruction<br />
says that the processor always follows the branch. To distinguish between<br />
conditional <strong>and</strong> unconditional branches, the MIPS name for this type of<br />
instruction is jump, abbreviated as j (the label Exit is defined below).<br />
conditional branch An<br />
instruction that requires<br />
the comparison of two<br />
values <strong>and</strong> that allows for<br />
a subsequent transfer of<br />
control to a new address<br />
in the program based<br />
on the outcome of the<br />
comparison.<br />
j Exit<br />
# go to Exit<br />
The assignment statement in the else portion of the if statement can again be<br />
compiled into a single instruction. We just need to append the label Else to<br />
this instruction. We also show the label Exit that is after this instruction,<br />
showing the end of the if-then-else compiled code:<br />
Else:sub $s0,$s1,$s2 # f = g – h (skipped if i = j)<br />
Exit:<br />
Notice that the assembler relieves the compiler <strong>and</strong> the assembly language<br />
programmer from the tedium of calculating addresses for branches, just as it does<br />
for calculating data addresses for loads <strong>and</strong> stores (see Section 2.12).<br />
i=j<br />
i= =j?<br />
i≠ j<br />
Else:<br />
f=g+h<br />
f=g–h<br />
Exit:<br />
FIGURE 2.9 Illustration of the options in the if statement above. The left box corresponds to<br />
the then part of the if statement, <strong>and</strong> the right box corresponds to the else part.
92 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />
Hardware/<br />
Software<br />
Interface<br />
Compilers frequently create branches <strong>and</strong> labels where they do not appear in<br />
the programming language. Avoiding the burden of writing explicit labels <strong>and</strong><br />
branches is one benefit of writing in high-level programming languages <strong>and</strong> is a<br />
reason coding is faster at that level.<br />
Loops<br />
Decisions are important both for choosing between two alternatives—found in if<br />
statements—<strong>and</strong> for iterating a computation—found in loops. The same assembly<br />
instructions are the building blocks for both cases.<br />
EXAMPLE<br />
Compiling a while Loop in C<br />
Here is a traditional loop in C:<br />
while (save[i] == k)<br />
i += 1;<br />
Assume that i <strong>and</strong> k correspond to registers $s3 <strong>and</strong> $s5 <strong>and</strong> the base of the<br />
array save is in $s6. What is the MIPS assembly code corresponding to this<br />
C segment?<br />
ANSWER<br />
The first step is to load save[i] into a temporary register. Before we can load<br />
save[i] into a temporary register, we need to have its address. Before we<br />
can add i to the base of array save to form the address, we must multiply the<br />
index i by 4 due to the byte addressing problem. Fortunately, we can use shift<br />
left logical, since shifting left by 2 bits multiplies by 2 2 or 4 (see page 88 in the<br />
prior section). We need to add the label Loop to it so that we can branch back<br />
to that instruction at the end of the loop:<br />
Loop: sll $t1,$s3,2 # Temp reg $t1 = i * 4<br />
To get the address of save[i], we need to add $t1 <strong>and</strong> the base of save in $s6:<br />
add $t1,$t1,$s6 # $t1 = address of save[i]<br />
Now we can use that address to load save[i] into a temporary register:<br />
lw $t0,0($t1) # Temp reg $t0 = save[i]<br />
The next instruction performs the loop test, exiting if save[i] ≠ k:<br />
bne $t0,$s5, Exit<br />
# go to Exit if save[i] ≠ k
2.7 Instructions for Making Decisions 93<br />
The next instruction adds 1 to i:<br />
addi $s3,$s3,1 # i = i + 1<br />
The end of the loop branches back to the while test at the top of the loop. We<br />
just add the Exit label after it, <strong>and</strong> we’re done:<br />
j Loop # go to Loop<br />
Exit:<br />
(See the exercises for an optimization of this sequence.)<br />
Such sequences of instructions that end in a branch are so fundamental to compiling<br />
that they are given their own buzzword: a basic block is a sequence of instructions<br />
without branches, except possibly at the end, <strong>and</strong> without branch targets or branch<br />
labels, except possibly at the beginning. One of the first early phases of compilation<br />
is breaking the program into basic blocks.<br />
The test for equality or inequality is probably the most popular test, but sometimes<br />
it is useful to see if a variable is less than another variable. For example, a for loop<br />
may want to test to see if the index variable is less than 0. Such comparisons are<br />
accomplished in MIPS assembly language with an instruction that compares two<br />
registers <strong>and</strong> sets a third register to 1 if the first is less than the second; otherwise,<br />
it is set to 0. The MIPS instruction is called set on less than, or slt. For example,<br />
Hardware/<br />
Software<br />
Interface<br />
basic block A sequence<br />
of instructions without<br />
branches (except possibly<br />
at the end) <strong>and</strong> without<br />
branch targets or branch<br />
labels (except possibly at<br />
the beginning).<br />
slt $t0, $s3, $s4 # $t0 = 1 if $s3 < $s4<br />
means that register $t0 is set to 1 if the value in register $s3 is less than the value<br />
in register $s4; otherwise, register $t0 is set to 0.<br />
Constant oper<strong>and</strong>s are popular in comparisons, so there is an immediate version<br />
of the set on less than instruction. To test if register $s2 is less than the constant<br />
10, we can just write<br />
slti $t0,$s2,10 # $t0 = 1 if $s2 < 10<br />
MIPS compilers use the slt, slti, beq, bne, <strong>and</strong> the fixed value of 0 (always<br />
available by reading register $zero) to create all relative conditions: equal, not<br />
equal, less than, less than or equal, greater than, greater than or equal.<br />
Hardware/<br />
Software<br />
Interface
94 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />
Heeding von Neumann’s warning about the simplicity of the “equipment,” the<br />
MIPS architecture doesn’t include branch on less than because it is too complicated;<br />
either it would stretch the clock cycle time or it would take extra clock cycles per<br />
instruction. Two faster instructions are more useful.<br />
Hardware/<br />
Software<br />
Interface<br />
Comparison instructions must deal with the dichotomy between signed <strong>and</strong><br />
unsigned numbers. Sometimes a bit pattern with a 1 in the most significant bit<br />
represents a negative number <strong>and</strong>, of course, is less than any positive number,<br />
which must have a 0 in the most significant bit. With unsigned integers, on the<br />
other h<strong>and</strong>, a 1 in the most significant bit represents a number that is larger than<br />
any that begins with a 0. (We’ll soon take advantage of this dual meaning of the<br />
most significant bit to reduce the cost of the array bounds checking.)<br />
MIPS offers two versions of the set on less than comparison to h<strong>and</strong>le these<br />
alternatives. Set on less than (slt) <strong>and</strong> set on less than immediate (slti) work with<br />
signed integers. Unsigned integers are compared using set on less than unsigned<br />
(sltu) <strong>and</strong> set on less than immediate unsigned (sltiu).<br />
EXAMPLE<br />
Signed versus Unsigned Comparison<br />
Suppose register $s0 has the binary number<br />
1111 1111 1111 1111 1111 1111 1111 1111 two<br />
<strong>and</strong> that register $s1 has the binary number<br />
0000 0000 0000 0000 0000 0000 0000 0001 two<br />
What are the values of registers $t0 <strong>and</strong> $t1 after these two instructions?<br />
slt<br />
sltu<br />
$t0, $s0, $s1 # signed comparison<br />
$t1, $s0, $s1 # unsigned comparison<br />
ANSWER<br />
The value in register $s0 represents 1 ten<br />
if it is an integer <strong>and</strong> 4,294,967,295 ten<br />
if it is an unsigned integer. The value in register $s1 represents 1 ten<br />
in either<br />
case. Then register $t0 has the value 1, since 1 ten<br />
1 ten<br />
, <strong>and</strong> register $t1 has<br />
the value 0, since 4,294,967,295 ten<br />
1 ten<br />
.
2.7 Instructions for Making Decisions 95<br />
Treating signed numbers as if they were unsigned gives us a low cost way of<br />
checking if 0 x y, which matches the index out-of-bounds check for arrays. The<br />
key is that negative integers in two’s complement notation look like large numbers<br />
in unsigned notation; that is, the most significant bit is a sign bit in the former<br />
notation but a large part of the number in the latter. Thus, an unsigned comparison<br />
of x y also checks if x is negative as well as if x is less than y.<br />
Bounds Check Shortcut<br />
Use this shortcut to reduce an index-out-of-bounds check: jump to<br />
IndexOutOfBounds if $s1 ≥ $t2 or if $s1 is negative.<br />
EXAMPLE<br />
The checking code just uses u to do both checks:<br />
sltu $t0,$s1,$t2 # $t0=0 if $s1>=length or $s1
2.8 Supporting Procedures in <strong>Computer</strong> Hardware 97<br />
You can think of a procedure like a spy who leaves with a secret plan, acquires<br />
resources, performs the task, covers his or her tracks, <strong>and</strong> then returns to the point<br />
of origin with the desired result. Nothing else should be perturbed once the mission<br />
is complete. Moreover, a spy operates on only a “need to know” basis, so the spy<br />
can’t make assumptions about his employer.<br />
Similarly, in the execution of a procedure, the program must follow these six<br />
steps:<br />
1. Put parameters in a place where the procedure can access them.<br />
2. Transfer control to the procedure.<br />
3. Acquire the storage resources needed for the procedure.<br />
4. Perform the desired task.<br />
5. Put the result value in a place where the calling program can access it.<br />
6. Return control to the point of origin, since a procedure can be called from<br />
several points in a program.<br />
As mentioned above, registers are the fastest place to hold data in a computer,<br />
so we want to use them as much as possible. MIPS software follows the following<br />
convention for procedure calling in allocating its 32 registers:<br />
■ $a0–$a3: four argument registers in which to pass parameters<br />
■ $v0–$v1: two value registers in which to return values<br />
■ $ra: one return address register to return to the point of origin<br />
In addition to allocating these registers, MIPS assembly language includes an<br />
instruction just for the procedures: it jumps to an address <strong>and</strong> simultaneously<br />
saves the address of the following instruction in register $ra. The jump-<strong>and</strong>-link<br />
instruction (jal) is simply written<br />
jal ProcedureAddress<br />
The link portion of the name means that an address or link is formed that points<br />
to the calling site to allow the procedure to return to the proper address. This “link,”<br />
stored in register$ra (register 31), is called the return address. The return address<br />
is needed because the same procedure could be called from several parts of the<br />
program.<br />
To support such situations, computers like MIPS use jump register instruction<br />
(jr), introduced above to help with case statements, meaning an unconditional<br />
jump to the address specified in a register:<br />
jr<br />
$ra<br />
jump-<strong>and</strong>-link<br />
instruction An<br />
instruction that jumps<br />
to an address <strong>and</strong><br />
simultaneously saves the<br />
address of the following<br />
instruction in a register<br />
($ra in MIPS).<br />
return address A link to<br />
the calling site that allows<br />
a procedure to return<br />
to the proper address;<br />
in MIPS it is stored in<br />
register $ra.
98 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />
caller The program that<br />
instigates a procedure <strong>and</strong><br />
provides the necessary<br />
parameter values.<br />
callee A procedure that<br />
executes a series of stored<br />
instructions based on<br />
parameters provided by<br />
the caller <strong>and</strong> then returns<br />
control to the caller.<br />
program counter<br />
(PC) The register<br />
containing the address<br />
of the instruction in the<br />
program being executed.<br />
stack A data structure<br />
for spilling registers<br />
organized as a last-infirst-out<br />
queue.<br />
stack pointer A value<br />
denoting the most<br />
recently allocated address<br />
in a stack that shows<br />
where registers should<br />
be spilled or where old<br />
register values can be<br />
found. In MIPS, it is<br />
register $sp.<br />
push Add element to<br />
stack.<br />
pop Remove element<br />
from stack.<br />
The jump register instruction jumps to the address stored in register $ra—<br />
which is just what we want. Thus, the calling program, or caller, puts the parameter<br />
values in $a0–$a3 <strong>and</strong> uses jal X to jump to procedure X (sometimes named<br />
the callee). The callee then performs the calculations, places the results in $v0 <strong>and</strong><br />
$v1, <strong>and</strong> returns control to the caller using jr $ra.<br />
Implicit in the stored-program idea is the need to have a register to hold the<br />
address of the current instruction being executed. For historical reasons, this<br />
register is almost always called the program counter, abbreviated PC in the MIPS<br />
architecture, although a more sensible name would have been instruction address<br />
register. The jal instruction actually saves PC 4 in register $ra to link to the<br />
following instruction to set up the procedure return.<br />
Using More Registers<br />
Suppose a compiler needs more registers for a procedure than the four argument<br />
<strong>and</strong> two return value registers. Since we must cover our tracks after our mission<br />
is complete, any registers needed by the caller must be restored to the values that<br />
they contained before the procedure was invoked. This situation is an example in<br />
which we need to spill registers to memory, as mentioned in the Hardware/Software<br />
Interface section above.<br />
The ideal data structure for spilling registers is a stack—a last-in-first-out<br />
queue. A stack needs a pointer to the most recently allocated address in the stack<br />
to show where the next procedure should place the registers to be spilled or where<br />
old register values are found. The stack pointer is adjusted by one word for each<br />
register that is saved or restored. MIPS software reserves register 29 for the stack<br />
pointer, giving it the obvious name $sp. Stacks are so popular that they have their<br />
own buzzwords for transferring data to <strong>and</strong> from the stack: placing data onto the<br />
stack is called a push, <strong>and</strong> removing data from the stack is called a pop.<br />
By historical precedent, stacks “grow” from higher addresses to lower addresses.<br />
This convention means that you push values onto the stack by subtracting from the<br />
stack pointer. Adding to the stack pointer shrinks the stack, thereby popping values<br />
off the stack.<br />
EXAMPLE<br />
Compiling a C Procedure That Doesn’t Call Another Procedure<br />
Let’s turn the example on page 65 from Section 2.2 into a C procedure:<br />
int leaf_example (int g, int h, int i, int j)<br />
{<br />
int f;<br />
}<br />
f = (g + h) – (i + j);<br />
return f;<br />
What is the compiled MIPS assembly code?
2.8 Supporting Procedures in <strong>Computer</strong> Hardware 99<br />
The parameter variables g, h, i, <strong>and</strong> j correspond to the argument registers<br />
$a0, $a1, $a2, <strong>and</strong> $a3, <strong>and</strong> f corresponds to $s0. The compiled program<br />
starts with the label of the procedure:<br />
ANSWER<br />
leaf_example:<br />
The next step is to save the registers used by the procedure. The C assignment<br />
statement in the procedure body is identical to the example on page 68, which<br />
uses two temporary registers. Thus, we need to save three registers: $s0, $t0,<br />
<strong>and</strong> $t1. We “push” the old values onto the stack by creating space for three<br />
words (12 bytes) on the stack <strong>and</strong> then store them:<br />
addi $sp, $sp, –12 # adjust stack to make room for 3 items<br />
sw $t1, 8($sp) # save register $t1 for use afterwards<br />
sw $t0, 4($sp) # save register $t0 for use afterwards<br />
sw $s0, 0($sp) # save register $s0 for use afterwards<br />
Figure 2.10 shows the stack before, during, <strong>and</strong> after the procedure call.<br />
The next three statements correspond to the body of the procedure, which<br />
follows the example on page 68:<br />
add $t0,$a0,$a1 # register $t0 contains g + h<br />
add $t1,$a2,$a3 # register $t1 contains i + j<br />
sub $s0,$t0,$t1 # f = $t0 – $t1, which is (g + h)–(i + j)<br />
To return the value of f, we copy it into a return value register:<br />
add $v0,$s0,$zero # returns f ($v0 = $s0 + 0)<br />
Before returning, we restore the three old values of the registers we saved by<br />
“popping” them from the stack:<br />
lw $s0, 0($sp) # restore register $s0 for caller<br />
lw $t0, 4($sp) # restore register $t0 for caller<br />
lw $t1, 8($sp) # restore register $t1 for caller<br />
addi $sp,$sp,12 # adjust stack to delete 3 items<br />
The procedure ends with a jump register using the return address:<br />
jr $ra # jump back to calling routine<br />
In the previous example, we used temporary registers <strong>and</strong> assumed their old<br />
values must be saved <strong>and</strong> restored. To avoid saving <strong>and</strong> restoring a register whose<br />
value is never used, which might happen with a temporary register, MIPS software<br />
separates 18 of the registers into two groups:<br />
■ $t0–$t9: temporary registers that are not preserved by the callee (called<br />
procedure) on a procedure call<br />
■ $s0–$s7: saved registers that must be preserved on a procedure call (if<br />
used, the callee saves <strong>and</strong> restores them)
100 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />
High address<br />
$sp<br />
$sp<br />
Contents of register $t1<br />
Contents of register $t0<br />
Contents of register $s0<br />
$sp<br />
Low address<br />
(a) (b) (c)<br />
FIGURE 2.10 The values of the stack pointer <strong>and</strong> the stack (a) before, (b) during, <strong>and</strong> (c)<br />
after the procedure call. The stack pointer always points to the “top” of the stack, or the last word in the<br />
stack in this drawing.<br />
This simple convention reduces register spilling. In the example above, since the<br />
caller does not expect registers $t0 <strong>and</strong> $t1 to be preserved across a procedure<br />
call, we can drop two stores <strong>and</strong> two loads from the code. We still must save <strong>and</strong><br />
restore $s0, since the callee must assume that the caller needs its value.<br />
Nested Procedures<br />
Procedures that do not call others are called leaf procedures. Life would be simple if<br />
all procedures were leaf procedures, but they aren’t. Just as a spy might employ other<br />
spies as part of a mission, who in turn might use even more spies, so do procedures<br />
invoke other procedures. Moreover, recursive procedures even invoke “clones” of<br />
themselves. Just as we need to be careful when using registers in procedures, more<br />
care must also be taken when invoking nonleaf procedures.<br />
For example, suppose that the main program calls procedure A with an argument<br />
of 3, by placing the value 3 into register $a0 <strong>and</strong> then using jal A. Then suppose<br />
that procedure A calls procedure B via jal B with an argument of 7, also placed<br />
in $a0. Since A hasn’t finished its task yet, there is a conflict over the use of register<br />
$a0. Similarly, there is a conflict over the return address in register $ra, since it<br />
now has the return address for B. Unless we take steps to prevent the problem, this<br />
conflict will eliminate procedure A’s ability to return to its caller.<br />
One solution is to push all the other registers that must be preserved onto<br />
the stack, just as we did with the saved registers. The caller pushes any argument<br />
registers ($a0–$a3) or temporary registers ($t0–$t9) that are needed after<br />
the call. The callee pushes the return address register $ra <strong>and</strong> any saved registers<br />
($s0–$s7) used by the callee. The stack pointer $sp is adjusted to account for the<br />
number of registers placed on the stack. Upon the return, the registers are restored<br />
from memory <strong>and</strong> the stack pointer is readjusted.
2.8 Supporting Procedures in <strong>Computer</strong> Hardware 101<br />
Compiling a Recursive C Procedure, Showing Nested Procedure<br />
Linking<br />
EXAMPLE<br />
Let’s tackle a recursive procedure that calculates factorial:<br />
int fact (int n)<br />
{<br />
if (n < 1) return (1);<br />
else return (n * fact(n – 1));<br />
}<br />
What is the MIPS assembly code?<br />
The parameter variable n corresponds to the argument register $a0. The<br />
compiled program starts with the label of the procedure <strong>and</strong> then saves two<br />
registers on the stack, the return address <strong>and</strong> $a0:<br />
ANSWER<br />
fact:<br />
addi $sp, $sp, –8 # adjust stack for 2 items<br />
sw $ra, 4($sp) # save the return address<br />
sw $a0, 0($sp) # save the argument n<br />
The first time fact is called, sw saves an address in the program that called<br />
fact. The next two instructions test whether n is less than 1, going to L1 if<br />
n ≥ 1.<br />
slti $t0,$a0,1 # test for n < 1<br />
beq $t0,$zero,L1 # if n >= 1, go to L1<br />
If n is less than 1, fact returns 1 by putting 1 into a value register: it adds 1 to<br />
0 <strong>and</strong> places that sum in $v0. It then pops the two saved values off the stack<br />
<strong>and</strong> jumps to the return address:<br />
addi $v0,$zero,1 # return 1<br />
addi $sp,$sp,8 # pop 2 items off stack<br />
jr $ra # return to caller<br />
Before popping two items off the stack, we could have loaded $a0 <strong>and</strong><br />
$ra. Since $a0 <strong>and</strong> $ra don’t change when n is less than 1, we skip those<br />
instructions.<br />
If n is not less than 1, the argument n is decremented <strong>and</strong> then fact is<br />
called again with the decremented value:<br />
L1: addi $a0,$a0,–1 # n >= 1: argument gets (n – 1)<br />
jal fact # call fact with (n –1)
102 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />
The next instruction is where fact returns. Now the old return address <strong>and</strong><br />
old argument are restored, along with the stack pointer:<br />
lw $a0, 0($sp) # return from jal: restore argument n<br />
lw $ra, 4($sp) # restore the return address<br />
addi $sp, $sp, 8 # adjust stack pointer to pop 2 items<br />
Next, the value register $v0 gets the product of old argument $a0 <strong>and</strong><br />
the current value of the value register. We assume a multiply instruction is<br />
available, even though it is not covered until Chapter 3:<br />
mul $v0,$a0,$v0 # return n * fact (n – 1)<br />
Finally, fact jumps again to the return address:<br />
jr $ra # return to the caller<br />
Hardware/<br />
Software<br />
Interface<br />
global pointer The<br />
register that is reserved to<br />
point to the static area.<br />
A C variable is generally a location in storage, <strong>and</strong> its interpretation depends both<br />
on its type <strong>and</strong> storage class. Examples include integers <strong>and</strong> characters (see Section<br />
2.9). C has two storage classes: automatic <strong>and</strong> static. Automatic variables are local to<br />
a procedure <strong>and</strong> are discarded when the procedure exits. Static variables exist across<br />
exits from <strong>and</strong> entries to procedures. C variables declared outside all procedures<br />
are considered static, as are any variables declared using the keyword static. The<br />
rest are automatic. To simplify access to static data, MIPS software reserves another<br />
register, called the global pointer, or $gp.<br />
Figure 2.11 summarizes what is preserved across a procedure call. Note that<br />
several schemes preserve the stack, guaranteeing that the caller will get the same<br />
data back on a load from the stack as it stored onto the stack. The stack above $sp<br />
is preserved simply by making sure the callee does not write above $sp; $sp is<br />
Preserved<br />
Saved registers: $s0–$s7<br />
Stack pointer register: $sp<br />
Return address register: $ra<br />
Stack above the stack pointer<br />
Not preserved<br />
Temporary registers: $t0–$t9<br />
Argument registers: $a0–$a3<br />
Return value registers: $v0–$v1<br />
Stack below the stack pointer<br />
FIGURE 2.11 What is <strong>and</strong> what is not preserved across a procedure call. If the software relies<br />
on the frame pointer register or on the global pointer register, discussed in the following subsections, they<br />
are also preserved.
2.8 Supporting Procedures in <strong>Computer</strong> Hardware 103<br />
itself preserved by the callee adding exactly the same amount that was subtracted<br />
from it; <strong>and</strong> the other registers are preserved by saving them on the stack (if they<br />
are used) <strong>and</strong> restoring them from there.<br />
Allocating Space for New Data on the Stack<br />
The final complexity is that the stack is also used to store variables that are local<br />
to the procedure but do not fit in registers, such as local arrays or structures. The<br />
segment of the stack containing a procedure’s saved registers <strong>and</strong> local variables is<br />
called a procedure frame or activation record. Figure 2.12 shows the state of the<br />
stack before, during, <strong>and</strong> after the procedure call.<br />
Some MIPS software uses a frame pointer ($fp) to point to the first word of<br />
the frame of a procedure. A stack pointer might change during the procedure, <strong>and</strong><br />
so references to a local variable in memory might have different offsets depending<br />
on where they are in the procedure, making the procedure harder to underst<strong>and</strong>.<br />
Alternatively, a frame pointer offers a stable base register within a procedure for<br />
local memory-references. Note that an activation record appears on the stack<br />
whether or not an explicit frame pointer is used. We’ve been avoiding using $fp by<br />
avoiding changes to $sp within a procedure: in our examples, the stack is adjusted<br />
only on entry <strong>and</strong> exit of the procedure.<br />
procedure frame Also<br />
called activation record.<br />
The segment of the stack<br />
containing a procedure’s<br />
saved registers <strong>and</strong> local<br />
variables.<br />
frame pointer A value<br />
denoting the location of<br />
the saved registers <strong>and</strong><br />
local variables for a given<br />
procedure.<br />
High address<br />
$fp<br />
$fp<br />
$sp<br />
$fp<br />
Saved argument<br />
registers (if any)<br />
$sp<br />
Saved return address<br />
Saved saved<br />
registers (if any)<br />
$sp<br />
Local arrays <strong>and</strong><br />
structures (if any)<br />
Low address<br />
(a) (b) (c)<br />
FIGURE 2.12 Illustration of the stack allocation (a) before, (b) during, <strong>and</strong> (c) after the<br />
procedure call. The frame pointer ($fp) points to the first word of the frame, often a saved argument<br />
register, <strong>and</strong> the stack pointer ($sp) points to the top of the stack. The stack is adjusted to make room for<br />
all the saved registers <strong>and</strong> any memory-resident local variables. Since the stack pointer may change during<br />
program execution, it’s easier for programmers to reference variables via the stable frame pointer, although it<br />
could be done just with the stack pointer <strong>and</strong> a little address arithmetic. If there are no local variables on the<br />
stack within a procedure, the compiler will save time by not setting <strong>and</strong> restoring the frame pointer. When a<br />
frame pointer is used, it is initialized using the address in $sp on a call, <strong>and</strong> $sp is restored using $fp. This<br />
information is also found in Column 4 of the MIPS Reference Data Card at the front of this book.
2.8 Supporting Procedures in <strong>Computer</strong> Hardware 105<br />
Figure 2.14 summarizes the register conventions for the MIPS assembly<br />
language. This convention is another example of making the common case fast:<br />
most procedures can be satisfied with up to 4 arguments, 2 registers for a return<br />
value, 8 saved registers, <strong>and</strong> 10 temporary registers without ever going to memory.<br />
Name Register number Usage<br />
Preserved on<br />
call?<br />
$zero 0 The constant value 0 n.a.<br />
$v0–$v1 2–3 Values for results <strong>and</strong> expression evaluation no<br />
$a0–$a3 4–7 Arguments no<br />
$t0–$t7<br />
$s0–$s7<br />
$t8–$t9<br />
$gp<br />
$sp<br />
$fp<br />
$ra<br />
8–15<br />
16–23<br />
24–25<br />
28<br />
29<br />
30<br />
31<br />
Temporaries<br />
Saved<br />
More temporaries<br />
Global pointer<br />
Stack pointer<br />
Frame pointer<br />
Return address<br />
FIGURE 2.14 MIPS register conventions. Register 1, called $at, is reserved for the assembler (see<br />
Section 2.12), <strong>and</strong> registers 26–27, called $k0–$k1, are reserved for the operating system. This information<br />
is also found in Column 2 of the MIPS Reference Data Card at the front of this book.<br />
no<br />
yes<br />
no<br />
yes<br />
yes<br />
yes<br />
yes<br />
Elaboration: What if there are more than four parameters? The MIPS convention is<br />
to place the extra parameters on the stack just above the frame pointer. The procedure<br />
then expects the fi rst four parameters to be in registers $a0 through $a3 <strong>and</strong> the rest<br />
in memory, addressable via the frame pointer.<br />
As mentioned in the caption of Figure 2.12, the frame pointer is convenient because<br />
all references to variables in the stack within a procedure will have the same offset.<br />
The frame pointer is not necessary, however. The GNU MIPS C compiler uses a frame<br />
pointer, but the C compiler from MIPS does not; it treats register 30 as another save<br />
register ($s8).<br />
Elaboration: Some recursive procedures can be implemented iteratively without using<br />
recursion. Iteration can signifi cantly improve performance by removing the overhead<br />
associated with recursive procedure calls. For example, consider a procedure used to<br />
accumulate a sum:<br />
int sum (int n, int acc) {<br />
if (n >0)<br />
return sum(n – 1, acc + n);<br />
else<br />
return acc;<br />
}<br />
Consider the procedure call sum(3,0). This will result in recursive calls to<br />
sum(2,3), sum(1,5), <strong>and</strong> sum(0,6), <strong>and</strong> then the result 6 will be returned four
2.9 Communicating with People 107<br />
ASCII versus Binary Numbers<br />
We could represent numbers as strings of ASCII digits instead of as integers.<br />
How much does storage increase if the number 1 billion is represented in<br />
ASCII versus a 32-bit integer?<br />
EXAMPLE<br />
One billion is 1,000,000,000, so it would take 10 ASCII digits, each 8 bits long.<br />
Thus the storage expansion would be (10 8)/32 or 2.5. Beyond the expansion<br />
in storage, the hardware to add, subtract, multiply, <strong>and</strong> divide such decimal<br />
numbers is difficult <strong>and</strong> would consume more energy. Such difficulties explain<br />
why computing professionals are raised to believe that binary is natural <strong>and</strong><br />
that the occasional decimal computer is bizarre.<br />
ANSWER<br />
A series of instructions can extract a byte from a word, so load word <strong>and</strong> store<br />
word are sufficient for transferring bytes as well as words. Because of the popularity<br />
of text in some programs, however, MIPS provides instructions to move bytes. Load<br />
byte (lb) loads a byte from memory, placing it in the rightmost 8 bits of a register.<br />
Store byte (sb) takes a byte from the rightmost 8 bits of a register <strong>and</strong> writes it to<br />
memory. Thus, we copy a byte with the sequence<br />
lb $t0,0($sp)<br />
sb $t0,0($gp)<br />
# Read byte from source<br />
# Write byte to destination<br />
Characters are normally combined into strings, which have a variable number<br />
of characters. There are three choices for representing a string: (1) the first position<br />
of the string is reserved to give the length of a string, (2) an accompanying variable<br />
has the length of the string (as in a structure), or (3) the last position of a string is<br />
indicated by a character used to mark the end of a string. C uses the third choice,<br />
terminating a string with a byte whose value is 0 (named null in ASCII). Thus,<br />
the string “Cal” is represented in C by the following 4 bytes, shown as decimal<br />
numbers: 67, 97, 108, 0. (As we shall see, Java uses the first option.)
108 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />
EXAMPLE<br />
Compiling a String Copy Procedure, Showing How to Use C Strings<br />
The procedure strcpy copies string y to string x using the null byte<br />
termination convention of C:<br />
void strcpy (char x[], char y[])<br />
{<br />
int i;<br />
}<br />
i = 0;<br />
while ((x[i] = y[i]) != ‘\0’) /* copy & test byte */<br />
i += 1;<br />
What is the MIPS assembly code?<br />
ANSWER<br />
Below is the basic MIPS assembly code segment. Assume that base addresses<br />
for arrays x <strong>and</strong> y are found in $a0 <strong>and</strong> $a1, while i is in $s0. strcpy<br />
adjusts the stack pointer <strong>and</strong> then saves the saved register $s0 on the stack:<br />
strcpy:<br />
addi $sp,$sp,–4 # adjust stack for 1 more item<br />
sw $s0, 0($sp) # save $s0<br />
To initialize i to 0, the next instruction sets $s0 to 0 by adding 0 to 0 <strong>and</strong><br />
placing that sum in $s0:<br />
add $s0,$zero,$zero # i = 0 + 0<br />
This is the beginning of the loop. The address of y[i] is first formed by adding<br />
i to y[]:<br />
L1: add $t1,$s0,$a1 # address of y[i] in $t1<br />
Note that we don’t have to multiply i by 4 since y is an array of bytes <strong>and</strong> not<br />
of words, as in prior examples.<br />
To load the character in y[i], we use load byte unsigned, which puts the<br />
character into $t2:<br />
lbu<br />
$t2, 0($t1) # $t2 = y[i]<br />
A similar address calculation puts the address of x[i] in $t3, <strong>and</strong> then the<br />
character in $t2 is stored at that address.
2.9 Communicating with People 109<br />
add $t3,$s0,$a0 # address of x[i] in $t3<br />
sb $t2, 0($t3) # x[i] = y[i]<br />
Next, we exit the loop if the character was 0. That is, we exit if it is the last<br />
character of the string:<br />
beq<br />
$t2,$zero,L2 # if y[i] == 0, go to L2<br />
If not, we increment i <strong>and</strong> loop back:<br />
addi $s0, $s0,1 # i = i + 1<br />
j L1 # go to L1<br />
If we don’t loop back, it was the last character of the string; we restore $s0 <strong>and</strong><br />
the stack pointer, <strong>and</strong> then return.<br />
L2: lw $s0, 0($sp) # y[i] == 0: end of string.<br />
# Restore old $s0<br />
addi $sp,$sp,4 # pop 1 word off stack<br />
jr $ra # return<br />
String copies usually use pointers instead of arrays in C to avoid the operations<br />
on i in the code above. See Section 2.14 for an explanation of arrays versus<br />
pointers.<br />
Since the procedure strcpy above is a leaf procedure, the compiler could<br />
allocate i to a temporary register <strong>and</strong> avoid saving <strong>and</strong> restoring $s0. Hence,<br />
instead of thinking of the $t registers as being just for temporaries, we can think of<br />
them as registers that the callee should use whenever convenient. When a compiler<br />
finds a leaf procedure, it exhausts all temporary registers before using registers it<br />
must save.<br />
Characters <strong>and</strong> Strings in Java<br />
Unicode is a universal encoding of the alphabets of most human languages. Figure<br />
2.16 gives a list of Unicode alphabets; there are almost as many alphabets in Unicode<br />
as there are useful symbols in ASCII. To be more inclusive, Java uses Unicode for<br />
characters. By default, it uses 16 bits to represent a character.
110 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />
Latin Malayalam Tagbanwa General Punctuation<br />
Greek Sinhala Khmer Spacing Modifier Letters<br />
Cyrillic Thai Mongolian Currency Symbols<br />
Armenian Lao Limbu Combining Diacritical Marks<br />
Hebrew Tibetan Tai Le Combining Marks for Symbols<br />
Arabic Myanmar Kangxi Radicals Superscripts <strong>and</strong> Subscripts<br />
Syriac Georgian Hiragana Number Forms<br />
Thaana Hangul Jamo Katakana Mathematical Operators<br />
Devanagari Ethiopic Bopomofo Mathematical Alphanumeric Symbols<br />
Bengali Cherokee Kanbun Braille Patterns<br />
Gurmukhi Unified Canadian Shavian<br />
Optical Character Recognition<br />
Aboriginal Syllabic<br />
Gujarati Ogham Osmanya Byzantine Musical Symbols<br />
Oriya Runic Cypriot Syllabary Musical Symbols<br />
Tamil Tagalog Tai Xuan Jing Symbols Arrows<br />
Telugu Hanunoo Yijing Hexagram Symbols Box Drawing<br />
Kannada Buhid Aegean Numbers Geometric Shapes<br />
FIGURE 2.16 Example alphabets in Unicode. Unicode version 4.0 has more than 160 “blocks,”<br />
which is their name for a collection of symbols. Each block is a multiple of 16. For example, Greek starts at<br />
0370 hex<br />
, <strong>and</strong> Cyrillic at 0400 hex<br />
. The first three columns show 48 blocks that correspond to human languages<br />
in roughly Unicode numerical order. The last column has 16 blocks that are multilingual <strong>and</strong> are not in order.<br />
A 16-bit encoding, called UTF-16, is the default. A variable-length encoding, called UTF-8, keeps the ASCII<br />
subset as eight bits <strong>and</strong> uses 16 or 32 bits for the other characters. UTF-32 uses 32 bits per character. To learn<br />
more, see www.unicode.org.<br />
The MIPS instruction set has explicit instructions to load <strong>and</strong> store such 16-<br />
bit quantities, called halfwords. Load half (lh) loads a halfword from memory,<br />
placing it in the rightmost 16 bits of a register. Like load byte, load half (lh) treats<br />
the halfword as a signed number <strong>and</strong> thus sign-extends to fill the 16 leftmost bits<br />
of the register, while load halfword unsigned (lhu) works with unsigned integers.<br />
Thus, lhu is the more popular of the two. Store half (sh) takes a halfword from the<br />
rightmost 16 bits of a register <strong>and</strong> writes it to memory. We copy a halfword with<br />
the sequence<br />
lhu $t0,0($sp) # Read halfword (16 bits) from source<br />
sh $t0,0($gp) # Write halfword (16 bits) to destination<br />
Strings are a st<strong>and</strong>ard Java class with special built-in support <strong>and</strong> predefined<br />
methods for concatenation, comparison, <strong>and</strong> conversion. Unlike C, Java includes a<br />
word that gives the length of the string, similar to Java arrays.
112 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />
32-Bit Immediate Oper<strong>and</strong>s<br />
Although constants are frequently short <strong>and</strong> fit into the 16-bit field, sometimes they<br />
are bigger. The MIPS instruction set includes the instruction load upper immediate<br />
(lui) specifically to set the upper 16 bits of a constant in a register, allowing a<br />
subsequent instruction to specify the lower 16 bits of the constant. Figure 2.17<br />
shows the operation of lui.<br />
EXAMPLE<br />
Loading a 32-Bit Constant<br />
What is the MIPS assembly code to load this 32-bit constant into register $s0?<br />
0000 0000 0011 1101 0000 1001 0000 0000<br />
ANSWER<br />
First, we would load the upper 16 bits, which is 61 in decimal, using lui:<br />
lui $s0, 61 # 61 decimal = 0000 0000 0011 1101 binary<br />
The value of register $s0 afterward is<br />
0000 0000 0011 1101 0000 0000 0000 0000<br />
The next step is to insert the lower 16 bits, whose decimal value is 2304:<br />
ori $s0, $s0, 2304 # 2304 decimal = 0000 1001 0000 0000<br />
The final value in register $s0 is the desired value:<br />
0000 0000 0011 1101 0000 1001 0000 0000<br />
The machine language version of lui $t0, 255 # $t0 is register 8:<br />
001111 00000 01000 0000 0000 1111 1111<br />
Contents of register $t0 after executing lui $t0, 255:<br />
0000 0000 1111 1111 0000 0000 0000 0000<br />
FIGURE 2.17 The effect of the lui instruction. The instruction lui transfers the 16-bit immediate constant field value into the<br />
leftmost 16 bits of the register, filling the lower 16 bits with 0s.
2.10 MIPS Addressing for 32-bit Immediates <strong>and</strong> Addresses 113<br />
Either the compiler or the assembler must break large constants into pieces <strong>and</strong><br />
then reassemble them into a register. As you might expect, the immediate field’s<br />
size restriction may be a problem for memory addresses in loads <strong>and</strong> stores as<br />
well as for constants in immediate instructions. If this job falls to the assembler,<br />
as it does for MIPS software, then the assembler must have a temporary register<br />
available in which to create the long values. This need is a reason for the register<br />
$at (assembler temporary), which is reserved for the assembler.<br />
Hence, the symbolic representation of the MIPS machine language is no longer<br />
limited by the hardware, but by whatever the creator of an assembler chooses to<br />
include (see Section 2.12). We stick close to the hardware to explain the architecture<br />
of the computer, noting when we use the enhanced language of the assembler that<br />
is not found in the processor.<br />
Hardware/<br />
Software<br />
Interface<br />
Elaboration: Creating 32-bit constants needs care. The instruction addi copies the<br />
left-most bit of the 16-bit immediate fi eld of the instruction into the upper 16 bits of a<br />
word. Logical or immediate from Section 2.6 loads 0s into the upper 16 bits <strong>and</strong> hence<br />
is used by the assembler in conjunction with lui to create 32-bit constants.<br />
Addressing in Branches <strong>and</strong> Jumps<br />
The MIPS jump instructions have the simplest addressing. They use the final MIPS<br />
instruction format, called the J-type, which consists of 6 bits for the operation field<br />
<strong>and</strong> the rest of the bits for the address field. Thus,<br />
j 10000 # go to location 10000<br />
could be assembled into this format (it’s actually a bit more complicated, as we will<br />
see):<br />
2 10000<br />
6 bits 26 bits<br />
where the value of the jump opcode is 2 <strong>and</strong> the jump address is 10000.<br />
Unlike the jump instruction, the conditional branch instruction must specify<br />
two oper<strong>and</strong>s in addition to the branch address. Thus,<br />
bne $s0,$s1,Exit # go to Exit if $s0 ≠ $s1<br />
is assembled into this instruction, leaving only 16 bits for the branch address:<br />
5 16 17 Exit<br />
6 bits 5 bits 5 bits 16 bits
114 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />
If addresses of the program had to fit in this 16-bit field, it would mean that no<br />
program could be bigger than 2 16 , which is far too small to be a realistic option<br />
today. An alternative would be to specify a register that would always be added<br />
to the branch address, so that a branch instruction would calculate the following:<br />
Program counter Register Branch address<br />
PC-relative<br />
addressing An<br />
addressing regime<br />
in which the address<br />
is the sum of the<br />
program counter (PC)<br />
<strong>and</strong> a constant in the<br />
instruction.<br />
This sum allows the program to be as large as 2 32 <strong>and</strong> still be able to use<br />
conditional branches, solving the branch address size problem. Then the question<br />
is, which register?<br />
The answer comes from seeing how conditional branches are used. Conditional<br />
branches are found in loops <strong>and</strong> in if statements, so they tend to branch to a<br />
nearby instruction. For example, about half of all conditional branches in SPEC<br />
benchmarks go to locations less than 16 instructions away. Since the program<br />
counter (PC) contains the address of the current instruction, we can branch within<br />
2 15 words of the current instruction if we use the PC as the register to be added<br />
to the address. Almost all loops <strong>and</strong> if statements are much smaller than 2 16 words,<br />
so the PC is the ideal choice.<br />
This form of branch addressing is called PC-relative addressing. As we shall see<br />
in Chapter 4, it is convenient for the hardware to increment the PC early to point<br />
to the next instruction. Hence, the MIPS address is actually relative to the address<br />
of the following instruction (PC 4) as opposed to the current instruction (PC).<br />
It is yet another example of making the common case fast, which in this case is<br />
addressing nearby instructions.<br />
Like most recent computers, MIPS uses PC-relative addressing for all conditional<br />
branches, because the destination of these instructions is likely to be close to the<br />
branch. On the other h<strong>and</strong>, jump-<strong>and</strong>-link instructions invoke procedures that<br />
have no reason to be near the call, so they normally use other forms of addressing.<br />
Hence, the MIPS architecture offers long addresses for procedure calls by using the<br />
J-type format for both jump <strong>and</strong> jump-<strong>and</strong>-link instructions.<br />
Since all MIPS instructions are 4 bytes long, MIPS stretches the distance of the<br />
branch by having PC-relative addressing refer to the number of words to the next<br />
instruction instead of the number of bytes. Thus, the 16-bit field can branch four<br />
times as far by interpreting the field as a relative word address rather than as a<br />
relative byte address. Similarly, the 26-bit field in jump instructions is also a word<br />
address, meaning that it represents a 28-bit byte address.<br />
Elaboration: Since the PC is 32 bits, 4 bits must come from somewhere else for<br />
jumps. The MIPS jump instruction replaces only the lower 28 bits of the PC, leaving<br />
the upper 4 bits of the PC unchanged. The loader <strong>and</strong> linker (Section 2.12) must be<br />
careful to avoid placing a program across an address boundary of 256 MB (64 million<br />
instructions); otherwise, a jump must be replaced by a jump register instruction preceded<br />
by other instructions to load the full 32-bit address into a register.
116 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />
Hardware/<br />
Software<br />
Interface<br />
Most conditional branches are to a nearby location, but occasionally they branch<br />
far away, farther than can be represented in the 16 bits of the conditional branch<br />
instruction. The assembler comes to the rescue just as it did with large addresses<br />
or constants: it inserts an unconditional jump to the branch target, <strong>and</strong> inverts the<br />
condition so that the branch decides whether to skip the jump.<br />
EXAMPLE<br />
Branching Far Away<br />
Given a branch on register $s0 being equal to register $s1,<br />
beq<br />
$s0, $s1, L1<br />
replace it by a pair of instructions that offers a much greater branching distance.<br />
ANSWER<br />
These instructions replace the short-address conditional branch:<br />
bne $s0, $s1, L2<br />
j L1<br />
L2:<br />
addressing mode One<br />
of several addressing<br />
regimes delimited by their<br />
varied use of oper<strong>and</strong>s<br />
<strong>and</strong>/or addresses.<br />
MIPS Addressing Mode Summary<br />
Multiple forms of addressing are generically called addressing modes. Figure 2.18<br />
shows how oper<strong>and</strong>s are identified for each addressing mode. The MIPS addressing<br />
modes are the following:<br />
1. Immediate addressing, where the oper<strong>and</strong> is a constant within the instruction<br />
itself<br />
2. Register addressing, where the oper<strong>and</strong> is a register<br />
3. Base or displacement addressing, where the oper<strong>and</strong> is at the memory location<br />
whose address is the sum of a register <strong>and</strong> a constant in the instruction<br />
4. PC-relative addressing, where the branch address is the sum of the PC <strong>and</strong> a<br />
constant in the instruction<br />
5. Pseudodirect addressing, where the jump address is the 26 bits of the<br />
instruction concatenated with the upper bits of the PC
118 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />
Decoding Machine Language<br />
Sometimes you are forced to reverse-engineer machine language to create the<br />
original assembly language. One example is when looking at “core dump.” Figure<br />
2.19 shows the MIPS encoding of the fields for the MIPS machine language. This<br />
figure helps when translating by h<strong>and</strong> between assembly language <strong>and</strong> machine<br />
language.<br />
EXAMPLE<br />
Decoding Machine Code<br />
What is the assembly language statement corresponding to this machine<br />
instruction?<br />
00af8020hex<br />
ANSWER<br />
The first step in converting hexadecimal to binary is to find the op fields:<br />
(Bits: 31 28 26 5 2 0)<br />
0000 0000 1010 1111 1000 0000 0010 0000<br />
We look at the op field to determine the operation. Referring to Figure 2.19,<br />
when bits 31–29 are 000 <strong>and</strong> bits 28–26 are 000, it is an R-format instruction.<br />
Let’s reformat the binary instruction into R-format fields, listed in Figure 2.20:<br />
op rs rt rd shamt funct<br />
000000 00101 01111 10000 00000 100000<br />
The bottom portion of Figure 2.19 determines the operation of an R-format<br />
instruction. In this case, bits 5–3 are 100 <strong>and</strong> bits 2–0 are 000, which means<br />
this binary pattern represents an add instruction.<br />
We decode the rest of the instruction by looking at the field values. The<br />
decimal values are 5 for the rs field, 15 for rt, <strong>and</strong> 16 for rd (shamt is unused).<br />
Figure 2.14 shows that these numbers represent registers $a1, $t7, <strong>and</strong> $s0.<br />
Now we can reveal the assembly instruction:<br />
add $s0,$a1,$t7
120 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />
Name Fields Comments<br />
Field size 6 bits 5 bits 5 bits 5 bits 5 bits 6 bits All MIPS instructions are 32 bits long<br />
R-format op rs rt rd shamt funct Arithmetic instruction format<br />
I-format op rs rt address/immediate Transfer, branch,imm. format<br />
J-format op<br />
target address<br />
Jump instruction format<br />
FIGURE 2.20<br />
MIPS instruction formats.<br />
Figure 2.20 shows all the MIPS instruction formats. Figure 2.1 on page 64 shows<br />
the MIPS assembly language revealed in this chapter. The remaining hidden portion<br />
of MIPS instructions deals mainly with arithmetic <strong>and</strong> real numbers, which are<br />
covered in the next chapter.<br />
Check<br />
Yourself<br />
I. What is the range of addresses for conditional branches in MIPS (K 1024)?<br />
1. Addresses between 0 <strong>and</strong> 64K 1<br />
2. Addresses between 0 <strong>and</strong> 256K 1<br />
3. Addresses up to about 32K before the branch to about 32K after<br />
4. Addresses up to about 128K before the branch to about 128K after<br />
II. What is the range of addresses for jump <strong>and</strong> jump <strong>and</strong> link in MIPS<br />
(M 1024K)?<br />
1. Addresses between 0 <strong>and</strong> 64M 1<br />
2. Addresses between 0 <strong>and</strong> 256M 1<br />
3. Addresses up to about 32M before the branch to about 32M after<br />
4. Addresses up to about 128M before the branch to about 128M after<br />
5. Anywhere within a block of 64M addresses where the PC supplies the<br />
upper 6 bits<br />
6. Anywhere within a block of 256M addresses where the PC supplies the<br />
upper 4 bits<br />
III. What is the MIPS assembly language instruction corresponding to the<br />
machine instruction with the value 0000 0000 hex<br />
?<br />
1. j<br />
2. R-format<br />
3. addi<br />
4. sll<br />
5. mfc0<br />
6. Undefined opcode: there is no legal instruction that corresponds to 0
2.11 Parallelism <strong>and</strong> Instructions: Synchronization 121<br />
2.11<br />
Parallelism <strong>and</strong> Instructions:<br />
Synchronization<br />
Parallel execution is easier when tasks are independent, but often they need to<br />
cooperate. Cooperation usually means some tasks are writing new values that<br />
others must read. To know when a task is finished writing so that it is safe for<br />
another to read, the tasks need to synchronize. If they don’t synchronize, there is a<br />
danger of a data race, where the results of the program can change depending on<br />
how events happen to occur.<br />
For example, recall the analogy of the eight reporters writing a story on page 44 of<br />
Chapter 1. Suppose one reporter needs to read all the prior sections before writing<br />
a conclusion. Hence, he or she must know when the other reporters have finished<br />
their sections, so that there is no danger of sections being changed afterwards. That<br />
is, they had better synchronize the writing <strong>and</strong> reading of each section so that the<br />
conclusion will be consistent with what is printed in the prior sections.<br />
In computing, synchronization mechanisms are typically built with user-level<br />
software routines that rely on hardware-supplied synchronization instructions. In<br />
this section, we focus on the implementation of lock <strong>and</strong> unlock synchronization<br />
operations. Lock <strong>and</strong> unlock can be used straightforwardly to create regions<br />
where only a single processor can operate, called a mutual exclusion, as well as to<br />
implement more complex synchronization mechanisms.<br />
The critical ability we require to implement synchronization in a multiprocessor<br />
is a set of hardware primitives with the ability to atomically read <strong>and</strong> modify a<br />
memory location. That is, nothing else can interpose itself between the read <strong>and</strong><br />
the write of the memory location. Without such a capability, the cost of building<br />
basic synchronization primitives will be high <strong>and</strong> will increase unreasonably as the<br />
processor count increases.<br />
There are a number of alternative formulations of the basic hardware primitives,<br />
all of which provide the ability to atomically read <strong>and</strong> modify a location, together<br />
with some way to tell if the read <strong>and</strong> write were performed atomically. In general,<br />
architects do not expect users to employ the basic hardware primitives, but<br />
instead expect that the primitives will be used by system programmers to build a<br />
synchronization library, a process that is often complex <strong>and</strong> tricky.<br />
Let’s start with one such hardware primitive <strong>and</strong> show how it can be used to<br />
build a basic synchronization primitive. One typical operation for building<br />
synchronization operations is the atomic exchange or atomic swap, which interchanges<br />
a value in a register for a value in memory.<br />
To see how to use this to build a basic synchronization primitive, assume that<br />
we want to build a simple lock where the value 0 is used to indicate that the lock<br />
is free <strong>and</strong> 1 is used to indicate that the lock is unavailable. A processor tries to set<br />
the lock by doing an exchange of 1, which is in a register, with the memory address<br />
corresponding to the lock. The value returned from the exchange instruction is 1<br />
if some other processor had already claimed access, <strong>and</strong> 0 otherwise. In the latter<br />
data race Two memory<br />
accesses form a data race<br />
if they are from different<br />
threads to same location,<br />
at least one is a write,<br />
<strong>and</strong> they occur one after<br />
another.
122 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />
case, the value is also changed to 1, preventing any competing exchange in another<br />
processor from also retrieving a 0.<br />
For example, consider two processors that each try to do the exchange<br />
simultaneously: this race is broken, since exactly one of the processors will perform<br />
the exchange first, returning 0, <strong>and</strong> the second processor will return 1 when it does<br />
the exchange. The key to using the exchange primitive to implement synchronization<br />
is that the operation is atomic: the exchange is indivisible, <strong>and</strong> two simultaneous<br />
exchanges will be ordered by the hardware. It is impossible for two processors<br />
trying to set the synchronization variable in this manner to both think they have<br />
simultaneously set the variable.<br />
Implementing a single atomic memory operation introduces some challenges in<br />
the design of the processor, since it requires both a memory read <strong>and</strong> a write in a<br />
single, uninterruptible instruction.<br />
An alternative is to have a pair of instructions in which the second instruction<br />
returns a value showing whether the pair of instructions was executed as if the pair<br />
were atomic. The pair of instructions is effectively atomic if it appears as if all other<br />
operations executed by any processor occurred before or after the pair. Thus, when<br />
an instruction pair is effectively atomic, no other processor can change the value<br />
between the instruction pair.<br />
In MIPS this pair of instructions includes a special load called a load linked <strong>and</strong><br />
a special store called a store conditional. These instructions are used in sequence:<br />
if the contents of the memory location specified by the load linked are changed<br />
before the store conditional to the same address occurs, then the store conditional<br />
fails. The store conditional is defined to both store the value of a (presumably<br />
different) register in memory <strong>and</strong> to change the value of that register to a 1 if it<br />
succeeds <strong>and</strong> to a 0 if it fails. Since the load linked returns the initial value, <strong>and</strong> the<br />
store conditional returns 1 only if it succeeds, the following sequence implements<br />
an atomic exchange on the memory location specified by the contents of $s1:<br />
again: addi $t0,$zero,1 ;copy locked value<br />
ll $t1,0($s1) ;load linked<br />
sc $t0,0($s1) ;store conditional<br />
beq $t0,$zero,again ;branch if store fails<br />
add $s4,$zero,$t1 ;put load value in $s4<br />
Any time a processor intervenes <strong>and</strong> modifies the value in memory between the<br />
ll <strong>and</strong> sc instructions, the sc returns 0 in $t0, causing the code sequence to try<br />
again. At the end of this sequence the contents of $s4 <strong>and</strong> the memory location<br />
specified by $s1 have been atomically exchanged.<br />
Elaboration: Although it was presented for multiprocessor synchronization, atomic<br />
exchange is also useful for the operating system in dealing with multiple processes<br />
in a single processor. To make sure nothing interferes in a single processor, the store<br />
conditional also fails if the processor does a context switch between the two instructions<br />
(see Chapter 5).
124 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />
C program<br />
Compiler<br />
Assembly language program<br />
Assembler<br />
Object: Machine language module<br />
Object: Library routine (machine language)<br />
Linker<br />
Executable: Machine language program<br />
Loader<br />
Memory<br />
FIGURE 2.21 A translation hierarchy for C. A high-level language program is first compiled into<br />
an assembly language program <strong>and</strong> then assembled into an object module in machine language. The linker<br />
combines multiple modules with library routines to resolve all references. The loader then places the machine<br />
code into the proper memory locations for execution by the processor. To speed up the translation process,<br />
some steps are skipped or combined. Some compilers produce object modules directly, <strong>and</strong> some systems use<br />
linking loaders that perform the last two steps. To identify the type of file, UNIX follows a suffix convention<br />
for files: C source files are named x.c, assembly files are x.s, object files are named x.o, statically linked<br />
library routines are x.a, dynamically linked library routes are x.so, <strong>and</strong> executable files by default are<br />
called a.out. MS-DOS uses the suffixes .C, .ASM, .OBJ, .LIB, .DLL, <strong>and</strong> .EXE to the same effect.<br />
pseudoinstruction<br />
A common variation<br />
of assembly language<br />
instructions often treated<br />
as if it were an instruction<br />
in its own right.<br />
Assembler<br />
Since assembly language is an interface to higher-level software, the assembler<br />
can also treat common variations of machine language instructions as if they<br />
were instructions in their own right. The hardware need not implement these<br />
instructions; however, their appearance in assembly language simplifies translation<br />
<strong>and</strong> programming. Such instructions are called pseudoinstructions.<br />
As mentioned above, the MIPS hardware makes sure that register $zero always<br />
has the value 0. That is, whenever register $zero is used, it supplies a 0, <strong>and</strong> the<br />
programmer cannot change the value of register $zero. Register $zero is used<br />
to create the assembly language instruction that copies the contents of one register<br />
to another. Thus the MIPS assembler accepts this instruction even though it is not<br />
found in the MIPS architecture:<br />
move $t0,$t1 # register $t0 gets register $t1
2.12 Translating <strong>and</strong> Starting a Program 125<br />
The assembler converts this assembly language instruction into the machine<br />
language equivalent of the following instruction:<br />
add $t0,$zero,$t1 # register $t0 gets 0 + register $t1<br />
The MIPS assembler also converts blt (branch on less than) into the two<br />
instructions slt <strong>and</strong> bne mentioned in the example on page 95. Other examples<br />
include bgt, bge, <strong>and</strong> ble. It also converts branches to faraway locations into a<br />
branch <strong>and</strong> jump. As mentioned above, the MIPS assembler allows 32-bit constants<br />
to be loaded into a register despite the 16-bit limit of the immediate instructions.<br />
In summary, pseudoinstructions give MIPS a richer set of assembly language<br />
instructions than those implemented by the hardware. The only cost is reserving<br />
one register, $at, for use by the assembler. If you are going to write assembly<br />
programs, use pseudoinstructions to simplify your task. To underst<strong>and</strong> the MIPS<br />
architecture <strong>and</strong> be sure to get best performance, however, study the real MIPS<br />
instructions found in Figures 2.1 <strong>and</strong> 2.19.<br />
Assemblers will also accept numbers in a variety of bases. In addition to binary<br />
<strong>and</strong> decimal, they usually accept a base that is more succinct than binary yet<br />
converts easily to a bit pattern. MIPS assemblers use hexadecimal.<br />
Such features are convenient, but the primary task of an assembler is assembly<br />
into machine code. The assembler turns the assembly language program into an<br />
object file, which is a combination of machine language instructions, data, <strong>and</strong><br />
information needed to place instructions properly in memory.<br />
To produce the binary version of each instruction in the assembly language<br />
program, the assembler must determine the addresses corresponding to all labels.<br />
Assemblers keep track of labels used in branches <strong>and</strong> data transfer instructions<br />
in a symbol table. As you might expect, the table contains pairs of symbols <strong>and</strong><br />
addresses.<br />
The object file for UNIX systems typically contains six distinct pieces:<br />
■ The object file header describes the size <strong>and</strong> position of the other pieces of the<br />
object file.<br />
■ The text segment contains the machine language code.<br />
■ The static data segment contains data allocated for the life of the program.<br />
(UNIX allows programs to use both static data, which is allocated throughout<br />
the program, <strong>and</strong> dynamic data, which can grow or shrink as needed by the<br />
program. See Figure 2.13.)<br />
■ The relocation information identifies instructions <strong>and</strong> data words that depend<br />
on absolute addresses when the program is loaded into memory.<br />
■ The symbol table contains the remaining labels that are not defined, such as<br />
external references.<br />
symbol table A table<br />
that matches names of<br />
labels to the addresses of<br />
the memory words that<br />
instructions occupy.
126 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />
■ The debugging information contains a concise description of how the modules<br />
were compiled so that a debugger can associate machine instructions with C<br />
source files <strong>and</strong> make data structures readable.<br />
The next subsection shows how to attach such routines that have already been<br />
assembled, such as library routines.<br />
linker Also called<br />
link editor. A systems<br />
program that combines<br />
independently assembled<br />
machine language<br />
programs <strong>and</strong> resolves all<br />
undefined labels into an<br />
executable file.<br />
executable file<br />
A functional program in<br />
the format of an object<br />
file that contains no<br />
unresolved references.<br />
It can contain symbol<br />
tables <strong>and</strong> debugging<br />
information. A “stripped<br />
executable” does not<br />
contain that information.<br />
Relocation information<br />
may be included for the<br />
loader.<br />
Linker<br />
What we have presented so far suggests that a single change to one line of one<br />
procedure requires compiling <strong>and</strong> assembling the whole program. Complete<br />
retranslation is a terrible waste of computing resources. This repetition is<br />
particularly wasteful for st<strong>and</strong>ard library routines, because programmers would<br />
be compiling <strong>and</strong> assembling routines that by definition almost never change. An<br />
alternative is to compile <strong>and</strong> assemble each procedure independently, so that a<br />
change to one line would require compiling <strong>and</strong> assembling only one procedure.<br />
This alternative requires a new systems program, called a link editor or linker,<br />
which takes all the independently assembled machine language programs <strong>and</strong><br />
“stitches” them together.<br />
There are three steps for the linker:<br />
1. Place code <strong>and</strong> data modules symbolically in memory.<br />
2. Determine the addresses of data <strong>and</strong> instruction labels.<br />
3. Patch both the internal <strong>and</strong> external references.<br />
The linker uses the relocation information <strong>and</strong> symbol table in each object<br />
module to resolve all undefined labels. Such references occur in branch instructions,<br />
jump instructions, <strong>and</strong> data addresses, so the job of this program is much like that<br />
of an editor: it finds the old addresses <strong>and</strong> replaces them with the new addresses.<br />
Editing is the origin of the name “link editor,” or linker for short. The reason a<br />
linker is useful is that it is much faster to patch code than it is to recompile <strong>and</strong><br />
reassemble.<br />
If all external references are resolved, the linker next determines the memory<br />
locations each module will occupy. Recall that Figure 2.13 on page 104 shows<br />
the MIPS convention for allocation of program <strong>and</strong> data to memory. Since the<br />
files were assembled in isolation, the assembler could not know where a module’s<br />
instructions <strong>and</strong> data would be placed relative to other modules. When the linker<br />
places a module in memory, all absolute references, that is, memory addresses that<br />
are not relative to a register, must be relocated to reflect its true location.<br />
The linker produces an executable file that can be run on a computer. Typically,<br />
this file has the same format as an object file, except that it contains no unresolved<br />
references. It is possible to have partially linked files, such as library routines, that<br />
still have unresolved addresses <strong>and</strong> hence result in object files.
2.12 Translating <strong>and</strong> Starting a Program 127<br />
Linking Object Files<br />
Link the two object files below. Show updated addresses of the first few<br />
instructions of the completed executable file. We show the instructions in<br />
assembly language just to make the example underst<strong>and</strong>able; in reality, the<br />
instructions would be numbers.<br />
Note that in the object files we have highlighted the addresses <strong>and</strong> symbols<br />
that must be updated in the link process: the instructions that refer to the<br />
addresses of procedures A <strong>and</strong> B <strong>and</strong> the instructions that refer to the addresses<br />
of data words X <strong>and</strong> Y.<br />
EXAMPLE<br />
Object file header<br />
Name<br />
Procedure A<br />
Text size<br />
Data size<br />
100 hex<br />
20 hex<br />
Text segment Address Instruction<br />
0 lw $a0, 0($gp)<br />
4 jal 0<br />
…<br />
…<br />
Data segment 0 (X)<br />
…<br />
…<br />
Relocation information Address Instruction type Dependency<br />
0 lw X<br />
4 jal B<br />
Symbol table Label Address<br />
X –<br />
B –<br />
Object file header<br />
Name<br />
Procedure B<br />
Text size<br />
Data size<br />
200 hex<br />
30 hex<br />
Text segment Address Instruction<br />
0 sw $a1, 0($gp)<br />
4 jal 0<br />
…<br />
…<br />
Data segment 0 (Y)<br />
…<br />
…<br />
Relocation information Address Instruction type Dependency<br />
0 sw Y<br />
4 jal A<br />
Symbol table Label Address<br />
Y –<br />
A –
128 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />
ANSWER<br />
Procedure A needs to find the address for the variable labeled X to put in the<br />
load instruction <strong>and</strong> to find the address of procedure B to place in the jal<br />
instruction. Procedure B needs the address of the variable labeled Y for the<br />
store instruction <strong>and</strong> the address of procedure A for its jal instruction.<br />
From Figure 2.13 on page 104, we know that the text segment starts<br />
at address 40 0000 hex<br />
<strong>and</strong> the data segment at 1000 0000 hex<br />
. The text of<br />
procedure A is placed at the first address <strong>and</strong> its data at the second. The object<br />
file header for procedure A says that its text is 100 hex<br />
bytes <strong>and</strong> its data is 20 hex<br />
bytes, so the starting address for procedure B text is 40 0100 hex<br />
, <strong>and</strong> its data<br />
starts at 1000 0020 hex<br />
.<br />
Executable file header<br />
Text size<br />
Data size<br />
300 hex<br />
50 hex<br />
Text segment Address Instruction<br />
0040 0000 hex lw $a0, 8000 hex<br />
($gp)<br />
0040 0004 hex jal 40 0100 hex<br />
…<br />
…<br />
0040 0100 hex sw $a1, 8020 hex<br />
($gp)<br />
Data segment<br />
0040 0104 hex jal 40 0000 hex<br />
…<br />
…<br />
Address<br />
1000 0000 hex (X)<br />
…<br />
…<br />
1000 0020 hex (Y)<br />
…<br />
…<br />
Figure 2.13 also shows that the text segment starts at address 40 0000 hex<br />
<strong>and</strong> the data segment at 1000 0000 hex<br />
. The text of procedure A is placed at the<br />
first address <strong>and</strong> its data at the second. The object file header for procedure A<br />
says that its text is 100 hex<br />
bytes <strong>and</strong> its data is 20 hex<br />
bytes, so the starting address<br />
for procedure B text is 40 0100 hex<br />
, <strong>and</strong> its data starts at 1000 0020 hex<br />
.<br />
Now the linker updates the address fields of the instructions. It uses the<br />
instruction type field to know the format of the address to be edited. We have<br />
two types here:<br />
1. The jals are easy because they use pseudodirect addressing. The jal at<br />
address 40 0004 hex<br />
gets 40 0100 hex<br />
(the address of procedure B) in its<br />
address field, <strong>and</strong> the jal at 40 0104 hex<br />
gets 40 0000 hex<br />
(the address of<br />
procedure A) in its address field.<br />
2. The load <strong>and</strong> store addresses are harder because they are relative to a base<br />
register. This example uses the global pointer as the base register. Figure 2.13<br />
shows that $gp is initialized to 1000 8000 hex<br />
. To get the address 1000 0000 hex<br />
(the address of word X), we place 8000 hex<br />
in the address field of lw at address<br />
40 0000 hex<br />
. Similarly, we place 8020 hex<br />
in the address field of sw at address<br />
40 0100 hex<br />
to get the address 1000 0020 hex<br />
(the address of word Y).
2.12 Translating <strong>and</strong> Starting a Program 129<br />
Elaboration: Recall that MIPS instructions are word aligned, so jal drops the right<br />
two bits to increase the instruction’s address range. Thus, it uses 26 bits to create a<br />
28-bit byte address. Hence, the actual address in the lower 26 bits of the jal instruction<br />
in this example is 10 0040 hex,<br />
rather than 40 0100 hex<br />
.<br />
Loader<br />
Now that the executable file is on disk, the operating system reads it to memory <strong>and</strong><br />
starts it. The loader follows these steps in UNIX systems:<br />
1. Reads the executable file header to determine size of the text <strong>and</strong> data<br />
segments.<br />
2. Creates an address space large enough for the text <strong>and</strong> data.<br />
3. Copies the instructions <strong>and</strong> data from the executable file into memory.<br />
4. Copies the parameters (if any) to the main program onto the stack.<br />
5. Initializes the machine registers <strong>and</strong> sets the stack pointer to the first free<br />
location.<br />
6. Jumps to a start-up routine that copies the parameters into the argument<br />
registers <strong>and</strong> calls the main routine of the program. When the main routine<br />
returns, the start-up routine terminates the program with an exit system<br />
call.<br />
Sections A.3 <strong>and</strong> A.4 in Appendix A describe linkers <strong>and</strong> loaders in more detail.<br />
Dynamically Linked Libraries<br />
The first part of this section describes the traditional approach to linking libraries<br />
before the program is run. Although this static approach is the fastest way to call<br />
library routines, it has a few disadvantages:<br />
■ The library routines become part of the executable code. If a new version of<br />
the library is released that fixes bugs or supports new hardware devices, the<br />
statically linked program keeps using the old version.<br />
■ It loads all routines in the library that are called anywhere in the executable,<br />
even if those calls are not executed. The library can be large relative to the<br />
program; for example, the st<strong>and</strong>ard C library is 2.5 MB.<br />
These disadvantages lead to dynamically linked libraries (DLLs), where the<br />
library routines are not linked <strong>and</strong> loaded until the program is run. Both the<br />
program <strong>and</strong> library routines keep extra information on the location of nonlocal<br />
procedures <strong>and</strong> their names. In the initial version of DLLs, the loader ran a dynamic<br />
linker, using the extra information in the file to find the appropriate libraries <strong>and</strong> to<br />
update all external references.<br />
loader A systems<br />
program that places an<br />
object program in main<br />
memory so that it is ready<br />
to execute.<br />
Virtually every<br />
problem in computer<br />
science can be solved<br />
by another level of<br />
indirection.<br />
David Wheeler<br />
dynamically linked<br />
libraries (DLLs) Library<br />
routines that are linked<br />
to a program during<br />
execution.
130 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />
The downside of the initial version of DLLs was that it still linked all routines<br />
of the library that might be called, versus only those that are called during the<br />
running of the program. This observation led to the lazy procedure linkage version<br />
of DLLs, where each routine is linked only after it is called.<br />
Like many innovations in our field, this trick relies on a level of indirection.<br />
Figure 2.22 shows the technique. It starts with the nonlocal routines calling a set of<br />
dummy routines at the end of the program, with one entry per nonlocal routine.<br />
These dummy entries each contain an indirect jump.<br />
The first time the library routine is called, the program calls the dummy entry<br />
<strong>and</strong> follows the indirect jump. It points to code that puts a number in a register to<br />
Text<br />
jal<br />
...<br />
lw<br />
jr<br />
...<br />
Text<br />
jal<br />
...<br />
lw<br />
jr<br />
...<br />
Data<br />
Data<br />
Text<br />
...<br />
li<br />
j<br />
...<br />
ID<br />
Text<br />
Dynamic linker/loader<br />
Remap DLL routine<br />
j ...<br />
Data/Text<br />
DLL routine<br />
...<br />
jr<br />
Text<br />
DLL routine<br />
...<br />
jr<br />
(a) First call to DLL routine<br />
(b) Subsequent calls to DLL routine<br />
FIGURE 2.22 Dynamically linked library via lazy procedure linkage. (a) Steps for the first time<br />
a call is made to the DLL routine. (b) The steps to find the routine, remap it, <strong>and</strong> link it are skipped on<br />
subsequent calls. As we will see in Chapter 5, the operating system may avoid copying the desired routine by<br />
remapping it using virtual memory management.
2.13 A C Sort Example to Put It All Together 133<br />
void swap(int v[], int k)<br />
{<br />
int temp;<br />
temp = v[k];<br />
v[k] = v[k+1];<br />
v[k+1] = temp;<br />
}<br />
FIGURE 2.24 A C procedure that swaps two locations in memory. This subsection uses this<br />
procedure in a sorting example.<br />
The Procedure swap<br />
Let’s start with the code for the procedure swap in Figure 2.24. This procedure<br />
simply swaps two locations in memory. When translating from C to assembly<br />
language by h<strong>and</strong>, we follow these general steps:<br />
1. Allocate registers to program variables.<br />
2. Produce code for the body of the procedure.<br />
3. Preserve registers across the procedure invocation.<br />
This section describes the swap procedure in these three pieces, concluding by<br />
putting all the pieces together.<br />
Register Allocation for swap<br />
As mentioned on pages 98–99, the MIPS convention on parameter passing is to<br />
use registers $a0, $a1, $a2, <strong>and</strong> $a3. Since swap has just two parameters, v <strong>and</strong><br />
k, they will be found in registers $a0 <strong>and</strong> $a1. The only other variable is temp,<br />
which we associate with register $t0 since swap is a leaf procedure (see page 100).<br />
This register allocation corresponds to the variable declarations in the first part of<br />
the swap procedure in Figure 2.24.<br />
Code for the Body of the Procedure swap<br />
The remaining lines of C code in swap are<br />
temp = v[k];<br />
v[k] = v[k+1];<br />
v[k+1] = temp;<br />
Recall that the memory address for MIPS refers to the byte address, <strong>and</strong> so<br />
words are really 4 bytes apart. Hence we need to multiply the index k by 4 before<br />
adding it to the address. Forgetting that sequential word addresses differ by 4 instead
2.13 A C Sort Example to Put It All Together 135<br />
The Procedure sort<br />
To ensure that you appreciate the rigor of programming in assembly language, we’ll<br />
try a second, longer example. In this case, we’ll build a routine that calls the swap<br />
procedure. This program sorts an array of integers, using bubble or exchange sort,<br />
which is one of the simplest if not the fastest sorts. Figure 2.26 shows the C version<br />
of the program. Once again, we present this procedure in several steps, concluding<br />
with the full procedure.<br />
void sort (int v[], int n)<br />
{<br />
int i, j;<br />
for (i = 0; i < n; i += 1) {<br />
for (j = i – 1; j >= 0 && v[j] > v[j + 1]; j =1) {<br />
swap(v,j);<br />
}<br />
}<br />
}<br />
FIGURE 2.26 A C procedure that performs a sort on the array v.<br />
Register Allocation for sort<br />
The two parameters of the procedure sort, v <strong>and</strong> n, are in the parameter registers<br />
$a0 <strong>and</strong> $a1, <strong>and</strong> we assign register $s0 to i <strong>and</strong> register $s1 to j.<br />
Code for the Body of the Procedure sort<br />
The procedure body consists of two nested for loops <strong>and</strong> a call to swap that includes<br />
parameters. Let’s unwrap the code from the outside to the middle.<br />
The first translation step is the first for loop:<br />
for (i = 0; i
136 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />
The loop should be exited if i < n is not true or, said another way, should be<br />
exited if i ≥ n. The set on less than instruction sets register $t0 to 1 if $s0 <<br />
$a1 <strong>and</strong> to 0 otherwise. Since we want to test if $s0 ≥ $a1, we branch if register<br />
$t0 is 0. This test takes two instructions:<br />
for1tst:slt $t0, $s0, $a1 # reg $t0 = 0 if $s0 ≥ $a1 (i≥n)<br />
beq $t0, $zero,exit1 # go to exit1 if $s0 ≥ $a1 (i≥n)<br />
The bottom of the loop just jumps back to the loop test:<br />
exit1:<br />
j for1tst<br />
# jump to test of outer loop<br />
The skeleton code of the first for loop is then<br />
move $s0, $zero # i = 0<br />
for1tst:slt $t0, $s0, $a1 # reg $t0 = 0 if $s0 ≥ $a1 (i≥n)<br />
beq $t0, $zero,exit1 # go to exit1 if $s0 ≥ $a1 (i≥n)<br />
. . .<br />
(body of first for loop)<br />
. . .<br />
addi $s0, $s0, 1 # i += 1<br />
j for1tst # jump to test of outer loop<br />
exit1:<br />
Voila! (The exercises explore writing faster code for similar loops.)<br />
The second for loop looks like this in C:<br />
for (j = i – 1; j >= 0 && v[j] > v[j + 1]; j –= 1) {<br />
The initialization portion of this loop is again one instruction:<br />
addi $s1, $s0, –1 # j = i – 1<br />
The decrement of j at the end of the loop is also one instruction:<br />
addi $s1, $s1, –1 # j –= 1<br />
The loop test has two parts. We exit the loop if either condition fails, so the first<br />
test must exit the loop if it fails (j 0):<br />
for2tst: slti $t0, $s1, 0 # reg $t0 = 1 if $s1 < 0 (j < 0)<br />
bne $t0, $zero, exit2 # go to exit2 if $s1 < 0 (j < 0)<br />
This branch will skip over the second condition test. If it doesn’t skip, j ≥ 0.
2.13 A C Sort Example to Put It All Together 137<br />
The second test exits if v[j] > v[j + 1] is not true, or exits if v[j] ≤<br />
v[j + 1]. First we create the address by multiplying j by 4 (since we need a byte<br />
address) <strong>and</strong> add it to the base address of v:<br />
sll $t1, $s1, 2 # reg $t1 = j * 4<br />
add $t2, $a0, $t1 # reg $t2 = v + (j * 4)<br />
Now we load v[j]:<br />
lw $t3, 0($t2) # reg $t3 = v[j]<br />
Since we know that the second element is just the following word, we add 4 to<br />
the address in register $t2 to get v[j + 1]:<br />
lw $t4, 4($t2) # reg $t4 = v[j + 1]<br />
The test of v[j] ≤ v[j + 1] is the same as v[j + 1] ≥ v[j], so the<br />
two instructions of the exit test are<br />
slt $t0, $t4, $t3 # reg $t0 = 0 if $t4 ≥ $t3<br />
beq $t0, $zero, exit2 # go to exit2 if $t4 ≥ $t3<br />
The bottom of the loop jumps back to the inner loop test:<br />
j for2tst # jump to test of inner loop<br />
Combining the pieces, the skeleton of the second for loop looks like this:<br />
addi $s1, $s0, –1 # j = i – 1<br />
for2tst:slti $t0, $s1, 0 # reg $t0 = 1 if $s1 < 0 (j < 0)<br />
bne $t0, $zero, exit2 # go to exit2 if $s1 < 0 (j < 0)<br />
sll $t1, $s1, 2 # reg $t1 = j * 4<br />
add $t2, $a0, $t1 # reg $t2 = v + (j * 4)<br />
lw $t3, 0($t2) # reg $t3 = v[j]<br />
lw $t4, 4($t2) # reg $t4 = v[j + 1]<br />
slt $t0, $t4, $t3 # reg $t0 = 0 if $t4 ≥ $t3<br />
beq $t0, $zero, exit2 # go to exit2 if $t4 ≥ $t3<br />
. . .<br />
(body of second for loop)<br />
. . .<br />
addi $s1, $s1, –1 # j –= 1<br />
j for2tst<br />
# jump to test of inner loop<br />
exit2:<br />
The Procedure Call in sort<br />
The next step is the body of the second for loop:<br />
swap(v,j);<br />
Calling swap is easy enough:<br />
jal<br />
swap
138 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />
Passing Parameters in sort<br />
The problem comes when we want to pass parameters because the sort procedure<br />
needs the values in registers $a0 <strong>and</strong> $a1, yet the swap procedure needs to have its<br />
parameters placed in those same registers. One solution is to copy the parameters<br />
for sort into other registers earlier in the procedure, making registers $a0 <strong>and</strong><br />
$a1 available for the call of swap. (This copy is faster than saving <strong>and</strong> restoring on<br />
the stack.) We first copy $a0 <strong>and</strong> $a1 into $s2 <strong>and</strong> $s3 during the procedure:<br />
move $s2, $a0 # copy parameter $a0 into $s2<br />
move $s3, $a1 # copy parameter $a1 into $s3<br />
Then we pass the parameters to swap with these two instructions:<br />
move $a0, $s2<br />
move $a1, $s1<br />
# first swap parameter is v<br />
# second swap parameter is j<br />
Preserving Registers in sort<br />
The only remaining code is the saving <strong>and</strong> restoring of registers. Clearly, we must<br />
save the return address in register $ra, since sort is a procedure <strong>and</strong> is called<br />
itself. The sort procedure also uses the saved registers $s0, $s1, $s2, <strong>and</strong> $s3,<br />
so they must be saved. The prologue of the sort procedure is then<br />
addi $sp,$sp,–20 # make room on stack for 5 registers<br />
sw $ra,16($sp) # save $ra on stack<br />
sw $s3,12($sp) # save $s3 on stack<br />
sw $s2, 8($sp) # save $s2 on stack<br />
sw $s1, 4($sp) # save $s1 on stack<br />
sw $s0, 0($sp) # save $s0 on stack<br />
The tail of the procedure simply reverses all these instructions, then adds a jr to<br />
return.<br />
The Full Procedure sort<br />
Now we put all the pieces together in Figure 2.27, being careful to replace references<br />
to registers $a0 <strong>and</strong> $a1 in the for loops with references to registers $s2 <strong>and</strong> $s3.<br />
Once again, to make the code easier to follow, we identify each block of code with<br />
its purpose in the procedure. In this example, nine lines of the sort procedure in<br />
C became 35 lines in the MIPS assembly language.<br />
Elaboration: One optimization that works with this example is procedure inlining.<br />
Instead of passing arguments in parameters <strong>and</strong> invoking the code with a jal instruction,<br />
the compiler would copy the code from the body of the swap procedure where the call<br />
to swap appears in the code. Inlining would avoid four instructions in this example. The<br />
downside of the inlining optimization is that the compiled code would be bigger if the<br />
inlined procedure is called from several locations. Such a code expansion might turn<br />
into lower performance if it increased the cache miss rate; see Chapter 5.
142 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />
clear1(int array[], int size)<br />
{<br />
int i;<br />
for (i = 0; i < size; i += 1)<br />
array[i] = 0;<br />
}<br />
clear2(int *array, int size)<br />
{<br />
int *p;<br />
for (p = &array[0]; p < &array[size]; p = p + 1)<br />
*p = 0;<br />
}<br />
FIGURE 2.30 Two C procedures for setting an array to all zeros. Clear1 uses indices,<br />
while clear2 uses pointers. The second procedure needs some explanation for those unfamiliar with C.<br />
The address of a variable is indicated by &, <strong>and</strong> the object pointed to by a pointer is indicated by *. The<br />
declarations declare that array <strong>and</strong> p are pointers to integers. The first part of the for loop in clear2<br />
assigns the address of the first element of array to the pointer p. The second part of the for loop tests to see<br />
if the pointer is pointing beyond the last element of array. Incrementing a pointer by one, in the last part of<br />
the for loop, means moving the pointer to the next sequential object of its declared size. Since p is a pointer to<br />
integers, the compiler will generate MIPS instructions to increment p by four, the number of bytes in a MIPS<br />
integer. The assignment in the loop places 0 in the object pointed to by p.<br />
Finally, we can store 0 in that address:<br />
sw $zero, 0($t2) # array[i] = 0<br />
This instruction is the end of the body of the loop, so the next step is to increment i:<br />
addi $t0,$t0,1 # i = i + 1<br />
The loop test checks if i is less than size:<br />
slt $t3,$t0,$a1 # $t3 = (i < size)<br />
bne $t3,$zero,loop1 # if (i < size) go to loop1<br />
We have now seen all the pieces of the procedure. Here is the MIPS code for<br />
clearing an array using indices:<br />
move $t0,$zero # i = 0<br />
loop1: sll $t1,$t0,2 # $t1 = i * 4<br />
add $t2,$a0,$t1 # $t2 = address of array[i]<br />
sw $zero, 0($t2) # array[i] = 0<br />
addi $t0,$t0,1 # i = i + 1<br />
slt $t3,$t0,$a1 # $t3 = (i < size)<br />
bne $t3,$zero,loop1 # if (i < size) go to loop1<br />
(This code works as long as size is greater than 0; ANSI C requires a test of size<br />
before the loop, but we’ll skip that legality here.)
2.14 Arrays versus Pointers 143<br />
Pointer Version of Clear<br />
The second procedure that uses pointers allocates the two parameters array <strong>and</strong><br />
size to the registers $a0 <strong>and</strong> $a1 <strong>and</strong> allocates p to register $t0. The code for<br />
the second procedure starts with assigning the pointer p to the address of the first<br />
element of the array:<br />
move $t0,$a0<br />
# p = address of array[0]<br />
The next code is the body of the for loop, which simply stores 0 into p:<br />
loop2: sw $zero,0($t0) # Memory[p] = 0<br />
This instruction implements the body of the loop, so the next code is the iteration<br />
increment, which changes p to point to the next word:<br />
addi $t0,$t0,4 # p = p + 4<br />
Incrementing a pointer by 1 means moving the pointer to the next sequential<br />
object in C. Since p is a pointer to integers, each of which uses 4 bytes, the compiler<br />
increments p by 4.<br />
The loop test is next. The first step is calculating the address of the last element<br />
of array. Start with multiplying size by 4 to get its byte address:<br />
sll $t1,$a1,2 # $t1 = size * 4<br />
<strong>and</strong> then we add the product to the starting address of the array to get the address<br />
of the first word after the array:<br />
add $t2,$a0,$t1<br />
# $t2 = address of array[size]<br />
The loop test is simply to see if p is less than the last element of array:<br />
slt $t3,$t0,$t2 # $t3 = (p
2.16 Real Stuff: ARMv7 (32-bit) Instructions 147<br />
by any amount, add it to the other registers to form the address, <strong>and</strong> then update<br />
one register with this new address.<br />
Addressing mode<br />
ARM<br />
MIPS<br />
Register oper<strong>and</strong><br />
Immediate oper<strong>and</strong><br />
Register + offset (displacement or based)<br />
Register + register (indexed) X<br />
—<br />
Register + scaled register (scaled) X<br />
—<br />
Register + offset <strong>and</strong> update register X<br />
—<br />
Register + register <strong>and</strong> update register X<br />
—<br />
Autoincrement, autodecrement X<br />
—<br />
PC-relative data X<br />
—<br />
X<br />
X<br />
X<br />
X<br />
X<br />
X<br />
FIGURE 2.33 Summary of data addressing modes. ARM has separate register indirect <strong>and</strong> register<br />
offset addressing modes, rather than just putting 0 in the offset of the latter mode. To get greater addressing<br />
range, ARM shifts the offset left 1 or 2 bits if the data size is halfword or word.<br />
Compare <strong>and</strong> Conditional Branch<br />
MIPS uses the contents of registers to evaluate conditional branches. ARM uses the<br />
traditional four condition code bits stored in the program status word: negative,<br />
zero, carry, <strong>and</strong> overflow. They can be set on any arithmetic or logical instruction;<br />
unlike earlier architectures, this setting is optional on each instruction. An<br />
explicit option leads to fewer problems in a pipelined implementation. ARM uses<br />
conditional branches to test condition codes to determine all possible unsigned<br />
<strong>and</strong> signed relations.<br />
CMP subtracts one oper<strong>and</strong> from the other <strong>and</strong> the difference sets the condition<br />
codes. Compare negative (CMN) adds one oper<strong>and</strong> to the other, <strong>and</strong> the sum sets<br />
the condition codes. TST performs logical AND on the two oper<strong>and</strong>s to set all<br />
condition codes but overflow, while TEQ uses exclusive OR to set the first three<br />
condition codes.<br />
One unusual feature of ARM is that every instruction has the option of executing<br />
conditionally, depending on the condition codes. Every instruction starts with a<br />
4-bit field that determines whether it will act as a no operation instruction (nop)<br />
or as a real instruction, depending on the condition codes. Hence, conditional<br />
branches are properly considered as conditionally executing the unconditional<br />
branch instruction. Conditional execution allows avoiding a branch to jump over a<br />
single instruction. It takes less code space <strong>and</strong> time to simply conditionally execute<br />
one instruction.<br />
Figure 2.34 shows the instruction formats for ARM <strong>and</strong> MIPS. The principal<br />
differences are the 4-bit conditional execution field in every instruction <strong>and</strong> the<br />
smaller register field, because ARM has half the number of registers.
148 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />
ARM<br />
31 28 27<br />
Opx 4<br />
20 19 16 15<br />
12 11<br />
4 3 0<br />
Op 8 Rs1 4 Rd 4 Opx 8<br />
Rs2 4<br />
Register-register<br />
31 26 25<br />
21 20<br />
16 15<br />
11 10 6 5 0<br />
MIPS<br />
Op 6<br />
Rs1 5 Rs2 5 Rd 5 Const 5 Opx 6<br />
31 28 27<br />
20 19 16 15 12 11<br />
0<br />
ARM Opx 4<br />
Op 8 Rs1 4 Rd 4 Const 12<br />
Data transfer<br />
31 26 25 21 20 16 15<br />
0<br />
MIPS<br />
Op 6<br />
Rs1 5 Rd 5<br />
Const 16<br />
31 28 27 24 23<br />
0<br />
ARM Opx 4 Op 4 Const 24<br />
Branch<br />
31 26 25<br />
21 20<br />
16 15<br />
0<br />
MIPS<br />
Op 6<br />
Rs1 5 Opx 5 /Rs2 5 Const 16<br />
ARM<br />
31 28 27 24 23<br />
0<br />
Opx 4 Op 4 Const 24<br />
Jump/Call<br />
31 26 25<br />
0<br />
MIPS Op 6<br />
Const 26<br />
Opcode<br />
Register<br />
Constant<br />
FIGURE 2.34 Instruction formats, ARM <strong>and</strong> MIPS. The differences result from whether the<br />
architecture has 16 or 32 registers.<br />
Unique Features of ARM<br />
Figure 2.35 shows a few arithmetic-logical instructions not found in MIPS. Since<br />
ARM does not have a dedicated register for 0, it has separate opcodes to perform<br />
some operations that MIPS can do with $zero. In addition, ARM has support for<br />
multiword arithmetic.<br />
ARM’s 12-bit immediate field has a novel interpretation. The eight leastsignificant<br />
bits are zero-extended to a 32-bit value, then rotated right the number<br />
of bits specified in the first four bits of the field multiplied by two. One advantage is<br />
that this scheme can represent all powers of two in a 32-bit word. Whether this split<br />
actually catches more immediates than a simple 12-bit field would be an interesting<br />
study.<br />
Oper<strong>and</strong> shifting is not limited to immediates. The second register of all<br />
arithmetic <strong>and</strong> logical processing operations has the option of being shifted before<br />
being operated on. The shift options are shift left logical, shift right logical, shift<br />
right arithmetic, <strong>and</strong> rotate right.
2.17 Real Stuff: x86 Instructions 151<br />
in parallel. Not only does this change enable more multimedia operations;<br />
it gives the compiler a different target for floating-point operations than<br />
the unique stack architecture. Compilers can choose to use the eight SSE<br />
registers as floating-point registers like those found in other computers. This<br />
change boosted the floating-point performance of the Pentium 4, the first<br />
microprocessor to include SSE2 instructions.<br />
■ 2003: A company other than Intel enhanced the x86 architecture this time.<br />
AMD announced a set of architectural extensions to increase the address<br />
space from 32 to 64 bits. Similar to the transition from a 16- to 32-bit address<br />
space in 1985 with the 80386, AMD64 widens all registers to 64 bits. It also<br />
increases the number of registers to 16 <strong>and</strong> increases the number of 128-<br />
bit SSE registers to 16. The primary ISA change comes from adding a new<br />
mode called long mode that redefines the execution of all x86 instructions<br />
with 64-bit addresses <strong>and</strong> data. To address the larger number of registers, it<br />
adds a new prefix to instructions. Depending how you count, long mode also<br />
adds four to ten new instructions <strong>and</strong> drops 27 old ones. PC-relative data<br />
addressing is another extension. AMD64 still has a mode that is identical<br />
to x86 (legacy mode) plus a mode that restricts user programs to x86 but<br />
allows operating systems to use AMD64 (compatibility mode). These modes<br />
allow a more graceful transition to 64-bit addressing than the HP/Intel IA-64<br />
architecture.<br />
■ 2004: Intel capitulates <strong>and</strong> embraces AMD64, relabeling it Extended Memory<br />
64 Technology (EM64T). The major difference is that Intel added a 128-bit<br />
atomic compare <strong>and</strong> swap instruction, which probably should have been<br />
included in AMD64. At the same time, Intel announced another generation of<br />
media extensions. SSE3 adds 13 instructions to support complex arithmetic,<br />
graphics operations on arrays of structures, video encoding, floating-point<br />
conversion, <strong>and</strong> thread synchronization (see Section 2.11). AMD added SSE3<br />
in subsequent chips <strong>and</strong> the missing atomic swap instruction to AMD64 to<br />
maintain binary compatibility with Intel.<br />
■ 2006: Intel announces 54 new instructions as part of the SSE4 instruction set<br />
extensions. These extensions perform tweaks like sum of absolute differences,<br />
dot products for arrays of structures, sign or zero extension of narrow data to<br />
wider sizes, population count, <strong>and</strong> so on. They also added support for virtual<br />
machines (see Chapter 5).<br />
■ 2007: AMD announces 170 instructions as part of SSE5, including 46<br />
instructions of the base instruction set that adds three oper<strong>and</strong> instructions<br />
like MIPS.<br />
■ 2011: Intel ships the Advanced Vector Extension that exp<strong>and</strong>s the SSE<br />
register width from 128 to 256 bits, thereby redefining about 250 instructions<br />
<strong>and</strong> adding 128 new instructions.
152 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />
This history illustrates the impact of the “golden h<strong>and</strong>cuffs” of compatibility on<br />
the x86, as the existing software base at each step was too important to jeopardize<br />
with significant architectural changes.<br />
Whatever the artistic failures of the x86, keep in mind that this instruction set<br />
largely drove the PC generation of computers <strong>and</strong> still dominates the cloud portion<br />
of the PostPC Era. Manufacturing 350M x86 chips per year may seem small<br />
compared to 9 billion ARMv7 chips, but many companies would love to control<br />
such a market. Nevertheless, this checkered ancestry has led to an architecture that<br />
is difficult to explain <strong>and</strong> impossible to love.<br />
Brace yourself for what you are about to see! Do not try to read this section<br />
with the care you would need to write x86 programs; the goal instead is to give you<br />
familiarity with the strengths <strong>and</strong> weaknesses of the world’s most popular desktop<br />
architecture.<br />
Rather than show the entire 16-bit, 32-bit, <strong>and</strong> 64-bit instruction set, in this<br />
section we concentrate on the 32-bit subset that originated with the 80386. We start<br />
our explanation with the registers <strong>and</strong> addressing modes, move on to the integer<br />
operations, <strong>and</strong> conclude with an examination of instruction encoding.<br />
x86 Registers <strong>and</strong> Data Addressing Modes<br />
The registers of the 80386 show the evolution of the instruction set (Figure 2.36).<br />
The 80386 extended all 16-bit registers (except the segment registers) to 32 bits,<br />
prefixing an E to their name to indicate the 32-bit version. We’ll refer to them<br />
generically as GPRs (general-purpose registers). The 80386 contains only eight<br />
GPRs. This means MIPS programs can use four times as many <strong>and</strong> ARMv7 twice<br />
as many.<br />
Figure 2.37 shows the arithmetic, logical, <strong>and</strong> data transfer instructions are<br />
two-oper<strong>and</strong> instructions. There are two important differences here. The x86<br />
arithmetic <strong>and</strong> logical instructions must have one oper<strong>and</strong> act as both a source<br />
<strong>and</strong> a destination; ARMv7 <strong>and</strong> MIPS allow separate registers for source <strong>and</strong><br />
destination. This restriction puts more pressure on the limited registers, since one<br />
source register must be modified. The second important difference is that one of<br />
the oper<strong>and</strong>s can be in memory. Thus, virtually any instruction may have one<br />
oper<strong>and</strong> in memory, unlike ARMv7 <strong>and</strong> MIPS.<br />
Data memory-addressing modes, described in detail below, offer two sizes of<br />
addresses within the instruction. These so-called displacements can be 8 bits or 32<br />
bits.<br />
Although a memory oper<strong>and</strong> can use any addressing mode, there are restrictions<br />
on which registers can be used in a mode. Figure 2.38 shows the x86 addressing<br />
modes <strong>and</strong> which GPRs cannot be used with each mode, as well as how to get the<br />
same effect using MIPS instructions.<br />
x86 Integer Operations<br />
The 8086 provides support for both 8-bit (byte) <strong>and</strong> 16-bit (word) data types. The<br />
80386 adds 32-bit addresses <strong>and</strong> data (double words) in the x86. (AMD64 adds 64-
2.17 Real Stuff: x86 Instructions 153<br />
Name<br />
31<br />
EAX<br />
ECX<br />
EDX<br />
EBX<br />
ESP<br />
EBP<br />
ESI<br />
EDI<br />
Use<br />
0<br />
GPR 0<br />
GPR 1<br />
GPR 2<br />
GPR 3<br />
GPR 4<br />
GPR 5<br />
GPR 6<br />
GPR 7<br />
CS<br />
SS<br />
DS<br />
ES<br />
FS<br />
GS<br />
Code segment pointer<br />
Stack segment pointer (top of stack)<br />
Data segment pointer 0<br />
Data segment pointer 1<br />
Data segment pointer 2<br />
Data segment pointer 3<br />
EIP<br />
EFLAGS<br />
Instruction pointer (PC)<br />
Condition codes<br />
FIGURE 2.36 The 80386 register set. Starting with the 80386, the top eight registers were extended<br />
to 32 bits <strong>and</strong> could also be used as general-purpose registers.<br />
Source/destination oper<strong>and</strong> type<br />
Register<br />
Register<br />
Register<br />
Memory<br />
Memory<br />
Second source oper<strong>and</strong><br />
Register<br />
Immediate<br />
Memory<br />
Register<br />
Immediate<br />
FIGURE 2.37 Instruction types for the arithmetic, logical, <strong>and</strong> data transfer instructions.<br />
The x86 allows the combinations shown. The only restriction is the absence of a memory-memory mode.<br />
Immediates may be 8, 16, or 32 bits in length; a register is any one of the 14 major registers in Figure 2.36<br />
(not EIP or EFLAGS).
154 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />
Register<br />
Mode<br />
Description<br />
restrictions<br />
MIPS equivalent<br />
Register indirect Address is in a register. Not ESP or EBP lw $s0,0($s1)<br />
Based mode with 8- or 32-bit<br />
displacement<br />
Base plus scaled index<br />
Base plus scaled index with<br />
8- or 32-bit displacement<br />
Address is contents of base register plus<br />
displacement.<br />
The address is<br />
Base + (2 Scale x Index)<br />
where Scale has the value 0, 1, 2, or 3.<br />
The address is<br />
Base + (2 Scale x Index) + displacement<br />
where Scale has the value 0, 1, 2, or 3.<br />
Not ESP<br />
Base: any GPR<br />
Index: not ESP<br />
Base: any GPR<br />
Index: not ESP<br />
lw $s0,100($s1) #
2.17 Real Stuff: x86 Instructions 155<br />
The first two categories are unremarkable, except that the arithmetic <strong>and</strong> logic<br />
instruction operations allow the destination to be either a register or a memory<br />
location. Figure 2.39 shows some typical x86 instructions <strong>and</strong> their functions.<br />
Conditional branches on the x86 are based on condition codes or flags, like<br />
ARMv7. Condition codes are set as a side effect of an operation; most are used<br />
to compare the value of a result to 0. Branches then test the condition codes. PC-<br />
Instruction<br />
Function<br />
je name<br />
if equal(condition code) {EIP=name};<br />
EIP–128
156 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />
Instruction<br />
Control<br />
jnz, jz<br />
Meaning<br />
Conditional <strong>and</strong> unconditional branches<br />
Jump if condition to EIP + 8-bit offset; JNE (forJNZ), JE (for JZ) are<br />
alternative names<br />
jmp<br />
Unconditional jump—8-bit or 16-bit offset<br />
call<br />
Subroutine call—16-bit offset; return address pushed onto stack<br />
ret<br />
Pops return address from stack <strong>and</strong> jumps to it<br />
loop Loop branch—decrement ECX; jump to EIP + 8-bit displacement if ECX ≠ 0<br />
Data transfer Move data between registers or between register <strong>and</strong> memory<br />
move<br />
Move between two registers or between register <strong>and</strong> memory<br />
push, pop Push source oper<strong>and</strong> on stack; pop oper<strong>and</strong> from stack top to a register<br />
les<br />
Load ES <strong>and</strong> one of the GPRs from memory<br />
Arithmetic, logical Arithmetic <strong>and</strong> logical operations using the data registers <strong>and</strong> memory<br />
add, sub<br />
Add source to destination; subtract source from destination; register-memory<br />
format<br />
cmp<br />
Compare source <strong>and</strong> destination; register-memory format<br />
shl, shr, rcr Shift left; shift logical right; rotate right with carry condition code as fi ll<br />
cbw<br />
Convert byte in eight rightmost bits of EAX to 16-bit word in right of EAX<br />
test<br />
Logical AND of source <strong>and</strong> destination sets condition codes<br />
inc, dec<br />
Increment destination, decrement destination<br />
or, xor<br />
Logical OR; exclusive OR; register-memory format<br />
String<br />
Move between string oper<strong>and</strong>s; length given by a repeat prefix<br />
movs<br />
Copies from string source to destination by incrementing ESI <strong>and</strong> EDI; may be<br />
repeated<br />
lods<br />
Loads a byte, word, or doubleword of a string into the EAX register<br />
FIGURE 2.40 Some typical operations on the x86. Many operations use register-memory format,<br />
where either the source or the destination may be memory <strong>and</strong> the other may be a register or immediate<br />
oper<strong>and</strong>.<br />
of the instructions that address memory. The base plus scaled index mode uses a second<br />
postbyte, labeled “sc, index, base.”<br />
Figure 2.42 shows the encoding of the two postbyte address specifiers for<br />
both 16-bit <strong>and</strong> 32-bit mode. Unfortunately, to underst<strong>and</strong> fully which registers<br />
<strong>and</strong> which addressing modes are available, you need to see the encoding of all<br />
addressing modes <strong>and</strong> sometimes even the encoding of the instructions.<br />
x86 Conclusion<br />
Intel had a 16-bit microprocessor two years before its competitors’ more elegant<br />
architectures, such as the Motorola 68000, <strong>and</strong> this head start led to the selection<br />
of the 8086 as the CPU for the IBM PC. Intel engineers generally acknowledge that<br />
the x86 is more difficult to build than computers like ARMv7 <strong>and</strong> MIPS, but the<br />
large market meant in the PC Era that AMD <strong>and</strong> Intel could afford more resources
2.17 Real Stuff: x86 Instructions 157<br />
a. JE EIP + displacement<br />
4 4 8<br />
JE<br />
Condition<br />
Displacement<br />
b. CALL<br />
8 32<br />
CALL<br />
Offset<br />
c. MOV EBX, [EDI + 45]<br />
6 1 1 8<br />
8<br />
r/m<br />
MOV d w<br />
Displacement<br />
Postbyte<br />
d. PUSH ESI<br />
5 3<br />
PUSH Reg<br />
e. ADD EAX, #6765<br />
4 3 1<br />
32<br />
ADD<br />
Reg w<br />
Immediate<br />
f. TEST EDX, #42<br />
7 1 8<br />
32<br />
TEST<br />
w<br />
Postbyte<br />
Immediate<br />
FIGURE 2.41 Typical x86 instruction formats. Figure 2.42 shows the encoding of the postbyte.<br />
Many instructions contain the 1-bit field w, which says whether the operation is a byte or a double word. The<br />
d field in MOV is used in instructions that may move to or from memory <strong>and</strong> shows the direction of the move.<br />
The ADD instruction requires 32 bits for the immediate field, because in 32-bit mode, the immediates are<br />
either 8 bits or 32 bits. The immediate field in the TEST is 32 bits long because there is no 8-bit immediate for<br />
test in 32-bit mode. Overall, instructions may vary from 1 to 15 bytes in length. The long length comes from<br />
extra 1-byte prefixes, having both a 4-byte immediate <strong>and</strong> a 4-byte displacement address, using an opcode of<br />
2 bytes, <strong>and</strong> using the scaled index mode specifier, which adds another byte.<br />
to help overcome the added complexity. What the x86 lacks in style, it made up for<br />
in market size, making it beautiful from the right perspective.<br />
Its saving grace is that the most frequently used x86 architectural components<br />
are not too difficult to implement, as AMD <strong>and</strong> Intel have demonstrated by rapidly<br />
improving performance of integer programs since 1978. To get that performance,
158 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />
reg w = 0 w = 1 r/m mod = 0 mod = 1 mod = 2 mod = 3<br />
16b 32b 16b 32b 16b 32b 16b 32b<br />
0 AL AX EAX 0 addr=BX+SI =EAX same same same same same<br />
1 CL CX ECX 1 addr=BX+DI =ECX addr as addr as addr as addr as as<br />
2 DL DX EDX 2 addr=BP+SI =EDX mod=0 mod=0 mod=0 mod=0 reg<br />
3 BL BX EBX 3 addr=BP+SI =EBX + disp8 + disp8 + disp16 + disp32 fi eld<br />
4 AH SP ESP 4 addr=SI =(sib) SI+disp8 (sib)+disp8 SI+disp8 (sib)+disp32 “<br />
5 CH BP EBP 5 addr=DI =disp32 DI+disp8 EBP+disp8 DI+disp16 EBP+disp32 “<br />
6 DH SI ESI 6 addr=disp16 =ESI BP+disp8 ESI+disp8 BP+disp16 ESI+disp32 “<br />
7 BH DI EDI 7 addr=BX =EDI BX+disp8 EDI+disp8 BX+disp16 EDI+disp32 “<br />
FIGURE 2.42 The encoding of the first address specifier of the x86: mod, reg, r/m. The first four columns show the encoding<br />
of the 3-bit reg field, which depends on the w bit from the opcode <strong>and</strong> whether the machine is in 16-bit mode (8086) or 32-bit mode (80386).<br />
The remaining columns explain the mod <strong>and</strong> r/m fields. The meaning of the 3-bit r/m field depends on the value in the 2-bit mod field <strong>and</strong> the<br />
address size. Basically, the registers used in the address calculation are listed in the sixth <strong>and</strong> seventh columns, under mod 0, with mod 1<br />
adding an 8-bit displacement <strong>and</strong> mod 2 adding a 16-bit or 32-bit displacement, depending on the address mode. The exceptions are 1) r/m<br />
6 when mod 1 or mod 2 in 16-bit mode selects BP plus the displacement; 2) r/m 5 when mod 1 or mod 2 in 32-bit mode selects<br />
EBP plus displacement; <strong>and</strong> 3) r/m 4 in 32-bit mode when mod does not equal 3, where (sib) means use the scaled index mode shown in<br />
Figure 2.38. When mod 3, the r/m field indicates a register, using the same encoding as the reg field combined with the w bit.<br />
compilers must avoid the portions of the architecture that are hard to implement<br />
fast.<br />
In the PostPC Era, however, despite considerable architectural <strong>and</strong> manufacturing<br />
expertise, x86 has not yet been competitive in the personal mobile device.<br />
2.18 Real Stuff: ARMv8 (64-bit) Instructions<br />
Of the many potential problems in an instruction set, the one that is almost impossible<br />
to overcome is having too small a memory address. While the x86 was successfully<br />
extended first to 32-bit addresses <strong>and</strong> then later to 64-bit addresses, many of its<br />
brethren were left behind. For example, the 16-bit address MOStek 6502 powered the<br />
Apple II, but even given this headstart with the first commercially successful personal<br />
computer, its lack of address bits condemned it to the dustbin of history.<br />
ARM architects could see the writing on the wall of their 32-bit address<br />
computer, <strong>and</strong> began design of the 64-bit address version of ARM in 2007. It was<br />
finally revealed in 2013. Rather than some minor cosmetic changes to make all<br />
the registers 64 bits wide, which is basically what happened to the x86, ARM did a<br />
complete overhaul. The good news is that if you know MIPS it will be very easy to<br />
pick up ARMv8, as the 64-bit version is called.<br />
First, as compared to MIPS, ARM dropped virtually all of the unusual features<br />
of v7:<br />
■ There is no conditional execution field, as there was in nearly every instruction<br />
in v7.
160 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />
This battle between compilers <strong>and</strong> assembly language coders is another situation<br />
in which humans are losing ground. For example, C offers the programmer a<br />
chance to give a hint to the compiler about which variables to keep in registers<br />
versus spilled to memory. When compilers were poor at register allocation, such<br />
hints were vital to performance. In fact, some old C textbooks spent a fair amount<br />
of time giving examples that effectively use register hints. Today’s C compilers<br />
generally ignore such hints, because the compiler does a better job at allocation<br />
than the programmer does.<br />
Even if writing by h<strong>and</strong> resulted in faster code, the dangers of writing in assembly<br />
language are the longer time spent coding <strong>and</strong> debugging, the loss in portability,<br />
<strong>and</strong> the difficulty of maintaining such code. One of the few widely accepted axioms<br />
of software engineering is that coding takes longer if you write more lines, <strong>and</strong> it<br />
clearly takes many more lines to write a program in assembly language than in C<br />
or Java. Moreover, once it is coded, the next danger is that it will become a popular<br />
program. Such programs always live longer than expected, meaning that someone<br />
will have to update the code over several years <strong>and</strong> make it work with new releases<br />
of operating systems <strong>and</strong> new models of machines. Writing in higher-level language<br />
instead of assembly language not only allows future compilers to tailor the code<br />
to future machines; it also makes the software easier to maintain <strong>and</strong> allows the<br />
program to run on more br<strong>and</strong>s of computers.<br />
Fallacy: The importance of commercial binary compatibility means successful<br />
instruction sets don’t change.<br />
While backwards binary compatibility is sacrosanct, Figure 2.43 shows that the x86<br />
architecture has grown dramatically. The average is more than one instruction per<br />
month over its 35-year lifetime!<br />
Pitfall: Forgetting that sequential word addresses in machines with byte addressing<br />
do not differ by one.<br />
Many an assembly language programmer has toiled over errors made by assuming<br />
that the address of the next word can be found by incrementing the address in a<br />
register by one instead of by the word size in bytes. Forewarned is forearmed!<br />
Pitfall: Using a pointer to an automatic variable outside its defining procedure.<br />
A common mistake in dealing with pointers is to pass a result from a procedure<br />
that includes a pointer to an array that is local to that procedure. Following the<br />
stack discipline in Figure 2.12, the memory that contains the local array will be<br />
reused as soon as the procedure returns. Pointers to automatic variables can lead<br />
to chaos.
162 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />
We also saw the great idea of making the common cast fast applied to instruction<br />
sets as well as computer architecture. Examples of making the common MIPS<br />
case fast include PC-relative addressing for conditional branches <strong>and</strong> immediate<br />
addressing for larger constant oper<strong>and</strong>s.<br />
Above this machine level is assembly language, a language that humans can read.<br />
The assembler translates it into the binary numbers that machines can underst<strong>and</strong>,<br />
<strong>and</strong> it even “extends” the instruction set by creating symbolic instructions that<br />
aren’t in the hardware. For instance, constants or addresses that are too big are<br />
broken into properly sized pieces, common variations of instructions are given<br />
their own name, <strong>and</strong> so on. Figure 2.44 lists the MIPS instructions we have covered<br />
MIPS instructions Name Format Pseudo MIPS Name Format<br />
add add R move move R<br />
subtract sub R multiply mult R<br />
add immediate addi I multiply immediate multi I<br />
load word lw I load immediate li I<br />
store word sw I branch less than blt I<br />
load half lh I branch less than<br />
load half unsigned lhu I or equal ble I<br />
store half sh I branch greater than bgt I<br />
load byte lb I branch greater than<br />
load byte unsigned lbu I or equal bge I<br />
store byte sb I<br />
load linked ll I<br />
store conditional sc I<br />
load upper immediate lui I<br />
<strong>and</strong> <strong>and</strong> R<br />
or or R<br />
nor nor R<br />
<strong>and</strong> immediate <strong>and</strong>i I<br />
or immediate ori I<br />
shift left logical sll R<br />
shift right logical srl R<br />
branch on equal beq I<br />
branch on not equal bne I<br />
set less than slt R<br />
set less than immediate slti I<br />
set less than immediate<br />
sltiu I<br />
unsigned<br />
jump j J<br />
jump register jr R<br />
jump <strong>and</strong> link jal J<br />
FIGURE 2.44 The MIPS instruction set covered so far, with the real MIPS instructions<br />
on the left <strong>and</strong> the pseudoinstructions on the right. Appendix A (Section A.10) describes the<br />
full MIPS architecture. Figure 2.1 shows more details of the MIPS architecture revealed in this chapter. The<br />
information given here is also found in Columns 1 <strong>and</strong> 2 of the MIPS Reference Data Card at the front of<br />
the book.
2.22 Exercises 165<br />
2.3 [5] For the following C statement, what is the corresponding<br />
MIPS assembly code? Assume that the variables f, g, h, i, <strong>and</strong> j are assigned to<br />
registers $s0, $s1, $s2, $s3, <strong>and</strong> $s4, respectively. Assume that the base address<br />
of the arrays A <strong>and</strong> B are in registers $s6 <strong>and</strong> $s7, respectively.<br />
B[8] = A[i−j];<br />
2.4 [5] For the MIPS assembly instructions below, what is the<br />
corresponding C statement? Assume that the variables f, g, h, i, <strong>and</strong> j are assigned<br />
to registers $s0, $s1, $s2, $s3, <strong>and</strong> $s4, respectively. Assume that the base address<br />
of the arrays A <strong>and</strong> B are in registers $s6 <strong>and</strong> $s7, respectively.<br />
sll $t0, $s0, 2 # $t0 = f * 4<br />
add $t0, $s6, $t0 # $t0 = &A[f]<br />
sll $t1, $s1, 2 # $t1 = g * 4<br />
add $t1, $s7, $t1 # $t1 = &B[g]<br />
lw $s0, 0($t0) # f = A[f]<br />
addi $t2, $t0, 4<br />
lw $t0, 0($t2)<br />
add $t0, $t0, $s0<br />
sw $t0, 0($t1)<br />
2.5 [5] For the MIPS assembly instructions in Exercise 2.4, rewrite<br />
the assembly code to minimize the number if MIPS instructions (if possible)<br />
needed to carry out the same function.<br />
2.6 The table below shows 32-bit values of an array stored in memory.<br />
Address Data<br />
24 2<br />
38 4<br />
32 3<br />
36 6<br />
40 1
166 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />
2.6.1 [5] For the memory locations in the table above, write C<br />
code to sort the data from lowest to highest, placing the lowest value in the<br />
smallest memory location shown in the figure. Assume that the data shown<br />
represents the C variable called Array, which is an array of type int, <strong>and</strong> that<br />
the first number in the array shown is the first element in the array. Assume<br />
that this particular machine is a byte-addressable machine <strong>and</strong> a word consists<br />
of four bytes.<br />
2.6.2 [5] For the memory locations in the table above, write MIPS<br />
code to sort the data from lowest to highest, placing the lowest value in the smallest<br />
memory location. Use a minimum number of MIPS instructions. Assume the base<br />
address of Array is stored in register $s6.<br />
2.7 [5] Show how the value 0xabcdef12 would be arranged in memory<br />
of a little-endian <strong>and</strong> a big-endian machine. Assume the data is stored starting at<br />
address 0.<br />
2.8 [5] Translate 0xabcdef12 into decimal.<br />
2.9 [5] Translate the following C code to MIPS. Assume that the<br />
variables f, g, h, i, <strong>and</strong> j are assigned to registers $s0, $s1, $s2, $s3, <strong>and</strong> $s4,<br />
respectively. Assume that the base address of the arrays A <strong>and</strong> B are in registers $s6<br />
<strong>and</strong> $s7, respectively. Assume that the elements of the arrays A <strong>and</strong> B are 4-byte<br />
words:<br />
B[8] = A[i] + A[j];<br />
2.10 [5] Translate the following MIPS code to C. Assume that the<br />
variables f, g, h, i, <strong>and</strong> j are assigned to registers $s0, $s1, $s2, $s3, <strong>and</strong> $s4,<br />
respectively. Assume that the base address of the arrays A <strong>and</strong> B are in registers $s6<br />
<strong>and</strong> $s7, respectively.<br />
addi $t0, $s6, 4<br />
add $t1, $s6, $0<br />
sw $t1, 0($t0)<br />
lw $t0, 0($t0)<br />
add $s0, $t1, $t0<br />
2.11 [5] For each MIPS instruction, show the value of the opcode<br />
(OP), source register (RS), <strong>and</strong> target register (RT) fields. For the I-type instructions,<br />
show the value of the immediate field, <strong>and</strong> for the R-type instructions, show the<br />
value of the destination register (RD) field.
2.22 Exercises 167<br />
2.12 Assume that registers $s0 <strong>and</strong> $s1 hold the values 0x80000000 <strong>and</strong><br />
0xD0000000, respectively.<br />
2.12.1 [5] What is the value of $t0 for the following assembly code?<br />
add $t0, $s0, $s1<br />
2.12.2 [5] Is the result in $t0 the desired result, or has there been overflow?<br />
2.12.3 [5] For the contents of registers $s0 <strong>and</strong> $s1 as specified above,<br />
what is the value of $t0 for the following assembly code?<br />
sub $t0, $s0, $s1<br />
2.12.4 [5] Is the result in $t0 the desired result, or has there been overflow?<br />
2.12.5 [5] For the contents of registers $s0 <strong>and</strong> $s1 as specified above,<br />
what is the value of $t0 for the following assembly code?<br />
add $t0, $s0, $s1<br />
add $t0, $t0, $s0<br />
2.12.6 [5] Is the result in $t0 the desired result, or has there been<br />
overflow?<br />
2.13 Assume that $s0 holds the value 128 ten<br />
.<br />
2.13.1 [5] For the instruction add $t0, $s0, $s1, what is the range(s) of<br />
values for $s1 that would result in overflow?<br />
2.13.2 [5] For the instruction sub $t0, $s0, $s1, what is the range(s) of<br />
values for $s1 that would result in overflow?<br />
2.13.3 [5] For the instruction sub $t0, $s1, $s0, what is the range(s) of<br />
values for $s1 that would result in overflow?<br />
2.14 [5] Provide the type <strong>and</strong> assembly language instruction for the<br />
following binary value: 0000 0010 0001 0000 1000 0000 0010 0000 two<br />
2.15 [5] Provide the type <strong>and</strong> hexadecimal representation of<br />
following instruction: sw $t1, 32($t2)
168 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />
2.16 [5] Provide the type, assembly language instruction, <strong>and</strong> binary<br />
representation of instruction described by the following MIPS fields:<br />
op=0, rs=3, rt=2, rd=3, shamt=0, funct=34<br />
2.17 [5] Provide the type, assembly language instruction, <strong>and</strong> binary<br />
representation of instruction described by the following MIPS fields:<br />
op=0x23, rs=1, rt=2, const=0x4<br />
2.18 Assume that we would like to exp<strong>and</strong> the MIPS register file to 128 registers<br />
<strong>and</strong> exp<strong>and</strong> the instruction set to contain four times as many instructions.<br />
2.18.1 [5] How this would this affect the size of each of the bit fields in<br />
the R-type instructions?<br />
2.18.2 [5] How this would this affect the size of each of the bit fields in<br />
the I-type instructions?<br />
2.18.3 [5] How could each of the two proposed changes decrease<br />
the size of an MIPS assembly program? On the other h<strong>and</strong>, how could the proposed<br />
change increase the size of an MIPS assembly program?<br />
2.19 Assume the following register contents:<br />
$t0 = 0xAAAAAAAA, $t1 = 0x12345678<br />
2.19.1 [5] For the register values shown above, what is the value of $t2<br />
for the following sequence of instructions?<br />
sll $t2, $t0, 44<br />
or $t2, $t2, $t1<br />
2.19.2 [5] For the register values shown above, what is the value of $t2<br />
for the following sequence of instructions?<br />
sll $t2, $t0, 4<br />
<strong>and</strong>i $t2, $t2, −1<br />
2.19.3 [5] For the register values shown above, what is the value of $t2<br />
for the following sequence of instructions?<br />
srl $t2, $t0, 3<br />
<strong>and</strong>i $t2, $t2, 0xFFEF
2.22 Exercises 169<br />
2.20 [5] Find the shortest sequence of MIPS instructions that extracts bits<br />
16 down to 11 from register $t0 <strong>and</strong> uses the value of this field to replace bits 31<br />
down to 26 in register $t1 without changing the other 26 bits of register $t1.<br />
2.21 [5] Provide a minimal set of MIPS instructions that may be used to<br />
implement the following pseudoinstruction:<br />
not $t1, $t2<br />
// bit-wise invert<br />
2.22 [5] For the following C statement, write a minimal sequence of MIPS<br />
assembly instructions that does the identical operation. Assume $t1 = A, $t2 = B,<br />
<strong>and</strong> $s1 is the base address of C.<br />
A = C[0] 0) R[rs]=R[rs]−1, PC=PC+4+BranchAddr<br />
2.25.1 [5] If this instruction were to be implemented in the MIPS<br />
instruction set, what is the most appropriate instruction format?<br />
2.25.2 [5] What is the shortest sequence of MIPS instructions that<br />
performs the same operation?
170 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />
2.26 Consider the following MIPS loop:<br />
LOOP: slt $t2, $0, $t1<br />
beq $t2, $0, DONE<br />
subi $t1, $t1, 1<br />
addi $s2, $s2, 2<br />
j LOOP<br />
DONE:<br />
2.26.1 [5] Assume that the register $t1 is initialized to the value 10. What<br />
is the value in register $s2 assuming $s2 is initially zero?<br />
2.26.2 [5] For each of the loops above, write the equivalent C code<br />
routine. Assume that the registers $s1, $s2, $t1, <strong>and</strong> $t2 are integers A, B, i, <strong>and</strong><br />
temp, respectively.<br />
2.26.3 [5] For the loops written in MIPS assembly above, assume that<br />
the register $t1 is initialized to the value N. How many MIPS instructions are<br />
executed?<br />
2.27 [5] Translate the following C code to MIPS assembly code. Use a<br />
minimum number of instructions. Assume that the values of a, b, i, <strong>and</strong> j are in<br />
registers $s0, $s1, $t0, <strong>and</strong> $t1, respectively. Also, assume that register $s2 holds<br />
the base address of the array D.<br />
for(i=0; i
2.22 Exercises 171<br />
addi $t1, $t1, 1<br />
slti $t2, $t1, 100<br />
bne $t2, $s0, LOOP<br />
2.30 [5] Rewrite the loop from Exercise 2.29 to reduce the number of<br />
MIPS instructions executed.<br />
2.31 [5] Implement the following C code in MIPS assembly. What is the<br />
total number of MIPS instructions needed to execute the function?<br />
int fib(int n){<br />
if (n==0)<br />
return 0;<br />
else if (n == 1)<br />
return 1;<br />
else<br />
return fib(n−1) + fib(n−2);<br />
2.32 [5] Functions can often be implemented by compilers “in-line.” An<br />
in-line function is when the body of the function is copied into the program space,<br />
allowing the overhead of the function call to be eliminated. Implement an “in-line”<br />
version of the C code above in MIPS assembly. What is the reduction in the total<br />
number of MIPS assembly instructions needed to complete the function? Assume<br />
that the C variable n is initialized to 5.<br />
2.33 [5] For each function call, show the contents of the stack after the<br />
function call is made. Assume the stack pointer is originally at address 0x7ffffffc,<br />
<strong>and</strong> follow the register conventions as specified in Figure 2.11.<br />
2.34 Translate function f into MIPS assembly language. If you need to use<br />
registers $t0 through $t7, use the lower-numbered registers first. Assume the<br />
function declaration for func is “int f(int a, int b);”. The code for function<br />
f is as follows:<br />
int f(int a, int b, int c, int d){<br />
return func(func(a,b),c+d);<br />
}
172 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />
2.35 [5] Can we use the tail-call optimization in this function? If no,<br />
explain why not. If yes, what is the difference in the number of executed instructions<br />
in f with <strong>and</strong> without the optimization?<br />
2.36 [5] Right before your function f from Exercise 2.34 returns, what do<br />
we know about contents of registers $t5, $s3, $ra, <strong>and</strong> $sp? Keep in mind that<br />
we know what the entire function f looks like, but for function func we only know<br />
its declaration.<br />
2.37 [5] Write a program in MIPS assembly language to convert an ASCII<br />
number string containing positive <strong>and</strong> negative integer decimal strings, to an<br />
integer. Your program should expect register $a0 to hold the address of a nullterminated<br />
string containing some combination of the digits 0 through 9. Your<br />
program should compute the integer value equivalent to this string of digits, then<br />
place the number in register $v0. If a non-digit character appears anywhere in the<br />
string, your program should stop with the value −1 in register $v0. For example,<br />
if register $a0 points to a sequence of three bytes 50ten, 52ten, 0ten (the nullterminated<br />
string “24”), then when the program stops, register $v0 should contain<br />
the value 24 ten<br />
.<br />
2.38 [5] Consider the following code:<br />
lbu $t0, 0($t1)<br />
sw $t0, 0($t2)<br />
Assume that the register $t1 contains the address 0x1000 0000 <strong>and</strong> the register<br />
$t2 contains the address 0x1000 0010. Note the MIPS architecture utilizes<br />
big-endian addressing. Assume that the data (in hexadecimal) at address 0x1000<br />
0000 is: 0x11223344. What value is stored at the address pointed to by register<br />
$t2?<br />
2.39 [5] Write the MIPS assembly code that creates the 32-bit constant<br />
0010 0000 0000 0001 0100 1001 0010 0100 two<br />
<strong>and</strong> stores that value to<br />
register $t1.<br />
2.40 [5] If the current value of the PC is 0x00000000, can you use<br />
a single jump instruction to get to the PC address as shown in Exercise 2.39?<br />
2.41 [5] If the current value of the PC is 0x00000600, can you use<br />
a single branch instruction to get to the PC address as shown in Exercise 2.39?
2.22 Exercises 173<br />
2.42 [5] If the current value of the PC is 0x1FFFf000, can you use<br />
a single branch instruction to get to the PC address as shown in Exercise 2.39?<br />
2.43 [5] Write the MIPS assembly code to implement the following C<br />
code:<br />
lock(lk);<br />
shvar=max(shvar,x);<br />
unlock(lk);<br />
Assume that the address of the lk variable is in $a0, the address of the shvar<br />
variable is in $a1, <strong>and</strong> the value of variable x is in $a2. Your critical section should<br />
not contain any function calls. Use ll/sc instructions to implement the lock()<br />
operation, <strong>and</strong> the unlock() operation is simply an ordinary store instruction.<br />
2.44 [5] Repeat Exercise 2.43, but this time use ll/sc to perform<br />
an atomic update of the shvar variable directly, without using lock() <strong>and</strong><br />
unlock(). Note that in this problem there is no variable lk.<br />
2.45 [5] Using your code from Exercise 2.43 as an example, explain what<br />
happens when two processors begin to execute this critical section at the same<br />
time, assuming that each processor executes exactly one instruction per cycle.<br />
2.46 Assume for a given processor the CPI of arithmetic instructions is 1,<br />
the CPI of load/store instructions is 10, <strong>and</strong> the CPI of branch instructions is<br />
3. Assume a program has the following instruction breakdowns: 500 million<br />
arithmetic instructions, 300 million load/store instructions, 100 million branch<br />
instructions.<br />
2.46.1 [5] Suppose that new, more powerful arithmetic instructions are<br />
added to the instruction set. On average, through the use of these more powerful<br />
arithmetic instructions, we can reduce the number of arithmetic instructions<br />
needed to execute a program by 25%, <strong>and</strong> the cost of increasing the clock cycle<br />
time by only 10%. Is this a good design choice? Why?<br />
2.46.2 [5] Suppose that we find a way to double the performance of<br />
arithmetic instructions. What is the overall speedup of our machine? What if we<br />
find a way to improve the performance of arithmetic instructions by 10 times?<br />
2.47 Assume that for a given program 70% of the executed instructions are<br />
arithmetic, 10% are load/store, <strong>and</strong> 20% are branch.
174 Chapter 2 Instructions: Language of the <strong>Computer</strong><br />
2.47.1 [5] Given this instruction mix <strong>and</strong> the assumption that an<br />
arithmetic instruction requires 2 cycles, a load/store instruction takes 6 cycles, <strong>and</strong><br />
a branch instruction takes 3 cycles, find the average CPI.<br />
2.47.2 [5] For a 25% improvement in performance, how many cycles, on<br />
average, may an arithmetic instruction take if load/store <strong>and</strong> branch instructions<br />
are not improved at all?<br />
2.47.3 [5] For a 50% improvement in performance, how many cycles, on<br />
average, may an arithmetic instruction take if load/store <strong>and</strong> branch instructions<br />
are not improved at all?<br />
Answers to<br />
Check Yourself<br />
§2.2, page 66: MIPS, C, Java<br />
§2.3, page 72: 2) Very slow<br />
§2.4, page 79: 2) 8 ten<br />
§2.5, page 87: 4) sub $t2, $t0, $t1<br />
§2.6, page 89: Both. AND with a mask pattern of 1s will leaves 0s everywhere but<br />
the desired field. Shifting left by the correct amount removes the bits from the left<br />
of the field. Shifting right by the appropriate amount puts the field into the rightmost<br />
bits of the word, with 0s in the rest of the word. Note that AND leaves the<br />
field where it was originally, <strong>and</strong> the shift pair moves the field into the rightmost<br />
part of the word.<br />
§2.7, page 96: I. All are true. II. 1).<br />
§2.8, page 106: Both are true.<br />
§2.9, page 111: I. 1) <strong>and</strong> 2) II. 3)<br />
§2.10, page 120: I. 4) 128K. II. 6) a block of 256M. III. 4) sll<br />
§2.11, page 123: Both are true.<br />
§2.12, page 132: 4) Machine independence.
This page intentionally left blank
3<br />
Arithmetic for<br />
<strong>Computer</strong>s<br />
Numerical precision<br />
is the very soul of<br />
science.<br />
Sir D’arcy Wentworth Thompson<br />
On Growth <strong>and</strong> Form, 1917<br />
3.1 Introduction 178<br />
3.2 Addition <strong>and</strong> Subtraction 178<br />
3.3 Multiplication 183<br />
3.4 Division 189<br />
3.5 Floating Point 196<br />
3.6 Parallelism <strong>and</strong> <strong>Computer</strong> Arithmetic:<br />
Subword Parallelism 222<br />
3.7 Real Stuff: Streaming SIMD Extensions <strong>and</strong><br />
Advanced Vector Extensions in x86 224<br />
<strong>Computer</strong> Organization <strong>and</strong> <strong>Design</strong>. DOI: http://dx.doi.org/10.1016/B978-0-12-407726-3.00001-1<br />
© 2013 Elsevier Inc. All rights reserved.
3.2 Addition <strong>and</strong> Subtraction 179<br />
(0)<br />
(0)<br />
0<br />
0<br />
0 (0)<br />
(1)<br />
0<br />
0<br />
1 (1)<br />
(1)<br />
1<br />
1<br />
1 (1)<br />
(0)<br />
1<br />
1<br />
0 (0)<br />
(Carries)<br />
1<br />
0<br />
1<br />
. . .<br />
. . .<br />
. . .<br />
0<br />
0<br />
0 (0)<br />
(0)<br />
FIGURE 3.1 Binary addition, showing carries from right to left. The rightmost bit adds 1<br />
to 0, resulting in the sum of this bit being 1 <strong>and</strong> the carry out from this bit being 0. Hence, the operation<br />
for the second digit to the right is 0 1 1. This generates a 0 for this sum bit <strong>and</strong> a carry out of 1. The<br />
third digit is the sum of 1 1 1, resulting in a carry out of 1 <strong>and</strong> a sum bit of 1. The fourth bit is 1 <br />
0 0, yielding a 1 sum <strong>and</strong> no carry.<br />
0000 0000 0000 0000 0000 0000 0000 0111 two = 7 ten<br />
– 0000 0000 0000 0000 0000 0000 0000 0110 two = 6 ten<br />
= 0000 0000 0000 0000 0000 0000 0000 0001 two = 1 ten<br />
or via addition using the two’s complement representation of 6:<br />
0000 0000 0000 0000 0000 0000 0000 0111 two = 7 ten<br />
+ 1111 1111 1111 1111 1111 1111 1111 1010 two = –6 ten<br />
= 0000 0000 0000 0000 0000 0000 0000 0001 two = 1 ten<br />
Recall that overflow occurs when the result from an operation cannot be<br />
represented with the available hardware, in this case a 32-bit word. When can<br />
overflow occur in addition? When adding oper<strong>and</strong>s with different signs, overflow<br />
cannot occur. The reason is the sum must be no larger than one of the oper<strong>and</strong>s.<br />
For example, 10 4 6. Since the oper<strong>and</strong>s fit in 32 bits <strong>and</strong> the sum is no<br />
larger than an oper<strong>and</strong>, the sum must fit in 32 bits as well. Therefore, no overflow<br />
can occur when adding positive <strong>and</strong> negative oper<strong>and</strong>s.<br />
There are similar restrictions to the occurrence of overflow during subtract, but<br />
it’s just the opposite principle: when the signs of the oper<strong>and</strong>s are the same, overflow<br />
cannot occur. To see this, remember that c a c (a) because we subtract by<br />
negating the second oper<strong>and</strong> <strong>and</strong> then add. Therefore, when we subtract oper<strong>and</strong>s<br />
of the same sign we end up by adding oper<strong>and</strong>s of different signs. From the prior<br />
paragraph, we know that overflow cannot occur in this case either.<br />
Knowing when overflow cannot occur in addition <strong>and</strong> subtraction is all well <strong>and</strong><br />
good, but how do we detect it when it does occur? Clearly, adding or subtracting<br />
two 32-bit numbers can yield a result that needs 33 bits to be fully expressed.<br />
The lack of a 33rd bit means that when overflow occurs, the sign bit is set with<br />
the value of the result instead of the proper sign of the result. Since we need just one<br />
extra bit, only the sign bit can be wrong. Hence, overflow occurs when adding two<br />
positive numbers <strong>and</strong> the sum is negative, or vice versa. This spurious sum means<br />
a carry out occurred into the sign bit.<br />
Overflow occurs in subtraction when we subtract a negative number from a<br />
positive number <strong>and</strong> get a negative result, or when we subtract a positive number<br />
from a negative number <strong>and</strong> get a positive result. Such a ridiculous result means a<br />
borrow occurred from the sign bit. Figure 3.2 shows the combination of operations,<br />
oper<strong>and</strong>s, <strong>and</strong> results that indicate an overflow.
3.2 Addition <strong>and</strong> Subtraction 181<br />
more detail; Chapter 5 describes other situations where exceptions <strong>and</strong> interrupts<br />
occur.)<br />
MIPS includes a register called the exception program counter (EPC) to contain<br />
the address of the instruction that caused the exception. The instruction move from<br />
system control (mfc0) is used to copy EPC into a general-purpose register so that<br />
MIPS software has the option of returning to the offending instruction via a jump<br />
register instruction.<br />
interrupt An exception<br />
that comes from outside<br />
of the processor. (Some<br />
architectures use the<br />
term interrupt for all<br />
exceptions.)<br />
Summary<br />
A major point of this section is that, independent of the representation, the finite<br />
word size of computers means that arithmetic operations can create results that<br />
are too large to fit in this fixed word size. It’s easy to detect overflow in unsigned<br />
numbers, although these are almost always ignored because programs don’t want to<br />
detect overflow for address arithmetic, the most common use of natural numbers.<br />
Two’s complement presents a greater challenge, yet some software systems require<br />
detection of overflow, so today all computers have a way to detect it.<br />
Some programming languages allow two’s complement integer arithmetic<br />
on variables declared byte <strong>and</strong> half, whereas MIPS only has integer arithmetic<br />
operations on full words. As we recall from Chapter 2, MIPS does have data transfer<br />
operations for bytes <strong>and</strong> halfwords. What MIPS instructions should be generated<br />
for byte <strong>and</strong> halfword arithmetic operations?<br />
1. Load with lbu, lhu; arithmetic with add, sub, mult, div; then store using<br />
sb, sh.<br />
2. Load with lb, lh; arithmetic with add, sub, mult, div; then store using<br />
sb, sh.<br />
3. Load with lb, lh; arithmetic with add, sub, mult, div, using AND to mask<br />
result to 8 or 16 bits after each operation; then store using sb, sh.<br />
Check<br />
Yourself<br />
Elaboration: One feature not generally found in general-purpose microprocessors is<br />
saturating operations. Saturation means that when a calculation overflows, the result<br />
is set to the largest positive number or most negative number, rather than a modulo<br />
calculation as in two’s complement arithmetic. Saturation is likely what you want for media<br />
operations. For example, the volume knob on a radio set would be frustrating if, as you<br />
turned it, the volume would get continuously louder for a while <strong>and</strong> then immediately very<br />
soft. A knob with saturation would stop at the highest volume no matter how far you turned<br />
it. Multimedia extensions to st<strong>and</strong>ard instruction sets often offer saturating arithmetic.<br />
Elaboration: MIPS can trap on overfl ow, but unlike many other computers, there is<br />
no conditional branch to test overfl ow. A sequence of MIPS instructions can discover
3.3 Multiplication 183<br />
3.3 Multiplication<br />
Now that we have completed the explanation of addition <strong>and</strong> subtraction, we are<br />
ready to build the more vexing operation of multiplication.<br />
First, let’s review the multiplication of decimal numbers in longh<strong>and</strong> to remind<br />
ourselves of the steps of multiplication <strong>and</strong> the names of the oper<strong>and</strong>s. For reasons<br />
that will become clear shortly, we limit this decimal example to using only the<br />
digits 0 <strong>and</strong> 1. Multiplying 1000 ten<br />
by 1001 ten<br />
:<br />
Multiplic<strong>and</strong> 1000 ten<br />
Multiplier x 1001 ten<br />
1000<br />
0000<br />
0000<br />
1000<br />
Product 1001000 ten<br />
Multiplication is<br />
vexation, Division is<br />
as bad; The rule of<br />
three doth puzzle me,<br />
And practice drives me<br />
mad.<br />
Anonymous,<br />
Elizabethan manuscript,<br />
1570<br />
The first oper<strong>and</strong> is called the multiplic<strong>and</strong> <strong>and</strong> the second the multiplier.<br />
The final result is called the product. As you may recall, the algorithm learned in<br />
grammar school is to take the digits of the multiplier one at a time from right to<br />
left, multiplying the multiplic<strong>and</strong> by the single digit of the multiplier, <strong>and</strong> shifting<br />
the intermediate product one digit to the left of the earlier intermediate products.<br />
The first observation is that the number of digits in the product is considerably<br />
larger than the number in either the multiplic<strong>and</strong> or the multiplier. In fact, if we<br />
ignore the sign bits, the length of the multiplication of an n-bit multiplic<strong>and</strong> <strong>and</strong> an<br />
m-bit multiplier is a product that is n m bits long. That is, n m bits are required<br />
to represent all possible products. Hence, like add, multiply must cope with<br />
overflow because we frequently want a 32-bit product as the result of multiplying<br />
two 32-bit numbers.<br />
In this example, we restricted the decimal digits to 0 <strong>and</strong> 1. With only two<br />
choices, each step of the multiplication is simple:<br />
1. Just place a copy of the multiplic<strong>and</strong> (1 multiplic<strong>and</strong>) in the proper place<br />
if the multiplier digit is a 1, or<br />
2. Place 0 (0 multiplic<strong>and</strong>) in the proper place if the digit is 0.<br />
Although the decimal example above happens to use only 0 <strong>and</strong> 1, multiplication<br />
of binary numbers must always use 0 <strong>and</strong> 1, <strong>and</strong> thus always offers only these two<br />
choices.<br />
Now that we have reviewed the basics of multiplication, the traditional next<br />
step is to provide the highly optimized multiply hardware. We break with tradition<br />
in the belief that you will gain a better underst<strong>and</strong>ing by seeing the evolution of<br />
the multiply hardware <strong>and</strong> algorithm through multiple generations. For now, let’s<br />
assume that we are multiplying only positive numbers.
3.3 Multiplication 185<br />
Start<br />
Multiplier0 = 1<br />
1. Test<br />
Multiplier0<br />
Multiplier0 = 0<br />
1a. Add multiplic<strong>and</strong> to product <strong>and</strong><br />
place the result in Product register<br />
2. Shift the Multiplic<strong>and</strong> register left 1 bit<br />
3. Shift the Multiplier register right 1 bit<br />
32nd repetition?<br />
No: < 32 repetitions<br />
Yes: 32 repetitions<br />
Done<br />
FIGURE 3.4 The first multiplication algorithm, using the hardware shown in Figure 3.3. If<br />
the least significant bit of the multiplier is 1, add the multiplic<strong>and</strong> to the product. If not, go to the next step.<br />
Shift the multiplic<strong>and</strong> left <strong>and</strong> the multiplier right in the next two steps. These three steps are repeated 32<br />
times.<br />
This algorithm <strong>and</strong> hardware are easily refined to take 1 clock cycle per step.<br />
The speed-up comes from performing the operations in parallel: the multiplier<br />
<strong>and</strong> multiplic<strong>and</strong> are shifted while the multiplic<strong>and</strong> is added to the product if the<br />
multiplier bit is a 1. The hardware just has to ensure that it tests the right bit of<br />
the multiplier <strong>and</strong> gets the preshifted version of the multiplic<strong>and</strong>. The hardware is<br />
usually further optimized to halve the width of the adder <strong>and</strong> registers by noticing<br />
where there are unused portions of registers <strong>and</strong> adders. Figure 3.5 shows the<br />
revised hardware.
3.3 Multiplication 187<br />
Iteration Step Multiplier Multiplic<strong>and</strong> Product<br />
0 Initial values 0011 0000 0010 0000 0000<br />
1 1a: 1 ⇒ Prod = Prod + Mc<strong>and</strong> 0011 0000 0010 0000 0010<br />
2: Shift left Multiplic<strong>and</strong> 0011 0000 0100 0000 0010<br />
3: Shift right Multiplier 0001 0000 0100 0000 0010<br />
2 1a: 1 ⇒ Prod = Prod + Mc<strong>and</strong> 0001 0000 0100 0000 0110<br />
2: Shift left Multiplic<strong>and</strong> 0001 0000 1000 0000 0110<br />
3: Shift right Multiplier 0000 0000 1000 0000 0110<br />
3 1: 0 ⇒ No operation 0000 0000 1000 0000 0110<br />
2: Shift left Multiplic<strong>and</strong> 0000 0001 0000 0000 0110<br />
3: Shift right Multiplier 0000 0001 0000 0000 0110<br />
4 1: 0 ⇒ No operation 0000 0001 0000 0000 0110<br />
2: Shift left Multiplic<strong>and</strong> 0000 0010 0000 0000 0110<br />
3: Shift right Multiplier 0000 0010 0000 0000 0110<br />
FIGURE 3.6 Multiply example using algorithm in Figure 3.4. The bit examined to determine the<br />
next step is circled in color.<br />
Signed Multiplication<br />
So far, we have dealt with positive numbers. The easiest way to underst<strong>and</strong> how<br />
to deal with signed numbers is to first convert the multiplier <strong>and</strong> multiplic<strong>and</strong> to<br />
positive numbers <strong>and</strong> then remember the original signs. The algorithms should<br />
then be run for 31 iterations, leaving the signs out of the calculation. As we learned<br />
in grammar school, we need negate the product only if the original signs disagree.<br />
It turns out that the last algorithm will work for signed numbers, provided that<br />
we remember that we are dealing with numbers that have infinite digits, <strong>and</strong> we are<br />
only representing them with 32 bits. Hence, the shifting steps would need to extend<br />
the sign of the product for signed numbers. When the algorithm completes, the<br />
lower word would have the 32-bit product.<br />
Faster Multiplication<br />
Moore’s Law has provided so much more in resources that hardware designers can<br />
now build much faster multiplication hardware. Whether the multiplic<strong>and</strong> is to be<br />
added or not is known at the beginning of the multiplication by looking at each of<br />
the 32 multiplier bits. Faster multiplications are possible by essentially providing<br />
one 32-bit adder for each bit of the multiplier: one input is the multiplic<strong>and</strong> ANDed<br />
with a multiplier bit, <strong>and</strong> the other is the output of a prior adder.<br />
A straightforward approach would be to connect the outputs of adders on the<br />
right to the inputs of adders on the left, making a stack of adders 32 high. An<br />
alternative way to organize these 32 additions is in a parallel tree, as Figure 3.7<br />
shows. Instead of waiting for 32 add times, we wait just the log 2<br />
(32) or five 32-bit<br />
add times.
3.4 Division 189<br />
3.4 Division<br />
The reciprocal operation of multiply is divide, an operation that is even less frequent<br />
<strong>and</strong> even more quirky. It even offers the opportunity to perform a mathematically<br />
invalid operation: dividing by 0.<br />
Let’s start with an example of long division using decimal numbers to recall the<br />
names of the oper<strong>and</strong>s <strong>and</strong> the grammar school division algorithm. For reasons<br />
similar to those in the previous section, we limit the decimal digits to just 0 or 1.<br />
The example is dividing 1,001,010 ten<br />
by 1000 ten<br />
:<br />
1001 ten Quotient<br />
Divisor 1000 ten 1001010 ten Dividend<br />
−1000<br />
10<br />
101<br />
1010<br />
−1000<br />
10 ten Remainder<br />
Divide’s two oper<strong>and</strong>s, called the dividend <strong>and</strong> divisor, <strong>and</strong> the result, called<br />
the quotient, are accompanied by a second result, called the remainder. Here is<br />
another way to express the relationship between the components:<br />
Dividend Quotient Divisor Remainder<br />
where the remainder is smaller than the divisor. Infrequently, programs use the<br />
divide instruction just to get the remainder, ignoring the quotient.<br />
The basic grammar school division algorithm tries to see how big a number<br />
can be subtracted, creating a digit of the quotient on each attempt. Our carefully<br />
selected decimal example uses only the numbers 0 <strong>and</strong> 1, so it’s easy to figure out<br />
how many times the divisor goes into the portion of the dividend: it’s either 0 times<br />
or 1 time. Binary numbers contain only 0 or 1, so binary division is restricted to<br />
these two choices, thereby simplifying binary division.<br />
Let’s assume that both the dividend <strong>and</strong> the divisor are positive <strong>and</strong> hence the<br />
quotient <strong>and</strong> the remainder are nonnegative. The division oper<strong>and</strong>s <strong>and</strong> both<br />
results are 32-bit values, <strong>and</strong> we will ignore the sign for now.<br />
A Division Algorithm <strong>and</strong> Hardware<br />
Figure 3.8 shows hardware to mimic our grammar school algorithm. We start with<br />
the 32-bit Quotient register set to 0. Each iteration of the algorithm needs to move<br />
the divisor to the right one digit, so we start with the divisor placed in the left half<br />
of the 64-bit Divisor register <strong>and</strong> shift it right 1 bit each step to align it with the<br />
dividend. The Remainder register is initialized with the dividend.<br />
Divide et impera.<br />
Latin for “Divide <strong>and</strong><br />
rule,” ancient political<br />
maxim cited by<br />
Machiavelli, 1532<br />
dividend A number<br />
being divided.<br />
divisor A number that<br />
the dividend is divided by.<br />
quotient The primary<br />
result of a division;<br />
a number that when<br />
multiplied by the<br />
divisor <strong>and</strong> added to the<br />
remainder produces the<br />
dividend.<br />
remainder The<br />
secondary result of<br />
a division; a number<br />
that when added to the<br />
product of the quotient<br />
<strong>and</strong> the divisor produces<br />
the dividend.
3.4 Division 191<br />
Start<br />
1. Subtract the Divisor register from the<br />
Remainder register <strong>and</strong> place the<br />
result in the Remainder register<br />
Remainder ≥ 0<br />
Test Remainder<br />
Remainder < 0<br />
2a. Shift the Quotient register to the left,<br />
setting the new rightmost bit to 1<br />
2b. Restore the original value by adding<br />
the Divisor register to the Remainder<br />
register <strong>and</strong> placing the sum in the<br />
Remainder register. Also shift the<br />
Quotient register to the left, setting the<br />
new least significant bit to 0<br />
3. Shift the Divisor register right 1 bit<br />
33rd repetition?<br />
No: < 33 repetitions<br />
Yes: 33 repetitions<br />
Done<br />
FIGURE 3.9 A division algorithm, using the hardware in Figure 3.8. If the remainder is positive,<br />
the divisor did go into the dividend, so step 2a generates a 1 in the quotient. A negative remainder after<br />
step 1 means that the divisor did not go into the dividend, so step 2b generates a 0 in the quotient <strong>and</strong> adds<br />
the divisor to the remainder, thereby reversing the subtraction of step 1. The final shift, in step 3, aligns the<br />
divisor properly, relative to the dividend for the next iteration. These steps are repeated 33 times.<br />
This algorithm <strong>and</strong> hardware can be refined to be faster <strong>and</strong> cheaper. The speedup<br />
comes from shifting the oper<strong>and</strong>s <strong>and</strong> the quotient simultaneously with the<br />
subtraction. This refinement halves the width of the adder <strong>and</strong> registers by noticing<br />
where there are unused portions of registers <strong>and</strong> adders. Figure 3.11 shows the<br />
revised hardware.
3.4 Division 193<br />
Elaboration: The one complication of signed division is that we must also set the sign<br />
of the remainder. Remember that the following equation must always hold:<br />
Dividend Quotient Divisor Remainder<br />
To underst<strong>and</strong> how to set the sign of the remainder, let’s look at the example of dividing<br />
all the combinations of 7 ten<br />
by 2 ten<br />
. The fi rst case is easy:<br />
Checking the results:<br />
7 2: Quotient 3, Remainder 1<br />
7 3 2 (1) 6 1<br />
If we change the sign of the dividend, the quotient must change as well:<br />
7 2: Quotient 3<br />
Rewriting our basic formula to calculate the remainder:<br />
So,<br />
Remainder (Dividend Quotient Divisor) 7 (3x 2)<br />
7 (6) 1<br />
Checking the results again:<br />
7 2: Quotient 3, Remainder 1<br />
7 3 2 (1) 6 1<br />
The reason the answer isn’t a quotient of 4 <strong>and</strong> a remainder of 1, which would also<br />
fi t this formula, is that the absolute value of the quotient would then change depending<br />
on the sign of the dividend <strong>and</strong> the divisor! Clearly, if<br />
(x y) (x) y<br />
programming would be an even greater challenge. This anomalous behavior is avoided<br />
by following the rule that the dividend <strong>and</strong> remainder must have the same signs, no<br />
matter what the signs of the divisor <strong>and</strong> quotient.<br />
We calculate the other combinations by following the same rule:<br />
7 2: Quotient 3, Remainder 1<br />
7 2: Quotient 3, Remainder 1
196 Chapter 3 Arithmetic for <strong>Computer</strong>s<br />
Hardware/<br />
Software<br />
Interface<br />
MIPS divide instructions ignore overflow, so software must determine whether the<br />
quotient is too large. In addition to overflow, division can also result in an improper<br />
calculation: division by 0. Some computers distinguish these two anomalous events.<br />
MIPS software must check the divisor to discover division by 0 as well as overflow.<br />
Elaboration: An even faster algorithm does not immediately add the divisor back<br />
if the remainder is negative. It simply adds the dividend to the shifted remainder in<br />
the following step, since (r d) 2 d r 2 d 2 d r 2 d. This<br />
nonrestoring division algorithm, which takes 1 clock cycle per step, is explored further<br />
in the exercises; the algorithm above is called restoring division. A third algorithm that<br />
doesn’t save the result of the subtract if it’s negative is called a nonperforming division<br />
algorithm. It averages one-third fewer arithmetic operations.<br />
3.5 Floating Point<br />
Speed gets you<br />
nowhere if you’re<br />
headed the wrong way.<br />
American proverb<br />
scientific notation<br />
A notation that renders<br />
numbers with a single<br />
digit to the left of the<br />
decimal point.<br />
normalized A number<br />
in floating-point notation<br />
that has no leading 0s.<br />
Going beyond signed <strong>and</strong> unsigned integers, programming languages support<br />
numbers with fractions, which are called reals in mathematics. Here are some<br />
examples of reals:<br />
3.14159265… ten<br />
(pi)<br />
2.71828… ten<br />
(e)<br />
0.000000001 ten<br />
or 1.0 ten<br />
× 10 −9 (seconds in a nanosecond)<br />
3,155,760,000 ten<br />
or 3.15576 ten<br />
× 10 9 (seconds in a typical century)<br />
Notice that in the last case, the number didn’t represent a small fraction, but it<br />
was bigger than we could represent with a 32-bit signed integer. The alternative<br />
notation for the last two numbers is called scientific notation, which has a single<br />
digit to the left of the decimal point. A number in scientific notation that has no<br />
leading 0s is called a normalized number, which is the usual way to write it. For<br />
example, 1.0 ten<br />
10 9 is in normalized scientific notation, but 0.1 ten<br />
10 8 <strong>and</strong><br />
10.0 ten<br />
10 10 are not.<br />
Just as we can show decimal numbers in scientific notation, we can also show<br />
binary numbers in scientific notation:<br />
1.0 two<br />
2 1<br />
To keep a binary number in normalized form, we need a base that we can increase<br />
or decrease by exactly the number of bits the number must be shifted to have one<br />
nonzero digit to the left of the decimal point. Only a base of 2 fulfills our need. Since<br />
the base is not 10, we also need a new name for decimal point; binary point will do fine.
3.5 Floating Point 197<br />
<strong>Computer</strong> arithmetic that supports such numbers is called floating point<br />
because it represents numbers in which the binary point is not fixed, as it is for<br />
integers. The programming language C uses the name float for such numbers. Just<br />
as in scientific notation, numbers are represented as a single nonzero digit to the<br />
left of the binary point. In binary, the form is<br />
floating point<br />
<strong>Computer</strong> arithmetic that<br />
represents numbers in<br />
which the binary point is<br />
not fixed.<br />
1.xxxxxxxxx two<br />
2 yyyy<br />
(Although the computer represents the exponent in base 2 as well as the rest of the<br />
number, to simplify the notation we show the exponent in decimal.)<br />
A st<strong>and</strong>ard scientific notation for reals in normalized form offers three<br />
advantages. It simplifies exchange of data that includes floating-point numbers;<br />
it simplifies the floating-point arithmetic algorithms to know that numbers will<br />
always be in this form; <strong>and</strong> it increases the accuracy of the numbers that can be<br />
stored in a word, since the unnecessary leading 0s are replaced by real digits to the<br />
right of the binary point.<br />
Floating-Point Representation<br />
A designer of a floating-point representation must find a compromise between the<br />
size of the fraction <strong>and</strong> the size of the exponent, because a fixed word size means<br />
you must take a bit from one to add a bit to the other. This tradeoff is between<br />
precision <strong>and</strong> range: increasing the size of the fraction enhances the precision<br />
of the fraction, while increasing the size of the exponent increases the range of<br />
numbers that can be represented. As our design guideline from Chapter 2 reminds<br />
us, good design dem<strong>and</strong>s good compromise.<br />
Floating-point numbers are usually a multiple of the size of a word. The<br />
representation of a MIPS floating-point number is shown below, where s is the sign<br />
of the floating-point number (1 meaning negative), exponent is the value of the<br />
8-bit exponent field (including the sign of the exponent), <strong>and</strong> fraction is the 23-bit<br />
number. As we recall from Chapter 2, this representation is sign <strong>and</strong> magnitude,<br />
since the sign is a separate bit from the rest of the number.<br />
fraction The value,<br />
generally between 0 <strong>and</strong><br />
1, placed in the fraction<br />
field. The fraction is also<br />
called the mantissa.<br />
exponent In the<br />
numerical representation<br />
system of floating-point<br />
arithmetic, the value that<br />
is placed in the exponent<br />
field.<br />
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0<br />
s exponent fraction<br />
1 bit 8 bits 23 bits<br />
In general, floating-point numbers are of the form<br />
(1) S F 2 E<br />
F involves the value in the fraction field <strong>and</strong> E involves the value in the exponent<br />
field; the exact relationship to these fields will be spelled out soon. (We will shortly<br />
see that MIPS does something slightly more sophisticated.)
198 Chapter 3 Arithmetic for <strong>Computer</strong>s<br />
overflow (floatingpoint)<br />
A situation in<br />
which a positive exponent<br />
becomes too large to fit in<br />
the exponent field.<br />
underflow (floatingpoint)<br />
A situation<br />
in which a negative<br />
exponent becomes too<br />
large to fit in the exponent<br />
field.<br />
double precision<br />
A floating-point value<br />
represented in two 32-bit<br />
words.<br />
single precision<br />
A floating-point value<br />
represented in a single 32-<br />
bit word.<br />
These chosen sizes of exponent <strong>and</strong> fraction give MIPS computer arithmetic<br />
an extraordinary range. Fractions almost as small as 2.0 ten<br />
10 38 <strong>and</strong> numbers<br />
almost as large as 2.0 ten<br />
10 38 can be represented in a computer. Alas, extraordinary<br />
differs from infinite, so it is still possible for numbers to be too large. Thus, overflow<br />
interrupts can occur in floating-point arithmetic as well as in integer arithmetic.<br />
Notice that overflow here means that the exponent is too large to be represented<br />
in the exponent field.<br />
Floating point offers a new kind of exceptional event as well. Just as programmers<br />
will want to know when they have calculated a number that is too large to be<br />
represented, they will want to know if the nonzero fraction they are calculating<br />
has become so small that it cannot be represented; either event could result in a<br />
program giving incorrect answers. To distinguish it from overflow, we call this<br />
event underflow. This situation occurs when the negative exponent is too large to<br />
fit in the exponent field.<br />
One way to reduce chances of underflow or overflow is to offer another format<br />
that has a larger exponent. In C this number is called double, <strong>and</strong> operations on<br />
doubles are called double precision floating-point arithmetic; single precision<br />
floating point is the name of the earlier format.<br />
The representation of a double precision floating-point number takes two MIPS<br />
words, as shown below, where s is still the sign of the number, exponent is the value<br />
of the 11-bit exponent field, <strong>and</strong> fraction is the 52-bit number in the fraction field.<br />
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0<br />
s<br />
exponent<br />
fraction<br />
1 bit 11 bits 20 bits<br />
fraction (continued)<br />
32 bits<br />
MIPS double precision allows numbers almost as small as 2.0 ten<br />
10 308 <strong>and</strong> almost<br />
as large as 2.0 ten<br />
10 308 . Although double precision does increase the exponent<br />
range, its primary advantage is its greater precision because of the much larger<br />
fraction.<br />
These formats go beyond MIPS. They are part of the IEEE 754 floating-point<br />
st<strong>and</strong>ard, found in virtually every computer invented since 1980. This st<strong>and</strong>ard has<br />
greatly improved both the ease of porting floating-point programs <strong>and</strong> the quality<br />
of computer arithmetic.<br />
To pack even more bits into the signific<strong>and</strong>, IEEE 754 makes the leading 1-bit<br />
of normalized binary numbers implicit. Hence, the number is actually 24 bits long<br />
in single precision (implied 1 <strong>and</strong> a 23-bit fraction), <strong>and</strong> 53 bits long in double<br />
precision (1 52). To be precise, we use the term signific<strong>and</strong> to represent the 24-<br />
or 53-bit number that is 1 plus the fraction, <strong>and</strong> fraction when we mean the 23- or<br />
52-bit number. Since 0 has no leading 1, it is given the reserved exponent value 0 so<br />
that the hardware won’t attach a leading 1 to it.
3.5 Floating Point 199<br />
Single precision Double precision Object represented<br />
Exponent Fraction Exponent Fraction<br />
0 0 0 0 0<br />
0 Nonzero 0 Nonzero ± denormalized number<br />
1–254 Anything 1–2046 Anything ± floating-point number<br />
255 0 2047 0 ± infinity<br />
255 Nonzero 2047 Nonzero NaN (Not a Number)<br />
FIGURE 3.13 EEE 754 encoding of floating-point numbers. A separate sign bit determines the<br />
sign. Denormalized numbers are described in the Elaboration on page 222. This information is also found in<br />
Column 4 of the MIPS Reference Data Card at the front of this book.<br />
Thus 00 … 00 two<br />
represents 0; the representation of the rest of the numbers uses<br />
the form from before with the hidden 1 added:<br />
(1) S (1 Fraction) 2 E<br />
where the bits of the fraction represent a number between 0 <strong>and</strong> 1 <strong>and</strong> E specifies<br />
the value in the exponent field, to be given in detail shortly. If we number the bits<br />
of the fraction from left to right s1, s2, s3, …, then the value is<br />
(1) S (1 (s1 2 1 ) (s2 2 2 ) (s3 2 3 ) (s4 2 4 ) ...) 2 E<br />
Figure 3.13 shows the encodings of IEEE 754 floating-point numbers. Other<br />
features of IEEE 754 are special symbols to represent unusual events. For example,<br />
instead of interrupting on a divide by 0, software can set the result to a bit pattern<br />
representing ∞ or ∞; the largest exponent is reserved for these special symbols.<br />
When the programmer prints the results, the program will print an infinity symbol.<br />
(For the mathematically trained, the purpose of infinity is to form topological<br />
closure of the reals.)<br />
IEEE 754 even has a symbol for the result of invalid operations, such as 0/0<br />
or subtracting infinity from infinity. This symbol is NaN, for Not a Number. The<br />
purpose of NaNs is to allow programmers to postpone some tests <strong>and</strong> decisions to<br />
a later time in the program when they are convenient.<br />
The designers of IEEE 754 also wanted a floating-point representation that could<br />
be easily processed by integer comparisons, especially for sorting. This desire is<br />
why the sign is in the most significant bit, allowing a quick test of less than, greater<br />
than, or equal to 0. (It’s a little more complicated than a simple integer sort, since<br />
this notation is essentially sign <strong>and</strong> magnitude rather than two’s complement.)<br />
Placing the exponent before the signific<strong>and</strong> also simplifies the sorting of<br />
floating-point numbers using integer comparison instructions, since numbers with<br />
bigger exponents look larger than numbers with smaller exponents, as long as both<br />
exponents have the same sign.
200 Chapter 3 Arithmetic for <strong>Computer</strong>s<br />
Negative exponents pose a challenge to simplified sorting. If we use two’s<br />
complement or any other notation in which negative exponents have a 1 in the<br />
most significant bit of the exponent field, a negative exponent will look like a big<br />
number. For example, 1.0 two<br />
2 1 would be represented as<br />
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0<br />
0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 . . .<br />
(Remember that the leading 1 is implicit in the signific<strong>and</strong>.) The value 1.0 two<br />
2 1<br />
would look like the smaller binary number<br />
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0<br />
0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 . . .<br />
The desirable notation must therefore represent the most negative exponent as<br />
00 … 00 two<br />
<strong>and</strong> the most positive as 11 … 11 two<br />
. This convention is called biased<br />
notation, with the bias being the number subtracted from the normal, unsigned<br />
representation to determine the real value.<br />
IEEE 754 uses a bias of 127 for single precision, so an exponent of 1 is<br />
represented by the bit pattern of the value 1 127 ten<br />
, or 126 ten<br />
0111 1110 two<br />
,<br />
<strong>and</strong> 1 is represented by 1 127, or 128 ten<br />
1000 0000 two<br />
. The exponent bias for<br />
double precision is 1023. Biased exponent means that the value represented by a<br />
floating-point number is really<br />
(1) S (Exponent Bias)<br />
(1 Fraction) 2<br />
The range of single precision numbers is then from as small as<br />
to as large as<br />
Let’s demonstrate.<br />
1.00000000000000000000000 two<br />
2 126<br />
1.11111111111111111111111 two<br />
2 127 .
3.5 Floating Point 201<br />
Floating-Point Representation<br />
Show the IEEE 754 binary representation of the number 0.75 ten<br />
in single <strong>and</strong><br />
double precision.<br />
EXAMPLE<br />
The number 0.75 ten<br />
is also<br />
3/4 ten<br />
or 3/2 2 ten<br />
ANSWER<br />
It is also represented by the binary fraction<br />
11 two<br />
/2 2 or 0.11 ten two<br />
In scientific notation, the value is<br />
0.11 two<br />
2 0<br />
<strong>and</strong> in normalized scientific notation, it is<br />
1.1 two<br />
2 1<br />
The general representation for a single precision number is<br />
(1) S (1 Fraction) 2 (Exponent127)<br />
Subtracting the bias 127 from the exponent of 1.1 two<br />
2 1 yields<br />
(1) 1 (1 .1000 0000 0000 0000 0000 000 two<br />
) 2 (126127)<br />
The single precision binary representation of 0.75 ten<br />
is then<br />
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0<br />
1 0 1 1 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0<br />
1 bit 8 bits 23 bits<br />
The double precision representation is<br />
(1) 1 (1 .1000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 two<br />
) 2 (10221023)<br />
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0<br />
1 0 1 1 1 1 1 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0<br />
1 bit 11 bits 20 bits<br />
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0<br />
32 bits
202 Chapter 3 Arithmetic for <strong>Computer</strong>s<br />
Now let’s try going the other direction.<br />
EXAMPLE<br />
Converting Binary to Decimal Floating Point<br />
What decimal number is represented by this single precision float?<br />
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0<br />
1 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 . . .<br />
ANSWER<br />
The sign bit is 1, the exponent field contains 129, <strong>and</strong> the fraction field contains<br />
1 2 2 1/4, or 0.25. Using the basic equation,<br />
(1) S (1 Fraction) 2 (ExponentBias) (1) 1 (1 0.25) 2 (129127)<br />
1 1.25 2 2<br />
1.25 4<br />
5.0<br />
In the next few subsections, we will give the algorithms for floating-point<br />
addition <strong>and</strong> multiplication. At their core, they use the corresponding integer<br />
operations on the signific<strong>and</strong>s, but extra bookkeeping is necessary to h<strong>and</strong>le the<br />
exponents <strong>and</strong> normalize the result. We first give an intuitive derivation of the<br />
algorithms in decimal <strong>and</strong> then give a more detailed, binary version in the figures.<br />
Elaboration: Following IEEE guidelines, the IEEE 754 committee was reformed 20<br />
years after the st<strong>and</strong>ard to see what changes, if any, should be made. The revised<br />
st<strong>and</strong>ard IEEE 754-2008 includes nearly all the IEEE 754-1985 <strong>and</strong> adds a 16-bit format<br />
(“half precision”) <strong>and</strong> a 128-bit format (“quadruple precision”). No hardware has yet been<br />
built that supports quadruple precision, but it will surely come. The revised st<strong>and</strong>ard<br />
also add decimal fl oating point arithmetic, which IBM mainframes have implemented.<br />
Elaboration: In an attempt to increase range without removing bits from the signific<strong>and</strong>,<br />
some computers before the IEEE 754 st<strong>and</strong>ard used a base other than 2. For example,<br />
the IBM 360 <strong>and</strong> 370 mainframe computers use base 16. Since changing the IBM<br />
exponent by one means shifting the signific<strong>and</strong> by 4 bits, “normalized” base 16 numbers<br />
can have up to 3 leading bits of 0s! Hence, hexadecimal digits mean that up to 3 bits must<br />
be dropped from the signific<strong>and</strong>, which leads to surprising problems in the accuracy of<br />
floating-point arithmetic. IBM mainframes now support IEEE 754 as well as the hex format.
3.5 Floating Point 203<br />
Floating-Point Addition<br />
Let’s add numbers in scientific notation by h<strong>and</strong> to illustrate the problems in<br />
floating-point addition: 9.999 ten<br />
10 1 1.610 ten<br />
10 1 . Assume that we can store<br />
only four decimal digits of the signific<strong>and</strong> <strong>and</strong> two decimal digits of the exponent.<br />
Step 1. To be able to add these numbers properly, we must align the decimal<br />
point of the number that has the smaller exponent. Hence, we need<br />
a form of the smaller number, 1.610 ten<br />
10 1 , that matches the<br />
larger exponent. We obtain this by observing that there are multiple<br />
representations of an unnormalized floating-point number in<br />
scientific notation:<br />
1.610 ten<br />
10 1 0.1610 ten<br />
10 0 0.01610 ten<br />
10 1<br />
The number on the right is the version we desire, since its exponent<br />
matches the exponent of the larger number, 9.999 ten<br />
10 1 . Thus, the<br />
first step shifts the signific<strong>and</strong> of the smaller number to the right until<br />
its corrected exponent matches that of the larger number. But we can<br />
represent only four decimal digits so, after shifting, the number is<br />
really<br />
0.016 10 1<br />
Step 2. Next comes the addition of the signific<strong>and</strong>s:<br />
The sum is 10.015 ten<br />
10 1 .<br />
9.999 ten<br />
+ 0.016 ten<br />
10.015 ten<br />
Step 3. This sum is not in normalized scientific notation, so we need to<br />
adjust it:<br />
10.015 ten<br />
10 1 1.0015 ten<br />
10 2<br />
Thus, after the addition we may have to shift the sum to put it into<br />
normalized form, adjusting the exponent appropriately. This example<br />
shows shifting to the right, but if one number were positive <strong>and</strong> the<br />
other were negative, it would be possible for the sum to have many<br />
leading 0s, requiring left shifts. Whenever the exponent is increased<br />
or decreased, we must check for overflow or underflow—that is, we<br />
must make sure that the exponent still fits in its field.<br />
Step 4. Since we assumed that the signific<strong>and</strong> can be only four digits long<br />
(excluding the sign), we must round the number. In our grammar<br />
school algorithm, the rules truncate the number if the digit to the<br />
right of the desired point is between 0 <strong>and</strong> 4 <strong>and</strong> add 1 to the digit if<br />
the number to the right is between 5 <strong>and</strong> 9. The number<br />
1.0015 ten<br />
10 2
204 Chapter 3 Arithmetic for <strong>Computer</strong>s<br />
is rounded to four digits in the signific<strong>and</strong> to<br />
1.002 ten<br />
10 2<br />
since the fourth digit to the right of the decimal point was between 5<br />
<strong>and</strong> 9. Notice that if we have bad luck on rounding, such as adding 1<br />
to a string of 9s, the sum may no longer be normalized <strong>and</strong> we would<br />
need to perform step 3 again.<br />
Figure 3.14 shows the algorithm for binary floating-point addition that follows<br />
this decimal example. Steps 1 <strong>and</strong> 2 are similar to the example just discussed:<br />
adjust the signific<strong>and</strong> of the number with the smaller exponent <strong>and</strong> then add the<br />
two signific<strong>and</strong>s. Step 3 normalizes the results, forcing a check for overflow or<br />
underflow. The test for overflow <strong>and</strong> underflow in step 3 depends on the precision<br />
of the oper<strong>and</strong>s. Recall that the pattern of all 0 bits in the exponent is reserved <strong>and</strong><br />
used for the floating-point representation of zero. Moreover, the pattern of all 1 bits<br />
in the exponent is reserved for indicating values <strong>and</strong> situations outside the scope of<br />
normal floating-point numbers (see the Elaboration on page 222). For the example<br />
below, remember that for single precision, the maximum exponent is 127, <strong>and</strong> the<br />
minimum exponent is 126.<br />
EXAMPLE<br />
Binary Floating-Point Addition<br />
Try adding the numbers 0.5 ten<br />
<strong>and</strong> 0.4375 ten<br />
in binary using the algorithm in<br />
Figure 3.14.<br />
ANSWER<br />
Let’s first look at the binary version of the two numbers in normalized scientific<br />
notation, assuming that we keep 4 bits of precision:<br />
0.5 ten<br />
1/2 ten<br />
1/2 1 ten<br />
0.1 two<br />
0.1 two<br />
2 0 1.000 two<br />
2 1<br />
0.4375 ten<br />
7/16 ten<br />
7/2 4 ten<br />
0.0111 two<br />
0.0111 two<br />
2 0 1.110 two<br />
2 2<br />
Now we follow the algorithm:<br />
Step 1. The signific<strong>and</strong> of the number with the lesser exponent (1.11 two<br />
2 2 ) is shifted right until its exponent matches the larger number:<br />
Step 2. Add the signific<strong>and</strong>s:<br />
1.110 two<br />
2 2 0.111 two<br />
2 1<br />
1.000 two<br />
2 1 (0.111 two<br />
2 1 ) 0.001 two<br />
2 1
206 Chapter 3 Arithmetic for <strong>Computer</strong>s<br />
Step 3. Normalize the sum, checking for overflow or underflow:<br />
0.001 two<br />
2 1 0.010 two<br />
2 2 0.100 two<br />
2 3<br />
1.000 two<br />
2 4<br />
Since 127 4 126, there is no overflow or underflow. (The<br />
biased exponent would be 4 127, or 123, which is between 1 <strong>and</strong><br />
254, the smallest <strong>and</strong> largest unreserved biased exponents.)<br />
Step 4. Round the sum:<br />
1.000 two<br />
2 4<br />
The sum already fits exactly in 4 bits, so there is no change to the bits<br />
due to rounding.<br />
This sum is then<br />
1.000 two<br />
2 4 0.0001000 two<br />
0.0001 two<br />
1/2 4 ten 1/16 ten 0.0625 ten<br />
This sum is what we would expect from adding 0.5 ten<br />
to 0.4375 ten<br />
.<br />
Many computers dedicate hardware to run floating-point operations as fast as possible.<br />
Figure 3.15 sketches the basic organization of hardware for floating-point addition.<br />
Floating-Point Multiplication<br />
Now that we have explained floating-point addition, let’s try floating-point<br />
multiplication. We start by multiplying decimal numbers in scientific notation by<br />
h<strong>and</strong>: 1.110 ten<br />
10 10 9.200 ten<br />
10 5 . Assume that we can store only four digits<br />
of the signific<strong>and</strong> <strong>and</strong> two digits of the exponent.<br />
Step 1. Unlike addition, we calculate the exponent of the product by simply<br />
adding the exponents of the oper<strong>and</strong>s together:<br />
New exponent 10 (5) 5<br />
Let’s do this with the biased exponents as well to make sure we obtain<br />
the same result: 10 + 127 = 137, <strong>and</strong> 5 + 127 = 122, so<br />
New exponent 137 122 259<br />
This result is too large for the 8-bit exponent field, so something is<br />
amiss! The problem is with the bias because we are adding the biases<br />
as well as the exponents:<br />
New exponent (10 127) (5 127) (5 2 127) 259<br />
Accordingly, to get the correct biased sum when we add biased numbers,<br />
we must subtract the bias from the sum:
3.5 Floating Point 207<br />
Sign<br />
Exponent<br />
Fraction<br />
Sign<br />
Exponent<br />
Fraction<br />
Small ALU<br />
Compare<br />
exponents<br />
Exponent<br />
difference<br />
0 1 0 1 0 1<br />
Control<br />
Shift right<br />
Shift smaller<br />
number right<br />
Big ALU<br />
Add<br />
0 1 0 1<br />
Increment or<br />
decrement<br />
Shift left or right<br />
Normalize<br />
Rounding hardware<br />
Round<br />
Sign<br />
Exponent<br />
Fraction<br />
FIGURE 3.15 Block diagram of an arithmetic unit dedicated to floating-point addition. The steps of Figure 3.14 correspond<br />
to each block, from top to bottom. First, the exponent of one oper<strong>and</strong> is subtracted from the other using the small ALU to determine which is<br />
larger <strong>and</strong> by how much. This difference controls the three multiplexors; from left to right, they select the larger exponent, the signific<strong>and</strong> of the<br />
smaller number, <strong>and</strong> the signific<strong>and</strong> of the larger number. The smaller signific<strong>and</strong> is shifted right, <strong>and</strong> then the signific<strong>and</strong>s are added together<br />
using the big ALU. The normalization step then shifts the sum left or right <strong>and</strong> increments or decrements the exponent. Rounding then creates<br />
the final result, which may require normalizing again to produce the actual final result.
208 Chapter 3 Arithmetic for <strong>Computer</strong>s<br />
New exponent 137 122 127 259 127 132 (5 127)<br />
<strong>and</strong> 5 is indeed the exponent we calculated initially.<br />
Step 2. Next comes the multiplication of the signific<strong>and</strong>s:<br />
1.110 ten<br />
× 9.200 ten<br />
0000<br />
0000<br />
2220<br />
9990<br />
10212000 ten<br />
There are three digits to the right of the decimal point for each<br />
oper<strong>and</strong>, so the decimal point is placed six digits from the right in the<br />
product signific<strong>and</strong>:<br />
10.212000 ten<br />
Assuming that we can keep only three digits to the right of the decimal<br />
point, the product is 10.212 10 5 .<br />
Step 3. This product is unnormalized, so we need to normalize it:<br />
10.212 ten<br />
10 5 1.0212 ten<br />
10 6<br />
Thus, after the multiplication, the product can be shifted right one digit<br />
to put it in normalized form, adding 1 to the exponent. At this point,<br />
we can check for overflow <strong>and</strong> underflow. Underflow may occur if both<br />
oper<strong>and</strong>s are small—that is, if both have large negative exponents.<br />
Step 4. We assumed that the signific<strong>and</strong> is only four digits long (excluding the<br />
sign), so we must round the number. The number<br />
1.0212 ten<br />
10 6<br />
is rounded to four digits in the signific<strong>and</strong> to<br />
1.021 ten<br />
10 6<br />
Step 5. The sign of the product depends on the signs of the original oper<strong>and</strong>s.<br />
If they are both the same, the sign is positive; otherwise, it’s negative.<br />
Hence, the product is<br />
1.021 ten<br />
10 6<br />
The sign of the sum in the addition algorithm was determined by<br />
addition of the signific<strong>and</strong>s, but in multiplication, the sign of the<br />
product is determined by the signs of the oper<strong>and</strong>s.
210 Chapter 3 Arithmetic for <strong>Computer</strong>s<br />
Once again, as Figure 3.16 shows, multiplication of binary floating-point numbers<br />
is quite similar to the steps we have just completed. We start with calculating<br />
the new exponent of the product by adding the biased exponents, being sure to<br />
subtract one bias to get the proper result. Next is multiplication of signific<strong>and</strong>s,<br />
followed by an optional normalization step. The size of the exponent is checked<br />
for overflow or underflow, <strong>and</strong> then the product is rounded. If rounding leads to<br />
further normalization, we once again check for exponent size. Finally, set the sign<br />
bit to 1 if the signs of the oper<strong>and</strong>s were different (negative product) or to 0 if they<br />
were the same (positive product).<br />
EXAMPLE<br />
ANSWER<br />
Binary Floating-Point Multiplication<br />
Let’s try multiplying the numbers 0.5 ten<br />
<strong>and</strong> 0.4375 ten<br />
, using the steps in<br />
Figure 3.16.<br />
In binary, the task is multiplying 1.000 two<br />
2 1 by 1.110 two<br />
2 2 .<br />
Step 1. Adding the exponents without bias:<br />
1 (2) 3<br />
or, using the biased representation:<br />
(1 127) (2 127) 127 (1 2) (127 127 127)<br />
3 127 124<br />
Step 2. Multiplying the signific<strong>and</strong>s:<br />
1.000 two<br />
1.110 two<br />
0000<br />
1000<br />
1000<br />
1000<br />
1110000 two<br />
The product is 1.110000 two<br />
2 3 , but we need to keep it to 4 bits, so it<br />
is 1.110 two<br />
2 3 .<br />
Step 3. Now we check the product to make sure it is normalized, <strong>and</strong> then<br />
check the exponent for overflow or underflow. The product is already<br />
normalized <strong>and</strong>, since 127 3 126, there is no overflow or<br />
underflow. (Using the biased representation, 254 124 1, so the<br />
exponent fits.)<br />
Step 4. Rounding the product makes no change:<br />
1.110 two<br />
2 3
3.5 Floating Point 211<br />
Step 5. Since the signs of the original oper<strong>and</strong>s differ, make the sign of the<br />
product negative. Hence, the product is<br />
1.110 two<br />
2 3<br />
Converting to decimal to check our results:<br />
1.110 two<br />
2 3 0.001110 two<br />
0.00111 two<br />
7/2 5 ten 7/32 ten 0.21875 ten<br />
The product of 0.5 ten<br />
<strong>and</strong> 0.4375 ten<br />
is indeed 0.21875 ten<br />
.<br />
Floating-Point Instructions in MIPS<br />
MIPS supports the IEEE 754 single precision <strong>and</strong> double precision formats with<br />
these instructions:<br />
■ Floating-point addition, single (add.s) <strong>and</strong> addition, double (add.d)<br />
■ Floating-point subtraction, single (sub.s) <strong>and</strong> subtraction, double (sub.d)<br />
■ Floating-point multiplication, single (mul.s) <strong>and</strong> multiplication, double (mul.d)<br />
■ Floating-point division, single (div.s) <strong>and</strong> division, double (div.d)<br />
■ Floating-point comparison, single (c.x.s) <strong>and</strong> comparison, double (c.x.d),<br />
where x may be equal (eq), not equal (neq), less than (lt), less than or equal<br />
(le), greater than (gt), or greater than or equal (ge)<br />
■ Floating-point branch, true (bc1t) <strong>and</strong> branch, false (bc1f)<br />
Floating-point comparison sets a bit to true or false, depending on the comparison<br />
condition, <strong>and</strong> a floating-point branch then decides whether or not to branch,<br />
depending on the condition.<br />
The MIPS designers decided to add separate floating-point registers—called<br />
$f0, $f1, $f2, …—used either for single precision or double precision. Hence,<br />
they included separate loads <strong>and</strong> stores for floating-point registers: lwc1 <strong>and</strong><br />
swc1. The base registers for floating-point data transfers which are used for<br />
addresses remain integer registers. The MIPS code to load two single precision<br />
numbers from memory, add them, <strong>and</strong> then store the sum might look like this:<br />
lwc1<br />
lwc1<br />
add.s<br />
swc1<br />
$f4,c($sp) # Load 32-bit F.P. number into F4<br />
$f6,a($sp) # Load 32-bit F.P. number into F6<br />
$f2,$f4,$f6 # F2 = F4 + F6 single precision<br />
$f2,b($sp) # Store 32-bit F.P. number from F2<br />
A double precision register is really an even-odd pair of single precision registers,<br />
using the even register number as its name. Thus, the pair of single precision<br />
registers $f2 <strong>and</strong> $f3 also form the double precision register named $f2.<br />
Figure 3.17 summarizes the floating-point portion of the MIPS architecture revealed<br />
in this chapter, with the additions to support floating point shown in color. Similar to<br />
Figure 2.19 in Chapter 2, Figure 3.18 shows the encoding of these instructions.
3.5 Floating Point 213<br />
op(31:26):<br />
28–26<br />
0(000) 1(001) 2(010) 3(011) 4(100) 5(101) 6(110) 7(111)<br />
31–29<br />
0(000) Rfmt Bltz/gez j jal beq bne blez bgtz<br />
1(001) addi addiu slti sltiu ANDi ORi xORi lui<br />
2(010) TLB FlPt<br />
3(011)<br />
4(100) lb lh lwl lw lbu lhu lwr<br />
5(101) sb sh swl sw swr<br />
6(110) lwc0 lwc1<br />
7(111) swc0 swc1<br />
op(31:26) = 010001 (FlPt), (rt(16:16) = 0 => c = f, rt(16:16) = 1 => c = t), rs(25:21):<br />
23–21<br />
0(000) 1(001) 2(010) 3(011) 4(100) 5(101) 6(110) 7(111)<br />
25–24<br />
0(00) mfc1 cfc1 mtc1 ctc1<br />
1(01) bc1.c<br />
2(10) f = single f = double<br />
3(11)<br />
op(31:26) = 010001 (FlPt), (f above: 10000 => f = s, 10001 => f = d), funct(5:0):<br />
2–0<br />
0(000) 1(001) 2(010) 3(011) 4(100) 5(101) 6(110) 7(111)<br />
5–3<br />
0(000) add.f sub.f mul.f div.f abs.f mov.f neg.f<br />
1(001)<br />
2(010)<br />
3(011)<br />
4(100) cvt.s.f cvt.d.f cvt.w.f<br />
5(101)<br />
6(110) c.f.f c.un.f c.eq.f c.ueq.f c.olt.f c.ult.f c.ole.f c.ule.f<br />
7(111) c.sf.f c.ngle.f c.seq.f c.ngl.f c.lt.f c.nge.f c.le.f c.ngt.f<br />
FIGURE 3.18 MIPS floating-point instruction encoding. This notation gives the value of a field by row <strong>and</strong> by column. For example,<br />
in the top portion of the figure, lw is found in row number 4 (100 two<br />
for bits 31–29 of the instruction) <strong>and</strong> column number 3 (011 two<br />
for bits<br />
28–26 of the instruction), so the corresponding value of the op field (bits 31–26) is 100011 two<br />
. Underscore means the field is used elsewhere.<br />
For example, FlPt in row 2 <strong>and</strong> column 1 (op 010001 two<br />
) is defined in the bottom part of the figure. Hence sub.f in row 0 <strong>and</strong> column 1 of<br />
the bottom section means that the funct field (bits 5–0) of the instruction) is 000001 two<br />
<strong>and</strong> the op field (bits 31–26) is 010001 two<br />
. Note that the<br />
5-bit rs field, specified in the middle portion of the figure, determines whether the operation is single precision (f s, so rs 10000) or double<br />
precision (f d, so rs 10001). Similarly, bit 16 of the instruction determines if the bc1.c instruction tests for true (bit 16 1 bc1.t)<br />
or false (bit 16 0 bc1.f). Instructions in color are described in Chapter 2 or this chapter, with Appendix A covering all instructions.<br />
This information is also found in column 2 of the MIPS Reference Data Card at the front of this book.
214 Chapter 3 Arithmetic for <strong>Computer</strong>s<br />
Hardware/<br />
Software<br />
Interface<br />
One issue that architects face in supporting floating-point arithmetic is whether<br />
to use the same registers used by the integer instructions or to add a special set<br />
for floating point. Because programs normally perform integer operations <strong>and</strong><br />
floating-point operations on different data, separating the registers will only<br />
slightly increase the number of instructions needed to execute a program. The<br />
major impact is to create a separate set of data transfer instructions to move data<br />
between floating-point registers <strong>and</strong> memory.<br />
The benefits of separate floating-point registers are having twice as many<br />
registers without using up more bits in the instruction format, having twice the<br />
register b<strong>and</strong>width by having separate integer <strong>and</strong> floating-point register sets, <strong>and</strong><br />
being able to customize registers to floating point; for example, some computers<br />
convert all sized oper<strong>and</strong>s in registers into a single internal format.<br />
EXAMPLE<br />
Compiling a Floating-Point C Program into MIPS Assembly Code<br />
Let’s convert a temperature in Fahrenheit to Celsius:<br />
float f2c (float fahr)<br />
{<br />
return ((5.0/9.0) *(fahr – 32.0));<br />
}<br />
Assume that the floating-point argument fahr is passed in $f12 <strong>and</strong> the<br />
result should go in $f0. (Unlike integer registers, floating-point register 0 can<br />
contain a number.) What is the MIPS assembly code?<br />
ANSWER<br />
We assume that the compiler places the three floating-point constants in<br />
memory within easy reach of the global pointer $gp. The first two instructions<br />
load the constants 5.0 <strong>and</strong> 9.0 into floating-point registers:<br />
f2c:<br />
lwc1 $f16,const5($gp) # $f16 = 5.0 (5.0 in memory)<br />
lwc1 $f18,const9($gp) # $f18 = 9.0 (9.0 in memory)<br />
They are then divided to get the fraction 5.0/9.0:<br />
div.s $f16, $f16, $f18 # $f16 = 5.0 / 9.0
3.5 Floating Point 215<br />
(Many compilers would divide 5.0 by 9.0 at compile time <strong>and</strong> save the single<br />
constant 5.0/9.0 in memory, thereby avoiding the divide at runtime.) Next, we<br />
load the constant 32.0 <strong>and</strong> then subtract it from fahr ($f12):<br />
lwc1 $f18, const32($gp)# $f18 = 32.0<br />
sub.s $f18, $f12, $f18 # $f18 = fahr – 32.0<br />
Finally, we multiply the two intermediate results, placing the product in $f0 as<br />
the return result, <strong>and</strong> then return<br />
mul.s $f0, $f16, $f18 # $f0 = (5/9)*(fahr – 32.0)<br />
jr $ra<br />
# return<br />
Now let’s perform floating-point operations on matrices, code commonly<br />
found in scientific programs.<br />
Compiling Floating-Point C Procedure with Two-Dimensional<br />
Matrices into MIPS<br />
EXAMPLE<br />
Most floating-point calculations are performed in double precision. Let’s perform<br />
matrix multiply of C C A * B. It is commonly called DGEMM,<br />
for Double precision, General Matrix Multiply. We’ll see versions of DGEMM<br />
again in Section 3.8 <strong>and</strong> subsequently in Chapters 4, 5, <strong>and</strong> 6. Let’s assume C,<br />
A, <strong>and</strong> B are all square matrices with 32 elements in each dimension.<br />
void mm (double c[][], double a[][], double b[][])<br />
{<br />
int i, j, k;<br />
for (i = 0; i != 32; i = i + 1)<br />
for (j = 0; j != 32; j = j + 1)<br />
for (k = 0; k != 32; k = k + 1)<br />
c[i][j] = c[i][j] + a[i][k] *b[k][j];<br />
}<br />
The array starting addresses are parameters, so they are in $a0, $a1, <strong>and</strong> $a2.<br />
Assume that the integer variables are in $s0, $s1, <strong>and</strong> $s2, respectively.<br />
What is the MIPS assembly code for the body of the procedure?<br />
Note that c[i][j] is used in the innermost loop above. Since the loop index<br />
is k, the index does not affect c[i][j], so we can avoid loading <strong>and</strong> storing<br />
c[i][j] each iteration. Instead, the compiler loads c[i][j] into a register<br />
outside the loop, accumulates the sum of the products of a[i][k] <strong>and</strong><br />
ANSWER
216 Chapter 3 Arithmetic for <strong>Computer</strong>s<br />
b[k][j] in that same register, <strong>and</strong> then stores the sum into c[i][j] upon<br />
termination of the innermost loop.<br />
We keep the code simpler by using the assembly language pseudoinstructions<br />
li (which loads a constant into a register), <strong>and</strong> l.d <strong>and</strong> s.d (which the<br />
assembler turns into a pair of data transfer instructions, lwc1 or swc1, to a<br />
pair of floating-point registers).<br />
The body of the procedure starts with saving the loop termination value of<br />
32 in a temporary register <strong>and</strong> then initializing the three for loop variables:<br />
mm:...<br />
li $t1, 32 # $t1 = 32 (row size/loop end)<br />
li $s0, 0 # i = 0; initialize 1st for loop<br />
L1: li $s1, 0 # j = 0; restart 2nd for loop<br />
L2: li $s2, 0 # k = 0; restart 3rd for loop<br />
To calculate the address of c[i][j], we need to know how a 32 32, twodimensional<br />
array is stored in memory. As you might expect, its layout is the<br />
same as if there were 32 single-dimension arrays, each with 32 elements. So the<br />
first step is to skip over the i “single-dimensional arrays,” or rows, to get the<br />
one we want. Thus, we multiply the index in the first dimension by the size of<br />
the row, 32. Since 32 is a power of 2, we can use a shift instead:<br />
sll $t2, $s0, 5 # $t2 = i * 2 5 (size of row of c)<br />
Now we add the second index to select the jth element of the desired row:<br />
addu $t2, $t2, $s1<br />
# $t2 = i * size(row) + j<br />
To turn this sum into a byte index, we multiply it by the size of a matrix element<br />
in bytes. Since each element is 8 bytes for double precision, we can instead shift<br />
left by 3:<br />
sll $t2, $t2, 3<br />
# $t2 = byte offset of [i][j]<br />
Next we add this sum to the base address of c, giving the address of c[i][j],<br />
<strong>and</strong> then load the double precision number c[i][j] into $f4:<br />
addu $t2, $a0, $t2 # $t2 = byte address of c[i][j]<br />
l.d $f4, 0($t2) # $f4 = 8 bytes of c[i][j]<br />
The following five instructions are virtually identical to the last five: calculate<br />
the address <strong>and</strong> then load the double precision number b[k][j].<br />
L3: sll $t0, $s2, 5 # $t0 = k * 2 5 (size of row of b)<br />
addu $t0, $t0, $s1 # $t0 = k * size(row) + j<br />
sll $t0, $t0, 3 # $t0 = byte offset of [k][j]<br />
addu $t0, $a2, $t0 # $t0 = byte address of b[k][j]<br />
l.d $f16, 0($t0) # $f16 = 8 bytes of b[k][j]<br />
Similarly, the next five instructions are like the last five: calculate the address<br />
<strong>and</strong> then load the double precision number a[i][k].
3.5 Floating Point 217<br />
sll $t0, $s0, 5 # $t0 = i * 2 5 (size of row of a)<br />
addu $t0, $t0, $s2 # $t0 = i * size(row) + k<br />
sll $t0, $t0, 3 # $t0 = byte offset of [i][k]<br />
addu $t0, $a1, $t0 # $t0 = byte address of a[i][k]<br />
l.d $f18, 0($t0) # $f18 = 8 bytes of a[i][k]<br />
Now that we have loaded all the data, we are finally ready to do some floatingpoint<br />
operations! We multiply elements of a <strong>and</strong> b located in registers $f18<br />
<strong>and</strong> $f16, <strong>and</strong> then accumulate the sum in $f4.<br />
mul.d $f16, $f18, $f16 # $f16 = a[i][k] * b[k][j]<br />
add.d $f4, $f4, $f16 # f4 = c[i][j] + a[i][k] * b[k][j]<br />
The final block increments the index k <strong>and</strong> loops back if the index is not 32.<br />
If it is 32, <strong>and</strong> thus the end of the innermost loop, we need to store the sum<br />
accumulated in $f4 into c[i][j].<br />
addiu $s2, $s2, 1 # $k = k + 1<br />
bne $s2, $t1, L3 # if (k != 32) go to L3<br />
s.d $f4, 0($t2) # c[i][j] = $f4<br />
Similarly, these final four instructions increment the index variable of the<br />
middle <strong>and</strong> outermost loops, looping back if the index is not 32 <strong>and</strong> exiting if<br />
the index is 32.<br />
addiu $s1, $s1, 1 # $j = j + 1<br />
bne $s1, $t1, L2 # if (j != 32) go to L2<br />
addiu $s0, $s0, 1 # $i = i + 1<br />
bne $s0, $t1, L1 # if (i != 32) go to L1<br />
…<br />
Figure 3.22 below shows the x86 assembly language code for a slightly different<br />
version of DGEMM in Figure 3.21.<br />
Elaboration: The array layout discussed in the example, called row-major order, is<br />
used by C <strong>and</strong> many other programming languages. Fortran instead uses column-major<br />
order, whereby the array is stored column by column.<br />
Elaboration: Only 16 of the 32 MIPS floating-point registers could originally be used<br />
for double precision operations: $f0, $f2, $f4, …, $f30. Double precision is computed<br />
using pairs of these single precision registers. The odd-numbered floating-point registers<br />
were used only to load <strong>and</strong> store the right half of 64-bit floating-point numbers. MIPS-32<br />
added l.d <strong>and</strong> s.d to the instruction set. MIPS-32 also added “paired single” versions of<br />
all floating-point instructions, where a single instruction results in two parallel floating-point<br />
operations on two 32-bit oper<strong>and</strong>s inside 64-bit registers (see Section 3.6). For example,<br />
add.ps $f0, $f2, $f4 is equivalent to add.s $f0, $f2, $f4 followed by add.s<br />
$f1, $f3, $f5.
218 Chapter 3 Arithmetic for <strong>Computer</strong>s<br />
Elaboration: Another reason for separate integers <strong>and</strong> floating-point registers is that<br />
microprocessors in the 1980s didn’t have enough transistors to put the floating-point unit<br />
on the same chip as the integer unit. Hence, the floating-point unit, including the floatingpoint<br />
registers, was optionally available as a second chip. Such optional accelerator<br />
chips are called coprocessors, <strong>and</strong> explain the acronym for floating-point loads in MIPS:<br />
lwc1 means load word to coprocessor 1, the floating-point unit. (Coprocessor 0 deals<br />
with virtual memory, described in Chapter 5.) Since the early 1990s, microprocessors<br />
have integrated floating point (<strong>and</strong> just about everything else) on chip, <strong>and</strong> hence the term<br />
coprocessor joins accumulator <strong>and</strong> core memory as quaint terms that date the speaker.<br />
Elaboration: As mentioned in Section 3.4, accelerating division is more challenging<br />
than multiplication. In addition to SRT, another technique to leverage a fast multiplier<br />
is Newton’s iteration, where division is recast as fi nding the zero of a function to fi nd<br />
the reciprocal 1/c, which is then multiplied by the other oper<strong>and</strong>. Iteration techniques<br />
cannot be rounded properly without calculating many extra bits. A TI chip solved this<br />
problem by calculating an extra-precise reciprocal.<br />
Elaboration: Java embraces IEEE 754 by name in its defi nition of Java fl oating-point<br />
data types <strong>and</strong> operations. Thus, the code in the fi rst example could have well been<br />
generated for a class method that converted Fahrenheit to Celsius.<br />
The second example above uses multiple dimensional arrays, which are not explicitly<br />
supported in Java. Java allows arrays of arrays, but each array may have its own length,<br />
unlike multiple dimensional arrays in C. Like the examples in Chapter 2, a Java version<br />
of this second example would require a good deal of checking code for array bounds,<br />
including a new length calculation at the end of row access. It would also need to check<br />
that the object reference is not null.<br />
guard The first of two<br />
extra bits kept on the<br />
right during intermediate<br />
calculations of floatingpoint<br />
numbers; used<br />
to improve rounding<br />
accuracy.<br />
round Method to<br />
make the intermediate<br />
floating-point result fit<br />
the floating-point format;<br />
the goal is typically to find<br />
the nearest number that<br />
can be represented in the<br />
format.<br />
Accurate Arithmetic<br />
Unlike integers, which can represent exactly every number between the smallest <strong>and</strong><br />
largest number, floating-point numbers are normally approximations for a number<br />
they can’t really represent. The reason is that an infinite variety of real numbers<br />
exists between, say, 0 <strong>and</strong> 1, but no more than 2 53 can be represented exactly in<br />
double precision floating point. The best we can do is getting the floating-point<br />
representation close to the actual number. Thus, IEEE 754 offers several modes of<br />
rounding to let the programmer pick the desired approximation.<br />
Rounding sounds simple enough, but to round accurately requires the hardware<br />
to include extra bits in the calculation. In the preceding examples, we were vague<br />
on the number of bits that an intermediate representation can occupy, but clearly,<br />
if every intermediate result had to be truncated to the exact number of digits, there<br />
would be no opportunity to round. IEEE 754, therefore, always keeps two extra bits<br />
on the right during intermediate additions, called guard <strong>and</strong> round, respectively.<br />
Let’s do a decimal example to illustrate their value.
3.5 Floating Point 219<br />
Rounding with Guard Digits<br />
Add 2.56 ten<br />
10 0 to 2.34 ten<br />
10 2 , assuming that we have three significant<br />
decimal digits. Round to the nearest decimal number with three significant<br />
decimal digits, first with guard <strong>and</strong> round digits, <strong>and</strong> then without them.<br />
EXAMPLE<br />
First we must shift the smaller number to the right to align the exponents, so<br />
2.56 ten<br />
10 0 becomes 0.0256 ten<br />
10 2 . Since we have guard <strong>and</strong> round digits,<br />
we are able to represent the two least significant digits when we align exponents.<br />
The guard digit holds 5 <strong>and</strong> the round digit holds 6. The sum is<br />
ANSWER<br />
2.3400 ten<br />
+ 0.0256 ten<br />
2.3656 ten<br />
Thus the sum is 2.3656 ten<br />
10 2 . Since we have two digits to round, we want<br />
values 0 to 49 to round down <strong>and</strong> 51 to 99 to round up, with 50 being the<br />
tiebreaker. Rounding the sum up with three significant digits yields 2.37 ten<br />
10 2 .<br />
Doing this without guard <strong>and</strong> round digits drops two digits from the<br />
calculation. The new sum is then<br />
2.34 ten<br />
+ 0.02 ten<br />
2.36 ten<br />
The answer is 2.36 ten<br />
10 2 , off by 1 in the last digit from the sum above.<br />
Since the worst case for rounding would be when the actual number is halfway<br />
between two floating-point representations, accuracy in floating point is normally<br />
measured in terms of the number of bits in error in the least significant bits of the<br />
signific<strong>and</strong>; the measure is called the number of units in the last place, or ulp. If<br />
a number were off by 2 in the least significant bits, it would be called off by 2 ulps.<br />
Provided there is no overflow, underflow, or invalid operation exceptions, IEEE<br />
754 guarantees that the computer uses the number that is within one-half ulp.<br />
units in the last place<br />
(ulp) The number of<br />
bits in error in the least<br />
significant bits of the<br />
signific<strong>and</strong> between<br />
the actual number <strong>and</strong><br />
the number that can be<br />
represented.<br />
Elaboration: Although the example above really needed just one extra digit, multiply<br />
can need two. A binary product may have one leading 0 bit; hence, the normalizing step<br />
must shift the product one bit left. This shifts the guard digit into the least significant bit<br />
of the product, leaving the round bit to help accurately round the product.<br />
IEEE 754 has four rounding modes: always round up (toward +∞), always round down<br />
(toward ∞), truncate, <strong>and</strong> round to nearest even. The fi nal mode determines what to<br />
do if the number is exactly halfway in between. The U.S. Internal Revenue Service (IRS)<br />
always rounds 0.50 dollars up, possibly to the benefi t of the IRS. A more equitable way<br />
would be to round up this case half the time <strong>and</strong> round down the other half. IEEE 754<br />
says that if the least signifi cant bit retained in a halfway case would be odd, add one;
222 Chapter 3 Arithmetic for <strong>Computer</strong>s<br />
In an attempt to squeeze every last bit of precision from a fl oating-point operation,<br />
the st<strong>and</strong>ard allows some numbers to be represented in unnormalized form. Rather than<br />
having a gap between 0 <strong>and</strong> the smallest normalized number, IEEE allows denormalized<br />
numbers (also known as denorms or subnormals). They have the same exponent as<br />
zero but a nonzero fraction. They allow a number to degrade in signifi cance until it<br />
becomes 0, called gradual underfl ow. For example, the smallest positive single precision<br />
normalized number is<br />
1.0000 0000 0000 0000 0000 000 two<br />
2 126<br />
but the smallest single precision denormalized number is<br />
0.0000 0000 0000 0000 0000 001 two<br />
2 126 , or 1.0 two<br />
2 149<br />
For double precision, the denorm gap goes from 1.0 2 1022 to 1.0 2 1074 .<br />
The possibility of an occasional unnormalized oper<strong>and</strong> has given headaches to<br />
fl oating-point designers who are trying to build fast fl oating-point units. Hence, many<br />
computers cause an exception if an oper<strong>and</strong> is denormalized, letting software complete<br />
the operation. Although software implementations are perfectly valid, their lower<br />
performance has lessened the popularity of denorms in portable fl oating-point software.<br />
Moreover, if programmers do not expect denorms, their programs may surprise them.<br />
3.6<br />
Parallelism <strong>and</strong> <strong>Computer</strong> Arithmetic:<br />
Subword Parallelism<br />
Since every desktop microprocessor by definition has its own graphical displays,<br />
as transistor budgets increased it was inevitable that support would be added for<br />
graphics operations.<br />
Many graphics systems originally used 8 bits to represent each of the three<br />
primary colors plus 8 bits for a location of a pixel. The addition of speakers <strong>and</strong><br />
microphones for teleconferencing <strong>and</strong> video games suggested support of sound as<br />
well. Audio samples need more than 8 bits of precision, but 16 bits are sufficient.<br />
Every microprocessor has special support so that bytes <strong>and</strong> halfwords take up<br />
less space when stored in memory (see Section 2.9), but due to the infrequency of<br />
arithmetic operations on these data sizes in typical integer programs, there was<br />
little support beyond data transfers. Architects recognized that many graphics<br />
<strong>and</strong> audio applications would perform the same operation on vectors of this data.<br />
By partitioning the carry chains within a 128-bit adder, a processor could use<br />
parallelism to perform simultaneous operations on short vectors of sixteen 8-bit<br />
oper<strong>and</strong>s, eight 16-bit oper<strong>and</strong>s, four 32-bit oper<strong>and</strong>s, or two 64-bit oper<strong>and</strong>s. The<br />
cost of such partitioned adders was small.<br />
Given that the parallelism occurs within a wide word, the extensions are<br />
classified as subword parallelism. It is also classified under the more general name<br />
of data level parallelism. They have been also called vector or SIMD, for single<br />
instruction, multiple data (see Section 6.6). The rising popularity of multimedia
226 Chapter 3 Arithmetic for <strong>Computer</strong>s<br />
1. void dgemm (int n, double* A, double* B, double* C)<br />
2. {<br />
3. for (int i = 0; i < n; ++i)<br />
4. for (int j = 0; j < n; ++j)<br />
5. {<br />
6. double cij = C[i+j*n]; /* cij = C[i][j] */<br />
7. for( int k = 0; k < n; k++ )<br />
8. cij += A[i+k*n] * B[k+j*n]; /* cij += A[i][k]*B[k][j] */<br />
9. C[i+j*n] = cij; /* C[i][j] = cij */<br />
10. }<br />
11. }<br />
FIGURE 3.21 Unoptimized C version of a double precision matrix multiply, widely known as DGEMM for<br />
Double-precision GEneral Matrix Multiply (GEMM). Because we are passing the matrix dimension as the parameter<br />
n, this version of DGEMM uses single dimensional versions of matrices C, A, <strong>and</strong> B <strong>and</strong> address arithmetic to get better<br />
performance instead of using the more intuitive two-dimensional arrays that we saw in Section 3.5. The comments remind<br />
us of this more intuitive notation.<br />
1. vmovsd (%r10),%xmm0 # Load 1 element of C into %xmm0<br />
2. mov %rsi,%rcx # register %rcx = %rsi<br />
3. xor %eax,%eax # register %eax = 0<br />
4. vmovsd (%rcx),%xmm1 # Load 1 element of B into %xmm1<br />
5. add %r9,%rcx # register %rcx = %rcx + %r9<br />
6. vmulsd (%r8,%rax,8),%xmm1,%xmm1 # Multiply %xmm1, element of A<br />
7. add $0x1,%rax # register %rax = %rax + 1<br />
8. cmp %eax,%edi # compare %eax to %edi<br />
9. vaddsd %xmm1,%xmm0,%xmm0 # Add %xmm1, %xmm0<br />
10. jg 30 # jump if %eax > %edi<br />
11. add $0x1,%r11d # register %r11 = %r11 + 1<br />
12. vmovsd %xmm0,(%r10) # Store %xmm0 into C element<br />
FIGURE 3.22 The x86 assembly language for the body of the nested loops generated by compiling the<br />
optimized C code in Figure 3.21. Although it is dealing with just 64-bits of data, the compiler uses the AVX version of<br />
the instructions instead of SSE2 presumably so that it can use three address per instruction instead of two (see the Elaboration<br />
in Section 3.7).
3.8 Going Faster: Subword Parallelism <strong>and</strong> Matrix Multiply 227<br />
1. #include <br />
2. void dgemm (int n, double* A, double* B, double* C)<br />
3. {<br />
4. for ( int i = 0; i < n; i+=4 )<br />
5. for ( int j = 0; j < n; j++ ) {<br />
6. __m256d c0 = _mm256_load_pd(C+i+j*n); /* c0 = C[i][j] */<br />
7. for( int k = 0; k < n; k++ )<br />
8. c0 = _mm256_add_pd(c0, /* c0 += A[i][k]*B[k][j] */<br />
9. _mm256_mul_pd(_mm256_load_pd(A+i+k*n),<br />
10. _mm256_broadcast_sd(B+k+j*n)));<br />
11. _mm256_store_pd(C+i+j*n, c0); /* C[i][j] = c0 */<br />
12. }<br />
13. }<br />
FIGURE 3.23 Optimized C version of DGEMM using C intrinsics to generate the AVX subword-parallel<br />
instructions for the x86. Figure 3.24 shows the assembly language produced by the compiler for the inner loop.<br />
While compiler writers may eventually be able to routinely produce highquality<br />
code that uses the AVX instructions of the x86, for now we must “cheat” by<br />
using C intrinsics that more or less tell the compiler exactly how to produce good<br />
code. Figure 3.23 shows the enhanced version of Figure 3.21 for which the Gnu C<br />
compiler produces AVX code. Figure 3.24 shows annotated x86 code that is the<br />
output of compiling using gcc with the –O3 level of optimization.<br />
The declaration on line 6 of Figure 3.23 uses the __m256d data type, which tells<br />
the compiler the variable will hold 4 double-precision floating-point values. The<br />
intrinsic _mm256_load_pd() also on line 6 uses AVX instructions to load 4<br />
double-precision floating-point numbers in parallel (_pd) from the matrix C into<br />
c0. The address calculation C+i+j*n on line 6 represents element C[i+j*n].<br />
Symmetrically, the final step on line 11 uses the intrinsic _mm256_store_pd()<br />
to store 4 double-precision floating-point numbers from c0 into the matrix C.<br />
As we’re going through 4 elements each iteration, the outer for loop on line 4<br />
increments i by 4 instead of by 1 as on line 3 of Figure 3.21.<br />
Inside the loops, on line 9 we first load 4 elements of A again using _mm256_<br />
load_pd(). To multiply these elements by one element of B, on line 10 we first<br />
use the intrinsic _mm256_broadcast_sd(), which makes 4 identical copies<br />
of the scalar double precision number—in this case an element of B—in one of the<br />
YMM registers. We then use _mm256_mul_pd() on line 9 to multiply the four<br />
double-precision results in parallel. Finally, _mm256_add_pd() on line 8 adds<br />
the 4 products to the 4 sums in c0.<br />
Figure 3.24 shows resulting x86 code for the body of the inner loops produced<br />
by the compiler. You can see the five AVX instructions—they all start with v <strong>and</strong>
228 Chapter 3 Arithmetic for <strong>Computer</strong>s<br />
1. vmovapd (%r11),%ymm0 # Load 4 elements of C into %ymm0<br />
2. mov %rbx,%rcx # register %rcx = %rbx<br />
3. xor %eax,%eax # register %eax = 0<br />
4. vbroadcastsd (%rax,%r8,1),%ymm1 # Make 4 copies of B element<br />
5. add $0x8,%rax # register %rax = %rax + 8<br />
6. vmulpd (%rcx),%ymm1,%ymm1 # Parallel mul %ymm1,4 A elements<br />
7. add %r9,%rcx # register %rcx = %rcx + %r9<br />
8. cmp %r10,%rax # compare %r10 to %rax<br />
9. vaddpd %ymm1,%ymm0,%ymm0 # Parallel add %ymm1, %ymm0<br />
10. jne 50 # jump if not %r10 != %rax<br />
11. add $0x1,%esi # register % esi = % esi + 1<br />
12. vmovapd %ymm0,(%r11) # Store %ymm0 into 4 C elements<br />
FIGURE 3.24 The x86 assembly language for the body of the nested loops generated by compiling<br />
the optimized C code in Figure 3.23. Note the similarities to Figure 3.22, with the primary difference being that the<br />
five floating-point operations are now using YMM registers <strong>and</strong> using the pd versions of the instructions for parallel double<br />
precision instead of the sd version for scalar double precision.<br />
four of the five use pd for parallel double precision—that correspond to the C<br />
intrinsics mentioned above. The code is very similar to that in Figure 3.22 above:<br />
both use 12 instructions, the integer instructions are nearly identical (but different<br />
registers), <strong>and</strong> the floating-point instruction differences are generally just going<br />
from scalar double (sd) using XMM registers to parallel double (pd) with YMM<br />
registers. The one exception is line 4 of Figure 3.24. Every element of A must be<br />
multiplied by one element of B. One solution is to place four identical copies of the<br />
64-bit B element side-by-side into the 256-bit YMM register, which is just what the<br />
instruction vbroadcastsd does.<br />
For matrices of dimensions of 32 by 32, the unoptimized DGEMM in Figure 3.21<br />
runs at 1.7 GigaFLOPS (FLoating point Operations Per Second) on one core of a<br />
2.6 GHz Intel Core i7 (S<strong>and</strong>y Bridge). The optimized code in Figure 3.23 performs<br />
at 6.4 GigaFLOPS. The AVX version is 3.85 times as fast, which is very close to the<br />
factor of 4.0 increase that you might hope for from performing 4 times as many<br />
operations at a time by using subword parallelism.<br />
Elaboration: As mentioned in the Elaboration in Section 1.6, Intel offers Turbo mode<br />
that temporarily runs at a higher clock rate until the chip gets too hot. This Intel Core i7<br />
(S<strong>and</strong>y Bridge) can increase from 2.6 GHz to 3.3 GHz in Turbo mode. The results above<br />
are with Turbo mode turned off. If we turn it on, we improve all the results by the increase<br />
in the clock rate of 3.3/2.6 = 1.27 to 2.1 GFLOPS for unoptimized DGEMM <strong>and</strong> 8.1<br />
GFLOPS with AVX. Turbo mode works particularly well when using only a single core of<br />
an eight-core chip, as in this case, as it lets that single core use much more than its fair<br />
share of power since the other cores are idle.
3.9 Fallacies <strong>and</strong> Pitfalls 229<br />
3.9 Fallacies <strong>and</strong> Pitfalls<br />
Arithmetic fallacies <strong>and</strong> pitfalls generally stem from the difference between the<br />
limited precision of computer arithmetic <strong>and</strong> the unlimited precision of natural<br />
arithmetic.<br />
Fallacy: Just as a left shift instruction can replace an integer multiply by a<br />
power of 2, a right shift is the same as an integer division by a power of 2.<br />
Recall that a binary number c, where xi means the ith bit, represents the number<br />
… (x 3 2 3 ) (x 2 2 2 ) 1 (x1 2 1 ) (x0 2 0 )<br />
Shifting the bits of c right by n bits would seem to be the same as dividing by<br />
2n. And this is true for unsigned integers. The problem is with signed integers. For<br />
example, suppose we want to divide 5 ten<br />
by 4 ten<br />
; the quotient should be 1 ten<br />
. The<br />
two’s complement representation of 5 ten<br />
is<br />
Thus mathematics<br />
may be defined as the<br />
subject in which we<br />
never know what we<br />
are talking about, nor<br />
whether what we are<br />
saying is true.<br />
Bertr<strong>and</strong> Russell, Recent<br />
Words on the Principles<br />
of Mathematics, 1901<br />
1111 1111 1111 1111 1111 1111 1111 1011 two<br />
According to this fallacy, shifting right by two should divide by 4 ten<br />
(2 2 ):<br />
0011 1111 1111 1111 1111 1111 1111 1110 two<br />
With a 0 in the sign bit, this result is clearly wrong. The value created by the shift<br />
right is actually 1,073,741,822 ten<br />
instead of 1 ten<br />
.<br />
A solution would be to have an arithmetic right shift that extends the sign bit<br />
instead of shifting in 0s. A 2-bit arithmetic shift right of 5 ten<br />
produces<br />
1111 1111 1111 1111 1111 1111 1111 1110 two<br />
The result is 2 ten<br />
instead of 1 ten<br />
; close, but no cigar.<br />
Pitfall: Floating-point addition is not associative.<br />
Associativity holds for a sequence of two’s complement integer additions, even if the<br />
computation overflows. Alas, because floating-point numbers are approximations<br />
of real numbers <strong>and</strong> because computer arithmetic has limited precision, it does<br />
not hold for floating-point numbers. Given the great range of numbers that can be<br />
represented in floating point, problems occur when adding two large numbers of<br />
opposite signs plus a small number. For example, let’s see if c (a b) (c a)<br />
b. Assume c 1.5 ten<br />
10 38 , a 1.5 ten<br />
10 38 , <strong>and</strong> b 1.0, <strong>and</strong> that these are<br />
all single precision numbers.
230 Chapter 3 Arithmetic for <strong>Computer</strong>s<br />
c ( a b) 1.5ten<br />
10 (1.5ten<br />
10 1.0)<br />
38<br />
38<br />
1.5ten<br />
10 (1.5ten<br />
10 )<br />
0.0<br />
38<br />
38<br />
c ( a b) ( 1.5ten<br />
10 1.5ten<br />
10 ) 1.0<br />
(0.0ten<br />
) 1.0<br />
1.0<br />
Since floating-point numbers have limited precision <strong>and</strong> result in approximations<br />
of real results, 1.5 ten<br />
10 38 is so much larger than 1.0 ten<br />
that 1.5 ten<br />
10 38 1.0 is still<br />
1.5 ten<br />
10 38 . That is why the sum of c, a, <strong>and</strong> b is 0.0 or 1.0, depending on the order<br />
of the floating-point additions, so c (a b) (c a) b. Therefore, floatingpoint<br />
addition is not associative.<br />
38<br />
Fallacy: Parallel execution strategies that work for integer data types also work<br />
for floating-point data types.<br />
Programs have typically been written first to run sequentially before being rewritten<br />
to run concurrently, so a natural question is, “Do the two versions get the same<br />
answer?” If the answer is no, you presume there is a bug in the parallel version that<br />
you need to track down.<br />
This approach assumes that computer arithmetic does not affect the results when<br />
going from sequential to parallel. That is, if you were to add a million numbers<br />
together, you would get the same results whether you used 1 processor or 1000<br />
processors. This assumption holds for two’s complement integers, since integer<br />
addition is associative. Alas, since floating-point addition is not associative, the<br />
assumption does not hold.<br />
A more vexing version of this fallacy occurs on a parallel computer where the<br />
operating system scheduler may use a different number of processors depending<br />
on what other programs are running on a parallel computer. As the varying<br />
number of processors from each run would cause the floating-point sums to be<br />
calculated in different orders, getting slightly different answers each time despite<br />
running identical code with identical input may flummox unaware parallel<br />
programmers.<br />
Given this qu<strong>and</strong>ary, programmers who write parallel code with floating-point<br />
numbers need to verify whether the results are credible even if they don’t give the<br />
same exact answer as the sequential code. The field that deals with such issues is<br />
called numerical analysis, which is the subject of textbooks in its own right. Such<br />
concerns are one reason for the popularity of numerical libraries such as LAPACK<br />
<strong>and</strong> SCALAPAK, which have been validated in both their sequential <strong>and</strong> parallel<br />
forms.<br />
Pitfall: The MIPS instruction add immediate unsigned (addiu) sign-extends<br />
its 16-bit immediate field.<br />
38
3.9 Fallacies <strong>and</strong> Pitfalls 231<br />
Despite its name, add immediate unsigned (addiu) is used to add constants to<br />
signed integers when we don’t care about overflow. MIPS has no subtract immediate<br />
instruction, <strong>and</strong> negative numbers need sign extension, so the MIPS architects<br />
decided to sign-extend the immediate field.<br />
Fallacy: Only theoretical mathematicians care about floating-point accuracy.<br />
Newspaper headlines of November 1994 prove this statement is a fallacy (see<br />
Figure 3.25). The following is the inside story behind the headlines.<br />
The Pentium used a st<strong>and</strong>ard floating-point divide algorithm that generates<br />
multiple quotient bits per step, using the most significant bits of divisor <strong>and</strong><br />
dividend to guess the next 2 bits of the quotient. The guess is taken from a lookup<br />
table containing 2, 1, 0, 1, or 2. The guess is multiplied by the divisor <strong>and</strong><br />
subtracted from the remainder to generate a new remainder. Like nonrestoring<br />
division, if a previous guess gets too large a remainder, the partial remainder is<br />
adjusted in a subsequent pass.<br />
Evidently, there were five elements of the table from the 80486 that Intel<br />
engineers thought could never be accessed, <strong>and</strong> they optimized the logic to return<br />
0 instead of 2 in these situations on the Pentium. Intel was wrong: while the first 11<br />
FIGURE 3.25 A sampling of newspaper <strong>and</strong> magazine articles from November 1994,<br />
including the New York Times, San Jose Mercury News, San Francisco Chronicle, <strong>and</strong><br />
Infoworld. The Pentium floating-point divide bug even made the “Top 10 List” of the David Letterman<br />
Late Show on television. Intel eventually took a $300 million write-off to replace the buggy chips.
232 Chapter 3 Arithmetic for <strong>Computer</strong>s<br />
bits were always correct, errors would show up occasionally in bits 12 to 52, or the<br />
4th to 15th decimal digits.<br />
A math professor at Lynchburg College in Virginia, Thomas Nicely, discovered the<br />
bug in September 1994. After calling Intel technical support <strong>and</strong> getting no official<br />
reaction, he posted his discovery on the Internet. This post led to a story in a trade<br />
magazine, which in turn caused Intel to issue a press release. It called the bug a glitch<br />
that would affect only theoretical mathematicians, with the average spreadsheet<br />
user seeing an error every 27,000 years. IBM Research soon counterclaimed that the<br />
average spreadsheet user would see an error every 24 days. Intel soon threw in the<br />
towel by making the following announcement on December 21:<br />
“We at Intel wish to sincerely apologize for our h<strong>and</strong>ling of the recently publicized<br />
Pentium processor flaw. The Intel Inside symbol means that your computer has<br />
a microprocessor second to none in quality <strong>and</strong> performance. Thous<strong>and</strong>s of Intel<br />
employees work very hard to ensure that this is true. But no microprocessor is<br />
ever perfect. What Intel continues to believe is technically an extremely minor<br />
problem has taken on a life of its own. Although Intel firmly st<strong>and</strong>s behind the<br />
quality of the current version of the Pentium processor, we recognize that many<br />
users have concerns. We want to resolve these concerns. Intel will exchange the<br />
current version of the Pentium processor for an updated version, in which this<br />
floating-point divide flaw is corrected, for any owner who requests it, free of<br />
charge anytime during the life of their computer.”<br />
Analysts estimate that this recall cost Intel $500 million, <strong>and</strong> Intel engineers did not<br />
get a Christmas bonus that year.<br />
This story brings up a few points for everyone to ponder. How much cheaper<br />
would it have been to fix the bug in July 1994? What was the cost to repair the<br />
damage to Intel’s reputation? And what is the corporate responsibility in disclosing<br />
bugs in a product so widely used <strong>and</strong> relied upon as a microprocessor?<br />
3.10 Concluding Remarks<br />
Over the decades, computer arithmetic has become largely st<strong>and</strong>ardized, greatly<br />
enhancing the portability of programs. Two’s complement binary integer arithmetic is<br />
found in every computer sold today, <strong>and</strong> if it includes floating point support, it offers<br />
the IEEE 754 binary floating-point arithmetic.<br />
<strong>Computer</strong> arithmetic is distinguished from paper-<strong>and</strong>-pencil arithmetic by the<br />
constraints of limited precision. This limit may result in invalid operations through<br />
calculating numbers larger or smaller than the predefined limits. Such anomalies, called<br />
“overflow” or “underflow,” may result in exceptions or interrupts, emergency events<br />
similar to unplanned subroutine calls. Chapters 4 <strong>and</strong> 5 discuss exceptions in more detail.<br />
Floating-point arithmetic has the added challenge of being an approximation<br />
of real numbers, <strong>and</strong> care needs to be taken to ensure that the computer number
3.10 Concluding Remarks 233<br />
selected is the representation closest to the actual number. The challenges of<br />
imprecision <strong>and</strong> limited representation of floating point are part of the inspiration<br />
for the field of numerical analysis. The recent switch to parallelism shines the<br />
searchlight on numerical analysis again, as solutions that were long considered<br />
safe on sequential computers must be reconsidered when trying to find the fastest<br />
algorithm for parallel computers that still achieves a correct result.<br />
Data-level parallelism, specifically subword parallelism, offers a simple path to<br />
higher performance for programs that are intensive in arithmetic operations for<br />
either integer or floating-point data. We showed that we could speed up matrix<br />
multiply nearly fourfold by using instructions that could execute four floatingpoint<br />
operations at a time.<br />
With the explanation of computer arithmetic in this chapter comes a description<br />
of much more of the MIPS instruction set. One point of confusion is the instructions<br />
covered in these chapters versus instructions executed by MIPS chips versus the<br />
instructions accepted by MIPS assemblers. Two figures try to make this clear.<br />
Figure 3.26 lists the MIPS instructions covered in this chapter <strong>and</strong> Chapter 2.<br />
We call the set of instructions on the left-h<strong>and</strong> side of the figure the MIPS core. The<br />
instructions on the right we call the MIPS arithmetic core. On the left of Figure 3.27<br />
are the instructions the MIPS processor executes that are not found in Figure 3.26.<br />
We call the full set of hardware instructions MIPS-32. On the right of Figure 3.27<br />
are the instructions accepted by the assembler that are not part of MIPS-32. We call<br />
this set of instructions Pseudo MIPS.<br />
Figure 3.28 gives the popularity of the MIPS instructions for SPEC CPU2006<br />
integer <strong>and</strong> floating-point benchmarks. All instructions are listed that were<br />
responsible for at least 0.2% of the instructions executed.<br />
Note that although programmers <strong>and</strong> compiler writers may use MIPS-32 to<br />
have a richer menu of options, MIPS core instructions dominate integer SPEC<br />
CPU2006 execution, <strong>and</strong> the integer core plus arithmetic core dominate SPEC<br />
CPU2006 floating point, as the table below shows.<br />
Instruction subset Integer Fl. pt.<br />
MIPS core 98% 31%<br />
MIPS arithmetic core 2% 66%<br />
Remaining MIPS-32 0% 3%<br />
For the rest of the book, we concentrate on the MIPS core instructions—the integer<br />
instruction set excluding multiply <strong>and</strong> divide—to make the explanation of computer<br />
design easier. As you can see, the MIPS core includes the most popular MIPS<br />
instructions; be assured that underst<strong>and</strong>ing a computer that runs the MIPS core<br />
will give you sufficient background to underst<strong>and</strong> even more ambitious computers.<br />
No matter what the instruction set or its size—MIPS, ARM, x86—never forget that<br />
bit patterns have no inherent meaning. The same bit pattern may represent a signed<br />
integer, unsigned integer, floating-point number, string, instruction, <strong>and</strong> so on. In<br />
stored program computers, it is the operation on the bit pattern that determines its<br />
meaning.
3.10 Concluding Remarks 235<br />
Remaining MIPS-32 Name Format Pseudo MIPS Name Format<br />
exclusive or (rs ⊕ rt) xor R absolute value abs rd,rs<br />
exclusive or immediate xori I negate (signed or unsigned) negs rd,rs<br />
shift right arithmetic sra R rotate left rol rd,rs,rt<br />
shift left logical variable sllv R rotate right ror rd,rs,rt<br />
shift right logical variable srlv R multiply <strong>and</strong> don’t check oflw (signed or uns.) muls rd,rs,rt<br />
shift right arithmetic variable srav R multiply <strong>and</strong> check oflw (signed or uns.) mulos rd,rs,rt<br />
move to Hi mthi R divide <strong>and</strong> check overflow div rd,rs,rt<br />
move to Lo mtlo R divide <strong>and</strong> don t check overflow divu rd,rs,rt<br />
load halfword lh I remainder (signed or unsigned) rems rd,rs,rt<br />
load byte lb I load immediate li rd,imm<br />
load word left (unaligned) lwl I load address la rd,addr<br />
load word right (unaligned) lwr I load double ld rd,addr<br />
store word left (unaligned) swl I store double sd rd,addr<br />
store word right (unaligned) swr I unaligned load word ulw rd,addr<br />
load linked (atomic update) ll I unaligned store word usw rd,addr<br />
store cond. (atomic update) sc I unaligned load halfword (signed or uns.) ulhs rd,addr<br />
move if zero movz R unaligned store halfword ush rd,addr<br />
move if not zero movn R branch b Label<br />
multiply <strong>and</strong> add (S or uns.) madds R branch on equal zero beqz rs,L<br />
multiply <strong>and</strong> subtract (S or uns.) msubs I branch on compare (signed or unsigned) bxs rs,rt,L<br />
branch on ≥ zero <strong>and</strong> link bgezal I (x = lt, le, gt, ge)<br />
branch on < zero <strong>and</strong> link bltzal I set equal seq rd,rs,rt<br />
jump <strong>and</strong> link register jalr R set not equal sne rd,rs,rt<br />
branch compare to zero bxz I set on compare (signed or unsigned) sxs rd,rs,rt<br />
branch compare to zero likely bxzl I (x = lt, le, gt, ge)<br />
(x = lt, le, gt, ge) load to floating point (s or d) l.f rd,addr<br />
branch compare reg likely bxl I store from floating point (s or d) s.f rd,addr<br />
trap if compare reg tx R<br />
trap if compare immediate txi I<br />
(x = eq, neq, lt, le, gt, ge)<br />
return from exception rfe R<br />
system call syscall I<br />
break (cause exception) break I<br />
move from FP to integer mfc1 R<br />
move to FP from integer mtc1 R<br />
FP move (s or d) mov.f R<br />
FP move if zero (s or d) movz.f R<br />
FP move if not zero (s or d) movn.f R<br />
FP square root (s or d) sqrt.f R<br />
FP absolute value (s or d) abs.f R<br />
FP negate (s or d) neg.f R<br />
FP convert (w, s, or d) cvt.f.f R<br />
FP compare un (s or d) c.xn.f R<br />
FIGURE 3.27 Remaining MIPS-32 <strong>and</strong> Pseudo MIPS instruction sets. f means single (s) or double (d) precision floating-point<br />
instructions, <strong>and</strong> s means signed <strong>and</strong> unsigned (u) versions. MIPS-32 also has FP instructions for multiply <strong>and</strong> add/sub (madd.f/ msub.f),<br />
ceiling (ceil.f), truncate (trunc.f), round (round.f), <strong>and</strong> reciprocal (recip.f). The underscore represents the letter to include to represent<br />
that datatype.
3.12 Exercises 237<br />
3.12 Exercises<br />
3.1 [5] What is 5ED4 07A4 when these values represent unsigned 16-<br />
bit hexadecimal numbers? The result should be written in hexadecimal. Show your<br />
work.<br />
3.2 [5] What is 5ED4 07A4 when these values represent signed 16-<br />
bit hexadecimal numbers stored in sign-magnitude format? The result should be<br />
written in hexadecimal. Show your work.<br />
3.3 [10] Convert 5ED4 into a binary number. What makes base 16<br />
(hexadecimal) an attractive numbering system for representing values in<br />
computers?<br />
3.4 [5] What is 4365 3412 when these values represent unsigned 12-bit<br />
octal numbers? The result should be written in octal. Show your work.<br />
3.5 [5] What is 4365 3412 when these values represent signed 12-bit<br />
octal numbers stored in sign-magnitude format? The result should be written in<br />
octal. Show your work.<br />
3.6 [5] Assume 185 <strong>and</strong> 122 are unsigned 8-bit decimal integers. Calculate<br />
185 – 122. Is there overflow, underflow, or neither?<br />
3.7 [5] Assume 185 <strong>and</strong> 122 are signed 8-bit decimal integers stored in<br />
sign-magnitude format. Calculate 185 122. Is there overflow, underflow, or<br />
neither?<br />
3.8 [5] Assume 185 <strong>and</strong> 122 are signed 8-bit decimal integers stored in<br />
sign-magnitude format. Calculate 185 122. Is there overflow, underflow, or<br />
neither?<br />
3.9 [10] Assume 151 <strong>and</strong> 214 are signed 8-bit decimal integers stored in<br />
two’s complement format. Calculate 151 214 using saturating arithmetic. The<br />
result should be written in decimal. Show your work.<br />
3.10 [10] Assume 151 <strong>and</strong> 214 are signed 8-bit decimal integers stored in<br />
two’s complement format. Calculate 151 214 using saturating arithmetic. The<br />
result should be written in decimal. Show your work.<br />
3.11 [10] Assume 151 <strong>and</strong> 214 are unsigned 8-bit integers. Calculate 151<br />
214 using saturating arithmetic. The result should be written in decimal. Show<br />
your work.<br />
3.12 [20] Using a table similar to that shown in Figure 3.6, calculate the<br />
product of the octal unsigned 6-bit integers 62 <strong>and</strong> 12 using the hardware described<br />
in Figure 3.3. You should show the contents of each register on each step.<br />
Never give in, never<br />
give in, never, never,<br />
never—in nothing,<br />
great or small, large or<br />
petty—never give in.<br />
Winston Churchill,<br />
address at Harrow<br />
School, 1941
238 Chapter 3 Arithmetic for <strong>Computer</strong>s<br />
3.13 [20] Using a table similar to that shown in Figure 3.6, calculate the<br />
product of the hexadecimal unsigned 8-bit integers 62 <strong>and</strong> 12 using the hardware<br />
described in Figure 3.5. You should show the contents of each register on each step.<br />
3.14 [10] Calculate the time necessary to perform a multiply using the<br />
approach given in Figures 3.3 <strong>and</strong> 3.4 if an integer is 8 bits wide <strong>and</strong> each step<br />
of the operation takes 4 time units. Assume that in step 1a an addition is always<br />
performed—either the multiplic<strong>and</strong> will be added, or a zero will be. Also assume<br />
that the registers have already been initialized (you are just counting how long it<br />
takes to do the multiplication loop itself). If this is being done in hardware, the<br />
shifts of the multiplic<strong>and</strong> <strong>and</strong> multiplier can be done simultaneously. If this is being<br />
done in software, they will have to be done one after the other. Solve for each case.<br />
3.15 [10] Calculate the time necessary to perform a multiply using the<br />
approach described in the text (31 adders stacked vertically) if an integer is 8 bits<br />
wide <strong>and</strong> an adder takes 4 time units.<br />
3.16 [20] Calculate the time necessary to perform a multiply using the<br />
approach given in Figure 3.7 if an integer is 8 bits wide <strong>and</strong> an adder takes 4 time<br />
units.<br />
3.17 [20] As discussed in the text, one possible performance enhancement<br />
is to do a shift <strong>and</strong> add instead of an actual multiplication. Since 9 6, for example,<br />
can be written (2 2 2 1) 6, we can calculate 9 6 by shifting 6 to the left 3<br />
times <strong>and</strong> then adding 6 to that result. Show the best way to calculate 033 055<br />
using shifts <strong>and</strong> adds/subtracts. Assume both inputs are 8-bit unsigned integers.<br />
3.18 [20] Using a table similar to that shown in Figure 3.10, calculate<br />
74 divided by 21 using the hardware described in Figure 3.8. You should show<br />
the contents of each register on each step. Assume both inputs are unsigned 6-bit<br />
integers.<br />
3.19 [30] Using a table similar to that shown in Figure 3.10, calculate<br />
74 divided by 21 using the hardware described in Figure 3.11. You should show<br />
the contents of each register on each step. Assume A <strong>and</strong> B are unsigned 6-bit<br />
integers. This algorithm requires a slightly different approach than that shown in<br />
Figure 3.9. You will want to think hard about this, do an experiment or two, or else<br />
go to the web to figure out how to make this work correctly. (Hint: one possible<br />
solution involves using the fact that Figure 3.11 implies the remainder register can<br />
be shifted either direction.)<br />
3.20 [5] What decimal number does the bit pattern 0×0C000000<br />
represent if it is a two’s complement integer? An unsigned integer?<br />
3.21 [10] If the bit pattern 0×0C000000 is placed into the Instruction<br />
Register, what MIPS instruction will be executed?<br />
3.22 [10] What decimal number does the bit pattern 0×0C000000<br />
represent if it is a floating point number? Use the IEEE 754 st<strong>and</strong>ard.
3.12 Exercises 239<br />
3.23 [10] Write down the binary representation of the decimal number<br />
63.25 assuming the IEEE 754 single precision format.<br />
3.24 [10] Write down the binary representation of the decimal number<br />
63.25 assuming the IEEE 754 double precision format.<br />
3.25 [10] Write down the binary representation of the decimal number<br />
63.25 assuming it was stored using the single precision IBM format (base 16,<br />
instead of base 2, with 7 bits of exponent).<br />
3.26 [20] Write down the binary bit pattern to represent 1.5625 10 1<br />
assuming a format similar to that employed by the DEC PDP-8 (the leftmost 12<br />
bits are the exponent stored as a two’s complement number, <strong>and</strong> the rightmost 24<br />
bits are the fraction stored as a two’s complement number). No hidden 1 is used.<br />
Comment on how the range <strong>and</strong> accuracy of this 36-bit pattern compares to the<br />
single <strong>and</strong> double precision IEEE 754 st<strong>and</strong>ards.<br />
3.27 [20] IEEE 754-2008 contains a half precision that is only 16 bits<br />
wide. The leftmost bit is still the sign bit, the exponent is 5 bits wide <strong>and</strong> has a bias<br />
of 15, <strong>and</strong> the mantissa is 10 bits long. A hidden 1 is assumed. Write down the<br />
bit pattern to represent 1.5625 10 1 assuming a version of this format, which<br />
uses an excess-16 format to store the exponent. Comment on how the range <strong>and</strong><br />
accuracy of this 16-bit floating point format compares to the single precision IEEE<br />
754 st<strong>and</strong>ard.<br />
3.28 [20] The Hewlett-Packard 2114, 2115, <strong>and</strong> 2116 used a format<br />
with the leftmost 16 bits being the fraction stored in two’s complement format,<br />
followed by another 16-bit field which had the leftmost 8 bits as an extension of the<br />
fraction (making the fraction 24 bits long), <strong>and</strong> the rightmost 8 bits representing<br />
the exponent. However, in an interesting twist, the exponent was stored in signmagnitude<br />
format with the sign bit on the far right! Write down the bit pattern to<br />
represent 1.5625 10 1 assuming this format. No hidden 1 is used. Comment on<br />
how the range <strong>and</strong> accuracy of this 32-bit pattern compares to the single precision<br />
IEEE 754 st<strong>and</strong>ard.<br />
3.29 [20] Calculate the sum of 2.6125 10 1 <strong>and</strong> 4.150390625 10 1<br />
by h<strong>and</strong>, assuming A <strong>and</strong> B are stored in the 16-bit half precision described in<br />
Exercise 3.27. Assume 1 guard, 1 round bit, <strong>and</strong> 1 sticky bit, <strong>and</strong> round to the<br />
nearest even. Show all the steps.<br />
3.30 [30] Calculate the product of –8.0546875 10 0 <strong>and</strong> 1.79931640625<br />
10 –1 by h<strong>and</strong>, assuming A <strong>and</strong> B are stored in the 16-bit half precision format<br />
described in Exercise 3.27. Assume 1 guard, 1 round bit, <strong>and</strong> 1 sticky bit, <strong>and</strong> round<br />
to the nearest even. Show all the steps; however, as is done in the example in the<br />
text, you can do the multiplication in human-readable format instead of using the<br />
techniques described in Exercises 3.12 through 3.14. Indicate if there is overflow<br />
or underflow. Write your answer in both the 16-bit floating point format described<br />
in Exercise 3.27 <strong>and</strong> also as a decimal number. How accurate is your result? How<br />
does it compare to the number you get if you do the multiplication on a calculator?
240 Chapter 3 Arithmetic for <strong>Computer</strong>s<br />
3.31 [30] Calculate by h<strong>and</strong> 8.625 10 1 divided by 4.875 10 0 . Show<br />
all the steps necessary to achieve your answer. Assume there is a guard, a round bit,<br />
<strong>and</strong> a sticky bit, <strong>and</strong> use them if necessary. Write the final answer in both the 16-bit<br />
floating point format described in Exercise 3.27 <strong>and</strong> in decimal <strong>and</strong> compare the<br />
decimal result to that which you get if you use a calculator.<br />
3.32 [20] Calculate (3.984375 10 1 3.4375 10 1 ) 1.771 10 3<br />
by h<strong>and</strong>, assuming each of the values are stored in the 16-bit half precision format<br />
described in Exercise 3.27 (<strong>and</strong> also described in the text). Assume 1 guard, 1<br />
round bit, <strong>and</strong> 1 sticky bit, <strong>and</strong> round to the nearest even. Show all the steps, <strong>and</strong><br />
write your answer in both the 16-bit floating point format <strong>and</strong> in decimal.<br />
3.33 [20] Calculate 3.984375 10 1 (3.4375 10 1 1.771 10 3 )<br />
by h<strong>and</strong>, assuming each of the values are stored in the 16-bit half precision format<br />
described in Exercise 3.27 (<strong>and</strong> also described in the text). Assume 1 guard, 1<br />
round bit, <strong>and</strong> 1 sticky bit, <strong>and</strong> round to the nearest even. Show all the steps, <strong>and</strong><br />
write your answer in both the 16-bit floating point format <strong>and</strong> in decimal.<br />
3.34 [10] Based on your answers to 3.32 <strong>and</strong> 3.33, does (3.984375 10 1<br />
3.4375 10 1 ) 1.771 10 3 = 3.984375 10 1 (3.4375 10 1 1.771 <br />
10 3 )?<br />
3.35 [30] Calculate (3.41796875 10 3 6.34765625 10 3 ) 1.05625<br />
10 2 by h<strong>and</strong>, assuming each of the values are stored in the 16-bit half precision<br />
format described in Exercise 3.27 (<strong>and</strong> also described in the text). Assume 1 guard,<br />
1 round bit, <strong>and</strong> 1 sticky bit, <strong>and</strong> round to the nearest even. Show all the steps, <strong>and</strong><br />
write your answer in both the 16-bit floating point format <strong>and</strong> in decimal.<br />
3.36 [30] Calculate 3.41796875 10 3 (6.34765625 10 3 1.05625<br />
10 2 ) by h<strong>and</strong>, assuming each of the values are stored in the 16-bit half precision<br />
format described in Exercise 3.27 (<strong>and</strong> also described in the text). Assume 1 guard,<br />
1 round bit, <strong>and</strong> 1 sticky bit, <strong>and</strong> round to the nearest even. Show all the steps, <strong>and</strong><br />
write your answer in both the 16-bit floating point format <strong>and</strong> in decimal.<br />
3.37 [10] Based on your answers to 3.35 <strong>and</strong> 3.36, does (3.41796875 10 3<br />
6.34765625 10 3 ) 1.05625 10 2 = 3.41796875 10 3 (6.34765625 <br />
10 3 1.05625 10 2 )?<br />
3.38 [30] Calculate 1.666015625 10 0 (1.9760 10 4 1.9744 <br />
10 4 ) by h<strong>and</strong>, assuming each of the values are stored in the 16-bit half precision<br />
format described in Exercise 3.27 (<strong>and</strong> also described in the text). Assume 1 guard,<br />
1 round bit, <strong>and</strong> 1 sticky bit, <strong>and</strong> round to the nearest even. Show all the steps, <strong>and</strong><br />
write your answer in both the 16-bit floating point format <strong>and</strong> in decimal.<br />
3.39 [30] Calculate (1.666015625 10 0 1.9760 10 4 ) (1.666015625<br />
10 0 1.9744 10 4 ) by h<strong>and</strong>, assuming each of the values are stored in the<br />
16-bit half precision format described in Exercise 3.27 (<strong>and</strong> also described in the<br />
text). Assume 1 guard, 1 round bit, <strong>and</strong> 1 sticky bit, <strong>and</strong> round to the nearest even.<br />
Show all the steps, <strong>and</strong> write your answer in both the 16-bit floating point format<br />
<strong>and</strong> in decimal.
3.12 Exercises 241<br />
3.40 [10] Based on your answers to 3.38 <strong>and</strong> 3.39, does (1.666015625 <br />
10 0 1.9760 10 4 ) (1.666015625 10 0 1.9744 10 4 ) = 1.666015625 <br />
10 0 (1.9760 10 4 1.9744 10 4 )?<br />
3.41 [10] Using the IEEE 754 floating point format, write down the bit<br />
pattern that would represent 1/4. Can you represent 1/4 exactly?<br />
3.42 [10] What do you get if you add 1/4 to itself 4 times? What is 1/4<br />
4? Are they the same? What should they be?<br />
3.43 [10] Write down the bit pattern in the fraction of value 1/3 assuming<br />
a floating point format that uses binary numbers in the fraction. Assume there are<br />
24 bits, <strong>and</strong> you do not need to normalize. Is this representation exact?<br />
3.44 [10] Write down the bit pattern in the fraction assuming a floating<br />
point format that uses Binary Coded Decimal (base 10) numbers in the fraction<br />
instead of base 2. Assume there are 24 bits, <strong>and</strong> you do not need to normalize. Is<br />
this representation exact?<br />
3.45 [10] Write down the bit pattern assuming that we are using base 15<br />
numbers in the fraction instead of base 2. (Base 16 numbers use the symbols 0–9<br />
<strong>and</strong> A–F. Base 15 numbers would use 0–9 <strong>and</strong> A–E.) Assume there are 24 bits, <strong>and</strong><br />
you do not need to normalize. Is this representation exact?<br />
3.46 [20] Write down the bit pattern assuming that we are using base 30<br />
numbers in the fraction instead of base 2. (Base 16 numbers use the symbols 0–9<br />
<strong>and</strong> A–F. Base 30 numbers would use 0–9 <strong>and</strong> A–T.) Assume there are 20 bits, <strong>and</strong><br />
you do not need to normalize. Is this representation exact?<br />
3.47 [45] The following C code implements a four-tap FIR filter on<br />
input array sig_in. Assume that all arrays are 16-bit fixed-point values.<br />
for (i 3;i< 128;i )<br />
sig_out[i] sig_in[i-3] * f[0] sig_in[i-2] * f[1]<br />
sig_in[i-1] * f[2] sig_in[i] * f[3];<br />
Assume you are to write an optimized implementation this code in assembly<br />
language on a processor that has SIMD instructions <strong>and</strong> 128-bit registers. Without<br />
knowing the details of the instruction set, briefly describe how you would<br />
implement this code, maximizing the use of sub-word operations <strong>and</strong> minimizing<br />
the amount of data that is transferred between registers <strong>and</strong> memory. State all your<br />
assumptions about the instructions you use.<br />
§3.2, page 182: 2.<br />
§3.5, page 221: 3.<br />
Answers to<br />
Check Yourself
4<br />
The Processor<br />
In a major matter, no<br />
details are small.<br />
French Proverb<br />
4.1 Introduction 244<br />
4.2 Logic <strong>Design</strong> Conventions 248<br />
4.3 Building a Datapath 251<br />
4.4 A Simple Implementation Scheme 259<br />
4.5 An Overview of Pipelining 272<br />
4.6 Pipelined Datapath <strong>and</strong> Control 286<br />
4.7 Data Hazards: Forwarding versus<br />
Stalling 303<br />
4.8 Control Hazards 316<br />
4.9 Exceptions 325<br />
4.10 Parallelism via Instructions 332<br />
<strong>Computer</strong> Organization <strong>and</strong> <strong>Design</strong>. DOI: http://dx.doi.org/10.1016/B978-0-12-407726-3.00001-1<br />
© 2013 Elsevier Inc. All rights reserved.
4.1 Introduction 245<br />
However, it illustrates the key principles used in creating a datapath <strong>and</strong> designing<br />
the control. The implementation of the remaining instructions is similar.<br />
In examining the implementation, we will have the opportunity to see how the<br />
instruction set architecture determines many aspects of the implementation, <strong>and</strong><br />
how the choice of various implementation strategies affects the clock rate <strong>and</strong> CPI<br />
for the computer. Many of the key design principles introduced in Chapter 1 can<br />
be illustrated by looking at the implementation, such as Simplicity favors regularity.<br />
In addition, most concepts used to implement the MIPS subset in this chapter are<br />
the same basic ideas that are used to construct a broad spectrum of computers,<br />
from high-performance servers to general-purpose microprocessors to embedded<br />
processors.<br />
An Overview of the Implementation<br />
In Chapter 2, we looked at the core MIPS instructions, including the integer<br />
arithmetic-logical instructions, the memory-reference instructions, <strong>and</strong> the branch<br />
instructions. Much of what needs to be done to implement these instructions is the<br />
same, independent of the exact class of instruction. For every instruction, the first<br />
two steps are identical:<br />
1. Send the program counter (PC) to the memory that contains the code <strong>and</strong><br />
fetch the instruction from that memory.<br />
2. Read one or two registers, using fields of the instruction to select the registers<br />
to read. For the load word instruction, we need to read only one register, but<br />
most other instructions require reading two registers.<br />
After these two steps, the actions required to complete the instruction depend<br />
on the instruction class. Fortunately, for each of the three instruction classes<br />
(memory-reference, arithmetic-logical, <strong>and</strong> branches), the actions are largely the<br />
same, independent of the exact instruction. The simplicity <strong>and</strong> regularity of the<br />
MIPS instruction set simplifies the implementation by making the execution of<br />
many of the instruction classes similar.<br />
For example, all instruction classes, except jump, use the arithmetic-logical unit<br />
(ALU) after reading the registers. The memory-reference instructions use the ALU<br />
for an address calculation, the arithmetic-logical instructions for the operation<br />
execution, <strong>and</strong> branches for comparison. After using the ALU, the actions required<br />
to complete various instruction classes differ. A memory-reference instruction<br />
will need to access the memory either to read data for a load or write data for a<br />
store. An arithmetic-logical or load instruction must write the data from the ALU<br />
or memory back into a register. Lastly, for a branch instruction, we may need to<br />
change the next instruction address based on the comparison; otherwise, the PC<br />
should be incremented by 4 to get the address of the next instruction.<br />
Figure 4.1 shows the high-level view of a MIPS implementation, focusing on<br />
the various functional units <strong>and</strong> their interconnection. Although this figure shows<br />
most of the flow of data through the processor, it omits two important aspects of<br />
instruction execution.
246 Chapter 4 The Processor<br />
First, in several places, Figure 4.1 shows data going to a particular unit as coming<br />
from two different sources. For example, the value written into the PC can come<br />
from one of two adders, the data written into the register file can come from either<br />
the ALU or the data memory, <strong>and</strong> the second input to the ALU can come from<br />
a register or the immediate field of the instruction. In practice, these data lines<br />
cannot simply be wired together; we must add a logic element that chooses from<br />
among the multiple sources <strong>and</strong> steers one of those sources to its destination. This<br />
selection is commonly done with a device called a multiplexor, although this device<br />
might better be called a data selector. Appendix B describes the multiplexor, which<br />
selects from among several inputs based on the setting of its control lines. The<br />
control lines are set based primarily on information taken from the instruction<br />
being executed.<br />
The second omission in Figure 4.1 is that several of the units must be controlled<br />
depending on the type of instruction. For example, the data memory must read<br />
4<br />
Add<br />
Add<br />
Data<br />
PC Address Instruction<br />
Instruction<br />
memory<br />
Register #<br />
Registers ALU Address<br />
Register #<br />
Data<br />
Register #<br />
memory<br />
Data<br />
FIGURE 4.1 An abstract view of the implementation of the MIPS subset showing the<br />
major functional units <strong>and</strong> the major connections between them. All instructions start by using<br />
the program counter to supply the instruction address to the instruction memory. After the instruction is<br />
fetched, the register oper<strong>and</strong>s used by an instruction are specified by fields of that instruction. Once the<br />
register oper<strong>and</strong>s have been fetched, they can be operated on to compute a memory address (for a load or<br />
store), to compute an arithmetic result (for an integer arithmetic-logical instruction), or a compare (for a<br />
branch). If the instruction is an arithmetic-logical instruction, the result from the ALU must be written to<br />
a register. If the operation is a load or store, the ALU result is used as an address to either store a value from<br />
the registers or load a value from memory into the registers. The result from the ALU or memory is written<br />
back into the register file. Branches require the use of the ALU output to determine the next instruction<br />
address, which comes either from the ALU (where the PC <strong>and</strong> branch offset are summed) or from an adder<br />
that increments the current PC by 4. The thick lines interconnecting the functional units represent buses,<br />
which consist of multiple signals. The arrows are used to guide the reader in knowing how information flows.<br />
Since signal lines may cross, we explicitly show when crossing lines are connected by the presence of a dot<br />
where the lines cross.
4.1 Introduction 247<br />
on a load <strong>and</strong> written on a store. The register file must be written only on a load<br />
or an arithmetic-logical instruction. And, of course, the ALU must perform one<br />
of several operations. (Appendix B describes the detailed design of the ALU.)<br />
Like the multiplexors, control lines that are set on the basis of various fields in the<br />
instruction direct these operations.<br />
Figure 4.2 shows the datapath of Figure 4.1 with the three required multiplexors<br />
added, as well as control lines for the major functional units. A control unit,<br />
which has the instruction as an input, is used to determine how to set the control<br />
lines for the functional units <strong>and</strong> two of the multiplexors. The third multiplexor,<br />
Branch<br />
M<br />
u<br />
x<br />
4<br />
Add<br />
Data<br />
Add<br />
M<br />
u<br />
x<br />
ALU operation<br />
PC Address Instruction<br />
Instruction<br />
memory<br />
Register #<br />
MemWrite<br />
Registers ALU Address<br />
Register #<br />
M<br />
Zero<br />
u<br />
Data<br />
x<br />
memory<br />
Register # RegWrite<br />
Data<br />
MemRead<br />
Control<br />
FIGURE 4.2 The basic implementation of the MIPS subset, including the necessary multiplexors <strong>and</strong> control lines.<br />
The top multiplexor (“Mux”) controls what value replaces the PC (PC + 4 or the branch destination address); the multiplexor is controlled<br />
by the gate that “ANDs” together the Zero output of the ALU <strong>and</strong> a control signal that indicates that the instruction is a branch. The middle<br />
multiplexor, whose output returns to the register file, is used to steer the output of the ALU (in the case of an arithmetic-logical instruction) or<br />
the output of the data memory (in the case of a load) for writing into the register file. Finally, the bottommost multiplexor is used to determine<br />
whether the second ALU input is from the registers (for an arithmetic-logical instruction or a branch) or from the offset field of the instruction<br />
(for a load or store). The added control lines are straightforward <strong>and</strong> determine the operation performed at the ALU, whether the data memory<br />
should read or write, <strong>and</strong> whether the registers should perform a write operation. The control lines are shown in color to make them easier to<br />
see.
254 Chapter 4 The Processor<br />
sign-extend To increase<br />
the size of a data item by<br />
replicating the high-order<br />
sign bit of the original<br />
data item in the highorder<br />
bits of the larger,<br />
destination data item.<br />
branch target<br />
address The address<br />
specified in a branch,<br />
which becomes the new<br />
program counter (PC)<br />
if the branch is taken. In<br />
the MIPS architecture the<br />
branch target is given by<br />
the sum of the offset field<br />
of the instruction <strong>and</strong> the<br />
address of the instruction<br />
following the branch.<br />
branch taken<br />
A branch where the<br />
branch condition is<br />
satisfied <strong>and</strong> the program<br />
counter (PC) becomes<br />
the branch target. All<br />
unconditional jumps are<br />
taken branches.<br />
branch not taken or<br />
(untaken branch)<br />
A branch where the<br />
branch condition is false<br />
<strong>and</strong> the program counter<br />
(PC) becomes the address<br />
of the instruction that<br />
sequentially follows the<br />
branch.<br />
Next, consider the MIPS load word <strong>and</strong> store word instructions, which have the<br />
general form lw $t1,offset_value($t2) or sw $t1,offset_value<br />
($t2). These instructions compute a memory address by adding the base register,<br />
which is $t2, to the 16-bit signed offset field contained in the instruction. If the<br />
instruction is a store, the value to be stored must also be read from the register file<br />
where it resides in $t1. If the instruction is a load, the value read from memory<br />
must be written into the register file in the specified register, which is $t1. Thus,<br />
we will need both the register file <strong>and</strong> the ALU from Figure 4.7.<br />
In addition, we will need a unit to sign-extend the 16-bit offset field in the<br />
instruction to a 32-bit signed value, <strong>and</strong> a data memory unit to read from or write<br />
to. The data memory must be written on store instructions; hence, data memory<br />
has read <strong>and</strong> write control signals, an address input, <strong>and</strong> an input for the data to be<br />
written into memory. Figure 4.8 shows these two elements.<br />
The beq instruction has three oper<strong>and</strong>s, two registers that are compared for<br />
equality, <strong>and</strong> a 16-bit offset used to compute the branch target address relative<br />
to the branch instruction address. Its form is beq $t1,$t2,offset. To<br />
implement this instruction, we must compute the branch target address by adding<br />
the sign-extended offset field of the instruction to the PC. There are two details in<br />
the definition of branch instructions (see Chapter 2) to which we must pay attention:<br />
■ The instruction set architecture specifies that the base for the branch address<br />
calculation is the address of the instruction following the branch. Since we<br />
compute PC + 4 (the address of the next instruction) in the instruction fetch<br />
datapath, it is easy to use this value as the base for computing the branch<br />
target address.<br />
■ The architecture also states that the offset field is shifted left 2 bits so that it<br />
is a word offset; this shift increases the effective range of the offset field by a<br />
factor of 4.<br />
To deal with the latter complication, we will need to shift the offset field by 2.<br />
As well as computing the branch target address, we must also determine whether<br />
the next instruction is the instruction that follows sequentially or the instruction<br />
at the branch target address. When the condition is true (i.e., the oper<strong>and</strong>s are<br />
equal), the branch target address becomes the new PC, <strong>and</strong> we say that the branch<br />
is taken. If the oper<strong>and</strong>s are not equal, the incremented PC should replace the<br />
current PC (just as for any other normal instruction); in this case, we say that the<br />
branch is not taken.<br />
Thus, the branch datapath must do two operations: compute the branch target<br />
address <strong>and</strong> compare the register contents. (Branches also affect the instruction<br />
fetch portion of the datapath, as we will deal with shortly.) Figure 4.9 shows the<br />
structure of the datapath segment that h<strong>and</strong>les branches. To compute the branch<br />
target address, the branch datapath includes a sign extension unit, from Figure 4.8<br />
<strong>and</strong> an adder. To perform the compare, we need to use the register file shown in<br />
Figure 4.7a to supply the two register oper<strong>and</strong>s (although we will not need to write<br />
into the register file). In addition, the comparison can be done using the ALU we
256 Chapter 4 The Processor<br />
PC + 4 from instruction datapath<br />
Shift<br />
left 2<br />
Add Sum<br />
Branch<br />
target<br />
Instruction<br />
Read<br />
register 1<br />
Read<br />
register 2<br />
Write<br />
register<br />
Write<br />
data<br />
Registers<br />
Read<br />
data 1<br />
Read<br />
data 2<br />
4<br />
ALU operation<br />
ALU Zero<br />
To branch<br />
control logic<br />
RegWrite<br />
16<br />
Signextend<br />
32<br />
FIGURE 4.9 The datapath for a branch uses the ALU to evaluate the branch condition <strong>and</strong><br />
a separate adder to compute the branch target as the sum of the incremented PC <strong>and</strong> the<br />
sign-extended, lower 16 bits of the instruction (the branch displacement), shifted left 2<br />
bits. The unit labeled Shift left 2 is simply a routing of the signals between input <strong>and</strong> output that adds 00 two<br />
to the low-order end of the sign-extended offset field; no actual shift hardware is needed, since the amount of<br />
the “shift” is constant. Since we know that the offset was sign-extended from 16 bits, the shift will throw away<br />
only “sign bits.” Control logic is used to decide whether the incremented PC or branch target should replace<br />
the PC, based on the Zero output of the ALU.<br />
Creating a Single Datapath<br />
Now that we have examined the datapath components needed for the individual<br />
instruction classes, we can combine them into a single datapath <strong>and</strong> add the control<br />
to complete the implementation. This simplest datapath will attempt to execute all<br />
instructions in one clock cycle. This means that no datapath resource can be used<br />
more than once per instruction, so any element needed more than once must be<br />
duplicated. We therefore need a memory for instructions separate from one for<br />
data. Although some of the functional units will need to be duplicated, many of the<br />
elements can be shared by different instruction flows.<br />
To share a datapath element between two different instruction classes, we may<br />
need to allow multiple connections to the input of an element, using a multiplexor<br />
<strong>and</strong> control signal to select among the multiple inputs.
4.3 Building a Datapath 257<br />
Building a Datapath<br />
The operations of arithmetic-logical (or R-type) instructions <strong>and</strong> the memory<br />
instructions datapath are quite similar. The key differences are the following:<br />
■ The arithmetic-logical instructions use the ALU, with the inputs coming<br />
from the two registers. The memory instructions can also use the ALU<br />
to do the address calculation, although the second input is the signextended<br />
16-bit offset field from the instruction.<br />
■ The value stored into a destination register comes from the ALU (for an<br />
R-type instruction) or the memory (for a load).<br />
Show how to build a datapath for the operational portion of the memoryreference<br />
<strong>and</strong> arithmetic-logical instructions that uses a single register file<br />
<strong>and</strong> a single ALU to h<strong>and</strong>le both types of instructions, adding any necessary<br />
multiplexors.<br />
EXAMPLE<br />
To create a datapath with only a single register file <strong>and</strong> a single ALU, we must<br />
support two different sources for the second ALU input, as well as two different<br />
sources for the data stored into the register file. Thus, one multiplexor is placed<br />
at the ALU input <strong>and</strong> another at the data input to the register file. Figure 4.10<br />
shows the operational portion of the combined datapath.<br />
ANSWER<br />
Now we can combine all the pieces to make a simple datapath for the core<br />
MIPS architecture by adding the datapath for instruction fetch (Figure 4.6), the<br />
datapath from R-type <strong>and</strong> memory instructions (Figure 4.10), <strong>and</strong> the datapath<br />
for branches (Figure 4.9). Figure 4.11 shows the datapath we obtain by composing<br />
the separate pieces. The branch instruction uses the main ALU for comparison of<br />
the register oper<strong>and</strong>s, so we must keep the adder from Figure 4.9 for computing<br />
the branch target address. An additional multiplexor is required to select either the<br />
sequentially following instruction address (PC + 4) or the branch target address to<br />
be written into the PC.<br />
Now that we have completed this simple datapath, we can add the control unit.<br />
The control unit must be able to take inputs <strong>and</strong> generate a write signal for each<br />
state element, the selector control for each multiplexor, <strong>and</strong> the ALU control. The<br />
ALU control is different in a number of ways, <strong>and</strong> it will be useful to design it first<br />
before we design the rest of the control unit.<br />
I. Which of the following is correct for a load instruction? Refer to Figure 4.10.<br />
a. MemtoReg should be set to cause the data from memory to be sent to the<br />
register file.<br />
Check<br />
Yourself
258 Chapter 4 The Processor<br />
Instruction<br />
Read<br />
register 1<br />
Read<br />
register 2<br />
Registers<br />
Write<br />
register<br />
Read<br />
data 1<br />
Read<br />
data 2<br />
ALUSrc<br />
0<br />
M<br />
ux<br />
4<br />
ALU<br />
ALU operation<br />
Zero<br />
ALU<br />
result<br />
Address<br />
MemWrite<br />
MemtoReg<br />
Read<br />
data<br />
1<br />
M<br />
ux<br />
Write<br />
data<br />
RegWrite<br />
1<br />
Write<br />
data<br />
Data<br />
memory<br />
0<br />
16<br />
Signextend<br />
32<br />
MemRead<br />
FIGURE 4.10 The datapath for the memory instructions <strong>and</strong> the R-type instructions. This example shows how a single<br />
datapath can be assembled from the pieces in Figures 4.7 <strong>and</strong> 4.8 by adding multiplexors. Two multiplexors are needed, as described in the<br />
example.<br />
PCSrc<br />
Add<br />
M<br />
ux<br />
4<br />
Shift<br />
left 2<br />
ALU<br />
Add result<br />
PC<br />
Read<br />
address<br />
Instruction<br />
Instruction<br />
memory<br />
Read<br />
register 1<br />
Read<br />
register 2<br />
Registers<br />
Write<br />
register<br />
Write<br />
data<br />
RegWrite<br />
Read<br />
data 1<br />
Read<br />
data 2<br />
ALUSrc<br />
M<br />
ux<br />
4<br />
ALU operation<br />
Zero<br />
ALU ALU<br />
result<br />
Address<br />
Write<br />
data<br />
MemWrite<br />
MemtoReg<br />
Read<br />
data<br />
Data<br />
memory<br />
M<br />
ux<br />
16<br />
Signextend<br />
32<br />
MemRead<br />
FIGURE 4.11 The simple datapath for the core MIPS architecture combines the elements required by different<br />
instruction classes. The components come from Figures 4.6, 4.9, <strong>and</strong> 4.10. This datapath can execute the basic instructions (load-store<br />
word, ALU operations, <strong>and</strong> branches) in a single clock cycle. Just one additional multiplexor is needed to integrate branches. The support for<br />
jumps will be added later.
262 Chapter 4 The Processor<br />
Field 0 rs rt rd shamt funct<br />
Bit positions 31:26 25:21 20:16 15:11 10:6 5:0<br />
a. R-type instruction<br />
Field 35 or 43 rs rt address<br />
Bit positions 31:26 25:21 20:16 15:0<br />
b. Load or store instruction<br />
Field 4 rs rt address<br />
Bit positions 31:26 25:21 20:16 15:0<br />
c. Branch instruction<br />
FIGURE 4.14 The three instruction classes (R-type, load <strong>and</strong> store, <strong>and</strong> branch) use two<br />
different instruction formats. The jump instructions use another format, which we will discuss shortly.<br />
(a) Instruction format for R-format instructions, which all have an opcode of 0. These instructions have three<br />
register oper<strong>and</strong>s: rs, rt, <strong>and</strong> rd. Fields rs <strong>and</strong> rt are sources, <strong>and</strong> rd is the destination. The ALU function is<br />
in the funct field <strong>and</strong> is decoded by the ALU control design in the previous section. The R-type instructions<br />
that we implement are add, sub, AND, OR, <strong>and</strong> slt. The shamt field is used only for shifts; we will ignore it<br />
in this chapter. (b) Instruction format for load (opcode = 35 ten<br />
) <strong>and</strong> store (opcode = 43 ten<br />
) instructions. The<br />
register rs is the base register that is added to the 16-bit address field to form the memory address. For loads,<br />
rt is the destination register for the loaded value. For stores, rt is the source register whose value should be<br />
stored into memory. (c) Instruction format for branch equal (opcode =4). The registers rs <strong>and</strong> rt are the<br />
source registers that are compared for equality. The 16-bit address field is sign-extended, shifted, <strong>and</strong> added<br />
to the PC + 4 to compute the branch target address.<br />
opcode The field that<br />
denotes the operation <strong>and</strong><br />
format of an instruction.<br />
the formats of the three instruction classes: the R-type, branch, <strong>and</strong> load-store<br />
instructions. Figure 4.14 shows these formats.<br />
There are several major observations about this instruction format that we will<br />
rely on:<br />
■ The op field, which as we saw in Chapter 2 is called the opcode, is always<br />
contained in bits 31:26. We will refer to this field as Op[5:0].<br />
■ The two registers to be read are always specified by the rs <strong>and</strong> rt fields, at<br />
positions 25:21 <strong>and</strong> 20:16. This is true for the R-type instructions, branch<br />
equal, <strong>and</strong> store.<br />
■ The base register for load <strong>and</strong> store instructions is always in bit positions<br />
25:21 (rs).<br />
■ The 16-bit offset for branch equal, load, <strong>and</strong> store is always in positions 15:0.<br />
■ The destination register is in one of two places. For a load it is in bit positions<br />
20:16 (rt), while for an R-type instruction it is in bit positions 15:11 (rd).<br />
Thus, we will need to add a multiplexor to select which field of the instruction<br />
is used to indicate the register number to be written.<br />
The first design principle from Chapter 2—simplicity favors regularity—pays off<br />
here in specifying control.
4.4 A Simple Implementation Scheme 263<br />
PCSrc<br />
0<br />
Add<br />
M ux<br />
4<br />
RegWrite<br />
Shift<br />
left 2<br />
ALU<br />
Addresult<br />
1<br />
PC<br />
Instruction [25:21] Read<br />
Read<br />
register 1<br />
address<br />
Read<br />
Instruction [20:16] Read data 1<br />
register 2<br />
ALUSrc Zero<br />
Instruction<br />
0<br />
[31:0] ALU<br />
M Write Read<br />
ALU<br />
0<br />
data 2<br />
result<br />
Instruction<br />
ux<br />
Instruction [15:11] register<br />
M<br />
memory<br />
1<br />
ux<br />
Write<br />
1<br />
data Registers<br />
RegDst<br />
Instruction [15:0]<br />
16<br />
Signextend<br />
32<br />
ALU<br />
control<br />
MemWrite<br />
Address<br />
Write<br />
data<br />
Read<br />
data<br />
Data<br />
memory<br />
MemRead<br />
MemtoReg<br />
1<br />
M ux<br />
0<br />
Instruction [5:0]<br />
ALUOp<br />
FIGURE 4.15 The datapath of Figure 4.11 with all necessary multiplexors <strong>and</strong> all control lines identified. The control<br />
lines are shown in color. The ALU control block has also been added. The PC does not require a write control, since it is written once at the end<br />
of every clock cycle; the branch control logic determines whether it is written with the incremented PC or the branch target address.<br />
Using this information, we can add the instruction labels <strong>and</strong> extra multiplexor<br />
(for the Write register number input of the register file) to the simple datapath.<br />
Figure 4.15 shows these additions plus the ALU control block, the write signals for<br />
state elements, the read signal for the data memory, <strong>and</strong> the control signals for the<br />
multiplexors. Since all the multiplexors have two inputs, they each require a single<br />
control line.<br />
Figure 4.15 shows seven single-bit control lines plus the 2-bit ALUOp control<br />
signal. We have already defined how the ALUOp control signal works, <strong>and</strong> it is<br />
useful to define what the seven other control signals do informally before we<br />
determine how to set these control signals during instruction execution. Figure<br />
4.16 describes the function of these seven control lines.<br />
Now that we have looked at the function of each of the control signals, we can<br />
look at how to set them. The control unit can set all but one of the control signals<br />
based solely on the opcode field of the instruction. The PCSrc control line is the<br />
exception. That control line should be asserted if the instruction is branch on equal<br />
(a decision that the control unit can make) <strong>and</strong> the Zero output of the ALU, which<br />
is used for equality comparison, is asserted. To generate the PCSrc signal, we will<br />
need to AND together a signal from the control unit, which we call Branch, with<br />
the Zero signal out of the ALU.
4.4 A Simple Implementation Scheme 265<br />
0<br />
Add<br />
M ux<br />
4<br />
Instruction [31–26]<br />
Control<br />
RegDst<br />
Branch<br />
MemRead<br />
MemtoReg<br />
ALUOp<br />
MemWrite<br />
ALUSrc<br />
RegWrite<br />
Shift<br />
left 2<br />
ALU<br />
Add<br />
result<br />
1<br />
PC<br />
Instruction [25–21] Read<br />
Read<br />
register 1<br />
address<br />
Read<br />
Instruction [20–16] Read<br />
data 1<br />
Zero<br />
Instruction<br />
register 2<br />
0<br />
[31–0] ALU<br />
M Read<br />
ALU<br />
Write<br />
0<br />
Instruction<br />
ux<br />
data 2<br />
result<br />
Instruction [15–11] register<br />
M<br />
memory<br />
ux<br />
1<br />
Write<br />
data<br />
1<br />
Registers<br />
Address<br />
Read<br />
data<br />
Write<br />
data<br />
Data<br />
memory<br />
1<br />
M ux<br />
0<br />
Instruction [15–0]<br />
16<br />
Signextend<br />
32<br />
ALU<br />
control<br />
Instruction [5–0]<br />
FIGURE 4.17 The simple datapath with the control unit. The input to the control unit is the 6-bit opcode field from the instruction.<br />
The outputs of the control unit consist of three 1-bit signals that are used to control multiplexors (RegDst, ALUSrc, <strong>and</strong> MemtoReg), three<br />
signals for controlling reads <strong>and</strong> writes in the register file <strong>and</strong> data memory (RegWrite, MemRead, <strong>and</strong> MemWrite), a 1-bit signal used in<br />
determining whether to possibly branch (Branch), <strong>and</strong> a 2-bit control signal for the ALU (ALUOp). An AND gate is used to combine the<br />
branch control signal <strong>and</strong> the Zero output from the ALU; the AND gate output controls the selection of the next PC. Notice that PCSrc is now<br />
a derived signal, rather than one coming directly from the control unit. Thus, we drop the signal name in subsequent figures.<br />
think of four steps to execute the instruction; these steps are ordered by the flow<br />
of information:<br />
1. The instruction is fetched, <strong>and</strong> the PC is incremented.<br />
2. Two registers, $t2 <strong>and</strong> $t3, are read from the register file; also, the main<br />
control unit computes the setting of the control lines during this step.<br />
3. The ALU operates on the data read from the register file, using the function<br />
code (bits 5:0, which is the funct field, of the instruction) to generate the<br />
ALU function.
268 Chapter 4 The Processor<br />
3. The ALU computes the sum of the value read from the register file <strong>and</strong> the<br />
sign-extended, lower 16 bits of the instruction (offset).<br />
4. The sum from the ALU is used as the address for the data memory.<br />
5. The data from the memory unit is written into the register file; the register<br />
destination is given by bits 20:16 of the instruction ($t1).<br />
Finally, we can show the operation of the branch-on-equal instruction, such as<br />
beq $t1, $t2, offset, in the same fashion. It operates much like an R-format<br />
instruction, but the ALU output is used to determine whether the PC is written with<br />
PC + 4 or the branch target address. Figure 4.21 shows the four steps in execution:<br />
1. An instruction is fetched from the instruction memory, <strong>and</strong> the PC is<br />
incremented.<br />
0<br />
Add<br />
M ux<br />
4<br />
Instruction [31–26]<br />
Control<br />
RegDst<br />
Branch<br />
MemRead<br />
MemtoReg<br />
ALUOp<br />
MemWrite<br />
ALUSrc<br />
RegWrite<br />
Shift<br />
left 2<br />
ALU<br />
Add<br />
result<br />
1<br />
PC<br />
Instruction [25–21] Read<br />
Read<br />
register 1<br />
address<br />
Read<br />
Instruction [20–16] Read<br />
data 1<br />
Zero<br />
Instruction<br />
register 2<br />
0<br />
[31–0] ALU<br />
M Read<br />
ALU<br />
Write<br />
0<br />
Instruction<br />
ux<br />
data 2<br />
result<br />
Instruction [15–11] register<br />
M<br />
memory<br />
ux<br />
1<br />
Write<br />
data<br />
1<br />
Registers<br />
Address<br />
Read<br />
data<br />
Write<br />
data<br />
Data<br />
memory<br />
1<br />
M ux<br />
0<br />
Instruction [15–0]<br />
16<br />
Signextend<br />
32<br />
ALU<br />
control<br />
Instruction [5–0]<br />
FIGURE 4.21 The datapath in operation for a branch-on-equal instruction. The control lines, datapath units, <strong>and</strong> connections<br />
that are active are highlighted. After using the register file <strong>and</strong> ALU to perform the compare, the Zero output is used to select the next program<br />
counter from between the two c<strong>and</strong>idates.
270 Chapter 4 The Processor<br />
single-cycle<br />
implementation Also<br />
called single clock cycle<br />
implementation. An<br />
implementation in which<br />
an instruction is executed<br />
in one clock cycle. While<br />
easy to underst<strong>and</strong>, it is<br />
too slow to be practical.<br />
Now that we have a single-cycle implementation of most of the MIPS core<br />
instruction set, let’s add the jump instruction to show how the basic datapath <strong>and</strong><br />
control can be extended to h<strong>and</strong>le other instructions in the instruction set.<br />
EXAMPLE<br />
Implementing Jumps<br />
Figure 4.17 shows the implementation of many of the instructions we looked at<br />
in Chapter 2. One class of instructions missing is that of the jump instruction.<br />
Extend the datapath <strong>and</strong> control of Figure 4.17 to include the jump instruction.<br />
Describe how to set any new control lines.<br />
ANSWER<br />
The jump instruction, shown in Figure 4.23, looks somewhat like a branch<br />
instruction but computes the target PC differently <strong>and</strong> is not conditional. Like<br />
a branch, the low-order 2 bits of a jump address are always 00 two<br />
. The next<br />
lower 26 bits of this 32-bit address come from the 26-bit immediate field in the<br />
instruction. The upper 4 bits of the address that should replace the PC come<br />
from the PC of the jump instruction plus 4. Thus, we can implement a jump by<br />
storing into the PC the concatenation of<br />
■ the upper 4 bits of the current PC + 4 (these are bits 31:28 of the<br />
sequentially following instruction address)<br />
■ the 26-bit immediate field of the jump instruction<br />
■ the bits 00 two<br />
Figure 4.24 shows the addition of the control for jump added to Figure 4.17. An<br />
additional multiplexor is used to select the source for the new PC value, which<br />
is either the incremented PC (PC + 4), the branch target PC, or the jump target<br />
PC. One additional control signal is needed for the additional multiplexor. This<br />
control signal, called Jump, is asserted only when the instruction is a jump—<br />
that is, when the opcode is 2.<br />
Field 000010 address<br />
Bit positions 31:26 25:0<br />
FIGURE 4.23 Instruction format for the jump instruction (opcode = 2). The destination<br />
address for a jump instruction is formed by concatenating the upper 4 bits of the current PC + 4 to the 26-bit<br />
address field in the jump instruction <strong>and</strong> adding 00 as the 2 low-order bits.
4.4 A Simple Implementation Scheme 271<br />
Add<br />
Instruction [25–0] Jump address [31–0]<br />
Shift<br />
left 2<br />
26 28<br />
PC + 4 [31–28]<br />
0<br />
M ux<br />
1<br />
M ux<br />
4<br />
Instruction [31–26]<br />
Control<br />
RegDst<br />
Jump<br />
Branch<br />
MemRead<br />
MemtoReg<br />
ALUOp<br />
MemWrite<br />
ALUSrc<br />
RegWrite<br />
Shift<br />
left 2<br />
ALU<br />
Add<br />
result<br />
1<br />
0<br />
PC<br />
Instruction [25–21] Read<br />
Read<br />
register 1<br />
address<br />
Read<br />
Instruction [20–16] Read<br />
data 1<br />
Zero<br />
Instruction<br />
register 2<br />
0<br />
[31–0] ALU<br />
M Read<br />
ALU<br />
Write<br />
0<br />
Instruction<br />
ux<br />
data 2<br />
result<br />
Instruction [15–11] register<br />
M<br />
memory<br />
ux<br />
1<br />
Write<br />
data<br />
1<br />
Registers<br />
Address<br />
Write<br />
data<br />
Read<br />
data<br />
Data<br />
memory<br />
1<br />
M ux<br />
0<br />
Instruction [15–0]<br />
16<br />
Signextend<br />
32<br />
ALU<br />
control<br />
Instruction [5–0]<br />
FIGURE 4.24 The simple control <strong>and</strong> datapath are extended to h<strong>and</strong>le the jump instruction. An additional multiplexor (at<br />
the upper right) is used to choose between the jump target <strong>and</strong> either the branch target or the sequential instruction following this one. This<br />
multiplexor is controlled by the jump control signal. The jump target address is obtained by shifting the lower 26 bits of the jump instruction<br />
left 2 bits, effectively adding 00 as the low-order bits, <strong>and</strong> then concatenating the upper 4 bits of PC + 4 as the high-order bits, thus yielding a<br />
32-bit address.<br />
Why a Single-Cycle Implementation Is Not Used Today<br />
Although the single-cycle design will work correctly, it would not be used in<br />
modern designs because it is inefficient. To see why this is so, notice that the clock<br />
cycle must have the same length for every instruction in this single-cycle design.<br />
Of course, the longest possible path in the processor determines the clock cycle.<br />
This path is almost certainly a load instruction, which uses five functional units<br />
in series: the instruction memory, the register file, the ALU, the data memory, <strong>and</strong><br />
the register file. Although the CPI is 1 (see Chapter 1), the overall performance of<br />
a single-cycle implementation is likely to be poor, since the clock cycle is too long.<br />
The penalty for using the single-cycle design with a fixed clock cycle is significant,<br />
but might be considered acceptable for this small instruction set. Historically, early
272 Chapter 4 The Processor<br />
computers with very simple instruction sets did use this implementation technique.<br />
However, if we tried to implement the floating-point unit or an instruction set with<br />
more complex instructions, this single-cycle design wouldn’t work well at all.<br />
Because we must assume that the clock cycle is equal to the worst-case delay<br />
for all instructions, it’s useless to try implementation techniques that reduce the<br />
delay of the common case but do not improve the worst-case cycle time. A singlecycle<br />
implementation thus violates the great idea from Chapter 1 of making the<br />
common case fast.<br />
In next section, we’ll look at another implementation technique, called<br />
pipelining, that uses a datapath very similar to the single-cycle datapath but is<br />
much more efficient by having a much higher throughput. Pipelining improves<br />
efficiency by executing multiple instructions simultaneously.<br />
Check<br />
Yourself<br />
Look at the control signals in Figure 4.22. Can you combine any together? Can any<br />
control signal output in the figure be replaced by the inverse of another? (Hint: take<br />
into account the don’t cares.) If so, can you use one signal for the other without<br />
adding an inverter?<br />
4.5 An Overview of Pipelining<br />
Never waste time.<br />
American proverb<br />
pipelining An<br />
implementation<br />
technique in which<br />
multiple instructions are<br />
overlapped in execution,<br />
much like an assembly<br />
line.<br />
Pipelining is an implementation technique in which multiple instructions are<br />
overlapped in execution. Today, pipelining is nearly universal.<br />
This section relies heavily on one analogy to give an overview of the pipelining<br />
terms <strong>and</strong> issues. If you are interested in just the big picture, you should concentrate<br />
on this section <strong>and</strong> then skip to Sections 4.10 <strong>and</strong> 4.11 to see an introduction to the<br />
advanced pipelining techniques used in recent processors such as the Intel Core i7<br />
<strong>and</strong> ARM Cortex-A8. If you are interested in exploring the anatomy of a pipelined<br />
computer, this section is a good introduction to Sections 4.6 through 4.9.<br />
Anyone who has done a lot of laundry has intuitively used pipelining. The nonpipelined<br />
approach to laundry would be as follows:<br />
1. Place one dirty load of clothes in the washer.<br />
2. When the washer is finished, place the wet load in the dryer.<br />
3. When the dryer is finished, place the dry load on a table <strong>and</strong> fold.<br />
4. When folding is finished, ask your roommate to put the clothes away.<br />
When your roommate is done, start over with the next dirty load.<br />
The pipelined approach takes much less time, as Figure 4.25 shows. As soon<br />
as the washer is finished with the first load <strong>and</strong> placed in the dryer, you load the<br />
washer with the second dirty load. When the first load is dry, you place it on the<br />
table to start folding, move the wet load to the dryer, <strong>and</strong> put the next dirty load
274 Chapter 4 The Processor<br />
pipeline, in this case four: washing, drying, folding, <strong>and</strong> putting away. Therefore,<br />
pipelined laundry is potentially four times faster than nonpipelined: 20 loads would<br />
take about 5 times as long as 1 load, while 20 loads of sequential laundry takes 20<br />
times as long as 1 load. It’s only 2.3 times faster in Figure 4.25, because we only<br />
show 4 loads. Notice that at the beginning <strong>and</strong> end of the workload in the pipelined<br />
version in Figure 4.25, the pipeline is not completely full; this start-up <strong>and</strong> winddown<br />
affects performance when the number of tasks is not large compared to the<br />
number of stages in the pipeline. If the number of loads is much larger than 4, then<br />
the stages will be full most of the time <strong>and</strong> the increase in throughput will be very<br />
close to 4.<br />
The same principles apply to processors where we pipeline instruction-execution.<br />
MIPS instructions classically take five steps:<br />
1. Fetch instruction from memory.<br />
2. Read registers while decoding the instruction. The regular format of MIPS<br />
instructions allows reading <strong>and</strong> decoding to occur simultaneously.<br />
3. Execute the operation or calculate an address.<br />
4. Access an oper<strong>and</strong> in data memory.<br />
5. Write the result into a register.<br />
Hence, the MIPS pipeline we explore in this chapter has five stages. The following<br />
example shows that pipelining speeds up instruction execution just as it speeds up<br />
the laundry.<br />
EXAMPLE<br />
ANSWER<br />
Single-Cycle versus Pipelined Performance<br />
To make this discussion concrete, let’s create a pipeline. In this example, <strong>and</strong> in<br />
the rest of this chapter, we limit our attention to eight instructions: load word<br />
(lw), store word (sw), add (add), subtract (sub), AND (<strong>and</strong>), OR (or), set<br />
less than (slt), <strong>and</strong> branch on equal (beq).<br />
Compare the average time between instructions of a single-cycle<br />
implementation, in which all instructions take one clock cycle, to a pipelined<br />
implementation. The operation times for the major functional units in this<br />
example are 200 ps for memory access, 200 ps for ALU operation, <strong>and</strong> 100 ps<br />
for register file read or write. In the single-cycle model, every instruction takes<br />
exactly one clock cycle, so the clock cycle must be stretched to accommodate<br />
the slowest instruction.<br />
Figure 4.26 shows the time required for each of the eight instructions.<br />
The single-cycle design must allow for the slowest instruction—in Figure<br />
4.26 it is lw—so the time required for every instruction is 800 ps. Similarly
4.5 An Overview of Pipelining 275<br />
to Figure 4.25, Figure 4.27 compares nonpipelined <strong>and</strong> pipelined execution<br />
of three load word instructions. Thus, the time between the first <strong>and</strong> fourth<br />
instructions in the nonpipelined design is 3 × 800 ns or 2400 ps.<br />
All the pipeline stages take a single clock cycle, so the clock cycle must be long<br />
enough to accommodate the slowest operation. Just as the single-cycle design<br />
must take the worst-case clock cycle of 800 ps, even though some instructions<br />
can be as fast as 500 ps, the pipelined execution clock cycle must have the<br />
worst-case clock cycle of 200 ps, even though some stages take only 100 ps.<br />
Pipelining still offers a fourfold performance improvement: the time between<br />
the first <strong>and</strong> fourth instructions is 3 × 200 ps or 600 ps.<br />
We can turn the pipelining speed-up discussion above into a formula. If the<br />
stages are perfectly balanced, then the time between instructions on the pipelined<br />
processor—assuming ideal conditions—is equal to<br />
Time between instructions<br />
pipelined<br />
Time between instructio<br />
<br />
Number of pipe stages<br />
n nonpipelined<br />
Under ideal conditions <strong>and</strong> with a large number of instructions, the speed-up<br />
from pipelining is approximately equal to the number of pipe stages; a five-stage<br />
pipeline is nearly five times faster.<br />
The formula suggests that a five-stage pipeline should offer nearly a fivefold<br />
improvement over the 800 ps nonpipelined time, or a 160 ps clock cycle. The<br />
example shows, however, that the stages may be imperfectly balanced. Moreover,<br />
pipelining involves some overhead, the source of which will be clearer shortly.<br />
Thus, the time per instruction in the pipelined processor will exceed the minimum<br />
possible, <strong>and</strong> speed-up will be less than the number of pipeline stages.<br />
Instruction class<br />
Instruction<br />
fetch<br />
Register<br />
read<br />
ALU<br />
operation<br />
Data<br />
access<br />
Register<br />
write<br />
Total<br />
time<br />
Load word (lw) 200 ps 100 ps 200 ps 200 ps 100 ps 800 ps<br />
Store word (sw) 200 ps 100 ps 200 ps 200 ps 700 ps<br />
R-format (add, sub, AND, 200 ps 100 ps 200 ps 100 ps 600 ps<br />
OR, slt)<br />
Branch (beq) 200 ps 100 ps 200 ps 500 ps<br />
FIGURE 4.26 Total time for each instruction calculated from the time for each component.<br />
This calculation assumes that the multiplexors, control unit, PC accesses, <strong>and</strong> sign extension unit have no<br />
delay.
4.5 An Overview of Pipelining 277<br />
Pipelining improves performance by increasing instruction throughput, as<br />
opposed to decreasing the execution time of an individual instruction, but instruction<br />
throughput is the important metric because real programs execute billions of<br />
instructions.<br />
<strong>Design</strong>ing Instruction Sets for Pipelining<br />
Even with this simple explanation of pipelining, we can get insight into the design<br />
of the MIPS instruction set, which was designed for pipelined execution.<br />
First, all MIPS instructions are the same length. This restriction makes it much<br />
easier to fetch instructions in the first pipeline stage <strong>and</strong> to decode them in the<br />
second stage. In an instruction set like the x86, where instructions vary from 1 byte<br />
to 15 bytes, pipelining is considerably more challenging. Recent implementations<br />
of the x86 architecture actually translate x86 instructions into simple operations<br />
that look like MIPS instructions <strong>and</strong> then pipeline the simple operations rather<br />
than the native x86 instructions! (See Section 4.10.)<br />
Second, MIPS has only a few instruction formats, with the source register fields<br />
being located in the same place in each instruction. This symmetry means that the<br />
second stage can begin reading the register file at the same time that the hardware<br />
is determining what type of instruction was fetched. If MIPS instruction formats<br />
were not symmetric, we would need to split stage 2, resulting in six pipeline stages.<br />
We will shortly see the downside of longer pipelines.<br />
Third, memory oper<strong>and</strong>s only appear in loads or stores in MIPS. This restriction<br />
means we can use the execute stage to calculate the memory address <strong>and</strong> then<br />
access memory in the following stage. If we could operate on the oper<strong>and</strong>s in<br />
memory, as in the x86, stages 3 <strong>and</strong> 4 would exp<strong>and</strong> to an address stage, memory<br />
stage, <strong>and</strong> then execute stage.<br />
Fourth, as discussed in Chapter 2, oper<strong>and</strong>s must be aligned in memory. Hence,<br />
we need not worry about a single data transfer instruction requiring two data<br />
memory accesses; the requested data can be transferred between processor <strong>and</strong><br />
memory in a single pipeline stage.<br />
Pipeline Hazards<br />
There are situations in pipelining when the next instruction cannot execute in the<br />
following clock cycle. These events are called hazards, <strong>and</strong> there are three different<br />
types.<br />
Hazards<br />
The first hazard is called a structural hazard. It means that the hardware cannot<br />
support the combination of instructions that we want to execute in the same clock<br />
cycle. A structural hazard in the laundry room would occur if we used a washerdryer<br />
combination instead of a separate washer <strong>and</strong> dryer, or if our roommate was<br />
busy doing something else <strong>and</strong> wouldn’t put clothes away. Our carefully scheduled<br />
pipeline plans would then be foiled.<br />
structural hazard When<br />
a planned instruction<br />
cannot execute in the<br />
proper clock cycle because<br />
the hardware does not<br />
support the combination<br />
of instructions that are set<br />
to execute.
278 Chapter 4 The Processor<br />
As we said above, the MIPS instruction set was designed to be pipelined,<br />
making it fairly easy for designers to avoid structural hazards when designing a<br />
pipeline. Suppose, however, that we had a single memory instead of two memories.<br />
If the pipeline in Figure 4.27 had a fourth instruction, we would see that in the<br />
same clock cycle the first instruction is accessing data from memory while the<br />
fourth instruction is fetching an instruction from that same memory. Without two<br />
memories, our pipeline could have a structural hazard.<br />
data hazard Also<br />
called a pipeline data<br />
hazard. When a planned<br />
instruction cannot<br />
execute in the proper<br />
clock cycle because data<br />
that is needed to execute<br />
the instruction is not yet<br />
available.<br />
forwarding Also called<br />
bypassing. A method of<br />
resolving a data hazard<br />
by retrieving the missing<br />
data element from<br />
internal buffers rather<br />
than waiting for it to<br />
arrive from programmervisible<br />
registers or<br />
memory.<br />
Data Hazards<br />
Data hazards occur when the pipeline must be stalled because one step must wait<br />
for another to complete. Suppose you found a sock at the folding station for which<br />
no match existed. One possible strategy is to run down to your room <strong>and</strong> search<br />
through your clothes bureau to see if you can find the match. Obviously, while you<br />
are doing the search, loads must wait that have completed drying <strong>and</strong> are ready to<br />
fold as well as those that have finished washing <strong>and</strong> are ready to dry.<br />
In a computer pipeline, data hazards arise from the dependence of one<br />
instruction on an earlier one that is still in the pipeline (a relationship that does not<br />
really exist when doing laundry). For example, suppose we have an add instruction<br />
followed immediately by a subtract instruction that uses the sum ($s0):<br />
add $s0, $t0, $t1<br />
sub $t2, $s0, $t3<br />
Without intervention, a data hazard could severely stall the pipeline. The add<br />
instruction doesn’t write its result until the fifth stage, meaning that we would have<br />
to waste three clock cycles in the pipeline.<br />
Although we could try to rely on compilers to remove all such hazards, the<br />
results would not be satisfactory. These dependences happen just too often <strong>and</strong> the<br />
delay is just too long to expect the compiler to rescue us from this dilemma.<br />
The primary solution is based on the observation that we don’t need to wait for<br />
the instruction to complete before trying to resolve the data hazard. For the code<br />
sequence above, as soon as the ALU creates the sum for the add, we can supply it as<br />
an input for the subtract. Adding extra hardware to retrieve the missing item early<br />
from the internal resources is called forwarding or bypassing.<br />
EXAMPLE<br />
Forwarding with Two Instructions<br />
For the two instructions above, show what pipeline stages would be connected<br />
by forwarding. Use the drawing in Figure 4.28 to represent the datapath during<br />
the five stages of the pipeline. Align a copy of the datapath for each instruction,<br />
similar to the laundry pipeline in Figure 4.25.
4.5 An Overview of Pipelining 279<br />
Time<br />
200 400 600 800 1000<br />
add $s0, $t0, $t1 IF ID EX MEM<br />
WB<br />
FIGURE 4.28 Graphical representation of the instruction pipeline, similar in spirit to<br />
the laundry pipeline in Figure 4.25. Here we use symbols representing the physical resources with<br />
the abbreviations for pipeline stages used throughout the chapter. The symbols for the five stages: IF for<br />
the instruction fetch stage, with the box representing instruction memory; ID for the instruction decode/<br />
register file read stage, with the drawing showing the register file being read; EX for the execution stage,<br />
with the drawing representing the ALU; MEM for the memory access stage, with the box representing data<br />
memory; <strong>and</strong> WB for the write-back stage, with the drawing showing the register file being written. The<br />
shading indicates the element is used by the instruction. Hence, MEM has a white background because add<br />
does not access the data memory. Shading on the right half of the register file or memory means the element<br />
is read in that stage, <strong>and</strong> shading of the left half means it is written in that stage. Hence the right half of ID is<br />
shaded in the second stage because the register file is read, <strong>and</strong> the left half of WB is shaded in the fifth stage<br />
because the register file is written.<br />
Figure 4.29 shows the connection to forward the value in $s0 after the<br />
execution stage of the add instruction as input to the execution stage of the<br />
sub instruction.<br />
ANSWER<br />
In this graphical representation of events, forwarding paths are valid only if the<br />
destination stage is later in time than the source stage. For example, there cannot<br />
be a valid forwarding path from the output of the memory access stage in the first<br />
instruction to the input of the execution stage of the following, since that would<br />
mean going backward in time.<br />
Forwarding works very well <strong>and</strong> is described in detail in Section 4.7. It cannot<br />
prevent all pipeline stalls, however. For example, suppose the first instruction was a<br />
load of $s0 instead of an add. As we can imagine from looking at Figure 4.29, the<br />
Program<br />
execution<br />
order Time<br />
(in instructions)<br />
add $s0, $t0, $t1<br />
IF<br />
200 400 600 800 1000<br />
ID EX MEM WB<br />
sub $t2, $s0, $t3<br />
IF<br />
ID<br />
EX<br />
MEM<br />
WB<br />
FIGURE 4.29 Graphical representation of forwarding. The connection shows the forwarding path<br />
from the output of the EX stage of add to the input of the EX stage for sub, replacing the value from register<br />
$s0 read in the second stage of sub.
4.5 An Overview of Pipelining 281<br />
Find the hazards in the preceding code segment <strong>and</strong> reorder the instructions<br />
to avoid any pipeline stalls.<br />
Both add instructions have a hazard because of their respective dependence<br />
on the immediately preceding lw instruction. Notice that bypassing eliminates<br />
several other potential hazards, including the dependence of the first add on<br />
the first lw <strong>and</strong> any hazards for store instructions. Moving up the third lw<br />
instruction to become the third instruction eliminates both hazards:<br />
ANSWER<br />
lw $t1, 0($t0)<br />
lw $t2, 4($t0)<br />
lw $t4, 8($t0)<br />
add $t3, $t1,$t2<br />
sw $t3, 12($t0)<br />
add $t5, $t1,$t4<br />
sw $t5, 16($t0)<br />
On a pipelined processor with forwarding, the reordered sequence will<br />
complete in two fewer cycles than the original version.<br />
Forwarding yields another insight into the MIPS architecture, in addition to the<br />
four mentioned on page 277. Each MIPS instruction writes at most one result <strong>and</strong><br />
does this in the last stage of the pipeline. Forwarding is harder if there are multiple<br />
results to forward per instruction or if there is a need to write a result early on in<br />
instruction execution.<br />
Elaboration: The name “forwarding” comes from the idea that the result is passed<br />
forward from an earlier instruction to a later instruction. “Bypassing” comes from<br />
passing the result around the register fi le to the desired unit.<br />
Control Hazards<br />
The third type of hazard is called a control hazard, arising from the need to make a<br />
decision based on the results of one instruction while others are executing.<br />
Suppose our laundry crew was given the happy task of cleaning the uniforms<br />
of a football team. Given how filthy the laundry is, we need to determine whether<br />
the detergent <strong>and</strong> water temperature setting we select is strong enough to get the<br />
uniforms clean but not so strong that the uniforms wear out sooner. In our laundry<br />
pipeline, we have to wait until after the second stage to examine the dry uniform to<br />
see if we need to change the washer setup or not. What to do?<br />
Here is the first of two solutions to control hazards in the laundry room <strong>and</strong> its<br />
computer equivalent.<br />
Stall: Just operate sequentially until the first batch is dry <strong>and</strong> then repeat until<br />
you have the right formula.<br />
This conservative option certainly works, but it is slow.<br />
control hazard Also<br />
called branch hazard.<br />
When the proper<br />
instruction cannot<br />
execute in the proper<br />
pipeline clock cycle<br />
because the instruction<br />
that was fetched is not the<br />
one that is needed; that<br />
is, the flow of instruction<br />
addresses is not what the<br />
pipeline expected.
4.6 Pipelined Datapath <strong>and</strong> Control 287<br />
IF: Instruction fetch<br />
ID: Instruction decode/<br />
register file read<br />
EX: Execute/<br />
address calculation<br />
MEM: Memory access<br />
WB: Write back<br />
Add<br />
4<br />
Shift<br />
left 2<br />
ADD<br />
Add<br />
result<br />
0<br />
Read Read<br />
M<br />
register 1 data 1<br />
u PC Address<br />
Zero<br />
x<br />
Read<br />
ALU<br />
1 register 2<br />
ALU<br />
Address<br />
Instruction<br />
Instruction<br />
memory<br />
Write<br />
register<br />
Write<br />
data<br />
Registers<br />
Read<br />
data 2<br />
0<br />
M<br />
u<br />
x<br />
1<br />
result<br />
Write<br />
data<br />
Data<br />
memory<br />
Read<br />
data<br />
1<br />
M<br />
u<br />
x<br />
0<br />
16<br />
Signextend<br />
32<br />
FIGURE 4.33 The single-cycle datapath from Section 4.4 (similar to Figure 4.17). Each step of the instruction can be mapped<br />
onto the datapath from left to right. The only exceptions are the update of the PC <strong>and</strong> the write-back step, shown in color, which sends either<br />
the ALU result or the data from memory to the left to be written into the register file. (Normally we use color lines for control, but these are<br />
data lines.)<br />
five stages as they complete execution. Returning to our laundry analogy, clothes<br />
get cleaner, drier, <strong>and</strong> more organized as they move through the line, <strong>and</strong> they<br />
never move backward.<br />
There are, however, two exceptions to this left-to-right flow of instructions:<br />
■ The write-back stage, which places the result back into the register file in the<br />
middle of the datapath<br />
■ The selection of the next value of the PC, choosing between the incremented<br />
PC <strong>and</strong> the branch address from the MEM stage<br />
Data flowing from right to left does not affect the current instruction; these<br />
reverse data movements influence only later instructions in the pipeline. Note that
288 Chapter 4 The Processor<br />
the first right-to-left flow of data can lead to data hazards <strong>and</strong> the second leads to<br />
control hazards.<br />
One way to show what happens in pipelined execution is to pretend that each<br />
instruction has its own datapath, <strong>and</strong> then to place these datapaths on a timeline to<br />
show their relationship. Figure 4.34 shows the execution of the instructions in Figure<br />
4.27 by displaying their private datapaths on a common timeline. We use a stylized<br />
version of the datapath in Figure 4.33 to show the relationships in Figure 4.34.<br />
Figure 4.34 seems to suggest that three instructions need three datapaths.<br />
Instead, we add registers to hold data so that portions of a single datapath can be<br />
shared during instruction execution.<br />
For example, as Figure 4.34 shows, the instruction memory is used during<br />
only one of the five stages of an instruction, allowing it to be shared by following<br />
instructions during the other four stages. To retain the value of an individual<br />
instruction for its other four stages, the value read from instruction memory must<br />
be saved in a register. Similar arguments apply to every pipeline stage, so we must<br />
place registers wherever there are dividing lines between stages in Figure 4.33.<br />
Returning to our laundry analogy, we might have a basket between each pair of<br />
stages to hold the clothes for the next step.<br />
Program<br />
execution<br />
order<br />
(in instructions)<br />
Time (in clock cycles)<br />
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7<br />
lw $1, 100($0)<br />
IM<br />
Reg<br />
ALU<br />
DM<br />
Reg<br />
lw $2, 200($0)<br />
IM<br />
Reg<br />
ALU<br />
DM<br />
Reg<br />
lw $3, 300($0)<br />
IM<br />
Reg<br />
ALU<br />
DM<br />
Reg<br />
FIGURE 4.34 Instructions being executed using the single-cycle datapath in Figure 4.33,<br />
assuming pipelined execution. Similar to Figures 4.28 through 4.30, this figure pretends that each<br />
instruction has its own datapath, <strong>and</strong> shades each portion according to use. Unlike those figures, each stage<br />
is labeled by the physical resource used in that stage, corresponding to the portions of the datapath in Figure<br />
4.33. IM represents the instruction memory <strong>and</strong> the PC in the instruction fetch stage, Reg st<strong>and</strong>s for the<br />
register file <strong>and</strong> sign extender in the instruction decode/register file read stage (ID), <strong>and</strong> so on. To maintain<br />
proper time order, this stylized datapath breaks the register file into two logical parts: registers read during<br />
register fetch (ID) <strong>and</strong> registers written during write back (WB). This dual use is represented by drawing<br />
the unshaded left half of the register file using dashed lines in the ID stage, when it is not being written, <strong>and</strong><br />
the unshaded right half in dashed lines in the WB stage, when it is not being read. As before, we assume the<br />
register file is written in the first half of the clock cycle <strong>and</strong> the register file is read during the second half.
4.6 Pipelined Datapath <strong>and</strong> Control 289<br />
Figure 4.35 shows the pipelined datapath with the pipeline registers highlighted.<br />
All instructions advance during each clock cycle from one pipeline register<br />
to the next. The registers are named for the two stages separated by that register.<br />
For example, the pipeline register between the IF <strong>and</strong> ID stages is called IF/ID.<br />
Notice that there is no pipeline register at the end of the write-back stage. All<br />
instructions must update some state in the processor—the register file, memory, or<br />
the PC—so a separate pipeline register is redundant to the state that is updated. For<br />
example, a load instruction will place its result in 1 of the 32 registers, <strong>and</strong> any later<br />
instruction that needs that data will simply read the appropriate register.<br />
Of course, every instruction updates the PC, whether by incrementing it or by<br />
setting it to a branch destination address. The PC can be thought of as a pipeline<br />
register: one that feeds the IF stage of the pipeline. Unlike the shaded pipeline<br />
registers in Figure 4.35, however, the PC is part of the visible architectural state;<br />
its contents must be saved when an exception occurs, while the contents of the<br />
pipeline registers can be discarded. In the laundry analogy, you could think of the<br />
PC as corresponding to the basket that holds the load of dirty clothes before the<br />
wash step.<br />
To show how the pipelining works, throughout this chapter we show sequences<br />
of figures to demonstrate operation over time. These extra pages would seem to<br />
require much more time for you to underst<strong>and</strong>. Fear not; the sequences take much<br />
IF/ID<br />
ID/EX<br />
EX/MEM<br />
MEM/WB<br />
Add<br />
4<br />
Shift<br />
left 2<br />
Add Add<br />
result<br />
0<br />
M<br />
u<br />
x<br />
1<br />
PC<br />
Address<br />
Instruction<br />
memory<br />
Instruction<br />
Read<br />
register 1<br />
Read<br />
register 2<br />
Write<br />
register<br />
Write<br />
data<br />
Registers<br />
Read<br />
data 1<br />
Read<br />
data 2<br />
0<br />
M<br />
u<br />
x<br />
1<br />
Zero<br />
ALU ALU<br />
result<br />
Read<br />
Address<br />
data<br />
Data<br />
memory<br />
1<br />
M<br />
u<br />
x<br />
0<br />
Write<br />
data<br />
16 Signextend<br />
32<br />
FIGURE 4.35 The pipelined version of the datapath in Figure 4.33. The pipeline registers, in color, separate each pipeline stage.<br />
They are labeled by the stages that they separate; for example, the first is labeled IF/ID because it separates the instruction fetch <strong>and</strong> instruction<br />
decode stages. The registers must be wide enough to store all the data corresponding to the lines that go through them. For example, the<br />
IF/ID register must be 64 bits wide, because it must hold both the 32-bit instruction fetched from memory <strong>and</strong> the incremented 32-bit PC<br />
address. We will exp<strong>and</strong> these registers over the course of this chapter, but for now the other three pipeline registers contain 128, 97, <strong>and</strong> 64<br />
bits, respectively.
290 Chapter 4 The Processor<br />
less time than it might appear, because you can compare them to see what changes<br />
occur in each clock cycle. Section 4.7 describes what happens when there are data<br />
hazards between pipelined instructions; ignore them for now.<br />
Figures 4.36 through 4.38, our first sequence, show the active portions of the<br />
datapath highlighted as a load instruction goes through the five stages of pipelined<br />
execution. We show a load first because it is active in all five stages. As in Figures<br />
4.28 through 4.30, we highlight the right half of registers or memory when they are<br />
being read <strong>and</strong> highlight the left half when they are being written.<br />
We show the instruction abbreviation lw with the name of the pipe stage that is<br />
active in each figure. The five stages are the following:<br />
1. Instruction fetch: The top portion of Figure 4.36 shows the instruction being<br />
read from memory using the address in the PC <strong>and</strong> then being placed in the<br />
IF/ID pipeline register. The PC address is incremented by 4 <strong>and</strong> then written<br />
back into the PC to be ready for the next clock cycle. This incremented<br />
address is also saved in the IF/ID pipeline register in case it is needed later<br />
for an instruction, such as beq. The computer cannot know which type of<br />
instruction is being fetched, so it must prepare for any instruction, passing<br />
potentially needed information down the pipeline.<br />
2. Instruction decode <strong>and</strong> register file read: The bottom portion of Figure 4.36<br />
shows the instruction portion of the IF/ID pipeline register supplying the<br />
16-bit immediate field, which is sign-extended to 32 bits, <strong>and</strong> the register<br />
numbers to read the two registers. All three values are stored in the ID/EX<br />
pipeline register, along with the incremented PC address. We again transfer<br />
everything that might be needed by any instruction during a later clock<br />
cycle.<br />
3. Execute or address calculation: Figure 4.37 shows that the load instruction<br />
reads the contents of register 1 <strong>and</strong> the sign-extended immediate from the<br />
ID/EX pipeline register <strong>and</strong> adds them using the ALU. That sum is placed in<br />
the EX/MEM pipeline register.<br />
4. Memory access: The top portion of Figure 4.38 shows the load instruction<br />
reading the data memory using the address from the EX/MEM pipeline<br />
register <strong>and</strong> loading the data into the MEM/WB pipeline register.<br />
5. Write-back: The bottom portion of Figure 4.38 shows the final step: reading<br />
the data from the MEM/WB pipeline register <strong>and</strong> writing it into the register<br />
file in the middle of the figure.<br />
This walk-through of the load instruction shows that any information needed<br />
in a later pipe stage must be passed to that stage via a pipeline register. Walking<br />
through a store instruction shows the similarity of instruction execution, as well<br />
as passing the information for later stages. Here are the five pipe stages of the store<br />
instruction:
4.6 Pipelined Datapath <strong>and</strong> Control 291<br />
lw<br />
Instruction fetch<br />
IF/ID<br />
ID/EX<br />
EX/MEM<br />
MEM/WB<br />
Add<br />
4<br />
Shift<br />
left 2<br />
Add<br />
Add<br />
resu t<br />
0<br />
M<br />
u<br />
x<br />
1<br />
PC<br />
Address<br />
Instruction<br />
memory<br />
Instruction<br />
Read<br />
register 1 Read<br />
data 1<br />
Read<br />
register 2<br />
Registers Read<br />
Write<br />
data 2<br />
register<br />
Write<br />
data<br />
0<br />
M<br />
u<br />
x<br />
1<br />
Zero<br />
ALU ALU<br />
result<br />
Address<br />
Data<br />
memory<br />
Read<br />
data<br />
0<br />
M<br />
u<br />
x<br />
1<br />
Write<br />
data<br />
16 Sign 32<br />
extend<br />
lw<br />
Instruction decode<br />
IF/ID<br />
ID/EX<br />
EX/MEM<br />
MEM/WB<br />
Add<br />
4<br />
Shift<br />
left 2<br />
Add<br />
Add<br />
resu t<br />
0<br />
M<br />
u<br />
x<br />
1<br />
PC<br />
Address<br />
Instruction<br />
memory<br />
Instruction<br />
Read<br />
register 1 Read<br />
data 1<br />
Read<br />
register 2<br />
Registers Read<br />
Write<br />
data 2<br />
register<br />
Write<br />
data<br />
0<br />
M<br />
u<br />
x<br />
1<br />
Zero<br />
ALU ALU<br />
result<br />
Address<br />
Data<br />
memory<br />
Read<br />
data<br />
1<br />
M<br />
u<br />
x<br />
0<br />
Write<br />
data<br />
16 Sign 32<br />
extend<br />
FIGURE 4.36 IF <strong>and</strong> ID: First <strong>and</strong> second pipe stages of an instruction, with the active portions of the datapath in<br />
Figure 4.35 highlighted. The highlighting convention is the same as that used in Figure 4.28. As in Section 4.2, there is no confusion when<br />
reading <strong>and</strong> writing registers, because the contents change only on the clock edge. Although the load needs only the top register in stage 2,<br />
the processor doesn’t know what instruction is being decoded, so it sign-extends the 16-bit constant <strong>and</strong> reads both registers into the ID/EX<br />
pipeline register. We don’t need all three oper<strong>and</strong>s, but it simplifies control to keep all three.
292 Chapter 4 The Processor<br />
Iw<br />
Execution<br />
IF/ID<br />
ID/EX<br />
EX/MEM<br />
MEM/WB<br />
Add<br />
4<br />
Shift<br />
left 2<br />
Add Add<br />
result<br />
0<br />
M<br />
u<br />
x<br />
1<br />
PC<br />
Address<br />
Instruction<br />
memory<br />
Instruction<br />
Read<br />
register 1<br />
Read<br />
register 2<br />
Write<br />
register<br />
Write<br />
data<br />
Registers<br />
Read<br />
data 1<br />
Read<br />
data 2<br />
0<br />
M<br />
u<br />
x<br />
1<br />
Zero<br />
ALU ALU<br />
result<br />
Read<br />
Address<br />
data<br />
Data<br />
memory<br />
Write<br />
data<br />
1<br />
M<br />
u<br />
x<br />
0<br />
16 Signextend<br />
32<br />
FIGURE 4.37 EX: The third pipe stage of a load instruction, highlighting the portions of the datapath in Figure 4.35<br />
used in this pipe stage. The register is added to the sign-extended immediate, <strong>and</strong> the sum is placed in the EX/MEM pipeline register.<br />
1. Instruction fetch: The instruction is read from memory using the address<br />
in the PC <strong>and</strong> then is placed in the IF/ID pipeline register. This stage occurs<br />
before the instruction is identified, so the top portion of Figure 4.36 works<br />
for store as well as load.<br />
2. Instruction decode <strong>and</strong> register file read: The instruction in the IF/ID pipeline<br />
register supplies the register numbers for reading two registers <strong>and</strong> extends<br />
the sign of the 16-bit immediate. These three 32-bit values are all stored<br />
in the ID/EX pipeline register. The bottom portion of Figure 4.36 for load<br />
instructions also shows the operations of the second stage for stores. These<br />
first two stages are executed by all instructions, since it is too early to know<br />
the type of the instruction.<br />
3. Execute <strong>and</strong> address calculation: Figure 4.39 shows the third step; the<br />
effective address is placed in the EX/MEM pipeline register.<br />
4. Memory access: The top portion of Figure 4.40 shows the data being written<br />
to memory. Note that the register containing the data to be stored was read in<br />
an earlier stage <strong>and</strong> stored in ID/EX. The only way to make the data available<br />
during the MEM stage is to place the data into the EX/MEM pipeline register<br />
in the EX stage, just as we stored the effective address into EX/MEM.
4.6 Pipelined Datapath <strong>and</strong> Control 293<br />
Iw<br />
Memory<br />
IF/ID<br />
ID/EX<br />
EX/MEM<br />
MEM/WB<br />
Add<br />
4<br />
Shift<br />
left 2<br />
Add<br />
Add<br />
result<br />
0<br />
M<br />
u<br />
x<br />
1<br />
PC<br />
Address<br />
Instruction<br />
memory<br />
Instruction<br />
Read<br />
register 1 Read<br />
data 1<br />
Read<br />
register 2<br />
Registers Read<br />
Write<br />
data 2<br />
register<br />
Write<br />
data<br />
0<br />
M<br />
u<br />
x<br />
1<br />
Zero<br />
ALU ALU<br />
result<br />
Address<br />
Data<br />
memory<br />
Read<br />
data<br />
0<br />
M<br />
u<br />
x<br />
1<br />
Write<br />
data<br />
16 Sign 32<br />
extend<br />
Iw<br />
Write-back<br />
IF/ID<br />
ID/EX<br />
EX/MEM<br />
MEM/WB<br />
Add<br />
4<br />
Shift<br />
left 2<br />
Add<br />
Add<br />
result<br />
0<br />
M<br />
u<br />
x<br />
1<br />
PC<br />
Address<br />
Instruction<br />
memory<br />
Instruction<br />
Read<br />
register 1 Read<br />
data 1<br />
Read<br />
register 2<br />
Registers Read<br />
Write<br />
data 2<br />
register<br />
Write<br />
data<br />
0<br />
M<br />
u<br />
x<br />
1<br />
Zero<br />
ALU ALU<br />
result<br />
Address<br />
Data<br />
memory<br />
Read<br />
data<br />
1<br />
M<br />
u<br />
x<br />
0<br />
Write<br />
data<br />
16 Sign 32<br />
extend<br />
FIGURE 4.38 MEM <strong>and</strong> WB: The fourth <strong>and</strong> fifth pipe stages of a load instruction, highlighting the portions of the<br />
datapath in Figure 4.35 used in this pipe stage. Data memory is read using the address in the EX/MEM pipeline registers, <strong>and</strong> the<br />
data is placed in the MEM/WB pipeline register. Next, data is read from the MEM/WB pipeline register <strong>and</strong> written into the register file in the<br />
middle of the datapath. Note: there is a bug in this design that is repaired in Figure 4.41.
294 Chapter 4 The Processor<br />
sw<br />
Execution<br />
IF/ID<br />
ID/EX<br />
EX/MEM<br />
MEM/WB<br />
Add<br />
4<br />
Shift<br />
left 2<br />
Add Add<br />
result<br />
0<br />
M<br />
u<br />
x<br />
1<br />
PC<br />
Address<br />
Instruction<br />
memory<br />
Instruction<br />
Read<br />
register 1<br />
Read<br />
register 2<br />
Write<br />
register<br />
Write<br />
data<br />
Registers<br />
Read<br />
data 1<br />
Read<br />
data 2<br />
0<br />
M<br />
u<br />
x<br />
1<br />
Zero<br />
ALU ALU<br />
result<br />
Read<br />
Address<br />
data<br />
Data<br />
memory<br />
Write<br />
data<br />
1<br />
M<br />
u<br />
x<br />
0<br />
16 Signextend<br />
32<br />
FIGURE 4.39 EX: The third pipe stage of a store instruction. Unlike the third stage of the load instruction in Figure 4.37, the<br />
second register value is loaded into the EX/MEM pipeline register to be used in the next stage. Although it wouldn’t hurt to always write this<br />
second register into the EX/MEM pipeline register, we write the second register only on a store instruction to make the pipeline easier to<br />
underst<strong>and</strong>.<br />
5. Write-back: The bottom portion of Figure 4.40 shows the final step of the<br />
store. For this instruction, nothing happens in the write-back stage. Since<br />
every instruction behind the store is already in progress, we have no way<br />
to accelerate those instructions. Hence, an instruction passes through a<br />
stage even if there is nothing to do, because later instructions are already<br />
progressing at the maximum rate.<br />
The store instruction again illustrates that to pass something from an early pipe<br />
stage to a later pipe stage, the information must be placed in a pipeline register;<br />
otherwise, the information is lost when the next instruction enters that pipeline<br />
stage. For the store instruction we needed to pass one of the registers read in the<br />
ID stage to the MEM stage, where it is stored in memory. The data was first placed<br />
in the ID/EX pipeline register <strong>and</strong> then passed to the EX/MEM pipeline register.<br />
Load <strong>and</strong> store illustrate a second key point: each logical component of the<br />
datapath—such as instruction memory, register read ports, ALU, data memory,<br />
<strong>and</strong> register write port—can be used only within a single pipeline stage. Otherwise,<br />
we would have a structural hazard (see page 277). Hence these components, <strong>and</strong><br />
their control, can be associated with a single pipeline stage.<br />
Now we can uncover a bug in the design of the load instruction. Did you see it?<br />
Which register is changed in the final stage of the load? More specifically, which
4.6 Pipelined Datapath <strong>and</strong> Control 295<br />
sw<br />
Memory<br />
IF/ID<br />
ID/EX<br />
EX/MEM<br />
MEM/WB<br />
Add<br />
4<br />
Shift<br />
left 2<br />
Add<br />
Add<br />
result<br />
0<br />
M<br />
u<br />
x<br />
1<br />
PC<br />
Address<br />
Instruction<br />
memory<br />
Instruction<br />
Read<br />
register 1 Read<br />
data 1<br />
Read<br />
register 2<br />
Registers Read<br />
Write<br />
data 2<br />
register<br />
Write<br />
data<br />
0<br />
M<br />
u<br />
x<br />
1<br />
Zero<br />
ALU ALU<br />
result<br />
Address<br />
Data<br />
memory<br />
Read<br />
data<br />
0<br />
M<br />
u<br />
x<br />
1<br />
Write<br />
data<br />
16 Sign 32<br />
extend<br />
sw<br />
Write-back<br />
IF/ID<br />
ID/EX<br />
EX/MEM<br />
MEM/WB<br />
Add<br />
4<br />
Shift<br />
left 2<br />
Add<br />
Add<br />
result<br />
0<br />
M<br />
u<br />
x<br />
1<br />
PC<br />
Address<br />
Instruction<br />
memory<br />
Instruction<br />
Read<br />
register 1 Read<br />
data 1<br />
Read<br />
register 2<br />
Registers Read<br />
Write<br />
data 2<br />
register<br />
Write<br />
data<br />
0<br />
M<br />
u<br />
x<br />
1<br />
Zero<br />
ALU ALU<br />
result<br />
Address<br />
Data<br />
memory<br />
Read<br />
data<br />
1<br />
M<br />
u<br />
x<br />
0<br />
Write<br />
data<br />
16 Sign 32<br />
extend<br />
FIGURE 4.40 MEM <strong>and</strong> WB: The fourth <strong>and</strong> fifth pipe stages of a store instruction. In the fourth stage, the data is written into<br />
data memory for the store. Note that the data comes from the EX/MEM pipeline register <strong>and</strong> that nothing is changed in the MEM/WB pipeline<br />
register. Once the data is written in memory, there is nothing left for the store instruction to do, so nothing happens in stage 5.
296 Chapter 4 The Processor<br />
instruction supplies the write register number? The instruction in the IF/ID pipeline<br />
register supplies the write register number, yet this instruction occurs considerably<br />
after the load instruction!<br />
Hence, we need to preserve the destination register number in the load<br />
instruction. Just as store passed the register contents from the ID/EX to the EX/<br />
MEM pipeline registers for use in the MEM stage, load must pass the register<br />
number from the ID/EX through EX/MEM to the MEM/WB pipeline register for<br />
use in the WB stage. Another way to think about the passing of the register number<br />
is that to share the pipelined datapath, we need to preserve the instruction read<br />
during the IF stage, so each pipeline register contains a portion of the instruction<br />
needed for that stage <strong>and</strong> later stages.<br />
Figure 4.41 shows the correct version of the datapath, passing the write register<br />
number first to the ID/EX register, then to the EX/MEM register, <strong>and</strong> finally to the<br />
MEM/WB register. The register number is used during the WB stage to specify<br />
the register to be written. Figure 4.42 is a single drawing of the corrected datapath,<br />
highlighting the hardware used in all five stages of the load word instruction in<br />
Figures 4.36 through 4.38. See Section 4.8 for an explanation of how to make the<br />
branch instruction work as expected.<br />
Graphically Representing Pipelines<br />
Pipelining can be difficult to underst<strong>and</strong>, since many instructions are simultaneously<br />
executing in a single datapath in every clock cycle. To aid underst<strong>and</strong>ing, there are<br />
IF/ID<br />
ID/EX<br />
EX/MEM<br />
MEM/WB<br />
Add<br />
4<br />
Shift<br />
left 2<br />
Add<br />
Add<br />
resu t<br />
0<br />
M<br />
u<br />
x<br />
1<br />
PC<br />
Address<br />
Instruction<br />
memory<br />
Instruction<br />
Read<br />
register 1 Read<br />
data 1<br />
Read<br />
register 2<br />
Registers Read<br />
Write<br />
data 2<br />
register<br />
Write<br />
data<br />
0<br />
M<br />
u<br />
x<br />
1<br />
Zero<br />
ALU ALU<br />
result<br />
Address<br />
Data<br />
memory<br />
Read<br />
data<br />
1<br />
M<br />
u<br />
x<br />
0<br />
Write<br />
data<br />
16 Sign 32<br />
extend<br />
FIGURE 4.41 The corrected pipelined datapath to h<strong>and</strong>le the load instruction properly. The write register number now<br />
comes from the MEM/WB pipeline register along with the data. The register number is passed from the ID pipe stage until it reaches the MEM/<br />
WB pipeline register, adding five more bits to the last three pipeline registers. This new path is shown in color.
4.6 Pipelined Datapath <strong>and</strong> Control 297<br />
two basic styles of pipeline figures: multiple-clock-cycle pipeline diagrams, such as<br />
Figure 4.34 on page 288, <strong>and</strong> single-clock-cycle pipeline diagrams, such as Figures<br />
4.36 through 4.40. The multiple-clock-cycle diagrams are simpler but do not contain<br />
all the details. For example, consider the following five-instruction sequence:<br />
lw $10, 20($1)<br />
sub $11, $2, $3<br />
add $12, $3, $4<br />
lw $13, 24($1)<br />
add $14, $5, $6<br />
Figure 4.43 shows the multiple-clock-cycle pipeline diagram for these<br />
instructions. Time advances from left to right across the page in these diagrams,<br />
<strong>and</strong> instructions advance from the top to the bottom of the page, similar to the<br />
laundry pipeline in Figure 4.25. A representation of the pipeline stages is placed<br />
in each portion along the instruction axis, occupying the proper clock cycles.<br />
These stylized datapaths represent the five stages of our pipeline graphically, but<br />
a rectangle naming each pipe stage works just as well. Figure 4.44 shows the more<br />
traditional version of the multiple-clock-cycle pipeline diagram. Note that Figure<br />
4.43 shows the physical resources used at each stage, while Figure 4.44 uses the<br />
name of each stage.<br />
Single-clock-cycle pipeline diagrams show the state of the entire datapath during<br />
a single clock cycle, <strong>and</strong> usually all five instructions in the pipeline are identified by<br />
labels above their respective pipeline stages. We use this type of figure to show the<br />
details of what is happening within the pipeline during each clock cycle; typically,<br />
IF/ID<br />
ID/EX<br />
EX/MEM<br />
MEM/WB<br />
Add<br />
4<br />
Shift<br />
left 2<br />
Add<br />
Add<br />
resu t<br />
0<br />
M<br />
u<br />
x<br />
1<br />
PC<br />
Address<br />
Instruction<br />
memory<br />
Instruction<br />
Read<br />
register 1 Read<br />
data 1<br />
Read<br />
register 2<br />
Registers Read<br />
Write<br />
data 2<br />
register<br />
Write<br />
data<br />
0<br />
M<br />
u<br />
x<br />
1<br />
Zero<br />
ALU ALU<br />
result<br />
Address<br />
Data<br />
memory<br />
Read<br />
data<br />
1<br />
M<br />
u<br />
x<br />
0<br />
Write<br />
data<br />
16 Sign 32<br />
extend<br />
FIGURE 4.42<br />
The portion of the datapath in Figure 4.41 that is used in all five stages of a load instruction.
4.6 Pipelined Datapath <strong>and</strong> Control 299<br />
Program<br />
execution<br />
order<br />
(in instructions)<br />
Time (in clock cycles)<br />
CC 1 CC 2 CC 3<br />
CC 4<br />
CC 5<br />
CC 6<br />
CC 7<br />
CC 8<br />
CC 9<br />
lw $10, 20($1)<br />
sub $11, $2, $3<br />
add $12, $3, $4<br />
lw $13, 24($1)<br />
add $14, $5, $6<br />
Instruction<br />
fetch<br />
Instruction<br />
decode<br />
Instruction<br />
fetch<br />
Execution<br />
Instruction<br />
decode<br />
Instruction<br />
fetch<br />
Data<br />
access<br />
Execution<br />
Instruction<br />
decode<br />
Instruction<br />
fetch<br />
Write-back<br />
Data<br />
access<br />
Execution<br />
Instruction<br />
decode<br />
Instruction<br />
fetch<br />
Write-back<br />
Data<br />
access<br />
Execution<br />
Instruction<br />
decode<br />
Write-back<br />
Data<br />
access<br />
Execution<br />
Write-back<br />
Data<br />
access<br />
Write-back<br />
FIGURE 4.44 Traditional multiple-clock-cycle pipeline diagram of five instructions in Figure 4.43.<br />
add $14, $5, $6<br />
lw $13, 24 ($1)<br />
add $12, $3, $4<br />
sub $11, $2, $3<br />
lw $10, 20($1)<br />
Instruction fetch<br />
Instruction decode<br />
Execution<br />
Memory<br />
Write-back<br />
IF/ID<br />
ID/EX<br />
EX/MEM<br />
MEM/WB<br />
Add<br />
4<br />
Shift<br />
left 2<br />
Add Add<br />
result<br />
0<br />
M<br />
u<br />
x<br />
1<br />
PC<br />
Address<br />
Instruction<br />
memory<br />
Instruction<br />
Read<br />
register 1<br />
Read<br />
data 1<br />
Read<br />
register 2<br />
Registers<br />
Read<br />
Write<br />
data 2<br />
register<br />
Write<br />
data<br />
0<br />
M<br />
u<br />
x<br />
1<br />
Zero<br />
ALU ALU<br />
result<br />
Read<br />
Address<br />
data<br />
Data<br />
memory<br />
1<br />
M<br />
u<br />
x<br />
0<br />
Write<br />
data<br />
16 Sign 32<br />
extend<br />
FIGURE 4.45 The single-clock-cycle diagram corresponding to clock cycle 5 of the pipeline in Figures 4.43 <strong>and</strong> 4.44.<br />
As you can see, a single-clock-cycle figure is a vertical slice through a multiple-clock-cycle diagram.<br />
1. Allowing jumps, branches, <strong>and</strong> ALU instructions to take fewer stages than<br />
the five required by the load instruction will increase pipeline performance<br />
under all circumstances.
300 Chapter 4 The Processor<br />
2. Trying to allow some instructions to take fewer cycles does not help, since<br />
the throughput is determined by the clock cycle; the number of pipe stages<br />
per instruction affects latency, not throughput.<br />
3. You cannot make ALU instructions take fewer cycles because of the writeback<br />
of the result, but branches <strong>and</strong> jumps can take fewer cycles, so there is<br />
some opportunity for improvement.<br />
4. Instead of trying to make instructions take fewer cycles, we should explore<br />
making the pipeline longer, so that instructions take more cycles, but the<br />
cycles are shorter. This could improve performance.<br />
In the 6600 <strong>Computer</strong>,<br />
perhaps even more<br />
than in any previous<br />
computer, the control<br />
system is the difference.<br />
James Thornton, <strong>Design</strong><br />
of a <strong>Computer</strong>: The<br />
Control Data 6600, 1970<br />
Pipelined Control<br />
Just as we added control to the single-cycle datapath in Section 4.3, we now add<br />
control to the pipelined datapath. We start with a simple design that views the<br />
problem through rose-colored glasses.<br />
The first step is to label the control lines on the existing datapath. Figure 4.46<br />
shows those lines. We borrow as much as we can from the control for the simple<br />
datapath in Figure 4.17. In particular, we use the same ALU control logic, branch<br />
logic, destination-register-number multiplexor, <strong>and</strong> control lines. These functions<br />
are defined in Figures 4.12, 4.16, <strong>and</strong> 4.18. We reproduce the key information in<br />
Figures 4.47 through 4.49 on a single page to make the following discussion easier<br />
to follow.<br />
As was the case for the single-cycle implementation, we assume that the PC is<br />
written on each clock cycle, so there is no separate write signal for the PC. By the<br />
same argument, there are no separate write signals for the pipeline registers (IF/<br />
ID, ID/EX, EX/MEM, <strong>and</strong> MEM/WB), since the pipeline registers are also written<br />
during each clock cycle.<br />
To specify control for the pipeline, we need only set the control values during<br />
each pipeline stage. Because each control line is associated with a component active<br />
in only a single pipeline stage, we can divide the control lines into five groups<br />
according to the pipeline stage.<br />
1. Instruction fetch: The control signals to read instruction memory <strong>and</strong> to<br />
write the PC are always asserted, so there is nothing special to control in this<br />
pipeline stage.<br />
2. Instruction decode/register file read: As in the previous stage, the same thing<br />
happens at every clock cycle, so there are no optional control lines to set.<br />
3. Execution/address calculation: The signals to be set are RegDst, ALUOp,<br />
<strong>and</strong> ALUSrc (see Figures 4.47 <strong>and</strong> 4.48). The signals select the Result register,<br />
the ALU operation, <strong>and</strong> either Read data 2 or a sign-extended immediate<br />
for the ALU.
4.6 Pipelined Datapath <strong>and</strong> Control 301<br />
PCSrc<br />
IF/ID<br />
ID/EX<br />
EX/MEM<br />
MEM/WB<br />
Add<br />
4<br />
Shift<br />
left 2<br />
Add Add<br />
result<br />
Branch<br />
0<br />
Mux<br />
RegWrite<br />
1<br />
PC<br />
Address<br />
Instruction<br />
memory<br />
Instruction<br />
Read<br />
register 1<br />
Read<br />
register 2<br />
Registers<br />
Write Read<br />
register data 2<br />
Write<br />
data<br />
Instruction<br />
(15–0)<br />
Read<br />
data 1<br />
16 Signextend<br />
32 6<br />
ALUSrc<br />
0<br />
Mux<br />
1<br />
ALU<br />
control<br />
Zero<br />
Add ALU<br />
result<br />
Write<br />
data<br />
MemWrite<br />
Read<br />
Address<br />
data<br />
Data<br />
memory<br />
MemRead<br />
MemtoReg<br />
1<br />
M<br />
ux<br />
0<br />
Instruction<br />
(20–16)<br />
0<br />
Mux<br />
ALUOp<br />
Instruction<br />
(15–11)<br />
1<br />
RegDst<br />
FIGURE 4.46 The pipelined datapath of Figure 4.41 with the control signals identified. This datapath borrows the control<br />
logic for PC source, register destination number, <strong>and</strong> ALU control from Section 4.4. Note that we now need the 6-bit funct field (function<br />
code) of the instruction in the EX stage as input to ALU control, so these bits must also be included in the ID/EX pipeline register. Recall that<br />
these 6 bits are also the 6 least significant bits of the immediate field in the instruction, so the ID/EX pipeline register can supply them from the<br />
immediate field since sign extension leaves these bits unchanged.<br />
Instruction<br />
opcode<br />
ALUOp<br />
Instruction<br />
operation<br />
Function code<br />
Desired<br />
ALU action<br />
ALU control<br />
input<br />
LW 00 load word XXXXXX add 0010<br />
SW 00 store word XXXXXX add 0010<br />
Branch equal 01 branch equal XXXXXX subtract 0110<br />
R-type 10 add 100000 add 0010<br />
R-type 10 subtract 100010 subtract 0110<br />
R-type 10 AND 100100 AND 0000<br />
R-type 10 OR 100101 OR 0001<br />
R-type 10 set on less than 101010 set on less than 0111<br />
FIGURE 4.47 A copy of Figure 4.12. This figure shows how the ALU control bits are set depending on the ALUOp control bits <strong>and</strong> the<br />
different function codes for the R-type instruction.
302 Chapter 4 The Processor<br />
Signal name Effect when deasserted (0) Effect when asserted (1)<br />
RegDst<br />
The register destination number for the Write<br />
register comes from the rt field (bits 20:16).<br />
The register destination number for the Write register comes<br />
from the rd field (bits 15:11).<br />
RegWrite None. The register on the Write register input is written with the value<br />
on the Write data input.<br />
ALUSrc<br />
PCSrc<br />
The second ALU oper<strong>and</strong> comes from the second<br />
register file output (Read data 2).<br />
The PC is replaced by the output of the adder that<br />
computes the value of PC + 4.<br />
The second ALU oper<strong>and</strong> is the sign-extended, lower 16 bits of<br />
the instruction.<br />
The PC is replaced by the output of the adder that computes<br />
the branch target.<br />
MemRead None. Data memory contents designated by the address input are<br />
put on the Read data output.<br />
MemWrite None. Data memory contents designated by the address input are<br />
replaced by the value on the Write data input.<br />
MemtoReg<br />
The value fed to the register Write data input<br />
comes from the ALU.<br />
The value fed to the register Write data input comes from the<br />
data memory.<br />
FIGURE 4.48 A copy of Figure 4.16. The function of each of seven control signals is defined. The ALU control lines (ALUOp) are defined<br />
in the second column of Figure 4.47. When a 1-bit control to a 2-way multiplexor is asserted, the multiplexor selects the input corresponding<br />
to 1. Otherwise, if the control is deasserted, the multiplexor selects the 0 input. Note that PCSrc is controlled by an AND gate in Figure 4.46.<br />
If the Branch signal <strong>and</strong> the ALU Zero signal are both set, then PCSrc is 1; otherwise, it is 0. Control sets the Branch signal only during a beq<br />
instruction; otherwise, PCSrc is set to 0.<br />
Instruction<br />
Execution/address calculation stage<br />
control lines<br />
RegDst ALUOp1 ALUOp0 ALUSrc Branch<br />
Memory access stage<br />
control lines<br />
Mem-<br />
Read<br />
Mem-<br />
Write<br />
Write-back stage<br />
control lines<br />
Reg-<br />
Write<br />
R-format 1 1 0 0 0 0 0 1 0<br />
lw 0 0 0 1 0 1 0 1 1<br />
sw X 0 0 1 0 0 1 0 X<br />
beq X 0 1 0 1 0 0 0 X<br />
Memto-<br />
Reg<br />
FIGURE 4.49 The values of the control lines are the same as in Figure 4.18, but they have been shuffled into three<br />
groups corresponding to the last three pipeline stages.<br />
4. Memory access: The control lines set in this stage are Branch, MemRead, <strong>and</strong><br />
MemWrite. The branch equal, load, <strong>and</strong> store instructions set these signals,<br />
respectively. Recall that PCSrc in Figure 4.48 selects the next sequential<br />
address unless control asserts Branch <strong>and</strong> the ALU result was 0.<br />
5. Write-back: The two control lines are MemtoReg, which decides between<br />
sending the ALU result or the memory value to the register file, <strong>and</strong> Reg-<br />
Write, which writes the chosen value.<br />
Since pipelining the datapath leaves the meaning of the control lines unchanged,<br />
we can use the same control values. Figure 4.49 has the same values as in Section<br />
4.4, but now the nine control lines are grouped by pipeline stage.
304 Chapter 4 The Processor<br />
PCSrc<br />
ID/EX<br />
WB<br />
EX/MEM<br />
Control<br />
M<br />
WB<br />
MEM/WB<br />
IF/ID<br />
EX<br />
M<br />
WB<br />
Add<br />
0<br />
M<br />
ux<br />
1<br />
PC<br />
4<br />
Address<br />
Instruction<br />
memory<br />
Instruction<br />
Read<br />
register 1<br />
Read<br />
register 2<br />
Write<br />
register<br />
Write<br />
data<br />
RegWrite<br />
Registers<br />
Read<br />
data 1<br />
Read<br />
data 2<br />
Shift<br />
left 2<br />
0 Mux<br />
1<br />
Add Add<br />
result<br />
ALUSrc<br />
Zero<br />
ALU ALU<br />
result<br />
Branch<br />
MemWrite<br />
Read<br />
Address<br />
data<br />
Data<br />
memory<br />
MemtoReg<br />
1<br />
M<br />
ux<br />
0<br />
Instruction<br />
[15–0]<br />
16 Signextend<br />
32<br />
6<br />
ALU<br />
control<br />
Write<br />
data<br />
MemRead<br />
Instruction<br />
[20–16]<br />
0<br />
ALUOp<br />
Instruction<br />
[15–11]<br />
1<br />
M<br />
ux<br />
RegDst<br />
FIGURE 4.51 The pipelined datapath of Figure 4.46, with the control signals connected to the control portions of<br />
the pipeline registers. The control values for the last three stages are created during the instruction decode stage <strong>and</strong> then placed in the<br />
ID/EX pipeline register. The control lines for each pipe stage are used, <strong>and</strong> remaining control lines are then passed to the next pipeline stage.<br />
Let’s look at a sequence with many dependences, shown in color:<br />
sub $2, $1,$3 # Register $2 written by sub<br />
<strong>and</strong> $12,$2,$5 # 1st oper<strong>and</strong>($2) depends on sub<br />
or $13,$6,$2 # 2nd oper<strong>and</strong>($2) depends on sub<br />
add $14,$2,$2 # 1st($2) & 2nd($2) depend on sub<br />
sw $15,100($2) # Base ($2) depends on sub<br />
The last four instructions are all dependent on the result in register $2 of the<br />
first instruction. If register $2 had the value 10 before the subtract instruction <strong>and</strong><br />
−20 afterwards, the programmer intends that −20 will be used in the following<br />
instructions that refer to register $2.
4.7 Data Hazards: Forwarding versus Stalling 305<br />
How would this sequence perform with our pipeline? Figure 4.52 illustrates the<br />
execution of these instructions using a multiple-clock-cycle pipeline representation.<br />
To demonstrate the execution of this instruction sequence in our current pipeline,<br />
the top of Figure 4.52 shows the value of register $2, which changes during the<br />
middle of clock cycle 5, when the sub instruction writes its result.<br />
The last potential hazard can be resolved by the design of the register file<br />
hardware: What happens when a register is read <strong>and</strong> written in the same clock<br />
cycle? We assume that the write is in the first half of the clock cycle <strong>and</strong> the read<br />
is in the second half, so the read delivers what is written. As is the case for many<br />
implementations of register files, we have no data hazard in this case.<br />
Figure 4.52 shows that the values read for register $2 would not be the result of<br />
the sub instruction unless the read occurred during clock cycle 5 or later. Thus, the<br />
instructions that would get the correct value of −20 are add <strong>and</strong> sw; the AND <strong>and</strong><br />
Time (in clock cycles)<br />
Value of CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9<br />
register $2: 10 10 10 10 10/–20 –20 –20 –20 –20<br />
Program<br />
execution<br />
order<br />
(in instructions)<br />
sub $2, $1, $3<br />
IM<br />
Reg<br />
DM<br />
Reg<br />
<strong>and</strong> $12, $2, $5<br />
IM<br />
Reg<br />
DM<br />
Reg<br />
or $13, $6, $2<br />
IM<br />
Reg<br />
DM<br />
Reg<br />
add $14, $2,$2<br />
IM<br />
Reg<br />
DM<br />
Reg<br />
sw $15, 100($2)<br />
IM<br />
Reg<br />
DM<br />
Reg<br />
FIGURE 4.52 Pipelined dependences in a five-instruction sequence using simplified datapaths to show the<br />
dependences. All the dependent actions are shown in color, <strong>and</strong> “CC 1” at the top of the figure means clock cycle 1. The first instruction<br />
writes into $2, <strong>and</strong> all the following instructions read $2. This register is written in clock cycle 5, so the proper value is unavailable before clock<br />
cycle 5. (A read of a register during a clock cycle returns the value written at the end of the first half of the cycle, when such a write occurs.) The<br />
colored lines from the top datapath to the lower ones show the dependences. Those that must go backward in time are pipeline data hazards.
306 Chapter 4 The Processor<br />
OR instructions would get the incorrect value 10! Using this style of drawing, such<br />
problems become apparent when a dependence line goes backward in time.<br />
As mentioned in Section 4.5, the desired result is available at the end of the<br />
EX stage or clock cycle 3. When is the data actually needed by the AND <strong>and</strong> OR<br />
instructions? At the beginning of the EX stage, or clock cycles 4 <strong>and</strong> 5, respectively.<br />
Thus, we can execute this segment without stalls if we simply forward the data as<br />
soon as it is available to any units that need it before it is available to read from the<br />
register file.<br />
How does forwarding work? For simplicity in the rest of this section, we consider<br />
only the challenge of forwarding to an operation in the EX stage, which may be<br />
either an ALU operation or an effective address calculation. This means that when<br />
an instruction tries to use a register in its EX stage that an earlier instruction<br />
intends to write in its WB stage, we actually need the values as inputs to the ALU.<br />
A notation that names the fields of the pipeline registers allows for a more<br />
precise notation of dependences. For example, “ID/EX.RegisterRs” refers to the<br />
number of one register whose value is found in the pipeline register ID/EX; that is,<br />
the one from the first read port of the register file. The first part of the name, to the<br />
left of the period, is the name of the pipeline register; the second part is the name of<br />
the field in that register. Using this notation, the two pairs of hazard conditions are<br />
1a. EX/MEM.RegisterRd = ID/EX.RegisterRs<br />
1b. EX/MEM.RegisterRd = ID/EX.RegisterRt<br />
2a. MEM/WB.RegisterRd = ID/EX.RegisterRs<br />
2b. MEM/WB.RegisterRd = ID/EX.RegisterRt<br />
The first hazard in the sequence on page 304 is on register $2, between the<br />
result of sub $2,$1,$3 <strong>and</strong> the first read oper<strong>and</strong> of <strong>and</strong> $12,$2,$5. This<br />
hazard can be detected when the <strong>and</strong> instruction is in the EX stage <strong>and</strong> the prior<br />
instruction is in the MEM stage, so this is hazard 1a:<br />
EX/MEM.RegisterRd = ID/EX.RegisterRs = $2<br />
EXAMPLE<br />
Dependence Detection<br />
Classify the dependences in this sequence from page 304:<br />
sub $2, $1, $3 # Register $2 set by sub<br />
<strong>and</strong> $12, $2, $5 # 1st oper<strong>and</strong>($2) set by sub<br />
or $13, $6, $2 # 2nd oper<strong>and</strong>($2) set by sub<br />
add $14, $2, $2 # 1st($2) & 2nd($2) set by sub<br />
sw $15, 100($2) # Index($2) set by sub
4.7 Data Hazards: Forwarding versus Stalling 307<br />
As mentioned above, the sub-<strong>and</strong> is a type 1a hazard. The remaining hazards<br />
are as follows:<br />
■ The sub-or is a type 2b hazard:<br />
ANSWER<br />
MEM/WB.RegisterRd = ID/EX.RegisterRt = $2<br />
■ The two dependences on sub-add are not hazards because the register<br />
file supplies the proper data during the ID stage of add.<br />
■ There is no data hazard between sub <strong>and</strong> sw because sw reads $2 the<br />
clock cycle after sub writes $2.<br />
Because some instructions do not write registers, this policy is inaccurate;<br />
sometimes it would forward when it shouldn’t. One solution is simply to check<br />
to see if the RegWrite signal will be active: examining the WB control field of the<br />
pipeline register during the EX <strong>and</strong> MEM stages determines whether RegWrite<br />
is asserted. Recall that MIPS requires that every use of $0 as an oper<strong>and</strong> must<br />
yield an oper<strong>and</strong> value of 0. In the event that an instruction in the pipeline has<br />
$0 as its destination (for example, sll $0, $1, 2), we want to avoid forwarding<br />
its possibly nonzero result value. Not forwarding results destined for $0 frees the<br />
assembly programmer <strong>and</strong> the compiler of any requirement to avoid using $0 as<br />
a destination. The conditions above thus work properly as long we add EX/MEM.<br />
RegisterRd ≠ 0 to the first hazard condition <strong>and</strong> MEM/WB.RegisterRd ≠ 0 to the<br />
second.<br />
Now that we can detect hazards, half of the problem is resolved—but we must<br />
still forward the proper data.<br />
Figure 4.53 shows the dependences between the pipeline registers <strong>and</strong> the inputs<br />
to the ALU for the same code sequence as in Figure 4.52. The change is that the<br />
dependence begins from a pipeline register, rather than waiting for the WB stage to<br />
write the register file. Thus, the required data exists in time for later instructions,<br />
with the pipeline registers holding the data to be forwarded.<br />
If we can take the inputs to the ALU from any pipeline register rather than just<br />
ID/EX, then we can forward the proper data. By adding multiplexors to the input<br />
of the ALU, <strong>and</strong> with the proper controls, we can run the pipeline at full speed in<br />
the presence of these data dependences.<br />
For now, we will assume the only instructions we need to forward are the four<br />
R-format instructions: add, sub, AND, <strong>and</strong> OR. Figure 4.54 shows a close-up of<br />
the ALU <strong>and</strong> pipeline register before <strong>and</strong> after adding forwarding. Figure 4.55<br />
shows the values of the control lines for the ALU multiplexors that select either the<br />
register file values or one of the forwarded values.<br />
This forwarding control will be in the EX stage, because the ALU forwarding<br />
multiplexors are found in that stage. Thus, we must pass the oper<strong>and</strong> register<br />
numbers from the ID stage via the ID/EX pipeline register to determine whether<br />
to forward values. We already have the rt field (bits 20–16). Before forwarding, the<br />
ID/EX register had no need to include space to hold the rs field. Hence, rs (bits<br />
25–21) is added to ID/EX.
4.7 Data Hazards: Forwarding versus Stalling 309<br />
ID/EX<br />
EX/MEM<br />
MEM/WB<br />
Registers<br />
ALU<br />
Data<br />
memory<br />
M<br />
ux<br />
a. No forwarding<br />
ID/EX EX/MEM MEM/WB<br />
M<br />
ux<br />
M<br />
ux<br />
M<br />
ux<br />
Registers<br />
ForwardA<br />
ALU<br />
Data<br />
memory<br />
M<br />
ux<br />
ForwardB<br />
Rs<br />
Rt<br />
Rt<br />
Rd<br />
EX/MEM.RegisterRd<br />
Forwarding<br />
unit<br />
MEM/WB.RegisterRd<br />
b. With forwarding<br />
FIGURE 4.54 On the top are the ALU <strong>and</strong> pipeline registers before adding forwarding. On<br />
the bottom, the multiplexors have been exp<strong>and</strong>ed to add the forwarding paths, <strong>and</strong> we show the forwarding<br />
unit. The new hardware is shown in color. This figure is a stylized drawing, however, leaving out details<br />
from the full datapath such as the sign extension hardware. Note that the ID/EX.RegisterRt field is shown<br />
twice, once to connect to the Mux <strong>and</strong> once to the forwarding unit, but it is a single signal. As in the earlier<br />
discussion, this ignores forwarding of a store value to a store instruction. Also note that this mechanism<br />
works for slt instructions as well.
310 Chapter 4 The Processor<br />
Mux control Source Explanation<br />
ForwardA = 00 ID/EX The first ALU oper<strong>and</strong> comes from the register file.<br />
ForwardA = 10 EX/MEM The first ALU oper<strong>and</strong> is forwarded from the prior ALU result.<br />
ForwardA = 01 MEM/WB The first ALU oper<strong>and</strong> is forwarded from data memory or an earlier<br />
ALU result.<br />
ForwardB = 00 ID/EX The second ALU oper<strong>and</strong> comes from the register file.<br />
ForwardB = 10 EX/MEM The second ALU oper<strong>and</strong> is forwarded from the prior ALU result.<br />
ForwardB = 01 MEM/WB The second ALU oper<strong>and</strong> is forwarded from data memory or an<br />
earlier ALU result.<br />
FIGURE 4.55 The control values for the forwarding multiplexors in Figure 4.54. The signed<br />
immediate that is another input to the ALU is described in the Elaboration at the end of this section.<br />
Note that the EX/MEM.RegisterRd field is the register destination for either<br />
an ALU instruction (which comes from the Rd field of the instruction) or a load<br />
(which comes from the Rt field).<br />
This case forwards the result from the previous instruction to either input of the<br />
ALU. If the previous instruction is going to write to the register file, <strong>and</strong> the write<br />
register number matches the read register number of ALU inputs A or B, provided<br />
it is not register 0, then steer the multiplexor to pick the value instead from the<br />
pipeline register EX/MEM.<br />
2. MEM hazard:<br />
if (MEM/WB.RegWrite<br />
<strong>and</strong> (MEM/WB.RegisterRd ≠ 0)<br />
<strong>and</strong> ( MEM/WB.RegisterRd = ID/EX.RegisterRs)) ForwardA = 01<br />
if (MEM/WB.RegWrite<br />
<strong>and</strong> (MEM/WB.RegisterRd ≠ 0)<br />
<strong>and</strong> (MEM/WB.RegisterRd = ID/EX.RegisterRt)) ForwardB = 01<br />
As mentioned above, there is no hazard in the WB stage, because we assume that<br />
the register file supplies the correct result if the instruction in the ID stage reads<br />
the same register written by the instruction in the WB stage. Such a register file<br />
performs another form of forwarding, but it occurs within the register file.<br />
One complication is potential data hazards between the result of the instruction<br />
in the WB stage, the result of the instruction in the MEM stage, <strong>and</strong> the source<br />
oper<strong>and</strong> of the instruction in the ALU stage. For example, when summing a vector<br />
of numbers in a single register, a sequence of instructions will all read <strong>and</strong> write to<br />
the same register:<br />
add $1,$1,$2<br />
add $1,$1,$3<br />
add $1,$1,$4<br />
. . .
4.7 Data Hazards: Forwarding versus Stalling 311<br />
In this case, the result is forwarded from the MEM stage because the result in the<br />
MEM stage is the more recent result. Thus, the control for the MEM hazard would<br />
be (with the additions highlighted):<br />
if (MEM/WB.RegWrite<br />
<strong>and</strong> (MEM/WB.RegisterRd ≠ 0)<br />
<strong>and</strong> not(EX/MEM.RegWrite <strong>and</strong> (EX/MEM.RegisterRd ≠ 0)<br />
<strong>and</strong> (EX/MEM.RegisterRd ≠ ID/EX.RegisterRs))<br />
<strong>and</strong> (MEM/WB.RegisterRd = ID/EX.RegisterRs)) ForwardA = 01<br />
if (MEM/WB.RegWrite<br />
<strong>and</strong> (MEM/WB.RegisterRd ≠ 0)<br />
<strong>and</strong> not(EX/MEM.RegWrite <strong>and</strong> (EX/MEM.RegisterRd ≠ 0)<br />
<strong>and</strong> (EX/MEM.RegisterRd ≠ ID/EX.RegisterRt))<br />
<strong>and</strong> (MEM/WB.RegisterRd = ID/EX.RegisterRt)) ForwardB = 01<br />
Figure 4.56 shows the hardware necessary to support forwarding for operations<br />
that use results during the EX stage. Note that the EX/MEM.RegisterRd field is the<br />
register destination for either an ALU instruction (which comes from the Rd field<br />
of the instruction) or a load (which comes from the Rt field).<br />
ID/EX<br />
WB<br />
EX/MEM<br />
Control<br />
M<br />
WB<br />
MEM/WB<br />
IF/ID<br />
EX<br />
M<br />
WB<br />
M<br />
u<br />
x<br />
PC<br />
Instruction<br />
memory<br />
Instruction<br />
Registers<br />
M<br />
u<br />
x<br />
ALU<br />
Data<br />
memory<br />
M<br />
u<br />
x<br />
IF/ID.RegisterRs<br />
IF/ID.RegisterRt<br />
IF/ID.RegisterRt<br />
IF/ID.RegisterRd<br />
Rs<br />
Rt<br />
Rt<br />
Rd<br />
M<br />
u<br />
x<br />
Forwarding<br />
unit<br />
EX/MEM.RegisterRd<br />
MEM/WB.RegisterRd<br />
FIGURE 4.56 The datapath modified to resolve hazards via forwarding. Compared with the datapath in Figure 4.51, the additions<br />
are the multiplexors to the inputs to the ALU. This figure is a more stylized drawing, however, leaving out details from the full datapath, such<br />
as the branch hardware <strong>and</strong> the sign extension hardware.
314 Chapter 4 The Processor<br />
use. Checking for load instructions, the control for the hazard detection unit is this<br />
single condition:<br />
nop An instruction that<br />
does no operation to<br />
change state.<br />
if (ID/EX.MemRead <strong>and</strong><br />
((ID/EX.RegisterRt = IF/ID.RegisterRs) or<br />
(ID/EX.RegisterRt = IF/ID.RegisterRt)))<br />
stall the pipeline<br />
The first line tests to see if the instruction is a load: the only instruction that reads<br />
data memory is a load. The next two lines check to see if the destination register<br />
field of the load in the EX stage matches either source register of the instruction<br />
in the ID stage. If the condition holds, the instruction stalls one clock cycle. After<br />
this 1-cycle stall, the forwarding logic can h<strong>and</strong>le the dependence <strong>and</strong> execution<br />
proceeds. (If there were no forwarding, then the instructions in Figure 4.58 would<br />
need another stall cycle.)<br />
If the instruction in the ID stage is stalled, then the instruction in the IF stage<br />
must also be stalled; otherwise, we would lose the fetched instruction. Preventing<br />
these two instructions from making progress is accomplished simply by preventing<br />
the PC register <strong>and</strong> the IF/ID pipeline register from changing. Provided these<br />
registers are preserved, the instruction in the IF stage will continue to be read<br />
using the same PC, <strong>and</strong> the registers in the ID stage will continue to be read using<br />
the same instruction fields in the IF/ID pipeline register. Returning to our favorite<br />
analogy, it’s as if you restart the washer with the same clothes <strong>and</strong> let the dryer<br />
continue tumbling empty. Of course, like the dryer, the back half of the pipeline<br />
starting with the EX stage must be doing something; what it is doing is executing<br />
instructions that have no effect: nops.<br />
How can we insert these nops, which act like bubbles, into the pipeline? In Figure<br />
4.49, we see that deasserting all nine control signals (setting them to 0) in the EX,<br />
MEM, <strong>and</strong> WB stages will create a “do nothing” or nop instruction. By identifying<br />
the hazard in the ID stage, we can insert a bubble into the pipeline by changing the<br />
EX, MEM, <strong>and</strong> WB control fields of the ID/EX pipeline register to 0. These benign<br />
control values are percolated forward at each clock cycle with the proper effect: no<br />
registers or memories are written if the control values are all 0.<br />
Figure 4.59 shows what really happens in the hardware: the pipeline execution<br />
slot associated with the AND instruction is turned into a nop <strong>and</strong> all instructions<br />
beginning with the AND instruction are delayed one cycle. Like an air bubble in<br />
a water pipe, a stall bubble delays everything behind it <strong>and</strong> proceeds down the<br />
instruction pipe one stage each cycle until it exits at the end. In this example, the<br />
hazard forces the AND <strong>and</strong> OR instructions to repeat in clock cycle 4 what they<br />
did in clock cycle 3: AND reads registers <strong>and</strong> decodes, <strong>and</strong> OR is refetched from<br />
instruction memory. Such repeated work is what a stall looks like, but its effect is<br />
to stretch the time of the AND <strong>and</strong> OR instructions <strong>and</strong> delay the fetch of the add<br />
instruction.<br />
Figure 4.60 highlights the pipeline connections for both the hazard detection<br />
unit <strong>and</strong> the forwarding unit. As before, the forwarding unit controls the ALU
316 Chapter 4 The Processor<br />
Hazard<br />
detection<br />
unit<br />
ID/EX.MemRead<br />
IF/DWrite<br />
ID/EX<br />
WB<br />
EX/MEM<br />
PCWrite<br />
IF/ID<br />
Control<br />
M<br />
WB<br />
0 EX<br />
M<br />
MEM/WB<br />
WB<br />
PC<br />
Instruction<br />
memory<br />
Instruction<br />
Registers<br />
ALU<br />
M<br />
ux<br />
M<br />
ux<br />
M<br />
ux<br />
M<br />
ux<br />
M<br />
ux<br />
Data<br />
memory<br />
IF/ID.RegisterRs<br />
IF/ID.RegisterRt<br />
IF/ID.RegisterRt<br />
IF/ID.RegisterRd<br />
ID/EX.RegisterRt<br />
Rt<br />
Rd<br />
Rs<br />
Rt<br />
Forwarding<br />
unit<br />
FIGURE 4.60 Pipelined control overview, showing the two multiplexors for forwarding, the hazard detection unit, <strong>and</strong><br />
the forwarding unit. Although the ID <strong>and</strong> EX stages have been simplified—the sign-extended immediate <strong>and</strong> branch logic are missing—<br />
this drawing gives the essence of the forwarding hardware requirements.<br />
Elaboration: Regarding the remark earlier about setting control lines to 0 to avoid<br />
writing registers or memory: only the signals RegWrite <strong>and</strong> MemWrite need be 0, while<br />
the other control signals can be don’t cares.<br />
There are a thous<strong>and</strong><br />
hacking at the<br />
branches of evil to one<br />
who is striking at the<br />
root.<br />
Henry David Thoreau,<br />
Walden, 1854<br />
4.8 Control Hazards<br />
Thus far, we have limited our concern to hazards involving arithmetic operations<br />
<strong>and</strong> data transfers. However, as we saw in Section 4.5, there are also pipeline hazards<br />
involving branches. Figure 4.61 shows a sequence of instructions <strong>and</strong> indicates when<br />
the branch would occur in this pipeline. An instruction must be fetched at every<br />
clock cycle to sustain the pipeline, yet in our design the decision about whether to<br />
branch doesn’t occur until the MEM pipeline stage. As mentioned in Section 4.5,
4.8 Control Hazards 319<br />
Forwarding for the oper<strong>and</strong>s of branches was formerly h<strong>and</strong>led by the ALU<br />
forwarding logic, but the introduction of the equality test unit in ID will<br />
require new forwarding logic. Note that the bypassed source oper<strong>and</strong>s of a<br />
branch can come from either the ALU/MEM or MEM/WB pipeline latches.<br />
2. Because the values in a branch comparison are needed during ID but may be<br />
produced later in time, it is possible that a data hazard can occur <strong>and</strong> a stall<br />
will be needed. For example, if an ALU instruction immediately preceding<br />
a branch produces one of the oper<strong>and</strong>s for the comparison in the branch,<br />
a stall will be required, since the EX stage for the ALU instruction will<br />
occur after the ID cycle of the branch. By extension, if a load is immediately<br />
followed by a conditional branch that is on the load result, two stall cycles<br />
will be needed, as the result from the load appears at the end of the MEM<br />
cycle but is needed at the beginning of ID for the branch.<br />
Despite these difficulties, moving the branch execution to the ID stage is an<br />
improvement, because it reduces the penalty of a branch to only one instruction if<br />
the branch is taken, namely, the one currently being fetched. The exercises explore<br />
the details of implementing the forwarding path <strong>and</strong> detecting the hazard.<br />
To flush instructions in the IF stage, we add a control line, called IF.Flush,<br />
that zeros the instruction field of the IF/ID pipeline register. Clearing the register<br />
transforms the fetched instruction into a nop, an instruction that has no action<br />
<strong>and</strong> changes no state.<br />
Pipelined Branch<br />
Show what happens when the branch is taken in this instruction sequence,<br />
assuming the pipeline is optimized for branches that are not taken <strong>and</strong> that we<br />
moved the branch execution to the ID stage:<br />
EXAMPLE<br />
36 sub $10, $4, $8<br />
40 beq $1, $3, 7 # PC-relative branch to 40 + 4 + 7 * 4 = 72<br />
44 <strong>and</strong> $12, $2, $5<br />
48 or $13, $2, $6<br />
52 add $14, $4, $2<br />
56 slt $15, $6, $7<br />
. . .<br />
72 lw $4, 50($7)<br />
Figure 4.62 shows what happens when a branch is taken. Unlike Figure 4.61,<br />
there is only one pipeline bubble on a taken branch.<br />
ANSWER
4.8 Control Hazards 323<br />
The limitations on delayed branch scheduling arise from (1) the restrictions on the<br />
instructions that are scheduled into the delay slots <strong>and</strong> (2) our ability to predict at<br />
compile time whether a branch is likely to be taken or not.<br />
Delayed branching was a simple <strong>and</strong> effective solution for a fi ve-stage pipeline<br />
issuing one instruction each clock cycle. As processors go to both longer pipelines<br />
<strong>and</strong> issuing multiple instructions per clock cycle (see Section 4.10), the branch delay<br />
becomes longer, <strong>and</strong> a single delay slot is insuffi cient. Hence, delayed branching has<br />
lost popularity compared to more expensive but more fl exible dynamic approaches.<br />
Simultaneously, the growth in available transistors per chip has due to Moore’s Law<br />
made dynamic prediction relatively cheaper.<br />
a. From before<br />
add $s1, $s2, $s3<br />
if $s2 = 0 then<br />
Delay slot<br />
b. From target<br />
sub $t4, $t5, $t6<br />
. . .<br />
add $s1, $s2, $s3<br />
if $s1 = 0 then<br />
Delay slot<br />
c. From fall-through<br />
add $s1, $s2, $s3<br />
if $s1 = 0 then<br />
Delay slot<br />
sub $t4, $t5, $t6<br />
Becomes<br />
Becomes<br />
Becomes<br />
if $s2 = 0 then<br />
add $s1, $s2, $s3<br />
add $s1, $s2, $s3<br />
if $s1 = 0 then<br />
sub $t4, $t5, $t6<br />
add $s1, $s2, $s3<br />
if $s1 = 0 then<br />
sub $t4, $t5, $t6<br />
FIGURE 4.64 Scheduling the branch delay slot. The top box in each pair shows the code before<br />
scheduling; the bottom box shows the scheduled code. In (a), the delay slot is scheduled with an independent<br />
instruction from before the branch. This is the best choice. Strategies (b) <strong>and</strong> (c) are used when (a) is not<br />
possible. In the code sequences for (b) <strong>and</strong> (c), the use of $s1 in the branch condition prevents the add<br />
instruction (whose destination is $s1) from being moved into the branch delay slot. In (b) the branch delay<br />
slot is scheduled from the target of the branch; usually the target instruction will need to be copied because<br />
it can be reached by another path. Strategy (b) is preferred when the branch is taken with high probability,<br />
such as a loop branch. Finally, the branch may be scheduled from the not-taken fall-through as in (c). To<br />
make this optimization legal for (b) or (c), it must be OK to execute the sub instruction when the branch<br />
goes in the unexpected direction. By “OK” we mean that the work is wasted, but the program will still execute<br />
correctly. This is the case, for example, if $t4 were an unused temporary register when the branch goes in<br />
the unexpected direction.
324 Chapter 4 The Processor<br />
branch target buffer<br />
A structure that caches<br />
the destination PC or<br />
destination instruction<br />
for a branch. It is usually<br />
organized as a cache with<br />
tags, making it more<br />
costly than a simple<br />
prediction buffer.<br />
correlating predictor<br />
A branch predictor that<br />
combines local behavior<br />
of a particular branch<br />
<strong>and</strong> global information<br />
about the behavior of<br />
some recent number of<br />
executed branches.<br />
tournament branch<br />
predictor A branch<br />
predictor with multiple<br />
predictions for each<br />
branch <strong>and</strong> a selection<br />
mechanism that chooses<br />
which predictor to enable<br />
for a given branch.<br />
Elaboration: A branch predictor tells us whether or not a branch is taken, but still<br />
requires the calculation of the branch target. In the fi ve-stage pipeline, this calculation<br />
takes one cycle, meaning that taken branches will have a 1-cycle penalty. Delayed<br />
branches are one approach to eliminate that penalty. Another approach is to use a<br />
cache to hold the destination program counter or destination instruction using a branch<br />
target buffer.<br />
The 2-bit dynamic prediction scheme uses only information about a particular branch.<br />
Researchers noticed that using information about both a local branch, <strong>and</strong> the global<br />
behavior of recently executed branches together yields greater prediction accuracy for<br />
the same number of prediction bits. Such predictors are called correlating predictors.<br />
A typical correlating predictor might have two 2-bit predictors for each branch, with the<br />
choice between predictors made based on whether the last executed branch was taken<br />
or not taken. Thus, the global branch behavior can be thought of as adding additional<br />
index bits for the prediction lookup.<br />
A more recent innovation in branch prediction is the use of tournament predictors. A<br />
tournament predictor uses multiple predictors, tracking, for each branch, which predictor<br />
yields the best results. A typical tournament predictor might contain two predictions for<br />
each branch index: one based on local information <strong>and</strong> one based on global branch<br />
behavior. A selector would choose which predictor to use for any given prediction. The<br />
selector can operate similarly to a 1- or 2-bit predictor, favoring whichever of the two<br />
predictors has been more accurate. Some recent microprocessors use such elaborate<br />
predictors.<br />
Elaboration: One way to reduce the number of conditional branches is to add<br />
conditional move instructions. Instead of changing the PC with a conditional branch, the<br />
instruction conditionally changes the destination register of the move. If the condition<br />
fails, the move acts as a nop. For example, one version of the MIPS instruction set<br />
architecture has two new instructions called movn (move if not zero) <strong>and</strong> movz (move<br />
if zero). Thus, movn $8, $11, $4 copies the contents of register 11 into register 8,<br />
provided that the value in register 4 is nonzero; otherwise, it does nothing.<br />
The ARMv7 instruction set has a condition fi eld in most instructions. Hence, ARM<br />
programs could have fewer conditional branches than in MIPS programs.<br />
Pipeline Summary<br />
We started in the laundry room, showing principles of pipelining in an everyday<br />
setting. Using that analogy as a guide, we explained instruction pipelining<br />
step-by-step, starting with the single-cycle datapath <strong>and</strong> then adding pipeline<br />
registers, forwarding paths, data hazard detection, branch prediction, <strong>and</strong> flushing<br />
instructions on exceptions. Figure 4.65 shows the final evolved datapath <strong>and</strong> control.<br />
We now are ready for yet another control hazard: the sticky issue of exceptions.<br />
Check<br />
Yourself<br />
Consider three branch prediction schemes: predict not taken, predict taken, <strong>and</strong><br />
dynamic prediction. Assume that they all have zero penalty when they predict<br />
correctly <strong>and</strong> two cycles when they are wrong. Assume that the average predict
4.9 Exceptions 325<br />
IF.Flush<br />
Hazard<br />
detection<br />
unit<br />
ID/EX<br />
WB<br />
MEM/WB<br />
Control<br />
0<br />
M<br />
WB<br />
EX/MEM<br />
+<br />
IF/ID<br />
+<br />
EX<br />
M<br />
WB<br />
4<br />
Shift<br />
left 2<br />
M<br />
ux<br />
M<br />
ux<br />
M<br />
ux<br />
M<br />
ux<br />
Registers =<br />
M Instruction<br />
Data<br />
ux PC<br />
ALU<br />
memory<br />
memory<br />
M<br />
ux<br />
Signextend<br />
Fowarding<br />
unit<br />
FIGURE 4.65 The final datapath <strong>and</strong> control for this chapter. Note that this is a stylized figure rather than a detailed datapath, so<br />
it’s missing the ALUsrc Mux from Figure 4.57 <strong>and</strong> the multiplexor controls from Figure 4.51.<br />
accuracy of the dynamic predictor is 90%. Which predictor is the best choice for<br />
the following branches?<br />
1. A branch that is taken with 5% frequency<br />
2. A branch that is taken with 95% frequency<br />
3. A branch that is taken with 70% frequency<br />
4.9 Exceptions<br />
Control is the most challenging aspect of processor design: it is both the hardest<br />
part to get right <strong>and</strong> the hardest part to make fast. One of the hardest parts of<br />
To make a computer<br />
with automatic<br />
program-interruption<br />
facilities behave<br />
[sequentially] was<br />
not an easy matter,<br />
because the number of<br />
instructions in various<br />
stages of processing<br />
when an interrupt<br />
signal occurs may be<br />
large.<br />
Fred Brooks, Jr.,<br />
Planning a <strong>Computer</strong><br />
System: Project Stretch,<br />
1962
328 Chapter 4 The Processor<br />
we did for the taken branch in the previous section, we must flush the instructions<br />
that follow the add instruction from the pipeline <strong>and</strong> begin fetching instructions<br />
from the new address. We will use the same mechanism we used for taken branches,<br />
but this time the exception causes the deasserting of control lines.<br />
When we dealt with branch mispredict, we saw how to flush the instruction<br />
in the IF stage by turning it into a nop. To flush instructions in the ID stage, we<br />
use the multiplexor already in the ID stage that zeros control signals for stalls. A<br />
new control signal, called ID.Flush, is ORed with the stall signal from the hazard<br />
detection unit to flush during ID. To flush the instruction in the EX phase, we use<br />
a new signal called EX.Flush to cause new multiplexors to zero the control lines. To<br />
start fetching instructions from location 8000 0180 hex<br />
, which is the MIPS exception<br />
address, we simply add an additional input to the PC multiplexor that sends 8000<br />
0180 hex<br />
to the PC. Figure 4.66 shows these changes.<br />
This example points out a problem with exceptions: if we do not stop execution<br />
in the middle of the instruction, the programmer will not be able to see the original<br />
value of register $1 that helped cause the overflow because it will be clobbered as<br />
the Destination register of the add instruction. Because of careful planning, the<br />
overflow exception is detected during the EX stage; hence, we can use the EX.Flush<br />
signal to prevent the instruction in the EX stage from writing its result in the WB<br />
stage. Many exceptions require that we eventually complete the instruction that<br />
caused the exception as if it executed normally. The easiest way to do this is to flush<br />
the instruction <strong>and</strong> restart it from the beginning after the exception is h<strong>and</strong>led.<br />
The final step is to save the address of the offending instruction in the exception<br />
program counter (EPC). In reality, we save the address +4, so the exception h<strong>and</strong>ling<br />
the software routine must first subtract 4 from the saved value. Figure 4.66 shows<br />
a stylized version of the datapath, including the branch hardware <strong>and</strong> necessary<br />
accommodations to h<strong>and</strong>le exceptions.<br />
EXAMPLE<br />
Exception in a Pipelined <strong>Computer</strong><br />
Given this instruction sequence,<br />
40 hex<br />
sub $11, $2, $4<br />
44 hex<br />
<strong>and</strong> $12, $2, $5<br />
48 hex<br />
or $13, $2, $6<br />
4C hex<br />
add $1, $2, $1<br />
50 hex<br />
slt $15, $6, $7<br />
54 hex<br />
lw $16, 50($7)<br />
. . .
4.9 Exceptions 329<br />
IF.Flush<br />
EX.Flush<br />
ID.Flush<br />
Hazard<br />
detection<br />
unit<br />
ID/EX<br />
M<br />
ux<br />
IF/ID<br />
Control<br />
<br />
0<br />
M<br />
u<br />
x<br />
WB<br />
M<br />
EX<br />
0<br />
Cause<br />
EPC<br />
EX/MEM<br />
M<br />
WB<br />
ux<br />
0 M<br />
MEM/WB<br />
WB<br />
80000180<br />
M<br />
u<br />
x<br />
PC<br />
4<br />
<br />
Instruction<br />
memory<br />
Shift<br />
left 2<br />
Registers<br />
<br />
M<br />
u<br />
x<br />
M<br />
u<br />
x<br />
ALU<br />
Data<br />
memory<br />
M<br />
u<br />
x<br />
Signextend<br />
M<br />
u<br />
x<br />
Forwarding<br />
unit<br />
FIGURE 4.66 The datapath with controls to h<strong>and</strong>le exceptions. The key additions include a new input with the value 8000 0180 hex<br />
in the multiplexor that supplies the new PC value; a Cause register to record the cause of the exception; <strong>and</strong> an Exception PC register to save<br />
the address of the instruction that caused the exception. The 8000 0180 hex<br />
input to the multiplexor is the initial address to begin fetching<br />
instructions in the event of an exception. Although not shown, the ALU overflow signal is an input to the control unit.<br />
assume the instructions to be invoked on an exception begin like this:<br />
80000180 hex<br />
sw $26, 1000($0)<br />
80000184 hex<br />
sw $27, 1004($0)<br />
...<br />
Show what happens in the pipeline if an overflow exception occurs in the add<br />
instruction.<br />
Figure 4.67 shows the events, starting with the add instruction in the EX stage.<br />
The overflow is detected during that phase, <strong>and</strong> 8000 0180 hex<br />
is forced into the<br />
PC. Clock cycle 7 shows that the add <strong>and</strong> following instructions are flushed,<br />
<strong>and</strong> the first instruction of the exception code is fetched. Note that the address<br />
of the instruction following the add is saved: 4C hex<br />
+ 4 = 50 hex<br />
.<br />
ANSWER
330 Chapter 4 The Processor<br />
80000180<br />
M ux<br />
Clock 6<br />
80000180<br />
M<br />
ux<br />
lw $16, 50($7) slt $15, $6, $7 add $1, $2, $1 or $13, . . . <strong>and</strong> $12, . . .<br />
EX.Flush<br />
IF.Flush<br />
ID.Flush<br />
Hazard<br />
detection<br />
unit<br />
M<br />
ID/EX ux<br />
0 10<br />
WB<br />
0<br />
M<br />
EX/MEM<br />
0 000<br />
M 10<br />
Control<br />
ux M<br />
ux<br />
WB<br />
MEM/WB<br />
Cause<br />
0<br />
50<br />
1<br />
IF/ID<br />
+<br />
0<br />
EPC<br />
WB<br />
PC<br />
80000180 54<br />
4<br />
+<br />
Instruction<br />
memory<br />
sw $26, 1000($0)<br />
IF.Flush<br />
PC<br />
80000184<br />
80000180<br />
4<br />
+<br />
Instruction<br />
memory<br />
58<br />
54<br />
IF/ID<br />
58<br />
Sh ft<br />
left 2<br />
Registers<br />
12<br />
S gn<br />
extend<br />
=<br />
$6<br />
$7<br />
15<br />
EX<br />
$1<br />
M<br />
ux<br />
M<br />
u<br />
x<br />
0 M<br />
Forwarding<br />
unit<br />
bubble (nop) bubble bubble or $13, . . .<br />
EX.Flush<br />
Hazard<br />
detection<br />
unit<br />
Control<br />
Sh ft<br />
left 2<br />
+<br />
ID.Flush<br />
Registers<br />
13<br />
ID/EX<br />
00<br />
0 0<br />
WB<br />
0<br />
EX/MEM<br />
M<br />
0<br />
000 M<br />
WB<br />
00<br />
ux M<br />
Cause ux<br />
0 0 EX<br />
EPC 0 M<br />
=<br />
M<br />
ux<br />
M<br />
ux<br />
M<br />
ux<br />
M<br />
ux<br />
$2<br />
$1<br />
ALU<br />
Data<br />
memory<br />
13 12<br />
Data<br />
memory<br />
MEM/WB<br />
WB<br />
M<br />
ux<br />
M<br />
ux<br />
S gn<br />
extend<br />
Clock 7<br />
M<br />
u<br />
x<br />
Forwarding<br />
unit<br />
13<br />
FIGURE 4.67 The result of an exception due to arithmetic overflow in the add instruction. The overflow is detected during<br />
the EX stage of clock 6, saving the address following the add in the EPC register (4C + 4 = 50 hex<br />
). Overflow causes all the Flush signals to be set<br />
near the end of this clock cycle, deasserting control values (setting them to 0) for the add. Clock cycle 7 shows the instructions converted to<br />
bubbles in the pipeline plus the fetching of the first instruction of the exception routine—sw $25,1000($0)—from instruction location<br />
8000 0180 hex<br />
. Note that the AND <strong>and</strong> OR instructions, which are prior to the add, still complete. Although not shown, the ALU overflow signal<br />
is an input to the control unit.
4.9 Exceptions 331<br />
We mentioned five examples of exceptions on page 326, <strong>and</strong> we will see others<br />
in Chapter 5. With five instructions active in any clock cycle, the challenge is<br />
to associate an exception with the appropriate instruction. Moreover, multiple<br />
exceptions can occur simultaneously in a single clock cycle. The solution is to<br />
prioritize the exceptions so that it is easy to determine which is serviced first. In<br />
most MIPS implementations, the hardware sorts exceptions so that the earliest<br />
instruction is interrupted.<br />
I/O device requests <strong>and</strong> hardware malfunctions are not associated with a specific<br />
instruction, so the implementation has some flexibility as to when to interrupt the<br />
pipeline. Hence, the mechanism used for other exceptions works just fine.<br />
The EPC captures the address of the interrupted instructions, <strong>and</strong> the MIPS<br />
Cause register records all possible exceptions in a clock cycle, so the exception<br />
software must match the exception to the instruction. An important clue is knowing<br />
in which pipeline stage a type of exception can occur. For example, an undefined<br />
instruction is discovered in the ID stage, <strong>and</strong> invoking the operating system<br />
occurs in the EX stage. Exceptions are collected in the Cause register in a pending<br />
exception field so that the hardware can interrupt based on later exceptions, once<br />
the earliest one has been serviced.<br />
The hardware <strong>and</strong> the operating system must work in conjunction so that<br />
exceptions behave as you would expect. The hardware contract is normally to<br />
stop the offending instruction in midstream, let all prior instructions complete,<br />
flush all following instructions, set a register to show the cause of the exception,<br />
save the address of the offending instruction, <strong>and</strong> then jump to a prearranged<br />
address. The operating system contract is to look at the cause of the exception <strong>and</strong><br />
act appropriately. For an undefined instruction, hardware failure, or arithmetic<br />
overflow exception, the operating system normally kills the program <strong>and</strong> returns<br />
an indicator of the reason. For an I/O device request or an operating system service<br />
call, the operating system saves the state of the program, performs the desired task,<br />
<strong>and</strong>, at some point in the future, restores the program to continue execution. In<br />
the case of I/O device requests, we may often choose to run another task before<br />
resuming the task that requested the I/O, since that task may often not be able to<br />
proceed until the I/O is complete. Exceptions are why the ability to save <strong>and</strong> restore<br />
the state of any task is critical. One of the most important <strong>and</strong> frequent uses of<br />
exceptions is h<strong>and</strong>ling page faults <strong>and</strong> TLB exceptions; Chapter 5 describes these<br />
exceptions <strong>and</strong> their h<strong>and</strong>ling in more detail.<br />
Elaboration: The diffi culty of always associating the correct exception with the correct<br />
instruction in pipelined computers has led some computer designers to relax this<br />
requirement in noncritical cases. Such processors are said to have imprecise interrupts<br />
or imprecise exceptions. In the example above, PC would normally have 58 hex<br />
at the start<br />
of the clock cycle after the exception is detected, even though the offending instruction<br />
Hardware/<br />
Software<br />
Interface<br />
imprecise<br />
interrupt Also called<br />
imprecise exception.<br />
Interrupts or exceptions<br />
in pipelined computers<br />
that are not associated<br />
with the exact instruction<br />
that was the cause of the<br />
interrupt or exception.
334 Chapter 4 The Processor<br />
Another example is that we might speculate that a store that precedes a load does<br />
not refer to the same address, which would allow the load to be executed before the<br />
store. The difficulty with speculation is that it may be wrong. So, any speculation<br />
mechanism must include both a method to check if the guess was right <strong>and</strong> a<br />
method to unroll or back out the effects of the instructions that were executed<br />
speculatively. The implementation of this back-out capability adds complexity.<br />
Speculation may be done in the compiler or by the hardware. For example, the<br />
compiler can use speculation to reorder instructions, moving an instruction across<br />
a branch or a load across a store. The processor hardware can perform the same<br />
transformation at runtime using techniques we discuss later in this section.<br />
The recovery mechanisms used for incorrect speculation are rather different.<br />
In the case of speculation in software, the compiler usually inserts additional<br />
instructions that check the accuracy of the speculation <strong>and</strong> provide a fix-up routine<br />
to use when the speculation is incorrect. In hardware speculation, the processor<br />
usually buffers the speculative results until it knows they are no longer speculative.<br />
If the speculation is correct, the instructions are completed by allowing the<br />
contents of the buffers to be written to the registers or memory. If the speculation is<br />
incorrect, the hardware flushes the buffers <strong>and</strong> re-executes the correct instruction<br />
sequence.<br />
Speculation introduces one other possible problem: speculating on certain<br />
instructions may introduce exceptions that were formerly not present. For<br />
example, suppose a load instruction is moved in a speculative manner, but the<br />
address it uses is not legal when the speculation is incorrect. The result would be<br />
an exception that should not have occurred. The problem is complicated by the<br />
fact that if the load instruction were not speculative, then the exception must<br />
occur! In compiler-based speculation, such problems are avoided by adding<br />
special speculation support that allows such exceptions to be ignored until it is<br />
clear that they really should occur. In hardware-based speculation, exceptions<br />
are simply buffered until it is clear that the instruction causing them is no longer<br />
speculative <strong>and</strong> is ready to complete; at that point the exception is raised, <strong>and</strong><br />
nor-mal exception h<strong>and</strong>ling proceeds.<br />
Since speculation can improve performance when done properly <strong>and</strong> decrease<br />
performance when done carelessly, significant effort goes into deciding when it<br />
is appropriate to speculate. Later in this section, we will examine both static <strong>and</strong><br />
dynamic techniques for speculation.<br />
issue packet The set<br />
of instructions that<br />
issues together in one<br />
clock cycle; the packet<br />
may be determined<br />
statically by the compiler<br />
or dynamically by the<br />
processor.<br />
Static Multiple Issue<br />
Static multiple-issue processors all use the compiler to assist with packaging<br />
instructions <strong>and</strong> h<strong>and</strong>ling hazards. In a static issue processor, you can think of the<br />
set of instructions issued in a given clock cycle, which is called an issue packet, as<br />
one large instruction with multiple operations. This view is more than an analogy.<br />
Since a static multiple-issue processor usually restricts what mix of instructions can<br />
be initiated in a given clock cycle, it is useful to think of the issue packet as a single
4.10 Parallelism via Instructions 335<br />
instruction allowing several operations in certain predefined fields. This view led to<br />
the original name for this approach: Very Long Instruction Word (VLIW).<br />
Most static issue processors also rely on the compiler to take on some<br />
responsibility for h<strong>and</strong>ling data <strong>and</strong> control hazards. The compiler’s responsibilities<br />
may include static branch prediction <strong>and</strong> code scheduling to reduce or prevent all<br />
hazards. Let’s look at a simple static issue version of a MIPS processor, before we<br />
describe the use of these techniques in more aggressive processors.<br />
An Example: Static Multiple Issue with the MIPS ISA<br />
To give a flavor of static multiple issue, we consider a simple two-issue MIPS<br />
processor, where one of the instructions can be an integer ALU operation or<br />
branch <strong>and</strong> the other can be a load or store. Such a design is like that used in some<br />
embedded MIPS processors. Issuing two instructions per cycle will require fetching<br />
<strong>and</strong> decoding 64 bits of instructions. In many static multiple-issue processors, <strong>and</strong><br />
essentially all VLIW processors, the layout of simultaneously issuing instructions<br />
is restricted to simplify the decoding <strong>and</strong> instruction issue. Hence, we will require<br />
that the instructions be paired <strong>and</strong> aligned on a 64-bit boundary, with the ALU<br />
or branch portion appearing first. Furthermore, if one instruction of the pair<br />
cannot be used, we require that it be replaced with a nop. Thus, the instructions<br />
always issue in pairs, possibly with a nop in one slot. Figure 4.68 shows how the<br />
instructions look as they go into the pipeline in pairs.<br />
Static multiple-issue processors vary in how they deal with potential data <strong>and</strong><br />
control hazards. In some designs, the compiler takes full responsibility for removing<br />
all hazards, scheduling the code <strong>and</strong> inserting no-ops so that the code executes<br />
without any need for hazard detection or hardware-generated stalls. In others,<br />
the hardware detects data hazards <strong>and</strong> generates stalls between two issue packets,<br />
while requiring that the compiler avoid all dependences within an instruction pair.<br />
Even so, a hazard generally forces the entire issue packet containing the dependent<br />
Very Long Instruction<br />
Word (VLIW)<br />
A style of instruction set<br />
architecture that launches<br />
many operations that are<br />
defined to be independent<br />
in a single wide<br />
instruction, typically with<br />
many separate opcode<br />
fields.<br />
Instruction type<br />
Pipe stages<br />
ALU or branch instruction IF ID EX MEM WB<br />
Load or store instruction IF ID EX MEM WB<br />
ALU or branch instruction IF ID EX MEM WB<br />
Load or store instruction IF ID EX MEM WB<br />
ALU or branch instruction IF ID EX MEM WB<br />
Load or store instruction IF ID EX MEM WB<br />
ALU or branch instruction IF ID EX MEM WB<br />
Load or store instruction IF ID EX MEM WB<br />
FIGURE 4.68 Static two-issue pipeline in operation. The ALU <strong>and</strong> data transfer instructions<br />
are issued at the same time. Here we have assumed the same five-stage structure as used for the single-issue<br />
pipeline. Although this is not strictly necessary, it does have some advantages. In particular, keeping the<br />
register writes at the end of the pipeline simplifies the h<strong>and</strong>ling of exceptions <strong>and</strong> the maintenance of a<br />
precise exception model, which become more difficult in multiple-issue processors.
336 Chapter 4 The Processor<br />
instruction to stall. Whether the software must h<strong>and</strong>le all hazards or only try to<br />
reduce the fraction of hazards between separate issue packets, the appearance of<br />
having a large single instruction with multiple operations is reinforced. We will<br />
assume the second approach for this example.<br />
To issue an ALU <strong>and</strong> a data transfer operation in parallel, the first need for<br />
additional hardware—beyond the usual hazard detection <strong>and</strong> stall logic—is extra<br />
ports in the register file (see Figure 4.69). In one clock cycle we may need to read<br />
two registers for the ALU operation <strong>and</strong> two more for a store, <strong>and</strong> also one write<br />
port for an ALU operation <strong>and</strong> one write port for a load. Since the ALU is tied<br />
up for the ALU operation, we also need a separate adder to calculate the effective<br />
address for data transfers. Without these extra resources, our two-issue pipeline<br />
would be hindered by structural hazards.<br />
Clearly, this two-issue processor can improve performance by up to a factor of<br />
two. Doing so, however, requires that twice as many instructions be overlapped<br />
in execution, <strong>and</strong> this additional overlap increases the relative performance loss<br />
from data <strong>and</strong> control hazards. For example, in our simple five-stage pipeline,<br />
<br />
<br />
M<br />
ux<br />
M<br />
ux<br />
4<br />
ALU<br />
80000180<br />
M<br />
ux<br />
PC<br />
Instruction<br />
memory<br />
Registers<br />
Signextend<br />
Signextend<br />
ALU<br />
Write<br />
data<br />
Data<br />
memory<br />
Address<br />
FIGURE 4.69 A static two-issue datapath. The additions needed for double issue are highlighted: another 32 bits from instruction<br />
memory, two more read ports <strong>and</strong> one more write port on the register file, <strong>and</strong> another ALU. Assume the bottom ALU h<strong>and</strong>les address<br />
calculations for data transfers <strong>and</strong> the top ALU h<strong>and</strong>les everything else.
4.10 Parallelism via Instructions 337<br />
loads have a use latency of one clock cycle, which prevents one instruction from<br />
using the result without stalling. In the two-issue, five-stage pipeline the result of<br />
a load instruction cannot be used on the next clock cycle. This means that the next<br />
two instructions cannot use the load result without stalling. Furthermore, ALU<br />
instructions that had no use latency in the simple five-stage pipeline now have a<br />
one-instruction use latency, since the results cannot be used in the paired load or<br />
store. To effectively exploit the parallelism available in a multiple-issue processor,<br />
more ambitious compiler or hardware scheduling techniques are needed, <strong>and</strong> static<br />
multiple issue requires that the compiler take on this role.<br />
use latency Number<br />
of clock cycles between<br />
a load instruction <strong>and</strong><br />
an instruction that can<br />
use the result of the<br />
load without stalling the<br />
pipeline.<br />
Simple Multiple-Issue Code Scheduling<br />
How would this loop be scheduled on a static two-issue pipeline for MIPS?<br />
EXAMPLE<br />
Loop: lw $t0, 0($s1) # $t0=array element<br />
addu $t0,$t0,$s2# add scalar in $s2<br />
sw $t0, 0($s1)# store result<br />
addi $s1,$s1,–4# decrement pointer<br />
bne $s1,$zero,Loop# branch $s1!=0<br />
Reorder the instructions to avoid as many pipeline stalls as possible. Assume<br />
branches are predicted, so that control hazards are h<strong>and</strong>led by the hardware.<br />
The first three instructions have data dependences, <strong>and</strong> so do the last two.<br />
Figure 4.70 shows the best schedule for these instructions. Notice that just<br />
one pair of instructions has both issue slots used. It takes four clocks per loop<br />
iteration; at four clocks to execute five instructions, we get the disappointing<br />
CPI of 0.8 versus the best case of 0.5., or an IPC of 1.25 versus 2.0. Notice<br />
that in computing CPI or IPC, we do not count any nops executed as useful<br />
instructions. Doing so would improve CPI, but not performance!<br />
ANSWER<br />
ALU or branch instruction Data transfer instruction Clock cycle<br />
Loop: lw $t0, 0($s1) 1<br />
addi $s1,$s1,–4 2<br />
addu $t0,$t0,$s2 3<br />
bne $s1,$zero,Loop sw $t0, 4($s1) 4<br />
FIGURE 4.70 The scheduled code as it would look on a two-issue MIPS pipeline. The empty<br />
slots are no-ops.
338 Chapter 4 The Processor<br />
loop unrolling<br />
A technique to get more<br />
performance from loops<br />
that access arrays, in<br />
which multiple copies of<br />
the loop body are made<br />
<strong>and</strong> instructions from<br />
different iterations are<br />
EXAMPLE<br />
An important compiler technique to get more performance from loops<br />
is loop unrolling, where multiple copies of the loop body are made. After<br />
unrolling, there is more ILP available by overlapping instructions from different<br />
iterations.<br />
Loop Unrolling for Multiple-Issue Pipelines<br />
See how well loop unrolling <strong>and</strong> scheduling work in the example above. For<br />
simplicity assume that the loop index is a multiple of four.<br />
ANSWER<br />
register renaming The<br />
renaming of registers<br />
by the compiler or<br />
hardware to remove<br />
antidependences.<br />
antidependence Also<br />
called name<br />
dependence. An<br />
ordering forced by the<br />
reuse of a name, typically<br />
a register, rather than by<br />
a true dependence that<br />
carries a value between<br />
two instructions.<br />
To schedule the loop without any delays, it turns out that we need to make<br />
four copies of the loop body. After unrolling <strong>and</strong> eliminating the unnecessary<br />
loop overhead instructions, the loop will contain four copies each of lw, add,<br />
<strong>and</strong> sw, plus one addi <strong>and</strong> one bne. Figure 4.71 shows the unrolled <strong>and</strong><br />
scheduled code.<br />
During the unrolling process, the compiler introduced additional registers<br />
($t1, $t2, $t3). The goal of this process, called register renaming, is to<br />
eliminate dependences that are not true data dependences, but could either<br />
lead to potential hazards or prevent the compiler from flexibly scheduling<br />
the code. Consider how the unrolled code would look using only $t0. There<br />
would be repeated instances of lw $t0,0($$s1), addu $t0, $t0, $s2<br />
followed by sw t0,4($s1), but these sequences, despite using $t0, are<br />
actually completely independent—no data values flow between one set of these<br />
instructions <strong>and</strong> the next set. This case is what is called an antidependence or<br />
name dependence, which is an ordering forced purely by the reuse of a name,<br />
rather than a real data dependence that is also called a true dependence.<br />
Renaming the registers during the unrolling process allows the compiler<br />
to move these independent instructions subsequently so as to better schedule<br />
ALU or branch instruction Data transfer instruction Clock cycle<br />
Loop: addi $s1,$s1,–16 lw $t0, 0($s1) 1<br />
lw $t1,12($s1) 2<br />
addu $t0,$t0,$s2 lw $t2, 8($s1) 3<br />
addu $t1,$t1,$s2 lw $t3, 4($s1) 4<br />
addu $t2,$t2,$s2 sw $t0, 16($s1) 5<br />
addu $t3,$t3,$s2 sw $t1,12($s1) 6<br />
sw $t2, 8($s1) 7<br />
bne $s1,$zero,Loop sw $t3, 4($s1) 8<br />
FIGURE 4.71 The unrolled <strong>and</strong> scheduled code of Figure 4.70 as it would look on a static<br />
two-issue MIPS pipeline. The empty slots are no-ops. Since the first instruction in the loop decrements<br />
$s1 by 16, the addresses loaded are the original value of $s1, then that address minus 4, minus 8, <strong>and</strong> minus 12.
4.10 Parallelism via Instructions 339<br />
the code. The renaming process eliminates the name dependences, while<br />
preserving the true dependences.<br />
Notice now that 12 of the 14 instructions in the loop execute as pairs. It takes<br />
8 clocks for 4 loop iterations, or 2 clocks per iteration, which yields a CPI of 8/14<br />
= 0.57. Loop unrolling <strong>and</strong> scheduling with dual issue gave us an improvement<br />
factor of almost 2, partly from reducing the loop control instructions <strong>and</strong> partly<br />
from dual issue execution. The cost of this performance improvement is using four<br />
temporary registers rather than one, as well as a significant increase in code size.<br />
Dynamic Multiple-Issue Processors<br />
Dynamic multiple-issue processors are also known as superscalar processors, or<br />
simply superscalars. In the simplest superscalar processors, instructions issue in<br />
order, <strong>and</strong> the processor decides whether zero, one, or more instructions can issue<br />
in a given clock cycle. Obviously, achieving good performance on such a processor<br />
still requires the compiler to try to schedule instructions to move dependences<br />
apart <strong>and</strong> thereby improve the instruction issue rate. Even with such compiler<br />
scheduling, there is an important difference between this simple superscalar<br />
<strong>and</strong> a VLIW processor: the code, whether scheduled or not, is guaranteed by<br />
the hardware to execute correctly. Furthermore, compiled code will always run<br />
correctly independent of the issue rate or pipeline structure of the processor. In<br />
some VLIW designs, this has not been the case, <strong>and</strong> recompilation was required<br />
when moving across different processor models; in other static issue processors,<br />
code would run correctly across different implementations, but often so poorly as<br />
to make compilation effectively required.<br />
Many superscalars extend the basic framework of dynamic issue decisions to<br />
include dynamic pipeline scheduling. Dynamic pipeline scheduling chooses<br />
which instructions to execute in a given clock cycle while trying to avoid hazards<br />
<strong>and</strong> stalls. Let’s start with a simple example of avoiding a data hazard. Consider the<br />
following code sequence:<br />
lw $t0, 20($s2)<br />
addu $t1, $t0, $t2<br />
sub $s4, $s4, $t3<br />
slti $t5, $s4, 20<br />
Even though the sub instruction is ready to execute, it must wait for the lw<br />
<strong>and</strong> addu to complete first, which might take many clock cycles if memory is slow.<br />
(Chapter 5 explains cache misses, the reason that memory accesses are sometimes<br />
very slow.) Dynamic pipeline scheduling allows such hazards to be avoided either<br />
fully or partially.<br />
superscalar An<br />
advanced pipelining<br />
technique that enables the<br />
processor to execute more<br />
than one instruction per<br />
clock cycle by selecting<br />
them during execution.<br />
dynamic pipeline<br />
scheduling Hardware<br />
support for reordering<br />
the order of instruction<br />
execution so as to avoid<br />
stalls.<br />
Dynamic Pipeline Scheduling<br />
Dynamic pipeline scheduling chooses which instructions to execute next, possibly<br />
reordering them to avoid stalls. In such processors, the pipeline is divided into<br />
three major units: an instruction fetch <strong>and</strong> issue unit, multiple functional units
340 Chapter 4 The Processor<br />
Instruction fetch<br />
<strong>and</strong> decode unit<br />
In-order issue<br />
Reservation<br />
station<br />
Reservation<br />
station<br />
. . .<br />
Reservation<br />
station<br />
Reservation<br />
station<br />
Functional<br />
units<br />
Integer<br />
Integer<br />
. . .<br />
Floating<br />
point<br />
Loadstore<br />
Out-of-order execute<br />
Commit<br />
unit<br />
In-order commit<br />
FIGURE 4.72 The three primary units of a dynamically scheduled pipeline. The final step of<br />
updating the state is also called retirement or graduation.<br />
commit unit The unit in<br />
a dynamic or out-of-order<br />
execution pipeline that<br />
decides when it is safe to<br />
release the result of an<br />
operation to programmervisible<br />
registers <strong>and</strong><br />
memory.<br />
reservation station<br />
A buffer within a<br />
functional unit that holds<br />
the oper<strong>and</strong>s <strong>and</strong> the<br />
operation.<br />
reorder buffer The<br />
buffer that holds results in<br />
a dynamically scheduled<br />
processor until it is safe<br />
to store the results to<br />
memory or a register.<br />
(a dozen or more in high-end designs in 2013), <strong>and</strong> a commit unit. Figure 4.72<br />
shows the model. The first unit fetches instructions, decodes them, <strong>and</strong> sends<br />
each instruction to a corresponding functional unit for execution. Each functional<br />
unit has buffers, called reservation stations, which hold the oper<strong>and</strong>s <strong>and</strong> the<br />
operation. (The Elaboration discusses an alternative to reservation stations used<br />
by many recent processors.) As soon as the buffer contains all its oper<strong>and</strong>s <strong>and</strong><br />
the functional unit is ready to execute, the result is calculated. When the result is<br />
completed, it is sent to any reservation stations waiting for this particular result<br />
as well as to the commit unit, which buffers the result until it is safe to put the<br />
result into the register file or, for a store, into memory. The buffer in the commit<br />
unit, often called the reorder buffer, is also used to supply oper<strong>and</strong>s, in much the<br />
same way as forwarding logic does in a statically scheduled pipeline. Once a result<br />
is committed to the register file, it can be fetched directly from there, just as in a<br />
normal pipeline.<br />
The combination of buffering oper<strong>and</strong>s in the reservation stations <strong>and</strong> results<br />
in the reorder buffer provides a form of register renaming, just like that used by<br />
the compiler in our earlier loop-unrolling example on page 338. To see how this<br />
conceptually works, consider the following steps:
4.10 Parallelism via Instructions 341<br />
1. When an instruction issues, it is copied to a reservation station for the<br />
appropriate functional unit. Any oper<strong>and</strong>s that are available in the register<br />
file or reorder buffer are also immediately copied into the reservation station.<br />
The instruction is buffered in the reservation station until all the oper<strong>and</strong>s<br />
<strong>and</strong> the functional unit are available. For the issuing instruction, the register<br />
copy of the oper<strong>and</strong> is no longer required, <strong>and</strong> if a write to that register<br />
occurred, the value could be overwritten.<br />
2. If an oper<strong>and</strong> is not in the register file or reorder buffer, it must be waiting to<br />
be produced by a functional unit. The name of the functional unit that will<br />
produce the result is tracked. When that unit eventually produces the result,<br />
it is copied directly into the waiting reservation station from the functional<br />
unit bypassing the registers.<br />
These steps effectively use the reorder buffer <strong>and</strong> the reservation stations to<br />
implement register renaming.<br />
Conceptually, you can think of a dynamically scheduled pipeline as analyzing<br />
the data flow structure of a program. The processor then executes the instructions<br />
in some order that preserves the data flow order of the program. This style of<br />
execution is called an out-of-order execution, since the instructions can be<br />
executed in a different order than they were fetched.<br />
To make programs behave as if they were running on a simple in-order pipeline,<br />
the instruction fetch <strong>and</strong> decode unit is required to issue instructions in order,<br />
which allows dependences to be tracked, <strong>and</strong> the commit unit is required to write<br />
results to registers <strong>and</strong> memory in program fetch order. This conservative mode is<br />
called in-order commit. Hence, if an exception occurs, the computer can point to<br />
the last instruction executed, <strong>and</strong> the only registers updated will be those written<br />
by instructions before the instruction causing the exception. Although the front<br />
end (fetch <strong>and</strong> issue) <strong>and</strong> the back end (commit) of the pipeline run in order,<br />
the functional units are free to initiate execution whenever the data they need is<br />
available. Today, all dynamically scheduled pipelines use in-order commit.<br />
Dynamic scheduling is often extended by including hardware-based speculation,<br />
especially for branch outcomes. By predicting the direction of a branch, a<br />
dynamically scheduled processor can continue to fetch <strong>and</strong> execute instructions<br />
along the predicted path. Because the instructions are committed in order, we know<br />
whether or not the branch was correctly predicted before any instructions from the<br />
predicted path are committed. A speculative, dynamically scheduled pipeline can<br />
also support speculation on load addresses, allowing load-store reordering, <strong>and</strong><br />
using the commit unit to avoid incorrect speculation. In the next section, we will<br />
look at the use of dynamic scheduling with speculation in the Intel Core i7 design.<br />
out-of-order<br />
execution A situation in<br />
pipelined execution when<br />
an instruction blocked<br />
from executing does<br />
not cause the following<br />
instructions to wait.<br />
in-order commit<br />
A commit in which<br />
the results of pipelined<br />
execution are written to<br />
the programmer visible<br />
state in the same order<br />
that instructions are<br />
fetched.
4.10 Parallelism via Instructions 343<br />
Modern, high-performance microprocessors are capable of issuing several instructions<br />
per clock; unfortunately, sustaining that issue rate is very difficult. For example, despite<br />
the existence of processors with four to six issues per clock, very few applications can<br />
sustain more than two instructions per clock. There are two primary reasons for this.<br />
First, within the pipeline, the major performance bottlenecks arise from<br />
dependences that cannot be alleviated, thus reducing the parallelism among<br />
instructions <strong>and</strong> the sustained issue rate. Although little can be done about true data<br />
dependences, often the compiler or hardware does not know precisely whether a<br />
dependence exists or not, <strong>and</strong> so must conservatively assume the dependence exists.<br />
For example, code that makes use of pointers, particularly in ways that may lead to<br />
aliasing, will lead to more implied potential dependences. In contrast, the greater<br />
regularity of array accesses often allows a compiler to deduce that no dependences<br />
exist. Similarly, branches that cannot be accurately predicted whether at runtime or<br />
compile time will limit the ability to exploit ILP. Often, additional ILP is available, but<br />
the ability of the compiler or the hardware to find ILP that may be widely separated<br />
(sometimes by the execution of thous<strong>and</strong>s of instructions) is limited.<br />
Second, losses in the memory hierarchy (the topic of Chapter 5) also limit the<br />
ability to keep the pipeline full. Some memory system stalls can be hidden, but<br />
limited amounts of ILP also limit the extent to which such stalls can be hidden.<br />
Hardware/<br />
Software<br />
Interface<br />
Energy Efficiency <strong>and</strong> Advanced Pipelining<br />
The downside to the increasing exploitation of instruction-level parallelism via<br />
dynamic multiple issue <strong>and</strong> speculation is potential energy inefficiency. Each<br />
innovation was able to turn more transistors into performance, but they often did<br />
so very inefficiently. Now that we have hit the power wall, we are seeing designs<br />
with multiple processors per chip where the processors are not as deeply pipelined<br />
or as aggressively speculative as its predecessors.<br />
The belief is that while the simpler processors are not as fast as their sophisticated<br />
brethren, they deliver better performance per joule, so that they can deliver more<br />
performance per chip when designs are constrained more by energy than they are<br />
by number of transistors.<br />
Figure 4.73 shows the number of pipeline stages, the issue width, speculation level,<br />
clock rate, cores per chip, <strong>and</strong> power of several past <strong>and</strong> recent microprocessors. Note<br />
the drop in pipeline stages <strong>and</strong> power as companies switch to multicore designs.<br />
Elaboration: A commit unit controls updates to the register file <strong>and</strong> memory. Some<br />
dynamically scheduled processors update the register file immediately during execution,<br />
using extra registers to implement the renaming function <strong>and</strong> preserving the older copy of a<br />
register until the instruction updating the register is no longer speculative. Other processors<br />
buffer the result, typically in a structure called a reorder buffer, <strong>and</strong> the actual update to the<br />
register file occurs later as part of the commit. Stores to memory must be buffered until<br />
commit time either in a store buffer (see Chapter 5) or in the reorder buffer. The commit unit<br />
allows the store to write to memory from the buffer when the buffer has a valid address <strong>and</strong><br />
valid data, <strong>and</strong> when the store is no longer dependent on predicted branches.
344 Chapter 4 The Processor<br />
Microprocessor Year Clock Rate<br />
Pipeline<br />
Stages<br />
Issue<br />
Width<br />
Out-of-Order/<br />
Speculation<br />
Cores/<br />
Chip<br />
Power<br />
Intel 486 1989 25 MHz 5 1 No 1 5 W<br />
Intel Pentium 1993 66 MHz 5 2 No 1 10 W<br />
Intel Pentium Pro 1997 200 MHz 10 3 Yes 1 29 W<br />
Intel Pentium 4 Willamette 2001 2000 MHz 22 3 Yes 1 75 W<br />
Intel Pentium 4 Prescott 2004 3600 MHz 31 3 Yes 1 103 W<br />
Intel Core 2006 2930 MHz 14 4 Yes<br />
2 75 W<br />
Intel Core i5 Nehalem 2010 3300 MHz 14 4 Yes<br />
1 87 W<br />
Intel Core i5 Ivy Bridge 2012 3400 MHz 14 4 Yes<br />
8 77 W<br />
FIGURE 4.73 Record of Intel Microprocessors in terms of pipeline complexity, number of cores, <strong>and</strong> power. The Pentium<br />
4 pipeline stages do not include the commit stages. If we included them, the Pentium 4 pipelines would be even deeper.<br />
Elaboration: Memory accesses benefi t from nonblocking caches, which continue<br />
servicing cache accesses during a cache miss (see Chapter 5). Out-of-order execution<br />
processors need the cache design to allow instructions to execute during a miss.<br />
Check<br />
Yourself<br />
State whether the following techniques or components are associated primarily<br />
with a software- or hardware-based approach to exploiting ILP. In some cases, the<br />
answer may be both.<br />
1. Branch prediction<br />
2. Multiple issue<br />
3. VLIW<br />
4. Superscalar<br />
5. Dynamic scheduling<br />
6. Out-of-order execution<br />
7. Speculation<br />
8. Reorder buffer<br />
9. Register renaming<br />
4.11 Real Stuff: The ARM Cortex-A8 <strong>and</strong> Intel<br />
Core i7 Pipelines<br />
Figure 4.74 describes the two microprocessors we examine in this section, whose<br />
targets are the two bookends of the PostPC Era.
4.11 Real Stuff: The ARM Cortex-A8 <strong>and</strong> Intel Core i7 Pipelines 345<br />
Processor ARM A8<br />
Intel Core i7 920<br />
Market<br />
Thermal design power<br />
Clock rate<br />
Cores/Chip<br />
Floating point?<br />
Multiple Issue?<br />
Peak instructions/clock cycle<br />
Pipeline Stages<br />
Pipeline schedule<br />
Branch prediction<br />
1st level caches / core<br />
2nd level cache / core<br />
3rd level cache (shared)<br />
Personal Mobile Device<br />
2 Watts<br />
1 GHz<br />
1<br />
No<br />
Dynamic<br />
2<br />
14<br />
Static In-order<br />
2-level<br />
32 KiB I, 32 KiB D<br />
128 - 1024 KiB<br />
--<br />
Server, Cloud<br />
130 Watts<br />
2.66 GHz<br />
4<br />
Yes<br />
Dynamic<br />
4<br />
14<br />
Dynamic Out-of-order with Speculation<br />
2-level<br />
32 KiB I, 32 KiB D<br />
256 KiB<br />
2 - 8 MiB<br />
FIGURE 4.74 Specification of the ARM Cortex-A8 <strong>and</strong> the Intel Core i7 920.<br />
The ARM Cortex-A8<br />
The ARM Corxtex-A8 runs at 1 GHz with a 14-stage pipeline. It uses dynamic<br />
multiple issue, with two instructions per clock cycle. It is a static in-order pipeline,<br />
in that instructions issue, execute, <strong>and</strong> commit in order. The pipeline consists of<br />
three sections for instruction fetch, instruction decode, <strong>and</strong> execute. Figure 4.75<br />
shows the overall pipeline.<br />
The first three stages fetch two instructions at a time <strong>and</strong> try to keep a<br />
12-instruction entry prefetch buffer full. It uses a two-level branch predictor using<br />
both a 512-entry branch target buffer, a 4096-entry global history buffer, <strong>and</strong> an<br />
8-entry return stack to predict future returns. When the branch prediction is<br />
wrong, it empties the pipeline, resulting in a 13-clock cycle misprediction penalty.<br />
The five stages of the decode pipeline determine if there are dependences<br />
between a pair of instructions, which would force sequential execution, <strong>and</strong> in<br />
which pipeline of the execution stages to send the instructions.<br />
The six stages of the instruction execution section offer one pipeline for load<br />
<strong>and</strong> store instructions <strong>and</strong> two pipelines for arithmetic operations, although only<br />
the first of the pair can h<strong>and</strong>le multiplies. Either instruction from the pair can be<br />
issued to the load-store pipeline. The execution stages have full bypassing between<br />
the three pipelines.<br />
Figure 4.76 shows the CPI of the A8 using small versions of programs derived<br />
from the SPEC2000 benchmarks. While the ideal CPI is 0.5, the best case here is<br />
1.4, the median case is 2.0, <strong>and</strong> the worst case is 5.2. For the median case, 80% of<br />
the stalls are due to the pipelining hazards <strong>and</strong> 20% are stalls due to the memory
4.11 Real Stuff: The ARM Cortex-A8 <strong>and</strong> Intel Core i7 Pipelines 349<br />
issue the micro-ops from the buffer, eliminating the need for the instruction<br />
fetch <strong>and</strong> instruction decode stages to be activated.<br />
5. Perform the basic instruction issue—Looking up the register location in the<br />
register tables, renaming the registers, allocating a reorder buffer entry, <strong>and</strong><br />
fetching any results from the registers or reorder buffer before sending the<br />
micro-ops to the reservation stations.<br />
6. The i7 uses a 36-entry centralized reservation station shared by six functional<br />
units. Up to six micro-ops may be dispatched to the functional units every<br />
clock cycle.<br />
7. The individual function units execute micro-ops <strong>and</strong> then results are sent<br />
back to any waiting reservation station as well as to the register retirement<br />
unit, where they will update the register state, once it is known that the<br />
instruction is no longer speculative. The entry corresponding to the<br />
instruction in the reorder buffer is marked as complete.<br />
8. When one or more instructions at the head of the reorder buffer have been<br />
marked as complete, the pending writes in the register retirement unit are<br />
executed, <strong>and</strong> the instructions are removed from the reorder buffer.<br />
Elaboration: Hardware in the second <strong>and</strong> fourth steps can combine or fuse operations<br />
together to reduce the number of operations that must be performed. Macro-op fusion<br />
in the second step takes x86 instruction combinations, such as compare followed by a<br />
branch, <strong>and</strong> fuses them into a single operation. Microfusion in the fourth step combines<br />
micro-operation pairs such as load/ALU operation <strong>and</strong> ALU operation/store <strong>and</strong> issues<br />
them to a single reservation station (where they can still issue independently), thus<br />
increasing the usage of the buffer. In a study of the Intel Core architecture, which also<br />
incorporated microfusion <strong>and</strong> macrofusion, Bird et al. [2007] discovered that microfusion<br />
had little impact on performance, while macrofusion appears to have a modest positive<br />
impact on integer performance <strong>and</strong> little impact on floating-point performance.<br />
Performance of the Intel Core i7 920<br />
Figure 4.78 shows the CPI of the Intel Core i7 for each of the SPEC2006 benchmarks.<br />
While the ideal CPI is 0.25, the best case here is 0.44, the median case is 0.79, <strong>and</strong><br />
the worst case is 2.67.<br />
While it is difficult to differentiate between pipeline stalls <strong>and</strong> memory stalls<br />
in a dynamic out-of-order execution pipeline, we can show the effectiveness of<br />
branch prediction <strong>and</strong> speculation. Figure 4.79 shows the percentage of branches<br />
mispredicted <strong>and</strong> the percentage of the work (measured by the numbers of microops<br />
dispatched into the pipeline) that does not retire (that is, their results are<br />
annulled) relative to all micro-op dispatches. The min, median, <strong>and</strong> max of branch<br />
mispredictions are 0%, 2%, <strong>and</strong> 10%. For wasted work, they are 1%, 18%, <strong>and</strong> 39%.<br />
The wasted work in some cases closely matches the branch misprediction rates,<br />
such as for gobmk <strong>and</strong> astar. In several instances, such as mcf, the wasted work<br />
seems relatively larger than the misprediction rate. This divergence is likely due
350 Chapter 4 The Processor<br />
3<br />
2.5<br />
Stalls, misspeculation<br />
Ideal CPI<br />
2.67<br />
2<br />
2.12<br />
CPI<br />
1.5<br />
1.02<br />
1<br />
0.5 0.44 0.59 0.61 0.65 0.74 0.77 0.82<br />
0<br />
libquantum<br />
h264ref<br />
1.06<br />
1.23<br />
hmmer<br />
perlbench<br />
bzip2<br />
xalancbmk<br />
sjeng<br />
gobmk<br />
astar<br />
gcc<br />
omnetpp<br />
mcf<br />
FIGURE 4.78 CPI of Intel Core i7 920 running SPEC2006 integer benchmarks.<br />
Branch misprediction % Wasted work %<br />
40%<br />
38%<br />
39%<br />
35%<br />
30%<br />
32%<br />
25%<br />
24%<br />
25%<br />
22%<br />
20%<br />
15%<br />
15%<br />
10%<br />
5%<br />
0%<br />
1%<br />
0%<br />
libquantum<br />
11%<br />
6%<br />
5%<br />
2% 2% 2%<br />
h264ref<br />
hmmer<br />
perlbench<br />
7%<br />
5%<br />
1%<br />
bzip2<br />
xalancbmk<br />
5%<br />
sjeng<br />
10%<br />
gobmk<br />
9%<br />
astar<br />
2% 2%<br />
gcc<br />
omnetpp<br />
6%<br />
mcf<br />
FIGURE 4.79 Percentage of branch mispredictions <strong>and</strong> wasted work due to unfruitful<br />
speculation of Intel Core i7 920 running SPEC2006 integer benchmarks.
352 Chapter 4 The Processor<br />
1 #include <br />
2 #define UNROLL (4)<br />
3<br />
4 void dgemm (int n, double* A, double* B, double* C)<br />
5 {<br />
6 for ( int i = 0; i < n; i+=UNROLL*4 )<br />
7 for ( int j = 0; j < n; j++ ) {<br />
8 __m256d c[4];<br />
9 for ( int x = 0; x < UNROLL; x++ )<br />
10 c[x] = _mm256_load_pd(C+i+x*4+j*n);<br />
11<br />
12 for( int k = 0; k < n; k++ )<br />
13 {<br />
14 __m256d b = _mm256_broadcast_sd(B+k+j*n);<br />
15 for (int x = 0; x < UNROLL; x++)<br />
16 c[x] = _mm256_add_pd(c[x],<br />
17 _mm256_mul_pd(_mm256_load_pd(A+n*k+x*4+i), b));<br />
18 }<br />
19<br />
20 for ( int x = 0; x < UNROLL; x++ )<br />
21 _mm256_store_pd(C+i+x*4+j*n, c[x]);<br />
22 }<br />
23 }<br />
FIGURE 4.80 Optimized C version of DGEMM using C intrinsics to generate the AVX subwordparallel<br />
instructions for the x86 (Figure 3.23) <strong>and</strong> loop unrolling to create more opportunities for<br />
instruction-level parallelism. Figure 4.81 shows the assembly language produced by the compiler for the inner<br />
loop, which unrolls the three for-loop bodies to expose instruction level parallelism.<br />
instruction, since we can use the four copies of the B element in register %ymm0<br />
repeatedly throughout the loop. Thus, the 5 AVX instructions in Figure 3.24<br />
become 17 in Figure 4.81, <strong>and</strong> the 7 integer instructions appear in both, although<br />
the constants <strong>and</strong> addressing changes to account for the unrolling. Hence, despite<br />
unrolling 4 times, the number of instructions in the body of the loop only doubles:<br />
from 12 to 24.<br />
Figure 4.82 shows the performance increase DGEMM for 32x32 matrices in<br />
going from unoptimized to AVX <strong>and</strong> then to AVX with unrolling. Unrolling more<br />
than doubles performance, going from 6.4 GFLOPS to 14.6 GFLOPS. Optimizations<br />
for subword parallelism <strong>and</strong> instruction level parallelism result in an overall<br />
speedup of 8.8 versus the unoptimized DGEMM in Figure 3.21.<br />
Elaboration: As mentioned in the Elaboration in Section 3.8, these results are with<br />
Turbo mode turned off. If we turn it on, like in Chapter 3 we improve all the results by the<br />
temporary increase in the clock rate of 3.3/2.6 = 1.27 to 2.1 GFLOPS for unoptimized<br />
DGEMM, 8.1 GFLOPS with AVX, <strong>and</strong> 18.6 GFLOPS with unrolling <strong>and</strong> AVX. As mentioned<br />
in Section 3.8, Turbo mode works particularly well in this case because it is using only<br />
a single core of an eight-core chip.
4.12 Going Faster: Instruction-Level Parallelism <strong>and</strong> Matrix Multiply 353<br />
1<br />
2<br />
3<br />
4<br />
5<br />
6<br />
7<br />
8<br />
9<br />
10<br />
11<br />
12<br />
13<br />
14<br />
15<br />
16<br />
17<br />
18<br />
19<br />
20<br />
21<br />
22<br />
23<br />
24<br />
vmovapd (%r11),%ymm4<br />
# Load 4 elements of C into %ymm4<br />
mov %rbx,%rax # register %rax = %rbx<br />
xor %ecx,%ecx # register %ecx = 0<br />
vmovapd 0x20(%r11),%ymm3 # Load 4 elements of C into %ymm3<br />
vmovapd 0x40(%r11),%ymm2 # Load 4 elements of C into %ymm2<br />
vmovapd 0x60(%r11),%ymm1 # Load 4 elements of C into %ymm1<br />
vbroadcastsd (%rcx,%r9,1),%ymm0 # Make 4 copies of B element<br />
add $0x8,%rcx # register %rcx = %rcx + 8<br />
vmulpd (%rax),%ymm0,%ymm5 # Parallel mul %ymm1,4 A elements<br />
vaddpd %ymm5,%ymm4,%ymm4 # Parallel add %ymm5, %ymm4<br />
vmulpd 0x20(%rax),%ymm0,%ymm5 # Parallel mul %ymm1,4 A elements<br />
vaddpd %ymm5,%ymm3,%ymm3 # Parallel add %ymm5, %ymm3<br />
vmulpd 0x40(%rax),%ymm0,%ymm5 # Parallel mul %ymm1,4 A elements<br />
vmulpd 0x60(%rax),%ymm0,%ymm0 # Parallel mul %ymm1,4 A elements<br />
add %r8,%rax # register %rax = %rax + %r8<br />
cmp %r10,%rcx # compare %r8 to %rax<br />
vaddpd %ymm5,%ymm2,%ymm2 # Parallel add %ymm5, %ymm2<br />
vaddpd %ymm0,%ymm1,%ymm1 # Parallel add %ymm0, %ymm1<br />
jne 68 # jump if not %r8 != %rax<br />
add $0x1,%esi # register % esi = % esi + 1<br />
vmovapd %ymm4,(%r11)<br />
# Store %ymm4 into 4 C elements<br />
vmovapd %ymm3,0x20(%r11) # Store %ymm3 into 4 C elements<br />
vmovapd %ymm2,0x40(%r11) # Store %ymm2 into 4 C elements<br />
vmovapd %ymm1,0x60(%r11) # Store %ymm1 into 4 C elements<br />
FIGURE 4.81 The x86 assembly language for the body of the nested loops generated by compiling<br />
the unrolled C code in Figure 4.80.<br />
Elaboration: There are no pipeline stalls despite the reuse of register %ymm5 in lines<br />
9 to 17 Figure 4.81 because the Intel Core i7 pipeline renames the registers.<br />
Are the following statements true or false?<br />
1. The Intel Core i7 uses a multiple-issue pipeline to directly execute x86<br />
instructions.<br />
2. Both the A8 <strong>and</strong> the Core i7 use dynamic multiple issue.<br />
3. The Core i7 microarchitecture has many more registers than x86 requires.<br />
4. The Intel Core i7 uses less than half the pipeline stages of the earlier Intel<br />
Pentium 4 Prescott (see Figure 4.73).<br />
Check<br />
Yourself
4.14 Fallacies <strong>and</strong> Pitfalls 355<br />
4.14 Fallacies <strong>and</strong> Pitfalls<br />
Fallacy: Pipelining is easy.<br />
Our books testify to the subtlety of correct pipeline execution. Our advanced book<br />
had a pipeline bug in its first edition, despite its being reviewed by more than 100<br />
people <strong>and</strong> being class-tested at 18 universities. The bug was uncovered only when<br />
someone tried to build the computer in that book. The fact that the Verilog to<br />
describe a pipeline like that in the Intel Core i7 will be many thous<strong>and</strong>s of lines is<br />
an indication of the complexity. Beware!<br />
Fallacy: Pipelining ideas can be implemented independent of technology.<br />
When the number of transistors on-chip <strong>and</strong> the speed of transistors made a<br />
five-stage pipeline the best solution, then the delayed branch (see the Elaboration<br />
on page 255) was a simple solution to control hazards. With longer pipelines,<br />
superscalar execution, <strong>and</strong> dynamic branch prediction, it is now redundant. In<br />
the early 1990s, dynamic pipeline scheduling took too many resources <strong>and</strong> was<br />
not required for high performance, but as transistor budgets continued to double<br />
due to Moore’s Law <strong>and</strong> logic became much faster than memory, then multiple<br />
functional units <strong>and</strong> dynamic pipelining made more sense. Today, concerns about<br />
power are leading to less aggressive designs.<br />
Pitfall: Failure to consider instruction set design can adversely impact pipelining.<br />
Many of the difficulties of pipelining arise because of instruction set complications.<br />
Here are some examples:<br />
■ Widely variable instruction lengths <strong>and</strong> running times can lead to imbalance<br />
among pipeline stages <strong>and</strong> severely complicate hazard detection in a design<br />
pipelined at the instruction set level. This problem was overcome, initially<br />
in the DEC VAX 8500 in the late 1980s, using the micro-operations <strong>and</strong><br />
micropipelined scheme that the Intel Core i7 employs today. Of course, the<br />
overhead of translation <strong>and</strong> maintaining correspondence between the microoperations<br />
<strong>and</strong> the actual instructions remains.<br />
■ Sophisticated addressing modes can lead to different sorts of problems.<br />
Addressing modes that update registers complicate hazard detection. Other<br />
addressing modes that require multiple memory accesses substantially<br />
complicate pipeline control <strong>and</strong> make it difficult to keep the pipeline flowing<br />
smoothly.<br />
■ Perhaps the best example is the DEC Alpha <strong>and</strong> the DEC NVAX. In<br />
comparable technology, the newer instruction set architecture of the Alpha<br />
allowed an implementation whose performance is more than twice as fast<br />
as NVAX. In another example, Bh<strong>and</strong>arkar <strong>and</strong> Clark [1991] compared the<br />
MIPS M/2000 <strong>and</strong> the DEC VAX 8700 by counting clock cycles of the SPEC<br />
benchmarks; they concluded that although the MIPS M/2000 executes more
358 Chapter 4 The Processor<br />
4.3 When processor designers consider a possible improvement to the processor<br />
datapath, the decision usually depends on the cost/performance trade-off. In<br />
the following three problems, assume that we are starting with a datapath from<br />
Figure 4.2, where I-Mem, Add, Mux, ALU, Regs, D-Mem, <strong>and</strong> Control blocks have<br />
latencies of 400 ps, 100 ps, 30 ps, 120 ps, 200 ps, 350 ps, <strong>and</strong> 100 ps, respectively,<br />
<strong>and</strong> costs of 1000, 30, 10, 100, 200, 2000, <strong>and</strong> 500, respectively.<br />
Consider the addition of a multiplier to the ALU. This addition will add 300 ps to the<br />
latency of the ALU <strong>and</strong> will add a cost of 600 to the ALU. The result will be 5% fewer<br />
instructions executed since we will no longer need to emulate the MUL instruction.<br />
4.3.1 [10] What is the clock cycle time with <strong>and</strong> without this improvement?<br />
4.3.2 [10] What is the speedup achieved by adding this improvement?<br />
4.3.3 [10] Compare the cost/performance ratio with <strong>and</strong> without this<br />
improvement.<br />
4.4 Problems in this exercise assume that logic blocks needed to implement a<br />
processor’s datapath have the following latencies:<br />
I-Mem Add Mux ALU Regs D-Mem Sign-Extend Shift-Left-2<br />
200ps 70ps 20ps 90ps 90ps 250ps 15ps 10ps<br />
4.4.1 [10] If the only thing we need to do in a processor is fetch consecutive<br />
instructions (Figure 4.6), what would the cycle time be?<br />
4.4.2 [10] Consider a datapath similar to the one in Figure 4.11, but for a<br />
processor that only has one type of instruction: unconditional PC-relative branch.<br />
What would the cycle time be for this datapath?<br />
4.4.3 [10] Repeat 4.4.2, but this time we need to support only conditional<br />
PC-relative branches.<br />
The remaining three problems in this exercise refer to the datapath element Shiftleft-2:<br />
4.4.4 [10] Which kinds of instructions require this resource?<br />
4.4.5 [20] For which kinds of instructions (if any) is this resource on the<br />
critical path?<br />
4.4.6 [10] Assuming that we only support beq <strong>and</strong> add instructions,<br />
discuss how changes in the given latency of this resource affect the cycle time of the<br />
processor. Assume that the latencies of other resources do not change.
4.17 Exercises 359<br />
4.5 For the problems in this exercise, assume that there are no pipeline stalls <strong>and</strong><br />
that the breakdown of executed instructions is as follows:<br />
add addi not beq lw sw<br />
20% 20% 0% 25% 25% 10%<br />
4.5.1 [10] In what fraction of all cycles is the data memory used?<br />
4.5.2 [10] In what fraction of all cycles is the input of the sign-extend<br />
circuit needed? What is this circuit doing in cycles in which its input is not needed?<br />
4.6 When silicon chips are fabricated, defects in materials (e.g., silicon) <strong>and</strong><br />
manufacturing errors can result in defective circuits. A very common defect is for<br />
one wire to affect the signal in another. This is called a cross-talk fault. A special<br />
class of cross-talk faults is when a signal is connected to a wire that has a constant<br />
logical value (e.g., a power supply wire). In this case we have a stuck-at-0 or a stuckat-1<br />
fault, <strong>and</strong> the affected signal always has a logical value of 0 or 1, respectively.<br />
The following problems refer to bit 0 of the Write Register input on the register file<br />
in Figure 4.24.<br />
4.6.1 [10] Let us assume that processor testing is done by filling the<br />
PC, registers, <strong>and</strong> data <strong>and</strong> instruction memories with some values (you can choose<br />
which values), letting a single instruction execute, then reading the PC, memories,<br />
<strong>and</strong> registers. These values are then examined to determine if a particular fault is<br />
present. Can you design a test (values for PC, memories, <strong>and</strong> registers) that would<br />
determine if there is a stuck-at-0 fault on this signal?<br />
4.6.2 [10] Repeat 4.6.1 for a stuck-at-1 fault. Can you use a single<br />
test for both stuck-at-0 <strong>and</strong> stuck-at-1? If yes, explain how; if no, explain why not.<br />
4.6.3 [60] If we know that the processor has a stuck-at-1 fault on<br />
this signal, is the processor still usable? To be usable, we must be able to convert<br />
any program that executes on a normal MIPS processor into a program that works<br />
on this processor. You can assume that there is enough free instruction memory<br />
<strong>and</strong> data memory to let you make the program longer <strong>and</strong> store additional<br />
data. Hint: the processor is usable if every instruction “broken” by this fault can<br />
be replaced with a sequence of “working” instructions that achieve the same<br />
effect.<br />
4.6.4 [10] Repeat 4.6.1, but now the fault to test for is whether<br />
the “MemRead” control signal becomes 0 if RegDst control signal is 0, no fault<br />
otherwise.<br />
4.6.5 [10] Repeat 4.6.4, but now the fault to test for is whether the<br />
“Jump” control signal becomes 0 if RegDst control signal is 0, no fault otherwise.
360 Chapter 4 The Processor<br />
4.7 In this exercise we examine in detail how an instruction is executed in a<br />
single-cycle datapath. Problems in this exercise refer to a clock cycle in which the<br />
processor fetches the following instruction word:<br />
10101100011000100000000000010100.<br />
Assume that data memory is all zeros <strong>and</strong> that the processor’s registers have the<br />
following values at the beginning of the cycle in which the above instruction word<br />
is fetched:<br />
r0 r1 r2 r3 r4 r5 r6 r8 r12 r31<br />
0 –1 2 –3 –4 10 6 8 2 –16<br />
4.7.1 [5] What are the outputs of the sign-extend <strong>and</strong> the jump “Shift left<br />
2” unit (near the top of Figure 4.24) for this instruction word?<br />
4.7.2 [10] What are the values of the ALU control unit’s inputs for this<br />
instruction?<br />
4.7.3 [10] What is the new PC address after this instruction is executed?<br />
Highlight the path through which this value is determined.<br />
4.7.4 [10] For each Mux, show the values of its data output during the<br />
execution of this instruction <strong>and</strong> these register values.<br />
4.7.5 [10] For the ALU <strong>and</strong> the two add units, what are their data input<br />
values?<br />
4.7.6 [10] What are the values of all inputs for the “Registers” unit?<br />
4.8 In this exercise, we examine how pipelining affects the clock cycle time of the<br />
processor. Problems in this exercise assume that individual stages of the datapath<br />
have the following latencies:<br />
IF ID EX MEM WB<br />
250ps 350ps 150ps 300ps 200ps<br />
Also, assume that instructions executed by the processor are broken down as<br />
follows:<br />
alu beq lw sw<br />
45% 20% 20% 15%<br />
4.8.1 [5] What is the clock cycle time in a pipelined <strong>and</strong> non-pipelined<br />
processor?<br />
4.8.2 [10] What is the total latency of an LW instruction in a pipelined<br />
<strong>and</strong> non-pipelined processor?
4.17 Exercises 363<br />
4.10.6 [10] Assuming stall-on-branch <strong>and</strong> no delay slots, what is the new<br />
clock cycle time <strong>and</strong> execution time of this instruction sequence if beq address<br />
computation is moved to the MEM stage? What is the speedup from this change?<br />
Assume that the latency of the EX stage is reduced by 20 ps <strong>and</strong> the latency of the<br />
MEM stage is unchanged when branch outcome resolution is moved from EX to<br />
MEM.<br />
4.11 Consider the following loop.<br />
loop:lw r1,0(r1)<br />
<strong>and</strong> r1,r1,r2<br />
lw r1,0(r1)<br />
lw r1,0(r1)<br />
beq r1,r0,loop<br />
Assume that perfect branch prediction is used (no stalls due to control hazards),<br />
that there are no delay slots, <strong>and</strong> that the pipeline has full forwarding support. Also<br />
assume that many iterations of this loop are executed before the loop exits.<br />
4.11.1 [10] Show a pipeline execution diagram for the third iteration of<br />
this loop, from the cycle in which we fetch the first instruction of that iteration up<br />
to (but not including) the cycle in which we can fetch the first instruction of the<br />
next iteration. Show all instructions that are in the pipeline during these cycles (not<br />
just those from the third iteration).<br />
4.11.2 [10] How often (as a percentage of all cycles) do we have a cycle in<br />
which all five pipeline stages are doing useful work?<br />
4.12 This exercise is intended to help you underst<strong>and</strong> the cost/complexity/<br />
performance trade-offs of forwarding in a pipelined processor. Problems in this<br />
exercise refer to pipelined datapaths from Figure 4.45. These problems assume<br />
that, of all the instructions executed in a processor, the following fraction of these<br />
instructions have a particular type of RAW data dependence. The type of RAW<br />
data dependence is identified by the stage that produces the result (EX or MEM)<br />
<strong>and</strong> the instruction that consumes the result (1st instruction that follows the one<br />
that produces the result, 2nd instruction that follows, or both). We assume that the<br />
register write is done in the first half of the clock cycle <strong>and</strong> that register reads are<br />
done in the second half of the cycle, so “EX to 3rd” <strong>and</strong> “MEM to 3rd” dependences<br />
are not counted because they cannot result in data hazards. Also, assume that the<br />
CPI of the processor is 1 if there are no data hazards.<br />
EX to 1 st<br />
Only<br />
MEM to 1 st<br />
Only<br />
EX to 2 nd<br />
Only<br />
MEM to 2 nd<br />
Only<br />
EX to 1 st<br />
<strong>and</strong> MEM<br />
to 2nd<br />
Other RAW<br />
Dependences<br />
5% 20% 5% 10% 10% 10%
4.17 Exercises 365<br />
4.13.2 [10] Repeat 4.13.1 but now use nops only when a hazard cannot be<br />
avoided by changing or rearranging these instructions. You can assume register R7<br />
can be used to hold temporary values in your modified code.<br />
4.13.3 [10] If the processor has forwarding, but we forgot to implement<br />
the hazard detection unit, what happens when this code executes?<br />
4.13.4 [20] If there is forwarding, for the first five cycles during the<br />
execution of this code, specify which signals are asserted in each cycle by hazard<br />
detection <strong>and</strong> forwarding units in Figure 4.60.<br />
4.13.5 [10] If there is no forwarding, what new inputs <strong>and</strong> output signals<br />
do we need for the hazard detection unit in Figure 4.60? Using this instruction<br />
sequence as an example, explain why each signal is needed.<br />
4.13.6 [20] For the new hazard detection unit from 4.13.5, specify which<br />
output signals it asserts in each of the first five cycles during the execution of this<br />
code.<br />
4.14 This exercise is intended to help you underst<strong>and</strong> the relationship between<br />
delay slots, control hazards, <strong>and</strong> branch execution in a pipelined processor. In<br />
this exercise, we assume that the following MIPS code is executed on a pipelined<br />
processor with a 5-stage pipeline, full forwarding, <strong>and</strong> a predict-taken branch<br />
predictor:<br />
lw r2,0(r1)<br />
label1: beq r2,r0,label2 # not taken once, then taken<br />
lw r3,0(r2)<br />
beq r3,r0,label1 # taken<br />
add r1,r3,r1<br />
label2: sw r1,0(r2)<br />
4.14.1 [10] Draw the pipeline execution diagram for this code, assuming<br />
there are no delay slots <strong>and</strong> that branches execute in the EX stage.<br />
4.14.2 [10] Repeat 4.14.1, but assume that delay slots are used. In the<br />
given code, the instruction that follows the branch is now the delay slot instruction<br />
for that branch.<br />
4.14.3 [20] One way to move the branch resolution one stage earlier is to<br />
not need an ALU operation in conditional branches. The branch instructions would<br />
be “bez rd,label” <strong>and</strong> “bnez rd,label”, <strong>and</strong> it would branch if the register has<br />
<strong>and</strong> does not have a zero value, respectively. Change this code to use these branch<br />
instructions instead of beq. You can assume that register R8 is available for you<br />
to use as a temporary register, <strong>and</strong> that an seq (set if equal) R-type instruction can<br />
be used.
366 Chapter 4 The Processor<br />
Section 4.8 describes how the severity of control hazards can be reduced by moving<br />
branch execution into the ID stage. This approach involves a dedicated comparator<br />
in the ID stage, as shown in Figure 4.62. However, this approach potentially adds<br />
to the latency of the ID stage, <strong>and</strong> requires additional forwarding logic <strong>and</strong> hazard<br />
detection.<br />
4.14.4 [10] Using the first branch instruction in the given code as an<br />
example, describe the hazard detection logic needed to support branch execution<br />
in the ID stage as in Figure 4.62. Which type of hazard is this new logic supposed<br />
to detect?<br />
4.14.5 [10] For the given code, what is the speedup achieved by moving<br />
branch execution into the ID stage? Explain your answer. In your speedup<br />
calculation, assume that the additional comparison in the ID stage does not affect<br />
clock cycle time.<br />
4.14.6 [10] Using the first branch instruction in the given code as an<br />
example, describe the forwarding support that must be added to support branch<br />
execution in the ID stage. Compare the complexity of this new forwarding unit to<br />
the complexity of the existing forwarding unit in Figure 4.62.<br />
4.15 The importance of having a good branch predictor depends on how often<br />
conditional branches are executed. Together with branch predictor accuracy, this<br />
will determine how much time is spent stalling due to mispredicted branches. In<br />
this exercise, assume that the breakdown of dynamic instructions into various<br />
instruction categories is as follows:<br />
R-Type BEQ JMP LW SW<br />
40% 25% 5% 25% 5%<br />
Also, assume the following branch predictor accuracies:<br />
Always-Taken Always-Not-Taken 2-Bit<br />
45% 55% 85%<br />
4.15.1 [10] Stall cycles due to mispredicted branches increase the<br />
CPI. What is the extra CPI due to mispredicted branches with the always-taken<br />
predictor? Assume that branch outcomes are determined in the EX stage, that there<br />
are no data hazards, <strong>and</strong> that no delay slots are used.<br />
4.15.2 [10] Repeat 4.15.1 for the “always-not-taken” predictor.<br />
4.15.3 [10] Repeat 4.15.1 for for the 2-bit predictor.<br />
4.15.4 [10] With the 2-bit predictor, what speedup would be achieved if<br />
we could convert half of the branch instructions in a way that replaces a branch<br />
instruction with an ALU instruction? Assume that correctly <strong>and</strong> incorrectly<br />
predicted instructions have the same chance of being replaced.
4.17 Exercises 367<br />
4.15.5 [10] With the 2-bit predictor, what speedup would be achieved if<br />
we could convert half of the branch instructions in a way that replaced each branch<br />
instruction with two ALU instructions? Assume that correctly <strong>and</strong> incorrectly<br />
predicted instructions have the same chance of being replaced.<br />
4.15.6 [10] Some branch instructions are much more predictable than<br />
others. If we know that 80% of all executed branch instructions are easy-to-predict<br />
loop-back branches that are always predicted correctly, what is the accuracy of the<br />
2-bit predictor on the remaining 20% of the branch instructions?<br />
4.16 This exercise examines the accuracy of various branch predictors for the<br />
following repeating pattern (e.g., in a loop) of branch outcomes: T, NT, T, T, NT<br />
4.16.1 [5] What is the accuracy of always-taken <strong>and</strong> always-not-taken<br />
predictors for this sequence of branch outcomes?<br />
4.16.2 [5] What is the accuracy of the two-bit predictor for the first 4<br />
branches in this pattern, assuming that the predictor starts off in the bottom left<br />
state from Figure 4.63 (predict not taken)?<br />
4.16.3 [10] What is the accuracy of the two-bit predictor if this pattern is<br />
repeated forever?<br />
4.16.4 [30] <strong>Design</strong> a predictor that would achieve a perfect accuracy if<br />
this pattern is repeated forever. You predictor should be a sequential circuit with<br />
one output that provides a prediction (1 for taken, 0 for not taken) <strong>and</strong> no inputs<br />
other than the clock <strong>and</strong> the control signal that indicates that the instruction is a<br />
conditional branch.<br />
4.16.5 [10] What is the accuracy of your predictor from 4.16.4 if it is<br />
given a repeating pattern that is the exact opposite of this one?<br />
4.16.6 [20] Repeat 4.16.4, but now your predictor should be able to<br />
eventually (after a warm-up period during which it can make wrong predictions)<br />
start perfectly predicting both this pattern <strong>and</strong> its opposite. Your predictor should<br />
have an input that tells it what the real outcome was. Hint: this input lets your<br />
predictor determine which of the two repeating patterns it is given.<br />
4.17 This exercise explores how exception h<strong>and</strong>ling affects pipeline design. The<br />
first three problems in this exercise refer to the following two instructions:<br />
Instruction 1 Instruction 2<br />
BNE R1, R2, Label<br />
LW R1, 0(R1)<br />
4.17.1 [5] Which exceptions can each of these instructions trigger? For<br />
each of these exceptions, specify the pipeline stage in which it is detected.
368 Chapter 4 The Processor<br />
4.17.2 [10] If there is a separate h<strong>and</strong>ler address for each exception, show<br />
how the pipeline organization must be changed to be able to h<strong>and</strong>le this exception.<br />
You can assume that the addresses of these h<strong>and</strong>lers are known when the processor<br />
is designed.<br />
4.17.3 [10] If the second instruction is fetched right after the first<br />
instruction, describe what happens in the pipeline when the first instruction causes<br />
the first exception you listed in 4.17.1. Show the pipeline execution diagram from<br />
the time the first instruction is fetched until the time the first instruction of the<br />
exception h<strong>and</strong>ler is completed.<br />
4.17.4 [20] In vectored exception h<strong>and</strong>ling, the table of exception h<strong>and</strong>ler<br />
addresses is in data memory at a known (fixed) address. Change the pipeline to<br />
implement this exception h<strong>and</strong>ling mechanism. Repeat 4.17.3 using this modified<br />
pipeline <strong>and</strong> vectored exception h<strong>and</strong>ling.<br />
4.17.5 [15] We want to emulate vectored exception h<strong>and</strong>ling (described<br />
in 4.17.4) on a machine that has only one fixed h<strong>and</strong>ler address. Write the code<br />
that should be at that fixed address. Hint: this code should identify the exception,<br />
get the right address from the exception vector table, <strong>and</strong> transfer execution to that<br />
h<strong>and</strong>ler.<br />
4.18 In this exercise we compare the performance of 1-issue <strong>and</strong> 2-issue<br />
processors, taking into account program transformations that can be made to<br />
optimize for 2-issue execution. Problems in this exercise refer to the following loop<br />
(written in C):<br />
for(i=0;i!=j;i+=2)<br />
b[i]=a[i]–a[i+1];<br />
When writing MIPS code, assume that variables are kept in registers as follows, <strong>and</strong><br />
that all registers except those indicated as Free are used to keep various variables,<br />
so they cannot be used for anything else.<br />
i j a b c Free<br />
R5 R6 R1 R2 R3 R10, R11, R12<br />
4.18.1 [10] Translate this C code into MIPS instructions. Your translation<br />
should be direct, without rearranging instructions to achieve better performance.<br />
4.18.2 [10] If the loop exits after executing only two iterations, draw a<br />
pipeline diagram for your MIPS code from 4.18.1 executed on a 2-issue processor<br />
shown in Figure 4.69. Assume the processor has perfect branch prediction <strong>and</strong> can<br />
fetch any two instructions (not just consecutive instructions) in the same cycle.<br />
4.18.3 [10] Rearrange your code from 4.18.1 to achieve better<br />
performance on a 2-issue statically scheduled processor from Figure 4.69.
4.17 Exercises 369<br />
4.18.4 [10] Repeat 4.18.2, but this time use your MIPS code from 4.18.3.<br />
4.18.5 [10] What is the speedup of going from a 1-issue processor to<br />
a 2-issue processor from Figure 4.69? Use your code from 4.18.1 for both 1-issue<br />
<strong>and</strong> 2-issue, <strong>and</strong> assume that 1,000,000 iterations of the loop are executed. As in<br />
4.18.2, assume that the processor has perfect branch predictions, <strong>and</strong> that a 2-issue<br />
processor can fetch any two instructions in the same cycle.<br />
4.18.6 [10] Repeat 4.18.5, but this time assume that in the 2-issue<br />
processor one of the instructions to be executed in a cycle can be of any kind, <strong>and</strong><br />
the other must be a non-memory instruction.<br />
4.19 This exercise explores energy efficiency <strong>and</strong> its relationship with performance.<br />
Problems in this exercise assume the following energy consumption for activity in<br />
Instruction memory, Registers, <strong>and</strong> Data memory. You can assume that the other<br />
components of the datapath spend a negligible amount of energy.<br />
I-Mem 1 Register Read Register Write D-Mem Read D-Mem Write<br />
140pJ 70pJ 60pJ 140pJ 120pJ<br />
Assume that components in the datapath have the following latencies. You can<br />
assume that the other components of the datapath have negligible latencies.<br />
I-Mem Control Register Read or Write ALU D-Mem Read or Write<br />
200ps 150ps 90ps 90ps 250ps<br />
4.19.1 [10] How much energy is spent to execute an ADD<br />
instruction in a single-cycle design <strong>and</strong> in the 5-stage pipelined design?<br />
4.19.2 [10] What is the worst-case MIPS instruction in terms of<br />
energy consumption, <strong>and</strong> what is the energy spent to execute it?<br />
4.19.3 [10] If energy reduction is paramount, how would you<br />
change the pipelined design? What is the percentage reduction in the energy spent<br />
by an LW instruction after this change?<br />
4.19.4 [10] What is the performance impact of your changes from<br />
4.19.3?<br />
4.19.5 [10] We can eliminate the MemRead control signal <strong>and</strong> have<br />
the data memory be read in every cycle, i.e., we can permanently have MemRead=1.<br />
Explain why the processor still functions correctly after this change. What is the<br />
effect of this change on clock frequency <strong>and</strong> energy consumption?<br />
4.19.6 [10] If an idle unit spends 10% of the power it would spend<br />
if it were active, what is the energy spent by the instruction memory in each cycle?<br />
What percentage of the overall energy spent by the instruction memory does this<br />
idle energy represent?
370 Chapter 4 The Processor<br />
Answers to<br />
Check Yourself<br />
§4.1, page 248: 3 of 5: Control, Datapath, Memory. Input <strong>and</strong> Output are missing.<br />
§4.2, page 251: false. Edge-triggered state elements make simultaneous reading <strong>and</strong><br />
writing both possible <strong>and</strong> unambiguous.<br />
§4.3, page 257: I. a. II. c.<br />
§4.4, page 272: Yes, Branch <strong>and</strong> ALUOp0 are identical. In addition, MemtoReg <strong>and</strong><br />
RegDst are inverses of one another. You don’t need an inverter; simply use the other<br />
signal <strong>and</strong> flip the order of the inputs to the multiplexor!<br />
§4.5, page 285: I. Stall on the lw result. 2. Bypass the first add result written into<br />
$t1. 3. No stall or bypass required.<br />
§4.6, page 298: Statements 2 <strong>and</strong> 4 are correct; the rest are incorrect.<br />
§4.8, page 324: 1. Predict not taken. 2. Predict taken. 3. Dynamic prediction.<br />
§4.9, page 332: The first instruction, since it is logically executed before the others.<br />
§4.10, page 344: 1. Both. 2. Both. 3. Software. 4. Hardware. 5. Hardware. 6.<br />
Hardware. 7. Both. 8. Hardware. 9. Both.<br />
§4.11, page 353: First two are false <strong>and</strong> the last two are true.
This page intentionally left blank
5<br />
Ideally one would desire an<br />
indefinitely large memory<br />
capacity such that any<br />
particular … word would be<br />
immediately available. … We<br />
are … forced to recognize the<br />
possibility of constructing a<br />
hierarchy of memories, each<br />
of which has greater capacity<br />
than the preceding but which<br />
is less quickly accessible.<br />
A. W. Burks, H. H. Goldstine, <strong>and</strong><br />
J. von Neumann<br />
Preliminary Discussion of the Logical <strong>Design</strong> of an<br />
Electronic Computing Instrument, 1946<br />
Large <strong>and</strong> Fast:<br />
Exploiting Memory<br />
Hierarchy<br />
5.1 Introduction 374<br />
5.2 Memory Technologies 378<br />
5.3 The Basics of Caches 383<br />
5.4 Measuring <strong>and</strong> Improving Cache<br />
Performance 398<br />
5.5 Dependable Memory Hierarchy 418<br />
5.6 Virtual Machines 424<br />
5.7 Virtual Memory 427<br />
<strong>Computer</strong> Organization <strong>and</strong> <strong>Design</strong>. DOI: http://dx.doi.org/10.1016/B978-0-12-407726-3.00001-1<br />
© 2013 Elsevier Inc. All rights reserved.
5.1 Introduction 375<br />
Speed<br />
Processor<br />
Size<br />
Cost ($/bit)<br />
Current<br />
technology<br />
Fastest<br />
Memory<br />
Smallest<br />
Highest<br />
SRAM<br />
Memory<br />
DRAM<br />
Slowest<br />
Memory<br />
Biggest<br />
Lowest<br />
Magnetic disk<br />
FIGURE 5.1 The basic structure of a memory hierarchy. By implementing the memory system as<br />
a hierarchy, the user has the illusion of a memory that is as large as the largest level of the hierarchy, but can<br />
be accessed as if it were all built from the fastest memory. Flash memory has replaced disks in many personal<br />
mobile devices, <strong>and</strong> may lead to a new level in the storage hierarchy for desktop <strong>and</strong> server computers; see<br />
Section 5.2.<br />
Just as accesses to books on the desk naturally exhibit locality, locality in<br />
programs arises from simple <strong>and</strong> natural program structures. For example,<br />
most programs contain loops, so instructions <strong>and</strong> data are likely to be accessed<br />
repeatedly, showing high amounts of temporal locality. Since instructions are<br />
normally accessed sequentially, programs also show high spatial locality. Accesses<br />
to data also exhibit a natural spatial locality. For example, sequential accesses to<br />
elements of an array or a record will naturally have high degrees of spatial locality.<br />
We take advantage of the principle of locality by implementing the memory<br />
of a computer as a memory hierarchy. A memory hierarchy consists of multiple<br />
levels of memory with different speeds <strong>and</strong> sizes. The faster memories are more<br />
expensive per bit than the slower memories <strong>and</strong> thus are smaller.<br />
Figure 5.1 shows the faster memory is close to the processor <strong>and</strong> the slower,<br />
less expensive memory is below it. The goal is to present the user with as much<br />
memory as is available in the cheapest technology, while providing access at the<br />
speed offered by the fastest memory.<br />
The data is similarly hierarchical: a level closer to the processor is generally a<br />
subset of any level further away, <strong>and</strong> all the data is stored at the lowest level. By<br />
analogy, the books on your desk form a subset of the library you are working in,<br />
which is in turn a subset of all the libraries on campus. Furthermore, as we move<br />
away from the processor, the levels take progressively longer to access, just as we<br />
might encounter in a hierarchy of campus libraries.<br />
A memory hierarchy can consist of multiple levels, but data is copied between<br />
only two adjacent levels at a time, so we can focus our attention on just two levels.<br />
memory hierarchy<br />
A structure that uses<br />
multiple levels of<br />
memories; as the distance<br />
from the processor<br />
increases, the size of the<br />
memories <strong>and</strong> the access<br />
time both increase.
376 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />
Processor<br />
Data is transferred<br />
block (or line) The<br />
minimum unit of<br />
information that can<br />
be either present or not<br />
present in a cache.<br />
hit rate The fraction of<br />
memory accesses found<br />
in a level of the memory<br />
hierarchy.<br />
miss rate The fraction<br />
of memory accesses not<br />
found in a level of the<br />
memory hierarchy.<br />
hit time The time<br />
required to access a level<br />
of the memory hierarchy,<br />
including the time needed<br />
to determine whether the<br />
access is a hit or a miss.<br />
miss penalty The time<br />
required to fetch a block<br />
into a level of the memory<br />
hierarchy from the lower<br />
level, including the time<br />
to access the block,<br />
transmit it from one level<br />
to the other, insert it in<br />
the level that experienced<br />
the miss, <strong>and</strong> then pass<br />
the block to the requestor.<br />
FIGURE 5.2 Every pair of levels in the memory hierarchy can be thought of as having an<br />
upper <strong>and</strong> lower level. Within each level, the unit of information that is present or not is called a block or<br />
a line. Usually we transfer an entire block when we copy something between levels.<br />
The upper level—the one closer to the processor—is smaller <strong>and</strong> faster than the lower<br />
level, since the upper level uses technology that is more expensive. Figure 5.2 shows<br />
that the minimum unit of information that can be either present or not present in<br />
the two-level hierarchy is called a block or a line; in our library analogy, a block of<br />
information is one book.<br />
If the data requested by the processor appears in some block in the upper level,<br />
this is called a hit (analogous to your finding the information in one of the books<br />
on your desk). If the data is not found in the upper level, the request is called a miss.<br />
The lower level in the hierarchy is then accessed to retrieve the block containing the<br />
requested data. (Continuing our analogy, you go from your desk to the shelves to<br />
find the desired book.) The hit rate, or hit ratio, is the fraction of memory accesses<br />
found in the upper level; it is often used as a measure of the performance of the<br />
memory hierarchy. The miss rate (1−hit rate) is the fraction of memory accesses<br />
not found in the upper level.<br />
Since performance is the major reason for having a memory hierarchy, the time<br />
to service hits <strong>and</strong> misses is important. Hit time is the time to access the upper level<br />
of the memory hierarchy, which includes the time needed to determine whether<br />
the access is a hit or a miss (that is, the time needed to look through the books on<br />
the desk). The miss penalty is the time to replace a block in the upper level with<br />
the corresponding block from the lower level, plus the time to deliver this block to<br />
the processor (or the time to get another book from the shelves <strong>and</strong> place it on the<br />
desk). Because the upper level is smaller <strong>and</strong> built using faster memory parts, the<br />
hit time will be much smaller than the time to access the next level in the hierarchy,<br />
which is the major component of the miss penalty. (The time to examine the books<br />
on the desk is much smaller than the time to get up <strong>and</strong> get a new book from the<br />
shelves.)
5.1 Introduction 377<br />
As we will see in this chapter, the concepts used to build memory systems affect<br />
many other aspects of a computer, including how the operating system manages<br />
memory <strong>and</strong> I/O, how compilers generate code, <strong>and</strong> even how applications use<br />
the computer. Of course, because all programs spend much of their time accessing<br />
memory, the memory system is necessarily a major factor in determining<br />
performance. The reliance on memory hierarchies to achieve performance<br />
has meant that programmers, who used to be able to think of memory as a flat,<br />
r<strong>and</strong>om access storage device, now need to underst<strong>and</strong> that memory is a hierarchy<br />
to get good performance. We show how important this underst<strong>and</strong>ing is in later<br />
examples, such as Figure 5.18 on page 408, <strong>and</strong> Section 5.14, which shows how to<br />
double matrix multiply performance.<br />
Since memory systems are critical to performance, computer designers devote a<br />
great deal of attention to these systems <strong>and</strong> develop sophisticated mechanisms for<br />
improving the performance of the memory system. In this chapter, we discuss the<br />
major conceptual ideas, although we use many simplifications <strong>and</strong> abstractions to<br />
keep the material manageable in length <strong>and</strong> complexity.<br />
Programs exhibit both temporal locality, the tendency to reuse recently<br />
accessed data items, <strong>and</strong> spatial locality, the tendency to reference data<br />
items that are close to other recently accessed items. Memory hierarchies<br />
take advantage of temporal locality by keeping more recently accessed<br />
data items closer to the processor. Memory hierarchies take advantage of<br />
spatial locality by moving blocks consisting of multiple contiguous words<br />
in memory to upper levels of the hierarchy.<br />
Figure 5.3 shows that a memory hierarchy uses smaller <strong>and</strong> faster<br />
memory technologies close to the processor. Thus, accesses that hit in the<br />
highest level of the hierarchy can be processed quickly. Accesses that miss<br />
go to lower levels of the hierarchy, which are larger but slower. If the hit<br />
rate is high enough, the memory hierarchy has an effective access time<br />
close to that of the highest (<strong>and</strong> fastest) level <strong>and</strong> a size equal to that of the<br />
lowest (<strong>and</strong> largest) level.<br />
In most systems, the memory is a true hierarchy, meaning that data<br />
cannot be present in level i unless it is also present in level i 1.<br />
The BIG<br />
Picture<br />
Which of the following statements are generally true?<br />
1. Memory hierarchies take advantage of temporal locality.<br />
2. On a read, the value returned depends on which blocks are in the cache.<br />
3. Most of the cost of the memory hierarchy is at the highest level.<br />
4. Most of the capacity of the memory hierarchy is at the lowest level.<br />
Check<br />
Yourself
5.2 Memory Technologies 379<br />
SRAM Technology<br />
SRAMs are simply integrated circuits that are memory arrays with (usually) a<br />
single access port that can provide either a read or a write. SRAMs have a fixed<br />
access time to any datum, though the read <strong>and</strong> write access times may differ.<br />
SRAMs don’t need to refresh <strong>and</strong> so the access time is very close to the cycle<br />
time. SRAMs typically use six to eight transistors per bit to prevent the information<br />
from being disturbed when read. SRAM needs only minimal power to retain the<br />
charge in st<strong>and</strong>by mode.<br />
In the past, most PCs <strong>and</strong> server systems used separate SRAM chips for either<br />
their primary, secondary, or even tertiary caches. Today, thanks to Moore’s Law, all<br />
levels of caches are integrated onto the processor chip, so the market for separate<br />
SRAM chips has nearly evaporated.<br />
DRAM Technology<br />
In a SRAM, as long as power is applied, the value can be kept indefinitely. In a<br />
dynamic RAM (DRAM), the value kept in a cell is stored as a charge in a capacitor.<br />
A single transistor is then used to access this stored charge, either to read the<br />
value or to overwrite the charge stored there. Because DRAMs use only a single<br />
transistor per bit of storage, they are much denser <strong>and</strong> cheaper per bit than SRAM.<br />
As DRAMs store the charge on a capacitor, it cannot be kept indefinitely <strong>and</strong> must<br />
periodically be refreshed. That is why this memory structure is called dynamic, as<br />
opposed to the static storage in an SRAM cell.<br />
To refresh the cell, we merely read its contents <strong>and</strong> write it back. The charge<br />
can be kept for several milliseconds. If every bit had to be read out of the DRAM<br />
<strong>and</strong> then written back individually, we would constantly be refreshing the DRAM,<br />
leaving no time for accessing it. Fortunately, DRAMs use a two-level decoding<br />
structure, <strong>and</strong> this allows us to refresh an entire row (which shares a word line)<br />
with a read cycle followed immediately by a write cycle.<br />
Figure 5.4 shows the internal organization of a DRAM, <strong>and</strong> Figure 5.5 shows<br />
how the density, cost, <strong>and</strong> access time of DRAMs have changed over the years.<br />
The row organization that helps with refresh also helps with performance. To<br />
improve performance, DRAMs buffer rows for repeated access. The buffer acts<br />
like an SRAM; by changing the address, r<strong>and</strong>om bits can be accessed in the buffer<br />
until the next row access. This capability improves the access time significantly,<br />
since the access time to bits in the row is much lower. Making the chip wider also<br />
improves the memory b<strong>and</strong>width of the chip. When the row is in the buffer, it<br />
can be transferred by successive addresses at whatever the width of the DRAM is<br />
(typically 4, 8, or 16 bits), or by specifying a block transfer <strong>and</strong> the starting address<br />
within the buffer.<br />
To further improve the interface to processors, DRAMs added clocks <strong>and</strong> are<br />
properly called Synchronous DRAMs or SDRAMs. The advantage of SDRAMs<br />
is that the use of a clock eliminates the time for the memory <strong>and</strong> processor to<br />
synchronize. The speed advantage of synchronous DRAMs comes from the ability<br />
to transfer the bits in the burst without having to specify additional address bits.
5.2 Memory Technologies 381<br />
write from multiple banks, with each having its own row buffer. Sending an address<br />
to several banks permits them all to read or write simultaneously. For example,<br />
with four banks, there is just one access time <strong>and</strong> then accesses rotate between<br />
the four banks to supply four times the b<strong>and</strong>width. This rotating access scheme is<br />
called address interleaving.<br />
Although Personal Mobile Devices like the iPad (see Chapter 1) use individual<br />
DRAMs, memory for servers are commonly sold on small boards called dual inline<br />
memory modules (DIMMs). DIMMs typically contain 4–16 DRAMs, <strong>and</strong> they are<br />
normally organized to be 8 bytes wide for server systems. A DIMM using DDR4-<br />
3200 SDRAMs could transfer at 8 3200 25,600 megabytes per second. Such<br />
DIMMs are named after their b<strong>and</strong>width: PC25600. Since a DIMM can have so<br />
many DRAM chips that only a portion of them are used for a particular transfer, we<br />
need a term to refer to the subset of chips in a DIMM that share common address<br />
lines. To avoid confusion with the internal DRAM names of row <strong>and</strong> banks, we use<br />
the term memory rank for such a subset of chips in a DIMM.<br />
Elaboration: One way to measure the performance of the memory system behind the<br />
caches is the Stream benchmark [McCalpin, 1995]. It measures the performance of<br />
long vector operations. They have no temporal locality <strong>and</strong> they access arrays that are<br />
larger than the cache of the computer being tested.<br />
Flash Memory<br />
Flash memory is a type of electrically erasable programmable read-only memory<br />
(EEPROM).<br />
Unlike disks <strong>and</strong> DRAM, but like other EEPROM technologies, writes can wear out<br />
flash memory bits. To cope with such limits, most flash products include a controller<br />
to spread the writes by remapping blocks that have been written many times to less<br />
trodden blocks. This technique is called wear leveling. With wear leveling, personal<br />
mobile devices are very unlikely to exceed the write limits in the flash. Such wear<br />
leveling lowers the potential performance of flash, but it is needed unless higherlevel<br />
software monitors block wear. Flash controllers that perform wear leveling can<br />
also improve yield by mapping out memory cells that were manufactured incorrectly.<br />
Disk Memory<br />
As Figure 5.6 shows, a magnetic hard disk consists of a collection of platters, which<br />
rotate on a spindle at 5400 to 15,000 revolutions per minute. The metal platters are<br />
covered with magnetic recording material on both sides, similar to the material found<br />
on a cassette or videotape. To read <strong>and</strong> write information on a hard disk, a movable arm<br />
containing a small electromagnetic coil called a read-write head is located just above<br />
each surface. The entire drive is permanently sealed to control the environment inside<br />
the drive, which, in turn, allows the disk heads to be much closer to the drive surface.<br />
Each disk surface is divided into concentric circles, called tracks. There are<br />
typically tens of thous<strong>and</strong>s of tracks per surface. Each track is in turn divided into<br />
track One of thous<strong>and</strong>s<br />
of concentric circles that<br />
makes up the surface of a<br />
magnetic disk.
382 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />
sector One of the<br />
segments that make up a<br />
track on a magnetic disk;<br />
a sector is the smallest<br />
amount of information<br />
that is read or written on<br />
a disk.<br />
sectors that contain the information; each track may have thous<strong>and</strong>s of sectors.<br />
Sectors are typically 512 to 4096 bytes in size. The sequence recorded on the<br />
magnetic media is a sector number, a gap, the information for that sector including<br />
error correction code (see Section 5.5), a gap, the sector number of the next sector,<br />
<strong>and</strong> so on.<br />
The disk heads for each surface are connected together <strong>and</strong> move in conjunction,<br />
so that every head is over the same track of every surface. The term cylinder is used<br />
to refer to all the tracks under the heads at a given point on all surfaces.<br />
FIGURE 5.6 A disk showing 10 disk platters <strong>and</strong> the read/write heads. The diameter of<br />
today’s disks is 2.5 or 3.5 inches, <strong>and</strong> there are typically one or two platters per drive today.<br />
seek The process of<br />
positioning a read/write<br />
head over the proper<br />
track on a disk.<br />
To access data, the operating system must direct the disk through a three-stage<br />
process. The first step is to position the head over the proper track. This operation is<br />
called a seek, <strong>and</strong> the time to move the head to the desired track is called the seek time.<br />
Disk manufacturers report minimum seek time, maximum seek time, <strong>and</strong> average<br />
seek time in their manuals. The first two are easy to measure, but the average is open to<br />
wide interpretation because it depends on the seek distance. The industry calculates<br />
average seek time as the sum of the time for all possible seeks divided by the number<br />
of possible seeks. Average seek times are usually advertised as 3 ms to 13 ms, but,<br />
depending on the application <strong>and</strong> scheduling of disk requests, the actual average seek<br />
time may be only 25% to 33% of the advertised number because of locality of disk
5.3 The Basics of Caches 383<br />
references. This locality arises both because of successive accesses to the same file <strong>and</strong><br />
because the operating system tries to schedule such accesses together.<br />
Once the head has reached the correct track, we must wait for the desired sector<br />
to rotate under the read/write head. This time is called the rotational latency or<br />
rotational delay. The average latency to the desired information is halfway around<br />
the disk. Disks rotate at 5400 RPM to 15,000 RPM. The average rotational latency<br />
at 5400 RPM is<br />
0.5 rotation 0.5 rotation<br />
Average rotational latency <br />
<br />
5400 RPM<br />
⎛ seconds⎞<br />
5400 RPM/<br />
60<br />
⎝⎜<br />
minute ⎠⎟<br />
0.0056 seconds 5.6 ms<br />
rotational latency Also<br />
called rotational delay.<br />
The time required for<br />
the desired sector of a<br />
disk to rotate under the<br />
read/write head; usually<br />
assumed to be half the<br />
rotation time.<br />
The last component of a disk access, transfer time, is the time to transfer a block<br />
of bits. The transfer time is a function of the sector size, the rotation speed, <strong>and</strong> the<br />
recording density of a track. Transfer rates in 2012 were between 100 <strong>and</strong> 200 MB/sec.<br />
One complication is that most disk controllers have a built-in cache that stores<br />
sectors as they are passed over; transfer rates from the cache are typically higher,<br />
<strong>and</strong> were up to 750 MB/sec (6 Gbit/sec) in 2012.<br />
Alas, where block numbers are located is no longer intuitive. The assumptions of<br />
the sector-track-cylinder model above are that nearby blocks are on the same track,<br />
blocks in the same cylinder take less time to access since there is no seek time,<br />
<strong>and</strong> some tracks are closer than others. The reason for the change was the raising<br />
of the level of the disk interfaces. To speed-up sequential transfers, these higherlevel<br />
interfaces organize disks more like tapes than like r<strong>and</strong>om access devices.<br />
The logical blocks are ordered in serpentine fashion across a single surface, trying<br />
to capture all the sectors that are recorded at the same bit density to try to get best<br />
performance. Hence, sequential blocks may be on different tracks.<br />
In summary, the two primary differences between magnetic disks <strong>and</strong><br />
semiconductor memory technologies are that disks have a slower access time because<br />
they are mechanical devices—flash is 1000 times as fast <strong>and</strong> DRAM is 100,000 times<br />
as fast—yet they are cheaper per bit because they have very high storage capacity at a<br />
modest cost—disk is 10 to 100 time cheaper. Magnetic disks are nonvolatile like flash,<br />
but unlike flash there is no write wear-out problem. However, flash is much more<br />
rugged <strong>and</strong> hence a better match to the jostling inherent in personal mobile devices.<br />
5.3 The Basics of Caches<br />
In our library example, the desk acted as a cache—a safe place to store things<br />
(books) that we needed to examine. Cache was the name chosen to represent the<br />
level of the memory hierarchy between the processor <strong>and</strong> main memory in the first<br />
commercial computer to have this extra level. The memories in the datapath in<br />
Chapter 4 are simply replaced by caches. Today, although this remains the dominant<br />
Cache: a safe place<br />
for hiding or storing<br />
things.<br />
Webster’s New World<br />
Dictionary of the<br />
American Language,<br />
Third College Edition,<br />
1988
384 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />
direct-mapped cache<br />
A cache structure in<br />
which each memory<br />
location is mapped to<br />
exactly one location in the<br />
cache.<br />
use of the word cache, the term is also used to refer to any storage managed to take<br />
advantage of locality of access. Caches first appeared in research computers in the<br />
early 1960s <strong>and</strong> in production computers later in that same decade; every generalpurpose<br />
computer built today, from servers to low-power embedded processors,<br />
includes caches.<br />
In this section, we begin by looking at a very simple cache in which the processor<br />
requests are each one word <strong>and</strong> the blocks also consist of a single word. (Readers<br />
already familiar with cache basics may want to skip to Section 5.4.) Figure 5.7 shows<br />
such a simple cache, before <strong>and</strong> after requesting a data item that is not initially in<br />
the cache. Before the request, the cache contains a collection of recent references<br />
X 1<br />
, X 2<br />
, …, X n1<br />
, <strong>and</strong> the processor requests a word X n<br />
that is not in the cache. This<br />
request results in a miss, <strong>and</strong> the word X n<br />
is brought from memory into the cache.<br />
In looking at the scenario in Figure 5.7, there are two questions to answer: How<br />
do we know if a data item is in the cache? Moreover, if it is, how do we find it? The<br />
answers are related. If each word can go in exactly one place in the cache, then it<br />
is straightforward to find the word if it is in the cache. The simplest way to assign<br />
a location in the cache for each word in memory is to assign the cache location<br />
based on the address of the word in memory. This cache structure is called direct<br />
mapped, since each memory location is mapped directly to exactly one location in<br />
the cache. The typical mapping between addresses <strong>and</strong> cache locations for a directmapped<br />
cache is usually simple. For example, almost all direct-mapped caches use<br />
this mapping to find a block:<br />
(Block address) modulo (Number of blocks in the cache)<br />
tag A field in a table used<br />
for a memory hierarchy<br />
that contains the address<br />
information required<br />
to identify whether the<br />
associated block in the<br />
hierarchy corresponds to<br />
a requested word.<br />
If the number of entries in the cache is a power of 2, then modulo can be<br />
computed simply by using the low-order log 2<br />
(cache size in blocks) bits of the<br />
address. Thus, an 8-block cache uses the three lowest bits (8 2 3 ) of the block<br />
address. For example, Figure 5.8 shows how the memory addresses between 1 ten<br />
(00001 two<br />
) <strong>and</strong> 29 ten<br />
(11101 two<br />
) map to locations 1 ten<br />
(001 two<br />
) <strong>and</strong> 5 ten<br />
(101 two<br />
) in a<br />
direct-mapped cache of eight words.<br />
Because each cache location can contain the contents of a number of different<br />
memory locations, how do we know whether the data in the cache corresponds<br />
to a requested word? That is, how do we know whether a requested word is in the<br />
cache or not? We answer this question by adding a set of tags to the cache. The<br />
tags contain the address information required to identify whether a word in the<br />
cache corresponds to the requested word. The tag needs only to contain the upper<br />
portion of the address, corresponding to the bits that are not used as an index into<br />
the cache. For example, in Figure 5.8 we need only have the upper 2 of the 5 address<br />
bits in the tag, since the lower 3-bit index field of the address selects the block.<br />
Architects omit the index bits because they are redundant, since by definition the<br />
index field of any address of a cache block must be that block number.<br />
We also need a way to recognize that a cache block does not have valid<br />
information. For instance, when a processor starts up, the cache does not have good<br />
data, <strong>and</strong> the tag fields will be meaningless. Even after executing many instructions,
388 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />
we have conflicting dem<strong>and</strong>s for a block. The word at address 18 (10010 two<br />
) should<br />
be brought into cache block 2 (010 two<br />
). Hence, it must replace the word at address<br />
26 (11010 two<br />
), which is already in cache block 2 (010 two<br />
). This behavior allows a<br />
cache to take advantage of temporal locality: recently referenced words replace less<br />
recently referenced words.<br />
This situation is directly analogous to needing a book from the shelves <strong>and</strong><br />
having no more space on your desk—some book already on your desk must be<br />
returned to the shelves. In a direct-mapped cache, there is only one place to put the<br />
newly requested item <strong>and</strong> hence only one choice of what to replace.<br />
We know where to look in the cache for each possible address: the low-order bits<br />
of an address can be used to find the unique cache entry to which the address could<br />
map. Figure 5.10 shows how a referenced address is divided into<br />
■ A tag field, which is used to compare with the value of the tag field of the<br />
cache<br />
■ A cache index, which is used to select the block<br />
The index of a cache block, together with the tag contents of that block, uniquely<br />
specifies the memory address of the word contained in the cache block. Because<br />
the index field is used as an address to reference the cache, <strong>and</strong> because an n-bit<br />
field has 2 n values, the total number of entries in a direct-mapped cache must be a<br />
power of 2. In the MIPS architecture, since words are aligned to multiples of four<br />
bytes, the least significant two bits of every address specify a byte within a word.<br />
Hence, the least significant two bits are ignored when selecting a word in the block.<br />
The total number of bits needed for a cache is a function of the cache size <strong>and</strong><br />
the address size, because the cache includes both the storage for the data <strong>and</strong> the<br />
tags. The size of the block above was one word, but normally it is several. For the<br />
following situation:<br />
■ 32-bit addresses<br />
■ A direct-mapped cache<br />
■ The cache size is 2 n blocks, so n bits are used for the index<br />
■ The block size is 2 m words (2 m+2 bytes), so m bits are used for the word within<br />
the block, <strong>and</strong> two bits are used for the byte part of the address<br />
the size of the tag field is<br />
32 (n m 2).<br />
The total number of bits in a direct-mapped cache is<br />
2 n (block size tag size valid field size).
390 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />
Bits in a Cache<br />
EXAMPLE<br />
How many total bits are required for a direct-mapped cache with 16 KiB of<br />
data <strong>and</strong> 4-word blocks, assuming a 32-bit address?<br />
ANSWER<br />
We know that 16 KiB is 4096 (2 12 ) words. With a block size of 4 words (2 2 ),<br />
there are 1024 (2 10 ) blocks. Each block has 4 32 or 128 bits of data plus a<br />
tag, which is 32 10 2 2 bits, plus a valid bit. Thus, the total cache size is<br />
2 10 (4 32 (32 10 2 2) 1) 2 10 147 147 Kibibits<br />
or 18.4 KiB for a 16 KiB cache. For this cache, the total number of bits in the<br />
cache is about 1.15 times as many as needed just for the storage of the data.<br />
Mapping an Address to a Multiword Cache Block<br />
EXAMPLE<br />
Consider a cache with 64 blocks <strong>and</strong> a block size of 16 bytes. To what block<br />
number does byte address 1200 map?<br />
ANSWER<br />
We saw the formula on page 384. The block is given by<br />
(Block address) modulo (Number of blocks in the cache)<br />
where the address of the block is<br />
Byte address<br />
Bytes per block<br />
Notice that this block address is the block containing all addresses between<br />
⎡ Byte address ⎤<br />
Bytes per block<br />
⎢<br />
⎣Bytes per block ⎥ <br />
⎦
392 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />
the block from the next lower level of the hierarchy <strong>and</strong> load it into the cache. The<br />
time to fetch the block has two parts: the latency to the first word <strong>and</strong> the transfer<br />
time for the rest of the block. Clearly, unless we change the memory system, the<br />
transfer time—<strong>and</strong> hence the miss penalty—will likely increase as the block size<br />
increases. Furthermore, the improvement in the miss rate starts to decrease as the<br />
blocks become larger. The result is that the increase in the miss penalty overwhelms<br />
the decrease in the miss rate for blocks that are too large, <strong>and</strong> cache performance<br />
thus decreases. Of course, if we design the memory to transfer larger blocks more<br />
efficiently, we can increase the block size <strong>and</strong> obtain further improvements in cache<br />
performance. We discuss this topic in the next section.<br />
Elaboration: Although it is hard to do anything about the longer latency component of<br />
the miss penalty for large blocks, we may be able to hide some of the transfer time so<br />
that the miss penalty is effectively smaller. The simplest method for doing this, called<br />
early restart, is simply to resume execution as soon as the requested word of the block<br />
is returned, rather than wait for the entire block. Many processors use this technique<br />
for instruction access, where it works best. Instruction accesses are largely sequential,<br />
so if the memory system can deliver a word every clock cycle, the processor may be<br />
able to restart operation when the requested word is returned, with the memory system<br />
delivering new instruction words just in time. This technique is usually less effective for<br />
data caches because it is likely that the words will be requested from the block in a<br />
less predictable way, <strong>and</strong> the probability that the processor will need another word from<br />
a different cache block before the transfer completes is high. If the processor cannot<br />
access the data cache because a transfer is ongoing, then it must stall.<br />
An even more sophisticated scheme is to organize the memory so that the requested<br />
word is transferred from the memory to the cache fi rst. The remainder of the block<br />
is then transferred, starting with the address after the requested word <strong>and</strong> wrapping<br />
around to the beginning of the block. This technique, called requested word fi rst or<br />
critical word fi rst, can be slightly faster than early restart, but it is limited by the same<br />
properties that limit early restart.<br />
cache miss A request for<br />
data from the cache that<br />
cannot be filled because<br />
the data is not present in<br />
the cache.<br />
H<strong>and</strong>ling Cache Misses<br />
Before we look at the cache of a real system, let’s see how the control unit deals with<br />
cache misses. (We describe a cache controller in detail in Section 5.9). The control<br />
unit must detect a miss <strong>and</strong> process the miss by fetching the requested data from<br />
memory (or, as we shall see, a lower-level cache). If the cache reports a hit, the<br />
computer continues using the data as if nothing happened.<br />
Modifying the control of a processor to h<strong>and</strong>le a hit is trivial; misses, however,<br />
require some extra work. The cache miss h<strong>and</strong>ling is done in collaboration with<br />
the processor control unit <strong>and</strong> with a separate controller that initiates the memory<br />
access <strong>and</strong> refills the cache. The processing of a cache miss creates a pipeline stall<br />
(Chapter 4) as opposed to an interrupt, which would require saving the state of all<br />
registers. For a cache miss, we can stall the entire processor, essentially freezing<br />
the contents of the temporary <strong>and</strong> programmer-visible registers, while we wait
5.3 The Basics of Caches 393<br />
for memory. More sophisticated out-of-order processors can allow execution of<br />
instructions while waiting for a cache miss, but we’ll assume in-order processors<br />
that stall on cache misses in this section.<br />
Let’s look a little more closely at how instruction misses are h<strong>and</strong>led; the same<br />
approach can be easily extended to h<strong>and</strong>le data misses. If an instruction access<br />
results in a miss, then the content of the Instruction register is invalid. To get the<br />
proper instruction into the cache, we must be able to instruct the lower level in the<br />
memory hierarchy to perform a read. Since the program counter is incremented in<br />
the first clock cycle of execution, the address of the instruction that generates an<br />
instruction cache miss is equal to the value of the program counter minus 4. Once<br />
we have the address, we need to instruct the main memory to perform a read. We<br />
wait for the memory to respond (since the access will take multiple clock cycles),<br />
<strong>and</strong> then write the words containing the desired instruction into the cache.<br />
We can now define the steps to be taken on an instruction cache miss:<br />
1. Send the original PC value (current PC – 4) to the memory.<br />
2. Instruct main memory to perform a read <strong>and</strong> wait for the memory to<br />
complete its access.<br />
3. Write the cache entry, putting the data from memory in the data portion of<br />
the entry, writing the upper bits of the address (from the ALU) into the tag<br />
field, <strong>and</strong> turning the valid bit on.<br />
4. Restart the instruction execution at the first step, which will refetch the<br />
instruction, this time finding it in the cache.<br />
The control of the cache on a data access is essentially identical: on a miss, we<br />
simply stall the processor until the memory responds with the data.<br />
H<strong>and</strong>ling Writes<br />
Writes work somewhat differently. Suppose on a store instruction, we wrote the<br />
data into only the data cache (without changing main memory); then, after the<br />
write into the cache, memory would have a different value from that in the cache.<br />
In such a case, the cache <strong>and</strong> memory are said to be inconsistent. The simplest way<br />
to keep the main memory <strong>and</strong> the cache consistent is always to write the data into<br />
both the memory <strong>and</strong> the cache. This scheme is called write-through.<br />
The other key aspect of writes is what occurs on a write miss. We first fetch the<br />
words of the block from memory. After the block is fetched <strong>and</strong> placed into the<br />
cache, we can overwrite the word that caused the miss into the cache block. We also<br />
write the word to main memory using the full address.<br />
Although this design h<strong>and</strong>les writes very simply, it would not provide very<br />
good performance. With a write-through scheme, every write causes the data<br />
to be written to main memory. These writes will take a long time, likely at least<br />
100 processor clock cycles, <strong>and</strong> could slow down the processor considerably. For<br />
example, suppose 10% of the instructions are stores. If the CPI without cache<br />
write-through<br />
A scheme in which writes<br />
always update both the<br />
cache <strong>and</strong> the next lower<br />
level of the memory<br />
hierarchy, ensuring that<br />
data is always consistent<br />
between the two.
394 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />
write buffer A queue<br />
that holds data while<br />
the data is waiting to be<br />
written to memory.<br />
write-back A scheme<br />
that h<strong>and</strong>les writes by<br />
updating values only to<br />
the block in the cache,<br />
then writing the modified<br />
block to the lower level<br />
of the hierarchy when the<br />
block is replaced.<br />
misses was 1.0, spending 100 extra cycles on every write would lead to a CPI of<br />
1.0 100 10% 11, reducing performance by more than a factor of 10.<br />
One solution to this problem is to use a write buffer. A write buffer stores the<br />
data while it is waiting to be written to memory. After writing the data into the<br />
cache <strong>and</strong> into the write buffer, the processor can continue execution. When a write<br />
to main memory completes, the entry in the write buffer is freed. If the write buffer<br />
is full when the processor reaches a write, the processor must stall until there is an<br />
empty position in the write buffer. Of course, if the rate at which the memory can<br />
complete writes is less than the rate at which the processor is generating writes, no<br />
amount of buffering can help, because writes are being generated faster than the<br />
memory system can accept them.<br />
The rate at which writes are generated may also be less than the rate at which the<br />
memory can accept them, <strong>and</strong> yet stalls may still occur. This can happen when the<br />
writes occur in bursts. To reduce the occurrence of such stalls, processors usually<br />
increase the depth of the write buffer beyond a single entry.<br />
The alternative to a write-through scheme is a scheme called write-back. In a<br />
write-back scheme, when a write occurs, the new value is written only to the block<br />
in the cache. The modified block is written to the lower level of the hierarchy when<br />
it is replaced. Write-back schemes can improve performance, especially when<br />
processors can generate writes as fast or faster than the writes can be h<strong>and</strong>led by<br />
main memory; a write-back scheme is, however, more complex to implement than<br />
write-through.<br />
In the rest of this section, we describe caches from real processors, <strong>and</strong> we<br />
examine how they h<strong>and</strong>le both reads <strong>and</strong> writes. In Section 5.8, we will describe<br />
the h<strong>and</strong>ling of writes in more detail.<br />
Elaboration: Writes introduce several complications into caches that are not present<br />
for reads. Here we discuss two of them: the policy on write misses <strong>and</strong> effi cient<br />
implementation of writes in write-back caches.<br />
Consider a miss in a write-through cache. The most common strategy is to allocate a<br />
block in the cache, called write allocate. The block is fetched from memory <strong>and</strong> then the<br />
appropriate portion of the block is overwritten. An alternative strategy is to update the portion<br />
of the block in memory but not put it in the cache, called no write allocate. The motivation is<br />
that sometimes programs write entire blocks of data, such as when the operating system<br />
zeros a page of memory. In such cases, the fetch associated with the initial write miss may<br />
be unnecessary. Some computers allow the write allocation policy to be changed on a per<br />
page basis.<br />
Actually implementing stores effi ciently in a cache that uses a write-back strategy is<br />
more complex than in a write-through cache. A write-through cache can write the data<br />
into the cache <strong>and</strong> read the tag; if the tag mismatches, then a miss occurs. Because the<br />
cache is write-through, the overwriting of the block in the cache is not catastrophic, since<br />
memory has the correct value. In a write-back cache, we must fi rst write the block back<br />
to memory if the data in the cache is modifi ed <strong>and</strong> we have a cache miss. If we simply<br />
overwrote the block on a store instruction before we knew whether the store had hit in<br />
the cache (as we could for a write-through cache), we would destroy the contents of the<br />
block, which is not backed up in the next lower level of the memory hierarchy.
5.3 The Basics of Caches 395<br />
In a write-back cache, because we cannot overwrite the block, stores either require<br />
two cycles (a cycle to check for a hit followed by a cycle to actually perform the write) or<br />
require a write buffer to hold that data—effectively allowing the store to take only one<br />
cycle by pipelining it. When a store buffer is used, the processor does the cache lookup<br />
<strong>and</strong> places the data in the store buffer during the normal cache access cycle. Assuming<br />
a cache hit, the new data is written from the store buffer into the cache on the next<br />
unused cache access cycle.<br />
By comparison, in a write-through cache, writes can always be done in one cycle.<br />
We read the tag <strong>and</strong> write the data portion of the selected block. If the tag matches<br />
the address of the block being written, the processor can continue normally, since the<br />
correct block has been updated. If the tag does not match, the processor generates a<br />
write miss to fetch the rest of the block corresponding to that address.<br />
Many write-back caches also include write buffers that are used to reduce the miss<br />
penalty when a miss replaces a modifi ed block. In such a case, the modifi ed block is<br />
moved to a write-back buffer associated with the cache while the requested block is read<br />
from memory. The write-back buffer is later written back to memory. Assuming another<br />
miss does not occur immediately, this technique halves the miss penalty when a dirty<br />
block must be replaced.<br />
An Example Cache: The Intrinsity FastMATH Processor<br />
The Intrinsity FastMATH is an embedded microprocessor that uses the MIPS<br />
architecture <strong>and</strong> a simple cache implementation. Near the end of the chapter, we<br />
will examine the more complex cache designs of ARM <strong>and</strong> Intel microprocessors,<br />
but we start with this simple, yet real, example for pedagogical reasons. Figure 5.12<br />
shows the organization of the Intrinsity FastMATH data cache.<br />
This processor has a 12-stage pipeline. When operating at peak speed, the<br />
processor can request both an instruction word <strong>and</strong> a data word on every clock.<br />
To satisfy the dem<strong>and</strong>s of the pipeline without stalling, separate instruction<br />
<strong>and</strong> data caches are used. Each cache is 16 KiB, or 4096 words, with 16-word<br />
blocks.<br />
Read requests for the cache are straightforward. Because there are separate<br />
data <strong>and</strong> instruction caches, we need separate control signals to read <strong>and</strong> write<br />
each cache. (Remember that we need to update the instruction cache when a miss<br />
occurs.) Thus, the steps for a read request to either cache are as follows:<br />
1. Send the address to the appropriate cache. The address comes either from<br />
the PC (for an instruction) or from the ALU (for data).<br />
2. If the cache signals hit, the requested word is available on the data lines.<br />
Since there are 16 words in the desired block, we need to select the right one.<br />
A block index field is used to control the multiplexor (shown at the bottom<br />
of the figure), which selects the requested word from the 16 words in the<br />
indexed block.
398 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />
To take advantage of spatial locality, a cache must have a block size larger than<br />
one word. The use of a larger block decreases the miss rate <strong>and</strong> improves the<br />
efficiency of the cache by reducing the amount of tag storage relative to the amount<br />
of data storage in the cache. Although a larger block size decreases the miss rate, it<br />
can also increase the miss penalty. If the miss penalty increased linearly with the<br />
block size, larger blocks could easily lead to lower performance.<br />
To avoid performance loss, the b<strong>and</strong>width of main memory is increased to<br />
transfer cache blocks more efficiently. Common methods for increasing b<strong>and</strong>width<br />
external to the DRAM are making the memory wider <strong>and</strong> interleaving. DRAM<br />
designers have steadily improved the interface between the processor <strong>and</strong> memory<br />
to increase the b<strong>and</strong>width of burst mode transfers to reduce the cost of larger cache<br />
block sizes.<br />
Check<br />
Yourself<br />
The speed of the memory system affects the designer’s decision on the size of<br />
the cache block. Which of the following cache designer guidelines are generally<br />
valid?<br />
1. The shorter the memory latency, the smaller the cache block<br />
2. The shorter the memory latency, the larger the cache block<br />
3. The higher the memory b<strong>and</strong>width, the smaller the cache block<br />
4. The higher the memory b<strong>and</strong>width, the larger the cache block<br />
5.4<br />
Measuring <strong>and</strong> Improving Cache<br />
Performance<br />
In this section, we begin by examining ways to measure <strong>and</strong> analyze cache<br />
performance. We then explore two different techniques for improving cache<br />
performance. One focuses on reducing the miss rate by reducing the probability<br />
that two different memory blocks will contend for the same cache location. The<br />
second technique reduces the miss penalty by adding an additional level to the<br />
hierarchy. This technique, called multilevel caching, first appeared in high-end<br />
computers selling for more than $100,000 in 1990; since then it has become<br />
common on personal mobile devices selling for a few hundred dollars!
5.4 Measuring <strong>and</strong> Improving Cache Performance 399<br />
CPU time can be divided into the clock cycles that the CPU spends executing<br />
the program <strong>and</strong> the clock cycles that the CPU spends waiting for the memory<br />
system. Normally, we assume that the costs of cache accesses that are hits are part<br />
of the normal CPU execution cycles. Thus,<br />
CPU time (CPU execution clock cycles Memory-stall clock cycles)<br />
Clock cycle time<br />
The memory-stall clock cycles come primarily from cache misses, <strong>and</strong> we make<br />
that assumption here. We also restrict the discussion to a simplified model of the<br />
memory system. In real processors, the stalls generated by reads <strong>and</strong> writes can be<br />
quite complex, <strong>and</strong> accurate performance prediction usually requires very detailed<br />
simulations of the processor <strong>and</strong> memory system.<br />
Memory-stall clock cycles can be defined as the sum of the stall cycles coming<br />
from reads plus those coming from writes:<br />
Memory-stall clock cycles (Read-stall cycles Write-stall cycles)<br />
The read-stall cycles can be defined in terms of the number of read accesses per<br />
program, the miss penalty in clock cycles for a read, <strong>and</strong> the read miss rate:<br />
Read-stall cycles<br />
Reads<br />
Program<br />
Read miss rate Read miss penalty<br />
Writes are more complicated. For a write-through scheme, we have two sources of<br />
stalls: write misses, which usually require that we fetch the block before continuing<br />
the write (see the Elaboration on page 394 for more details on dealing with writes),<br />
<strong>and</strong> write buffer stalls, which occur when the write buffer is full when a write<br />
occurs. Thus, the cycles stalled for writes equals the sum of these two:<br />
Write-stall cycles<br />
⎛ Writes<br />
⎞<br />
Write miss rate Write miss penalty<br />
⎝⎜<br />
Program ⎠⎟<br />
Write buffer stalls<br />
Because the write buffer stalls depend on the proximity of writes, <strong>and</strong> not just<br />
the frequency, it is not possible to give a simple equation to compute such stalls.<br />
Fortunately, in systems with a reasonable write buffer depth (e.g., four or more<br />
words) <strong>and</strong> a memory capable of accepting writes at a rate that significantly exceeds<br />
the average write frequency in programs (e.g., by a factor of 2), the write buffer<br />
stalls will be small, <strong>and</strong> we can safely ignore them. If a system did not meet these<br />
criteria, it would not be well designed; instead, the designer should have used either<br />
a deeper write buffer or a write-back organization.
400 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />
Write-back schemes also have potential additional stalls arising from the need<br />
to write a cache block back to memory when the block is replaced. We will discuss<br />
this more in Section 5.8.<br />
In most write-through cache organizations, the read <strong>and</strong> write miss penalties are<br />
the same (the time to fetch the block from memory). If we assume that the write<br />
buffer stalls are negligible, we can combine the reads <strong>and</strong> writes by using a single<br />
miss rate <strong>and</strong> the miss penalty:<br />
Memory-stall clock cycles<br />
Memory accesses<br />
Program<br />
Miss rate<br />
Miss penalty<br />
We can also factor this as<br />
Memory-stall clock cycles<br />
Instructions<br />
Program<br />
Misses<br />
Instruction<br />
Miss penalty<br />
Let’s consider a simple example to help us underst<strong>and</strong> the impact of cache<br />
performance on processor performance.<br />
Calculating Cache Performance<br />
EXAMPLE<br />
Assume the miss rate of an instruction cache is 2% <strong>and</strong> the miss rate of the data<br />
cache is 4%. If a processor has a CPI of 2 without any memory stalls <strong>and</strong> the<br />
miss penalty is 100 cycles for all misses, determine how much faster a processor<br />
would run with a perfect cache that never missed. Assume the frequency of all<br />
loads <strong>and</strong> stores is 36%.<br />
ANSWER<br />
The number of memory miss cycles for instructions in terms of the Instruction<br />
count (I) is<br />
Instruction miss cycles I 2% 100 2.00 I<br />
As the frequency of all loads <strong>and</strong> stores is 36%, we can find the number of<br />
memory miss cycles for data references:<br />
Data miss cycles I 36% 4% 100 1.44 I
5.4 Measuring <strong>and</strong> Improving Cache Performance 401<br />
The total number of memory-stall cycles is 2.00 I 1.44 I 3.44 I. This is<br />
more than three cycles of memory stall per instruction. Accordingly, the total<br />
CPI including memory stalls is 2 3.44 5.44. Since there is no change in<br />
instruction count or clock rate, the ratio of the CPU execution times is<br />
CPU time with stalls<br />
CPU time with perfect cache<br />
I CPI stall Clock cycle<br />
I CPIperfect<br />
Clock cycle<br />
CPIstall<br />
5. 4 4<br />
CPI 2<br />
perfect<br />
The performance with the perfect cache is better by 544 .<br />
2<br />
2.72.<br />
What happens if the processor is made faster, but the memory system is not? The<br />
amount of time spent on memory stalls will take up an increasing fraction of the<br />
execution time; Amdahl’s Law, which we examined in Chapter 1, reminds us of<br />
this fact. A few simple examples show how serious this problem can be. Suppose<br />
we speed-up the computer in the previous example by reducing its CPI from 2 to 1<br />
without changing the clock rate, which might be done with an improved pipeline.<br />
The system with cache misses would then have a CPI of 1 3.44 4.44, <strong>and</strong> the<br />
system with the perfect cache would be<br />
444 .<br />
4.44 times as fast.<br />
1<br />
The amount of execution time spent on memory stalls would have risen from<br />
344 .<br />
63%<br />
544 .<br />
to 344 .<br />
77%<br />
444 .<br />
Similarly, increasing the clock rate without changing the memory system also<br />
increases the performance lost due to cache misses.<br />
The previous examples <strong>and</strong> equations assume that the hit time is not a factor in<br />
determining cache performance. Clearly, if the hit time increases, the total time to<br />
access a word from the memory system will increase, possibly causing an increase in<br />
the processor cycle time. Although we will see additional examples of what can increase
402 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />
hit time shortly, one example is increasing the cache size. A larger cache could clearly<br />
have a longer access time, just as, if your desk in the library was very large (say, 3 square<br />
meters), it would take longer to locate a book on the desk. An increase in hit time<br />
likely adds another stage to the pipeline, since it may take multiple cycles for a cache<br />
hit. Although it is more complex to calculate the performance impact of a deeper<br />
pipeline, at some point the increase in hit time for a larger cache could dominate the<br />
improvement in hit rate, leading to a decrease in processor performance.<br />
To capture the fact that the time to access data for both hits <strong>and</strong> misses affects<br />
performance, designers sometime use average memory access time (AMAT) as<br />
a way to examine alternative cache designs. Average memory access time is the<br />
average time to access memory considering both hits <strong>and</strong> misses <strong>and</strong> the frequency<br />
of different accesses; it is equal to the following:<br />
AMAT Time for a hit Miss rate Miss penalty<br />
Calculating Average Memory Access Time<br />
EXAMPLE<br />
Find the AMAT for a processor with a 1 ns clock cycle time, a miss penalty of<br />
20 clock cycles, a miss rate of 0.05 misses per instruction, <strong>and</strong> a cache access<br />
time (including hit detection) of 1 clock cycle. Assume that the read <strong>and</strong> write<br />
miss penalties are the same <strong>and</strong> ignore other write stalls.<br />
ANSWER<br />
The average memory access time per instruction is<br />
AMAT Time for a hit Miss rate Miss penalty<br />
1 0.05 20<br />
2 clock cycles<br />
or 2 ns.<br />
The next subsection discusses alternative cache organizations that decrease<br />
miss rate but may sometimes increase hit time; additional examples appear in<br />
Section 5.15, Fallacies <strong>and</strong> Pitfalls.<br />
Reducing Cache Misses by More Flexible Placement<br />
of Blocks<br />
So far, when we place a block in the cache, we have used a simple placement scheme:<br />
A block can go in exactly one place in the cache. As mentioned earlier, it is called<br />
direct mapped because there is a direct mapping from any block address in memory<br />
to a single location in the upper level of the hierarchy. However, there is actually a<br />
whole range of schemes for placing blocks. Direct mapped, where a block can be<br />
placed in exactly one location, is at one extreme.
5.4 Measuring <strong>and</strong> Improving Cache Performance 403<br />
At the other extreme is a scheme where a block can be placed in any location<br />
in the cache. Such a scheme is called fully associative, because a block in memory<br />
may be associated with any entry in the cache. To find a given block in a fully<br />
associative cache, all the entries in the cache must be searched because a block<br />
can be placed in any one. To make the search practical, it is done in parallel with<br />
a comparator associated with each cache entry. These comparators significantly<br />
increase the hardware cost, effectively making fully associative placement practical<br />
only for caches with small numbers of blocks.<br />
The middle range of designs between direct mapped <strong>and</strong> fully associative<br />
is called set associative. In a set-associative cache, there are a fixed number of<br />
locations where each block can be placed. A set-associative cache with n locations<br />
for a block is called an n-way set-associative cache. An n-way set-associative cache<br />
consists of a number of sets, each of which consists of n blocks. Each block in the<br />
memory maps to a unique set in the cache given by the index field, <strong>and</strong> a block can<br />
be placed in any element of that set. Thus, a set-associative placement combines<br />
direct-mapped placement <strong>and</strong> fully associative placement: a block is directly<br />
mapped into a set, <strong>and</strong> then all the blocks in the set are searched for a match. For<br />
example, Figure 5.14 shows where block 12 may be placed in a cache with eight<br />
blocks total, according to the three block placement policies.<br />
Remember that in a direct-mapped cache, the position of a memory block is<br />
given by<br />
fully associative<br />
cache A cache structure<br />
in which a block can be<br />
placed in any location in<br />
the cache.<br />
set-associative cache<br />
A cache that has a fixed<br />
number of locations (at<br />
least two) where each<br />
block can be placed.<br />
(Block number) modulo (Number of blocks in the cache)<br />
Direct mapped<br />
Set associative<br />
Fully associative<br />
Block #<br />
012 34 5 6 7<br />
Set #<br />
0 1 2 3<br />
Data<br />
Data<br />
Data<br />
Tag<br />
1<br />
2<br />
Tag<br />
1<br />
2<br />
Tag<br />
1<br />
2<br />
Search<br />
Search<br />
Search<br />
FIGURE 5.14 The location of a memory block whose address is 12 in a cache with eight<br />
blocks varies for direct-mapped, set-associative, <strong>and</strong> fully associative placement. In directmapped<br />
placement, there is only one cache block where memory block 12 can be found, <strong>and</strong> that block is<br />
given by (12 modulo 8) 4. In a two-way set-associative cache, there would be four sets, <strong>and</strong> memory block<br />
12 must be in set (12 mod 4) 0; the memory block could be in either element of the set. In a fully associative<br />
placement, the memory block for block address 12 can appear in any of the eight cache blocks.
404 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />
In a set-associative cache, the set containing a memory block is given by<br />
(Block number) modulo (Number of sets in the cache)<br />
Since the block may be placed in any element of the set, all the tags of all the elements<br />
of the set must be searched. In a fully associative cache, the block can go anywhere,<br />
<strong>and</strong> all tags of all the blocks in the cache must be searched.<br />
We can also think of all block placement strategies as a variation on set<br />
associativity. Figure 5.15 shows the possible associativity structures for an eightblock<br />
cache. A direct-mapped cache is simply a one-way set-associative cache:<br />
each cache entry holds one block <strong>and</strong> each set has one element. A fully associative<br />
cache with m entries is simply an m-way set-associative cache; it has one set with m<br />
blocks, <strong>and</strong> an entry can reside in any block within that set.<br />
The advantage of increasing the degree of associativity is that it usually decreases<br />
the miss rate, as the next example shows. The main disadvantage, which we discuss<br />
in more detail shortly, is a potential increase in the hit time.<br />
One-way set associative<br />
(direct mapped)<br />
Block Tag Data<br />
0<br />
1<br />
2<br />
3<br />
4<br />
5<br />
6<br />
7<br />
Set<br />
0<br />
1<br />
2<br />
3<br />
Two-way set associative<br />
Tag Data Tag Data<br />
Four-way set associative<br />
Set<br />
0<br />
1<br />
Tag Data Tag Data Tag Data Tag Data<br />
Eight-way set associative (fully associative)<br />
Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data<br />
FIGURE 5.15 An eight-block cache configured as direct mapped, two-way set associative,<br />
four-way set associative, <strong>and</strong> fully associative. The total size of the cache in blocks is equal to the<br />
number of sets times the associativity. Thus, for a fixed cache size, increasing the associativity decreases<br />
the number of sets while increasing the number of elements per set. With eight blocks, an eight-way setassociative<br />
cache is the same as a fully associative cache.
406 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />
is replaced. (We will discuss other replacement rules in more detail shortly.)<br />
Using this replacement rule, the contents of the set-associative cache after each<br />
reference looks like this:<br />
Address of memory<br />
block accessed<br />
Hit<br />
or miss<br />
0 miss Memory[0]<br />
8 miss Memory[0] Memory[8]<br />
0 hit Memory[0] Memory[8]<br />
6 miss Memory[0] Memory[6]<br />
8 miss Memory[8] Memory[6]<br />
Contents of cache blocks after reference<br />
Set 0 Set 0 Set 1 Set 1<br />
Notice that when block 6 is referenced, it replaces block 8, since block 8 has<br />
been less recently referenced than block 0. The two-way set-associative cache<br />
has four misses, one less than the direct-mapped cache.<br />
The fully associative cache has four cache blocks (in a single set); any<br />
memory block can be stored in any cache block. The fully associative cache has<br />
the best performance, with only three misses:<br />
Address of memory<br />
block accessed<br />
Hit<br />
or miss<br />
Contents of cache blocks after reference<br />
Block 0 Block 1 Block 2 Block 3<br />
0 miss Memory[0]<br />
8 miss Memory[0] Memory[8]<br />
0 hit Memory[0] Memory[8]<br />
6 miss Memory[0] Memory[8] Memory[6]<br />
8 hit Memory[0] Memory[8] Memory[6]<br />
For this series of references, three misses is the best we can do, because three<br />
unique block addresses are accessed. Notice that if we had eight blocks in the<br />
cache, there would be no replacements in the two-way set-associative cache<br />
(check this for yourself), <strong>and</strong> it would have the same number of misses as the<br />
fully associative cache. Similarly, if we had 16 blocks, all 3 caches would have<br />
the same number of misses. Even this trivial example shows that cache size <strong>and</strong><br />
associativity are not independent in determining cache performance.<br />
How much of a reduction in the miss rate is achieved by associativity?<br />
Figure 5.16 shows the improvement for a 64 KiB data cache with a 16-word block,<br />
<strong>and</strong> associativity ranging from direct mapped to eight-way. Going from one-way<br />
to two-way associativity decreases the miss rate by about 15%, but there is little<br />
further improvement in going to higher associativity.
5.4 Measuring <strong>and</strong> Improving Cache Performance 407<br />
Associativity<br />
Data miss rate<br />
1 10.3%<br />
2 8.6%<br />
4 8.3%<br />
8 8.1%<br />
FIGURE 5.16 The data cache miss rates for an organization like the Intrinsity FastMATH<br />
processor for SPEC CPU2000 benchmarks with associativity varying from one-way to<br />
eight-way. These results for 10 SPEC CPU2000 programs are from Hennessy <strong>and</strong> Patterson (2003).<br />
Tag<br />
Index<br />
Block offset<br />
FIGURE 5.17 The three portions of an address in a set-associative or direct-mapped<br />
cache. The index is used to select the set, then the tag is used to choose the block by comparison with the<br />
blocks in the selected set. The block offset is the address of the desired data within the block.<br />
Locating a Block in the Cache<br />
Now, let’s consider the task of finding a block in a cache that is set associative.<br />
Just as in a direct-mapped cache, each block in a set-associative cache includes<br />
an address tag that gives the block address. The tag of every cache block within<br />
the appropriate set is checked to see if it matches the block address from the<br />
processor. Figure 5.17 decomposes the address. The index value is used to select<br />
the set containing the address of interest, <strong>and</strong> the tags of all the blocks in the set<br />
must be searched. Because speed is of the essence, all the tags in the selected set are<br />
searched in parallel. As in a fully associative cache, a sequential search would make<br />
the hit time of a set-associative cache too slow.<br />
If the total cache size is kept the same, increasing the associativity increases the<br />
number of blocks per set, which is the number of simultaneous compares needed<br />
to perform the search in parallel: each increase by a factor of 2 in associativity<br />
doubles the number of blocks per set <strong>and</strong> halves the number of sets. Accordingly,<br />
each factor-of-2 increase in associativity decreases the size of the index by 1 bit <strong>and</strong><br />
increases the size of the tag by 1 bit. In a fully associative cache, there is effectively<br />
only one set, <strong>and</strong> all the blocks must be checked in parallel. Thus, there is no index,<br />
<strong>and</strong> the entire address, excluding the block offset, is compared against the tag of<br />
every block. In other words, we search the entire cache without any indexing.<br />
In a direct-mapped cache, only a single comparator is needed, because the entry can<br />
be in only one block, <strong>and</strong> we access the cache simply by indexing. Figure 5.18 shows<br />
that in a four-way set-associative cache, four comparators are needed, together with<br />
a 4-to-1 multiplexor to choose among the four potential members of the selected set.<br />
The cache access consists of indexing the appropriate set <strong>and</strong> then searching the tags<br />
of the set. The costs of an associative cache are the extra comparators <strong>and</strong> any delay<br />
imposed by having to do the compare <strong>and</strong> select from among the elements of the set.
5.4 Measuring <strong>and</strong> Improving Cache Performance 409<br />
Choosing Which Block to Replace<br />
When a miss occurs in a direct-mapped cache, the requested block can go in<br />
exactly one position, <strong>and</strong> the block occupying that position must be replaced. In<br />
an associative cache, we have a choice of where to place the requested block, <strong>and</strong><br />
hence a choice of which block to replace. In a fully associative cache, all blocks are<br />
c<strong>and</strong>idates for replacement. In a set-associative cache, we must choose among the<br />
blocks in the selected set.<br />
The most commonly used scheme is least recently used (LRU), which we used<br />
in the previous example. In an LRU scheme, the block replaced is the one that has<br />
been unused for the longest time. The set associative example on page 405 uses<br />
LRU, which is why we replaced Memory(0) instead of Memory(6).<br />
LRU replacement is implemented by keeping track of when each element in a<br />
set was used relative to the other elements in the set. For a two-way set-associative<br />
cache, tracking when the two elements were used can be implemented by keeping<br />
a single bit in each set <strong>and</strong> setting the bit to indicate an element whenever that<br />
element is referenced. As associativity increases, implementing LRU gets harder; in<br />
Section 5.8, we will see an alternative scheme for replacement.<br />
least recently used<br />
(LRU) A replacement<br />
scheme in which the<br />
block replaced is the one<br />
that has been unused for<br />
the longest time.<br />
Size of Tags versus Set Associativity<br />
Increasing associativity requires more comparators <strong>and</strong> more tag bits per<br />
cache block. Assuming a cache of 4096 blocks, a 4-word block size, <strong>and</strong> a<br />
32-bit address, find the total number of sets <strong>and</strong> the total number of tag bits<br />
for caches that are direct mapped, two-way <strong>and</strong> four-way set associative, <strong>and</strong><br />
fully associative.<br />
EXAMPLE<br />
Since there are 16 ( 2 4 ) bytes per block, a 32-bit address yields 324 28 bits<br />
to be used for index <strong>and</strong> tag. The direct-mapped cache has the same number<br />
of sets as blocks, <strong>and</strong> hence 12 bits of index, since log 2<br />
(4096) 12; hence, the<br />
total number is (2812) 4096 16 4096 66 K tag bits.<br />
Each degree of associativity decreases the number of sets by a factor of 2 <strong>and</strong><br />
thus decreases the number of bits used to index the cache by 1 <strong>and</strong> increases<br />
the number of bits in the tag by 1. Thus, for a two-way set-associative cache,<br />
there are 2048 sets, <strong>and</strong> the total number of tag bits is (2811) 2 2048 <br />
34 2048 70 Kbits. For a four-way set-associative cache, the total number<br />
of sets is 1024, <strong>and</strong> the total number is (2810) 4 1024 72 1024 <br />
74 K tag bits.<br />
For a fully associative cache, there is only one set with 4096 blocks, <strong>and</strong> the<br />
tag is 28 bits, leading to 28 4096 1 115 K tag bits.<br />
ANSWER
410 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />
Reducing the Miss Penalty Using Multilevel Caches<br />
All modern computers make use of caches. To close the gap further between the<br />
fast clock rates of modern processors <strong>and</strong> the increasingly long time required to<br />
access DRAMs, most microprocessors support an additional level of caching. This<br />
second-level cache is normally on the same chip <strong>and</strong> is accessed whenever a miss<br />
occurs in the primary cache. If the second-level cache contains the desired data,<br />
the miss penalty for the first-level cache will be essentially the access time of the<br />
second-level cache, which will be much less than the access time of main memory.<br />
If neither the primary nor the secondary cache contains the data, a main memory<br />
access is required, <strong>and</strong> a larger miss penalty is incurred.<br />
How significant is the performance improvement from the use of a secondary<br />
cache? The next example shows us.<br />
Performance of Multilevel Caches<br />
EXAMPLE<br />
ANSWER<br />
Suppose we have a processor with a base CPI of 1.0, assuming all references<br />
hit in the primary cache, <strong>and</strong> a clock rate of 4 GHz. Assume a main memory<br />
access time of 100 ns, including all the miss h<strong>and</strong>ling. Suppose the miss rate<br />
per instruction at the primary cache is 2%. How much faster will the processor<br />
be if we add a secondary cache that has a 5 ns access time for either a hit or<br />
a miss <strong>and</strong> is large enough to reduce the miss rate to main memory to 0.5%?<br />
The miss penalty to main memory is<br />
100 ns<br />
400 clock cycles<br />
ns<br />
025 .<br />
clock cycle<br />
The effective CPI with one level of caching is given by<br />
Total CPI Base CPI Memory-stall cycles per instruction<br />
For the processor with one level of caching,<br />
Total CPI 1.0 Memory-stall cycles per instruction 1.0 2% 400 9<br />
With two levels of caching, a miss in the primary (or first-level) cache can be<br />
satisfied either by the secondary cache or by main memory. The miss penalty<br />
for an access to the second-level cache is<br />
025 .<br />
5 ns<br />
ns<br />
clock cycle<br />
20 clock cycles
5.4 Measuring <strong>and</strong> Improving Cache Performance 411<br />
If the miss is satisfied in the secondary cache, then this is the entire miss<br />
penalty. If the miss needs to go to main memory, then the total miss penalty is<br />
the sum of the secondary cache access time <strong>and</strong> the main memory access time.<br />
Thus, for a two-level cache, total CPI is the sum of the stall cycles from both<br />
levels of cache <strong>and</strong> the base CPI:<br />
Total CPI 1 Primary stalls per instruction Secondary stalls per instruction<br />
1 2% 20 0.5% 400 1 0.4 2.0 3.4<br />
Thus, the processor with the secondary cache is faster by<br />
90 .<br />
2.6<br />
34 .<br />
Alternatively, we could have computed the stall cycles by summing the stall<br />
cycles of those references that hit in the secondary cache ((2%0.5%) <br />
20 0.3). Those references that go to main memory, which must include the<br />
cost to access the secondary cache as well as the main memory access time, are<br />
(0.5% (20 400) 2.1). The sum, 1.0 0.3 2.1, is again 3.4.<br />
The design considerations for a primary <strong>and</strong> secondary cache are significantly<br />
different, because the presence of the other cache changes the best choice versus<br />
a single-level cache. In particular, a two-level cache structure allows the primary<br />
cache to focus on minimizing hit time to yield a shorter clock cycle or fewer<br />
pipeline stages, while allowing the secondary cache to focus on miss rate to reduce<br />
the penalty of long memory access times.<br />
The effect of these changes on the two caches can be seen by comparing each<br />
cache to the optimal design for a single level of cache. In comparison to a singlelevel<br />
cache, the primary cache of a multilevel cache is often smaller. Furthermore,<br />
the primary cache may use a smaller block size, to go with the smaller cache size <strong>and</strong><br />
also to reduce the miss penalty. In comparison, the secondary cache will be much<br />
larger than in a single-level cache, since the access time of the secondary cache is<br />
less critical. With a larger total size, the secondary cache may use a larger block size<br />
than appropriate with a single-level cache. It often uses higher associativity than<br />
the primary cache given the focus of reducing miss rates.<br />
multilevel cache<br />
A memory hierarchy with<br />
multiple levels of caches,<br />
rather than just a cache<br />
<strong>and</strong> main memory.<br />
Sorting has been exhaustively analyzed to find better algorithms: Bubble Sort,<br />
Quicksort, Radix Sort, <strong>and</strong> so on. Figure 5.19(a) shows instructions executed by<br />
item searched for Radix Sort versus Quicksort. As expected, for large arrays, Radix<br />
Sort has an algorithmic advantage over Quicksort in terms of number of operations.<br />
Figure 5.19(b) shows time per key instead of instructions executed. We see that the<br />
lines start on the same trajectory as in Figure 5.19(a), but then the Radix Sort line<br />
Underst<strong>and</strong>ing<br />
Program<br />
Performance
412 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />
1200<br />
1000<br />
Radix Sort<br />
Instructions/item<br />
800<br />
600<br />
400<br />
200<br />
Quicksort<br />
a.<br />
0<br />
4 8 16 32<br />
64 128 256 512 1024 2048 4096<br />
Size (K items to sort)<br />
2000<br />
Clock cycles/item<br />
1600<br />
1200<br />
800<br />
400<br />
Radix Sort<br />
Quicksort<br />
b.<br />
0<br />
4 8 16 32<br />
64 128 256 512 1024 2048 4096<br />
Size (K items to sort)<br />
5<br />
Cache misses/item<br />
4<br />
3<br />
2<br />
1<br />
Radix Sort<br />
Quicksort<br />
c.<br />
0<br />
4 8 16 32<br />
64 128 256 512 1024 2048 4096<br />
Size (K items to sort)<br />
FIGURE 5.19 Comparing Quicksort <strong>and</strong> Radix Sort by (a) instructions executed per item<br />
sorted, (b) time per item sorted, <strong>and</strong> (c) cache misses per item sorted. This data is from a<br />
paper by LaMarca <strong>and</strong> Ladner [1996]. Due to such results, new versions of Radix Sort have been invented<br />
that take memory hierarchy into account, to regain its algorithmic advantages (see Section 5.15). The basic<br />
idea of cache optimizations is to use all the data in a block repeatedly before it is replaced on a miss.
5.4 Measuring <strong>and</strong> Improving Cache Performance 413<br />
diverges as the data to sort increases. What is going on? Figure 5.19(c) answers by<br />
looking at the cache misses per item sorted: Quicksort consistently has many fewer<br />
misses per item to be sorted.<br />
Alas, st<strong>and</strong>ard algorithmic analysis often ignores the impact of the memory<br />
hierarchy. As faster clock rates <strong>and</strong> Moore’s Law allow architects to squeeze all of<br />
the performance out of a stream of instructions, using the memory hierarchy well<br />
is critical to high performance. As we said in the introduction, underst<strong>and</strong>ing the<br />
behavior of the memory hierarchy is critical to underst<strong>and</strong>ing the performance of<br />
programs on today’s computers.<br />
Software Optimization via Blocking<br />
Given the importance of the memory hierarchy to program performance, not<br />
surprisingly many software optimizations were invented that can dramatically<br />
improve performance by reusing data within the cache <strong>and</strong> hence lower miss rates<br />
due to improved temporal locality.<br />
When dealing with arrays, we can get good performance from the memory<br />
system if we store the array in memory so that accesses to the array are sequential<br />
in memory. Suppose that we are dealing with multiple arrays, however, with some<br />
arrays accessed by rows <strong>and</strong> some by columns. Storing the arrays row-by-row<br />
(called row major order) or column-by-column (column major order) does not<br />
solve the problem because both rows <strong>and</strong> columns are used in every loop iteration.<br />
Instead of operating on entire rows or columns of an array, blocked algorithms<br />
operate on submatrices or blocks. The goal is to maximize accesses to the data<br />
loaded into the cache before the data are replaced; that is, improve temporal locality<br />
to reduce cache misses.<br />
For example, the inner loops of DGEMM (lines 4 through 9 of Figure 3.21 in<br />
Chapter 3) are<br />
for (int j = 0; j < n; ++j)<br />
{<br />
double cij = C[i+j*n]; /* cij = C[i][j] */<br />
for( int k = 0; k < n; k++ )<br />
cij += A[i+k*n] * B[k+j*n]; /* cij += A[i][k]*B[k][j] */<br />
C[i+j*n] = cij; /* C[i][j] = cij */<br />
}<br />
}<br />
It reads all N-by-N elements of B, reads the same N elements in what corresponds to<br />
one row of A repeatedly, <strong>and</strong> writes what corresponds to one row of N elements of<br />
C. (The comments make the rows <strong>and</strong> columns of the matrices easier to identify.)<br />
Figure 5.20 gives a snapshot of the accesses to the three arrays. A dark shade<br />
indicates a recent access, a light shade indicates an older access, <strong>and</strong> white means<br />
not yet accessed.
414 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />
x<br />
j<br />
0 1 2 3 4 5<br />
y<br />
k<br />
0 1 2 3 4 5<br />
z<br />
j<br />
0 1 2 3 4 5<br />
0<br />
0<br />
0<br />
1<br />
1<br />
1<br />
i<br />
2<br />
3<br />
i<br />
2<br />
3<br />
2<br />
k<br />
3<br />
4<br />
4<br />
4<br />
5<br />
5<br />
5<br />
FIGURE 5.20 A snapshot of the three arrays C, A, <strong>and</strong> B when N 6 <strong>and</strong> i 1. The age of<br />
accesses to the array elements is indicated by shade: white means not yet touched, light means older accesses,<br />
<strong>and</strong> dark means newer accesses. Compared to Figure 5.21, elements of A <strong>and</strong> B are read repeatedly to calculate<br />
new elements of x. The variables i, j, <strong>and</strong> k are shown along the rows or columns used to access the arrays.<br />
The number of capacity misses clearly depends on N <strong>and</strong> the size of the cache. If<br />
it can hold all three N-by-N matrices, then all is well, provided there are no cache<br />
conflicts. We purposely picked the matrix size to be 32 by 32 in DGEMM for<br />
Chapters 3 <strong>and</strong> 4 so that this would be the case. Each matrix is 32 32 1024<br />
elements <strong>and</strong> each element is 8 bytes, so the three matrices occupy 24 KiB, which<br />
comfortably fit in the 32 KiB data cache of the Intel Core i7 (S<strong>and</strong>y Bridge).<br />
If the cache can hold one N-by-N matrix <strong>and</strong> one row of N, then at least the ith<br />
row of A <strong>and</strong> the array B may stay in the cache. Less than that <strong>and</strong> misses may<br />
occur for both B <strong>and</strong> C. In the worst case, there would be 2 N 3 N 2 memory words<br />
accessed for N 3 operations.<br />
To ensure that the elements being accessed can fit in the cache, the original code<br />
is changed to compute on a submatrix. Hence, we essentially invoke the version of<br />
DGEMM from Figure 4.80 in Chapter 4 repeatedly on matrices of size BLOCKSIZE<br />
by BLOCKSIZE. BLOCKSIZE is called the blocking factor.<br />
Figure 5.21 shows the blocked version of DGEMM. The function do_block is<br />
DGEMM from Figure 3.21 with three new parameters si, sj, <strong>and</strong> sk to specify<br />
the starting position of each submatrix of of A, B, <strong>and</strong> C. The two inner loops of the<br />
do_block now compute in steps of size BLOCKSIZE rather than the full length<br />
of B <strong>and</strong> C. The gcc optimizer removes any function call overhead by “inlining” the<br />
function; that is, it inserts the code directly to avoid the conventional parameter<br />
passing <strong>and</strong> return address bookkeeping instructions.<br />
Figure 5.22 illustrates the accesses to the three arrays using blocking. Looking<br />
only at capacity misses, the total number of memory words accessed is 2 N 3 /<br />
BLOCKSIZE N 2 . This total is an improvement by about a factor of BLOCKSIZE.<br />
Hence, blocking exploits a combination of spatial <strong>and</strong> temporal locality, since A<br />
benefits from spatial locality <strong>and</strong> B benefits from temporal locality.
5.4 Measuring <strong>and</strong> Improving Cache Performance 415<br />
1 #define BLOCKSIZE 32<br />
2 void do_block (int n, int si, int sj, int sk, double *A, double<br />
3 *B, double *C)<br />
4 {<br />
5 for (int i = si; i < si+BLOCKSIZE; ++i)<br />
6 for (int j = sj; j < sj+BLOCKSIZE; ++j)<br />
7 {<br />
8 double cij = C[i+j*n];/* cij = C[i][j] */<br />
9 for( int k = sk; k < sk+BLOCKSIZE; k++ )<br />
10 cij += A[i+k*n] * B[k+j*n];/* cij+=A[i][k]*B[k][j] */<br />
11 C[i+j*n] = cij;/* C[i][j] = cij */<br />
12 }<br />
13 }<br />
14 void dgemm (int n, double* A, double* B, double* C)<br />
15 {<br />
16 for ( int sj = 0; sj < n; sj += BLOCKSIZE )<br />
17 for ( int si = 0; si < n; si += BLOCKSIZE )<br />
18 for ( int sk = 0; sk < n; sk += BLOCKSIZE )<br />
19 do_block(n, si, sj, sk, A, B, C);<br />
20 }<br />
FIGURE 5.21 Cache blocked version of DGEMM in Figure 3.21. Assume C is initialized to zero. The do_block<br />
function is basically DGEMM from Chapter 3 with new parameters to specify the starting positions of the submatrices of<br />
BLOCKSIZE. The gcc optimizer can remove the function overhead instructions by inlining the do_block function.<br />
x<br />
j<br />
0 1 2 3 4 5<br />
y<br />
k<br />
0 1 2 3 4 5<br />
z<br />
j<br />
0 1 2 3 4 5<br />
0<br />
0<br />
0<br />
1<br />
1<br />
1<br />
i<br />
2<br />
3<br />
i<br />
2<br />
3<br />
2<br />
k<br />
3<br />
4<br />
4<br />
4<br />
5<br />
5<br />
5<br />
FIGURE 5.22 The age of accesses to the arrays C, A, <strong>and</strong> B when BLOCKSIZE 3. Note that,<br />
in contrast to Figure 5.20, fewer elements are accessed.<br />
Although we have aimed at reducing cache misses, blocking can also be used to<br />
help register allocation. By taking a small blocking size such that the block can be<br />
held in registers, we can minimize the number of loads <strong>and</strong> stores in the program,<br />
which also improves performance.
416 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />
32x32 160x160 480x480 960x960<br />
GFLOPS<br />
1.8<br />
1.5<br />
1.2<br />
0.9<br />
1.7<br />
1.5<br />
1.3<br />
0.8<br />
1.7 1.6 1.6<br />
1.5<br />
0.6<br />
0.3<br />
–<br />
Unoptimized<br />
Blocked<br />
FIGURE 5.23 Performance of unoptimized DGEMM (Figure 3.21) versus cache blocked<br />
DGEMM (Figure 5.21) as the matrix dimension varies from 32x32 (where all three matrices<br />
fit in the cache) to 960x960.<br />
Figure 5.23 shows the impact of cache blocking on the performance of the<br />
unoptimized DGEMM as we increase the matrix size beyond where all three<br />
matrices fit in the cache. The unoptimized performance is halved for the largest<br />
matrix. The cache-blocked version is less than 10% slower even at matrices that are<br />
960x960, or 900 times larger than the 32 × 32 matrices in Chapters 3 <strong>and</strong> 4.<br />
global miss rate The<br />
fraction of references<br />
that miss in all levels of a<br />
multilevel cache.<br />
local miss rate The<br />
fraction of references to<br />
one level of a cache that<br />
miss; used in multilevel<br />
hierarchies.<br />
Elaboration: Multilevel caches create several complications. First, there are now<br />
several different types of misses <strong>and</strong> corresponding miss rates. In the example on<br />
pages 410–411, we saw the primary cache miss rate <strong>and</strong> the global miss rate—the<br />
fraction of references that missed in all cache levels. There is also a miss rate for the<br />
secondary cache, which is the ratio of all misses in the secondary cache divided by the<br />
number of accesses to it. This miss rate is called the local miss rate of the secondary<br />
cache. Because the primary cache fi lters accesses, especially those with good spatial<br />
<strong>and</strong> temporal locality, the local miss rate of the secondary cache is much higher than the<br />
global miss rate. For the example on pages 410–411, we can compute the local miss<br />
rate of the secondary cache as 0.5%/2% 25%! Luckily, the global miss rate dictates<br />
how often we must access the main memory.<br />
Elaboration: With out-of-order processors (see Chapter 4), performance is more<br />
complex, since they execute instructions during the miss penalty. Instead of instruction<br />
miss rates <strong>and</strong> data miss rates, we use misses per instruction, <strong>and</strong> this formula:<br />
Memory stall cycles<br />
Instruction<br />
Misses<br />
Instruction<br />
(Total miss latency<br />
Overlapped miss latency)
5.4 Measuring <strong>and</strong> Improving Cache Performance 417<br />
There is no general way to calculate overlapped miss latency, so evaluations of<br />
memory hierarchies for out-of-order processors inevitably require simulation of the<br />
processor <strong>and</strong> the memory hierarchy. Only by seeing the execution of the processor<br />
during each miss can we see if the processor stalls waiting for data or simply fi nds other<br />
work to do. A guideline is that the processor often hides the miss penalty for an L1<br />
cache miss that hits in the L2 cache, but it rarely hides a miss to the L2 cache.<br />
Elaboration: The performance challenge for algorithms is that the memory hierarchy<br />
varies between different implementations of the same architecture in cache size,<br />
associativity, block size, <strong>and</strong> number of caches. To cope with such variability, some<br />
recent numerical libraries parameterize their algorithms <strong>and</strong> then search the parameter<br />
space at runtime to fi nd the best combination for a particular computer. This approach<br />
is called autotuning.<br />
Which of the following is generally true about a design with multiple levels of<br />
caches?<br />
1. First-level caches are more concerned about hit time, <strong>and</strong> second-level<br />
caches are more concerned about miss rate.<br />
2. First-level caches are more concerned about miss rate, <strong>and</strong> second-level<br />
caches are more concerned about hit time.<br />
Check<br />
Yourself<br />
Summary<br />
In this section, we focused on four topics: cache performance, using associativity to<br />
reduce miss rates, the use of multilevel cache hierarchies to reduce miss penalties,<br />
<strong>and</strong> software optimizations to improve effectiveness of caches.<br />
The memory system has a significant effect on program execution time. The<br />
number of memory-stall cycles depends on both the miss rate <strong>and</strong> the miss penalty.<br />
The challenge, as we will see in Section 5.8, is to reduce one of these factors without<br />
significantly affecting other critical factors in the memory hierarchy.<br />
To reduce the miss rate, we examined the use of associative placement schemes.<br />
Such schemes can reduce the miss rate of a cache by allowing more flexible<br />
placement of blocks within the cache. Fully associative schemes allow blocks to be<br />
placed anywhere, but also require that every block in the cache be searched to satisfy<br />
a request. The higher costs make large fully associative caches impractical. Setassociative<br />
caches are a practical alternative, since we need only search among the<br />
elements of a unique set that is chosen by indexing. Set-associative caches have higher<br />
miss rates but are faster to access. The amount of associativity that yields the best<br />
performance depends on both the technology <strong>and</strong> the details of the implementation.<br />
We looked at multilevel caches as a technique to reduce the miss penalty by<br />
allowing a larger secondary cache to h<strong>and</strong>le misses to the primary cache. Secondlevel<br />
caches have become commonplace as designers find that limited silicon <strong>and</strong><br />
the goals of high clock rates prevent primary caches from becoming large. The<br />
secondary cache, which is often ten or more times larger than the primary cache,<br />
h<strong>and</strong>les many accesses that miss in the primary cache. In such cases, the miss<br />
penalty is that of the access time to the secondary cache (typically < 10 processor
420 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />
error detection<br />
code A code that<br />
enables the detection of<br />
an error in data, but not<br />
the precise location <strong>and</strong>,<br />
hence, correction of the<br />
error.<br />
The Hamming Single Error Correcting, Double Error<br />
Detecting Code (SEC/DED)<br />
Richard Hamming invented a popular redundancy scheme for memory, for which<br />
he received the Turing Award in 1968. To invent redundant codes, it is helpful<br />
to talk about how “close” correct bit patterns can be. What we call the Hamming<br />
distance is just the minimum number of bits that are different between any two<br />
correct bit patterns. For example, the distance between 011011 <strong>and</strong> 001111 is two.<br />
What happens if the minimum distance between members of a codes is two, <strong>and</strong><br />
we get a one-bit error? It will turn a valid pattern in a code to an invalid one. Thus,<br />
if we can detect whether members of a code are valid or not, we can detect single<br />
bit errors, <strong>and</strong> can say we have a single bit error detection code.<br />
Hamming used a parity code for error detection. In a parity code, the number<br />
of 1s in a word is counted; the word has odd parity if the number of 1s is odd <strong>and</strong><br />
even otherwise. When a word is written into memory, the parity bit is also written<br />
(1 for odd, 0 for even). That is, the parity of the N+1 bit word should always be even.<br />
Then, when the word is read out, the parity bit is read <strong>and</strong> checked. If the parity of the<br />
memory word <strong>and</strong> the stored parity bit do not match, an error has occurred.<br />
EXAMPLE<br />
Calculate the parity of a byte with the value 31 ten<br />
<strong>and</strong> show the pattern stored to<br />
memory. Assume the parity bit is on the right. Suppose the most significant bit<br />
was inverted in memory, <strong>and</strong> then you read it back. Did you detect the error?<br />
What happens if the two most significant bits are inverted?<br />
ANSWER<br />
31 ten<br />
is 00011111 two<br />
, which has five 1s. To make parity even, we need to write a 1<br />
in the parity bit, or 000111111 two<br />
. If the most significant bit is inverted when we<br />
read it back, we would see 100111111 two<br />
which has seven 1s. Since we expect<br />
even parity <strong>and</strong> calculated odd parity, we would signal an error. If the two most<br />
significant bits are inverted, we would see 110111111 two<br />
which has eight 1s or<br />
even parity <strong>and</strong> we would not signal an error.<br />
If there are 2 bits of error, then a 1-bit parity scheme will not detect any errors,<br />
since the parity will match the data with two errors. (Actually, a 1-bit parity scheme<br />
can detect any odd number of errors; however, the probability of having 3 errors is<br />
much lower than the probability of having two, so, in practice, a 1-bit parity code is<br />
limited to detecting a single bit of error.)<br />
Of course, a parity code cannot correct errors, which Hamming wanted to do<br />
as well as detect them. If we used a code that had a minimum distance of 3, then<br />
any single bit error would be closer to the correct pattern than to any other valid<br />
pattern. He came up with an easy to underst<strong>and</strong> mapping of data into a distance 3<br />
code that we call Hamming Error Correction Code (ECC) in his honor. We use extra
422 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />
EXAMPLE<br />
Assume one byte data value is 10011010 two<br />
. First show the Hamming ECC code<br />
for that byte, <strong>and</strong> then invert bit 10 <strong>and</strong> show that the ECC code finds <strong>and</strong><br />
corrects the single bit error.<br />
ANSWER<br />
Leaving spaces for the parity bits, the 12 bit pattern is _ _ 1 _ 0 0 1 _ 1 0 1 0.<br />
Position 1 checks bits 1,3,5,7,9, <strong>and</strong>11, which we highlight: __ 1 _ 0 0 1 _ 1 0 1<br />
0. To make the group even parity, we should set bit 1 to 0.<br />
Position 2 checks bits 2,3,6,7,10,11, which is 0 _ 1 _ 0 0 1 _ 1 0 1 0 or odd parity,<br />
so we set position 2 to a 1.<br />
Position 4 checks bits 4,5,6,7,12, which is 0 1 1 _ 0 0 1 _ 1 0 1, so we set it to a 1.<br />
Position 8 checks bits 8,9,10,11,12, which is 0 1 1 1 0 0 1 _ 1 0 1 0, so we set it<br />
to a 0.<br />
The final code word is 011100101010. Inverting bit 10 changes it to<br />
011100101110.<br />
Parity bit 1 is 0 (011100101110 is four 1s, so even parity; this group is OK).<br />
Parity bit 2 is 1 (011100101110 is five 1s, so odd parity; there is an error<br />
somewhere).<br />
Parity bit 4 is 1 (011100101110 is two 1s, so even parity; this group is OK).<br />
Parity bit 8 is 1 (011100101110 is three 1s, so odd parity; there is an error<br />
somewhere).<br />
Parity bits 2 <strong>and</strong> 10 are incorrect. As 2 + 8 = 10, bit 10 must be wrong. Hence,<br />
we can correct the error by inverting bit 10: 011100101010. Voila!<br />
Hamming did not stop at single bit error correction code. At the cost of one more<br />
bit, we can make the minimum Hamming distance in a code be 4. This means<br />
we can correct single bit errors <strong>and</strong> detect double bit errors. The idea is to add a<br />
parity bit that is calculated over the whole word. Let’s use a four-bit data word as<br />
an example, which would only need 7 bits for single bit error detection. Hamming<br />
parity bits H (p1 p2 p3) are computed (even parity as usual) plus the even parity<br />
over the entire word, p4:<br />
1 2 3 4 5 6 7 8<br />
p 1<br />
p 2<br />
d 1<br />
p 3<br />
d 2<br />
d 3<br />
d 4<br />
p 4<br />
Then the algorithm to correct one error <strong>and</strong> detect two is just to calculate parity<br />
over the ECC groups (H) as before plus one more over the whole group (p 4<br />
). There<br />
are four cases:<br />
1. H is even <strong>and</strong> p 4<br />
is even, so no error occurred.<br />
2. H is odd <strong>and</strong> p 4<br />
is odd, so a correctable single error occurred. (p 4<br />
should<br />
calculate odd parity if one error occurred.)<br />
3. H is even <strong>and</strong> p 4<br />
is odd, a single error occurred in p 4<br />
bit, not in the rest of the<br />
word, so correct the p 4<br />
bit.
5.6 Virtual Machines 425<br />
allow these separate software stacks to run independently yet share hardware,<br />
thereby consolidating the number of servers. Another example is that some<br />
VMMs support migration of a running VM to a different computer, either<br />
to balance load or to evacuate from failing hardware.<br />
Amazon Web Services (AWS) uses the virtual machines in its cloud computing<br />
offering EC2 for five reasons:<br />
1. It allows AWS to protect users from each other while sharing the same server.<br />
2. It simplifies software distribution within a warehouse scale computer. A<br />
customer installs a virtual machine image configured with the appropriate<br />
software, <strong>and</strong> AWS distributes it to all the instances a customer wants to use.<br />
3. Customers (<strong>and</strong> AWS) can reliably “kill” a VM to control resource usage<br />
when customers complete their work.<br />
4. Virtual machines hide the identity of the hardware on which the customer is<br />
running, which means AWS can keep using old servers <strong>and</strong> introduce new,<br />
more efficient servers. The customer expects performance for instances to<br />
match their ratings in “EC2 Compute Units,” which AWS defines: to “provide<br />
the equivalent CPU capacity of a 1.0–1.2 GHz 2007 AMD Opteron or 2007<br />
Intel Xeon processor.” Thanks to Moore’s Law, newer servers clearly offer<br />
more EC2 Compute Units than older ones, but AWS can keep renting old<br />
servers as long as they are economical.<br />
5. Virtual Machine Monitors can control the rate that a VM uses the processor,<br />
the network, <strong>and</strong> disk space, which allows AWS to offer many price points<br />
of instances of different types running on the same underlying servers.<br />
For example, in 2012 AWS offered 14 instance types, from small st<strong>and</strong>ard<br />
instances at $0.08 per hour to high I/O quadruple extra large instances at<br />
$3.10 per hour.<br />
Hardware/<br />
Software<br />
Interface<br />
In general, the cost of processor virtualization depends on the workload. Userlevel<br />
processor-bound programs have zero virtualization overhead, because the<br />
OS is rarely invoked, so everything runs at native speeds. I/O-intensive workloads<br />
are generally also OS-intensive, executing many system calls <strong>and</strong> privileged<br />
instructions that can result in high virtualization overhead. On the other h<strong>and</strong>, if<br />
the I/O-intensive workload is also I/O-bound, the cost of processor virtualization<br />
can be completely hidden, since the processor is often idle waiting for I/O.<br />
The overhead is determined by both the number of instructions that must be<br />
emulated by the VMM <strong>and</strong> by how much time each takes to emulate them. Hence,<br />
when the guest VMs run the same ISA as the host, as we assume here, the goal
426 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />
of the architecture <strong>and</strong> the VMM is to run almost all instructions directly on the<br />
native hardware.<br />
Requirements of a Virtual Machine Monitor<br />
What must a VM monitor do? It presents a software interface to guest software, it<br />
must isolate the state of guests from each other, <strong>and</strong> it must protect itself from guest<br />
software (including guest OSes). The qualitative requirements are:<br />
■ Guest software should behave on a VM exactly as if it were running on the<br />
native hardware, except for performance-related behavior or limitations of<br />
fixed resources shared by multiple VMs.<br />
■ Guest software should not be able to change allocation of real system resources<br />
directly.<br />
To “virtualize” the processor, the VMM must control just about everything—access<br />
to privileged state, I/O, exceptions, <strong>and</strong> interrupts—even though the guest VM <strong>and</strong><br />
OS currently running are temporarily using them.<br />
For example, in the case of a timer interrupt, the VMM would suspend the<br />
currently running guest VM, save its state, h<strong>and</strong>le the interrupt, determine which<br />
guest VM to run next, <strong>and</strong> then load its state. Guest VMs that rely on a timer<br />
interrupt are provided with a virtual timer <strong>and</strong> an emulated timer interrupt by the<br />
VMM.<br />
To be in charge, the VMM must be at a higher privilege level than the guest<br />
VM, which generally runs in user mode; this also ensures that the execution of<br />
any privileged instruction will be h<strong>and</strong>led by the VMM. The basic requirements of<br />
system virtual:<br />
■ At least two processor modes, system <strong>and</strong> user.<br />
■ A privileged subset of instructions that is available only in system mode,<br />
resulting in a trap if executed in user mode; all system resources must be<br />
controllable only via these instructions.<br />
(Lack of) Instruction Set Architecture Support for Virtual<br />
Machines<br />
If VMs are planned for during the design of the ISA, it’s relatively easy to reduce<br />
both the number of instructions that must be executed by a VMM <strong>and</strong> improve<br />
their emulation speed. An architecture that allows the VM to execute directly on<br />
the hardware earns the title virtualizable, <strong>and</strong> the IBM 370 architecture proudly<br />
bears that label.<br />
Alas, since VMs have been considered for PC <strong>and</strong> server applications only fairly<br />
recently, most instruction sets were created without virtualization in mind. These<br />
culprits include x86 <strong>and</strong> most RISC architectures, including ARMv7 <strong>and</strong> MIPS.
5.7 Virtual Memory 427<br />
Because the VMM must ensure that the guest system only interacts with virtual<br />
resources, a conventional guest OS runs as a user mode program on top of the<br />
VMM. Then, if a guest OS attempts to access or modify information related to<br />
hardware resources via a privileged instruction—for example, reading or writing<br />
a status bit that enables interrupts—it will trap to the VMM. The VMM can then<br />
effect the appropriate changes to corresponding real resources.<br />
Hence, if any instruction that tries to read or write such sensitive information<br />
traps when executed in user mode, the VMM can intercept it <strong>and</strong> support a virtual<br />
version of the sensitive information, as the guest OS expects.<br />
In the absence of such support, other measures must be taken. A VMM must<br />
take special precautions to locate all problematic instructions <strong>and</strong> ensure that they<br />
behave correctly when executed by a guest OS, thereby increasing the complexity<br />
of the VMM <strong>and</strong> reducing the performance of running the VM.<br />
Protection <strong>and</strong> Instruction Set Architecture<br />
Protection is a joint effort of architecture <strong>and</strong> operating systems, but architects<br />
had to modify some awkward details of existing instruction set architectures when<br />
virtual memory became popular.<br />
For example, the x86 instruction POPF loads the flag registers from the top of<br />
the stack in memory. One of the flags is the Interrupt Enable (IE) flag. If you run<br />
the POPF instruction in user mode, rather than trap it, it simply changes all the<br />
flags except IE. In system mode, it does change the IE. Since a guest OS runs in user<br />
mode inside a VM, this is a problem, as it expects to see a changed IE.<br />
Historically, IBM mainframe hardware <strong>and</strong> VMM took three steps to improve<br />
performance of virtual machines:<br />
1. Reduce the cost of processor virtualization.<br />
2. Reduce interrupt overhead cost due to the virtualization.<br />
3. Reduce interrupt cost by steering interrupts to the proper VM without<br />
invoking VMM.<br />
AMD <strong>and</strong> Intel tried to address the first point in 2006 by reducing the cost of<br />
processor virtualization. It will be interesting to see how many generations of<br />
architecture <strong>and</strong> VMM modifications it will take to address all three points, <strong>and</strong><br />
how long before virtual machines of the 21st century will be as efficient as the IBM<br />
mainframes <strong>and</strong> VMMs of the 1970s.<br />
5.7 Virtual Memory<br />
In earlier sections, we saw how caches provided fast access to recently used portions<br />
of a program’s code <strong>and</strong> data. Similarly, the main memory can act as a “cache” for<br />
… a system has<br />
been devised to<br />
make the core drum<br />
combination appear<br />
to the programmer<br />
as a single level<br />
store, the requisite<br />
transfers taking place<br />
automatically.<br />
Kilburn et al., One-level<br />
storage system, 1962
428 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />
virtual memory<br />
A technique that uses<br />
main memory as a “cache”<br />
for secondary storage.<br />
physical address<br />
An address in main<br />
memory.<br />
protection A set<br />
of mechanisms for<br />
ensuring that multiple<br />
processes sharing the<br />
processor, memory,<br />
or I/O devices cannot<br />
interfere, intentionally<br />
or unintentionally, with<br />
one another by reading or<br />
writing each other’s data.<br />
These mechanisms also<br />
isolate the operating system<br />
from a user process.<br />
page fault An event that<br />
occurs when an accessed<br />
page is not present in<br />
main memory.<br />
virtual address<br />
An address that<br />
corresponds to a location<br />
in virtual space <strong>and</strong> is<br />
translated by address<br />
mapping to a physical<br />
address when memory is<br />
accessed.<br />
the secondary storage, usually implemented with magnetic disks. This technique is<br />
called virtual memory. Historically, there were two major motivations for virtual<br />
memory: to allow efficient <strong>and</strong> safe sharing of memory among multiple programs,<br />
such as for the memory needed by multiple virtual machines for cloud computing,<br />
<strong>and</strong> to remove the programming burdens of a small, limited amount of main<br />
memory. Five decades after its invention, it’s the former reason that reigns today.<br />
Of course, to allow multiple virtual machines to share the same memory, we<br />
must be able to protect the virtual machines from each other, ensuring that a<br />
program can only read <strong>and</strong> write the portions of main memory that have been<br />
assigned to it. Main memory need contain only the active portions of the many<br />
virtual machines, just as a cache contains only the active portion of one program.<br />
Thus, the principle of locality enables virtual memory as well as caches, <strong>and</strong> virtual<br />
memory allows us to efficiently share the processor as well as the main memory.<br />
We cannot know which virtual machines will share the memory with other<br />
virtual machines when we compile them. In fact, the virtual machines sharing<br />
the memory change dynamically while the virtual machines are running. Because<br />
of this dynamic interaction, we would like to compile each program into its<br />
own address space—a separate range of memory locations accessible only to this<br />
program. Virtual memory implements the translation of a program’s address space<br />
to physical addresses. This translation process enforces protection of a program’s<br />
address space from other virtual machines.<br />
The second motivation for virtual memory is to allow a single user program<br />
to exceed the size of primary memory. Formerly, if a program became too large<br />
for memory, it was up to the programmer to make it fit. Programmers divided<br />
programs into pieces <strong>and</strong> then identified the pieces that were mutually exclusive.<br />
These overlays were loaded or unloaded under user program control during<br />
execution, with the programmer ensuring that the program never tried to access<br />
an overlay that was not loaded <strong>and</strong> that the overlays loaded never exceeded the<br />
total size of the memory. Overlays were traditionally organized as modules, each<br />
containing both code <strong>and</strong> data. Calls between procedures in different modules<br />
would lead to overlaying of one module with another.<br />
As you can well imagine, this responsibility was a substantial burden on<br />
programmers. Virtual memory, which was invented to relieve programmers of<br />
this difficulty, automatically manages the two levels of the memory hierarchy<br />
represented by main memory (sometimes called physical memory to distinguish it<br />
from virtual memory) <strong>and</strong> secondary storage.<br />
Although the concepts at work in virtual memory <strong>and</strong> in caches are the same,<br />
their differing historical roots have led to the use of different terminology. A virtual<br />
memory block is called a page, <strong>and</strong> a virtual memory miss is called a page fault.<br />
With virtual memory, the processor produces a virtual address, which is translated<br />
by a combination of hardware <strong>and</strong> software to a physical address, which in turn can<br />
be used to access main memory. Figure 5.25 shows the virtually addressed memory<br />
with pages mapped to main memory. This process is called address mapping or
430 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />
Virtual address<br />
31 30 29 28 27 15 14 13 12 11 10 9 8<br />
3 2 1 0<br />
Virtual page number<br />
Page offset<br />
Translation<br />
29 28 27 15 14 13 12 11 10 9 8<br />
3 2 1 0<br />
Physical page number<br />
Page offset<br />
Physical address<br />
FIGURE 5.26 Mapping from a virtual to a physical address. The page size is 2 12 4 KiB. The<br />
number of physical pages allowed in memory is 2 18 , since the physical page number has 18 bits in it. Thus,<br />
main memory can have at most 1 GiB, while the virtual address space is 4 GiB.<br />
Many design choices in virtual memory systems are motivated by the high cost<br />
of a page fault. A page fault to disk will take millions of clock cycles to process.<br />
(The table on page 378 shows that main memory latency is about 100,000 times<br />
quicker than disk.) This enormous miss penalty, dominated by the time to get the<br />
first word for typical page sizes, leads to several key decisions in designing virtual<br />
memory systems:<br />
■ Pages should be large enough to try to amortize the high access time. Sizes<br />
from 4 KiB to 16 KiB are typical today. New desktop <strong>and</strong> server systems are<br />
being developed to support 32 KiB <strong>and</strong> 64 KiB pages, but new embedded<br />
systems are going in the other direction, to 1 KiB pages.<br />
■ Organizations that reduce the page fault rate are attractive. The primary<br />
technique used here is to allow fully associative placement of pages in<br />
memory.<br />
■ Page faults can be h<strong>and</strong>led in software because the overhead will be small<br />
compared to the disk access time. In addition, software can afford to use clever<br />
algorithms for choosing how to place pages because even small reductions in<br />
the miss rate will pay for the cost of such algorithms.<br />
■ Write-through will not work for virtual memory, since writes take too long.<br />
Instead, virtual memory systems use write-back.
5.7 Virtual Memory 431<br />
The next few subsections address these factors in virtual memory design.<br />
Elaboration: We present the motivation for virtual memory as many virtual machines<br />
sharing the same memory, but virtual memory was originally invented so that many<br />
programs could share a computer as part of a timesharing system. Since many readers<br />
today have no experience with time-sharing systems, we use virtual machines to motivate<br />
this section.<br />
Elaboration: For servers <strong>and</strong> even PCs, 32-bit address processors are problematic.<br />
Although we normally think of virtual addresses as much larger than physical addresses,<br />
the opposite can occur when the processor address size is small relative to the state<br />
of the memory technology. No single program or virtual machine can benefi t, but a<br />
collection of programs or virtual machines running at the same time can benefi t from<br />
not having to be swapped to memory or by running on parallel processors.<br />
Elaboration: The discussion of virtual memory in this book focuses on paging,<br />
which uses fi xed-size blocks. There is also a variable-size block scheme called<br />
segmentation. In segmentation, an address consists of two parts: a segment number<br />
<strong>and</strong> a segment offset. The segment number is mapped to a physical address, <strong>and</strong><br />
the offset is added to fi nd the actual physical address. Because the segment can<br />
vary in size, a bounds check is also needed to make sure that the offset is within<br />
the segment. The major use of segmentation is to support more powerful methods<br />
of protection <strong>and</strong> sharing in an address space. Most operating system textbooks<br />
contain extensive discussions of segmentation compared to paging <strong>and</strong> of the use<br />
of segmentation to logically share the address space. The major disadvantage of<br />
segmentation is that it splits the address space into logically separate pieces that<br />
must be manipulated as a two-part address: the segment number <strong>and</strong> the offset.<br />
Paging, in contrast, makes the boundary between page number <strong>and</strong> offset invisible<br />
to programmers <strong>and</strong> compilers.<br />
Segments have also been used as a method to extend the address space without<br />
changing the word size of the computer. Such attempts have been unsuccessful because<br />
of the awkwardness <strong>and</strong> performance penalties inherent in a two-part address, of which<br />
programmers <strong>and</strong> compilers must be aware.<br />
Many architectures divide the address space into large fi xed-size blocks that simplify<br />
protection between the operating system <strong>and</strong> user programs <strong>and</strong> increase the effi ciency<br />
of implementing paging. Although these divisions are often called “segments,” this<br />
mechanism is much simpler than variable block size segmentation <strong>and</strong> is not visible to<br />
user programs; we discuss it in more detail shortly.<br />
segmentation<br />
A variable-size address<br />
mapping scheme in which<br />
an address consists of two<br />
parts: a segment number,<br />
which is mapped to a<br />
physical address, <strong>and</strong> a<br />
segment offset.<br />
Placing a Page <strong>and</strong> Finding It Again<br />
Because of the incredibly high penalty for a page fault, designers reduce page fault<br />
frequency by optimizing page placement. If we allow a virtual page to be mapped<br />
to any physical page, the operating system can then choose to replace any page<br />
it wants when a page fault occurs. For example, the operating system can use a
432 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />
page table The table<br />
containing the virtual<br />
to physical address<br />
translations in a virtual<br />
memory system. The<br />
table, which is stored<br />
in memory, is typically<br />
indexed by the virtual<br />
page number; each entry<br />
in the table contains the<br />
physical page number<br />
for that virtual page if<br />
the page is currently in<br />
memory.<br />
sophisticated algorithm <strong>and</strong> complex data structures that track page usage to try<br />
to choose a page that will not be needed for a long time. The ability to use a clever<br />
<strong>and</strong> flexible replacement scheme reduces the page fault rate <strong>and</strong> simplifies the use<br />
of fully associative placement of pages.<br />
As mentioned in Section 5.4, the difficulty in using fully associative placement<br />
is in locating an entry, since it can be anywhere in the upper level of the hierarchy.<br />
A full search is impractical. In virtual memory systems, we locate pages by using a<br />
table that indexes the memory; this structure is called a page table, <strong>and</strong> it resides<br />
in memory. A page table is indexed with the page number from the virtual address<br />
to discover the corresponding physical page number. Each program has its own<br />
page table, which maps the virtual address space of that program to main memory.<br />
In our library analogy, the page table corresponds to a mapping between book<br />
titles <strong>and</strong> library locations. Just as the card catalog may contain entries for books<br />
in another library on campus rather than the local branch library, we will see that<br />
the page table may contain entries for pages not present in memory. To indicate the<br />
location of the page table in memory, the hardware includes a register that points to<br />
the start of the page table; we call this the page table register. Assume for now that<br />
the page table is in a fixed <strong>and</strong> contiguous area of memory.<br />
Hardware/<br />
Software<br />
Interface<br />
The page table, together with the program counter <strong>and</strong> the registers, specifies<br />
the state of a virtual machine. If we want to allow another virtual machine to use<br />
the processor, we must save this state. Later, after restoring this state, the virtual<br />
machine can continue execution. We often refer to this state as a process. The<br />
process is considered active when it is in possession of the processor; otherwise, it<br />
is considered inactive. The operating system can make a process active by loading<br />
the process’s state, including the program counter, which will initiate execution at<br />
the value of the saved program counter.<br />
The process’s address space, <strong>and</strong> hence all the data it can access in memory, is<br />
defined by its page table, which resides in memory. Rather than save the entire page<br />
table, the operating system simply loads the page table register to point to the page<br />
table of the process it wants to make active. Each process has its own page table,<br />
since different processes use the same virtual addresses. The operating system is<br />
responsible for allocating the physical memory <strong>and</strong> updating the page tables, so<br />
that the virtual address spaces of different processes do not collide. As we will see<br />
shortly, the use of separate page tables also provides protection of one process from<br />
another.
434 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />
swap space The space on<br />
the disk reserved for the<br />
full virtual memory space<br />
of a process.<br />
Page Faults<br />
If the valid bit for a virtual page is off, a page fault occurs. The operating system<br />
must be given control. This transfer is done with the exception mechanism, which<br />
we saw in Chapter 4 <strong>and</strong> will discuss again later in this section. Once the operating<br />
system gets control, it must find the page in the next level of the hierarchy (usually<br />
flash memory or magnetic disk) <strong>and</strong> decide where to place the requested page in<br />
main memory.<br />
The virtual address alone does not immediately tell us where the page is on disk.<br />
Returning to our library analogy, we cannot find the location of a library book on<br />
the shelves just by knowing its title. Instead, we go to the catalog <strong>and</strong> look up the<br />
book, obtaining an address for the location on the shelves, such as the Library of<br />
Congress call number. Likewise, in a virtual memory system, we must keep track<br />
of the location on disk of each page in virtual address space.<br />
Because we do not know ahead of time when a page in memory will be replaced,<br />
the operating system usually creates the space on flash memory or disk for all the<br />
pages of a process when it creates the process. This space is called the swap space.<br />
At that time, it also creates a data structure to record where each virtual page is<br />
stored on disk. This data structure may be part of the page table or may be an<br />
auxiliary data structure indexed in the same way as the page table. Figure 5.28<br />
shows the organization when a single table holds either the physical page number<br />
or the disk address.<br />
The operating system also creates a data structure that tracks which processes<br />
<strong>and</strong> which virtual addresses use each physical page. When a page fault occurs,<br />
if all the pages in main memory are in use, the operating system must choose a<br />
page to replace. Because we want to minimize the number of page faults, most<br />
operating systems try to choose a page that they hypothesize will not be needed<br />
in the near future. Using the past to predict the future, operating systems follow<br />
the least recently used (LRU) replacement scheme, which we mentioned in Section<br />
5.4. The operating system searches for the least recently used page, assuming that<br />
a page that has not been used in a long time is less likely to be needed than a more<br />
recently accessed page. The replaced pages are written to swap space on the disk.<br />
In case you are wondering, the operating system is just another process, <strong>and</strong> these<br />
tables controlling memory are in memory; the details of this seeming contradiction<br />
will be explained shortly.
436 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />
Elaboration: With a 32-bit virtual address, 4 KiB pages, <strong>and</strong> 4 bytes per page table<br />
entry, we can compute the total page table size:<br />
Number of page table entries <br />
2<br />
32<br />
2 12<br />
2<br />
20<br />
20 2<br />
Size of page table 2 page table entries 2<br />
bytes<br />
page table entry<br />
4 MiB<br />
That is, we would need to use 4 MiB of memory for each program in execution at any<br />
time. This amount is not so bad for a single process. What if there are hundreds of<br />
processes running, each with their own page table? And how should we h<strong>and</strong>le 64-bit<br />
addresses, which by this calculation would need 2 52 words?<br />
A range of techniques is used to reduce the amount of storage required for the page<br />
table. The fi ve techniques below aim at reducing the total maximum storage required as<br />
well as minimizing the main memory dedicated to page tables:<br />
1. The simplest technique is to keep a limit register that restricts the size of the<br />
page table for a given process. If the virtual page number becomes larger than<br />
the contents of the limit register, entries must be added to the page table. This<br />
technique allows the page table to grow as a process consumes more space.<br />
Thus, the page table will only be large if the process is using many pages of<br />
virtual address space. This technique requires that the address space exp<strong>and</strong> in<br />
only one direction.<br />
2. Allowing growth in only one direction is not sufficient, since most languages require<br />
two areas whose size is exp<strong>and</strong>able: one area holds the stack <strong>and</strong> the other area<br />
holds the heap. Because of this duality, it is convenient to divide the page table<br />
<strong>and</strong> let it grow from the highest address down, as well as from the lowest address<br />
up. This means that there will be two separate page tables <strong>and</strong> two separate<br />
limits. The use of two page tables breaks the address space into two segments.<br />
The high-order bit of an address usually determines which segment <strong>and</strong> thus which<br />
page table to use for that address. Since the high-order address bit specifies the<br />
segment, each segment can be as large as one-half of the address space. A<br />
limit register for each segment specifies the current size of the segment, which<br />
grows in units of pages. This type of segmentation is used by many architectures,<br />
including MIPS. Unlike the type of segmentation discussed in the third elaboration<br />
on page 431, this form of segmentation is invisible to the application program,<br />
although not to the operating system. The major disadvantage of this scheme is<br />
that it does not work well when the address space is used in a sparse fashion<br />
rather than as a contiguous set of virtual addresses.<br />
3. Another approach to reducing the page table size is to apply a hashing function<br />
to the virtual address so that the page table need be only the size of the number<br />
of physical pages in main memory. Such a structure is called an inverted page<br />
table. Of course, the lookup process is slightly more complex with an inverted<br />
page table, because we can no longer just index the page table.<br />
4. Multiple levels of page tables can also be used to reduce the total amount of<br />
page table storage. The fi rst level maps large fi xed-size blocks of virtual address<br />
space, perhaps 64 to 256 pages in total. These large blocks are sometimes<br />
called segments, <strong>and</strong> this fi rst-level mapping table is sometimes called a
5.7 Virtual Memory 437<br />
segment table, though the segments are again invisible to the user. Each entry<br />
in the segment table indicates whether any pages in that segment are allocated<br />
<strong>and</strong>, if so, points to a page table for that segment. Address translation happens<br />
by fi rst looking in the segment table, using the highest-order bits of the address.<br />
If the segment address is valid, the next set of high-order bits is used to index<br />
the page table indicated by the segment table entry. This scheme allows the<br />
address space to be used in a sparse fashion (multiple noncontiguous segments<br />
can be active) without having to allocate the entire page table. Such schemes<br />
are particularly useful with very large address spaces <strong>and</strong> in software systems<br />
that require noncontiguous allocation. The primary disadvantage of this two-level<br />
mapping is the more complex process for address translation.<br />
5. To reduce the actual main memory tied up in page tables, most modern systems<br />
also allow the page tables to be paged. Although this sounds tricky, it works<br />
by using the same basic ideas of virtual memory <strong>and</strong> simply allowing the page<br />
tables to reside in the virtual address space. In addition, there are some small<br />
but critical problems, such as a never-ending series of page faults, which must<br />
be avoided. How these problems are overcome is both very detailed <strong>and</strong> typically<br />
highly processor specifi c. In brief, these problems are avoided by placing all the<br />
page tables in the address space of the operating system <strong>and</strong> placing at least<br />
some of the page tables for the operating system in a portion of main memory<br />
that is physically addressed <strong>and</strong> is always present <strong>and</strong> thus never on disk.<br />
What about Writes?<br />
The difference between the access time to the cache <strong>and</strong> main memory is tens to<br />
hundreds of cycles, <strong>and</strong> write-through schemes can be used, although we need a<br />
write buffer to hide the latency of the write from the processor. In a virtual memory<br />
system, writes to the next level of the hierarchy (disk) can take millions of processor<br />
clock cycles; therefore, building a write buffer to allow the system to write-through<br />
to disk would be completely impractical. Instead, virtual memory systems must use<br />
write-back, performing the individual writes into the page in memory, <strong>and</strong> copying<br />
the page back to disk when it is replaced in the memory.<br />
A write-back scheme has another major advantage in a virtual memory system.<br />
Because the disk transfer time is small compared with its access time, copying back<br />
an entire page is much more efficient than writing individual words back to the disk.<br />
A write-back operation, although more efficient than transferring individual words, is<br />
still costly. Thus, we would like to know whether a page needs to be copied back when<br />
we choose to replace it. To track whether a page has been written since it was read into<br />
the memory, a dirty bit is added to the page table. The dirty bit is set when any word<br />
in a page is written. If the operating system chooses to replace the page, the dirty bit<br />
indicates whether the page needs to be written out before its location in memory can be<br />
given to another page. Hence, a modified page is often called a dirty page.<br />
Hardware/<br />
Software<br />
Interface
5.7 Virtual Memory 439<br />
Because we access the TLB instead of the page table on every reference, the TLB<br />
will need to include other status bits, such as the dirty <strong>and</strong> the reference bits.<br />
On every reference, we look up the virtual page number in the TLB. If we get a<br />
hit, the physical page number is used to form the address, <strong>and</strong> the corresponding<br />
reference bit is turned on. If the processor is performing a write, the dirty bit is also<br />
turned on. If a miss in the TLB occurs, we must determine whether it is a page fault<br />
or merely a TLB miss. If the page exists in memory, then the TLB miss indicates<br />
only that the translation is missing. In such cases, the processor can h<strong>and</strong>le the TLB<br />
miss by loading the translation from the page table into the TLB <strong>and</strong> then trying the<br />
reference again. If the page is not present in memory, then the TLB miss indicates<br />
a true page fault. In this case, the processor invokes the operating system using an<br />
exception. Because the TLB has many fewer entries than the number of pages in<br />
main memory, TLB misses will be much more frequent than true page faults.<br />
TLB misses can be h<strong>and</strong>led either in hardware or in software. In practice, with<br />
care there can be little performance difference between the two approaches, because<br />
the basic operations are the same in either case.<br />
After a TLB miss occurs <strong>and</strong> the missing translation has been retrieved from the<br />
page table, we will need to select a TLB entry to replace. Because the reference <strong>and</strong><br />
dirty bits are contained in the TLB entry, we need to copy these bits back to the page<br />
table entry when we replace an entry. These bits are the only portion of the TLB<br />
entry that can be changed. Using write-back—that is, copying these entries back at<br />
miss time rather than when they are written—is very efficient, since we expect the<br />
TLB miss rate to be small. Some systems use other techniques to approximate the<br />
reference <strong>and</strong> dirty bits, eliminating the need to write into the TLB except to load<br />
a new table entry on a miss.<br />
Some typical values for a TLB might be<br />
■ TLB size: 16–512 entries<br />
■ Block size: 1–2 page table entries (typically 4–8 bytes each)<br />
■ Hit time: 0.5–1 clock cycle<br />
■ Miss penalty: 10–100 clock cycles<br />
■ Miss rate: 0.01%–1%<br />
<strong>Design</strong>ers have used a wide variety of associativities in TLBs. Some systems use<br />
small, fully associative TLBs because a fully associative mapping has a lower miss<br />
rate; furthermore, since the TLB is small, the cost of a fully associative mapping is<br />
not too high. Other systems use large TLBs, often with small associativity. With<br />
a fully associative mapping, choosing the entry to replace becomes tricky since<br />
implementing a hardware LRU scheme is too expensive. Furthermore, since TLB<br />
misses are much more frequent than page faults <strong>and</strong> thus must be h<strong>and</strong>led more<br />
cheaply, we cannot afford an expensive software algorithm, as we can for page faults.<br />
As a result, many systems provide some support for r<strong>and</strong>omly choosing an entry<br />
to replace. We’ll examine replacement schemes in a little more detail in Section 5.8.
440 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />
The Intrinsity FastMATH TLB<br />
To see these ideas in a real processor, let’s take a closer look at the TLB of the<br />
Intrinsity FastMATH. The memory system uses 4 KiB pages <strong>and</strong> a 32-bit address<br />
space; thus, the virtual page number is 20 bits long, as in the top of Figure 5.30.<br />
The physical address is the same size as the virtual address. The TLB contains 16<br />
entries, it is fully associative, <strong>and</strong> it is shared between the instruction <strong>and</strong> data<br />
references. Each entry is 64 bits wide <strong>and</strong> contains a 20-bit tag (which is the virtual<br />
page number for that TLB entry), the corresponding physical page number (also 20<br />
bits), a valid bit, a dirty bit, <strong>and</strong> other bookkeeping bits. Like most MIPS systems,<br />
it uses software to h<strong>and</strong>le TLB misses.<br />
Figure 5.30 shows the TLB <strong>and</strong> one of the caches, while Figure 5.31 shows the<br />
steps in processing a read or write request. When a TLB miss occurs, the MIPS<br />
hardware saves the page number of the reference in a special register <strong>and</strong> generates<br />
an exception. The exception invokes the operating system, which h<strong>and</strong>les the miss<br />
in software. To find the physical address for the missing page, the TLB miss routine<br />
indexes the page table using the page number of the virtual address <strong>and</strong> the page<br />
table register, which indicates the starting address of the active process page table.<br />
Using a special set of system instructions that can update the TLB, the operating<br />
system places the physical address from the page table into the TLB. A TLB miss<br />
takes about 13 clock cycles, assuming the code <strong>and</strong> the page table entry are in the<br />
instruction cache <strong>and</strong> data cache, respectively. (We will see the MIPS TLB code<br />
on page 449.) A true page fault occurs if the page table entry does not have a valid<br />
physical address. The hardware maintains an index that indicates the recommended<br />
entry to replace; the recommended entry is chosen r<strong>and</strong>omly.<br />
There is an extra complication for write requests: namely, the write access bit in<br />
the TLB must be checked. This bit prevents the program from writing into pages<br />
for which it has only read access. If the program attempts a write <strong>and</strong> the write<br />
access bit is off, an exception is generated. The write access bit forms part of the<br />
protection mechanism, which we will discuss shortly.<br />
Integrating Virtual Memory, TLBs, <strong>and</strong> Caches<br />
Our virtual memory <strong>and</strong> cache systems work together as a hierarchy, so that data<br />
cannot be in the cache unless it is present in main memory. The operating system<br />
helps maintain this hierarchy by flushing the contents of any page from the cache<br />
when it decides to migrate that page to disk. At the same time, the OS modifies the<br />
page tables <strong>and</strong> TLB, so that an attempt to access any data on the migrated page<br />
will generate a page fault.<br />
Under the best of circumstances, a virtual address is translated by the TLB <strong>and</strong><br />
sent to the cache where the appropriate data is found, retrieved, <strong>and</strong> sent back to<br />
the processor. In the worst case, a reference can miss in all three components of the<br />
memory hierarchy: the TLB, the page table, <strong>and</strong> the cache. The following example<br />
illustrates these interactions in more detail.
442 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />
Virtual address<br />
TLB access<br />
TLB miss<br />
exception<br />
No<br />
TLB hit?<br />
Yes<br />
Physical address<br />
No<br />
Write?<br />
Yes<br />
Try to read data<br />
from cache<br />
No<br />
Write access<br />
bit on?<br />
Yes<br />
Cache miss stall<br />
while read block<br />
No<br />
Cache hit?<br />
Yes<br />
Write protection<br />
exception<br />
Try to write data<br />
to cache<br />
Deliver data<br />
to the CPU<br />
Cache miss stall<br />
while read block<br />
No<br />
Cache hit?<br />
Yes<br />
Write data into cache,<br />
update the dirty bit, <strong>and</strong><br />
put the data <strong>and</strong> the<br />
address into the write buffer<br />
FIGURE 5.31 Processing a read or a write-through in the Intrinsity FastMATH TLB <strong>and</strong> cache. If the TLB generates a hit,<br />
the cache can be accessed with the resulting physical address. For a read, the cache generates a hit or miss <strong>and</strong> supplies the data or causes a stall<br />
while the data is brought from memory. If the operation is a write, a portion of the cache entry is overwritten for a hit <strong>and</strong> the data is sent to<br />
the write buffer if we assume write-through. A write miss is just like a read miss except that the block is modified after it is read from memory.<br />
Write-back requires writes to set a dirty bit for the cache block, <strong>and</strong> a write buffer is loaded with the whole block only on a read miss or write<br />
miss if the block to be replaced is dirty. Notice that a TLB hit <strong>and</strong> a cache hit are independent events, but a cache hit can only occur after a TLB<br />
hit occurs, which means that the data must be present in memory. The relationship between TLB misses <strong>and</strong> cache misses is examined further<br />
in the following example <strong>and</strong> the exercises at the end of this chapter.
5.7 Virtual Memory 443<br />
Overall Operation of a Memory Hierarchy<br />
In a memory hierarchy like that of Figure 5.30, which includes a TLB <strong>and</strong> a<br />
cache organized as shown, a memory reference can encounter three different<br />
types of misses: a TLB miss, a page fault, <strong>and</strong> a cache miss. Consider all<br />
the combinations of these three events with one or more occurring (seven<br />
possibilities). For each possibility, state whether this event can actually occur<br />
<strong>and</strong> under what circumstances.<br />
EXAMPLE<br />
Figure 5.32 shows all combinations <strong>and</strong> whether each is possible in practice.<br />
ANSWER<br />
Elaboration: Figure 5.32 assumes that all memory addresses are translated to<br />
physical addresses before the cache is accessed. In this organization, the cache is<br />
physically indexed <strong>and</strong> physically tagged (both the cache index <strong>and</strong> tag are physical,<br />
rather than virtual, addresses). In such a system, the amount of time to access memory,<br />
assuming a cache hit, must accommodate both a TLB access <strong>and</strong> a cache access; of<br />
course, these accesses can be pipelined.<br />
Alternatively, the processor can index the cache with an address that is completely<br />
or partially virtual. This is called a virtually addressed cache, <strong>and</strong> it uses tags that<br />
are virtual addresses; hence, such a cache is virtually indexed <strong>and</strong> virtually tagged. In<br />
such caches, the address translation hardware (TLB) is unused during the normal cache<br />
access, since the cache is accessed with a virtual address that has not been translated<br />
to a physical address. This takes the TLB out of the critical path, reducing cache latency.<br />
When a cache miss occurs, however, the processor needs to translate the address to a<br />
physical address so that it can fetch the cache block from main memory.<br />
virtually addressed<br />
cache A cache that is<br />
accessed with a virtual<br />
address rather than a<br />
physical address.<br />
TLB<br />
Page<br />
table Cache Possible? If so, under what circumstance?<br />
Hit Hit Miss Possible, although the page table is never really checked if TLB hits.<br />
Miss Hit Hit TLB misses, but entry found in page table; after retry, data is found in cache.<br />
Miss Hit Miss TLB misses, but entry found in page table; after retry, data misses in cache.<br />
Miss Miss Miss TLB misses <strong>and</strong> is followed by a page fault; after retry, data must miss in cache.<br />
Hit Miss Miss Impossible: cannot have a translation in TLB if page is not present in memory.<br />
Hit Miss Hit Impossible: cannot have a translation in TLB if page is not present in memory.<br />
Miss Miss Hit Impossible: data cannot be allowed in cache if the page is not in memory.<br />
FIGURE 5.32 The possible combinations of events in the TLB, virtual memory system,<br />
<strong>and</strong> cache. Three of these combinations are impossible, <strong>and</strong> one is possible (TLB hit, virtual memory hit,<br />
cache miss) but never detected.
444 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />
aliasing A situation<br />
in which two addresses<br />
access the same object;<br />
it can occur in virtual<br />
memory when there are<br />
two virtual addresses for<br />
the same physical page.<br />
physically addressed<br />
cache A cache that is<br />
addressed by a physical<br />
address.<br />
When the cache is accessed with a virtual address <strong>and</strong> pages are shared between<br />
processes (which may access them with different virtual addresses), there is the<br />
possibility of aliasing. Aliasing occurs when the same object has two names—in this<br />
case, two virtual addresses for the same page. This ambiguity creates a problem, because<br />
a word on such a page may be cached in two different locations, each corresponding<br />
to different virtual addresses. This ambiguity would allow one program to write the data<br />
without the other program being aware that the data had changed. Completely virtually<br />
addressed caches either introduce design limitations on the cache <strong>and</strong> TLB to reduce<br />
aliases or require the operating system, <strong>and</strong> possibly the user, to take steps to ensure<br />
that aliases do not occur.<br />
A common compromise between these two design points is caches that are virtually<br />
indexed—sometimes using just the page-offset portion of the address, which is really<br />
a physical address since it is not translated—but use physical tags. These designs,<br />
which are virtually indexed but physically tagged, attempt to achieve the performance<br />
advantages of virtually indexed caches with the architecturally simpler advantages of a<br />
physically addressed cache. For example, there is no alias problem in this case. Figure<br />
5.30 assumed a 4 KiB page size, but it’s really 16 KiB, so the Intrinsity FastMATH can<br />
use this trick. To pull it off, there must be careful coordination between the minimum<br />
page size, the cache size, <strong>and</strong> associativity.<br />
Implementing Protection with Virtual Memory<br />
Perhaps the most important function of virtual memory today is to allow sharing of<br />
a single main memory by multiple processes, while providing memory protection<br />
among these processes <strong>and</strong> the operating system. The protection mechanism must<br />
ensure that although multiple processes are sharing the same main memory, one<br />
renegade process cannot write into the address space of another user process or into<br />
the operating system either intentionally or unintentionally. The write access bit in<br />
the TLB can protect a page from being written. Without this level of protection,<br />
computer viruses would be even more widespread.<br />
Hardware/<br />
Software<br />
Interface<br />
supervisor mode Also<br />
called kernel mode. A<br />
mode indicating that a<br />
running process is an<br />
operating system process.<br />
To enable the operating system to implement protection in the virtual memory<br />
system, the hardware must provide at least the three basic capabilities summarized<br />
below. Note that the first two are the same requirements as needed for virtual<br />
machines (Section 5.6).<br />
1. Support at least two modes that indicate whether the running process is a<br />
user process or an operating system process, variously called a supervisor<br />
process, a kernel process, or an executive process.<br />
2. Provide a portion of the processor state that a user process can read but not<br />
write. This includes the user/supervisor mode bit, which dictates whether<br />
the processor is in user or supervisor mode, the page table pointer, <strong>and</strong> the
5.7 Virtual Memory 445<br />
TLB. To write these elements, the operating system uses special instructions<br />
that are only available in supervisor mode.<br />
3. Provide mechanisms whereby the processor can go from user mode to<br />
supervisor mode <strong>and</strong> vice versa. The first direction is typically accomplished<br />
by a system call exception, implemented as a special instruction (syscall in<br />
the MIPS instruction set) that transfers control to a dedicated location in<br />
supervisor code space. As with any other exception, the program counter<br />
from the point of the system call is saved in the exception PC (EPC), <strong>and</strong><br />
the processor is placed in supervisor mode. To return to user mode from the<br />
exception, use the return from exception (ERET) instruction, which resets to<br />
user mode <strong>and</strong> jumps to the address in EPC.<br />
By using these mechanisms <strong>and</strong> storing the page tables in the operating system’s<br />
address space, the operating system can change the page tables while preventing a<br />
user process from changing them, ensuring that a user process can access only the<br />
storage provided to it by the operating system.<br />
system call A special<br />
instruction that transfers<br />
control from user mode<br />
to a dedicated location<br />
in supervisor code space,<br />
invoking the exception<br />
mechanism in the process.<br />
We also want to prevent a process from reading the data of another process. For<br />
example, we wouldn’t want a student program to read the grades while they were<br />
in the processor’s memory. Once we begin sharing main memory, we must provide<br />
the ability for a process to protect its data from both reading <strong>and</strong> writing by another<br />
process; otherwise, sharing the main memory will be a mixed blessing!<br />
Remember that each process has its own virtual address space. Thus, if the<br />
operating system keeps the page tables organized so that the independent virtual<br />
pages map to disjoint physical pages, one process will not be able to access another’s<br />
data. Of course, this also requires that a user process be unable to change the page<br />
table mapping. The operating system can assure safety if it prevents the user process<br />
from modifying its own page tables. However, the operating system must be able<br />
to modify the page tables. Placing the page tables in the protected address space of<br />
the operating system satisfies both requirements.<br />
When processes want to share information in a limited way, the operating system<br />
must assist them, since accessing the information of another process requires<br />
changing the page table of the accessing process. The write access bit can be used<br />
to restrict the sharing to just read sharing, <strong>and</strong>, like the rest of the page table, this<br />
bit can be changed only by the operating system. To allow another process, say, P1,<br />
to read a page owned by process P2, P2 would ask the operating system to create<br />
a page table entry for a virtual page in P1’s address space that points to the same<br />
physical page that P2 wants to share. The operating system could use the write<br />
protection bit to prevent P1 from writing the data, if that was P2’s wish. Any bits<br />
that determine the access rights for a page must be included in both the page table<br />
<strong>and</strong> the TLB, because the page table is accessed only on a TLB miss.
446 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />
context switch<br />
A changing of the internal<br />
state of the processor to<br />
allow a different process<br />
to use the processor<br />
that includes saving the<br />
state needed to return to<br />
the currently executing<br />
process.<br />
Elaboration: When the operating system decides to change from running process<br />
P1 to running process P2 (called a context switch or process switch), it must ensure<br />
that P2 cannot get access to the page tables of P1 because that would compromise<br />
protection. If there is no TLB, it suffi ces to change the page table register to point to P2’s<br />
page table (rather than to P1’s); with a TLB, we must clear the TLB entries that belong to<br />
P1—both to protect the data of P1 <strong>and</strong> to force the TLB to load the entries for P2. If the<br />
process switch rate were high, this could be quite ineffi cient. For example, P2 might load<br />
only a few TLB entries before the operating system switched back to P1. Unfortunately,<br />
P1 would then fi nd that all its TLB entries were gone <strong>and</strong> would have to pay TLB misses<br />
to reload them. This problem arises because the virtual addresses used by P1 <strong>and</strong> P2<br />
are the same, <strong>and</strong> we must clear out the TLB to avoid confusing these addresses.<br />
A common alternative is to extend the virtual address space by adding a process<br />
identifi er or task identifi er. The Intrinsity FastMATH has an 8-bit address space ID (ASID)<br />
fi eld for this purpose. This small fi eld identifi es the currently running process; it is kept<br />
in a register loaded by the operating system when it switches processes. The process<br />
identifier is concatenated to the tag portion of the TLB, so that a TLB hit occurs only if<br />
both the page number <strong>and</strong> the process identifi er match. This combination eliminates the<br />
need to clear the TLB, except on rare occasions.<br />
Similar problems can occur for a cache, since on a process switch the cache will<br />
contain data from the running process. These problems arise in different ways for<br />
physically addressed <strong>and</strong> virtually addressed caches, <strong>and</strong> a variety of different solutions,<br />
such as process identifi ers, are used to ensure that a process gets its own data.<br />
H<strong>and</strong>ling TLB Misses <strong>and</strong> Page Faults<br />
Although the translation of virtual to physical addresses with a TLB is<br />
straightforward when we get a TLB hit, as we saw earlier, h<strong>and</strong>ling TLB misses <strong>and</strong><br />
page faults is more complex. A TLB miss occurs when no entry in the TLB matches<br />
a virtual address. Recall that a TLB miss can indicate one of two possibilities:<br />
1. The page is present in memory, <strong>and</strong> we need only create the missing TLB<br />
entry.<br />
2. The page is not present in memory, <strong>and</strong> we need to transfer control to the<br />
operating system to deal with a page fault.<br />
MIPS traditionally h<strong>and</strong>les a TLB miss in software. It brings in the page table<br />
entry from memory <strong>and</strong> then re-executes the instruction that caused the TLB miss.<br />
Upon re-executing, it will get a TLB hit. If the page table entry indicates the page is<br />
not in memory, this time it will get a page fault exception.<br />
H<strong>and</strong>ling a TLB miss or a page fault requires using the exception mechanism<br />
to interrupt the active process, transferring control to the operating system, <strong>and</strong><br />
later resuming execution of the interrupted process. A page fault will be recognized<br />
sometime during the clock cycle used to access memory. To restart the instruction<br />
after the page fault is h<strong>and</strong>led, the program counter of the instruction that caused<br />
the page fault must be saved. Just as in Chapter 4, the exception program counter<br />
(EPC) is used to hold this value.
5.7 Virtual Memory 447<br />
In addition, a TLB miss or page fault exception must be asserted by the end<br />
of the same clock cycle that the memory access occurs, so that the next clock<br />
cycle will begin exception processing rather than continue normal instruction<br />
execution. If the page fault was not recognized in this clock cycle, a load instruction<br />
could overwrite a register, <strong>and</strong> this could be disastrous when we try to restart the<br />
instruction. For example, consider the instruction lw $1,0($1): the computer<br />
must be able to prevent the write pipeline stage from occurring; otherwise, it could<br />
not properly restart the instruction, since the contents of $1 would have been<br />
destroyed. A similar complication arises on stores. We must prevent the write into<br />
memory from actually completing when there is a page fault; this is usually done<br />
by deasserting the write control line to the memory.<br />
Between the time we begin executing the exception h<strong>and</strong>ler in the operating<br />
system <strong>and</strong> the time that the operating system has saved all the state of the process,<br />
the operating system is particularly vulnerable. For example, if another exception<br />
occurred when we were processing the first exception in the operating system, the<br />
control unit would overwrite the exception program counter, making it impossible<br />
to return to the instruction that caused the page fault! We can avoid this disaster<br />
by providing the ability to disable <strong>and</strong> enable exceptions. When an exception first<br />
occurs, the processor sets a bit that disables all other exceptions; this could happen<br />
at the same time the processor sets the supervisor mode bit. The operating system<br />
will then save just enough state to allow it to recover if another exception occurs—<br />
namely, the exception program counter (EPC) <strong>and</strong> Cause registers. EPC <strong>and</strong> Cause<br />
are two of the special control registers that help with exceptions, TLB misses, <strong>and</strong><br />
page faults; Figure 5.33 shows the rest. The operating system can then re-enable<br />
exceptions. These steps make sure that exceptions will not cause the processor<br />
to lose any state <strong>and</strong> thereby be unable to restart execution of the interrupting<br />
instruction.<br />
Hardware/<br />
Software<br />
Interface<br />
exception enable Also<br />
called interrupt enable.<br />
A signal or action that<br />
controls whether the<br />
process responds to<br />
an exception or not;<br />
necessary for preventing<br />
the occurrence of<br />
exceptions during<br />
intervals before the<br />
processor has safely saved<br />
the state needed to restart.<br />
Once the operating system knows the virtual address that caused the page fault, it<br />
must complete three steps:<br />
1. Look up the page table entry using the virtual address <strong>and</strong> find the location<br />
of the referenced page on disk.<br />
2. Choose a physical page to replace; if the chosen page is dirty, it must be<br />
written out to disk before we can bring a new virtual page into this physical<br />
page.<br />
3. Start a read to bring the referenced page from disk into the chosen physical<br />
page.
5.7 Virtual Memory 449<br />
The exception invokes the operating system, which h<strong>and</strong>les the miss in software.<br />
Control is transferred to address 8000 0000 hex<br />
, the location of the TLB miss h<strong>and</strong>ler.<br />
To find the physical address for the missing page, the TLB miss routine indexes the<br />
page table using the page number of the virtual address <strong>and</strong> the page table register,<br />
which indicates the starting address of the active process page table. To make this<br />
indexing fast, MIPS hardware places everything you need in the special Context<br />
register: the upper 12 bits have the address of the base of the page table, <strong>and</strong> the<br />
next 18 bits have the virtual address of the missing page. Each page table entry is<br />
one word, so the last 2 bits are 0. Thus, the first two instructions copy the Context<br />
register into the kernel temporary register $k1 <strong>and</strong> then load the page table entry<br />
from that address into $k1. Recall that $k0 <strong>and</strong> $k1 are reserved for the operating<br />
system to use without saving; a major reason for this convention is to make the TLB<br />
miss h<strong>and</strong>ler fast. Below is the MIPS code for a typical TLB miss h<strong>and</strong>ler:<br />
h<strong>and</strong>ler Name of a<br />
software routine invoked<br />
to “h<strong>and</strong>le” an exception<br />
or interrupt.<br />
TLBmiss:<br />
mfc0 $k1,Context # copy address of PTE into temp $k1<br />
lw $k1,0($k1) # put PTE into temp $k1<br />
mtc0 $k1,EntryLo # put PTE into special register EntryLo<br />
tlbwr<br />
# put EntryLo into TLB entry at R<strong>and</strong>om<br />
eret<br />
# return from TLB miss exception<br />
As shown above, MIPS has a special set of system instructions to update the<br />
TLB. The instruction tlbwr copies from control register EntryLo into the TLB<br />
entry selected by the control register R<strong>and</strong>om. R<strong>and</strong>om implements r<strong>and</strong>om<br />
replacement, so it is basically a free-running counter. A TLB miss takes about a<br />
dozen clock cycles.<br />
Note that the TLB miss h<strong>and</strong>ler does not check to see if the page table entry is<br />
valid. Because the exception for TLB entry missing is much more frequent than<br />
a page fault, the operating system loads the TLB from the page table without<br />
examining the entry <strong>and</strong> restarts the instruction. If the entry is invalid, another<br />
<strong>and</strong> different exception occurs, <strong>and</strong> the operating system recognizes the page fault.<br />
This method makes the frequent case of a TLB miss fast, at a slight performance<br />
penalty for the infrequent case of a page fault.<br />
Once the process that generated the page fault has been interrupted, it transfers<br />
control to 8000 0180 hex<br />
, a different address than the TLB miss h<strong>and</strong>ler. This is<br />
the general address for exception; TLB miss has a special entry point to lower the<br />
penalty for a TLB miss. The operating system uses the exception Cause register<br />
to diagnose the cause of the exception. Because the exception is a page fault, the<br />
operating system knows that extensive processing will be required. Thus, unlike a<br />
TLB miss, it saves the entire state of the active process. This state includes all the<br />
general-purpose <strong>and</strong> floating-point registers, the page table address register, the<br />
EPC, <strong>and</strong> the exception Cause register. Since exception h<strong>and</strong>lers do not usually use<br />
the floating-point registers, the general entry point does not save them, leaving that<br />
to the few h<strong>and</strong>lers that need them.
450 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />
Figure 5.34 sketches the MIPS code of an exception h<strong>and</strong>ler. Note that we<br />
save <strong>and</strong> restore the state in MIPS code, taking care when we enable <strong>and</strong> disable<br />
exceptions, but we invoke C code to h<strong>and</strong>le the particular exception.<br />
The virtual address that caused the fault depends on whether the fault was an<br />
instruction or data fault. The address of the instruction that generated the fault is<br />
in the EPC. If it was an instruction page fault, the EPC contains the virtual address<br />
of the faulting page; otherwise, the faulting virtual address can be computed by<br />
examining the instruction (whose address is in the EPC) to find the base register<br />
<strong>and</strong> offset field.<br />
unmapped A portion<br />
of the address space that<br />
cannot have page faults.<br />
Elaboration: This simplifi ed version assumes that the stack pointer (sp) is valid. To<br />
avoid the problem of a page fault during this low-level exception code, MIPS sets aside<br />
a portion of its address space that cannot have page faults, called unmapped. The<br />
operating system places the exception entry point code <strong>and</strong> the exception stack in<br />
unmapped memory. MIPS hardware translates virtual addresses 8000 0000 hex<br />
to BFFF<br />
FFFF hex<br />
to physical addresses simply by ignoring the upper bits of the virtual address,<br />
thereby placing these addresses in the low part of physical memory. Thus, the operating<br />
system places exception entry points <strong>and</strong> exception stacks in unmapped memory.<br />
Elaboration: The code in Figure 5.34 shows the MIPS-32 exception return sequence.<br />
The older MIPS-I architecture uses rfe <strong>and</strong> jr instead of eret.<br />
Elaboration: For processors with more complex instructions that can touch many<br />
memory locations <strong>and</strong> write many data items, making instructions restartable is much<br />
harder. Processing one instruction may generate a number of page faults in the middle<br />
of the instruction. For example, x86 processors have block move instructions that touch<br />
thous<strong>and</strong>s of data words. In such processors, instructions often cannot be restarted<br />
from the beginning, as we do for MIPS instructions. Instead, the instruction must be<br />
interrupted <strong>and</strong> later continued midstream in its execution. Resuming an instruction in<br />
the middle of its execution usually requires saving some special state, processing the<br />
exception, <strong>and</strong> restoring that special state. Making this work properly requires careful<br />
<strong>and</strong> detailed coordination between the exception-h<strong>and</strong>ling code in the operating system<br />
<strong>and</strong> the hardware.<br />
Elaboration: Rather than pay an extra level of indirection on every memory access, the<br />
VMM maintains a shadow page table that maps directly from the guest virtual address<br />
space to the physical address space of the hardware. By detecting all modifi cations to<br />
the guest’s page table, the VMM can ensure the shadow page table entries being used<br />
by the hardware for translations correspond to those of the guest OS environment, with<br />
the exception of the correct physical pages substituted for the real pages in the guest<br />
tables. Hence, the VMM must trap any attempt by the guest OS to change its page table<br />
or to access the page table pointer. This is commonly done by write protecting the guest<br />
page tables <strong>and</strong> trapping any access to the page table pointer by a guest OS. As noted<br />
above, the latter happens naturally if accessing the page table pointer is a privileged<br />
operation.
452 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />
Elaboration: The fi nal portion of the architecture to virtualize is I/O. This is by far<br />
the most diffi cult part of system virtualization because of the increasing number of<br />
I/O devices attached to the computer <strong>and</strong> the increasing diversity of I/O device types.<br />
Another diffi culty is the sharing of a real device among multiple VMs, <strong>and</strong> yet another<br />
comes from supporting the myriad of device drivers that are required, especially if<br />
different guest OSes are supported on the same VM system. The VM illusion can be<br />
maintained by giving each VM generic versions of each type of I/O device driver, <strong>and</strong> then<br />
leaving it to the VMM to h<strong>and</strong>le real I/O.<br />
Elaboration: In addition to virtualizing the instruction set for a virtual machine,<br />
another challenge is virtualization of virtual memory, as each guest OS in every virtual<br />
machine manages its own set of page tables. To make this work, the VMM separates<br />
the notions of real <strong>and</strong> physical memory (which are often treated synonymously), <strong>and</strong><br />
makes real memory a separate, intermediate level between virtual memory <strong>and</strong> physical<br />
memory. (Some use the terms virtual memory, physical memory, <strong>and</strong> machine memory<br />
to name the same three levels.) The guest OS maps virtual memory to real memory<br />
via its page tables, <strong>and</strong> the VMM page tables map the guest’s real memory to physical<br />
memory. The virtual memory architecture is specifi ed either via page tables, as in IBM<br />
VM/370 <strong>and</strong> the x86, or via the TLB structure, as in MIPS.<br />
Summary<br />
Virtual memory is the name for the level of memory hierarchy that manages<br />
caching between the main memory <strong>and</strong> secondary memory. Virtual memory<br />
allows a single program to exp<strong>and</strong> its address space beyond the limits of main<br />
memory. More importantly, virtual memory supports sharing of the main memory<br />
among multiple, simultaneously active processes, in a protected manner.<br />
Managing the memory hierarchy between main memory <strong>and</strong> disk is challenging<br />
because of the high cost of page faults. Several techniques are used to reduce the<br />
miss rate:<br />
1. Pages are made large to take advantage of spatial locality <strong>and</strong> to reduce the<br />
miss rate.<br />
2. The mapping between virtual addresses <strong>and</strong> physical addresses, which is<br />
implemented with a page table, is made fully associative so that a virtual<br />
page can be placed anywhere in main memory.<br />
3. The operating system uses techniques, such as LRU <strong>and</strong> a reference bit, to<br />
choose which pages to replace.
5.7 Virtual Memory 453<br />
Writes to secondary memory are expensive, so virtual memory uses a write-back<br />
scheme <strong>and</strong> also tracks whether a page is unchanged (using a dirty bit) to avoid<br />
writing unchanged pages.<br />
The virtual memory mechanism provides address translation from a virtual<br />
address used by the program to the physical address space used for accessing<br />
memory. This address translation allows protected sharing of the main memory<br />
<strong>and</strong> provides several additional benefits, such as simplifying memory allocation.<br />
Ensuring that processes are protected from each other requires that only the<br />
operating system can change the address translations, which is implemented by<br />
preventing user programs from changing the page tables. Controlled sharing of<br />
pages among processes can be implemented with the help of the operating system<br />
<strong>and</strong> access bits in the page table that indicate whether the user program has read or<br />
write access to a page.<br />
If a processor had to access a page table resident in memory to translate every<br />
access, virtual memory would be too expensive, as caches would be pointless!<br />
Instead, a TLB acts as a cache for translations from the page table. Addresses are<br />
then translated from virtual to physical using the translations in the TLB.<br />
Caches, virtual memory, <strong>and</strong> TLBs all rely on a common set of principles <strong>and</strong><br />
policies. The next section discusses this common framework.<br />
Although virtual memory was invented to enable a small memory to act as a large<br />
one, the performance difference between secondary memory <strong>and</strong> main memory<br />
means that if a program routinely accesses more virtual memory than it has<br />
physical memory, it will run very slowly. Such a program would be continuously<br />
swapping pages between memory <strong>and</strong> disk, called thrashing. Thrashing is a disaster<br />
if it occurs, but it is rare. If your program thrashes, the easiest solution is to run it on<br />
a computer with more memory or buy more memory for your computer. A more<br />
complex choice is to re-examine your algorithm <strong>and</strong> data structures to see if you<br />
can change the locality <strong>and</strong> thereby reduce the number of pages that your program<br />
uses simultaneously. This set of popular pages is informally called the working set.<br />
A more common performance problem is TLB misses. Since a TLB might<br />
h<strong>and</strong>le only 32–64 page entries at a time, a program could easily see a high TLB<br />
miss rate, as the processor may access less than a quarter mebibyte directly: 64<br />
4 KiB 0.25 MiB. For example, TLB misses are often a challenge for Radix<br />
Sort. To try to alleviate this problem, most computer architectures now support<br />
variable page sizes. For example, in addition to the st<strong>and</strong>ard 4 KiB page, MIPS<br />
hardware supports 16 KiB, 64 KiB, 256 KiB, 1 MiB, 4 MiB, 16 MiB, 64 MiB, <strong>and</strong><br />
256 MiB pages. Hence, if a program uses large page sizes, it can access more<br />
memory directly without TLB misses.<br />
The practical challenge is getting the operating system to allow programs to<br />
select these larger page sizes. Once again, the more complex solution to reducing<br />
Underst<strong>and</strong>ing<br />
Program<br />
Performance
454 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />
TLB misses is to re-examine the algorithm <strong>and</strong> data structures to reduce the<br />
working set of pages; given the importance of memory accesses to performance<br />
<strong>and</strong> the frequency of TLB misses, some programs with large working sets have<br />
been redesigned with that goal.<br />
Check<br />
Yourself<br />
Match the definitions in the right column to the terms in the left column.<br />
1. L1 cache a. A cache for a cache<br />
2. L2 cache b. A cache for disks<br />
3. Main memory c. A cache for a main memory<br />
4. TLB d. A cache for page table entries<br />
5.8<br />
A Common Framework for Memory<br />
Hierarchy<br />
By now, you’ve recognized that the different types of memory hierarchies have a<br />
great deal in common. Although many of the aspects of memory hierarchies differ<br />
quantitatively, many of the policies <strong>and</strong> features that determine how a hierarchy<br />
functions are similar qualitatively. Figure 5.35 shows how some of the quantitative<br />
characteristics of memory hierarchies can differ. In the rest of this section, we will<br />
discuss the common operational alternatives for memory hierarchies, <strong>and</strong> how<br />
these determine their behavior. We will examine these policies as a series of four<br />
questions that apply between any two levels of a memory hierarchy, although for<br />
simplicity we will primarily use terminology for caches.<br />
Feature<br />
Typical values<br />
for L1 caches<br />
Typical values<br />
for L2 caches<br />
Typical values for<br />
paged memory<br />
Typical values<br />
for a TLB<br />
Total size in blocks 250–2000 2,500–25,000 16,000–250,000 40–1024<br />
Total size in kilobytes 16–64 125–2000 1,000,000–1,000,000,000 0.25–16<br />
Block size in bytes 16–64 64–128 4000–64,000 4–32<br />
Miss penalty in clocks 10–25 100–1000 10,000,000–100,000,000 10–1000<br />
Miss rates (global for L2) 2%–5% 0.1%–2% 0.00001%–0.0001% 0.01%–2%<br />
FIGURE 5.35 The key quantitative design parameters that characterize the major elements of memory hierarchy in a<br />
computer. These are typical values for these levels as of 2012. Although the range of values is wide, this is partially because many of the values<br />
that have shifted over time are related; for example, as caches become larger to overcome larger miss penalties, block sizes also grow. While not<br />
shown, server microprocessors today also have L3 caches, which can be 2 to 8 MiB <strong>and</strong> contain many more blocks than L2 caches. L3 caches<br />
lower the L2 miss penalty to 30 to 40 clock cycles.
5.8 A Common Framework for Memory Hierarchy 457<br />
implementation, such as whether the cache is on-chip, the technology used for<br />
implementing the cache, <strong>and</strong> the critical role of cache access time in determining<br />
the processor cycle time.<br />
Question 3: Which Block Should Be Replaced on<br />
a Cache Miss?<br />
When a miss occurs in an associative cache, we must decide which block to replace.<br />
In a fully associative cache, all blocks are c<strong>and</strong>idates for replacement. If the cache is<br />
set associative, we must choose among the blocks in the set. Of course, replacement<br />
is easy in a direct-mapped cache because there is only one c<strong>and</strong>idate.<br />
There are the two primary strategies for replacement in set-associative or fully<br />
associative caches:<br />
■ R<strong>and</strong>om: C<strong>and</strong>idate blocks are r<strong>and</strong>omly selected, possibly using some hardware<br />
assistance. For example, MIPS supports r<strong>and</strong>om replacement for TLB misses.<br />
■ Least recently used (LRU): The block replaced is the one that has been unused<br />
for the longest time.<br />
In practice, LRU is too costly to implement for hierarchies with more than a small<br />
degree of associativity (two to four, typically), since tracking the usage information<br />
is costly. Even for four-way set associativity, LRU is often approximated—for<br />
example, by keeping track of which pair of blocks is LRU (which requires 1 bit),<br />
<strong>and</strong> then tracking which block in each pair is LRU (which requires 1 bit per pair).<br />
For larger associativity, either LRU is approximated or r<strong>and</strong>om replacement is<br />
used. In caches, the replacement algorithm is in hardware, which means that the<br />
scheme should be easy to implement. R<strong>and</strong>om replacement is simple to build in<br />
hardware, <strong>and</strong> for a two-way set-associative cache, r<strong>and</strong>om replacement has a miss<br />
rate about 1.1 times higher than LRU replacement. As the caches become larger, the<br />
miss rate for both replacement strategies falls, <strong>and</strong> the absolute difference becomes<br />
small. In fact, r<strong>and</strong>om replacement can sometimes be better than the simple LRU<br />
approximations that are easily implemented in hardware.<br />
In virtual memory, some form of LRU is always approximated, since even a tiny<br />
reduction in the miss rate can be important when the cost of a miss is enormous.<br />
Reference bits or equivalent functionality are often provided to make it easier for<br />
the operating system to track a set of less recently used pages. Because misses are<br />
so expensive <strong>and</strong> relatively infrequent, approximating this information primarily<br />
in software is acceptable.<br />
Question 4: What Happens on a Write?<br />
A key characteristic of any memory hierarchy is how it deals with writes. We have<br />
already seen the two basic options:<br />
■ Write-through: The information is written to both the block in the cache <strong>and</strong><br />
the block in the lower level of the memory hierarchy (main memory for a<br />
cache). The caches in Section 5.3 used this scheme.
458 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />
■ Write-back: The information is written only to the block in the cache. The<br />
modified block is written to the lower level of the hierarchy only when it<br />
is replaced. Virtual memory systems always use write-back, for the reasons<br />
discussed in Section 5.7.<br />
Both write-back <strong>and</strong> write-through have their advantages. The key advantages of<br />
write-back are the following:<br />
■ Individual words can be written by the processor at the rate that the cache,<br />
rather than the memory, can accept them.<br />
■ Multiple writes within a block require only one write to the lower level in the<br />
hierarchy.<br />
■ When blocks are written back, the system can make effective use of a highb<strong>and</strong>width<br />
transfer, since the entire block is written.<br />
Write-through has these advantages:<br />
■ Misses are simpler <strong>and</strong> cheaper because they never require a block to be<br />
written back to the lower level.<br />
■ Write-through is easier to implement than write-back, although to be<br />
practical, a write-through cache will still need to use a write buffer.<br />
Caches, TLBs, <strong>and</strong> virtual memory may initially look very different, but<br />
they rely on the same two principles of locality, <strong>and</strong> they can be understood<br />
by their answers to four questions:<br />
The BIG<br />
Picture<br />
Question 1:<br />
Answer:<br />
Question 2:<br />
Answer:<br />
Question 3:<br />
Answer:<br />
Question 4:<br />
Answer:<br />
Where can a block be placed?<br />
One place (direct mapped), a few places (set associative),<br />
or any place (fully associative).<br />
How is a block found?<br />
There are four methods: indexing (as in a direct-mapped<br />
cache), limited search (as in a set-associative cache), full<br />
search (as in a fully associative cache), <strong>and</strong> a separate<br />
lookup table (as in a page table).<br />
What block is replaced on a miss?<br />
Typically, either the least recently used or a r<strong>and</strong>om block.<br />
How are writes h<strong>and</strong>led?<br />
Each level in the hierarchy can use either write-through<br />
or write-back.
5.8 A Common Framework for Memory Hierarchy 459<br />
In virtual memory systems, only a write-back policy is practical because of the long<br />
latency of a write to the lower level of the hierarchy. The rate at which writes are<br />
generated by a processor generally exceeds the rate at which the memory system can<br />
process them, even allowing for physically <strong>and</strong> logically wider memories <strong>and</strong> burst<br />
modes for DRAM. Consequently, today lowest-level caches typically use write-back.<br />
The Three Cs: An Intuitive Model for Underst<strong>and</strong>ing the<br />
Behavior of Memory Hierarchies<br />
In this subsection, we look at a model that provides insight into the sources of<br />
misses in a memory hierarchy <strong>and</strong> how the misses will be affected by changes<br />
in the hierarchy. We will explain the ideas in terms of caches, although the ideas<br />
carry over directly to any other level in the hierarchy. In this model, all misses are<br />
classified into one of three categories (the three Cs):<br />
■ Compulsory misses: These are cache misses caused by the first access to<br />
a block that has never been in the cache. These are also called cold-start<br />
misses.<br />
■ Capacity misses: These are cache misses caused when the cache cannot<br />
contain all the blocks needed during execution of a program. Capacity misses<br />
occur when blocks are replaced <strong>and</strong> then later retrieved.<br />
■ Conflict misses: These are cache misses that occur in set-associative or<br />
direct-mapped caches when multiple blocks compete for the same set.<br />
Conflict misses are those misses in a direct-mapped or set-associative cache<br />
that are eliminated in a fully associative cache of the same size. These cache<br />
misses are also called collision misses.<br />
Figure 5.37 shows how the miss rate divides into the three sources. These sources of<br />
misses can be directly attacked by changing some aspect of the cache design. Since<br />
conflict misses arise directly from contention for the same cache block, increasing<br />
associativity reduces conflict misses. Associativity, however, may slow access time,<br />
leading to lower overall performance.<br />
Capacity misses can easily be reduced by enlarging the cache; indeed, secondlevel<br />
caches have been growing steadily larger for many years. Of course, when we<br />
make the cache larger, we must also be careful about increasing the access time,<br />
which could lead to lower overall performance. Thus, first-level caches have been<br />
growing slowly, if at all.<br />
Because compulsory misses are generated by the first reference to a block, the<br />
primary way for the cache system to reduce the number of compulsory misses is<br />
to increase the block size. This will reduce the number of references required to<br />
touch each block of the program once, because the program will consist of fewer<br />
three Cs model A cache<br />
model in which all cache<br />
misses are classified into<br />
one of three categories:<br />
compulsory misses,<br />
capacity misses, <strong>and</strong><br />
conflict misses.<br />
compulsory miss Also<br />
called cold-start miss.<br />
A cache miss caused by<br />
the first access to a block<br />
that has never been in the<br />
cache.<br />
capacity miss A cache<br />
miss that occurs because<br />
the cache, even with<br />
full associativity, cannot<br />
contain all the blocks<br />
needed to satisfy the<br />
request.<br />
conflict miss Also called<br />
collision miss. A cache<br />
miss that occurs in a<br />
set-associative or directmapped<br />
cache when<br />
multiple blocks compete<br />
for the same set <strong>and</strong> that<br />
are eliminated in a fully<br />
associative cache of the<br />
same size.
462 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />
■ Write-back using write allocate<br />
■ Block size is 4 words (16 bytes or 128 bits)<br />
■ Cache size is 16 KiB, so it holds 1024 blocks<br />
■ 32-byte addresses<br />
■ The cache includes a valid bit <strong>and</strong> dirty bit per block<br />
From Section 5.3, we can now calculate the fields of an address for the cache:<br />
■ Cache index is 10 bits<br />
■ Block offset is 4 bits<br />
■ Tag size is 32 (10 4) or 18 bits<br />
The signals between the processor to the cache are<br />
■ 1-bit Read or Write signal<br />
■ 1-bit Valid signal, saying whether there is a cache operation or not<br />
■ 32-bit address<br />
■ 32-bit data from processor to cache<br />
■ 32-bit data from cache to processor<br />
■ 1-bit Ready signal, saying the cache operation is complete<br />
The interface between the memory <strong>and</strong> the cache has the same fields as between<br />
the processor <strong>and</strong> the cache, except that the data fields are now 128 bits wide. The<br />
extra memory width is generally found in microprocessors today, which deal with<br />
either 32-bit or 64-bit words in the processor while the DRAM controller is often<br />
128 bits. Making the cache block match the width of the DRAM simplified the<br />
design. Here are the signals:<br />
■ 1-bit Read or Write signal<br />
■ 1-bit Valid signal, saying whether there is a memory operation or not<br />
■ 32-bit address<br />
■ 128-bit data from cache to memory<br />
■ 128-bit data from memory to cache<br />
■ 1-bit Ready signal, saying the memory operation is complete<br />
Note that the interface to memory is not a fixed number of cycles. We assume a<br />
memory controller that will notify the cache via the Ready signal when the memory<br />
read or write is finished.<br />
Before describing the cache controller, we need to review finite-state machines,<br />
which allow us to control an operation that can take multiple clock cycles.
464 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />
Combinational<br />
control logic<br />
Datapath control outputs<br />
Outputs<br />
Inputs<br />
Inputs from cache<br />
datapath<br />
State register<br />
Next state<br />
FIGURE 5.39 Finite-state machine controllers are typically implemented using a block of<br />
combinational logic <strong>and</strong> a register to hold the current state. The outputs of the combinational<br />
logic are the next-state number <strong>and</strong> the control signals to be asserted for the current state. The inputs to the<br />
combinational logic are the current state <strong>and</strong> any inputs used to determine the next state. Notice that in the<br />
finite-state machine used in this chapter, the outputs depend only on the current state, not on the inputs. The<br />
Elaboration explains this in more detail.<br />
needed early in the clock cycle, do not depend on the inputs, but only on the current<br />
state. In Appendix B, when the implementation of this fi nite-state machine is taken down<br />
to logic gates, the size advantage can be clearly seen. The potential disadvantage of a<br />
Moore machine is that it may require additional states. For example, in situations where<br />
there is a one-state difference between two sequences of states, the Mealy machine<br />
may unify the states by making the outputs depend on the inputs.<br />
FSM for a Simple Cache Controller<br />
Figure 5.40 shows the four states of our simple cache controller:<br />
■ Idle: This state waits for a valid read or write request from the processor,<br />
which moves the FSM to the Compare Tag state.<br />
■ Compare Tag: As the name suggests, this state tests to see if the requested read<br />
or write is a hit or a miss. The index portion of the address selects the tag to<br />
be compared. If the data in the cache block referred to by the index portion<br />
of the address is valid, <strong>and</strong> the tag portion of the address matches the tag,<br />
then it is a hit. Either the data is read from the selected word if it is a load or<br />
written to the selected word if it is a store. The Cache Ready signal is then
468 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />
■ Replication: When shared data are being simultaneously read, the caches<br />
make a copy of the data item in the local cache. Replication reduces both<br />
latency of access <strong>and</strong> contention for a read shared data item.<br />
Supporting migration <strong>and</strong> replication is critical to performance in accessing<br />
shared data, so many multiprocessors introduce a hardware protocol to maintain<br />
coherent caches. The protocols to maintain coherence for multiple processors are<br />
called cache coherence protocols. Key to implementing a cache coherence protocol<br />
is tracking the state of any sharing of a data block.<br />
The most popular cache coherence protocol is snooping. Every cache that has a<br />
copy of the data from a block of physical memory also has a copy of the sharing<br />
status of the block, but no centralized state is kept. The caches are all accessible via<br />
some broadcast medium (a bus or network), <strong>and</strong> all cache controllers monitor or<br />
snoop on the medium to determine whether or not they have a copy of a block that<br />
is requested on a bus or switch access.<br />
In the following section we explain snooping-based cache coherence as<br />
implemented with a shared bus, but any communication medium that broadcasts<br />
cache misses to all processors can be used to implement a snooping-based<br />
coherence scheme. This broadcasting to all caches makes snooping protocols<br />
simple to implement but also limits their scalability.<br />
Snooping Protocols<br />
One method of enforcing coherence is to ensure that a processor has exclusive<br />
access to a data item before it writes that item. This style of protocol is called a write<br />
invalidate protocol because it invalidates copies in other caches on a write. Exclusive<br />
access ensures that no other readable or writable copies of an item exist when the<br />
write occurs: all other cached copies of the item are invalidated.<br />
Figure 5.42 shows an example of an invalidation protocol for a snooping bus<br />
with write-back caches in action. To see how this protocol ensures coherence,<br />
consider a write followed by a read by another processor: since the write requires<br />
exclusive access, any copy held by the reading processor must be invalidated (hence<br />
the protocol name). Thus, when the read occurs, it misses in the cache, <strong>and</strong> the<br />
cache is forced to fetch a new copy of the data. For a write, we require that the<br />
writing processor have exclusive access, preventing any other processor from being<br />
able to write simultaneously. If two processors do attempt to write the same data<br />
simultaneously, one of them wins the race, causing the other processor’s copy to be<br />
invalidated. For the other processor to complete its write, it must obtain a new copy<br />
of the data, which must now contain the updated value. Therefore, this protocol<br />
also enforces write serialization.
5.10 Parallelism <strong>and</strong> Memory Hierarchy: Cache Coherence 469<br />
Processor activity<br />
Bus activity<br />
Contents of<br />
CPU A’s cache<br />
Contents of<br />
CPU B’s cache<br />
Contents of<br />
memory<br />
location X<br />
0<br />
CPU A reads X Cache miss for X<br />
0<br />
0<br />
CPU B reads X Cache miss for X 0 0 0<br />
CPU A writes a 1 to X Invalidation for X<br />
1<br />
0<br />
CPU B reads X Cache miss for X 1 1 1<br />
FIGURE 5.42 An example of an invalidation protocol working on a snooping bus for a<br />
single cache block (X) with write-back caches. We assume that neither cache initially holds X<br />
<strong>and</strong> that the value of X in memory is 0. The CPU <strong>and</strong> memory contents show the value after the processor<br />
<strong>and</strong> bus activity have both completed. A blank indicates no activity or no copy cached. When the second<br />
miss by B occurs, CPU A responds with the value canceling the response from memory. In addition, both<br />
the contents of B’s cache <strong>and</strong> the memory contents of X are updated. This update of memory, which occurs<br />
when a block becomes shared, simplifies the protocol, but it is possible to track the ownership <strong>and</strong> force the<br />
write-back only if the block is replaced. This requires the introduction of an additional state called “owner,”<br />
which indicates that a block may be shared, but the owning processor is responsible for updating any other<br />
processors <strong>and</strong> memory when it changes the block or replaces it.<br />
One insight is that block size plays an important role in cache coherency. For<br />
example, take the case of snooping on a cache with a block size of eight words,<br />
with a single word alternatively written <strong>and</strong> read by two processors. Most protocols<br />
exchange full blocks between processors, thereby increasing coherency b<strong>and</strong>width<br />
dem<strong>and</strong>s.<br />
Large blocks can also cause what is called false sharing: when two unrelated<br />
shared variables are located in the same cache block, the full block is exchanged<br />
between processors even though the processors are accessing different variables.<br />
Programmers <strong>and</strong> compilers should lay out data carefully to avoid false sharing.<br />
Elaboration: Although the three properties on pages 466 <strong>and</strong> 467 are suffi cient to<br />
ensure coherence, the question of when a written value will be seen is also important. To<br />
see why, observe that we cannot require that a read of X in Figure 5.41 instantaneously<br />
sees the value written for X by some other processor. If, for example, a write of X on one<br />
processor precedes a read of X on another processor very shortly beforeh<strong>and</strong>, it may be<br />
impossible to ensure that the read returns the value of the data written, since the written<br />
data may not even have left the processor at that point. The issue of exactly when a<br />
written value must be seen by a reader is defi ned by a memory consistency model.<br />
Hardware/<br />
Software<br />
Interface<br />
false sharing When two<br />
unrelated shared variables<br />
are located in the same<br />
cache block <strong>and</strong> the<br />
full block is exchanged<br />
between processors even<br />
though the processors<br />
are accessing different<br />
variables.
5.13 Real Stuff: The ARM Cortex-A8 <strong>and</strong> Intel Core i7 Memory Hierarchies 471<br />
5.13<br />
Real Stuff: The ARM Cortex-A8 <strong>and</strong> Intel<br />
Core i7 Memory Hierarchies<br />
In this section, we will look at the memory hierarchy of the same two microprocessors<br />
described in Chapter 4: the ARM Cortex-A8 <strong>and</strong> Intel Core i7. This section is based<br />
on Section 2.6 of <strong>Computer</strong> Architecture: A Quantitative Approach, 5 th edition.<br />
Figure 5.43 summarizes the address sizes <strong>and</strong> TLBs of the two processors. Note<br />
that the A8 has two TLBs with a 32-bit virtual address space <strong>and</strong> a 32-bit physical<br />
address space. The Core i7 has three TLBs with a 48-bit virtual address <strong>and</strong> a 44-bit<br />
physical address. Although the 64-bit registers of the Core i7 could hold a larger<br />
virtual address, there was no software need for such a large space <strong>and</strong> 48-bit virtual<br />
addresses shrinks both the page table memory footprint <strong>and</strong> the TLB hardware.<br />
Figure 5.44 shows their caches. Keep in mind that the A8 has just one processor<br />
or core while the Core i7 has four. Both have identically organized 32 KiB, 4-way<br />
set associative, L1 instruction caches (per core) with 64 byte blocks. The A8 uses the<br />
same design for data cache, while the Core i7 keeps everything the same except the<br />
associativity, which it increases to 8-way. Both use an 8-way set associative unified<br />
L2 cache (per core) with 64 byte blocks, although the A8 varies in size from 128 KiB<br />
to 1 MiB while the Core i7 is fixed at 256 KiB. As the Core i7 is used for servers, it<br />
Characteristic ARM Cortex-A8 Intel Core i7<br />
Virtual address 32 bits 48 bits<br />
Physical address 32 bits 44 bits<br />
Page size Variable: 4, 16, 64 KiB, 1, 16 MiB Variable: 4 KiB, 2/4 MiB<br />
TLB organization 1 TLB for instructions <strong>and</strong> 1 TLB<br />
for data<br />
1 TLB for instructions <strong>and</strong> 1 TLB for<br />
data per core<br />
Both TLBs are fully associative,<br />
with 32 entries, round robin<br />
replacement<br />
TLB misses h<strong>and</strong>led in hardware<br />
Both L1 TLBs are four-way set<br />
associative, LRU replacement<br />
L1 I-TLB has 128 entries for small<br />
pages, 7 per thread for large pages<br />
L1 D-TLB has 64 entries for small<br />
pages, 32 for large pages<br />
The L2 TLB is four-way set associative,<br />
LRU replacement<br />
The L2 TLB has 512 entries<br />
TLB misses h<strong>and</strong>led in hardware<br />
FIGURE 5.43 Address translation <strong>and</strong> TLB hardware for the ARM Cortex-A8 <strong>and</strong> Intel<br />
Core i7 920. Both processors provide support for large pages, which are used for things like the operating<br />
system or mapping a frame buffer. The large-page scheme avoids using a large number of entries to map a<br />
single object that is always present.
5.13 Real Stuff: The ARM Cortex-A8 <strong>and</strong> Intel Core i7 Memory Hierarchies 473<br />
advantage of this capability, but large servers <strong>and</strong> multiprocessors often have<br />
memory systems capable of h<strong>and</strong>ling more than one outst<strong>and</strong>ing miss in parallel.<br />
The Core i7 has a prefetch mechanism for data accesses. It looks at a pattern<br />
of data misses <strong>and</strong> use this information to try to predict the next address to start<br />
fetching the data before the miss occurs. Such techniques generally work best when<br />
accessing arrays in loops.<br />
The sophisticated memory hierarchies of these chips <strong>and</strong> the large fraction of<br />
the dies dedicated to caches <strong>and</strong> TLBs show the significant design effort expended<br />
to try to close the gap between processor cycle times <strong>and</strong> memory latency.<br />
Performance of the A8 <strong>and</strong> Core i7 Memory Hierarchies<br />
The memory hierarchy of the Cortex-A8 was simulated with a 1 MiB eight-way<br />
set associative L2 cache using the integer Minnespec benchmarks. As mentioned<br />
in Chapter 4, Minnespec is a set of benchmarks consisting of the SPEC2000<br />
benchmarks but with different inputs that reduce the running times by several<br />
orders of magnitude. Although the use of smaller inputs does not change the<br />
instruction mix, it does affect the cache behavior. For example, on mcf, the most<br />
memory-intensive SPEC2000 integer benchmark, Minnespec has a miss rate for a<br />
32 KiB cache that is only 65% of the miss rate for the full SPEC2000 version. For<br />
a 1 MiB cache the difference is a factor of six! For this reason, one cannot compare<br />
the Minnespec benchmarks against the SPEC2000 benchmarks, much less the even<br />
larger SPEC2006 benchmarks used for the Core i7 in Figure 5.47. Instead, the data<br />
are useful for looking at the relative impact of L1 <strong>and</strong> L2 misses <strong>and</strong> on overall CPI,<br />
which we used in Chapter 4.<br />
The A8 instruction cache miss rates for these benchmarks (<strong>and</strong> also for the<br />
full SPEC2000 versions on which Minnespec is based) are very small even for<br />
just the L1: close to zero for most <strong>and</strong> under 1% for all of them. This low rate<br />
probably results from the computationally intensive nature of the SPEC programs<br />
<strong>and</strong> the four-way set associative cache that eliminates most conflict misses. Figure<br />
5.45 shows the data cache results for the A8, which have significant L1 <strong>and</strong> L2<br />
miss rates. The L1 miss penalty for a 1 GHz Cortex-A8 is 11 clock cycles, while<br />
the L2 miss penalty is assumed to be 60 clock cycles. Using these miss penalties,<br />
Figure 5.46 shows the average miss penalty per data access.<br />
Figure 5.47 shows the miss rates for the caches of the Core i7 using the SPEC2006<br />
benchmarks. The L1 instruction cache miss rate varies from 0.1% to 1.8%,<br />
averaging just over 0.4%. This rate is in keeping with other studies of instruction<br />
cache behavior for the SPECCPU2006 benchmarks, which show low instruction<br />
cache miss rates. With L1 data cache miss rates running 5% to 10%, <strong>and</strong> sometimes<br />
higher, the importance of the L2 <strong>and</strong> L3 caches should be obvious. Since the cost<br />
for a miss to memory is over 100 cycles <strong>and</strong> the average data miss rate in L2 is 4%,<br />
L3 is obviously critical. Assuming about half the instructions are loads or stores,<br />
without L3 the L2 cache misses could add two cycles per instruction to the CPI! In<br />
comparison, the average L3 data miss rate of 1% is still significant but four times<br />
lower than the L2 miss rate <strong>and</strong> six times less than the L1 miss rate.
476 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />
1<br />
2<br />
3<br />
4<br />
5<br />
6<br />
7<br />
8<br />
9<br />
10<br />
11<br />
12<br />
13<br />
14<br />
15<br />
16<br />
17<br />
18<br />
19<br />
20<br />
21<br />
22<br />
23<br />
24<br />
25<br />
26<br />
27<br />
28<br />
29<br />
30<br />
31<br />
32<br />
33<br />
34<br />
#include <br />
#define UNROLL (4)<br />
#define BLOCKSIZE 32<br />
void do_block (int n, int si, int sj, int sk,<br />
double *A, double *B, double *C)<br />
{<br />
for ( int i = si; i < si+BLOCKSIZE; i+=UNROLL*4 )<br />
for ( int j = sj; j < sj+BLOCKSIZE; j++ ) {<br />
__m256d c[4];<br />
for ( int x = 0; x < UNROLL; x++ )<br />
c[x] = _mm256_load_pd(C+i+x*4+j*n);<br />
/* c[x] = C[i][j] */<br />
for( int k = sk; k < sk+BLOCKSIZE; k++ )<br />
{<br />
__m256d b = _mm256_broadcast_sd(B+k+j*n);<br />
/* b = B[k][j] */<br />
for (int x = 0; x < UNROLL; x++)<br />
c[x] = _mm256_add_pd(c[x], /* c[x]+=A[i][k]*b */<br />
_mm256_mul_pd(_mm256_load_pd(A+n*k+x*4+i), b));<br />
}<br />
}<br />
}<br />
for ( int x = 0; x < UNROLL; x++ )<br />
_mm256_store_pd(C+i+x*4+j*n, c[x]);<br />
/* C[i][j] = c[x] */<br />
void dgemm (int n, double* A, double* B, double* C)<br />
{<br />
for ( int sj = 0; sj < n; sj += BLOCKSIZE )<br />
for ( int si = 0; si < n; si += BLOCKSIZE )<br />
for ( int sk = 0; sk < n; sk += BLOCKSIZE )<br />
do_block(n, si, sj, sk, A, B, C);<br />
}<br />
FIGURE 5.48 Optimized C version of DGEMM from Figure 4.80 using cache blocking. These changes<br />
are the same ones found in Figure 5.21. The assembly language produced by the compiler for the do_block function<br />
is nearly identical to Figure 4.81. Once again, there is no overhead to call the do_block because the compiler inlines<br />
the function call.
5.14 Going Faster: Cache Blocking <strong>and</strong> Matrix Multiply 477<br />
of A, B, <strong>and</strong> C. Indeed, lines 28 – 34 <strong>and</strong> lines 7 – 8 in Figure 5.48 are identical to<br />
lines 14 – 20 <strong>and</strong> lines 5 – 6 in Figure 5.21, with the exception of incrementing the<br />
for loop in line 7 by the amount unrolled.<br />
Unlike the earlier chapters, we do not show the resulting x86 code because the<br />
inner loop code is nearly identical to Figure 4.81, as the blocking does not affect the<br />
computation, just the order that it accesses data in memory. What does change is<br />
the bookkeeping integer instructions to implement the for loops. It exp<strong>and</strong>s from<br />
14 instructions before the inner loop <strong>and</strong> 8 after the loop for Figure 4.80 to 40 <strong>and</strong><br />
28 instructions respectively for the bookkeeping code generated for Figure 5.48.<br />
Nevertheless, the extra instructions executed pale in comparison to the performance<br />
improvement of reducing cache misses. Figure 5.49 compares unoptimzed to<br />
optimizations for subword parallelism, instruction level parallelism, <strong>and</strong> caches.<br />
Blocking improves performance over unrolled AVX code by factors of 2 to 2.5 for<br />
the larger matrices. When we compare unoptimized code to the code with all three<br />
optimizations, the performance improvement is factors of 8 to 15, with the largest<br />
increase for the largest matrix.<br />
32x32 160x160 480x480 960x960<br />
GFLOPS<br />
16.0<br />
12.0<br />
8.0<br />
4.0<br />
14.6<br />
13.6<br />
12.7<br />
11.7 12.0<br />
6.4<br />
6.6<br />
3.5<br />
–<br />
Unoptimized AVX AVX + unroll AVX + unroll +<br />
blocked<br />
FIGURE 5.49 Performance of four versions of DGEMM from matrix dimensions 32x32 to<br />
960x960. The fully optimized code for largest matrix is almost 15 times as fast the unoptimized version in<br />
Figure 3.21 in Chapter 3.<br />
Elaboration: As mentioned in the Elaboration in Section 3.8, these results are<br />
with Turbo mode turned off. As in Chapters 3 <strong>and</strong> 4, when we turn it on we improve all<br />
the results by the temporary increase in the clock rate of 3.3/2.6 1.27. Turbo mode<br />
works particularly well in this case because it is using only a single core of an eightcore<br />
chip. However, if we want to run fast we should use all cores, which we’ll see in<br />
Chapter 6.
5.15 Fallacies <strong>and</strong> Pitfalls 479<br />
This mistake catches many people, including the authors (in earlier drafts) <strong>and</strong><br />
instructors who forget whether they intended the addresses to be in words, bytes,<br />
or block numbers. Remember this pitfall when you tackle the exercises.<br />
Pitfall: Having less set associativity for a shared cache than the number of cores or<br />
threads sharing that cache.<br />
Without extra care, a parallel program running on 2 n processors or threads can<br />
easily allocate data structures to addresses that would map to the same set of a<br />
shared L2 cache. If the cache is at least 2 n -way associative, then these accidental<br />
conflicts are hidden by the hardware from the program. If not, programmers could<br />
face apparently mysterious performance bugs—actually due to L2 conflict misses—<br />
when migrating from, say, a 16-core design to 32-core design if both use 16-way<br />
associative L2 caches.<br />
Pitfall: Using average memory access time to evaluate the memory hierarchy of an<br />
out-of-order processor.<br />
If a processor stalls during a cache miss, then you can separately calculate the<br />
memory-stall time <strong>and</strong> the processor execution time, <strong>and</strong> hence evaluate the memory<br />
hierarchy independently using average memory access time (see page 399).<br />
If the processor continues to execute instructions, <strong>and</strong> may even sustain more<br />
cache misses during a cache miss, then the only accurate assessment of the memory<br />
hierarchy is to simulate the out-of-order processor along with the memory hierarchy.<br />
Pitfall: Extending an address space by adding segments on top of an unsegmented<br />
address space.<br />
During the 1970s, many programs grew so large that not all the code <strong>and</strong> data could<br />
be addressed with just a 16-bit address. <strong>Computer</strong>s were then revised to offer 32-<br />
bit addresses, either through an unsegmented 32-bit address space (also called a flat<br />
address space) or by adding 16 bits of segment to the existing 16-bit address. From<br />
a marketing point of view, adding segments that were programmer-visible <strong>and</strong> that<br />
forced the programmer <strong>and</strong> compiler to decompose programs into segments could<br />
solve the addressing problem. Unfortunately, there is trouble any time a programming<br />
language wants an address that is larger than one segment, such as indices for large<br />
arrays, unrestricted pointers, or reference parameters. Moreover, adding segments<br />
can turn every address into two words—one for the segment number <strong>and</strong> one for the<br />
segment offset—causing problems in the use of addresses in registers.<br />
Fallacy: Disk failure rates in the field match their specifications.<br />
Two recent studies evaluated large collections of disks to check the relationship<br />
between results in the field compared to specifications. One study was of almost<br />
100,000 disks that had quoted MTTF of 1,000,000 to 1,500,000 hours, or AFR of<br />
0.6% to 0.8%. They found AFRs of 2% to 4% to be common, often three to five<br />
times higher than the specified rates [Schroeder <strong>and</strong> Gibson, 2007]. A second study<br />
of more than 100,000 disks at Google, which had a quoted AFR of about 1.5%, saw<br />
failure rates of 1.7% for drives in their first year rise to 8.6% for drives in their third<br />
year, or about five to six times the specified rate [Pinheiro, Weber, <strong>and</strong> Barroso,<br />
2007].
5.15 Fallacies <strong>and</strong> Pitfalls 481<br />
Problem category<br />
Problem x86 instructions<br />
Access sensitive registers without<br />
trapping when running in user mode<br />
When accessing virtual memory<br />
mechanisms in user mode, instructions<br />
fail the x86 protection checks<br />
Store global descriptor table register (SGDT)<br />
Store local descriptor table register (SLDT)<br />
Store interrupt descriptor table register (SIDT)<br />
Store machine status word (SMSW)<br />
Push flags (PUSHF, PUSHFD)<br />
Pop flags (POPF, POPFD)<br />
Load access rights from segment descriptor (LAR)<br />
Load segment limit from segment descriptor (LSL)<br />
Verify if segment descriptor is readable (VERR)<br />
Verify if segment descriptor is writable (VERW)<br />
Pop to segment register (POP CS, POP SS, . . .)<br />
Push segment register (PUSH CS, PUSH SS, . . .)<br />
Far call to different privilege level (CALL)<br />
Far return to different privilege level (RET)<br />
Far jump to different privilege level (JMP)<br />
Software interrupt (INT)<br />
Store segment selector register (STR)<br />
Move to/from segment registers (MOVE)<br />
FIGURE 5.51 Summary of 18 x86 instructions that cause problems for virtualization<br />
[Robin <strong>and</strong> Irvine, 2000]. The first five instructions in the top group allow a program in user mode to<br />
read a control register, such as descriptor table registers, without causing a trap. The pop flags instruction<br />
modifies a control register with sensitive information but fails silently when in user mode. The protection<br />
checking of the segmented architecture of the x86 is the downfall of the bottom group, as each of these<br />
instructions checks the privilege level implicitly as part of instruction execution when reading a control<br />
register. The checking assumes that the OS must be at the highest privilege level, which is not the case for<br />
guest VMs. Only the Move to segment register tries to modify control state, <strong>and</strong> protection checking foils it<br />
as well.<br />
Pitfall: Implementing a virtual machine monitor on an instruction set architecture<br />
that wasn’t designed to be virtualizable.<br />
Many architects in the 1970s <strong>and</strong> 1980s weren’t careful to make sure that all<br />
instructions reading or writing information related to hardware resource<br />
information were privileged. This laissez-faire attitude causes problems for VMMs<br />
for all of these architectures, including the x86, which we use here as an example.<br />
Figure 5.51 describes the 18 instructions that cause problems for virtualization<br />
[Robin <strong>and</strong> Irvine, 2000]. The two broad classes are instructions that<br />
■ Read control registers in user mode that reveals that the guest operating<br />
system is running in a virtual machine (such as POPF, mentioned earlier)<br />
■ Check protection as required by the segmented architecture but assume that<br />
the operating system is running at the highest privilege level<br />
To simplify implementations of VMMs on the x86, both AMD <strong>and</strong> Intel have<br />
proposed extensions to the architecture via a new mode. Intel’s VT-x provides<br />
a new execution mode for running VMs, an architected definition of the VM
484 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />
5.1.4 [10] How many 16-byte cache blocks are needed to store all 32-bit<br />
matrix elements being referenced?<br />
5.1.5 [5] References to which variables exhibit temporal locality?<br />
5.1.6 [5] References to which variables exhibit spatial locality?<br />
5.2 Caches are important to providing a high-performance memory hierarchy<br />
to processors. Below is a list of 32-bit memory address references, given as word<br />
addresses.<br />
3, 180, 43, 2, 191, 88, 190, 14, 181, 44, 186, 253<br />
5.2.1 [10] For each of these references, identify the binary address, the tag,<br />
<strong>and</strong> the index given a direct-mapped cache with 16 one-word blocks. Also list if each<br />
reference is a hit or a miss, assuming the cache is initially empty.<br />
5.2.2 [10] For each of these references, identify the binary address, the tag,<br />
<strong>and</strong> the index given a direct-mapped cache with two-word blocks <strong>and</strong> a total size of 8<br />
blocks. Also list if each reference is a hit or a miss, assuming the cache is initially empty.<br />
5.2.3 [20] You are asked to optimize a cache design for the given<br />
references. There are three direct-mapped cache designs possible, all with a total of 8<br />
words of data: C1 has 1-word blocks, C2 has 2-word blocks, <strong>and</strong> C3 has 4-word blocks.<br />
In terms of miss rate, which cache design is the best? If the miss stall time is 25 cycles,<br />
<strong>and</strong> C1 has an access time of 2 cycles, C2 takes 3 cycles, <strong>and</strong> C3 takes 5 cycles, which is<br />
the best cache design?<br />
There are many different design parameters that are important to a cache’s overall<br />
performance. Below are listed parameters for different direct-mapped cache designs.<br />
Cache Data Size: 32 KiB<br />
Cache Block Size: 2 words<br />
Cache Access Time: 1 cycle<br />
5.2.4 [15] Calculate the total number of bits required for the cache listed<br />
above, assuming a 32-bit address. Given that total size, find the total size of the closest<br />
direct-mapped cache with 16-word blocks of equal size or greater. Explain why the<br />
second cache, despite its larger data size, might provide slower performance than the<br />
first cache.<br />
5.2.5 [20] Generate a series of read requests that have a lower miss rate<br />
on a 2 KiB 2-way set associative cache than the cache listed above. Identify one possible<br />
solution that would make the cache listed have an equal or lower miss rate than the 2<br />
KiB cache. Discuss the advantages <strong>and</strong> disadvantages of such a solution.<br />
5.2.6 [15] The formula shown in Section 5.3 shows the typical method to<br />
index a direct-mapped cache, specifically (Block address) modulo (Number of blocks in<br />
the cache). Assuming a 32-bit address <strong>and</strong> 1024 blocks in the cache, consider a different
5.18 Exercises 493<br />
Consider the following address sequence: 0, 2, 4, 8, 10, 12, 14, 16, 0<br />
5.13.1 [5] Assuming an LRU replacement policy, how many hits does<br />
this address sequence exhibit?<br />
5.13.2 [5] Assuming an MRU (most recently used) replacement policy,<br />
how many hits does this address sequence exhibit?<br />
5.13.3 [5] Simulate a r<strong>and</strong>om replacement policy by flipping a coin. For<br />
example, “heads” means to evict the first block in a set <strong>and</strong> “tails” means to evict the<br />
second block in a set. How many hits does this address sequence exhibit?<br />
5.13.4 [10] Which address should be evicted at each replacement to<br />
maximize the number of hits? How many hits does this address sequence exhibit if you<br />
follow this “optimal” policy?<br />
5.13.5 [10] Describe why it is difficult to implement a cache replacement<br />
policy that is optimal for all address sequences.<br />
5.13.6 [10] Assume you could make a decision upon each memory<br />
reference whether or not you want the requested address to be cached. What impact<br />
could this have on miss rate?<br />
5.14 To support multiple virtual machines, two levels of memory virtualization are<br />
needed. Each virtual machine still controls the mapping of virtual address (VA) to<br />
physical address (PA), while the hypervisor maps the physical address (PA) of each<br />
virtual machine to the actual machine address (MA). To accelerate such mappings,<br />
a software approach called “shadow paging” duplicates each virtual machine’s page<br />
tables in the hypervisor, <strong>and</strong> intercepts VA to PA mapping changes to keep both copies<br />
consistent. To remove the complexity of shadow page tables, a hardware approach<br />
called nested page table (NPT) explicitly supports two classes of page tables (VA ⇒ PA<br />
<strong>and</strong> PA ⇒ MA) <strong>and</strong> can walk such tables purely in hardware.<br />
Consider the following sequence of operations: (1) Create process; (2) TLB miss;<br />
(3) page fault; (4) context switch;<br />
5.14.1 [10] What would happen for the given operation sequence for<br />
shadow page table <strong>and</strong> nested page table, respectively?<br />
5.14.2 [10] Assuming an x86-based 4-level page table in both guest <strong>and</strong><br />
nested page table, how many memory references are needed to service a TLB miss for<br />
native vs. nested page table?<br />
5.14.3 [15] Among TLB miss rate, TLB miss latency, page fault rate, <strong>and</strong><br />
page fault h<strong>and</strong>ler latency, which metrics are more important for shadow page table?<br />
Which are important for nested page table?
5.18 Exercises 495<br />
5.16 In this exercise, we will explore the control unit for a cache controller for a<br />
processor with a write buffer. Use the finite state machine found in Figure 5.40 as a<br />
starting point for designing your own finite state machines. Assume that the cache<br />
controller is for the simple direct-mapped cache described on page 465 (Figure 5.40 in<br />
Section 5.9), but you will add a write buffer with a capacity of one block.<br />
Recall that the purpose of a write buffer is to serve as temporary storage so that the<br />
processor doesn’t have to wait for two memory accesses on a dirty miss. Rather than<br />
writing back the dirty block before reading the new block, it buffers the dirty block <strong>and</strong><br />
immediately begins reading the new block. The dirty block can then be written to main<br />
memory while the processor is working.<br />
5.16.1 [10] What should happen if the processor issues a request that<br />
hits in the cache while a block is being written back to main memory from the write<br />
buffer?<br />
5.16.2 [10] What should happen if the processor issues a request that<br />
misses in the cache while a block is being written back to main memory from the write<br />
buffer?<br />
5.16.3 [30] <strong>Design</strong> a finite state machine to enable the use of a write<br />
buffer.<br />
5.17 Cache coherence concerns the views of multiple processors on a given cache<br />
block. The following data shows two processors <strong>and</strong> their read/write operations on two<br />
different words of a cache block X (initially X[0] = X[1] = 0). Assume the size of integers is<br />
32 bits.<br />
P1<br />
X[0] ++; X[1] = 3; X[0] = 5; X[1] +=2;<br />
P2<br />
5.17.1 [15] List the possible values of the given cache block for a correct<br />
cache coherence protocol implementation. List at least one more possible value of the<br />
block if the protocol doesn’t ensure cache coherency.<br />
5.17.2 [15] For a snooping protocol, list a valid operation sequence on each<br />
processor/cache to finish the above read/write operations.<br />
5.17.3 [10] What are the best-case <strong>and</strong> worst-case numbers of cache misses<br />
needed to execute the listed read/write instructions?<br />
Memory consistency concerns the views of multiple data items. The following data<br />
shows two processors <strong>and</strong> their read/write operations on different cache blocks (A <strong>and</strong><br />
B initially 0).<br />
P1<br />
A = 1; B = 2; A+=2; B++; C = B; D = A;<br />
P2
5.18 Exercises 497<br />
5.19 In this exercise we show the definition of a web server log <strong>and</strong> examine code<br />
optimizations to improve log processing speed. The data structure for the log is defined<br />
as follows:<br />
struct entry {<br />
int srcIP; // remote IP address<br />
char URL[128]; // request URL (e.g., “GET index.html”)<br />
long long refTime; // reference time<br />
int status; // connection status<br />
char browser[64]; // client browser name<br />
} log [NUM_ENTRIES];<br />
Assume the following processing function for the log:<br />
topK_sourceIP (int hour);<br />
5.19.1 [5] Which fields in a log entry will be accessed for the given log<br />
processing function? Assuming 64-byte cache blocks <strong>and</strong> no prefetching, how many<br />
cache misses per entry does the given function incur on average?<br />
5.19.2 [10] How can you reorganize the data structure to improve cache<br />
utilization <strong>and</strong> access locality? Show your structure definition code.<br />
5.19.3 [10] Give an example of another log processing function that would<br />
prefer a different data structure layout. If both functions are important, how would you<br />
rewrite the program to improve the overall performance? Supplement the discussion<br />
with code snippet <strong>and</strong> data.<br />
For the problems below, use data from “Cache Performance for SPEC CPU2000<br />
Benchmarks” (http://www.cs.wisc.edu/multifacet/misc/spec2000cache-data/) for the<br />
pairs of benchmarks shown in the following table.<br />
a. Mesa / gcc<br />
b. mcf / swim<br />
5.19.4 [10] For 64 KiB data caches with varying set associativities, what are<br />
the miss rates broken down by miss types (cold, capacity, <strong>and</strong> conflict misses) for each<br />
benchmark?<br />
5.19.5 [10] Select the set associativity to be used by a 64 KiB L1 data cache<br />
shared by both benchmarks. If the L1 cache has to be directly mapped, select the set<br />
associativity for the 1 MiB L2 cache.<br />
5.19.6 [20] Give an example in the miss rate table where higher set<br />
associativity actually increases miss rate. Construct a cache configuration <strong>and</strong> reference<br />
stream to demonstrate this.
498 Chapter 5 Large <strong>and</strong> Fast: Exploiting Memory Hierarchy<br />
Answers to<br />
Check Yourself<br />
§5.1, page 377: 1 <strong>and</strong> 4. (3 is false because the cost of the memory hierarchy varies<br />
per computer, but in 2013 the highest cost is usually the DRAM.)<br />
§5.3, page 398: 1 <strong>and</strong> 4: A lower miss penalty can enable smaller blocks, since you<br />
don’t have that much latency to amortize, yet higher memory b<strong>and</strong>width usually<br />
leads to larger blocks, since the miss penalty is only slightly larger.<br />
§5.4, page 417: 1.<br />
§5.7, page 454: 1-a, 2-c, 3-b, 4-d.<br />
§5.8, page 461: 2. (Both large block sizes <strong>and</strong> prefetching may reduce compulsory<br />
misses, so 1 is false.)
This page intentionally left blank
6<br />
Parallel Processors<br />
from Client to Cloud<br />
“I swing big, with<br />
everything I’ve got.<br />
I hit big or I miss big.<br />
I like to live as big as<br />
I can.”<br />
Babe Ruth<br />
American baseball player<br />
6.1 Introduction 502<br />
6.2 The Difficulty of Creating Parallel Processing<br />
Programs 504<br />
6.3 SISD, MIMD, SIMD, SPMD, <strong>and</strong> Vector 509<br />
6.4 Hardware Multithreading 516<br />
6.5 Multicore <strong>and</strong> Other Shared Memory<br />
Multiprocessors 519<br />
6.6 Introduction to Graphics Processing<br />
Units 524<br />
<strong>Computer</strong> Organization <strong>and</strong> <strong>Design</strong>. DOI: http://dx.doi.org/10.1016/B978-0-12-407726-3.00001-1<br />
© 2013 Elsevier Inc. All rights reserved.
6.1 Introduction 503<br />
multicore microprocessors instead of multiprocessor microprocessors,<br />
presumably to avoid redundancy in naming. Hence, processors are often called<br />
cores in a multicore chip. The number of cores is expected to increase with<br />
Moore’s Law. These multicores are almost always Shared Memory Processors<br />
(SMPs), as they usually share a single physical address space. We’ll see SMPs<br />
more in Section 6.5.<br />
The state of technology today means that programmers who care about<br />
performance must become parallel programmers, for sequential code now means<br />
slow code.<br />
The tall challenge facing the industry is to create hardware <strong>and</strong> software that<br />
will make it easy to write correct parallel processing programs that will execute<br />
efficiently in performance <strong>and</strong> energy as the number of cores per chip scales.<br />
This abrupt shift in microprocessor design caught many off guard, so there is a<br />
great deal of confusion about the terminology <strong>and</strong> what it means. Figure 6.1 tries to<br />
clarify the terms serial, parallel, sequential, <strong>and</strong> concurrent. The columns of this figure<br />
represent the software, which is either inherently sequential or concurrent. The rows<br />
of the figure represent the hardware, which is either serial or parallel. For example, the<br />
programmers of compilers think of them as sequential programs: the steps include<br />
parsing, code generation, optimization, <strong>and</strong> so on. In contrast, the programmers<br />
of operating systems normally think of them as concurrent programs: cooperating<br />
processes h<strong>and</strong>ling I/O events due to independent jobs running on a computer.<br />
The point of these two axes of Figure 6.1 is that concurrent software can run on<br />
serial hardware, such as operating systems for the Intel Pentium 4 uniprocessor,<br />
or on parallel hardware, such as an OS on the more recent Intel Core i7. The same<br />
is true for sequential software. For example, the MATLAB programmer writes<br />
a matrix multiply thinking about it sequentially, but it could run serially on the<br />
Pentium 4 or in parallel on the Intel Core i7.<br />
You might guess that the only challenge of the parallel revolution is figuring out how<br />
to make naturally sequential software have high performance on parallel hardware, but<br />
it is also to make concurrent programs have high performance on multiprocessors as the<br />
number of processors increases. With this distinction made, in the rest of this chapter<br />
we will use parallel processing program or parallel software to mean either sequential<br />
or concurrent software running on parallel hardware. The next section of this chapter<br />
describes why it is hard to create efficient parallel processing programs.<br />
multicore<br />
microprocessor<br />
A microprocessor<br />
containing multiple<br />
processors (“cores”)<br />
in a single integrated<br />
circuit. Virtually all<br />
microprocessors today in<br />
desktops <strong>and</strong> servers are<br />
multicore.<br />
shared memory<br />
multiprocessor<br />
(SMP) A parallel<br />
processor with a single<br />
physical address space.<br />
Software<br />
Sequential<br />
Concurrent<br />
Hardware<br />
Serial<br />
Parallel<br />
Matrix Multiply written in MatLab<br />
running on an Intel Pentium 4<br />
Matrix Multiply written in MATLAB<br />
running on an Intel Core i7<br />
Windows Vista Operating System<br />
running on an Intel Pentium 4<br />
Windows Vista Operating System<br />
running on an Intel Core i7<br />
FIGURE 6.1 Hardware/software categorization <strong>and</strong> examples of application perspective<br />
on concurrency versus hardware perspective on parallelism.
504 Chapter 6 Parallel Processors from Client to Cloud<br />
Before proceeding further down the path to parallelism, dont forget our initial<br />
incursions from the earlier chapters:<br />
Check<br />
Yourself<br />
■ Chapter 2, Section 2.11: Parallelism <strong>and</strong> Instructions: Synchronization<br />
■ Chapter 3, Section 3.6: Parallelism <strong>and</strong> <strong>Computer</strong> Arithmetic: Subword<br />
Parallelism<br />
■ Chapter 4, Section 4.10: Parallelism via Instructions<br />
■ Chapter 5, Section 5.10: Parallelism <strong>and</strong> Memory Hierarchy: Cache Coherence<br />
True or false: To benefit from a multiprocessor, an application must be concurrent.<br />
6.2<br />
The Difficulty of Creating Parallel<br />
Processing Programs<br />
The difficulty with parallelism is not the hardware; it is that too few important<br />
application programs have been rewritten to complete tasks sooner on multiprocessors.<br />
It is difficult to write software that uses multiple processors to complete one task<br />
faster, <strong>and</strong> the problem gets worse as the number of processors increases.<br />
Why has this been so? Why have parallel processing programs been so much<br />
harder to develop than sequential programs?<br />
The first reason is that you must get better performance or better energy<br />
efficiency from a parallel processing program on a multiprocessor; otherwise, you<br />
would just use a sequential program on a uniprocessor, as sequential programming<br />
is simpler. In fact, uniprocessor design techniques such as superscalar <strong>and</strong> out-oforder<br />
execution take advantage of instruction-level parallelism (see Chapter 4),<br />
normally without the involvement of the programmer. Such innovations reduced<br />
the dem<strong>and</strong> for rewriting programs for multiprocessors, since programmers<br />
could do nothing <strong>and</strong> yet their sequential programs would run faster on new<br />
computers.<br />
Why is it difficult to write parallel processing programs that are fast, especially<br />
as the number of processors increases? In Chapter 1, we used the analogy of<br />
eight reporters trying to write a single story in hopes of doing the work eight<br />
times faster. To succeed, the task must be broken into eight equal-sized pieces,<br />
because otherwise some reporters would be idle while waiting for the ones with<br />
larger pieces to finish. Another speed-up obstacle could be that the reporters<br />
would spend too much time communicating with each other instead of writing<br />
their pieces of the story. For both this analogy <strong>and</strong> parallel programming,<br />
the challenges include scheduling, partitioning the work into parallel pieces,<br />
balancing the load evenly between the workers, time to synchronize, <strong>and</strong>
6.2 The Difficulty of Creating Parallel Processing Programs 505<br />
overhead for communication between the parties. The challenge is stiffer with the<br />
more reporters for a newspaper story <strong>and</strong> with the more processors for parallel<br />
programming.<br />
Our discussion in Chapter 1 reveals another obstacle, namely Amdahls Law. It<br />
reminds us that even small parts of a program must be parallelized if the program<br />
is to make good use of many cores.<br />
Speed-up Challenge<br />
Suppose you want to achieve a speed-up of 90 times faster with 100 processors.<br />
What percentage of the original computation can be sequential?<br />
EXAMPLE<br />
Amdahls Law (Chapter 1) says<br />
Execution time after improvement =<br />
Execution time affected by improvement<br />
+ Execution time unaffected<br />
Amount of improvement<br />
ANSWER<br />
We can reformulate Amdahls Law in terms of speed-up versus the original<br />
execution time:<br />
Speed-up =<br />
(Execution time before<br />
Execution time before<br />
Execution time affected<br />
− Execution time affected) + Amount of improvement<br />
This formula is usually rewritten assuming that the execution time before is<br />
1 for some unit of time, <strong>and</strong> the execution time affected by improvement is<br />
considered the fraction of the original execution time:<br />
Speed-up =<br />
1<br />
Fraction time affected<br />
(1 − Fraction time affected) +<br />
Amount of improvement<br />
Substituting 90 for speed-up <strong>and</strong> 100 for amount of improvement into the<br />
formula above:<br />
90 =<br />
1<br />
Fraction time affected<br />
(1 − Fraction time affected) +<br />
100
506 Chapter 6 Parallel Processors from Client to Cloud<br />
Then simplifying the formula <strong>and</strong> solving for fraction time affected:<br />
90 × (1 − 0.99 × Fraction time affected) = 1<br />
90 − (90 × 0.99 × Fraction time affected) = 1<br />
90 −1 = 90 × 0.99 × Fraction time affected<br />
Fraction time affected = 89/89.1 = 0.999<br />
Thus, to achieve a speed-up of 90 from 100 processors, the sequential<br />
percentage can only be 0.1%.<br />
Yet, there are applications with plenty of parallelism, as we shall see next.<br />
EXAMPLE<br />
Speed-up Challenge: Bigger Problem<br />
Suppose you want to perform two sums: one is a sum of 10 scalar variables, <strong>and</strong><br />
one is a matrix sum of a pair of two-dimensional arrays, with dimensions 10 by 10.<br />
For now let’s assume only the matrix sum is parallelizable; we’ll see soon how to<br />
parallelize scalar sums. What speed-up do you get with 10 versus 40 processors?<br />
Next, calculate the speed-ups assuming the matrices grow to 20 by 20.<br />
ANSWER<br />
If we assume performance is a function of the time for an addition, t, then<br />
there are 10 additions that do not benefit from parallel processors <strong>and</strong> 100<br />
additions that do. If the time for a single processor is 110 t, the execution time<br />
for 10 processors is<br />
Execution time after improvement =<br />
Execution time affected by improvement<br />
+ Execution time unaffected<br />
Amount of improvement<br />
Execution time after improvement = 100 t<br />
+ 10t<br />
= 20t<br />
10<br />
so the speed-up with 10 processors is 110t/20t = 5.5. The execution time for<br />
40 processors is<br />
Execution time after improvement = 100 t<br />
40<br />
+ 10t<br />
= 12.<br />
5t<br />
so the speed-up with 40 processors is 110t/12.5t = 8.8. Thus, for this problem<br />
size, we get about 55% of the potential speed-up with 10 processors, but only<br />
22% with 40.
6.2 The Difficulty of Creating Parallel Processing Programs 507<br />
Look what happens when we increase the matrix. The sequential program now<br />
takes 10t + 400t = 410t. The execution time for 10 processors is<br />
Execution time after improvement = 400 t<br />
10<br />
+ 10t<br />
= 50t<br />
so the speed-up with 10 processors is 410t/50t = 8.2. The execution time for<br />
40 processors is<br />
Execution time after improvement = 400 t<br />
40<br />
+ 10t<br />
= 20t<br />
so the speed-up with 40 processors is 410t/20t = 20.5. Thus, for this larger problem<br />
size, we get 82% of the potential speed-up with 10 processors <strong>and</strong> 51% with 40.<br />
These examples show that getting good speed-up on a multiprocessor while<br />
keeping the problem size fixed is harder than getting good speed-up by increasing<br />
the size of the problem. This insight allows us to introduce two terms that describe<br />
ways to scale up.<br />
Strong scaling means measuring speed-up while keeping the problem size fixed.<br />
Weak scaling means that the problem size grows proportionally to the increase in<br />
the number of processors. Let’s assume that the size of the problem, M, is the working<br />
set in main memory, <strong>and</strong> we have P processors. Then the memory per processor for<br />
strong scaling is approximately M/P, <strong>and</strong> for weak scaling, it is approximately M.<br />
Note that the memory hierarchy can interfere with the conventional wisdom<br />
about weak scaling being easier than strong scaling. For example, if the weakly<br />
scaled dataset no longer fits in the last level cache of a multicore microprocessor,<br />
the resulting performance could be much worse than by using strong scaling.<br />
Depending on the application, you can argue for either scaling approach. For<br />
example, the TPC-C debit-credit database benchmark requires that you scale up<br />
the number of customer accounts in proportion to the higher transactions per<br />
minute. The argument is that its nonsensical to think that a given customer base<br />
is suddenly going to start using ATMs 100 times a day just because the bank gets a<br />
faster computer. Instead, if youre going to demonstrate a system that can perform<br />
100 times the numbers of transactions per minute, you should run the experiment<br />
with 100 times as many customers. Bigger problems often need more data, which<br />
is an argument for weak scaling.<br />
This final example shows the importance of load balancing.<br />
strong scaling Speedup<br />
achieved on a<br />
multiprocessor without<br />
increasing the size of the<br />
problem.<br />
weak scaling Speedup<br />
achieved on a<br />
multiprocessor while<br />
increasing the size of the<br />
problem proportionally<br />
to the increase in the<br />
number of processors.<br />
Speed-up Challenge: Balancing Load<br />
To achieve the speed-up of 20.5 on the previous larger problem with 40<br />
processors, we assumed the load was perfectly balanced. That is, each of the 40<br />
EXAMPLE
6.3 SISD, MIMD, SIMD, SPMD, <strong>and</strong> Vector 511<br />
data elements from memory, put them in order into a large set of registers, operate<br />
on them sequentially in registers using pipelined execution units, <strong>and</strong> then write<br />
the results back to memory. A key feature of vector architectures is then a set of<br />
vector registers. Thus, a vector architecture might have 32 vector registers, each<br />
with 64 64-bit elements.<br />
Comparing Vector to Conventional Code<br />
Suppose we extend the MIPS instruction set architecture with vector<br />
instructions <strong>and</strong> vector registers. Vector operations use the same names as<br />
MIPS operations, but with the letter V appended. For example, addv.d<br />
adds two double-precision vectors. The vector instructions take as their input<br />
either a pair of vector registers (addv.d) or a vector register <strong>and</strong> a scalar<br />
register (addvs.d). In the latter case, the value in the scalar register is used<br />
as the input for all operationsthe operation addvs.d will add the contents<br />
of a scalar register to each element in a vector register. The names lv <strong>and</strong> sv<br />
denote vector load <strong>and</strong> vector store, <strong>and</strong> they load or store an entire vector<br />
of double-precision data. One oper<strong>and</strong> is the vector register to be loaded or<br />
stored; the other oper<strong>and</strong>, which is a MIPS general-purpose register, is the<br />
starting address of the vector in memory. Given this short description, show<br />
the conventional MIPS code versus the vector MIPS code for<br />
EXAMPLE<br />
Y = a× X + Y<br />
where X <strong>and</strong> Y are vectors of 64 double precision floating-point numbers,<br />
initially resident in memory, <strong>and</strong> a is a scalar double precision variable. (This<br />
example is the so-called DAXPY loop that forms the inner loop of the Linpack<br />
benchmark; DAXPY st<strong>and</strong>s for double precision a × X plus Y.). Assume that<br />
the starting addresses of X <strong>and</strong> Y are in $s0 <strong>and</strong> $s1, respectively.<br />
Here is the conventional MIPS code for DAXPY:<br />
l.d $f0,a($sp) :load scalar a<br />
addiu $t0,$s0,#512 :upper bound of what to load<br />
loop: l.d $f2,0($s0) :load x(i)<br />
mul.d $f2,$f2,$f0 :a x x(i)<br />
l.d $f4,0($s1) :load y(i)<br />
add.d $f4,$f4,$f2 :a x x(i) + y(i)<br />
s.d $f4,0($s1) :store into y(i)<br />
addiu $s0,$s0,#8 :increment index to x<br />
addiu $s1,$s1,#8 :increment index to y<br />
subu $t1,$t0,$s0 :compute bound<br />
bne $t1,$zero,loop :check if done<br />
Here is the vector MIPS code for DAXPY:<br />
ANSWER
512 Chapter 6 Parallel Processors from Client to Cloud<br />
l.d $f0,a($sp) :load scalar a<br />
lv $v1,0($s0) :load vector x<br />
mulvs.d $v2,$v1,$f0 :vector-scalar multiply<br />
lv $v3,0($s1) :load vector y<br />
addv.d $v4,$v2,$v3 :add y to product<br />
sv $v4,0($s1) :store the result<br />
There are some interesting comparisons between the two code segments in<br />
this example. The most dramatic is that the vector processor greatly reduces the<br />
dynamic instruction b<strong>and</strong>width, executing only 6 instructions versus almost 600<br />
for the traditional MIPS architecture. This reduction occurs both because the vector<br />
operations work on 64 elements at a time <strong>and</strong> because the overhead instructions<br />
that constitute nearly half the loop on MIPS are not present in the vector code. As<br />
you might expect, this reduction in instructions fetched <strong>and</strong> executed saves energy.<br />
Another important difference is the frequency of pipeline hazards (Chapter 4).<br />
In the straightforward MIPS code, every add.d must wait for a mul.d, every<br />
s.d must wait for the add.d <strong>and</strong> every add.d <strong>and</strong> mul.d must wait on l.d.<br />
On the vector processor, each vector instruction will only stall for the first element<br />
in each vector, <strong>and</strong> then subsequent elements will flow smoothly down the pipeline.<br />
Thus, pipeline stalls are required only once per vector operation, rather than once<br />
per vector element. In this example, the pipeline stall frequency on MIPS will be<br />
about 64 times higher than it is on the vector version of MIPS. The pipeline stalls<br />
can be reduced on MIPS by using loop unrolling (see Chapter 4). However, the<br />
large difference in instruction b<strong>and</strong>width cannot be reduced.<br />
Since the vector elements are independent, they can be operated on in parallel,<br />
much like subword parallelism for AVX instructions. All modern vector computers<br />
have vector functional units with multiple parallel pipelines (called vector lanes; see<br />
Figures 6.2 <strong>and</strong> 6.3) that can produce two or more results per clock cycle.<br />
Elaboration: The loop in the example above exactly matched the vector length. When<br />
loops are shorter, vector architectures use a register that reduces the length of vector<br />
operations. When loops are larger, we add bookkeeping code to iterate full-length vector<br />
operations <strong>and</strong> to h<strong>and</strong>le the leftovers. This latter process is called strip mining.<br />
Vector versus Scalar<br />
Vector instructions have several important properties compared to conventional<br />
instruction set architectures, which are called scalar architectures in this context:<br />
■ A single vector instruction specifies a great deal of workit is equivalent<br />
to executing an entire loop. The instruction fetch <strong>and</strong> decode b<strong>and</strong>width<br />
needed is dramatically reduced.<br />
■ By using a vector instruction, the compiler or programmer indicates that the<br />
computation of each result in the vector is independent of the computation of<br />
other results in the same vector, so hardware does not have to check for data<br />
hazards within a vector instruction.<br />
■ Vector architectures <strong>and</strong> compilers have a reputation of making it much<br />
easier than when using MIMD multiprocessors to write efficient applications<br />
when they contain data-level parallelism.
6.3 SISD, MIMD, SIMD, SPMD, <strong>and</strong> Vector 513<br />
■ Hardware need only check for data hazards between two vector instructions<br />
once per vector oper<strong>and</strong>, not once for every element within the vectors.<br />
Reduced checking can save energy as well as time.<br />
■ Vector instructions that access memory have a known access pattern. If<br />
the vectors elements are all adjacent, then fetching the vector from a set<br />
of heavily interleaved memory banks works very well. Thus, the cost of the<br />
latency to main memory is seen only once for the entire vector, rather than<br />
once for each word of the vector.<br />
■ Because an entire loop is replaced by a vector instruction whose behavior<br />
is predetermined, control hazards that would normally arise from the loop<br />
branch are nonexistent.<br />
■ The savings in instruction b<strong>and</strong>width <strong>and</strong> hazard checking plus the efficient<br />
use of memory b<strong>and</strong>width give vector architectures advantages in power <strong>and</strong><br />
energy versus scalar architectures.<br />
For these reasons, vector operations can be made faster than a sequence of<br />
scalar operations on the same number of data items, <strong>and</strong> designers are motivated<br />
to include vector units if the application domain can often use them.<br />
Vector versus Multimedia Extensions<br />
Like multimedia extensions found in the x86 AVX instructions, a vector instruction<br />
specifies multiple operations. However, multimedia extensions typically specify a<br />
few operations while vector specifies dozens of operations. Unlike multimedia<br />
extensions, the number of elements in a vector operation is not in the opcode but in a<br />
separate register. This distinction means different versions of the vector architecture<br />
can be implemented with a different number of elements just by changing the<br />
contents of that register <strong>and</strong> hence retain binary compatibility. In contrast, a new<br />
large set of opcodes is added each time the vector length changes in the multimedia<br />
extension architecture of the x86: MMX, SSE, SSE2, AVX, AVX2, … .<br />
Also unlike multimedia extensions, the data transfers need not be contiguous.<br />
Vectors support both strided accesses, where the hardware loads every nth data<br />
element in memory, <strong>and</strong> indexed accesses, where hardware finds the addresses of<br />
the items to be loaded in a vector register. Indexed accesses are also called gatherscatter,<br />
in that indexed loads gather elements from main memory into contiguous<br />
vector elements <strong>and</strong> indexed stores scatter vector elements across main memory.<br />
Like multimedia extensions, vector architectures easily capture the flexibility<br />
in data widths, so it is easy to make a vector operation work on 32 64-bit data<br />
elements or 64 32-bit data elements or 128 16-bit data elements or 256 8-bit data<br />
elements. The parallel semantics of a vector instruction allows an implementation<br />
to execute these operations using a deeply pipelined functional unit, an array of<br />
parallel functional units, or a combination of parallel <strong>and</strong> pipelined functional<br />
units. Figure 6.3 illustrates how to improve vector performance by using parallel<br />
pipelines to execute a vector add instruction.<br />
Vector arithmetic instructions usually only allow element N of one vector<br />
register to take part in operations with element N from other vector registers. This
514 Chapter 6 Parallel Processors from Client to Cloud<br />
A[9]<br />
B[9]<br />
A[8]<br />
B[8]<br />
A[7]<br />
B[7]<br />
A[6]<br />
B[6]<br />
A[5]<br />
B[5]<br />
A[4]<br />
B[4]<br />
A[3]<br />
B[3]<br />
A[2]<br />
B[2]<br />
A[8]<br />
B[8]<br />
A[9]<br />
B[9]<br />
A[1]<br />
B[1]<br />
A[4]<br />
B[4]<br />
A[5]<br />
B[5] A[6] B[6] A[7] B[7]<br />
+<br />
+ + + +<br />
C[0]<br />
C[0] C[1] C[2] C[3]<br />
Element group<br />
(a)<br />
(b)<br />
vector lane One or<br />
more vector functional<br />
units <strong>and</strong> a portion of<br />
the vector register file.<br />
Inspired by lanes on<br />
highways that increase<br />
traffic speed, multiple<br />
lanes execute vector<br />
operations<br />
simultaneously.<br />
Check<br />
Yourself<br />
FIGURE 6.3 Using multiple functional units to improve the performance of a single vector<br />
add instruction, C = A + B. The vector processor (a) on the left has a single add pipeline <strong>and</strong> can complete<br />
one addition per cycle. The vector processor (b) on the right has four add pipelines or lanes <strong>and</strong> can complete<br />
four additions per cycle. The elements within a single vector add instruction are interleaved across the four<br />
lanes.<br />
dramatically simplifies the construction of a highly parallel vector unit, which can<br />
be structured as multiple parallel vector lanes. As with a traffic highway, we can<br />
increase the peak throughput of a vector unit by adding more lanes. Figure 6.4<br />
shows the structure of a four-lane vector unit. Thus, going to four lanes from one<br />
lane reduces the number of clocks per vector instruction by roughly a factor of four.<br />
For multiple lanes to be advantageous, both the applications <strong>and</strong> the architecture<br />
must support long vectors. Otherwise, they will execute so quickly that you’ll run<br />
out of instructions, requiring instruction level parallel techniques like those in<br />
Chapter 4 to supply enough vector instructions.<br />
Generally, vector architectures are a very efficient way to execute data parallel<br />
processing programs; they are better matches to compiler technology than<br />
multimedia extensions; <strong>and</strong> they are easier to evolve over time than the multimedia<br />
extensions to the x86 architecture.<br />
Given these classic categories, we next see how to exploit parallel streams of<br />
instructions to improve the performance of a single processor, which we will reuse<br />
with multiple processors.<br />
True or false: As exemplified in the x86, multimedia extensions can be thought of<br />
as a vector architecture with short vectors that supports only contiguous vector<br />
data transfers.
6.4 Hardware Multithreading 517<br />
Simultaneous multithreading (SMT) is a variation on hardware multithreading<br />
that uses the resources of a multiple-issue, dynamically scheduled pipelined<br />
processor to exploit thread-level parallelism at the same time it exploits instructionlevel<br />
parallelism (see Chapter 4). The key insight that motivates SMT is that<br />
multiple-issue processors often have more functional unit parallelism available<br />
than most single threads can effectively use. Furthermore, with register renaming<br />
<strong>and</strong> dynamic scheduling (see Chapter 4), multiple instructions from independent<br />
threads can be issued without regard to the dependences among them; the resolution<br />
of the dependences can be h<strong>and</strong>led by the dynamic scheduling capability.<br />
Since SMT relies on the existing dynamic mechanisms, it does not switch<br />
resources every cycle. Instead, SMT is always executing instructions from multiple<br />
threads, leaving it up to the hardware to associate instruction slots <strong>and</strong> renamed<br />
registers with their proper threads.<br />
Figure 6.5 conceptually illustrates the differences in a processors ability to exploit<br />
superscalar resources for the following processor configurations. The top portion shows<br />
Issue slots<br />
Thread A<br />
Thread B<br />
Thread C<br />
Thread D<br />
simultaneous<br />
multithreading<br />
(SMT) A version<br />
of multithreading<br />
that lowers the cost<br />
of multithreading by<br />
utilizing the resources<br />
needed for multiple issue,<br />
dynamically scheduled<br />
microarchitecture.<br />
Time<br />
Time<br />
Issue slots<br />
Coarse MT<br />
Fine MT<br />
SMT<br />
FIGURE 6.5 How four threads use the issue slots of a superscalar processor in different<br />
approaches. The four threads at the top show how each would execute running alone on a st<strong>and</strong>ard<br />
superscalar processor without multithreading support. The three examples at the bottom show how they<br />
would execute running together in three multithreading options. The horizontal dimension represents the<br />
instruction issue capability in each clock cycle. The vertical dimension represents a sequence of clock cycles.<br />
An empty (white) box indicates that the corresponding issue slot is unused in that clock cycle. The shades of<br />
gray <strong>and</strong> color correspond to four different threads in the multithreading processors. The additional pipeline<br />
start-up effects for coarse multithreading, which are not illustrated in this figure, would lead to further loss<br />
in throughput for coarse multithreading.
520 Chapter 6 Parallel Processors from Client to Cloud<br />
uniform memory access<br />
(UMA) A multiprocessor<br />
in which latency to any<br />
word in main memory is<br />
about the same no matter<br />
which processor requests<br />
the access.<br />
nonuniform memory<br />
access (NUMA) A type<br />
of single address space<br />
multiprocessor in which<br />
some memory accesses<br />
are much faster than<br />
others depending on<br />
which processor asks for<br />
which word.<br />
synchronization The<br />
process of coordinating<br />
the behavior of two or<br />
more processes, which<br />
may be running on<br />
different processors.<br />
lock A synchronization<br />
device that allows access<br />
to data to only one<br />
processor at a time.<br />
nearly always the case for multicore chipsalthough a more accurate term would<br />
have been shared-address multiprocessor. Processors communicate through shared<br />
variables in memory, with all processors capable of accessing any memory location<br />
via loads <strong>and</strong> stores. Figure 6.7 shows the classic organization of an SMP. Note that<br />
such systems can still run independent jobs in their own virtual address spaces,<br />
even if they all share a physical address space.<br />
Single address space multiprocessors come in two styles. In the first style, the<br />
latency to a word in memory does not depend on which processor asks for it.<br />
Such machines are called uniform memory access (UMA) multiprocessors. In the<br />
second style, some memory accesses are much faster than others, depending on<br />
which processor asks for which word, typically because main memory is divided<br />
<strong>and</strong> attached to different microprocessors or to different memory controllers on<br />
the same chip. Such machines are called nonuniform memory access (NUMA)<br />
multiprocessors. As you might expect, the programming challenges are harder for<br />
a NUMA multiprocessor than for a UMA multiprocessor, but NUMA machines<br />
can scale to larger sizes <strong>and</strong> NUMAs can have lower latency to nearby memory.<br />
As processors operating in parallel will normally share data, they also need to<br />
coordinate when operating on shared data; otherwise, one processor could start<br />
working on data before another is finished with it. This coordination is called<br />
synchronization, which we saw in Chapter 2. When sharing is supported with a<br />
single address space, there must be a separate mechanism for synchronization. One<br />
approach uses a lock for a shared variable. Only one processor at a time can acquire<br />
the lock, <strong>and</strong> other processors interested in shared data must wait until the original<br />
processor unlocks the variable. Section 2.11 of Chapter 2 describes the instructions<br />
for locking in the MIPS instruction set.<br />
Processor<br />
Processor<br />
. . .<br />
Processor<br />
Cache Cache . . .<br />
Cache<br />
Interconnection Network<br />
Memory<br />
I/O<br />
FIGURE 6.7<br />
Classic organization of a shared memory multiprocessor.
6.5 Multicore <strong>and</strong> Other Shared Memory Multiprocessors 521<br />
A Simple Parallel Processing Program for a Shared Address Space<br />
Suppose we want to sum 64,000 numbers on a shared memory multiprocessor<br />
computer with uniform memory access time. Lets assume we have 64<br />
processors.<br />
EXAMPLE<br />
The first step is to ensure a balanced load per processor, so we split the set<br />
of numbers into subsets of the same size. We do not allocate the subsets to a<br />
different memory space, since there is a single memory space for this machine;<br />
we just give different starting addresses to each processor. Pn is the number that<br />
identifies the processor, between 0 <strong>and</strong> 63. All processors start the program by<br />
running a loop that sums their subset of numbers:<br />
ANSWER<br />
sum[Pn] = 0;<br />
for (i = 1000*Pn; i < 1000*(Pn+1); i += 1)<br />
sum[Pn] += A[i]; /*sum the assigned areas*/<br />
(Note the C code i += 1 is just a shorter way to say i = i + 1.)<br />
The next step is to add these 64 partial sums. This step is called a reduction,<br />
where we divide to conquer. Half of the processors add pairs of partial sums,<br />
<strong>and</strong> then a quarter add pairs of the new partial sums, <strong>and</strong> so on until we<br />
have the single, final sum. Figure 6.8 illustrates the hierarchical nature of this<br />
reduction.<br />
In this example, the two processors must synchronize before the consumer<br />
processor tries to read the result from the memory location written by the<br />
producer processor; otherwise, the consumer may read the old value of<br />
reduction A function<br />
that processes a data<br />
structure <strong>and</strong> returns a<br />
single value.<br />
0<br />
(half = 1)<br />
0 1<br />
(half = 2)<br />
0 1 2 3<br />
(half = 4)<br />
0 1 2 3 4 5 6 7<br />
FIGURE 6.8 The last four levels of a reduction that sums results from each processor,<br />
from bottom to top. For all processors whose number i is less than half, add the sum produced by<br />
processor number (i + half) to its sum.
522 Chapter 6 Parallel Processors from Client to Cloud<br />
the data. We want each processor to have its own version of the loop counter<br />
variable i, so we must indicate that it is a private variable. Here is the code<br />
(half is private also):<br />
half = 64; /*64 processors in multiprocessor*/<br />
do<br />
synch(); /*wait for partial sum completion*/<br />
if (half%2 != 0 && Pn == 0)<br />
sum[0] += sum[half–1];<br />
/*Conditional sum needed when half is<br />
odd; Processor0 gets missing element */<br />
half = half/2; /*dividing line on who sums */<br />
if (Pn < half) sum[Pn] += sum[Pn+half];<br />
while (half > 1); /*exit with final sum in Sum[0] */<br />
Hardware/<br />
Software<br />
Interface<br />
OpenMP An API<br />
for shared memory<br />
multiprocessing in C,<br />
C++, or Fortran that runs<br />
on UNIX <strong>and</strong> Microsoft<br />
platforms. It includes<br />
compiler directives, a<br />
library, <strong>and</strong> runtime<br />
directives.<br />
Given the long-term interest in parallel programming, there have been hundreds<br />
of attempts to build parallel programming systems. A limited but popular example<br />
is OpenMP. It is just an Application Programmer Interface (API) along with a set of<br />
compiler directives, environment variables, <strong>and</strong> runtime library routines that can<br />
extend st<strong>and</strong>ard programming languages. It offers a portable, scalable, <strong>and</strong> simple<br />
programming model for shared memory multiprocessors. Its primary goal is to<br />
parallelize loops <strong>and</strong> to perform reductions.<br />
Most C compilers already have support for OpenMP. The comm<strong>and</strong> to uses the<br />
OpenMP API with the UNIX C compiler is just:<br />
cc –fopenmp foo.c<br />
OpenMP extends C using pragmas, which are just comm<strong>and</strong>s to the C macro<br />
preprocessor like #define <strong>and</strong> #include. To set the number of processors we<br />
want to use to be 64, as we wanted in the example above, we just use the comm<strong>and</strong><br />
#define P 64 /* define a constant that we’ll use a few times */<br />
#pragma omp parallel num_threads(P)<br />
That is, the runtime libraries should use 64 parallel threads.<br />
To turn the sequential for loop into a parallel for loop that divides the work<br />
equally between all the threads that we told it to use, we just write (assuming sum<br />
is initialized to 0)<br />
#pragma omp parallel for<br />
for (Pn = 0; Pn < P; Pn += 1)<br />
for (i = 0; 1000*Pn; i < 1000*(Pn+1); i += 1)<br />
sum[Pn] += A[i]; /*sum the assigned areas*/
6.5 Multicore <strong>and</strong> Other Shared Memory Multiprocessors 523<br />
To perform the reduction, we can use another comm<strong>and</strong> that tells OpenMP<br />
what the reduction operator is <strong>and</strong> what variable you need to use to place the result<br />
of the reduction.<br />
#pragma omp parallel for reduction(+ : FinalSum)<br />
for (i = 0; i < P; i += 1)<br />
FinalSum += sum[i]; /* Reduce to a single number */<br />
Note that it is now up to the OpenMP library to find efficient code to sum 64<br />
numbers efficiently using 64 processors.<br />
While OpenMP makes it easy to write simple parallel code, it is not very helpful<br />
with debugging, so many parallel programmers use more sophisticated parallel<br />
programming systems than OpenMP, just as many programmers today use more<br />
productive languages than C.<br />
Given this tour of classic MIMD hardware <strong>and</strong> software, our next path is a more<br />
exotic tour of a type of MIMD architecture with a different heritage <strong>and</strong> thus a very<br />
different perspective on the parallel programming challenge.<br />
True or false: Shared memory multiprocessors cannot take advantage of task-level<br />
parallelism.<br />
Check<br />
Yourself<br />
Elaboration: Some writers repurposed the acronym SMP to mean symmetric<br />
multiprocessor, to indicate that the latency from processor to memory was about the<br />
same for all processors. This shift was done to contrast them from large-scale NUMA<br />
multiprocessors, as both classes used a single address space. As clusters proved much<br />
more popular than large-scale NUMA multiprocessors, in this book we restore SMP to<br />
its original meaning, <strong>and</strong> use it to contrast against that use multiple address spaces,<br />
such as clusters.<br />
Elaboration: An alternative to sharing the physical address space would be to have<br />
separate physical address spaces but share a common virtual address space, leaving<br />
it up to the operating system to h<strong>and</strong>le communication. This approach has been tried,<br />
but it has too high an overhead to offer a practical shared memory abstraction to the<br />
performance-oriented programmer.
526 Chapter 6 Parallel Processors from Client to Cloud<br />
registers than do vector processors. Unlike most vector architectures, GPUs also<br />
rely on hardware multithreading within a single multi-threaded SIMD processor<br />
to hide memory latency (see Section 6.4).<br />
A multithreaded SIMD processor is similar to a Vector Processor, but the former<br />
has many parallel functional units instead of just a few that are deeply pipelined,<br />
as does the latter.<br />
As mentioned above, a GPU contains a collection of multithreaded SIMD<br />
processors; that is, a GPU is a MIMD composed of multithreaded SIMD processors.<br />
For example, NVIDIA has four implementations of the Fermi architecture at<br />
different price points with 7, 11, 14, or 15 multithreaded SIMD processors. To<br />
provide transparent scalability across models of GPUs with differing number of<br />
multithreaded SIMD processors, the Thread Block Scheduler hardware assigns<br />
blocks of threads to multithreaded SIMD processors. Figure 6.9 shows a simplified<br />
block diagram of a multithreaded SIMD processor.<br />
Dropping down one more level of detail, the machine object that the hardware<br />
creates, manages, schedules, <strong>and</strong> executes is a thread of SIMD instructions, which<br />
we will also call a SIMD thread. It is a traditional thread, but it contains exclusively<br />
SIMD instructions. These SIMD threads have their own program counters <strong>and</strong><br />
they run on a multithreaded SIMD processor. The SIMD Thread Scheduler includes<br />
a controller that lets it know which threads of SIMD instructions are ready to<br />
run, <strong>and</strong> then it sends them off to a dispatch unit to be run on the multithreaded<br />
Instruction register<br />
SIMD Lanes<br />
(Thread<br />
Processors)<br />
Registers<br />
Reg<br />
Reg<br />
Reg<br />
Reg<br />
Reg<br />
Reg<br />
Reg<br />
Reg<br />
Reg<br />
Reg<br />
Reg<br />
Reg<br />
Reg<br />
Reg<br />
Reg<br />
1K × 32 1K × 32<br />
1K × 32<br />
1K × 32<br />
1K × 32<br />
1K × 32<br />
1K × 32<br />
1K × 32<br />
1K × 32<br />
1K × 32<br />
1K × 32<br />
1K × 32<br />
1K × 32<br />
1K × 32<br />
1K × 32<br />
1K × 32<br />
Load<br />
store<br />
unit<br />
Load<br />
store<br />
unit<br />
Load<br />
store<br />
unit<br />
Load<br />
store<br />
unit<br />
Load<br />
store<br />
unit<br />
Load<br />
store<br />
unit<br />
Load<br />
store<br />
unit<br />
Load<br />
store<br />
unit<br />
Load<br />
store<br />
unit<br />
Load<br />
store<br />
unit<br />
Load<br />
store<br />
unit<br />
Load<br />
store<br />
unit<br />
Load<br />
store<br />
unit<br />
Load<br />
store<br />
unit<br />
Load<br />
store<br />
unit<br />
Load<br />
store<br />
unit<br />
Address coalescing unit<br />
Interconnection network<br />
Local Memory<br />
64 KiB<br />
To Global<br />
Memory<br />
FIGURE 6.9 Simplified block diagram of the datapath of a multithreaded SIMD Processor.<br />
It has 16 SIMD lanes. The SIMD Thread Scheduler has many independent SIMD threads that it chooses from<br />
to run on this processor.
6.6 Introduction to Graphics Processing Units 527<br />
SIMD processor. It is identical to a hardware thread scheduler in a traditional<br />
multithreaded processor (see Section 6.4), except that it is scheduling threads of<br />
SIMD instructions. Thus, GPU hardware has two levels of hardware schedulers:<br />
1. The Thread Block Scheduler that assigns blocks of threads to multithreaded<br />
SIMD processors, <strong>and</strong><br />
2. the SIMD Thread Scheduler within a SIMD processor, which schedules<br />
when SIMD threads should run.<br />
The SIMD instructions of these threads are 32 wide, so each thread of SIMD<br />
instructions would compute 32 of the elements of the computation. Since the<br />
thread consists of SIMD instructions, the SIMD processor must have parallel<br />
functional units to perform the operation. We call them SIMD Lanes, <strong>and</strong> they are<br />
quite similar to the Vector Lanes in Section 6.3.<br />
Elaboration: The number of lanes per SIMD processor varies across GPU generations.<br />
With Fermi, each 32-wide thread of SIMD instructions is mapped to 16 SIMD Lanes,<br />
so each SIMD instruction in a thread of SIMD instructions takes two clock cycles to<br />
complete. Each thread of SIMD instructions is executed in lock step. Staying with the<br />
analogy of a SIMD processor as a vector processor, you could say that it has 16 lanes,<br />
<strong>and</strong> the vector length would be 32. This wide but shallow nature is why we use the term<br />
SIMD processor instead of vector processor, as it is more intuitive.<br />
Since by defi nition the threads of SIMD instructions are independent, the SIMD<br />
Thread Scheduler can pick whatever thread of SIMD instructions is ready, <strong>and</strong> need not<br />
stick with the next SIMD instruction in the sequence within a single thread. Thus, using<br />
the terminology of Section 6.4, it uses fine-grained multithreading.<br />
To hold these memory elements, a Fermi SIMD processor has an impressive 32,768<br />
32-bit registers. Just like a vector processor, these registers are divided logically across<br />
the vector lanes or, in this case, SIMD Lanes. Each SIMD Thread is limited to no more than<br />
64 registers, so you might think of a SIMD Thread as having up to 64 vector registers,<br />
with each vector register having 32 elements <strong>and</strong> each element being 32 bits wide.<br />
Since Fermi has 16 SIMD Lanes, each contains 2048 registers. Each CUDA Thread<br />
gets one element of each of the vector registers. Note that a CUDA thread is just a<br />
vertical cut of a thread of SIMD instructions, corresponding to one element executed by<br />
one SIMD Lane. Beware that CUDA Threads are very different from POSIX threads; you<br />
cant make arbitrary system calls or synchronize arbitrarily in a CUDA Thread.<br />
NVIDIA GPU Memory Structures<br />
Figure 6.10 shows the memory structures of an NVIDIA GPU. We call the onchip<br />
memory that is local to each multithreaded SIMD processor Local Memory.<br />
It is shared by the SIMD Lanes within a multithreaded SIMD processor, but this<br />
memory is not shared between multithreaded SIMD processors. We call the offchip<br />
DRAM shared by the whole GPU <strong>and</strong> all thread blocks GPU Memory.<br />
Rather than rely on large caches to contain the whole working sets of an<br />
application, GPUs traditionally use smaller streaming caches <strong>and</strong> rely on extensive<br />
multithreading of threads of SIMD instructions to hide the long latency to DRAM,
528 Chapter 6 Parallel Processors from Client to Cloud<br />
CUDA Thread<br />
Per-CUDA Thread Private Memory<br />
Thread block<br />
Per-Block<br />
Local Memory<br />
Grid 0<br />
Sequence<br />
. . .<br />
Grid 1<br />
Inter-Grid Synchronization<br />
GPU Memory<br />
. . .<br />
FIGURE 6.10 GPU Memory structures. GPU Memory is shared by the vectorized loops. All threads<br />
of SIMD instructions within a thread block share Local Memory.<br />
since their working sets can be hundreds of megabytes. Thus, they will not fit<br />
in the last level cache of a multicore microprocessor. Given the use of hardware<br />
multithreading to hide DRAM latency, the chip area used for caches in system<br />
processors is spent instead on computing resources <strong>and</strong> on the large number of<br />
registers to hold the state of the many threads of SIMD instructions.<br />
Elaboration: While hiding memory latency is the underlying philosophy, note that the<br />
latest GPUs <strong>and</strong> vector processors have added caches. For example, the recent Fermi<br />
architecture has added caches, but they are thought of as either b<strong>and</strong>width fi lters to<br />
reduce dem<strong>and</strong>s on GPU Memory or as accelerators for the few variables whose latency<br />
cannot be hidden by multithreading. Local memory for stack frames, function calls,<br />
<strong>and</strong> register spilling is a good match to caches, since latency matters when calling a<br />
function. Caches can also save energy, since on-chip cache accesses take much less<br />
energy than accesses to multiple, external DRAM chips.
530 Chapter 6 Parallel Processors from Client to Cloud<br />
Type<br />
More descriptive<br />
name<br />
Closest old term<br />
outside of GPUs<br />
Official CUDA/<br />
NVIDIA GPU term<br />
Book definition<br />
Memory hardware<br />
Processing hardware<br />
Machine object Program abstractions<br />
Vectorizable<br />
Loop<br />
Body of<br />
Vectorized Loop<br />
Sequence of<br />
SIMD Lane<br />
Operations<br />
A Thread of<br />
SIMD<br />
Instructions<br />
SIMD<br />
Instruction<br />
Multithreaded<br />
SIMD<br />
Processor<br />
Thread Block<br />
Scheduler<br />
SIMD Thread<br />
Scheduler<br />
Body of a<br />
(Strip-Mined)<br />
Vectorized Loop<br />
One iteration of<br />
a Scalar Loop<br />
Thread of Vector<br />
Instructions<br />
Vector Instruction<br />
(Multithreaded)<br />
Vector Processor<br />
Scalar Processor<br />
Thread scheduler<br />
in a Multithreaded<br />
CPU<br />
Thread Block<br />
CUDA Thread<br />
Warp<br />
PTX Instruction<br />
Streaming<br />
Multiprocessor<br />
Giga Thread<br />
Engine<br />
Warp Scheduler<br />
SIMD Lane Vector lane Thread Processor<br />
GPU Memory Main Memory Global Memory<br />
Local Memory Local Memory Shared Memory<br />
SIMD Lane<br />
Registers<br />
Vectorizable Loop Grid A vectorizable loop, executed on the GPU, made<br />
up of one or more Thread Blocks (bodies of<br />
vectorized loop) that can execute in parallel.<br />
Vector Lane<br />
Registers<br />
Thread Processor<br />
Registers<br />
A vectorized loop executed on a multithreaded<br />
SIMD Processor, made up of one or more threads<br />
of SIMD instructions. They can communicate via<br />
Local Memory.<br />
A vertical cut of a thread of SIMD instructions<br />
corresponding to one element executed by one<br />
SIMD Lane. Result is stored depending on mask<br />
<strong>and</strong> predicate register.<br />
A traditional thread, but it contains just SIMD<br />
instructions that are executed on a multithreaded<br />
SIMD Processor. Results stored depending on a<br />
per-element mask.<br />
A single SIMD instruction executed across SIMD<br />
Lanes.<br />
A multithreaded SIMD Processor executes<br />
threads of SIMD instructions, independent of<br />
other SIMD Processors.<br />
Assigns multiple Thread Blocks (bodies of<br />
vectorized loop) to multithreaded SIMD<br />
Processors.<br />
Hardware unit that schedules <strong>and</strong> issues threads<br />
of SIMD instructions when they are ready to<br />
execute; includes a scoreboard to track SIMD<br />
Thread execution.<br />
A SIMD Lane executes the operations in a thread<br />
of SIMD instructions on a single element. Results<br />
stored depending on mask.<br />
DRAM memory accessible by all multithreaded<br />
SIMD Processors in a GPU.<br />
Fast local SRAM for one multithreaded SIMD<br />
Processor, unavailable to other SIMD Processors.<br />
Registers in a single SIMD Lane allocated across<br />
a full thread block (body of vectorized loop).<br />
FIGURE 6.12 Quick guide to GPU terms. We use the first column for hardware terms. Four groups<br />
cluster these 12 terms. From top to bottom: Program Abstractions, Machine Objects, Processing Hardware,<br />
<strong>and</strong> Memory Hardware.<br />
make more sense when architects ask, given the hardware invested to do graphics<br />
well, how can we supplement it to improve the performance of a wider range of<br />
applications?<br />
Having covered two different styles of MIMD that have a shared address<br />
space, we next introduce parallel processors where each processor has its<br />
own private address space, which makes it much easier to build much larger<br />
systems. The Internet services that you use every day depend on these large scale<br />
systems.
6.7 Clusters, Warehouse Scale <strong>Computer</strong>s, <strong>and</strong> Other Message-Passing Multiprocessors 533<br />
Given that clusters are constructed from whole computers <strong>and</strong> independent,<br />
scalable networks, this isolation also makes it easier to exp<strong>and</strong> the system without<br />
bringing down the application that runs on top of the cluster.<br />
Their lower cost, higher availability, <strong>and</strong> rapid, incremental exp<strong>and</strong>ability make<br />
clusters attractive to service Internet providers, despite their poorer communication<br />
performance when compared to large-scale shared memory multiprocessors. The<br />
search engines that hundreds of millions of us use every day depend upon this<br />
technology. Amazon, Facebook, Google, Microsoft, <strong>and</strong> others all have multiple<br />
datacenters each with clusters of tens of thous<strong>and</strong>s of servers. Clearly, the use of<br />
multiple processors in Internet service companies has been hugely successful.<br />
Warehouse-Scale <strong>Computer</strong>s<br />
Internet services, such as those described above, necessitated the construction<br />
of new buildings to house, power, <strong>and</strong> cool 100,000 servers. Although they may<br />
be classified as just large clusters, their architecture <strong>and</strong> operation are more<br />
sophisticated. They act as one giant computer <strong>and</strong> cost on the order of $150M<br />
for the building, the electrical <strong>and</strong> cooling infrastructure, the servers, <strong>and</strong> the<br />
networking equipment that connects <strong>and</strong> houses 50,000 to 100,000 servers. We<br />
consider them a new class of computer, called Warehouse-Scale <strong>Computer</strong>s (WSC).<br />
Anyone can build a fast<br />
CPU. The trick is to build a<br />
fast system.<br />
Seymour Cray, considered<br />
the father of the<br />
supercomputer.<br />
The most popular framework for batch processing in a WSC is MapReduce [Dean,<br />
2008] <strong>and</strong> its open-source twin Hadoop. Inspired by the Lisp functions of the same<br />
name, Map first applies a programmer-supplied function to each logical input<br />
record. Map runs on thous<strong>and</strong>s of servers to produce an intermediate result of keyvalue<br />
pairs. Reduce collects the output of those distributed tasks <strong>and</strong> collapses them<br />
using another programmer-defined function. With appropriate software support,<br />
both are highly parallel yet easy to underst<strong>and</strong> <strong>and</strong> to use. Within 30 minutes, a<br />
novice programmer can run a MapReduce task on thous<strong>and</strong>s of servers.<br />
For example, one MapReduce program calculates the number of occurrences of<br />
every English word in a large collection of documents. Below is a simplified version<br />
of that program, which shows just the inner loop <strong>and</strong> assumes just one occurrence<br />
of all English words found in a document:<br />
Hardware/<br />
Software<br />
Interface<br />
map(String key, String value):<br />
// key: document name<br />
// value: document contents<br />
for each word w in value:<br />
EmitIntermediate(w, “1”); // Produce list of all words reduce(String key, Iterator values):<br />
// key: a word<br />
// values: a list of counts<br />
int result = 0;<br />
for each v in values:<br />
result += ParseInt(v); // get integer from key-value pair<br />
Emit(AsString(result));
534 Chapter 6 Parallel Processors from Client to Cloud<br />
The function EmitIntermediate used in the Map function emits each<br />
word in the document <strong>and</strong> the value one. Then the Reduce function sums all the<br />
values per word for each document using ParseInt() to get the number of<br />
occurrences per word in all documents. The MapReduce runtime environment<br />
schedules map tasks <strong>and</strong> reduce tasks to the servers of a WSC.<br />
software as a service<br />
(SaaS) Rather than<br />
selling software that<br />
is installed <strong>and</strong> run<br />
on customers’ own<br />
computers, software is run<br />
at a remote site <strong>and</strong> made<br />
available over the Internet<br />
typically via a Web<br />
interface to customers.<br />
SaaS customers are<br />
charged based on use<br />
versus on ownership.<br />
At this extreme scale, which requires innovation in power distribution, cooling,<br />
monitoring, <strong>and</strong> operations, the WSC is a modern descendant of the 1970s<br />
supercomputers—making Seymour Cray the godfather of today’s WSC architects.<br />
His extreme computers h<strong>and</strong>led computations that could be done nowhere else, but<br />
were so expensive that only a few companies could afford them. This time the target<br />
is providing information technology for the world instead of high performance<br />
computing for scientists <strong>and</strong> engineers. Hence, WSCs surely play a more important<br />
societal role today than Cray’s supercomputers did in the past.<br />
While they share some common goals with servers, WSCs have three major<br />
distinctions:<br />
1. Ample, easy parallelism: A concern for a server architect is whether the<br />
applications in the targeted marketplace have enough parallelism to justify<br />
the amount of parallel hardware <strong>and</strong> whether the cost is too high for sufficient<br />
communication hardware to exploit this parallelism. A WSC architect has<br />
no such concern. First, batch applications like MapReduce benefit from the<br />
large number of independent data sets that need independent processing,<br />
such as billions of Web pages from a Web crawl. Second, interactive Internet<br />
service applications, also known as Software as a Service (SaaS), can benefit<br />
from millions of independent users of interactive Internet services. Reads<br />
<strong>and</strong> writes are rarely dependent in SaaS, so SaaS rarely needs to synchronize.<br />
For example, search uses a read-only index <strong>and</strong> email is normally reading<br />
<strong>and</strong> writing independent information. We call this type of easy parallelism<br />
Request-Level Parallelism, as many independent efforts can proceed in<br />
parallel naturally with little need for communication or synchronization.<br />
2. Operational Costs Count: Traditionally, server architects design their systems<br />
for peak performance within a cost budget <strong>and</strong> worry about energy only to<br />
make sure they don’t exceed the cooling capacity of their enclosure. They<br />
usually ignored operational costs of a server, assuming that they pale in<br />
comparison to purchase costs. WSC have longer lifetimes—the building <strong>and</strong><br />
electrical <strong>and</strong> cooling infrastructure are often amortized over 10 or more<br />
years—so the operational costs add up: energy, power distribution, <strong>and</strong><br />
cooling represent more than 30% of the costs of a WSC over 10 years.<br />
3. Scale <strong>and</strong> the Opportunities/Problems Associated with Scale: To construct a<br />
single WSC, you must purchase 100,000 servers along with the supporting<br />
infrastructure, which means volume discounts. Hence, WSCs are so massive
6.7 Clusters, Warehouse Scale <strong>Computer</strong>s, <strong>and</strong> Other Message-Passing Multiprocessors 535<br />
internally that you get economy of scale even if there are not many WSCs.<br />
These economies of scale led to cloud computing, as the lower per unit costs<br />
of a WSC meant that cloud companies could rent servers at a profitable rate<br />
<strong>and</strong> still be below what it costs outsiders to do it themselves. The flip side<br />
of the economic opportunity of scale is the need to cope with the failure<br />
frequency of scale. Even if a server had a Mean Time To Failure of an amazing<br />
25 years (200,000 hours), the WSC architect would need to design for 5<br />
server failures every day. Section 5.15 mentioned annualized disk failure rate<br />
(AFR) was measured at Google at 2% to 4%. If there were 4 disks per server<br />
<strong>and</strong> their annual failure rate was 2%, the WSC architect should expect to see<br />
one disk fail every hour. Thus, fault tolerance is even more important for the<br />
WSC architect than the server architect.<br />
The economies of scale uncovered by WSC have realized the long dreamed of<br />
goal of computing as a utility. Cloud computing means anyone anywhere with good<br />
ideas, a business model, <strong>and</strong> a credit card can tap thous<strong>and</strong>s of servers to deliver<br />
their vision almost instantly around the world. Of course, there are important<br />
obstacles that could limit the growth of cloud computing—such as security,<br />
privacy, st<strong>and</strong>ards, <strong>and</strong> the rate of growth of Internet b<strong>and</strong>width—but we foresee<br />
them being addressed so that WSCs <strong>and</strong> cloud computing can flourish.<br />
To put the growth rate of cloud computing into perspective, in 2012 Amazon<br />
Web Services announced that it adds enough new server capacity every day to<br />
support all of Amazon’s global infrastructure as of 2003, when Amazon was a<br />
$5.2Bn annual revenue enterprise with 6000 employees.<br />
Now that we underst<strong>and</strong> the importance of message-passing multiprocessors,<br />
especially for cloud computing, we next cover ways to connect the nodes of a WSC<br />
together. Thanks to Moore’s Law <strong>and</strong> the increasing number of cores per chip, we<br />
now need networks inside a chip as well, so these topologies are important in the<br />
small as well as in the large.<br />
Elaboration: The MapReduce framework shuffl es <strong>and</strong> sorts the key-value pairs at the<br />
end of the Map phase to produce groups that all share the same key. These groups are<br />
then passed to the Reduce phase.<br />
Elaboration: Another form of large scale computing is grid computing, where the<br />
computers are spread across large areas, <strong>and</strong> then the programs that run across them<br />
must communicate via long haul networks. The most popular <strong>and</strong> unique form of grid<br />
computing was pioneered by the SETI@home project. As millions of PCs are idle at<br />
any one time doing nothing useful, they could be harvested <strong>and</strong> put to good uses if<br />
someone developed software that could run on those computers <strong>and</strong> then gave each PC<br />
an independent piece of the problem to work on. The fi rst example was the Search for<br />
ExtraTerrestrial Intelligence (SETI), which was launched at UC Berkeley in 1999. Over 5<br />
million computer users in more than 200 countries have signed up for SETI@home, with<br />
more than 50% outside the US. By the end of 2011, the average performance of the<br />
SETI@home grid was 3.5 PetaFLOPS.
6.8 Introduction to Multiprocessor Network Topologies 537<br />
Because there are numerous topologies to choose from, performance metrics<br />
are needed to distinguish these designs. Two are popular. The first is total network<br />
b<strong>and</strong>width, which is the b<strong>and</strong>width of each link multiplied by the number of links.<br />
This represents the peak b<strong>and</strong>width. For the ring network above, with P processors,<br />
the total network b<strong>and</strong>width would be P times the b<strong>and</strong>width of one link; the total<br />
network b<strong>and</strong>width of a bus is just the b<strong>and</strong>width of that bus.<br />
To balance this best b<strong>and</strong>width case, we include another metric that is closer to<br />
the worst case: the bisection b<strong>and</strong>width. This metric is calculated by dividing the<br />
machine into two halves. Then you sum the b<strong>and</strong>width of the links that cross that<br />
imaginary dividing line. The bisection b<strong>and</strong>width of a ring is two times the link<br />
b<strong>and</strong>width. It is one times the link b<strong>and</strong>width for the bus. If a single link is as fast<br />
as the bus, the ring is only twice as fast as a bus in the worst case, but it is P times<br />
faster in the best case.<br />
Since some network topologies are not symmetric, the question arises<br />
of where to draw the imaginary line when bisecting the machine. Bisection<br />
b<strong>and</strong>width is a worst-case metric, so the answer is to choose the division that<br />
yields the most pessimistic network performance. Stated alternatively, calculate<br />
all possible bisection b<strong>and</strong>widths <strong>and</strong> pick the smallest. We take this pessimistic<br />
view because parallel programs are often limited by the weakest link in the<br />
communication chain.<br />
At the other extreme from a ring is a fully connected network, where every<br />
processor has a bidirectional link to every other processor. For fully connected<br />
networks, the total network b<strong>and</strong>width is P × (P – 1)/2, <strong>and</strong> the bisection b<strong>and</strong>width<br />
is (P/2) 2 .<br />
The tremendous improvement in performance of fully connected networks is<br />
offset by the tremendous increase in cost. This consequence inspires engineers<br />
to invent new topologies that are between the cost of rings <strong>and</strong> the performance<br />
of fully connected networks. The evaluation of success depends in large part on<br />
the nature of the communication in the workload of parallel programs run on the<br />
computer.<br />
The number of different topologies that have been discussed in publications<br />
would be difficult to count, but only a few have been used in commercial parallel<br />
processors. Figure 6.14 illustrates two of the popular topologies.<br />
An alternative to placing a processor at every node in a network is to leave only<br />
the switch at some of these nodes. The switches are smaller than processor-memoryswitch<br />
nodes, <strong>and</strong> thus may be packed more densely, thereby lessening distance <strong>and</strong><br />
increasing performance. Such networks are frequently called multistage networks<br />
to reflect the multiple steps that a message may travel. Types of multistage networks<br />
are as numerous as single-stage networks; Figure 6.15 illustrates two of the popular<br />
multistage organizations. A fully connected or crossbar network allows any<br />
node to communicate with any other node in one pass through the network. An<br />
Omega network uses less hardware than the crossbar network (2n log 2<br />
n versus n 2<br />
switches), but contention can occur between messages, depending on the pattern<br />
network<br />
b<strong>and</strong>width Informally,<br />
the peak transfer rate of a<br />
network; can refer to the<br />
speed of a single link or<br />
the collective transfer rate<br />
of all links in the network.<br />
bisection<br />
b<strong>and</strong>width The<br />
b<strong>and</strong>width between<br />
two equal parts of<br />
a multiprocessor.<br />
This measure is for a<br />
worst case split of the<br />
multiprocessor.<br />
fully connected<br />
network A network<br />
that connects processormemory<br />
nodes by<br />
supplying a dedicated<br />
communication link<br />
between every node.<br />
multistage network<br />
A network that supplies a<br />
small switch at each node.<br />
crossbar network<br />
A network that allows<br />
any node to communicate<br />
with any other node in<br />
one pass through the<br />
network.
540 Chapter 6 Parallel Processors from Client to Cloud<br />
After covering the performance of network at a low level of detail in this online<br />
section, the next section shows how to benchmark multiprocessors of all kinds<br />
with much higher-level programs.<br />
6.10<br />
Multiprocessor Benchmarks <strong>and</strong><br />
Performance Models<br />
As we saw in Chapter 1, benchmarking systems is always a sensitive topic, because<br />
it is a highly visible way to try to determine which system is better. The results affect<br />
not only the sales of commercial systems, but also the reputation of the designers<br />
of those systems. Hence, all participants want to win the competition, but they also<br />
want to be sure that if someone else wins, they deserve to win because they have<br />
a genuinely better system. This desire leads to rules to ensure that the benchmark<br />
results are not simply engineering tricks for that benchmark, but are instead<br />
advances that improve performance of real applications.<br />
To avoid possible tricks, a typical rule is that you cant change the benchmark.<br />
The source code <strong>and</strong> data sets are fixed, <strong>and</strong> there is a single proper answer. Any<br />
deviation from those rules makes the results invalid.<br />
Many multiprocessor benchmarks follow these traditions. A common exception<br />
is to be able to increase the size of the problem so that you can run the benchmark<br />
on systems with a widely different number of processors. That is, many benchmarks<br />
allow weak scaling rather than require strong scaling, even though you must take<br />
care when comparing results for programs running different problem sizes.<br />
Figure 6.16 gives a summary of several parallel benchmarks, also described below:<br />
■ Linpack is a collection of linear algebra routines, <strong>and</strong> the routines for<br />
performing Gaussian elimination constitute what is known as the Linpack<br />
benchmark. The DGEMM routine in the example on page 215 represents a<br />
small fraction of the source code of the Linpack benchmark, but it accounts<br />
for most of the execution time for the benchmark. It allows weak scaling,<br />
letting the user pick any size problem. Moreover, it allows the user to rewrite<br />
Linpack in almost any form <strong>and</strong> in any language, as long as it computes the<br />
proper result <strong>and</strong> performs the same number of floating point operations<br />
for a given problem size. Twice a year, the 500 computers with the fastest<br />
Linpack performance are published at www.top500.org. The first on this list<br />
is considered by the press to be the worlds fastest computer.<br />
■ SPECrate is a throughput metric based on the SPEC CPU benchmarks,<br />
such as SPEC CPU 2006 (see Chapter 1). Rather than report performance<br />
of the individual programs, SPECrate runs many copies of the program<br />
simultaneously. Thus, it measures task-level parallelism, as there is no
542 Chapter 6 Parallel Processors from Client to Cloud<br />
Pthreads A UNIX<br />
API for creating <strong>and</strong><br />
manipulating threads. It is<br />
structured as a library.<br />
■ The NAS (NASA Advanced Supercomputing) parallel benchmarks were<br />
another attempt from the 1990s to benchmark multiprocessors. Taken from<br />
computational fluid dynamics, they consist of five kernels. They allow weak<br />
scaling by defining a few data sets. Like Linpack, these benchmarks can be<br />
rewritten, but the rules require that the programming language can only be C<br />
or Fortran.<br />
■ The recent PARSEC (Princeton Application Repository for Shared Memory<br />
<strong>Computer</strong>s) benchmark suite consists of multithreaded programs that use<br />
Pthreads (POSIX threads) <strong>and</strong> OpenMP (Open MultiProcessing; see<br />
Section 6.5). They focus on emerging computational domains <strong>and</strong> consist of<br />
nine applications <strong>and</strong> three kernels. Eight rely on data parallelism, three rely<br />
on pipelined parallelism, <strong>and</strong> one on unstructured parallelism.<br />
■ On the cloud front, the goal of the Yahoo! Cloud Serving Benchmark (YCSB)<br />
is to compare performance of cloud data services. It offers a framework that<br />
makes it easy for a client to benchmark new data services, using Cass<strong>and</strong>ra<br />
<strong>and</strong> HBase as representative examples. [Cooper, 2010]<br />
The downside of such traditional restrictions to benchmarks is that innovation is<br />
chiefly limited to the architecture <strong>and</strong> compiler. Better data structures, algorithms,<br />
programming languages, <strong>and</strong> so on often cannot be used, since that would give a<br />
misleading result. The system could win because of, say, the algorithm, <strong>and</strong> not<br />
because of the hardware or the compiler.<br />
While these guidelines are underst<strong>and</strong>able when the foundations of computing<br />
are relatively stableas they were in the 1990s <strong>and</strong> the first half of this decade<br />
they are undesirable during a programming revolution. For this revolution to<br />
succeed, we need to encourage innovation at all levels.<br />
Researchers at the University of California at Berkeley have advocated one<br />
approach. They identified 13 design patterns that they claim will be part of<br />
applications of the future. Frameworks or kernels implement these design<br />
patterns. Examples are sparse matrices, structured grids, finite-state machines,<br />
map reduce, <strong>and</strong> graph traversal. By keeping the definitions at a high level, they<br />
hope to encourage innovations at any level of the system. Thus, the system with the<br />
fastest sparse matrix solver is welcome to use any data structure, algorithm, <strong>and</strong><br />
programming language, in addition to novel architectures <strong>and</strong> compilers.<br />
Performance Models<br />
A topic related to benchmarks is performance models. As we have seen with the<br />
increasing architectural diversity in this chapter—multithreading, SIMD, GPUs—<br />
it would be especially helpful if we had a simple model that offered insights into the<br />
performance of different architectures. It need not be perfect, just insightful.<br />
The 3Cs for cache performance from Chapter 5 is an example performance<br />
model. It is not a perfect performance model, since it ignores potentially important
544 Chapter 6 Parallel Processors from Client to Cloud<br />
The Roofline Model<br />
This simple model ties floating-point performance, arithmetic intensity, <strong>and</strong> memory<br />
performance together in a two-dimensional graph [Williams, Waterman, <strong>and</strong><br />
Patterson 2009]. Peak floating-point performance can be found using the hardware<br />
specifications mentioned above. The working sets of the kernels we consider here<br />
do not fit in on-chip caches, so peak memory performance may be defined by the<br />
memory system behind the caches. One way to find the peak memory performance<br />
is the Stream benchmark. (See the Elaboration on page 381 in Chapter 5).<br />
Figure 6.18 shows the model, which is done once for a computer, not for each<br />
kernel. The vertical Y-axis is achievable floating-point performance from 0.5 to<br />
64.0 GFLOPs/second. The horizontal X-axis is arithmetic intensity, varying from<br />
1/8 FLOPs/DRAM byte accessed to 16 FLOPs/DRAM byte accessed. Note that the<br />
graph is a log-log scale.<br />
For a given kernel, we can find a point on the X-axis based on its arithmetic<br />
intensity. If we draw a vertical line through that point, the performance of the kernel<br />
on that computer must lie somewhere along that line. We can plot a horizontal line<br />
showing peak floating-point performance of the computer. Obviously, the actual<br />
floating-point performance can be no higher than the horizontal line, since that is<br />
a hardware limit.<br />
64.0<br />
32.0<br />
Attainable GFLOPs/second<br />
16.0<br />
8.0<br />
4.0<br />
2.0<br />
1.0<br />
peak memory BW (stream)peak floating-point performance<br />
Kernel 1<br />
(Memory<br />
B<strong>and</strong>width<br />
limited)<br />
Kernel 2<br />
(Computation<br />
limited)<br />
0.5<br />
1 / 1 8 / 1 4 / 2 1 2 4 8 16<br />
Arithmetic Intensity: FLOPs/Byte Ratio<br />
FIGURE 6.18 Roofline Model [Williams, Waterman, <strong>and</strong> Patterson 2009]. This example has a<br />
peak floating-point performance of 16 GFLOPS/sec <strong>and</strong> a peak memory b<strong>and</strong>width of 16 GB/sec from the<br />
Stream benchmark. (Since Stream is actually four measurements, this line is the average of the four.) The<br />
dotted vertical line in color on the left represents Kernel 1, which has an arithmetic intensity of 0.5 FLOPs/<br />
byte. It is limited by memory b<strong>and</strong>width to no more than 8 GFLOPS/sec on this Opteron X2. The dotted<br />
vertical line to the right represents Kernel 2, which has an arithmetic intensity of 4 FLOPs/byte. It is limited<br />
only computationally to 16 GFLOPS/s. (This data is based on the AMD Opteron X2 (Revision F) using dual<br />
cores running at 2 GHz in a dual socket system.)
6.10 Multiprocessor Benchmarks <strong>and</strong> Performance Models 545<br />
How could we plot the peak memory performance, which is measured in bytes/<br />
second? Since the X-axis is FLOPs/byte <strong>and</strong> the Y-axis FLOPs/second, bytes/second<br />
is just a diagonal line at a 45-degree angle in this figure. Hence, we can plot a third<br />
line that gives the maximum floating-point performance that the memory system<br />
of that computer can support for a given arithmetic intensity. We can express the<br />
limits as a formula to plot the line in the graph in Figure 6.18:<br />
Attainable GFLOPs/sec = Min (Peak Memory BW × Arithmetic Intensity, Peak<br />
Floating-Point Performance)<br />
The horizontal <strong>and</strong> diagonal lines give this simple model its name <strong>and</strong> indicate its<br />
value. The roofline sets an upper bound on performance of a kernel depending on<br />
its arithmetic intensity. Given a roofline of a computer, you can apply it repeatedly,<br />
since it doesnt vary by kernel.<br />
If we think of arithmetic intensity as a pole that hits the roof, either it hits<br />
the slanted part of the roof, which means performance is ultimately limited by<br />
memory b<strong>and</strong>width, or it hits the flat part of the roof, which means performance is<br />
computationally limited. In Figure 6.18, kernel 1 is an example of the former, <strong>and</strong><br />
kernel 2 is an example of the latter.<br />
Note that the ridge point, where the diagonal <strong>and</strong> horizontal roofs meet, offers<br />
an interesting insight into the computer. If it is far to the right, then only kernels<br />
with very high arithmetic intensity can achieve the maximum performance of<br />
that computer. If it is far to the left, then almost any kernel can potentially hit the<br />
maximum performance.<br />
Comparing Two Generations of Opterons<br />
The AMD Opteron X4 (Barcelona) with four cores is the successor to the Opteron<br />
X2 with two cores. To simplify board design, they use the same socket. Hence, they<br />
have the same DRAM channels <strong>and</strong> thus the same peak memory b<strong>and</strong>width. In<br />
addition to doubling the number of cores, the Opteron X4 also has twice the peak<br />
floating-point performance per core: Opteron X4 cores can issue two floating-point<br />
SSE2 instructions per clock cycle, while Opteron X2 cores issue at most one. As the<br />
two systems were comparing have similar clock rates2.2 GHz for Opteron X2<br />
versus 2.3 GHz for Opteron X4the Opteron X4 has about four times the peak<br />
floating-point performance of the Opteron X2 with the same DRAM b<strong>and</strong>width.<br />
The Opteron X4 also has a 2MiB L3 cache, which is not found in the Opteron X2.<br />
In Figure 6.19 the roofline models for both systems are compared. As we would<br />
expect, the ridge point moves to the right, from 1 in the Opteron X2 to 5 in the<br />
Opteron X4. Hence, to see a performance gain in the next generation, kernels need<br />
an arithmetic intensity higher than 1, or their working sets must fit in the caches<br />
of the Opteron X4.<br />
The roofline model gives an upper bound to performance. Suppose your<br />
program is far below that bound. What optimizations should you perform, <strong>and</strong> in<br />
what order?
550 Chapter 6 Parallel Processors from Client to Cloud<br />
Elaboration: The ceilings are ordered so that lower ceilings are easier to optimize.<br />
Clearly, a programmer can optimize in any order, but following this sequence reduces the<br />
chances of wasting effort on an optimization that has no benefi t due to other constraints.<br />
Like the 3Cs model, as long as the roofl ine model delivers on insights, a model can<br />
have assumptions that may prove optimistic. For example, roofl ine assumes the load is<br />
balanced between all processors.<br />
Elaboration: An alternative to the Stream benchmark is to use the raw DRAM<br />
b<strong>and</strong>width as the roofl ine. While the raw b<strong>and</strong>width defi nitely is a hard upper bound,<br />
actual memory performance is often so far from that boundary that its not that useful.<br />
That is, no program can go close to that bound. The downside to using Stream is that<br />
very careful programming may exceed the Stream results, so the memory roofl ine may<br />
not be as hard a limit as the computational roofl ine. We stick with Stream because few<br />
programmers will be able to deliver more memory b<strong>and</strong>width than Stream discovers.<br />
Elaboration: Although the roofl ine model shown is for multicore processors, it clearly<br />
would work for a uniprocessor as well.<br />
Check<br />
Yourself<br />
True or false: The main drawback with conventional approaches to benchmarks<br />
for parallel computers is that the rules that ensure fairness also slow software<br />
innovation.<br />
6.11<br />
Real Stuff: Benchmarking <strong>and</strong> Rooflines<br />
of the Intel Core i7 960 <strong>and</strong> the NVIDIA<br />
Tesla GPU<br />
A group of Intel researchers published a paper [Lee et al., 2010] comparing a<br />
quad-core Intel Core i7 960 with multimedia SIMD extensions to the previous<br />
generation GPU, the NVIDIA Tesla GTX 280. Figure 6.22 lists the characteristics<br />
of the two systems. Both products were purchased in Fall 2009. The Core i7 is<br />
in Intels 45-nanometer semiconductor technology while the GPU is in TSMCs<br />
65-nanometer technology. Although it might have been fairer to have a comparison<br />
by a neutral party or by both interested parties, the purpose of this section is not to<br />
determine how much faster one product is than another, but to try to underst<strong>and</strong><br />
the relative value of features of these two contrasting architecture styles.<br />
The rooflines of the Core i7 960 <strong>and</strong> GTX 280 in Figure 6.23 illustrate the<br />
differences in the computers. Not only does the GTX 280 have much higher<br />
memory b<strong>and</strong>width <strong>and</strong> double-precision floating-point performance, but also its<br />
double-precision ridge point is considerably to the left. The double-precision ridge<br />
point is 0.6 for the GTX 280 versus 3.1 for the Core i7. As mentioned above, it is<br />
much easier to hit peak computational performance the further the ridge point of
6.11 Real Stuff: Benchmarking <strong>and</strong> Rooflines of the Intel Core i7 960 <strong>and</strong> the NVIDIA Tesla GPU 551<br />
Core i7-<br />
960<br />
GTX 280 GTX 480<br />
Ratio<br />
280/i7<br />
Ratio<br />
480/i7<br />
Number of processing elements (cores or SMs)<br />
4<br />
30<br />
15<br />
7.5<br />
3.8<br />
Clock frequency (GHz)<br />
3.2<br />
1.3<br />
1.4<br />
0.41<br />
0.44<br />
Die size<br />
263<br />
576<br />
520<br />
2.2<br />
2.0<br />
Technology<br />
Intel 45 nm<br />
TSMC 65 nm<br />
TSMC 40 nm<br />
1.6<br />
1.0<br />
Power (chip, not module)<br />
130<br />
130<br />
167<br />
1.0<br />
1.3<br />
Transistors<br />
700 M<br />
1400 M<br />
3030 M<br />
2.0<br />
4.4<br />
Memory br<strong>and</strong>with (GBytes/sec)<br />
32<br />
141<br />
177<br />
4.4<br />
5.5<br />
Single-precision SIMD width<br />
4<br />
8<br />
32<br />
2.0<br />
8.0<br />
Double-precision SIMD width<br />
2<br />
1<br />
16<br />
0.5<br />
8.0<br />
Peak Single-precision scalar FLOPS (GFLOP/sec)<br />
26<br />
117<br />
63<br />
4.6<br />
2.5<br />
Peak Single-precision SIMD FLOPS (GFLOP/Sec)<br />
102<br />
311 to 933<br />
515 or 1344<br />
3.0–9.1<br />
6.6–13.1<br />
(SP 1 add or multiply)<br />
N.A.<br />
(311)<br />
(515)<br />
(3.0)<br />
(6.6)<br />
(SP 1 instruction fused multiply-adds)<br />
N.A.<br />
(622)<br />
(1344)<br />
(6.1)<br />
(13.1)<br />
(Rare SP dual issue fused multiply-add <strong>and</strong> multiply)<br />
N.A.<br />
(933)<br />
N.A.<br />
(9.1)<br />
–<br />
Peal double-precision SIMD FLOPS (GFLOP/sec)<br />
51<br />
78<br />
515<br />
1.5<br />
10.1<br />
FIGURE 6.22 Intel Core i7-960, NVIDIA GTX 280, <strong>and</strong> GTX 480 specifications. The rightmost columns show the ratios of the<br />
Tesla GTX 280 <strong>and</strong> the Fermi GTX 480 to Core i7. Although the case study is between the Tesla 280 <strong>and</strong> i7, we include the Fermi 480 to show<br />
its relationship to the Tesla 280 since it is described in this chapter. Note that these memory b<strong>and</strong>widths are higher than in Figure 6.23 because<br />
these are DRAM pin b<strong>and</strong>widths <strong>and</strong> those in Figure 6.23 are at the processors as measured by a benchmark program. (From Table 2 in Lee<br />
et al. [2010].)<br />
the roofline is to the left. For single-precision performance, the ridge point moves<br />
far to the right for both computers, so its much harder to hit the roof of singleprecision<br />
performance. Note that the arithmetic intensity of the kernel is based on<br />
the bytes that go to main memory, not the bytes that go to cache memory. Thus,<br />
as mentioned above, caching can change the arithmetic intensity of a kernel on a<br />
particular computer, if most references really go to the cache. Note also that this<br />
b<strong>and</strong>width is for unit-stride accesses in both architectures. Real gather-scatter<br />
addresses can be slower on the GTX 280 <strong>and</strong> on the Core i7, as we shall see.<br />
The researchers selected the benchmark programs by analyzing the computational<br />
<strong>and</strong> memory characteristics of four recently proposed benchmark suites <strong>and</strong> then<br />
formulated the set of throughput computing kernels that capture these characteristics.<br />
Figure 6.24 shows the performance results, with larger numbers meaning faster. The<br />
Rooflines help explain the relative performance in this case study.<br />
Given that the raw performance specifications of the GTX 280 vary from 2.5 ×<br />
slower (clock rate) to 7.5 × faster (cores per chip) while the performance varies
552 Chapter 6 Parallel Processors from Client to Cloud<br />
GFlop/s<br />
128<br />
64<br />
32<br />
16<br />
8<br />
Stream = 16.4 GB/s<br />
Core i7 960<br />
(Nehalem)<br />
51.2 GF/s<br />
Double Precision<br />
GFlop/s<br />
128<br />
64<br />
32<br />
16<br />
8<br />
Stream = 127 GB/s<br />
NVIDIA GTX280<br />
Peak = 78 GF/s<br />
Double Precision<br />
4<br />
4<br />
2<br />
2<br />
1<br />
1<br />
1/8 1/4 1/2 1 2 4 8 16 32<br />
1/8 1/4 1/2 1 2 4 8 16 32<br />
Arithmetic intensity<br />
Arithmetic intensity<br />
1024<br />
Core i7 960<br />
(Nehalem)<br />
1024<br />
NVIDIA GTX280<br />
GFlop/s<br />
512<br />
256<br />
128<br />
64<br />
32<br />
16<br />
8<br />
Stream = 16.4 GB/s<br />
102.4 GF/s<br />
Single Precision<br />
51.2 GF/s<br />
Double Precision<br />
GFlop/s<br />
512<br />
256<br />
128<br />
64<br />
32<br />
16<br />
8<br />
Stream = 127 GB/s<br />
624 GF/s<br />
Single Precision<br />
78 GF/s<br />
Double Precision<br />
4<br />
1/8 1/4 1/2<br />
1 2 4 8 16 32<br />
Arithmetic intensity<br />
4<br />
1/8 1/4 1/2<br />
1 2 4 8 16<br />
Arithmetic intensity<br />
32<br />
FIGURE 6.23 Roofline model [Williams, Waterman, <strong>and</strong> Patterson 2009]. These rooflines show double-precision floating-point<br />
performance in the top row <strong>and</strong> single-precision performance in the bottom row. (The DP FP performance ceiling is also in the bottom row<br />
to give perspective.) The Core i7 960 on the left has a peak DP FP performance of 51.2 GFLOP/sec, a SP FP peak of 102.4 GFLOP/sec, <strong>and</strong> a<br />
peak memory b<strong>and</strong>width of 16.4 GBytes/sec. The NVIDIA GTX 280 has a DP FP peak of 78 GFLOP/sec, SP FP peak of 624 GFLOP/sec, <strong>and</strong><br />
127 GBytes/sec of memory b<strong>and</strong>width. The dashed vertical line on the left represents an arithmetic intensity of 0.5 FLOP/byte. It is limited by<br />
memory b<strong>and</strong>width to no more than 8 DP GFLOP/sec or 8 SP GFLOP/sec on the Core i7. The dashed vertical line to the right has an arithmetic<br />
intensity of 4 FLOP/byte. It is limited only computationally to 51.2 DP GFLOP/sec <strong>and</strong> 102.4 SP GFLOP/sec on the Core i7 <strong>and</strong> 78 DP GFLOP/<br />
sec <strong>and</strong> 512 DP GFLOP/sec on the GTX 280. To hit the highest computation rate on the Core i7 you need to use all 4 cores <strong>and</strong> SSE instructions<br />
with an equal number of multiplies <strong>and</strong> adds. For the GTX 280, you need to use fused multiply-add instructions on all multithreaded SIMD<br />
processors.
6.11 Real Stuff: Benchmarking <strong>and</strong> Rooflines of the Intel Core i7 960 <strong>and</strong> the NVIDIA Tesla GPU 553<br />
Kernel Units Core i7-960 GTX 280<br />
GTX 280/<br />
i7-960<br />
SGEMM<br />
GFLOP/sec<br />
94<br />
364<br />
3.9<br />
MC<br />
Billion paths/sec<br />
0.8<br />
1.4<br />
1.8<br />
Conv<br />
Million pixels/sec<br />
1250<br />
3500<br />
2.8<br />
FFT<br />
GFLOP/sec<br />
71.4<br />
213<br />
3.0<br />
SAXPY<br />
GBytes/sec<br />
16.8<br />
88.8<br />
5.3<br />
LBM<br />
Million lookups/sec<br />
85<br />
426<br />
5.0<br />
Solv<br />
Frames/sec<br />
103<br />
52<br />
0.5<br />
SpMV<br />
GFLOP/sec<br />
4.9<br />
9.1<br />
1.9<br />
GJK<br />
Frames/sec<br />
67<br />
1020<br />
15.2<br />
Sort<br />
Million elements/sec<br />
250<br />
198<br />
0.8<br />
RC<br />
Search<br />
Frames/sec<br />
Million queries/sec<br />
5<br />
50<br />
8.1<br />
90<br />
1.6<br />
1.8<br />
Hist<br />
Million pixels/sec<br />
1517<br />
2583<br />
1.7<br />
Bilat<br />
Million pixels/sec<br />
83<br />
475<br />
5.7<br />
FIGURE 6.24 Raw <strong>and</strong> relative performance measured for the two platforms. In this study,<br />
SAXPY is just used as a measure of memory b<strong>and</strong>width, so the right unit is GBytes/sec <strong>and</strong> not GFLOP/sec.<br />
(Based on Table 3 in [Lee et al., 2010].)<br />
from 2.0 × slower (Solv) to 15.2 × faster (GJK), the Intel researchers decided to<br />
find the reasons for the differences:<br />
■ Memory b<strong>and</strong>width. The GPU has 4.4 × the memory b<strong>and</strong>width, which helps<br />
explain why LBM <strong>and</strong> SAXPY run 5.0 <strong>and</strong> 5.3 × faster; their working sets are<br />
hundreds of megabytes <strong>and</strong> hence dont fit into the Core i7 cache. (So as to<br />
access memory intensively, they purposely did not use cache blocking as in<br />
Chapter 5.) Hence, the slope of the rooflines explains their performance. SpMV<br />
also has a large working set, but it only runs 1.9 × faster because the doubleprecision<br />
floating point of the GTX 280 is only 1.5 × as faster as the Core i7.<br />
■ Compute b<strong>and</strong>width. Five of the remaining kernels are compute bound:<br />
SGEMM, Conv, FFT, MC, <strong>and</strong> Bilat. The GTX is faster by 3.9, 2.8, 3.0, 1.8, <strong>and</strong><br />
5.7 ×, respectively. The first three of these use single-precision floating-point<br />
arithmetic, <strong>and</strong> GTX 280 single precision is 3 to 6 × faster. MC uses double<br />
precision, which explains why its only 1.8 × faster since DP performance<br />
is only 1.5 × faster. Bilat uses transcendental functions, which the GTX<br />
280 supports directly. The Core i7 spends two-thirds of its time calculating<br />
transcendental functions for Bilat, so the GTX 280 is 5.7 × faster. This<br />
observation helps point out the value of hardware support for operations that<br />
occur in your workload: double-precision floating point <strong>and</strong> perhaps even<br />
transcendentals.
554 Chapter 6 Parallel Processors from Client to Cloud<br />
■ Cache benefits. Ray casting (RC) is only 1.6 × faster on the GTX because<br />
cache blocking with the Core i7 caches prevents it from becoming memory<br />
b<strong>and</strong>width bound (see Sections 5.4 <strong>and</strong> 5.14), as it is on GPUs. Cache<br />
blocking can help Search, too. If the index trees are small so that they fit in<br />
the cache, the Core i7 is twice as fast. Larger index trees make them memory<br />
b<strong>and</strong>width bound. Overall, the GTX 280 runs search 1.8 × faster. Cache<br />
blocking also helps Sort. While most programmers wouldnt run Sort on<br />
a SIMD processor, it can be written with a 1-bit Sort primitive called split.<br />
However, the split algorithm executes many more instructions than a scalar<br />
sort does. As a result, the Core i7 runs 1.25 × as fast as the GTX 280. Note<br />
that caches also help other kernels on the Core i7, since cache blocking allows<br />
SGEMM, FFT, <strong>and</strong> SpMV to become compute bound. This observation reemphasizes<br />
the importance of cache blocking optimizations in Chapter 5.<br />
■ Gather-Scatter. The multimedia SIMD extensions are of little help if the data are<br />
scattered throughout main memory; optimal performance comes only when<br />
accesses are to data are aligned on 16-byte boundaries. Thus, GJK gets little benefit<br />
from SIMD on the Core i7. As mentioned above, GPUs offer gather-scatter<br />
addressing that is found in a vector architecture but omitted from most SIMD<br />
extensions. The memory controller even batches accesses to the same DRAM<br />
page together (see Section 5.2). This combination means the GTX 280 runs GJK<br />
a startling 15.2 × as fast as the Core i7, which is larger than any single physical<br />
parameter in Figure 6.22. This observation reinforces the importance of gatherscatter<br />
to vector <strong>and</strong> GPU architectures that is missing from SIMD extensions.<br />
■ Synchronization. The performance of synchronization is limited by atomic<br />
updates, which are responsible for 28% of the total runtime on the Core i7<br />
despite its having a hardware fetch-<strong>and</strong>-increment instruction. Thus, Hist is only<br />
1.7 × faster on the GTX 280. Solv solves a batch of independent constraints in<br />
a small amount of computation followed by barrier synchronization. The Core<br />
i7 benefits from the atomic instructions <strong>and</strong> a memory consistency model that<br />
ensures the right results even if not all previous accesses to memory hierarchy<br />
have completed. Without the memory consistency model, the GTX 280<br />
version launches some batches from the system processor, which leads to the<br />
GTX 280 running 0.5 × as fast as the Core i7. This observation points out how<br />
synchronization performance can be important for some data parallel problems.<br />
It is striking how often weaknesses in the Tesla GTX 280 that were uncovered by<br />
kernels selected by Intel researchers were already being addressed in the successor<br />
architecture to Tesla: Fermi has faster double-precision floating-point performance,<br />
faster atomic operations, <strong>and</strong> caches. It was also interesting that the gather-scatter<br />
support of vector architectures that predate the SIMD instructions by decades was<br />
so important to the effective usefulness of these SIMD extensions, which some had<br />
predicted before the comparison. The Intel researchers noted that 6 of the 14 kernels<br />
would exploit SIMD better with more efficient gather-scatter support on the Core<br />
i7. This study certainly establishes the importance of cache blocking as well.
6.12 Going Faster: Multiple Processors <strong>and</strong> Matrix Multiply 555<br />
Now that we seen a wide range of results of benchmarking different<br />
multiprocessors, let’s return to our DGEMM example to see in detail how much we<br />
have to change the C code to exploit multiple processors.<br />
6.12<br />
Going Faster: Multiple Processors <strong>and</strong><br />
Matrix Multiply<br />
This section is the final <strong>and</strong> largest step in our incremental performance journey of<br />
adapting DGEMM to the underlying hardware of the Intel Core i7 (S<strong>and</strong>y Bridge).<br />
Each Core i7 has 8 cores, <strong>and</strong> the computer we have been using has 2 Core i7s.<br />
Thus, we have 16 cores on which to run DGEMM.<br />
Figure 6.25 shows the OpenMP version of DGEMM that utilizes those cores.<br />
Note that line 30 is the single line added to Figure 5.48 to make this code run on<br />
multiple processors: an OpenMP pragma that tells the compiler to use multiple<br />
threads in the outermost for loop. It tells the computer to spread the work of the<br />
outermost loop across all the threads.<br />
Figure 6.26 plots a classic multiprocessor speedup graph, showing the<br />
performance improvement versus a single thread as the number of threads increase.<br />
This graph makes it easy to see the challenges of strong scaling versus weak scaling.<br />
When everything fits in the first level data cache, as is the case for 32 × 32 matrices,<br />
adding threads actually hurts performance. The 16-threaded version of DGEMM<br />
is almost half as fast as the single-threaded version in this case. In contrast, the two<br />
largest matrices get a 14 × speedup from 16 threads, <strong>and</strong> hence the classic two “up<br />
<strong>and</strong> to the right” lines in Figure 6.26.<br />
Figure 6.27 shows the absolute performance increase as we increase the number<br />
of threads from 1 to 16. DGEMM operates now operates at 174 GLOPS for 960 × 960<br />
matrices. As our unoptimized C version of DGEMM in Figure 3.21 ran this code at<br />
just 0.8 GFOPS, the optimizations in Chapters 3 to 6 that tailor the code to the<br />
underlying hardware result in a speedup of over 200 times!<br />
Next up is our warnings of the fallacies <strong>and</strong> pitfalls of multiprocessing. The<br />
computer architecture graveyard is filled with parallel processing projects that have<br />
ignored them.<br />
Elaboration: These results are with Turbo mode turned off. We are using a dual chip<br />
system in this system, so not surprisingly, we can get the full Turbo speedup (3.3/2.6<br />
= 1.27) with either 1 thread (only 1 core on one of the chips) or 2 threads (1 core per<br />
chip). As we increase the number of threads <strong>and</strong> hence the number of active cores, the<br />
benefi t of Turbo mode decreases, as there is less of the power budget to spend on the<br />
active cores. For 4 threads the average Turbo speedup is 1.23, for 8 it is 1.13, <strong>and</strong> for<br />
16 it is 1.11.
556 Chapter 6 Parallel Processors from Client to Cloud<br />
1<br />
2<br />
3<br />
4<br />
5<br />
6<br />
7<br />
8<br />
9<br />
10<br />
11<br />
12<br />
13<br />
14<br />
15<br />
16<br />
17<br />
18<br />
19<br />
20<br />
21<br />
22<br />
23<br />
24<br />
25<br />
26<br />
27<br />
28<br />
29<br />
30<br />
31<br />
32<br />
33<br />
34<br />
35<br />
#include <br />
#define UNROLL (4)<br />
#define BLOCKSIZE 32<br />
void do_block (int n, int si, int sj, int sk,<br />
double *A, double *B, double *C)<br />
{<br />
for ( int i = si; i < si+BLOCKSIZE; i+=UNROLL*4 )<br />
for ( int j = sj; j < sj+BLOCKSIZE; j++ ) {<br />
__m256d c[4];<br />
for ( int x = 0; x < UNROLL; x++ )<br />
c[x] = _mm256_load_pd(C+i+x*4+j*n);<br />
/* c[x] = C[i][j] */<br />
for( int k = sk; k < sk+BLOCKSIZE; k++ )<br />
{<br />
__m256d b = _mm256_broadcast_sd(B+k+j*n);<br />
/* b = B[k][j] */<br />
for (int x = 0; x < UNROLL; x++)<br />
c[x] = _mm256_add_pd(c[x], /* c[x]+=A[i][k]*b */<br />
_mm256_mul_pd(_mm256_load_pd(A+n*k+x*4+i), b));<br />
}<br />
}<br />
}<br />
for ( int x = 0; x < UNROLL; x++ )<br />
_mm256_store_pd(C+i+x*4+j*n, c[x]);<br />
/* C[i][j] = c[x] */<br />
void dgemm (int n, double* A, double* B, double* C)<br />
{<br />
#pragma omp parallel for<br />
for ( int sj = 0; sj < n; sj += BLOCKSIZE )<br />
for ( int si = 0; si < n; si += BLOCKSIZE )<br />
for ( int sk = 0; sk < n; sk += BLOCKSIZE )<br />
do_block(n, si, sj, sk, A, B, C);<br />
}<br />
FIGURE 6.25 OpenMP version of DGEMM from Figure 5.48. Line 30 is the only OpenMP code, making<br />
the outermost for loop operate in parallel. This line is the only difference from Figure 5.48.<br />
Elaboration: Although the S<strong>and</strong>y Bridge supports two hardware threads per core, we<br />
do not get more performance from 32 threads. The reason is that a single AVX hardware<br />
is shared between the two threads multiplexed onto one core, so assigning two threads<br />
per core actually hurts performance due to the multiplexing overhead.
6.13 Fallacies <strong>and</strong> Pitfalls 559<br />
One frequently encountered problem occurs when software designed for a<br />
uniprocessor is adapted to a multiprocessor environment. For example, the Silicon<br />
Graphics operating system originally protected the page table with a single lock,<br />
assuming that page allocation is infrequent. In a uniprocessor, this does not<br />
represent a performance problem. In a multiprocessor, it can become a major<br />
performance bottleneck for some programs. Consider a program that uses a large<br />
number of pages that are initialized at start-up, which UNIX does for statically<br />
allocated pages. Suppose the program is parallelized so that multiple processes<br />
allocate the pages. Because page allocation requires the use of the page table, which<br />
is locked whenever it is in use, even an OS kernel that allows multiple threads in the<br />
OS will be serialized if the processes all try to allocate their pages at once (which is<br />
exactly what we might expect at initialization time!).<br />
This page table serialization eliminates parallelism in initialization <strong>and</strong> has<br />
significant impact on overall parallel performance. This performance bottleneck<br />
persists even for task-level parallelism. For example, suppose we split the parallel<br />
processing program apart into separate jobs <strong>and</strong> run them, one job per processor,<br />
so that there is no sharing between the jobs. (This is exactly what one user did,<br />
since he reasonably believed that the performance problem was due to unintended<br />
sharing or interference in his application.) Unfortunately, the lock still serializes all<br />
the jobsso even the independent job performance is poor.<br />
This pitfall indicates the kind of subtle but significant performance bugs<br />
that can arise when software runs on multiprocessors. Like many other key<br />
software components, the OS algorithms <strong>and</strong> data structures must be rethought<br />
in a multiprocessor context. Placing locks on smaller portions of the page table<br />
effectively eliminated the problem.<br />
Fallacy: You can get good vector performance without providing memory<br />
b<strong>and</strong>width.<br />
As we saw with the Roofline model, memory b<strong>and</strong>width is quite important to<br />
all architectures. DAXPY requires 1.5 memory references per floating-point<br />
operation, <strong>and</strong> this ratio is typical of many scientific codes. Even if the floating-point<br />
operations took no time, a Cray-1 could not increase the DAXPY performance of<br />
the vector sequence used, since it was memory limited. The Cray-1 performance on<br />
Linpack jumped when the compiler used blocking to change the computation so<br />
that values could be kept in the vector registers. This approach lowered the number<br />
of memory references per FLOP <strong>and</strong> improved the performance by nearly a factor<br />
of two! Thus, the memory b<strong>and</strong>width on the Cray-1 became sufficient for a loop<br />
that formerly required more b<strong>and</strong>width, which is just what the Roofline model<br />
would predict.
562 Chapter 6 Parallel Processors from Client to Cloud<br />
■ In the past, microprocessors <strong>and</strong> multiprocessors were subject to<br />
different definitions of success. When scaling uniprocessor performance,<br />
microprocessor architects were happy if single thread performance went up<br />
by the square root of the increased silicon area. Thus, they were happy with<br />
sublinear performance in terms of resources. Multiprocessor success used<br />
to be defined as linear speed-up as a function of the number of processors,<br />
assuming that the cost of purchase or cost of administration of n processors<br />
was n times as much as one processor. Now that parallelism is happening onchip<br />
via multicore, we can use the traditional microprocessor metric of being<br />
successful with sublinear performance improvement.<br />
■ The success of just-in-time runtime compilation <strong>and</strong> autotuning makes it<br />
feasible to think of software adapting itself to take advantage of the increasing<br />
number of cores per chip, which provides flexibility that is not available when<br />
limited to static compilers.<br />
■ Unlike in the past, the open source movement has become a critical portion<br />
of the software industry. This movement is a meritocracy, where better<br />
engineering solutions can win the mind share of the developers over legacy<br />
concerns. It also embraces innovation, inviting change to old software <strong>and</strong><br />
welcoming new languages <strong>and</strong> software products. Such an open culture could<br />
be extremely helpful in this time of rapid change.<br />
To motivate readers to embrace this revolution, we demonstrated the potential<br />
of parallelism concretely for matrix multiply on the Intel Core i7 (S<strong>and</strong>y Bridge) in<br />
the Going Faster sections of Chapters 3 to 6:<br />
■ Data-level parallelism in Chapter 3 improved performance by a factor of 3.85<br />
by executing four 64-bit floating-point operations in parallel using the 256-<br />
bit oper<strong>and</strong>s of the AVX instructions, demonstrating the value of SIMD.<br />
■ Instruction-level parallelism in Chapter 4 pushed performance up by another<br />
factor of 2.3 by unrolling loops 4 times to give the out-of-order execution<br />
hardware more instructions to schedule.<br />
■ Cache optimizations in Chapter 5 improved performance of matrices that<br />
didn’t fit into the L1 data cache by another factor of 2.0 to 2.5 by using cache<br />
blocking to reduce cache misses.<br />
■ Thread-level parallelism in this chapter improved performance of matrices<br />
that don’t fit into a single L1 data cache by another factor of 4 to 14 by utilizing<br />
all 16 cores of our multicore chips, demonstrating the value of MIMD. We<br />
did this by adding a single line using an OpenMP pragma.<br />
Using the ideas in this book <strong>and</strong> tailoring the software to this computer added<br />
24 lines of code to DGEMM. For the matrix sizes of 32x32, 160x160, 480x480, <strong>and</strong><br />
960x960, the overall performance speedup from these ideas realized in those twodozen<br />
lines of code is factors of 8, 39, 129, <strong>and</strong> 212!
564 Chapter 6 Parallel Processors from Client to Cloud<br />
backpack <strong>and</strong> then carry them “in parallel”). For each of your activities, discuss if<br />
they are already working in parallel, but if not, why they are not.<br />
6.1.2 [5] Next, consider which of the activities could be carried out<br />
concurrently (e.g., eating breakfast <strong>and</strong> listening to the news). For each of your<br />
activities, describe which other activity could be paired with this activity.<br />
6.1.3 [5] For 6.1.2, what could we change about current systems (e.g.,<br />
showers, clothes, TVs, cars) so that we could perform more tasks in parallel?<br />
6.1.4 [5] Estimate how much shorter time it would take to carry out these<br />
activities if you tried to carry out as many tasks in parallel as possible.<br />
6.2 You are trying to bake 3 blueberry pound cakes. Cake ingredients are as<br />
follows:<br />
1 cup butter, softened<br />
1 cup sugar<br />
4 large eggs<br />
1 teaspoon vanilla extract<br />
1/2 teaspoon salt<br />
1/4 teaspoon nutmeg<br />
1 1/2 cups flour<br />
1 cup blueberries<br />
The recipe for a single cake is as follows:<br />
Step 1: Preheat oven to 325°F (160°C). Grease <strong>and</strong> flour your cake pan.<br />
Step 2: In large bowl, beat together with a mixer butter <strong>and</strong> sugar at medium<br />
speed until light <strong>and</strong> fluffy. Add eggs, vanilla, salt <strong>and</strong> nutmeg. Beat until<br />
thoroughly blended. Reduce mixer speed to low <strong>and</strong> add flour, 1/2 cup at a time,<br />
beating just until blended.<br />
Step 3: Gently fold in blueberries. Spread evenly in prepared baking pan. Bake<br />
for 60 minutes.<br />
6.2.1 [5] Your job is to cook 3 cakes as efficiently as possible. Assuming<br />
that you only have one oven large enough to hold one cake, one large bowl, one<br />
cake pan, <strong>and</strong> one mixer, come up with a schedule to make three cakes as quickly<br />
as possible. Identify the bottlenecks in completing this task.<br />
6.2.2 [5] Assume now that you have three bowls, 3 cake pans <strong>and</strong> 3 mixers.<br />
How much faster is the process now that you have additional resources?
6.16 Exercises 565<br />
6.2.3 [5] Assume now that you have two friends that will help you cook,<br />
<strong>and</strong> that you have a large oven that can accommodate all three cakes. How will this<br />
change the schedule you arrived at in Exercise 6.2.1 above?<br />
6.2.4 [5] Compare the cake-making task to computing 3 iterations<br />
of a loop on a parallel computer. Identify data-level parallelism <strong>and</strong> task-level<br />
parallelism in the cake-making loop.<br />
6.3 Many computer applications involve searching through a set of data <strong>and</strong><br />
sorting the data. A number of efficient searching <strong>and</strong> sorting algorithms have been<br />
devised in order to reduce the runtime of these tedious tasks. In this problem we<br />
will consider how best to parallelize these tasks.<br />
6.3.1 [10] Consider the following binary search algorithm (a classic divide<br />
<strong>and</strong> conquer algorithm) that searches for a value X in a sorted N-element array A<br />
<strong>and</strong> returns the index of matched entry:<br />
BinarySearch(A[0..N−1], X) {<br />
low = 0<br />
high = N −1<br />
while (low X)<br />
high = mid −1<br />
else if (A[mid]
6.16 Exercises 567<br />
continue until we have lists of size 1 in length. Then starting with sublists of length<br />
1, “merge” the two sublists into a single sorted list.<br />
Mergesort(m)<br />
var list left, right, result<br />
if length(m) ≤ 1<br />
return m<br />
else<br />
var middle = length(m) / 2<br />
for each x in m up to middle<br />
add x to left<br />
for each x in m after middle<br />
add x to right<br />
left = Mergesort(left)<br />
right = Mergesort(right)<br />
result = Merge(left, right)<br />
return result<br />
The merge step is carried out by the following code:<br />
Merge(left,right)<br />
var list result<br />
while length(left) >0 <strong>and</strong> length(right) > 0<br />
if first(left) ≤ first(right)<br />
append first(left) to result<br />
left = rest(left)<br />
else<br />
append first(right) to result<br />
right = rest(right)<br />
if length(left) >0<br />
append rest(left) to result<br />
if length(right) >0<br />
append rest(right) to result<br />
return result<br />
6.5.1 [10] Assume that you have Y cores on a multicore processor to run<br />
MergeSort. Assuming that Y is much smaller than length(m), express the speedup<br />
factor you might expect to obtain for values of Y <strong>and</strong> length(m). Plot these on a<br />
graph.<br />
6.5.2 [10] Next, assume that Y is equal to length (m). How would this<br />
affect your conclusions your previous answer? If you were tasked with obtaining<br />
the best speedup factor possible (i.e., strong scaling), explain how you might<br />
change this code to obtain it.
568 Chapter 6 Parallel Processors from Client to Cloud<br />
6.6 Matrix multiplication plays an important role in a number of applications.<br />
Two matrices can only be multiplied if the number of columns of the first matrix is<br />
equal to the number of rows in the second.<br />
Let’s assume we have an m × n matrix A <strong>and</strong> we want to multiply it by an n × p<br />
matrix B. We can express their product as an m × p matrix denoted by AB (or A ⋅ B).<br />
If we assign C = AB, <strong>and</strong> c i,j<br />
denotes the entry in C at position (i, j), then for each<br />
element i <strong>and</strong> j with 1 ≤ i ≤ m <strong>and</strong> 1 ≤ j ≤ p. Now we want to see if we can parallelize<br />
the computation of C. Assume that matrices are laid out in memory sequentially as<br />
follows: a 1,1<br />
, a 2,1<br />
, a 3,1<br />
, a 4,1<br />
, …, etc.<br />
6.6.1 [10] Assume that we are going to compute C on both a single core<br />
shared memory machine <strong>and</strong> a 4-core shared-memory machine. Compute the<br />
speedup we would expect to obtain on the 4-core machine, ignoring any memory<br />
issues.<br />
6.6.2 [10] Repeat Exercise 6.6.1, assuming that updates to C incur a cache<br />
miss due to false sharing when consecutive elements are in a row (i.e., index i) are<br />
updated.<br />
6.6.3 [10] How would you fix the false sharing issue that can occur?<br />
6.7 Consider the following portions of two different programs running at the<br />
same time on four processors in a symmetric multicore processor (SMP). Assume<br />
that before this code is run, both x <strong>and</strong> y are 0.<br />
Core 1: x = 2;<br />
Core 2: y = 2;<br />
Core 3: w = x + y + 1;<br />
Core 4: z = x + y;<br />
6.7.1 [10] What are all the possible resulting values of w, x, y, <strong>and</strong> z? For<br />
each possible outcome, explain how we might arrive at those values. You will need<br />
to examine all possible interleavings of instructions.<br />
6.7.2 [5] How could you make the execution more deterministic so that<br />
only one set of values is possible?<br />
6.8 The dining philosopher’s problem is a classic problem of synchronization <strong>and</strong><br />
concurrency. The general problem is stated as philosophers sitting at a round table<br />
doing one of two things: eating or thinking. When they are eating, they are not<br />
thinking, <strong>and</strong> when they are thinking, they are not eating. There is a bowl of pasta<br />
in the center. A fork is placed in between each philosopher. The result is that each<br />
philosopher has one fork to her left <strong>and</strong> one fork to her right. Given the nature of<br />
eating pasta, the philosopher needs two forks to eat, <strong>and</strong> can only use the forks on<br />
her immediate left <strong>and</strong> right. The philosophers do not speak to one another.
570 Chapter 6 Parallel Processors from Client to Cloud<br />
Assume all instructions take a single cycle to execute unless noted otherwise or<br />
they encounter a hazard.<br />
6.9.1 [10] Assume that you have 1 SS CPU. How many cycles will it take to<br />
execute these two threads? How many issue slots are wasted due to hazards?<br />
6.9.2 [10] Now assume you have 2 SS CPUs. How many cycles will it take<br />
to execute these two threads? How many issue slots are wasted due to hazards?<br />
6.9.3 [10] Assume that you have 1 MT CPU. How many cycles will it take<br />
to execute these two threads? How many issue slots are wasted due to hazards?<br />
6.10 Virtualization software is being aggressively deployed to reduce the costs of<br />
managing today’s high performance servers. Companies like VMWare, Microsoft<br />
<strong>and</strong> IBM have all developed a range of virtualization products. The general concept,<br />
described in Chapter 5, is that a hypervisor layer can be introduced between the<br />
hardware <strong>and</strong> the operating system to allow multiple operating systems to share<br />
the same physical hardware. The hypervisor layer is then responsible for allocating<br />
CPU <strong>and</strong> memory resources, as well as h<strong>and</strong>ling services typically h<strong>and</strong>led by the<br />
operating system (e.g., I/O).<br />
Virtualization provides an abstract view of the underlying hardware to the hosted<br />
operating system <strong>and</strong> application software. This will require us to rethink how<br />
multi-core <strong>and</strong> multiprocessor systems will be designed in the future to support<br />
the sharing of CPUs <strong>and</strong> memories by a number of operating systems concurrently.<br />
6.10.1 [30] Select two hypervisors on the market today, <strong>and</strong> compare<br />
<strong>and</strong> contrast how they virtualize <strong>and</strong> manage the underlying hardware (CPUs <strong>and</strong><br />
memory).<br />
6.10.2 [15] Discuss what changes may be necessary in future multi-core<br />
CPU platforms in order to better match the resource dem<strong>and</strong>s placed on these<br />
systems. For instance, can multithreading play an effective role in alleviating the<br />
competition for computing resources?<br />
6.11 We would like to execute the loop below as efficiently as possible. We have<br />
two different machines, a MIMD machine <strong>and</strong> a SIMD machine.<br />
for (i=0; i < 2000; i++)<br />
for (j=0; j
6.16 Exercises 571<br />
6.12 A systolic array is an example of an MISD machine. A systolic array is a<br />
pipeline network or “wavefront” of data processing elements. Each of these elements<br />
does not need a program counter since execution is triggered by the arrival of data.<br />
Clocked systolic arrays compute in “lock-step” with each processor undertaking<br />
alternate compute <strong>and</strong> communication phases.<br />
6.12.1 [10] Consider proposed implementations of a systolic array (you<br />
can find these in on the Internet or in technical publications). Then attempt to<br />
program the loop provided in Exercise 6.11 using this MISD model. Discuss any<br />
difficulties you encounter.<br />
6.12.2 [10] Discuss the similarities <strong>and</strong> differences between an MISD <strong>and</strong><br />
SIMD machine. Answer this question in terms of data-level parallelism.<br />
6.13 Assume we want to execute the DAXPY loop show on page 511 in MIPS<br />
assembly on the NVIDIA 8800 GTX GPU described in this chapter. In this problem,<br />
we will assume that all math operations are performed on single-precision floatingpoint<br />
numbers (we will rename the loop SAXPY). Assume that instructions take<br />
the following number of cycles to execute.<br />
Loads Stores Add.S Mult.S<br />
5 2 3 4<br />
6.13.1 [20] Describe how you will constructs warps for the SAXPY loop<br />
to exploit the 8 cores provided in a single multiprocessor.<br />
6.14 Download the CUDA Toolkit <strong>and</strong> SDK from http://www.nvidia.com/object/<br />
cuda_get.html. Make sure to use the “emurelease” (Emulation Mode) version of the<br />
code (you will not need actual NVIDIA hardware for this assignment). Build the<br />
example programs provided in the SDK, <strong>and</strong> confirm that they run on the emulator.<br />
6.14.1 [90] Using the “template” SDK sample as a starting point, write a<br />
CUDA program to perform the following vector operations:<br />
1) a − b (vector-vector subtraction)<br />
2) a ⋅ b (vector dot product)<br />
The dot product of two vectors a = [a 1<br />
, a 2<br />
, … , a n<br />
] <strong>and</strong> b = [b 1<br />
, b 2<br />
, … , b n<br />
] is defined as:<br />
a ⋅ b ∑ a b a b a b … a b<br />
i i 1 1 2 2<br />
i<br />
n<br />
1<br />
Submit code for each program that demonstrates each operation <strong>and</strong> verifies the<br />
correctness of the results.<br />
6.14.2 [90] If you have GPU hardware available, complete a performance<br />
analysis your program, examining the computation time for the GPU <strong>and</strong> a CPU<br />
version of your program for a range of vector sizes. Explain any results you see.<br />
n n
572 Chapter 6 Parallel Processors from Client to Cloud<br />
6.15 AMD has recently announced that they will be integrating a graphics<br />
processing unit with their x86 cores in a single package, though with different<br />
clocks for each of the cores. This is an example of a heterogeneous multiprocessor<br />
system which we expect to see produced commericially in the near future. One<br />
of the key design points will be to allow for fast data communication between<br />
the CPU <strong>and</strong> the GPU. Presently communications must be performed between<br />
discrete CPU <strong>and</strong> GPU chips. But this is changing in AMDs Fusion architecture.<br />
Presently the plan is to use multiple (at least 16) PCI express channels for facilitate<br />
intercommunication. Intel is also jumping into this arena with their Larrabee chip.<br />
Intel is considering to use their QuickPath interconnect technology.<br />
6.15.1 [25] Compare the b<strong>and</strong>width <strong>and</strong> latency associated with these<br />
two interconnect technologies.<br />
6.16 Refer to Figure 6.14b, which shows an n-cube interconnect topology of order<br />
3 that interconnects 8 nodes. One attractive feature of an n-cube interconnection<br />
network topology is its ability to sustain broken links <strong>and</strong> still provide connectivity.<br />
6.16.1 [10] Develop an equation that computes how many links in the<br />
n-cube (where n is the order of the cube) can fail <strong>and</strong> we can still guarantee an<br />
unbroken link will exist to connect any node in the n-cube.<br />
6.16.2 [10] Compare the resiliency to failure of n-cube to a fullyconnected<br />
interconnection network. Plot a comparison of reliability as a function<br />
of the added number of links for the two topologies.<br />
6.17 Benchmarking is field of study that involves identifying representative<br />
workloads to run on specific computing platforms in order to be able to objectively<br />
compare performance of one system to another. In this exercise we will compare<br />
two classes of benchmarks: the Whetstone CPU benchmark <strong>and</strong> the PARSEC<br />
Benchmark suite. Select one program from PARSEC. All programs should be freely<br />
available on the Internet. Consider running multiple copies of Whetstone versus<br />
running the PARSEC Benchmark on any of systems described in Section 6.11.<br />
6.17.1 [60] What is inherently different between these two classes of<br />
workload when run on these multi-core systems?<br />
6.17.2 [60] In terms of the Roofline Model, how dependent will the<br />
results you obtain when running these benchmarks be on the amount of sharing<br />
<strong>and</strong> synchronization present in the workload used?<br />
6.18 When performing computations on sparse matrices, latency in the memory<br />
hierarchy becomes much more of a factor. Sparse matrices lack the spatial locality<br />
in the data stream typically found in matrix operations. As a result, new matrix<br />
representations have been proposed.<br />
One the earliest sparse matrix representations is the Yale Sparse Matrix Format. It<br />
stores an initial sparse m × n matrix, M in row form using three one-dimensional
6.16 Exercises 573<br />
arrays. Let R be the number of nonzero entries in M. We construct an array A<br />
of length R that contains all nonzero entries of M (in left-to-right top-to-bottom<br />
order). We also construct a second array IA of length m + 1 (i.e., one entry per row,<br />
plus one). IA(i) contains the index in A of the first nonzero element of row i. Row<br />
i of the original matrix extends from A(IA(i)) to A(IA(i+1)−1). The third array, JA,<br />
contains the column index of each element of A, so it also is of length R.<br />
6.18.1 [15] Consider the sparse matrix X below <strong>and</strong> write C code that<br />
would store this code in Yale Sparse Matrix Format.<br />
Row 1 [1, 2, 0, 0, 0, 0]<br />
Row 2 [0, 0, 1, 1, 0, 0]<br />
Row 3 [0, 0, 0, 0, 9, 0]<br />
Row 4 [2, 0, 0, 0, 0, 2]<br />
Row 5 [0, 0, 3, 3, 0, 7]<br />
Row 6 [1, 3, 0, 0, 0, 1]<br />
6.18.2 [10] In terms of storage space, assuming that each element in<br />
matrix X is single precision floating point, compute the amount of storage used to<br />
store the Matrix above in Yale Sparse Matrix Format.<br />
6.18.3 [15] Perform matrix multiplication of Matrix X by Matrix Y<br />
shown below.<br />
[2, 4, 1, 99, 7, 2]<br />
Put this computation in a loop, <strong>and</strong> time its execution. Make sure to increase<br />
the number of times this loop is executed to get good resolution in your timing<br />
measurement. Compare the runtime of using a naïve representation of the matrix,<br />
<strong>and</strong> the Yale Sparse Matrix Format.<br />
6.18.4 [15] Can you find a more efficient sparse matrix representation<br />
(in terms of space <strong>and</strong> computational overhead)?<br />
6.19 In future systems, we expect to see heterogeneous computing platforms<br />
constructed out of heterogeneous CPUs. We have begun to see some appear in the<br />
embedded processing market in systems that contain both floating point DSPs <strong>and</strong><br />
a microcontroller CPUs in a multichip module package.<br />
Assume that you have three classes of CPU:<br />
CPU A—A moderate speed multi-core CPU (with a floating point unit) that can<br />
execute multiple instructions per cycle.<br />
CPU B—A fast single-core integer CPU (i.e., no floating point unit) that can<br />
execute a single instruction per cycle.<br />
CPU C—A slow vector CPU (with floating point capability) that can execute<br />
multiple copies of the same instruction per cycle.
6.16 Exercises 575<br />
§6.1, page 504: False. Task-level parallelism can help sequential applications <strong>and</strong><br />
sequential applications can be made to run on parallel hardware, although it is<br />
more challenging.<br />
§6.2, page 509: False. Weak scaling can compensate for a serial portion of the<br />
program that would otherwise limit scalability, but not so for strong scaling.<br />
§6.3, page 514: True, but they are missing useful vector features like gather-scatter<br />
<strong>and</strong> vector length registers that improve the efficiency of vector architectures.<br />
(As an elaboration in this section mentions, the AVX2 SIMD extensions offers<br />
indexed loads via a gather operation but not scatter for indexed stores. The Haswell<br />
generation x86 microprocessor is the first to support AVX2.)<br />
§6.4, page 519: 1. True. 2. True.<br />
§6.5, page 523: False. Since the shared address is a physical address, multiple<br />
tasks each in their own virtual address spaces can run well on a shared memory<br />
multiprocessor.<br />
§6.6, page 531: False. Graphics DRAM chips are prized for their higher b<strong>and</strong>width.<br />
§6.7, page 536: 1. False. Sending <strong>and</strong> receiving a message is an implicit<br />
synchronization, as well as a way to share data. 2. True.<br />
§6.8, page 538: True.<br />
§6.10, page 550: True. We likely need innovation at all levels of the hardware <strong>and</strong><br />
software stack for parallel computing to succeed.<br />
Answers to<br />
Check Yourself
A<br />
A P P E N D I X<br />
Fear of serious injury<br />
cannot alone justify<br />
suppression of free<br />
speech <strong>and</strong> assembly.<br />
Assemblers, Linkers,<br />
<strong>and</strong> the SPIM<br />
Simulator<br />
James R. Larus<br />
Microsoft Research<br />
Microsoft<br />
Louis Br<strong>and</strong>eis<br />
Whitney v. California, 1927
A-4 Appendix A Assemblers, Linkers, <strong>and</strong> the SPIM Simulator<br />
Source<br />
file<br />
Assembler<br />
Object<br />
file<br />
Source<br />
file<br />
Assembler<br />
Object<br />
file<br />
Linker<br />
Executable<br />
file<br />
Source<br />
file<br />
Assembler<br />
Object<br />
file<br />
Program<br />
library<br />
FIGURE A.1.1 The process that produces an executable file. An assembler translates a file of<br />
assembly language into an object file, which is linked with other files <strong>and</strong> libraries into an executable file.<br />
assembler A program<br />
that translates a symbolic<br />
version of instruction into<br />
the binary ver sion.<br />
macro A patternmatching<br />
<strong>and</strong> replacement<br />
facility that pro vides a<br />
simple mechanism to name<br />
a frequently used sequence<br />
of instructions.<br />
unresolved reference<br />
A reference that requires<br />
more information from<br />
an outside source to be<br />
complete.<br />
linker Also called<br />
link editor. A systems<br />
program that combines<br />
independently assembled<br />
machine language<br />
programs <strong>and</strong> resolves all<br />
undefined labels into an<br />
executable file.<br />
permits programmers to use labels to identify <strong>and</strong> name particular memory words<br />
that hold instructions or data.<br />
A tool called an assembler translates assembly language into binary instructions.<br />
Assemblers provide a friendlier representation than a computer’s 0s <strong>and</strong> 1s, which<br />
sim plifies writing <strong>and</strong> reading programs. Symbolic names for operations <strong>and</strong> locations<br />
are one facet of this representation. Another facet is programming facilities<br />
that increase a program’s clarity. For example, macros, discussed in Section A.2,<br />
enable a programmer to extend the assembly language by defining new operations.<br />
An assembler reads a single assembly language source file <strong>and</strong> produces an<br />
object file containing machine instructions <strong>and</strong> bookkeeping information that<br />
helps combine several object files into a program. Figure A.1.1 illustrates how a<br />
program is built. Most programs consist of several files—also called modules—<br />
that are written, compiled, <strong>and</strong> assembled independently. A program may also use<br />
prewritten routines supplied in a program library. A module typically contains references<br />
to subroutines <strong>and</strong> data defined in other modules <strong>and</strong> in libraries. The code<br />
in a module cannot be executed when it contains unresolved references to labels<br />
in other object files or libraries. Another tool, called a linker, combines a collection<br />
of object <strong>and</strong> library files into an executable file, which a computer can run.<br />
To see the advantage of assembly language, consider the following sequence of<br />
figures, all of which contain a short subroutine that computes <strong>and</strong> prints the sum of<br />
the squares of integers from 0 to 100. Figure A.1.2 shows the machine language that<br />
a MIPS computer executes. With considerable effort, you could use the opcode <strong>and</strong><br />
instruction format tables in Chapter 2 to translate the instructions into a symbolic<br />
program similar to that shown in Figure A.1.3. This form of the routine is much<br />
easier to read, because operations <strong>and</strong> oper<strong>and</strong>s are written with symbols rather
A.1 Introduction A-5<br />
00100111101111011111111111100000<br />
10101111101111110000000000010100<br />
10101111101001000000000000100000<br />
10101111101001010000000000100100<br />
10101111101000000000000000011000<br />
10101111101000000000000000011100<br />
10001111101011100000000000011100<br />
10001111101110000000000000011000<br />
00000001110011100000000000011001<br />
00100101110010000000000000000001<br />
00101001000000010000000001100101<br />
10101111101010000000000000011100<br />
00000000000000000111100000010010<br />
00000011000011111100100000100001<br />
00010100001000001111111111110111<br />
10101111101110010000000000011000<br />
00111100000001000001000000000000<br />
10001111101001010000000000011000<br />
00001100000100000000000011101100<br />
00100100100001000000010000110000<br />
10001111101111110000000000010100<br />
00100111101111010000000000100000<br />
00000011111000000000000000001000<br />
00000000000000000001000000100001<br />
FIGURE A.1.2 MIPS machine language code for a routine to compute <strong>and</strong> print the sum<br />
of the squares of integers between 0 <strong>and</strong> 100.<br />
than with bit patterns. However, this assembly language is still difficult to follow,<br />
because memory locations are named by their address rather than by a symbolic<br />
label.<br />
Figure A.1.4 shows assembly language that labels memory addresses with mnemonic<br />
names. Most programmers prefer to read <strong>and</strong> write this form. Names that<br />
begin with a period, for example .data <strong>and</strong> .globl, are assembler directives<br />
that tell the assembler how to translate a program but do not produce machine<br />
instructions. Names followed by a colon, such as str: or main:, are labels that<br />
name the next memory location. This program is as readable as most assembly<br />
language programs (except for a glaring lack of comments), but it is still difficult<br />
to follow, because many simple operations are required to accomplish simple tasks<br />
<strong>and</strong> because assembly language’s lack of control flow constructs provides few hints<br />
about the program’s operation.<br />
By contrast, the C routine in Figure A.1.5 is both shorter <strong>and</strong> clearer, since variables<br />
have mnemonic names <strong>and</strong> the loop is explicit rather than constructed with<br />
branches. In fact, the C routine is the only one that we wrote. The other forms of<br />
the program were produced by a C compiler <strong>and</strong> assembler.<br />
In general, assembly language plays two roles (see Figure A.1.6). The first role<br />
is the output language of compilers. A compiler translates a program written in a<br />
high-level language (such as C or Pascal) into an equivalent program in machine or<br />
assembler directive<br />
An operation that tells the<br />
assembler how to translate<br />
a program but does not<br />
produce machine instructions;<br />
always begins with<br />
a period.
A-6 Appendix A Assemblers, Linkers, <strong>and</strong> the SPIM Simulator<br />
addiu $29, $29, -32<br />
sw $31, 20($29)<br />
sw $4, 32($29)<br />
sw $5, 36($29)<br />
sw $0, 24($29)<br />
sw $0, 28($29)<br />
lw $14, 28($29)<br />
lw $24, 24($29)<br />
multu $14, $14<br />
addiu $8, $14, 1<br />
slti $1, $8, 101<br />
sw $8, 28($29)<br />
mflo $15<br />
addu $25, $24, $15<br />
bne $1, $0, -9<br />
sw $25, 24($29)<br />
lui $4, 4096<br />
lw $5, 24($29)<br />
jal 1048812<br />
addiu $4, $4, 1072<br />
lw $31, 20($29)<br />
addiu $29, $29, 32<br />
jr $31<br />
move $2, $0<br />
FIGURE A.1.3 The same routine as in Figure A.1.2 written in assembly language. However,<br />
the code for the routine does not label registers or memory locations or include comments.<br />
source language The<br />
high-level language<br />
in which a pro gram is<br />
originally written.<br />
assembly language. The high-level language is called the source language, <strong>and</strong> the<br />
compiler’s output is its target language.<br />
Assembly language’s other role is as a language in which to write programs. This<br />
role used to be the dominant one. Today, however, because of larger main memories<br />
<strong>and</strong> better compilers, most programmers write in a high-level language <strong>and</strong><br />
rarely, if ever, see the instructions that a computer executes. Nevertheless, assembly<br />
language is still important to write programs in which speed or size is critical or to<br />
exploit hardware features that have no analogues in high-level languages.<br />
Although this appendix focuses on MIPS assembly language, assembly programming<br />
on most other machines is very similar. The additional instructions <strong>and</strong><br />
address modes in CISC machines, such as the VAX, can make assembly pro grams<br />
shorter but do not change the process of assembling a program or provide assembly<br />
language with the advantages of high-level languages, such as type-checking <strong>and</strong><br />
structured control flow.
A.1 Introduction A-7<br />
FIGURE A.1.4 The same routine as in Figure A.1.2 written in assembly language with<br />
labels, but no com ments. The comm<strong>and</strong>s that start with periods are assembler directives (see pages<br />
A-47–49). .text indicates that succeeding lines contain instructions. .data indicates that they contain<br />
data. .align n indicates that the items on the succeeding lines should be aligned on a 2 n byte boundary.<br />
Hence, .align 2 means the next item should be on a word boundary. .globl main declares that main is<br />
a global symbol that should be visible to code stored in other files. Finally, .asciiz stores a null-terminated<br />
string in memory.<br />
When to Use Assembly Language<br />
The primary reason to program in assembly language, as opposed to an available<br />
high-level language, is that the speed or size of a program is critically important.<br />
For example, consider a computer that controls a piece of machinery, such as a<br />
car’s brakes. A computer that is incorporated in another device, such as a car, is<br />
called an embedded computer. This type of computer needs to respond rapidly<br />
<strong>and</strong> predictably to events in the outside world. Because a compiler introduces
A-8 Appendix A Assemblers, Linkers, <strong>and</strong> the SPIM Simulator<br />
#include <br />
int<br />
main (int argc, char *argv[])<br />
{<br />
int i;<br />
int sum = 0;<br />
}<br />
for (i = 0; i
A.1 Introduction A-9<br />
This improvement is not necessarily an indication that the high-level language’s<br />
compiler has failed. Compilers typically are better than programmers at producing<br />
uniformly high-quality machine code across an entire program. Pro grammers,<br />
however, underst<strong>and</strong> a program’s algorithms <strong>and</strong> behavior at a deeper level than<br />
a compiler <strong>and</strong> can expend considerable effort <strong>and</strong> ingenuity improving small<br />
sections of the program. In particular, programmers often consider several procedures<br />
simultaneously while writing their code. Compilers typically compile each<br />
procedure in isolation <strong>and</strong> must follow strict conventions governing the use of<br />
registers at procedure boundaries. By retaining commonly used values in registers,<br />
even across procedure boundaries, programmers can make a program run<br />
faster.<br />
Another major advantage of assembly language is the ability to exploit specialized<br />
instructions—for example, string copy or pattern-matching instructions.<br />
Compilers, in most cases, cannot determine that a program loop can be replaced<br />
by a single instruction. However, the programmer who wrote the loop can replace<br />
it easily with a single instruction.<br />
Currently, a programmer’s advantage over a compiler has become difficult to<br />
maintain as compilation techniques improve <strong>and</strong> machines’ pipelines increase in<br />
complexity (Chapter 4).<br />
The final reason to use assembly language is that no high-level language is<br />
available on a particular computer. Many older or specialized computers do not<br />
have a compiler, so a programmer’s only alternative is assembly language.<br />
Drawbacks of Assembly Language<br />
Assembly language has many disadvantages that strongly argue against its widespread<br />
use. Perhaps its major disadvantage is that programs written in assembly<br />
language are inherently machine-specific <strong>and</strong> must be totally rewritten to run on<br />
another computer architecture. The rapid evolution of computers discussed in<br />
Chapter 1 means that architectures become obsolete. An assembly language program<br />
remains tightly bound to its original archi tecture, even after the computer is<br />
eclipsed by new, faster, <strong>and</strong> more cost-effective machines.<br />
Another disadvantage is that assembly language programs are longer than the<br />
equivalent programs written in a high-level language. For example, the C program<br />
in Figure A.1.5 is 11 lines long, while the assembly program in Figure A.1.4 is<br />
31 lines long. In more complex programs, the ratio of assembly to high-level language<br />
(its expansion factor) can be much larger than the factor of three in this<br />
exam ple. Unfortunately, empirical studies have shown that programmers write<br />
roughly the same number of lines of code per day in assembly as in high-level<br />
languages. This means that programmers are roughly x times more productive in a<br />
high-level language, where x is the assembly language expansion factor.
A-10 Appendix A Assemblers, Linkers, <strong>and</strong> the SPIM Simulator<br />
To compound the problem, longer programs are more difficult to read <strong>and</strong><br />
underst<strong>and</strong>, <strong>and</strong> they contain more bugs. Assembly language exacerbates the problem<br />
because of its complete lack of structure. Common programming idioms,<br />
such as if-then statements <strong>and</strong> loops, must be built from branches <strong>and</strong> jumps. The<br />
resulting programs are hard to read, because the reader must reconstruct every<br />
higher-level construct from its pieces <strong>and</strong> each instance of a statement may be<br />
slightly different. For example, look at Figure A.1.4 <strong>and</strong> answer these questions:<br />
What type of loop is used? What are its lower <strong>and</strong> upper bounds?<br />
Elaboration: Compilers can produce machine language directly instead of relying on<br />
an assembler. These compilers typically execute much faster than those that invoke<br />
an assembler as part of compilation. However, a compiler that generates machine language<br />
must perform many tasks that an assembler normally h<strong>and</strong>les, such as resolving<br />
addresses <strong>and</strong> encoding instructions as binary numbers. The tradeoff is between<br />
compilation speed <strong>and</strong> compiler simplicity.<br />
Elaboration: Despite these considerations, some embedded applications are written<br />
in a high-level language. Many of these applications are large <strong>and</strong> complex programs<br />
that must be extremely reliable. Assembly language programs are longer <strong>and</strong><br />
more diffi cult to write <strong>and</strong> read than high-level language programs. This greatly increases<br />
the cost of writing an assembly language program <strong>and</strong> makes it extremely dif fi cult to<br />
verify the correctness of this type of program. In fact, these considerations led the US<br />
Department of Defense, which pays for many complex embedded systems, to develop<br />
Ada, a new high-level language for writing embedded systems.<br />
A.2 Assemblers<br />
external label Also called<br />
global label. A label<br />
referring to an object that<br />
can be referenced from<br />
files other than the one in<br />
which it is defined.<br />
An assembler translates a file of assembly language statements into a file of binary<br />
machine instructions <strong>and</strong> binary data. The translation process has two major<br />
parts. The first step is to find memory locations with labels so that the relationship<br />
between symbolic names <strong>and</strong> addresses is known when instructions are trans lated.<br />
The second step is to translate each assembly statement by combining the numeric<br />
equivalents of opcodes, register specifiers, <strong>and</strong> labels into a legal instruc tion. As<br />
shown in Figure A.1.1, the assembler produces an output file, called an object file,<br />
which contains the machine instructions, data, <strong>and</strong> bookkeeping infor mation.<br />
An object file typically cannot be executed, because it references procedures or<br />
data in other files. A label is external (also called global) if the labeled object can
A.2 Assemblers A-11<br />
be referenced from files other than the one in which it is defined. A label is local<br />
if the object can be used only within the file in which it is defined. In most assemblers,<br />
labels are local by default <strong>and</strong> must be explicitly declared global. Subrou tines<br />
<strong>and</strong> global variables require external labels since they are referenced from many<br />
files in a program. Local labels hide names that should not be visible to other<br />
modules—for example, static functions in C, which can only be called by other<br />
functions in the same file. In addition, compiler-generated names—for example, a<br />
name for the instruction at the beginning of a loop—are local so that the compiler<br />
need not produce unique names in every file.<br />
local label A label<br />
referring to an object that<br />
can be used only within<br />
the file in which it is<br />
defined.<br />
Local <strong>and</strong> Global Labels<br />
Consider the program in Figure A.1.4. The subroutine has an external (global)<br />
label main. It also contains two local labels—loop <strong>and</strong> str—that are only<br />
visible with this assembly language file. Finally, the routine also contains an<br />
unresolved reference to an external label printf, which is the library routine<br />
that prints values. Which labels in Figure A.1.4 could be referenced from<br />
another file?<br />
EXAMPLE<br />
Only global labels are visible outside a file, so the only label that could be<br />
referenced from another file is main.<br />
Since the assembler processes each file in a program individually <strong>and</strong> in isola tion,<br />
it only knows the addresses of local labels. The assembler depends on another tool,<br />
the linker, to combine a collection of object files <strong>and</strong> libraries into an executable<br />
file by resolving external labels. The assembler assists the linker by pro viding lists<br />
of labels <strong>and</strong> unresolved references.<br />
However, even local labels present an interesting challenge to an assembler.<br />
Unlike names in most high-level languages, assembly labels may be used before<br />
they are defined. In the example in Figure A.1.4, the label str is used by the la<br />
instruction before it is defined. The possibility of a forward reference, like this one,<br />
forces an assembler to translate a program in two steps: first find all labels <strong>and</strong> then<br />
produce instructions. In the example, when the assembler sees the la instruction,<br />
it does not know where the word labeled str is located or even whether str labels<br />
an instruction or datum.<br />
ANSWER<br />
forward reference<br />
A label that is used<br />
before it is defined.
A-12 Appendix A Assemblers, Linkers, <strong>and</strong> the SPIM Simulator<br />
An assembler’s first pass reads each line of an assembly file <strong>and</strong> breaks it into its<br />
component pieces. These pieces, which are called lexemes, are individual words,<br />
numbers, <strong>and</strong> punctuation characters. For example, the line<br />
symbol table A table<br />
that matches names of<br />
labels to the addresses of<br />
the memory words that<br />
instructions occupy.<br />
ble<br />
$t0, 100, loop<br />
contains six lexemes: the opcode ble, the register specifier $t0, a comma, the<br />
number 100, a comma, <strong>and</strong> the symbol loop.<br />
If a line begins with a label, the assembler records in its symbol table the name<br />
of the label <strong>and</strong> the address of the memory word that the instruction occupies.<br />
The assembler then calculates how many words of memory the instruction on the<br />
current line will occupy. By keeping track of the instructions’ sizes, the assembler<br />
can determine where the next instruction goes. To compute the size of a variablelength<br />
instruction, like those on the VAX, an assembler has to examine it in detail.<br />
However, fixed-length instructions, like those on MIPS, require only a cursory<br />
examination. The assembler performs a similar calculation to compute the space<br />
required for data statements. When the assembler reaches the end of an assembly<br />
file, the symbol table records the location of each label defined in the file.<br />
The assembler uses the information in the symbol table during a second pass<br />
over the file, which actually produces machine code. The assembler again examines<br />
each line in the file. If the line contains an instruction, the assembler combines<br />
the binary representations of its opcode <strong>and</strong> oper<strong>and</strong>s (register specifiers or<br />
memory address) into a legal instruction. The process is similar to the one used in<br />
Section 2.5 in Chapter 2. Instructions <strong>and</strong> data words that reference an external<br />
symbol defined in another file cannot be completely assembled (they are unresolved),<br />
since the symbol’s address is not in the symbol table. An assembler does<br />
not complain about unresolved references, since the corresponding label is likely<br />
to be defined in another file.<br />
The BIG<br />
Picture<br />
Assembly language is a programming language. Its principal difference<br />
from high-level languages such as BASIC, Java, <strong>and</strong> C is that assembly language<br />
provides only a few, simple types of data <strong>and</strong> control flow. Assembly<br />
language programs do not specify the type of value held in a variable.<br />
Instead, a programmer must apply the appropriate operations (e.g., integer<br />
or floating-point addition) to a value. In addition, in assem bly language,<br />
programs must implement all control flow with go tos. Both factors make<br />
assembly language programming for any machine—MIPS or x86—more<br />
difficult <strong>and</strong> error-prone than writing in a high-level language.
A.2 Assemblers A-13<br />
Elaboration: If an assembler’s speed is important, this two-step process can be done<br />
in one pass over the assembly fi le with a technique known as backpatching. In its<br />
pass over the fi le, the assembler builds a (possibly incomplete) binary representation<br />
of every instruction. If the instruction references a label that has not yet been defi ned,<br />
the assembler records the label <strong>and</strong> instruction in a table. When a label is defi ned, the<br />
assembler consults this table to fi nd all instructions that contain a forward reference to<br />
the label. The assembler goes back <strong>and</strong> corrects their binary representation to incorporate<br />
the address of the label. Backpatching speeds assembly because the assembler<br />
only reads its input once. However, it requires an assembler to hold the entire binary representation<br />
of a program in memory so instructions can be backpatched. This requirement<br />
can limit the size of programs that can be assembled. The process is com plicated<br />
by machines with several types of branches that span different ranges of instructions.<br />
When the assembler fi rst sees an unresolved label in a branch instruction, it must either<br />
use the largest possible branch or risk having to go back <strong>and</strong> readjust many instructions<br />
to make room for a larger branch.<br />
backpatching<br />
A method for translating<br />
from assembly lan guage<br />
to machine instructions<br />
in which the assembler<br />
builds a (possibly<br />
incomplete) binary<br />
representation of every<br />
instruc tion in one pass<br />
over a program <strong>and</strong> then<br />
returns to fill in previously<br />
undefined labels.<br />
Object File Format<br />
Assemblers produce object files. An object file on UNIX contains six distinct<br />
sections (see Figure A.2.1):<br />
■ The object file header describes the size <strong>and</strong> position of the other pieces of<br />
the file.<br />
■ The text segment contains the machine language code for routines in the<br />
source file. These routines may be unexecutable because of unresolved<br />
references.<br />
■ The data segment contains a binary representation of the data in the source<br />
file. The data also may be incomplete because of unresolved references to<br />
labels in other files.<br />
■ The relocation information identifies instructions <strong>and</strong> data words that<br />
depend on absolute addresses. These references must change if portions of<br />
the program are moved in memory.<br />
■ The symbol table associates addresses with external labels in the source file<br />
<strong>and</strong> lists unresolved references.<br />
■ The debugging information contains a concise description of the way the<br />
program was compiled, so a debugger can find which instruction addresses<br />
correspond to lines in a source file <strong>and</strong> print the data structures in readable<br />
form.<br />
The assembler produces an object file that contains a binary representation of<br />
the program <strong>and</strong> data <strong>and</strong> additional information to help link pieces of a program.<br />
text segment The<br />
segment of a UNIX<br />
object file that contains<br />
the machine language<br />
code for rou tines in the<br />
source file.<br />
data segment The<br />
segment of a UNIX<br />
object or executable file<br />
that contains a binary<br />
represen tation of the<br />
initialized data used by<br />
the program.<br />
relocation information<br />
The segment of a UNIX<br />
object file that identifies<br />
instructions <strong>and</strong> data<br />
words that depend on<br />
absolute addresses.<br />
absolute address<br />
A variable’s or routine’s<br />
actual address in memory.
A-14 Appendix A Assemblers, Linkers, <strong>and</strong> the SPIM Simulator<br />
Object file<br />
header<br />
Text<br />
segment<br />
Data<br />
segment<br />
Relocation<br />
information<br />
Symbol<br />
table<br />
Debugging<br />
information<br />
FIGURE A.2.1 Object file. A UNIX assembler produces an object file with six distinct sections.<br />
This relocation information is necessary because the assembler does not know<br />
which memory locations a procedure or piece of data will occupy after it is linked<br />
with the rest of the program. Procedures <strong>and</strong> data from a file are stored in a contiguous<br />
piece of memory, but the assembler does not know where this mem ory will<br />
be located. The assembler also passes some symbol table entries to the linker. In<br />
particular, the assembler must record which external symbols are defined in a file<br />
<strong>and</strong> what unresolved references occur in a file.<br />
Elaboration: For convenience, assemblers assume each file starts at the same<br />
address (for example, location 0) with the expectation that the linker will relocate the code<br />
<strong>and</strong> data when they are assigned locations in memory. The assembler produces relocation<br />
information, which contains an entry describing each instruction or data word in the file<br />
that references an absolute address. On MIPS, only the subroutine call, load, <strong>and</strong> store<br />
instructions reference absolute addresses. Instructions that use PC- relative addressing,<br />
such as branches, need not be relocated.<br />
Additional Facilities<br />
Assemblers provide a variety of convenience features that help make assembler<br />
programs shorter <strong>and</strong> easier to write, but do not fundamentally change assembly<br />
language. For example, data layout directives allow a programmer to describe data<br />
in a more concise <strong>and</strong> natural manner than its binary representation.<br />
In Figure A.1.4, the directive<br />
.asciiz “The sum from 0 .. 100 is %d\n”<br />
stores characters from the string in memory. Contrast this line with the alternative<br />
of writing each character as its ASCII value (Figure 2.15 in Chapter 2 describes the<br />
ASCII encoding for characters):<br />
.byte 84, 104, 101, 32, 115, 117, 109, 32<br />
.byte 102, 114, 111, 109, 32, 48, 32, 46<br />
.byte 46, 32, 49, 48, 48, 32, 105, 115<br />
.byte 32, 37, 100, 10, 0<br />
The .asciiz directive is easier to read because it represents characters as letters,<br />
not binary numbers. An assembler can translate characters to their binary representation<br />
much faster <strong>and</strong> more accurately than a human can. Data layout directives
A.2 Assemblers A-15<br />
specify data in a human-readable form that the assembler translates to binary. Other<br />
layout directives are described in Section A.10.<br />
String Directive<br />
Define the sequence of bytes produced by this directive:<br />
.asciiz “The quick brown fox jumps over the lazy dog”<br />
EXAMPLE<br />
.byte 84, 104, 101, 32, 113, 117, 105, 99<br />
.byte 107, 32, 98, 114, 111, 119, 110, 32<br />
.byte 102, 111, 120, 32, 106, 117, 109, 112<br />
.byte 115, 32, 111, 118, 101, 114, 32, 116<br />
.byte 104, 101, 32, 108, 97, 122, 121, 32<br />
.byte 100, 111, 103, 0<br />
ANSWER<br />
Macro is a pattern-matching <strong>and</strong> replacement facility that provides a simple<br />
mechanism to name a frequently used sequence of instructions. Instead of repeatedly<br />
typing the same instructions every time they are used, a programmer invokes<br />
the macro <strong>and</strong> the assembler replaces the macro call with the corresponding<br />
sequence of instructions. Macros, like subroutines, permit a programmer to create<br />
<strong>and</strong> name a new abstraction for a common operation. Unlike subroutines, however,<br />
macros do not cause a subroutine call <strong>and</strong> return when the program runs,<br />
since a macro call is replaced by the macro’s body when the program is assembled.<br />
After this replacement, the resulting assembly is indistinguishable from the equivalent<br />
program written without macros.<br />
Macros<br />
As an example, suppose that a programmer needs to print many numbers. The<br />
library routine printf accepts a format string <strong>and</strong> one or more values to print<br />
as its arguments. A programmer could print the integer in register $7 with the<br />
following instructions:<br />
EXAMPLE<br />
.data<br />
int_str: .asciiz“%d”<br />
.text<br />
la $a0, int_str # Load string address<br />
# into first arg
A-16 Appendix A Assemblers, Linkers, <strong>and</strong> the SPIM Simulator<br />
mov $a1, $7 # Load value into<br />
# second arg<br />
jal printf # Call the printf routine<br />
The .data directive tells the assembler to store the string in the program’s data<br />
segment, <strong>and</strong> the .text directive tells the assembler to store the instruc tions<br />
in its text segment.<br />
However, printing many numbers in this fashion is tedious <strong>and</strong> produces a<br />
verbose program that is difficult to underst<strong>and</strong>. An alternative is to introduce<br />
a macro, print_int, to print an integer:<br />
formal parameter<br />
A variable that is the<br />
argument to a proce dure<br />
or macro; it is replaced by<br />
that argument once the<br />
macro is exp<strong>and</strong>ed.<br />
.data<br />
int_str:.asciiz “%d”<br />
.text<br />
.macro print_int($arg)<br />
la $a0, int_str # Load string address into<br />
# first arg<br />
mov $a1, $arg # Load macro’s parameter<br />
# ($arg) into second arg<br />
jal printf # Call the printf routine<br />
.end_macro<br />
print_int($7)<br />
The macro has a formal parameter, $arg, that names the argument to the<br />
macro. When the macro is exp<strong>and</strong>ed, the argument from a call is substituted<br />
for the formal parameter throughout the macro’s body. Then the assembler<br />
replaces the call with the macro’s newly exp<strong>and</strong>ed body. In the first call on<br />
print_int, the argument is $7, so the macro exp<strong>and</strong>s to the code<br />
la $a0, int_str<br />
mov $a1, $7<br />
jal printf<br />
In a second call on print_int, say, print_int($t0), the argument is $t0,<br />
so the macro exp<strong>and</strong>s to<br />
la $a0, int_str<br />
mov $a1, $t0<br />
jal printf<br />
What does the call print_int($a0) exp<strong>and</strong> to?
A.2 Assemblers A-17<br />
la $a0, int_str<br />
mov $a1, $a0<br />
jal printf<br />
ANSWER<br />
This example illustrates a drawback of macros. A programmer who uses<br />
this macro must be aware that print_int uses register $a0 <strong>and</strong> so cannot<br />
correctly print the value in that register.<br />
Some assemblers also implement pseudoinstructions, which are instructions provided<br />
by an assembler but not implemented in hardware. Chapter 2 contains<br />
many examples of how the MIPS assembler synthesizes pseudoinstructions<br />
<strong>and</strong> addressing modes from the spartan MIPS hardware instruction set. For<br />
example, Section 2.7 in Chapter 2 describes how the assembler synthesizes the<br />
blt instruc tion from two other instructions: slt <strong>and</strong> bne. By extending the<br />
instruction set, the MIPS assembler makes assembly language programming<br />
easier without complicating the hardware. Many pseudoinstructions could also<br />
be simulated with macros, but the MIPS assembler can generate better code for<br />
these instructions because it can use a dedicated register ($at) <strong>and</strong> is able to<br />
optimize the generated code.<br />
Hardware/<br />
Software<br />
Interface<br />
Elaboration: Assemblers conditionally assemble pieces of code, which permits a<br />
programmer to include or exclude groups of instructions when a program is assembled.<br />
This feature is particularly useful when several versions of a program differ by a small<br />
amount. Rather than keep these programs in separate fi les—which greatly complicates<br />
fi xing bugs in the common code—programmers typically merge the versions into a single<br />
fi le. Code particular to one version is conditionally assembled, so it can be excluded<br />
when other versions of the program are assembled.<br />
If macros <strong>and</strong> conditional assembly are useful, why do assemblers for UNIX systems<br />
rarely, if ever, provide them? One reason is that most programmers on these systems<br />
write programs in higher-level languages like C. Most of the assembly code is produced<br />
by compilers, which fi nd it more convenient to repeat code rather than defi ne macros.<br />
Another reason is that other tools on UNIX—such as cpp, the C preprocessor, or m4, a<br />
general macro processor—can provide macros <strong>and</strong> conditional assembly for assembly<br />
language programs.
A-20 Appendix A Assemblers, Linkers, <strong>and</strong> the SPIM Simulator<br />
system kernel brings a program into memory <strong>and</strong> starts it running. To start a program,<br />
the operating system performs the following steps:<br />
1. It reads the executable file’s header to determine the size of the text <strong>and</strong> data<br />
segments.<br />
2. It creates a new address space for the program. This address space is large<br />
enough to hold the text <strong>and</strong> data segments, along with a stack segment (see<br />
Section A.5).<br />
3. It copies instructions <strong>and</strong> data from the executable file into the new address<br />
space.<br />
4. It copies arguments passed to the program onto the stack.<br />
5. It initializes the machine registers. In general, most registers are cleared, but<br />
the stack pointer must be assigned the address of the first free stack location<br />
(see Section A.5).<br />
6. It jumps to a start-up routine that copies the program’s arguments from the<br />
stack to registers <strong>and</strong> calls the program’s main routine. If the main routine<br />
returns, the start-up routine terminates the program with the exit system call.<br />
A.5 Memory Usage<br />
static data The portion<br />
of memory that contains<br />
data whose size is known<br />
to the com piler <strong>and</strong> whose<br />
lifetime is the program’s<br />
entire execution.<br />
The next few sections elaborate the description of the MIPS architecture presented<br />
earlier in the book. Earlier chapters focused primarily on hardware <strong>and</strong> its relationship<br />
with low-level software. These sections focus primarily on how assembly language<br />
programmers use MIPS hardware. These sections describe a set of conventions<br />
followed on many MIPS systems. For the most part, the hardware does not impose<br />
these conventions. Instead, they represent an agreement among programmers to<br />
follow the same set of rules so that software written by different people can work<br />
together <strong>and</strong> make effective use of MIPS hardware.<br />
Systems based on MIPS processors typically divide memory into three parts<br />
(see Figure A.5.1). The first part, near the bottom of the address space (starting<br />
at address 400000 hex ), is the text segment, which holds the program’s instructions.<br />
The second part, above the text segment, is the data segment, which is further<br />
divided into two parts. Static data (starting at address 10000000 hex ) contains<br />
objects whose size is known to the compiler <strong>and</strong> whose lifetime—the interval<br />
dur ing which a program can access them—is the program’s entire execution. For<br />
example, in C, global variables are statically allocated, since they can be referenced
A.5 Memory Usage A-21<br />
7fffffff hex<br />
Stack segment<br />
10000000 hex<br />
400000 hex<br />
Dynamic data<br />
Static data<br />
Reserved<br />
Data segment<br />
Text segment<br />
FIGURE A.5.1 Layout of memory.<br />
anytime during a program’s execution. The linker both assigns static objects to<br />
locations in the data segment <strong>and</strong> resolves references to these objects.<br />
Immediately above static data is dynamic data. This data, as its name implies, is<br />
allocated by the program as it executes. In C programs, the malloc library rou tine<br />
Because the data segment begins far above the program at address 10000000 hex ,<br />
load <strong>and</strong> store instructions cannot directly reference data objects with their 16-bit<br />
offset fields (see Section 2.5 in Chapter 2). For example, to load the word in the<br />
data segment at address 10010020 hex into register $v0 requires two instructions:<br />
Hardware/<br />
Software<br />
Interface<br />
lui $s0, 0x1001 # 0x1001 means 1001 base 16<br />
lw $v0, 0x0020($s0) # 0x10010000 + 0x0020 = 0x10010020<br />
(The 0x before a number means that it is a hexadecimal value. For example, 0x8000<br />
is 8000 hex or 32,768 ten .)<br />
To avoid repeating the lui instruction at every load <strong>and</strong> store, MIPS systems<br />
typically dedicate a register ($gp) as a global pointer to the static data segment. This<br />
register contains address 10008000 hex, so load <strong>and</strong> store instructions can use their<br />
signed 16-bit offset fields to access the first 64 KB of the static data segment. With<br />
this global pointer, we can rewrite the example as a single instruction:<br />
lw $v0, 0x8020($gp)<br />
Of course, a global pointer register makes addressing locations 10000000 hex –<br />
10010000 hex faster than other heap locations. The MIPS compiler usually stores<br />
global variables in this area, because these variables have fixed locations <strong>and</strong> fit better<br />
than other global data, such as arrays.
A-22 Appendix A Assemblers, Linkers, <strong>and</strong> the SPIM Simulator<br />
stack segment The<br />
portion of memory used<br />
by a program to hold<br />
procedure call frames.<br />
finds <strong>and</strong> returns a new block of memory. Since a compiler cannot predict how<br />
much memory a program will allocate, the operating system exp<strong>and</strong>s the dynamic<br />
data area to meet dem<strong>and</strong>. As the upward arrow in the figure indicates, malloc<br />
exp<strong>and</strong>s the dynamic area with the sbrk system call, which causes the operating<br />
system to add more pages to the program’s virtual address space (see Section 5.7 in<br />
Chapter 5) immediately above the dynamic data segment.<br />
The third part, the program stack segment, resides at the top of the virtual<br />
address space (starting at address 7fffffff hex ). Like dynamic data, the maximum size<br />
of a program’s stack is not known in advance. As the program pushes values on to<br />
the stack, the operating system exp<strong>and</strong>s the stack segment down toward the data<br />
segment.<br />
This three-part division of memory is not the only possible one. However, it has<br />
two important characteristics: the two dynamically exp<strong>and</strong>able segments are as far<br />
apart as possible, <strong>and</strong> they can grow to use a program’s entire address space.<br />
A.6 Procedure Call Convention<br />
register use convention<br />
Also called procedure<br />
call convention.<br />
A software proto col<br />
governing the use of<br />
registers by procedures.<br />
Conventions governing the use of registers are necessary when procedures in a<br />
program are compiled separately. To compile a particular procedure, a compiler<br />
must know which registers it may use <strong>and</strong> which registers are reserved for other<br />
procedures. Rules for using registers are called register use or procedure call<br />
conventions. As the name implies, these rules are, for the most part, conventions<br />
fol lowed by software rather than rules enforced by hardware. However, most compilers<br />
<strong>and</strong> programmers try very hard to follow these conventions because violating<br />
them causes insidious bugs.<br />
The calling convention described in this section is the one used by the gcc compiler.<br />
The native MIPS compiler uses a more complex convention that is slightly<br />
faster.<br />
The MIPS CPU contains 32 general-purpose registers that are numbered 0–31.<br />
Register $0 always contains the hardwired value 0.<br />
■ Registers $at (1), $k0 (26), <strong>and</strong> $k1 (27) are reserved for the assembler <strong>and</strong><br />
operating system <strong>and</strong> should not be used by user programs or compilers.<br />
■ Registers $a0–$a3 (4–7) are used to pass the first four arguments to rou tines<br />
(remaining arguments are passed on the stack). Registers $v0 <strong>and</strong> $v1 (2, 3)<br />
are used to return values from functions.
A.6 Procedure Call Convention A-23<br />
■ Registers $t0–$t9 (8–15, 24, 25) are caller-saved registers that are used<br />
to hold temporary quantities that need not be preserved across calls (see<br />
Section 2.8 in Chapter 2).<br />
■ Registers $s0–$s7 (16–23) are callee-saved registers that hold long-lived<br />
values that should be preserved across calls.<br />
■ Register $gp (28) is a global pointer that points to the middle of a 64K block<br />
of memory in the static data segment.<br />
■ Register $sp (29) is the stack pointer, which points to the last location on<br />
the stack. Register $fp (30) is the frame pointer. The jal instruction writes<br />
register $ra (31), the return address from a procedure call. These two registers<br />
are explained in the next section.<br />
The two-letter abbreviations <strong>and</strong> names for these registers—for example $sp<br />
for the stack pointer—reflect the registers’ intended uses in the procedure call<br />
convention. In describing this convention, we will use the names instead of regis ter<br />
numbers. Figure A.6.1 lists the registers <strong>and</strong> describes their intended uses.<br />
Procedure Calls<br />
This section describes the steps that occur when one procedure (the caller) invokes<br />
another procedure (the callee). Programmers who write in a high-level language<br />
(like C or Pascal) never see the details of how one procedure calls another, because<br />
the compiler takes care of this low-level bookkeeping. However, assembly language<br />
programmers must explicitly implement every procedure call <strong>and</strong> return.<br />
Most of the bookkeeping associated with a call is centered around a block<br />
of memory called a procedure call frame. This memory is used for a variety of<br />
purposes:<br />
■ To hold values passed to a procedure as arguments<br />
■ To save registers that a procedure may modify, but which the procedure’s<br />
caller does not want changed<br />
■ To provide space for variables local to a procedure<br />
In most programming languages, procedure calls <strong>and</strong> returns follow a strict<br />
last-in, first-out (LIFO) order, so this memory can be allocated <strong>and</strong> deallocated on<br />
a stack, which is why these blocks of memory are sometimes called stack frames.<br />
Figure A.6.2 shows a typical stack frame. The frame consists of the memory<br />
between the frame pointer ($fp), which points to the first word of the frame,<br />
<strong>and</strong> the stack pointer ($sp), which points to the last word of the frame. The stack<br />
grows down from higher memory addresses, so the frame pointer points above the<br />
caller-saved register<br />
A regis ter saved by the<br />
routine being called.<br />
callee-saved register<br />
A regis ter saved by<br />
the routine making a<br />
procedure call.<br />
procedure call frame<br />
A block of memory that<br />
is used to hold values<br />
passed to a procedure<br />
as arguments, to save<br />
registers that a procedure<br />
may modify but that the<br />
procedure’s caller does not<br />
want changed, <strong>and</strong> to provide<br />
space for variables<br />
local to a procedure.
A.6 Procedure Call Convention A-25<br />
Higher memory addresses<br />
$fp<br />
Argument 6<br />
Argument 5<br />
Saved registers<br />
Stack<br />
grows<br />
Local variables<br />
$sp<br />
Lower memory addresses<br />
FIGURE A.6.2 Layout of a stack frame. The frame pointer ($fp) points to the first word in the<br />
currently executing procedure’s stack frame. The stack pointer ($sp) points to the last word of the frame. The<br />
first four arguments are passed in registers, so the fifth argument is the first one stored on the stack.<br />
A stack frame may be built in many different ways; however, the caller <strong>and</strong><br />
callee must agree on the sequence of steps. The steps below describe the calling<br />
convention used on most MIPS machines. This convention comes into play at three<br />
points during a procedure call: immediately before the caller invokes the callee,<br />
just as the callee starts executing, <strong>and</strong> immediately before the callee returns to the<br />
caller. In the first part, the caller puts the procedure call arguments in stan dard<br />
places <strong>and</strong> invokes the callee to do the following:<br />
1. Pass arguments. By convention, the first four arguments are passed in registers<br />
$a0–$a3. Any remaining arguments are pushed on the stack <strong>and</strong> appear<br />
at the beginning of the called procedure’s stack frame.<br />
2. Save caller-saved registers. The called procedure can use these registers<br />
($a0–$a3 <strong>and</strong> $t0–$t9) without first saving their value. If the caller expects<br />
to use one of these registers after a call, it must save its value before the call.<br />
3. Execute a jal instruction (see Section 2.8 of Chapter 2), which jumps to the<br />
callee’s first instruction <strong>and</strong> saves the return address in register $ra.
A-26 Appendix A Assemblers, Linkers, <strong>and</strong> the SPIM Simulator<br />
Before a called routine starts running, it must take the following steps to set up<br />
its stack frame:<br />
1. Allocate memory for the frame by subtracting the frame’s size from the stack<br />
pointer.<br />
2. Save callee-saved registers in the frame. A callee must save the values in<br />
these registers ($s0–$s7, $fp, <strong>and</strong> $ra) before altering them, since the<br />
caller expects to find these registers unchanged after the call. Register $fp is<br />
saved by every procedure that allocates a new stack frame. However, register<br />
$ra only needs to be saved if the callee itself makes a call. The other calleesaved<br />
registers that are used also must be saved.<br />
3. Establish the frame pointer by adding the stack frame’s size minus 4 to $sp<br />
<strong>and</strong> storing the sum in register $fp.<br />
Hardware/<br />
Software<br />
Interface<br />
The MIPS register use convention provides callee- <strong>and</strong> caller-saved registers,<br />
because both types of registers are advantageous in different circumstances. Calleesaved<br />
registers are better used to hold long-lived values, such as variables from a<br />
user’s program. These registers are only saved during a procedure call if the callee<br />
expects to use the register. On the other h<strong>and</strong>, caller-saved registers are bet ter used<br />
to hold short-lived quantities that do not persist across a call, such as immediate<br />
values in an address calculation. During a call, the callee can also use these registers<br />
for short-lived temporaries.<br />
Finally, the callee returns to the caller by executing the following steps:<br />
1. If the callee is a function that returns a value, place the returned value in<br />
register $v0.<br />
2. Restore all callee-saved registers that were saved upon procedure entry.<br />
3. Pop the stack frame by adding the frame size to $sp.<br />
4. Return by jumping to the address in register $ra.<br />
recursive procedures<br />
Procedures that call<br />
themselves either directly<br />
or indirectly through a<br />
chain of calls.<br />
Elaboration: A programming language that does not permit recursive procedures—<br />
procedures that call themselves either directly or indirectly through a chain of calls—need<br />
not allocate frames on a stack. In a nonrecursive language, each procedure’s frame<br />
may be statically allocated, since only one invocation of a procedure can be active at a<br />
time. Older versions of Fortran prohibited recursion, because statically allocated frames<br />
produced faster code on some older machines. However, on load store architec tures like<br />
MIPS, stack frames may be just as fast, because a frame pointer register points directly
A.6 Procedure Call Convention A-27<br />
to the active stack frame, which permits a single load or store instruc tion to access<br />
values in the frame. In addition, recursion is a valuable programming technique.<br />
Procedure Call Example<br />
As an example, consider the C routine<br />
main ()<br />
{<br />
printf (“The factorial of 10 is %d\n”, fact (10));<br />
}<br />
int fact (int n)<br />
{<br />
if (n < 1)<br />
return (1);<br />
else<br />
return (n * fact (n - 1));<br />
}<br />
which computes <strong>and</strong> prints 10! (the factorial of 10, 10! = 10 × 9 × . . . × 1). fact is<br />
a recursive routine that computes n! by multiplying n times (n - 1)!. The assembly<br />
code for this routine illustrates how programs manipulate stack frames.<br />
Upon entry, the routine main creates its stack frame <strong>and</strong> saves the two calleesaved<br />
registers it will modify: $fp <strong>and</strong> $ra. The frame is larger than required for<br />
these two register because the calling convention requires the minimum size of a<br />
stack frame to be 24 bytes. This minimum frame can hold four argument registers<br />
($a0–$a3) <strong>and</strong> the return address $ra, padded to a double-word boundary<br />
(24 bytes). Since main also needs to save $fp, its stack frame must be two words<br />
larger (remember: the stack pointer is kept doubleword aligned).<br />
.text<br />
.globl main<br />
main:<br />
subu $sp,$sp,32 # Stack frame is 32 bytes long<br />
sw $ra,20($sp) # Save return address<br />
sw $fp,16($sp) # Save old frame pointer<br />
addiu $fp,$sp,28 # Set up frame pointer<br />
The routine main then calls the factorial routine <strong>and</strong> passes it the single argument<br />
10. After fact returns, main calls the library routine printf <strong>and</strong> passes it both<br />
a format string <strong>and</strong> the result returned from fact:
A-28 Appendix A Assemblers, Linkers, <strong>and</strong> the SPIM Simulator<br />
li $a0,10 # Put argument (10) in $a0<br />
jal fact # Call factorial function<br />
la $a0,$LC # Put format string in $a0<br />
move $a1,$v0 # Move fact result to $a1<br />
jal printf # Call the print function<br />
Finally, after printing the factorial, main returns. But first, it must restore the<br />
registers it saved <strong>and</strong> pop its stack frame:<br />
lw $ra,20($sp) # Restore return address<br />
lw $fp,16($sp) # Restore frame pointer<br />
addiu $sp,$sp,32 # Pop stack frame<br />
jr $ra # Return to caller<br />
.rdata<br />
$LC:<br />
.ascii<br />
“The factorial of 10 is %d\n\000”<br />
The factorial routine is similar in structure to main. First, it creates a stack frame<br />
<strong>and</strong> saves the callee-saved registers it will use. In addition to saving $ra <strong>and</strong> $fp,<br />
fact also saves its argument ($a0), which it will use for the recursive call:<br />
.text<br />
fact:<br />
subu $sp,$sp,32 # Stack frame is 32 bytes long<br />
sw $ra,20($sp) # Save return address<br />
sw $fp,16($sp) # Save frame pointer<br />
addiu $fp,$sp,28 # Set up frame pointer<br />
sw $a0,0($fp) # Save argument (n)<br />
The heart of the fact routine performs the computation from the C program.<br />
It tests whether the argument is greater than 0. If not, the routine returns the<br />
value 1. If the argument is greater than 0, the routine recursively calls itself to<br />
compute fact(n–1) <strong>and</strong> multiplies that value times n:<br />
lw $v0,0($fp) # Load n<br />
bgtz $v0,$L2 # Branch if n > 0<br />
li $v0,1 # Return 1<br />
jr $L1 # Jump to code to return<br />
$L2:<br />
lw $v1,0($fp) # Load n<br />
subu $v0,$v1,1 # Compute n - 1<br />
move $a0,$v0 # Move value to $a0
A.6 Procedure Call Convention A-29<br />
jal fact # Call factorial function<br />
lw $v1,0($fp) # Load n<br />
mul $v0,$v0,$v1 # Compute fact(n-1) * n<br />
Finally, the factorial routine restores the callee-saved registers <strong>and</strong> returns the<br />
value in register $v0:<br />
$L1: # Result is in $v0<br />
lw $ra, 20($sp) # Restore $ra<br />
lw $fp, 16($sp) # Restore $fp<br />
addiu $sp, $sp, 32 # Pop stack<br />
jr $ra # Return to caller<br />
Stack in Recursive Procedure<br />
Figure A.6.3 shows the stack at the call fact(7). main runs first, so its frame<br />
is deepest on the stack. main calls fact(10), whose stack frame is next on the<br />
stack. Each invocation recursively invokes fact to compute the next-lowest<br />
factorial. The stack frames parallel the LIFO order of these calls. What does the<br />
stack look like when the call to fact(10) returns?<br />
EXAMPLE<br />
Stack<br />
Old $ra<br />
Old $fp<br />
main<br />
Old $a0<br />
Old $ra<br />
Old $fp<br />
Old $a0<br />
Old $ra<br />
Old $fp<br />
Old $a0<br />
Old $ra<br />
Old $fp<br />
Old $a0<br />
Old $ra<br />
Old $fp<br />
fact (10)<br />
fact (9)<br />
fact (8)<br />
fact (7)<br />
Stack grows<br />
FIGURE A.6.3 Stack frames during the call of fact(7).
A-30 Appendix A Assemblers, Linkers, <strong>and</strong> the SPIM Simulator<br />
ANSWER<br />
Stack<br />
Old $ra<br />
Old $fp<br />
main<br />
Stack grows<br />
Elaboration: The difference between the MIPS compiler <strong>and</strong> the gcc compiler is that<br />
the MIPS compiler usually does not use a frame pointer, so this register is available as<br />
another callee-saved register, $s8. This change saves a couple of instructions in the<br />
procedure call <strong>and</strong> return sequence. However, it complicates code generation, because<br />
a procedure must access its stack frame with $sp, whose value can change during a<br />
procedure’s execution if values are pushed on the stack.<br />
Another Procedure Call Example<br />
As another example, consider the following routine that computes the tak function,<br />
which is a widely used benchmark created by Ikuo Takeuchi. This function<br />
does not compute anything useful, but is a heavily recursive program that illustrates<br />
the MIPS calling convention.<br />
int tak (int x, int y, int z)<br />
{<br />
if (y < x)<br />
return 1+ tak (tak (x - 1, y, z),<br />
tak (y - 1, z, x),<br />
tak (z - 1, x, y));<br />
else<br />
return z;<br />
}<br />
int main ()<br />
{<br />
tak(18, 12, 6);<br />
}<br />
The assembly code for this program is shown below. The tak function first saves<br />
its return address in its stack frame <strong>and</strong> its arguments in callee-saved regis ters,<br />
since the routine may make calls that need to use registers $a0–$a2 <strong>and</strong> $ra. The<br />
function uses callee-saved registers, since they hold values that persist over the
A.6 Procedure Call Convention A-31<br />
lifetime of the function, which includes several calls that could potentially modify<br />
registers.<br />
.text<br />
.globl<br />
tak<br />
tak:<br />
subu $sp, $sp, 40<br />
sw $ra, 32($sp)<br />
sw $s0, 16($sp) # x<br />
move $s0, $a0<br />
sw $s1, 20($sp) # y<br />
move $s1, $a1<br />
sw $s2, 24($sp) # z<br />
move $s2, $a2<br />
sw $s3, 28($sp) # temporary<br />
The routine then begins execution by testing if y < x. If not, it branches to label<br />
L1, which is shown below.<br />
bge $s1, $s0, L1 # if (y < x)<br />
If y < x, then it executes the body of the routine, which contains four recursive<br />
calls. The first call uses almost the same arguments as its parent:<br />
addiu $a0, $s0, -1<br />
move $a1, $s1<br />
move $a2, $s2<br />
jal tak # tak (x - 1, y, z)<br />
move $s3, $v0<br />
Note that the result from the first recursive call is saved in register $s3, so that it<br />
can be used later.<br />
The function now prepares arguments for the second recursive call.<br />
addiu $a0, $s1, -1<br />
move $a1, $s2<br />
move $a2, $s0<br />
jal tak # tak (y - 1, z, x)<br />
In the instructions below, the result from this recursive call is saved in register<br />
$s0. But first we need to read, for the last time, the saved value of the first argument<br />
from this register.
A-32 Appendix A Assemblers, Linkers, <strong>and</strong> the SPIM Simulator<br />
addiu $a0, $s2, -1<br />
move $a1, $s0<br />
move $a2, $s1<br />
move $s0, $v0<br />
jal tak # tak (z - 1, x, y)<br />
After the three inner recursive calls, we are ready for the final recursive call. After<br />
the call, the function’s result is in $v0 <strong>and</strong> control jumps to the function’s epilogue.<br />
move $a0, $s3<br />
move $a1, $s0<br />
move $a2, $v0<br />
jal tak # tak (tak(...), tak(...), tak(...))<br />
addiu $v0, $v0, 1<br />
j L2<br />
This code at label L1 is the consequent of the if-then-else statement. It just moves<br />
the value of argument z into the return register <strong>and</strong> falls into the function epilogue.<br />
L1:<br />
move $v0, $s2<br />
The code below is the function epilogue, which restores the saved registers <strong>and</strong><br />
returns the function’s result to its caller.<br />
L2:<br />
lw $ra, 32($sp)<br />
lw $s0, 16($sp)<br />
lw $s1, 20($sp)<br />
lw $s2, 24($sp)<br />
lw $s3, 28($sp)<br />
addiu $sp, $sp, 40<br />
jr $ra<br />
The main routine calls the tak function with its initial arguments, then takes the<br />
computed result (7) <strong>and</strong> prints it using SPIM’s system call for printing integers.<br />
.globl main<br />
main:<br />
subu $sp, $sp, 24<br />
sw $ra, 16($sp)<br />
li $a0, 18<br />
li $a1, 12
A-34 Appendix A Assemblers, Linkers, <strong>and</strong> the SPIM Simulator<br />
These seven registers are part of coprocessor 0’s register set. They are accessed<br />
by the mfc0 <strong>and</strong> mtc0 instructions. After an exception, register EPC contains the<br />
address of the instruction that was executing when the exception occurred. If the<br />
exception was caused by an external interrupt, then the instruction will not have<br />
started executing. All other exceptions are caused by the execution of the instruction<br />
at EPC, except when the offending instruction is in the delay slot of a branch<br />
or jump. In that case, EPC points to the branch or jump instruction <strong>and</strong> the BD bit<br />
is set in the Cause register. When that bit is set, the exception h<strong>and</strong>ler must look<br />
at EPC + 4 for the offending instruction. However, in either case, an excep tion<br />
h<strong>and</strong>ler properly resumes the program by returning to the instruction at EPC.<br />
If the instruction that caused the exception made a memory access, register<br />
BadVAddr contains the referenced memory location’s address.<br />
The Count register is a timer that increments at a fixed rate (by default, every<br />
10 milliseconds) while SPIM is running. When the value in the Count register<br />
equals the value in the Compare register, a hardware interrupt at priority level 5<br />
occurs.<br />
Figure A.7.1 shows the subset of the Status register fields implemented by the<br />
MIPS simulator SPIM. The interrupt mask field contains a bit for each of the<br />
six hardware <strong>and</strong> two software interrupt levels. A mask bit that is 1 allows interrupts<br />
at that level to interrupt the processor. A mask bit that is 0 disables interrupts<br />
at that level. When an interrupt arrives, it sets its interrupt pending bit in the<br />
Cause register, even if the mask bit is disabled. When an interrupt is pending, it will<br />
interrupt the processor when its mask bit is subsequently enabled.<br />
The user mode bit is 0 if the processor is running in kernel mode <strong>and</strong> 1 if it is<br />
running in user mode. On SPIM, this bit is fixed at 1, since the SPIM processor<br />
does not implement kernel mode. The exception level bit is normally 0, but is set to<br />
1 after an exception occurs. When this bit is 1, interrupts are disabled <strong>and</strong> the EPC<br />
is not updated if another exception occurs. This bit prevents an exception h<strong>and</strong>ler<br />
from being disturbed by an interrupt or exception, but it should be reset when the<br />
h<strong>and</strong>ler finishes. If the interrupt enable bit is 1, interrupts are allowed. If it is<br />
0, they are disabled.<br />
Figure A.7.2 shows the subset of Cause register fields that SPIM implements.<br />
The branch delay bit is 1 if the last exception occurred in an instruction executed in<br />
the delay slot of a branch. The interrupt pending bits become 1 when an inter rupt
A-36 Appendix A Assemblers, Linkers, <strong>and</strong> the SPIM Simulator<br />
faults are requests from a process to the operating system to perform a service,<br />
such as bringing in a page from disk. The operating system processes these requests<br />
<strong>and</strong> resumes the process. The final type of exceptions are interrupts from external<br />
devices. These generally cause the operating system to move data to or from an I/O<br />
device <strong>and</strong> resume the interrupted process.<br />
The code in the example below is a simple exception h<strong>and</strong>ler, which invokes<br />
a routine to print a message at each exception (but not interrupts). This code is<br />
similar to the exception h<strong>and</strong>ler (exceptions.s) used by the SPIM simulator.<br />
Exception H<strong>and</strong>ler<br />
EXAMPLE<br />
The exception h<strong>and</strong>ler first saves register $at, which is used in pseudoinstructions<br />
in the h<strong>and</strong>ler code, then saves $a0 <strong>and</strong> $a1, which it later uses to<br />
pass arguments. The exception h<strong>and</strong>ler cannot store the old values from these<br />
registers on the stack, as would an ordinary routine, because the cause of the<br />
exception might have been a memory reference that used a bad value (such<br />
as 0) in the stack pointer. Instead, the exception h<strong>and</strong>ler stores these registers<br />
in an exception h<strong>and</strong>ler register ($k1, since it can’t access memory without<br />
using $at) <strong>and</strong> two memory locations (save0 <strong>and</strong> save1). If the exception<br />
routine itself could be interrupted, two locations would not be enough since<br />
the second exception would overwrite values saved during the first exception.<br />
However, this simple exception h<strong>and</strong>ler finishes running before it enables<br />
interrupts, so the problem does not arise.<br />
.ktext 0x80000180<br />
mov $k1, $at # Save $at register<br />
sw $a0, save0 # H<strong>and</strong>ler is not re-entrant <strong>and</strong> can’t use<br />
sw $a1, save1 # stack to save $a0, $a1<br />
# Don’t need to save $k0/$k1<br />
The exception h<strong>and</strong>ler then moves the Cause <strong>and</strong> EPC registers into CPU<br />
registers. The Cause <strong>and</strong> EPC registers are not part of the CPU register set.<br />
In stead, they are registers in coprocessor 0, which is the part of the CPU that<br />
han dles exceptions. The instruction mfc0 $k0, $13 moves coprocessor 0’s<br />
register 13 (the Cause register) into CPU register $k0. Note that the exception<br />
h<strong>and</strong>ler need not save registers $k0 <strong>and</strong> $k1, because user programs are not<br />
supposed to use these registers. The exception h<strong>and</strong>ler uses the value from the<br />
Cause reg ister to test whether the exception was caused by an interrupt (see<br />
the preceding ta ble). If so, the exception is ignored. If the exception was not an<br />
interrupt, the h<strong>and</strong>ler calls print_excp to print a message.
A.7 Exceptions <strong>and</strong> Interrupts A-37<br />
mfc0 $k0, $13 # Move Cause into $k0<br />
srl $a0, $k0, 2 # Extract ExcCode field<br />
<strong>and</strong>i $a0, $a0, Oxf<br />
bgtz $a0, done # Branch if ExcCode is Int (0)<br />
mov $a0, $k0 # Move Cause into $a0<br />
mfco $a1, $14 # Move EPC into $a1<br />
jal print_excp # Print exception error message<br />
Before returning, the exception h<strong>and</strong>ler clears the Cause register; resets<br />
the Status register to enable interrupts <strong>and</strong> clear the EXL bit, which allows<br />
subse quent exceptions to change the EPC register; <strong>and</strong> restores registers $a0,<br />
$a1, <strong>and</strong> $at. It then executes the eret (exception return) instruction, which<br />
returns to the instruction pointed to by EPC. This exception h<strong>and</strong>ler returns<br />
to the instruction following the one that caused the exception, so as to not<br />
re-execute the faulting instruction <strong>and</strong> cause the same exception again.<br />
done: mfc0 $k0, $14 # Bump EPC<br />
addiu $k0, $k0, 4 # Do not re-execute<br />
# faulting instruction<br />
mtc0 $k0, $14 # EPC<br />
mtc0 $0, $13 # Clear Cause register<br />
mfc0 $k0, $12 # Fix Status register<br />
<strong>and</strong>i $k0, Oxfffd # Clear EXL bit<br />
ori $k0, Ox1 # Enable interrupts<br />
mtc0 $k0, $12<br />
lw $a0, save0 # Restore registers<br />
lw $a1, save1<br />
mov $at, $k1<br />
eret<br />
# Return to EPC<br />
.kdata<br />
save0: .word 0<br />
save1: .word 0
A-38 Appendix A Assemblers, Linkers, <strong>and</strong> the SPIM Simulator<br />
Elaboration: On real MIPS processors, the return from an exception h<strong>and</strong>ler is more<br />
complex. The exception h<strong>and</strong>ler cannot always jump to the instruction following EPC. For<br />
example, if the instruction that caused the exception was in a branch instruction’s delay<br />
slot (see Chapter 4), the next instruction to execute may not be the following instruction<br />
in memory.<br />
A.8 Input <strong>and</strong> Output<br />
SPIM simulates one I/O device: a memory-mapped console on which a program<br />
can read <strong>and</strong> write characters. When a program is running, SPIM connects its<br />
own terminal (or a separate console window in the X-window version xspim or<br />
the Windows version PCSpim) to the processor. A MIPS program running on<br />
SPIM can read the characters that you type. In addition, if the MIPS program<br />
writes characters to the terminal, they appear on SPIM’s terminal or console window.<br />
One exception to this rule is control-C: this character is not passed to the<br />
program, but instead causes SPIM to stop <strong>and</strong> return to comm<strong>and</strong> mode. When<br />
the program stops running (for example, because you typed control-C or because<br />
the program hit a breakpoint), the terminal is reconnected to SPIM so you can type<br />
SPIM comm<strong>and</strong>s.<br />
To use memory-mapped I/O (see below), spim or xspim must be started<br />
with the -mapped_io flag. PCSpim can enable memory-mapped I/O through a<br />
comm<strong>and</strong> line flag or the “Settings” dialog.<br />
The terminal device consists of two independent units: a receiver <strong>and</strong> a transmitter.<br />
The receiver reads characters from the keyboard. The transmitter displays<br />
characters on the console. The two units are completely independent. This means,<br />
for example, that characters typed at the keyboard are not automatically echoed on<br />
the display. Instead, a program echoes a character by reading it from the receiver<br />
<strong>and</strong> writing it to the transmitter.<br />
A program controls the terminal with four memory-mapped device registers,<br />
as shown in Figure A.8.1. “Memory-mapped’’ means that each register appears as<br />
a special memory location. The Receiver Control register is at location ffff0000 hex .<br />
Only two of its bits are actually used. Bit 0 is called “ready’’: if it is 1, it means<br />
that a character has arrived from the keyboard but has not yet been read from the<br />
Receiver Data register. The ready bit is read-only: writes to it are ignored. The ready<br />
bit changes from 0 to 1 when a character is typed at the keyboard, <strong>and</strong> it changes<br />
from 1 to 0 when the character is read from the Receiver Data register.
A-40 Appendix A Assemblers, Linkers, <strong>and</strong> the SPIM Simulator<br />
<strong>and</strong> is read-only. If this bit is 1, the transmitter is ready to accept a new character<br />
for output. If it is 0, the transmitter is still busy writing the previous character.<br />
Bit 1 is “interrupt enable’’ <strong>and</strong> is readable <strong>and</strong> writable. If this bit is set to 1, then<br />
the terminal requests an interrupt at hardware level 0 whenever the transmitter is<br />
ready for a new character, <strong>and</strong> the ready bit becomes 1.<br />
The final device register is the Transmitter Data register (at address ffff000c hex ).<br />
When a value is written into this location, its low-order eight bits (i.e., an ASCII<br />
character as in Figure 2.15 in Chapter 2) are sent to the console. When the Transmitter<br />
Data register is written, the ready bit in the Transmitter Control register is<br />
reset to 0. This bit stays 0 until enough time has elapsed to transmit the character<br />
to the terminal; then the ready bit becomes 1 again. The Trans mitter Data register<br />
should only be written when the ready bit of the Transmitter Control register is 1.<br />
If the transmitter is not ready, writes to the Transmitter Data register are ignored<br />
(the write appears to succeed but the character is not output).<br />
Real computers require time to send characters to a console or terminal. These<br />
time lags are simulated by SPIM. For example, after the transmitter starts to write a<br />
character, the transmitter’s ready bit becomes 0 for a while. SPIM measures time in<br />
instructions executed, not in real clock time. This means that the transmitter does<br />
not become ready again until the processor executes a fixed number of instructions.<br />
If you stop the machine <strong>and</strong> look at the ready bit, it will not change. However, if you<br />
let the machine run, the bit eventually changes back to 1.<br />
A.9 SPIM<br />
SPIM is a software simulator that runs assembly language programs written for<br />
processors that implement the MIPS-32 architecture, specifically Release 1 of this<br />
architecture with a fixed memory mapping, no caches, <strong>and</strong> only coprocessors 0<br />
<strong>and</strong> 1. 2 SPIM’s name is just MIPS spelled backwards. SPIM can read <strong>and</strong> immediately<br />
execute assembly language files. SPIM is a self-contained system for running<br />
2. Earlier versions of SPIM (before 7.0) implemented the MIPS-1 architecture used in the origi nal<br />
MIPS R2000 processors. This architecture is almost a proper subset of the MIPS-32 architec ture,<br />
with the difference being the manner in which exceptions are h<strong>and</strong>led. MIPS-32 also introduced<br />
approximately 60 new instructions, which are supported by SPIM. Programs that ran on the<br />
earlier versions of SPIM <strong>and</strong> did not use exceptions should run unmodified on newer ver sions of<br />
SPIM. Programs that used exceptions will require minor changes.
A.9 SPIM A-41<br />
MIPS programs. It contains a debugger <strong>and</strong> provides a few operating system-like<br />
services. SPIM is much slower than a real computer (100 or more times). How ever,<br />
its low cost <strong>and</strong> wide availability cannot be matched by real hardware!<br />
An obvious question is, “Why use a simulator when most people have PCs that<br />
contain processors that run significantly faster than SPIM?” One reason is that<br />
the processors in PCs are Intel 80×86s, whose architecture is far less regular <strong>and</strong><br />
far more complex to underst<strong>and</strong> <strong>and</strong> program than MIPS processors. The MIPS<br />
architecture may be the epitome of a simple, clean RISC machine.<br />
In addition, simulators can provide a better environment for assembly programming<br />
than an actual machine because they can detect more errors <strong>and</strong> provide<br />
a better interface than can an actual computer.<br />
Finally, simulators are useful tools in studying computers <strong>and</strong> the programs that<br />
run on them. Because they are implemented in software, not silicon, simulators can<br />
be examined <strong>and</strong> easily modified to add new instructions, build new systems such<br />
as multiprocessors, or simply collect data.<br />
Simulation of a Virtual Machine<br />
The basic MIPS architecture is difficult to program directly because of delayed<br />
branches, delayed loads, <strong>and</strong> restricted address modes. This difficulty is tolerable<br />
since these computers were designed to be programmed in high-level languages<br />
<strong>and</strong> present an interface designed for compilers rather than assembly language<br />
programmers. A good part of the programming complexity results from delayed<br />
instructions. A delayed branch requires two cycles to execute (see the Elaborations<br />
on pages 284 <strong>and</strong> 322 of Chapter 4). In the second cycle, the instruction immediately<br />
following the branch executes. This instruction can perform useful work<br />
that normally would have been done before the branch. It can also be a nop (no<br />
operation) that does nothing. Similarly, delayed loads require two cycles to bring<br />
a value from memory, so the instruction immediately following a load cannot use<br />
the value (see Section 4.2 of Chapter 4).<br />
MIPS wisely chose to hide this complexity by having its assembler implement<br />
a virtual machine. This virtual computer appears to have nondelayed branches<br />
<strong>and</strong> loads <strong>and</strong> a richer instruction set than the actual hardware. The assembler<br />
reorga nizes (rearranges) instructions to fill the delay slots. The virtual computer<br />
also provides pseudoinstructions, which appear as real instructions in assembly<br />
lan guage programs. The hardware, however, knows nothing about pseudoinstructions,<br />
so the assembler must translate them into equivalent sequences of actual<br />
machine instructions. For example, the MIPS hardware only provides instructions<br />
to branch when a register is equal to or not equal to 0. Other conditional branches,<br />
such as one that branches when one register is greater than another, are synthesized<br />
by comparing the two registers <strong>and</strong> branching when the result of the comparison<br />
is true (nonzero).<br />
virtual machine<br />
A virtual computer<br />
that appears to have<br />
nondelayed branches<br />
<strong>and</strong> loads <strong>and</strong> a richer<br />
instruction set than the<br />
actual hardware.
A.9 SPIM A-43<br />
Another surprise (which occurs on the real machine as well) is that a pseudoinstruction<br />
exp<strong>and</strong>s to several machine instructions. When you single-step or<br />
exam ine memory, the instructions that you see are different from the source<br />
program. The correspondence between the two sets of instructions is fairly simple,<br />
since SPIM does not reorganize instructions to fill slots.<br />
Byte Order<br />
Processors can number bytes within a word so the byte with the lowest number is<br />
either the leftmost or rightmost one. The convention used by a machine is called<br />
its byte order. MIPS processors can operate with either big-endian or little-endian<br />
byte order. For example, in a big-endian machine, the directive .byte 0, 1, 2, 3<br />
would result in a memory word containing<br />
Byte #<br />
0 1 2 3<br />
while in a little-endian machine, the word would contain<br />
Byte #<br />
3 2 1 0<br />
SPIM operates with both byte orders. SPIM’s byte order is the same as the byte<br />
order of the underlying machine that runs the simulator. For example, on an Intel<br />
80x86, SPIM is little-endian, while on a Macintosh or Sun SPARC, SPIM is bigendian.<br />
System Calls<br />
SPIM provides a small set of operating system–like services through the system<br />
call (syscall) instruction. To request a service, a program loads the system call<br />
code (see Figure A.9.1) into register $v0 <strong>and</strong> arguments into registers $a0–$a3<br />
(or $f12 for floating-point values). System calls that return values put their results<br />
in register $v0 (or $f0 for floating-point results). For example, the follow ing code<br />
prints "the answer = 5":<br />
.data<br />
str:<br />
.asciiz “the answer = ”<br />
.text
A.10 MIPS R2000 Assembly Language A-47<br />
lui $at, 4096<br />
addu $at, $at, $a1<br />
lw $a0, 8($at)<br />
The fi rst instruction loads the upper bits of the label’s address into register $at, which<br />
is the register that the assembler reserves for its own use. The second instruction adds<br />
the contents of register $a1 to the label’s partial address. Finally, the load instruction<br />
uses the hardware address mode to add the sum of the lower bits of the label’s address<br />
<strong>and</strong> the offset from the original instruction to the value in register $at.<br />
Assembler Syntax<br />
Comments in assembler files begin with a sharp sign (#). Everything from the<br />
sharp sign to the end of the line is ignored.<br />
Identifiers are a sequence of alphanumeric characters, underbars (_), <strong>and</strong> dots<br />
(.) that do not begin with a number. Instruction opcodes are reserved words that<br />
cannot be used as identifiers. Labels are declared by putting them at the beginning<br />
of a line followed by a colon, for example:<br />
.data<br />
item: .word 1<br />
.text<br />
.globl main<br />
main: lw<br />
# Must be global<br />
$t0, item<br />
Numbers are base 10 by default. If they are preceded by 0x, they are interpreted<br />
as hexadecimal. Hence, 256 <strong>and</strong> 0x100 denote the same value.<br />
Strings are enclosed in double quotes (”). Special characters in strings follow the<br />
C convention:<br />
■ newline \n<br />
■ tab \t<br />
■ quote \”<br />
SPIM supports a subset of the MIPS assembler directives:<br />
.align n<br />
.ascii str<br />
Align the next datum on a 2 n byte boundary. For<br />
example, .align 2 aligns the next value on a word<br />
boundary. .align 0 turns off automatic alignment<br />
of .half, .word, .float, <strong>and</strong> .double directives<br />
until the next .data or .kdata directive.<br />
Store the string str in memory, but do not nullterminate<br />
it.
A-48 Appendix A Assemblers, Linkers, <strong>and</strong> the SPIM Simulator<br />
.asciiz str<br />
.byte b1,..., bn<br />
.data <br />
Store the string str in memory <strong>and</strong> null- terminate it.<br />
Store the n values in successive bytes of memory.<br />
Subsequent items are stored in the data segment.<br />
If the optional argument addr is present, subsequent<br />
items are stored starting at address addr.<br />
.double d1,..., dn Store the n floating-point double precision<br />
num-bers in successive memory locations.<br />
.extern sym size<br />
.float f1,..., fn<br />
.globl sym<br />
.half h1,..., hn<br />
.kdata <br />
.ktext <br />
.set noat <strong>and</strong> .set at<br />
.space n<br />
Declare that the datum stored at sym is size bytes<br />
large <strong>and</strong> is a global label. This directive enables<br />
the assembler to store the datum in a portion of<br />
the data segment that is efficiently accessed via<br />
register $gp.<br />
Store the n floating-point single precision numbers<br />
in successive memory locations.<br />
Declare that label sym is global <strong>and</strong> can be referenced<br />
from other files.<br />
Store the n 16-bit quantities in successive mem ory<br />
halfwords.<br />
Subsequent data items are stored in the kernel<br />
data segment. If the optional argument addr is<br />
present, subsequent items are stored starting at<br />
address addr.<br />
Subsequent items are put in the kernel text segment.<br />
In SPIM, these items may only be instructions<br />
or words (see the .word directive below). If<br />
the optional argument addr is present, subse quent<br />
items are stored starting at address addr.<br />
The first directive prevents SPIM from complaining<br />
about subsequent instructions that use regis ter<br />
$at. The second directive re-enables the warning.<br />
Since pseudoinstructions exp<strong>and</strong> into code that<br />
uses register $at, programmers must be very careful<br />
about leaving values in this register.<br />
Allocates n bytes of space in the current segment<br />
(which must be the data segment in SPIM).
A.10 MIPS R2000 Assembly Language A-49<br />
.text <br />
.word w1,..., wn<br />
Subsequent items are put in the user text seg ment.<br />
In SPIM, these items may only be instruc tions<br />
or words (see the .word directive below). If the<br />
optional argument addr is present, subse quent<br />
items are stored starting at address addr.<br />
Store the n 32-bit quantities in successive mem ory<br />
words.<br />
SPIM does not distinguish various parts of the data segment (.data, .rdata, <strong>and</strong><br />
.sdata).<br />
Encoding MIPS Instructions<br />
Figure A.10.2 explains how a MIPS instruction is encoded in a binary number.<br />
Each column contains instruction encodings for a field (a contiguous group of<br />
bits) from an instruction. The numbers at the left margin are values for a field.<br />
For example, the j opcode has a value of 2 in the opcode field. The text at the top<br />
of a column names a field <strong>and</strong> specifies which bits it occupies in an instruction.<br />
For example, the op field is contained in bits 26–31 of an instruction. This field<br />
encodes most instructions. However, some groups of instructions use additional<br />
fields to distinguish related instructions. For example, the different floating-point<br />
instructions are specified by bits 0–5. The arrows from the first column show which<br />
opcodes use these additional fields.<br />
Instruction Format<br />
The rest of this appendix describes both the instructions implemented by actual<br />
MIPS hardware <strong>and</strong> the pseudoinstructions provided by the MIPS assembler. The<br />
two types of instructions are easily distinguished. Actual instructions depict the<br />
fields in their binary representation. For example, in<br />
Addition (with overflow)<br />
add rd, rs, rt<br />
0 rs rt rd 0 0x20<br />
6 5 5 5 5 6<br />
the add instruction consists of six fields. Each field’s size in bits is the small num ber<br />
below the field. This instruction begins with six bits of 0s. Register specifiers begin<br />
with an r, so the next field is a 5-bit register specifier called rs. This is the same<br />
register that is the second argument in the symbolic assembly at the left of this<br />
line. Another common field is imm 16 , which is a 16-bit immediate number.
A.10 MIPS R2000 Assembly Language A-51<br />
Pseudoinstructions follow roughly the same conventions, but omit instruction<br />
encoding information. For example:<br />
Multiply (without overflow)<br />
mul rdest, rsrc1, src2<br />
pseudoinstruction<br />
In pseudoinstructions, rdest <strong>and</strong> rsrc1 are registers <strong>and</strong> src2 is either a register<br />
or an immediate value. In general, the assembler <strong>and</strong> SPIM translate a more<br />
general form of an instruction (e.g., add $v1, $a0, 0x55) to a specialized form<br />
(e.g., addi $v1, $a0, 0x55).<br />
Arithmetic <strong>and</strong> Logical Instructions<br />
Absolute value<br />
abs rdest, rsrc<br />
pseudoinstruction<br />
Put the absolute value of register rsrc in register rdest.<br />
Addition (with overflow)<br />
add rd, rs, rt<br />
0 rs rt rd 0 0x20<br />
6 5 5 5 5 6<br />
Addition (without overflow)<br />
addu rd, rs, rt<br />
0 rs rt rd 0 0x21<br />
6 5 5 5 5 6<br />
Put the sum of registers rs <strong>and</strong> rt into register rd.<br />
Addition immediate (with overflow)<br />
addi rt, rs, imm<br />
8 rs rt imm<br />
6 5 5 16<br />
Addition immediate (without overflow)<br />
addiu rt, rs, imm<br />
9 rs rt imm<br />
6 5 5 16<br />
Put the sum of register rs <strong>and</strong> the sign-extended immediate into register rt.
A-52 Appendix A Assemblers, Linkers, <strong>and</strong> the SPIM Simulator<br />
AND<br />
<strong>and</strong> rd, rs, rt<br />
0 rs rt rd 0 0x24<br />
6 5 5 5 5 6<br />
Put the logical AND of registers rs <strong>and</strong> rt into register rd.<br />
AND immediate<br />
<strong>and</strong>i rt, rs, imm<br />
0xc rs rt imm<br />
6 5 5 16<br />
Put the logical AND of register rs <strong>and</strong> the zero-extended immediate into register<br />
rt.<br />
Count leading ones<br />
clo rd, rs<br />
0x1c rs 0 rd 0 0x21<br />
6 5 5 5 5 6<br />
Count leading zeros<br />
clz rd, rs<br />
0x1c rs 0 rd 0 0x20<br />
6 5 5 5 5 6<br />
Count the number of leading ones (zeros) in the word in register rs <strong>and</strong> put<br />
the result into register rd. If a word is all ones (zeros), the result is 32.<br />
Divide (with overflow)<br />
div rs, rt<br />
0 rs rt 0 0x1a<br />
6 5 5 10 6<br />
Divide (without overflow)<br />
divu rs, rt<br />
0 rs rt 0 0x1b<br />
6 5 5 10 6<br />
Divide register rs by register rt. Leave the quotient in register lo <strong>and</strong> the remainder<br />
in register hi. Note that if an oper<strong>and</strong> is negative, the remainder is unspecified<br />
by the MIPS architecture <strong>and</strong> depends on the convention of the machine on which<br />
SPIM is run.
A.10 MIPS R2000 Assembly Language A-53<br />
Divide (with overflow)<br />
div rdest, rsrc1, src2<br />
pseudoinstruction<br />
Divide (without overflow)<br />
divu rdest, rsrc1, src2<br />
pseudoinstruction<br />
Put the quotient of register rsrc1 <strong>and</strong> src2 into register rdest.<br />
Multiply<br />
mult rs, rt<br />
0 rs rt 0 0x18<br />
6 5 5 10 6<br />
Unsigned multiply<br />
multu rs, rt<br />
0 rs rt 0 0x19<br />
6 5 5 10 6<br />
Multiply registers rs <strong>and</strong> rt. Leave the low-order word of the product in register<br />
lo <strong>and</strong> the high-order word in register hi.<br />
Multiply (without overflow)<br />
mul rd, rs, rt<br />
0x1c rs rt rd 0 2<br />
6 5 5 5 5 6<br />
Put the low-order 32 bits of the product of rs <strong>and</strong> rt into register rd.<br />
Multiply (with overflow)<br />
mulo rdest, rsrc1, src2<br />
pseudoinstruction<br />
Unsigned multiply (with overflow)<br />
mulou rdest, rsrc1, src2<br />
pseudoinstruction<br />
Put the low-order 32 bits of the product of register rsrc1 <strong>and</strong> src2 into register<br />
rdest.
A-54 Appendix A Assemblers, Linkers, <strong>and</strong> the SPIM Simulator<br />
Multiply add<br />
madd rs, rt<br />
0x1c rs rt 0 0<br />
6 5 5 10 6<br />
Unsigned multiply add<br />
maddu rs, rt<br />
0x1c rs rt 0 1<br />
6 5 5 10 6<br />
Multiply registers rs <strong>and</strong> rt <strong>and</strong> add the resulting 64-bit product to the 64-bit<br />
value in the concatenated registers lo <strong>and</strong> hi.<br />
Multiply subtract<br />
msub rs, rt<br />
0x1c rs rt 0 4<br />
6 5 5 10 6<br />
Unsigned multiply subtract<br />
msub rs, rt<br />
0x1c rs rt 0 5<br />
6 5 5 10 6<br />
Multiply registers rs <strong>and</strong> rt <strong>and</strong> subtract the resulting 64-bit product from the 64-<br />
bit value in the concatenated registers lo <strong>and</strong> hi.<br />
Negate value (with overflow)<br />
neg rdest, rsrc<br />
pseudoinstruction<br />
Negate value (without overflow)<br />
negu rdest, rsrc<br />
pseudoinstruction<br />
Put the negative of register rsrc into register rdest.<br />
NOR<br />
nor rd, rs, rt<br />
0 rs rt rd 0 0x27<br />
6 5 5 5 5 6<br />
Put the logical NOR of registers rs <strong>and</strong> rt into register rd.
A.10 MIPS R2000 Assembly Language A-55<br />
NOT<br />
not rdest, rsrc<br />
pseudoinstruction<br />
Put the bitwise logical negation of register rsrc into register rdest.<br />
OR<br />
or rd, rs, rt<br />
0 rs rt rd 0 0x25<br />
6 5 5 5 5 6<br />
Put the logical OR of registers rs <strong>and</strong> rt into register rd.<br />
OR immediate<br />
ori rt, rs, imm<br />
0xd rs rt imm<br />
6 5 5 16<br />
Put the logical OR of register rs <strong>and</strong> the zero-extended immediate into register rt.<br />
Remainder<br />
rem rdest, rsrc1, rsrc2<br />
pseudoinstruction<br />
Unsigned remainder<br />
remu rdest, rsrc1, rsrc2<br />
pseudoinstruction<br />
Put the remainder of register rsrc1 divided by register rsrc2 into register rdest.<br />
Note that if an oper<strong>and</strong> is negative, the remainder is unspecified by the MIPS<br />
architecture <strong>and</strong> depends on the convention of the machine on which SPIM is run.<br />
Shift left logical<br />
sll rd, rt, shamt<br />
0 rs rt rd shamt 0<br />
6 5 5 5 5 6<br />
Shift left logical variable<br />
sllv rd, rt, rs<br />
0 rs rt rd 0 4<br />
6 5 5 5 5 6
A-56 Appendix A Assemblers, Linkers, <strong>and</strong> the SPIM Simulator<br />
Shift right arithmetic<br />
sra rd, rt, shamt<br />
0 rs rt rd shamt 3<br />
6 5 5 5 5 6<br />
Shift right arithmetic variable<br />
srav rd, rt, rs<br />
0 rs rt rd 0 7<br />
6 5 5 5 5 6<br />
Shift right logical<br />
srl rd, rt, shamt<br />
0 rs rt rd shamt 2<br />
6 5 5 5 5 6<br />
Shift right logical variable<br />
srlv rd, rt, rs<br />
0 rs rt rd 0 6<br />
6 5 5 5 5 6<br />
Shift register rt left (right) by the distance indicated by immediate shamt or the<br />
register rs <strong>and</strong> put the result in register rd. Note that argument rs is ignored for<br />
sll, sra, <strong>and</strong> srl.<br />
Rotate left<br />
rol rdest, rsrc1, rsrc2<br />
pseudoinstruction<br />
Rotate right<br />
ror rdest, rsrc1, rsrc2<br />
pseudoinstruction<br />
Rotate register rsrc1 left (right) by the distance indicated by rsrc2 <strong>and</strong> put the<br />
result in register rdest.<br />
Subtract (with overflow)<br />
sub rd, rs, rt<br />
0 rs rt rd 0 0x22<br />
6 5 5 5 5 6
A.10 MIPS R2000 Assembly Language A-57<br />
Subtract (without overflow)<br />
subu rd, rs, rt<br />
0 rs rt rd 0 0x23<br />
6 5 5 5 5 6<br />
Put the difference of registers rs <strong>and</strong> rt into register rd.<br />
Exclusive OR<br />
xor rd, rs, rt<br />
0 rs rt rd 0 0x26<br />
6 5 5 5 5 6<br />
Put the logical XOR of registers rs <strong>and</strong> rt into register rd.<br />
XOR immediate<br />
xori rt, rs, imm<br />
0xe rs rt Imm<br />
6 5 5 16<br />
Put the logical XOR of register rs <strong>and</strong> the zero-extended immediate into register<br />
rt.<br />
Constant-Manipulating Instructions<br />
Load upper immediate<br />
lui rt, imm<br />
0xf O rt imm<br />
6 5 5 16<br />
Load the lower halfword of the immediate imm into the upper halfword of register<br />
rt. The lower bits of the register are set to 0.<br />
Load immediate<br />
li rdest, imm<br />
pseudoinstruction<br />
Move the immediate imm into register rdest.<br />
Comparison Instructions<br />
Set less than<br />
slt rd, rs, rt<br />
0 rs rt rd 0 0x2a<br />
6 5 5 5 5 6
A-58 Appendix A Assemblers, Linkers, <strong>and</strong> the SPIM Simulator<br />
Set less than unsigned<br />
sltu rd, rs, rt<br />
0 rs rt rd 0 0x2b<br />
6 5 5 5 5 6<br />
Set register rd to 1 if register rs is less than rt, <strong>and</strong> to 0 otherwise.<br />
Set less than immediate<br />
slti rt, rs, imm<br />
0xa rs rt imm<br />
6 5 5 16<br />
Set less than unsigned immediate<br />
sltiu rt, rs, imm<br />
0xb rs rt imm<br />
6 5 5 16<br />
Set register rt to 1 if register rs is less than the sign-extended immediate, <strong>and</strong> to<br />
0 otherwise.<br />
Set equal<br />
seq rdest, rsrc1, rsrc2<br />
pseudoinstruction<br />
Set register rdest to 1 if register rsrc1 equals rsrc2, <strong>and</strong> to 0 otherwise.<br />
Set greater than equal<br />
sge rdest, rsrc1, rsrc2<br />
pseudoinstruction<br />
Set greater than equal unsigned<br />
sgeu rdest, rsrc1, rsrc2<br />
pseudoinstruction<br />
Set register rdest to 1 if register rsrc1 is greater than or equal to rsrc2, <strong>and</strong> to<br />
0 otherwise.<br />
Set greater than<br />
sgt rdest, rsrc1, rsrc2<br />
pseudoinstruction
A.10 MIPS R2000 Assembly Language A-59<br />
Set greater than unsigned<br />
sgtu rdest, rsrc1, rsrc2<br />
pseudoinstruction<br />
Set register rdest to 1 if register rsrc1 is greater than rsrc2, <strong>and</strong> to 0 otherwise.<br />
Set less than equal<br />
sle rdest, rsrc1, rsrc2<br />
pseudoinstruction<br />
Set less than equal unsigned<br />
sleu rdest, rsrc1, rsrc2<br />
pseudoinstruction<br />
Set register rdest to 1 if register rsrc1 is less than or equal to rsrc2, <strong>and</strong> to 0<br />
otherwise.<br />
Set not equal<br />
sne rdest, rsrc1, rsrc2<br />
pseudoinstruction<br />
Set register rdest to 1 if register rsrc1 is not equal to rsrc2, <strong>and</strong> to 0 otherwise.<br />
Branch Instructions<br />
Branch instructions use a signed 16-bit instruction offset field; hence, they can<br />
jump 2 15 − 1 instructions (not bytes) forward or 2 15 instructions backward. The<br />
jump instruction contains a 26-bit address field. In actual MIPS processors, branch<br />
instructions are delayed branches, which do not transfer control until the instruction<br />
following the branch (its “delay slot”) has executed (see Chapter 4). Delayed branches<br />
affect the offset calculation, since it must be computed relative to the address of the<br />
delay slot instruction (PC + 4), which is when the branch occurs. SPIM does not<br />
simulate this delay slot, unless the -bare or -delayed_branch flags are specified.<br />
In assembly code, offsets are not usually specified as numbers. Instead, an<br />
instructions branch to a label, <strong>and</strong> the assembler computes the distance between<br />
the branch <strong>and</strong> the target instructions.<br />
In MIPS-32, all actual (not pseudo) conditional branch instructions have a<br />
“likely” variant (for example, beq’s likely variant is beql), which does not execute<br />
the instruction in the branch’s delay slot if the branch is not taken. Do not use
A-60 Appendix A Assemblers, Linkers, <strong>and</strong> the SPIM Simulator<br />
these instructions; they may be removed in subsequent versions of the architec ture.<br />
SPIM implements these instructions, but they are not described further.<br />
Branch instruction<br />
b label<br />
pseudoinstruction<br />
Unconditionally branch to the instruction at the label.<br />
Branch coprocessor false<br />
bclf cc label<br />
0x11 8 cc 0 Offset<br />
6 5 3 2 16<br />
Branch coprocessor true<br />
bclt cc label<br />
0x11 8 cc 1 Offset<br />
6 5 3 2 16<br />
Conditionally branch the number of instructions specified by the offset if the<br />
floating-point coprocessor’s condition flag numbered cc is false (true). If cc is<br />
omitted from the instruction, condition code flag 0 is assumed.<br />
Branch on equal<br />
beq rs, rt, label<br />
4 rs rt Offset<br />
6 5 5 16<br />
Conditionally branch the number of instructions specified by the offset if register<br />
rs equals rt.<br />
Branch on greater than equal zero<br />
bgez rs, label<br />
1 rs 1 Offset<br />
6 5 5 16<br />
Conditionally branch the number of instructions specified by the offset if register<br />
rs is greater than or equal to 0.
A.10 MIPS R2000 Assembly Language A-61<br />
Branch on greater than equal zero <strong>and</strong> link<br />
bgezal rs, label<br />
1 rs 0x11 Offset<br />
6 5 5 16<br />
Conditionally branch the number of instructions specified by the offset if register<br />
rs is greater than or equal to 0. Save the address of the next instruction in register<br />
31.<br />
Branch on greater than zero<br />
bgtz rs, label<br />
7 rs 0 Offset<br />
6 5 5 16<br />
Conditionally branch the number of instructions specified by the offset if register<br />
rs is greater than 0.<br />
Branch on less than equal zero<br />
blez rs, label<br />
6 rs 0 Offset<br />
6 5 5 16<br />
Conditionally branch the number of instructions specified by the offset if register<br />
rs is less than or equal to 0.<br />
Branch on less than <strong>and</strong> link<br />
bltzal rs, label<br />
1 rs 0x10 Offset<br />
6 5 5 16<br />
Conditionally branch the number of instructions specified by the offset if register<br />
rs is less than 0. Save the address of the next instruction in register 31.<br />
Branch on less than zero<br />
bltz rs, label<br />
1 rs 0 Offset<br />
6 5 5 16<br />
Conditionally branch the number of instructions specified by the offset if register<br />
rs is less than 0.
A-62 Appendix A Assemblers, Linkers, <strong>and</strong> the SPIM Simulator<br />
Branch on not equal<br />
bne rs, rt, label<br />
5 rs rt Offset<br />
6 5 5 16<br />
Conditionally branch the number of instructions specified by the offset if register<br />
rs is not equal to rt.<br />
Branch on equal zero<br />
beqz rsrc, label<br />
pseudoinstruction<br />
Conditionally branch to the instruction at the label if rsrc equals 0.<br />
Branch on greater than equal<br />
bge rsrc1, rsrc2, label<br />
pseudoinstruction<br />
Branch on greater than equal unsigned<br />
bgeu rsrc1, rsrc2, label<br />
pseudoinstruction<br />
Conditionally branch to the instruction at the label if register rsrc1 is greater than<br />
or equal to rsrc2.<br />
Branch on greater than<br />
bgt rsrc1, src2, label<br />
pseudoinstruction<br />
Branch on greater than unsigned<br />
bgtu rsrc1, src2, label<br />
pseudoinstruction<br />
Conditionally branch to the instruction at the label if register rsrc1 is greater than<br />
src2.<br />
Branch on less than equal<br />
ble rsrc1, src2, label<br />
pseudoinstruction
A.10 MIPS R2000 Assembly Language A-63<br />
Branch on less than equal unsigned<br />
bleu rsrc1, src2, label<br />
pseudoinstruction<br />
Conditionally branch to the instruction at the label if register rsrc1 is less than or<br />
equal to src2.<br />
Branch on less than<br />
blt rsrc1, rsrc2, label<br />
pseudoinstruction<br />
Branch on less than unsigned<br />
bltu rsrc1, rsrc2, label<br />
pseudoinstruction<br />
Conditionally branch to the instruction at the label if register rsrc1 is less than<br />
rsrc2.<br />
Branch on not equal zero<br />
bnez rsrc, label<br />
pseudoinstruction<br />
Conditionally branch to the instruction at the label if register rsrc is not equal to 0.<br />
Jump Instructions<br />
Jump<br />
j target<br />
2 target<br />
6 26<br />
Unconditionally jump to the instruction at target.<br />
Jump <strong>and</strong> link<br />
jal target<br />
3 target<br />
6 26<br />
Unconditionally jump to the instruction at target. Save the address of the next<br />
instruction in register $ra.
A-64 Appendix A Assemblers, Linkers, <strong>and</strong> the SPIM Simulator<br />
Jump <strong>and</strong> link register<br />
jalr rs, rd<br />
0 rs 0 rd 0 9<br />
6 5 5 5 5 6<br />
Unconditionally jump to the instruction whose address is in register rs. Save the<br />
address of the next instruction in register rd (which defaults to 31).<br />
Jump register<br />
jr rs<br />
0 rs 0 8<br />
6 5 15 6<br />
Unconditionally jump to the instruction whose address is in register rs.<br />
Trap Instructions<br />
Trap if equal<br />
teq rs, rt<br />
0 rs rt 0 0x34<br />
6 5 5 10 6<br />
If register rs is equal to register rt, raise a Trap exception.<br />
Trap if equal immediate<br />
teqi rs, imm<br />
1 rs 0xc imm<br />
6 5 5 16<br />
If register rs is equal to the sign-extended value imm, raise a Trap exception.<br />
Trap if not equal<br />
teq rs, rt<br />
0 rs rt 0 0x36<br />
6 5 5 10 6<br />
If register rs is not equal to register rt, raise a Trap exception.<br />
Trap if not equal immediate<br />
teqi rs, imm<br />
1 rs 0xe imm<br />
6 5 5 16<br />
If register rs is not equal to the sign-extended value imm, raise a Trap exception.
A.10 MIPS R2000 Assembly Language A-65<br />
Trap if greater equal<br />
tge rs, rt<br />
0 rs rt 0 0x30<br />
6 5 5 10 6<br />
Unsigned trap if greater equal<br />
tgeu rs, rt<br />
0 rs rt 0 0x31<br />
6 5 5 10 6<br />
If register rs is greater than or equal to register rt, raise a Trap exception.<br />
Trap if greater equal immediate<br />
tgei rs, imm<br />
1 rs 8 imm<br />
6 5 5 16<br />
Unsigned trap if greater equal immediate<br />
tgeiu rs, imm<br />
1 rs 9 imm<br />
6 5 5 16<br />
If register rs is greater than or equal to the sign-extended value imm, raise a Trap<br />
exception.<br />
Trap if less than<br />
tlt rs, rt<br />
0 rs rt 0 0x32<br />
6 5 5 10 6<br />
Unsigned trap if less than<br />
tltu rs, rt<br />
0 rs rt 0 0x33<br />
6 5 5 10 6<br />
If register rs is less than register rt, raise a Trap exception.<br />
Trap if less than immediate<br />
tlti rs, imm<br />
1 rs a imm<br />
6 5 5 16
A-66 Appendix A Assemblers, Linkers, <strong>and</strong> the SPIM Simulator<br />
Unsigned trap if less than immediate<br />
tltiu rs, imm<br />
1 rs b imm<br />
6 5 5 16<br />
If register rs is less than the sign-extended value imm, raise a Trap exception.<br />
Load Instructions<br />
Load address<br />
la rdest, address<br />
pseudoinstruction<br />
Load computed address—not the contents of the location—into register rdest.<br />
Load byte<br />
lb rt, address<br />
0x20 rs rt Offset<br />
6 5 5 16<br />
Load unsigned byte<br />
lbu rt, address<br />
0x24 rs rt Offset<br />
6 5 5 16<br />
Load the byte at address into register rt. The byte is sign-extended by lb, but not<br />
by lbu.<br />
Load halfword<br />
lh rt, address<br />
0x21 rs rt Offset<br />
6 5 5 16<br />
Load unsigned halfword<br />
lhu rt, address<br />
0x25 rs rt Offset<br />
6 5 5 16<br />
Load the 16-bit quantity (halfword) at address into register rt. The halfword is<br />
sign-extended by lh, but not by lhu.
A.10 MIPS R2000 Assembly Language A-67<br />
Load word<br />
lw rt, address<br />
0x23 rs rt Offset<br />
6 5 5 16<br />
Load the 32-bit quantity (word) at address into register rt.<br />
Load word coprocessor 1<br />
lwcl ft, address<br />
0x31 rs rt Offset<br />
6 5 5 16<br />
Load the word at address into register ft in the floating-point unit.<br />
Load word left<br />
lwl rt, address<br />
0x22 rs rt Offset<br />
6 5 5 16<br />
Load word right<br />
lwr rt, address<br />
0x26 rs rt Offset<br />
6 5 5 16<br />
Load the left (right) bytes from the word at the possibly unaligned address into<br />
register rt.<br />
Load doubleword<br />
ld rdest, address<br />
pseudoinstruction<br />
Load the 64-bit quantity at address into registers rdest <strong>and</strong> rdest + 1.<br />
Unaligned load halfword<br />
ulh rdest, address<br />
pseudoinstruction
A-68 Appendix A Assemblers, Linkers, <strong>and</strong> the SPIM Simulator<br />
Unaligned load halfword unsigned<br />
ulhu rdest, address<br />
pseudoinstruction<br />
Load the 16-bit quantity (halfword) at the possibly unaligned address into register<br />
rdest. The halfword is sign-extended by ulh, but not ulhu.<br />
Unaligned load word<br />
ulw rdest, address<br />
pseudoinstruction<br />
Load the 32-bit quantity (word) at the possibly unaligned address into register<br />
rdest.<br />
Load linked<br />
ll rt, address<br />
0x30 rs rt Offset<br />
6 5 5 16<br />
Load the 32-bit quantity (word) at address into register rt <strong>and</strong> start an atomic<br />
read-modify-write operation. This operation is completed by a store conditional<br />
(sc) instruction, which will fail if another processor writes into the block containing<br />
the loaded word. Since SPIM does not simulate multiple processors, the store<br />
conditional operation always succeeds.<br />
Store Instructions<br />
Store byte<br />
sb rt, address<br />
0x28 rs rt Offset<br />
6 5 5 16<br />
Store the low byte from register rt at address.<br />
Store halfword<br />
sh rt, address<br />
0x29 rs rt Offset<br />
6 5 5 16<br />
Store the low halfword from register rt at address.
A.10 MIPS R2000 Assembly Language A-69<br />
Store word<br />
sw rt, address<br />
0x2b rs rt Offset<br />
6 5 5 16<br />
Store the word from register rt at address.<br />
Store word coprocessor 1<br />
swcl ft, address<br />
0x31 rs ft Offset<br />
6 5 5 16<br />
Store the floating-point value in register ft of floating-point coprocessor at address.<br />
Store double coprocessor 1<br />
sdcl ft, address<br />
0x3d rs ft Offset<br />
6 5 5 16<br />
Store the doubleword floating-point value in registers ft <strong>and</strong> ft + l of floatingpoint<br />
coprocessor at address. Register ft must be even numbered.<br />
Store word left<br />
swl rt, address<br />
0x2a rs rt Offset<br />
6 5 5 16<br />
Store word right<br />
swr rt, address<br />
0x2e rs rt Offset<br />
6 5 5 16<br />
Store the left (right) bytes from register rt at the possibly unaligned address.<br />
Store doubleword<br />
sd rsrc, address<br />
pseudoinstruction<br />
Store the 64-bit quantity in registers rsrc <strong>and</strong> rsrc + 1 at address.
A-70 Appendix A Assemblers, Linkers, <strong>and</strong> the SPIM Simulator<br />
Unaligned store halfword<br />
ush rsrc, address<br />
pseudoinstruction<br />
Store the low halfword from register rsrc at the possibly unaligned address.<br />
Unaligned store word<br />
usw rsrc, address<br />
pseudoinstruction<br />
Store the word from register rsrc at the possibly unaligned address.<br />
Store conditional<br />
sc rt, address<br />
0x38 rs rt Offset<br />
6 5 5 16<br />
Store the 32-bit quantity (word) in register rt into memory at address <strong>and</strong> com plete<br />
an atomic read-modify-write operation. If this atomic operation is success ful, the<br />
memory word is modified <strong>and</strong> register rt is set to 1. If the atomic operation fails<br />
because another processor wrote to a location in the block contain ing the addressed<br />
word, this instruction does not modify memory <strong>and</strong> writes 0 into register rt. Since<br />
SPIM does not simulate multiple processors, the instruc tion always succeeds.<br />
Data Movement Instructions<br />
Move<br />
move rdest, rsrc<br />
pseudoinstruction<br />
Move register rsrc to rdest.<br />
Move from hi<br />
mfhi rd<br />
0 0 rd 0 0x10<br />
6 10 5 5 6
A.10 MIPS R2000 Assembly Language A-71<br />
Move from lo<br />
mflo rd<br />
0 0 rd 0 0x12<br />
6 10 5 5 6<br />
The multiply <strong>and</strong> divide unit produces its result in two additional registers, hi<br />
<strong>and</strong> lo. These instructions move values to <strong>and</strong> from these registers. The multiply,<br />
divide, <strong>and</strong> remainder pseudoinstructions that make this unit appear to operate on<br />
the general registers move the result after the computation finishes.<br />
Move the hi (lo) register to register rd.<br />
Move to hi<br />
mthi rs<br />
0 rs 0 0x11<br />
6 5 15 6<br />
Move to lo<br />
mtlo rs<br />
0 rs 0 0x13<br />
6 5 15 6<br />
Move register rs to the hi (lo) register.<br />
Move from coprocessor 0<br />
mfc0 rt, rd<br />
0x10 0 rt rd 0<br />
6 5 5 5 11<br />
Move from coprocessor 1<br />
mfcl rt, fs<br />
0x11 0 rt fs 0<br />
6 5 5 5 11<br />
Coprocessors have their own register sets. These instructions move values between<br />
these registers <strong>and</strong> the CPU’s registers.<br />
Move register rd in a coprocessor (register fs in the FPU) to CPU register rt. The<br />
floating-point unit is coprocessor 1.
A-72 Appendix A Assemblers, Linkers, <strong>and</strong> the SPIM Simulator<br />
Move double from coprocessor 1<br />
mfc1.d rdest, frsrc1<br />
pseudoinstruction<br />
Move floating-point registers frsrc1 <strong>and</strong> frsrc1 + 1 to CPU registers rdest<br />
<strong>and</strong> rdest + 1.<br />
Move to coprocessor 0<br />
mtc0 rd, rt<br />
0x10 4 rt rd 0<br />
6 5 5 5 11<br />
Move to coprocessor 1<br />
mtc1 rd, fs<br />
0x11 4 rt fs 0<br />
6 5 5 5 11<br />
Move CPU register rt to register rd in a coprocessor (register fs in the FPU).<br />
Move conditional not zero<br />
movn rd, rs, rt<br />
0 rs rt rd 0xb<br />
6 5 5 5 11<br />
Move register rs to register rd if register rt is not 0.<br />
Move conditional zero<br />
movz rd, rs, rt<br />
0 rs rt rd 0xa<br />
6 5 5 5 11<br />
Move register rs to register rd if register rt is 0.<br />
Move conditional on FP false<br />
movf rd, rs, cc<br />
0 rs cc 0 rd 0 1<br />
6 5 3 2 5 5 6<br />
Move CPU register rs to register rd if FPU condition code flag number cc is 0. If<br />
cc is omitted from the instruction, condition code flag 0 is assumed.
A.10 MIPS R2000 Assembly Language A-73<br />
Move conditional on FP true<br />
movt rd, rs, cc<br />
0 rs cc 1 rd 0 1<br />
6 5 3 2 5 5 6<br />
Move CPU register rs to register rd if FPU condition code flag number cc is 1. If<br />
cc is omitted from the instruction, condition code bit 0 is assumed.<br />
Floating-Point Instructions<br />
The MIPS has a floating-point coprocessor (numbered 1) that operates on single<br />
precision (32-bit) <strong>and</strong> double precision (64-bit) floating-point numbers. This<br />
coprocessor has its own registers, which are numbered $f0–$f31. Because these<br />
registers are only 32 bits wide, two of them are required to hold doubles, so only<br />
floating-point registers with even numbers can hold double precision values. The<br />
floating-point coprocessor also has eight condition code (cc) flags, numbered 0–7,<br />
which are set by compare instructions <strong>and</strong> tested by branch (bclf or bclt) <strong>and</strong><br />
conditional move instructions.<br />
Values are moved in or out of these registers one word (32 bits) at a time by<br />
lwc1, swc1, mtc1, <strong>and</strong> mfc1 instructions or one double (64 bits) at a time by ldcl<br />
<strong>and</strong> sdcl, described above, or by the l.s, l.d, s.s, <strong>and</strong> s.d pseudoinstructions<br />
described below.<br />
In the actual instructions below, bits 21–26 are 0 for single precision <strong>and</strong> 1<br />
for double precision. In the pseudoinstructions below, fdest is a floating-point<br />
register (e.g., $f2).<br />
Floating-point absolute value double<br />
abs.d fd, fs<br />
0x11 1 0 fs fd 5<br />
6 5 5 5 5 6<br />
Floating-point absolute value single<br />
abs.s fd, fs<br />
0x11 0 0 fs fd 5<br />
Compute the absolute value of the floating-point double (single) in register fs <strong>and</strong><br />
put it in register fd.<br />
Floating-point addition double<br />
add.d fd, fs, ft<br />
0x11 0x11 ft fs fd 0<br />
6 5 5 5 5 6
A-74 Appendix A Assemblers, Linkers, <strong>and</strong> the SPIM Simulator<br />
Floating-point addition single<br />
add.s fd, fs, ft<br />
0x11 0x10 ft fs fd 0<br />
6 5 5 5 5 6<br />
Compute the sum of the floating-point doubles (singles) in registers fs <strong>and</strong> ft <strong>and</strong><br />
put it in register fd.<br />
Floating-point ceiling to word<br />
ceil.w.d fd, fs<br />
ceil.w.s fd, fs<br />
0x11 0x11 0 fs fd 0xe<br />
6 5 5 5 5 6<br />
0x11 0x10 0 fs fd 0xe<br />
Compute the ceiling of the floating-point double (single) in register fs, convert to<br />
a 32-bit fixed-point value, <strong>and</strong> put the resulting word in register fd.<br />
Compare equal double<br />
c.eq.d cc fs, ft<br />
0x11 0x11 ft fs cc 0 FC 2<br />
6 5 5 5 3 2 2 4<br />
Compare equal single<br />
c.eq.s cc fs, ft<br />
0x11 0x10 ft fs cc 0 FC 2<br />
6 5 5 5 3 2 2 4<br />
Compare the floating-point double (single) in register fs against the one in ft<br />
<strong>and</strong> set the floating-point condition flag cc to 1 if they are equal. If cc is omitted,<br />
condition code flag 0 is assumed.<br />
Compare less than equal double<br />
c.le.d cc fs, ft<br />
0x11 0x11 ft fs cc 0 FC 0xe<br />
6 5 5 5 3 2 2 4<br />
Compare less than equal single<br />
c.le.s cc fs, ft<br />
0x11 0x10 ft fs cc 0 FC 0xe<br />
6 5 5 5 3 2 2 4
A.10 MIPS R2000 Assembly Language A-75<br />
Compare the floating-point double (single) in register fs against the one in ft <strong>and</strong><br />
set the floating-point condition flag cc to 1 if the first is less than or equal to the<br />
second. If cc is omitted, condition code flag 0 is assumed.<br />
Compare less than double<br />
c.lt.d cc fs, ft<br />
0x11 0x11 ft fs cc 0 FC 0xc<br />
6 5 5 5 3 2 2 4<br />
Compare less than single<br />
c.lt.s cc fs, ft<br />
0x11 0x10 ft fs cc 0 FC 0xc<br />
6 5 5 5 3 2 2 4<br />
Compare the floating-point double (single) in register fs against the one in ft<br />
<strong>and</strong> set the condition flag cc to 1 if the first is less than the second. If cc is omitted,<br />
condition code flag 0 is assumed.<br />
Convert single to double<br />
cvt.d.s fd, fs<br />
0x11 0x10 0 fs fd 0x21<br />
6 5 5 5 5 6<br />
Convert integer to double<br />
cvt.d.w fd, fs<br />
0x11 0x14 0 fs fd 0x21<br />
6 5 5 5 5 6<br />
Convert the single precision floating-point number or integer in register fs to a<br />
double (single) precision number <strong>and</strong> put it in register fd.<br />
Convert double to single<br />
cvt.s.d fd, fs<br />
0x11 0x11 0 fs fd 0x20<br />
6 5 5 5 5 6<br />
Convert integer to single<br />
cvt.s.w fd, fs<br />
0x11 0x14 0 fs fd 0x20<br />
6 5 5 5 5 6<br />
Convert the double precision floating-point number or integer in register fs to a<br />
single precision number <strong>and</strong> put it in register fd.
A-76 Appendix A Assemblers, Linkers, <strong>and</strong> the SPIM Simulator<br />
Convert double to integer<br />
cvt.w.d fd, fs<br />
0x11 0x11 0 fs fd 0x24<br />
6 5 5 5 5 6<br />
Convert single to integer<br />
cvt.w.s fd, fs<br />
0x11 0x10 0 fs fd 0x24<br />
6 5 5 5 5 6<br />
Convert the double or single precision floating-point number in register fs to an<br />
integer <strong>and</strong> put it in register fd.<br />
Floating-point divide double<br />
div.d fd, fs, ft<br />
0x11 0x11 ft fs fd 3<br />
6 5 5 5 5 6<br />
Floating-point divide single<br />
div.s fd, fs, ft<br />
0x11 0x10 ft fs fd 3<br />
6 5 5 5 5 6<br />
Compute the quotient of the floating-point doubles (singles) in registers fs <strong>and</strong> ft<br />
<strong>and</strong> put it in register fd.<br />
Floating-point floor to word<br />
floor.w.d fd, fs<br />
floor.w.s fd, fs<br />
0x11 0x11 0 fs fd 0xf<br />
6 5 5 5 5 6<br />
0x11 0x10 0 fs fd 0xf<br />
Compute the floor of the floating-point double (single) in register fs <strong>and</strong> put the<br />
resulting word in register fd.<br />
Load floating-point double<br />
l.d fdest, address<br />
pseudoinstruction
A.10 MIPS R2000 Assembly Language A-77<br />
Load floating-point single<br />
l.s fdest, address<br />
pseudoinstruction<br />
Load the floating-point double (single) at address into register fdest.<br />
Move floating-point double<br />
mov.d fd, fs<br />
0x11 0x11 0 fs fd 6<br />
6 5 5 5 5 6<br />
Move floating-point single<br />
mov.s fd, fs<br />
0x11 0x10 0 fs fd 6<br />
6 5 5 5 5 6<br />
Move the floating-point double (single) from register fs to register fd.<br />
Move conditional floating-point double false<br />
movf.d fd, fs, cc<br />
0x11 0x11 cc 0 fs fd 0x11<br />
6 5 3 2 5 5 6<br />
Move conditional floating-point single false<br />
movf.s fd, fs, cc<br />
0x11 0x10 cc 0 fs fd 0x11<br />
6 5 3 2 5 5 6<br />
Move the floating-point double (single) from register fs to register fd if condi tion<br />
code flag cc is 0. If cc is omitted, condition code flag 0 is assumed.<br />
Move conditional floating-point double true<br />
movt.d fd, fs, cc<br />
0x11 0x11 cc 1 fs fd 0x11<br />
6 5 3 2 5 5 6<br />
Move conditional floating-point single true<br />
movt.s fd, fs, cc<br />
0x11 0x10 cc 1 fs fd 0x11<br />
6 5 3 2 5 5 6
A-78 Appendix A Assemblers, Linkers, <strong>and</strong> the SPIM Simulator<br />
Move the floating-point double (single) from register fs to register fd if condi tion<br />
code flag cc is 1. If cc is omitted, condition code flag 0 is assumed.<br />
Move conditional floating-point double not zero<br />
movn.d fd, fs, rt<br />
0x11 0x11 rt fs fd 0x13<br />
6 5 5 5 5 6<br />
Move conditional floating-point single not zero<br />
movn.s fd, fs, rt<br />
0x11 0x10 rt fs fd 0x13<br />
6 5 5 5 5 6<br />
Move the floating-point double (single) from register fs to register fd if proces sor<br />
register rt is not 0.<br />
Move conditional floating-point double zero<br />
movz.d fd, fs, rt<br />
0x11 0x11 rt fs fd 0x12<br />
6 5 5 5 5 6<br />
Move conditional floating-point single zero<br />
movz.s fd, fs, rt<br />
0x11 0x10 rt fs fd 0x12<br />
6 5 5 5 5 6<br />
Move the floating-point double (single) from register fs to register fd if proces sor<br />
register rt is 0.<br />
Floating-point multiply double<br />
mul.d fd, fs, ft<br />
0x11 0x11 ft fs fd 2<br />
6 5 5 5 5 6<br />
Floating-point multiply single<br />
mul.s fd, fs, ft<br />
0x11 0x10 ft fs fd 2<br />
6 5 5 5 5 6<br />
Compute the product of the floating-point doubles (singles) in registers fs <strong>and</strong> ft<br />
<strong>and</strong> put it in register fd.<br />
Negate double<br />
neg.d fd, fs<br />
0x11 0x11 0 fs fd 7<br />
6 5 5 5 5 6
A.10 MIPS R2000 Assembly Language A-79<br />
Negate single<br />
neg.s fd, fs<br />
0x11 0x10 0 fs fd 7<br />
6 5 5 5 5 6<br />
Negate the floating-point double (single) in register fs <strong>and</strong> put it in register fd.<br />
Floating-point round to word<br />
round.w.d fd, fs<br />
0x11 0x11 0 fs fd 0xc<br />
6 5 5 5 5 6<br />
round.w.s fd, fs 0x11 0x10 0 fs fd 0xc<br />
Round the floating-point double (single) value in register fs, convert to a 32-bit<br />
fixed-point value, <strong>and</strong> put the resulting word in register fd.<br />
Square root double<br />
sqrt.d fd, fs<br />
0x11 0x11 0 fs fd 4<br />
6 5 5 5 5 6<br />
Square root single<br />
sqrt.s fd, fs<br />
0x11 0x10 0 fs fd 4<br />
6 5 5 5 5 6<br />
Compute the square root of the floating-point double (single) in register fs <strong>and</strong><br />
put it in register fd.<br />
Store floating-point double<br />
s.d fdest, address<br />
pseudoinstruction<br />
Store floating-point single<br />
s.s fdest, address<br />
pseudoinstruction<br />
Store the floating-point double (single) in register fdest at address.<br />
Floating-point subtract double<br />
sub.d fd, fs, ft<br />
0x11 0x11 ft fs fd 1<br />
6 5 5 5 5 6
A-80 Appendix A Assemblers, Linkers, <strong>and</strong> the SPIM Simulator<br />
Floating-point subtract single<br />
sub.s fd, fs, ft<br />
0x11 0x10 ft fs fd 1<br />
6 5 5 5 5 6<br />
Compute the difference of the floating-point doubles (singles) in registers fs <strong>and</strong><br />
ft <strong>and</strong> put it in register fd.<br />
Floating-point truncate to word<br />
trunc.w.d fd, fs<br />
0x11 0x11 0 fs fd 0xd<br />
6 5 5 5 5 6<br />
trunc.w.s fd, fs 0x11 0x10 0 fs fd 0xd<br />
Truncate the floating-point double (single) value in register fs, convert to a 32-bit<br />
fixed-point value, <strong>and</strong> put the resulting word in register fd.<br />
Exception <strong>and</strong> Interrupt Instructions<br />
Exception return<br />
eret<br />
0x10 1 0 0x18<br />
6 1 19 6<br />
Set the EXL bit in coprocessor 0’s Status register to 0 <strong>and</strong> return to the instruction<br />
pointed to by coprocessor 0’s EPC register.<br />
System call<br />
syscall<br />
0 0 0xc<br />
6 20 6<br />
Register $v0 contains the number of the system call (see Figure A.9.1) provided<br />
by SPIM.<br />
Break<br />
break code<br />
0 code 0xd<br />
6 20 6<br />
Cause exception code. Exception 1 is reserved for the debugger.<br />
No operation<br />
nop<br />
0 0 0 0 0 0<br />
6 5 5 5 5 6<br />
Do nothing.
A.11 Concluding Remarks A-81<br />
A.11<br />
Concluding Remarks<br />
Programming in assembly language requires a programmer to trade helpful features<br />
of high-level languages—such as data structures, type checking, <strong>and</strong> control<br />
constructs—for complete control over the instructions that a computer executes.<br />
External constraints on some applications, such as response time or program size,<br />
require a programmer to pay close attention to every instruction. However, the<br />
cost of this level of attention is assembly language programs that are longer, more<br />
time-consuming to write, <strong>and</strong> more difficult to maintain than high-level language<br />
programs.<br />
Moreover, three trends are reducing the need to write programs in assembly<br />
language. The first trend is toward the improvement of compilers. Modern compilers<br />
produce code that is typically comparable to the best h<strong>and</strong>written code—<br />
<strong>and</strong> is sometimes better. The second trend is the introduction of new processors<br />
that are not only faster, but in the case of processors that execute multiple instructions<br />
simultaneously, also more difficult to program by h<strong>and</strong>. In addition, the rapid<br />
evolution of the modern computer favors high-level language programs that are<br />
not tied to a single architecture. Finally, we witness a trend toward increasingly<br />
complex applications, characterized by complex graphic interfaces <strong>and</strong> many more<br />
features than their predecessors had. Large applications are written by teams of<br />
programmers <strong>and</strong> require the modularity <strong>and</strong> semantic checking features pro vided<br />
by high-level languages.<br />
Further Reading<br />
Aho, A., R. Sethi, <strong>and</strong> J. Ullman [1985]. Compilers: Principles, Techniques, <strong>and</strong> Tools, Reading, MA: Addison-<br />
Wesley.<br />
Slightly dated <strong>and</strong> lacking in coverage of modern architectures, but still the st<strong>and</strong>ard reference on compilers.<br />
Sweetman, D. [1999]. See MIPS Run, San Francisco, CA: Morgan Kaufmann Publishers.<br />
A complete, detailed, <strong>and</strong> engaging introduction to the MIPS instruction set <strong>and</strong> assembly language programming<br />
on these machines.<br />
Detailed documentation on the MIPS-32 architecture is available on the Web:<br />
MIPS32 Architecture for Programmers Volume I: Introduction to the MIPS32 Architecture<br />
(http://mips.com/content/Documentation/MIPSDocumentation/ProcessorArchitecture/<br />
ArchitectureProgrammingPublicationsforMIPS32/MD00082-2B-MIPS32INT-AFP-02.00.pdf/<br />
getDownload)<br />
MIPS32 Architecture for Programmers Volume II: The MIPS32 Instruction Set<br />
(http://mips.com/content/Documentation/MIPSDocumentation/ProcessorArchitecture/<br />
ArchitectureProgrammingPublicationsforMIPS32/MD00086-2B-MIPS32BIS-AFP-02.00.pdf/getDownload)<br />
MIPS32 Architecture for Programmers Volume III: The MIPS32 Privileged Resource Architecture<br />
(http://mips.com/content/Documentation/MIPSDocumentation/ProcessorArchitecture/<br />
ArchitectureProgrammingPublicationsforMIPS32/MD00090-2B-MIPS32PRA-AFP-02.00.pdf/getDownload)
A.12 Exercises A-83<br />
A.10 [10] Using SPIM, write <strong>and</strong> test a recursive program for solv ing<br />
the classic mathematical recreation, the Towers of Hanoi puzzle. (This will require<br />
the use of stack frames to support recursion.) The puzzle consists of three pegs<br />
(1, 2, <strong>and</strong> 3) <strong>and</strong> n disks (the number n can vary; typical values might be in the<br />
range from 1 to 8). Disk 1 is smaller than disk 2, which is in turn smaller than disk<br />
3, <strong>and</strong> so forth, with disk n being the largest. Initially, all the disks are on peg 1,<br />
starting with disk n on the bottom, disk n − 1 on top of that, <strong>and</strong> so forth, up to<br />
disk 1 on the top. The goal is to move all the disks to peg 2. You may only move one<br />
disk at a time, that is, the top disk from any of the three pegs onto the top of either<br />
of the other two pegs. Moreover, there is a constraint: You must not place a larger<br />
disk on top of a smaller disk.<br />
The C program below can be used to help write your assembly language program.<br />
/* move n smallest disks from start to finish using<br />
extra */<br />
void hanoi(int n, int start, int finish, int extra){<br />
if(n != 0){<br />
hanoi(n-1, start, extra, finish);<br />
print_string(“Move disk”);<br />
print_int(n);<br />
print_string(“from peg”);<br />
print_int(start);<br />
print_string(“to peg”);<br />
print_int(finish);<br />
print_string(“.\n”);<br />
hanoi(n-1, extra, finish, start);<br />
}<br />
}<br />
main(){<br />
int n;<br />
print_string(“Enter number of disks>“);<br />
n = read_int();<br />
hanoi(n, 1, 2, 3);<br />
return 0;<br />
}
B<br />
A P P E N D I X<br />
I always loved that<br />
word, Boolean.<br />
Claude Shannon<br />
IEEE Spectrum, April 1992<br />
(Shannon’s master’s thesis showed<br />
that the algebra invented by George<br />
Boole in the 1800s could represent the<br />
workings of electrical switches.)<br />
The Basics of Logic<br />
<strong>Design</strong><br />
B.1 Introduction B-3<br />
B.2 Gates, Truth Tables, <strong>and</strong> Logic<br />
Equations B-4<br />
B.3 Combinational Logic B-9<br />
B.4 Using a Hardware Description<br />
Language B-20<br />
B.5 Constructing a Basic Arithmetic Logic<br />
Unit B-26<br />
B.6 Faster Addition: Carry Lookahead B-38<br />
B.7 Clocks B-48<br />
<strong>Computer</strong> Organization <strong>and</strong> <strong>Design</strong>. DOI: http://dx.doi.org/10.1016/B978-0-12-407726-3.00001-1<br />
© 2013 Elsevier Inc. All rights reserved.
B-6 Appendix B The Basics of Logic <strong>Design</strong><br />
Boolean Algebra<br />
Another approach is to express the logic function with logic equations. This<br />
is done with the use of Boolean algebra (named after Boole, a 19th-century<br />
mathematician). In Boolean algebra, all the variables have the values 0 or 1 <strong>and</strong>, in<br />
typical formulations, there are three operators:<br />
■ The OR operator is written as , as in A B. The result of an OR operator is<br />
1 if either of the variables is 1. The OR operation is also called a logical sum,<br />
since its result is 1 if either oper<strong>and</strong> is 1.<br />
■ The AND operator is written as , as in A B. The result of an AND operator<br />
is 1 only if both inputs are 1. The AND operator is also called logical product,<br />
since its result is 1 only if both oper<strong>and</strong>s are 1.<br />
■ The unary operator NOT is written as A. The result of a NOT operator is 1 only if<br />
the input is 0. Applying the operator NOT to a logical value results in an inversion<br />
or negation of the value (i.e., if the input is 0 the output is 1, <strong>and</strong> vice versa).<br />
There are several laws of Boolean algebra that are helpful in manipulating logic<br />
equations.<br />
■ Identity law: A 0 A <strong>and</strong> A 1 A<br />
■ Zero <strong>and</strong> One laws: A 1 1 <strong>and</strong> A 0 0<br />
■ Inverse laws: A A 1 <strong>and</strong> A A 0<br />
■ Commutative laws: A B B A <strong>and</strong> A B B A<br />
■ Associative laws: A (B C) (A B) C <strong>and</strong> A (B C) (A B) C<br />
■ Distributive laws: A (B C) (A B) (A C) <strong>and</strong><br />
A (B C) (A B) (A C)<br />
In addition, there are two other useful theorems, called DeMorgan’s laws, that are<br />
discussed in more depth in the exercises.<br />
Any set of logic functions can be written as a series of equations with an output<br />
on the left-h<strong>and</strong> side of each equation <strong>and</strong> a formula consisting of variables <strong>and</strong> the<br />
three operators above on the right-h<strong>and</strong> side.
B.2 Gates, Truth Tables, <strong>and</strong> Logical Equations B-7<br />
Logic Equations<br />
Show the logic equations for the logic functions, D, E, <strong>and</strong> F, described in the<br />
previous example.<br />
EXAMPLE<br />
Here’s the equation for D:<br />
F is equally simple:<br />
D A B C<br />
ANSWER<br />
F<br />
A B C<br />
E is a little tricky. Think of it in two parts: what must be true for E to be true<br />
(two of the three inputs must be true), <strong>and</strong> what cannot be true (all three<br />
cannot be true). Thus we can write E as<br />
E (( A B) ( A C) ( B C)) ( A B C)<br />
We can also derive E by realizing that E is true only if exactly two of the inputs<br />
are true. Then we can write E as an OR of the three possible terms that have<br />
two true inputs <strong>and</strong> one false input:<br />
E ( A B C) ( A C B) ( B C A)<br />
Proving that these two expressions are equivalent is explored in the exercises.<br />
In Verilog, we describe combinational logic whenever possible using the assign<br />
statement, which is described beginning on page B-23. We can write a definition<br />
for E using the Verilog exclusive-OR operator as assign E (A ^ B ^ C) *<br />
(A + B + C) * (A * B * C), which is yet another way to describe this function.<br />
D <strong>and</strong> F have even simpler representations, which are just like the corresponding C<br />
code: D A | B | C <strong>and</strong> F A & B & C.
B-8 Appendix B The Basics of Logic <strong>Design</strong><br />
gate A device that<br />
implements basic logic<br />
functions, such as AND<br />
or OR.<br />
NOR gate An inverted<br />
OR gate.<br />
NAND gate An inverted<br />
AND gate.<br />
Check<br />
Yourself<br />
Gates<br />
Logic blocks are built from gates that implement basic logic functions. For example,<br />
an AND gate implements the AND function, <strong>and</strong> an OR gate implements the OR<br />
function. Since both AND <strong>and</strong> OR are commutative <strong>and</strong> associative, an AND or an<br />
OR gate can have multiple inputs, with the output equal to the AND or OR of all<br />
the inputs. The logical function NOT is implemented with an inverter that always<br />
has a single input. The st<strong>and</strong>ard representation of these three logic building blocks<br />
is shown in Figure B.2.1.<br />
Rather than draw inverters explicitly, a common practice is to add “bubbles”<br />
to the inputs or outputs of a gate to cause the logic value on that input line or<br />
output line to be inverted. For example, Figure B.2.2 shows the logic diagram for<br />
the function A B , using explicit inverters on the left <strong>and</strong> bubbled inputs <strong>and</strong><br />
outputs on the right.<br />
Any logical function can be constructed using AND gates, OR gates, <strong>and</strong><br />
inversion; several of the exercises give you the opportunity to try implementing<br />
some common logic functions with gates. In the next section, we’ll see how an<br />
implementation of any logic function can be constructed using this knowledge.<br />
In fact, all logic functions can be constructed with only a single gate type, if that<br />
gate is inverting. The two common inverting gates are called NOR <strong>and</strong> NAND <strong>and</strong><br />
correspond to inverted OR <strong>and</strong> AND gates, respectively. NOR <strong>and</strong> NAND gates are<br />
called universal, since any logic function can be built using this one gate type. The<br />
exercises explore this concept further.<br />
Are the following two logical expressions equivalent? If not, find a setting of the<br />
variables to show they are not:<br />
■ ( A B C) ( A C B) ( B C A)<br />
■ B ( A C C A)<br />
FIGURE B.2.1 St<strong>and</strong>ard drawing for an AND gate, OR gate, <strong>and</strong> an inverter, shown from<br />
left to right. The signals to the left of each symbol are the inputs, while the output appears on the right. The<br />
AND <strong>and</strong> OR gates both have two inputs. Inverters have a single input.<br />
A<br />
B<br />
A<br />
B<br />
FIGURE B.2.2 Logic gate implementation of A B using explicit inverts on the left <strong>and</strong><br />
bubbled inputs <strong>and</strong> outputs on the right. This logic function can be simplified to AB<br />
or in Verilog,<br />
A & ~ B.
B.3 Combinational Logic B-9<br />
B.3 Combinational Logic<br />
In this section, we look at a couple of larger logic building blocks that we use<br />
heavily, <strong>and</strong> we discuss the design of structured logic that can be automatically<br />
implemented from a logic equation or truth table by a translation program. Last,<br />
we discuss the notion of an array of logic blocks.<br />
Decoders<br />
One logic block that we will use in building larger components is a decoder. The<br />
most common type of decoder has an n-bit input <strong>and</strong> 2 n outputs, where only one<br />
output is asserted for each input combination. This decoder translates the n-bit<br />
input into a signal that corresponds to the binary value of the n-bit input. The<br />
outputs are thus usually numbered, say, Out0, Out1, … , Out2 n 1. If the value of<br />
the input is i, then Outi will be true <strong>and</strong> all other outputs will be false. Figure B.3.1<br />
shows a 3-bit decoder <strong>and</strong> the truth table. This decoder is called a 3-to-8 decoder<br />
since there are 3 inputs <strong>and</strong> 8 (2 3 ) outputs. There is also a logic element called<br />
an encoder that performs the inverse function of a decoder, taking 2 n inputs <strong>and</strong><br />
producing an n-bit output.<br />
decoder A logic block<br />
that has an n-bit input<br />
<strong>and</strong> 2n outputs, where<br />
only one output is<br />
asserted for each input<br />
combination.<br />
3<br />
Decoder<br />
Out0<br />
Out1<br />
Out2<br />
Out3<br />
Out4<br />
Out5<br />
Out6<br />
Out7<br />
Inputs<br />
Outputs<br />
12 11 10 Out7 Out6 Out5 Out4 Out3 Out2 Out1 Out0<br />
0 0 0 0 0 0 0 0 0 0 1<br />
0 0 1 0 0 0 0 0 0 1 0<br />
0 1 0 0 0 0 0 0 1 0 0<br />
0 1 1 0 0 0 0 1 0 0 0<br />
1 0 0 0 0 0 1 0 0 0 0<br />
1 0 1 0 0 1 0 0 0 0 0<br />
1 1 0 0 1 0 0 0 0 0 0<br />
1 1 1 1 0 0 0 0 0 0 0<br />
a. A 3-bit decoder<br />
b. The truth table for a 3-bit decoder<br />
FIGURE B.3.1 A 3-bit decoder has 3 inputs, called 12, 11, <strong>and</strong> 10, <strong>and</strong> 2 3 = 8 outputs, called Out0 to Out7. Only the<br />
output corresponding to the binary value of the input is true, as shown in the truth table. The label 3 on the input to the decoder says that the<br />
input signal is 3 bits wide.
B-10 Appendix B The Basics of Logic <strong>Design</strong><br />
A<br />
B<br />
0<br />
M<br />
u<br />
x<br />
1<br />
C<br />
A<br />
B<br />
C<br />
S<br />
S<br />
FIGURE B.3.2 A two-input multiplexor on the left <strong>and</strong> its implementation with gates on<br />
the right. The multiplexor has two data inputs (A <strong>and</strong> B), which are labeled 0 <strong>and</strong> 1, <strong>and</strong> one selector input<br />
(S), as well as an output C. Implementing multiplexors in Verilog requires a little more work, especially when<br />
they are wider than two inputs. We show how to do this beginning on page B-23.<br />
selector value Also<br />
called control value. The<br />
control signal that is used<br />
to select one of the input<br />
values of a multiplexor<br />
as the output of the<br />
multiplexor.<br />
Multiplexors<br />
One basic logic function that we use quite often in Chapter 4 is the multiplexor.<br />
A multiplexor might more properly be called a selector, since its output is one of<br />
the inputs that is selected by a control. Consider the two-input multiplexor. The<br />
left side of Figure B.3.2 shows this multiplexor has three inputs: two data values<br />
<strong>and</strong> a selector (or control) value. The selector value determines which of the<br />
inputs becomes the output. We can represent the logic function computed by a<br />
two-input multiplexor, shown in gate form on the right side of Figure B.3.2, as<br />
C ( A S) ( B S).<br />
Multiplexors can be created with an arbitrary number of data inputs. When<br />
there are only two inputs, the selector is a single signal that selects one of the inputs<br />
if it is true (1) <strong>and</strong> the other if it is false (0). If there are n data inputs, there will<br />
need to be ⎡<br />
⎢log 2<br />
n⎤<br />
⎥ selector inputs. In this case, the multiplexor basically consists<br />
of three parts:<br />
1. A decoder that generates n signals, each indicating a different input value<br />
2. An array of n AND gates, each combining one of the inputs with a signal<br />
from the decoder<br />
3. A single large OR gate that incorporates the outputs of the AND gates<br />
To associate the inputs with selector values, we often label the data inputs numerically<br />
(i.e., 0, 1, 2, 3, …, n 1) <strong>and</strong> interpret the data selector inputs as a binary number.<br />
Sometimes, we make use of a multiplexor with undecoded selector signals.<br />
Multiplexors are easily represented combinationally in Verilog by using if<br />
expressions. For larger multiplexors, case statements are more convenient, but care<br />
must be taken to synthesize combinational logic.
B.3 Combinational Logic B-11<br />
Two-Level Logic <strong>and</strong> PLAs<br />
As pointed out in the previous section, any logic function can be implemented with<br />
only AND, OR, <strong>and</strong> NOT functions. In fact, a much stronger result is true. Any logic<br />
function can be written in a canonical form, where every input is either a true or<br />
complemented variable <strong>and</strong> there are only two levels of gates—one being AND <strong>and</strong><br />
the other OR—with a possible inversion on the final output. Such a representation<br />
is called a two-level representation, <strong>and</strong> there are two forms, called sum of products<br />
<strong>and</strong> product of sums. A sum-of-products representation is a logical sum (OR) of<br />
products (terms using the AND operator); a product of sums is just the opposite.<br />
In our earlier example, we had two equations for the output E:<br />
<strong>and</strong><br />
E (( A B) ( A C) ( B C)) ( A B C)<br />
sum of products A form<br />
of logical representation<br />
that employs a logical sum<br />
(OR) of products (terms<br />
joined using the AND<br />
operator).<br />
E ( A B C) ( A C B) ( B C A)<br />
This second equation is in a sum-of-products form: it has two levels of logic <strong>and</strong> the<br />
only inversions are on individual variables. The first equation has three levels of logic.<br />
Elaboration: We can also write E as a product of sums:<br />
E ( A B C) ( A C B) ( B C A)<br />
To derive this form, you need to use DeMorgan’s theorems, which are discussed in the<br />
exercises.<br />
In this text, we use the sum-of-products form. It is easy to see that any logic<br />
function can be represented as a sum of products by constructing such a<br />
representation from the truth table for the function. Each truth table entry for<br />
which the function is true corresponds to a product term. The product term<br />
consists of a logical product of all the inputs or the complements of the inputs,<br />
depending on whether the entry in the truth table has a 0 or 1 corresponding to<br />
this variable. The logic function is the logical sum of the product terms where the<br />
function is true. This is more easily seen with an example.
B-12 Appendix B The Basics of Logic <strong>Design</strong><br />
Sum of Products<br />
EXAMPLE<br />
Show the sum-of-products representation for the following truth table for D.<br />
Inputs<br />
Outputs<br />
A B C D<br />
0 0 0 0<br />
0 0 1 1<br />
0 1 0 1<br />
0 1 1 0<br />
1 0 0 1<br />
1 0 1 0<br />
1 1 0 0<br />
1 1 1 1<br />
ANSWER<br />
There are four product terms, since the function is true (1) for four different<br />
input combinations. These are:<br />
ABC<br />
ABC<br />
programmable logic<br />
array (PLA)<br />
A structured-logic<br />
element composed<br />
of a set of inputs <strong>and</strong><br />
corresponding input<br />
complements <strong>and</strong> two<br />
stages of logic: the first<br />
generates product terms<br />
of the inputs <strong>and</strong> input<br />
complements, <strong>and</strong> the<br />
second generates sum<br />
terms of the product<br />
terms. Hence, PLAs<br />
implement logic functions<br />
as a sum of products.<br />
minterms Also called<br />
product terms. A set<br />
of logic inputs joined<br />
by conjunction (AND<br />
operations); the product<br />
terms form the first logic<br />
stage of the programmable<br />
logic array (PLA).<br />
ABC<br />
ABC<br />
Thus, we can write the function for D as the sum of these terms:<br />
D ( A B C)( A B C)( A B C)( A B C)<br />
Note that only those truth table entries for which the function is true generate<br />
terms in the equation.<br />
We can use this relationship between a truth table <strong>and</strong> a two-level representation<br />
to generate a gate-level implementation of any set of logic functions. A set of logic<br />
functions corresponds to a truth table with multiple output columns, as we saw in<br />
the example on page B-5. Each output column represents a different logic function,<br />
which may be directly constructed from the truth table.<br />
The sum-of-products representation corresponds to a common structured-logic<br />
implementation called a programmable logic array (PLA). A PLA has a set of<br />
inputs <strong>and</strong> corresponding input complements (which can be implemented with a<br />
set of inverters), <strong>and</strong> two stages of logic. The first stage is an array of AND gates that<br />
form a set of product terms (sometimes called minterms); each product term can<br />
consist of any of the inputs or their complements. The second stage is an array of<br />
OR gates, each of which forms a logical sum of any number of the product terms.<br />
Figure B.3.3 shows the basic form of a PLA.
B.3 Combinational Logic B-13<br />
Inputs<br />
AND gates<br />
Product terms<br />
OR gates<br />
Outputs<br />
FIGURE B.3.3 The basic form of a PLA consists of an array of AND gates followed by an<br />
array of OR gates. Each entry in the AND gate array is a product term consisting of any number of inputs or<br />
inverted inputs. Each entry in the OR gate array is a sum term consisting of any number of these product terms.<br />
A PLA can directly implement the truth table of a set of logic functions with<br />
multiple inputs <strong>and</strong> outputs. Since each entry where the output is true requires<br />
a product term, there will be a corresponding row in the PLA. Each output<br />
corresponds to a potential row of OR gates in the second stage. The number of OR<br />
gates corresponds to the number of truth table entries for which the output is true.<br />
The total size of a PLA, such as that shown in Figure B.3.3, is equal to the sum of the<br />
size of the AND gate array (called the AND plane) <strong>and</strong> the size of the OR gate array<br />
(called the OR plane). Looking at Figure B.3.3, we can see that the size of the AND<br />
gate array is equal to the number of inputs times the number of different product<br />
terms, <strong>and</strong> the size of the OR gate array is the number of outputs times the number<br />
of product terms.<br />
A PLA has two characteristics that help make it an efficient way to implement a<br />
set of logic functions. First, only the truth table entries that produce a true value for<br />
at least one output have any logic gates associated with them. Second, each different<br />
product term will have only one entry in the PLA, even if the product term is used<br />
in multiple outputs. Let’s look at an example.<br />
PLAs<br />
Consider the set of logic functions defined in the example on page B-5. Show<br />
a PLA implementation of this example for D, E, <strong>and</strong> F.<br />
EXAMPLE
B-14 Appendix B The Basics of Logic <strong>Design</strong><br />
ANSWER<br />
Here is the truth table we constructed earlier:<br />
Inputs<br />
Outputs<br />
A B C D E F<br />
0 0 0 0 0 0<br />
0 0 1 1 0 0<br />
0 1 0 1 0 0<br />
0 1 1 1 1 0<br />
1 0 0 1 0 0<br />
1 0 1 1 1 0<br />
1 1 0 1 1 0<br />
1 1 1 1 0 1<br />
Since there are seven unique product terms with at least one true value in the<br />
output section, there will be seven columns in the AND plane. The number of<br />
rows in the AND plane is three (since there are three inputs), <strong>and</strong> there are also<br />
three rows in the OR plane (since there are three outputs). Figure B.3.4 shows<br />
the resulting PLA, with the product terms corresponding to the truth table<br />
entries from top to bottom.<br />
read-only memory<br />
(ROM) A memory<br />
whose contents are<br />
designated at creation<br />
time, after which the<br />
contents can only be read.<br />
ROM is used as structured<br />
logic to implement a<br />
set of logic functions by<br />
using the terms in the<br />
logic functions as address<br />
inputs <strong>and</strong> the outputs as<br />
bits in each word of the<br />
memory.<br />
programmable ROM<br />
(PROM) A form of<br />
read-only memory that<br />
can be pro grammed<br />
when a designer knows its<br />
contents.<br />
Rather than drawing all the gates, as we do in Figure B.3.4, designers often show<br />
just the position of AND gates <strong>and</strong> OR gates. Dots are used on the intersection of a<br />
product term signal line <strong>and</strong> an input line or an output line when a corresponding<br />
AND gate or OR gate is required. Figure B.3.5 shows how the PLA of Figure B.3.4<br />
would look when drawn in this way. The contents of a PLA are fixed when the PLA<br />
is created, although there are also forms of PLA-like structures, called PALs, that<br />
can be programmed electronically when a designer is ready to use them.<br />
ROMs<br />
Another form of structured logic that can be used to implement a set of logic<br />
functions is a read-only memory (ROM). A ROM is called a memory because it<br />
has a set of locations that can be read; however, the contents of these locations are<br />
fixed, usually at the time the ROM is manufactured. There are also programmable<br />
ROMs (PROMs) that can be programmed electronically, when a designer knows<br />
their contents. There are also erasable PROMs; these devices require a slow erasure<br />
process using ultraviolet light, <strong>and</strong> thus are used as read-only memories, except<br />
during the design <strong>and</strong> debugging process.<br />
A ROM has a set of input address lines <strong>and</strong> a set of outputs. The number of<br />
addressable entries in the ROM determines the number of address lines: if the
B.4 Using a Hardware Description Language B-19<br />
elements, which we can represent simply by showing that a given operation will<br />
happen to an entire collection of inputs. Inside a machine, much of the time we<br />
want to select between a pair of buses. A bus is a collection of data lines that is<br />
treated together as a single logical signal. (The term bus is also used to indicate a<br />
shared collection of lines with multiple sources <strong>and</strong> uses.)<br />
For example, in the MIPS instruction set, the result of an instruction that is written<br />
into a register can come from one of two sources. A multiplexor is used to choose<br />
which of the two buses (each 32 bits wide) will be written into the Result register.<br />
The 1-bit multiplexor, which we showed earlier, will need to be replicated 32 times.<br />
We indicate that a signal is a bus rather than a single 1-bit line by showing it with<br />
a thicker line in a figure. Most buses are 32 bits wide; those that are not are explicitly<br />
labeled with their width. When we show a logic unit whose inputs <strong>and</strong> outputs are<br />
buses, this means that the unit must be replicated a sufficient number of times to<br />
accommodate the width of the input. Figure B.3.6 shows how we draw a multiplexor<br />
that selects between a pair of 32-bit buses <strong>and</strong> how this exp<strong>and</strong>s in terms of 1-bitwide<br />
multiplexors. Sometimes we need to construct an array of logic elements<br />
where the inputs for some elements in the array are outputs from earlier elements.<br />
For example, this is how a multibit-wide ALU is constructed. In such cases, we must<br />
explicitly show how to create wider arrays, since the individual elements of the array<br />
are no longer independent, as they are in the case of a 32-bit-wide multiplexor.<br />
bus In logic design, a<br />
collection of data lines<br />
that is treated together<br />
as a single logical signal;<br />
also, a shared collection<br />
of lines with multiple<br />
sources <strong>and</strong> uses.<br />
Select<br />
Select<br />
32<br />
A<br />
32<br />
B<br />
M<br />
u<br />
x<br />
32<br />
C<br />
A31<br />
B31<br />
M<br />
u<br />
x<br />
C31<br />
A30<br />
B30<br />
M<br />
u<br />
x<br />
.<br />
C30<br />
.<br />
A0<br />
B0<br />
M<br />
u<br />
x<br />
C0<br />
a. A 32-bit wide 2-to-1 multiplexor b. The 32-bit wide multiplexor is actually<br />
an array of 32 1-bit multiplexors<br />
FIGURE B.3.6 A multiplexor is arrayed 32 times to perform a selection between two 32-<br />
bit inputs. Note that there is still only one data selection signal used for all 32 1-bit multiplexors.
B.4 Using a Hardware Description Language B-21<br />
Readers already familiar with VHDL should find the concepts simple, provided<br />
they have been exposed to the syntax of C.<br />
Verilog can specify both a behavioral <strong>and</strong> a structural definition of a digital<br />
system. A behavioral specification describes how a digital system functionally<br />
operates. A structural specification describes the detailed organization of a digital<br />
system, usually using a hierarchical description. A structural specification can be<br />
used to describe a hardware system in terms of a hierarchy of basic elements such<br />
as gates <strong>and</strong> switches. Thus, we could use Verilog to describe the exact contents of<br />
the truth tables <strong>and</strong> datapath of the last section.<br />
With the arrival of hardware synthesis tools, most designers now use Verilog<br />
or VHDL to structurally describe only the datapath, relying on logic synthesis to<br />
generate the control from a behavioral description. In addition, most CAD systems<br />
provide extensive libraries of st<strong>and</strong>ardized parts, such as ALUs, multiplexors,<br />
register files, memories, <strong>and</strong> programmable logic blocks, as well as basic gates.<br />
Obtaining an acceptable result using libraries <strong>and</strong> logic synthesis requires that<br />
the specification be written with an eye toward the eventual synthesis <strong>and</strong> the<br />
desired outcome. For our simple designs, this primarily means making clear what<br />
we expect to be implemented in combinational logic <strong>and</strong> what we expect to require<br />
sequential logic. In most of the examples we use in this section <strong>and</strong> the remainder<br />
of this appendix, we have written the Verilog with the eventual synthesis in mind.<br />
Datatypes <strong>and</strong> Operators in Verilog<br />
There are two primary datatypes in Verilog:<br />
1. A wire specifies a combinational signal.<br />
2. A reg (register) holds a value, which can vary with time. A reg need not<br />
necessarily correspond to an actual register in an implementation, although<br />
it often will.<br />
A register or wire, named X, that is 32 bits wide is declared as an array: reg<br />
[31:0] X or wire [31:0] X, which also sets the index of 0 to designate the<br />
least significant bit of the register. Because we often want to access a subfield of a<br />
register or wire, we can refer to a contiguous set of bits of a register or wire with the<br />
notation [starting bit: ending bit], where both indices must be constant<br />
values.<br />
An array of registers is used for a structure like a register file or memory. Thus,<br />
the declaration<br />
behavioral<br />
specification Describes<br />
how a digital system<br />
operates functionally.<br />
structural<br />
specification Describes<br />
how a digital system is<br />
organized in terms of a<br />
hierarchical connection of<br />
elements.<br />
hardware synthesis<br />
tools <strong>Computer</strong>-aided<br />
design software that<br />
can generate a gatelevel<br />
design based on<br />
behavioral descriptions of<br />
a digital system.<br />
wire In Verilog, specifies<br />
a combinational signal.<br />
reg In Verilog, a register.<br />
reg [31:0] registerfile[0:31]<br />
specifies a variable registerfile that is equivalent to a MIPS registerfile, where<br />
register 0 is the first. When accessing an array, we can refer to a single element, as<br />
in C, using the notation registerfile[regnum].
B-22 Appendix B The Basics of Logic <strong>Design</strong><br />
The possible values for a register or wire in Verilog are<br />
■ 0 or 1, representing logical false or true<br />
■ X, representing unknown, the initial value given to all registers <strong>and</strong> to any<br />
wire not connected to something<br />
■ Z, representing the high-impedance state for tristate gates, which we will not<br />
discuss in this appendix<br />
Constant values can be specified as decimal numbers as well as binary, octal, or<br />
hexadecimal. We often want to say exactly how large a constant field is in bits. This<br />
is done by prefixing the value with a decimal number specifying its size in bits. For<br />
example:<br />
■ 4’b0100 specifies a 4-bit binary constant with the value 4, as does 4’d4.<br />
■ - 8 ‘h4 specifies an 8-bit constant with the value 4 (in two’s complement<br />
representation)<br />
Values can also be concatenated by placing them within { } separated by commas.<br />
The notation {x{bit field}} replicates bit field x times. For example:<br />
■ {16{2’b01}} creates a 32-bit value with the pattern 0101 … 01.<br />
■ {A[31:16],B[15:0]} creates a value whose upper 16 bits come from A<br />
<strong>and</strong> whose lower 16 bits come from B.<br />
Verilog provides the full set of unary <strong>and</strong> binary operators from C, including the<br />
arithmetic operators (, , *. /), the logical operators (&, |, ), the comparison<br />
operators ( , !, , , , ), the shift operators (, ), <strong>and</strong> C’s<br />
conditional operator (?, which is used in the form condition ? expr1 :expr2<br />
<strong>and</strong> returns expr1 if the condition is true <strong>and</strong> expr2 if it is false). Verilog adds<br />
a set of unary logic reduction operators (&, |, ^) that yield a single bit by applying<br />
the logical operator to all the bits of an oper<strong>and</strong>. For example, &A returns the value<br />
obtained by ANDing all the bits of A together, <strong>and</strong> ^A returns the reduction obtained<br />
by using exclusive OR on all the bits of A.<br />
Check<br />
Yourself<br />
Which of the following define exactly the same value?<br />
l. 8’bimoooo<br />
2. 8’hF0<br />
3. 8’d240<br />
4. {{4{1’b1}},{4{1’b0}}}<br />
5. {4’b1,4’b0)
B.4 Using a Hardware Description Language B-23<br />
Structure of a Verilog Program<br />
A Verilog program is structured as a set of modules, which may represent anything<br />
from a collection of logic gates to a complete system. Modules are similar to classes<br />
in C, although not nearly as powerful. A module specifies its input <strong>and</strong> output<br />
ports, which describe the incoming <strong>and</strong> outgoing connections of a module. A<br />
module may also declare additional variables. The body of a module consists of:<br />
■ initial constructs, which can initialize reg variables<br />
■ Continuous assignments, which define only combinational logic<br />
■ always constructs, which can define either sequential or combinational<br />
logic<br />
■ Instances of other modules, which are used to implement the module being<br />
defined<br />
Representing Complex Combinational Logic in Verilog<br />
A continuous assignment, which is indicated with the keyword assign, acts like<br />
a combinational logic function: the output is continuously assigned the value, <strong>and</strong><br />
a change in the input values is reflected immediately in the output value. Wires<br />
may only be assigned values with continuous assignments. Using continuous<br />
assignments, we can define a module that implements a half-adder, as Figure B.4.1<br />
shows.<br />
Assign statements are one sure way to write Verilog that generates combinational<br />
logic. For more complex structures, however, assign statements may be awkward or<br />
tedious to use. It is also possible to use the always block of a module to describe<br />
a combinational logic element, although care must be taken. Using an always<br />
block allows the inclusion of Verilog control constructs, such as if-then-else, case<br />
statements, for statements, <strong>and</strong> repeat statements, to be used. These statements are<br />
similar to those in C with small changes.<br />
An always block specifies an optional list of signals on which the block is<br />
sensitive (in a list starting with @). The always block is re-evaluated if any of the<br />
FIGURE B.4.1<br />
A Verilog module that defines a half-adder using continuous assignments.
B-24 Appendix B The Basics of Logic <strong>Design</strong><br />
sensitivity list The list of<br />
signals that specifies when<br />
an always block should<br />
be re-evaluated.<br />
listed signals changes value; if the list is omitted, the always block is constantly reevaluated.<br />
When an always block is specifying combinational logic, the sensitivity<br />
list should include all the input signals. If there are multiple Verilog statements to<br />
be executed in an always block, they are surrounded by the keywords begin <strong>and</strong><br />
end, which take the place of the { <strong>and</strong> } in C. An always block thus looks like this:<br />
always @(list of signals that cause reevaluation) begin<br />
Verilog statements including assignments <strong>and</strong> other<br />
control statements end<br />
blocking assignment<br />
In Verilog, an assignment<br />
that completes before<br />
the execution of the next<br />
statement.<br />
nonblocking<br />
assignment An<br />
assignment that continues<br />
after evaluating the righth<strong>and</strong><br />
side, assigning the<br />
left-h<strong>and</strong> side the value<br />
only after all right-h<strong>and</strong><br />
sides are evaluated.<br />
Reg variables may only be assigned inside an always block, using a procedural<br />
assignment statement (as distinguished from continuous assignment we saw<br />
earlier). There are, however, two different types of procedural assignments. The<br />
assignment operator executes as it does in C; the right-h<strong>and</strong> side is evaluated,<br />
<strong>and</strong> the left-h<strong>and</strong> side is assigned the value. Furthermore, it executes like the<br />
normal C assignment statement: that is, it is completed before the next statement is<br />
executed. Hence, the assignment operator has the name blocking assignment.<br />
This blocking can be useful in the generation of sequential logic, <strong>and</strong> we will return<br />
to it shortly. The other form of assignment (nonblocking) is indicated by
B.5 Constructing a Basic Arithmetic Logic Unit B-25<br />
FIGURE B.4.2 A Verilog definition of a 4-to-1 multiplexor with 32-bit inputs, using a case<br />
statement. The case statement acts like a C switch statement, except that in Verilog only the code<br />
associated with the selected case is executed (as if each case state had a break at the end) <strong>and</strong> there is no fallthrough<br />
to the next statement.<br />
FIGURE B.4.3 A Verilog behavioral definition of a MIPS ALU. This could be synthesized using a module library containing basic<br />
arithmetic <strong>and</strong> logical operations.
B-26 Appendix B The Basics of Logic <strong>Design</strong><br />
Check<br />
Yourself<br />
Assuming all values are initially zero, what are the values of A <strong>and</strong> B after executing<br />
this Verilog code inside an always block?<br />
C=1;<br />
A
B-30 Appendix B The Basics of Logic <strong>Design</strong><br />
Operation<br />
CarryIn<br />
a0<br />
b0<br />
CarryIn<br />
ALU0<br />
CarryOut<br />
Result0<br />
a1<br />
b1<br />
CarryIn<br />
ALU1<br />
CarryOut<br />
Result1<br />
a2<br />
b2<br />
CarryIn<br />
ALU2<br />
CarryOut<br />
Result2<br />
.<br />
.<br />
.<br />
a31<br />
b31<br />
CarryIn<br />
ALU31<br />
Result31<br />
FIGURE B.5.7 A 32-bit ALU constructed from 32 1-bit ALUs. CarryOut of the less significant bit is<br />
connected to the CarryIn of the more significant bit. This organization is called ripple carry.<br />
this is only one step in negating a two’s complement number. Notice that the least<br />
significant bit still has a CarryIn signal, even though it’s unnecessary for addition.<br />
What happens if we set this CarryIn to 1 instead of 0? The adder will then calculate<br />
a b 1. By selecting the inverted version of b, we get exactly what we want:<br />
a b 1 a ( b 1) a ( b) a b<br />
The simplicity of the hardware design of a two’s complement adder helps explain<br />
why two’s complement representation has become the universal st<strong>and</strong>ard for<br />
integer computer arithmetic.
.<br />
.<br />
.<br />
B-34 Appendix B The Basics of Logic <strong>Design</strong><br />
Binvert<br />
Ainvert<br />
Operation<br />
CarryIn<br />
a0<br />
b0<br />
CarryIn<br />
ALU0<br />
Less<br />
CarryOut<br />
Result0<br />
a1<br />
b1<br />
0<br />
CarryIn<br />
ALU1<br />
Less<br />
CarryOut<br />
Result1<br />
a2<br />
b2<br />
0<br />
CarryIn<br />
ALU2<br />
Less<br />
CarryOut<br />
Result2<br />
.<br />
.<br />
CarryIn<br />
.<br />
a31 CarryIn<br />
Result31<br />
b31 ALU31<br />
Set<br />
0 Less<br />
Overflow<br />
FIGURE B.5.11 A 32-bit ALU constructed from the 31 copies of the 1-bit ALU in the top<br />
of Figure B.5.10 <strong>and</strong> one 1-bit ALU in the bottom of that figure. The Less inputs are connected<br />
to 0 except for the least significant bit, which is connected to the Set output of the most significant bit. If the<br />
ALU performs a b <strong>and</strong> we select the input 3 in the multiplexor in Figure B.5.10, then Result 0 … 001 if<br />
a b, <strong>and</strong> Result 0 … 000 otherwise.<br />
Thus, we need a new 1-bit ALU for the most significant bit that has an extra<br />
output bit: the adder output. The bottom drawing of Figure B.5.10 shows the<br />
design, with this new adder output line called Set, <strong>and</strong> used only for slt. As long<br />
as we need a special ALU for the most significant bit, we added the overflow detection<br />
logic since it is also associated with that bit.
B.1 Introduction B-35<br />
Alas, the test of less than is a little more complicated than just described because<br />
of overflow, as we explore in the exercises. Figure B.5.11 shows the 32-bit ALU.<br />
Notice that every time we want the ALU to subtract, we set both CarryIn <strong>and</strong><br />
Binvert to 1. For adds or logical operations, we want both control lines to be 0. We<br />
can therefore simplify control of the ALU by combining the CarryIn <strong>and</strong> Binvert to<br />
a single control line called Bnegate.<br />
To further tailor the ALU to the MIPS instruction set, we must support<br />
conditional branch instructions. These instructions branch either if two registers<br />
are equal or if they are unequal. The easiest way to test equality with the ALU is to<br />
subtract b from a <strong>and</strong> then test to see if the result is 0, since<br />
( a b 0)<br />
⇒ a b<br />
Thus, if we add hardware to test if the result is 0, we can test for equality. The<br />
simplest way is to OR all the outputs together <strong>and</strong> then send that signal through<br />
an inverter:<br />
Zero ( Result31 Result30 … Result2 Result1 Result0)<br />
Figure B.5.12 shows the revised 32-bit ALU. We can think of the combination of<br />
the 1-bit Ainvert line, the 1-bit Binvert line, <strong>and</strong> the 2-bit Operation lines as 4-bit<br />
control lines for the ALU, telling it to perform add, subtract, AND, OR, or set on<br />
less than. Figure B.5.13 shows the ALU control lines <strong>and</strong> the corresponding ALU<br />
operation.<br />
Finally, now that we have seen what is inside a 32-bit ALU, we will use the<br />
universal symbol for a complete ALU, as shown in Figure B.5.14.<br />
Defining the MIPS ALU in Verilog<br />
Figure B.5.15 shows how a combinational MIPS ALU might be specified in Verilog;<br />
such a specification would probably be compiled using a st<strong>and</strong>ard parts library that<br />
provided an adder, which could be instantiated. For completeness, we show the<br />
ALU control for MIPS in Figure B.5.16, which is used in Chapter 4, where we build<br />
a Verilog version of the MIPS datapath.<br />
The next question is, “How quickly can this ALU add two 32-bit oper<strong>and</strong>s?”<br />
We can determine the a <strong>and</strong> b inputs, but the CarryIn input depends on the<br />
operation in the adjacent 1-bit adder. If we trace all the way through the chain of<br />
dependencies, we connect the most significant bit to the least significant bit, so<br />
the most significant bit of the sum must wait for the sequential evaluation of all 32<br />
1-bit adders. This sequential chain reaction is too slow to be used in time-critical<br />
hardware. The next section explores how to speed-up addition. This topic is not<br />
crucial to underst<strong>and</strong>ing the rest of the appendix <strong>and</strong> may be skipped.
.<br />
.<br />
.<br />
B-36 Appendix B The Basics of Logic <strong>Design</strong><br />
Ainvert<br />
Bnegate<br />
Operation<br />
a0<br />
b0<br />
CarryIn<br />
ALU0<br />
Less<br />
CarryOut<br />
Result0<br />
a1<br />
b1<br />
0<br />
CarryIn<br />
ALU1<br />
Less<br />
CarryOut<br />
Result1<br />
.<br />
Zero<br />
a2<br />
b2<br />
0<br />
CarryIn<br />
ALU2<br />
Less<br />
CarryOut<br />
Result2<br />
.<br />
.<br />
CarryIn<br />
.<br />
.<br />
Result31<br />
a31 CarryIn<br />
b31 ALU31<br />
Set<br />
0 Less<br />
Overflow<br />
FIGURE B.5.12<br />
The final 32-bit ALU. This adds a Zero detector to Figure B.5.11.<br />
ALU control lines Function<br />
0000 AND<br />
0001 OR<br />
0010 add<br />
0110 subtract<br />
0111 set on less than<br />
1100 NOR<br />
FIGURE B.5.13 The values of the three ALU control lines, Bnegate, <strong>and</strong> Operation, <strong>and</strong> the<br />
corresponding ALU operations.
B.5 Constructing a Basic Arithmetic Logic Unit B-37<br />
ALU operation<br />
a<br />
Zero<br />
ALU<br />
Result<br />
Overflow<br />
b<br />
CarryOut<br />
FIGURE B.5.14 The symbol commonly used to represent an ALU, as shown in Figure<br />
B.5.12. This symbol is also used to represent an adder, so it is normally labeled either with ALU or Adder.<br />
FIGURE B.5.15<br />
A Verilog behavioral definition of a MIPS ALU.
B.6 Faster Addition: Carry Lookahead B-39<br />
significant bit of the adder, in theory we could calculate the CarryIn values to all<br />
the remaining bits of the adder in just two levels of logic.<br />
For example, the CarryIn for bit 2 of the adder is exactly the CarryOut of bit 1,<br />
so the formula is<br />
CarryIn2 ( b1 CarryIn1) ( a1 CarryIn1) ( a1<br />
b1)<br />
Similarly, CarryIn1 is defined as<br />
CarryIn1 ( b0 CarryIn0) ( a0 CarryIn0) ( a0 b0)<br />
Using the shorter <strong>and</strong> more traditional abbreviation of ci for CarryIni, we can<br />
rewrite the formulas as<br />
c2 ( b1 c1) ( a1 c1) ( a1 b1)<br />
c1 ( b0 c0) ( a0 c0) ( a0 b0)<br />
Substituting the definition of c1 for the first equation results in this formula:<br />
c2 ( a1 a0 b0) ( a1 a0 c0) ( a1 b0 c0)<br />
( b1 a0 b0) ( b1 a0 c0) ( b1 b0 c0) ( a1 b1)<br />
You can imagine how the equation exp<strong>and</strong>s as we get to higher bits in the adder;<br />
it grows rapidly with the number of bits. This complexity is reflected in the cost of<br />
the hardware for fast carry, making this simple scheme prohibitively expensive for<br />
wide adders.<br />
Fast Carry Using the First Level of Abstraction: Propagate<br />
<strong>and</strong> Generate<br />
Most fast-carry schemes limit the complexity of the equations to simplify the<br />
hardware, while still making substantial speed improvements over ripple carry.<br />
One such scheme is a carry-lookahead adder. In Chapter 1, we said computer<br />
systems cope with complexity by using levels of abstraction. A carry-lookahead<br />
adder relies on levels of abstraction in its implementation.<br />
Let’s factor our original equation as a first step:<br />
ci 1 ( bi ci) ( ai ci) ( ai bi)<br />
= ( ai bi) ( ai bi)<br />
ci<br />
If we were to rewrite the equation for c2 using this formula, we would see some<br />
repeated patterns:<br />
c2 ( a1 b1) ( a1 b1) (( a0 b0) ( a0 b0) c0)<br />
Note the repeated appearance of (ai bi) <strong>and</strong> (ai bi) in the formula above. These<br />
two important factors are traditionally called generate (gi) <strong>and</strong> propagate (pi):
B-40 Appendix B The Basics of Logic <strong>Design</strong><br />
Using them to define ci 1, we get<br />
gi ai bi<br />
pi ai bi<br />
ci 1 gi pi ci<br />
To see where the signals get their names, suppose gi is 1. Then<br />
ci 1 gi pi ci 1 pi ci<br />
1<br />
That is, the adder generates a CarryOut (ci 1) independent of the value of CarryIn<br />
(ci). Now suppose that gi is 0 <strong>and</strong> pi is 1. Then<br />
ci 1 gi pi ci 0 1 ci ci<br />
That is, the adder propagates CarryIn to a CarryOut. Putting the two together,<br />
CarryIni 1 is a 1 if either gi is 1 or both pi is 1 <strong>and</strong> CarryIni is 1.<br />
As an analogy, imagine a row of dominoes set on edge. The end domino can be<br />
tipped over by pushing one far away, provided there are no gaps between the two.<br />
Similarly, a carry out can be made true by a generate far away, provided all the<br />
propagates between them are true.<br />
Relying on the definitions of propagate <strong>and</strong> generate as our first level of<br />
abstraction, we can express the CarryIn signals more economically. Let’s show it<br />
for 4 bits:<br />
c1 g0 ( p0 c0)<br />
c2 g1 ( p1 g0) ( p1 p0 c0)<br />
c3 g2 ( p2 g1) ( p2 p1<br />
g0) ( p2 p1 p0 c0)<br />
c4 g3 ( p3 g2) ( p3 p2 g1) ( p3 p2 p1 g0)<br />
(p3 p2 p1 p0 c 0)<br />
These equations just represent common sense: CarryIni is a 1 if some earlier adder<br />
generates a carry <strong>and</strong> all intermediary adders propagate a carry. Figure B.6.1 uses<br />
plumbing to try to explain carry lookahead.<br />
Even this simplified form leads to large equations <strong>and</strong>, hence, considerable logic<br />
even for a 16-bit adder. Let’s try moving to two levels of abstraction.<br />
Fast Carry Using the Second Level of Abstraction<br />
First, we consider this 4-bit adder with its carry-lookahead logic as a single building<br />
block. If we connect them in ripple carry fashion to form a 16-bit adder, the add<br />
will be faster than the original with a little more hardware.
B.6 Faster Addition: Carry Lookahead B-41<br />
To go faster, we’ll need carry lookahead at a higher level. To perform carry look<br />
ahead for 4-bit adders, we need to propagate <strong>and</strong> generate signals at this higher<br />
level. Here they are for the four 4-bit adder blocks:<br />
P0 p3 p2 p1 p0<br />
P1 p7 p6 p5 p4<br />
P2 p11 p10 p9 p8<br />
P3 p15 p14 p13 p12<br />
That is, the “super” propagate signal for the 4-bit abstraction (Pi) is true only if each<br />
of the bits in the group will propagate a carry.<br />
For the “super” generate signal (Gi), we care only if there is a carry out of the<br />
most significant bit of the 4-bit group. This obviously occurs if generate is true<br />
for that most significant bit; it also occurs if an earlier generate is true <strong>and</strong> all the<br />
intermediate propagates, including that of the most significant bit, are also true:<br />
G0 g3 ( p3 g2) ( p3 p2 g1) ( p3 p2 p1 g0)<br />
G1 g7 ( p7 g6) ( p7 p6<br />
g5) ( p7 p6 p5 g4)<br />
G2 g11 ( p11 g10) ( p11 p10 g9) ( p11 p10<br />
p9 g8)<br />
G3 g15 ( p15 g14) ( p15 p14 g13) ( p15 p14 p13 g12)<br />
Figure B.6.2 updates our plumbing analogy to show P0 <strong>and</strong> G0.<br />
Then the equations at this higher level of abstraction for the carry in for each<br />
4-bit group of the 16-bit adder (C1, C2, C3, C4 in Figure B.6.3) are very similar to<br />
the carry out equations for each bit of the 4-bit adder (c1, c2, c3, c4) on page B-40:<br />
C1 G0 ( P0 c0)<br />
C2 G1 ( P1 G0) ( P1 P0 c0)<br />
C3 G2 ( P2 G1) ( P2 P1<br />
G0) ( P2 P1 P0 c0)<br />
C4 G3 ( P3 G2) ( P3 P2 G1) ( P3 P2 P1 G 0)<br />
( P3 P2 P1 P0 c0)<br />
Figure B.6.3 shows 4-bit adders connected with such a carry-lookahead unit.<br />
The exercises explore the speed differences between these carry schemes, different<br />
notations for multibit propagate <strong>and</strong> generate signals, <strong>and</strong> the design of a 64-bit<br />
adder.
B-44 Appendix B The Basics of Logic <strong>Design</strong><br />
Both Levels of the Propagate <strong>and</strong> Generate<br />
EXAMPLE<br />
Determine the gi, pi, Pi, <strong>and</strong> Gi values of these two 16-bit numbers:<br />
a: 0001 1010 0011 0011 two<br />
b: 1110 0101 1110 1011 two<br />
Also, what is CarryOut15 (C4)?<br />
ANSWER<br />
Aligning the bits makes it easy to see the values of generate gi (ai bi) <strong>and</strong><br />
propagate pi (ai bi):<br />
a: 0001 1010 0011 0011<br />
b: 1110 0101 1110 1011<br />
gi: 0000 0000 0010 0011<br />
pi: 1111 1111 1111 1011<br />
where the bits are numbered 15 to 0 from left to right. Next, the “super”<br />
propagates (P3, P2, P1, P0) are simply the AND of the lower-level propagates:<br />
P3<br />
1111 1<br />
P2<br />
1111 1<br />
P1<br />
1111 1<br />
P0 1 0 1 1 0<br />
The “super” generates are more complex, so use the following equations:<br />
G0 g3 ( p3 g2) ( p3 p2 g1) ( p3 p2 p1 g0)<br />
= 0 ( 1 0) ( 1 0 1) ( 1 0 1 1)<br />
0 0 0 0 0<br />
G1 g7 ( p7 g6) ( p7 p6 g5) ( p7 p6 p5 g4)<br />
0 ( 10) ( 111) ( 1110)<br />
0 0 1 0 1<br />
G2 g11 ( p11 g10) ( p11 p10 g9) ( p11 p10 p9 g8)<br />
0 ( 1 0) ( 1 1 0) ( 1 1 1 0)<br />
0 0 0 0 0<br />
G3 g15 ( p15 g14) ( p15 p14 g13) ( p15 p14 p13 g12)<br />
0 ( 1 0) ( 1 1 0) ( 1 1 1 0)<br />
0 0 0 0 0<br />
Finally, CarryOut15 is<br />
C4 G3 ( P3 G2) ( P3 P2 G1) ( P3 P2 P1 G0)<br />
( P3 P2 P1 P0 c0)<br />
0 ( 10) ( 111) ( 1110) ( 1110 0)<br />
0 0 1 0 0 1<br />
Hence, there is a carry out when adding these two 16-bit numbers.
B.6 Faster Addition: Carry Lookahead B-45<br />
CarryIn<br />
a0<br />
b0<br />
a1<br />
b1<br />
a2<br />
b2<br />
a3<br />
b3<br />
CarryIn<br />
ALU0<br />
P0<br />
G0<br />
C1<br />
pi<br />
gi<br />
ci + 1<br />
Result0–3<br />
Carry-lookahead unit<br />
a4<br />
b4<br />
a5<br />
b5<br />
a6<br />
b6<br />
a7<br />
b7<br />
CarryIn<br />
ALU1<br />
P1<br />
G1<br />
C2<br />
pi + 1<br />
gi + 1<br />
ci + 2<br />
Result4–7<br />
a8<br />
b8<br />
a9<br />
b9<br />
a10<br />
b10<br />
a11<br />
b11<br />
CarryIn<br />
ALU2<br />
P2<br />
G2<br />
C3<br />
pi + 2<br />
gi + 2<br />
ci + 3<br />
Result8–11<br />
a12<br />
b12<br />
a13<br />
b13<br />
a14<br />
b14<br />
a15<br />
b15<br />
CarryIn<br />
ALU3<br />
P3<br />
G3<br />
C4<br />
pi + 3<br />
gi + 3<br />
ci + 4<br />
Result12–15<br />
CarryOut<br />
FIGURE B.6.3 Four 4-bit ALUs using carry lookahead to form a 16-bit adder. Note that the<br />
carries come from the carry-lookahead unit, not from the 4-bit ALUs.
B-46 Appendix B The Basics of Logic <strong>Design</strong><br />
The reason carry lookahead can make carries faster is that all logic begins<br />
evaluating the moment the clock cycle begins, <strong>and</strong> the result will not change once<br />
the output of each gate stops changing. By taking the shortcut of going through<br />
fewer gates to send the carry in signal, the output of the gates will stop changing<br />
sooner, <strong>and</strong> hence the time for the adder can be less.<br />
To appreciate the importance of carry lookahead, we need to calculate the<br />
relative performance between it <strong>and</strong> ripple carry adders.<br />
Speed of Ripple Carry versus Carry Lookahead<br />
EXAMPLE<br />
One simple way to model time for logic is to assume each AND or OR gate<br />
takes the same time for a signal to pass through it. Time is estimated by simply<br />
counting the number of gates along the path through a piece of logic. Compare<br />
the number of gate delays for paths of two 16-bit adders, one using ripple carry<br />
<strong>and</strong> one using two-level carry lookahead.<br />
ANSWER<br />
Figure B.5.5 on page B-28 shows that the carry out signal takes two gate<br />
delays per bit. Then the number of gate delays between a carry in to the least<br />
significant bit <strong>and</strong> the carry out of the most significant is 16 2 32.<br />
For carry lookahead, the carry out of the most significant bit is just C4,<br />
defined in the example. It takes two levels of logic to specify C4 in terms of<br />
Pi <strong>and</strong> Gi (the OR of several AND terms). Pi is specified in one level of logic<br />
(AND) using pi, <strong>and</strong> Gi is specified in two levels using pi <strong>and</strong> gi, so the worst<br />
case for this next level of abstraction is two levels of logic. pi <strong>and</strong> gi are each<br />
one level of logic, defined in terms of ai <strong>and</strong> bi. If we assume one gate delay<br />
for each level of logic in these equations, the worst case is 2 2 1 5 gate<br />
delays.<br />
Hence, for the path from carry in to carry out, the 16-bit addition by a<br />
carry-lookahead adder is six times faster, using this very simple estimate of<br />
hardware speed.<br />
Summary<br />
Carry lookahead offers a faster path than waiting for the carries to ripple through<br />
all 32 1-bit adders. This faster path is paved by two signals, generate <strong>and</strong> propagate.
B.6 Faster Addition: Carry Lookahead B-47<br />
The former creates a carry regardless of the carry input, <strong>and</strong> the latter passes a carry<br />
along. Carry lookahead also gives another example of how abstraction is important<br />
in computer design to cope with complexity.<br />
Using the simple estimate of hardware speed above with gate delays, what is the<br />
relative performance of a ripple carry 8-bit add versus a 64-bit add using carrylookahead<br />
logic?<br />
1. A 64-bit carry-lookahead adder is three times faster: 8-bit adds are 16 gate<br />
delays <strong>and</strong> 64-bit adds are 7 gate delays.<br />
2. They are about the same speed, since 64-bit adds need more levels of logic in<br />
the 16-bit adder.<br />
3. 8-bit adds are faster than 64 bits, even with carry lookahead.<br />
Check<br />
Yourself<br />
Elaboration: We have now accounted for all but one of the arithmetic <strong>and</strong> logical<br />
operations for the core MIPS instruction set: the ALU in Figure B.5.14 omits support of<br />
shift instructions. It would be possible to widen the ALU multiplexor to include a left shift<br />
by 1 bit or a right shift by 1 bit. But hardware designers have created a circuit called a<br />
barrel shifter, which can shift from 1 to 31 bits in no more time than it takes to add two<br />
32-bit numbers, so shifting is normally done outside the ALU.<br />
Elaboration: The logic equation for the Sum output of the full adder on page B-28 can<br />
be expressed more simply by using a more powerful gate than AND <strong>and</strong> OR. An exclusive<br />
OR gate is true if the two oper<strong>and</strong>s disagree; that is,<br />
x ≠ y ⇒ 1 <strong>and</strong> x y ⇒ 0<br />
In some technologies, exclusive OR is more effi cient than two levels of AND <strong>and</strong> OR<br />
gates. Using the symbol ⊕ to represent exclusive OR, here is the new equation:<br />
Sum a ⊕ b ⊕ CarryIn<br />
Also, we have drawn the ALU the traditional way, using gates. <strong>Computer</strong>s are designed<br />
today in CMOS transistors, which are basically switches. CMOS ALU <strong>and</strong> barrel shifters<br />
take advantage of these switches <strong>and</strong> have many fewer multiplexors than shown in our<br />
designs, but the design principles are similar.<br />
Elaboration: Using lowercase <strong>and</strong> uppercase to distinguish the hierarchy of generate<br />
<strong>and</strong> propagate symbols breaks down when you have more than two levels. An alternate<br />
notation that scales is g i..j<br />
<strong>and</strong> p i..j<br />
for the generate <strong>and</strong> propagate signals for bits i to j.<br />
Thus, g 1..1<br />
is generated for bit 1, g 4..1<br />
is for bits 4 to 1, <strong>and</strong> g 16..1<br />
is for bits 16 to 1.
B.7 Clocks B-49<br />
clock edge occurs. A signal is valid if it is stable (i.e., not changing), <strong>and</strong> the value<br />
will not change again until the inputs change. Since combinational circuits cannot<br />
have feedback, if the inputs to a combinational logic unit are not changed, the<br />
outputs will eventually become valid.<br />
Figure B.7.2 shows the relationship among the state elements <strong>and</strong> the<br />
combinational logic blocks in a synchronous, sequential logic design. The state<br />
elements, whose outputs change only after the clock edge, provide valid inputs<br />
to the combinational logic block. To ensure that the values written into the state<br />
elements on the active clock edge are valid, the clock must have a long enough<br />
period so that all the signals in the combinational logic block stabilize, <strong>and</strong> then the<br />
clock edge samples those values for storage in the state elements. This constraint<br />
sets a lower bound on the length of the clock period, which must be long enough<br />
for all state element inputs to be valid.<br />
In the rest of this appendix, as well as in Chapter 4, we usually omit the clock<br />
signal, since we are assuming that all state elements are updated on the same clock<br />
edge. Some state elements will be written on every clock edge, while others will be<br />
written only under certain conditions (such as a register being updated). In such<br />
cases, we will have an explicit write signal for that state element. The write signal<br />
must still be gated with the clock so that the update occurs only on the clock edge if<br />
the write signal is active. We will see how this is done <strong>and</strong> used in the next section.<br />
One other advantage of an edge-triggered methodology is that it is possible<br />
to have a state element that is used as both an input <strong>and</strong> output to the same<br />
combinational logic block, as shown in Figure B.7.3. In practice, care must be<br />
taken to prevent races in such situations <strong>and</strong> to ensure that the clock period is long<br />
enough; this topic is discussed further in Section B.11.<br />
Now that we have discussed how clocking is used to update state elements, we<br />
can discuss how to construct the state elements.<br />
State<br />
element<br />
1<br />
Combinational logic<br />
State<br />
element<br />
2<br />
Clock cycle<br />
FIGURE B.7.2 The inputs to a combinational logic block come from a state element, <strong>and</strong><br />
the outputs are written into a state element. The clock edge determines when the contents of the<br />
state elements are updated.
B.8 Memory Elements: Flip-Flops, Latches, <strong>and</strong> Registers B-51<br />
The simplest type of memory elements are unclocked; that is, they do not<br />
have any clock input. Although we only use clocked memory elements in this<br />
text, an unclocked latch is the simplest memory element, so let’s look at this<br />
circuit first. Figure B.8.1 shows an S-R latch (set-reset latch), built from a pair of<br />
NOR gates (OR gates with inverted outputs). The outputs Q <strong>and</strong> Q represent the<br />
value of the stored state <strong>and</strong> its complement. When neither S nor R are asserted,<br />
the cross-coupled NOR gates act as inverters <strong>and</strong> store the previous values of<br />
Q <strong>and</strong> Q.<br />
For example, if the output, Q, is true, then the bottom inverter produces a false<br />
output (which is Q), which becomes the input to the top inverter, which produces<br />
a true output, which is Q, <strong>and</strong> so on. If S is asserted, then the output Q will be<br />
asserted <strong>and</strong> Q will be deasserted, while if R is asserted, then the output Q will be<br />
asserted <strong>and</strong> Q will be deasserted. When S <strong>and</strong> R are both deasserted, the last values<br />
of Q <strong>and</strong> Q will continue to be stored in the cross-coupled structure. Asserting S<br />
<strong>and</strong> R simultaneously can lead to incorrect operation: depending on how S <strong>and</strong> R<br />
are deasserted, the latch may oscillate or become metastable (this is described in<br />
more detail in Section B.11).<br />
This cross-coupled structure is the basis for more complex memory elements<br />
that allow us to store data signals. These elements contain additional gates used to<br />
store signal values <strong>and</strong> to cause the state to be updated only in conjunction with a<br />
clock. The next section shows how these elements are built.<br />
Flip-Flops <strong>and</strong> Latches<br />
Flip-flops <strong>and</strong> latches are the simplest memory elements. In both flip-flops <strong>and</strong><br />
latches, the output is equal to the value of the stored state inside the element.<br />
Furthermore, unlike the S-R latch described above, all the latches <strong>and</strong> flip-flops we<br />
will use from this point on are clocked, which means that they have a clock input<br />
<strong>and</strong> the change of state is triggered by that clock. The difference between a flipflop<br />
<strong>and</strong> a latch is the point at which the clock causes the state to actually change.<br />
In a clocked latch, the state is changed whenever the appropriate inputs change<br />
<strong>and</strong> the clock is asserted, whereas in a flip-flop, the state is changed only on a clock<br />
edge. Since throughout this text we use an edge-triggered timing methodology<br />
where state is only updated on clock edges, we need only use flip-flops. Flip-flops<br />
are often built from latches, so we start by describing the operation of a simple<br />
clocked latch <strong>and</strong> then discuss the operation of a flip-flop constructed from that<br />
latch.<br />
For computer applications, the function of both flip-flops <strong>and</strong> latches is to<br />
store a signal. A D latch or D flip-flop stores the value of its data input signal in<br />
the internal memory. Although there are many other types of latch <strong>and</strong> flip-flop,<br />
the D type is the only basic building block that we will need. A D latch has two<br />
inputs <strong>and</strong> two outputs. The inputs are the data value to be stored (called D) <strong>and</strong><br />
a clock signal (called C) that indicates when the latch should read the value on<br />
the D input <strong>and</strong> store it. The outputs are simply the value of the internal state (Q)<br />
flip-flop A memory<br />
element for which the<br />
output is equal to the<br />
value of the stored state<br />
inside the element <strong>and</strong> for<br />
which the internal state is<br />
changed only on a clock<br />
edge.<br />
latch A memory element<br />
in which the output is<br />
equal to the value of the<br />
stored state inside the<br />
element <strong>and</strong> the state is<br />
changed whenever the<br />
appropriate inputs change<br />
<strong>and</strong> the clock is asserted.<br />
D flip-flop A flip-flop<br />
with one data input<br />
that stores the value of<br />
that input signal in the<br />
internal memory when<br />
the clock edge occurs.
B-52 Appendix B The Basics of Logic <strong>Design</strong><br />
<strong>and</strong> its complement (Q). When the clock input C is asserted, the latch is said to<br />
be open, <strong>and</strong> the value of the output (Q) becomes the value of the input D. When<br />
the clock input C is deasserted, the latch is said to be closed, <strong>and</strong> the value of the<br />
output (Q) is whatever value was stored the last time the latch was open.<br />
Figure B.8.2 shows how a D latch can be implemented with two additional gates<br />
added to the cross-coupled NOR gates. Since when the latch is open the value of Q<br />
changes as D changes, this structure is sometimes called a transparent latch. Figure<br />
B.8.3 shows how this D latch works, assuming that the output Q is initially false <strong>and</strong><br />
that D changes first.<br />
As mentioned earlier, we use flip-flops as the basic building block, rather than<br />
latches. Flip-flops are not transparent: their outputs change only on the clock edge.<br />
A flip-flop can be built so that it triggers on either the rising (positive) or falling<br />
(negative) clock edge; for our designs we can use either type. Figure B.8.4 shows<br />
how a falling-edge D flip-flop is constructed from a pair of D latches. In a D flipflop,<br />
the output is stored when the clock edge occurs. Figure B.8.5 shows how this<br />
flip-flop operates.<br />
C<br />
Q<br />
D<br />
Q<br />
FIGURE B.8.2 A D latch implemented with NOR gates. A NOR gate acts as an inverter if the other<br />
input is 0. Thus, the cross-coupled pair of NOR gates acts to store the state value unless the clock input, C, is<br />
asserted, in which case the value of input D replaces the value of Q <strong>and</strong> is stored. The value of input D must<br />
be stable when the clock signal C changes from asserted to deasserted.<br />
D<br />
C<br />
Q<br />
FIGURE B.8.3 Operation of a D latch, assuming the output is initially deasserted. When<br />
the clock, C, is asserted, the latch is open <strong>and</strong> the Q output immediately assumes the value of the D input.
B.8 Memory Elements: Flip-Flops, Latches, <strong>and</strong> Registers B-53<br />
D<br />
D<br />
C<br />
D<br />
latch<br />
Q<br />
D<br />
C<br />
D<br />
latch<br />
Q<br />
Q<br />
Q<br />
Q<br />
C<br />
FIGURE B.8.4 A D flip-flop with a falling-edge trigger. The first latch, called the master, is open<br />
<strong>and</strong> follows the input D when the clock input, C, is asserted. When the clock input, C, falls, the first latch is<br />
closed, but the second latch, called the slave, is open <strong>and</strong> gets its input from the output of the master latch.<br />
D<br />
C<br />
Q<br />
FIGURE B.8.5 Operation of a D flip-flop with a falling-edge trigger, assuming the output is<br />
initially deasserted. When the clock input (C) changes from asserted to deasserted, the Q output stores<br />
the value of the D input. Compare this behavior to that of the clocked D latch shown in Figure B.8.3. In a<br />
clocked latch, the stored value <strong>and</strong> the output, Q, both change whenever C is high, as opposed to only when<br />
C transitions.<br />
Here is a Verilog description of a module for a rising-edge D flip-flop, assuming<br />
that C is the clock input <strong>and</strong> D is the data input:<br />
module DFF(clock,D,Q,Qbar);<br />
input clock, D;<br />
output reg Q; // Q is a reg since it is assigned in an<br />
always block<br />
output Qbar;<br />
assign Qbar = ~ Q; // Qbar is always just the inverse<br />
of Q<br />
always @(posedge clock) // perform actions whenever the<br />
clock rises<br />
Q = D;<br />
endmodule<br />
Because the D input is sampled on the clock edge, it must be valid for a period<br />
of time immediately before <strong>and</strong> immediately after the clock edge. The minimum<br />
time that the input must be valid before the clock edge is called the setup time; the<br />
setup time The<br />
minimum time that the<br />
input to a memory device<br />
must be valid before the<br />
clock edge.
B-54 Appendix B The Basics of Logic <strong>Design</strong><br />
D<br />
Setup time<br />
Hold time<br />
C<br />
FIGURE B.8.6 Setup <strong>and</strong> hold time requirements for a D flip-flop with a falling-edge trigger.<br />
The input must be stable for a period of time before the clock edge, as well as after the clock edge. The<br />
minimum time the signal must be stable before the clock edge is called the setup time, while the minimum<br />
time the signal must be stable after the clock edge is called the hold time. Failure to meet these minimum<br />
requirements can result in a situation where the output of the flip-flop may not be predictable, as described<br />
in Section B.11. Hold times are usually either 0 or very small <strong>and</strong> thus not a cause of worry.<br />
hold time The minimum<br />
time during which the<br />
input must be valid after<br />
the clock edge.<br />
minimum time during which it must be valid after the clock edge is called the hold<br />
time. Thus the inputs to any flip-flop (or anything built using flip-flops) must be valid<br />
during a window that begins at time t setup<br />
before the clock edge <strong>and</strong> ends at t hold<br />
after<br />
the clock edge, as shown in Figure B.8.6. Section B.11 talks about clocking <strong>and</strong> timing<br />
constraints, including the propagation delay through a flip-flop, in more detail.<br />
We can use an array of D flip-flops to build a register that can hold a multibit<br />
datum, such as a byte or word. We used registers throughout our datapaths in<br />
Chapter 4.<br />
Register Files<br />
One structure that is central to our datapath is a register file. A register file consists<br />
of a set of registers that can be read <strong>and</strong> written by supplying a register number<br />
to be accessed. A register file can be implemented with a decoder for each read<br />
or write port <strong>and</strong> an array of registers built from D flip-flops. Because reading a<br />
register does not change any state, we need only supply a register number as an<br />
input, <strong>and</strong> the only output will be the data contained in that register. For writing a<br />
register we will need three inputs: a register number, the data to write, <strong>and</strong> a clock<br />
that controls the writing into the register. In Chapter 4, we used a register file that<br />
has two read ports <strong>and</strong> one write port. This register file is drawn as shown in Figure<br />
B.8.7. The read ports can be implemented with a pair of multiplexors, each of which<br />
is as wide as the number of bits in each register of the register file. Figure B.8.8<br />
shows the implementation of two register read ports for a 32-bit-wide register file.<br />
Implementing the write port is slightly more complex, since we can only change<br />
the contents of the designated register. We can do this by using a decoder to generate<br />
a signal that can be used to determine which register to write. Figure B.8.9 shows<br />
how to implement the write port for a register file. It is important to remember that<br />
the flip-flop changes state only on the clock edge. In Chapter 4, we hooked up write<br />
signals for the register file explicitly <strong>and</strong> assumed the clock shown in Figure B.8.9<br />
is attached implicitly.<br />
What happens if the same register is read <strong>and</strong> written during a clock cycle?<br />
Because the write of the register file occurs on the clock edge, the register will be
B.8 Memory Elements: Flip-Flops, Latches, <strong>and</strong> Registers B-55<br />
Read register<br />
number 1<br />
Read register<br />
number 2<br />
Write<br />
register<br />
Write<br />
data<br />
Register file<br />
Write<br />
Read<br />
data 1<br />
Read<br />
data 2<br />
FIGURE B.8.7 A register file with two read ports <strong>and</strong> one write port has five inputs <strong>and</strong><br />
two outputs. The control input Write is shown in color.<br />
Read register<br />
number 1<br />
Register 0<br />
Register 1<br />
. . .<br />
Register n – 2<br />
Register n – 1<br />
M<br />
u<br />
x<br />
Read data 1<br />
Read register<br />
number 2<br />
M<br />
u<br />
x<br />
Read data 2<br />
FIGURE B.8.8 The implementation of two read ports for a register file with n registers<br />
can be done with a pair of n-to-1 multiplexors, each 32 bits wide. The register read number<br />
signal is used as the multiplexor selector signal. Figure B.8.9 shows how the write port is implemented.
B-56 Appendix B The Basics of Logic <strong>Design</strong><br />
Write<br />
Register number<br />
n-to-2 n<br />
decoder<br />
0<br />
1<br />
n – 2<br />
n – 1<br />
.<br />
C<br />
D<br />
C<br />
D<br />
Register 0<br />
Register 1<br />
.<br />
C<br />
Register n – 2<br />
D<br />
C<br />
Register n – 1<br />
Register data<br />
D<br />
FIGURE B.8.9 The write port for a register file is implemented with a decoder that is<br />
used with the write signal to generate the C input to the registers. All three inputs (the register<br />
number, the data, <strong>and</strong> the write signal) will have setup <strong>and</strong> hold-time constraints that ensure that the correct<br />
data is written into the register file.<br />
valid during the time it is read, as we saw earlier in Figure B.7.2. The value returned<br />
will be the value written in an earlier clock cycle. If we want a read to return the<br />
value currently being written, additional logic in the register file or outside of it is<br />
needed. Chapter 4 makes extensive use of such logic.<br />
Specifying Sequential Logic in Verilog<br />
To specify sequential logic in Verilog, we must underst<strong>and</strong> how to generate a<br />
clock, how to describe when a value is written into a register, <strong>and</strong> how to specify<br />
sequential control. Let us start by specifying a clock. A clock is not a predefined<br />
object in Verilog; instead, we generate a clock by using the Verilog notation #n<br />
before a statement; this causes a delay of n simulation time steps before the execution<br />
of the statement. In most Verilog simulators, it is also possible to generate<br />
a clock as an external input, allowing the user to specify at simulation time the<br />
number of clock cycles during which to run a simulation.<br />
The code in Figure B.8.10 implements a simple clock that is high or low for one<br />
simulation unit <strong>and</strong> then switches state. We use the delay capability <strong>and</strong> blocking<br />
assignment to implement the clock.
B.8 Memory Elements: Flip-Flops, Latches, <strong>and</strong> Registers B-57<br />
FIGURE B.8.10<br />
A specification of a clock.<br />
Next, we must be able to specify the operation of an edge-triggered register. In<br />
Verilog, this is done by using the sensitivity list on an always block <strong>and</strong> specifying<br />
as a trigger either the positive or negative edge of a binary variable with the<br />
notation posedge or negedge, respectively. Hence, the following Verilog code<br />
causes register A to be written with the value b at the positive edge clock:<br />
FIGURE B.8.11 A MIPS register file written in behavioral Verilog. This register file writes on<br />
the rising clock edge.<br />
Throughout this chapter <strong>and</strong> the Verilog sections of Chapter 4, we will assume<br />
a positive edge-triggered design. Figure B.8.11 shows a Verilog specification of a<br />
MIPS register file that assumes two reads <strong>and</strong> one write, with only the write being<br />
clocked.
B-58 Appendix B The Basics of Logic <strong>Design</strong><br />
Check<br />
Yourself<br />
In the Verilog for the register file in Figure B.8.11, the output ports corresponding to<br />
the registers being read are assigned using a continuous assignment, but the register<br />
being written is assigned in an always block. Which of the following is the reason?<br />
a. There is no special reason. It was simply convenient.<br />
b. Because Data1 <strong>and</strong> Data2 are output ports <strong>and</strong> WriteData is an input port.<br />
c. Because reading is a combinational event, while writing is a sequential event.<br />
B.9 Memory Elements: SRAMs <strong>and</strong> DRAMs<br />
static r<strong>and</strong>om access<br />
memory (SRAM)<br />
A memory where data<br />
is stored statically (as<br />
in flip-flops) rather<br />
than dynamically (as<br />
in DRAM). SRAMs are<br />
faster than DRAMs,<br />
but less dense <strong>and</strong> more<br />
expensive per bit.<br />
Registers <strong>and</strong> register files provide the basic building blocks for small memories,<br />
but larger amounts of memory are built using either SRAMs (static r<strong>and</strong>om<br />
access memories) or DRAMs (dynamic r<strong>and</strong>om access memories). We first discuss<br />
SRAMs, which are somewhat simpler, <strong>and</strong> then turn to DRAMs.<br />
SRAMs<br />
SRAMs are simply integrated circuits that are memory arrays with (usually) a single<br />
access port that can provide either a read or a write. SRAMs have a fixed access<br />
time to any datum, though the read <strong>and</strong> write access characteristics often differ.<br />
An SRAM chip has a specific configuration in terms of the number of addressable<br />
locations, as well as the width of each addressable location. For example, a 4M 8<br />
SRAM provides 4M entries, each of which is 8 bits wide. Thus it will have 22 address<br />
lines (since 4M 2 22 ), an 8-bit data output line, <strong>and</strong> an 8-bit single data input line.<br />
As with ROMs, the number of addressable locations is often called the height, with<br />
the number of bits per unit called the width. For a variety of technical reasons, the<br />
newest <strong>and</strong> fastest SRAMs are typically available in narrow configurations: 1 <strong>and</strong><br />
4. Figure B.9.1 shows the input <strong>and</strong> output signals for a 2M 16 SRAM.<br />
Address 21<br />
Chip select<br />
Output enable<br />
Write enable<br />
SRAM<br />
2M 16<br />
16<br />
Dout[15–0]<br />
Din[15–0]<br />
16<br />
FIGURE B.9.1 A 32K 8 SRAM showing the 21 address lines (32K 2 15 ) <strong>and</strong> 16 data<br />
inputs, the 3 control lines, <strong>and</strong> the 16 data outputs.
B.9 Memory Elements: SRAMs <strong>and</strong> DRAMs B-59<br />
To initiate a read or write access, the Chip select signal must be made active.<br />
For reads, we must also activate the Output enable signal that controls whether or<br />
not the datum selected by the address is actually driven on the pins. The Output<br />
enable is useful for connecting multiple memories to a single-output bus <strong>and</strong> using<br />
Output enable to determine which memory drives the bus. The SRAM read access<br />
time is usually specified as the delay from the time that Output enable is true <strong>and</strong><br />
the address lines are valid until the time that the data is on the output lines. Typical<br />
read access times for SRAMs in 2004 varied from about 2–4 ns for the fastest CMOS<br />
parts, which tend to be somewhat smaller <strong>and</strong> narrower, to 8–20 ns for the typical<br />
largest parts, which in 2004 had more than 32 million bits of data. The dem<strong>and</strong> for<br />
low-power SRAMs for consumer products <strong>and</strong> digital appliances has grown greatly<br />
in the past five years; these SRAMs have much lower st<strong>and</strong>-by <strong>and</strong> access power,<br />
but usually are 5–10 times slower. Most recently, synchronous SRAMs—similar to<br />
the synchronous DRAMs, which we discuss in the next section—have also been<br />
developed.<br />
For writes, we must supply the data to be written <strong>and</strong> the address, as well as<br />
signals to cause the write to occur. When both the Write enable <strong>and</strong> Chip select are<br />
true, the data on the data input lines is written into the cell specified by the address.<br />
There are setup-time <strong>and</strong> hold-time requirements for the address <strong>and</strong> data lines,<br />
just as there were for D flip-flops <strong>and</strong> latches. In addition, the Write enable signal<br />
is not a clock edge but a pulse with a minimum width requirement. The time to<br />
complete a write is specified by the combination of the setup times, the hold times,<br />
<strong>and</strong> the Write enable pulse width.<br />
Large SRAMs cannot be built in the same way we build a register file because,<br />
unlike a register file where a 32-to-1 multiplexor might be practical, the 64K-to-<br />
1 multiplexor that would be needed for a 64K 1 SRAM is totally impractical.<br />
Rather than use a giant multiplexor, large memories are implemented with a shared<br />
output line, called a bit line, which multiple memory cells in the memory array can<br />
assert. To allow multiple sources to drive a single line, a three-state buffer (or tristate<br />
buffer) is used. A three-state buffer has two inputs—a data signal <strong>and</strong> an Output<br />
enable—<strong>and</strong> a single output, which is in one of three states: asserted, deasserted,<br />
or high impedance. The output of a tristate buffer is equal to the data input signal,<br />
either asserted or deasserted, if the Output enable is asserted, <strong>and</strong> is otherwise in a<br />
high-impedance state that allows another three-state buffer whose Output enable is<br />
asserted to determine the value of a shared output.<br />
Figure B.9.2 shows a set of three-state buffers wired to form a multiplexor with a<br />
decoded input. It is critical that the Output enable of at most one of the three-state<br />
buffers be asserted; otherwise, the three-state buffers may try to set the output line<br />
differently. By using three-state buffers in the individual cells of the SRAM, each<br />
cell that corresponds to a particular output can share the same output line. The use<br />
of a set of distributed three-state buffers is a more efficient implementation than a<br />
large centralized multiplexor. The three-state buffers are incorporated into the flipflops<br />
that form the basic cells of the SRAM. Figure B.9.3 shows how a small 4 2<br />
SRAM might be built, using D latches with an input called Enable that controls the<br />
three-state output.
B-60 Appendix B The Basics of Logic <strong>Design</strong><br />
Select 0<br />
Data 0<br />
In<br />
Enable<br />
Out<br />
Select 1<br />
Data 1<br />
In<br />
Enable<br />
Out<br />
Select 2<br />
Enable<br />
Output<br />
Data 2<br />
In<br />
Out<br />
Select 3<br />
Data 3<br />
In<br />
Enable<br />
Out<br />
FIGURE B.9.2 Four three-state buffers are used to form a multiplexor. Only one of the four<br />
Select inputs can be asserted. A three-state buffer with a deasserted Output enable has a high-impedance<br />
output that allows a three-state buffer whose Output enable is asserted to drive the shared output line.<br />
The design in Figure B.9.3 eliminates the need for an enormous multiplexor;<br />
however, it still requires a very large decoder <strong>and</strong> a correspondingly large number<br />
of word lines. For example, in a 4M 8 SRAM, we would need a 22-to-4M decoder<br />
<strong>and</strong> 4M word lines (which are the lines used to enable the individual flip-flops)!<br />
To circumvent this problem, large memories are organized as rectangular arrays<br />
<strong>and</strong> use a two-step decoding process. Figure B.9.4 shows how a 4M 8 SRAM<br />
might be organized internally using a two-step decode. As we will see, the two-level<br />
decoding process is quite important in underst<strong>and</strong>ing how DRAMs operate.<br />
Recently we have seen the development of both synchronous SRAMs (SSRAMs)<br />
<strong>and</strong> synchronous DRAMs (SDRAMs). The key capability provided by synchronous<br />
RAMs is the ability to transfer a burst of data from a series of sequential addresses<br />
within an array or row. The burst is defined by a starting address, supplied in the<br />
usual fashion, <strong>and</strong> a burst length. The speed advantage of synchronous RAMs<br />
comes from the ability to transfer the bits in the burst without having to specify<br />
additional address bits. Instead, a clock is used to transfer the successive bits in the<br />
burst. The elimination of the need to specify the address for the transfers within<br />
the burst significantly improves the rate for transferring the block of data. Because<br />
of this capability, synchronous SRAMs <strong>and</strong> DRAMs are rapidly becoming the<br />
RAMs of choice for building memory systems in computers. We discuss the use of<br />
synchronous DRAMs in a memory system in more detail in the next section <strong>and</strong><br />
in Chapter 5.
B.9 Memory Elements: SRAMs <strong>and</strong> DRAMs B-63<br />
DRAMs<br />
In a static RAM (SRAM), the value stored in a cell is kept on a pair of inverting gates,<br />
<strong>and</strong> as long as power is applied, the value can be kept indefinitely. In a dynamic<br />
RAM (DRAM), the value kept in a cell is stored as a charge in a capacitor. A single<br />
transistor is then used to access this stored charge, either to read the value or to<br />
overwrite the charge stored there. Because DRAMs use only a single transistor per<br />
bit of storage, they are much denser <strong>and</strong> cheaper per bit. By comparison, SRAMs<br />
require four to six transistors per bit. Because DRAMs store the charge on a<br />
capacitor, it cannot be kept indefinitely <strong>and</strong> must periodically be refreshed. That is<br />
why this memory structure is called dynamic, as opposed to the static storage in a<br />
SRAM cell.<br />
To refresh the cell, we merely read its contents <strong>and</strong> write it back. The charge can<br />
be kept for several milliseconds, which might correspond to close to a million clock<br />
cycles. Today, single-chip memory controllers often h<strong>and</strong>le the refresh function<br />
independently of the processor. If every bit had to be read out of the DRAM <strong>and</strong><br />
then written back individually, with large DRAMs containing multiple megabytes,<br />
we would constantly be refreshing the DRAM, leaving no time for accessing it.<br />
Fortunately, DRAMs also use a two-level decoding structure, <strong>and</strong> this allows us<br />
to refresh an entire row (which shares a word line) with a read cycle followed<br />
immediately by a write cycle. Typically, refresh operations consume 1% to 2% of<br />
the active cycles of the DRAM, leaving the remaining 98% to 99% of the cycles<br />
available for reading <strong>and</strong> writing data.<br />
Elaboration: How does a DRAM read <strong>and</strong> write the signal stored in a cell? The<br />
transistor inside the cell is a switch, called a pass transistor, that allows the value stored<br />
on the capacitor to be accessed for either reading or writing. Figure B.9.5 shows how<br />
the single-transistor cell looks. The pass transistor acts like a switch: when the signal<br />
on the word line is asserted, the switch is closed, connecting the capacitor to the bit<br />
line. If the operation is a write, then the value to be written is placed on the bit line. If<br />
the value is a 1, the capacitor will be charged. If the value is a 0, then the capacitor will<br />
be discharged. Reading is slightly more complex, since the DRAM must detect a very<br />
small charge stored in the capacitor. Before activating the word line for a read, the bit<br />
line is charged to the voltage that is halfway between the low <strong>and</strong> high voltage. Then, by<br />
activating the word line, the charge on the capacitor is read out onto the bit line. This<br />
causes the bit line to move slightly toward the high or low direction, <strong>and</strong> this change is<br />
detected with a sense amplifi er, which can detect small changes in voltage.
B-64 Appendix B The Basics of Logic <strong>Design</strong><br />
Word line<br />
Pass transistor<br />
Capacitor<br />
Bit line<br />
FIGURE B.9.5 A single-transistor DRAM cell contains a capacitor that stores the cell<br />
contents <strong>and</strong> a transistor used to access the cell.<br />
Row<br />
decoder<br />
11-to-2048<br />
2048 2048<br />
array<br />
Address[10–0]<br />
Column latches<br />
Mux<br />
Dout<br />
FIGURE B.9.6 A 4M 1 DRAM is built with a 2048 2048 array. The row access uses 11 bits to<br />
select a row, which is then latched in 2048 1-bit latches. A multiplexor chooses the output bit from these 2048<br />
latches. The RAS <strong>and</strong> CAS signals control whether the address lines are sent to the row decoder or column<br />
multiplexor.
B.9 Memory Elements: SRAMs <strong>and</strong> DRAMs B-65<br />
DRAMs use a two-level decoder consisting of a row access followed by a column<br />
access, as shown in Figure B.9.6. The row access chooses one of a number of rows<br />
<strong>and</strong> activates the corresponding word line. The contents of all the columns in the<br />
active row are then stored in a set of latches. The column access then selects the<br />
data from the column latches. To save pins <strong>and</strong> reduce the package cost, the same<br />
address lines are used for both the row <strong>and</strong> column address; a pair of signals called<br />
RAS (Row Access Strobe) <strong>and</strong> CAS (Column Access Strobe) are used to signal the<br />
DRAM that either a row or column address is being supplied. Refresh is performed<br />
by simply reading the columns into the column latches <strong>and</strong> then writing the same<br />
values back. Thus, an entire row is refreshed in one cycle. The two-level addressing<br />
scheme, combined with the internal circuitry, makes DRAM access times much<br />
longer (by a factor of 5–10) than SRAM access times. In 2004, typical DRAM access<br />
times ranged from 45 to 65 ns; 256 Mbit DRAMs are in full production, <strong>and</strong> the<br />
first customer samples of 1 GB DRAMs became available in the first quarter of<br />
2004. The much lower cost per bit makes DRAM the choice for main memory,<br />
while the faster access time makes SRAM the choice for caches.<br />
You might observe that a 64M 4 DRAM actually accesses 8K bits on every<br />
row access <strong>and</strong> then throws away all but 4 of those during a column access. DRAM<br />
designers have used the internal structure of the DRAM as a way to provide<br />
higher b<strong>and</strong>width out of a DRAM. This is done by allowing the column address to<br />
change without changing the row address, resulting in an access to other bits in the<br />
column latches. To make this process faster <strong>and</strong> more precise, the address inputs<br />
were clocked, leading to the dominant form of DRAM in use today: synchronous<br />
DRAM or SDRAM.<br />
Since about 1999, SDRAMs have been the memory chip of choice for most<br />
cache-based main memory systems. SDRAMs provide fast access to a series of bits<br />
within a row by sequentially transferring all the bits in a burst under the control<br />
of a clock signal. In 2004, DDRRAMs (Double Data Rate RAMs), which are called<br />
double data rate because they transfer data on both the rising <strong>and</strong> falling edge of<br />
an externally supplied clock, were the most heavily used form of SDRAMs. As we<br />
discuss in Chapter 5, these high-speed transfers can be used to boost the b<strong>and</strong>width<br />
available out of main memory to match the needs of the processor <strong>and</strong> caches.<br />
Error Correction<br />
Because of the potential for data corruption in large memories, most computer<br />
systems use some sort of error-checking code to detect possible corruption of data.<br />
One simple code that is heavily used is a parity code. In a parity code the number<br />
of 1s in a word is counted; the word has odd parity if the number of 1s is odd <strong>and</strong>
B-70 Appendix B The Basics of Logic <strong>Design</strong><br />
NSlite<br />
Outputs<br />
EWlite<br />
NSgreen 1 0<br />
EWgreen 0 1<br />
function, with labels on the arcs specifying the input condition as logic functions.<br />
Figure B.10.2 shows the graphical representation for this finite-state machine.<br />
EWcar<br />
NSgreen<br />
EWgreen<br />
NSlite<br />
NScar<br />
EWlite<br />
EWcar<br />
NScar<br />
FIGURE B.10.2 The graphical representation of the two-state traffic light controller. We<br />
simplified the logic functions on the state transitions. For example, the transition from NSgreen to EWgreen<br />
in the next-state table is ( NScar EWcar) ( NScar EWcar ), which is equivalent to EWcar.<br />
A finite-state machine can be implemented with a register to hold the current<br />
state <strong>and</strong> a block of combinational logic that computes the next-state function <strong>and</strong><br />
the output function. Figure B.10.3 shows how a finite-state machine with 4 bits of<br />
state, <strong>and</strong> thus up to 16 states, might look. To implement the finite-state machine<br />
in this way, we must first assign state numbers to the states. This process is called<br />
state assignment. For example, we could assign NSgreen to state 0 <strong>and</strong> EWgreen to<br />
state 1. The state register would contain a single bit. The next-state function would<br />
be given as<br />
NextState ( CurrentState EWcar) ( CurrentState NScar)
B-72 Appendix B The Basics of Logic <strong>Design</strong><br />
FIGURE B.10.4<br />
A Verilog version of the traffic light controller.<br />
Check<br />
Yourself<br />
What is the smallest number of states in a Moore machine for which a Mealy<br />
machine could have fewer states?<br />
a. Two, since there could be a one-state Mealy machine that might do the same<br />
thing.<br />
b. Three, since there could be a simple Moore machine that went to one of two<br />
different states <strong>and</strong> always returned to the original state after that. For such a<br />
simple machine, a two-state Mealy machine is possible.<br />
c. You need at least four states to exploit the advantages of a Mealy machine<br />
over a Moore machine.<br />
B.11 Timing Methodologies<br />
Throughout this appendix <strong>and</strong> in the rest of the text, we use an edge-triggered<br />
timing methodology. This timing methodology has an advantage in that it is<br />
simpler to explain <strong>and</strong> underst<strong>and</strong> than a level-triggered methodology. In this<br />
section, we explain this timing methodology in a little more detail <strong>and</strong> also<br />
introduce level-sensitive clocking. We conclude this section by briefly discussing
B.11 Timing Methodologies B-73<br />
the issue of asynchronous signals <strong>and</strong> synchronizers, an important problem for<br />
digital designers.<br />
The purpose of this section is to introduce the major concepts in clocking<br />
methodology. The section makes some important simplifying assumptions; if you<br />
are interested in underst<strong>and</strong>ing timing methodology in more detail, consult one of<br />
the references listed at the end of this appendix.<br />
We use an edge-triggered timing methodology because it is simpler to explain<br />
<strong>and</strong> has fewer rules required for correctness. In particular, if we assume that all<br />
clocks arrive at the same time, we are guaranteed that a system with edge-triggered<br />
registers between blocks of combinational logic can operate correctly without races<br />
if we simply make the clock long enough. A race occurs when the contents of a<br />
state element depend on the relative speed of different logic elements. In an edgetriggered<br />
design, the clock cycle must be long enough to accommodate the path<br />
from one flip-flop through the combinational logic to another flip-flop where it<br />
must satisfy the setup-time requirement. Figure B.11.1 shows this requirement for<br />
a system using rising edge-triggered flip-flops. In such a system the clock period<br />
(or cycle time) must be at least as large as<br />
t t t<br />
prop combinational setup<br />
for the worst-case values of these three delays, which are defined as follows:<br />
■ t prop<br />
is the time for a signal to propagate through a flip-flop; it is also sometimes<br />
called clock-to-Q.<br />
■ t combinational<br />
is the longest delay for any combinational logic (which by definition<br />
is surrounded by two flip-flops).<br />
■ t setup<br />
is the time before the rising clock edge that the input to a flip-flop must<br />
be valid.<br />
D Q<br />
Flip-flop<br />
C<br />
Combinational<br />
logic block<br />
D Q<br />
Flip-flop<br />
C<br />
t prop t combinational t setup<br />
FIGURE B.11.1 In an edge-triggered design, the clock must be long enough to allow<br />
signals to be valid for the required setup time before the next clock edge. The time for a<br />
flip-flop input to propagate to the flip-flip outputs is t prop<br />
; the signal then takes t combinational<br />
to travel through the<br />
combinational logic <strong>and</strong> must be valid t setup<br />
before the next clock edge.
B-74 Appendix B The Basics of Logic <strong>Design</strong><br />
clock skew The<br />
difference in absolute time<br />
between the times when<br />
two state elements see a<br />
clock edge.<br />
We make one simplifying assumption: the hold-time requirements are satisfied,<br />
which is almost never an issue with modern logic.<br />
One additional complication that must be considered in edge-triggered designs<br />
is clock skew. Clock skew is the difference in absolute time between when two state<br />
elements see a clock edge. Clock skew arises because the clock signal will often<br />
use two different paths, with slightly different delays, to reach two different state<br />
elements. If the clock skew is large enough, it may be possible for a state element to<br />
change <strong>and</strong> cause the input to another flip-flop to change before the clock edge is<br />
seen by the second flip-flop.<br />
Figure B.11.2 illustrates this problem, ignoring setup time <strong>and</strong> flip-flop<br />
propagation delay. To avoid incorrect operation, the clock period is increased to<br />
allow for the maximum clock skew. Thus, the clock period must be longer than<br />
tprop tcombinational tsetup tskew<br />
With this constraint on the clock period, the two clocks can also arrive in the<br />
opposite order, with the second clock arriving t skew<br />
earlier, <strong>and</strong> the circuit will work<br />
Clock arrives<br />
at time t<br />
D Q<br />
Flip-flop<br />
C<br />
Combinational<br />
logic block with<br />
delay time of Δ<br />
Clock arrives<br />
after t + Δ<br />
D Q<br />
Flip-flop<br />
C<br />
FIGURE B.11.2 Illustration of how clock skew can cause a race, leading to incorrect operation. Because of the difference<br />
in when the two flip-flops see the clock, the signal that is stored into the first flip-flop can race forward <strong>and</strong> change the input to the second flipflop<br />
before the clock arrives at the second flip-flop.<br />
level-sensitive<br />
clocking A timing<br />
methodology in which<br />
state changes occur<br />
at either high or low<br />
clock levels but are not<br />
instantaneous as such<br />
changes are in edgetriggered<br />
designs.<br />
correctly. <strong>Design</strong>ers reduce clock-skew problems by carefully routing the clock<br />
signal to minimize the difference in arrival times. In addition, smart designers also<br />
provide some margin by making the clock a little longer than the minimum; this<br />
allows for variation in components as well as in the power supply. Since clock skew<br />
can also affect the hold-time requirements, minimizing the size of the clock skew<br />
is important.<br />
Edge-triggered designs have two drawbacks: they require extra logic <strong>and</strong> they<br />
may sometimes be slower. Just looking at the D flip-flop versus the level-sensitive<br />
latch that we used to construct the flip-flop shows that edge-triggered design<br />
requires more logic. An alternative is to use level-sensitive clocking. Because state<br />
changes in a level-sensitive methodology are not instantaneous, a level-sensitive<br />
scheme is slightly more complex <strong>and</strong> requires additional care to make it operate<br />
correctly.
B.11 Timing Methodologies B-75<br />
Level-Sensitive Timing<br />
In level-sensitive timing, the state changes occur at either high or low levels, but<br />
they are not instantaneous as they are in an edge-triggered methodology. Because of<br />
the noninstantaneous change in state, races can easily occur. To ensure that a levelsensitive<br />
design will also work correctly if the clock is slow enough, designers use twophase<br />
clocking. Two-phase clocking is a scheme that makes use of two nonoverlapping<br />
clock signals. Since the two clocks, typically called φ 1<br />
<strong>and</strong> φ 2<br />
, are nonoverlapping, at<br />
most one of the clock signals is high at any given time, as Figure B.11.3 shows. We<br />
can use these two clocks to build a system that contains level-sensitive latches but is<br />
free from any race conditions, just as the edge-triggered designs were.<br />
Φ 1<br />
Φ 2<br />
Nonoverlapping<br />
periods<br />
FIGURE B.11.3 A two-phase clocking scheme showing the cycle of each clock <strong>and</strong> the<br />
nonoverlapping periods.<br />
Φ 1<br />
D<br />
C<br />
Latch<br />
Q<br />
Combinational<br />
logic block<br />
Φ 2<br />
D<br />
C<br />
Latch<br />
Q<br />
Combinational<br />
logic block<br />
Φ 1<br />
D<br />
C<br />
Latch<br />
FIGURE B.11.4 A two-phase timing scheme with alternating latches showing how the system operates on both clock<br />
phases. The output of a latch is stable on the opposite phase from its C input. Thus, the first block of combinational inputs has a stable input<br />
during φ 2<br />
, <strong>and</strong> its output is latched by φ 2<br />
. The second (rightmost) combinational block operates in just the opposite fashion, with stable inputs<br />
during φ 1<br />
. Thus, the delays through the combinational blocks determine the minimum time that the respective clocks must be asserted. The<br />
size of the nonoverlapping period is determined by the maximum clock skew <strong>and</strong> the minimum delay of any logic block.<br />
One simple way to design such a system is to alternate the use of latches that are<br />
open on φ 1<br />
with latches that are open on φ 2<br />
. Because both clocks are not asserted<br />
at the same time, a race cannot occur. If the input to a combinational block is a φ 1<br />
clock, then its output is latched by a φ 2<br />
clock, which is open only during φ 2<br />
when<br />
the input latch is closed <strong>and</strong> hence has a valid output. Figure B.11.4 shows how<br />
a system with two-phase timing <strong>and</strong> alternating latches operates. As in an edgetriggered<br />
design, we must pay attention to clock skew, particularly between the two
B-76 Appendix B The Basics of Logic <strong>Design</strong><br />
clock phases. By increasing the amount of nonoverlap between the two phases, we<br />
can reduce the potential margin of error. Thus, the system is guaranteed to operate<br />
correctly if each phase is long enough <strong>and</strong> if there is large enough nonoverlap<br />
between the phases.<br />
Asynchronous Inputs <strong>and</strong> Synchronizers<br />
By using a single clock or a two-phase clock, we can eliminate race conditions<br />
if clock-skew problems are avoided. Unfortunately, it is impractical to make an<br />
entire system function with a single clock <strong>and</strong> still keep the clock skew small.<br />
While the CPU may use a single clock, I/O devices will probably have their own<br />
clock. An asynchronous device may communicate with the CPU through a series<br />
of h<strong>and</strong>shaking steps. To translate the asynchronous input to a synchronous signal<br />
that can be used to change the state of a system, we need to use a synchronizer,<br />
whose inputs are the asynchronous signal <strong>and</strong> a clock <strong>and</strong> whose output is a signal<br />
synchronous with the input clock.<br />
Our first attempt to build a synchronizer uses an edge-triggered D flip-flop,<br />
whose D input is the asynchronous signal, as Figure B.11.5 shows. Because we<br />
communicate with a h<strong>and</strong>shaking protocol, it does not matter whether we detect<br />
the asserted state of the asynchronous signal on one clock or the next, since the<br />
signal will be held asserted until it is acknowledged. Thus, you might think that this<br />
simple structure is enough to sample the signal accurately, which would be the case<br />
except for one small problem.<br />
Asynchronous input<br />
Clock<br />
D Q<br />
Flip-flop<br />
C<br />
Synchronous output<br />
metastability<br />
A situation that occurs if<br />
a signal is sampled when<br />
it is not stable for the<br />
required setup <strong>and</strong> hold<br />
times, possibly causing<br />
the sampled value to<br />
fall in the indeterminate<br />
region between a high <strong>and</strong><br />
low value.<br />
FIGURE B.11.5 A synchronizer built from a D flip-flop is used to sample an asynchronous<br />
signal to produce an output that is synchronous with the clock. This “synchronizer” will not<br />
work properly!<br />
The problem is a situation called metastability. Suppose the asynchronous<br />
signal is transitioning between high <strong>and</strong> low when the clock edge arrives. Clearly,<br />
it is not possible to know whether the signal will be latched as high or low. That<br />
problem we could live with. Unfortunately, the situation is worse: when the signal<br />
that is sampled is not stable for the required setup <strong>and</strong> hold times, the flip-flop may<br />
go into a metastable state. In such a state, the output will not have a legitimate high<br />
or low value, but will be in the indeterminate region between them. Furthermore,
B.13 Concluding Remarks B-77<br />
the flip-flop is not guaranteed to exit this state in any bounded amount of time.<br />
Some logic blocks that look at the output of the flip-flop may see its output as 0,<br />
while others may see it as 1. This situation is called a synchronizer failure.<br />
In a purely synchronous system, synchronizer failure can be avoided by ensuring<br />
that the setup <strong>and</strong> hold times for a flip-flop or latch are always met, but this is<br />
impossible when the input is asynchronous. Instead, the only solution possible is to<br />
wait long enough before looking at the output of the flip-flop to ensure that its output<br />
is stable, <strong>and</strong> that it has exited the metastable state, if it ever entered it. How long is<br />
long enough? Well, the probability that the flip-flop will stay in the metastable state<br />
decreases exponentially, so after a very short time the probability that the flip-flop<br />
is in the metastable state is very low; however, the probability never reaches 0! So<br />
designers wait long enough such that the probability of a synchronizer failure is very<br />
low, <strong>and</strong> the time between such failures will be years or even thous<strong>and</strong>s of years.<br />
For most flip-flop designs, waiting for a period that is several times longer than<br />
the setup time makes the probability of synchronization failure very low. If the<br />
clock rate is longer than the potential metastability period (which is likely), then a<br />
safe synchronizer can be built with two D flip-flops, as Figure B.11.6 shows. If you<br />
are interested in reading more about these problems, look into the references.<br />
synchronizer failure<br />
A situation in which<br />
a flip-flop enters a<br />
metastable state <strong>and</strong><br />
where some logic blocks<br />
reading the output of the<br />
flip-flop see a 0 while<br />
others see a 1.<br />
Asynchronous input<br />
Clock<br />
D Q<br />
Flip-flop<br />
C<br />
D Q<br />
Flip-flop<br />
C<br />
Synchronous output<br />
FIGURE B.11.6 This synchronizer will work correctly if the period of metastability that<br />
we wish to guard against is less than the clock period. Although the output of the first flip-flop<br />
may be metastable, it will not be seen by any other logic element until the second clock, when the second D<br />
flip-flop samples the signal, which by that time should no longer be in a metastable state.<br />
Suppose we have a design with very large clock skew—longer than the register<br />
propagation time. Is it always possible for such a design to slow the clock down<br />
enough to guarantee that the logic operates properly?<br />
a. Yes, if the clock is slow enough the signals can always propagate <strong>and</strong> the<br />
design will work, even if the skew is very large.<br />
b. No, since it is possible that two registers see the same clock edge far enough<br />
apart that a register is triggered, <strong>and</strong> its outputs propagated <strong>and</strong> seen by a<br />
second register with the same clock edge.<br />
Check<br />
Yourself<br />
propagation time The<br />
time required for an input<br />
to a flip-flop to propagate<br />
to the outputs of the flipflop.
B.14 Exercises B-81<br />
B.10 [15] §§B.2, B.3 Prove that a two-input multiplexor is also universal by<br />
showing how to build the NAND (or NOR) gate using a multiplexor.<br />
B.11 [5] §§4.2, B.2, B.3 Assume that X consists of 3 bits, x2 x1 x0. Write four<br />
logic functions that are true if <strong>and</strong> only if<br />
■ X contains only one 0<br />
■ X contains an even number of 0s<br />
■ X when interpreted as an unsigned binary number is less than 4<br />
■ X when interpreted as a signed (two’s complement) number is negative<br />
B.12 [5] §§4.2, B.2, B.3 Implement the four functions described in Exercise<br />
B.11 using a PLA.<br />
B.13 [5] §§4.2, B.2, B.3 Assume that X consists of 3 bits, x2 x1 x0, <strong>and</strong> Y<br />
consists of 3 bits, y2 y1 y0. Write logic functions that are true if <strong>and</strong> only if<br />
■ X Y, where X <strong>and</strong> Y are thought of as unsigned binary numbers<br />
■ X Y, where X <strong>and</strong> Y are thought of as signed (two’s complement) numbers<br />
■ X Y<br />
Use a hierarchical approach that can be extended to larger numbers of bits. Show<br />
how can you extend it to 6-bit comparison.<br />
B.14 [5] §§B.2, B.3 Implement a switching network that has two data inputs<br />
(A <strong>and</strong> B), two data outputs (C <strong>and</strong> D), <strong>and</strong> a control input (S). If S equals 1, the<br />
network is in pass-through mode, <strong>and</strong> C should equal A, <strong>and</strong> D should equal B. If<br />
S equals 0, the network is in crossing mode, <strong>and</strong> C should equal B, <strong>and</strong> D should<br />
equal A.<br />
B.15 [15] §§B.2, B.3 Derive the product-of-sums representation for E shown<br />
on page B-11 starting with the sum-of-products representation. You will need to<br />
use DeMorgan’s theorems.<br />
B.16 [30] §§B.2, B.3 Give an algorithm for constructing the sum-of- products<br />
representation for an arbitrary logic equation consisting of AND, OR, <strong>and</strong> NOT.<br />
The algorithm should be recursive <strong>and</strong> should not construct the truth table in the<br />
process.<br />
B.17 [5] §§B.2, B.3 Show a truth table for a multiplexor (inputs A, B, <strong>and</strong> S;<br />
output C ), using don’t cares to simplify the table where possible.
B-82 Appendix B The Basics of Logic <strong>Design</strong><br />
B.18 [5] §B.3 What is the function implemented by the following Verilog<br />
modules:<br />
module FUNC1 (I0, I1, S, out);<br />
input I0, I1;<br />
input S;<br />
output out;<br />
out = S? I1: I0;<br />
endmodule<br />
module FUNC2 (out,ctl,clk,reset);<br />
output [7:0] out;<br />
input ctl, clk, reset;<br />
reg [7:0] out;<br />
always @(posedge clk)<br />
if (reset) begin<br />
out
B.14 Exercises B-83<br />
In<br />
<br />
Adder<br />
Load<br />
Clk<br />
16<br />
16<br />
Out<br />
Rst<br />
Load<br />
Register<br />
B.22 [20] §§B3, B.4, B.5 Section 3.3 presents basic operation <strong>and</strong> possible<br />
implementations of multipliers. A basic unit of such implementations is a shift<strong>and</strong>-add<br />
unit. Show a Verilog implementation for this unit. Show how can you use<br />
this unit to build a 32-bit multiplier.<br />
B.23 [20] §§B3, B.4, B.5 Repeat Exercise B.22, but for an unsigned divider<br />
rather than a multiplier.<br />
B.24 [15] §B.5 The ALU supported set on less than (slt) using just the sign<br />
bit of the adder. Let’s try a set on less than operation using the values 7 ten<br />
<strong>and</strong> 6 ten<br />
.<br />
To make it simpler to follow the example, let’s limit the binary representations to 4<br />
bits: 1001 two<br />
<strong>and</strong> 0110 two<br />
.<br />
1001 two<br />
– 0110 two<br />
= 1001 two<br />
+ 1010 two<br />
= 0011 two<br />
This result would suggest that 7 6, which is clearly wrong. Hence, we must<br />
factor in overflow in the decision. Modify the 1-bit ALU in Figure B.5.10 on page<br />
B-33 to h<strong>and</strong>le slt correctly. Make your changes on a photocopy of this figure to<br />
save time.<br />
B.25 [20] §B.6 A simple check for overflow during addition is to see if the<br />
CarryIn to the most significant bit is not the same as the CarryOut of the most<br />
significant bit. Prove that this check is the same as in Figure 3.2.<br />
B.26 [5] §B.6 Rewrite the equations on page B-44 for a carry-lookahead logic<br />
for a 16-bit adder using a new notation. First, use the names for the CarryIn signals<br />
of the individual bits of the adder. That is, use c4, c8, c12, … instead of C1, C2,<br />
C3, …. In addition, let Pi,j; mean a propagate signal for bits i to j, <strong>and</strong> Gi,j; mean a<br />
generate signal for bits i to j. For example, the equation<br />
C2 G1 ( P1 G0) ( P1 P0 c0)
B-84 Appendix B The Basics of Logic <strong>Design</strong><br />
can be rewritten as<br />
c8 G ( P G ) ( P P c0)<br />
74 , 74 , 30 , 74 , 30 ,<br />
This more general notation is useful in creating wider adders.<br />
B.27 [15] §B.6 Write the equations for the carry-lookahead logic for a 64-<br />
bit adder using the new notation from Exercise B.26 <strong>and</strong> using 16-bit adders as<br />
building blocks. Include a drawing similar to Figure B.6.3 in your solution.<br />
B.28 [10] §B.6 Now calculate the relative performance of adders. Assume that<br />
hardware corresponding to any equation containing only OR or AND terms, such<br />
as the equations for pi <strong>and</strong> gi on page B-40, takes one time unit T. Equations that<br />
consist of the OR of several AND terms, such as the equations for c1, c2, c3, <strong>and</strong><br />
c4 on page B-40, would thus take two time units, 2T. The reason is it would take T<br />
to produce the AND terms <strong>and</strong> then an additional T to produce the result of the<br />
OR. Calculate the numbers <strong>and</strong> performance ratio for 4-bit adders for both ripple<br />
carry <strong>and</strong> carry lookahead. If the terms in equations are further defined by other<br />
equations, then add the appropriate delays for those intermediate equations, <strong>and</strong><br />
continue recursively until the actual input bits of the adder are used in an equation.<br />
Include a drawing of each adder labeled with the calculated delays <strong>and</strong> the path of<br />
the worst-case delay highlighted.<br />
B.29 [15] §B.6 This exercise is similar to Exercise B.28, but this time calculate<br />
the relative speeds of a 16-bit adder using ripple carry only, ripple carry of 4-bit<br />
groups that use carry lookahead, <strong>and</strong> the carry-lookahead scheme on page B-39.<br />
B.30 [15] §B.6 This exercise is similar to Exercises B.28 <strong>and</strong> B.29, but this<br />
time calculate the relative speeds of a 64-bit adder using ripple carry only, ripple<br />
carry of 4-bit groups that use carry lookahead, ripple carry of 16-bit groups that use<br />
carry lookahead, <strong>and</strong> the carry-lookahead scheme from Exercise B.27.<br />
B.31 [10] §B.6 Instead of thinking of an adder as a device that adds two<br />
numbers <strong>and</strong> then links the carries together, we can think of the adder as a hardware<br />
device that can add three inputs together (ai, bi, ci) <strong>and</strong> produce two outputs<br />
(s, ci 1). When adding two numbers together, there is little we can do with this<br />
observation. When we are adding more than two oper<strong>and</strong>s, it is possible to reduce<br />
the cost of the carry. The idea is to form two independent sums, called S (sum bits)<br />
<strong>and</strong> C (carry bits). At the end of the process, we need to add C <strong>and</strong> S together<br />
using a normal adder. This technique of delaying carry propagation until the end<br />
of a sum of numbers is called carry save addition. The block drawing on the lower<br />
right of Figure B.14.1 (see below) shows the organization, with two levels of carry<br />
save adders connected by a single normal adder.<br />
Calculate the delays to add four 16-bit numbers using full carry-lookahead adders<br />
versus carry save with a carry-lookahead adder forming the final sum. (The time<br />
unit T in Exercise B.28 is the same.)
B-86 Appendix B The Basics of Logic <strong>Design</strong><br />
First, show the block organization of the 16-bit carry save adders to add these 16<br />
terms, as shown on the right in Figure B.14.1. Then calculate the delays to add these<br />
16 numbers. Compare this time to the iterative multiplication scheme in Chapter<br />
3 but only assume 16 iterations using a 16-bit adder that has full carry lookahead<br />
whose speed was calculated in Exercise B.29.<br />
B.33 [10] §B.6 There are times when we want to add a collection of numbers<br />
together. Suppose you wanted to add four 4-bit numbers (A, B, E, F) using 1-bit<br />
full adders. Let’s ignore carry lookahead for now. You would likely connect the<br />
1-bit adders in the organization at the top of Figure B.14.1. Below the traditional<br />
organization is a novel organization of full adders. Try adding four numbers using<br />
both organizations to convince yourself that you get the same answer.<br />
B.34 [5] §B.6 First, show the block organization of the 16-bit carry save<br />
adders to add these 16 terms, as shown in Figure B.14.1. Assume that the time delay<br />
through each 1-bit adder is 2T. Calculate the time of adding four 4-bit numbers to<br />
the organization at the top versus the organization at the bottom of Figure B.14.1.<br />
B.35 [5] §B.8 Quite often, you would expect that given a timing diagram<br />
containing a description of changes that take place on a data input D <strong>and</strong> a clock<br />
input C (as in Figures B.8.3 <strong>and</strong> B.8.6 on pages B-52 <strong>and</strong> B-54, respectively), there<br />
would be differences between the output waveforms (Q) for a D latch <strong>and</strong> a D flipflop.<br />
In a sentence or two, describe the circumstances (e.g., the nature of the inputs)<br />
for which there would not be any difference between the two output waveforms.<br />
B.36 [5] §B.8 Figure B.8.8 on page B-55 illustrates the implementation of the<br />
register file for the MIPS datapath. Pretend that a new register file is to be built,<br />
but that there are only two registers <strong>and</strong> only one read port, <strong>and</strong> that each register<br />
has only 2 bits of data. Redraw Figure B.8.8 so that every wire in your diagram<br />
corresponds to only 1 bit of data (unlike the diagram in Figure B.8.8, in which<br />
some wires are 5 bits <strong>and</strong> some wires are 32 bits). Redraw the registers using D flipflops.<br />
You do not need to show how to implement a D flip-flop or a multiplexor.<br />
B.37 [10] §B.10 A friend would like you to build an “electronic eye” for use<br />
as a fake security device. The device consists of three lights lined up in a row,<br />
controlled by the outputs Left, Middle, <strong>and</strong> Right, which, if asserted, indicate that<br />
a light should be on. Only one light is on at a time, <strong>and</strong> the light “moves” from<br />
left to right <strong>and</strong> then from right to left, thus scaring away thieves who believe that<br />
the device is monitoring their activity. Draw the graphical representation for the<br />
finite-state machine used to specify the electronic eye. Note that the rate of the eye’s<br />
movement will be controlled by the clock speed (which should not be too great)<br />
<strong>and</strong> that there are essentially no inputs.<br />
B.38 [10] §B.10 Assign state numbers to the states of the finite-state machine<br />
you constructed for Exercise B.37 <strong>and</strong> write a set of logic equations for each of the<br />
outputs, including the next-state bits.
B.14 Exercises B-87<br />
B.39 [15] §§B.2, B.8, B.10 Construct a 3-bit counter using three D flipflops<br />
<strong>and</strong> a selection of gates. The inputs should consist of a signal that resets the<br />
counter to 0, called reset, <strong>and</strong> a signal to increment the counter, called inc. The<br />
outputs should be the value of the counter. When the counter has value 7 <strong>and</strong> is<br />
incremented, it should wrap around <strong>and</strong> become 0.<br />
B.40 [20] §B.10 A Gray code is a sequence of binary numbers with the property<br />
that no more than 1 bit changes in going from one element of the sequence to<br />
another. For example, here is a 3-bit binary Gray code: 000, 001, 011, 010, 110,<br />
111, 101, <strong>and</strong> 100. Using three D flip-flops <strong>and</strong> a PLA, construct a 3-bit Gray code<br />
counter that has two inputs: reset, which sets the counter to 000, <strong>and</strong> inc, which<br />
makes the counter go to the next value in the sequence. Note that the code is cyclic,<br />
so that the value after 100 in the sequence is 000.<br />
B.41 [25] §B.10 We wish to add a yellow light to our traffic light example on<br />
page B-68. We will do this by changing the clock to run at 0.25 Hz (a 4-second clock<br />
cycle time), which is the duration of a yellow light. To prevent the green <strong>and</strong> red lights<br />
from cycling too fast, we add a 30-second timer. The timer has a single input, called<br />
TimerReset, which restarts the timer, <strong>and</strong> a single output, called TimerSignal, which<br />
indicates that the 30-second period has expired. Also, we must redefine the traffic<br />
signals to include yellow. We do this by defining two out put signals for each light:<br />
green <strong>and</strong> yellow. If the output NSgreen is asserted, the green light is on; if the output<br />
NSyellow is asserted, the yellow light is on. If both signals are off, the red light is on. Do<br />
not assert both the green <strong>and</strong> yellow signals at the same time, since American drivers<br />
will certainly be confused, even if European drivers underst<strong>and</strong> what this means! Draw<br />
the graphical representation for the finite-state machine for this improved controller.<br />
Choose names for the states that are different from the names of the outputs.<br />
B.42 [15] §B.10 Write down the next-state <strong>and</strong> output-function tables for the<br />
traffic light controller described in Exercise B.41.<br />
B.43 [15] §§B.2, B.10 Assign state numbers to the states in the traf-fic light<br />
example of Exercise B.41 <strong>and</strong> use the tables of Exercise B.42 to write a set of logic<br />
equations for each of the outputs, including the next-state outputs.<br />
B.44 [15] §§B.3, B.10 Implement the logic equations of Exercise B.43 as a<br />
PLA.<br />
§B.2, page B-8: No. If A 1, C 1, B 0, the first is true, but the second is false.<br />
§B.3, page B-20: C.<br />
§B.4, page B-22: They are all exactly the same.<br />
§B.4, page B-26: A 0, B 1.<br />
§B.5, page B-38: 2.<br />
§B.6, page B-47: 1.<br />
§B.8, page B-58: c.<br />
§B.10, page B-72: b.<br />
§B.11, page B-77: b.<br />
Answers to<br />
Check Yourself
This page intentionally left blank
Index<br />
Note: Online information is listed by chapter <strong>and</strong> section number followed by page numbers (OL3.11-7). Page references preceded<br />
by a single letter with hyphen refer to appendices.<br />
1-bit ALU, B-26–29. See also Arithmetic<br />
logic unit (ALU)<br />
adder, B-27<br />
CarryOut, B-28<br />
for most significant bit, B-33<br />
illustrated, B-29<br />
logical unit for AND/OR, B-27<br />
performing AND, OR, <strong>and</strong> addition,<br />
B-31, B-33<br />
32-bit ALU, B-29–38. See also Arithmetic<br />
logic unit (ALU)<br />
defining in Verilog, B-35–38<br />
from 31 copies of 1-bit ALU, B-34<br />
illustrated, B-36<br />
ripple carry adder, B-29<br />
tailoring to MIPS, B-31–35<br />
with 32 1-bit ALUs, B-30<br />
32-bit immediate oper<strong>and</strong>s, 112–113<br />
7090/7094 hardware, OL3.11-7<br />
A<br />
Absolute references, 126<br />
Abstractions<br />
hardware/software interface, 22<br />
principle, 22<br />
to simplify design, 11<br />
Accumulator architectures, OL2.21-2<br />
Acronyms, 9<br />
Active matrix, 18<br />
add (Add), 64<br />
add.d (FP Add Double), A-73<br />
add.s (FP Add Single), A-74<br />
Add unsigned instruction, 180<br />
addi (Add Immediate), 64<br />
Addition, 178–182. See also Arithmetic<br />
binary, 178–179<br />
floating-point, 203–206, 211, A-73–74<br />
instructions, A-51<br />
oper<strong>and</strong>s, 179<br />
signific<strong>and</strong>s, 203<br />
speed, 182<br />
addiu (Add Imm.Unsigned), 119<br />
Address interleaving, 381<br />
Address select logic, D-24, D-25<br />
Address space, 428, 431<br />
extending, 479<br />
flat, 479<br />
ID (ASID), 446<br />
inadequate, OL5.17-6<br />
shared, 519–520<br />
single physical, 517<br />
unmapped, 450<br />
virtual, 446<br />
Address translation<br />
for ARM cortex-A8, 471<br />
defined, 429<br />
fast, 438–439<br />
for Intel core i7, 471<br />
TLB for, 438–439<br />
Address-control lines, D-26<br />
Addresses<br />
32-bit immediates, 113–116<br />
base, 69<br />
byte, 69<br />
defined, 68<br />
memory, 77<br />
virtual, 428–431, 450<br />
Addressing<br />
32-bit immediates, 113–116<br />
base, 116<br />
displacement, 116<br />
immediate, 116<br />
in jumps <strong>and</strong> branches, 113–116<br />
MIPS modes, 116–118<br />
PC-relative, 114, 116<br />
pseudodirect, 116<br />
register, 116<br />
x86 modes, 152<br />
Addressing modes, A-45–47<br />
desktop architectures, E-6<br />
addu (Add Unsigned), 64<br />
Advanced Vector Extensions (AVX), 225,<br />
227<br />
AGP, C-9<br />
Algol-60, OL2.21-7<br />
Aliasing, 444<br />
Alignment restriction, 69–70<br />
All-pairs N-body algorithm, C-65<br />
Alpha architecture<br />
bit count instructions, E-29<br />
floating-point instructions, E-28<br />
instructions, E-27–29<br />
no divide, E-28<br />
PAL code, E-28<br />
unaligned load-store, E-28<br />
VAX floating-point formats, E-29<br />
ALU control, 259–261. See also<br />
Arithmetic logic unit (ALU)<br />
bits, 260<br />
logic, D-6<br />
mapping to gates, D-4–7<br />
truth tables, D-5<br />
ALU control block, 263<br />
defined, D-4<br />
generating ALU control bits, D-6<br />
ALUOp, 260, D-6<br />
bits, 260, 261<br />
control signal, 263<br />
Amazon Web Services (AWS), 425<br />
AMD Opteron X4 (Barcelona), 543, 544<br />
AMD64, 151, 224, OL2.21-6<br />
Amdahl’s law, 401, 503<br />
corollary, 49<br />
defined, 49<br />
fallacy, 556<br />
<strong>and</strong> (AND), 64<br />
AND gates, B-12, D-7<br />
AND operation, 88<br />
AND operation, A-52, B-6<br />
<strong>and</strong>i (And Immediate), 64<br />
Annual failure rate (AFR), 418<br />
versus.MTTF of disks, 419–420<br />
Antidependence, 338<br />
Antifuse, B-78<br />
Apple computer, OL1.12-7<br />
I-1
I-2 Index<br />
Apple iPad 2 A1395, 20<br />
logic board of, 20<br />
processor integrated circuit of, 21<br />
Application binary interface (ABI), 22<br />
Application programming interfaces<br />
(APIs)<br />
defined, C-4<br />
graphics, C-14<br />
Architectural registers, 347<br />
Arithmetic, 176–236<br />
addition, 178–182<br />
addition <strong>and</strong> subtraction, 178–182<br />
division, 189–195<br />
fallacies <strong>and</strong> pitfalls, 229–232<br />
floating-point, 196–222<br />
historical perspective, 236<br />
multiplication, 183–188<br />
parallelism <strong>and</strong>, 222–223<br />
Streaming SIMD Extensions <strong>and</strong><br />
advanced vector extensions in x86,<br />
224–225<br />
subtraction, 178–182<br />
subword parallelism, 222–223<br />
subword parallelism <strong>and</strong> matrix<br />
multiply, 225–228<br />
Arithmetic instructions. See also<br />
Instructions<br />
desktop RISC, E-11<br />
embedded RISC, E-14<br />
logical, 251<br />
MIPS, A-51–57<br />
oper<strong>and</strong>s, 66–, 73<br />
Arithmetic intensity, 541<br />
Arithmetic logic unit (ALU). See also<br />
ALU control; Control units<br />
1-bit, B-26–29<br />
32-bit, B-29–38<br />
before forwarding, 309<br />
branch datapath, 254<br />
hardware, 180<br />
memory-reference instruction use, 245<br />
for register values, 252<br />
R-format operations, 253<br />
signed-immediate input, 312<br />
ARM Cortex-A8, 244, 345–346<br />
address translation for, 471<br />
caches in, 472<br />
data cache miss rates for, 474<br />
memory hierarchies of, 471–475<br />
performance of, 473–475<br />
specification, 345<br />
TLB hardware for, 471<br />
ARM instructions, 145–147<br />
12-bit immediate field, 148<br />
addressing modes, 145<br />
block loads <strong>and</strong> stores, 149<br />
brief history, OL2.21-5<br />
calculations, 145–146<br />
compare <strong>and</strong> conditional branch,<br />
147–148<br />
condition field, 324<br />
data transfer, 146<br />
features, 148–149<br />
formats, 148<br />
logical, 149<br />
MIPS similarities, 146<br />
register-register, 146<br />
unique, E-36–37<br />
ARMv7, 62<br />
ARMv8, 158–159<br />
ARPANET, OL1.12-10<br />
Arrays, 415<br />
logic elements, B-18–19<br />
multiple dimension, 218<br />
pointers versus, 141–145<br />
procedures for setting to zero, 142<br />
ASCII<br />
binary numbers versus, 107<br />
character representation, 106<br />
defined, 106<br />
symbols, 109<br />
Assembler directives, A-5<br />
Assemblers, 124–126, A-10–17<br />
conditional code assembly, A-17<br />
defined, 14, A-4<br />
function, 125, A-10<br />
macros, A-4, A-15–17<br />
microcode, D-30<br />
number acceptance, 125<br />
object file, 125<br />
pseudoinstructions, A-17<br />
relocation information, A-13, A-14<br />
speed, A-13<br />
symbol table, A-12<br />
Assembly language, 15<br />
defined, 14, 123<br />
drawbacks, A-9–10<br />
floating-point, 212<br />
high-level languages versus, A-12<br />
illustrated, 15<br />
MIPS, 64, 84, A-45–80<br />
production of, A-8–9<br />
programs, 123<br />
translating into machine language, 84<br />
when to use, A-7–9<br />
Asserted signals, 250, B-4<br />
Associativity<br />
in caches, 405<br />
degree, increasing, 404, 455<br />
increasing, 409<br />
set, tag size versus, 409<br />
Atomic compare <strong>and</strong> swap, 123<br />
Atomic exchange, 121<br />
Atomic fetch-<strong>and</strong>-increment, 123<br />
Atomic memory operation, C-21<br />
Attribute interpolation, C-43–44<br />
Automobiles, computer application in, 4<br />
Average memory access time (AMAT),<br />
402<br />
calculating, 403<br />
B<br />
Backpatching, A-13<br />
B<strong>and</strong>width, 30–31<br />
bisection, 532<br />
external to DRAM, 398<br />
memory, 380–381, 398<br />
network, 535<br />
Barrier synchronization, C-18<br />
defined, C-20<br />
for thread communication, C-34<br />
Base addressing, 69, 116<br />
Base registers, 69<br />
Basic block, 93<br />
Benchmarks, 538–540<br />
defined, 46<br />
Linpack, 538, OL3.11-4<br />
multicores, 522–529<br />
multiprocessor, 538–540<br />
NAS parallel, 540<br />
parallel, 539<br />
PARSEC suite, 540<br />
SPEC CPU, 46–48<br />
SPEC power, 48–49<br />
SPECrate, 538–539<br />
Stream, 548<br />
beq (Branch On Equal), 64<br />
bge (Branch Greater Than or Equal), 125<br />
bgt (Branch Greater Than), 125<br />
Biased notation, 79, 200<br />
Big-endian byte order, 70, A-43<br />
Binary numbers, 81–82
Index I-3<br />
ASCII versus, 107<br />
conversion to decimal numbers, 76<br />
defined, 73<br />
Bisection b<strong>and</strong>width, 532<br />
Bit maps<br />
defined, 18, 73<br />
goal, 18<br />
storing, 18<br />
Bit-Interleaved Parity (RAID 3), OL5.11-<br />
5<br />
Bits<br />
ALUOp, 260, 261<br />
defined, 14<br />
dirty, 437<br />
guard, 220<br />
patterns, 220–221<br />
reference, 435<br />
rounding, 220<br />
sign, 75<br />
state, D-8<br />
sticky, 220<br />
valid, 383<br />
ble (Branch Less Than or Equal), 125<br />
Blocking assignment, B-24<br />
Blocking factor, 414<br />
Block-Interleaved Parity (RAID 4),<br />
OL5.11-5–5.11-6<br />
Blocks<br />
combinational, B-4<br />
defined, 376<br />
finding, 456<br />
flexible placement, 402–404<br />
least recently used (LRU), 409<br />
loads/stores, 149<br />
locating in cache, 407–408<br />
miss rate <strong>and</strong>, 391<br />
multiword, mapping addresses to, 390<br />
placement locations, 455–456<br />
placement strategies, 404<br />
replacement selection, 409<br />
replacement strategies, 457<br />
spatial locality exploitation, 391<br />
state, B-4<br />
valid data, 386<br />
blt (Branch Less Than), 125<br />
bne (Branch On Not Equal), 64<br />
Bonding, 28<br />
Boolean algebra, B-6<br />
Bounds check shortcut, 95<br />
Branch datapath<br />
ALU, 254<br />
operations, 254<br />
Branch delay slots<br />
defined, 322<br />
scheduling, 323<br />
Branch equal, 318<br />
Branch instructions, A-59–63<br />
jump instruction versus, 270<br />
list of, A-60–63<br />
pipeline impact, 317<br />
Branch not taken<br />
assumption, 318<br />
defined, 254<br />
Branch prediction<br />
as control hazard solution, 284<br />
buffers, 321, 322<br />
defined, 283<br />
dynamic, 284, 321–323<br />
static, 335<br />
Branch predictors<br />
accuracy, 322<br />
correlation, 324<br />
information from, 324<br />
tournament, 324<br />
Branch taken<br />
cost reduction, 318<br />
defined, 254<br />
Branch target<br />
addresses, 254<br />
buffers, 324<br />
Branches. See also Conditional<br />
branches<br />
addressing in, 113–116<br />
compiler creation, 91<br />
condition, 255<br />
decision, moving up, 318<br />
delayed, 96, 255, 284, 318–319, 322,<br />
324<br />
ending, 93<br />
execution in ID stage, 319<br />
pipelined, 318<br />
target address, 318<br />
unconditional, 91<br />
Branch-on-equal instruction, 268<br />
Bubble Sort, 140<br />
Bubbles, 314<br />
Bus-based coherent multiprocessors,<br />
OL6.15-7<br />
Buses, B-19<br />
Bytes<br />
addressing, 70<br />
order, 70, A-43<br />
C<br />
C.mmp, OL6.15-4<br />
C language<br />
assignment, compiling into MIPS,<br />
65–66<br />
compiling, 145, OL2.15-2–2.15-3<br />
compiling assignment with registers,<br />
67–68<br />
compiling while loops in, 92<br />
sort algorithms, 141<br />
translation hierarchy, 124<br />
translation to MIPS assembly language,<br />
65<br />
variables, 102<br />
C++ language, OL2.15-27, OL2.21-8<br />
Cache blocking <strong>and</strong> matrix multiply,<br />
475–476<br />
Cache coherence, 466–470<br />
coherence, 466<br />
consistency, 466<br />
enforcement schemes, 467–468<br />
implementation techniques,<br />
OL5.12-11–5.12-12<br />
migration, 467<br />
problem, 466, 467, 470<br />
protocol example, OL5.12-12–5.12-16<br />
protocols, 468<br />
replication, 468<br />
snooping protocol, 468–469<br />
snoopy, OL5.12-17<br />
state diagram, OL5.12-16<br />
Cache coherency protocol, OL5.12-<br />
12–5.12-16<br />
finite-state transition diagram, OL5.12-<br />
15<br />
functioning, OL5.12-14<br />
mechanism, OL5.12-14<br />
state diagram, OL5.12-16<br />
states, OL5.12-13<br />
write-back cache, OL5.12-15<br />
Cache controllers, 470<br />
coherent cache implementation<br />
techniques, OL5.12-11–5.12-12<br />
implementing, OL5.12-2<br />
snoopy cache coherence, OL5.12-17<br />
SystemVerilog, OL5.12-2<br />
Cache hits, 443<br />
Cache misses<br />
block replacement on, 457<br />
capacity, 459
I-4 Index<br />
Cache misses (Continued)<br />
compulsory, 459<br />
conflict, 459<br />
defined, 392<br />
direct-mapped cache, 404<br />
fully associative cache, 406<br />
h<strong>and</strong>ling, 392–393<br />
memory-stall clock cycles, 399<br />
reducing with flexible block placement,<br />
402–404<br />
set-associative cache, 405<br />
steps, 393<br />
in write-through cache, 393<br />
Cache performance, 398–417<br />
calculating, 400<br />
hit time <strong>and</strong>, 401–402<br />
impact on processor performance, 400<br />
Cache-aware instructions, 482<br />
Caches, 383–398. See also Blocks<br />
accessing, 386–389<br />
in ARM cortex-A8, 472<br />
associativity in, 405–406<br />
bits in, 390<br />
bits needed for, 390<br />
contents illustration, 387<br />
defined, 21, 383–384<br />
direct-mapped, 384, 385, 390, 402<br />
empty, 386–387<br />
FSM for controlling, 461–462<br />
fully associative, 403<br />
GPU, C-38<br />
inconsistent, 393<br />
index, 388<br />
in Intel Core i7, 472<br />
Intrinsity FastMATH example,<br />
395–398<br />
locating blocks in, 407–408<br />
locations, 385<br />
multilevel, 398, 410<br />
nonblocking, 472<br />
physically addressed, 443<br />
physically indexed, 443<br />
physically tagged, 443<br />
primary, 410, 417<br />
secondary, 410, 417<br />
set-associative, 403<br />
simulating, 478<br />
size, 389<br />
split, 397<br />
summary, 397–398<br />
tag field, 388<br />
tags, OL5.12-3, OL5.12-11<br />
virtual memory <strong>and</strong> TLB integration,<br />
440–441<br />
virtually addressed, 443<br />
virtually indexed, 443<br />
virtually tagged, 443<br />
write-back, 394, 395, 458<br />
write-through, 393, 395, 457<br />
writes, 393–395<br />
Callee, 98, 99<br />
Callee-saved register, A-23<br />
Caller, 98<br />
Caller-saved register, A-23<br />
Capabilities, OL5.17-8<br />
Capacity misses, 459<br />
Carry lookahead, B-38–47<br />
4-bit ALUs using, B-45<br />
adder, B-39<br />
fast, with first level of abstraction,<br />
B-39–40<br />
fast, with “infinite” hardware, B-38–39<br />
fast, with second level of abstraction,<br />
B-40–46<br />
plumbing analogy, B-42, B-43<br />
ripple carry speed versus, B-46<br />
summary, B-46–47<br />
Carry save adders, 188<br />
Cause register<br />
defined, 327<br />
fields, A-34, A-35<br />
OLC 6600, OL1.12-7, OL4.16-3<br />
Cell phones, 7<br />
Central processor unit (CPU). See also<br />
Processors<br />
classic performance equation, 36–40<br />
coprocessor 0, A-33–34<br />
defined, 19<br />
execution time, 32, 33–34<br />
performance, 33–35<br />
system, time, 32<br />
time, 399<br />
time measurements, 33–34<br />
user, time, 32<br />
Cg pixel shader program, C-15–17<br />
Characters<br />
ASCII representation, 106<br />
in Java, 109–111<br />
Chips, 19, 25, 26<br />
manufacturing process, 26<br />
Classes<br />
defined, OL2.15-15<br />
packages, OL2.15-21<br />
Clock cycles<br />
defined, 33<br />
memory-stall, 399<br />
number of registers <strong>and</strong>, 67<br />
worst-case delay <strong>and</strong>, 272<br />
Clock cycles per instruction (CPI), 35,<br />
282<br />
one level of caching, 410<br />
two levels of caching, 410<br />
Clock rate<br />
defined, 33<br />
frequency switched as function of, 41<br />
power <strong>and</strong>, 40<br />
Clocking methodology, 249–251, B-48<br />
edge-triggered, 249, B-48, B-73<br />
level-sensitive, B-74, B-75–76<br />
for predictability, 249<br />
Clocks, B-48–50<br />
edge, B-48, B-50<br />
in edge-triggered design, B-73<br />
skew, B-74<br />
specification, B-57<br />
synchronous system, B-48–49<br />
Cloud computing, 533<br />
defined, 7<br />
Cluster networking, 537–538, OL6.9-12<br />
Clusters, OL6.15-8–6.15-9<br />
defined, 30, 500, OL6.15-8<br />
isolation, 530<br />
organization, 499<br />
scientific computing on, OL6.15-8<br />
Cm*, OL6.15-4<br />
CMOS (complementary metal oxide<br />
semiconductor), 41<br />
Coarse-grained multithreading, 514<br />
Cobol, OL2.21-7<br />
Code generation, OL2.15-13<br />
Code motion, OL2.15-7<br />
Cold-start miss, 459<br />
Collision misses, 459<br />
Column major order, 413<br />
Combinational blocks, B-4<br />
Combinational control units, D-4–8<br />
Combinational elements, 248<br />
Combinational logic, 249, B-3, B-9–20<br />
arrays, B-18–19<br />
decoders, B-9<br />
defined, B-5<br />
don’t cares, B-17–18<br />
multiplexors, B-10<br />
ROMs, B-14–16<br />
two-level, B-11–14<br />
Verilog, B-23–26
Index I-5<br />
Commercial computer development,<br />
OL1.12-4–1.12-10<br />
Commit units<br />
buffer, 339–340<br />
defined, 339–340<br />
in update control, 343<br />
Common case fast, 11<br />
Common subexpression elimination,<br />
OL2.15-6<br />
Communication, 23–24<br />
overhead, reducing, 44–45<br />
thread, C-34<br />
Compact code, OL2.21-4<br />
Comparison instructions, A-57–59<br />
floating-point, A-74–75<br />
list of, A-57–59<br />
Comparisons, 93<br />
constant oper<strong>and</strong>s in, 93<br />
signed versus unsigned, 94–95<br />
Compilers, 123–124<br />
branch creation, 92<br />
brief history, OL2.21-9<br />
conservative, OL2.15-6<br />
defined, 14<br />
front end, OL2.15-3<br />
function, 14, 123–124, A-5–6<br />
high-level optimizations, OL2.15-4<br />
ILP exploitation, OL4.16-5<br />
Just In Time (JIT), 132<br />
machine language production, A-8–9,<br />
A-10<br />
optimization, 141, OL2.21-9<br />
speculation, 333–334<br />
structure, OL2.15-2<br />
Compiling<br />
C assignment statements, 65–66<br />
C language, 92–93, 145, OL2.15-<br />
2–2.15-3<br />
floating-point programs, 214–217<br />
if-then-else, 91<br />
in Java, OL2.15-19<br />
procedures, 98, 101–102<br />
recursive procedures, 101–102<br />
while loops, 92–93<br />
Compressed sparse row (CSR) matrix,<br />
C-55, C-56<br />
Compulsory misses, 459<br />
<strong>Computer</strong> architects, 11–12<br />
abstraction to simplify design, 11<br />
common case fast, 11<br />
dependability via redundancy, 12<br />
hierarchy of memories, 12<br />
Moore’s law, 11<br />
parallelism, 12<br />
pipelining, 12<br />
prediction, 12<br />
<strong>Computer</strong>s<br />
application classes, 5–6<br />
applications, 4<br />
arithmetic for, 176–236<br />
characteristics, OL1.12-12<br />
commercial development, OL1.12-<br />
4–1.12-10<br />
component organization, 17<br />
components, 17, 177<br />
design measure, 53<br />
desktop, 5<br />
embedded, 5, A-7<br />
first, OL1.12-2–1.12-4<br />
in information revolution, 4<br />
instruction representation, 80–87<br />
performance measurement, OL1.12-10<br />
PostPC Era, 6–7<br />
principles, 86<br />
servers, 5<br />
Condition field, 324<br />
Conditional branches<br />
ARM, 147–148<br />
changing program counter with, 324<br />
compiling if-then-else into, 91<br />
defined, 90<br />
desktop RISC, E-16<br />
embedded RISC, E-16<br />
implementation, 96<br />
in loops, 115<br />
PA-RISC, E-34, E-35<br />
PC-relative addressing, 114<br />
RISC, E-10–16<br />
SPARC, E-10–12<br />
Conditional move instructions, 324<br />
Conflict misses, 459<br />
Constant memory, C-40<br />
Constant oper<strong>and</strong>s, 72–73<br />
in comparisons, 93<br />
frequent occurrence, 72<br />
Constant-manipulating instructions,<br />
A-57<br />
Content Addressable Memory (CAM),<br />
408<br />
Context switch, 446<br />
Control<br />
ALU, 259–261<br />
challenge, 325–326<br />
finishing, 269–270<br />
forwarding, 307<br />
FSM, D-8–21<br />
implementation, optimizing, D-27–28<br />
for jump instruction, 270<br />
mapping to hardware, D-2–32<br />
memory, D-26<br />
organizing, to reduce logic, D-31–32<br />
pipelined, 300–303<br />
Control flow graphs, OL2.15-9–2.15-10<br />
illustrated examples, OL2.15-9,<br />
OL2.15-10<br />
Control functions<br />
ALU, mapping to gates, D-4–7<br />
defining, 264<br />
PLA, implementation, D-7,<br />
D-20–21<br />
ROM, encoding, D-18–19<br />
for single-cycle implementation, 269<br />
Control hazards, 281–282, 316–325<br />
branch delay reduction, 318–319<br />
branch not taken assumption, 318<br />
branch prediction as solution, 284<br />
delayed decision approach, 284<br />
dynamic branch prediction,<br />
321–323<br />
logic implementation in Verilog,<br />
OL4.13-8<br />
pipeline stalls as solution, 282<br />
pipeline summary, 324<br />
simplicity, 317<br />
solutions, 282<br />
static multiple-issue processors <strong>and</strong>,<br />
335–336<br />
Control lines<br />
asserted, 264<br />
in datapath, 263<br />
execution/address calculation, 300<br />
final three stages, 303<br />
instruction decode/register file read,<br />
300<br />
instruction fetch, 300<br />
memory access, 302<br />
setting of, 264<br />
values, 300<br />
write-back, 302<br />
Control signals<br />
ALUOp, 263<br />
defined, 250<br />
effect of, 264<br />
multi-bit, 264<br />
pipelined datapaths with, 300–303<br />
truth tables, D-14
I-6 Index<br />
Control units, 247. See also Arithmetic<br />
logic unit (ALU)<br />
address select logic, D-24, D-25<br />
combinational, implementing, D-4–8<br />
with explicit counter, D-23<br />
illustrated, 265<br />
logic equations, D-11<br />
main, designing, 261–264<br />
as microcode, D-28<br />
MIPS, D-10<br />
next-state outputs, D-10, D-12–13<br />
output, 259–261, D-10<br />
Conversion instructions, A-75–76<br />
Cooperative thread arrays (CTAs), C-30<br />
Coprocessors, A-33–34<br />
defined, 218<br />
move instructions, A-71–72<br />
Core MIPS instruction set, 236. See also<br />
MIPS<br />
abstract view, 246<br />
desktop RISC, E-9–11<br />
implementation, 244–248<br />
implementation illustration, 247<br />
overview, 245<br />
subset, 244<br />
Cores<br />
defined, 43<br />
number per chip, 43<br />
Correlation predictor, 324<br />
Cosmic Cube, OL6.15-7<br />
Count register, A-34<br />
CPU, 9<br />
Cray computers, OL3.11-5–3.11-6<br />
Critical word first, 392<br />
Crossbar networks, 535<br />
CTSS (Compatible Time-Sharing<br />
System), OL5.18-9<br />
CUDA programming environment, 523,<br />
C-5<br />
barrier synchronization, C-18, C-34<br />
development, C-17, C-18<br />
hierarchy of thread groups, C-18<br />
kernels, C-19, C-24<br />
key abstractions, C-18<br />
paradigm, C-19–23<br />
parallel plus-scan template, C-61<br />
per-block shared memory, C-58<br />
plus-reduction implementation, C-63<br />
programs, C-6, C-24<br />
scalable parallel programming with,<br />
C-17–23<br />
shared memories, C-18<br />
threads, C-36<br />
Cyclic redundancy check, 423<br />
Cylinder, 381<br />
D<br />
D flip-flops, B-51, B-53<br />
D latches, B-51, B-52<br />
Data bits, 421<br />
Data flow analysis, OL2.15-11<br />
Data hazards, 278, 303–316.See also<br />
Hazards<br />
forwarding, 278, 303–316<br />
load-use, 280, 318<br />
stalls <strong>and</strong>, 313–316<br />
Data layout directives, A-14<br />
Data movement instructions, A-70–73<br />
Data parallel problem decomposition,<br />
C-17, C-18<br />
Data race, 121<br />
Data segment, A-13<br />
Data selectors, 246<br />
Data transfer instructions.See also<br />
Instructions<br />
defined, 68<br />
load, 68<br />
offset, 69<br />
store, 71<br />
Datacenters, 7<br />
Data-level parallelism, 508<br />
Datapath elements<br />
defined, 251<br />
sharing, 256<br />
Datapaths<br />
branch, 254<br />
building, 251–259<br />
control signal truth tables, D-14<br />
control unit, 265<br />
defined, 19<br />
design, 251<br />
exception h<strong>and</strong>ling, 329<br />
for fetching instructions, 253<br />
for hazard resolution via forwarding,<br />
311<br />
for jump instruction, 270<br />
for memory instructions, 256<br />
for MIPS architecture, 257<br />
in operation for branch-on-equal<br />
instruction, 268<br />
in operation for load instruction, 267<br />
in operation for R-type instruction,<br />
266<br />
operation of, 264–269<br />
pipelined, 286–303<br />
for R-type instructions, 256, 264–265<br />
single, creating, 256<br />
single-cycle, 283<br />
static two-issue, 336<br />
Deasserted signals, 250, B-4<br />
Debugging information, A-13<br />
DEC PDP-8, OL2.21-3<br />
Decimal numbers<br />
binary number conversion to, 76<br />
defined, 73<br />
Decision-making instructions, 90–96<br />
Decoders, B-9<br />
two-level, B-65<br />
Decoding machine language, 118–120<br />
Defect, 26<br />
Delayed branches, 96.See also Branches<br />
as control hazard solution, 284<br />
defined, 255<br />
embedded RISCs <strong>and</strong>, E-23<br />
for five-stage pipelines, 26, 323–324<br />
reducing, 318–319<br />
scheduling limitations, 323<br />
Delayed decision, 284<br />
DeMorgan’s theorems, B-11<br />
Denormalized numbers, 222<br />
Dependability via redundancy, 12<br />
Dependable memory hierarchy, 418–423<br />
failure, defining, 418<br />
Dependences<br />
between pipeline registers, 308<br />
between pipeline registers <strong>and</strong> ALU<br />
inputs, 308<br />
bubble insertion <strong>and</strong>, 314<br />
detection, 306–308<br />
name, 338<br />
sequence, 304<br />
<strong>Design</strong><br />
compromises <strong>and</strong>, 161<br />
datapath, 251<br />
digital, 354<br />
logic, 248–251, B-1–79<br />
main control unit, 261–264<br />
memory hierarchy, challenges, 460<br />
pipelining instruction sets, 277<br />
Desktop <strong>and</strong> server RISCs.See also<br />
Reduced instruction set computer<br />
(RISC) architectures
Index I-7<br />
addressing modes, E-6<br />
architecture summary, E-4<br />
arithmetic/logical instructions, E-11<br />
conditional branches, E-16<br />
constant extension summary, E-9<br />
control instructions, E-11<br />
conventions equivalent to MIPS core,<br />
E-12<br />
data transfer instructions, E-10<br />
features added to, E-45<br />
floating-point instructions, E-12<br />
instruction formats, E-7<br />
multimedia extensions, E-16–18<br />
multimedia support, E-18<br />
types of, E-3<br />
Desktop computers, defined, 5<br />
Device driver, OL6.9-5<br />
DGEMM (Double precision General<br />
Matrix Multiply), 225, 352, 413, 553<br />
cache blocked version of, 415<br />
optimized C version of, 226, 227, 476<br />
performance, 354, 416<br />
Dicing, 27<br />
Dies, 26, 26–27<br />
Digital design pipeline, 354<br />
Digital signal-processing (DSP)<br />
extensions, E-19<br />
DIMMs (dual inline memory modules),<br />
OL5.17-5<br />
Direct Data IO (DDIO), OL6.9-6<br />
Direct memory access (DMA), OL6.9-4<br />
Direct3D, C-13<br />
Direct-mapped caches.See also Caches<br />
address portions, 407<br />
choice of, 456<br />
defined, 384, 402<br />
illustrated, 385<br />
memory block location, 403<br />
misses, 405<br />
single comparator, 407<br />
total number of bits, 390<br />
Dirty bit, 437<br />
Dirty pages, 437<br />
Disk memory, 381–383<br />
Displacement addressing, 116<br />
Distributed Block-Interleaved Parity<br />
(RAID 5), OL5.11-6<br />
div (Divide), A-52<br />
div.d (FP Divide Double), A-76<br />
div.s (FP Divide Single), A-76<br />
Divide algorithm, 190<br />
Dividend, 189<br />
Division, 189–195<br />
algorithm, 191<br />
dividend, 189<br />
divisor, 189<br />
Divisor, 189<br />
divu (Divide Unsigned), A-52.See also<br />
Arithmetic<br />
faster, 194<br />
floating-point, 211, A-76<br />
hardware, 189–192<br />
hardware, improved version, 192<br />
instructions, A-52–53<br />
in MIPS, 194<br />
oper<strong>and</strong>s, 189<br />
quotient, 189<br />
remainder, 189<br />
signed, 192–194<br />
SRT, 194<br />
Don’t cares, B-17–18<br />
example, B-17–18<br />
term, 261<br />
Double data rate (DDR), 379<br />
Double Data Rate RAMs (DDRRAMs),<br />
379–380, B-65<br />
Double precision.See also Single precision<br />
defined, 198<br />
FMA, C-45–46<br />
GPU, C-45–46, C-74<br />
representation, 201<br />
Double words, 152<br />
Dual inline memory modules (DIMMs),<br />
381<br />
Dynamic branch prediction, 321–323.See<br />
also Control hazards<br />
branch prediction buffer, 321<br />
loops <strong>and</strong>, 321–323<br />
Dynamic hardware predictors, 284<br />
Dynamic multiple-issue processors, 333,<br />
339–341.See also Multiple issue<br />
pipeline scheduling, 339–341<br />
superscalar, 339<br />
Dynamic pipeline scheduling, 339–341<br />
commit unit, 339–340<br />
concept, 339–340<br />
hardware-based speculation, 341<br />
primary units, 340<br />
reorder buffer, 343<br />
reservation station, 339–340<br />
Dynamic r<strong>and</strong>om access memory<br />
(DRAM), 378, 379–381, B-63–65<br />
b<strong>and</strong>width external to, 398<br />
cost, 23<br />
defined, 19, B-63<br />
DIMM, OL5.17-5<br />
Double Date Rate (DDR), 379–380<br />
early board, OL5.17-4<br />
GPU, C-37–38<br />
growth of capacity, 25<br />
history, OL5.17-2<br />
internal organization of, 380<br />
pass transistor, B-63<br />
SIMM, OL5.17-5, OL5.17-6<br />
single-transistor, B-64<br />
size, 398<br />
speed, 23<br />
synchronous (SDRAM), 379–380,<br />
B-60, B-65<br />
two-level decoder, B-65<br />
Dynamically linked libraries (DLLs),<br />
129–131<br />
defined, 129<br />
lazy procedure linkage version, 130<br />
E<br />
Early restart, 392<br />
Edge-triggered clocking methodology,<br />
249, 250, B-48, B-73<br />
advantage, B-49<br />
clocks, B-73<br />
drawbacks, B-74<br />
illustrated, B-50<br />
rising edge/falling edge, B-48<br />
EDSAC (Electronic Delay Storage<br />
Automatic Calculator), OL1.12-3,<br />
OL5.17-2<br />
Eispack, OL3.11-4<br />
Electrically erasable programmable readonly<br />
memory (EEPROM), 381<br />
Elements<br />
combinational, 248<br />
datapath, 251, 256<br />
memory, B-50–58<br />
state, 248, 250, 252, B-48, B-50<br />
Embedded computers, 5<br />
application requirements, 6<br />
defined, A-7<br />
design, 5<br />
growth, OL1.12-12–1.12-13<br />
Embedded Microprocessor Benchmark<br />
Consortium (EEMBC), OL1.12-12
I-8 Index<br />
Embedded RISCs. See also Reduced<br />
instruction set computer (RISC)<br />
architectures<br />
addressing modes, E-6<br />
architecture summary, E-4<br />
arithmetic/logical instructions, E-14<br />
conditional branches, E-16<br />
constant extension summary, E-9<br />
control instructions, E-15<br />
data transfer instructions, E-13<br />
delayed branch <strong>and</strong>, E-23<br />
DSP extensions, E-19<br />
general purpose registers, E-5<br />
instruction conventions, E-15<br />
instruction formats, E-8<br />
multiply-accumulate approaches, E-19<br />
types of, E-4<br />
Encoding<br />
defined, D-31<br />
floating-point instruction, 213<br />
MIPS instruction, 83, 119, A-49<br />
ROM control function, D-18–19<br />
ROM logic function, B-15<br />
x86 instruction, 155–156<br />
ENIAC (Electronic Numerical Integrator<br />
<strong>and</strong> Calculator), OL1.12-2, OL1.12-<br />
3, OL5.17-2<br />
EPIC, OL4.16-5<br />
Error correction, B-65–67<br />
Error Detecting <strong>and</strong> Correcting Code<br />
(RAID 2), OL5.11-5<br />
Error detection, B-66<br />
Error detection code, 420<br />
Ethernet, 23<br />
EX stage<br />
load instructions, 292<br />
overflow exception detection, 328<br />
store instructions, 294<br />
Exabyte, 6<br />
Exception enable, 447<br />
Exception h<strong>and</strong>lers, A-36–38<br />
defined, A-35<br />
return from, A-38<br />
Exception program counters (EPCs), 326<br />
address capture, 331<br />
copying, 181<br />
defined, 181, 327<br />
in restart determination, 326–327<br />
transferring, 182<br />
Exceptions, 325–332, A-33–38<br />
association, 331–332<br />
datapath with controls for h<strong>and</strong>ling,<br />
329<br />
defined, 180, 326<br />
detecting, 326<br />
event types <strong>and</strong>, 326<br />
imprecise, 331–332<br />
instructions, A-80<br />
interrupts versus, 325–326<br />
in MIPS architecture, 326–327<br />
overflow, 329<br />
PC, 445, 446–447<br />
pipelined computer example, 328<br />
in pipelined implementation, 327–332<br />
precise, 332<br />
reasons for, 326–327<br />
result due to overflow in add<br />
instruction, 330<br />
saving/restoring stage on, 450<br />
Exclusive OR (XOR) instructions, A-57<br />
Executable files, A-4<br />
defined, 126<br />
linker production, A-19<br />
Execute or address calculation stage, 292<br />
Execute/address calculation<br />
control line, 300<br />
load instruction, 292<br />
store instruction, 292<br />
Execution time<br />
as valid performance measure, 51<br />
CPU, 32, 33–34<br />
pipelining <strong>and</strong>, 286<br />
Explicit counters, D-23, D-26<br />
Exponents, 197–198<br />
External labels, A-10<br />
F<br />
Facilities, A-14–17<br />
Failures, synchronizer, B-77<br />
Fallacies. See also Pitfalls<br />
add immediate unsigned, 227<br />
Amdahl’s law, 556<br />
arithmetic, 229–232<br />
assembly language for performance,<br />
159–160<br />
commercial binary compatibility<br />
importance, 160<br />
defined, 49<br />
GPUs, C-72–74, C-75<br />
low utilization uses little power, 50<br />
peak performance, 556<br />
pipelining, 355–356<br />
powerful instructions mean higher<br />
performance, 159<br />
right shift, 229<br />
False sharing, 469<br />
Fast carry<br />
with “infinite” hardware, B-38–39<br />
with first level of abstraction, B-39–40<br />
with second level of abstraction,<br />
B-40–46<br />
Fast Fourier Transforms (FFT), C-53<br />
Fault avoidance, 419<br />
Fault forecasting, 419<br />
Fault tolerance, 419<br />
Fermi architecture, 523, 552<br />
Field programmable devices (FPDs), B-78<br />
Field programmable gate arrays (FPGAs),<br />
B-78<br />
Fields<br />
Cause register, A-34, A-35<br />
defined, 82<br />
format, D-31<br />
MIPS, 82–83<br />
names, 82<br />
Status register, A-34, A-35<br />
Files, register, 252, 257, B-50, B-54–56<br />
Fine-grained multithreading, 514<br />
Finite-state machines (FSMs), 451–466,<br />
B-67–72<br />
control, D-8–22<br />
controllers, 464<br />
for multicycle control, D-9<br />
for simple cache controller, 464–466<br />
implementation, 463, B-70<br />
Mealy, 463<br />
Moore, 463<br />
next-state function, 463, B-67<br />
output function, B-67, B-69<br />
state assignment, B-70<br />
state register implementation, B-71<br />
style of, 463<br />
synchronous, B-67<br />
SystemVerilog, OL5.12-7<br />
traffic light example, B-68–70<br />
Flash memory, 381<br />
characteristics, 23<br />
defined, 23<br />
Flat address space, 479<br />
Flip-flops<br />
D flip-flops, B-51, B-53<br />
defined, B-51
Index I-9<br />
Floating point, 196–222, 224<br />
assembly language, 212<br />
backward step, OL3.11-4–3.11-5<br />
binary to decimal conversion, 202<br />
branch, 211<br />
challenges, 232–233<br />
diversity versus portability, OL3.11-<br />
3–3.11-4<br />
division, 211<br />
first dispute, OL3.11-2–3.11-3<br />
form, 197<br />
fused multiply add, 220<br />
guard digits, 218–219<br />
history, OL3.11-3<br />
IEEE 754 st<strong>and</strong>ard, 198, 199<br />
instruction encoding, 213<br />
intermediate calculations, 218<br />
machine language, 212<br />
MIPS instruction frequency for, 236<br />
MIPS instructions, 211–213<br />
oper<strong>and</strong>s, 212<br />
overflow, 198<br />
packed format, 224<br />
precision, 230<br />
procedure with two-dimensional<br />
matrices, 215–217<br />
programs, compiling, 214–217<br />
registers, 217<br />
representation, 197–202<br />
rounding, 218–219<br />
sign <strong>and</strong> magnitude, 197<br />
SSE2 architecture, 224–225<br />
subtraction, 211<br />
underflow, 198<br />
units, 219<br />
in x86, 224<br />
Floating vectors, OL3.11-3<br />
Floating-point addition, 203–206<br />
arithmetic unit block diagram, 207<br />
binary, 204<br />
illustrated, 205<br />
instructions, 211, A-73–74<br />
steps, 203–204<br />
Floating-point arithmetic (GPUs),<br />
C-41–46<br />
basic, C-42<br />
double precision, C-45–46, C-74<br />
performance, C-44<br />
specialized, C-42–44<br />
supported formats, C-42<br />
texture operations, C-44<br />
Floating-point instructions, A-73–80<br />
absolute value, A-73<br />
addition, A-73–74<br />
comparison, A-74–75<br />
conversion, A-75–76<br />
desktop RISC, E-12<br />
division, A-76<br />
load, A-76–77<br />
move, A-77–78<br />
multiplication, A-78<br />
negation, A-78–79<br />
SPARC, E-31<br />
square root, A-79<br />
store, A-79<br />
subtraction, A-79–80<br />
truncation, A-80<br />
Floating-point multiplication, 206–210<br />
binary, 210–211<br />
illustrated, 209<br />
instructions, 211<br />
signific<strong>and</strong>s, 206<br />
steps, 206–210<br />
Flow-sensitive information, OL2.15-15<br />
Flushing instructions, 318, 319<br />
defined, 319<br />
exceptions <strong>and</strong>, 331<br />
For loops, 141, OL2.15-26<br />
inner, OL2.15-24<br />
SIMD <strong>and</strong>, OL6.15-2<br />
Formal parameters, A-16<br />
Format fields, D-31<br />
Fortran, OL2.21-7<br />
Forward references, A-11<br />
Forwarding, 303–316<br />
ALU before, 309<br />
control, 307<br />
datapath for hazard resolution, 311<br />
defined, 278<br />
functioning, 306<br />
graphical representation, 279<br />
illustrations, OL4.13-26–4.13-26<br />
multiple results <strong>and</strong>, 281<br />
multiplexors, 310<br />
pipeline registers before, 309<br />
with two instructions, 278<br />
Verilog implementation, OL4.13-<br />
2–4.13-4<br />
Fractions, 197, 198<br />
Frame buffer, 18<br />
Frame pointers, 103<br />
Front end, OL2.15-3<br />
Fully associative caches. See also Caches<br />
block replacement strategies, 457<br />
choice of, 456<br />
defined, 403<br />
memory block location, 403<br />
misses, 406<br />
Fully connected networks, 535<br />
Function code, 82<br />
Fused-multiply-add (FMA) operation,<br />
220, C-45–46<br />
G<br />
Game consoles, C-9<br />
Gates, B-3, B-8<br />
AND, B-12, D-7<br />
delays, B-46<br />
mapping ALU control function to,<br />
D-4–7<br />
NAND, B-8<br />
NOR, B-8, B-50<br />
Gather-scatter, 511, 552<br />
General Purpose GPUs (GPGPUs),<br />
C-5<br />
General-purpose registers, 150<br />
architectures, OL2.21-3<br />
embedded RISCs, E-5<br />
Generate<br />
defined, B-40<br />
example, B-44<br />
super, B-41<br />
Gigabyte, 6<br />
Global common subexpression<br />
elimination, OL2.15-6<br />
Global memory, C-21, C-39<br />
Global miss rates, 416<br />
Global optimization, OL2.15-5<br />
code, OL2.15-7<br />
implementing, OL2.15-8–2.15-11<br />
Global pointers, 102<br />
GPU computing. See also Graphics<br />
processing units (GPUs)<br />
defined, C-5<br />
visual applications, C-6–7<br />
GPU system architectures, C-7–12<br />
graphics logical pipeline, C-10<br />
heterogeneous, C-7–9<br />
implications for, C-24<br />
interfaces <strong>and</strong> drivers, C-9<br />
unified, C-10–12<br />
Graph coloring, OL2.15-12
I-10 Index<br />
Graphics displays<br />
computer hardware support, 18<br />
LCD, 18<br />
Graphics logical pipeline, C-10<br />
Graphics processing units (GPUs), 522–<br />
529. See also GPU computing<br />
as accelerators, 522<br />
attribute interpolation, C-43–44<br />
defined, 46, 506, C-3<br />
evolution, C-5<br />
fallacies <strong>and</strong> pitfalls, C-72–75<br />
floating-point arithmetic, C-17, C-41–<br />
46, C-74<br />
GeForce 8-series generation, C-5<br />
general computation, C-73–74<br />
General Purpose (GPGPUs), C-5<br />
graphics mode, C-6<br />
graphics trends, C-4<br />
history, C-3–4<br />
logical graphics pipeline, C-13–14<br />
mapping applications to, C-55–72<br />
memory, 523<br />
multilevel caches <strong>and</strong>, 522<br />
N-body applications, C-65–72<br />
NVIDIA architecture, 523–526<br />
parallel memory system, C-36–41<br />
parallelism, 523, C-76<br />
performance doubling, C-4<br />
perspective, 527–529<br />
programming, C-12–24<br />
programming interfaces to, C-17<br />
real-time graphics, C-13<br />
summary, C-76<br />
Graphics shader programs, C-14–15<br />
Gresham’s Law, 236, OL3.11-2<br />
Grid computing, 533<br />
Grids, C-19<br />
GTX 280, 548–553<br />
Guard digits<br />
defined, 218<br />
rounding with, 219<br />
H<br />
Half precision, C-42<br />
Halfwords, 110<br />
Hamming, Richard, 420<br />
Hamming distance, 420<br />
Hamming Error Correction Code (ECC),<br />
420–421<br />
calculating, 420–421<br />
H<strong>and</strong>lers<br />
defined, 449<br />
TLB miss, 448<br />
Hard disks<br />
access times, 23<br />
defined, 23<br />
Hardware<br />
as hierarchical layer, 13<br />
language of, 14–16<br />
operations, 63–66<br />
supporting procedures in, 96–106<br />
synthesis, B-21<br />
translating microprograms to, D-28–32<br />
virtualizable, 426<br />
Hardware description languages. See also<br />
Verilog<br />
defined, B-20<br />
using, B-20–26<br />
VHDL, B-20–21<br />
Hardware multithreading, 514–517<br />
coarse-grained, 514<br />
options, 516<br />
simultaneous, 515–517<br />
Hardware-based speculation, 341<br />
Harvard architecture, OL1.12-4<br />
Hazard detection units, 313–314<br />
functions, 314<br />
pipeline connections for, 314<br />
Hazards, 277–278. See also Pipelining<br />
control, 281–282, 316–325<br />
data, 278, 303–316<br />
forwarding <strong>and</strong>, 312<br />
structural, 277, 294<br />
Heap<br />
allocating space on, 104–106<br />
defined, 104<br />
Heterogeneous systems, C-4–5<br />
architecture, C-7–9<br />
defined, C-3<br />
Hexadecimal numbers, 81–82<br />
binary number conversion to, 81–82<br />
Hierarchy of memories, 12<br />
High-level languages, 14–16, A-6<br />
benefits, 16<br />
computer architectures, OL2.21-5<br />
importance, 16<br />
High-level optimizations, OL2.15-4–2.15-<br />
5<br />
Hit rate, 376<br />
Hit time<br />
cache performance <strong>and</strong>, 401–402<br />
defined, 376<br />
Hit under miss, 472<br />
Hold time, B-54<br />
Horizontal microcode, D-32<br />
Hot-swapping, OL5.11-7<br />
Human genome project, 4<br />
I<br />
I<br />
I/O, A-38–40, OL6.9-2, OL6.9-3<br />
memory-mapped, A-38<br />
on system performance, OL5.11-2<br />
I/O benchmarks.See Benchmarks<br />
IBM 360/85, OL5.17-7<br />
IBM 701, OL1.12-5<br />
IBM 7030, OL4.16-2<br />
IBM ALOG, OL3.11-7<br />
IBM Blue Gene, OL6.15-9–6.15-10<br />
IBM Personal <strong>Computer</strong>, OL1.12-7,<br />
OL2.21-6<br />
IBM System/360 computers, OL1.12-6,<br />
OL3.11-6, OL4.16-2<br />
IBM z/VM, OL5.17-8<br />
ID stage<br />
branch execution in, 319<br />
load instructions, 292<br />
store instruction in, 291<br />
IEEE 754 floating-point st<strong>and</strong>ard, 198,<br />
199, OL3.11-8–3.11-10. See also<br />
Floating point<br />
first chips, OL3.11-8–3.11-9<br />
in GPU arithmetic, C-42–43<br />
implementation, OL3.11-10<br />
rounding modes, 219<br />
today, OL3.11-10<br />
If statements, 114<br />
I-format, 83<br />
If-then-else, 91<br />
Immediate addressing, 116<br />
Immediate instructions, 72<br />
Imprecise interrupts, 331, OL4.16-4<br />
Index-out-of-bounds check, 94–95<br />
Induction variable elimination, OL2.15-7<br />
Inheritance, OL2.15-15<br />
In-order commit, 341<br />
Input devices, 16<br />
Inputs, 261<br />
Instances, OL2.15-15<br />
Instruction count, 36, 38<br />
Instruction decode/register file read stage
Index I-11<br />
control line, 300<br />
load instruction, 289<br />
store instruction, 294<br />
Instruction execution illustrations,<br />
OL4.13-16–4.13-17<br />
clock cycle 9, OL4.13-24<br />
clock cycles 1 <strong>and</strong> 2, OL4.13-21<br />
clock cycles 3 <strong>and</strong> 4, OL4.13-22<br />
clock cycles 5 <strong>and</strong> 6, OL4.13-23,<br />
OL4.13-23<br />
clock cycles 7 <strong>and</strong> 8, OL4.13-24<br />
examples, OL4.13-20–4.13-25<br />
forwarding, OL4.13-26–4.13-31<br />
no hazard, OL4.13-17<br />
pipelines with stalls <strong>and</strong> forwarding,<br />
OL4.13-26, OL4.13-20<br />
Instruction fetch stage<br />
control line, 300<br />
load instruction, 289<br />
store instruction, 294<br />
Instruction formats, 157<br />
ARM, 148<br />
defined, 81<br />
desktop/server RISC architectures, E-7<br />
embedded RISC architectures, E-8<br />
I-type, 83<br />
J-type, 113<br />
jump instruction, 270<br />
MIPS, 148<br />
R-type, 83, 261<br />
x86, 157<br />
Instruction latency, 356<br />
Instruction mix, 39, OL1.12-10<br />
Instruction set architecture<br />
ARM, 145–147<br />
branch address calculation, 254<br />
defined, 22, 52<br />
history, 163<br />
maintaining, 52<br />
protection <strong>and</strong>, 427<br />
thread, C-31–34<br />
virtual machine support, 426–427<br />
Instruction sets, 235, C-49<br />
ARM, 324<br />
design for pipelining, 277<br />
MIPS, 62, 161, 234<br />
MIPS-32, 235<br />
Pseudo MIPS, 233<br />
x86 growth, 161<br />
Instruction-level parallelism (ILP), 354.<br />
See also Parallelism<br />
compiler exploitation, OL4.16-5–4.16-6<br />
defined, 43, 333<br />
exploitation, increasing, 343<br />
<strong>and</strong> matrix multiply, 351–354<br />
Instructions, 60–164, E-25–27, E-40–42.<br />
See also Arithmetic instructions;<br />
MIPS; Oper<strong>and</strong>s<br />
add immediate, 72<br />
addition, 180, A-51<br />
Alpha, E-27–29<br />
arithmetic-logical, 251, A-51–57<br />
ARM, 145–147, E-36–37<br />
assembly, 66<br />
basic block, 93<br />
branch, A-59–63<br />
cache-aware, 482<br />
comparison, A-57–59<br />
conditional branch, 90<br />
conditional move, 324<br />
constant-manipulating, A-57<br />
conversion, A-75–76<br />
core, 233<br />
data movement, A-70–73<br />
data transfer, 68<br />
decision-making, 90–96<br />
defined, 14, 62<br />
desktop RISC conventions, E-12<br />
division, A-52–53<br />
as electronic signals, 80<br />
embedded RISC conventions, E-15<br />
encoding, 83<br />
exception <strong>and</strong> interrupt, A-80<br />
exclusive OR, A-57<br />
fetching, 253<br />
fields, 80<br />
floating-point (x86), 224<br />
floating-point, 211–213, A-73–80<br />
flushing, 318, 319, 331<br />
immediate, 72<br />
introduction to, 62–63<br />
jump, 95, 97, A-63–64<br />
left-to-right flow, 287–288<br />
load, 68, A-66–68<br />
load linked, 122<br />
logical operations, 87–89<br />
M32R, E-40<br />
memory access, C-33–34<br />
memory-reference, 245<br />
multiplication, 188, A-53–54<br />
negation, A-54<br />
nop, 314<br />
PA-RISC, E-34–36<br />
performance, 35–36<br />
pipeline sequence, 313<br />
PowerPC, E-12–13, E-32–34<br />
PTX, C-31, C-32<br />
remainder, A-55<br />
representation in computer, 80–87<br />
restartable, 450<br />
resuming, 450<br />
R-type, 252<br />
shift, A-55–56<br />
SPARC, E-29–32<br />
store, 71, A-68–70<br />
store conditional, 122<br />
subtraction, 180, A-56–57<br />
SuperH, E-39–40<br />
thread, C-30–31<br />
Thumb, E-38<br />
trap, A-64–66<br />
vector, 510<br />
as words, 62<br />
x86, 149–155<br />
Instructions per clock cycle (IPC), 333<br />
Integrated circuits (ICs), 19. See also<br />
specific chips<br />
cost, 27<br />
defined, 25<br />
manufacturing process, 26<br />
very large-scale (VLSIs), 25<br />
Intel Core i7, 46–49, 244, 501, 548–553<br />
address translation for, 471<br />
architectural registers, 347<br />
caches in, 472<br />
memory hierarchies of, 471–475<br />
microarchitecture, 338<br />
performance of, 473<br />
SPEC CPU benchmark, 46–48<br />
SPEC power benchmark, 48–49<br />
TLB hardware for, 471<br />
Intel Core i7 920, 346–349<br />
microarchitecture, 347<br />
Intel Core i7 960<br />
benchmarking <strong>and</strong> rooflines of,<br />
548–553<br />
Intel Core i7 Pipelines, 344, 346–349<br />
memory components, 348<br />
performance, 349–351<br />
program performance, 351<br />
specification, 345<br />
Intel IA-64 architecture, OL2.21-3<br />
Intel Paragon, OL6.15-8
I-12 Index<br />
Intel Threading Building Blocks, C-60<br />
Intel x86 microprocessors<br />
clock rate <strong>and</strong> power for, 40<br />
Interference graphs, OL2.15-12<br />
Interleaving, 398<br />
Interprocedural analysis, OL2.15-14<br />
Interrupt enable, 447<br />
Interrupt h<strong>and</strong>lers, A-33<br />
Interrupt-driven I/O, OL6.9-4<br />
Interrupts<br />
defined, 180, 326<br />
event types <strong>and</strong>, 326<br />
exceptions versus, 325–326<br />
imprecise, 331, OL4.16-4<br />
instructions, A-80<br />
precise, 332<br />
vectored, 327<br />
Intrinsity FastMATH processor, 395–398<br />
caches, 396<br />
data miss rates, 397, 407<br />
read processing, 442<br />
TLB, 440<br />
write-through processing, 442<br />
Inverted page tables, 436<br />
Issue packets, 334<br />
J<br />
j (Jump), 64<br />
jal (Jump And Link), 64<br />
Java<br />
bytecode, 131<br />
bytecode architecture, OL2.15-17<br />
characters in, 109–111<br />
compiling in, OL2.15-19–2.15-20<br />
goals, 131<br />
interpreting, 131, 145, OL2.15-15–<br />
2.15-16<br />
keywords, OL2.15-21<br />
method invocation in, OL2.15-21<br />
pointers, OL2.15-26<br />
primitive types, OL2.15-26<br />
programs, starting, 131–132<br />
reference types, OL2.15-26<br />
sort algorithms, 141<br />
strings in, 109–111<br />
translation hierarchy, 131<br />
while loop compilation in, OL2.15-<br />
18–2.15-19<br />
Java Virtual Machine (JVM), 145,<br />
OL2.15-16<br />
jr (Jump Register), 64<br />
J-type instruction format, 113<br />
Jump instructions, 254, E-26<br />
branch instruction versus, 270<br />
control <strong>and</strong> datapath for, 271<br />
implementing, 270<br />
instruction format, 270<br />
list of, A-63–64<br />
Just In Time (JIT) compilers,<br />
132, 560<br />
K<br />
Karnaugh maps, B-18<br />
Kernel mode, 444<br />
Kernels<br />
CUDA, C-19, C-24<br />
defined, C-19<br />
Kilobyte, 6<br />
L<br />
Labels<br />
global, A-10, A-11<br />
local, A-11<br />
LAPACK, 230<br />
Large-scale multiprocessors, OL6.15-7,<br />
OL6.15-9–6.15-10<br />
Latches<br />
D latch, B-51, B-52<br />
defined, B-51<br />
Latency<br />
instruction, 356<br />
memory, C-74–75<br />
pipeline, 286<br />
use, 336–337<br />
lbu (Load Byte Unsigned), 64<br />
Leaf procedures. See also Procedures<br />
defined, 100<br />
example, 109<br />
Least recently used (LRU)<br />
as block replacement strategy, 457<br />
defined, 409<br />
pages, 434<br />
Least significant bits, B-32<br />
defined, 74<br />
SPARC, E-31<br />
Left-to-right instruction flow, 287–288<br />
Level-sensitive clocking, B-74, B-75–76<br />
defined, B-74<br />
two-phase, B-75<br />
lhu (Load Halfword Unsigned), 64<br />
li (Load Immediate), 162<br />
Link, OL6.9-2<br />
Linkers, 126–129, A-18–19<br />
defined, 126, A-4<br />
executable files, 126, A-19<br />
function illustration, A-19<br />
steps, 126<br />
using, 126–129<br />
Linking object files, 126–129<br />
Linpack, 538, OL3.11-4<br />
Liquid crystal displays (LCDs), 18<br />
LISP, SPARC support, E-30<br />
Little-endian byte order, A-43<br />
Live range, OL2.15-11<br />
Livermore Loops, OL1.12-11<br />
ll (Load Linked), 64<br />
Load balancing, 505–506<br />
Load instructions. See also Store<br />
instructions<br />
access, C-41<br />
base register, 262<br />
block, 149<br />
compiling with, 71<br />
datapath in operation for, 267<br />
defined, 68<br />
details, A-66–68<br />
EX stage, 292<br />
floating-point, A-76–77<br />
halfword unsigned, 110<br />
ID stage, 291<br />
IF stage, 291<br />
linked, 122, 123<br />
list of, A-66–68<br />
load byte unsigned, 76<br />
load half, 110<br />
load upper immediate, 112, 113<br />
MEM stage, 293<br />
pipelined datapath in, 296<br />
signed, 76<br />
unit for implementing, 255<br />
unsigned, 76<br />
WB stage, 293<br />
Load word, 68, 71<br />
Loaders, 129<br />
Loading, A-19–20<br />
Load-store architectures, OL2.21-3<br />
Load-use data hazard, 280, 318<br />
Load-use stalls, 318<br />
Local area networks (LANs), 24. See also<br />
Networks
Index I-13<br />
Local labels, A-11<br />
Local memory, C-21, C-40<br />
Local miss rates, 416<br />
Local optimization, OL2.15-5.<br />
See also Optimization<br />
implementing, OL2.15-8<br />
Locality<br />
principle, 374<br />
spatial, 374, 377<br />
temporal, 374, 377<br />
Lock synchronization, 121<br />
Locks, 518<br />
Logic<br />
address select, D-24, D-25<br />
ALU control, D-6<br />
combinational, 250, B-5, B-9–20<br />
components, 249<br />
control unit equations, D-11<br />
design, 248–251, B-1–79<br />
equations, B-7<br />
minimization, B-18<br />
programmable array (PAL),<br />
B-78<br />
sequential, B-5, B-56–58<br />
two-level, B-11–14<br />
Logical operations, 87–89<br />
AND, 88, A-52<br />
ARM, 149<br />
desktop RISC, E-11<br />
embedded RISC, E-14<br />
MIPS, A-51–57<br />
NOR, 89, A-54<br />
NOT, 89, A-55<br />
OR, 89, A-55<br />
shifts, 87<br />
Long instruction word (LIW),<br />
OL4.16-5<br />
Lookup tables (LUTs), B-79<br />
Loop unrolling<br />
defined, 338, OL2.15-4<br />
for multiple-issue pipelines, 338<br />
register renaming <strong>and</strong>, 338<br />
Loops, 92–93<br />
conditional branches in, 114<br />
for, 141<br />
prediction <strong>and</strong>, 321–323<br />
test, 142, 143<br />
while, compiling, 92–93<br />
lui (Load Upper Imm.), 64<br />
lw (Load Word), 64<br />
lwc1 (Load FP Single), A-73<br />
M<br />
M32R, E-15, E-40<br />
Machine code, 81<br />
Machine instructions, 81<br />
Machine language, 15<br />
branch offset in, 115<br />
decoding, 118–120<br />
defined, 14, 81, A-3<br />
floating-point, 212<br />
illustrated, 15<br />
MIPS, 85<br />
SRAM, 21<br />
translating MIPS assembly language<br />
into, 84<br />
Macros<br />
defined, A-4<br />
example, A-15–17<br />
use of, A-15<br />
Main memory, 428. See also Memory<br />
defined, 23<br />
page tables, 437<br />
physical addresses, 428<br />
Mapping applications, C-55–72<br />
Mark computers, OL1.12-14<br />
Matrix multiply, 225–228, 553–555<br />
Mealy machine, 463–464, B-68, B-71,<br />
B-72<br />
Mean time to failure(MTTF), 418<br />
improving, 419<br />
versus AFR of disks, 419–420<br />
Media Access Control (MAC) address,<br />
OL6.9-7<br />
Megabyte, 6<br />
Memory<br />
addresses, 77<br />
affinity, 545<br />
atomic, C-21<br />
b<strong>and</strong>width, 380–381, 397<br />
cache, 21, 383–398, 398–417<br />
CAM, 408<br />
constant, C-40<br />
control, D-26<br />
defined, 19<br />
DRAM, 19, 379–380, B-63–65<br />
flash, 23<br />
global, C-21, C-39<br />
GPU, 523<br />
instructions, datapath for, 256<br />
layout, A-21<br />
local, C-21, C-40<br />
main, 23<br />
nonvolatile, 22<br />
oper<strong>and</strong>s, 68–69<br />
parallel system, C-36–41<br />
read-only (ROM), B-14–16<br />
SDRAM, 379–380<br />
secondary, 23<br />
shared, C-21, C-39–40<br />
spaces, C-39<br />
SRAM, B-58–62<br />
stalls, 400<br />
technologies for building, 24–28<br />
texture, C-40<br />
usage, A-20–22<br />
virtual, 427–454<br />
volatile, 22<br />
Memory access instructions, C-33–34<br />
Memory access stage<br />
control line, 302<br />
load instruction, 292<br />
store instruction, 292<br />
Memory b<strong>and</strong>width, 551, 557<br />
Memory consistency model, 469<br />
Memory elements, B-50–58<br />
clocked, B-51<br />
D flip-flop, B-51, B-53<br />
D latch, B-52<br />
DRAMs, B-63–67<br />
flip-flop, B-51<br />
hold time, B-54<br />
latch, B-51<br />
setup time, B-53, B-54<br />
SRAMs, B-58–62<br />
unclocked, B-51<br />
Memory hierarchies, 545<br />
of ARM cortex-A8, 471–475<br />
block (or line), 376<br />
cache performance, 398–417<br />
caches, 383–417<br />
common framework, 454–461<br />
defined, 375<br />
design challenges, 461<br />
development, OL5.17-6–5.17-8<br />
exploiting, 372–498<br />
of Intel core i7, 471–475<br />
level pairs, 376<br />
multiple levels, 375<br />
overall operation of, 443–444<br />
parallelism <strong>and</strong>, 466–470, OL5.11-2<br />
pitfalls, 478–482<br />
program execution time <strong>and</strong>, 417
I-14 Index<br />
Memory hierarchies (Continued)<br />
quantitative design parameters, 454<br />
redundant arrays <strong>and</strong> inexpensive<br />
disks, 470<br />
reliance on, 376<br />
structure, 375<br />
structure diagram, 378<br />
variance, 417<br />
virtual memory, 427–454<br />
Memory rank, 381<br />
Memory technologies, 378–383<br />
disk memory, 381–383<br />
DRAM technology, 378, 379–381<br />
flash memory, 381<br />
SRAM technology, 378, 379<br />
Memory-mapped I/O, OL6.9-3<br />
use of, A-38<br />
Memory-stall clock cycles, 399<br />
Message passing<br />
defined, 529<br />
multiprocessors, 529–534<br />
Metastability, B-76<br />
Methods<br />
defined, OL2.15-5<br />
invoking in Java, OL2.15-20–2.15-21<br />
static, A-20<br />
mfc0 (Move From Control), A-71<br />
mfhi (Move From Hi), A-71<br />
mflo (Move From Lo), A-71<br />
Microarchitectures, 347<br />
Intel Core i7 920, 347<br />
Microcode<br />
assembler, D-30<br />
control unit as, D-28<br />
defined, D-27<br />
dispatch ROMs, D-30–31<br />
horizontal, D-32<br />
vertical, D-32<br />
Microinstructions, D-31<br />
Microprocessors<br />
design shift, 501<br />
multicore, 8, 43, 500–501<br />
Microprograms<br />
as abstract control representation,<br />
D-30<br />
field translation, D-29<br />
translating to hardware, D-28–32<br />
Migration, 467<br />
Million instructions per second (MIPS),<br />
51<br />
Minterms<br />
defined, B-12, D-20<br />
in PLA implementation, D-20<br />
MIP-map, C-44<br />
MIPS, 64, 84, A-45–80<br />
addressing for 32-bit immediates,<br />
116–118<br />
addressing modes, A-45–47<br />
arithmetic core, 233<br />
arithmetic instructions, 63, A-51–57<br />
ARM similarities, 146<br />
assembler directive support, A-47–49<br />
assembler syntax, A-47–49<br />
assembly instruction, mapping, 80–81<br />
branch instructions, A-59–63<br />
comparison instructions, A-57–59<br />
compiling C assignment statements<br />
into, 65<br />
compiling complex C assignment into,<br />
65–66<br />
constant-manipulating instructions,<br />
A-57<br />
control registers, 448<br />
control unit, D-10<br />
CPU, A-46<br />
divide in, 194<br />
exceptions in, 326–327<br />
fields, 82–83<br />
floating-point instructions, 211–213<br />
FPU, A-46<br />
instruction classes, 163<br />
instruction encoding, 83, 119, A-49<br />
instruction formats, 120, 148, A-49–51<br />
instruction set, 62, 162, 234<br />
jump instructions, A-63–66<br />
logical instructions, A-51–57<br />
machine language, 85<br />
memory addresses, 70<br />
memory allocation for program <strong>and</strong><br />
data, 104<br />
multiply in, 188<br />
opcode map, A-50<br />
oper<strong>and</strong>s, 64<br />
Pseudo, 233, 235<br />
register conventions, 105<br />
static multiple issue with, 335–338<br />
MIPS core<br />
architecture, 195<br />
arithmetic/logical instructions not in,<br />
E-21, E-23<br />
common extensions to, E-20–25<br />
control instructions not in, E-21<br />
data transfer instructions not in, E-20,<br />
E-22<br />
floating-point instructions not in, E-22<br />
instruction set, 233, 244–248, E-9–10<br />
MIPS-16<br />
16-bit instruction set, E-41–42<br />
immediate fields, E-41<br />
instructions, E-40–42<br />
MIPS core instruction changes, E-42<br />
PC-relative addressing, E-41<br />
MIPS-32 instruction set, 235<br />
MIPS-64 instructions, E-25–27<br />
conditional procedure call instructions,<br />
E-27<br />
constant shift amount, E-25<br />
jump/call not PC-relative, E-26<br />
move to/from control registers, E-26<br />
nonaligned data transfers, E-25<br />
NOR, E-25<br />
parallel single precision floating-point<br />
operations, E-27<br />
reciprocal <strong>and</strong> reciprocal square root,<br />
E-27<br />
SYSCALL, E-25<br />
TLB instructions, E-26–27<br />
Mirroring, OL5.11-5<br />
Miss penalty<br />
defined, 376<br />
determination, 391–392<br />
multilevel caches, reducing, 410<br />
Miss rates<br />
block size versus, 392<br />
data cache, 455<br />
defined, 376<br />
global, 416<br />
improvement, 391–392<br />
Intrinsity FastMATH processor, 397<br />
local, 416<br />
miss sources, 460<br />
split cache, 397<br />
Miss under miss, 472<br />
MMX (MultiMedia eXtension), 224<br />
Modules, A-4<br />
Moore machines, 463–464, B-68, B-71,<br />
B-72<br />
Moore’s law, 11, 379, 522, OL6.9-2,<br />
C-72–73<br />
Most significant bit<br />
1-bit ALU for, B-33<br />
defined, 74<br />
move (Move), 139
Index I-15<br />
Move instructions, A-70–73<br />
coprocessor, A-71–72<br />
details, A-70–73<br />
floating-point, A-77–78<br />
MS-DOS, OL5.17-11<br />
mul.d (FP Multiply Double), A-78<br />
mul.s (FP Multiply Single), A-78<br />
mult (Multiply), A-53<br />
Multicore, 517–521<br />
Multicore multiprocessors, 8, 43<br />
defined, 8, 500–501<br />
MULTICS (Multiplexed Information<br />
<strong>and</strong> Computing Service), OL5.17-<br />
9–5.17-10<br />
Multilevel caches. See also Caches<br />
complications, 416<br />
defined, 398, 416<br />
miss penalty, reducing, 410<br />
performance of, 410<br />
summary, 417–418<br />
Multimedia extensions<br />
desktop/server RISCs, E-16–18<br />
as SIMD extensions to instruction sets,<br />
OL6.15-4<br />
vector versus, 511–512<br />
Multiple dimension arrays, 218<br />
Multiple instruction multiple data<br />
(MIMD), 558<br />
defined, 507, 508<br />
first multiprocessor, OL6.15-14<br />
Multiple instruction single data (MISD), 507<br />
Multiple issue, 332–339<br />
code scheduling, 337–338<br />
dynamic, 333, 339–341<br />
issue packets, 334<br />
loop unrolling <strong>and</strong>, 338<br />
processors, 332, 333<br />
static, 333, 334–339<br />
throughput <strong>and</strong>, 342<br />
Multiple processors, 553–555<br />
Multiple-clock-cycle pipeline diagrams,<br />
296–297<br />
five instructions, 298<br />
illustrated, 298<br />
Multiplexors, B-10<br />
controls, 463<br />
in datapath, 263<br />
defined, 246<br />
forwarding, control values, 310<br />
selector control, 256–257<br />
two-input, B-10<br />
Multiplic<strong>and</strong>, 183<br />
Multiplication, 183–188. See also<br />
Arithmetic<br />
fast, hardware, 188<br />
faster, 187–188<br />
first algorithm, 185<br />
floating-point, 206–208, A-78<br />
hardware, 184–186<br />
instructions, 188, A-53–54<br />
in MIPS, 188<br />
multiplic<strong>and</strong>, 183<br />
multiplier, 183<br />
oper<strong>and</strong>s, 183<br />
product, 183<br />
sequential version, 184–186<br />
signed, 187<br />
Multiplier, 183<br />
Multiply algorithm, 186<br />
Multiply-add (MAD), C-42<br />
Multiprocessors<br />
benchmarks, 538–540<br />
bus-based coherent, OL6.15-7<br />
defined, 500<br />
historical perspective, 561<br />
large-scale, OL6.15-7–6.15-8, OL6.15-<br />
9–6.15-10<br />
message-passing, 529–534<br />
multithreaded architecture, C-26–27,<br />
C-35–36<br />
organization, 499, 529<br />
for performance, 559<br />
shared memory, 501, 517–521<br />
software, 500<br />
TFLOPS, OL6.15-6<br />
UMA, 518<br />
Multistage networks, 535<br />
Multithreaded multiprocessor<br />
architecture, C-25–36<br />
conclusion, C-36<br />
ISA, C-31–34<br />
massive multithreading, C-25–26<br />
multiprocessor, C-26–27<br />
multiprocessor comparison, C-35–36<br />
SIMT, C-27–30<br />
special function units (SFUs), C-35<br />
streaming processor (SP), C-34<br />
thread instructions, C-30–31<br />
threads/thread blocks management,<br />
C-30<br />
Multithreading, C-25–26<br />
coarse-grained, 514<br />
defined, 506<br />
fine-grained, 514<br />
hardware, 514–517<br />
simultaneous (SMT), 515–517<br />
multu (Multiply Unsigned), A-54<br />
Must-information, OL2.15-5<br />
Mutual exclusion, 121<br />
N<br />
Name dependence, 338<br />
NAND gates, B-8<br />
NAS (NASA Advanced Supercomputing),<br />
540<br />
N-body<br />
all-pairs algorithm, C-65<br />
GPU simulation, C-71<br />
mathematics, C-65–67<br />
multiple threads per body, C-68–69<br />
optimization, C-67<br />
performance comparison, C-69–70<br />
results, C-70–72<br />
shared memory use, C-67–68<br />
Negation instructions, A-54, A-78–79<br />
Negation shortcut, 76<br />
Nested procedures, 100–102<br />
compiling recursive procedure<br />
showing, 101–102<br />
NetFPGA 10-Gigagit Ethernet card,<br />
OL6.9-2, OL6.9-3<br />
Network of Workstations, OL6.15-<br />
8–6.15-9<br />
Network topologies, 534–537<br />
implementing, 536<br />
multistage, 537<br />
Networking, OL6.9-4<br />
operating system in, OL6.9-4–6.9-5<br />
performance improvement, OL6.9-<br />
7–6.9-10<br />
Networks, 23–24<br />
advantages, 23<br />
b<strong>and</strong>width, 535<br />
crossbar, 535<br />
fully connected, 535<br />
local area (LANs), 24<br />
multistage, 535<br />
wide area (WANs), 24<br />
Newton’s iteration, 218<br />
Next state<br />
nonsequential, D-24<br />
sequential, D-23
I-16 Index<br />
Next-state function, 463, B-67<br />
defined, 463<br />
implementing, with sequencer,<br />
D-22–28<br />
Next-state outputs, D-10, D-12–13<br />
example, D-12–13<br />
implementation, D-12<br />
logic equations, D-12–13<br />
truth tables, D-15<br />
No Redundancy (RAID 0), OL5.11-4<br />
No write allocation, 394<br />
Nonblocking assignment, B-24<br />
Nonblocking caches, 344, 472<br />
Nonuniform memory access (NUMA),<br />
518<br />
Nonvolatile memory, 22<br />
Nops, 314<br />
nor (NOR), 64<br />
NOR gates, B-8<br />
cross-coupled, B-50<br />
D latch implemented with, B-52<br />
NOR operation, 89, A-54, E-25<br />
NOT operation, 89, A-55, B-6<br />
Numbers<br />
binary, 73<br />
computer versus real-world, 221<br />
decimal, 73, 76<br />
denormalized, 222<br />
hexadecimal, 81–82<br />
signed, 73–78<br />
unsigned, 73–78<br />
NVIDIA GeForce 8800, C-46–55<br />
all-pairs N-body algorithm, C-71<br />
dense linear algebra computations,<br />
C-51–53<br />
FFT performance, C-53<br />
instruction set, C-49<br />
performance, C-51<br />
rasterization, C-50<br />
ROP, C-50–51<br />
scalability, C-51<br />
sorting performance, C-54–55<br />
special function approximation<br />
statistics, C-43<br />
special function unit (SFU), C-50<br />
streaming multiprocessor (SM),<br />
C-48–49<br />
streaming processor, C-49–50<br />
streaming processor array (SPA), C-46<br />
texture/processor cluster (TPC),<br />
C-47–48<br />
NVIDIA GPU architecture, 523–526<br />
NVIDIA GTX 280, 548–553<br />
NVIDIA Tesla GPU, 548–553<br />
O<br />
Object files, 125, A-4<br />
debugging information, 124<br />
defined, A-10<br />
format, A-13–14<br />
header, 125, A-13<br />
linking, 126–129<br />
relocation information, 125<br />
static data segment, 125<br />
symbol table, 125, 126<br />
text segment, 125<br />
Object-oriented languages. See also Java<br />
brief history, OL2.21-8<br />
defined, 145, OL2.15-5<br />
One’s complement, 79, B-29<br />
Opcodes<br />
control line setting <strong>and</strong>, 264<br />
defined, 82, 262<br />
OpenGL, C-13<br />
OpenMP (Open MultiProcessing), 520,<br />
540<br />
Oper<strong>and</strong>s, 66–73. See also Instructions<br />
32-bit immediate, 112–113<br />
adding, 179<br />
arithmetic instructions, 66<br />
compiling assignment when in<br />
memory, 69<br />
constant, 72–73<br />
division, 189<br />
floating-point, 212<br />
memory, 68–69<br />
MIPS, 64<br />
multiplication, 183<br />
shifting, 148<br />
Operating systems<br />
brief history, OL5.17-9–5.17-12<br />
defined, 13<br />
encapsulation, 22<br />
in networking, OL6.9-4–6.9-5<br />
Operations<br />
atomic, implementing, 121<br />
hardware, 63–66<br />
logical, 87–89<br />
x86 integer, 152, 154–155<br />
Optimization<br />
class explanation, OL2.15-14<br />
compiler, 141<br />
control implementation, D-27–28<br />
global, OL2.15-5<br />
high-level, OL2.15-4–2.15-5<br />
local, OL2.15-5, OL2.15-8<br />
manual, 144<br />
or (OR), 64<br />
OR operation, 89, A-55, B-6<br />
ori (Or Immediate), 64<br />
Out-of-order execution<br />
defined, 341<br />
performance complexity, 416<br />
processors, 344<br />
Output devices, 16<br />
Overflow<br />
defined, 74, 198<br />
detection, 180<br />
exceptions, 329<br />
floating-point, 198<br />
occurrence, 75<br />
saturation <strong>and</strong>, 181<br />
subtraction, 179<br />
P<br />
P+Q redundancy (RAID 6), OL5.11-7<br />
Packed floating-point format, 224<br />
Page faults, 434. See also Virtual memory<br />
for data access, 450<br />
defined, 428<br />
h<strong>and</strong>ling, 429, 446–453<br />
virtual address causing, 449, 450<br />
Page tables, 456<br />
defined, 432<br />
illustrated, 435<br />
indexing, 432<br />
inverted, 436<br />
levels, 436–437<br />
main memory, 437<br />
register, 432<br />
storage reduction techniques, 436–437<br />
updating, 432<br />
VMM, 452<br />
Pages. See also Virtual memory<br />
defined, 428<br />
dirty, 437<br />
finding, 432–434<br />
LRU, 434<br />
offset, 429<br />
physical number, 429<br />
placing, 432–434
Index I-17<br />
size, 430<br />
virtual number, 429<br />
Parallel bus, OL6.9-3<br />
Parallel execution, 121<br />
Parallel memory system, C-36–41. See<br />
also Graphics processing units<br />
(GPUs)<br />
caches, C-38<br />
constant memory, C-40<br />
DRAM considerations, C-37–38<br />
global memory, C-39<br />
load/store access, C-41<br />
local memory, C-40<br />
memory spaces, C-39<br />
MMU, C-38–39<br />
ROP, C-41<br />
shared memory, C-39–40<br />
surfaces, C-41<br />
texture memory, C-40<br />
Parallel processing programs, 502–507<br />
creation difficulty, 502–507<br />
defined, 501<br />
for message passing, 519–520<br />
great debates in, OL6.15-5<br />
for shared address space, 519–520<br />
use of, 559<br />
Parallel reduction, C-62<br />
Parallel scan, C-60–63<br />
CUDA template, C-61<br />
inclusive, C-60<br />
tree-based, C-62<br />
Parallel software, 501<br />
Parallelism, 12, 43, 332–344<br />
<strong>and</strong> computers arithmetic, 222–223<br />
data-level, 233, 508<br />
debates, OL6.15-5–6.15-7<br />
GPUs <strong>and</strong>, 523, C-76<br />
instruction-level, 43, 332, 343<br />
memory hierarchies <strong>and</strong>, 466–470,<br />
OL5.11-2<br />
multicore <strong>and</strong>, 517<br />
multiple issue, 332–339<br />
multithreading <strong>and</strong>, 517<br />
performance benefits, 44–45<br />
process-level, 500<br />
redundant arrays <strong>and</strong> inexpensive<br />
disks, 470<br />
subword, E-17<br />
task, C-24<br />
task-level, 500<br />
thread, C-22<br />
Paravirtualization, 482<br />
PA-RISC, E-14, E-17<br />
branch vectored, E-35<br />
conditional branches, E-34, E-35<br />
debug instructions, E-36<br />
decimal operations, E-35<br />
extract <strong>and</strong> deposit, E-35<br />
instructions, E-34–36<br />
load <strong>and</strong> clear instructions, E-36<br />
multiply/add <strong>and</strong> multiply/subtract,<br />
E-36<br />
nullification, E-34<br />
nullifying branch option, E-25<br />
store bytes short, E-36<br />
synthesized multiply <strong>and</strong> divide,<br />
E-34–35<br />
Parity, OL5.11-5<br />
bits, 421<br />
code, 420, B-65<br />
PARSEC (Princeton Application<br />
Repository for Shared Memory<br />
<strong>Computer</strong>s), 540<br />
Pass transistor, B-63<br />
PCI-Express (PCIe), 537, C-8, OL6.9-2<br />
PC-relative addressing, 114, 116<br />
Peak floating-point performance, 542<br />
Pentium bug morality play, 231–232<br />
Performance, 28–36<br />
assessing, 28<br />
classic CPU equation, 36–40<br />
components, 38<br />
CPU, 33–35<br />
defining, 29–32<br />
equation, using, 36<br />
improving, 34–35<br />
instruction, 35–36<br />
measuring, 33–35, OL1.12-10<br />
program, 39–40<br />
ratio, 31<br />
relative, 31–32<br />
response time, 30–31<br />
sorting, C-54–55<br />
throughput, 30–31<br />
time measurement, 32<br />
Personal computers (PCs), 7<br />
defined, 5<br />
Personal mobile device (PMD)<br />
defined, 7<br />
Petabyte, 6<br />
Physical addresses, 428<br />
mapping to, 428–429<br />
space, 517, 521<br />
Physically addressed caches, 443<br />
Pipeline registers<br />
before forwarding, 309<br />
dependences, 308<br />
forwarding unit selection, 312<br />
Pipeline stalls, 280<br />
avoiding with code reordering, 280<br />
data hazards <strong>and</strong>, 313–316<br />
insertion, 315<br />
load-use, 318<br />
as solution to control hazards, 282<br />
Pipelined branches, 319<br />
Pipelined control, 300–303. See also<br />
Control<br />
control lines, 300, 303<br />
overview illustration, 316<br />
specifying, 300<br />
Pipelined datapaths, 286–303<br />
with connected control signals, 304<br />
with control signals, 300–303<br />
corrected, 296<br />
illustrated, 289<br />
in load instruction stages, 296<br />
Pipelined dependencies, 305<br />
Pipelines<br />
branch instruction impact, 317<br />
effectiveness, improving, OL4.16-<br />
4–4.16-5<br />
execute <strong>and</strong> address calculation stage,<br />
290, 292<br />
five-stage, 274, 290, 299<br />
graphic representation, 279, 296–300<br />
instruction decode <strong>and</strong> register file<br />
read stage, 289, 292<br />
instruction fetch stage, 290, 292<br />
instructions sequence, 313<br />
latency, 286<br />
memory access stage, 290, 292<br />
multiple-clock-cycle diagrams,<br />
296–297<br />
performance bottlenecks, 343<br />
single-clock-cycle diagrams, 296–297<br />
stages, 274<br />
static two-issue, 335<br />
write-back stage, 290, 294<br />
Pipelining, 12, 272–286<br />
advanced, 343–344<br />
benefits, 272<br />
control hazards, 281–282<br />
data hazards, 278
I-18 Index<br />
Pipelining (Continued)<br />
exceptions <strong>and</strong>, 327–332<br />
execution time <strong>and</strong>, 286<br />
fallacies, 355–356<br />
hazards, 277–278<br />
instruction set design for, 277<br />
laundry analogy, 273<br />
overview, 272–286<br />
paradox, 273<br />
performance improvement, 277<br />
pitfall, 355–356<br />
simultaneous executing instructions,<br />
286<br />
speed-up formula, 273<br />
structural hazards, 277, 294<br />
summary, 285<br />
throughput <strong>and</strong>, 286<br />
Pitfalls. See also Fallacies<br />
address space extension, 479<br />
arithmetic, 229–232<br />
associativity, 479<br />
defined, 49<br />
GPUs, C-74–75<br />
ignoring memory system behavior, 478<br />
memory hierarchies, 478–482<br />
out-of-order processor evaluation, 479<br />
performance equation subset, 50–51<br />
pipelining, 355–356<br />
pointer to automatic variables, 160<br />
sequential word addresses, 160<br />
simulating cache, 478<br />
software development with<br />
multiprocessors, 556<br />
VMM implementation, 481, 481–482<br />
Pixel shader example, C-15–17<br />
Pixels, 18<br />
Pointers<br />
arrays versus, 141–145<br />
frame, 103<br />
global, 102<br />
incrementing, 143<br />
Java, OL2.15-26<br />
stack, 98, 102<br />
Polling, OL6.9-8<br />
Pop, 98<br />
Power<br />
clock rate <strong>and</strong>, 40<br />
critical nature of, 53<br />
efficiency, 343–344<br />
relative, 41<br />
PowerPC<br />
algebraic right shift, E-33<br />
branch registers, E-32–33<br />
condition codes, E-12<br />
instructions, E-12–13<br />
instructions unique to, E-31–33<br />
load multiple/store multiple, E-33<br />
logical shifted immediate, E-33<br />
rotate with mask, E-33<br />
Precise interrupts, 332<br />
Prediction, 12<br />
2-bit scheme, 322<br />
accuracy, 321, 324<br />
dynamic branch, 321–323<br />
loops <strong>and</strong>, 321–323<br />
steady-state, 321<br />
Prefetching, 482, 544<br />
Primitive types, OL2.15-26<br />
Procedure calls<br />
convention, A-22–33<br />
examples, A-27–33<br />
frame, A-23<br />
preservation across, 102<br />
Procedures, 96–106<br />
compiling, 98<br />
compiling, showing nested procedure<br />
linking, 101–102<br />
execution steps, 96<br />
frames, 103<br />
leaf, 100<br />
nested, 100–102<br />
recursive, 105, A-26–27<br />
for setting arrays to zero, 142<br />
sort, 135–139<br />
strcpy, 108–109<br />
string copy, 108–109<br />
swap, 133<br />
Process identifiers, 446<br />
Process-level parallelism, 500<br />
Processors, 242–356<br />
as cores, 43<br />
control, 19<br />
datapath, 19<br />
defined, 17, 19<br />
dynamic multiple-issue, 333<br />
multiple-issue, 333<br />
out-of-order execution, 344, 416<br />
performance growth, 44<br />
ROP, C-12, C-41<br />
speculation, 333–334<br />
static multiple-issue, 333, 334–339<br />
streaming, C-34<br />
superscalar, 339, 515–516, OL4.16-5<br />
technologies for building, 24–28<br />
two-issue, 336–337<br />
vector, 508–510<br />
VLIW, 335<br />
Product, 183<br />
Product of sums, B-11<br />
Program counters (PCs), 251<br />
changing with conditional branch, 324<br />
defined, 98, 251<br />
exception, 445, 447<br />
incrementing, 251, 253<br />
instruction updates, 289<br />
Program libraries, A-4<br />
Program performance<br />
elements affecting, 39<br />
underst<strong>and</strong>ing, 9<br />
Programmable array logic (PAL), B-78<br />
Programmable logic arrays (PLAs)<br />
component dots illustration, B-16<br />
control function implementation, D-7,<br />
D-20–21<br />
defined, B-12<br />
example, B-13–14<br />
illustrated, B-13<br />
ROMs <strong>and</strong>, B-15–16<br />
size, D-20<br />
truth table implementation, B-13<br />
Programmable logic devices (PLDs), B-78<br />
Programmable ROMs (PROMs), B-14<br />
Programming languages. See also specific<br />
languages<br />
brief history of, OL2.21-7–2.21-8<br />
object-oriented, 145<br />
variables, 67<br />
Programs<br />
assembly language, 123<br />
Java, starting, 131–132<br />
parallel processing, 502–507<br />
starting, 123–132<br />
translating, 123–132<br />
Propagate<br />
defined, B-40<br />
example, B-44<br />
super, B-41<br />
Protected keywords, OL2.15-21<br />
Protection<br />
defined, 428<br />
implementing, 444–446<br />
mechanisms, OL5.17-9<br />
VMs for, 424<br />
Protection group, OL5.11-5<br />
Pseudo MIPS<br />
defined, 233
Index I-19<br />
instruction set, 235<br />
Pseudodirect addressing, 116<br />
Pseudoinstructions<br />
defined, 124<br />
summary, 125<br />
Pthreads (POSIX threads), 540<br />
PTX instructions, C-31, C-32<br />
Public keywords, OL2.15-21<br />
Push<br />
defined, 98<br />
using, 100<br />
Q<br />
Quad words, 154<br />
Quicksort, 411, 412<br />
Quotient, 189<br />
R<br />
Race, B-73<br />
Radix sort, 411, 412, C-63–65<br />
CUDA code, C-64<br />
implementation, C-63–65<br />
RAID, See Redundant arrays of<br />
inexpensive disks (RAID)<br />
RAM, 9<br />
Raster operation (ROP) processors, C-12,<br />
C-41, C-50–51<br />
fixed function, C-41<br />
Raster refresh buffer, 18<br />
Rasterization, C-50<br />
Ray casting (RC), 552<br />
Read-only memories (ROMs), B-14–16<br />
control entries, D-16–17<br />
control function encoding, D-18–19<br />
dispatch, D-25<br />
implementation, D-15–19<br />
logic function encoding, B-15<br />
overhead, D-18<br />
PLAs <strong>and</strong>, B-15–16<br />
programmable (PROM), B-14<br />
total size, D-16<br />
Read-stall cycles, 399<br />
Read-write head, 381<br />
Receive message routine, 529<br />
Receiver Control register, A-39<br />
Receiver Data register, A-38, A-39<br />
Recursive procedures, 105, A-26–27. See<br />
also Procedures<br />
clone invocation, 100<br />
stack in, A-29–30<br />
Reduced instruction set computer (RISC)<br />
architectures, E-2–45, OL2.21-5,<br />
OL4.16-4. See also Desktop <strong>and</strong><br />
server RISCs; Embedded RISCs<br />
group types, E-3–4<br />
instruction set lineage, E-44<br />
Reduction, 519<br />
Redundant arrays of inexpensive disks<br />
(RAID), OL5.11-2–5.11-8<br />
history, OL5.11-8<br />
RAID 0, OL5.11-4<br />
RAID 1, OL5.11-5<br />
RAID 2, OL5.11-5<br />
RAID 3, OL5.11-5<br />
RAID 4, OL5.11-5–5.11-6<br />
RAID 5, OL5.11-6–5.11-7<br />
RAID 6, OL5.11-7<br />
spread of, OL5.11-6<br />
summary, OL5.11-7–5.11-8<br />
use statistics, OL5.11-7<br />
Reference bit, 435<br />
References<br />
absolute, 126<br />
forward, A-11<br />
types, OL2.15-26<br />
unresolved, A-4, A-18<br />
Register addressing, 116<br />
Register allocation, OL2.15-11–2.15-13<br />
Register files, B-50, B-54–56<br />
defined, 252, B-50, B-54<br />
in behavioral Verilog, B-57<br />
single, 257<br />
two read ports implementation, B-55<br />
with two read ports/one write port,<br />
B-55<br />
write port implementation, B-56<br />
Register-memory architecture, OL2.21-3<br />
Registers, 152, 153–154<br />
architectural, 325–332<br />
base, 69<br />
callee-saved, A-23<br />
caller-saved, A-23<br />
Cause, A-35<br />
clock cycle time <strong>and</strong>, 67<br />
compiling C assignment with, 67–68<br />
Count, A-34<br />
defined, 66<br />
destination, 83, 262<br />
floating-point, 217<br />
left half, 290<br />
mapping, 80<br />
MIPS conventions, 105<br />
number specification, 252<br />
page table, 432<br />
pipeline, 308, 309, 312<br />
primitives, 66<br />
Receiver Control, A-39<br />
Receiver Data, A-38, A-39<br />
renaming, 338<br />
right half, 290<br />
spilling, 71<br />
Status, 327, A-35<br />
temporary, 67, 99<br />
Transmitter Control, A-39–40<br />
Transmitter Data, A-40<br />
usage convention, A-24<br />
use convention, A-22<br />
variables, 67<br />
Relative performance, 31–32<br />
Relative power, 41<br />
Reliability, 418<br />
Relocation information, A-13, A-14<br />
Remainder<br />
defined, 189<br />
instructions, A-55<br />
Reorder buffers, 343<br />
Replication, 468<br />
Requested word first, 392<br />
Request-level parallelism, 532<br />
Reservation stations<br />
buffering oper<strong>and</strong>s in, 340–341<br />
defined, 339–340<br />
Response time, 30–31<br />
Restartable instructions, 448<br />
Return address, 97<br />
Return from exception (ERET), 445<br />
R-format, 262<br />
ALU operations, 253<br />
defined, 83<br />
Ripple carry<br />
adder, B-29<br />
carry lookahead speed versus, B-46<br />
Roofline model, 542–543, 544, 545<br />
with ceilings, 546, 547<br />
computational roofline, 545<br />
illustrated, 542<br />
Opteron generations, 543, 544<br />
with overlapping areas shaded, 547<br />
peak floating-point performance,<br />
542<br />
peak memory performance, 543<br />
with two kernels, 547<br />
Rotational delay.See Rotational latency<br />
Rotational latency, 383
I-20 Index<br />
Rounding, 218<br />
accurate, 218<br />
bits, 220<br />
with guard digits, 219<br />
IEEE 754 modes, 219<br />
Row-major order, 217, 413<br />
R-type instructions, 252<br />
datapath for, 264–265<br />
datapath in operation for, 266<br />
S<br />
Saturation, 181<br />
sb (Store Byte), 64<br />
sc (Store Conditional), 64<br />
SCALAPAK, 230<br />
Scaling<br />
strong, 505, 507<br />
weak, 505<br />
Scientific notation<br />
adding numbers in, 203<br />
defined, 196<br />
for reals, 197<br />
Search engines, 4<br />
Secondary memory, 23<br />
Sectors, 381<br />
Seek, 382<br />
Segmentation, 431<br />
Selector values, B-10<br />
Semiconductors, 25<br />
Send message routine, 529<br />
Sensitivity list, B-24<br />
Sequencers<br />
explicit, D-32<br />
implementing next-state function with,<br />
D-22–28<br />
Sequential logic, B-5<br />
Servers, OL5. See also Desktop <strong>and</strong> server<br />
RISCs<br />
cost <strong>and</strong> capability, 5<br />
Service accomplishment, 418<br />
Service interruption, 418<br />
Set instructions, 93<br />
Set-associative caches, 403. See also<br />
Caches<br />
address portions, 407<br />
block replacement strategies, 457<br />
choice of, 456<br />
four-way, 404, 407<br />
memory-block location, 403<br />
misses, 405–406<br />
n-way, 403<br />
two-way, 404<br />
Setup time, B-53, B-54<br />
sh (Store Halfword), 64<br />
Shaders<br />
defined, C-14<br />
floating-point arithmetic, C-14<br />
graphics, C-14–15<br />
pixel example, C-15–17<br />
Shading languages, C-14<br />
Shadowing, OL5.11-5<br />
Shared memory. See also Memory<br />
as low-latency memory, C-21<br />
caching in, C-58–60<br />
CUDA, C-58<br />
N-body <strong>and</strong>, C-67–68<br />
per-CTA, C-39<br />
SRAM banks, C-40<br />
Shared memory multiprocessors (SMP),<br />
517–521<br />
defined, 501, 517<br />
single physical address space, 517<br />
synchronization, 518<br />
Shift amount, 82<br />
Shift instructions, 87, A-55–56<br />
Sign <strong>and</strong> magnitude, 197<br />
Sign bit, 76<br />
Sign extension, 254<br />
defined, 76<br />
shortcut, 78<br />
Signals<br />
asserted, 250, B-4<br />
control, 250, 263–264<br />
deasserted, 250, B-4<br />
Signed division, 192–194<br />
Signed multiplication, 187<br />
Signed numbers, 73–78<br />
sign <strong>and</strong> magnitude, 75<br />
treating as unsigned, 94–95<br />
Signific<strong>and</strong>s, 198<br />
addition, 203<br />
multiplication, 206<br />
Silicon, 25<br />
as key hardware technology, 53<br />
crystal ingot, 26<br />
defined, 26<br />
wafers, 26<br />
Silicon crystal ingot, 26<br />
SIMD (Single Instruction Multiple Data),<br />
507–508, 558<br />
computers, OL6.15-2–6.15-4<br />
data vector, C-35<br />
extensions, OL6.15-4<br />
for loops <strong>and</strong>, OL6.15-3<br />
massively parallel multiprocessors,<br />
OL6.15-2<br />
small-scale, OL6.15-4<br />
vector architecture, 508–510<br />
in x86, 508<br />
SIMMs (single inline memory modules),<br />
OL5.17-5, OL5.17-6<br />
Simple programmable logic devices<br />
(SPLDs), B-78<br />
Simplicity, 161<br />
Simultaneous multithreading (SMT),<br />
515–517<br />
support, 515<br />
thread-level parallelism, 517<br />
unused issue slots, 515<br />
Single error correcting/Double error<br />
correcting (SEC/DEC), 420–422<br />
Single instruction single data (SISD), 507<br />
Single precision. See also Double<br />
precision<br />
binary representation, 201<br />
defined, 198<br />
Single-clock-cycle pipeline diagrams,<br />
296–297<br />
illustrated, 299<br />
Single-cycle datapaths. See also Datapaths<br />
illustrated, 287<br />
instruction execution, 288<br />
Single-cycle implementation<br />
control function for, 269<br />
defined, 270<br />
nonpipelined execution versus<br />
pipelined execution, 276<br />
non-use of, 271–272<br />
penalty, 271–272<br />
pipelined performance versus, 274<br />
Single-instruction multiple-thread<br />
(SIMT), C-27–30<br />
overhead, C-35<br />
multithreaded warp scheduling, C-28<br />
processor architecture, C-28<br />
warp execution <strong>and</strong> divergence,<br />
C-29–30<br />
Single-program multiple data (SPMD),<br />
C-22<br />
sll (Shift Left Logical), 64<br />
slt (Set Less Than), 64<br />
slti (Set Less Than Imm.), 64
Index I-21<br />
sltiu (Set Less Than Imm.Unsigned), 64<br />
sltu (Set Less Than Unsig.), 64<br />
Smalltalk-80, OL2.21-8<br />
Smart phones, 7<br />
Snooping protocol, 468–470<br />
Snoopy cache coherence, OL5.12-7<br />
Software optimization<br />
via blocking, 413–418<br />
Sort algorithms, 141<br />
Software<br />
layers, 13<br />
multiprocessor, 500<br />
parallel, 501<br />
as service, 7, 532, 558<br />
systems, 13<br />
Sort procedure, 135–139. See also<br />
Procedures<br />
code for body, 135–137<br />
full procedure, 138–139<br />
passing parameters in, 138<br />
preserving registers in, 138<br />
procedure call, 137<br />
register allocation for, 135<br />
Sorting performance, C-54–55<br />
Source files, A-4<br />
Source language, A-6<br />
Space allocation<br />
on heap, 104–106<br />
on stack, 103<br />
SPARC<br />
annulling branch, E-23<br />
CASA, E-31<br />
conditional branches, E-10–12<br />
fast traps, E-30<br />
floating-point operations, E-31<br />
instructions, E-29–32<br />
least significant bits, E-31<br />
multiple precision floating-point<br />
results, E-32<br />
nonfaulting loads, E-32<br />
overlapping integer operations, E-31<br />
quadruple precision floating-point<br />
arithmetic, E-32<br />
register windows, E-29–30<br />
support for LISP <strong>and</strong> Smalltalk, E-30<br />
Sparse matrices, C-55–58<br />
Sparse Matrix-Vector multiply (SpMV),<br />
C-55, C-57, C-58<br />
CUDA version, C-57<br />
serial code, C-57<br />
shared memory version, C-59<br />
Spatial locality, 374<br />
large block exploitation of, 391<br />
tendency, 378<br />
SPEC, OL1.12-11–1.12-12<br />
CPU benchmark, 46–48<br />
power benchmark, 48–49<br />
SPEC2000, OL1.12-12<br />
SPEC2006, 233, OL1.12-12<br />
SPEC89, OL1.12-11<br />
SPEC92, OL1.12-12<br />
SPEC95, OL1.12-12<br />
SPECrate, 538–539<br />
SPECratio, 47<br />
Special function units (SFUs), C-35, C-50<br />
defined, C-43<br />
Speculation, 333–334<br />
hardware-based, 341<br />
implementation, 334<br />
performance <strong>and</strong>, 334<br />
problems, 334<br />
recovery mechanism, 334<br />
Speed-up challenge, 503–505<br />
balancing load, 505–506<br />
bigger problem, 504–505<br />
Spilling registers, 71, 98<br />
SPIM, A-40–45<br />
byte order, A-43<br />
features, A-42–43<br />
getting started with, A-42<br />
MIPS assembler directives support,<br />
A-47–49<br />
speed, A-41<br />
system calls, A-43–45<br />
versions, A-42<br />
virtual machine simulation, A-41–42<br />
Split algorithm, 552<br />
Split caches, 397<br />
Square root instructions, A-79<br />
sra (Shift Right Arith.), A-56<br />
srl (Shift Right Logical), 64<br />
Stack architectures, OL2.21-4<br />
Stack pointers<br />
adjustment, 100<br />
defined, 98<br />
values, 100<br />
Stack segment, A-22<br />
Stacks<br />
allocating space on, 103<br />
for arguments, 140<br />
defined, 98<br />
pop, 98<br />
push, 98, 100<br />
recursive procedures, A-29–30<br />
Stalls, 280<br />
as solution to control hazard, 282<br />
avoiding with code reordering, 280<br />
behavioral Verilog with detection,<br />
OL4.13-6–4.13-8<br />
data hazards <strong>and</strong>, 313–316<br />
illustrations, OL4.13-23, OL4.13-30<br />
insertion into pipeline, 315<br />
load-use, 318<br />
memory, 400<br />
write-back scheme, 399<br />
write buffer, 399<br />
St<strong>and</strong>by spares, OL5.11-8<br />
State<br />
in 2-bit prediction scheme, 322<br />
assignment, B-70, D-27<br />
bits, D-8<br />
exception, saving/restoring, 450<br />
logic components, 249<br />
specification of, 432<br />
State elements<br />
clock <strong>and</strong>, 250<br />
combinational logic <strong>and</strong>, 250<br />
defined, 248, B-48<br />
inputs, 249<br />
in storing/accessing instructions,<br />
252<br />
register file, B-50<br />
Static branch prediction, 335<br />
Static data<br />
as dynamic data, A-21<br />
defined, A-20<br />
segment, 104<br />
Static multiple-issue processors, 333,<br />
334–339. See also Multiple issue<br />
control hazards <strong>and</strong>, 335–336<br />
instruction sets, 335<br />
with MIPS ISA, 335–338<br />
Static r<strong>and</strong>om access memories (SRAMs),<br />
378, 379, B-58–62<br />
array organization, B-62<br />
basic structure, B-61<br />
defined, 21, B-58<br />
fixed access time, B-58<br />
large, B-59<br />
read/write initiation, B-59<br />
synchronous (SSRAMs), B-60<br />
three-state buffers, B-59, B-60<br />
Static variables, 102
I-22 Index<br />
Status register<br />
fields, A-34, A-35<br />
Steady-state prediction, 321<br />
Sticky bits, 220<br />
Store buffers, 343<br />
Store instructions. See also Load<br />
instructions<br />
access, C-41<br />
base register, 262<br />
block, 149<br />
compiling with, 71<br />
conditional, 122<br />
defined, 71<br />
details, A-68–70<br />
EX stage, 294<br />
floating-point, A-79<br />
ID stage, 291<br />
IF stage, 291<br />
instruction dependency, 312<br />
list of, A-68–70<br />
MEM stage, 295<br />
unit for implementing, 255<br />
WB stage, 295<br />
Store word, 71<br />
Stored program concept, 63<br />
as computer principle, 86<br />
illustrated, 86<br />
principles, 161<br />
Strcpy procedure, 108–109. See also<br />
Procedures<br />
as leaf procedure, 109<br />
pointers, 109<br />
Stream benchmark, 548<br />
Streaming multiprocessor (SM), C-48–49<br />
Streaming processors, C-34, C-49–50<br />
array (SPA), C-41, C-46<br />
Streaming SIMD Extension 2 (SSE2)<br />
floating-point architecture, 224<br />
Streaming SIMD Extensions (SSE) <strong>and</strong><br />
advanced vector extensions in x86,<br />
224–225<br />
Stretch computer, OL4.16-2<br />
Strings<br />
defined, 107<br />
in Java, 109–111<br />
representation, 107<br />
Strip mining, 510<br />
Striping, OL5.11-4<br />
Strong scaling, 505, 517<br />
Structural hazards, 277, 294<br />
sub (Subtract), 64<br />
sub.d (FP Subtract Double), A-79<br />
sub.s (FP Subtract Single), A-80<br />
Subnormals, 222<br />
Subtraction, 178–182. See also Arithmetic<br />
binary, 178–179<br />
floating-point, 211, A-79–80<br />
instructions, A-56–57<br />
negative number, 179<br />
overflow, 179<br />
subu (Subtract Unsigned), 119<br />
Subword parallelism, 222–223, 352, E-17<br />
<strong>and</strong> matrix multiply, 225–228<br />
Sum of products, B-11, B-12<br />
Supercomputers, OL4.16-3<br />
defined, 5<br />
SuperH, E-15, E-39–40<br />
Superscalars<br />
defined, 339, OL4.16-5<br />
dynamic pipeline scheduling, 339<br />
multithreading options, 516<br />
Surfaces, C-41<br />
sw (Store Word), 64<br />
Swap procedure, 133. See also Procedures<br />
body code, 135<br />
full, 135, 138–139<br />
register allocation, 133<br />
Swap space, 434<br />
swc1 (Store FP Single), A-73<br />
Symbol tables, 125, A-12, A-13<br />
Synchronization, 121–123, 552<br />
barrier, C-18, C-20, C-34<br />
defined, 518<br />
lock, 121<br />
overhead, reducing, 44–45<br />
unlock, 121<br />
Synchronizers<br />
defined, B-76<br />
failure, B-77<br />
from D flip-flop, B-76<br />
Synchronous DRAM (SRAM), 379–380,<br />
B-60, B-65<br />
Synchronous SRAM (SSRAM), B-60<br />
Synchronous system, B-48<br />
Syntax tree, OL2.15-3<br />
System calls, A-43–45<br />
code, A-43–44<br />
defined, 445<br />
loading, A-43<br />
Systems software, 13<br />
SystemVerilog<br />
cache controller, OL5.12-2<br />
T<br />
cache data <strong>and</strong> tag modules, OL5.12-6<br />
FSM, OL5.12-7<br />
simple cache block diagram, OL5.12-4<br />
type declarations, OL5.12-2<br />
Tablets, 7<br />
Tags<br />
defined, 384<br />
in locating block, 407<br />
page tables <strong>and</strong>, 434<br />
size of, 409<br />
Tail call, 105–106<br />
Task identifiers, 446<br />
Task parallelism, C-24<br />
Task-level parallelism, 500<br />
Tebibyte (TiB), 5<br />
Telsa PTX ISA, C-31–34<br />
arithmetic instructions, C-33<br />
barrier synchronization, C-34<br />
GPU thread instructions, C-32<br />
memory access instructions, C-33–34<br />
Temporal locality, 374<br />
tendency, 378<br />
Temporary registers, 67, 99<br />
Terabyte (TB) , 6<br />
defined, 5<br />
Text segment, A-13<br />
Texture memory, C-40<br />
Texture/processor cluster (TPC),<br />
C-47–48<br />
TFLOPS multiprocessor, OL6.15-6<br />
Thrashing, 453<br />
Thread blocks, 528<br />
creation, C-23<br />
defined, C-19<br />
managing, C-30<br />
memory sharing, C-20<br />
synchronization, C-20<br />
Thread parallelism, C-22<br />
Threads<br />
creation, C-23<br />
CUDA, C-36<br />
ISA, C-31–34<br />
managing, C-30<br />
memory latencies <strong>and</strong>, C-74–75<br />
multiple, per body, C-68–69<br />
warps, C-27<br />
Three Cs model, 459–461<br />
Three-state buffers, B-59, B-60
Index I-23<br />
Throughput<br />
defined, 30–31<br />
multiple issue <strong>and</strong>, 342<br />
pipelining <strong>and</strong>, 286, 342<br />
Thumb, E-15, E-38<br />
Timing<br />
asynchronous inputs, B-76–77<br />
level-sensitive, B-75–76<br />
methodologies, B-72–77<br />
two-phase, B-75<br />
TLB misses, 439. See also Translationlookaside<br />
buffer (TLB)<br />
entry point, 449<br />
h<strong>and</strong>ler, 449<br />
h<strong>and</strong>ling, 446–453<br />
occurrence, 446<br />
problem, 453<br />
Tomasulo’s algorithm, OL4.16-3<br />
Touchscreen, 19<br />
Tournament branch predicators, 324<br />
Tracks, 381–382<br />
Transfer time, 383<br />
Transistors, 25<br />
Translation-lookaside buffer (TLB),<br />
438–439, E-26–27, OL5.17-6. See<br />
also TLB misses<br />
associativities, 439<br />
illustrated, 438<br />
integration, 440–441<br />
Intrinsity FastMATH, 440<br />
typical values, 439<br />
Transmit driver <strong>and</strong> NIC hardware time<br />
versus.receive driver <strong>and</strong> NIC hardware<br />
time, OL6.9-8<br />
Transmitter Control register, A-39–40<br />
Transmitter Data register, A-40<br />
Trap instructions, A-64–66<br />
Tree-based parallel scan, C-62<br />
Truth tables, B-5<br />
ALU control lines, D-5<br />
for control bits, 260–261<br />
datapath control outputs, D-17<br />
datapath control signals, D-14<br />
defined, 260<br />
example, B-5<br />
next-state output bits, D-15<br />
PLA implementation, B-13<br />
Two’s complement representation, 75–76<br />
advantage, 75–76<br />
negation shortcut, 76<br />
rule, 79<br />
sign extension shortcut, 78<br />
Two-level logic, B-11–14<br />
Two-phase clocking, B-75<br />
TX-2 computer, OL6.15-4<br />
U<br />
Unconditional branches, 91<br />
Underflow, 198<br />
Unicode<br />
alphabets, 109<br />
defined, 110<br />
example alphabets, 110<br />
Unified GPU architecture, C-10–12<br />
illustrated, C-11<br />
processor array, C-11–12<br />
Uniform memory access (UMA), 518,<br />
C-9<br />
multiprocessors, 519<br />
Units<br />
commit, 339–340, 343<br />
control, 247–248, 259–261, D-4–8,<br />
D-10, D-12–13<br />
defined, 219<br />
floating point, 219<br />
hazard detection, 313, 314–315<br />
for load/store implementation, 255<br />
special function (SFUs), C-35, C-43,<br />
C-50<br />
UNIVAC I, OL1.12-5<br />
UNIX, OL2.21-8, OL5.17-9–5.17-12<br />
AT&T, OL5.17-10<br />
Berkeley version (BSD), OL5.17-10<br />
genius, OL5.17-12<br />
history, OL5.17-9–5.17-12<br />
Unlock synchronization, 121<br />
Unresolved references<br />
defined, A-4<br />
linkers <strong>and</strong>, A-18<br />
Unsigned numbers, 73–78<br />
Use latency<br />
defined, 336–337<br />
one-instruction, 336–337<br />
V<br />
Vacuum tubes, 25<br />
Valid bit, 386<br />
Variables<br />
C language, 102<br />
programming language, 67<br />
register, 67<br />
static, 102<br />
storage class, 102<br />
type, 102<br />
VAX architecture, OL2.21-4, OL5.17-7<br />
Vector lanes, 512<br />
Vector processors, 508–510. See also<br />
Processors<br />
conventional code comparison,<br />
509–510<br />
instructions, 510<br />
multimedia extensions <strong>and</strong>, 511–512<br />
scalar versus, 510–511<br />
Vectored interrupts, 327<br />
Verilog<br />
behavioral definition of MIPS ALU,<br />
B-25<br />
behavioral definition with bypassing,<br />
OL4.13-4–4.13-6<br />
behavioral definition with stalls for<br />
loads, OL4.13-6–4.13-8<br />
behavioral specification, B-21, OL4.13-<br />
2–4.13-4<br />
behavioral specification of multicycle<br />
MIPS design, OL4.13-12–4.13-13<br />
behavioral specification with<br />
simulation, OL4.13-2<br />
behavioral specification with stall<br />
detection, OL4.13-6–4.13-8<br />
behavioral specification with synthesis,<br />
OL4.13-11–4.13-16<br />
blocking assignment, B-24<br />
branch hazard logic implementation,<br />
OL4.13-8–4.13-10<br />
combinational logic, B-23–26<br />
datatypes, B-21–22<br />
defined, B-20<br />
forwarding implementation,<br />
OL4.13-4<br />
MIPS ALU definition in, B-35–38<br />
modules, B-23<br />
multicycle MIPS datapath, OL4.13-14<br />
nonblocking assignment, B-24<br />
operators, B-22<br />
program structure, B-23<br />
reg, B-21–22<br />
sensitivity list, B-24<br />
sequential logic specification, B-56–58<br />
structural specification, B-21<br />
wire, B-21–22<br />
Vertical microcode, D-32
I-24 Index<br />
Very large-scale integrated (VLSI)<br />
circuits, 25<br />
Very Long Instruction Word (VLIW)<br />
defined, 334–335<br />
first generation computers, OL4.16-5<br />
processors, 335<br />
VHDL, B-20–21<br />
Video graphics array (VGA) controllers,<br />
C-3–4<br />
Virtual addresses<br />
causing page faults, 449<br />
defined, 428<br />
mapping from, 428–429<br />
size, 430<br />
Virtual machine monitors (VMMs)<br />
defined, 424<br />
implementing, 481, 481–482<br />
laissez-faire attitude, 481<br />
page tables, 452<br />
in performance improvement, 427<br />
requirements, 426<br />
Virtual machines (VMs), 424–427<br />
benefits, 424<br />
defined, A-41<br />
illusion, 452<br />
instruction set architecture support,<br />
426–427<br />
performance improvement, 427<br />
for protection improvement, 424<br />
simulation of, A-41–42<br />
Virtual memory, 427–454. See also Pages<br />
address translation, 429, 438–439<br />
integration, 440–441<br />
mechanism, 452–453<br />
motivations, 427–428<br />
page faults, 428, 434<br />
protection implementation,<br />
444–446<br />
segmentation, 431<br />
summary, 452–453<br />
virtualization of, 452<br />
writes, 437<br />
Virtualizable hardware, 426<br />
Virtually addressed caches, 443<br />
Visual computing, C-3<br />
Volatile memory, 22<br />
W<br />
Wafers, 26<br />
defects, 26<br />
dies, 26–27<br />
yield, 27<br />
Warehouse Scale <strong>Computer</strong>s (WSCs), 7,<br />
531–533, 558<br />
Warps, 528, C-27<br />
Weak scaling, 505<br />
Wear levelling, 381<br />
While loops, 92–93<br />
Whirlwind, OL5.17-2<br />
Wide area networks (WANs), 24. See also<br />
Networks<br />
Words<br />
accessing, 68<br />
defined, 66<br />
double, 152<br />
load, 68, 71<br />
quad, 154<br />
store, 71<br />
Working set, 453<br />
World Wide Web, 4<br />
Worst-case delay, 272<br />
Write buffers<br />
defined, 394<br />
stalls, 399<br />
write-back cache, 395<br />
Write invalidate protocols, 468, 469<br />
Write serialization, 467<br />
Write-back caches. See also Caches<br />
advantages, 458<br />
cache coherency protocol, OL5.12-5<br />
complexity, 395<br />
defined, 394, 458<br />
stalls, 399<br />
write buffers, 395<br />
Write-back stage<br />
control line, 302<br />
load instruction, 292<br />
store instruction, 294<br />
Writes<br />
complications, 394<br />
expense, 453<br />
h<strong>and</strong>ling, 393–395<br />
memory hierarchy h<strong>and</strong>ling of,<br />
457–458<br />
schemes, 394<br />
virtual memory, 437<br />
write-back cache, 394, 395<br />
write-through cache, 394, 395<br />
Write-stall cycles, 400<br />
Write-through caches. See also Caches<br />
advantages, 458<br />
defined, 393, 457<br />
tag mismatch, 394<br />
X<br />
x86, 149–158<br />
Advanced Vector Extensions in, 225<br />
brief history, OL2.21-6<br />
conclusion, 156–158<br />
data addressing modes, 152, 153–154<br />
evolution, 149–152<br />
first address specifier encoding, 158<br />
historical timeline, 149–152<br />
instruction encoding, 155–156<br />
instruction formats, 157<br />
instruction set growth, 161<br />
instruction types, 153<br />
integer operations, 152–155<br />
registers, 152, 153–154<br />
SIMD in, 507–508, 508<br />
Streaming SIMD Extensions in,<br />
224–225<br />
typical instructions/functions, 155<br />
typical operations, 157<br />
Xerox Alto computer, OL1.12-8<br />
XMM, 224<br />
Y<br />
Yahoo! Cloud Serving Benchmark<br />
(YCSB), 540<br />
Yield, 27<br />
YMM, 225<br />
Z<br />
Zettabyte, 6