You are on page 1of 106

Computer Architecture

Contents

1 Computer architecture 1
1.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Subcategories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 The Roles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3.2 Instruction set architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3.3 Computer organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Design goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4.2 Power consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.6 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.8 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Computer hardware 6
2.1 Von Neumann architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Sales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Different systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.1 Personal computer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.2 Mainframe computer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.3 Departmental computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.4 Supercomputer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.5 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.6 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3 Moore’s law 10
3.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 As a target for industry and a self-fulfilling prophecy . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2.1 Moore’s second law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3 Major enabling factors and future trends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

i
ii CONTENTS

3.3.1 Ultimate limits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13


3.4 Consequences and limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.5 Other formulations and similar observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.6 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.7 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.8 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.9 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.10 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4 Amdahl’s law 22
4.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.3 Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.4 Relation to law of diminishing returns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.5 Speedup in a sequential program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.6 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.7 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.8 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.9 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.10 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.11 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5 Von Neumann architecture 26


5.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.2 Development of the stored-program concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.3 Early von Neumann-architecture computers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.4 Early stored-program computers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.5 Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.6 Von Neumann bottleneck . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.7 Non–von Neumann processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.8 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.9 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.9.1 Inline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.9.2 General . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.10 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

6 Harvard architecture 33
6.1 Memory details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6.1.1 Contrast with von Neumann architectures . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6.1.2 Contrast with modified Harvard architecture . . . . . . . . . . . . . . . . . . . . . . . . . 33
6.2 Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
6.2.1 Internal vs. external design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
CONTENTS iii

6.3 Modern uses of the Harvard architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34


6.4 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

7 Microarchitecture 35
7.1 Relation to instruction set architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
7.2 Aspects of microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
7.3 Microarchitectural concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
7.3.1 Instruction cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
7.3.2 Increasing execution speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
7.3.3 Instruction set choice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
7.3.4 Instruction pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
7.3.5 Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
7.3.6 Branch prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
7.3.7 Superscalar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
7.3.8 Out-of-order execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
7.3.9 Register renaming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
7.3.10 Multiprocessing and multithreading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
7.4 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
7.5 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
7.6 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

8 Central processing unit 41


8.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
8.1.1 Transistor and integrated circuit CPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
8.1.2 Microprocessors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
8.2 Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
8.2.1 Fetch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
8.2.2 Decode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
8.2.3 Execute . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
8.3 Design and implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
8.3.1 Control unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
8.3.2 Arithmetic logic unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
8.3.3 Integer range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
8.3.4 Clock rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
8.3.5 Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
8.4 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
8.5 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
8.6 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
8.7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
8.8 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

9 Microprocessor 52
iv CONTENTS

9.1 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
9.1.1 Special-purpose designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
9.2 Embedded applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
9.3 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
9.3.1 Firsts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
9.3.2 8-bit designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
9.3.3 12-bit designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
9.3.4 16-bit designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
9.3.5 32-bit designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
9.3.6 64-bit designs in personal computers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
9.3.7 RISC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
9.4 Market statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
9.5 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
9.6 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
9.7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
9.8 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

10 Processor design 63
10.1 Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
10.1.1 Micro-architectural concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
10.1.2 Research topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
10.1.3 Performance analysis and benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
10.2 Markets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
10.2.1 General purpose computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
10.2.2 Scientific computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
10.2.3 Embedded design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
10.3 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
10.4 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

11 History of general-purpose CPUs 67


11.1 1950s: early designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
11.2 1960s: the computer revolution and CISC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
11.3 1970s: Large Scale Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
11.4 Early 1980s: the lessons of RISC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
11.5 Mid-to-late 1980s: exploiting instruction level parallelism . . . . . . . . . . . . . . . . . . . . . . 70
11.6 1990 to today: looking forward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
11.6.1 VLIW and EPIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
11.6.2 Multi-threading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
11.6.3 Multi-core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
11.6.4 Reconfigurable logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
11.6.5 Open source processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
11.6.6 Asynchronous CPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
CONTENTS v

11.6.7 Optical communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72


11.6.8 Optical processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
11.6.9 Belt Machine Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
11.7 Timeline of events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
11.8 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
11.9 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
11.10External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

12 Comparison of CPU microarchitectures 75


12.1 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
12.2 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

13 Reduced instruction set computing 76


13.1 History and development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
13.2 Characteristics and design philosophy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
13.2.1 Instruction set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
13.2.2 Hardware utilization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
13.3 Comparison to other architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
13.4 RISC: from cell phones to supercomputers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
13.4.1 Low end and mobile systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
13.4.2 High end RISC and supercomputing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
13.5 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
13.6 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
13.7 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

14 Complex instruction set computing 83


14.1 Historical design context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
14.1.1 Incitements and benefits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
14.1.2 Design issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
14.2 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
14.3 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
14.4 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
14.5 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
14.6 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

15 Minimal instruction set computer 86


15.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
15.2 Design weaknesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
15.3 Notable CPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
15.4 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
15.5 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
15.6 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
vi CONTENTS

16 Comparison of instruction set architectures 89


16.1 Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
16.1.1 Bits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
16.1.2 Operands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
16.1.3 Endianness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
16.2 Instruction sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
16.3 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
16.4 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

17 Computer data storage 91


17.1 Functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
17.2 Data organization and representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
17.3 Hierarchy of storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
17.3.1 Primary storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
17.3.2 Secondary storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
17.3.3 Tertiary storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
17.3.4 Off-line storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
17.4 Characteristics of storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
17.4.1 Volatility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
17.4.2 Mutability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
17.4.3 Accessibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
17.4.4 Addressability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
17.4.5 Capacity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
17.4.6 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
17.4.7 Energy use . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
17.5 Storage Media . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
17.5.1 Semiconductor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
17.5.2 Magnetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
17.5.3 Optical . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
17.5.4 Paper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
17.5.5 Other Storage Media or Substrates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
17.6 Related technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
17.6.1 Redundancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
17.6.2 Network connectivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
17.6.3 Robotic storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
17.7 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
17.7.1 Primary storage topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
17.7.2 Secondary, tertiary and off-line storage topics . . . . . . . . . . . . . . . . . . . . . . . . 99
17.7.3 Data storage conferences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
17.8 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
17.9 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
17.10Text and image sources, contributors, and licenses . . . . . . . . . . . . . . . . . . . . . . . . . . 101
CONTENTS vii

17.10.1 Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101


17.10.2 Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
17.10.3 Content license . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Chapter 1

Computer architecture

Instruction Fetch
Instruction Decode
Register Fetch
Execute
Address Calc.
Memory Access Write Back
1.1 History
IF ID EX MEM WB
Next PC
Next SEQ PC Next SEQ PC
The first documented computer architecture was in
Adder

MUX

RS1

RS2
Register Zero?
Branch
taken the correspondence between Charles Babbage and Ada
File
Lovelace, describing the analytical engine. Two other
MEM / WB

early and important examples were:


EX / MEM
ID / EX
IF / ID

MUX

IR
PC Memory
MUX

ALU
Sign Imm
Extend
MUX

Memory

• John von Neumann's 1945 paper, First Draft of a


Report on the EDVAC, which described an organi-
WB Data
zation of logical elements; and

Pipelined implementation of MIPS architecture. Pipelining is a • Alan Turing's more detailed Proposed Electronic
key concept in computer architecture. Calculator for the Automatic Computing Engine,
also 1945 and which cited von Neumann’s paper.[5]

The term “architecture” in computer literature can be


traced to the work of Lyle R. Johnson, Mohammad Us-
man Khan and Frederick P. Brooks, Jr., members in 1959
of the Machine Organization department in IBM’s main
research center. Johnson had the opportunity to write a
In computer engineering,[1] computer architecture is a proprietary research communication about the Stretch, an
set of disciplines that describes the functionality, the or- IBM-developed supercomputer for Los Alamos Scientific
ganization and the implementation of computer systems; Laboratory. To describe the level of detail for discussing
that is, it defines the capabilities of a computer and its the luxuriously embellished computer, he noted that his
programming model in an abstract way, and how the in- description of formats, instruction types, hardware pa-
ternal organization of the system is designed and imple- rameters, and speed enhancements were at the level of
mented to meet the specified capabilities.[2][3] Computer “system architecture” – a term that seemed more useful
architecture involves many aspects, including instruction than “machine organization.”
set architecture design, microarchitecture design, logic
Subsequently, Brooks, a Stretch designer, started Chap-
design, and implementation.[4] Some fashionable (2011)
ter 2 of a book (Planning a Computer System: Project
computer architectures include cluster computing and
Stretch, ed. W. Buchholz, 1962) by writing,
non-uniform memory access.
Computer architects use computers to design new com-
puters. Emulation software can run programs written in Computer architecture, like other architec-
a proposed instruction set. While the design is very easy ture, is the art of determining the needs of the
to change at this stage, compiler designers often collabo- user of a structure and then designing to meet
rate with the architects, suggesting improvements in the those needs as effectively as possible within
instruction set. Modern emulators may measure time in economic and technological constraints.
clock cycles: estimate energy consumption in joules, and
give realistic estimates of code size in bytes. These affect Brooks went on to help develop the IBM System/360
the convenience of the user, the power consumption and (now called the IBM zSeries) line of computers, in which
the size and expense of the computer’s largest physical “architecture” became a noun defining “what the user
part: its memory. That is, they help to estimate the value needs to know”. Later, computer users came to use the
of a computer design. term in many less-explicit ways.

1
2 CHAPTER 1. COMPUTER ARCHITECTURE

The earliest computer architectures were designed on pa- • Programmer Visible Macroarchitecture: higher
per and then directly built into the final hardware form.[6] level language tools such as compilers may define a
Later, computer architecture prototypes were physically consistent interface or contract to programmers us-
built in the form of a Transistor–Transistor Logic (TTL) ing them, abstracting differences between underly-
computer—such as the prototypes of the 6800 and the ing ISA, UISA, and microarchitectures. E.g. the C,
PA-RISC—tested, and tweaked, before committing to C++, or Java standards define different Programmer
the final hardware form. As of the 1990s, new computer Visible Macroarchitecture.
architectures are typically “built”, tested, and tweaked—
inside some other computer architecture in a computer • UISA (Microcode Instruction Set Architecture)—
architecture simulator; or inside a FPGA as a soft micro- a family of machines with different hardware level
processor; or both—before committing to the final hard- microarchitectures may share a common microcode
ware form. architecture, and hence a UISA.
• Pin Architecture: The hardware functions that a
microprocessor should provide to a hardware plat-
1.2 Subcategories form, e.g., the x86 pins A20M, FERR/IGNNE or
FLUSH. Also, messages that the processor should
The discipline of computer architecture has three main emit so that external caches can be invalidated (emp-
subcategories:[7] tied). Pin architecture functions are more flexible
than ISA functions because external hardware can
• Instruction Set Architecture, or ISA. The ISA de- adapt to new encodings, or change from a pin to a
fines the codes that a central processor reads and message. The term “architecture” fits, because the
acts upon. It is the machine language (or assembly functions must be provided for compatible systems,
language), including the instruction set, word size, even if the detailed method changes.
memory address modes, processor registers, and ad-
dress and data formats.
• Microarchitecture, also known as Computer organi-
1.3 The Roles
zation describes the data paths, data processing ele-
ments and data storage elements, and describes how 1.3.1 Definition
they should implement the ISA.[8] The size of a com-
puter’s CPU cache for instance, is an organizational The purpose is to design a computer that maximizes per-
issue that generally has nothing to do with the ISA. formance while keeping power consumption in check,
costs low relative to the amount of expected performance,
• System Design includes all of the other hardware and is also very reliable. For this, many aspects are to be
components within a computing system. These in- considered which includes Instruction Set Design, Func-
clude: tional Organization, Logic Design, and Implementation.
The implementation involves Integrated Circuit Design,
1. Data paths, such as computer buses and switches Packaging, Power, and Cooling. Optimization of the de-
2. Memory controllers and hierarchies sign requires familiarity with Compilers, Operating Sys-
tems to Logic Design and Packaging.
3. Data processing other than the CPU, such as direct
memory access (DMA)
1.3.2 Instruction set architecture
4. Miscellaneous issues such as virtualization,
multiprocessing and software features. Main article: Instruction set architecture

Some architects at companies such as Intel and AMD use


finer distinctions: An instruction set architecture (ISA) is the interface be-
tween the computer’s software and hardware and also
• Macroarchitecture: architectural layers more ab- can be viewed as the programmer’s view of the machine.
stract than microarchitecture, e.g. ISA Computers do not understand high level languages which
have few, if any, language elements that translate directly
• Instruction Set Architecture (ISA): as above but into a machine’s native opcodes. A processor only un-
without: derstands instructions encoded in some numerical fash-
• Assembly ISA: a smart assembler may con- ion, usually as binary numbers. Software tools, such as
vert an abstract assembly language common compilers, translate high level languages, such as C into
to a group of machines into slightly different instructions.
machine language for different implementa- Besides instructions, the ISA defines items in the com-
tions puter that are available to a program—e.g. data types,
1.4. DESIGN GOALS 3

registers, addressing modes, and memory. Instructions 1.3.4 Implementation


locate operands with Register indexes (or names) and
memory addressing modes. Once an instruction set and micro-architecture are de-
The ISA of a computer is usually described in a small scribed, a practical machine must be designed. This de-
book or pamphlet, which describes how the instruc- sign process is called the implementation. Implementa-
tions are encoded. Also, it may define short (vaguely) tion is usually not considered architectural definition, but
mnenonic names for the instructions. The names can rather hardware design engineering. Implementation can
be recognized by a software development tool called be further broken down into several (not fully distinct)
an assembler. An assembler is a computer program steps:
that translates a human-readable form of the ISA into a
computer-readable form. Disassemblers are also widely • Logic Implementation designs the blocks defined
available, usually in debuggers, software programs to iso- in the micro-architecture at (primarily) the register-
late and correct malfunctions in binary computer pro- transfer level and logic gate level.
grams.
ISAs vary in quality and completeness. A good ISA com- • Circuit Implementation does transistor-level de-
promises between programmer convenience (more oper- signs of basic elements (gates, multiplexers, latches
ations can be better), cost of the computer to interpret etc.) as well as of some larger blocks (ALUs, caches
the instructions (cheaper is better), speed of the computer etc.) that may be implemented at this level, or even
(faster is better), and size of the code (smaller is better). (partly) at the physical level, for performance rea-
For example, a single-instruction ISA is possible, inex- sons.
pensive, and fast, (e.g., subtract and jump if zero. It was
actually used in the SSEM), but it was not convenient or • Physical Implementation draws physical circuits.
helpful to make programs small. Memory organization The different circuit components are placed in a
defines how instructions interact with the memory, and chip floorplan or on a board and the wires connect-
also how different parts of memory interact with each ing them are routed.
other.
• Design Validation tests the computer as a whole to
see if it works in all situations and all timings. Once
implementation starts, the first design validations are
simulations using logic emulators. However, this is
usually too slow to run realistic programs. So, af-
1.3.3 Computer organization ter making corrections, prototypes are constructed
using Field-Programmable Gate-Arrays (FPGAs).
Many hobby projects stop at this stage. The final
Main article: Microarchitecture
step is to test prototype integrated circuits. Inte-
grated circuits may require several redesigns to fix
Computer organization helps optimize performance- problems.
based products. For example, software engineers need
to know the processing ability of processors. They may
need to optimize software in order to gain the most per- For CPUs, the entire implementation process is often
formance at the least expense. This can require quite de- called CPU design.
tailed analysis of the computer organization. For exam-
ple, in a multimedia decoder, the designers might need to
arrange for most data to be processed in the fastest data
path and the various components are assumed to be in 1.4 Design goals
place and task is to investigate the organisational struc-
ture to verify the computer parts operates. The exact form of a computer system depends on the con-
Computer organization also helps plan the selection of a straints and goals. Computer architectures usually trade
processor for a particular project. Multimedia projects off standards, power versus performance, cost, mem-
may need very rapid data access, while supervisory soft- ory capacity, latency (latency is the amount of time that
ware may need fast interrupts. Sometimes certain tasks it takes for information from one node to travel to the
need additional components as well. For example, a source) and throughput. Sometimes other considerations,
computer capable of virtualization needs virtual mem- such as features, size, weight, reliability, and expandabil-
ory hardware so that the memory of different simulated ity are also factors.
computers can be kept separated. Computer organization The most common scheme does an in depth power anal-
and features also affect power consumption and processor ysis and figures out how to keep power consumption low,
cost. while maintaining adequate performance.
4 CHAPTER 1. COMPUTER ARCHITECTURE

1.4.1 Performance system may be CPU bound (as in numerical calculation),


I/O bound (as in a webserving application) or memory
bound (as in video editing). Power consumption has be-
Modern computer performance is often described in IPC come important in servers and portable devices like lap-
(instructions per cycle). This measures the efficiency of tops.
the architecture at any clock speed. Since a faster clock
can make a faster computer, this is a useful, widely appli- Benchmarking tries to take all these factors into account
cable measurement. Historic computers had IPC counts by measuring the time a computer takes to run through a
as low as 0.1 (See instructions per second). Simple mod- series of test programs. Although benchmarking shows
ern processors easily reach near 1. Superscalar processors strengths, it may not help one to choose a computer. Of-
may reach three to five by executing several instructions ten the measured machines split on different measures.
per clock cycle. Multicore and vector processing CPUs For example, one system might handle scientific appli-
can multiply this further by acting on a lot of data per in- cations quickly, while another might play popular video
struction, which have several CPUs executing in parallel. games more smoothly. Furthermore, designers may add
special features to their products, in hardware or software,
Counting machine language instructions would be mis- that permit a specific benchmark to execute quickly but
leading because they can do varying amounts of work in don't offer similar advantages to general tasks.
different ISAs. The “instruction” in the standard mea-
surements is not a count of the ISA’s actual machine lan-
guage instructions, but a historical unit of measurement,
1.4.2 Power consumption
usually based on the speed of the VAX computer archi-
tecture.
Main article: low-power electronics
Historically, many people measured a computer’s speed
by the clock rate (usually in MHz or GHz). This refers to
Power consumption is another measurement that is im-
the cycles per second of the main clock of the CPU. How-
portant in modern computers. Power efficiency can often
ever, this metric is somewhat misleading, as a machine
be traded for speed or lower cost. The typical measure-
with a higher clock rate may not necessarily have higher
ment in this case is MIPS/W (millions of instructions per
performance. As a result manufacturers have moved
second per watt).
away from clock speed as a measure of performance.
Modern circuits have less power per transistor as the num-
Other factors influence speed, such as the mix of
ber of transistors per chip grows. Therefore, power ef-
functional units, bus speeds, available memory, and the
ficiency has increased in importance. Recent processor
type and order of instructions in the programs being run.
designs such as Intel’s Haswell (microarchitecture), put
In a typical home computer, the simplest, most reliable more emphasis on increasing power efficiency. Also, in
way to speed performance is usually to add random ac- the world of embedded computing, power efficiency has
cess memory (RAM). More RAM increases the likeli- long been and remains an important goal next to through-
hood that needed data or a program is in RAM—so the put and latency.
system is less likely to need to move memory data from
the disk. The disk is often ten thousand times slower than
RAM because it has mechanical parts that must move to
access its data.
1.5 See also
There are two main types of speed, latency and through-
• Comparison of CPU architectures
put. Latency is the time between the start of a process and
its completion. Throughput is the amount of work done
• Computer hardware
per unit time. Interrupt latency is the guaranteed maxi-
mum response time of the system to an electronic event • CPU design
(e.g. when the disk drive finishes moving some data).
Performance is affected by a very wide range of design • Floating point
choices — for example, pipelining a processor usually
makes latency worse (slower) but makes throughput bet- • Harvard architecture
ter. Computers that control machinery usually need low
interrupt latencies. These computers operate in a real- • Influence of the IBM PC on the personal computer
time environment and fail if an operation is not completed market
in a specified amount of time. For example, computer-
controlled anti-lock brakes must begin braking within a • Orthogonal instruction set
predictable, short time after the brake pedal is sensed.
• Software architecture
The performance of a computer can be measured using
other metrics, depending upon its application domain. A • von Neumann architecture
1.8. EXTERNAL LINKS 5

1.6 Notes [7] John L. Hennessy and David A. Patterson. Computer Ar-
chitecture: A Quantitative Approach (Third Edition ed.).
Morgan Kaufmann Publishers.
• John L. Hennessy and David Patterson (2006).
Computer Architecture: A Quantitative Approach [8] Laplante, Phillip A. (2001). Dictionary of Computer Sci-
(Fourth Edition ed.). Morgan Kaufmann. ISBN ence, Engineering, and Technology. CRC Press. pp. 94–
978-0-12-370490-0. 95. ISBN 0-8493-2691-5.

• Barton, Robert S., “Functional Design of Comput-


ers”, Communications of the ACM 4(9): 405 (1961).
1.8 External links
• Barton, Robert S., “A New Approach to the Func-
tional Design of a Digital Computer”, Proceedings • ISCA: Proceedings of the International Symposium
of the Western Joint Computer Conference, May on Computer Architecture
1961, pp. 393–396. About the design of the Bur-
• Micro: IEEE/ACM International Symposium on
roughs B5000 computer.
Microarchitecture
• Bell, C. Gordon; and Newell, Allen (1971). • HPCA: International Symposium on High Perfor-
“Computer Structures: Readings and Examples”, mance Computer Architecture
McGraw-Hill.
• ASPLOS: International Conference on Architec-
• Blaauw, G.A., and Brooks, F.P., Jr., “The Structure tural Support for Programming Languages and Op-
of System/360, Part I-Outline of the Logical Struc- erating Systems
ture”, IBM Systems Journal, vol. 3, no. 2, pp. 119–
135, 1964. • ACM Transactions on Computer Systems

• Tanenbaum, Andrew S. (1979). Structured Com- • ACM Transactions on Architecture and Code Opti-
puter Organization. Englewood Cliffs, New Jersey: mization
Prentice-Hall. ISBN 0-13-148521-0.
• IEEE Transactions on Computers
• The von Neumann Architecture of Computer Sys-
1.7 References tems

[1] Curriculum Guidelines for Undergraduate Degree Pro-


grams in Computer Engineering (PDF). Association for
Computing Machinery. 2004. p. 60. Computer archi-
tecture is a key component of computer engineering and
the practicing computer engineer should have a practical
understanding of this topic...

[2] Hennessy, John; Patterson, David. Computer Architecture:


A Quantitative Approach (Fifth Edition ed.). p. 11.

[3] Clements, Alan. Principles of Computer Hardware


(Fourth Edition ed.). p. 1. Architecture describes the in-
ternal organization of a computer in an abstract way; that
is, it defines the capabilities of the computer and its pro-
gramming model. You can have two computers that have
been constructed in different ways with different tech-
nologies but with the same architecture.

[4] Hennessy, John; Patterson, David. Computer Architecture:


A Quantitative Approach (Fifth Edition ed.). p. 11. This
task has many aspects, including instruction set design,
functional organization, logic design, and implementation.

[5] Reproduced in B. J. Copeland (Ed.), “Alan Turing’s Au-


tomatic Computing Engine”, OUP, 2005, pp. 369-454.

[6] ACE underwent seven paper designs in one year, before a


prototype was initiated in 1948. [B. J. Copeland (Ed.),
“Alan Turing’s Automatic Computing Engine”, OUP,
2005, p. 57]
Chapter 2

Computer hardware

For other uses, see Hardware. 2.1 Von Neumann architecture


Computer hardware (usually simply called hardware
Main article: Von Neumann architecture
The template for all modern computers is the Von Neu-

Von Neumann architecture scheme.

mann architecture, detailed in a 1945 paper by Hun-


garian mathematician John von Neumann. This de-
scribes a design architecture for an electronic digital
computer with subdivisions of a processing unit con-
sisting of an arithmetic logic unit and processor reg-
isters, a control unit containing an instruction register
and program counter, a memory to store both data and
instructions, external mass storage, and input and output
PDP-11 CPU board
mechanisms.[3] The meaning of the term has evolved to
mean a stored-program computer in which an instruction
fetch and a data operation cannot occur at the same time
because they share a common bus. This is referred to as
when a computing context is implicit) is the collection the Von Neumann bottleneck [4]
and often limits the perfor-
of physical elements that constitutes a computer system. mance of the system.
Computer hardware is the physical parts or components
of a computer, such as the monitor, mouse, keyboard,
computer data storage, hard disk drive (HDD), system 2.2 Sales
unit (graphic cards, sound cards, memory, motherboard
and chips), and so on, all of which are physical objects For the third consecutive year, U.S. business-to-business
that can be touched (that is, they are tangible).[1] In con- channel sales (sales through distributors and commercial
trast, software is instructions that can be stored and run resellers) increased, ending 2013 up nearly 6 percent at
by hardware. $61.7 billion. The impressive growth was the fastest sales
Software is any set of machine-readable instructions that increase since the end of the recession. Sales growth ac-
directs a computer’s processor to perform specific oper- celerated in the second half of the year peaking in fourth
ations. A combination of hardware and software forms a quarter with a 6.9 percent increase over the fourth quarter
usable computing system.[2] of 2012.[5]

6
2.3. DIFFERENT SYSTEMS 7

2.3 Different systems similar, although may use lower-power or reduced size
components.
There are a number of different types of computer system
in use today.
Case

2.3.1 Personal computer Main article: Computer case

The computer case is a plastic or metal enclosure that


houses most of the components. Those found on desktop
computers are usually small enough to fit under a desk,
however in recent years more compact designs have be-
come more common place, such as the all-in-one style
designs from Apple, namely the iMac. Laptops are com-
puters that usually come in a clamshell form factor, again
however in more recent years deviations from this form
factor have started to emerge such as laptops that have a
detachable screen that become tablet computers in their
own right.

Power supply

Main article: Power supply unit (computer)

A power supply unit (PSU) converts alternating current


(AC) electric power to low-voltage DC power for the in-
ternal components of the computer. Laptops are capable
Hardware of a modern personal computer of running from a built-in battery, normally for a period
1. Monitor 2. Motherboard 3. CPU 4. RAM 5. Expansion of hours.[6]
cards 6. Power supply 7. Optical disc drive 8. Hard disk drive
9. Keyboard 10. Mouse

Motherboard

Main article: Motherboard

The motherboard is the main component of computer. It


is a large rectangular board with integrated circuitry that
connects the other parts of the computer including the
CPU, the RAM, the disk drives(CD, DVD, hard disk, or
any others) as well as any peripherals connected via the
ports or the expansion slots.
Components directly attached to or part of the mother-
board include:

• The CPU (Central Processing Unit) performs most


of the calculations which enable a computer to func-
tion, and is sometimes referred to as the “brain” of
the computer. It is usually cooled by a heat sink and
fan. Most newer CPUs include an on-die Graphics
Inside a custom-built computer: power supply at the bottom has Processing Unit (GPU).
its own cooling fan.
• The Chipset, which includes the north bridge, me-
The personal computer, also known as the PC, is one of diates communication between the CPU and the
the most common types of computer due to its versatil- other components of the system, including main
ity and relatively low price. Laptops are generally very memory.
8 CHAPTER 2. COMPUTER HARDWARE

• The Random-Access Memory (RAM) stores the Input and output peripherals
code and data that are being actively accessed by the
CPU. Main article: Peripheral
• The Read-Only Memory (ROM) stores the BIOS
that runs when the computer is powered on or Input and output devices are typically housed externally
otherwise begins execution, a process known as to the main computer chassis. The following are either
Bootstrapping, or "booting" or “booting up”. The standard or very common to many computer systems.
BIOS (Basic Input Output System) includes boot Input
firmware and power management firmware. Newer
motherboards use Unified Extensible Firmware In- Input devices allow the user to enter information into the
terface (UEFI) instead of BIOS. system, or control its operation. Most personal comput-
ers have a mouse and keyboard, but laptop systems typ-
• Buses connect the CPU to various internal compo- ically use a touchpad instead of a mouse. Other input
nents and to expansion cards for graphics and sound. devices include webcams, microphones, joysticks, and
image scanners.
• The CMOS battery is also attached to the mother- Output device
board. This battery is the same as a watch battery
Output devices display information in a human readable
or a battery for a remote to a car’s central locking
form. Such devices could include printers, speakers,
system. Most batteries are CR2032, which powers
monitors or a Braille embosser.
the memory for date and time in the BIOS chip.

Expansion cards
2.3.2 Mainframe computer
Main article: Expansion card

An expansion card in computing is a printed circuit board


that can be inserted into an expansion slot of a computer
motherboard or backplane to add functionality to a com-
puter system via the expansion bus.

Storage devices

Main article: Computer data storage

Computer data storage, often called storage or memory,


refers to computer components and recording media that
retain digital data. Data storage is a core function and
fundamental component of computers.
Fixed media
Data is stored by a computer using a variety of media.
Hard disk drives are found in virtually all older comput-
ers, due to their high capacity and low cost, but solid-state
drives are faster and more power efficient, although cur-
rently more expensive than hard drives, so are often found
in more expensive computers. Some systems may use a
disk array controller for greater performance or reliabil-
ity. An IBM System z9 mainframe
Removable media
To transfer data between computers, a USB flash drive or A mainframe computer is a much larger computer that
Optical disc may be used. Their usefulness depends on typically fills a room and may cost many hundreds or
being readable by other systems; the majority of machines thousands of times as much as a personal computer. They
have an optical disk drive, and virtually all have a USB are designed to perform large numbers of calculations for
port. governments and large enterprises.
2.6. EXTERNAL LINKS 9

2.3.3 Departmental computing [6] “How long should a laptop battery last?". Computer Hope.
Retrieved 9 December 2013.
In the 1960s and 1970s more and more departments
[7] Alba, Davey. “China’s Tianhe-2 Caps Top 10 Supercom-
started to use cheaper and dedicated systems for specific puters”. IEEE. Retrieved 9 December 2013.
purposes like process control and laboratory automation.
Main article: Minicomputer
2.6 External links
• Media related to Computer hardware at Wikimedia
2.3.4 Supercomputer Commons
A supercomputer is superficially similar to a mainframe, • Learning materials related to Computer hardware at
but is instead intended for extremely demanding compu- Wikiversity
tational tasks. As of November 2013, the fastest super-
computer in the world is the Tianhe-2, in Guangzhou,
China.[7]
The term supercomputer does not refer to a specific tech-
nology. Rather it indicates the fastest computers available
at any given time. In mid 2011, the fastest supercom-
puters boasted speeds exceeding one petaflop, or 1000
trillion floating point operations per second. Super com-
puters are fast but extremely costly so they are generally
used by large organizations to execute computationally
demanding tasks involving large data sets. Super com-
puters typically run military and scientific applications.
Although they cost millions of dollars, they are also being
used for commercial applications where huge amounts of
data must be analyzed. For example, large banks employ
supercomputers to calculate the risks and returns of var-
ious investment strategies, and healthcare organizations
use them to analyze giant databases of patient data to de-
termine optimal treatments for various diseases and prob-
lems incurring to the country.

2.4 See also


• Open-source computing hardware

2.5 References
[1] “Parts of computer”. Microsoft. Retrieved 5 December
2013.

[2] Smither, Roger. “Use of computers in audiovisual


archives”. UNESCO. Retrieved 5 December 2013.

[3] von Neumann, John (1945). “First Draft of a Report on


the EDVAC” (PDF).

[4] Markgraf, Joey D. (2007). “The Von Neumann bottle-


neck”. Retrieved 24 August 2011.

[5] US B2B Channel sales reach nearly $62 Bil-


lion in 2013, by The NPD Group: https:
//www.npd.com/wps/portal/npd/us/news/press-releases/
us-b2bchannel-sales-reach-nearly-62-billion-in-2013-according-to-the-npd-group/
Chapter 3

Moore’s law

Microprocessor Transistor Counts 1971-2011 & Moore's Law executive David House, who predicted that chip perfor-
16-Core SPARC T3
mance would double every 18 months (being a combi-
nation of the effect of more transistors and their being
Six-Core Core i7
2,600,000,000 Six-Core Xeon 7400 10-Core Xeon Westmere-EX
Dual-Core Itanium 2 8-core POWER7
Quad-core z196

faster).[15]
1,000,000,000 AMD K10
POWER6
Quad-Core Itanium Tukwila
8-Core Xeon Nehalem-EX
Itanium 2 with 9MB cache Six-Core Opteron 2400
AMD K10 Core i7 (Quad)
Core 2 Duo
Itanium 2 Cell

100,000,000
Although this trend has continued for more than half
Pentium 4
AMD K8

Barton Atom

a century, “Moore’s law” should be considered an


curve shows transistor
count doubling every AMD K6
AMD K7
AMD K6-III
Transistor count

10,000,000 two years Pentium III

observation or projection and not a physical or natural


Pentium II
AMD K5
Pentium

1,000,000 law. Doubts about the ability of the projection to remain


80486

valid into the indefinite future have been expressed. For


80286
80386

100,000
68000
example, the 2010 update to the International Technol-
80186

8086 8088

10,000
8085
6800
ogy Roadmap for Semiconductors predicted that growth
6809

would slow around 2013,[16] and Gordon Moore in 2015


8080 Z80

8008 MOS 6502


2,300 4004 RCA 1802

foresaw that the rate of progress would reach saturation:


1980 1990 2000 2011
“I see Moore’s law dying here in the next decade or so.”[17]
1971

Date of introduction
However, The Economist news-magazine has opined that
predictions that Moore’s law will soon fail are almost as
Plot of CPU transistor counts against dates of introduction; note old, going back years and years, as the law itself, with
the logarithmic vertical scale; the line corresponds to exponential the time of eventual end of the technological trend being
growth with transistor count doubling every two years. uncertain.[18]

“Moore’s law” is the observation that, over the history


of computing hardware, the number of transistors in a 3.1 History
dense integrated circuit has doubled approximately ev-
ery two years. The observation is named after Gordon E.
Moore, co-founder of the Intel Corporation and Fairchild
Semiconductor, whose 1965 paper described a doubling
every year in the number of components per integrated
circuit.[1][2][3] In 1975, he revised the forecast doubling
time to two years.[4][5][6] His prediction had proven to
be accurate, in part because the law now is used in the
semiconductor industry to guide long-term planning and
to set targets for research and development.[7] The ca-
pabilities of many digital electronic devices are strongly
linked to Moore’s law: quality-adjusted microprocessor
prices,[8] memory capacity, sensors and even the number
and size of pixels in digital cameras.[9] All of these are
improving at roughly exponential rates as well.
This exponential improvement has dramatically en-
hanced the effect of digital electronics in nearly ev-
ery segment of the world economy.[10] Moore’s law de-
scribes a driving force of technological and social change, Gordon Moore in 2004
productivity, and economic growth in the late twentieth For the thirty-fifth anniversary issue of Electronics mag-
and early twenty-first centuries.[11][12][13][14] azine, which was published on April 19, 1965, Gordon
The period is often quoted as 18 months because of Intel E. Moore, who was working as the director of research

10
3.2. AS A TARGET FOR INDUSTRY AND A SELF-FULFILLING PROPHECY 11

and development (R&D) at Fairchild Semiconductor at 3.2 As a target for industry and a
the time, was asked to predict what was going to hap-
pen in the semiconductor components industry over the
self-fulfilling prophecy
next ten years. His response was a brief article entitled,
“Cramming more components onto integrated circuits”.[19] Although Moore’s law initially was made in the form of
Within his editorial, he speculated that by 1975 it would an observation and forecast, the more widely it became
be possible to contain as many as 65,000 components on accepted, the more it served as a goal for an entire indus-
a single quarter-inch semiconductor. try.
The complexity for minimum component costs has in- This drove both marketing and engineering departments
creased at a rate of roughly a factor of two per year. Cer- of semiconductor manufacturers to focus enormous en-
tainly over the short term this rate can be expected to con- ergy aiming for the specified increase in processing power
tinue, if not to increase. Over the longer term, the rate of that it was presumed one or more of their competitors
increase is a bit more uncertain, although there is no rea- would soon attain. In this regard, it may be viewed as a
son to believe it will not remain nearly constant for at least self-fulfilling prophecy.[7][28]
10 years.
[emphasis added]
3.2.1 Moore’s second law
G. Moore, 1965
Further information: Rock’s law
His reasoning was a log-linear relationship between de-
vice complexity (higher circuit density at reduced cost) As the cost of computer power to the consumer falls, the
and time:[20][21] cost for producers to fulfill Moore’s law follows an oppo-
At the 1975 IEEE International Electron Devices Meet- site trend: R&D, manufacturing, and test costs have in-
ing Moore revised the forecast rate:[4] [22] semiconductor creased steadily with each new generation of chips. Ris-
complexity would continue to double annually until about ing manufacturing costs are an important consideration
[29]
1980 after which it would decrease to a rate of doubling for the sustaining of Moore’s law. This had led to the
approximately every two years. [22] He outlined several formulation of Moore’s second law, also called Rock’s
contributing factors for this exponential behavior:[20][21] law, which is that the capital cost of a semiconductor fab
also increases exponentially over time.[30][31]
• Die sizes were increasing at an exponential rate and
as defective densities decreased, chip manufacturers
could work with larger areas without losing reduc- 3.3 Major enabling factors and fu-
tion yields ture trends
• Simultaneous evolution to finer minimum dimen-
Numerous innovations by a large number of scientists and
sions
engineers have helped significantly to sustain Moore’s
• and what Moore called “circuit and device clever- law since the beginning of the integrated circuit (IC) era.
ness” Whereas assembling a detailed list of such significant
contributions would be as desirable as it would be dif-
ficult, just a few innovations are listed below as examples
Shortly after 1975, Caltech professor Carver Mead pop- of breakthroughs that have played a critical role in the ad-
ularized the term “Moore’s law”.[2][23] vancement of integrated circuit technology by more than
Despite a popular misconception, Moore is adamant that seven orders of magnitude in less than five decades:
he did not predict a doubling “every 18 months.” Rather,
David House, an Intel colleague, had factored in the in- • The foremost contribution, which is the raison d’être
creasing performance of transistors to conclude that in- for Moore’s law, is the invention of the integrated
tegrated circuits would double in performance every 18 circuit, credited contemporaneously to Jack Kilby at
months. Texas Instruments[32] and Robert Noyce at Fairchild
Predictions of similar increases in computer power had Semiconductor.[33]
existed years prior. For example, Douglas Engelbart dis- • The invention of the complementary metal–oxide–
cussed the projected downscaling of integrated circuit semiconductor (CMOS) process by Frank Wanlass
size in 1959 [24] or 1960.[25] in 1963 [34] and a number of advances in CMOS
In April 2005, Intel offered US$10,000 to purchase a technology by many workers in the semiconductor
copy of the original Electronics issue in which Moore’s ar- field since the work of Wanlass have enabled the ex-
ticle appeared.[26] An engineer living in the United King- tremely dense and high-performance ICs that the in-
dom was the first to find a copy and offer it to Intel.[27] dustry makes today.
12 CHAPTER 3. MOORE’S LAW

• The invention of the dynamic random access mem- • Researchers from IBM and Georgia Tech created
ory (DRAM) technology by Robert Dennard at a new speed record when they ran a supercooled
I.B.M. in 1967 [35] made it possible to fabricate silicon-germanium transistor above 500 GHz at a
single-transistor memory cells, and the invention of temperature of 4.5 K (−269 °C; −452 °F).[60][61]
flash memory by Fujio Masuoka at Toshiba in the
1980s,[36][37][38] leading to low-cost, high-capacity • In April 2008, researchers at HP Labs announced
memory in diverse electronic products. the creation of a working memristor, a fourth ba-
sic passive circuit element whose existence only had
• The invention of chemically-amplified photoresist been theorized previously. The memristor’s unique
by C. Grant Willson, Hiroshi Ito and J.M.J. Fréchet properties permit the creation of smaller and better-
at IBM c.1980,[39][40][41] that was 10–100 times performing electronic devices.[62]
more sensitive to ultraviolet light.[42] IBM intro-
duced chemically amplified photoresist for DRAM • In February 2010, Researchers at the Tyndall Na-
production in the mid-1980s.[43][44] tional Institute in Cork, Ireland announced a break-
through in transistors with the design and fabri-
• The invention of deep UV excimer laser cation of the world’s first junctionless transistor.
photolithography by Kanti Jain [45] at IBM The research led by Professor Jean-Pierre Colinge
c.1980,[46][47][48] has enabled the smallest features was published in Nature Nanotechnology and de-
in ICs to shrink from 800 nanometers in 1990 to scribes a control gate around a silicon nanowire
as low as 22 nanometers in 2012.[49] This built that can tighten around the wire to the point of
on the invention of the excimer laser in 1970 [50] closing down the passage of electrons without the
by Nikolai Basov, V. A. Danilychev and Yu. M. use of junctions or doping. The researchers claim
Popov, at the Lebedev Physical Institute. From that the new junctionless transistors may be pro-
a broader scientific perspective, the invention of duced at 10-nanometer scale using existing fabrica-
excimer laser lithography has been highlighted as tion techniques.[63]
one of the major milestones in the 50-year history
of the laser.[51][52] • In April 2011, a research team at the Univer-
sity of Pittsburgh announced the development of a
• The interconnect innovations of the late 1990s single-electron transistor, 1.5 nanometers in diam-
include that IBM developed CMP or chemical eter, made out of oxide based materials. Accord-
mechanical planarization c.1980, based on the ing to the researchers, three “wires” converge on a
centuries-old polishing process for making telescope central “island” that can house one or two electrons.
lenses.[53] CMP smooths the chip surface. Intel Electrons tunnel from one wire to another through
used chemical-mechanical polishing to enable ad- the island. Conditions on the third wire result in
ditional layers of metal wires in 1990; higher tran- distinct conductive properties including the ability
sistor density (tighter spacing) via trench isolation, of the transistor to act as a solid state memory.[64]
local polysilicon (wires connecting nearby transis-
tors), and improved wafer yield (all in 1995).[54][55] • In February 2012, a research team at the University
Higher yield, the fraction of working chips on a of New South Wales announced the development
wafer, reduces manufacturing cost. IBM with assis- of the first working transistor consisting of a
tance from Motorola used CMP for lower electrical single atom placed precisely in a silicon crystal
resistance copper interconnect instead of aluminum (not just picked from a large sample of random
in 1997.[56] transistors).[65] Moore’s law predicted this milestone
to be reached in the lab by 2020.
Computer industry technology road maps predict (as of
2001) that Moore’s law will continue for several gener- • In April 2014, bioengineers at Stanford University
ations of semiconductor chips. Depending on the dou- developed a new circuit board modeled on the hu-
bling time used in the calculations, this could mean up man brain. 16 custom-designed “Neurocore” chips
to a hundredfold increase in transistor count per chip simulate 1 million neurons and billions of synaptic
within a decade. The semiconductor industry tech- connections. This Neurogrid is claimed to be 9,000
nology roadmap uses a three-year doubling time for times faster as well as more energy efficient than a
microprocessors, leading to a tenfold increase in the next typical PC. The cost of the prototype was $40,000.
decade.[57] Intel was reported in 2005 as stating that the With current technology, however, a similar Neuro-
downsizing of silicon chips with good economics can con- grid could be made for $400.[66]
tinue during the next decade,[note 1] and in 2008 as predict-
ing the trend through 2029.[59] • The advancement of nanotechnology could spur
the creation of microscopic computers and restore
Some of the new directions in research that may allow Moore’s Law to its original rate of growth.[67][68][69]
Moore’s law to continue are:
3.4. CONSEQUENCES AND LIMITATIONS 13

15 billion transistors, and by 2020 will be in molecular


scale production, where each molecule can be individu-
ally positioned.[71]
In 2003, Intel predicted the end would come between
2013 and 2018 with 16 nanometer manufacturing pro-
cesses and 5 nanometer gates, due to quantum tunnelling,
although others suggested chips could just get larger, or
become layered.[72] In 2008 it was noted that for the last
30 years, it has been predicted that Moore’s law would
last at least another decade.[59]
The trend of scaling for NAND flash memory allows doubling Some see the limits of the law as being in the distant
of components manufactured in the same wafer area in less than future. Lawrence Krauss and Glenn D. Starkman an-
18 months nounced an ultimate limit of approximately 600 years
in their paper,[73] based on rigorous estimation of to-
3.3.1 Ultimate limits tal information-processing capacity of any system in the
Universe, which is limited by the Bekenstein bound. On
the other hand, based on first principles, there are pre-
dictions that Moore’s law will collapse in the next few
decades [20–40 years]".[74][75]
One also could limit the theoretical performance of a
rather practical “ultimate laptop” with a mass of one kilo-
gram and a volume of one litre. This is done by consider-
ing the speed of light, the quantum scale, the gravitational
constant, and the Boltzmann constant, giving a perfor-
mance of 5.4258 × 1050 logical operations per second
Atomistic simulation result for formation of inversion channel on approximately 1031 bits.[76]
(electron density) and attainment of threshold voltage (IV) in a
Then again, the law often has met obstacles that first ap-
nanowire MOSFET. Note that the threshold voltage for this device
peared insurmountable, but were indeed surmounted be-
lies around 0.45 V. Nanowire MOSFETs lie toward the end of the
fore long. In that sense, Moore says he now sees his law
ITRS road map for scaling devices below 10 nm gate lengths.[57]
as more beautiful than he had realized: “Moore’s law is
a violation of Murphy’s law. Everything gets better and
On 13 April 2005, Gordon Moore stated in an interview
better.”[77]
that the projection cannot be sustained indefinitely: “It
can't continue forever. The nature of exponentials is that
you push them out and eventually disaster happens”. He
also noted that transistors eventually would reach the lim- 3.4 Consequences and limitations
its of miniaturization at atomic levels:
Technological change is a combination of more and of
In terms of size [of transistors] you can better technology. A 2011 study in the journal Science
see that we're approaching the size of atoms showed that the peak of the rate of change of the world’s
which is a fundamental barrier, but it'll be capacity to compute information was in the year 1998,
two or three generations before we get that when the world’s technological capacity to compute in-
far—but that’s as far out as we've ever been formation on general-purpose computers grew at 88%
able to see. We have another 10 to 20 years per year.[78] Since then, technological change clearly has
before we reach a fundamental limit. By then slowed. In recent times, every new year allowed humans
they'll be able to make bigger chips and have to carry out roughly 60% of the computations that pos-
transistor budgets in the billions. sibly could have been executed by all existing general-
— [70] purpose computers before that year.[78] This still is ex-
ponential, but shows the varying nature of technological
change.[79]
In January 1995, the Digital Alpha 21164 microproces- The primary driving force of economic growth is the
sor had 9.3 million transistors. This 64-bit processor was growth of productivity,[13] and Moore’s law factors into
a technological spearhead at the time, even if the circuit’s productivity. Moore (1995) expected that “the rate of
market share remained average. Six years later, a state of technological progress is going to be controlled from fi-
the art microprocessor contained more than 40 million nancial realities.”[80] The reverse could and did occur
transistors. It is theorised that, with further miniaturisa- around the late-1990s, however, with economists report-
tion, by 2015 these processors should contain more than ing that "Productivity growth is the key economic indi-
14 CHAPTER 3. MOORE’S LAW

cator of innovation.”[14] An acceleration in the rate of of processors. This is particularly true while accessing
semiconductor progress contributed to a surge in U.S. shared or dependent resources, due to lock contention.
productivity growth,[81][82][83] which reached 3.4% per This effect becomes more noticeable as the number of
year in 1997–2004, outpacing the 1.6% per year dur- processors increases. There are cases where a roughly
ing both 1972–1996 and 2005–2013.[84] As economist 45% increase in processor transistors has translated to
Richard G. Anderson notes, “Numerous studies have roughly 10–20% increase in processing power.[90]
traced the cause of the productivity acceleration to tech- On the other hand, processor manufacturers are taking
nological innovations in the production of semiconduc- advantage of the 'extra space' that the transistor shrinkage
tors that sharply reduced the prices of such components
provides to add specialized processing units to deal with
and of the products that contain them (as well as expand- features such as graphics, video, and cryptography. For
ing the capabilities of such products).”[85]
one example, Intel’s Parallel JavaScript extension not only
adds support for multiple cores, but also for the other non-
general processing features of their chips, as part of the
migration in client side scripting toward HTML5.[91]
A negative implication of Moore’s law is obsolescence,
that is, as technologies continue to rapidly “improve”,
these improvements may be significant enough to ren-
der predecessor technologies obsolete rapidly. In situa-
tions in which security and survivability of hardware or
data are paramount, or in which resources are limited,
rapid obsolescence may pose obstacles to smooth or con-
tinued operations.[92] Because of the toxic materials used
in the production of modern computers, obsolescence if
Intel transistor gate length trend – transistor scaling has slowed not properly managed, may lead to harmful environmen-
down significantly at advanced (smaller) nodes
tal impacts.[93]
Moore’s law has affected the performance of other tech-
While physical limits to transistor scaling such as source-
nologies significantly: Michael S. Malone wrote of a
to-drain leakage, limited gate metals, and limited options
Moore’s War following the apparent success of shock and
for channel material have been reached, new avenues
awe in the early days of the Iraq War. Progress in the
for continued scaling are open. The most promising of
development of guided weapons depends on electronic
these approaches rely on using the spin state of electron
technology.[94] Improvements in circuit density and low-
spintronics, tunnel junctions, and advanced confinement
power operation associated with Moore’s law, also have
of channel materials via nano-wire geometry. A compre-
contributed to the development of Star Trek-like tech-
hensive list of available device choices shows that a wide
nologies including mobile telephones[95] and replicator-
range of device options is open for continuing Moore’s
like 3-D printing.[96]
law into the next few decades.[86] Spin-based logic and
memory options are being developed actively in indus-
trial labs,[87] as well as academic labs.[88]
Another source of improved performance is in 3.5 Other formulations and similar
microarchitecture techniques exploiting the growth observations
of available transistor count. Out-of-order execution and
on-chip caching and prefetching reduce the memory la-
Several measures of digital technology are improving at
tency bottleneck at the expense of using more transistors
and increasing the processor complexity. These increases exponential rates related to Moore’s law, including the
are described empirically by Pollack’s Rule, which states size, cost, density, and speed of components. Moore
that performance increases due to microarchitecture wrote only about the density of components, “a compo-
techniques are square root of the number of transistors nent being a transistor, resistor, diode or capacitor,”[80] at
or the area of a processor. minimum cost.

For years, processor makers delivered increases in clock Transistors per integrated circuit – The most popular
rates and instruction-level parallelism, so that single- formulation is of the doubling of the number of transistors
threaded code executed faster on newer processors with on integrated circuits every two years. At the end of the
no modification.[89] Now, to manage CPU power dis- 1970s, Moore’s law became known as the limit for the
sipation, processor makers favor multi-core chip de- number of transistors on the most complex chips. The
signs, and software has to be written in a multi-threaded graph at the top shows this trend holds true today.
manner to take full advantage of the hardware. Many Density at minimum cost per transistor – This is the
multi-threaded development paradigms introduce over- formulation given in Moore’s 1965 paper.[1] It is not just
head, and will not see a linear increase in speed vs number about the density of transistors that can be achieved, but
3.5. OTHER FORMULATIONS AND SIMILAR OBSERVATIONS 15

about the density of transistors at which the cost per during the late 1990s, reaching 60% per year (halving
transistor is the lowest.[97] As more transistors are put every nine months) versus the typical 30% improvement
on a chip, the cost to make each transistor decreases, rate (halving every two years) during the years earlier and
but the chance that the chip will not work due to a de- later.[108][109] Laptop microprocessors in particular im-
fect increases. In 1965, Moore examined the density proved 25–35% per year in 2004–2010, and slowed to
of transistors at which cost is minimized, and observed 15–25% per year in 2010–2013.[110]
that, as transistors were made smaller through advances The number of transistors per chip cannot explain
in photolithography, this number would increase at “a rate quality-adjusted microprocessor prices fully.[108][111][112]
of roughly a factor of two per year”.[1]
Moore’s 1995 paper does not limit Moore’s law to
Dennard scaling – This suggests that power require- strict linearity or to transistor count, “The definition of
ments are proportional to area (both voltage and current 'Moore’s Law' has come to refer to almost anything re-
being proportional to length) for transistors. Combined lated to the semiconductor industry that when plotted
with Moore’s law, performance per watt would grow at on semi-log paper approximates a straight line. I hes-
roughly the same rate as transistor density, doubling ev- itate to review its origins and by doing so restrict its
ery 1–2 years. According to Dennard scaling transistor definition.”[80]
dimensions are scaled by 30% (0.7x) every technology Moore (2003) credits chemical mechanical planarization
generation, thus reducing their area by 50%. This reduces (chip smoothing) with increasing the connectivity of mi-
the delay by 30% (0.7x) and therefore increases operating croprocessors from two or three metal layers in the early
frequency by about 40% (1.4x). Finally, to keep electric 1990s to seven in 2003.[54] This progressed to nine metal
field constant, voltage is reduced by 30%, reducing en- layers in 2007 and thirteen in 2014.[113][114][115] Connec-
ergy by 65% and power (at 1.4x frequency) by 50%.[note 2] tivity improves performance, and relieves network con-
Therefore, in every technology generation transistor den- gestion. Just as additional floors may not enlarge a build-
sity doubles, circuit becomes 40% faster, while power ing’s footprint, nor is connectivity tallied in transistor
consumption (with twice the number of transistors) stays count. Microprocessors rely more on communications
the same.[98] (interconnect) than do DRAM chips, which have three
The exponential processor transistor growth predicted or four metal layers.[116][117][118] Microprocessor prices
by Moore does not always translate into exponentially in the late 1990s improved faster than DRAM prices.[108]
greater practical CPU performance. Since around 2005–
Hard disk drive areal density – A similar observation
2007, Dennard scaling appears to have broken down, (sometimes called Kryder’s law) was made as of 2005
so even though Moore’s law continued for several years for hard disk drive areal density.[119] Several decades of
after that, it has not yielded dividends in improved rapid progress resulted from the use of error correcting
performance.[99][100][101] The primary reason cited for the codes, the magnetoresistive effect, and the giant mag-
breakdown is that at small sizes, current leakage poses netoresistive effect. The Kryder rate of areal density
greater challenges, and also causes the chip to heat up, advancement slowed significantly around 2010, because
which creates a threat of thermal runaway and therefore, of noise related to smaller grain size of the disk media,
further increases energy costs.[99][100][101] The breakdown thermal stability, and writability using available magnetic
of Dennard scaling prompted a switch among some chip fields.[120][121]
manufacturers to a greater focus on multicore proces-
sors, but the gains offered by switching to more cores Network capacity – According to Gerry/Gerald
[122][123]
are lower than the gains that would be achieved had Butters, the former head of Lucent’s Optical
Dennard scaling continued. [102][103]
In another departure Networking Group at Bell Labs, there is another version,
[124]
from Dennard scaling, Intel microprocessors adopted a called Butters’ Law of Photonics, a formulation
non-planar tri-gate FinFET at 22 nm in 2012 that is that deliberately parallels Moore’s law. Butter’s law
faster and consumes less power than a conventional planar says that the amount of data coming out of an op-
transistor.[104] tical fiber is doubling every nine months.[125] Thus,
the cost of transmitting a bit over an optical network
Quality adjusted price of IT equipment – The price decreases by half every nine months. The availability
of information technology (IT), computers and periph-
of wavelength-division multiplexing (sometimes called
eral equipment, adjusted for quality and inflation, de- WDM) increased the capacity that could be placed on
clined 16% per year on average over the five decades
a single fiber by as much as a factor of 100. Optical
from 1959 to 2009. [105][106] The pace accelerated, how- networking and dense wavelength-division multiplexing
ever, to 23% per year in 1995–1999 triggered by faster IT (DWDM) is rapidly bringing down the cost of network-
innovation,[14] and later, slowed to 2% per year in 2010– ing, and further progress seems assured. As a result, the
2013.[105][107] wholesale price of data traffic collapsed in the dot-com
The rate of quality-adjusted microprocessor price im- bubble. Nielsen’s Law says that the bandwidth available
provement likewise varies, and is not linear on a log to users increases by 50% annually.[126]
scale. Microprocessor price improvement accelerated Pixels per dollar – Similarly, Barry Hendy of Kodak
16 CHAPTER 3. MOORE’S LAW

Australia has plotted pixels per dollar as a basic measure • Haitz’s law – analog to Moore’s law for LEDs
of value for a digital camera, demonstrating the historical
linearity (on a log scale) of this market and the opportu- • Intel Tick-Tock
nity to predict the future trend of digital camera price,
LCD and LED screens, and resolution.[127][128][129] • Koomey’s law

The great Moore’s law compensator (TGMLC), also • List of eponymous laws
known as Wirth’s law – generally is referred to as
bloat and is the principle that successive generations • Microprocessor chronology
of computer software increase in size and complexity,
thereby offsetting the performance gains predicted by • Quantum computing
Moore’s law. In a 2008 article in InfoWorld, Randall C.
Kennedy,[130] formerly of Intel, introduces this term us- • Zimmerman’s Law
ing successive versions of Microsoft Office between the
year 2000 and 2007 as his premise. Despite the gains in
computational performance during this time period ac- 3.7 Notes
cording to Moore’s law, Office 2007 performed the same
task at half the speed on a prototypical year 2007 com- [1] The trend begins with the invention of the integrated cir-
puter as compared to Office 2000 on a year 2000 com- cuit in 1958. See the graph on the bottom of page 3 of
puter. Moore’s original presentation of the idea.[58]
Library expansion – was calculated in 1945 by Fremont
[2] Active power = CV2 f
Rider to double in capacity every 16 years, if suffi-
cient space were made available.[131] He advocated re-
placing bulky, decaying printed works with miniaturized
microform analog photographs, which could be dupli- 3.8 References
cated on-demand for library patrons or other institutions.
He did not foresee the digital technology that would fol- [1] Moore, Gordon E. (1965). “Cramming more components
low decades later to replace analog microform with dig- onto integrated circuits” (PDF). Electronics Magazine. p.
ital imaging, storage, and transmission mediums. Au- 4. Retrieved 2006-11-11.
tomated, potentially lossless digital technologies allowed
vast increases in the rapidity of information growth in an [2] Brock, David C., ed. (2006). Understanding Moore’s law
: four decades of innovation. Philadelphia, Pa: Chemical
era that now sometimes is called an Information Age.
Heritage Press. ISBN 0941901416.
The Carlson Curve – is a term coined by The Economist
[132] [3] “1965 – “Moore’s Law” Predicts the Future of Integrated
to describe the biotechnological equivalent of
Moore’s law, and is named after author Rob Carlson.[133] Circuits”. Computer History Museum. 2007. Retrieved
Carlson accurately predicted that the doubling time of 2009-03-19.
DNA sequencing technologies (measured by cost and
[4] Takahashi, Dean (18 April 2005). “Forty years of
performance) would be at least as fast as Moore’s law.[134] Moore’s law”. Seattle Times (San Jose, CA). Retrieved 7
Carlson Curves illustrate the rapid (in some cases hyper- April 2015. A decade later, he revised what had become
exponential) decreases in cost, and increases in perfor- known as Moore’s Law: The number of transistors on a
mance, of a variety of technologies, including DNA se- chip would double every two years.
quencing, DNA synthesis, and a range of physical and
computational tools used in protein expression and in de- [5] Moore, Gordon (2006). “Chapter 7: Moore’s law at
termining protein structures. 40”. In Brock, David. Understanding Moore’s Law: Four
Decades of Innovation (PDF). Chemical Heritage Foun-
dation. pp. 67–84. ISBN 0-941901-41-6. Retrieved
March 15, 2015.
3.6 See also
[6] “Over 6 Decades of Continued Transistor Shrinkage, In-
• Accelerating change novation” (Press release). Santa Clara, California: Intel
Corporation. Intel Corporation. 2011-05-01. Retrieved
• Amdahl’s law 2015-03-15. 1965: Moore’s Law is born when Gordon
Moore predicts that the number of transistors on a chip
• Bell’s law will double roughly every year (a decade later, revised to
every 2 years)
• Metcalfe’s law
[7] Disco, Cornelius; van der Meulen, Barend (1998). Getting
• Empirical relationship new technologies together. New York: Walter de Gruyter.
pp. 206–207. ISBN 3-11-015630-X. OCLC 39391108.
• Grosch’s law Retrieved 23 August 2008.
3.8. REFERENCES 17

[8] Byrne, David M.; Oliner, Stephen D.; Sichel, Daniel [20] Schaller, Bob (26 Sep 1996). “The Origin, Nature, and
E. (2013-03). Is the Information Technology Revolution Implications of “MOORE'S LAW"". Microsoft. Re-
Over? (PDF). Finance and Economics Discussion Series trieved 10 September 2014.
Divisions of Research & Statistics and Monetary Affairs
Federal Reserve Board. Washington, D.C.: Federal Re- [21] Tuomi, I. (2002). “The Lives and Death of Moore’s Law”.
serve Board Finance and Economics Discussion Series First Monday 7 (11). doi:10.5210/fm.v7i11.1000.
(FEDS). Archived (PDF) from the original on 2014-06-
09. technical progress in the semiconductor industry has [22] Moore, Gordon (1975). “IEEE Technical Digest 1975”
continued to proceed at a rapid pace ... Advances in (PDF). Intel Corp. Retrieved 7 April 2015. ... the rate of
semiconductor technology have driven down the constant- increase of complexity can be expected to change slope in
quality prices of MPUs and other chips at a rapid rate over the next few years as shown in Figure 5. The new slope
the past several decades. Check date values in: |date= might approximate a doubling every two years, rather than
(help) every year, by the end of the decade.

[9] Nathan Myhrvold (7 June 2006). “Moore’s Law Corol- [23] in reference to Gordon E. Moore's statements at the IEEE.
lary: Pixel Power”. New York Times. Retrieved 2011-11- “Moore’s Law – The Genius Lives On”. IEEE solid-state
27. circuits society newsletter. September 2006. Archived
from the original on 2007-07-13.
[10] Rauch, Jonathan (January 2001). “The New Old Econ-
omy: Oil, Computers, and the Reinvention of the Earth”. [24] Markoff, John (31 August 2009). “After the Transistor,
The Atlantic Monthly. Retrieved 28 November 2008. a Leap Into the Microcosm”. The New York Times. Re-
trieved 2009-08-31.
[11] Keyes, Robert W. (September 2006). “The Impact of
Moore’s Law”. Solid State Circuits Newsletter. Retrieved [25] Markoff, John (18 April 2005). “It’s Moore’s Law But
28 November 2008. Another Had The Idea First”. The New York Times.
Archived from the original on 4 October 2011. Retrieved
[12] Liddle, David E. (September 2006). “The Wider Impact 4 October 2011.
of Moore’s Law”. Solid State Circuits Newsletter. Re-
trieved 28 November 2008. [26] Michael Kanellos (2005-04-11). “Intel offers $10,000 for
Moore’s Law magazine”. ZDNET News.com. Retrieved
[13] Kendrick, John W. (1961). Productivity Trends in the 2013-06-21.
United States. Princeton University Press for NBER. p.
3. [27] “Moore’s Law original issue found”. BBC News Online.
2005-04-22. Retrieved 2012-08-26.
[14] Dale W. Jorgenson, Mun S. Ho and Jon D. Samuels
(2014). “Long-term Estimates of U.S. Productivity and [28] “Gordon Moore Says Aloha to Moore’s Law”. the In-
Growth” (PDF). World KLEMS Conference. Retrieved quirer. 13 April 2005. Retrieved 2 September 2009.
2014-05-27.
[29] Sumner Lemon, Sumner; Tom Krazit (2005-04-19).
[15] “Moore’s Law to roll on for another decade”. Retrieved “With chips, Moore’s Law is not the problem”. Infoworld.
2011-11-27. Moore also affirmed he never said transistor Retrieved 2011-08-22.
count would double every 18 months, as is commonly said.
Initially, he said transistors on a chip would double every [30] Jeff Dorsch. “Does Moore’s Law Still Hold Up?" (PDF).
year. He then recalibrated it to every two years in 1975. EDA Vision. Retrieved 2011-08-22.
David House, an Intel executive at the time, noted that
the changes would cause computer performance to double [31] Bob Schaller (1996-09-26). “The Origin, Nature, and Im-
every 18 months. plications of “Moore’s Law"". Research.microsoft.com.
Retrieved 2011-08-22.
[16] “Overall Technology Roadmap Characteristics”.
International Technology Roadmap for Semiconductors. [32] Kilby, J., “Miniaturized electronic circuits”, US 3138743,
2010. Retrieved 2013-08-08. issued 23 June 1964 (filed 6 February 1959).

[17] Moore, Gordon (March 30, 2015). Gordon Moore: The [33] Noyce, R., “Semiconductor device-and-lead structure”,
Man Whose Name Means Progress, The visionary engineer US 2981877, issued 25 April 1961 (filed 30 July 1959).
reflects on 50 years of Moore’s Law. IEEE Spectrum. In-
terview with Rachel Courtland. Special Report: 50 Years [34] Wanlass, F., “Low stand-by power complementary field
of Moore’s Law. We won’t have the rate of progress that effect circuitry”, US 3356858, issued 5 December 1967
we've had over the last few decades. I think that’s in- (filed 18 June 1963).
evitable with any technology; it eventually saturates out.
I guess I see Moore’s law dying here in the next decade or [35] Dennard, R., “Field-effect transistor memory”, US
so, but that’s not surprising. 3387286, issued 4 June 1968 (filed 14 July 1967)

[18] http://www.economist.com/node/21649047 [36] Fulford, Benjamin (24 June 2002). “Unsung hero”.
Forbes. Retrieved 18 March 2008.
[19] Evans, Dean. “Moore’s Law: how long will it last?". http:
//www.techradar.com''. Retrieved 25 November 2014. [37] US 4531203 Fujio Masuoka
18 CHAPTER 3. MOORE’S LAW

[38] Masuoka, F.; Momodomi, M.; Iwata, Y.; Shirota, R. [54] Moore, Gordon (2003-02-10). “transcription of Gordon
(1987). “New ultra high density EPROM and flash EEP- Moore’s Plenary Address at ISSCC 50th Anniversary”
ROM with NAND structure cell”. Electron Devices Meet- (PDF). transcription “Moore on Moore: no Exponential
ing, 1987 International. IEEE. Retrieved 4 January 2013. is forever”. 2003 IEEE International Solid-State Circuits
Conference. San Francisco, California: ISSCC.
[39] U.S. Patent 4,491,628 “Positive and Negative Working
Resist Compositions with Acid-Generating Photoinitia- [55] Steigerwald, J. M. (2008). “Chemical mechani-
tor and Polymer with Acid-Labile Groups Pendant From cal polish: The enabling technology”. 2008 IEEE
Polymer Backbone” J.M.J. Fréchet, H. Ito and C.G. Will- International Electron Devices Meeting. p. 1.
son 1985. doi:10.1109/IEDM.2008.4796607. ISBN 978-1-
4244-2377-4. “Table1: 1990 enabling multilevel
[40] Ito, H., & Willson, C. G. (1983). “Chemical amplification metallization; 1995 enabling STI compact isolation,
in the design of dry developing resist material”. Polymer polysilicon patterning and yield / defect reduction”
Engineering & Science. 23(18): 204.
[56] “IBM100 – Copper Interconnects: The Evolution of Mi-
[41] Ito, Hiroshi, C. Grant Willson, and Jean HJ Frechet croprocessors”. Retrieved 17 October 2012.
(1982). “New UV resists with negative or positive tone”.
[57] “International Technology Roadmap for Semiconduc-
VLSI Technology, 1982. Digest of Technical Papers.
tors”. Retrieved 2011-08-22.
Symposium on.
[58] Gordon E. Moore (1965-04-19). “Cramming more com-
[42] “Patterning the World: The Rise of Chemically Amplified ponents onto integrated circuits” (PDF). Electronics. Re-
Photoresists”. Chemical Heritage Magazine. 2007-10-01. trieved 2011-08-22.
Retrieved 2014-05-29.
[59] “Moore’s Law: “We See No End in Sight,” Says Intel’s Pat
[43] The Japan Prize Foundation (2013). “Laureates of the Gelsinger”. SYS-CON. 2008-05-01. Retrieved 2008-05-
Japan Prize”. The Japan Prize Foundation. Retrieved 01.
2014-05-20.
[60] “Chilly chip shatters speed record”. BBC Online. 2006-
[44] Hiroshi Ito (2000). “Chemical amplification resists: His- 06-20. Retrieved 2006-06-24.
tory and development within IBM” (PDF). IBM Journal
of Research and Development. Retrieved 2014-05-20. [61] “Georgia Tech/IBM Announce New Chip Speed Record”.
Georgia Institute of Technology. 2006-06-20. Retrieved
[45] 4458994 A US patent US 4458994 A, Kantilal Jain, 2014-03-28.
Carlton G. Willson, “High resolution optical lithography
[62] Strukov, Dmitri B; Snider, Gregory S; Stew-
method and apparatus having excimer laser light source
art, Duncan R; Williams, Stanley R (2008).
and stimulated Raman shifting”, issued 1984-07-10
“The missing memristor found”. Nature 453
(7191): 80–83. Bibcode:2008Natur.453...80S.
[46] Jain, K. et al, “Ultrafast deep-UV lithography with ex-
doi:10.1038/nature06932. PMID 18451858.
cimer lasers”, IEEE Electron Device Lett., Vol. EDL-3,
53 (1982); http://ieeexplore.ieee.org/xpl/freeabs_all.jsp? [63] Dexter Johnson (2010-02-22). “Junctionless Transistor
arnumber=1482581 Fabricated from Nanowires”. IEEE Spectrum. Retrieved
2010-04-20.
[47] Jain, K. “Excimer Laser Lithography”, SPIE Press,
Bellingham, WA, 1990. [64] “Super-small transistor created: Artificial atom
powered by single electron”. Science Daily.
[48] La Fontaine, B., “Lasers and Moore’s Law”, SPIE Profes- 2011-04-19. Bibcode:2011NatNa...6..343C.
sional, Oct. 2010, p. 20; http://spie.org/x42152.xml doi:10.1038/nnano.2011.56. Retrieved 2011-08-22.
[49] Dirk Basting; Gerd Marowsky (5 December 2005). [65] “A single-atom transistor”. Nature. 2011-
Excimer Laser Technology. Springer. ISBN 978-3-540- 12-16. Bibcode:2012NatNa...7..242F.
26667-9. doi:10.1038/nnano.2012.21. Retrieved 2012-01-19.

[50] Basov, N. G. et al., Zh. Eksp. Fiz. i Tekh. Pis’ma. Red. [66] http://news.stanford.edu/pr/2014/
12, 473(1970). pr-neurogrid-boahen-engineering-042814.html

[51] Lasers in Our Lives / 50 Years of Impact (PDF), Engineer- [67] Michio Kaku (2010). Physics of the Future. Doubleday.
ing and Physical Sciences Research Council, retrieved p. 173. ISBN 978-0-385-53080-4.
2011-08-22 [68] Bob Yirka (2013-05-02). “New nanowire transis-
tors may help keep Moore’s Law alive”. Phys.org.
[52] “50 Years Advancing the Laser” (PDF). SPIE. Retrieved
doi:10.1039/C3NR33738C. Retrieved 2013-08-08.
2011-08-22.
[69] “Rejuvenating Moore’s Law With Nanotechnology”.
[53] Lai, Jiun-Yu (2000-09-30). “Mechanics, Mechanisms, Forbes. 2007-06-05. Retrieved 2013-08-08.
and Modeling of the Chemical Mechanical Polishing Pro-
cess” (PDF). Ph.D. Dissertation, Massachusetts Institute of [70] Manek Dubash (2005-04-13). “Moore’s Law is dead, says
Technology: 20–28. Retrieved 2014-06-03. Gordon Moore”. Techworld. Retrieved 2006-06-24.
3.8. REFERENCES 19

[71] Waldner, Jean-Baptiste (2008). Nanocomputers and [87] Sasikanth Manipatruni; Dmitri E. Nikonov; Ian A. Young
swarm intelligence. London: ISTE. pp. 44–45. ISBN (2012-12-13). “Material Targets for Scaling All Spin
978-1-84821-009-7. Logic”. Cornell University Library. Retrieved 2013-08-
08.
[72] Michael Kanellos (2003-12-01). “Intel scientists find wall
for Moore’s Law”. CNET. Retrieved 2009-03-19. [88] “Proposal for an all-spin logic device with built-in mem-
ory”. Nature Nanotechnology. 2010-02-28. Retrieved
[73] Lawrence M. Krauss; Glenn D. Starkman (2004-05- 2013-08-08.
10). “Universal Limits of Computation”. arXiv:astro-
ph/0404510. [89] See Herb Sutter,The Free Lunch Is Over: A Fundamental
Turn Toward Concurrency in Software, Dr. Dobb’s Jour-
[74] Kaku, Michio. “Parallel universes, the Matrix, and super- nal, 30(3), March 2005. Retrieved 21 November 2011.
intelligence”. Kurzweil. Retrieved 2011-08-22.
[90] Anand Lal Shimpi (2004-07-21). “AnandTech: Intel’s
[75] Kumar, Suhas (2012). “Fundamental Limits to Moore’s 90nm Pentium M 755: Dothan Investigated”. Anadtech.
Law”. Stanford University. Retrieved 2007-12-12.

[76] Seth Lloyd (2000). “Ultimate physical limits to computa- [91] “Parallel JavaScript”. Intel. 2011-09-15. Retrieved 2013-
tion”. Nature. Retrieved 2011-11-27. 08-08.

[77] “Moore’s Law at 40 – Happy birthday”. The Economist. [92] Peter Standborn (April 2008). “Trapped on Technology’s
2005-03-23. Retrieved 2006-06-24. Trailing Edge”. IEEE Spectrum. Retrieved 2011-11-27.

[93] “WEEE – Combating the obsolescence of computers and


[78] Hilbert, Martin; López, Priscila (2011). “The
other devices”. SAP Community Network. 2012-12-14.
World’s Technological Capacity to Store, Com-
Retrieved 2013-08-08.
municate, and Compute Information”. Science
332 (6025): 60–65. Bibcode:2011Sci...332...60H. [94] Malone, Michael S. (27 March 2003). “Silicon Insider:
doi:10.1126/science.1200970. PMID 21310967. Welcome to Moore’s War”. ABC News. Retrieved 2011-
Free access to the study through www.martinhilbert. 08-22.
net/WorldInfoCapacity.html and video animation
ideas.economist.com/video/giant-sifting-sound-0 [95] Zygmont, Jeffrey (2003). Microchip. Cambridge, MA,
USA: Perseus Publishing. pp. 154–169. ISBN 0-7382-
[79] “Technological guideposts and innovation avenuesn”, Sa- 0561-3.
hal, Devendra (1985), Research Policy, 14, 61.
[96] Lipson, Hod (2013). Fabricated: The New World of 3D
[80] Gordon E. Moore (1995). “Lithography and the future of Printing. Indianapolis, IN, USA: John Wiley & Sons.
Moore’s law” (PDF). SPIE. Retrieved 2014-05-27. ISBN 978-1-118-35063-8.

[81] Dale W. Jorgenson (2000). “Information Technology and [97] Stokes, Jon (2008-09-27). “Understanding Moore’s
the U.S. Economy: Presidential Address to the Ameri- Law”. Ars Technica. Retrieved 2011-08-22.
can Economic Association”. American Economic Asso-
ciation. Retrieved 2014-05-15. [98] Shekhar Borkar, Andrew A. Chien (May 2011). “The Fu-
ture of Microprocessors”. Communications of ACM 54
[82] Dale W. Jorgenson, Mun S. Ho, and Kevin J. Stiroh (5). Retrieved 2011-11-27.
(2008). “A Retrospective Look at the U.S. Productivity
Growth Resurgence”. Journal of Economic Perspectives. [99] McMenamin, Adrian (April 15, 2013). “The end of Den-
Retrieved 2014-05-15. nard scaling”. Retrieved January 23, 2014.

[100] Bohr, Mark (January 2007). “A 30 Year Retrospective on


[83] Bruce T. Grimm, Brent R. Moulton, and David B.
Dennard’s MOSFET Scaling Paper” (PDF). Solid-State
Wasshausen (2002). “Information Processing Equipment
Circuits Society. Retrieved January 23, 2014.
and Software in the National Accounts” (PDF). U.S. De-
partment of Commerce Bureau of Economic Analysis. [101] Nickel, Sebastian (July 27, 2013). “Sebastian Nickel’s
Retrieved 2014-05-15. comments on “Model Combination and Adjustment"".
Retrieved January 23, 2014.
[84] “Nonfarm Business Sector: Real Output Per Hour of All
Persons”. Federal Reserve Bank of St. Louis Economic [102] Esmaeilzedah, Emily; Blem; St. Amant, Renee; Sankar-
Data. 2014. Retrieved 2014-05-27. alingam, Kartikeyan; Burger, Doug. “Dark Silicon and
the end of multicore scaling” (PDF).
[85] Richard G. Anderson (2007). “How Well Do Wages Fol-
low Productivity Growth?" (PDF). Federal Reserve Bank [103] Hruska, Joel (February 1, 2012). “The death of CPU scal-
of St. Louis Economic Synopses. Retrieved 2014-05-27. ing: From one core to many — and why we’re still stuck”.
ExtremeTech. Retrieved January 23, 2014.
[86] Dmitri E. Nikonov; Ian A. Young (2013-02-01).
“Overview of Beyond-CMOS Devices and A Uniform [104] Kaizad Mistry (2011). “Tri-Gate Transistors: Enabling
Methodology for Their Benchmarking”. Cornell Univer- Moore’s Law at 22nm and Beyond” (PDF). Intel Corpo-
sity Library. Retrieved 2013-08-08. ration at semiconwest.org. Retrieved 2014-05-27.
20 CHAPTER 3. MOORE’S LAW

[105] “Private fixed investment, chained price index: Nonres- [119] Walter, Chip (2005-07-25). “Kryder’s Law”. Scien-
idential: Equipment: Information processing equipment: tific American ((Verlagsgruppe Georg von Holtzbrinck
Computers and peripheral equipment”. Federal Reserve GmbH)). Retrieved 2006-10-29.
Bank of St. Louis. 2014. Retrieved 2014-05-12.
[120] Plumer et. al, Martin L. (March 2011). “New Paradigms
[106] Raghunath Nambiar, Meikel Poess (2011). “Transaction in Magnetic Recording” (PDF). Physics in Canada 67 (1):
Performance vs. Moore’s Law: A Trend Analysis”. 25–29. Retrieved 17 July 2014.
Springer.

[107] Michael Feroli (2013). “US: is I.T. over?" (PDF). JP- [121] Mellor, Chris (2014-11-10). “Kryder’s law craps out:
Morgan Chase Bank NA Economic Research. Retrieved Race to UBER-CHEAP STORAGE is OVER”. thereg-
2014-05-15. ister.co.uk (UK: The Register). Retrieved 2014-11-12.
Currently 2.5-inch drives are at 500GB/platter with some
[108] Ana Aizcorbe, Stephen D. Oliner, and Daniel E. Sichel at 600GB or even 667GB/platter – a long way from
(2006). “Shifting Trends in Semiconductor Prices and 20TB/platter. To reach 20TB by 2020, the 500GB/platter
the Pace of Technological Progress”. The Federal Re- drives will have to increase areal density 44 times in six
serve Board Finance and Economics Discussion Series. years. It isn't going to happen. ... Rosenthal writes: “The
Retrieved 2014-05-15. technical difficulties of migrating from PMR to HAMR,
meant that already in 2010 the Kryder rate had slowed sig-
[109] Ana Aizcorbe (2005). “Why Are Semiconductor Price nificantly and was not expected to return to its trend in the
Indexes Falling So Fast? Industry Estimates and Implica- near future. The floods reinforced this.”
tions for Productivity Measurement” (PDF). U.S. Depart-
ment of Commerce Bureau of Economic Analysis. Re- [122] “Gerald Butters is a communications industry veteran”.
trieved 2014-05-15. Forbes.com. Archived from the original on 2007-10-12.
[110] Sun, Liyang (2014-04-25). “What We Are Paying for:
[123] “Board of Directors”. LAMBDA OpticalSystems. Re-
A Quality Adjusted Price Index for Laptop Microproces-
trieved 2011-08-22.
sors”. Wellesley College. Retrieved 2014-11-07. ... com-
pared with −25% to −35% per year over 2004–2010, the
[124] Rich Tehrani. “As We May Communicate”. Tmcnet.com.
annual decline plateaus around −15% to −25% over 2010–
Retrieved 2011-08-22.
2013.

[111] Ana Aizcorbe, Samuel Kortum (2004). “Moore’s Law [125] Gail Robinson (2000-09-26). “Speeding net traffic with
and the Semiconductor Industry: A Vintage Model” tiny mirrors”. EE Times. Retrieved 2011-08-22.
(PDF). U.S. Department of Commerce Bureau of Eco-
nomic Analysis. Retrieved 2014-05-27. [126] Jakob Nielsen (1998-04-05). “Nielsen’s Law of Internet
Bandwidth”. Alertbox. Retrieved 2011-08-22.
[112] John Markoff (2004). “Intel’s Big Shift After Hitting
Technical Wall”. New York Times. Retrieved 2014-05- [127] Ziggy Switkowski (2009-04-09). “Trust the power of
27. technology”. The Australian. Retrieved 2013-12-02.
[113] James, Dick. “Intel’s 14-nm Parts are Finally Here!",
[128] EMIN GÜNSIRER, RIK FARROW. “Some Lesser-
chipworks.com, 27 October 2014. Retrieved on 5 Novem-
Known Laws of Computer Science” (PDF). Retrieved
ber 2014.
2013-12-02.
[114] Bohr, Mark (2009). “The New Era of Scaling in an SoC
World” (PDF). UCSD. Intel. Retrieved 2014-06-04. [129] “Using Moore’s Law to Predict Future Memory Trends”.
2011-11-21. Retrieved 2013-12-02.
[115] Bohr, Mark (2012). “Silicon Technology Leadership for
the Mobility Era” (PDF). Intel Corporation. Retrieved [130] Kennedy, Randall C. (2008-04-14). “Fat, fatter, fattest:
2014-06-04. Microsoft’s kings of bloat”. InfoWorld. Retrieved 2011-
08-22.
[116] Saraswat, Krishna (2002). “Scaling of Interconnections
(course notes)" (PDF). Stanford University. Retrieved [131] Rider (1944). The Scholar and the Future of the Research
2014-06-04. Memories ... don’t need too many inter- Library. New York City: Hadham Press.
connects. Logic chips are more irregular and are dom-
inated by communication requirements...generally have [132] Life 2.0. (2006, August 31). The Economist
larger number of interconnects and thus need more lev-
els of them.
[133] Carlson, Robert H. Biology Is Technology: The Promise,
[117] Bruce Jacob, Spencer Ng, David Wang. “Memory sys- Peril, and New Business of Engineering Life. Cambridge,
tems: cache, DRAM, disk”. 2007. Section 8.10.2. MA: Harvard UP, 2010. Print
“Comparison of DRAM-optimized process versus a logic-
optimized process”. Page 376. [134] “The Pace and Proliferation of Biological Technologies”
Robert Carlson. Biosecurity and Bioterrorism: Biode-
[118] Young Choi. “Battle commences in 50nm DRAM arena”. fense Strategy, Practice, and Science. September 2003,
2009. 1(3): 203–214. doi:10.1089/153871303769201851
3.10. EXTERNAL LINKS 21

3.9 Further reading


• Moore’s Law: The Life of Gordon Moore, Sili-
con Valley’s Quiet Revolutionary. Arnold Thackray,
David C. Brock, and Rachel Jones. New York: Ba-
sic Books, (May) 2015.
• Understanding Moore’s Law: Four Decades of In-
novation. Edited by David C. Brock. Philadelphia:
Chemical Heritage Press, 2006. ISBN 0-941901-
41-6. OCLC 66463488.

3.10 External links


• Intel press kit released for Moore’s Law’s 40th an-
niversary, with a 1965 sketch by Moore

• The Lives and Death of Moore’s Law – By Ilkka


Tuomi; a detailed study on Moore’s Law and its his-
torical evolution and its criticism by Kurzweil.
• No Technology has been more disruptive... Slide
show of microchip growth
• Intel (IA-32) CPU Speeds 1994–2005. Speed in-
creases in recent years have seemed to slow down
with regard to percentage increase per year (avail-
able in PDF or PNG format).
• Current Processors Chart

• International Technology Roadmap for Semicon-


ductors (ITRS)
• Gordon Moore His Law and Integrated Circuit,
Dream 2047 October 2006
• A C|net FAQ about Moore’s Law
Chapter 4

Amdahl’s law

Amdahl’s Law • n ∈ N , the number of threads of execution,


20.00

18.00 • B ∈ [0, 1] , the fraction of the algorithm that is


Parallel Portion
16.00 50% strictly serial,
75%
14.00 90%
95%
12.00 The time T (n) an algorithm takes to finish when being
Speedup

10.00 executed on n thread(s) of execution corresponds to:


8.00 ( )
T (n) = T (1) B + n1 (1 − B)
6.00

4.00 Therefore, the theoretical speedup S(n) that can be had


2.00 by executing a given algorithm on a system capable of
0.00 executing n threads of execution is:
1

16

32

64

128

256

512

1024

2048

4096

8192

16384

32768

65536

T (1) T (1) 1
Number of Processors S(n) = T (n) = T (1)(B+ n1
(1−B))
= 1
B+ n (1−B)

The speedup of a program using multiple processors in parallel


computing is limited by the sequential fraction of the program.
For example, if 95% of the program can be parallelized, the the-
4.2 Description
oretical maximum speedup using parallel computing would be
20× as shown in the diagram, no matter how many processors Amdahl’s law is a model for the expected speedup and the
are used. relationship between parallelized implementations of an
algorithm and its sequential implementations, under the
Amdahl’s law, also known as Amdahl’s argument,[1] is assumption that the problem size remains the same when
used to find the maximum expected improvement to an parallelized. For example, if for a given problem size a
overall system when only part of the system is improved. parallelized implementation of an algorithm can run 12%
It is often used in parallel computing to predict the theo- of the algorithm’s operations arbitrarily quickly (while the
retical maximum speedup using multiple processors. The remaining 88% of the operations are not parallelizable),
law is named after computer architect Gene Amdahl, and Amdahl’s law states that the maximum speedup of the
was presented at the AFIPS Spring Joint Computer Con- parallelized version is 1/(1 – 0.12) = 1.136 times as fast
ference in 1967. as the non-parallelized implementation.
The speedup of a program using multiple processors in More technically, the law is concerned with the speedup
parallel computing is limited by the time needed for the achievable from an improvement to a computation that
sequential fraction of the program. For example, if a pro- affects a proportion P of that computation where the im-
gram needs 20 hours using a single processor core, and a provement has a speedup of S. (For example, if 30% of
particular portion of the program which takes one hour the computation may be the subject of a speed up, P will
to execute cannot be parallelized, while the remaining 19 be 0.3; if the improvement makes the portion affected
hours (95%) of execution time can be parallelized, then twice as fast, S will be 2.) Amdahl’s law states that the
regardless of how many processors are devoted to a par- overall speedup of applying the improvement will be:
allelized execution of this program, the minimum execu-
tion time cannot be less than that critical one hour. Hence
the speedup is limited to at most 20×. 1 1
= = 1.1765
(1 − P ) + P
S
(1 − 0.3) + 0.3
2

To see how this formula was derived, assume that the run-
4.1 Definition ning time of the old computation was 1, for some unit of
time. The running time of the new computation will be
Given: the length of time the unimproved fraction takes (which

22
4.4. RELATION TO LAW OF DIMINISHING RETURNS 23

is 1 − P), plus the length of time the improved fraction


takes. The length of time for the improved part of the
SU − 1
1
computation is the length of the improved part’s former P =
estimated
NP − 1
1
running time divided by the speedup, making the length
of time of the improved part P/S. The final speedup is
computed by dividing the old running time by the new P estimated in this way can then be used in Amdahl’s law
running time, which is what the above formula does. to predict speedup for a different number of processors.
Here’s another example. We are given a sequential task
which is split into four consecutive parts: P1, P2, P3
and P4 with the percentages of runtime being 11%, 18%, 4.4 Relation to law of diminishing
23% and 48% respectively. Then we are told that P1 is
not sped up, so S1 = 1, while P2 is sped up 5×, P3 is returns
sped up 20×, and P4 is sped up 1.6×. By using the for-
mula P1/S1 + P2/S2 + P3/S3 + P4/S4, we find the new Amdahl’s law is often conflated with the law of dimin-
sequential running time is: ishing returns, whereas only a special case of applying
Amdahl’s law demonstrates 'law of diminishing returns’.
If one picks optimally (in terms of the achieved speed-
0.11 0.18 0.23 0.48 up) what to improve, then one will see monotonically de-
+ + + = 0.4575.
1 5 20 1.6 creasing improvements as one improves. If, however, one
picks non-optimally, after improving a sub-optimal com-
or a little less than 1 ⁄2 the original running time. Using the ponent and moving on to improve a more optimal com-
formula (P1/S1 + P2/S2 + P3/S3 + P4/S4)−1 , the overall ponent, one can see an increase in return. Note that it is
speed boost is 1 / 0.4575 = 2.186, or a little more than often rational to improve a system in an order that is “non-
double the original speed. Notice how the 20× and 5× optimal” in this sense, given that some improvements are
speedup don't have much effect on the overall speed when more difficult or consuming of development time than
P1 (11%) is not sped up, and P4 (48%) is sped up only others.
1.6 times.
Amdahl’s law does represent the law of diminishing re-
turns if you are considering what sort of return you get
by adding more processors to a machine, if you are run-
4.3 Parallelization ning a fixed-size computation that will use all available
processors to their capacity. Each new processor you add
In the case of parallelization, Amdahl’s law states that if to the system will add less usable power than the previous
P is the proportion of a program that can be made parallel one. Each time you double the number of processors the
(i.e., benefit from parallelization), and (1 − P) is the pro- speedup ratio will diminish, as the total throughput heads
portion that cannot be parallelized (remains serial), then toward the limit of 1/(1 − P ) .
the maximum speedup that can be achieved by using N
This analysis neglects other potential bottlenecks such as
processors is
memory bandwidth and I/O bandwidth, if they do not
scale with the number of processors; however, taking into
account such bottlenecks would tend to further demon-
1
S(N ) = strate the diminishing returns of only adding processors.
(1 − P ) + N P

In the limit, as N tends to infinity, the maximum speedup


tends to 1 / (1 − P). In practice, performance to price ratio 4.5 Speedup in a sequential pro-
falls rapidly as N is increased once there is even a small
component of (1 − P). gram
As an example, if P is 90%, then (1 − P) is 10%, and the
The maximum speedup in an improved sequential pro-
problem can be sped up by a maximum of a factor of 10,
gram, where some part was speed up by p times is limited
no matter how large the value of N used. For this reason,
by inequality
parallel computing is only useful for either small numbers
of processors, or problems with very high values of P: so-
called embarrassingly parallel problems. A great part of
p
the craft of parallel programming consists of attempting speedup maximum ≤
1 + f · (p − 1)
to reduce the component (1 – P) to the smallest possible
value. where f ( 0 < f < 1 ) is the fraction of time (before the
P can be estimated by using the measured speedup (SU) improvement) spent in the part that was not improved.
on a specific number of processors (NP) using For example (see picture on right):
24 CHAPTER 4. AMDAHL’S LAW

• Karp–Flatt metric
• Moore’s law

4.8 Notes
Assume that a task has two independent parts, A and B. B takes [1] (Rodgers 1985, p. 226)
roughly 25% of the time of the whole computation. By working
very hard, one may be able to make this part 5 times faster, but [2] Michael McCool; James Reinders; Arch Robison (2013).
this only reduces the time for the whole computation by a little. Structured Parallel Programming: Patterns for Efficient
In contrast, one may need to perform less work to make part A be Computation. Elsevier. p. 61.
twice as fast. This will make the computation much faster than
by optimizing part B, even though B’s speed-up is greater by ratio,
(5× versus 2×). 4.9 References
• If part B is made five times faster ( p = 5 ), tA =3 , • Rodgers, David P. (June 1985). “Improvements
tB = 1 , and f = tA /(tA + tB ) = 0.75 , then in multiprocessor system design”. ACM SIGARCH
Computer Architecture News archive (New
5
speedup maximum ≤ = 1.25 York, NY, USA: ACM) 13 (3): 225–231.
1 + 0.75 · (5 − 1)
doi:10.1145/327070.327215. ISBN 0-8186-0634-
• If part A is made to run twice as fast ( p = 2 ), tB =1
7. ISSN 0163-5964.
, tA = 3 , and f = tB /(tA + tB ) = 0.25 , then
2
speedup maximum ≤ = 1.60 4.10 Further reading
1 + 0.25 · (2 − 1)

Therefore, making A twice as fast is better than making • Amdahl, Gene M. (1967). “Validity of the
B five times faster. The percentage improvement in speed Single Processor Approach to Achieving
can be calculated as Large-Scale Computing Capabilities” (PDF).
AFIPS Conference Proceedings (30): 483–485.
doi:10.1145/1465482.1465560.
( )
1
improvement percentage = 1− · 100
factor speedup
4.11 External links
• Improving part A by a factor of two will increase
overall program speed by a factor of 1.6, which
• Cases where Amdahl’s law is inapplicable
makes it 37.5% faster than the original computation.
• Oral history interview with Gene M. Amdahl
• However, improving part B by a factor of five, which
Charles Babbage Institute, University of Minnesota.
presumably requires more effort, will only achieve
Amdahl discusses his graduate work at the Univer-
an overall speedup factor of 1.25, which makes it
sity of Wisconsin and his design of WISC. Discusses
20% faster.
his role in the design of several computers for IBM
including the STRETCH, IBM 701, and IBM 704.
He discusses his work with Nathaniel Rochester and
4.6 Limitations IBM’s management of the design process. Men-
tions work with Ramo-Wooldridge, Aeronutronic,
Amdahl’s law only applies to cases where the problem size and Computer Sciences Corporation
is fixed. In practice, as more computing resources be-
come available, they tend to get used on larger problems • A simple interactive Amdahl’s Law calculator
(larger datasets), and the time spent in the parallelizable • “Amdahl’s Law” by Joel F. Klein, Wolfram Demon-
part often grows much faster than the inherently sequen- strations Project, 2007.
tial work. In this case, Gustafson’s law gives a more real-
istic assessment of parallel performance.[2] • Amdahl’s Law in the Multicore Era
• Blog Post: “What the $#@! is Parallelism, Any-
how?"
4.7 See also
• Amdahl’s Law applied to OS system calls on multi-
• Critical path method core CPU
4.11. EXTERNAL LINKS 25

• Evaluation of the Intel Core i7 Turbo Boost feature,


by James Charles, Preet Jassi, Ananth Narayan S,
Abbas Sadat and Alexandra Fedorova

• Calculation of the acceleration of parallel programs


as a function of the number of threads, by George
Popov, Valeri Mladenov and Nikos Mastorakis
Chapter 5

Von Neumann architecture

See also: Stored-program computer and Universal Tur- and program counter, a memory to store both data and
ing machine § Stored-program computer instructions, external mass storage, and input and out-
The Von Neumann architecture, also known as the put mechanisms.[1][2] The meaning has evolved to be any
stored-program computer in which an instruction fetch
and a data operation cannot occur at the same time be-
cause they share a common bus. This is referred to as
the Von Neumann bottleneck and often limits the perfor-
mance of the system.[3]
The design of a Von Neumann architecture is simpler
than the more modern Harvard architecture which is also
a stored-program system but has one dedicated set of ad-
dress and data buses for reading data from and writing
data to memory, and another set of address and data buses
for fetching instructions.
Von Neumann architecture scheme A stored-program digital computer is one that keeps its
program instructions, as well as its data, in read-write,
random-access memory (RAM). Stored-program com-
puters were an advancement over the program-controlled
Memory computers of the 1940s, such as the Colossus and the
ENIAC, which were programmed by setting switches and
inserting patch leads to route data and to control signals
between various functional units. In the vast majority of
modern computers, the same memory is used for both
Arithmetic data and program instructions, and the Von Neumann vs.

Control Logic Harvard distinction applies to the cache architecture, not


the main memory.
Unit Unit
Accumulator
5.1 History
The earliest computing machines had fixed programs.
Input Output Some very simple computers still use this design, either
for simplicity or training purposes. For example, a desk
calculator (in principle) is a fixed program computer. It
Von Neumann architecture scheme can do basic mathematics, but it cannot be used as a word
processor or a gaming console. Changing the program
Von Neumann model and Princeton architecture, is of a fixed-program machine requires rewiring, restruc-
a computer architecture based on that described in 1945 turing, or redesigning the machine. The earliest comput-
by the mathematician and physicist John von Neumann ers were not so much “programmed” as they were “de-
and others in the First Draft of a Report on the EDVAC.[1] signed”. “Reprogramming”, when it was possible at all,
This describes a design architecture for an electronic was a laborious process, starting with flowcharts and pa-
digital computer with parts consisting of a processing unit per notes, followed by detailed engineering designs, and
containing an arithmetic logic unit and processor reg- then the often-arduous process of physically rewiring and
isters, a control unit containing an instruction register rebuilding the machine. It could take three weeks to set

26
5.2. DEVELOPMENT OF THE STORED-PROGRAM CONCEPT 27

up a program on ENIAC and get it working.[4] a visiting professor at Cambridge in 1935, and also dur-
With the proposal of the stored-program computer this ing Turing’s PhD year at the Institute for Advanced Study
changed. A stored-program computer includes by design in Princeton, New Jersey during 1936 – 37. Whether he
an instruction set and can store in memory a set of in- knew of Turing’s paper of 1936 at that time is not clear.
structions (a program) that details the computation. In 1936, Konrad Zuse also anticipated in two patent ap-
A stored-program design also allows for self-modifying plications that machine instructions could be stored in the
same storage used for data.[7]
code. One early motivation for such a facility was the
need for a program to increment or otherwise modify Independently, J. Presper Eckert and John Mauchly, who
the address portion of instructions, which had to be done were developing the ENIAC at the Moore School of
manually in early designs. This became less impor- Electrical Engineering, at the University of Pennsylva-
tant when index registers and indirect addressing became nia, wrote about the stored-program concept in Decem-
usual features of machine architecture. Another use was ber 1943.[8][9] In planning a new machine, EDVAC, Eck-
to embed frequently used data in the instruction stream ert wrote in January 1944 that they would store data and
using immediate addressing. Self-modifying code has programs in a new addressable memory device, a mer-
largely fallen out of favor, since it is usually hard to under- cury metal delay line memory. This was the first time the
stand and debug, as well as being inefficient under mod- construction of a practical stored-program machine was
ern processor pipelining and caching schemes. proposed. At that time, he and Mauchly were not aware
On a large scale, the ability to treat instructions as data of Turing’s work.
is what makes assemblers, compilers, linkers, loaders, Von Neumann was involved in the Manhattan Project
and other automated programming tools possible. One at the Los Alamos National Laboratory, which required
can “write programs which write programs”.[5] On a huge amounts of calculation. This drew him to the
smaller scale, repetitive I/O-intensive operations such asENIAC project, during the summer of 1944. There he
the BITBLT image manipulation primitive or pixel & ver- joined into the ongoing discussions on the design of this
tex shaders in modern 3D graphics, were considered in- stored-program computer, the EDVAC. As part of that
efficient to run without custom hardware. These oper- group, he wrote up a description titled First Draft of a
ations could be accelerated on general purpose proces- Report on the EDVAC [1] based on the work of Eckert and
sors with “on the fly compilation” ("just-in-time compila- Mauchly. It was unfinished when his colleague Herman
tion") technology, e.g., code-generating programs—one Goldstine circulated it with only von Neumann’s name on
form of self-modifying code that has remained popular. it, to the consternation of Eckert and Mauchly.[10] The
There are drawbacks to the Von Neumann design. Aside paper was read by dozens of von Neumann’s colleagues
from the Von Neumann bottleneck described below, pro- in America and Europe, and influenced the next round of
gram modifications can be quite harmful, either by ac- computer designs.
cident or design. In some simple stored-program com- Jack Copeland considers that it is “historically inappro-
puter designs, a malfunctioning program can damage it- priate, to refer to electronic stored-program digital com-
self, other programs, or the operating system, possibly puters as 'von Neumann machines’".[11] His Los Alamos
leading to a computer crash. Memory protection and colleague Stan Frankel said of von Neumann’s regard for
other forms of access control can usually protect against Turing’s ideas:
both accidental and malicious program modification.
I know that in or about 1943 or '44 von
Neumann was well aware of the fundamental
importance of Turing’s paper of 1936 ... Von
5.2 Development of the stored- Neumann introduced me to that paper and at
his urging I studied it with care. Many people
program concept have acclaimed von Neumann as the “father of
the computer” (in a modern sense of the term)
The mathematician Alan Turing, who had been alerted to but I am sure that he would never have made
a problem of mathematical logic by the lectures of Max that mistake himself. He might well be called the
Newman at the University of Cambridge, wrote a paper in midwife, perhaps, but he firmly emphasized to
1936 entitled On Computable Numbers, with an Applica- me, and to others I am sure, that the fundamen-
tion to the Entscheidungsproblem, which was published in tal conception is owing to Turing— in so far as
the Proceedings of the London Mathematical Society.[6] In not anticipated by Babbage ... Both Turing and
it he described a hypothetical machine which he called a von Neumann, of course, also made substan-
“universal computing machine”, and which is now known tial contributions to the “reduction to practice”
as the "Universal Turing machine". The hypothetical ma- of these concepts but I would not regard these
chine had an infinite store (memory in today’s terminol- as comparable in importance with the introduc-
ogy) that contained both instructions and data. John von tion and explication of the concept of a com-
Neumann became acquainted with Turing while he was puter able to store in its memory its program of
28 CHAPTER 5. VON NEUMANN ARCHITECTURE

activities and of modifying that program in the von Neumann subsequently decided to build a
course of these activities. [12] machine based on the Williams memory. This
machine, which was completed in June, 1952 in
At the time that the “First Draft” report was circulated, Princeton has become popularly known as the
Turing was producing a report entitled Proposed Elec- Maniac. The design of this machine has in-
tronic Calculator which described in engineering and pro- spired that of half a dozen or more machines
gramming detail, his idea of a machine that was called the which are now being built in America, all of
Automatic Computing Engine (ACE).[13] He presented which are known affectionately as “Johniacs."'
this to the Executive Committee of the British National
Physical Laboratory on February 19, 1946. Although In the same book, the[15] first two paragraphs of a chapter on
Turing knew from his wartime experience at Bletchley ACE read as follows:
Park that what he proposed was feasible, the secrecy sur-
Automatic Computation at the National
rounding Colossus, that was subsequently maintained for
Physical Laboratory
several decades, prevented him from saying so. Various
successful implementations of the ACE design were pro-
One of the most modern digital comput-
duced.
ers which embodies developments and improve-
Both von Neumann’s and Turing’s papers described ments in the technique of automatic electronic
stored-program computers, but von Neumann’s earlier computing was recently demonstrated at the Na-
paper achieved greater circulation and the computer ar- tional Physical Laboratory, Teddington, where
chitecture it outlined became known as the “von Neu- it has been designed and built by a small team
mann architecture”. In the 1953 publication Faster than of mathematicians and electronics research en-
Thought: A Symposium on Digital Computing Machines gineers on the staff of the Laboratory, assisted
(edited by B.V. Bowden), a section in the chapter on by a number of production engineers from the
Computers in America reads as follows:[14] English Electric Company, Limited. The equip-
ment so far erected at the Laboratory is only
The Machine of the Institute For Ad- the pilot model of a much larger installation
vanced Studies, Princeton which will be known as the Automatic Comput-
ing Engine, but although comparatively small in
bulk and containing only about 800 thermionic
In 1945, Professor J. von Neumann, who
valves, as can be judged from Plates XII, XIII
was then working at the Moore School of En-
and XIV, it is an extremely rapid and versatile
gineering in Philadelphia, where the E.N.I.A.C.
calculating machine.
had been built, issued on behalf of a group of
his co-workers a report on the logical design The basic concepts and abstract principles
of digital computers. The report contained a of computation by a machine were formulated
fairly detailed proposal for the design of the by Dr. A. M. Turing, F.R.S., in a paper1 .
machine which has since become known as the read before the London Mathematical Society
E.D.V.A.C. (electronic discrete variable auto- in 1936, but work on such machines in Britain
matic computer). This machine has only re- was delayed by the war. In 1945, however,
cently been completed in America, but the von an examination of the problems was made at
Neumann report inspired the construction of the the National Physical Laboratory by Mr. J. R.
E.D.S.A.C. (electronic delay-storage automatic Womersley, then superintendent of the Math-
calculator) in Cambridge (see page 130). ematics Division of the Laboratory. He was
joined by Dr. Turing and a small staff of spe-
In 1947, Burks, Goldstine and von Neu- cialists, and, by 1947, the preliminary planning
mann published another report which outlined was sufficiently advanced to warrant the es-
the design of another type of machine (a par- tablishment of the special group already men-
allel machine this time) which should be ex- tioned. In April, 1948, the latter became the
ceedingly fast, capable perhaps of 20,000 op- Electronics Section of the Laboratory, under the
erations per second. They pointed out that the charge of Mr. F. M. Colebrook.
outstanding problem in constructing such a ma-
chine was in the development of a suitable mem-
ory, all the contents of which were instanta- 5.3 Early von Neumann-
neously accessible, and at first they suggested
the use of a special vacuum tube — called architecture computers
the "Selectron" – which had been invented by
the Princeton Laboratories of the R.C.A. These The First Draft described a design that was used by
tubes were expensive and difficult to make, so many universities and corporations to construct their
5.5. EVOLUTION 29

computers.[16] Among these various computers, only IL- • The IBM SSEC had the ability to treat instruc-
LIAC and ORDVAC had compatible instruction sets. tions as data, and was publicly demonstrated on
January 27, 1948. This ability was claimed in a
• Manchester Small-Scale Experimental Machine US patent.[18] However it was partially electrome-
(SSEM), nicknamed “Baby” (University of Manch- chanical, not fully electronic. In practice, instruc-
ester, England) made its first successful run of a tions were read from paper tape due to its limited
stored-program on June 21, 1948. memory.[19]

• The Manchester SSEM (the Baby) was the first fully


• EDSAC (University of Cambridge, England) was
electronic computer to run a stored program. It
the first practical stored-program electronic com-
ran a factoring program for 52 minutes on June 21,
puter (May 1949)
1948, after running a simple division program and
• Manchester Mark 1 (University of Manchester, a program to show that two numbers were relatively
England) Developed from the SSEM (June 1949) prime.

• The ENIAC was modified to run as a primitive read-


• CSIRAC (Council for Scientific and Industrial Re-
only stored-program computer (using the Function
search) Australia (November 1949)
Tables for program ROM) and was demonstrated as
• EDVAC (Ballistic Research Laboratory, Comput- such on September 16, 1948, running a program by
ing Laboratory at Aberdeen Proving Ground 1951) Adele Goldstine for von Neumann.

• The BINAC ran some test programs in February,


• ORDVAC (U-Illinois) at Aberdeen Proving
March, and April 1949, although was not completed
Ground, Maryland (completed November 1951)[17]
until September 1949.
• IAS machine at Princeton University (January 1952) • The Manchester Mark 1 developed from the SSEM
project. An intermediate version of the Mark 1 was
• MANIAC I at Los Alamos Scientific Laboratory
available to run programs in April 1949, but was not
(March 1952)
completed until October 1949.
• ILLIAC at the University of Illinois, (September • The EDSAC ran its first program on May 6, 1949.
1952)
• The EDVAC was delivered in August 1949, but it
• BESM-1 in Moscow (1952) had problems that kept it from being put into regular
operation until 1951.
• AVIDAC at Argonne National Laboratory (1953)
• The CSIR Mk I ran its first program in November
• ORACLE at Oak Ridge National Laboratory (June 1949.
1953)
• The SEAC was demonstrated in April 1950.
• BESK in Stockholm (1953)
• The Pilot ACE ran its first program on May 10, 1950
• JOHNNIAC at RAND Corporation (January 1954) and was demonstrated in December 1950.

• DASK in Denmark (1955) • The SWAC was completed in July 1950.

• WEIZAC in Rehovoth (1955) • The Whirlwind was completed in December 1950


and was in actual use in April 1951.
• PERM in Munich (1956?)
• The first ERA Atlas (later the commercial ERA
• SILLIAC in Sydney (1956) 1101/UNIVAC 1101) was installed in December
1950.

5.4 Early stored-program comput- 5.5 Evolution


ers
Through the decades of the 1960s and 1970s comput-
The date information in the following chronology is diffi- ers generally became both smaller and faster, which led
cult to put into proper order. Some dates are for first run- to some evolutions in their architecture. For exam-
ning a test program, some dates are the first time the com- ple, memory-mapped I/O allows input and output de-
puter was demonstrated or completed, and some dates are vices to be treated the same as memory.[20] A single
for the first delivery or installation. system bus could be used to provide a modular system
30 CHAPTER 5. VON NEUMANN ARCHITECTURE

nificant data itself, but where to find it.[22][23]


Input and
CPU Memory Output The performance problem can be alleviated (to some
extent) by several mechanisms. Providing a cache be-
tween the CPU and the main memory, providing sep-
arate caches or separate access paths for data and in-

System bus
Control bus structions (the so-called Modified Harvard architecture),
using branch predictor algorithms and logic, and pro-
Address bus viding a limited CPU stack or other on-chip scratchpad
memory to reduce memory access are four of the ways
Data bus performance is increased. The problem can also be
sidestepped somewhat by using parallel computing, using
for example the Non-Uniform Memory Access (NUMA)
architecture—this approach is commonly employed by
Single system bus evolution of the architecture
supercomputers. It is less clear whether the intellec-
tual bottleneck that Backus criticized has changed much
with lower cost. This is sometimes called a “streamlin- since 1977. Backus’s proposed solution has not had a
ing” of the architecture.[21] In subsequent decades, simple major influence. Modern functional programming and
microcontrollers would sometimes omit features of the object-oriented programming are much less geared to-
model to lower cost and size. Larger computers added wards “pushing vast numbers of words back and forth”
features for higher performance. than earlier languages like Fortran were, but internally,
that is still what computers spend much of their time do-
ing, even highly parallel supercomputers.
5.6 Von Neumann bottleneck As of 1996, a database benchmark study found that
three out of four CPU cycles were spent waiting for
The shared bus between the program memory and data memory. Researchers expect that increasing the number
memory leads to the Von Neumann bottleneck, the lim- of simultaneous instruction streams with multithreading
ited throughput (data transfer rate) between the CPU and or single-chip multiprocessing will make this bottleneck
memory compared to the amount of memory. Because even worse. [24]
program memory and data memory cannot be accessed
at the same time, throughput is much smaller than the
rate at which the CPU can work. This seriously lim- 5.7 Non–von Neumann processors
its the effective processing speed when the CPU is re-
quired to perform minimal processing on large amounts
The National Semiconductor COP8 was introduced in
of data. The CPU is continually forced to wait for needed
1986; it has a Modified Harvard architecture.[25][26]
data to be transferred to or from memory. Since CPU
speed and memory size have increased much faster than Reduceron is an attempt to create a processor for direct
the throughput between them, the bottleneck has become Functional Program execution.
more of a problem, a problem whose severity increasesPerhaps the most common kind of non–von Neu-
with every newer generation of CPU. mann structure used in modern computers is content-
The von Neumann bottleneck was described by John addressable memory (CAM).
Backus in his 1977 ACM Turing Award lecture. Accord-
ing to Backus:
5.8 See also
Surely there must be a less primitive way of
making big changes in the store than by pushing • CARDboard Illustrative Aid to Computation
vast numbers of words back and forth through
the von Neumann bottleneck. Not only is this • Harvard architecture
tube a literal bottleneck for the data traffic of a
problem, but, more importantly, it is an intel- • Interconnect bottleneck
lectual bottleneck that has kept us tied to word-
at-a-time thinking instead of encouraging us to • Little man computer
think in terms of the larger conceptual units of
• Modified Harvard architecture
the task at hand. Thus programming is basi-
cally planning and detailing the enormous traf- • Random-access machine
fic of words through the von Neumann bottle-
neck, and much of that traffic concerns not sig- • Turing machine
5.9. REFERENCES 31

5.9 References [18] F.E. Hamilton, R.R. Seeber, R.A. Rowley, and E.S.
Hughes (January 19, 1949). “Selective Sequence Elec-
tronic Calculator”. US Patent 2,636,672. Retrieved April
5.9.1 Inline 28, 2011. Issued April 28, 1953.

[1] von Neumann, John (1945), First Draft of a Report on [19] Herbert R.J. Grosch (1991), Computer: Bit Slices From a
the EDVAC (PDF), archived from the original (PDF) on Life, Third Millennium Books, ISBN 0-88733-085-1
March 14, 2013, retrieved August 24, 2011
[20] C. Gordon Bell; R. Cady; H. McFarland; J. O'Laughlin;
[2] Ganesan 2009 R. Noonan; W. Wulf (1970), “A New Architecture
for Mini-Computers—The DEC PDP-11” (PDF), Spring
[3] Markgraf, Joey D. (2007), The Von Neumann bottleneck,
Joint Computer Conference: 657–675.
retrieved August 24, 2011

[4] Copeland 2006, p. 104 [21] Linda Null; Julia Lobur (2010), The essentials of computer
organization and architecture (3rd ed.), Jones & Bartlett
[5] MFTL (My Favorite Toy Language) entry Jargon File Learning, pp. 36,199–203, ISBN 978-1-4496-0006-8
4.4.7, retrieved 2008-07-11
[22] Backus, John W.. “Can Programming Be Liberated from
[6] Turing, A.M. (1936), “On Computable Numbers, with an the von Neumann Style? A Functional Style and Its Alge-
Application to the Entscheidungsproblem”, Proceedings bra of Programs”. doi:10.1145/359576.359579.
of the London Mathematical Society, 2 (1937) 42: 230–65,
doi:10.1112/plms/s2-42.1.230 (and Turing, A.M. (1938), [23] Dijkstra, Edsger W.. “E. W. Dijkstra Archive: A review
“On Computable Numbers, with an Application to the of the 1977 Turing Award Lecture”. Retrieved 2008-07-
Entscheidungsproblem. A correction”, Proceedings of the 11.
London Mathematical Society, 2 (1937) 43 (6): 544–6,
doi:10.1112/plms/s2-43.6.544) [24] Richard L. Sites, Yale Patt. “Architects Look to Proces-
sors of Future”. Microprocessor report. 1996.
[7] “Electronic Digital Computers”, Nature 162, September
25, 1948: 487, doi:10.1038/162487a0, retrieved 2009- [25] “COP8 Basic Family User’s Manual” (PDF). National
04-10 Semiconductor. Retrieved 2012-01-20.

[8] Lukoff, Herman (1979), From Dits to Bits...: A Personal [26] “COP888 Feature Family User’s Manual” (PDF). Na-
History of the Electronic Computer, Robotics Press, ISBN tional Semiconductor. Retrieved 2012-01-20.
978-0-89661-002-6

[9] ENIAC project administrator Grist Brainerd’s December


1943 progress report for the first period of the ENIAC’s
5.9.2 General
development implicitly proposed the stored program con-
cept (while simultaneously rejecting its implementation in • Bowden, B.V., ed. (1953), Faster Than Thought:
the ENIAC) by stating that “in order to have the simplest A Symposium on Digital Computing Machines, Lon-
project and not to complicate matters” the ENIAC would don: Sir Isaac Pitman and Sons Ltd.
be constructed without any “automatic regulation”.
• Rojas, Raúl; Hashagen, Ulf, eds. (2000), The First
[10] Copeland 2006, p. 113 Computers: History and Architectures, MIT Press,
ISBN 0-262-18197-5
[11] Copeland, Jack (2000), A Brief History of Computing:
ENIAC and EDVAC, retrieved January 27, 2010
• Davis, Martin (2000), The universal computer: the
[12] Copeland, Jack (2000), A Brief History of Computing: road from Leibniz to Turing, New York: W W Nor-
ENIAC and EDVAC, retrieved 27 January 2010 which ton & Company Inc., ISBN 0-393-04785-7 repub-
cites Randell, B. (1972), Meltzer, B.; Michie, D., eds., lished as: Davis, Martin (2001), Engines of Logic:
“On Alan Turing and the Origins of Digital Computers”, Mathematicians and the Origin of the Computer,
Machine Intelligence 7 (Edinburgh: Edinburgh University New York: W. W. Norton & Company, ISBN 978-
Press): 10, ISBN 0-902383-26-4 0-393-32229-3
[13] Copeland 2006, pp. 108–111
• Can Programming be Liberated from the von Neu-
[14] Bowden 1953, pp. 176,177 mann Style?, John Backus, 1977 ACM Turing
Award Lecture. Communications of the ACM,
[15] Bowden 1953, p. 135 August 1978, Volume 21, Number 8 Online
PDF see details at http://www.cs.tufts.edu/~{}nr/
[16] “Electronic Computer Project”. Institute for Advanced
Study. Retrieved May 26, 2011. backus-lecture.html

[17] James E. Robertson (1955), Illiac Design Techniques, re- • C. Gordon Bell and Allen Newell (1971), Computer
port number UIUCDCS-R-1955-146, Digital Computer Structures: Readings and Examples, McGraw-Hill
Laboratory, University of Illinois at Urbana-Champaign Book Company, New York. Massive (668 pages)
32 CHAPTER 5. VON NEUMANN ARCHITECTURE

• Copeland, Jack (2006), “Colossus and the Rise


of the Modern Computer”, in Copeland, B. Jack,
Colossus: The Secrets of Bletchley Park’s Codebreak-
ing Computers, Oxford: Oxford University Press,
ISBN 978-0-19-284055-4

• Ganesan, Deepak (2009), The Von Neumann Model


(PDF), retrieved October 22, 2011

• McCartney, Scott (1999). ENIAC: The Triumphs


and Tragedies of the World’s First Computer. Walker
& Co. ISBN 0-8027-1348-3.

• Goldstine, Herman H. (1972). The Computer from


Pascal to von Neumann. Princeton University press.
ISBN 0-691-08104-2.
• Shurkin, Joel (1984). Engines of the Mind - a history
of the computer. New York, London: W.W. Norton
& Company. ISBN 0-393-01804-0.

5.10 External links


• Harvard vs von Neumann

• A tool that emulates the behavior of a von Neumann


machine

• JOHNNY – A simple Open Source simulator of a


von Neumann machine for educational purposes
Chapter 6

Harvard architecture

For the architecture program at Harvard University, see 6.1.1 Contrast with von Neumann archi-
Harvard Graduate School of Design. tectures
The Harvard architecture is a computer architec-
Under pure von Neumann architecture the CPU can
be either reading an instruction or reading/writing data
ALU from/to the memory. Both cannot occur at the same time
since the instructions and data use the same bus system.
In a computer using the Harvard architecture, the CPU
Instruction Control Data
can both read an instruction and perform a data memory
memory unit memory access at the same time, even without a cache. A Har-
vard architecture computer can thus be faster for a given
circuit complexity because instruction fetches and data
access do not contend for a single memory pathway.
I/O
Also, a Harvard architecture machine has distinct code
and data address spaces: instruction address zero is not
Harvard architecture
the same as data address zero. Instruction address zero
might identify a twenty-four bit value, while data address
ture with physically separate storage and signal pathways zero might indicate an eight-bit byte that isn't part of that
for instructions and data. The term originated from the twenty-four bit value.
Harvard Mark I relay-based computer, which stored in-
structions on punched tape (24 bits wide) and data in
electro-mechanical counters. These early machines had 6.1.2 Contrast with modified Harvard ar-
data storage entirely contained within the central process- chitecture
ing unit, and provided no access to the instruction storage
as data. Programs needed to be loaded by an operator; the Main article: Modified Harvard architecture
processor could not initialize itself.
Today, most processors implement such separate signal A modified Harvard architecture machine is very much
pathways for performance reasons, but actually imple- like a Harvard architecture machine, but it relaxes the
ment a modified Harvard architecture, so they can sup- strict separation between instruction and data while still
port tasks like loading a program from disk storage as letting the CPU concurrently access two (or more) mem-
data and then executing it. ory buses. The most common modification includes sep-
arate instruction and data caches backed by a common
address space. While the CPU executes from cache, it
6.1 Memory details acts as a pure Harvard machine. When accessing back-
ing memory, it acts like a von Neumann machine (where
In a Harvard architecture, there is no need to make code can be moved around like data, which is a powerful
the two memories share characteristics. In particular, technique). This modification is widespread in modern
the word width, timing, implementation technology, and processors, such as the ARM architecture and x86 pro-
memory address structure can differ. In some systems, cessors. It is sometimes loosely called a Harvard archi-
instructions can be stored in read-only memory while data tecture, overlooking the fact that it is actually “modified”.
memory generally requires read-write memory. In some Another modification provides a pathway between the in-
systems, there is much more instruction memory than struction memory (such as ROM or flash memory) and
data memory so instruction addresses are wider than data the CPU to allow words from the instruction memory to
addresses. be treated as read-only data. This technique is used in

33
34 CHAPTER 6. HARVARD ARCHITECTURE

some microcontrollers, including the Atmel AVR. This write buffer are synchronized before trying to execute
allows constant data, such as text strings or function ta- those just-written instructions.
bles, to be accessed without first having to be copied
into data memory, preserving scarce (and power-hungry)
data memory for read/write variables. Special machine 6.3 Modern uses of the Harvard
language instructions are provided to read data from the
instruction memory. (This is distinct from instructions architecture
which themselves embed constant data, although for in-
dividual constants the two mechanisms can substitute for The principal advantage of the pure Harvard
each other.) architecture—simultaneous access to more than
one memory system—has been reduced by modified
Harvard processors using modern CPU cache systems.
Relatively pure Harvard architecture machines are used
6.2 Speed mostly in applications where trade-offs, like the cost
and power savings from omitting caches, outweigh the
In recent years, the speed of the CPU has grown many programming penalties from featuring distinct code and
times in comparison to the access speed of the main data address spaces.
memory. Care needs to be taken to reduce the number
of times main memory is accessed in order to maintain • Digital signal processors (DSPs) generally execute
performance. If, for instance, every instruction run in the small, highly optimized audio or video processing
CPU requires an access to memory, the computer gains algorithms. They avoid caches because their be-
nothing for increased CPU speed—a problem referred to havior must be extremely reproducible. The diffi-
as being memory bound. culties of coping with multiple address spaces are
It is possible to make extremely fast memory, but this of secondary concern to speed of execution. Con-
is only practical for small amounts of memory for cost, sequently, some DSPs feature multiple data mem-
power and signal routing reasons. The solution is to pro- ories in distinct address spaces to facilitate SIMD
vide a small amount of very fast memory known as a CPU and VLIW processing. Texas Instruments TMS320
cache which holds recently accessed data. As long as the C55x processors, for one example, feature multiple
data that the CPU needs are in the cache, the performance parallel data buses (two write, three read) and one
is much higher than it is when the cache has to get the data instruction bus.
from the main memory.
• Microcontrollers are characterized by having small
amounts of program (flash memory) and data
(SRAM) memory, with no cache, and take advan-
6.2.1 Internal vs. external design
tage of the Harvard architecture to speed processing
by concurrent instruction and data access. The sep-
Modern high performance CPU chip designs incorporate
arate storage means the program and data memories
aspects of both Harvard and von Neumann architecture.
may feature different bit widths, for example using
In particular, the “split cache” version of the modified
16-bit wide instructions and 8-bit wide data. They
Harvard architecture is very common. CPU cache mem-
also mean that instruction prefetch can be performed
ory is divided into an instruction cache and a data cache.
in parallel with other activities. Examples include,
Harvard architecture is used as the CPU accesses the
the AVR by Atmel Corp and the PIC by Microchip
cache. In the case of a cache miss, however, the data is
Technology, Inc..
retrieved from the main memory, which is not formally
divided into separate instruction and data sections, al-
Even in these cases, it is common to employ special in-
though it may well have separate memory controllers used
structions in order to access program memory as though
for concurrent access to RAM, ROM and (NOR) flash
it were data for read-only tables, or for reprogramming;
memory.
those processors are modified Harvard architecture pro-
Thus, while a von Neumann architecture is visible in cessors.
some contexts, such as when data and code come through
the same memory controller, the hardware implementa-
tion gains the efficiencies of the Harvard architecture for 6.4 External links
cache accesses and at least some main memory accesses.
In addition, CPUs often have write buffers which let • Harvard vs von Neumann
CPUs proceed after writes to non-cached regions. The
von Neumann nature of memory is then visible when in- • Harvard vs Von Nuemann [sic] Architecture
structions are written as data by the CPU and software • ARM Information center
must ensure that the caches (data and instruction) and
Chapter 7

Microarchitecture

“Computer organization” redirects here. For organiza- 7.1 Relation to instruction set ar-
tions that make computers, see List of computer system
manufacturers. For one classification of computer archi-
chitecture
tectures, see Flynn’s taxonomy. For another classifica-
tion of instruction set architectures, see Instruction set § The ISA is roughly the same as the programming model
Number of operands. of a processor as seen by an assembly language program-
In computer engineering, microarchitecture (some- mer or compiler writer. The ISA includes the execu-
tion model, processor registers, address and data formats
among other things. The microarchitecture includes the
constituent parts of the processor and how these intercon-
nect and interoperate to implement the ISA.
128 Entry 32 KB Instruction Cache
ITLB (8 way)
Shared Bus
128 Bit Interface
Unit
32 Byte Pre-Decode,
Fetch Buffer
Instruction 6 Instructions
Fetch Unit
18 Entry
Instruction Queue

Micro- Complex Simple Simple Simple


code Decoder Decoder Decoder Decoder

4 �ops 1 �op 1 �op 1 �op

7+ Entry �op Buffer Shared


4 �ops L2 Cache
(16 way)
Register Alias Table
and Allocator
4 �ops
4 �ops
Retirement Register File 256 Entry
96 Entry Reorder Buffer (ROB)
(Program Visible State) L2 DTLB
4 �ops

32 Entry Reservation Station


Port 0 Port 1 Port 5 Port 3 Port 4 Port 2

SSE SSE
ALU SSE Store Store Load
ALU Shuffle ALU Shuffle
Branch ALU Address Data Address
ALU MUL

128 Bit
FMUL
128 Bit Memory Ordering Buffer
FADD
FDIV (MOB)
Store Load
Internal Results Bus 128 Bit 256
128 Bit Bit
32 KB Dual Ported Data Cache 16 Entry
(8 way) DTLB

Intel Core 2 Architecture

Intel Core microarchitecture

Single bus organization microarchitecture

times abbreviated to µarch or uarch), also called com- The microarchitecture of a machine is usually repre-
puter organization, is the way a given instruction set sented as (more or less detailed) diagrams that describe
architecture (ISA) is implemented on a processor.[1] the interconnections of the various microarchitectural el-
A given ISA may be implemented with different ements of the machine, which may be everything from
microarchitectures;[2][3] implementations may vary due single gates and registers, to complete arithmetic logic
to different goals of a given design or due to shifts in units (ALUs) and even larger elements. These diagrams
technology.[4] generally separate the datapath (where data is placed) and
[5]
Computer architecture is the combination of microarchi- the control path (which can be said to steer the data).
tecture and instruction set designs. The person designing a system usually draws the specific

35
36 CHAPTER 7. MICROARCHITECTURE

microarchitecture as a kind of data flow diagram. Like a decode, execute, and write back. Some architectures in-
block diagram, the microarchitecture diagram shows mi- clude other stages such as memory access. The design of
croarchitectural elements such as the arithmetic and logic pipelines is one of the central microarchitectural tasks.
unit and the register file as a single schematic symbol. Execution units are also essential to microarchitecture.
Typically the diagram connects those elements with ar- Execution units include arithmetic logic units (ALU),
rows and thick lines and thin lines to distinguish between floating point units (FPU), load/store units, branch pre-
three-state buses -- which require a three state buffer for diction, and SIMD. These units perform the operations
each device that drives the bus; unidirectional buses -- al- or calculations of the processor. The choice of the num-
ways driven by a single source, such as the way the address
ber of execution units, their latency and throughput is a
bus on simpler computers is always driven by the memory central microarchitectural design task. The size, latency,
address register; and individual control lines. Very sim-
throughput and connectivity of memories within the sys-
ple computers have a single data bus organization -- they tem are also microarchitectural decisions.
have a single three-state bus. The diagram of more com-
plex computers usually shows multiple three-state buses, System-level design decisions such as whether or not to
which help the machine do more operations simultane- include peripherals, such as memory controllers, can be
ously. considered part of the microarchitectural design process.
This includes decisions on the performance-level and con-
Each microarchitectural element is in turn represented nectivity of these peripherals.
by a schematic describing the interconnections of logic
gates used to implement it. Each logic gate is in turn rep- Unlike architectural design, where achieving a specific
resented by a circuit diagram describing the connections performance level is the main goal, microarchitectural
of the transistors used to implement it in some particu- design pays closer attention to other constraints. Since
lar logic family. Machines with different microarchitec- microarchitecture design decisions directly affect what
tures may have the same instruction set architecture, and goes into a system, attention must be paid to such issues
thus be capable of executing the same programs. New as:
microarchitectures and/or circuitry solutions, along with
advances in semiconductor manufacturing, are what al- • Chip area/cost
lows newer generations of processors to achieve higher
• Power consumption
performance while using the same ISA.
In principle, a single microarchitecture could execute • Logic complexity
several different ISAs with only minor changes to the • Ease of connectivity
microcode.
• Manufacturability
• Ease of debugging
7.2 Aspects of microarchitecture
• Testability

Intel 80286 architecture

Address Unit (AU)

Physical
Address Latches
and Drivers
A
23
A

BHE#, M/IO#
0
7.3 Microarchitectural concepts
Address
Adder
Processor PEACK#
Segment
Offset Prefetcher Extension
Bases
Adder Interface PEREQ
Segment Limit Segment
Checker Sizes

Bus Control
READY#, HOLD

S1#, S0#, COD/INTA#


LOCK#, HLDA
7.3.1 Instruction cycle
Data Tranceivers D D
15 0

8 Byte
Prefetch
Main article: instruction cycle
Queue

Bus Unit (BU)


ALU

RESET

Registers Control
3 Decoded
Instruction
Instruction
CLK
Vss
In general, all CPUs, single-chip microprocessors or
multi-chip implementations run programs by performing
Decoder
Queue Vcc

Execution Unit (EU) Instruction Unit (IU) CAP

NMI
INTR
BUSY
ERROR the following steps:

Intel 80286 microarchitecture 1. Read an instruction and decode it

The pipelined datapath is the most commonly used data- 2. Find any associated data that is needed to process
path design in microarchitecture today. This technique is the instruction
used in most modern microprocessors, microcontrollers, 3. Process the instruction
and DSPs. The pipelined architecture allows multiple in-
structions to overlap in execution, much like an assem- 4. Write the results out
bly line. The pipeline includes several different stages
which are fundamental in microarchitecture designs.[5] The instruction cycle is repeated continuously until the
Some of these stages include instruction fetch, instruction power is turned off.
7.3. MICROARCHITECTURAL CONCEPTS 37

7.3.2 Increasing execution speed Large portions of the circuitry were left idle at any
one step; for instance, the instruction decoding circuitry
Complicating this simple-looking series of steps is the would be idle during execution and so on.
fact that the memory hierarchy, which includes caching,
Pipelines improve performance by allowing a number of
main memory and non-volatile storage like hard disks
instructions to work their way through the processor at
(where the program instructions and data reside), has al-
the same time. In the same basic example, the proces-
ways been slower than the processor itself. Step (2) often
sor would start to decode (step 1) a new instruction while
introduces a lengthy (in CPU terms) delay while the data
the last one was waiting for results. This would allow up
arrives over the computer bus. A considerable amount of
to four instructions to be “in flight” at one time, making
research has been put into designs that avoid these delays
the processor look four times as fast. Although any one
as much as possible. Over the years, a central goal was to
instruction takes just as long to complete (there are still
execute more instructions in parallel, thus increasing the
four steps) the CPU as a whole “retires” instructions much
effective execution speed of a program. These efforts
faster.
introduced complicated logic and circuit structures. Ini-
tially, these techniques could only be implemented on ex- RISC make pipelines smaller and much easier to con-
pensive mainframes or supercomputers due to the amount struct by cleanly separating each stage of the instruction
of circuitry needed for these techniques. As semiconduc- process and making them take the same amount of time
tor manufacturing progressed, more and more of these — one cycle. The processor as a whole operates in an
techniques could be implemented on a single semicon- assembly line fashion, with instructions coming in one
ductor chip. See Moore’s law. side and results out the other. Due to the reduced com-
plexity of the Classic RISC pipeline, the pipelined core
and an instruction cache could be placed on the same size
7.3.3 Instruction set choice die that would otherwise fit the core alone on a CISC de-
sign. This was the real reason that RISC was faster. Early
Instruction sets have shifted over the years, from origi- designs like the SPARC and MIPS often ran over 10 times
nally very simple to sometimes very complex (in vari- as fast as Intel and Motorola CISC solutions at the same
ous respects). In recent years, load-store architectures, clock speed and price.
VLIW and EPIC types have been in fashion. Architec- Pipelines are by no means limited to RISC designs. By
tures that are dealing with data parallelism include SIMD 1986 the top-of-the-line VAX implementation (VAX
and Vectors. Some labels used to denote classes of CPU 8800) was a heavily pipelined design, slightly predating
architectures are not particularly descriptive, especially the first commercial MIPS and SPARC designs. Most
so the CISC label; many early designs retroactively de- modern CPUs (even embedded CPUs) are now pipelined,
noted "CISC" are in fact significantly simpler than mod- and microcoded CPUs with no pipelining are seen only in
ern RISC processors (in several respects). the most area-constrained embedded processors. Large
However, the choice of instruction set architecture may CISC machines, from the VAX 8800 to the modern Pen-
greatly affect the complexity of implementing high per- tium 4 and Athlon, are implemented with both microcode
formance devices. The prominent strategy, used to de- and pipelines. Improvements in pipelining and caching
velop the first RISC processors, was to simplify instruc- are the two major microarchitectural advances that have
tions to a minimum of individual semantic complexity enabled processor performance to keep pace with the cir-
combined with high encoding regularity and simplicity. cuit technology on which they are based.
Such uniform instructions were easily fetched, decoded
and executed in a pipelined fashion and a simple strat-
egy to reduce the number of logic levels in order to reach 7.3.5 Cache
high operating frequencies; instruction cache-memories
compensated for the higher operating frequency and in- Main article: CPU cache
herently low code density while large register sets were
used to factor out as much of the (slow) memory accesses It was not long before improvements in chip manufac-
as possible. turing allowed for even more circuitry to be placed on
the die, and designers started looking for ways to use it.
One of the most common was to add an ever-increasing
7.3.4 Instruction pipelining amount of cache memory on-die. Cache is simply very
fast memory, memory that can be accessed in a few cy-
Main article: instruction pipeline cles as opposed to many needed to “talk” to main mem-
ory. The CPU includes a cache controller which auto-
One of the first, and most powerful, techniques to im- mates reading and writing from the cache, if the data is
prove performance is the use of the instruction pipeline. already in the cache it simply “appears”, whereas if it is
Early processor designs would carry out all of the steps not the processor is “stalled” while the cache controller
above for one instruction before moving onto the next. reads it in.
38 CHAPTER 7. MICROARCHITECTURE

RISC designs started adding cache in the mid-to-late 7.3.7 Superscalar


1980s, often only 4 KB in total. This number grew over
time, and typical CPUs now have at least 512 KB, while Main article: Superscalar
more powerful CPUs come with 1 or 2 or even 4, 6, 8 or
12 MB, organized in multiple levels of a memory hier- Even with all of the added complexity and gates needed
archy. Generally speaking, more cache means more per- to support the concepts outlined above, improvements in
formance, due to reduced stalling. semiconductor manufacturing soon allowed even more
Caches and pipelines were a perfect match for each other. logic gates to be used.
Previously, it didn't make much sense to build a pipeline In the outline above the processor processes parts of a
that could run faster than the access latency of off-chip single instruction at a time. Computer programs could
memory. Using on-chip cache memory instead, meant be executed faster if multiple instructions were pro-
that a pipeline could run at the speed of the cache access cessed simultaneously. This is what superscalar pro-
latency, a much smaller length of time. This allowed the cessors achieve, by replicating functional units such as
operating frequencies of processors to increase at a much ALUs. The replication of functional units was only made
faster rate than that of off-chip memory. possible when the die area of a single-issue processor no
longer stretched the limits of what could be reliably man-
ufactured. By the late 1980s, superscalar designs started
to enter the market place.
In modern designs it is common to find two load units,
one store (many instructions have no results to store), two
or more integer math units, two or more floating point
7.3.6 Branch prediction units, and often a SIMD unit of some sort. The instruction
issue logic grows in complexity by reading in a huge list
of instructions from memory and handing them off to the
Main article: Branch predictor
different execution units that are idle at that point. The
results are then collected and re-ordered at the end.
One barrier to achieving higher performance through
instruction-level parallelism stems from pipeline stalls
and flushes due to branches. Normally, whether a con- 7.3.8 Out-of-order execution
ditional branch will be taken isn't known until late in the
pipeline as conditional branches depend on results com- Main article: Out-of-order execution
ing from a register. From the time that the processor’s in-
struction decoder has figured out that it has encountered a The addition of caches reduces the frequency or duration
conditional branch instruction to the time that the decid- of stalls due to waiting for data to be fetched from the
ing register value can be read out, the pipeline needs to memory hierarchy, but does not get rid of these stalls en-
be stalled for several cycles, or if it’s not and the branch is tirely. In early designs a cache miss would force the cache
taken, the pipeline needs to be flushed. As clock speeds controller to stall the processor and wait. Of course there
increase the depth of the pipeline increases with it, and may be some other instruction in the program whose data
some modern processors may have 20 stages or more. On is available in the cache at that point. Out-of-order exe-
average, every fifth instruction executed is a branch, so cution allows that ready instruction to be processed while
without any intervention, that’s a high amount of stalling. an older instruction waits on the cache, then re-orders the
Techniques such as branch prediction and speculative ex- results to make it appear that everything happened in the
ecution are used to lessen these branch penalties. Branch programmed order. This technique is also used to avoid
prediction is where the hardware makes educated guesses other operand dependency stalls, such as an instruction
on whether a particular branch will be taken. In reality awaiting a result from a long latency floating-point oper-
one side or the other of the branch will be called much ation or other multi-cycle operations.
more often than the other. Modern designs have rather
complex statistical prediction systems, which watch the
results of past branches to predict the future with greater 7.3.9 Register renaming
accuracy. The guess allows the hardware to prefetch in-
structions without waiting for the register read. Specula- Main article: Register renaming
tive execution is a further enhancement in which the code
along the predicted path is not just prefetched but also ex- Register renaming refers to a technique used to avoid
ecuted before it is known whether the branch should be unnecessary serialized execution of program instructions
taken or not. This can yield better performance when the because of the reuse of the same registers by those in-
guess is good, with the risk of a huge penalty when the structions. Suppose we have two groups of instruction
guess is bad because instructions need to be undone. that will use the same register. One set of instructions
7.4. SEE ALSO 39

is executed first to leave the register to the other set, but instead of stalling for the data to arrive, the processor
if the other set is assigned to a different similar register, switches to another program or program thread which
both sets of instructions can be executed in parallel (or) is ready to execute. Though this does not speed up a
in series. particular program/thread, it increases the overall system
throughput by reducing the time the CPU is idle.
Conceptually, multithreading is equivalent to a context
7.3.10 Multiprocessing and multithread- switch at the operating system level. The difference is that
ing a multithreaded CPU can do a thread switch in one CPU
cycle instead of the hundreds or thousands of CPU cy-
Main articles: Multiprocessing and Multithreading cles a context switch normally requires. This is achieved
(computer architecture) by replicating the state hardware (such as the register file
and program counter) for each active thread.
Computer architects have become stymied by the grow- A further enhancement is simultaneous multithreading.
ing mismatch in CPU operating frequencies and DRAM This technique allows superscalar CPUs to execute in-
access times. None of the techniques that exploited structions from different programs/threads simultane-
instruction-level parallelism (ILP) within one program ously in the same cycle.
could make up for the long stalls that occurred when
data had to be fetched from main memory. Additionally,
the large transistor counts and high operating frequencies
needed for the more advanced ILP techniques required 7.4 See also
power dissipation levels that could no longer be cheaply
cooled. For these reasons, newer generations of comput- • List of AMD CPU microarchitectures
ers have started to exploit higher levels of parallelism that
exist outside of a single program or program thread. • List of Intel CPU microarchitectures
This trend is sometimes known as throughput comput-
ing. This idea originated in the mainframe market where • Microprocessor
online transaction processing emphasized not just the ex-
ecution speed of one transaction, but the capacity to deal • Microcontroller
with massive numbers of transactions. With transaction-
based applications such as network routing and web-site • Digital signal processor (DSP)
serving greatly increasing in the last decade, the computer
industry has re-emphasized capacity and throughput is- • CPU design
sues.
One technique of how this parallelism is achieved is • Hardware description language (HDL)
through multiprocessing systems, computer systems with
multiple CPUs. Once reserved for high-end mainframes • Hardware architecture
and supercomputers, small scale (2-8) multiprocessors
servers have become commonplace for the small business • Harvard architecture
market. For large corporations, large scale (16-256) mul-
tiprocessors are common. Even personal computers with • von Neumann architecture
multiple CPUs have appeared since the 1990s.
With further transistor size reductions made avail- • Multi-core (computing)
able with semiconductor technology advances, multicore
CPUs have appeared where multiple CPUs are imple- • Datapath
mented on the same silicon chip. Initially used in chips
targeting embedded markets, where simpler and smaller • Dataflow architecture
CPUs would allow multiple instantiations to fit on one
piece of silicon. By 2005, semiconductor technology • Very-large-scale integration (VLSI)
allowed dual high-end desktop CPUs CMP chips to be
manufactured in volume. Some designs, such as Sun • VHDL
Microsystems' UltraSPARC T1 have reverted to simpler
(scalar, in-order) designs in order to fit more processors • Verilog
on one piece of silicon.
Another technique that has become more popular re- • Stream processing
cently is multithreading. In multithreading, when the
processor has to fetch data from slow system memory, • Instruction level parallelism (ILP)
40 CHAPTER 7. MICROARCHITECTURE

7.5 References
[1] Curriculum Guidelines for Undergraduate Degree Pro-
grams in Computer Engineering (PDF). Association for
Computing Machinery. 2004. p. 60. Comments on Com-
puter Architecture and Organization: Computer architec-
ture is a key component of computer engineering and the
practicing computer engineer should have a practical un-
derstanding of this topic...

[2] Miles Murdocca and Vincent Heuring (2007). Computer


Architecture and Organization, An Integrated Approach.
Wiley. p. 151.

[3] Clements, Alan. Principles of Computer Hardware


(Fourth Edition ed.). pp. 1–2.

[4] Michael J. Flynn (2007). Computer Architecture Pipelined


and parallel Processor Design. Jones and Bartlett. pp. 1–
3.

[5] John L. Hennessy and David A. Patterson (2006). Com-


puter Architecture: A Quantitative Approach (Fourth Edi-
tion ed.). Morgan Kaufmann Publishers, Inc. ISBN 0-12-
370490-1.

7.6 Further reading


• D. Patterson and J. Hennessy (2004-08-02).
Computer Organization and Design: The Hard-
ware/Software Interface. Morgan Kaufmann
Publishers, Inc. ISBN 1-55860-604-1.
• V. C. Hamacher, Z. G. Vrasenic, and S. G. Zaky
(2001-08-02). Computer Organization. McGraw-
Hill. ISBN 0-07-232086-9.
• William Stallings (2002-07-15). Computer Organi-
zation and Architecture. Prentice Hall. ISBN 0-13-
035119-9.
• J. P. Hayes (2002-09-03). Computer Architecture
and Organization. McGraw-Hill. ISBN 0-07-
286198-3.
• Gary Michael Schneider (1985). The Principles of
Computer Organization. Wiley. pp. 6–7. ISBN 0-
471-88552-5.
• M. Morris Mano (1992-10-19). Computer System
Architecture. Prentice Hall. p. 3. ISBN 0-13-
175563-3.
• Mostafa Abd-El-Barr and Hesham El-Rewini
(2004-12-03). Fundamentals of Computer Organi-
zation and Architecture. Wiley-Interscience. p. 1.
ISBN 0-471-46741-3.
• PC Processor Microarchitecture
• Computer Architecture: A Minimalist Perspective -
book webpage
Chapter 8

Central processing unit

“CPU” redirects here. For other uses, see CPU (disam- unit (ALU) that performs arithmetic and logic operations,
biguation). hardware registers that supply operands to the ALU and
“Computer processor” redirects here. For other uses, see store the results of ALU operations, and a control unit that
Processor (computing). fetches instructions from memory and “executes” them by
directing the coordinated operations of the ALU, regis-
ters and other components.
Most modern CPUs are microprocessors, meaning they
are contained on a single integrated circuit (IC) chip.
An IC that contains a CPU may also contain mem-
ory, peripheral interfaces, and other components of a
computer; such integrated devices are variously called
microcontrollers or systems on a chip (SoC). Some com-
puters employ a multi-core processor, which is a single
chip containing two or more CPUs called “cores"; in
that context, single chips are sometimes referred to as
An Intel 80486DX2 CPU, as seen from above “sockets”.[3] Array processors or vector processors have
multiple processors that operate in parallel, with no unit
considered central.

8.1 History

Main article: History of general-purpose CPUs


Computers such as the ENIAC had to be physically
rewired to perform different tasks, which caused these
machines to be called “fixed-program computers”.[4]
Bottom side of an Intel 80486DX2 Since the term “CPU” is generally defined as a device
for software (computer program) execution, the earliest
devices that could rightly be called CPUs came with the
A central processing unit (CPU) is the electronic cir- advent of the stored-program computer.
cuitry within a computer that carries out the instructions
of a computer program by performing the basic arith- The idea of a stored-program computer was already
metic, logical, control and input/output (I/O) operations present in the design of J. Presper Eckert and John
specified by the instructions. The term has been used in William Mauchly's ENIAC, but was initially omitted so
the computer industry at least since the early 1960s.[1] that it could be finished sooner. On June 30, 1945, be-
Traditionally, the term “CPU” refers to a processor and fore ENIAC was made, mathematician John von Neu-
its control unit (CU), distinguishing these core elements mann distributed the paper entitled First Draft of a Re-
of a computer from external components such as main port on the EDVAC. It was the outline of a stored-program
memory and I/O circuitry.[2] computer that would eventually be completed in August
1949.[5] EDVAC was designed to perform a certain num-
The form, design and implementation of CPUs have ber of instructions (or operations) of various types. Sig-
changed over the course of their history, but their fun- nificantly, the programs written for EDVAC were to be
damental operation remains almost unchanged. Princi- stored in high-speed computer memory rather than spec-
pal components of a CPU include the arithmetic logic ified by the physical wiring of the computer. This over-

41
42 CHAPTER 8. CENTRAL PROCESSING UNIT

called Harvard architecture of the Harvard Mark I, which


was completed before EDVAC, also utilized a stored-
program design using punched paper tape rather than
electronic memory. The key difference between the von
Neumann and Harvard architectures is that the latter sep-
arates the storage and treatment of CPU instructions and
data, while the former uses the same memory space for
both. Most modern CPUs are primarily von Neumann in
design, but CPUs with the Harvard architecture are seen
as well, especially in embedded applications; for instance,
the Atmel AVR microcontrollers are Harvard architec-
ture processors.
Relays and vacuum tubes (thermionic valves) were com-
monly used as switching elements; a useful computer re-
quires thousands or tens of thousands of switching de-
vices. The overall speed of a system is dependent on
the speed of the switches. Tube computers like EDVAC
tended to average eight hours between failures, whereas
relay computers like the (slower, but earlier) Harvard
Mark I failed very rarely.[1] In the end, tube-based CPUs
became dominant because the significant speed advan-
tages afforded generally outweighed the reliability prob-
lems. Most of these early synchronous CPUs ran at low
EDVAC, one of the first stored-program computers clock rates compared to modern microelectronic designs
(see below for a discussion of clock rate). Clock signal
frequencies ranging from 100 kHz to 4 MHz were very
came a severe limitation of ENIAC, which was the con- common at this time, limited largely by the speed of the
siderable time and effort required to reconfigure the com- switching devices they were built with.
puter to perform a new task. With von Neumann’s de-
sign, the program, or software, that EDVAC ran could be
8.1.1 Transistor and integrated circuit
changed simply by changing the contents of the memory.
EDVAC, however, was not the first stored-program com- CPUs
puter; the Manchester Small-Scale Experimental Ma-
chine, a small prototype stored-program computer, ran
its first program on 21 June 1948[6] and the Manchester
Mark 1 ran its first program during the night of 16–17
June 1949.
Early CPUs were custom-designed as a part of a larger,
sometimes one-of-a-kind, computer. However, this
method of designing custom CPUs for a particular appli-
cation has largely given way to the development of mass-
produced processors that are made for many purposes.
This standardization began in the era of discrete transistor
mainframes and minicomputers and has rapidly acceler-
ated with the popularization of the integrated circuit (IC).
The IC has allowed increasingly complex CPUs to be de-
signed and manufactured to tolerances on the order of
CPU, core memory, and external bus interface of a DEC PDP-
nanometers. Both the miniaturization and standardiza-
8/I. Made of medium-scale integrated circuits.
tion of CPUs have increased the presence of digital de-
vices in modern life far beyond the limited application The design complexity of CPUs increased as various
of dedicated computing machines. Modern microproces- technologies facilitated building smaller and more reli-
sors appear in everything from automobiles to cell phones able electronic devices. The first such improvement came
and children’s toys. with the advent of the transistor. Transistorized CPUs
While von Neumann is most often credited with the de- during the 1950s and 1960s no longer had to be built out
sign of the stored-program computer because of his de- of bulky, unreliable, and fragile switching elements like
sign of EDVAC, others before him, such as Konrad Zuse, vacuum tubes and electrical relays. With this improve-
had suggested and implemented similar ideas. The so- ment more complex and reliable CPUs were built onto
8.1. HISTORY 43

one or several printed circuit boards containing discrete the era of specialized supercomputers like those made by
(individual) components. Cray Inc.
During this period, a method of manufacturing many
interconnected transistors in a compact space was de-
8.1.2 Microprocessors
veloped. The integrated circuit (IC) allowed a large
number of transistors to be manufactured on a single
Main article: Microprocessor
semiconductor-based die, or “chip”. At first only very
basic non-specialized digital circuits such as NOR gates
were miniaturized into ICs. CPUs based upon these
“building block” ICs are generally referred to as “small-
scale integration” (SSI) devices. SSI ICs, such as the ones
used in the Apollo guidance computer, usually contained
up to a few score transistors. To build an entire CPU out
of SSI ICs required thousands of individual chips, but still
consumed much less space and power than earlier discrete
transistor designs. As microelectronic technology ad-
vanced, an increasing number of transistors were placed
on ICs, thus decreasing the quantity of individual ICs Die of an Intel 80486DX2 microprocessor (actual size:
needed for a complete CPU. MSI and LSI (medium- and 12×6.75 mm) in its packaging
large-scale integration) ICs increased transistor counts to
hundreds, and then thousands.
In 1964, IBM introduced its System/360 computer ar-
chitecture that was used in a series of computers capa-
ble of running the same programs with different speed
and performance. This was significant at a time when
most electronic computers were incompatible with one
another, even those made by the same manufacturer.
To facilitate this improvement, IBM utilized the concept
of a microprogram (often called “microcode”), which
still sees widespread usage in modern CPUs.[7] The Sys-
tem/360 architecture was so popular that it dominated the Intel Core i5 CPU on a Vaio E series laptop motherboard
mainframe computer market for decades and left a legacy (on the right, beneath the heat pipe)
that is still continued by similar modern computers like
the IBM zSeries. In the same year (1964), Digital Equip- In the 1970s the fundamental inventions by Federico
ment Corporation (DEC) introduced another influential Faggin (Silicon Gate MOS ICs with self-aligned gates
computer aimed at the scientific and research markets, along with his new random logic design methodology)
the PDP-8. DEC would later introduce the extremely changed the design and implementation of CPUs for-
popular PDP-11 line that originally was built with SSI ever. Since the introduction of the first commercially
ICs but was eventually implemented with LSI compo- available microprocessor (the Intel 4004) in 1970, and
nents once these became practical. In stark contrast with the first widely used microprocessor (the Intel 8080) in
its SSI and MSI predecessors, the first LSI implementa- 1974, this class of CPUs has almost completely overtaken
tion of the PDP-11 contained a CPU composed of only all other central processing unit implementation methods.
four LSI integrated circuits.[8] Mainframe and minicomputer manufacturers of the time
Transistor-based computers had several distinct advan- launched proprietary IC development programs to up-
tages over their predecessors. Aside from facilitating in- grade their older computer architectures, and eventually
creased reliability and lower power consumption, transis- produced instruction set compatible microprocessors that
tors also allowed CPUs to operate at much higher speeds were backward-compatible with their older hardware and
because of the short switching time of a transistor in com- software. Combined with the advent and eventual suc-
parison to a tube or relay. Thanks to both the increased cess of the ubiquitous personal computer, the term CPU
reliability as well as the dramatically increased speed of is now applied almost exclusively[lower-alpha 1] to micropro-
the switching elements (which were almost exclusively cessors. Several CPUs (denoted 'cores’) can be combined
transistors by this time), CPU clock rates in the tens of in a single processing chip.
megahertz were obtained during this period. Addition- Previous generations of CPUs were implemented as
ally while discrete transistor and IC CPUs were in heavy discrete components and numerous small integrated cir-
usage, new high-performance designs like SIMD (Single cuits (ICs) on one or more circuit boards. Microproces-
Instruction Multiple Data) vector processors began to ap- sors, on the other hand, are CPUs manufactured on a very
pear. These early experimental designs later gave rise to small number of ICs; usually just one. The overall smaller
44 CHAPTER 8. CENTRAL PROCESSING UNIT

CPU size, as a result of being implemented on a single functions.[lower-alpha 3] In some processors, some other in-
die, means faster switching time because of physical fac- structions change the state of bits in a “flags” register.
tors like decreased gate parasitic capacitance. This has These flags can be used to influence how a program be-
allowed synchronous microprocessors to have clock rates haves, since they often indicate the outcome of various
ranging from tens of megahertz to several gigahertz. Ad- operations. For example, in such processors a “compare”
ditionally, as the ability to construct exceedingly small instruction evaluates two values and sets or clears bits
transistors on an IC has increased, the complexity and in the flags register to indicate which one is greater or
number of transistors in a single CPU has increased many whether they are equal; one of these flags could then be
fold. This widely observed trend is described by Moore’s used by a later jump instruction to determine program
law, which has proven to be a fairly accurate predictor of flow.
the growth of CPU (and other IC) complexity.[9]
While the complexity, size, construction, and general
form of CPUs have changed enormously since 1950,
it is notable that the basic design and function has not 8.2.1 Fetch
changed much at all. Almost all common CPUs to-
day can be very accurately described as von Neumann
stored-program machines.[lower-alpha 2] As the aforemen- The first step, fetch, involves retrieving an instruction
tioned Moore’s law continues to hold true,[9] concerns (which is represented by a number or sequence of num-
have arisen about the limits of integrated circuit transistor bers) from program memory. The instruction’s location
technology. Extreme miniaturization of electronic gates (address) in program memory is determined by a program
is causing the effects of phenomena like electromigration counter (PC), which stores a number that identifies the
and subthreshold leakage to become much more signifi- address of the next instruction to be fetched. After an in-
cant. These newer concerns are among the many factors struction is fetched, the PC is incremented by the length
causing researchers to investigate new methods of com- of the instruction so that it will contain the address of the
puting such as the quantum computer, as well as to ex- next instruction in the sequence.[lower-alpha 4] Often, the in-
pand the usage of parallelism and other methods that ex- struction to be fetched must be retrieved from relatively
tend the usefulness of the classical von Neumann model. slow memory, causing the CPU to stall while waiting for
the instruction to be returned. This issue is largely ad-
dressed in modern processors by caches and pipeline ar-
chitectures (see below).
8.2 Operation

The fundamental operation of most CPUs, regardless of


the physical form they take, is to execute a sequence of 8.2.2 Decode
stored instructions called a program. The instructions are
kept in some kind of computer memory. There are three The instruction that the CPU fetches from memory de-
steps that nearly all CPUs use in their operation: fetch,
termines what the CPU has to do. In the decode step, the
decode, and execute. instruction is broken up into parts that have significance
After the execution of an instruction, the entire process to other portions of the CPU. The way in which the nu-
repeats, with the next instruction cycle normally fetch- merical instruction value is interpreted is defined by the
ing the next-in-sequence instruction because of the in- CPU’s instruction set architecture (ISA).[lower-alpha 5] Of-
cremented value in the program counter. If a jump in- ten, one group of numbers in the instruction, called the
struction was executed, the program counter will be mod- opcode, indicates which operation to perform. The re-
ified to contain the address of the instruction that was maining parts of the number usually provide information
jumped to and program execution continues normally. required for that instruction, such as operands for an addi-
In more complex CPUs, multiple instructions can be tion operation. Such operands may be given as a constant
fetched, decoded, and executed simultaneously. This sec- value (called an immediate value), or as a place to locate
tion describes what is generally referred to as the "classic a value: a register or a memory address, as determined
RISC pipeline", which is quite common among the sim- by some addressing mode.
ple CPUs used in many electronic devices (often called In some CPU designs the instruction decoder is imple-
microcontroller). It largely ignores the important role of mented as a hardwired, unchangeable circuit. In others,
CPU cache, and therefore the access stage of the pipeline. a microprogram is used to translate instructions into sets
Some instructions manipulate the program counter rather of CPU configuration signals that are applied sequentially
than producing result data directly; such instructions are over multiple clock pulses. In some cases the memory
generally called “jumps” and facilitate program behav- that stores the microprogram is rewritable, making it pos-
ior like loops, conditional program execution (through sible to change the way in which the CPU decodes in-
the use of a conditional jump), and existence of structions.
8.3. DESIGN AND IMPLEMENTATION 45

8.2.3 Execute A complete machine language instruction consists of an


opcode and, in many cases, additional bits that specify ar-
After the fetch and decode steps, the execute step is per- guments for the operation (for example, the numbers to
formed. Depending on the CPU architecture, this may be summed in the case of an addition operation). Going
consist of a single action or a sequence of actions. Dur- up the complexity scale, a machine language program is a
ing each action, various parts of the CPU are electrically collection of machine language instructions that the CPU
connected so they can perform all or part of the desired executes.
operation and then the action is completed, typically in The actual mathematical operation for each instruction
response to a clock pulse. Very often the results are writ- is performed by a combinational logic circuit within the
ten to an internal CPU register for quick access by subse- CPU’s processor known as the arithmetic logic unit or
quent instructions. In other cases results may be written ALU. In general, a CPU executes an instruction by fetch-
to slower, but less expensive and higher capacity main ing it from memory, using its ALU to perform an op-
memory. eration, and then storing the result to memory. Beside
For example, if an addition instruction is to be executed, the instructions for integer mathematics and logic oper-
the arithmetic logic unit (ALU) inputs are connected to ations, various other machine instructions exists, such as
a pair of operand sources (numbers to be summed), the those for loading data from memory and storing it back,
ALU is configured to perform an addition operation so branching operations, and mathematical operations on
that the sum of its operand inputs will appear at its out- floating-point numbers performed by the CPU’s floating-
put, and the ALU output is connected to storage (e.g., a point unit (FPU).[10]
register or memory) that will receive the sum. When the
clock pulse occurs, the sum will be transferred to storage
and, if the resulting sum is too large (i.e., it is larger than 8.3.1 Control unit
the ALU’s output word size), an arithmetic overflow flag
will be set. Main article: Control unit

The control unit of the CPU contains circuitry that uses


8.3 Design and implementation electrical signals to direct the entire computer system to
carry out stored program instructions. The control unit
Main article: CPU design does not execute program instructions; rather, it directs
Hardwired into a CPU’s circuitry is a set of basic oper- other parts of the system to do so. The control unit com-
municates with both the ALU and memory.

8.3.2 Arithmetic logic unit

Main article: Arithmetic logic unit


The arithmetic logic unit (ALU) is a digital circuit within

Block diagram of a basic uniprocessor-CPU computer. Black


lines indicate data flow, whereas red lines indicate control flow;
arrows indicate flow directions.
Symbolic representation of an ALU and its input and output sig-
ations it can perform, called an instruction set. Such op- nals
erations may involve, for example, adding or subtracting
two numbers, comparing two numbers, or jumping to a the processor that performs integer arithmetic and bitwise
different part of a program. Each basic operation is rep- logic operations. The inputs to the ALU are the data
resented by a particular combination of bits, known as words to be operated on (called operands), status infor-
the machine language opcode; while executing instruc- mation from previous operations, and a code from the
tions in a machine language program, the CPU decides control unit indicating which operation to perform. De-
which operation to perform by “decoding” the opcode. pending on the instruction being executed, the operands
46 CHAPTER 8. CENTRAL PROCESSING UNIT

may come from internal CPU registers or external mem- quired, however, the benefits of a larger word size (larger
ory, or they may be constants generated by the ALU itself. data ranges and address spaces) may outweigh the disad-
When all input signals have settled and propagated vantages.
through the ALU circuitry, the result of the performed To gain some of the advantages afforded by both lower
operation appears at the ALU’s outputs. The result con- and higher bit lengths, many CPUs are designed with
sists of both a data word, which may be stored in a reg- different bit widths for different portions of the device.
ister or memory, and status information that is typically For example, the IBM System/370 used a CPU that was
stored in a special, internal CPU register reserved for this primarily 32 bit, but it used 128-bit precision inside its
purpose. floating point units to facilitate greater accuracy and range
in floating point numbers.[7] Many later CPU designs use
similar mixed bit width, especially when the processor is
8.3.3 Integer range meant for general-purpose usage where a reasonable bal-
ance of integer and floating point capability is required.
Every CPU represents numerical values in a specific way.
For example, some early digital computers represented
numbers as familiar decimal (base 10) numeral system 8.3.4 Clock rate
values, and others have employed more unusual repre-
sentations such as ternary (base three). Nearly all modern Main article: Clock rate
CPUs represent numbers in binary form, with each digit
being represented by some two-valued physical quantity Most CPUs are synchronous circuits, which means they
such as a “high” or “low” voltage.[lower-alpha 6] employ a clock signal to pace their sequential operations.
The clock signal is produced by an external oscillator cir-
cuit that generates a consistent number of pulses each sec-
ond in the form of a periodic square wave. The frequency
of the clock pulses determines the rate at which a CPU ex-
ecutes instructions and, consequently, the faster the clock,
the more instructions the CPU will execute each second.
A six-bit word containing the binary encoded representation of To ensure proper operation of the CPU, the clock period
decimal value 40. Most modern CPUs employ word sizes that are is longer than the maximum time needed for all signals to
a power of two, for example eight, 16, 32 or 64 bits.
propagate (move) through the CPU. In setting the clock
period to a value well above the worst-case propagation
Related to numeric representation is the size and preci- delay, it is possible to design the entire CPU and the way
sion of integer numbers that a CPU can represent. In it moves data around the “edges” of the rising and falling
the case of a binary CPU, this is measured by the num- clock signal. This has the advantage of simplifying the
ber of bits (significant digits of a binary encoded inte- CPU significantly, both from a design perspective and
ger) that the CPU can process in one operation, which a component-count perspective. However, it also car-
is commonly called "word size", “bit width”, “data path
ries the disadvantage that the entire CPU must wait on
width”, “integer precision”, or “integer size”. A CPU’s in- its slowest elements, even though some portions of it are
teger size determines the range of integer values it can di-
much faster. This limitation has largely been compen-
rectly operate on.[lower-alpha 7] For example, an 8-bit CPU sated for by various methods of increasing CPU paral-
can directly manipulate integers represented by eight bits,
lelism (see below).
which have a range of 256 (28 ) discrete integer values.
However, architectural improvements alone do not solve
Integer range can also affect the number of memory lo- all of the drawbacks of globally synchronous CPUs. For
cations the CPU can directly address (an address is an example, a clock signal is subject to the delays of any
integer value representing a specific memory location). other electrical signal. Higher clock rates in increasingly
For example, if a binary CPU uses 32 bits to represent a complex CPUs make it more difficult to keep the clock
memory address then it can directly address 232 memory signal in phase (synchronized) throughout the entire unit.
locations. To circumvent this limitation and for various This has led many modern CPUs to require multiple iden-
other reasons, some CPUs use mechanisms (such as bank tical clock signals to be provided to avoid delaying a single
switching) that allow additional memory to be addressed. signal significantly enough to cause the CPU to malfunc-
CPUs with larger word sizes require more circuitry and tion. Another major issue, as clock rates increase dramat-
consequently are physically larger, cost more, and con- ically, is the amount of heat that is dissipated by the CPU.
sume more power (and therefore generate more heat). The constantly changing clock causes many components
As a result, smaller 4- or 8-bit microcontrollers are com- to switch regardless of whether they are being used at that
monly used in modern applications even though CPUs time. In general, a component that is switching uses more
with much larger word sizes (such as 16, 32, 64, even energy than an element in a static state. Therefore, as
128-bit) are available. When higher performance is re- clock rate increases, so does energy consumption, caus-
8.3. DESIGN AND IMPLEMENTATION 47

ing the CPU to require more heat dissipation in the form which take more than one clock cycle to complete execu-
of CPU cooling solutions. tion. Even adding a second execution unit (see below)
One method of dealing with the switching of unneeded does not improve performance much; rather than one
components is called clock gating, which involves turn- pathway being hung up, now two pathways are hung up
ing off the clock signal to unneeded components (effec- and the number of unused transistors is increased. This
tively disabling them). However, this is often regarded design, wherein the CPU’s execution resources can op-
as difficult to implement and therefore does not see com- erate on only one instruction at a time, can only possi-
mon usage outside of very low-power designs. One no- bly reach scalar performance (one instruction per clock).
However, the performance is nearly always subscalar (less
table recent CPU design that uses extensive clock gating
is the IBM PowerPC-based Xenon used in the Xbox 360; than one instruction per cycle).
that way, power requirements of the Xbox 360 are greatly Attempts to achieve scalar and better performance have
reduced.[11] Another method of addressing some of the resulted in a variety of design methodologies that cause
problems with a global clock signal is the removal of the the CPU to behave less linearly and more in paral-
clock signal altogether. While removing the global clock lel. When referring to parallelism in CPUs, two terms
signal makes the design process considerably more com- are generally used to classify these design techniques.
plex in many ways, asynchronous (or clockless) designs Instruction level parallelism (ILP) seeks to increase the
carry marked advantages in power consumption and heat rate at which instructions are executed within a CPU
dissipation in comparison with similar synchronous de- (that is, to increase the utilization of on-die execution
signs. While somewhat uncommon, entire asynchronous resources), and thread level parallelism (TLP) purposes
CPUs have been built without utilizing a global clock sig- to increase the number of threads (effectively individ-
nal. Two notable examples of this are the ARM com- ual programs) that a CPU can execute simultaneously.
pliant AMULET and the MIPS R3000 compatible Min- Each methodology differs both in the ways in which they
iMIPS. are implemented, as well as the relative effectiveness
Rather than totally removing the clock signal, some CPU they afford [lower-alpha
in increasing the CPU’s performance for an
8]
designs allow certain portions of the device to be asyn- application.
chronous, such as using asynchronous ALUs in conjunc-
tion with superscalar pipelining to achieve some arith- Instruction-level parallelism
metic performance gains. While it is not altogether clear
whether totally asynchronous designs can perform at a Main articles: Instruction pipelining and Superscalar
comparable or better level than their synchronous coun- One of the simplest methods used to accomplish in-
terparts, it is evident that they do at least excel in sim-
pler math operations. This, combined with their excel-
lent power consumption and heat dissipation properties,
makes them very suitable for embedded computers.[12]

8.3.5 Parallelism
Basic five-stage pipeline. In the best case scenario, this pipeline
Main article: Parallel computing can sustain a completion rate of one instruction per cycle.
The description of the basic operation of a CPU offered
creased parallelism is to begin the first steps of instruction
fetching and decoding before the prior instruction finishes
executing. This is the simplest form of a technique known
as instruction pipelining, and is utilized in almost all mod-
ern general-purpose CPUs. Pipelining allows more than
Model of a subscalar CPU, in which it takes fifteen cycles to com- one instruction to be executed at any given time by break-
plete three instructions. ing down the execution pathway into discrete stages. This
separation can be compared to an assembly line, in which
in the previous section describes the simplest form that an instruction is made more complete at each stage until
a CPU can take. This type of CPU, usually referred to it exits the execution pipeline and is retired.
as subscalar, operates on and executes one instruction on Pipelining does, however, introduce the possibility for
one or two pieces of data at a time. a situation where the result of the previous operation is
This process gives rise to an inherent inefficiency in sub- needed to complete the next operation; a condition often
scalar CPUs. Since only one instruction is executed at termed data dependency conflict. To cope with this, addi-
a time, the entire CPU must wait for that instruction to tional care must be taken to check for these sorts of con-
complete before proceeding to the next instruction. As a ditions and delay a portion of the instruction pipeline if
result, the subscalar CPU gets “hung up” on instructions this occurs. Naturally, accomplishing this requires addi-
48 CHAPTER 8. CENTRAL PROCESSING UNIT

tional circuitry, so pipelined processors are more complex disable parts of the pipeline so that when a single instruc-
than subscalar ones (though not very significantly so). A tion is executed many times, the CPU skips the fetch and
pipelined processor can become very nearly scalar, inhib- decode phases and thus greatly increases performance on
ited only by pipeline stalls (an instruction spending more certain occasions, especially in highly monotonous pro-
than one clock cycle in a stage). gram engines such as video creation software and photo
processing.
In the case where a portion of the CPU is superscalar and
part is not, the part which is not suffers a performance
penalty due to scheduling stalls. The Intel P5 Pentium
had two superscalar ALUs which could accept one in-
struction per clock each, but its FPU could not accept one
instruction per clock. Thus the P5 was integer superscalar
but not floating point superscalar. Intel’s successor to the
P5 architecture, P6, added superscalar capabilities to its
floating point features, and therefore afforded a signifi-
cant increase in floating point instruction performance.

A simple superscalar pipeline. By fetching and dispatching two


Both simple pipelining and superscalar design increase a
instructions at a time, a maximum of two instructions per cycle CPU’s ILP by allowing a single processor to complete ex-
can be completed. ecution of instructions at rates surpassing one instruction
per cycle (IPC).[lower-alpha 9] Most modern CPU designs
Further improvement upon the idea of instruction are at least somewhat superscalar, and nearly all general
pipelining led to the development of a method that de- purpose CPUs designed in the last decade are superscalar.
creases the idle time of CPU components even fur- In later years some of the emphasis in designing high-ILP
ther. Designs that are said to be superscalar include a computers has been moved out of the CPU’s hardware
long instruction pipeline and multiple identical execu- and into its software interface, or ISA. The strategy of
tion units.[13] In a superscalar pipeline, multiple instruc- the very long instruction word (VLIW) causes some ILP
tions are read and passed to a dispatcher, which decides to become implied directly by the software, reducing the
whether or not the instructions can be executed in paral- amount of work the CPU must perform to boost ILP and
lel (simultaneously). If so they are dispatched to available thereby reducing the design’s complexity.
execution units, resulting in the ability for several instruc-
tions to be executed simultaneously. In general, the more
instructions a superscalar CPU is able to dispatch simulta- Thread-level parallelism
neously to waiting execution units, the more instructions
will be completed in a given cycle. Another strategy of achieving performance is to execute
multiple programs or threads in parallel. This area of re-
Most of the difficulty in the design of a superscalar CPU
architecture lies in creating an effective dispatcher. The search is known as parallel computing. In Flynn’s tax-
dispatcher needs to be able to quickly and correctly de- onomy, this strategy is known as Multiple Instructions-
termine whether instructions can be executed in paral- Multiple Data or MIMD.
lel, as well as dispatch them in such a way as to keep as One technology used for this purpose was
many execution units busy as possible. This requires that multiprocessing (MP). The initial flavor of this tech-
the instruction pipeline is filled as often as possible and nology is known as symmetric multiprocessing (SMP),
gives rise to the need in superscalar architectures for sig- where a small number of CPUs share a coherent view
nificant amounts of CPU cache. It also makes hazard- of their memory system. In this scheme, each CPU has
avoiding techniques like branch prediction, speculative additional hardware to maintain a constantly up-to-date
execution, and out-of-order execution crucial to main- view of memory. By avoiding stale views of memory, the
taining high levels of performance. By attempting to pre- CPUs can cooperate on the same program and programs
dict which branch (or path) a conditional instruction will can migrate from one CPU to another. To increase the
take, the CPU can minimize the number of times that the number of cooperating CPUs beyond a handful, schemes
entire pipeline must wait until a conditional instruction is such as non-uniform memory access (NUMA) and
completed. Speculative execution often provides mod- directory-based coherence protocols were introduced in
est performance increases by executing portions of code the 1990s. SMP systems are limited to a small number
that may not be needed after a conditional operation com- of CPUs while NUMA systems have been built with
pletes. Out-of-order execution somewhat rearranges the thousands of processors. Initially, multiprocessing was
order in which instructions are executed to reduce delays built using multiple discrete CPUs and boards to imple-
due to data dependencies. Also in case of Single Instruc- ment the interconnect between the processors. When the
tions Multiple Data — a case when a lot of data from the processors and their interconnect are all implemented
same type has to be processed, modern processors can on a single silicon chip, the technology is known as a
8.4. PERFORMANCE 49

multi-core processor. microprocessor.


It was later recognized that finer-grain parallelism existed
with a single program. A single program might have sev-
Data parallelism
eral threads (or functions) that could be executed sepa-
rately or in parallel. Some of the earliest examples of this
Main articles: Vector processor and SIMD
technology implemented input/output processing such as
direct memory access as a separate thread from the com-
putation thread. A more general approach to this tech- A less common but increasingly important paradigm of
nology was introduced in the 1970s when systems were CPUs (and indeed, computing in general) deals with data
designed to run multiple computation threads in parallel. parallelism. The processors discussed earlier are all re-
This technology is known as multi-threading (MT). This ferred to as some type of scalar device.[lower-alpha 10] As
approach is considered more cost-effective than multipro- the name implies, vector processors deal with multiple
cessing, as only a small number of components within a pieces of data in the context of one instruction. This con-
CPU is replicated to support MT as opposed to the en- trasts with scalar processors, which deal with one piece
tire CPU in the case of MP. In MT, the execution units of data for every instruction. Using Flynn’s taxonomy,
and the memory system including the caches are shared these two schemes of dealing with data are generally re-
among multiple threads. The downside of MT is that the ferred to as SIMD (single instruction, multiple data) and
hardware support for multithreading is more visible to SISD (single instruction, single data), respectively. The
software than that of MP and thus supervisor software great utility in creating CPUs that deal with vectors of
like operating systems have to undergo larger changes to data lies in optimizing tasks that tend to require the same
support MT. One type of MT that was implemented is operation (for example, a sum or a dot product) to be per-
known as block multithreading, where one thread is ex- formed on a large set of data. Some classic examples of
ecuted until it is stalled waiting for data to return from these types of tasks are multimedia applications (images,
external memory. In this scheme, the CPU would then video, and sound), as well as many types of scientific and
quickly switch to another thread which is ready to run, engineering tasks. Whereas a scalar CPU must complete
the switch often done in one CPU clock cycle, such as the entire process of fetching, decoding, and executing
the UltraSPARC Technology. Another type of MT is each instruction and value in a set of data, a vector CPU
known as simultaneous multithreading, where instruc- can perform a single operation on a comparatively large
tions of multiple threads are executed in parallel within set of data with one instruction. Of course, this is only
one CPU clock cycle. possible when the application tends to require many steps
which apply one operation to a large set of data.
For several decades from the 1970s to early 2000s, the fo-
cus in designing high performance general purpose CPUs Most early vector CPUs, such as the Cray-1, were as-
was largely on achieving high ILP through technologies sociated almost exclusively with scientific research and
such as pipelining, caches, superscalar execution, out- cryptography applications. However, as multimedia has
of-order execution, etc. This trend culminated in large, largely shifted to digital media, the need for some form of
power-hungry CPUs such as the Intel Pentium 4. By the SIMD in general-purpose CPUs has become significant.
early 2000s, CPU designers were thwarted from achiev- Shortly after inclusion of floating point execution units
ing higher performance from ILP techniques due to the started to become commonplace in general-purpose pro-
growing disparity between CPU operating frequencies cessors, specifications for and implementations of SIMD
and main memory operating frequencies as well as esca- execution units also began to appear for general-purpose
lating CPU power dissipation owing to more esoteric ILP CPUs. Some of these early SIMD specifications like HP’s
techniques. Multimedia Acceleration eXtensions (MAX) and Intel’s
MMX were integer-only. This proved to be a significant
CPU designers then borrowed ideas from commercial
impediment for some software developers, since many of
computing markets such as transaction processing, where
the applications that benefit from SIMD primarily deal
the aggregate performance of multiple programs, also
with floating point numbers. Progressively, these early
known as throughput computing, was more important
designs were refined and remade into some of the com-
than the performance of a single thread or program.
mon, modern SIMD specifications, which are usually as-
This reversal of emphasis is evidenced by the prolifer- sociated with one ISA. Some notable modern examples
ation of dual and multiple core CMP (chip-level multi- are Intel’s SSE and the PowerPC-related AltiVec (also
processing) designs and notably, Intel’s newer designs re- known as VMX).[lower-alpha 11]
sembling its less superscalar P6 architecture. Late de-
signs in several processor families exhibit CMP, includ-
ing the x86-64 Opteron and Athlon 64 X2, the SPARC
UltraSPARC T1, IBM POWER4 and POWER5, as well
8.4 Performance
as several video game console CPUs like the Xbox 360's
triple-core PowerPC design, and the PS3's 7-core Cell Further information: Computer performance and
Benchmark (computing)
50 CHAPTER 8. CENTRAL PROCESSING UNIT

• CPU socket
The performance or speed of a processor depends on, • Digital signal processor
among many other factors, the clock rate (generally given
in multiples of hertz) and the instructions per clock (IPC), • Hyper-threading
which together are the factors for the instructions per sec-
ond (IPS) that the CPU can perform.[14] Many reported • List of CPU architectures
IPS values have represented “peak” execution rates on ar-
• Microprocessor
tificial instruction sequences with few branches, whereas
realistic workloads consist of a mix of instructions and • Multi-core processor
applications, some of which take longer to execute
than others. The performance of the memory hierar- • Protection ring
chy also greatly affects processor performance, an is-
sue barely considered in MIPS calculations. Because of • RISC
these problems, various standardized tests, often called • Stream processing
“benchmarks” for this purpose—such as SPECint – have
been developed to attempt to measure the real effective • True Performance Index
performance in commonly used applications.
• Wait state
Processing performance of computers is increased by us-
ing multi-core processors, which essentially is plugging
two or more individual processors (called cores in this
sense) into one integrated circuit.[15] Ideally, a dual core 8.6 Notes
processor would be nearly twice as powerful as a sin-
gle core processor. In practice, the performance gain is [1] Integrated circuits are now used to implement all CPUs,
far smaller, only about 50%, due to imperfect software except for a few machines designed to withstand large
algorithms and implementation.[16] Increasing the num- electromagnetic pulses, say from a nuclear weapon.
ber of cores in a processor (i.e. dual-core, quad-core, [2] The so-called “von Neumann” memo expounded the idea
etc.) increases the workload that can be handled. This of stored programs, which for example may be stored on
means that the processor can now handle numerous asyn- punched cards, paper tape, or magnetic tape.
chronous events, interrupts, etc. which can take a toll on
the CPU when overwhelmed. These cores can be thought [3] Some early computers like the Harvard Mark I did not
of as different floors in a processing plant, with each floor support any kind of “jump” instruction, effectively limit-
handling a different task. Sometimes, these cores will ing the complexity of the programs they could run. It is
largely for this reason that these computers are often not
handle the same tasks as cores adjacent to them if a sin-
considered to contain a proper CPU, despite their close
gle core is not enough to handle the information.
similarity to stored-program computers.
Due to specific capabilities of modern CPUs, such as
hyper-threading and uncore, which involve sharing of ac- [4] Since the program counter counts memory addresses and
not instructions, it is incremented by the number of mem-
tual CPU resources while aiming at increased utiliza-
ory units that the instruction word contains. In the case of
tion, monitoring performance levels and hardware uti-
simple fixed-length instruction word ISAs, this is always
lization gradually became a more complex task. As a the same number. For example, a fixed-length 32-bit in-
response, some CPUs implement additional hardware struction word ISA that uses 8-bit memory words would
logic that monitors actual utilization of various parts of always increment the PC by four (except in the case of
a CPU and provides various counters accessible to soft- jumps). ISAs that use variable-length instruction words
ware; an example is Intel’s Performance Counter Monitor increment the PC by the number of memory words corre-
technology.[3] sponding to the last instruction’s length.

[5] Because the instruction set architecture of a CPU is fun-


damental to its interface and usage, it is often used as a
8.5 See also classification of the “type” of CPU. For example, a “Pow-
erPC CPU” uses some variant of the PowerPC ISA. A
• Accelerated processing unit system can execute a different ISA by running an emula-
tor.
• Addressing mode
[6] The physical concept of voltage is an analog one by nature,
• CISC practically having an infinite range of possible values. For
the purpose of physical representation of binary numbers,
• Computer bus
two specific ranges of voltages are defined, one for logic
• Computer engineering '0' and another for logic '1'. These ranges are dictated by
design considerations such as noise margins and charac-
• CPU core voltage teristics of the devices used to create the CPU.
8.8. EXTERNAL LINKS 51

[7] While a CPU’s integer size sets a limit on integer ranges, [5] “First Draft of a Report on the EDVAC” (PDF). Moore
this can (and often is) overcome using a combination of School of Electrical Engineering, University of Pennsyl-
software and hardware techniques. By using additional vania. 1945.
memory, software can represent integers many magni-
tudes larger than the CPU can. Sometimes the CPU’s ISA [6] Enticknap, Nicholas (Summer 1998), “Computing’s
will even facilitate operations on integers larger than it Golden Jubilee”, Resurrection (The Computer Conserva-
can natively represent by providing instructions to make tion Society) (20), ISSN 0958-7403, retrieved 19 April
large integer arithmetic relatively quick. This method 2008
of dealing with large integers is slower than utilizing a
[7] Amdahl, G. M., Blaauw, G. A., & Brooks, F. P. Jr.
CPU with higher integer size, but is a reasonable trade-
(1964). “Architecture of the IBM System/360” (PDF).
off in cases where natively supporting the full integer
IBM Research.
range needed would be cost-prohibitive. See Arbitrary-
precision arithmetic for more details on purely software- [8] “LSI-11 Module Descriptions”. LSI-11, PDP-11/03 user’s
supported arbitrary-sized integers. manual (PDF) (2nd ed.). Maynard, Massachusetts:
Digital Equipment Corporation. November 1975. pp. 4–
[8] Neither ILP nor TLP is inherently superior over the other; 3.
they are simply different means by which to increase CPU
parallelism. As such, they both have advantages and dis- [9] “Excerpts from A Conversation with Gordon Moore:
advantages, which are often determined by the type of Moore’s Law” (PDF). Intel. 2005. Retrieved 2012-07-
software that the processor is intended to run. High-TLP 25.
CPUs are often used in applications that lend themselves
well to being split up into numerous smaller applications, [10] Ian Wienand (September 3, 2013). “Computer Science
so-called "embarrassingly parallel problems”. Frequently, from the Bottom Up, Chapter 3. Computer Architecture”
a computational problem that can be solved quickly with (PDF). bottomupcs.com. Retrieved January 7, 2015.
high TLP design strategies like SMP take significantly
[11] Brown, Jeffery (2005). “Application-customized CPU
more time on high ILP devices like superscalar CPUs, and
design”. IBM developerWorks. Retrieved 2005-12-17.
vice versa.
[12] Garside, J. D., Furber, S. B., & Chung, S-H (1999).
[9] Best-case scenario (or peak) IPC rates in very superscalar
“AMULET3 Revealed”. University of Manchester Com-
architectures are difficult to maintain since it is impossible
puter Science Department. Archived from the original on
to keep the instruction pipeline filled all the time. There-
December 10, 2005.
fore, in highly superscalar CPUs, average sustained IPC
is often discussed rather than peak IPC. [13] Huynh, Jack (2003). “The AMD Athlon XP Processor
with 512KB L2 Cache” (PDF). University of Illinois —
[10] Earlier the term scalar was used to compare the IPC (in- Urbana-Champaign. pp. 6–11. Retrieved 2007-10-06.
structions per cycle) count afforded by various ILP meth-
ods. Here the term is used in the strictly mathematical [14] “CPU Frequency”. CPU World Glossary. CPU World. 25
sense to contrast with vectors. See scalar (mathematics) March 2008. Retrieved 1 January 2010.
and Vector (geometric).
[15] “What is (a) multi-core processor?". Data Center Defini-
[11] Although SSE/SSE2/SSE3 have superseded MMX in In- tions. SearchDataCenter.com. 27 March 2007. Retrieved
tel’s general purpose CPUs, later IA-32 designs still sup- 1 January 2010.
port MMX. This is usually accomplished by providing
most of the MMX functionality with the same hardware [16] “Quad Core Vs. Dual Core”. http://www.buzzle.com/''.
that supports the much more expansive SSE instruction Retrieved 26 November 2014.
sets.

8.8 External links


8.7 References
• How Microprocessors Work at HowStuffWorks.
[1] Weik, Martin H. (1961). “A Third Survey of Domes-
• 25 Microchips that shook the world – an article by
tic Electronic Digital Computing Systems”. Ballistic Re-
the Institute of Electrical and Electronics Engineers.
search Laboratory.

[2] Kuck, David (1978). Computers and Computations, Vol 1.


John Wiley & Sons, Inc. p. 12. ISBN 0471027162.

[3] Thomas Willhalm; Roman Dementiev; Patrick Fay (De-


cember 18, 2014). “Intel Performance Counter Moni-
tor – A better way to measure CPU utilization”. soft-
ware.intel.com. Retrieved February 17, 2015.

[4] Regan, Gerard. A Brief History of Computing. p. 66.


ISBN 1848000839. Retrieved 26 November 2014.
Chapter 9

Microprocessor

medium- and small-scale integrated circuits. Micro-


processors integrated this into one or a few large-scale
ICs. Continued increases in microprocessor capacity
have since rendered other forms of computers almost
completely obsolete (see history of computing hardware),
with one or more microprocessors used in everything
from the smallest embedded systems and handheld de-
vices to the largest mainframes and supercomputers.

9.1 Structure
Z80 Architecture

BUFFERS
Internal Data Bus 8 Bit

MUX MUX
Instruction
Register I R
W' Z' W Z A F

B' C' B C
Instruction TEMP A' F'
Decoder
D' E' D E

Control H' L' H L ACU

Intel 4004, the first commercial microprocessor Logic


IX

IY

SP
ALU
PC +

See also: Processor, System on a chip, Microcontroller _1


+ _1
+

BUFFERS
and Digital signal processor Control Section

Address Bus 16 Bit

BUFFERS

Control Bus

A microprocessor is a computer processor that incorpo-


rates the functions of a computer's central processing unit
(CPU) on a single integrated circuit (IC),[1] or at most a A block diagram of the internal architecture of the Z80 micro-
processor, showing the arithmetic and logic section, register file,
few integrated circuits.[2] The microprocessor is a mul-
control logic section, and buffers to external address and data
tipurpose, programmable device that accepts digital data lines
as input, processes it according to instructions stored in
its memory, and provides results as output. It is an exam-
The internal arrangement of a microprocessor varies de-
ple of sequential digital logic, as it has internal memory. pending on the age of the design and the intended pur-
Microprocessors operate on numbers and symbols repre- poses of the microprocessor. The complexity of an in-
sented in the binary numeral system. tegrated circuit is bounded by physical limitations of the
The integration of a whole CPU onto a single chip or on number of transistors that can be put onto one chip, the
a few chips greatly reduced the cost of processing power. number of package terminations that can connect the pro-
The integrated circuit processor was produced in large cessor to other parts of the system, the number of inter-
numbers by highly automated processes, so unit cost was connections it is possible to make on the chip, and the heat
low. Single-chip processors increase reliability as there that the chip can dissipate. Advancing technology makes
are many fewer electrical connections to fail. As micro- more complex and powerful chips feasible to manufac-
processor designs get faster, the cost of manufacturing a ture.
chip (with smaller components built on a semiconductor A minimal hypothetical microprocessor might only in-
chip the same size) generally stays the same. clude an arithmetic logic unit (ALU) and a control logic
Before microprocessors, small computers had been im- section. The ALU performs operations such as addition,
plemented using racks of circuit boards with many subtraction, and operations such as AND or OR. Each

52
9.2. EMBEDDED APPLICATIONS 53

operation of the ALU sets one or more flags in a status than narrower processors.[3] So 8-bit or 16-bit processors
register, which indicate the results of the last operation are better than 32-bit processors for system on a chip and
(zero value, negative number, overflow, or others). The microcontrollers that require extremely low-power elec-
control logic section retrieves instruction operation codes tronics, or are part of. Nevertheless trade offs apply: If
from memory, and initiates whatever sequence of opera- you need to run 32 bit arithmetic using a 8 bit chip could
tions of the ALU requires to carry out the instruction. A end up using more power as software with multiple in-
single operation code might affect many individual data structions will need to be executed. Modern micropro-
paths, registers, and other elements of the processor. cessors go into low power states when “possible” and a 8
bit chip running 32 bit software will be active most of the
As integrated circuit technology advanced, it was feasible
to manufacture more and more complex processors on a time. So it’s a delicate balance between software, hard-
ware and utilization patterns, plus costs.
single chip. The size of data objects became larger; al-
lowing more transistors on a chip allowed word sizes to in- When manufactured on a similar process, 8-bit micros
crease from 4- and 8-bit words up to today’s 64-bit words. use less power when operating and less power when sleep-
Additional features were added to the processor archi- ing than 32-bit micros.[4]
tecture; more on-chip registers sped up programs, and However, some people say a 32-bit micro may use less
complex instructions could be used to make more com- average power than an 8-bit micro, when the application
pact programs. Floating-point arithmetic, for example, requires certain operations, such as floating-point math,
was often not available on 8-bit microprocessors, but had that take many more clock cycles on an 8-bit micro than
to be carried out in software. Integration of the floating a 32-bit micro, and so the 8-bit micro spends more time
point unit first as a separate integrated circuit and then in high-power operating mode.[4][5][6][7]
as part of the same microprocessor chip, sped up floating
point calculations.
Occasionally, physical limitations of integrated circuits
made such practices as a bit slice approach necessary. In-
9.2 Embedded applications
stead of processing all of a long word on one integrated
circuit, multiple circuits in parallel processed subsets of Thousands of items that were traditionally not computer-
each data word. While this required extra logic to han- related include microprocessors. These include large
dle, for example, carry and overflow within each slice, the and small household appliances, cars (and their accessory
result was a system that could handle, say, 32-bit words equipment units), car keys, tools and test instruments,
using integrated circuits with a capacity for only four bits toys, light switches/dimmers and electrical circuit break-
each. ers, smoke alarms, battery packs, and hi-fi audio/visual
components (from DVD players to phonograph turnta-
With the ability to put large numbers of transistors on
bles). Such products as cellular telephones, DVD video
one chip, it becomes feasible to integrate memory on the
system and HDTV broadcast systems fundamentally re-
same die as the processor. This CPU cache has the ad-
quire consumer devices with powerful, low-cost, micro-
vantage of faster access than off-chip memory, and in-
processors. Increasingly stringent pollution control stan-
creases the processing speed of the system for many ap-
dards effectively require automobile manufacturers to use
plications. Processor clock frequency has increased more
microprocessor engine management systems, to allow op-
rapidly than external memory speed, except in the recent
timal control of emissions over widely varying operating
past, so cache memory is necessary if the processor is not
conditions of an automobile. Non-programmable con-
delayed by slower external memory.
trols would require complex, bulky, or costly implemen-
tation to achieve the results possible with a microproces-
sor.
9.1.1 Special-purpose designs
A microprocessor control program (embedded software)
A microprocessor is a general purpose system. Several can be easily tailored to different needs of a product line,
specialized processing devices have followed from the allowing upgrades in performance with minimal redesign
technology. Microcontrollers integrate a microproces- of the product. Different features can be implemented in
sor with peripheral devices in embedded systems. A different models of a product line at negligible production
digital signal processor (DSP) is specialized for signal cost.
processing. Graphics processing units may have no, lim- Microprocessor control of a system can provide con-
ited, or general programming facilities. For example, trol strategies that would be impractical to implement
GPUs through the 1990s were mostly non-programmable using electromechanical controls or purpose-built elec-
and have only recently gained limited facilities like pro-
tronic controls. For example, an engine control system
grammable vertex shaders. in an automobile can adjust ignition timing based on en-
32-bit processors have more digital logic than narrower gine speed, load on the engine, ambient temperature, and
processors, so 32-bit (and wider) processors produce any observed tendency for knocking—allowing an auto-
more digital noise and have higher static consumption mobile to operate on a range of fuel grades.
54 CHAPTER 9. MICROPROCESSOR

9.3 History CADC

For more details on this topic, see Central Air Data


The advent of low-cost computers on integrated circuits Computer.
has transformed modern society. General-purpose mi-
croprocessors in personal computers are used for compu-
tation, text editing, multimedia display, and communica- In 1968, Garrett AiResearch (which employed designers
tion over the Internet. Many more microprocessors are Ray Holt and Steve Geller) was invited to produce a digi-
part of embedded systems, providing digital control over tal computer to compete with electromechanical systems
myriad objects from appliances to automobiles to cellular then under development for the main flight control com-
phones and industrial process control. puter in the US Navy's new F-14 Tomcat fighter. The
design was complete by 1970, and used a MOS-based
The first use of the term “microprocessor” is attributed chipset as the core CPU. The design was significantly
to Viatron Computer Systems describing the custom in- (approximately 20 times) smaller and much more reli-
tegrated circuit used in their System 21 small computer able than the mechanical systems it competed against,
system announced in 1968. and was used in all of the early Tomcat models. This
Intel introduced its first 4-bit microprocessor 4004 in system contained “a 20-bit, pipelined, parallel multi-
1971 and its 8-bit microprocessor 8008 in 1972. During microprocessor". The Navy refused to allow publication
the 1960s, computer processors were constructed out of of the design until 1997. For this reason the CADC,
small and medium-scale ICs—each containing from tens and the MP944 chipset it used, are fairly unknown.[11]
of transistors to a few hundred. These were placed and Ray Holt graduated California Polytechnic University in
soldered onto printed circuit boards, and often multiple 1968, and began his computer design career with the
boards were interconnected in a chassis. The large num- CADC. From its inception, it was shrouded in secrecy
ber of discrete logic gates used more electrical power— until 1998 when at Holt’s request, the US Navy allowed
and therefore produced more heat—than a more inte- the documents into the public domain. Since then peo-
grated design with fewer ICs. The distance that signals ple have debated whether this was the first microproces-
had to travel between ICs on the boards limited a com- sor. Holt has stated that no one has compared this mi-
puter’s operating speed. croprocessor with those that came later.[12] According
to Parab et al. (2007), “The scientific papers and litera-
In the NASA Apollo space missions to the moon in the
ture published around 1971 reveal that the MP944 digi-
1960s and 1970s, all onboard computations for primary
tal processor used for the F-14 Tomcat aircraft of the US
guidance, navigation and control were provided by a small
Navy qualifies as the first microprocessor. Although in-
custom processor called “The Apollo Guidance Com-
teresting, it was not a single-chip processor, as was not
puter". It used wire wrap circuit boards whose only logic
the Intel 4004 – they both were more like a set of parallel
elements were three-input NOR gates.[8]
building blocks you could use to make a general-purpose
The first microprocessors emerged in the early 1970s form. It contains a CPU, RAM, ROM, and two other sup-
and were used for electronic calculators, using binary- port chips like the Intel 4004. It was made from the same
coded decimal (BCD) arithmetic on 4-bit words. Other P-channel technology, operated at military specifications
embedded uses of 4-bit and 8-bit microprocessors, such and had larger chips -- an excellent computer engineering
as terminals, printers, various kinds of automation etc., design by any standards. Its design indicates a major ad-
followed soon after. Affordable 8-bit microprocessors vance over Intel, and two year earlier. It actually worked
with 16-bit addressing also led to the first general-purpose and was flying in the F-14 when the Intel 4004 was an-
microcomputers from the mid-1970s on. nounced. It indicates that today’s industry theme of con-
Since the early 1970s, the increase in capacity of micro- verging DSP-microcontroller architectures was started in
processors has followed Moore’s law; this originally sug- 1971.” [13] This convergence of DSP and microcontroller
gested that the number of components that can be fitted architectures is known as a digital signal controller.[14]
onto a chip doubles every year. With present technology,
it is actually every two years,[9] and as such Moore later Gilbert Hyatt
changed the period to two years.[10]
Gilbert Hyatt was awarded a patent claiming an in-
vention pre-dating both TI and Intel, describing a
“microcontroller”.[15] The patent was later invalidated,
9.3.1 Firsts but not before substantial royalties were paid out.[16][17]

Three projects delivered a microprocessor at about the TMS 1000


same time: Garrett AiResearch's Central Air Data Com-
puter (CADC), Texas Instruments (TI) TMS 1000 (1971 The Smithsonian Institution says TI engineers Gary
September), and Intel's 4004 (1971 November). Boone and Michael Cochran succeeded in creating the
9.3. HISTORY 55

first microcontroller (also called a microcomputer) and chip for storing the programs, a dynamic RAM chip for
the first single-chip CPU in 1971. The result of their storing data, a simple I/O device and a 4-bit central pro-
work was the TMS 1000, which went commercial in cessing unit (CPU). Although not a chip designer, he felt
1974.[18] TI stressed the 4-bit TMS 1000 for use in pre- the CPU could be integrated into a single chip, but as he
programmed embedded applications, introducing a ver- lacked the technical know-how the idea remained just a
sion called the TMS1802NC on September 17, 1971 that wish for the time being.
implemented a calculator on a chip. While the architecture and specifications of the MCS-4
TI filed for a patent on the microprocessor. Gary Boone came from the interaction of Hoff with Stanley Mazor,
was awarded U.S. Patent 3,757,306 for the single-chip a software engineer reporting to him, and with Busicom
microprocessor architecture on September 4, 1973. In engineer Masatoshi Shima, during 1969, Mazor and Hoff
1971 and again in 1976, Intel and TI entered into broad moved on to other projects. In April 1970, Intel hired
patent cross-licensing agreements, with Intel paying roy- Italian-born engineer Federico Faggin as project leader,
alties to TI for the microprocessor patent. A history of a move that ultimately made the single-chip CPU final
these events is contained in court documentation from a design a reality (Shima meanwhile designed the Busicom
legal dispute between Cyrix and Intel, with TI as inventor calculator firmware and assisted Faggin during the first six
and owner of the microprocessor patent. months of the implementation). Faggin, who originally
A computer-on-a-chip combines the microprocessor core developed the silicon gate technology (SGT) in 1968 at
(CPU), memory, and I/O (input/output) lines onto one Fairchild Semiconductor[23] and designed the world’s first
chip. The computer-on-a-chip patent, called the “mi- commercial integrated circuit using SGT, the Fairchild
crocomputer patent” at the time, U.S. Patent 4,074,351, 3708, had the correct background to lead the project into
was awarded to Gary Boone and Michael J. Cochran what would become the first commercial general purpose
of TI. Aside from this patent, the standard meaning of microprocessor. Since SGT was his very own invention,
microcomputer is a computer using one or more micro- Faggin also used it to create his new methodology for
processors as its CPU(s), while the concept defined in the random logic design that made it possible to implement
patent is more akin to a microcontroller. a single-chip CPU with the proper speed, power dissipa-
tion and cost. The manager of Intel’s MOS Design De-
partment was Leslie L. Vadász at the time of the MCS-
Intel 4004 4 development, but Vadasz’s attention was completely
focused on the mainstream business of semiconductor
memories and he left the leadership and the management
of the MCS-4 project to Faggin, who was ultimately re-
sponsible for leading the 4004 project to its realization.
Production units of the 4004 were first delivered to Bu-
sicom in March 1971 and shipped to other customers in
late 1971.

The 4004 with cover removed (left) and as actually used (right)
Pico/General Instrument
Main article: Intel 4004
In 1971 Pico Electronics[24] and General Instrument (GI)
introduced their first collaboration in ICs, a complete sin-
The Intel 4004 is generally regarded as the first commer- gle chip calculator IC for the Monroe/Litton Royal Digi-
cially available microprocessor,[19][20] and cost $60.[21] tal III calculator. This chip could also arguably lay claim
The first known advertisement for the 4004 is dated to be one of the first microprocessors or microcontrollers
November 15, 1971 and appeared in Electronic News.[22] having ROM, RAM and a RISC instruction set on-chip.
The project that produced the 4004 originated in 1969, The layout for the four layers of the PMOS process was
when Busicom, a Japanese calculator manufacturer, hand drawn at x500 scale on mylar film, a significant task
asked Intel to build a chipset for high-performance desk- at the time given the complexity of the chip.
top calculators. Busicom’s original design called for
a programmable chip set consisting of seven different Pico was a spinout by five GI design engineers whose
chips. Three of the chips were to make a special-purpose vision was to create single chip calculator ICs. They
CPU with its program stored in ROM and its data stored had significant previous design experience on multiple
in shift register read-write memory. Ted Hoff, the Intel calculator chipsets with both GI and Marconi-Elliott.[25]
engineer assigned to evaluate the project, believed the Bu- The key team members had originally been tasked by
sicom design could be simplified by using dynamic RAM Elliott Automation to create an 8-bit computer in MOS
storage for data, rather than shift register memory, and a and had helped establish a MOS Research Laboratory in
more traditional general-purpose CPU architecture. Hoff Glenrothes, Scotland in 1967.
came up with a four–chip architectural proposal: a ROM Calculators were becoming the largest single market for
56 CHAPTER 9. MICROPROCESSOR

plementation, known as the CTC 1201.[33] In late 1970


or early 1971, TI dropped out being unable to make a
reliable part. In 1970, with Intel yet to deliver the part,
CTC opted to use their own implementation in the Data-
point 2200, using traditional TTL logic instead (thus the
first machine to run “8008 code” was not in fact a mi-
croprocessor at all and was delivered a year earlier). In-
tel’s version of the 1201 microprocessor arrived in late
1971, but was too late, slow, and required a number of
additional support chips. CTC had no interest in using
it. CTC had originally contracted Intel for the chip, and
would have owed them $50,000 for their design work.[33]
To avoid paying for a chip they did not want (and could
not use), CTC released Intel from their contract and al-
lowed them free use of the design.[33] Intel marketed it as
the 8008 in April, 1972, as the world’s first 8-bit micro-
processor. It was the basis for the famous "Mark-8" com-
puter kit advertised in the magazine Radio-Electronics in
1974. This processor had an 8-bit data bus and a 14-bit
The PICO1/GI250 chip introduced in 1971. This was designed address bus.[34]
by Pico Electronics (Glenrothes, Scotland) and manufactured by
General Instrument of Hicksville NY. The 8008 was the precursor to the very successful Intel
8080 (1974), which offered much improved performance
over the 8008 and required fewer support chips and was
semiconductors and Pico and GI went on to have sig- conceived and architected by Federico Faggin using high
nificant success in this burgeoning market. GI con- voltage N channel MOS, Zilog Z80 (1976) also archi-
tinued to innovate in microprocessors and microcon- tected by Faggin using low voltage N channel with de-
trollers with products including the CP1600, IOB1680 pletion load, and derivative Intel 8-bit processors: all of
and PIC1650.[26] In 1987 the GI Microelectronics busi- them designed with the design methodology created by
ness was spun out into the Microchip PIC microcontroller Faggin for the 4004. The competing Motorola 6800 was
business. released August 1974 and the similar MOS Technology
6502 in 1975 (both designed largely by the same people).
Four-Phase Systems AL1 The 6502 family rivaled the Z80 in popularity during the
1980s.
The Four-Phase Systems AL1 was an 8-bit bit slice chip A low overall cost, small packaging, simple computer
containing eight registers and an ALU.[27] It was designed bus requirements, and sometimes the integration of ex-
by Lee Boysel in 1969.[28][29][30] At the time, it formed tra circuitry (e.g. the Z80’s built-in memory refresh cir-
part of a nine-chip, 24-bit CPU with three AL1s, but it cuitry) allowed the home computer “revolution” to ac-
was later called a microprocessor when, in response to celerate sharply in the early 1980s. This delivered such
1990s litigation by Texas Instruments, a demonstration inexpensive machines as the Sinclair ZX-81, which sold
system was constructed where a single AL1 formed part for US$99. A variation of the 6502, the MOS Technol-
of a courtroom demonstration computer system, together ogy 6510 was used in the Commodore 64 and yet another
with RAM, ROM, and an input-output device.[31] variant, the 8502, powered the Commodore 128.
The Western Design Center, Inc (WDC) introduced the
9.3.2 8-bit designs CMOS 65C02 in 1982 and licensed the design to several
firms. It was used as the CPU in the Apple IIe and IIc per-
The Intel 4004 was followed in 1972 by the Intel 8008, sonal computers as well as in medical implantable grade
the world’s first 8-bit microprocessor. The 8008 was not, pacemakers and defibrillators, automotive, industrial and
however, an extension of the 4004 design, but instead consumer devices. WDC pioneered the licensing of mi-
the culmination of a separate design project at Intel, aris- croprocessor designs, later followed by ARM (32-bit) and
ing from a contract with Computer Terminals Corpora- other microprocessor intellectual property (IP) providers
tion, of San Antonio TX, for a chip for a terminal they in the 1990s.
were designing,[32] the Datapoint 2200 — fundamental Motorola introduced the MC6809 in 1978, an ambitious
aspects of the design came not from Intel but from CTC. and well thought-through 8-bit design which was source
In 1968, CTC’s Vic Poor and Harry Pyle developed the compatible with the 6800 and was implemented using
original design for the instruction set and operation of purely hard-wired logic. (Subsequent 16-bit micropro-
the processor. In 1969, CTC contracted two companies, cessors typically used microcode to some extent, as CISC
Intel and Texas Instruments, to make a single-chip im-
9.3. HISTORY 57

design requirements were getting too complex for pure Another early single-chip 16-bit microprocessor was TI’s
hard-wired logic.) TMS 9900, which was also compatible with their TI-990
Another early 8-bit microprocessor was the Signetics line of minicomputers. The 9900 was used in the TI
2650, which enjoyed a brief surge of interest due to its 990/4 minicomputer, the TI-99/4A home computer, and
innovative and powerful instruction set architecture. the TM990 line of OEM microcomputer boards. The
chip was packaged in a large ceramic 64-pin DIP pack-
A seminal microprocessor in the world of spaceflight was age, while most 8-bit microprocessors such as the Intel
RCA's RCA 1802 (aka CDP1802, RCA COSMAC) (in- 8080 used the more common, smaller, and less expensive
troduced in 1976), which was used on board the Galileo plastic 40-pin DIP. A follow-on chip, the TMS 9980, was
probe to Jupiter (launched 1989, arrived 1995). RCA designed to compete with the Intel 8080, had the full TI
COSMAC was the first to implement CMOS technology. 990 16-bit instruction set, used a plastic 40-pin package,
The CDP1802 was used because it could be run at very moved data 8 bits at a time, but could only address 16
low power, and because a variant was available fabricated KB. A third chip, the TMS 9995, was a new design. The
using a special production process, silicon on sapphire family later expanded to include the 99105 and 99110.
(SOS), which provided much better protection against
cosmic radiation and electrostatic discharge than that of The Western Design Center (WDC) introduced the
any other processor of the era. Thus, the SOS version of CMOS 65816 16-bit upgrade of the WDC CMOS 65C02
the 1802 was said to be the first radiation-hardened mi- in 1984. The 65816 16-bit microprocessor was the core
croprocessor. of the Apple IIgs and later the Super Nintendo Entertain-
ment System, making it one of the most popular 16-bit
The RCA 1802 had what is called a static design, meaning designs of all time.
that the clock frequency could be made arbitrarily low,
even to 0 Hz, a total stop condition. This let the Galileo Intel “upsized” their 8080 design into the 16-bit Intel
8086, the first member of the x86 family, which powers
spacecraft use minimum electric power for long unevent-
ful stretches of a voyage. Timers or sensors would awaken most modern PC type computers. Intel introduced the
8086 as a cost-effective way of porting software from the
the processor in time for important tasks, such as naviga-
tion updates, attitude control, data acquisition, and radio 8080 lines, and succeeded in winning much business on
that premise. The 8088, a version of the 8086 that used
communication. Current versions of the Western Design
Center 65C02 and 65C816 have static cores, and thus re- an 8-bit external data bus, was the microprocessor in the
first IBM PC. Intel then released the 80186 and 80188,
tain data even when the clock is completely halted.
the 80286 and, in 1985, the 32-bit 80386, cementing
their PC market dominance with the processor family’s
9.3.3 12-bit designs backwards compatibility. The 80186 and 80188 were es-
sentially versions of the 8086 and 8088, enhanced with
The Intersil 6100 family consisted of a 12-bit micropro- some onboard peripherals and a few new instructions. Al-
cessor (the 6100) and a range of peripheral support and though Intel’s 80186 and 80188 were not used in IBM
memory ICs. The microprocessor recognised the DEC PC type designs, second source versions from NEC, the
PDP-8 minicomputer instruction set. As such it was V20 and V30 frequently were. The 8086 and successors
sometimes referred to as the CMOS-PDP8. Since it was had an innovative but limited method of memory segmen-
also produced by Harris Corporation, it was also known as tation, while the 80286 introduced a full-featured seg-
the Harris HM-6100. By virtue of its CMOS technology mented memory management unit (MMU). The 80386
and associated benefits, the 6100 was being incorporated introduced a flat 32-bit memory model with paged mem-
into some military designs until the early 1980s. ory management.
The 16-bit Intel x86 processors up to and including
the 80386 do not include floating-point units (FPUs).
9.3.4 16-bit designs Intel introduced the 8087, 80187, 80287 and 80387
math coprocessors to add hardware floating-point and
The first multi-chip 16-bit microprocessor was the transcendental function capabilities to the 8086 through
National Semiconductor IMP-16, introduced in early 80386 CPUs. The 8087 works with the 8086/8088 and
1973. An 8-bit version of the chipset was introduced in 80186/80188,[35] the 80187 works with the 80186 but
1974 as the IMP-8. not the 80188,[36] the 80287 works with the 80286 and
Other early multi-chip 16-bit microprocessors include the 80387 works with the 80386. The combination of an
one that Digital Equipment Corporation (DEC) used in x86 CPU and an x87 coprocessor forms a single multi-
the LSI-11 OEM board set and the packaged PDP 11/03 chip microprocessor; the two chips are programmed as
minicomputer—and the Fairchild Semiconductor Mi- a unit using a single integrated instruction set.[37] The
croFlame 9440, both introduced in 1975–1976. In 1975, 8087 and 80187 coprocessors are connected in parallel
National introduced the first 16-bit single-chip micropro- with the data and address buses of their parent proces-
cessor, the National Semiconductor PACE, which was sor and directly execute instructions intended for them.
later followed by an NMOS version, the INS8900. The 80287 and 80387 coprocessors are interfaced to the
58 CHAPTER 9. MICROPROCESSOR

CPU through I/O ports in the CPU’s address space, this is 3B15 minicomputers; in the 3B2, the world’s first desktop
transparent to the program, which does not need to know super microcomputer; in the “Companion”, the world’s
about or access these I/O ports directly; the program ac- first 32-bit laptop computer; and in “Alexander”, the
cesses the coprocessor and its registers through normal world’s first book-sized super microcomputer, featuring
instruction opcodes. ROM-pack memory cartridges similar to today’s gaming
consoles. All these systems ran the UNIX System V op-
erating system.
9.3.5 32-bit designs
The first commercial, single chip, fully 32-bit micropro-
cessor available on the market was the HP FOCUS.
Intel’s first 32-bit microprocessor was the iAPX 432,
which was introduced in 1981 but was not a commer-
cial success. It had an advanced capability-based object-
oriented architecture, but poor performance compared
to contemporary architectures such as Intel’s own 80286
(introduced 1982), which was almost four times as fast
on typical benchmark tests. However, the results for the
iAPX432 was partly due to a rushed and therefore sub-
optimal Ada compiler.
Motorola’s success with the 68000 led to the MC68010,
which added virtual memory support. The MC68020,
introduced in 1984 added full 32-bit data and address
buses. The 68020 became hugely popular in the Unix
supermicrocomputer market, and many small companies
(e.g., Altos, Charles River Data Systems, Cromemco)
produced desktop-size systems. The MC68030 was in-
troduced next, improving upon the previous design by in-
tegrating the MMU into the chip. The continued success
Upper interconnect layers on an Intel 80486DX2 die
led to the MC68040, which included an FPU for better
math performance. A 68050 failed to achieve its per-
16-bit designs had only been on the market briefly when
formance goals and was not released, and the follow-up
32-bit implementations started to appear.
MC68060 was released into a market saturated by much
The most significant of the 32-bit designs is the Motorola faster RISC designs. The 68k family faded from the desk-
MC68000, introduced in 1979. The 68k, as it was widely top in the early 1990s.
known, had 32-bit registers in its programming model
Other large companies designed the 68020 and follow-
but used 16-bit internal data paths, three 16-bit Arith-
ons into embedded equipment. At one point, there were
metic Logic Units, and a 16-bit external data bus (to re-
more 68020s in embedded equipment than there were
duce pin count), and externally supported only 24-bit ad-
Intel Pentiums in PCs.[41] The ColdFire processor cores
dresses (internally it worked with full 32 bit addresses).
are derivatives of the venerable 68020.
In PC-based IBM-compatible mainframes the MC68000
internal microcode was modified to emulate the 32-bit During this time (early to mid-1980s), National Semi-
System/370 IBM mainframe.[38] Motorola generally de- conductor introduced a very similar 16-bit pinout, 32-
scribed it as a 16-bit processor, though it clearly has 32- bit internal microprocessor called the NS 16032 (later
bit capable architecture. The combination of high perfor- renamed 32016), the full 32-bit version named the NS
mance, large (16 megabytes or 224 bytes) memory space 32032. Later, National Semiconductor produced the NS
and fairly low cost made it the most popular CPU de- 32132, which allowed two CPUs to reside on the same
sign of its class. The Apple Lisa and Macintosh designs memory bus with built in arbitration. The NS32016/32
made use of the 68000, as did a host of other designs in outperformed the MC68000/10, but the NS32332—
the mid-1980s, including the Atari ST and Commodore which arrived at approximately the same time as the
Amiga. MC68020—did not have enough performance. The third
generation chip, the NS32532, was different. It had
The world’s first single-chip fully 32-bit microprocessor,
about double the performance of the MC68030, which
with 32-bit data paths, 32-bit buses, and 32-bit addresses,
was released around the same time. The appearance of
was the AT&T Bell Labs BELLMAC-32A, with first
[39][40] RISC processors like the AM29000 and MC88000 (now
samples in 1980, and general production in 1982.
both dead) influenced the architecture of the final core,
After the divestiture of AT&T in 1984, it was renamed
the NS32764. Technically advanced—with a superscalar
the WE 32000 (WE for Western Electric), and had two
RISC core, 64-bit bus, and internally overclocked—it
follow-on generations, the WE 32100 and WE 32200.
could still execute Series 32000 instructions through real-
These microprocessors were used in the AT&T 3B5 and
9.3. HISTORY 59

time translation. saw the introduction of 64-bit microprocessors targeted


When National Semiconductor decided to leave the Unix at the PC market.
market, the chip was redesigned into the Swordfish Em- With AMD’s introduction of a 64-bit architecture
bedded processor with a set of on chip peripherals. The backwards-compatible with x86, x86-64 (also called
chip turned out to be too expensive for the laser printer AMD64), in September 2003, followed by Intel’s near
market and was killed. The design team went to Intel and fully compatible 64-bit extensions (first called IA-32e
there designed the Pentium processor, which is very sim- or EM64T, later renamed Intel 64), the 64-bit desktop
ilar to the NS32764 core internally. The big success of era began. Both versions can run 32-bit legacy appli-
the Series 32000 was in the laser printer market, where cations without any performance penalty as well as new
the NS32CG16 with microcoded BitBlt instructions had 64-bit software. With operating systems Windows XP
very good price/performance and was adopted by large x64, Windows Vista x64, Windows 7 x64, Linux, BSD,
companies like Canon. By the mid-1980s, Sequent intro- and Mac OS X that run 64-bit native, the software is also
duced the first SMP server-class computer using the NS geared to fully utilize the capabilities of such processors.
32032. This was one of the design’s few wins, and it dis- The move to 64 bits is more than just an increase in reg-
appeared in the late 1980s. The MIPS R2000 (1984) and ister size from the IA-32 as it also doubles the number of
R3000 (1989) were highly successful 32-bit RISC micro- general-purpose registers.
processors. They were used in high-end workstations and The move to 64 bits by PowerPC processors had been in-
servers by SGI, among others. Other designs included the tended since the processors’ design in the early 90s and
Zilog Z80000, which arrived too late to market to stand was not a major cause of incompatibility. Existing in-
a chance and disappeared quickly. teger registers are extended as are all related data path-
The ARM first appeared in 1985.[42] This is a RISC pro- ways, but, as was the case with IA-32, both floating point
cessor design, which has since come to dominate the 32- and vector units had been operating at or above 64 bits
bit embedded systems processor space due in large part for several years. Unlike what happened when IA-32
to its power efficiency, its licensing model, and its wide was extended to x86-64, no new general purpose regis-
selection of system development tools. Semiconductor ters were added in 64-bit PowerPC, so any performance
manufacturers generally license cores and integrate them gained when using the 64-bit mode for applications mak-
into their own system on a chip products; only a few such ing no use of the larger address space is minimal.
vendors are licensed to modify the ARM cores. Most
In 2011, ARM introduced a new 64-bit ARM architec-
cell phones include an ARM processor, as do a wide vari- ture.
ety of other products. There are microcontroller-oriented
ARM cores without virtual memory support, as well as
symmetric multiprocessor (SMP) applications processors
with virtual memory. Multi-core designs
In the late 1980s, “microprocessor wars” started killing
Main article: Multi-core (computing)
off some of the microprocessors. Apparently, with only
one bigger design win, Sequent, the NS 32032 just faded
out of existence, and Sequent switched to Intel micropro- A different approach to improving a computer’s perfor-
cessors. mance is to add extra processors, as in symmetric multi-
processing designs, which have been popular in servers
From 1993 to 2003, the 32-bit x86 architectures became
and workstations since the early 1990s. Keeping up
increasingly dominant in desktop, laptop, and server mar-
with Moore’s Law is becoming increasingly challenging
kets, and these microprocessors became faster and more
as chip-making technologies approach their physical lim-
capable. Intel had licensed early versions of the archi-
its. In response, microprocessor manufacturers look for
tecture to other companies, but declined to license the
other ways to improve performance so they can maintain
Pentium, so AMD and Cyrix built later versions of the
the momentum of constant upgrades.
architecture based on their own designs. During this
span, these processors increased in complexity (transis- A multi-core processor is a single chip that contains more
tor count) and capability (instructions/second) by at least than one microprocessor core. Each core can simultane-
three orders of magnitude. Intel’s Pentium line is prob- ously execute processor instructions in parallel. This ef-
ably the most famous and recognizable 32-bit processor fectively multiplies the processor’s potential performance
model, at least with the public at broad. by the number of cores, if the software is designed to take
advantage of more than one processor core. Some com-
ponents, such as bus interface and cache, may be shared
9.3.6 64-bit designs in personal computers between cores. Because the cores are physically close
to each other, they can communicate with each other
While 64-bit microprocessor designs have been in use much faster than separate (off-chip) processors in a mul-
in several markets since the early 1990s (including the tiprocessor system, which improves overall system per-
Nintendo 64 gaming console in 1996), the early 2000s formance.
60 CHAPTER 9. MICROPROCESSOR

In 2005, AMD released the first native dual-core proces- and workstations that more effectively use fewer cores
sor, the Athlon X2. Intel’s Pentium D had beaten the X2 and threads.
to market by a few weeks, but it used two separate CPU
dies and was less efficient than AMD’s native design. As
of 2012, dual-core and quad-core processors are widely 9.3.7 RISC
used in home PCs and laptops, while quad, six, eight,
ten, twelve, and sixteen-core processors are common in Main article: Reduced instruction set computing
the professional and enterprise markets with workstations
and servers. In the mid-1980s to early 1990s, a crop of new high-
Sun Microsystems has released the Niagara and Niagara performance reduced instruction set computer (RISC)
2 chips, both of which feature an eight-core design. The microprocessors appeared, influenced by discrete RISC-
Niagara 2 supports more threads and operates at 1.6 GHz. like CPU designs such as the IBM 801 and others. RISC
microprocessors were initially used in special-purpose
High-end Intel Xeon processors that are on the LGA 771,
machines and Unix workstations, but then gained wide
LGA1336, and LGA 2011 sockets and high-end AMD
acceptance in other roles.
Opteron processors that are on the C32 and G34 sockets
are DP (dual processor) capable, as well as the older Intel In 1986, HP released its first system with a PA-RISC
Core 2 Extreme QX9775 also used in an older Mac Pro by CPU. The first commercial RISC microprocessor design
Apple and the Intel Skulltrail motherboard. AMD’s G34 was released in 1984 by MIPS Computer Systems, the
motherboards can support up to four CPUs and Intel’s 32-bit R2000 (the R1000 was not released). In 1987 in
LGA 1567 motherboards can support up to eight CPUs. the non-Unix Acorn computers' 32-bit, then cache-less,
ARM2-based Acorn Archimedes the fist commercial suc-
Modern desktop computers support systems with multi-
cess using the ARM architecture, then known as Acorn
ple CPUs, but few applications outside of the professional
RISC Machine (ARM); first silicon ARM1 in 1985. The
market can make good use of more than four cores. Both
R3000 made the design truly practical, and the R4000
Intel and AMD currently offer fast quad- and six-core
introduced the world’s first commercially available 64-bit
desktop CPUs, making multi cpu systems obsolete for
RISC microprocessor. Competing projects would result
many purposes. AMD also offers the first and currently
in the IBM POWER and Sun SPARC architectures. Soon
the only eight core desktop CPUs with the FX-8xxx line.
every major vendor was releasing a RISC design, includ-
The desktop market has been in a transition towards quad- ing the AT&T CRISP, AMD 29000, Intel i860 and Intel
core CPUs since Intel’s Core 2 Quads were released and i960, Motorola 88000, DEC Alpha.
now are common, although dual-core CPUs are still more
In the late 1990s, only two 64-bit RISC architectures
prevalent. Older or mobile computers are less likely to
were still produced in volume for non-embedded appli-
have more than two cores than newer desktops. Not all
cations: SPARC and Power ISA, but as ARM has be-
software is optimised for multi core cpu’s, making fewer,
come increasingly powerful, in the early 2010s, it became
more powerful cores preferable. AMD offers CPUs with
the third RISC architecture in the general computing seg-
more cores for a given amount of money than similarly
ment.
priced Intel CPUs—but the AMD cores are somewhat
slower, so the two trade blows in different applications
depending on how well-threaded the programs running
are. 9.4 Market statistics
For example, Intel’s cheapest Sandy Bridge quad-core
CPUs often cost almost twice as much as AMD’s cheap- In 2003, about US$44 billion worth of microprocessors
est Athlon II, Phenom II, and FX quad-core CPUs but In- were manufactured and sold.[43] Although about half of
tel has dual-core CPUs in the same price ranges as AMD’s that money was spent on CPUs used in desktop or lap-
cheaper quad core CPUs. In an application that uses one top personal computers, those count for only about 2%
or two threads, the Intel dual cores outperform AMD’s of all CPUs sold.[44] The quality-adjusted price of lap-
similarly priced quad-core CPUs—and if a program sup- top microprocessors improved −25% to −35% per year
ports three or four threads the cheap AMD quad-core in 2004–2010, and the rate of improvement slowed to
CPUs outperform the similarly priced Intel dual-core −15% to −25% per year in 2010–2013.[45]
CPUs. About 55% of all CPUs sold in the world are 8-bit
Historically, AMD and Intel have switched places as the microcontrollers,
[46]
over two billion of which were sold in
company with the fastest CPU several times. Intel cur- 1997.
rently leads on the desktop side of the computer CPU In 2002, less than 10% of all the CPUs sold in the world
market, with their Sandy Bridge and Ivy Bridge series. were 32-bit or more. Of all the 32-bit CPUs sold, about
In servers, AMD’s new Opterons seem to have superior 2% are used in desktop or laptop personal computers.
performance for their price point. This means that AMD Most microprocessors are used in embedded control ap-
are currently more competitive in low- to mid-end servers plications such as household appliances, automobiles, and
9.6. NOTES 61

computer peripherals. Taken as a whole, the average [9] Moore, Gordon (19 April 1965). “Cramming more com-
price for a microprocessor, microcontroller, or DSP is ponents onto integrated circuits” (PDF). Electronics 38
just over $6.[44] (8). Retrieved 2009-12-23.

About ten billion CPUs were manufactured in 2008. [10] “Excerpts from A Conversation with Gordon Moore:
About 98% of new CPUs produced each year are Moore’s Law” (PDF). Intel. 2005. Retrieved 2009-12-
embedded.[47] 23.

[11] Holt, Ray M. “World’s First Microprocessor Chip Set”.


Ray M. Holt website. Archived from the original on 2010-
9.5 See also 07-25. Retrieved 2010-07-25.

[12] Holt, Ray (27 September 2001). Lecture: Microprocessor


• Arithmetic logic unit Design and Development for the US Navy F14 FighterJet
(Speech). Room 8220, Wean Hall, Carnegie Mellon Uni-
• Central processing unit versity, Pittsburgh, PA, US. Retrieved 2010-07-25.

• Comparison of CPU architectures [13] Parab, Jivan S.; Shelake, Vinod G.; Kamat, Rajanish K.;
Naik, Gourish M. (2007). Exploring C for Microcon-
• Computer architecture trollers: A Hands on Approach (PDF). Springer. p. 4.
ISBN 978-1-4020-6067-0. Retrieved 2010-07-25.
• Computer engineering
[14] Dyer, S. A.; Harms, B. K. (1993). “Digital Signal Pro-
• CPU design cessing”. In Yovits, M. C. Advances in Computers 37.
Academic Press. pp. 104–107. doi:10.1016/S0065-
• Floating point unit 2458(08)60403-9. ISBN 9780120121373.

• Instruction set [15] Hyatt, Gilbert P., “Single chip integrated circuit computer
architecture”, Patent 4942516, issued July 17, 1990
• List of instruction sets
[16] “The Gilbert Hyatt Patent”. intel4004.com. Federico Fag-
• List of microprocessors gin. Retrieved 2009-12-23.

[17] Crouch, Dennis (1 July 2007). “Written Description:


• Microarchitecture
CAFC Finds Prima Facie Rejection (Hyatt v. Dudas
• Microcode (Fed. Cir. 2007))". Patently-O blog. Retrieved 2009-
12-23.
• Microprocessor chronology
[18] Augarten, Stan (1983). The Most Widely Used Computer
on a Chip: The TMS 1000. State of the Art: A Photo-
graphic History of the Integrated Circuit (New Haven and
9.6 Notes New York: Ticknor & Fields). ISBN 0-89919-195-9. Re-
trieved 2009-12-23.
[1] Osborne, Adam (1980). An Introduction to Microcomput- [19] Mack, Pamela E. (30 November 2005). “The Microcom-
ers. Volume 1: Basic Concepts (2nd ed.). Berkely, Cali- puter Revolution”. Retrieved 2009-12-23.
fornia: Osborne-McGraw Hill. ISBN 0-931988-34-9.
[20] “History in the Computing Curriculum” (PDF). Retrieved
[2] Krishna Kant Microprocessors And Microcontrollers: Ar- 2009-12-23.
chitecture Programming And System Design, PHI Learning
Pvt. Ltd., 2007 ISBN 81-203-3191-5, page 61, describ- [21] Bright, Peter (November 15, 2011). “The 40th birthday
ing the iAPX 432. of—maybe—the first microprocessor, the Intel 4004”. ar-
stechnica.com.
[3] Kristian Saether, Ingar Fredriksen. “Introducing a New
Breed of Microcontrollers for 8/16-bit Applications”. p. [22] Faggin, Federico; Hoff, Marcian E., Jr.; Mazor, Stan-
5. ley; Shima, Masatoshi (December 1996). “The His-
tory of the 4004”. IEEE Micro 16 (6): 10–20.
[4] CMicrotek. “8-bit vs 32-bit Micros”. 2013. doi:10.1109/40.546561.

[5] Richard York. “8-bit versus 32-bit MCUs - The impas- [23] Faggin, F.; Klein, T.; L. (23 October 1968). Insulated
sioned debate goes on”. Gate Field Effect Transistor Integrated Circuits with Silicon
Gates (JPEG IMAGE). International Electronic Devices
[6] “32-bit Microcontroller Technology: Reduced processing Meeting. IEEE Electron Devices Group. Retrieved 2009-
time”. 12-23.

[7] “Cortex-M3 Processor: Energy efficiency advantage”. [24] McGonigal, James (20 September 2006).
“Microprocessor History: Foundations in Glenrothes,
[8] Back to the Moon: The Verification of a Small Micropro- Scotland”. McGonigal personal website accessdate=2009-
cessor’s Logic Design - NASA Office of Logic Design 12-23.
62 CHAPTER 9. MICROPROCESSOR

[25] Tout, Nigel. “ANITA at its Zenith”. Bell Punch Company [43] WSTS Board Of Directors. “WSTS Semiconductor Mar-
and the ANITA calculators. Retrieved 2010-07-25. ket Forecast World Release Date: 1 June 2004 - 6:00
UTC”. Miyazaki, Japan, Spring Forecast Meeting 18–21
[26] 16 Bit Microprocessor Handbook by Gerry Kane, Adam May 2004 (Press release). World Semiconductor Trade
Osborne ISBN 0-07-931043-5 (0-07-931043-5) Statistics. Archived from the original on 2004-12-07.
[27] Basset, Ross (2003). “When is a Microprocessor not a
[44] Turley, Jim (18 December 2002). “The Two Percent So-
Microprocessor? The Industrial Construction of Semi-
lution”. Embedded Systems Design. TechInsights (United
conductor Innovation”. In Finn, Bernard. Exposing Elec-
Business Media). Retrieved 2009-12-23.
tronics. Michigan State University Press. p. 121. ISBN
0-87013-658-5. [45] Sun, Liyang (2014-04-25). “What We Are Paying for:
A Quality Adjusted Price Index for Laptop Micropro-
[28] “1971 - Microprocessor Integrates CPU Function onto a
cessors”. Wellesley College. Retrieved 2014-11-07. …
Single Chip”. The Silicon Engine. Computer History Mu-
compared with −25% to −35% per year over 2004-2010,
seum. Retrieved 2010-07-25.
the annual decline plateaus around −15% to −25% over
[29] Shaller, Robert R. (15 April 2004). “Dissertation: Tech- 2010-2013.
nological Innovation in the Semiconductor Industry: A
Case Study of the International Technology Roadmap [46] Cantrell, Tom (1998). “Microchip on the March”.
for Semiconductors” (PDF). George Mason University. Archived from the original on 2007-02-20.
Archived from the original (PDF) on 2006-12-19. Re- [47] Barr, Michael (1 August 2009). “Real men program in C”.
trieved 2010-07-25. Embedded Systems Design. TechInsights (United Business
[30] RW (3 March 1995). “Interview with Gordon E. Moore”. Media). p. 2. Retrieved 2009-12-23.
LAIR History of Science and Technology Collections. Los
Altos Hills, California: Stanford University.
[31] Bassett 2003. pp. 115, 122. 9.7 References
[32] Ceruzzi, Paul E. (May 2003). A History of Modern Com- • Ray, A. K.; Bhurchand, K.M. Advanced Micropro-
puting (2nd ed.). MIT Press. pp. 220–221. ISBN 0-262-
cessors and Peripherals. India: Tata McGraw-Hill.
53203-4.
[33] Wood, Lamont (August 2008). “Forgotten history: the
true origins of the PC”. Computerworld. Archived from
the original on 2011-01-07. Retrieved 2011-01-07.
9.8 External links
[34] Intel 8008 data sheet. • Patent problems
[35] Intel 8087 datasheet, pg. 1 • Dirk Oppelt. “The CPU Collection”. Retrieved
[36] The 80187 only has a 16-bit data bus because it used the 2009-12-23.
80387SX core.
• Gennadiy Shvets. “CPU-World”. Retrieved 2009-
[37] “Essentially, the 80C187 can be treated as an additional 12-23.
resource or an extension to the CPU. The 80C186 CPU
together with an 80C187 can be used as a single unified • Jérôme Cremet. “The Gecko’s CPU Library”. Re-
system.” Intel 80C187 datasheet, p. 3, November 1992 trieved 2009-12-23.
(Order Number: 270640-004).
• “How Microprocessors Work”. Retrieved 2009-12-
[38] “Priorartdatabase.com”. Priorartdatabase.com. 1986-01- 23.
01. Retrieved 2014-06-09.
• William Blair. “IC Die Photography”. Retrieved
[39] “Shoji, M. Bibliography”. Bell Laboratories. 7 October
2009-12-23.
1998. Retrieved 2009-12-23.
[40] “Timeline: 1982–1984”. Physical Sciences & Communi- • John Bayko (December 2003). “Great Micropro-
cations at Bell Labs. Bell Labs, Alcatel-Lucent. 17 Jan- cessors of the Past and Present”. Retrieved 2009-
uary 2001. Retrieved 2009-12-23. 12-23.
[41] Turley, Jim (July 1998). “MCore: Does Motorola Need • Wade Warner (22 December 2004). “Great mo-
Another Processor Family?". Embedded Systems Design. ments in microprocessor history”. IBM. Retrieved
TechInsights (United Business Media). Archived from the 2013-03-07.
original on 1998-07-02. Retrieved 2009-12-23.
• Ray M. Holt. “theDocuments”. World’s First Micro-
[42] Garnsey, Elizabeth; Lorenzoni, Gianni; Ferriani, Simone
processor. Retrieved 2009-12-23.
(March 2008). “Speciation through entrepreneurial spin-
off: The Acorn-ARM story” (PDF). Research Policy 37
(2). doi:10.1016/j.respol.2007.11.006. Retrieved 2011-
06-02. [...] the first silicon was run on April 26th 1985.
Chapter 10

Processor design

Processor design is the design engineering task of synthesis using CAD tools) can be used to implement dat-
creating a microprocessor, a component of computer apaths, register files, and clocks. Common logic styles
hardware. It is a subfield of electronics engineering used in CPU design include unstructured random logic,
and computer engineering. The design process in- finite-state machines, microprogramming (common from
volves choosing an instruction set and a certain execu- 1965 to 1985), and Programmable logic arrays (common
tion paradigm (e.g. VLIW or RISC) and results in a in the 1980s, no longer common).
microarchitecture described in e.g. VHDL or Verilog.
Device types used to implement the logic include:
This description is then manufactured employing some of
the various semiconductor device fabrication processes.
This results in a die which is bonded onto some chip car- • Transistor-transistor logic Small Scale Integration
rier. This chip carrier is then soldered onto some printed logic chips - no longer used for CPUs
circuit board (PCB).
• Programmable Array Logic and Programmable
The mode of operation of any microprocessor is the ex- logic devices - no longer used for CPUs
ecution of lists of instructions. Instructions typically in-
clude those to compute or manipulate data values using • Emitter-coupled logic (ECL) gate arrays - no longer
registers, change or retrieve values in read/write mem- common
ory, perform relational tests between data values and to
• CMOS gate arrays - no longer used for CPUs
control program flow.
• CMOS mass-produced ICs - the vast majority of
CPUs by volume
10.1 Details
• CMOS ASICs - only for a minority of special appli-
cations due to expense
CPU design focuses on six main areas:
• Field-programmable gate arrays (FPGA) - common
1. datapaths (such as ALUs and pipelines) for soft microprocessors, and more or less required
for reconfigurable computing
2. control unit: logic which controls the datapaths
A CPU design project generally has these major tasks:
3. Memory components such as register files, caches

4. Clock circuitry such as clock drivers, PLLs, clock • Programmer-visible instruction set architecture,
distribution networks which can be implemented by a variety of
microarchitectures
5. Pad transceiver circuitry
• Architectural study and performance modeling in
6. Logic gate cell library which is used to implement ANSI C/C++ or SystemC
the logic
• High-level synthesis (HLS) or register transfer level
CPUs designed for high-performance markets might (RTL, e.g. logic) implementation
require custom designs for each of these items to
• RTL verification
achieve frequency, power-dissipation, and chip-area goals
whereas CPUs designed for lower performance markets • Circuit design of speed critical components (caches,
might lessen the implementation burden by acquiring registers, ALUs)
some of these items by purchasing them as intellectual
property. Control logic implementation techniques (logic • Logic synthesis or logic-gate-level design

63
64 CHAPTER 10. PROCESSOR DESIGN

• Timing analysis to confirm that all logic and circuits Measurements include:
will run at the specified operating frequency

• Physical design including floorplanning, place and • Instructions per second - Most consumers pick a
route of logic gates computer architecture (normally Intel IA32 archi-
tecture) to be able to run a large base of pre-existing
• Checking that RTL, gate-level, transistor-level and pre-compiled software. Being relatively uninformed
physical-level representations are equivalent on computer benchmarks, some of them pick a
particular CPU based on operating frequency (see
• Checks for signal integrity, chip manufacturability
Megahertz Myth).

Re-designing a CPU core to a smaller die-area helps to


• FLOPS - The number of floating point operations
shrink everything (a "photomask shrink”), resulting in the
per second is often important in selecting computers
same number of transistors on a smaller die. It improves
for scientific computations.
performance (smaller transistors switch faster), reduces
power (smaller wires have less parasitic capacitance) and
reduces cost (more CPUs fit on the same wafer of sili- • Performance per watt - System designers build-
con). Releasing a CPU on the same size die, but with ing parallel computers, such as Google, pick CPUs
a smaller CPU core, keeps the cost about the same but based on their speed per watt of power, because the
allows higher levels of integration within one very-large- cost of powering the CPU outweighs the cost of the
scale integration chip (additional cache, multiple CPUs, CPU itself. [1]
or other components), improving performance and re-
ducing overall system cost. • Some system designers building parallel computers
pick CPUs based on the speed per dollar.
As with most complex electronic designs, the logic verifi-
cation effort (proving that the design does not have bugs)
• System designers building real-time computing sys-
now dominates the project schedule of a CPU.
tems want to guarantee worst-case response. That is
Key CPU architectural innovations include index reg- easier to do when the CPU has low interrupt latency
ister, cache, virtual memory, instruction pipelining, and when it has deterministic response. (DSP)
superscalar, CISC, RISC, virtual machine, emulators,
microprogram, and stack. • Computer programmers who program directly in as-
sembly language want a CPU to support a full fea-
tured instruction set.
10.1.1 Micro-architectural concepts
• Low power - For systems with limited power sources
Main article: Microarchitecture
(e.g. solar, batteries, human power).

• Small size or low weight - for portable embedded


10.1.2 Research topics systems, systems for spacecraft.

Main article: History of general-purpose CPUs § 1990 • Environmental impact - Minimizing environmental
to today: looking forward impact of computers during manufacturing and re-
cycling as well during use. Reducing waste, reduc-
A variety of new CPU design ideas have been pro- ing hazardous materials. (see Green computing).
posed, including reconfigurable logic, clockless CPUs,
computational RAM, and optical computing. Some of these measures conflict. In particular, many de-
sign techniques that make a CPU run faster make the
“performance per watt”, “performance per dollar”, and
10.1.3 Performance analysis and bench- “deterministic response” much worse, and vice versa.
marking
Main article: Computer performance
10.2 Markets
Benchmarking is a way of testing CPU speed. Ex-
amples include SPECint and SPECfp, developed by There are several different markets in which CPUs are
Standard Performance Evaluation Corporation, and used. Since each of these markets differ in their require-
ConsumerMark developed by the Embedded Micropro- ments for CPUs, the devices designed for one market are
cessor Benchmark Consortium EEMBC. in most cases inappropriate for the other markets.
10.2. MARKETS 65

10.2.1 General purpose computing • To give lower system cost, peripherals are integrated
with the processor on the same silicon chip.
The vast majority of revenues generated from CPU sales
is for general purpose computing, that is, desktop, laptop, • Keeping peripherals on-chip also reduces power
and server computers commonly used in businesses and consumption as external GPIO ports typically re-
homes. In this market, the Intel IA-32 architecture dom- quire buffering so that they can source or sink the
inates, with its rivals PowerPC and SPARC maintaining relatively high current loads that are required to
much smaller customer bases. Yearly, hundreds of mil- maintain a strong signal outside of the chip.
lions of IA-32 architecture CPUs are used by this market.
• Many embedded applications have a limited
A growing percentage of these processors are for mobile
amount of physical space for circuitry; keep-
implementations such as netbooks and laptops.[2]
ing peripherals on-chip will reduce the space
Since these devices are used to run countless different required for the circuit board.
types of programs, these CPU designs are not specifically • The program and data memories are often in-
targeted at one type of application or one function. The tegrated on the same chip. When the only al-
demands of being able to run a wide range of programs lowed program memory is ROM, the device is
efficiently has made these CPU designs among the more known as a microcontroller.
advanced technically, along with some disadvantages of
being relatively costly, and having high power consump- • For many embedded applications, interrupt latency
tion. will be more critical than in some general-purpose
processors.
High-end processor economics
Embedded processor economics
In 1984, most high-performance CPUs required four to
five years to develop.[3] The embedded CPU family with the largest number of
total units shipped is the 8051, averaging nearly a billion
units per year.[4] The 8051 is widely used because it is
10.2.2 Scientific computing very inexpensive. The design time is now roughly zero,
because it is widely available as commercial intellectual
Main article: Supercomputer property. It is now often embedded as a small part of a
larger system on a chip. The silicon cost of an 8051 is
Scientific computing is a much smaller niche market (in now as low as US$0.001, because some implementations
revenue and units shipped). It is used in government re- use as few as 2,200 logic gates and take 0.0127 square
[5][6]
search labs and universities. Before 1990, CPU design millimeters of silicon.
was often done for this market, but mass market CPUs As of 2009, more CPUs are produced using the ARM ar-
organized into large clusters have proven to be more af- chitecture instruction set than any other 32-bit instruction
fordable. The main remaining area of active hardware set.[7][8] The ARM architecture and the first ARM chip
design and research for scientific computing is for high- were designed in about one and a half years and 5 human
speed data transmission systems to connect mass market years of work time.[9]
CPUs.
The 32-bit Parallax Propeller microcontroller architec-
ture and the first chip were designed by two people in
10.2.3 Embedded design about 10 human years of work time.[10]
The 8-bit AVR architecture and first AVR microcon-
As measured by units shipped, most CPUs are embed- troller was conceived and designed by two students at the
ded in other machinery, such as telephones, clocks, ap- Norwegian Institute of Technology.
pliances, vehicles, and infrastructure. Embedded proces-
sors sell in the volume of many billions of units per year, The 8-bit 6502 architecture and the first MOS Technol-
however, mostly at much lower price points than that of ogy 6502 chip were designed in 13 months by a group of
the general purpose processors. about 9 people.[11]

These single-function devices differ from the more famil-


iar general-purpose CPUs in several ways: Research and educational CPU design

• Low cost is of high importance. The 32 bit Berkeley RISC I and RISC II architecture and
the first chips were mostly designed by a series of students
• It is important to maintain a low power dissipation as as part of a four quarter sequence of graduate courses.[12]
embedded devices often have a limited battery life This design became the basis of the commercial SPARC
and it is often impractical to include cooling fans. processor design.
66 CHAPTER 10. PROCESSOR DESIGN

For about a decade, every student taking the 6.004 class years to develop, The NonStop TXP processor took just
at MIT was part of a team—each team had one semester 2+1/2 years -- six months to develop a complete written
to design and build a simple 8 bit CPU out of 7400 series specification, one year to construct a working prototype,
integrated circuits. One team of 4 students designed and and another year to reach volume production.”
built a simple 32 bit CPU during that semester. [13] [4] http://people.wallawalla.edu/~{}curt.nelson/engr355/
Some undergraduate courses require a team of 2 to 5 stu- lecture/8051_overview.pdf
dents to design, implement, and test a simple CPU in a
[5] Square millimeters per 8051, 0.013 in 45nm line-widths;
FPGA in a single 15 week semester. [14] see

[6] To figure dollars per square millimeter, see , and note that
Soft microprocessor cores an SOC component has no pin or packaging costs.

Main article: Soft microprocessor [7] “ARM Cores Climb Into 3G Territory” by Mark
Hachman, 2002.

For embedded systems, the highest performance levels [8] “The Two Percent Solution” by Jim Turley 2002.
are often not needed or desired due to the power con-
[9] “ARM’s way” 1998
sumption requirements. This allows for the use of proces-
sors which can be totally implemented by logic synthesis [10] “Why the Propeller Works” by Chip Gracey
techniques. These synthesized processors can be imple-
mented in a much shorter amount of time, giving quicker [11] “Interview with William Mensch”
time-to-market. [12] 'Design and Implementation of RISC I' - original journal
article by C.E. Sequin and D.A.Patterson

[13] “the VHS”


10.3 See also
[14] “Teaching Computer Design with FPGAs” by Jan Gray
• Central processing unit
• Hwang, Enoch (2006). Digital Logic and Micropro-
• History of general-purpose CPUs cessor Design with VHDL. Thomson. ISBN 0-534-
• Microprocessor 46593-5.

• Microarchitecture • Processor Design: An Introduction

• Moore’s law

• Amdahl’s law

• System-on-a-chip

• Reduced instruction set computer

• Complex instruction set computer

• Minimal instruction set computer

• Electronic design automation

• High-level synthesis

10.4 References
[1]

[2] Kerr, Justin. “AMD Loses Market Share as Mobile CPU


Sales Outsell Desktop for the First Time.” Maximum PC.
Published 2010-10-26.

[3] “New system manages hundreds of transactions per sec-


ond” article by Robert Horst and Sandra Metz, of Tandem
Computers Inc., “Electronics” magazine, 1984 April 19:
“While most high-performance CPUs require four to five
Chapter 11

History of general-purpose CPUs

The history of general-purpose CPUs is a continuation veloped factory-constructed, truck-deliverable comput-


of the earlier history of computing hardware. ers. The most widely installed computer was the IBM
650, which used drum memory onto which programs
were loaded using either paper tape or punched cards.
Some very high-end machines also included core mem-
ory which provided higher speeds. Hard disks were also
starting to become popular.
A computer is an automatic abacus. The type of number
system affects the way it works. In the early 1950s most
computers were built for specific numerical processing
tasks, and many machines used decimal numbers as their
basic number system – that is, the mathematical functions
of the machines worked in base-10 instead of base-2 as
is common today. These were not merely binary coded
decimal. Most machines actually had ten vacuum tubes
per digit in each register. Some early Soviet computer de-
signers implemented systems based on ternary logic; that
is, a bit could have three states: +1, 0, or −1, correspond-
ing to positive, zero, or negative voltage.
A Vacuum tube module from early 700 series IBM computers An early project for the U.S. Air Force, BINAC at-
tempted to make a lightweight, simple computer by using
binary arithmetic. It deeply impressed the industry.
As late as 1970, major computer languages were unable
11.1 1950s: early designs to standardize their numeric behavior because decimal
computers had groups of users too large to alienate.
Each of the computer designs of the early 1950s was Even when designers used a binary system, they still had
a unique design; there were no upward-compatible ma- many odd ideas. Some used sign-magnitude arithmetic
chines or computer architectures with multiple, differ- (−1 = 10001), or ones’ complement (−1 = 11110), rather
ing implementations. Programs written for one machine than modern two’s complement arithmetic (−1 = 11111).
would not run on another kind, even other kinds from the Most computers used six-bit character sets, because they
same company. This was not a major drawback at the adequately encoded Hollerith cards. It was a major rev-
time because there was not a large body of software de- elation to designers of this period to realize that the data
veloped to run on computers, so starting programming word should be a multiple of the character size. They be-
from scratch was not seen as a large barrier. gan to design computers with 12, 24 and 36 bit data words
The design freedom of the time was very important, for (e.g. see the TX-2).
designers were very constrained by the cost of electronics, In this era, Grosch’s law dominated computer design:
yet just beginning to explore how a computer could best Computer cost increased as the square of its speed.
be organized. Some of the basic features introduced dur-
ing this period included index registers (on the Ferranti
Mark 1), a return-address saving instruction (UNIVAC
I), immediate operands (IBM 704), and the detection of
invalid operations (IBM 650).
By the end of the 1950s commercial builders had de-

67
68 CHAPTER 11. HISTORY OF GENERAL-PURPOSE CPUS

11.2 1960s: the computer revolu- later, when RISC (Reduced Instruction Set Computer)
began to get market share.
tion and CISC
In many CISCs, an instruction could access either regis-
ters or memory, usually in several different ways. This
One major problem with early computers was that a pro- made the CISCs easier to program, because a program-
gram for one would not work on others. Computer com- mer could remember just thirty to a hundred instructions,
panies found that their customers had little reason to re- and a set of three to ten addressing modes rather than
main loyal to a particular brand, as the next computer they thousands of distinct instructions. This was called an
purchased would be incompatible anyway. At that point, "orthogonal instruction set.” The PDP-11 and Motorola
price and performance were usually the only concerns. 68000 architecture are examples of nearly orthogonal in-
In 1962, IBM tried a new approach to designing comput- struction sets.
ers. The plan was to make an entire family of computers There was also the BUNCH (Burroughs, UNIVAC, NCR,
that could all run the same software, but with different Control Data Corporation, and Honeywell) that competed
performances, and at different prices. As users’ require- against IBM at this time; however, IBM dominated the
ments grew they could move up to larger computers, and era with S/360.
still keep all of their investment in programs, data and
storage media. The Burroughs Corporation (which later merged with
Sperry/Univac to become Unisys) offered an alternative
In order to do this they designed a single reference com- to S/360 with their B5000 series machines. In 1961, the
puter called the System/360 (or S/360). The System/360 B5000 had virtual memory, symmetric multiprocessing,
was a virtual computer, a reference instruction set and a multi-programming operating system (Master Control
capabilities that all machines in the family would sup- Program or MCP), written in ALGOL 60, and the indus-
port. In order to provide different classes of machines, try’s first recursive-descent compilers as early as 1963.
each computer in the family would use more or less hard-
ware emulation, and more or less microprogram emula-
tion, to create a machine capable of running the entire
System/360 instruction set. 11.3 1970s: Large Scale Integra-
For instance a low-end machine could include a very sim- tion
ple processor for low cost. However this would require
the use of a larger microcode emulator to provide the rest In the 1960s, the Apollo guidance computer and
of the instruction set, which would slow it down. A high- Minuteman missile made the integrated circuit economi-
end machine would use a much more complex processor cal and practical.
that could directly process more of the System/360 de-
sign, thus running a much simpler and faster emulator.
IBM chose to make the reference instruction set quite
complex, and very capable. This was a conscious choice.
Even though the computer was complex, its "control
store" containing the microprogram would stay relatively
small, and could be made with very fast memory. An-
other important effect was that a single instruction could
describe quite a complex sequence of operations. Thus
the computers would generally have to fetch fewer in-
structions from the main memory, which could be made
slower, smaller and less expensive for a given combina-
tion of speed and price.
An Intel 8008 Microprocessor
As the S/360 was to be a successor to both scientific ma-
chines like the 7090 and data processing machines like Around 1971, the first calculator and clock chips began
the 1401, it needed a design that could reasonably supportto show that very small computers might be possible. The
all forms of processing. Hence the instruction set was de-first microprocessor was the Intel 4004, designed in 1971
signed to manipulate not just simple binary numbers, but for a calculator company (Busicom), and produced by
text, scientific floating-point (similar to the numbers usedIntel. In 1972, Intel introduced a microprocessor having
in a calculator), and the binary coded decimal arithmetic a different architecture: the 8008. The 8008 is the di-
needed by accounting systems. rect ancestor of the current Core i7, even now maintain-
Almost all following computers included these innova- ing code compatibility (every instruction of the 8008’s in-
tions in some form. This basic set of features is now struction set has a direct equivalent in the Intel Core i7’s
called a "Complex Instruction Set Computer,” or CISC much larger instruction set, although the opcode values
(pronounced “sisk”), a term not invented until many years are different).
11.4. EARLY 1980S: THE LESSONS OF RISC 69

By the mid-1970s, the use of integrated circuits in com- signers also experimented with using large sets of internal
puters was commonplace. The whole decade consists of registers. The idea was to cache intermediate results in
upheavals caused by the shrinking price of transistors. the registers under the control of the compiler. This also
It became possible to put an entire CPU on a single reduced the number of addressing modes and orthogonal-
printed circuit board. The result was that minicomput- ity.
ers, usually with 16-bit words, and 4k to 64K of memory, The computer designs based on this theory were called
came to be commonplace. Reduced Instruction Set Computers, or RISC. RISCs
CISCs were believed to be the most powerful types of generally had larger numbers of registers, accessed by
computers, because their microcode was small and could simpler instructions, with a few instructions specifically
be stored in very high-speed memory. The CISC archi- to load and store data to memory. The result was a very
tecture also addressed the “semantic gap” as it was per- simple core CPU running at very high speed, support-
ceived at the time. This was a defined distance between ing the exact sorts of operations the compilers were using
the machine language, and the higher level language peo- anyway.
ple used to program a machine. It was felt that compilers A common variation on the RISC design employs the
could do a better job with a richer instruction set. Harvard architecture, as opposed to the Von Neumann
Custom CISCs were commonly constructed using “bit or Stored Program architecture common to most other
slice” computer logic such as the AMD 2900 chips, with designs. In a Harvard Architecture machine, the pro-
custom microcode. A bit slice component is a piece of gram and data occupy separate memory devices and can
an ALU, register file or microsequencer. Most bit-slice be accessed simultaneously. In Von Neumann machines
integrated circuits were 4-bits wide. the data and programs are mixed in a single memory de-
vice, requiring sequential accessing which produces the
By the early 1970s, the PDP-11 was developed, arguably so-called “Von Neumann bottleneck.”
the most advanced small computer of its day. Almost
immediately, wider-word CISCs were introduced, the 32- One downside to the RISC design has been that the pro-
bit VAX and 36-bit PDP-10. grams that run on them tend to be larger. This is because
compilers have to generate longer sequences of the sim-
Also, to control a cruise missile, Intel developed a more- pler instructions to accomplish the same results. Since
capable version of its 8008 microprocessor, the 8080. these instructions need to be loaded from memory any-
IBM continued to make large, fast computers. However way, the larger code size offsets some of the RISC de-
the definition of large and fast now meant more than a sign’s fast memory handling.
megabyte of RAM, clock speeds near one megahertz , Recently, engineers have found ways to compress the re-
and tens of megabytes of disk drives. duced instruction sets so they fit in even smaller mem-
IBM’s System 370 was a version of the 360 tweaked to ory systems than CISCs. Examples of such compression
run virtual computing environments. The virtual com- schemes include the ARM's “Thumb” instruction set. In
puter was developed in order to reduce the possibility of applications that do not need to run older binary software,
an unrecoverable software failure. compressed RISCs are coming to dominate sales.

The Burroughs B5000/B6000/B7000 series reached its Another approach to RISCs was the MISC, "niladic" or
largest market share. It was a stack computer whose OS “zero-operand” instruction set. This approach realized
was programmed in a dialect of Algol. that the majority of space in an instruction was to identify
the operands of the instruction. These machines placed
All these different developments competed for market the operands on a push-down (last-in, first out) stack. The
share. instruction set was supplemented with a few instructions
to fetch and store memory. Most used simple caching to
provide extremely fast RISC machines, with very com-
11.4 Early 1980s: the lessons of pact code. Another benefit was that the interrupt laten-
cies were extremely small, smaller than most CISC ma-
RISC chines (a rare trait in RISC machines). The Burroughs
large systems architecture uses this approach. The B5000
In the early 1980s, researchers at UC Berkeley and IBM was designed in 1961, long before the term “RISC” was
both discovered that most computer language compilers invented. The architecture puts six 8-bit instructions in
and interpreters used only a small subset of the instruc- a 48-bit word, and was a precursor to VLIW design (see
tions of a CISC. Much of the power of the CPU was sim- below: 1990 to Today).
ply being ignored in real-world use. They realized that by The Burroughs architecture was one of the inspirations
making the computer simpler and less orthogonal, they for Charles H. Moore's Forth programming language,
could make it faster and less expensive at the same time. which in turn inspired his later MISC chip designs. For
At the same time, CPU calculation became faster in re- example, his f20 cores had 31 5-bit instructions, which
lation to the time for necessary memory accesses. De- fit four to a 20-bit word.
70 CHAPTER 11. HISTORY OF GENERAL-PURPOSE CPUS

RISC chips now dominate the market for 32-bit em- tor. To minimize these dependencies, out-of-order exe-
bedded systems. Smaller RISC chips are even becom- cution of instructions was introduced. In such a scheme,
ing common in the cost-sensitive 8-bit embedded-system the instruction results which complete out-of-order must
market. The main market for RISC CPUs has been sys- be re-ordered in program order by the processor for the
tems that require low power or small size. program to be restartable after an exception. Out-of-
Even some CISC processors (based on architectures that Order execution was the main advancement of the com-
were created before RISC became dominant), such as puter industry during the 1990s. A similar concept is
newer x86 processors, translate instructions internally speculative execution, where instructions from one direc-
tion of a branch (the predicted direction) are executed
into a RISC-like instruction set.
before the branch direction is known. When the branch
These numbers may surprise many, because the “market” direction is known, the predicted direction and the actual
is perceived to be desktop computers. x86 designs dom- direction are compared. If the predicted direction was
inate desktop and notebook computer sales, but desktop correct, the speculatively-executed instructions and their
and notebook computers are only a tiny fraction of the results are kept; if it was incorrect, these instructions and
computers now sold. Most people in industrialised coun- their results are thrown out. Speculative execution cou-
tries own more computers in embedded systems in their pled with an accurate branch predictor gives a large per-
car and house than on their desks. formance gain.
These advances, which were originally developed from
research for RISC-style designs, allow modern CISC pro-
11.5 Mid-to-late 1980s: exploiting cessors to execute twelve or more instructions per clock
instruction level parallelism cycle, when traditional CISC designs could take twelve or
more cycles to execute just one instruction.
In the mid-to-late 1980s, designers began using a tech- The resulting instruction scheduling logic of these pro-
nique known as "instruction pipelining", in which the pro- cessors is large, complex and difficult to verify. Further-
cessor works on multiple instructions in different stages more, the higher complexity requires more transistors,
of completion. For example, the processor may be re- increasing power consumption and heat. In this respect
trieving the operands for the next instruction while calcu- RISC is superior because the instructions are simpler,
lating the result of the current one. Modern CPUs may have less interdependence and make superscalar imple-
use over a dozen such stages. MISC processors achieve mentations easier. However, as Intel has demonstrated,
single-cycle execution of instructions without the need for the concepts can be applied to a CISC design, given
pipelining. enough time and money.
A similar idea, introduced only a few years later, was
to execute multiple instructions in parallel on separate Historical note: Some of these techniques (e.g.
arithmetic logic units (ALUs). Instead of operating on pipelining) were originally developed in the
only one instruction at a time, the CPU will look for sev- late 1950s by IBM on their Stretch mainframe
eral similar instructions that are not dependent on each computer.
other, and execute them in parallel. This approach is
called superscalar processor design.
Such techniques are limited by the degree of instruction 11.6 1990 to today: looking for-
level parallelism (ILP), the number of non-dependent in-
structions in the program code. Some programs are able
ward
to run very well on superscalar processors due to their
inherent high ILP, notably graphics. However more gen- 11.6.1 VLIW and EPIC
eral problems do not have such high ILP, thus making the
achievable speedups due to these techniques to be lower. The instruction scheduling logic that makes a superscalar
Branching is one major culprit. For example, the pro- processor is just boolean logic. In the early 1990s, a sig-
gram might add two numbers and branch to a different nificant innovation was to realize that the coordination
code segment if the number is bigger than a third num- of a multiple-ALU computer could be moved into the
ber. In this case even if the branch operation is sent to the compiler, the software that translates a programmer’s in-
second ALU for processing, it still must wait for the re- structions into machine-level instructions.
sults from the addition. It thus runs no faster than if there This type of computer is called a very long instruction
were only one ALU. The most common solution for this word (VLIW) computer.
type of problem is to use a type of branch prediction. Statically scheduling the instructions in the compiler (as
To further the efficiency of multiple functional units opposed to letting the processor do the scheduling dy-
which are available in superscalar designs, operand reg- namically) can reduce CPU complexity. This can im-
ister dependencies was found to be another limiting fac- prove performance, reduce heat, and reduce cost.
11.6. 1990 TO TODAY: LOOKING FORWARD 71

This design supposedly provides the VLIW advantage


of increased instruction throughput. However, it avoids
some of the issues of scaling and complexity, by explicitly
providing in each “bundle” of instructions information
concerning their dependencies. This information is cal-
culated by the compiler, as it would be in a VLIW design.
The early versions are also backward-compatible with
current x86 software by means of an on-chip emulation
mode. Integer performance was disappointing and de-
spite improvements, sales in volume markets continue to
be low.

11.6.2 Multi-threading
Microelectronics circuits as complex as primate brains are fore-
seeable Current designs work best when the computer is run-
ning only a single program, however nearly all modern
operating systems allow the user to run multiple pro-
Unfortunately, the compiler lacks accurate knowledge of grams at the same time. For the CPU to change over and
runtime scheduling issues. Merely changing the CPU do work on another program requires expensive context
core frequency multiplier will have an effect on schedul- switching. In contrast, multi-threaded CPUs can handle
ing. Actual operation of the program, as determined by instructions from multiple programs at once.
input data, will have major effects on scheduling. To
overcome these severe problems a VLIW system may be To do this, such CPUs include several sets of registers.
enhanced by adding the normal dynamic scheduling, los- When a context switch occurs, the contents of the “work-
ing some of the VLIW advantages. ing registers” are simply copied into one of a set of reg-
isters for this purpose.
Static scheduling in the compiler also assumes that dy-
namically generated code will be uncommon. Prior to the Such designs often include thousands of registers instead
creation of Java, this was in fact true. It was reasonable of hundreds as in a typical design. On the downside, reg-
to assume that slow compiles would only affect software isters tend to be somewhat expensive in chip space needed
developers. Now, with JIT virtual machines being used to implement them. This chip space might otherwise be
for many languages, slow code generation affects users as used for some other purpose.
well.
There were several unsuccessful attempts to commercial- 11.6.3 Multi-core
ize VLIW. The basic problem is that a VLIW computer
does not scale to different price and performance points, Multi-core CPUs are typically multiple CPU cores on the
as a dynamically scheduled computer can. Another issue same die, connected to each other via a shared L2 or L3
is that compiler design for VLIW computers is extremely cache, an on-die bus, or an on-die crossbar switch. All
difficult, and the current crop of compilers (as of 2005) the CPU cores on the die share interconnect components
don't always produce optimal code for these platforms. with which to interface to other processors and the rest
Also, VLIW computers optimise for throughput, not low of the system. These components may include a front
latency, so they were not attractive to the engineers de- side bus interface, a memory controller to interface with
signing controllers and other computers embedded in ma- DRAM, a cache coherent link to other processors, and
chinery. The embedded systems markets had often pi- a non-coherent link to the southbridge and I/O devices.
oneered other computer improvements by providing a The terms multi-core and MPU (which stands for Micro-
large market that did not care about compatibility with Processor Unit) have come into general usage for a single
older software. die that contains multiple CPU cores.

In January 2000, Transmeta Corporation took the inter-


esting step of placing a compiler in the central process- Intelligent RAM
ing unit, and making the compiler translate from a ref-
erence byte code (in their case, x86 instructions) to an One way to work around the Von Neumann bottleneck is
internal VLIW instruction set. This approach combines to mix a processor and DRAM all on one chip.
the hardware simplicity, low power and speed of VLIW
RISC with the compact main memory system and soft- • The Berkeley IRAM Project
ware reverse-compatibility provided by popular CISC.
• eDRAM
Intel's Itanium chip is based on what they call an
Explicitly Parallel Instruction Computing (EPIC) design. • computational RAM
72 CHAPTER 11. HISTORY OF GENERAL-PURPOSE CPUS

• Memristor might finish faster than normal because of the par-


ticular data inputs (multiplication can be very fast if
it is multiplying by 0 or 1), or because it is running at
11.6.4 Reconfigurable logic a higher voltage or lower temperature than normal.

Main article: reconfigurable computing


Asynchronous logic proponents believe these capabilities
would have these benefits:
Another track of development is to combine reconfig-
urable logic with a general-purpose CPU. In this scheme,
a special computer language compiles fast-running sub- • lower power dissipation for a given performance
routines into a bit-mask to configure the logic. Slower, or level
less-critical parts of the program can be run by sharing
their time on the CPU. This process has the capability • highest possible execution speeds
to create devices such as software radios, by using digital
signal processing to perform functions usually performed
The biggest disadvantage of the clockless CPU is that
by analog electronics.
most CPU design tools assume a clocked CPU (a
synchronous circuit), so making a clockless CPU (design-
11.6.5 Open source processors ing an asynchronous circuit) involves modifying the de-
sign tools to handle clockless logic and doing extra testing
As the lines between hardware and software increasingly to ensure the design avoids metastable problems.
blur due to progress in design methodology and availabil- Even so, several asynchronous CPUs have been built, in-
ity of chips such as FPGAs and cheaper production pro- cluding
cesses, even open source hardware has begun to appear.
Loosely knit communities like OpenCores have recently
announced completely open CPU architectures such as • the ORDVAC and the identical ILLIAC I (1951)
the OpenRISC which can be readily implemented on FP-
GAs or in custom produced chips, by anyone, without • the ILLIAC II (1962), the fastest computer in the
paying license fees, and even established processor man- world at the time
ufacturers like Sun Microsystems have released processor
designs (e.g. OpenSPARC) under open-source licenses. • The Caltech Asynchronous Microprocessor, the
world-first asynchronous microprocessor (1988)

11.6.6 Asynchronous CPUs • the ARM-implementing AMULET (1993 and


2000)
Main article: Asynchronous CPU
• the asynchronous implementation of MIPS R3000,
Yet another possibility is the “clockless CPU” dubbed MiniMIPS (1998)
(asynchronous CPU). Unlike conventional proces-
sors, clockless processors have no central clock to • the SEAforth multi-core processor from Charles H.
coordinate the progress of data through the pipeline. Moore [1]
Instead, stages of the CPU are coordinated using logic
devices called “pipe line controls” or “FIFO sequencers.”
Basically, the pipeline controller clocks the next stage of 11.6.7 Optical communication
logic when the existing stage is complete. In this way, a
central clock is unnecessary. One interesting possibility would be to eliminate the front
side bus. Modern vertical laser diodes enable this change.
It might be easier to implement high performance devices
In theory, an optical computer’s components could di-
in asynchronous logic as opposed to clocked logic:
rectly connect through a holographic or phased open-air
switching system. This would provide a large increase in
• components can run at different speeds in the clock- effective speed and design flexibility, and a large reduc-
less CPU. In a clocked CPU, no component can run tion in cost. Since a computer’s connectors are also its
faster than the clock rate. most likely failure point, a busless system might be more
• In a clocked CPU, the clock can go no faster than the reliable, as well.
worst-case performance of the slowest stage. In a In addition, current (2010) modern processors use 64- or
clockless CPU, when a stage finishes faster than nor- 128-bit logic. Wavelength superposition could allow for
mal, the next stage can immediately take the results data lanes and logic many orders of magnitude higher,
rather than waiting for the next clock tick. A stage without additional space or copper wires.
11.8. SEE ALSO 73

11.6.8 Optical processors • 1985. Intel introduces the Intel 80386, which adds
a 32-bit instruction set to the x86 microarchitecture.
Another long-term possibility is to use light instead of
electricity for the digital logic itself. In theory, this could • 1989. Intel introduces the Intel 80486
run about 30% faster and use less power, as well as per- • 1993. Intel launches the original Pentium micropro-
mit a direct interface with quantum computational de- cessor, the first processor with a x86 superscalar mi-
vices. The chief problem with this approach is that for the croarchitecture.
foreseeable future, electronic devices are faster, smaller
(i.e. cheaper) and more reliable. An important theo- • 1995. Intel introduces the Pentium Pro which be-
retical problem is that electronic computational elements comes the foundation for the Pentium II, Pentium
are already smaller than some wavelengths of light, and III, Pentium M, and Intel Core Architectures.
therefore even wave-guide based optical logic may be un-
economic compared to electronic logic. The majority of • 2000. AMD announced x86-64 extension to the x86
development effort, as of 2006 is focused on electronic microarchitecture.
circuitry. See also optical computing.
• 2000. AMD hits 1 GHZ with its Athlon micropro-
cessor.
11.6.9 Belt Machine Architecture • 2000. Analog Devices introduces the Blackfin ar-
chitecture.
As opposed to conventional register machine or stack
machine architecture, yet similar to Intel’s Itanium • 2002. Intel releases a Pentium 4 with Hyper-
architecture,[2] a temporal register addressing scheme has Threading, the first modern desktop processor to im-
been proposed by Ivan Godard & company that is in- plement simultaneous multithreading (SMT).
tended to greatly reduce the complexity of CPU hardware
(specifically the number of internal registers and the re- • 2003. AMD releases the Athlon 64, the first 64-bit
sulting huge multiplexer trees).[3] While somewhat harder consumer cpu.
to read and debug than general-purpose register names,
• 2003. Intel introduced the Pentium M, a low power
it is recommended that it be perceived as a moving “con-
mobile derivative of the Pentium Pro architecture.
veyor belt” where the oldest values “drop off” the belt into
oblivion. It is implemented by the Mill CPU architecture. • 2005. AMD announced the Athlon 64 X2, the first
x86 dual-core processor.

11.7 Timeline of events • 2006. Intel introduces the Core line of CPUs based
on a modified Pentium M design.
• 1964. IBM releases the 32-bit IBM System/360 • 2008. About ten billion CPUs were manufactured
with memory protection. in 2008.
• 1971. Intel released the 4-bit Intel 4004, the world’s • 2010. Intel introduced Core i3, i5, i7 processors.
first commercially available microprocessor.
• 2011. AMD announces the appearance of the
• 1975. MOS Technology released the 8-bit MOS world’s first 8 core CPU for desktop PC’s.
Technology 6502, the first integrated processor to
have an affordable price of $25 when the 6800 com-
petition demanded $175.
11.8 See also
• 1977. First 32-bit VAX sold, a VAX-11/780.
• 1978. Intel introduces the Intel 8086 and Intel 8088, • Microprocessor chronology
the first x86 chips.
• 1981. Stanford MIPS introduced, one of the first 11.9 References
RISC designs.
• 1982. Intel introduces the Intel 80286, which was [1] SEAforth Overview "... asynchronous circuit design
the first Intel processor that could run all the soft- throughout the chip. There is no central clock with bil-
lions of dumb nodes dissipating useless power. ... the
ware written for its predecessors, the 8086 and
processor cores are internally asynchronous themselves.”
8088.
[2] http://williams.comp.ncat.edu/comp375/
• 1984, Motorola introduces the Motorola RISCprocessors.pdf
68020+68851, which enabled 32-bit instruc-
tion set and virtualization. [3] “The Belt”.
74 CHAPTER 11. HISTORY OF GENERAL-PURPOSE CPUS

11.10 External links


• Great moments in microprocessor history by W.
Warner, 2004
• Great Microprocessors of the Past and Present (V
13.4.0) by: John Bayko, 2003
Chapter 12

Comparison of CPU microarchitectures

The following is a comparison of CPU


microarchitectures.

12.1 See also


• CPU design

• Comparison of instruction set architectures

12.2 References
[1] “Products We Design”. amd.com. Retrieved 19 January
2014.

[2] “wp-content/uploads/2013/07/AMD-Steamroller-vs-
Bulldozer”. cdn3.wccftech.com. Retrieved 19 January
2014.

[3] “Cyrix 5x86 (“M1sc”)". pcguide.com. Retrieved 19 Jan-


uary 2014.

[4] “Computer Science 246: Computer Architecture” (PDF).


Harvard University. Retrieved 23 December 2013. P6
pipeline

[5] Intel Itanium 2 Processor Hardware Developer’s Manual.


p. 14. http://www.intel.com/design/itanium2/manuals/
25110901.pdf (2002) Retrieved 28 November 2011

[6] “Multi Core Processor SPARC64™ Series : Fujitsu


Global”. fujitsu.com. Retrieved 19 January 2014.

75
Chapter 13

Reduced instruction set computing

“RISC” redirects here. For other uses, see RISC (disam- modern version of the design dates to the 1980s. In par-
biguation). ticular, two projects at Stanford University and University
Reduced instruction set computing, or RISC (pro- of California, Berkeley are most associated with the pop-
ularization of this concept. Stanford’s design would go
on to be commercialized as the successful MIPS archi-
tecture, while Berkeley’s RISC gave its name to the entire
concept, commercialized as the SPARC. Another success
from this era were IBM's efforts that eventually led to the
Power Architecture. As these projects matured, a wide
variety of similar designs flourished in the late 1980s and
especially the early 1990s, representing a major force in
the Unix workstation market as well as embedded proces-
sors in laser printers, routers and similar products.
Well-known RISC families include DEC Alpha, AMD
29k, ARC, ARM, Atmel AVR, Blackfin, Intel i860 and
i960, MIPS, Motorola 88000, PA-RISC, Power (includ-
ing PowerPC), RISC-V, SuperH, and SPARC. In the 21st
century, the use of ARM architecture processors in smart
phones and tablet computers such as the iPad, Android,
and Windows RT tablets provided a wide user base for
RISC-based systems. RISC processors are also used in
supercomputers such as the K computer, the fastest on the
A Sun UltraSPARC, a RISC microprocessor TOP500 list in 2011, second at the 2012 list, and fourth
at the 2013 list,[3][4] and Sequoia, the fastest in 2012 and
nounced 'risk'), is a CPU design strategy based on the third in the 2013 list.
insight that a simplified instruction set (as opposed to a
complex set) provides higher performance when com-
bined with a microprocessor architecture capable of exe-
cuting those instructions using fewer microprocessor cy- 13.1 History and development
cles per instruction.[1] A computer based on this strat-
egy is a reduced instruction set computer, also called RISC. A number of systems, going back to the 1970s (and even
The opposing architecture is called complex instruction 1960s) have been credited as the first RISC architecture,
set computing, i.e. CISC. partly based on their use of load/store approach.[5] The
Various suggestions have been made regarding a precise term RISC was coined by David Patterson of the Berkeley
definition of RISC, but the general concept is that of a RISC project, although somewhat similar concepts had
system that uses a small, highly optimized set of instruc- appeared before.[6]
tions, rather than a more versatile set of instructions of-
The CDC 6600 designed by Seymour Cray in 1964 used
ten found in other types of architectures. Another com-
a load/store architecture with only two addressing modes
mon trait is that RISC systems use the load/store architec-
(register+register, and register+immediate constant) and
ture,[2] where memory is normally accessed only through
74 opcodes, with the basic clock cycle/instruction issue
specific instructions, rather than accessed as part of other
rate being 10 times faster than the memory access time.[7]
instructions like add. Partly due to the optimized load/store architecture of the
Although a number of systems from the 1960s and 70s CDC 6600 Jack Dongarra states that it can be considered
have been identified as being forerunners of RISC, the as a forerunner of modern RISC systems, although a num-

76
13.1. HISTORY AND DEVELOPMENT 77

ber of other technical barriers needed to be overcome for yet completely outperformed any other single-chip de-
the development of a modern RISC system.[8] sign. They followed this up with the 40,760 transistor, 39
instruction RISC-II in 1983, which ran over three times
as fast as RISC-I.[14]
The MIPS architecture grew out of a graduate course by
John L. Hennessy at Stanford University in 1981, resulted
in a functioning system in 1983, and could run simple
programs by 1984.[16] The MIPS approach emphasized
an aggressive clock cycle and the use of the pipeline,
making sure it could be run as “full” as possible.[16] The
MIPS system was followed by the MIPS-X and in 1984
Hennessy and his colleagues formed MIPS Computer
Systems.[16][17] The commercial venture resulted in the
R2000 microprocessor in 1985, and was followed by the
R3000 in 1988.[17]

An IBM PowerPC 601 RISC microprocessor.

Michael J. Flynn views the first RISC system as the IBM


801 design which began in 1975 by John Cocke, and com-
pleted in 1980.[2] The 801 was eventually produced in
a single-chip form as the ROMP in 1981, which stood
for 'Research OPD [Office Products Division] Micro
Processor'.[9] As the name implies, this CPU was de- Co-designer Yunsup Lee holding RISC-V prototype chip in 2013.
signed for “mini” tasks, and was also used in the IBM
RT-PC in 1986, which turned out to be a commercial In the early 1980s, significant uncertainties surrounded
failure.[10] But the 801 inspired several research projects, the RISC concept, and it was uncertain if it could have
including new ones at IBM that would eventually lead to a commercial future, but by the mid-1980s the con-
the IBM POWER instruction set architecture.[11][12] cepts had matured enough to be seen as commercially
The most public RISC designs, however, were the results viable.[10][16] In 1986 Hewlett Packard started using an
of university research programs run with funding from early implementation of their PA-RISC in some of their
computers.[10] In the meantime, the Berkeley RISC effort
the DARPA VLSI Program. The VLSI Program, practi-
cally unknown today, led to a huge number of advances had become so well known that it eventually became the
name for the entire concept and in 1987 Sun Microsys-
in chip design, fabrication, and even computer graphics.
The Berkeley RISC project started in 1980 under the di- tems began shipping systems with the SPARC processor,
directly based on the Berkeley RISC-II system.[10][18]
rection of David Patterson and Carlo H. Sequin.[6] [13][14]
Berkeley RISC was based on gaining performance The US government Committee on Innovations in Com-
through the use of pipelining and an aggressive use of a puting and Communications credits the acceptance of
technique known as register windowing.[13][14] In a tradi- the viability of the RISC concept to the success of the
tional CPU, one has a small number of registers, and a SPARC system.[10] The success of SPARC renewed in-
program can use any register at any time. In a CPU with terest within IBM, which released new RISC systems by
register windows, there are a huge number of registers, 1990 and by 1995 RISC processors were the foundation
e.g. 128, but programs can only use a small number of of a $15 billion server industry.[10]
them, e.g. eight, at any one time. A program that lim- Since 2010 a new open source, ISA, RISC-V, is under
its itself to eight registers per procedure can make very development at the University of California, Berkeley,
fast procedure calls: The call simply moves the window for research purposes and as a free alternative to propri-
“down” by eight, to the set of eight registers used by that etary ISA’s . As of 2014 version 2 of the userspace ISA
procedure, and the return moves the window back.[15] is fixed.[19] The ISA is designed to be extensible from a
The Berkeley RISC project delivered the RISC-I proces- barebones core sufficient for a small embedded processor
sor in 1982. Consisting of only 44,420 transistors (com- to supercomputer and cloud computing use with standard
pared with averages of about 100,000 in newer CISC de- and chip designer defined extensions and coprocessors. It
signs of the era) RISC-I had only 32 instructions, and has been tested in silicon design with the ROCKET SoC
78 CHAPTER 13. REDUCED INSTRUCTION SET COMPUTING

which is also available as an open source processor gen- • Few data types in hardware, some CISCs have byte
erator in the CHISEL language. string instructions, or support complex numbers; this
is so far unlikely to be found on a RISC.
• Processor throughput of one instruction per cycle on
13.2 Characteristics and design average
philosophy
Exceptions abound, of course, within both CISC and
For more details on this topic, see CPU design. RISC.
RISC designs are also more likely to feature a Harvard
memory model, where the instruction stream and the data
stream are conceptually separated; this means that modi-
13.2.1 Instruction set fying the memory where code is held might not have any
effect on the instructions executed by the processor (be-
A common misunderstanding of the phrase “reduced in- cause the CPU has a separate instruction and data cache),
struction set computer” is the mistaken idea that instruc- at least until a special synchronization instruction is is-
tions are simply eliminated, resulting in a smaller set of sued. On the upside, this allows both caches to be ac-
instructions.[20] In fact, over the years, RISC instruction cessed simultaneously, which can often improve perfor-
sets have grown in size, and today many of them have mance.
a larger set of instructions than many CISC CPUs.[21][22]
Some RISC processors such as the PowerPC have instruc- Many early RISC designs also shared the characteristic
tion sets as large as the CISC IBM System/370, for exam- of having a branch delay slot. A branch delay slot is
ple; conversely, the DEC PDP-8—clearly a CISC CPU an instruction space immediately following a jump or
because many of its instructions involve multiple mem- branch. The instruction in this space is executed, whether
ory accesses—has only 8 basic instructions and a few ex- or not the branch is taken (in other words the effect of the
tended instructions. branch is delayed). This instruction keeps the ALU of the
CPU busy for the extra time normally needed to perform
The term “reduced” in that phrase was intended to de- a branch. Nowadays the branch delay slot is considered an
scribe the fact that the amount of work any single in- unfortunate side effect of a particular strategy for imple-
struction accomplishes is reduced—at most a single data menting some RISC designs, and modern RISC designs
memory cycle—compared to the “complex instructions” generally do away with it (such as PowerPC and more re-
of CISC CPUs that may require dozens of data memory cent versions of SPARC and MIPS).
cycles in order to execute a single instruction.[23] In par-
ticular, RISC processors typically have separate instruc- Some aspects attributed to the first RISC-labeled designs
tions for I/O and data processing. around 1975 include the observations that the memory-
restricted compilers of the time were often unable to
take advantage of features intended to facilitate man-
13.2.2 Hardware utilization ual assembly coding, and that complex addressing modes
take many cycles to perform due to the required addi-
For any given level of general performance, a RISC chip tional memory accesses. It was argued that such func-
will typically have far fewer transistors dedicated to the tions would be better performed by sequences of sim-
core logic which originally allowed designers to increase pler instructions if this could yield implementations small
the size of the register set and increase internal paral- enough to leave room for many registers, reducing the
lelism. number of slow memory accesses. In these simple de-
signs, most instructions are of uniform length and simi-
Other features that are typically found in RISC architec- lar structure, arithmetic operations are restricted to CPU
tures are: registers and only separate load and store instructions ac-
cess memory. These properties enable a better balanc-
• Uniform instruction format, using a single word with ing of pipeline stages than before, making RISC pipelines
the opcode in the same bit positions in every instruc- significantly more efficient and allowing higher clock fre-
tion, demanding less decoding; quencies.

• Identical general purpose registers, allowing any In the early days of the computer industry, programming
was done in assembly language or machine code, which
register to be used in any context, simplifying com-
piler design (although normally there are separateencouraged powerful and easy-to-use instructions. CPU
floating point registers); designers therefore tried to make instructions that would
do as much work as feasible. With the advent of higher
• Simple addressing modes, with complex addressing level languages, computer architects also started to create
performed via sequences of arithmetic, load-store dedicated instructions to directly implement certain cen-
operations, or both; tral mechanisms of such languages. Another general goal
13.2. CHARACTERISTICS AND DESIGN PHILOSOPHY 79

was to provide every possible addressing mode for ev- more than limited ability to take advantage of the features
ery instruction, known as orthogonality, to ease compiler provided by conventional CPUs.
implementation. Arithmetic operations could therefore It was also discovered that, on microcoded implementa-
often have results as well as operands directly in memory tions of certain architectures, complex operations tended
(in addition to register or immediate). to be slower than a sequence of simpler operations do-
The attitude at the time was that hardware design was ing the same thing. This was in part an effect of the fact
more mature than compiler design so this was in itself also that many designs were rushed, with little time to opti-
a reason to implement parts of the functionality in hard- mize or tune every instruction, but only those used most
ware or microcode rather than in a memory constrained often. One infamous example was the VAX's INDEX
compiler (or its generated code) alone. After the advent instruction.[13]
of RISC, this philosophy became retroactively known as As mentioned elsewhere, core memory had long since
complex instruction set computing, or CISC. been slower than many CPU designs. The advent of semi-
CPUs also had relatively few registers, for several reasons: conductor memory reduced this difference, but it was still
apparent that more registers (and later caches) would al-
• More registers also implies more time-consuming low higher CPU operating frequencies. Additional regis-
saving and restoring of register contents on the ma- ters would require sizeable chip or board areas which, at
chine stack. the time (1975), could be made available if the complex-
ity of the CPU logic was reduced.
• A large number of registers requires a large number Yet another impetus of both RISC and other designs
of instruction bits as register specifiers, meaning less came from practical measurements on real-world pro-
dense code (see below). grams. Andrew Tanenbaum summed up many of these,
demonstrating that processors often had oversized imme-
• CPU registers are more expensive than external diates. For instance, he showed that 98% of all the con-
memory locations; large register sets were cumber- stants in a program would fit in 13 bits, yet many CPU
some with limited circuit boards or chip integration. designs dedicated 16 or 32 bits to store them. This sug-
gests that, to reduce the number of memory accesses, a
An important force encouraging complexity was very fixed length machine could store constants in unused bits
limited main memories (on the order of kilobytes). It was of the instruction word itself, so that they would be im-
therefore advantageous for the code density—the density mediately ready when the CPU needs them (much like
of information held in computer programs—to be high, immediate addressing in a conventional design). This re-
leading to features such as highly encoded, variable length quired small opcodes in order to leave room for a reason-
instructions, doing data loading as well as calculation (as ably sized constant in a 32-bit instruction word.
mentioned above). These issues were of higher priority Since many real-world programs spend most of their time
than the ease of decoding such instructions. executing simple operations, some researchers decided
An equally important reason was that main memories to focus on making those operations as fast as possible.
were quite slow (a common type was ferrite core mem- The clock rate of a CPU is limited by the time it takes to
ory); by using dense information packing, one could re- execute the slowest sub-operation of any instruction; de-
duce the frequency with which the CPU had to access this creasing that cycle-time often accelerates the execution
slow resource. Modern computers face similar limiting of other instructions.[24] The focus on “reduced instruc-
factors: main memories are slow compared to the CPU tions” led to the resulting machine being called a “reduced
and the fast cache memories employed to overcome this instruction set computer” (RISC). The goal was to make
are limited in size. This may partly explain why highly instructions so simple that they could easily be pipelined,
encoded instruction sets have proven to be as useful as in order to achieve a single clock throughput at high fre-
RISC designs in modern computers. quencies.
RISC was developed as an alternative to what is now Later, it was noted that one of the most significant char-
known as CISC. Over the years, other strategies have acteristics of RISC processors was that external memory
been implemented as alternatives to RISC and CISC. was only accessible by a load or store instruction. All
Some examples are VLIW, MISC, OISC, massive paral- other instructions were limited to internal registers. This
lel processing, systolic array, reconfigurable computing, simplified many aspects of processor design: allowing in-
and dataflow architecture. structions to be fixed-length, simplifying pipelines, and
isolating the logic for dealing with the delay in complet-
In the mid-1970s, researchers (particularly John Cocke)
ing a memory access (cache miss, etc.) to only two in-
at IBM (and similar projects elsewhere) demonstrated
structions. This led to RISC designs being referred to as
that the majority of combinations of these orthogonal
load/store architectures.[25]
addressing modes and instructions were not used by most
programs generated by compilers available at the time. One more issue is that some complex instructions are dif-
It proved difficult in many cases to write a compiler with ficult to restart, e.g. following a page fault. In some cases,
80 CHAPTER 13. REDUCED INSTRUCTION SET COMPUTING

restarting from the beginning will work (although waste- was the NexGen Nx586, released in 1994; the AMD
ful), but in many cases this would give incorrect results. K5 was severely delayed and released in 1995.
Therefore the machine needs to have some hidden state to
remember which parts went through and what remains to 3. Later, more powerful processors, such as Intel P6,
be done. With a load/store machine, the program counter AMD K6, AMD K7, and Pentium 4, employed
is sufficient to describe the state of the machine. similar dynamic buffering and scheduling princi-
ples and implemented loosely coupled superscalar
The main distinguishing feature of RISC is that the in-
(and speculative) execution of micro-operation se-
struction set is optimized for a highly regular instruction
quences generated from several parallel x86 decod-
pipeline flow.[20] All the other features associated with
ing stages. Today, these ideas have been further
RISC—branch delay slots, separate instruction and data
refined (some x86-pairs are instead merged into a
caches, load/store architecture, large register set, etc.—
more complex micro-operation, for example) and
may seem to be a random assortment of unrelated fea-
are still used by modern x86 processors such as Intel
tures, but each of them is helpful in maintaining a regular
Core 2 and AMD K8.
pipeline flow that completes an instruction every clock
cycle.
Outside of the desktop arena, however, the ARM ar-
chitecture (RISC and born at about the same time as
13.3 Comparison to other architec- SPARC) has to a degree broken the Intel stranglehold
with its widespread use in smartphones, tablets and many
tures forms of embedded device. It is also the case that since
the Pentium Pro (P6) Intel has been using an internal
Some CPUs have been specifically designed to have a RISC processor core for its processors.[26]
very small set of instructions – but these designs are very While early RISC designs differed significantly from con-
different from classic RISC designs, so they have been temporary CISC designs, by 2000 the highest perform-
given other names such as minimal instruction set com- ing CPUs in the RISC line were almost indistinguish-
puter (MISC), or transport triggered architecture (TTA), able from the highest performing CPUs in the CISC
etc. line.[27][28][29]
Despite many successes, RISC has made few inroads into
the desktop PC and commodity server markets, where
Intel's x86 platform remains the dominant processor ar-
chitecture. There are three main reasons for this:
13.4 RISC: from cell phones to su-
percomputers
1. A very large base of proprietary PC applications are
written for x86 or compiled into x86 machine code, RISC architectures are now used across a wide range of
whereas no RISC platform has a similar installed platforms, from cellular telephones and tablet computers
base; hence PC users were locked into the x86. to some of the world’s fastest supercomputers such as the
K computer, the fastest on the TOP500 list in 2011.[3][4]
2. Although RISC was indeed able to scale up in per-
formance quite quickly and cheaply, Intel took ad-
vantage of its large market by spending vast amounts
of money on processor development. Intel could 13.4.1 Low end and mobile systems
spend many times as much as any RISC manufac-
turer on improving low level design and manufac- By the beginning of the 21st century, the majority of low [30]
turing. The same could not be said about smaller end and mobile systems relied on RISC architectures.
firms like Cyrix and NexGen, but they realized that Examples include:
they could apply (tightly) pipelined design practices
also to the x86-architecture, just as in the 486 and • The ARM architecture dominates the market for
Pentium. The 6x86 and MII series did exactly this, low power and low cost embedded systems (typi-
but was more advanced; it implemented superscalar cally 200–1800 MHz in 2014). It is used in a num-
speculative execution via register renaming, directly ber of systems such as most Android-based systems,
at the x86-semantic level. Others, like the Nx586 the Apple iPhone and iPad, RIM devices, Nintendo
and AMD K5 did the same, but indirectly, via dy- Game Boy Advance and Nintendo DS, etc.
namic microcode buffering and semi-independent
superscalar scheduling and instruction dispatch at • The MIPS line, (at one point used in many SGI com-
the micro-operation level (older or simpler ‘CISC’ puters) and now in the PlayStation, PlayStation 2,
designs typically execute rigid micro-operation se- Nintendo 64, PlayStation Portable game consoles,
quences directly). The first available chip deploying and residential gateways like Linksys WRT54G se-
such dynamic buffering and scheduling techniques ries.
13.6. REFERENCES 81

• Hitachi's SuperH, originally in wide use in the Sega 13.6 References


Super 32X, Saturn and Dreamcast, now developed
and sold by Renesas as the SH4 [1] Northern Illinois University, Department of Computer
Science, “RISC - Reduced instruction set computer”
• Atmel AVR used in a variety of products ranging
from Xbox handheld controllers to BMW cars. [2] Flynn, Michael J. (1995). Computer architecture:
pipelined and parallel processor design. pp. 54–56. ISBN
0867202041.
• RISC-V, the open source fifth Berkeley RISC ISA,
with 32 bit address space a small core integer in- [3] “Japanese ‘K’ Computer Is Ranked Most Powerful”. The
struction set, an experimental “Compressed” ISA New York Times. 20 June 2011. Retrieved 20 June 2011.
for code density and designed for standard and spe-
[4] “Supercomputer “K computer” Takes First Place in
cial purpose extensions.
World”. Fujitsu. Retrieved 20 June 2011.

[5] Fisher, Joseph A.; Faraboschi, Paolo; Young, Cliff


13.4.2 High end RISC and supercomput- (2005). Embedded Computing: A VLIW Approach to
Architecture, Compilers and Tools. p. 55. ISBN
ing
1558607668.

• MIPS, by Silicon Graphics (ceased making MIPS- [6] Milestones in computer science and information technology
based systems in 2006). by Edwin D. Reilly 2003 ISBN 1-57356-521-0 page 50

[7] Grishman, Ralph. Assembly Language Programming for


• SPARC, by Oracle (previously Sun Microsystems),
the Control Data 6000 Series. Algorithmics Press. 1974.
and Fujitsu. pg 12

• IBM's Power Architecture, used in many of IBM’s [8] Numerical Linear Algebra on High-Performance Comput-
supercomputers, midrange servers and worksta- ers by Jack J. Dongarra, et al 1987 ISBN 0-89871-428-1
tions. page 6

[9] Processor architecture: from dataflow to superscalar and


• Hewlett-Packard's PA-RISC, also known as HP-PA beyond by Jurij Šilc, Borut Robič, Theo Ungerer 1999
(discontinued at the end of 2008). ISBN 3-540-64798-8 page 33

• Alpha, used in single-board computers, worksta- [10] Funding a Revolution: Government Support for Computing
tions, servers and supercomputers from Digital Research by Committee on Innovations in Computing and
Equipment Corporation, Compaq and HP (discon- Communications 1999 ISBN 0-309-06278-0 page 239
tinued as of 2007). [11] Processor design: system-on-chip computing for ASICs and
FPGAs by Jari Nurmi 2007 ISBN 1-4020-5529-3 pages
• RISC-V, the open source fifth Berkeley RISC ISA, 40-43
with 64 or 128-bit address spaces, and the integer
core extended with floating point, atomics and vec- [12] Readings in computer architecture by Mark Donald Hill,
tor processing, and designed to be extended with in- Norman Paul Jouppi, Gurindar Sohi 1999 ISBN 1-55860-
539-8 pages 252-254
structions for networking, IO, data processing etc.
A 64-bit superscalar design, “Rocket”, is available [13] Patterson, D. A.; Ditzel, D. R. (1980). “The case
for download. for the reduced instruction set computer”. ACM
SIGARCH Computer Architecture News 8 (6): 25–
33. doi:10.1145/641914.641917. CiteSeerX:
10.1.1.68.9623.
13.5 See also
[14] RISC I: A Reduced Instruction Set VLSI Computer by David
A. Patterson and Carlo H. Sequin, in the Proceedings
• Addressing mode of the 8th annual symposium on Computer Architecture,
1981.
• Classic RISC pipeline
[15] Design and Implementation of RISC I by Carlo Sequin
• Complex instruction set computer and David Patterson, in the Proceedings of the Advanced
Course on VLSI Architecture, University of Bristol, July
• Computer architecture 1982

[16] The MIPS-X RISC microprocessor by Paul Chow 1989


• Instruction set ISBN 0-7923-9045-8 pages xix-xx
• Microprocessor [17] Processor design: system-on-chip computing for ASICs and
FPGAs by Jari Nurmi 2007 ISBN 1-4020-5529-3 pages
• Minimal instruction set computer 52-53
82 CHAPTER 13. REDUCED INSTRUCTION SET COMPUTING

[18] Computer science handbook by Allen B. Tucker 2004


ISBN 1-58488-360-X page 100-6

[19] Waterman, Andrew; Lee, Yunsup; Patterson, David A.;


Asanovi, Krste. “The RISC-V Instruction Set Manual,
Volume I: Base User-Level ISA version 2 (Technical Re-
port EECS-2014-54)". University of California, Berke-
ley. Retrieved 26 Dec 2014.

[20] Margarita Esponda and Ra'ul Rojas. “The RISC Concept -


A Survey of Implementations”. Section 2: “The confusion
around the RISC concept”. 1991.

[21] “RISC vs. CISC: the Post-RISC Era” by Jon “Hannibal”


Stokes (Arstechnica)

[22] “RISC versus CISC” by Lloyd Borrett Australian Personal


Computer, June 1991

[23] “Guide to RISC Processors for Programmers and Engi-


neers": Chapter 3: “RISC Principles” by Sivarama P.
Dandamudi, 2005, ISBN 978-0-387-21017-9. “the main
goal was not to reduce the number of instructions, but the
complexity”

[24] “Microprocessors From the Programmer’s Perspective”


by Andrew Schulman 1990

[25] Kevin Dowd. High Performance Computing. O'Reilly &


Associates, Inc. 1993.

[26] “Intel x86 Processors – CISC or RISC? Or both??" by


Sundar Srinivasan

[27] “Schaum’s Outline of Computer Architecture” by


Nicholas P. Carter 2002 p. 96 ISBN 0-07-136207-X

[28] “CISC, RISC, and DSP Microprocessors” by Douglas L.


Jones 2000

[29] “A History of Apple’s Operating Systems” by Amit Singh.


“the line between RISC and CISC has been growing
fuzzier over the years.”

[30] Guide to RISC processors: for programmers and engineers


by Sivarama P. Dandamudi - 2005 ISBN 0-387-21017-2
pages 121-123

13.7 External links


• RISC vs. CISC

• What is RISC
• The RISC-V Instruction Set Architecture

• Not Quite RISC


Chapter 14

Complex instruction set computing

Complex instruction set computing (CISC /ˈsɪsk/) is 14.1 Historical design context
a CPU design where single instructions can execute sev-
eral low-level operations (such as a load from memory, an
arithmetic operation, and a memory store) or are capa- 14.1.1 Incitements and benefits
ble of multi-step operations or addressing modes within
single instructions. The term was retroactively coined in Before the RISC philosophy became prominent, many
contrast to reduced instruction set computer (RISC) [1][2] computer architects tried to bridge the so-called semantic
and has therefore become something of an umbrella term gap, i.e. to design instruction sets that directly sup-
for everything that is not RISC, i.e. everything from large ported high-level programming constructs such as proce-
and complex mainframes to simplistic microcontrollers dure calls, loop control, and complex addressing modes,
where memory load and store operations are not sepa- allowing data structure and array accesses to be com-
rated from arithmetic instructions. bined into single instructions. Instructions are also typi-
cally highly encoded in order to further enhance the code
A modern RISC processor can therefore be much more density. The compact nature of such instruction sets
complex than, say, a modern microcontroller using a results in smaller program sizes and fewer (slow) main
CISC-labeled instruction set, especially in terms of im- memory accesses, which at the time (early 1960s and on-
plementation (electronic circuit complexity), but also in wards) resulted in a tremendous savings on the cost of
terms of the number of instructions or the complex- computer memory and disc storage, as well as faster ex-
ity of their encoding patterns. The only differentiating ecution. It also meant good programming productivity
characteristic (nearly) “guaranteed” is the fact that most even in assembly language, as high level languages such
RISC designs uses uniform instruction length for (almost) as Fortran or Algol were not always available or appropri-
all instructions and employs strictly separate load/store- ate (microprocessors in this category are sometimes still
instructions. programmed in assembly language for certain types of
Examples of instruction set architectures that have critical applications).
been retroactively labeled CISC are System/360 through
z/Architecture, the PDP-11 and VAX architectures, Data
General Nova and many others. Well known micro- New instructions
processors and microcontrollers that have also been la-
beled CISC in many academic publications include the
In the 1970s, analysis of high level languages indicated
Motorola 6800, 6809 and 68000-families, the Intel
some complex machine language implementations and
8080, iAPX432 and x86-family, the Zilog Z80, Z8 and
it was determined that new instructions could improve
Z8000-families, the National Semiconductor 32016 and
performance. Some instructions were added that were
NS320xx-line, the MOS Technology 6502-family, the In-
never intended to be used in assembly language but fit
tel 8051-family, and others.
well with compiled high level languages. Compilers were
Some designs have been regarded as borderline cases by updated to take advantage of these instructions. The ben-
some writers. For instance, the Microchip Technology efits of semantically rich instructions with compact en-
PIC has been labeled RISC in some circles and CISC in codings can be seen in modern processors as well, par-
others and the 6502 and 6809 have both been described ticularly in the high performance segment where caches
as “RISC-like”, although they have complex addressing are a central component (as opposed to most embedded
modes as well as arithmetic instructions that access mem- systems). This is because these fast, but complex and ex-
ory, contrary to the RISC-principles. pensive, memories are inherently limited in size, making
compact code beneficial. Of course, the fundamental rea-
son they are needed is that main memories (i.e. dynamic
RAM today) remain slow compared to a (high perfor-
mance) CPU-core.

83
84 CHAPTER 14. COMPLEX INSTRUCTION SET COMPUTING

14.1.2 Design issues nal microcode execution in CISC processors, on the other
hand, could be more or less pipelined depending on the
While many designs achieved the aim of higher through- particular design, and therefore more or less akin to the
put at lower cost and also allowed high-level language basic structure of RISC processors.
constructs to be expressed by fewer instructions, it was
observed that this was not always the case. For instance,
low-end versions of complex architectures (i.e. using less Superscalar
hardware) could lead to situations where it was possible to
improve performance by not using a complex instruction In a more modern context, the complex variable length
(such as a procedure call or enter instruction), but instead encoding used by some of the typical CISC architec-
using a sequence of simpler instructions. tures makes it complicated, but still feasible, to build
One reason for this was that architects (microcode writ- a superscalar implementation of a CISC programming
ers) sometimes “over-designed” assembler language in- model directly; the in-order superscalar original Pentium
structions, i.e. including features which were not possible and the out-of-order superscalar Cyrix 6x86 are well
to implement efficiently on the basic hardware available. known examples of this. The frequent memory accesses
This could, for instance, be “side effects” (above conven- for operands of a typical CISC machine may limit the in-
tional flags), such as the setting of a register or memory struction level parallelism that can be extracted from the
location that was perhaps seldom used; if this was done code, although this is strongly mediated by the fast cache
via ordinary (non duplicated) internal buses, or even the structures used in modern designs, as well as by other
external bus, it would demand extra cycles every time, measures. Due to inherently compact and semantically
and thus be quite inefficient. rich instructions, the average amount of work performed
per machine code unit (i.e. per byte or bit) is higher for a
Even in balanced high performance designs, highly en- CISC than a RISC processor, which may give it a signifi-
coded and (relatively) high-level instructions could be cant advantage in a modern cache based implementation.
complicated to decode and execute efficiently within a
limited transistor budget. Such architectures therefore Transistors for logic, PLAs, and microcode are no longer
required a great deal of work on the part of the pro- scarce resources; only large high-speed cache memories
cessor designer in cases where a simpler, but (typically) are limited by the maximum number of transistors to-
slower, solution based on decode tables and/or microcode day. Although complex, the transistor count of CISC
sequencing is not appropriate. At a time when transis- decoders do not grow exponentially like the total num-
tors and other components were a limited resource, this ber of transistors per processor (the majority typically
also left fewer components and less opportunity for other used for caches). Together with better tools and en-
types of performance optimizations. hanced technologies, this has led to new implementa-
tions of highly encoded and variable length designs with-
out load-store limitations (i.e. non-RISC). This gov-
The RISC idea erns re-implementations of older architectures such as
the ubiquitous x86 (see below) as well as new designs for
The circuitry that performs the actions defined by the mi- microcontrollers for embedded systems, and similar uses.
crocode in many (but not all) CISC processors is, in itself, The superscalar complexity in the case of modern x86
a processor which in many ways is reminiscent in struc- was solved by converting instructions into one or more
ture to very early CPU designs. In the early 1970s, this micro-operations and dynamically issuing those micro-
gave rise to ideas to return to simpler processor designs in operations, i.e. indirect and dynamic superscalar execu-
order to make it more feasible to cope without (then rela- tion; the Pentium Pro and AMD K5 are early examples
tively large and expensive) ROM tables and/or PLA struc- of this. It allows a fairly simple superscalar design to be
tures for sequencing and/or decoding. The first (retroac- located after the (fairly complex) decoders (and buffers),
tively) RISC-labeled processor (IBM 801 - IBM's Wat- giving, so to speak, the best of both worlds in many re-
son Research Center, mid-1970s) was a tightly pipelined spects.
simple machine originally intended to be used as an in-
ternal microcode kernel, or engine, in CISC designs, but
also became the processor that introduced the RISC idea CISC and RISC terms
to a somewhat larger public. Simplicity and regularity
also in the visible instruction set would make it easier The terms CISC and RISC have become less meaningful
to implement overlapping processor stages (pipelining) with the continued evolution of both CISC and RISC de-
at the machine code level (i.e. the level seen by compil- signs and implementations. The first highly (or tightly)
ers). However, pipelining at that level was already used pipelined x86 implementations, the 486 designs from
in some high performance CISC “supercomputers” in or- Intel, AMD, Cyrix, and IBM, supported every instruc-
der to reduce the instruction cycle time (despite the com- tion that their predecessors did, but achieved maximum
plications of implementing within the limited component efficiency only on a fairly simple x86 subset that was
count and wiring complexity feasible at the time). Inter- only a little more than a typical RISC instruction set (i.e.
14.3. NOTES 85

without typical RISC load-store limitations). The In- • MISC


tel P5 Pentium generation was a superscalar version of
these principles. However, modern x86 processors also • RISC
(typically) decode and split instructions into dynamic se- • ZISC
quences of internally buffered micro-operations, which
not only helps execute a larger subset of instructions in a • VLIW
pipelined (overlapping) fashion, but also facilitates more
advanced extraction of parallelism out of the code stream, • Microprocessor
for even higher performance.
Contrary to popular simplifications (present also in some
academic texts), not all CISCs are microcoded or have
14.3 Notes
“complex” instructions. As CISC became a catch-all
term meaning anything that’s not a load-store (RISC) ar- • Tanenbaum, Andrew S. (2006) Structured Computer
chitecture, it’s not the number of instructions, nor the Organization, Fifth Edition, Pearson Education, Inc.
complexity of the implementation or of the instructions Upper Saddle River, NJ.
themselves, that define CISC, but the fact that arithmetic
instructions also perform memory accesses. Compared
to a small 8-bit CISC processor, a RISC floating-point 14.4 References
instruction is complex. CISC does not even need to have
complex addressing modes; 32 or 64-bit RISC proces- [1] Patterson, D. A.; Ditzel, D. R. (October 1980). “The
sors may well have more complex addressing modes than case for the reduced instruction set computer”. SIGARCH
small 8-bit CISC processors. Computer Architecture News (ACM) 8 (6): 25–33.
doi:10.1145/641914.641917.
A PDP-10, a PDP-8, an Intel 386, an Intel 4004, a
Motorola 68000, a System z mainframe, a Burroughs [2] Lakhe, Pravin R. (June 2013). “A Technology in Most
B5000, a VAX, a Zilog Z80000, and a MOS Technol- Recent Processor is Complex Reduced Instruction Set
ogy 6502 all vary wildly in the number, sizes, and formats Computers (CRISC): A Survey” (PDF). International
of instructions, the number, types, and sizes of registers, Journal of Innovation Research and Studies 2 (6). pp.
711–715.
and the available data types. Some have hardware sup-
port for operations like scanning for a substring, arbitrary-
precision BCD arithmetic, or transcendental functions, This article is based on material taken from the Free On-
while others have only 8-bit addition and subtraction. But line Dictionary of Computing prior to 1 November 2008
they are all in the CISC category because they have “load- and incorporated under the “relicensing” terms of the
operate” instructions that load and/or store memory con- GFDL, version 1.3 or later.
tents within the same instructions that perform the ac-
tual calculations. For instance, the PDP-8, having only
8 fixed-length instructions and no microcode at all, is a 14.5 Further reading
CISC because of how the instructions work, PowerPC,
which has over 230 instructions (more than some VAXes)
and complex internals like register renaming and a re- 14.6 External links
order buffer is a RISC, while Minimal CISC has 8 instruc-
tions, but is clearly a CISC because it combines memory • COSC 243_Computer Architecture 2
access and computation in the same instructions.
Some of the problems and contradictions in this termi-
nology will perhaps disappear as more systematic terms,
such as (non) load/store, become more popular and even-
tually replace the imprecise and slightly counter-intuitive
RISC/CISC terms.

14.2 See also


• CPU design
• Computer architecture
• Computer
• CPU
Chapter 15

Minimal instruction set computer

Clock Cycle • Typically a Minimal Instruction Set Computer is


0 1 2 3 4 5 6 7 8 viewed as having 32 or fewer instructions,[1][2][3]
where NOP, RESET and CPUID type instructions
are generally not counted by consensus due to their
Waiting fundamental nature.
Instructions

• 32 instructions is viewed as the highest allowable


Stage 1: Fetch number of instructions for a MISC, as 16 or 8 in-
PIPELINE

Stage 2: Decode structions are closer to what is meant by “Minimal


Stage 3: Execute Instructions”.
Stage 4: Write-back

• A MISC CPU cannot have zero instructions as that


is a zero instruction set computer.
Completed
Instructions
• A MISC CPU cannot have one instruction as that is
a one instruction set computer[4]

Generic 4-stage pipeline; the colored boxes represent instructions


• The implemented CPU instructions should by de-
independent of each other fault not support a wide set of inputs, so this typi-
cally means an 8-bit or 16-bit CPU.

(Not to be confused • If a CPU has an NX bit, it is more likely to be viewed


with multiple instruction as being CISC or RISC.
set computer, also ab-
breviated MISC, such • MISC chips typically don't have hardware memory
as the HLH Orion or protection of any kind unless there is an application
the OROCHI VLIW specific reason to have the feature.
processor.)
• If a CPU has a microcode subsystem, that excludes
Minimal Instruction Set Computer (MISC) is a
it from being a MISC system.
processor architecture with a very small number of basic
operations and corresponding opcodes. Such instruction
sets are commonly stack-based rather than register-based • The only addressing mode considered acceptable for
to reduce the size of operand specifiers. a MISC CPU to have is LOAD-STORE, the same
as for RISC CPUs.
Such a stack machine architecture is inherently simpler
since all instructions operate on the top-most stack en- • MISC CPUs can typically have between 64 KB to
tries. 4 GB of accessible addressable memory—but most
As a result of the stack architecture is an overall smaller MISC designs are under 1 megabyte.
instruction set, a smaller and faster instruction decode
unit with overall faster operation of individual instruc- Also, the instruction pipelines of MISC as a rule tend to
tions. be very simple. Instruction pipelines, branch prediction,
Separate from the stack definition of a MISC architecture, out-of-order execution, register renaming and speculative
is the MISC architecture being defined with respect to the execution broadly exclude a CPU from being classified as
number of instructions supported. a MISC architecture system.

86
15.2. DESIGN WEAKNESSES 87

15.1 History • The Manchester Mark 1 developed from the SSEM


project. An intermediate version of the Mark 1 was
Some of the first digital computers implemented with available to run programs in April 1949, but was not
instruction sets were by modern definition Minimal In- completed until October 1949.
struction Set computers.
• The EDSAC ran its first program on May 6, 1949.
Among these various computers, only ILLIAC and OR-
DVAC had compatible instruction sets. • The EDVAC was delivered in August 1949, but it
had problems that kept it from being put into regular
• Manchester Small-Scale Experimental Machine operation until 1951.
(SSEM), nicknamed “Baby” (University of Manch-
ester, England) made its first successful run of a • The CSIR Mk I ran its first program in November
stored-program on June 21, 1948. 1949.
• EDSAC (University of Cambridge, England) was
• The SEAC was demonstrated in April 1950.
the first practical stored-program electronic com-
puter (May 1949)
• The Pilot ACE ran its first program on May 10, 1950
• Manchester Mark 1 (University of Manchester, and was demonstrated in December 1950.
England) Developed from the SSEM (June 1949)
• The SWAC was completed in July 1950.
• CSIRAC (Council for Scientific and Industrial Re-
search) Australia (November 1949) • The Whirlwind was completed in December 1950
• EDVAC (Ballistic Research Laboratory, Comput- and was in actual use in April 1951.
ing Laboratory at Aberdeen Proving Ground 1951)
• The first ERA Atlas (later the commercial ERA
• ORDVAC (U-Illinois) at Aberdeen Proving 1101/UNIVAC 1101) was installed in December
Ground, Maryland (completed November 1951)[5] 1950.
• IAS machine at Princeton University (January 1952)
• MANIAC I at Los Alamos Scientific Laboratory
(March 1952)
15.2 Design weaknesses
• ILLIAC at the University of Illinois, (September The disadvantage of an MISC is that instructions tend
1952) to have more sequential dependencies, reducing overall
instruction-level parallelism.
Early stored-program computers
MISC architectures have much in common with the Forth
programming language and the Java Virtual Machine that
• The IBM SSEC had the ability to treat instructions
are weak in providing full instruction-level parallelism.
as data, and was publicly demonstrated on January
27, 1948. This ability was claimed in a US patent.[6]
However it was partially electromechanical, not fully
electronic. In practice, instructions were read from 15.3 Notable CPUs
paper tape due to its limited memory.[7]
• The Manchester SSEM (the Baby) was the first fully Probably the most commercially successful MISC was
electronic computer to run a stored program. It the original INMOS transputer archecture that had no
ran a factoring program for 52 minutes on June 21, floating-point unit. However, many eight-bit microcon-
1948, after running a simple division program and trollers (for embedded computer applications) fit into this
a program to show that two numbers were relatively category.
prime. Each STEREO spacecraft includes two P24 MISC CPUs
[8][9]
• The ENIAC was modified to run as a primitive read- and two CPU24 MISC CPUs.
only stored-program computer (using the Function
Tables for program ROM) and was demonstrated as
such on September 16, 1948, running a program by 15.4 See also
Adele Goldstine for von Neumann.
• The BINAC ran some test programs in February, • Complex instruction set computing
March, and April 1949, although was not completed
until September 1949. • Reduced instruction set computing
88 CHAPTER 15. MINIMAL INSTRUCTION SET COMPUTER

15.5 References
[1] Chen-hanson Ting and Charles H. Moore. “MuP21--A
High Performance MISC Processor”. 1995.

[2] Michael A. Baxter. “Minimal instruction set computer ar-


chitecture and multiple instruction issue method”. 1993.

[3] Richard Halverson, Jr. and Art Lew. “An FPGA-Based


Minimal Instruction Set Computer”. 1995. p. 23.

[4] Kong, J.H.; Ang, L.-M.; Seng, K.P. “Minimal Instruction


Set AES Processor using Harvard Architecture”. 2010.
doi:10.1109/ICCSIT.2010.5564522

[5] James E. Robertson (1955), Illiac Design Techniques, re-


port number UIUCDCS-R-1955-146, Digital Computer
Laboratory, University of Illinois at Urbana-Champaign

[6] F.E. Hamilton, R.R. Seeber, R.A. Rowley, and E.S.


Hughes (January 19, 1949). “Selective Sequence Elec-
tronic Calculator”. US Patent 2,636,672. Retrieved April
28, 2011. Issued April 28, 1953.

[7] Herbert R.J. Grosch (1991), Computer: Bit Slices From a


Life, Third Millennium Books, ISBN 0-88733-085-1

[8] R. A. Mewaldt, C. M. S. Cohen, W. R. Cook, A. C. Cum-


mings, et. al. “The Low-Energy Telescope (LET) and
SEP Central Electronics for the STEREO Mission”.

[9] C.T. Russell. “The STEREO Mission”. 2008.

15.6 External links


• Forth MISC chip designs

• seaForth-24 - the next to latest multi-core MISC de-


sign from Chuck Moore

• Green Arrays - the latest multi-core MISC design


from Chuck Moore
Chapter 16

Comparison of instruction set


architectures

16.1 Factors A := A + B
to be computed in one instruction, so two instructions will
16.1.1 Bits need to be executed to simulate a single three-operand
instruction
Computer architectures are often described as n-bit ar- A := B A := A + C
chitectures. Today n is often 8, 16, 32, or 64, but other
sizes have been used. This is actually a strong simplifi-
cation. A computer architecture often has a few more
or less “natural” datasizes in the instruction set, but the 16.1.3 Endianness
hardware implementation of these may be very differ-
ent. Many architectures have instructions operating on An architecture may use “big” or “little” endianness, or
half and/or twice the size of respective processors major both, or be configurable to use either. Little endian pro-
internal datapaths. Examples of this are the 8080, Z80, cessors order bytes in memory with the least significant
MC68000 as well as many others. On this type of imple- byte of a multi-byte value in the lowest-numbered mem-
mentations, a twice as wide operation typically also takes ory location. Big endian architectures instead order them
around twice as many clock cycles (which is not the case with the most significant byte at the lowest-numbered ad-
on high performance implementations). On the 68000, dress. The x86 and the ARM architectures as well as sev-
for instance, this means 8 instead of 4 clock ticks, and eral 8-bit architectures are little endian. Most RISC ar-
this particular chip may be described as a 32-bit architec- chitectures (SPARC, Power, PowerPC, MIPS) were orig-
ture with a 16-bit implementation. The external databus inally big endian, but many (including ARM) are now
width is often not useful to determine the width of the configurable.
architecture; the NS32008, NS32016 and NS32032 were
basically the same 32-bit chip with different external data
buses. The NS32764 had a 64-bit bus, but used 32-bit
registers. 16.2 Instruction sets
The width of addresses may or may not be different from
the width of data. Early 32-bit microprocessors often had Usually the number of registers is a power of two, e.g.
a 24-bit address, as did the System/360 processors. 8, 16, 32. In some cases a hardwired-to-zero pseudo-
register is included, as “part” of register files of archi-
tectures, mostly to simplify indexing modes. This table
only counts the integer “registers” usable by general in-
16.1.2 Operands structions at any moment. Architectures always include
special-purpose registers such as the program pointer
Main article: instruction set § Number of operands (PC). Those are not counted unless mentioned. Note that
some architectures, such as SPARC, have register win-
The number of operands is one of the factors that may dows; for those architectures, the count below indicates
give an indication about the performance of the instruc- how many registers are available within a register win-
tion set. A three-operand architecture will allow dow. Also, non-architected registers for register renam-
ing are not counted.
A := B + C
The table below compares basic information about in-
to be computed in one instruction. struction sets to be implemented in the CPU architec-
A two-operand architecture will allow tures:

89
90 CHAPTER 16. COMPARISON OF INSTRUCTION SET ARCHITECTURES

16.3 See also


• Central processing unit (CPU)

• CPU design
• Comparison of CPU microarchitectures

• Instruction set

• List of instruction sets


• Microprocessor

• Benchmark (computing)

16.4 References
[1] ARMv8 Technology Preview

[2] “ARM goes 64-bit with new ARMv8 chip architecture”.


Retrieved 26 May 2012.

[3] “AVR32 Architecture Document” (PDF). Atmel. Re-


trieved 2008-06-15.

[4] “Blackfin Processor Architecture Overview”. Analog De-


vices. Retrieved 2009-05-10.

[5] “Blackfin memory architecture”. Analog Devices. Re-


trieved 2009-12-18.

[6] “Crusoe Exposed: Transmeta TM5xxx Architecture 2”.


Real World Technologies.

[7] Alexander Klaiber (January 2000). “The Technology Be-


hind Crusoe Processors” (PDF). Transmeta Corporation.
Retrieved December 6, 2013.

[8] “LatticeMico32 Architecture”. Lattice Semiconductor.


Retrieved 2009-12-18.

[9] “Open Source Licensing”. Lattice Semiconductor. Re-


trieved 2009-12-18.

[10] “The 65k Project”. Advanced 6502. Retrieved 20 De-


cember 2013.

[11] “Power ISA 2.07”. IBM. Retrieved 2013-08-12.

[12] http://www.ibm.com/developerworks/power/newto/#2
New to Cell/B.E., multicore, and Power Architecture
technology

[13] http://www.sparc.org/specificationsDocuments.html#
ArchLic SPARC Architecture License
Chapter 17

Computer data storage

1 GB of SDRAM mounted in a personal computer. An example


of primary storage.

160 GB SDLT tape cartridge, an example of off-line storage.


When used within a robotic tape library, it is classified as tertiary
storage instead.

ory”, while slower permanent technologies are referred to


as “storage”, but these terms are often used interchange-
ably. In the Von Neumann architecture, the CPU consists
of two main parts: control unit and arithmetic logic unit
(ALU). The former controls the flow of data between the
CPU and memory; the latter performs arithmetic and log-
ical operations on data.

17.1 Functionality
40 GB PATA hard disk drive (HDD); when connected to a com-
puter it serves as secondary storage. Without a significant amount of memory, a computer
would merely be able to perform fixed operations and im-
mediately output the result. It would have to be reconfig-
Computer data storage, often called storage or mem- ured to change its behavior. This is acceptable for de-
ory, is a technology consisting of computer components vices such as desk calculators, digital signal processors,
and recording media used to retain digital data. It is a and other specialised devices. Von Neumann machines
core function and fundamental component of computers. differ in having a memory in which they store their op-
The central processing unit (CPU) of a computer is what erating instructions and data. Such computers are more
manipulates data by performing computations. In prac- versatile in that they do not need to have their hardware
tice, almost all computers use a storage hierarchy, which reconfigured for each new program, but can simply be
puts fast but expensive and small storage options close to reprogrammed with new in-memory instructions; they
the CPU and slower but larger and cheaper options far- also tend to be simpler to design, in that a relatively sim-
ther away. Often the fast, volatile technologies (which ple processor may keep state between successive com-
lose data when powered off) are referred to as “mem- putations to build up complex procedural results. Most

91
92 CHAPTER 17. COMPUTER DATA STORAGE

modern computers are von Neumann machines. 17.3 Hierarchy of storage

17.2 Data organization and repre-


sentation Primary storage

Central processing unit


A modern digital computer represents data using the Registers Main memory
binary numeral system. Text, numbers, pictures, audio, Logic Memory Random access memory
unit Cache bus 256-1024 MB
and nearly any other form of information can be con- memory
verted into a string of bits, or binary digits, each of which
has a value of 1 or 0. The most common unit of storage
is the byte, equal to 8 bits. A piece of information can be Input/output channels

handled by any computer or device whose storage space


Secondary storage Off-line storage
is large enough to accommodate the binary representation
of the piece of information, or simply data. For example,
Removable media drive
the complete works of Shakespeare, about 1250 pages in Mass storage device CD-RW, DVD-RW drive
print, can be stored in about five megabytes (40 million Hard disk
20-120 GB Removable medium
bits) with one byte per character. CD-RW
650 MB
Data is encoded by assigning a bit pattern to each
character, digit, or multimedia object. Many standards
exist for encoding (e.g., character encodings like ASCII,
Tertiary storage
image encodings like JPEG, video encodings like MPEG-
4).
Removable Robotic
By adding bits to each encoded unit, the redundancy al- Removable
access
media medium
lows the computer both to detect errors in coded data and Removable
drive system
medium
to correct them based on mathematical algorithms. Er-
rors occur regularly in low probabilities due to random bit
value flipping, or “physical bit fatigue”, loss of the phys-
ical bit in storage its ability to maintain distinguishable
Various forms of storage, divided according to their distance
value (0 or 1), or due to errors in inter or intra-computer from the central processing unit. The fundamental components
communication. A random bit flip (e.g., due to random of a general-purpose computer are arithmetic and logic unit,
radiation) is typically corrected upon detection. A bit, control circuitry, storage space, and input/output devices. Tech-
or a group of malfunctioning physical bits (not always nology and capacity as in common home computers around
the specific defective bit is known; group definition de- 2005.
pends on specific storage device) is typically automati-
cally fenced-out, taken out of use by the device, and re-
placed with another functioning equivalent group in the
Main article: Memory hierarchy
device, where the corrected bit values are restored (if
possible). The cyclic redundancy check (CRC) method
is typically used in communications and storage for error Generally, the lower a storage is in the hierarchy, the
detection. A detected error is then retried. lesser its bandwidth and the greater its access latency
is from the CPU. This traditional division of storage to
Data compression methods allow in many cases to repre- primary, secondary, tertiary and off-line storage is also
sent a string of bits by a shorter bit string (“compress”) guided by cost per bit. In contemporary usage, “mem-
and reconstruct the original string (“decompress”) when ory” is usually semiconductor storage read-write random-
needed. This utilizes substantially less storage (tens of access memory, typically DRAM (Dynamic-RAM) or
percents) for many types of data at the cost of more com- other forms of fast but temporary storage. “Storage” con-
putation (compress and decompress when needed). Anal- sists of storage devices and their media not directly acces-
ysis of trade-off between storage cost saving and costs of sible by the CPU (secondary or tertiary storage), typically
related computations and possible delays in data availabil- hard disk drives, optical disc drives, and other devices
ity is done before deciding whether to keep certain data slower than RAM but non-volatile (retaining contents
in a database compressed or not. when powered down).[1] Historically, memory has been
For security reasons certain types of data (e.g., credit- called core, main memory, real storage or internal mem-
card information) may be kept encrypted in storage to ory while storage devices have been referred to as sec-
prevent the possibility of unauthorized information re- ondary storage, external memory or auxiliary/peripheral
construction from chunks of storage snapshots. storage.
17.3. HIERARCHY OF STORAGE 93

17.3.1 Primary storage example to provide an abstraction of virtual memory or


other tasks.
Main article: Computer memory As the RAM types used for primary storage are volatile
(cleared at start up), a computer containing only such stor-
Primary storage (also known as main memory or internal age would not have a source to read instructions from, in
memory), often referred to simply as memory, is the only order to start the computer. Hence, non-volatile primary
one directly accessible to the CPU. The CPU continu- storage containing a small startup program (BIOS) is used
ously reads instructions stored there and executes them to bootstrap the computer, that is, to read a larger pro-
as required. Any data actively operated on is also stored gram from non-volatile secondary storage to RAM and
there in uniform manner. start to execute it. A non-volatile technology used for
this purpose is called ROM, for read-only memory (the
Historically, early computers used delay lines, Williams
terminology may be somewhat confusing as most ROM
tubes, or rotating magnetic drums as primary storage. By
types are also capable of random access).
1954, those unreliable methods were mostly replaced by
magnetic core memory. Core memory remained domi- Many types of “ROM” are not literally read only, as up-
nant until the 1970s, when advances in integrated circuit dates are possible; however it is slow and memory must be
technology allowed semiconductor memory to become erased in large portions before it can be re-written. Some
economically competitive. embedded systems run programs directly from ROM
(or similar), because such programs are rarely changed.
This led to modern random-access memory (RAM). It is
Standard computers do not store non-rudimentary pro-
small-sized, light, but quite expensive at the same time.
grams in ROM, and rather, use large capacities of sec-
(The particular types of RAM used for primary storage
ondary storage, which is non-volatile as well, and not as
are also volatile, i.e. they lose the information when not
costly.
powered).
Recently, primary storage and secondary storage in some
As shown in the diagram, traditionally there are two more
uses refer to what was historically called, respectively,
sub-layers of the primary storage, besides main large-
secondary storage and tertiary storage.[2]
capacity RAM:

• Processor registers are located inside the processor. 17.3.2 Secondary storage
Each register typically holds a word of data (of-
ten 32 or 64 bits). CPU instructions instruct the
arithmetic and logic unit to perform various calcu-
lations or other operations on this data (or with the
help of it). Registers are the fastest of all forms of
computer data storage.

• Processor cache is an intermediate stage between


ultra-fast registers and much slower main memory.
It was introduced solely to improve the performance
of computers. Most actively used information in the
main memory is just duplicated in the cache mem-
ory, which is faster, but of much lesser capacity.
On the other hand, main memory is much slower,
but has a much greater storage capacity than proces- A hard disk drive with protective cover removed.
sor registers. Multi-level hierarchical cache setup is
also commonly used—primary cache being small- Main article: Auxiliary memory
est, fastest and located inside the processor; sec-
ondary cache being somewhat larger and slower.
Secondary storage (also known as external memory or
auxiliary storage), differs from primary storage in that
Main memory is directly or indirectly connected to the it is not directly accessible by the CPU. The computer
central processing unit via a memory bus. It is actually two usually uses its input/output channels to access secondary
buses (not on the diagram): an address bus and a data bus. storage and transfers the desired data using intermediate
The CPU firstly sends a number through an address bus, a area in primary storage. Secondary storage does not lose
number called memory address, that indicates the desired the data when the device is powered down—it is non-
location of data. Then it reads or writes the data in the volatile. Per unit, it is typically also two orders of magni-
memory cells using the data bus. Additionally, a memory tude less expensive than primary storage. Modern com-
management unit (MMU) is a small device between CPU puter systems typically have two orders of magnitude
and RAM recalculating the actual memory address, for more secondary storage than primary storage and data are
94 CHAPTER 17. COMPUTER DATA STORAGE

kept for a longer time there.


In modern computers, hard disk drives are usually used
as secondary storage. The time taken to access a given
byte of information stored on a hard disk is typically a few
thousandths of a second, or milliseconds. By contrast, the
time taken to access a given byte of information stored
in random-access memory is measured in billionths of
a second, or nanoseconds. This illustrates the signif-
icant access-time difference which distinguishes solid-
state memory from rotating magnetic storage devices:
hard disks are typically about a million times slower than
memory. Rotating optical storage devices, such as CD
and DVD drives, have even longer access times. With
disk drives, once the disk read/write head reaches the
proper placement and the data of interest rotates under
it, subsequent data on the track are very fast to access.
To reduce the seek time and rotational latency, data are
transferred to and from disks in large contiguous blocks.
When data reside on disk, block access to hide latency
offers a ray of hope in designing efficient external mem-
ory algorithms. Sequential or block access on disks is or-
ders of magnitude faster than random access, and many
sophisticated paradigms have been developed to design
efficient algorithms based upon sequential and block ac-
Large tape library. Tape cartridges placed on shelves in the front,
cess. Another way to reduce the I/O bottleneck is to use robotic arm moving in the back. Visible height of the library is
multiple disks in parallel in order to increase the band- about 180 cm.
width between primary and secondary memory.[3]
Some other examples of secondary storage technologies
are: flash memory (e.g. USB flash drives or keys), floppy rarely accessed information since it is much slower than
disks, magnetic tape, paper tape, punched cards, stan- secondary storage (e.g. 5–60 seconds vs. 1–10 millisec-
dalone RAM disks, and Iomega Zip drives. onds). This is primarily useful for extraordinarily large
data stores, accessed without human operators. Typical
The secondary storage is often formatted according to a examples include tape libraries and optical jukeboxes.
file system format, which provides the abstraction neces-
sary to organize data into files and directories, providing When a computer needs to read information from the
also additional information (called metadata) describing tertiary storage, it will first consult a catalog database to
the owner of a certain file, the access time, the access determine which tape or disc contains the information.
permissions, and other information. Next, the computer will instruct a robotic arm to fetch
the medium and place it in a drive. When the computer
Most computer operating systems use the concept of has finished reading the information, the robotic arm will
virtual memory, allowing utilization of more primary return the medium to its place in the library.
storage capacity than is physically available in the system.
As the primary memory fills up, the system moves the
least-used chunks (pages) to secondary storage devices
(to a swap file or page file), retrieving them later when
17.3.4 Off-line storage
they are needed. As more of these retrievals from slower
secondary storage are necessary, the more the overall sys- See also: Near-line storage
tem performance is degraded.
Off-line storage is a computer data storage on a medium
or a device that is not under the control of a processing
17.3.3 Tertiary storage unit.[5] The medium is recorded, usually in a secondary
or tertiary storage device, and then physically removed
Tertiary storage or tertiary memory, provides a third or disconnected. It must be inserted or connected by a
[4]

level of storage. Typically it involves a robotic mecha- human operator before a computer can access it again.
nism which will mount (insert) and dismount removable Unlike tertiary storage, it cannot be accessed without hu-
mass storage media into a storage device according to the man interaction.
system’s demands; this data is often copied to secondary Off-line storage is used to transfer information, since the
storage before use. It is primarily used for archiving detached medium can be easily physically transported.
17.4. CHARACTERISTICS OF STORAGE 95

Additionally, in case a disaster, for example a fire, de- power. It is suitable for long-term storage of infor-
stroys the original data, a medium in a remote location mation.
will probably be unaffected, enabling disaster recovery.
Off-line storage increases general information security, Volatile memory Requires constant power to maintain
since it is physically inaccessible from a computer, and the stored information. The fastest memory tech-
data confidentiality or integrity cannot be affected by nologies of today are volatile ones (not a universal
computer-based attack techniques. Also, if the informa- rule). Since primary storage is required to be very
tion stored for archival purposes is rarely accessed, off- fast, it predominantly uses volatile memory.
line storage is less expensive than tertiary storage.
Dynamic random-access memory A form of
In modern personal computers, most secondary and ter- volatile memory which also requires the stored
tiary storage media are also used for off-line storage. Op- information to be periodically re-read and
tical discs and flash memory devices are most popular, re-written, or refreshed, otherwise it would
and to much lesser extent removable hard disk drives. In vanish.
enterprise uses, magnetic tape is predominant. Older ex-
Static random-access memory A form of volatile
amples are floppy disks, Zip disks, or punched cards.
memory similar to DRAM with the exception
that it never needs to be refreshed as long as
power is applied. (It loses its content if power
17.4 Characteristics of storage is removed.)

An uninterruptible power supply can be used to give a


computer a brief window of time to move information
from primary volatile storage into non-volatile storage be-
fore the batteries are exhausted. Some systems (e.g., see
the EMC Symmetrix) have integrated batteries that main-
tain volatile storage for several minutes.

17.4.2 Mutability
Read/write storage or mutable storage Allows infor-
mation to be overwritten at any time. A computer
without some amount of read/write storage for pri-
mary storage purposes would be useless for many
tasks. Modern computers typically use read/write
storage also for secondary storage.

Read only storage Retains the information stored at


the time of manufacture, and write once storage
(Write Once Read Many) allows the information
to be written only once at some point after manu-
facture. These are called immutable storage. Im-
mutable storage is used for tertiary and off-line stor-
A 1GB DDR RAM module (detail) age. Examples include CD-ROM and CD-R.

Storage technologies at all levels of the storage hierarchy Slow write, fast read storage Read/write storage
can be differentiated by evaluating certain core charac- which allows information to be overwritten multiple
teristics as well as measuring characteristics specific to times, but with the write operation being much
a particular implementation. These core characteristics slower than the read operation. Examples include
are volatility, mutability, accessibility, and addressability. CD-RW and swayne memory
For any particular implementation of any storage technol-
ogy, the characteristics worth measuring are capacity and
performance. 17.4.3 Accessibility
Random access Any location in storage can be accessed
17.4.1 Volatility at any moment in approximately the same amount of
time. Such characteristic is well suited for primary
Non-volatile memory Will retain the stored informa- and secondary storage. Most semiconductor mem-
tion even if it is not constantly supplied with electric ories and disk drives provide random access.
96 CHAPTER 17. COMPUTER DATA STORAGE

Sequential access The accessing of pieces of informa- Throughput The rate at which information can be read
tion will be in a serial order, one after the other; from or written to the storage. In computer data
therefore the time to access a particular piece of in- storage, throughput is usually expressed in terms of
formation depends upon which piece of information megabytes per second or MB/s, though bit rate may
was last accessed. Such characteristic is typical of also be used. As with latency, read rate and write
off-line storage. rate may need to be differentiated. Also accessing
media sequentially, as opposed to randomly, typi-
cally yields maximum throughput.
17.4.4 Addressability
Granularity The size of the largest “chunk” of data that
Location-addressable Each individually accessible can be efficiently accessed as a single unit, e.g. with-
unit of information in storage is selected with its out introducing more latency.
numerical memory address. In modern comput- ability The probability of spontaneous bit value change
ers, location-addressable storage usually limits to under various conditions, or overall failure rate.
primary storage, accessed internally by computer
programs, since location-addressability is very
efficient, but burdensome for humans. 17.4.7 Energy use
File addressable Information is divided into files of • Storage devices that reduce fan usage, automatically
variable length, and a particular file is selected with shut-down during inactivity, and low power hard
human-readable directory and file names. The un- drives can reduce energy consumption 90 percent.[6]
derlying device is still location-addressable, but the
operating system of a computer provides the file sys- • 2.5 inch hard disk drives often consume less power
tem abstraction to make the operation more under- than larger ones.[7][8] Low capacity solid-state drives
standable. In modern computers, secondary, ter- have no moving parts and consume less power than
tiary and off-line storage use file systems. hard disks.[9][10][11] Also, memory may use more
power than hard disks.[11]
Content-addressable Each individually accessible unit
of information is selected based on the basis of (part
of) the contents stored there. Content-addressable 17.5 Storage Media
storage can be implemented using software (com-
puter program) or hardware (computer device), with
hardware being faster but more expensive option. As of 2011, the most commonly used data storage tech-
Hardware content addressable memory is often used nologies are semiconductor, magnetic, and optical, while
in a computer’s CPU cache. paper still sees some limited usage. Media is a common
name for what actually holds the data in the storage de-
vice. Some other fundamental storage technologies have
17.4.5 Capacity also been used in the past or are proposed for develop-
ment.
Raw capacity The total amount of stored information
that a storage device or medium can hold. It is
17.5.1 Semiconductor
expressed as a quantity of bits or bytes (e.g. 10.4
megabytes).
Semiconductor memory uses semiconductor-based
Memory storage density The compactness of stored integrated circuits to store information. A semiconduc-
information. It is the storage capacity of a medium tor memory chip may contain millions of tiny transistors
divided with a unit of length, area or volume (e.g. or capacitors. Both volatile and non-volatile forms of
1.2 megabytes per square inch). semiconductor memory exist. In modern computers,
primary storage almost exclusively consists of dynamic
volatile semiconductor memory or dynamic random
17.4.6 Performance access memory. Since the turn of the century, a type
of non-volatile semiconductor memory known as flash
Latency The time it takes to access a particular location memory has steadily gained share as off-line storage for
in storage. The relevant unit of measurement is typ- home computers. Non-volatile semiconductor memory
ically nanosecond for primary storage, millisecond is also used for secondary storage in various advanced
for secondary storage, and second for tertiary stor- electronic devices and specialized computers.
age. It may make sense to separate read latency and As early as 2006, notebook and desktop computer man-
write latency, and in case of sequential access stor- ufacturers started using flash-based solid-state drives
age, minimum, maximum and average latency. (SSDs) as default configuration options for the secondary
17.5. STORAGE MEDIA 97

storage either in addition to or instead of the more tradi- Magneto-optical disc storage is optical disc storage where
tional HDD.[12][13][14][15][16] the magnetic state on a ferromagnetic surface stores in-
formation. The information is read optically and written
by combining magnetic and optical methods. Magneto-
17.5.2 Magnetic optical disc storage is non-volatile, sequential access, slow
write, fast read storage used for tertiary and off-line stor-
Magnetic storage uses different patterns of magnetization age.
on a magnetically coated surface to store information.
Magnetic storage is non-volatile. The information is ac- 3D optical data storage has also been proposed.
cessed using one or more read/write heads which may
contain one or more recording transducers. A read/write
17.5.4 Paper
head only covers a part of the surface so that the head or
medium or both must be moved relative to another in or-
Paper data storage, typically in the form of paper tape
der to access data. In modern computers, magnetic stor-
or punched cards, has long been used to store informa-
age will take these forms:
tion for automatic processing, particularly before general-
purpose computers existed. Information was recorded by
• Magnetic disk punching holes into the paper or cardboard medium and
• Floppy disk, used for off-line storage was read mechanically (or later optically) to determine
whether a particular location on the medium was solid
• Hard disk drive, used for secondary storage
or contained a hole. A few technologies allow people to
• Magnetic tape, used for tertiary and off-line storage make marks on paper that are easily read by machine—
these are widely used for tabulating votes and grading
In early computers, magnetic storage was also used as: standardized tests. Barcodes made it possible for any ob-
ject that was to be sold or transported to have some com-
• Primary storage in a form of magnetic memory, or puter readable information securely attached to it.
core memory, core rope memory, thin-film memory
and/or twistor memory. 17.5.5 Other Storage Media or Substrates
• Tertiary (e.g. NCR CRAM) or off line storage in
the form of magnetic cards. Vacuum tube memory A Williams tube used a
cathode ray tube, and a Selectron tube used a large
• Magnetic tape was then often used for secondary vacuum tube to store information. These primary
storage. storage devices were short-lived in the market, since
Williams tube was unreliable and the Selectron tube
was expensive.
17.5.3 Optical
Electro-acoustic memory Delay line memory used
Optical storage, the typical optical disc, stores informa-
sound waves in a substance such as mercury to
tion in deformities on the surface of a circular disc and
store information. Delay line memory was dynamic
reads this information by illuminating the surface with
volatile, cycle sequential read/write storage, and was
a laser diode and observing the reflection. Optical disc
used for primary storage.
storage is non-volatile. The deformities may be perma-
nent (read only media), formed once (write once media)
or reversible (recordable or read/write media). The fol- Optical tape is a medium for optical storage generally
lowing forms are currently in common use:[17] consisting of a long and narrow strip of plastic onto
which patterns can be written and from which the
• CD, CD-ROM, DVD, BD-ROM: Read only stor- patterns can be read back. It shares some technolo-
age, used for mass distribution of digital information gies with cinema film stock and optical discs, but
(music, video, computer programs) is compatible with neither. The motivation behind
developing this technology was the possibility of far
• CD-R, DVD-R, DVD+R, BD-R: Write once stor- greater storage capacities than either magnetic tape
age, used for tertiary and off-line storage or optical discs.
• CD-RW, DVD-RW, DVD+RW, DVD-RAM, BD-
RE: Slow write, fast read storage, used for tertiary Phase-change memory uses different mechanical
and off-line storage phases of Phase Change Material to store infor-
mation in an X-Y addressable matrix, and reads
• Ultra Density Optical or UDO is similar in capac- the information by observing the varying electrical
ity to BD-R or BD-RE and is slow write, fast read resistance of the material. Phase-change memory
storage used for tertiary and off-line storage. would be non-volatile, random-access read/write
98 CHAPTER 17. COMPUTER DATA STORAGE

storage, and might be used for primary, secondary N=6 are common. N>2 saves storage, when com-
and off-line storage. Most rewritable and many paring with N=2, at the cost of more processing dur-
write once optical disks already use phase change ing both regular operation (with often reduced per-
material to store information. formance) and defective device replacement.

Holographic data storage stores information optically Device mirroring and typical RAID are designed to han-
inside crystals or photopolymers. Holographic stor- dle a single device failure in the RAID group of devices.
age can utilize the whole volume of the storage However, if a second failure occurs before the RAID
medium, unlike optical disc storage which is lim- group is completely repaired from the first failure, then
ited to a small number of surface layers. Holo- data can be lost. The probability of a single failure is typ-
ically small. Thus the probability of two failures in a same
graphic storage would be non-volatile, sequential ac-
RAID group in time proximity is much smaller (approxi-
cess, and either write once or read/write storage. It
might be used for secondary and off-line storage. mately the probability squared, i.e., multiplied by itself).
See Holographic Versatile Disc (HVD). If a database cannot tolerate even such smaller probability
of data loss, then the RAID group itself is replicated (mir-
Molecular memory stores information in polymer that rored). In many cases such mirroring is done geographi-
can store electric charge. Molecular memory might cally remotely, in a different storage array, to handle also
be especially suited for primary storage. The theo- recovery from disasters (see disaster recovery above).
retical storage capacity of molecular memory is 10
terabits per square inch.[18]
17.6.2 Network connectivity
A secondary or tertiary storage may connect to a com-
17.6 Related technologies puter utilizing computer networks. This concept does not
pertain to the primary storage, which is shared between
17.6.1 Redundancy multiple processors to a lesser degree.

Main articles: Disk mirroring and RAID • Direct-attached storage (DAS) is a traditional mass
storage, that does not use any network. This is still
a most popular approach. This retronym was coined
See also Disk storage replication recently, together with NAS and SAN.

• Network-attached storage (NAS) is mass storage at-


While a group of bits malfunction may be resolved by
tached to a computer which another computer can
error detection and correction mechanisms (see above),
access at file level over a local area network, a pri-
storage device malfunction requires different solutions.
vate wide area network, or in the case of online file
The following solutions are commonly used and valid for
storage, over the Internet. NAS is commonly asso-
most storage devices:
ciated with the NFS and CIFS/SMB protocols.

• Device mirroring (replication) – A common so- • Storage area network (SAN) is a specialized net-
lution to the problem is constantly maintaining an work, that provides other computers with storage
identical copy of device content on another device capacity. The crucial difference between NAS and
(typically of a same type). The downside is that this SAN is the former presents and manages file systems
doubles the storage, and both devices (copies) need to client computers, whilst the latter provides access
to be updated simultaneously with some overhead at block-addressing (raw) level, leaving it to attach-
and possibly some delays. The upside is possible ing systems to manage data or file systems within
concurrent read of a same data group by two in- the provided capacity. SAN is commonly associ-
dependent processes, which increases performance. ated with Fibre Channel networks.
When one of the replicated devices is detected to be
defective, the other copy is still operational, and is
being utilized to generate a new copy on another de- 17.6.3 Robotic storage
vice (usually available operational in a pool of stand-
by devices for this purpose). Large quantities of individual magnetic tapes, and op-
tical or magneto-optical discs may be stored in robotic
• Redundant array of independent disks (RAID) – tertiary storage devices. In tape storage field they are
This method generalizes the device mirroring above known as tape libraries, and in optical storage field optical
by allowing one device in a group of N devices to fail jukeboxes, or optical disk libraries per analogy. Smallest
and be replaced with content restored (Device mir- forms of either technology containing just one drive de-
roring is RAID with N=2). RAID groups of N=5 or vice are referred to as autoloaders or autochangers.

You might also like