Computer Systems, A Programmers Perspective

Overview

A book recommended in the TYCS series.

Document UUID: gfjehshjf-hqek-hluo-ekoq-uihwfrrlehwi.

The short-term goal is to read chapters 1-6, as recommended by TYCS. Projects will be skipped, unless they can be somehow reworked to be musical.

Go to messages.

Chapter 1: A tour of computer systems

What is a computer system? A computer system colnsists of hardawre and systems software that work together to run application programs.

Hello World in C. Bits, bytes. The compiler pipeline for Hello World. How the hardware is organized: Buses, I/O devices, Main Memory, Processor. Caches and memory hierarchy.

OS management of hardware. Virtual memory. Processes. Threads. Files. Networks. Parallelism. SIMD.

Chapter 2: Representing and manipulating information

Hex, Binary, how integers are represented. Also floating point.

There's a lot of nitty-gritty details here that are worth going back to at some point.

Chapter 3: Machine-Level Represenation of Programs

3.1: A historical Perspective

This is just a big list of the processors from Intel. x86 carries a lot of historical baggage, and you can see where things come from.

3.2: Program Encodings

Outlines some of the differences between C and machine representations of programs. C abstracts away things like program counter, register, condition code, and vector registers. C offers more constructs of data, that get translated to continugous bytes in machine code.

Some words on tooling here. How to view machine code from C code using disassemblers. OBJDUMP is mentioned, GDB is mentioned.

ATT vs Intel assembly code formats discussed as an aside. This book opts for ATT.

3.3: Data Formats

Defining a Word as 16 bits. A byte is 8 bits. Long is 32 bits (or double words). Quads are 64 bits. etc. etc.

Assembly uses single-character suffixes denoting the size of the operand (table 3.1). 'l' can mean either 'long' or float, depending on context.

3.4: Accessing Information

x86-64 contains 16 general purpose 64-bit registers.

History dictates the naming of these registers. Originally %ax through %bp, then expanded to 32-bit with %eaxthrough %ebp, then expanded to 64 bits with %raxthrough %rbp. 8 registers were added as well: %r8through %r15.

Instructions can operate on registers using different sizes. 1 or 2 byte: leave remaing bytes unchagned. 4 bytes: set upper 4 bytes to 0.

The %rsp register is the end position in the runtime stack, and it is the most specialized register.

Instructions have operands, which specify the source and destination values. x86 has many operand forms.

Immediate values are constants, written $ followed by a value.

Memory references access a memory location, using the notation M_b[Addr], though the subscript b is usually dropped.

There are many addressing modes to get at a memory location. There's 9 of 'em in this table on pg 474 (figure 3.3).

lmm is an absolute reference.

(r_a) is an indirect reference M[R[r_a]].

lmm(r_b) is base + displacement M[lmm + R[r_b]].

(rb, ri) is indexed M[R[rb] + R[ri]].

lmm(rb, ri) is indexed M[lmm + R[rb] + R[ri]].

(,ri,s) is scaled indexed M[R[ri] * s].

lmm(,ri, s) is scaled indexed M[lmm + R[ri]*s].

(rb,ri,s) is scaled indexed M[R[rb] + R[ri]*s].

lmm(rb,ri,s) is scaled index M[lmm + R[rb] + R[ri]*s]. this is the most general form.

mov(S,D) copies from data from one place (Source) to another (Destination). movb, movw, movl, moivq, movabsq. Source is a value (immediate, register, memory) the destination (register, memory address). Source/destination can't both be memory addresses.

movz(S,D) does a move and pads the remaining bytes with zeros. movz, movzbw, movzbl, movzwl, movzbq, movzwq. movzlq does not exist because movl will automatically zero out the upper 4 bytes.

movs(S, D) does a move with sign extension. movsbw, movsbl, movswl, movsbq, movswq, movslq, cltq.

The program stack can be a way to move data around. It's a LIFO.

pushq(S) has the effect:

R[%rsp] = R[%rsp] - 8;
M[R[%rsp]] = S

popq(S) has the effect:

D = M[R[%rsp]];
R[%rsp] = R[%rsp] + 8;

Stack grows downward: top element has the lowest stack element.

3.5: Arithmetic and Logical Operations

Operations are divided into 4 groups: Load Effective Address, unary (one operand), binary (two operands), and shifts.

The Load Effective Address command leaq takes the address of a source and loads it into the destination. This gets abused for simple arithmetic operations.

unary operations: inc, dec, neg, not.

binary operations: add, sub, mul, xor, or, and. Note that the ordering goes Source, then Destination.

Shift operations: sal, shl, sar, shr. Follow the arguments (k, D). Where k is a byte that determines the amount to shift. Low order bits are used.

x86-64 has limited support for 128-bit arithmetic operations: imulq, mulq, cqto, idivq, divq.

3.6: Control

Condition codes: single-bit registers that the CPU maintains, which describe attributes of the most recent arithmetic or logical operation. These include carry flag (CF), zero flag (ZF), sign flag (SF), and overflow flag (OF).

There are two instruction classes that can set condition codes without touching the registers. cmp works like sub but only changing the condition codes. testbehaves like and without updating the registers.

Condition codes get indirectly read 3 wasy: setting a byte to 1 or 0, conditionally jumping based on state, or conditionally transferring data.

Suffixes for these instructions are different than other instructions. setl and setb mean "set less" and "set below", not "set long word" or "set byte". (Yeesh).

The set commands are used to set a register or memory address to the value of the condition flag. sete or setz, setne or setnz, sets, setns, setg or setnle, setge or setnl, setl or setnge, setle or setng, seta or setnbe, setae or setnb, setb or setnae, setbe or setna.

The "or"s are synonyms.

(There's some tricky stuff with set at the end of this chapter (pdf pg 125) that I'm still trying to grok. I just don't think the author is being all that clear.)

Jumps allows executation to switch to a completely different position in the program. They are usually indicated with the use of labels. jmp (Label), jmp (Operand), je, jne, js, jns, jg, jge, jl, jle, ja, jae, jb, jbe.

PC Relative jumps are offsets relative to the positions they are jumping off from.

The value of the program counter when performing PC-relative addressing is the address of the instruction following the jump instruction, not the address of the jump instruction itself.

Relative jumps allow for compact code that can be shifted to different positions without alterations.

Conditional and unconditional jumping. Gotos can be used in C to closely match the equivalent assembly, but in "goto code" is considered bad practice.

Conditional moves change a value based on a condition being true. Known as conditional data transfer. cmove or cmovz, cmovene or cmovenz, cmovs, cmovens, cmovg or cmovnle, cmovge or cmovnl, cmovl or cmovnge, cmovle or cmovng, cmova or cmovnbe, cmovae or cmovnb, cmovb or cmovnae, cmovbe or cmovna.

The pipelining used in modern processors is why this is more efficient. Branch prediction allows a processor to make guesses where it's going to go, and will fill the pipeline full of instructions. A bad prediction (branch misprediction) causes it to discard work and start filling things over again. This incurs a performance penalty.

Not all conditional expressions can be expressed as conditional moves. Since both sides are evaluated, they could introduce side effects.

Conditional moves aren't always the most efficient option, especially if the then-expr or else-expr requires significant computation. A compiler like GCC will cleverly know when to use them.

Loops in C like do-while, while, and for, boil down to combintations of conditional tests and jumps in assembly.

do-while: repeat body statment, evaluate test expression, and continue if non-zero. The body statement is executed at least once.

while: like do-while, except the test statement is executed before the body. This means the loop can be terminated before the first iteration (stopped before it even starts).

for loop: has init-expr, test-expr, update-expr, can mostly be written as an equivalent while loop.

switch statement in C: provides multiway branching capabilities based on the value of an index. Can be efficiently implemented using a jump table. The compiler automatically implements these.

Switch cases that fall through don't have a break statement.

3.7: Procedures

Machine-level support for procedures need to do the following: pass control, pass data, and allocate/deallocate memory for local variables.

Run-time stack: a LIFO stack used to manage memory. When a procedure runs out of space in registers, it allocates memory on the stack. Known as a stack frame.

Return address: an address that gets pushed onto the stack that indicates where to return to when the procedure is done.

Procedure Control transfer involves updating program counter. call is the instruction that that calls a procedure, and the return address is the address that follows right after it. ret is used to go back to the return address (popping it off the stack and setting it to the PC).

Most data transfer between procedures is done via registers. In x86-64, up to six integral (integer, pointer) arguments can be passed via registers.

Registers are used in a specified order (these names change based on the sized value being used). 64-bit register order is: %rdi, %rsi, %rdx=, %rc, %r8, %r9.

More than 6 integral arguments, and the extras get passed on the stack.

When a procedure needs to allocate space on the stack, this is done by decrementing the stack pointer, resulting in a "local variables" portion of the stack.

Registers are a single resource shared by all procedures. How do procedures not overwrite registers?

callee-saved registers: When a procedure is called, the information in these registers must be the same when it returns to the previous procedure. These registers are %rbx, %rbp, and %r12-%r15. This is accomplished by not touching them at all, or storing their values on to the stack and popping them back before returning. A dedicated portion of the stack frame is reserved for this, called "saved registers".

The convention of saving registers allows for procedures to be called recursively.

3.8: Array Allocation and Access

Arrays are a construct in C, which are reasonably straight forward to translate to machine code. Pointer arithmetic in C gets translated into address computations in machine code. Optimizing compilers can simplify address computations quite a fair bit, and this can sometimes obfuscate things.

Basic access to an array: A + L*i, where A is the initial address of the first element, L is the length of the type (in bytes), and i is the index.

Pointer arithmetic: computed value is scaled according to the size of the data type referenced by the pointer.

Nested arrays: A[5][3] can be thought of as a 2d array with 5 rows 3 columns. Known as row-major order.

Access array element in D at i,j: A+L(C*i+j), where Ais the address, L is the length of the type, and Cis the column size.

Fixed-size multi-dimensional arrays in C can be heavily optimized by C compilers.

Variable-sized arrays were introduced in C99.

3.9: Heterogeneous Data Structures

C has two mechanisms for combining objects of different types. structures (struct) aggregate multiple objects into a single unit. unions (union), allow an object to be referenced using several types.

Unions are useful in situations when the use of two different fields in a structure will be mutually exclusive. Using a union will reduce space.

Alignment Restrictions: allowable memory addresses must be some multiple of either 2, 4, or 8. This simplifies hardware between the processor and memory system.

Alignment can be enforced in x86 using something like ".align 8".

Structs can add space in between fields to meet the alignment requirements.

3.10: Combining Control and Data in Machine-Level Programs

Pointers. And a bit how they work in C.

Some introductions to GDB.

Buffer Overflow example, attempts to reach for values out of bounds.

Exploit code: malicious code inside input to a function (such a string). The code then overrides the return to point to the new program subroutine.

Stack randomization: randomizes stack position from one program to another. Used to thwart stack exploits.

stack randomization is part of address-space layout randomization (ASLR), which generally concerns itself with loding parts of a computer program (program code, library code, stack, global variables, heap data), into different addresses spaces every time it is run.

stack protection is a way of checking for stack corruption checking to see if a canary value has changed. this is enabled by default in recent versions of GCC.

Limiting executable regions of code is another means to make things more secure. Memory is divided into chunks called pages, where the kind of access can be controlled (memory protection) at the hardware level.

alloca: standard function that allocates arbitrary bytes onto the stack.

Variable size stack frames are managed using a special pointer to store the base pointer (called %rbp), and then addresses things relative to that.

3.11: Floating-point Code

Floating point architecture has evolved "media" instructions inteded for for computer graphics.

Focused on parallel computation (single instruction, multiple data, or SIMD).

SSE: Streaming SIMD extensions.

AVX: Advanced Vector Extensions.

Data in these special instructions uses special registers: MM for MMX, XMM for SSE, and YMM for AVX.

AVX2: instructions introduced to the core i7 in 2013. Second version of AVX. Primary focus.

Floating point movement instructions: vmovss (single) vmovsd (double) vmoaps (aligned, packed single) vmovapd (aligned, packed double).

Floating point conversions (float to int, int to float), on scalar values: vcvttss2si, vcvttsd2si, vcvttss2siq, vcvttsd2siq.

Floating point to integer conversion performs truncation, rounding values towards zero.

These instructions convert from integer to float: vcvtsi2ss, vcvtsi2sd, vcvtsi2ssq, vcvtsi2sdq. They take 2 sources, and a destintation. (The second source can be ignored for now, and are commonly identical).

vunpcklps followed by vcvtps2pd is what GCC uses to convert between single to double precision. vunpcklpsperforms some interleaving operation, vcvtps2pd uses that that intermediate value to be a double. From the book: "it is unclear why GCC generates this code".

vmovddup and vcvtpd2psx are used to convert from double to single precision. vmovddup will replicate first vector element, and the vcvtpd2psx will use that to produce the single precision element in the destination.

The XMM registers are used for passing floating point arguments to and from functions. Up to 8, %xmm0 is return, and caller saved (calle can overwrite without first saving it).

Floating point arithmetic operations: vaddss, vaddsd, vsubss, vsubsd, vmulss, vmulsd, vdivss, vdivsd, vmaxss, vmaxsd, vminss, vminsd, sqrtss, sqrtsd.

Constant floating point values must be stored and read from memory. Can't be passed in directly as an immediate value.

Bitwise operations on floats: performed on entire XMM register (128 bits): vxorps, xorpd, vandps, andps.

Floating point comparison operators are: ucomiss and comisd. They set the zero flag (ZF), carry flag (CF), and this thing called the parity flag (PF), which is set when the least significant byte has an even number of bits turned on (even parity).

Chapter 4: Processor Architecture

4.1 The Y86-64 Instruction Set Architecutre

We're defining our own processor called Y86-64!

Program Visible State: the stuff that programs can read and modify.

State layout does not need to directly line up with ISA.

Components: registers, program counter, memory, status code.

8-bit integer operations only. 8-bit word.

Byte-level encoding of instructions: ranges from 1-10 bytes. Initial byte identifies the instruction.

Byte is split into two nibbles: code (high-order), and function (low-order).

Constant word: 8 bytes. used for immediate data in irmovq, displacement in rmmovq and mrmovq.

Status codes: AOK (all good), HLT (halted), ADR (invalid address), INS (invalid instruction).

pushq/popq quirks: pushq %rsp and popq %rsp are ambiguous. When in doubt, read the manual.

4.2 Logic Design and the Hardware Control Language

Logic gates: the basic computing elements for digital circuits. Takes in bit values as inputs, peforms boolean operation, and outputs a bit. Examples: AND, OR, and NOT.

Combinatorial Circuit: logic gates that form a network.

HCL: syntax for outlining combinatorial circuits in a C-like expression.

Combinatorial logic doesn't have any partial evaluation rules like C. Things are always getting evaluated in logical statemes like AND.

Combinational circuits tend to work on groups of bits known as words.

Multiplexors can work on a word level by selecting a word from a source using some control condition.

Multiplexing functions are expressed in HCL using case expressions.

Logic synthesis: a technique to create an optimal circuit from some higher-level logic written in something like HCL.

Hardware registers (directly connected to circuit with wires) are slightly different from program registers (addressable words).

Hardware register kind of acts like a sample-and-hold. It takes in a signal and a clock signal, and holds the value when the clock goes up.

Clocked registers: hardware register that whose values can be loaded using a clock signal.

A register file is an abstraction that holds words and addresses, and is used at the machine-language level.

Register file: two read ports, and one write port. This means it is multiported, allowing for multiple read/write operations to happen simultaneously.

Memory: address input, data input, data output. Address as input, write control to 0, add delay, and the value will appear on data output.

4.3 Sequational Y86-64 Implementations

Processor will be called SEQ.

Instruction processing stages: fetch, decode, execute, memory, write back, PC update.

Fetch: grabs bytes from memory.

Decode: reads up to two operands from register file.

Execute: Executes the instruction.

Memory: may read/write memory.

Write back: writes up to two results to register file.

PC update: sets PC to address of next instruction.

addq, subq, andq, xorq: same icode portion (same steps), different ifun portion (different ALU computation).

memory write/read: like integer operations, but ALU is used to compute valC and valB to get effective address. Memory stage will write register value valA to memory, or read valM.

pushq: use %rsp as second identifier for operand. ALU decrements stack pointer by 8.

popq: like pushq, only read 2 copies of stack pointer in decode stage. redundant, but more uniform with other instructions. ALU increments by 8. write-back: update stack pointer and register rA.

Jumps: fetch like before, but does not require register specifier byte. Execute: check condition codes and jump condition, determine whether or not to take branch, yield 1-bit signal Cnd. PC update: test Cnd flag, update PC to valC (jump target) if Cnd is 1, valP (following address) if flag is 0.

Hardware structure of SEQ: operates with a single clock transition triggering a flow in combinational logic to execute an instruction.

Combinational Logic does not require sequencing or control, values propogate through the network. The forms of memory used (clock registers and random access memories), have similar behavior to combinational logic.

Four hardware units that require explicit control over their sequencing: program counter, condition code register, data memory, regster file. Conrolled via single clock.

Clocking of registers/memories is all that is required to control sequencing in the processor, and the sequential events in the different instructions will happen.

Why does the execution work? This principle: No Reading Back. The processor never needs to read back the state updated by an instruciton in order to complete the processing of this instruction.

An overview of the HCL requried to build y86-64 is included.

Problem: SEQ is too slow. The clock needs to run slowly enough to run through all the stages within a single cycle.

Inefficient: each hardware unit is on for a fraction of a clock cycle.

Solution to fix SEQ performance: pipelineing.

4.4 General Principles of Pipelining

Pipelining is something a processor does to address the slow performance of instruction sequencing with the clock.

Analogy: cafeteria. instead of having one person at a time go through one at a time to get food (salad, main dish, dessert, etc), line them up.

Pipelining works by breaking computation into components, and allowing instructions to pass through the system one component at at time.

Pipelining increases the throughput of a system, but it may slightly increase latency (someone going for dessert has to wait in line).

Pipeline registers store the state at each pipeline.

Circuit delays are measured in picoseconds.

Pipelining charts are ways to analyze the latency of a circuit.

Nonuniform partitioning: when delays in each stage are different. pipelinig is only as fast as it's slowest component. designing with uniform delay is a difficult challenge.

Diminishing returns of pipelining: adding pipelining stages eventually doesn't add gains to throughput, due to the small delay introduced from the pipeline registers.

Data Dependency: sequentional instructions share data.

Control Dependency: outcome of a instruction determines what the next instruction will be.

Feedback in the pipeline handles control and data dependencies.

When introducing feedback, the effects must be handled so that the resulting behavior matches the model defined by the ISA.

4.5 Pipelined Y86-64 Implementations

SEQ must be changed so that PC update happens at beginning of the cycle instead of the end. This will be called SEQ+.

PC now computes address of current instruction.

Next, insert pipeline registers between each state: Fholds a predicted value of PC, D (btwn fetch and decode), holds information about most recent fetched instruction, E (btwn decode and execute) holds information about recently decoded instruction, M (btwn execute and memory) holds results of last executed instruciton, W (btwn memory and feedback paths) supply computed results to register file for writing and return address to PC selection logic.

Sequential SEQ/SEQ+ only process one instruction at a time. Pipelined has to have multiple versions of the value kept through the system.

With measures added to handle control dependencies, now make it so that a new instruction can be issued on every clock cycle, and an instruction can be executed as well. This would yield a throughput of 1 instruction/cycle.

To achieve this, the next instruction must be fetched after the current one.

With branches, one doesn't know which instruction will be chosen until the instruction gets executed down the line. A return instruction will also be ambiguous until it reaches the memory stage.

branch prediction: guessing a branch direction and fetching instructions based on that guess.

Return addresses will not be predicted, as it's virtually unbounded. When a return instruction happens, hold off other instructions until it has passed.

Branch prediction strategies. Always taken (what will be used here). ~60% sucess rate. Never taken (NT): opposite of Always Taken. ~40% success rate. Backward Taken, Forward Not Taken: slightly more involved strategy that predicts branches to lower addresses will be taken, high addresses will not be taken. ~65% success rate. Works because loops are are closed by backward branches and happen multiple times, forward branches are conditional operations and are less likely to happen.

Pipeline hazards: pipelining with feedback can cause problems with data and control dependencies, known as data hazards and control hazards.

Data hazards in the current pipeline occur when one of its operands is updated by any of the three preceding instructions. Occurs because it reads operands for an instruction from the register file in the decode stage, does not write instruction to the register file until three cycles later, after instruction passes through the write-back stage.

Stalling: processor holds back one or more instructions in the pipeline until the hazard condition no longer exists.

Stalling is a way to avoid data hazards.

Holding an instruction back in the pipeline involves injecting a bubble, a dynamically created NOP instruction.

Data Forwarding: The technique of passing a result value directly from one pipeline stage to another.

Data forwarding is a way to avoid data hazards.

Data forwarding can be used when there is a pending write to a register in the memory stage and can be done to avoid stalling.

Data forwarding can also pass newly computed values from execute stage to decode stage.

Five different forwarding sources, and two different forwarding destinations.

Load/use hazard: one instruction reads a value from memory while the next instruction needs this value as a source operand.

Load/use hazards are solved with a combination of stalling and forwarding. Known as a load interlock.

Instruction Cancelling is used to mitigate control hazards (branch mispredict) by inserting bubbles into the pipeline. Since the fetched instructions didn't reach the execution, it doesn't effect program state.

Exceptional control flow: activities in a processor that lead to a break in the normal chain of program execution.

Exceptions can happen internally by the program, or externally by some outside signal.

Internal exceptions: halt, invalid instruction, invalid address.

Exception Instruction: "virtual" instruction that causes exception.

Exception handling in pipelined systems has subtleties. 1: It is possible to have exceptions triggered by multiple instructions simultaneously. 2: A fetched instruction that begins execution, causes an exception, and then is canceled due to branch misprediction. 3: Because the pipelined processor updates different parts in stages, it is possible for an instruction following one causing an exception to alter some part of the state before the instruction completes.

Transition look-aside buffer (TLB): a type of cache memory that provides a fast translation from virtual to physical addresses.

Cache Miss: some reference is made to a location not in the cache.

Page Fault Exception: memory location being referenced is actually stored in the disk or non-volatile memory.

Chapter 5: Optimizing Program Performance

Introduction

The primary objective in writing a program must be to make it work correctly under all possible conditions. Making the program run fast is also an important consideration sometimes.

Writing an efficient program requires selecting an appropriate set of data structures, and then writing source code that the compiler can effectively optimize into executable code (which varies on the language/compiler). Efficient code can also be realized by splitting up the task into portions that can be run in parallel.

Common Trade-off: how easy it is to maintain vs. how fast the program is.

Modern compilers do a good job optimizing code, but sometimes programmers need to help by writing code that can be trivially optimized.

Optimizig a program first requires eliminating all unncesary work: unnecessary function calls, conditional tests, memory references, etc.

Programmer and compiler must be aware of the model of the target machine to maximise performance.

Understanding processor operation, one can provide instruction-level parallelism.

Problems in optimizing large programs, code profilers, etc.

Optimization is not a linear problem and requires a lot of trial and error.

Studying the assembly code is the most effective means for gaining an understanding of the compiler and how the generated code will run.

Practical approach: rewrite the program to the point where the compiler can generate optimal code that has measurable performance increases.

5.1 Capabilities and Limitations of Optimizing Computers

Modern compilers employ sophisticated program analysis techniques that find opportunities to simplify, as well as re-use and reduce computations.

GCC has different levels of optmization: -O1 through -O3. Some clever C code that generates better code at -O1 than less clever code compiled with -O3.

Compilers apply only safe optimizations to code, so that the program behaves the same way at each optimization level. (Exception is undefined C behavior).

Memory Aliasing: the case where two pointers may designate the same memory location. A safe compiler must assume that different pointers may be aliased.

Optmization Blocker: aspects of a program that can severely limit the opportunities for a compiler to generate optimized code.

Function calls can be an optimization blocker because functions can modify global state, causing a side effect.

5.2 Expressing Program Performance

Cycles Per Element (CPE): metric used to express program performance.

Clock provides regular timing, usually measured in gigahertz. Expressing measurements in terms of "cycles" (rather than in nanoseconds or picoseconds) is more intuitive for programmers.

Least squares fit can be used to find the run times for programs, expressed in units of cycles as a constant plus a factor proportional to the number of elements processed.

5.3 Program Example

This outlines the sample program used in the next sections below.

5.4 Eliminating Loop Inefficiences

Code motion: a general class of optmizations. Identify computations performed multiple times where the result doesn't change, and move it to an earlier part of code where it doesn't execute as often.

Optimizing compilers will try to apply code motion, but they are careful about decisions and sometimes won't do it.

5.5 Reducing Procedure Calls

Replaced a function call in the inner loop with direct memory access to the vector. Little to noperformance benefit.

5.6 Eliminating Unneeded Memory References

Assembly is analyzed for the combine3 procedure, and there is a wasteful read/write operation. This gets fixed by adding an accumulator variable. This reduces memory operations from two reads and one write per iteration to just one write and one read.

The compiler is unable to optimize this code because of memory aliasing. When using memory addresses, it does the conservative approach and reads/writes from memory each time.

5.7 Understanding Modern Processors

Exploiting the microarchitecutre of the processor to optimize a program.

Instruction-level parallelism: instructions being executed in parallel at the processor level, which may appear to be sequential at the code level.

Two lower bounds characterize maximum performance of a program: latency bound and throughput bound.

Latency bound: encounterted when a series of operations must be performed in strict sequence, because the result of one operation is required before the next.

Throughput Bound: raw computing capacity of the processor's functional units. Becomes ultimate limit on program performance.

This processor is a superscalar processor, meaning it processes multiple instructions on every clock cycle and not necessarily in sequential order (out of order).

Two main parts: Instruction Control Unit (ICU) and Execution Unit (EU).

ICU reads from instruction cache, usually ahead of time before the EU. Uses Branch Prediction to load instructions, and speculative execution, which executes instructions even before it is the correct branch, and resets state if it isn't.

Instruction decoding: takes program instructions and converts them into primitive operations known as micro-operations. Different instructions can have a variable number of operations.

Misprediction in branching incurs a significant performance penalty.

Processor has different functional units designed to carry out various operations (integer/float arithmetic, store, load, address computation, etc). Units are typically designed to be multifunctional.

Retirement Unit: keeps track of ongoing processing, makes sure it obeys sequential semantics of the machine-level program.

Instruction that are correctly predicted can be retired, with any udpates to the program registers being made. Instructions that are mispredicted are flushed, being discarded without altering program state.

Register renaming allows entire sequences of operations to be performed speculatively. The registers are actually updated when the processor is certain of the branch outcome.

Timings for machine instructions are determined by latency, issue time, and capacity.

Latency: total time to perform operation. Issue time: minimum number of clock cycles between two independent operations of the same type. Capacity: inidicates number of functional units performing that operation.

Fully pipelined: Functional units with issues times of 1 that can start a new operation every clock cycle.

Throughput is a more common way of expressing issue time, which is the reciprocol. Capacity / Issue Time.

Latency Bound: minimum value for the CPE for any function that must perform the combining operation in a strict sequence.

Throughput Bound: a mimimum bound for the CPE based on the maximum rate at which the functional units can produce results.

Data-flow representation: graphical notation showing how the data dependencies between different operations constrain the order which they are executed. Constraints cause critical paths in the graph, thus putting a lower bound on the number of clock cycles required to execute a set of machine instructions.

Code segments forming a loop can have registers classified into 4 types: read-only, write-only, local (updated, but no dependency from one iteration to the next), loop (used as both source values and destinations for loop).

Analyzed performance of combine4, and found latency bounds. Next task: restructure the operations to enhance instruction-level parallelism. Goal: transform the program in such a way that only limitation becomes throughput bound.

5.8 Loop Unrolling

Loop unrolling: program transformation that increases the number of elements computed for each iteration, thus reducing the number of iterations for a loop.

LU reduces the overhead operations in a loop such as indexing, and exposes ways to further reduce operations.

K by 1 loop unrolling: make a loop go through kelements at a time.

Optimizig Compilers will often attempt to try loop unrolling, especially at higher optimization settings.

5.9 Enhancing Parallelism

2 by 2 Loop Unrolling: iterate over two values, accumulate results in two places.

This breaks through the barrier imposed by the latency bound.

Reassociation Transformation: reduces number of operations along the critical path for computation, resulting in better performance by better utilizing the multiple function units.

Optimizing compilers will usually not try to do reassocation transformations because it is not often guaranteed that operations are associative.

5.10 Summary of Results for Optimizing Combining Code

Optimizations done with scalar code, not taking advantage of AVX vector instructions.

Combined optimizations achieve close to the throughput bounds of 0.5 and 1.0. A 10-20x improvement over the original code.

Modern processors have considerable processing power, but may need to be coaxed into using this power through the use of writing programs in stylized ways.

5.11 Some Limitng Factors

Register Spilling: when a program using some degree of parallelism exceeds the number of available registers. The compiler will have to resort to "spilling" temporary values in memory.

Mispredection penalties for branches.

DO NOT BE OVERLY CONCERNED ABOUT PREDICABLE BRANCHES (their big text, not mine, though the emphasis using CAPS is mine).

5.12 Understanding Memory Performance

This chapter investigates performance of load/store.

Write/read depdency: outcome of memory read depends on a memory write.

5.13 Performance Improvement Techniques

Already outlined previously.

5.14 Identifying and Eliminating Performance Bottlenecks

Code Profiler: Analysis tool that collect performance data about a program as it executes.

Profilers can be used to guide optimization.

Chapter 6: The Memory Hierarchy

If you understand how data moves up/down the memory hierarchy, then you can write programs with data that is quickly accesible by placing them higher in the hierarchy.

6.1 Storage Technologies

RAM comes in two types: static and dynamic.

Static RAM (SRAM): used cache memories, both on/off the CPU chip. Significantly faster and more expensive than dynamic RAM.

Dynamic RAM (DRAM): Used for the main memory plus the framebuffer of the graphics system.

SRAM stores each bit in a bistable memory cell inside of a six transistor circuit. It can stably be in one of two states.

DRAM stores each bit as a charge on a capacitor. They can be packed very densely, but are very sensitive to disturbances. For this reason, many systems periodically refresh DRAM.

Supercells: how DRAM partitions cells (bits). These are organized in a rectangular array.

Pins are used to set bits in the DRAM. 8 pins can transfer 1 byte.

Memory Controller: circuit that DRAM is connected to. Used to transfer bits to the DRAM chip.

Supercells are addressed as a coordinate. Row: Row Address Strobe (RAS). Column: Column Access Strobe (CAS).

Memory Modules: things that DRAM chips are packaged into. Plug into expansion slots.

Enhanced DRAM: Fast-page mode DRAM (FPM DRAM), Extended data out (EDO) RAM, Synchronous DRAM (SDRAM), Double Data-Rate Synchronous DRAM (DDR SDRAM), Video RAM (VRAM).

Nonvolatile Memory: retain information even when they are powered off.

PROM: Programmable ROM. Can be programmed exactly once. Consists of fuses that can be zapped into a permanent position.

EPROM: Erasable programmable ROM. Zeroed out using ultra-violet light. Can be rewritten ~1000 times.

EEPROM: Electronically eraseable PROM. Like EPROM, but can be programmed inplace on a circuit borad. Can be written ~10^5 times.

Flash Memory: non-volatile memory based on EEPROM that is very popular in consumer electronics.

Data flow between DRAM and processor is done with the use of electrical conduits called buses.

Bus transaction: a single transfer of data beween CPU and memory.

Read Transaction: main memory to CPU.

Write Transaction: CPU to main memory.

Whan value at address A is loaded into register: bus interface initiates read transaction, CPU puts A on system bus, I/O bridge passes it to memory bus, main memory gets address, fetches value from DRAM, writes to memory bus, passes it back to signal system bus and back to CPU.

Writing to a register is basically the reverse.

Disks: storage medium that holds lots of data.

Disks are constructed from platters, double sided, coated with magnetic material. They spin at a fixed rate, typically between 5400 and 15000 RPM.

Tracks: concentric rings on the surface of the platter.

Sector: equal partioning of the track. Usually 512 bytes. Space between sectors is used to store information about sector.

Cylinder: collection of tracks on all the surfaces that are equidistant from spindle.

Maximum Capacity: maximum number of bits that can be recorded to disk.

Capacity factors: Recording Density, Track Density, Area Density.

Multiple zone recording: modern technique used for (dense) disk layout. Cylinders are partitioned into disjointed subsets known as recording zones.

Disk read/write is done with an actuator arm, which physically moves around. Due to very small tolerances, arm is contained in an air-tight environment.

Access time components: seek time, rotational latency, and transfer time.

Logical Blocks: simplified view of disk geometry. Splits things up into sector-sized blocks. This is what the operating system sees.

Disk Controller: component that maintains mapping between logical block numbers and disk sectors.

6.2 Locality

Locality: programs that tend to reference items that are near other recently referenced items.

Locality forms: temporal locality and spatial locality.

Temporal locality: a memory location referenced once is likely to be referenced many times in the future.

Spatial locality: a memory location referenced once is likely to have a program reference another memory location nearyby it.

Programs with good locality tend to run faster than those with poor locality.

stride-1 reference pattern or sequential reference pattern refers to a function that references elements of a vector sequentially.

Smaller the stride, the better the spatial locality.

6.3 The Memory Hierarchy

CPU registers, L1 Cache (SRAM), L2 Cache (SRAM), L3 Cache (SRAM), Main Memory (DRAM), Local secondary storage (local disks), Remote secondary storage (distrubuted file systems, web servers).

Cache: small storage device that acts as a staging area for data objects stored in a larger, slower device.

Caches are divided into contiguous chunks of data called blocks.

Cache hit: when a program searches for a value in a cache at a particular level, and finds it.

Cache miss: when a program searches for a value at a particular cache level, but does not find it.

Replacing or evicting a cache block. Governed by the cache's replacement policy.

Cold cache is a cache with an empty row. a cold miss is an attempt to read from a cold cache. AKA compulsory miss.

Placement policy: where to place the block it has retrieved from level k+1?

Conflict miss: cache is large enough to hold referenced data objects, but because they map to the same cache block, the cache keeps missing.

Memory hierarchies based on caching work because programs tend to exhibit locality.

6.4 Cache Memories

Cache is organized to be able to find requested words by inspecting bits of the address, similar to a hash function.

direct-mapped cache: A cache with exactly one line per set.

Cache determining a hit or miss and then extracting: set selection, line matching, word extraction.

Conflict Miss: can be caused when two variables keep replacing the same set in the cache due to the collision rules.

This kind of conflict miss thrashing can be done with the use of padding.

Set-associative cache: cache where each sets hold more than one line.

Fully-associative cache: A single set that consists of all the cache lines.

It is difficult to build a fully-associative cache that is large and fast. Usually only appropriate for small caches.

Write Hit: writing a word to a cache that is already written.

Write Through: write w's cache block to the next lower level.

Write Back: defers the update as long as possible by writing the updated block to the next level only when it is evicted.

Dealing with write-misses: write-allocate and no-write-allocate.

Optmizing caches for writes is a subtle and difficult issue.

i-cache: a cache that holds instructions.

d-cache: a cache that (only) holds program data.

unified cache: a cache that holds both instructions and program data.

Cache performance metrics: miss rate, hit rate, hit time, miss penalty.

Larger cache will tend to increase hit rate, but also increase hit time.

Increasing the block size of a cache can help increase hit rate, by exploiting a program's spatial locality. Larger blocks imply smaller cache lines, which can hurt hit rate with programs with temporal locality.

Higher associativity: decreases the chances of thrashing. But very expensive to implmement and hard to make fast.

6.5 Writing Cache Friendly Code

Review: cache lines, sets, and blocks.

block: fixed sized packet of information that moves back and forth between a cache and main memory.

line: container in a cache that stores a block, as well as other info like valid bit and tag bits.

set: collection of one or more lines.

Basic approaches to cache-friendly code: make common case code go fast, and minimize the number of cache misses in each inner loop.

6.6 Putting It Together: The Impact of Caches On Program Performance

Read Throughput or Read Bandwidth: the rate a program reads data from the memory system.

Memory Mountain: two dimensional function of read throuput versus temporal and spatial locality.

The performance of the memory system is not characterized by a single number. Instead is a mountain of temporal and spatial locality whose elevations can vary by over and order of magnitude.

Messages

Any messages tagged with the group CSAPP.

[ca194d75] 2022-04-13-09-34: done with chapter 6. that finishes my requirements for (TYCS).

[8083fd6d] 2022-04-13-06-22: this aside on cache lines, sets, and blocks was very well placed. I was a little confused up to this point.

[a9065925] 2022-04-12-06-28: skeptical about how universal some of these symbols used are. looking at the cache chapter. maybe it's because the typography is bad.

[cdd2605c] 2022-04-11-06-20: "While rotating disks are here to stay, it is clear that SSDs are an important alternative" almost sounds dated.

[7b798ebe] 2022-04-09-10-47: done reading chapter 5. onto chapter 6 now.

[4461d57b] 2022-04-09-06-30: does shakespeare really have exactly 23,706 words? Is that like a known number?

[2362a6a9] 2022-04-09-06-08: there's a lot of casual observations in this chapter on optimizations. "we have found that GCC...". This in itself is interesting.

[f0872386] 2022-04-08-06-32: the text formatting in this book is very quirky sometimes. looks like a programmer was trying to use microsoft word. In this section, there's this phrase in big text: "Do Not Be Overly Concerned about Predictable Branches". Is this big text for emphasis ever used again? Nope. This book is very dry and humorless, so this quirk is very out of place.

[323f8425] 2022-04-08-05-47: onto loop unrolling

[4a81a784] 2022-04-07-05-56: I have now hit the longer sections of chapter 5 it seems.

[5bc48383] 2022-04-06-06-29: chapter 5 is a much smaller chapter, with much more digestible subsections.

[fb36423b] 2022-04-06-05-50: I'm beginning to form an opinion: code is overrated. Using text as a notation is convenient, but may not be the most ideal way of expressing something. Writing something in code as a way to make it seem familiar has diminishing returns, because some things just aren't clearly explained this way.

[f97f7bb3] 2022-04-05-15-38: add more CSAPP keywords to glossary page

[a969e7b1] 2022-04-05-09-52: chapter 4 I didn't add many words to the glossary, I may get back to that.

[97739041] 2022-04-05-09-27: a strategy I seem to be taking, for better or worse, is to extract the big ideas, superficially look at examples, and skip the problem sets entirely. It certainly will minimize retention and intuition, but I think that's about all I can do at the brisk pace I am going at.

[513aba37] 2022-04-05-09-23: onto chapter 5: optimizing program performance

[c4805400] 2022-04-05-06-24: Shriver and Smith give a thorough presentation of an Intel compatible x86-64 processor manufactured by AMD.

[54a42dd3] 2022-04-05-06-23: Hennessy and Patterson have a computer architecture textbook, and provides extensive coverage of architecture design.

[a96d1c5d] 2022-04-05-06-23: Katz and Borriello have a standard textbook for logic design, emphasizing the use of hardware languages.

[27ff571a] 2022-04-05-06-22: the end of chapter mentions a few books for further reading on this topic.

[073bac3c] 2022-04-05-06-17: ugh. they are telling me now that y86-64 is "simplified". Still seems plenty complex to me.

[749901bd] 2022-04-05-06-08: I do appreciate that this is going over x86, and just how massively complex it is. I wonder how the RISC-V chip compares?

[223e9b12] 2022-04-05-06-07: I'm totally glossing over the bits where the actual y86-64 processor is getting stitched together using HCL. It just seems too overwhelming, and the presentation in this PDF leaves a lot to be desired. I get the feeling like there could be whole books on what this chapter is attempting to cover.

[be781514] 2022-04-05-05-45: TIL "canceled" and "cancelled" both are correct.

[00f9964e] 2022-04-04-13-37: it's pretty amazing. anytime this book tries to sequentially explain things in a paragraph, my brain just shuts down and refuses to process it. I think I'll have to rewrite those paragraphs in a more terse form in order to get my brain to grok it.

[2f58f984] 2022-04-04-13-25: there's a lot of examples I should study more clearly in 4.5 with how pipelining works. Again, if I had more time and energy I would. I think I'm getting the broad strokes of it.

[a5d0d462] 2022-04-03-07-41: circuits feel really well suited for structuring generative music events

[63c5a4eb] 2022-04-03-07-21: pipelining diagrams look like DSP charts. I bet you could measure DSP performance as a pipelining diagram.

[2896b802] 2022-04-03-07-05: I think learning about simpler 8-bit CPU chips would be much more educational than attempting to figure out x86-64 stuff.

[b3fa2d25] 2022-04-03-07-04: Skimming 4.3.4 SEQ Stage Implementations, which devises HCL descriptions for the control logic blocks to implement SEQ. I don't know if it's worth the time to study fully, and it seems massively complex (they don't even show the full HCL description).

[03706078] 2022-04-03-06-55: I think this book was a thankless task because they decided to stick to x86-64, which seems really hard to teach in a classroom setting.

[145fc095] 2022-04-03-06-53: the machine instruction tables in 4.18-4.21 make more sense after reading onwards. I think they were trying to go top-down with machine instructions to combinational logic. I would have done bottoms-up with the combinational logic, working up to instructions. The nand2tetris concept sounds a bit more appealing now.

[75a7d4a7] 2022-04-01-16-03: 4.3.2: SEQ hardware structure packs a lot, and may be worth revisiting. The charts do not work well in grayscale on my RM, so that may not work so well.

[86e99092] 2022-04-01-15-57: there is definitely a difference in writing between chapters and even sections. really inconsistent.

[68980f65] 2022-04-01-15-55: 4.3.2 should absolutely have come first before 4.3.1. Defining the sections as hardware, then the instructions makes things a lot clearer. Not to mention the variables in the tables (and the writing) make more sense.

[f512527b] 2022-04-01-15-09: note to self: maybe study the tables 4.18-4.21 in more detail. these describe how Y86-64 instructions proceed through the processing stages.

[be3d02c7] 2022-04-01-13-13: the register file is more at the machine-language level, and can store words and addresses.

[e24c9a1a] 2022-04-01-13-11: clocked registers: hardware registers that store individual bits or words. The clock signal controls the loading of the register with the value at its input.

[5eacef38] 2022-04-01-13-10: ah okay, clocked registers are explained in a weird box (but why though?).

[8b8105fe] 2022-04-01-13-09: this is the sentence that is making me confused about clocked registers vs register file: "the writing of words to the register file is controlled by the clock signal in a manner similar to the loading of values into a clocked register".

[1db5cdcd] 2022-04-01-06-29: what's the difference between a clocked register and a register file? They seem to be used interchangeably. Maybe I'm not reading close enough.

[e0d7f58b] 2022-03-31-10-34: Hardware Description Language (HDL): a textual notation that looks similar to a programming language but that is used to describe hardware structures rather than program behaviors.

[4b3af7be] 2022-03-31-10-29: I don't understand the point of going through making Y86-64. There still seems to be a lot of complexity and weirdness, such as with their pushq instruction. Is it that they are trying to mimic the quirks of x86?

[566ff57c] 2022-03-31-10-22: Since this chapter is building a fantasy computer, I'm tempted to read this paired with Knuths MIX system. At least there the writing/code will be clearer, and the examples will be interesting data structures.

[aa34acd2] 2022-03-31-10-17: this textbook is so sloppy. jeez. I'm seeing tons of typos and formatting errors. Spotted what looks like an erroneos space in a Y86-64 code example.

[7cd6589d] 2022-03-31-10-13: exception handler: a procedure designed to handle the specific type of exception encountered.

[22ceb0ab] 2022-03-31-06-04: programmer-visible state: refers to the part of a processor state that a program can read and modify.

[b49da2a0] 2022-03-31-05-55: Done with chapter 3. now outling chapter 4 in (CSAPP).

[ac0f5250] 2022-03-30-16-56: From the book: "it is unclear why GCC generates this code."

[0e720f0d] 2022-03-30-11-19: copyright of this book is 2016. 6 years old. something to keep in mind.

[683cf460] 2022-03-30-11-18: going to need to look at the date of this book. It may be stale. It says AVX2 was introduced in 2013, which was nearly 10 years ago. Is it still relevant?

[db723ebe] 2022-03-30-11-13: AVX: Advanced Vector Extensions.

[3dd9372c] 2022-03-30-11-12: SSE: Streaming SIMD Instructions.

[b8b911d3] 2022-03-30-11-10: the floating point operations chapter feels most relevant to my DSP stuff. SIMD, etc.

[67013d1a] 2022-03-30-06-26: canary value: special value used in a stack protector.

[189f6375] 2022-03-30-06-25: stack protector: a mechanism uesd to detect stack corruption, implemented in recent versions of GCC.

[b4377754] 2022-03-30-06-23: nop sled: a technique used by attackers that involve using a long sequence of nops. used to try and guess the address of the exploit code. used to try and bypass ASLR.

[0294b59e] 2022-03-30-06-18: address-space layout randomization (ASLR): techniques that involve loading parts of a computer program into different regions of memory every time it is run.

[7dc9f38b] 2022-03-30-06-15: stack randomization: the use of randomizing the stack position of a program from one run of a program to another.

[8ae13c84] 2022-03-30-06-14: security monoculture: a phenomenon where many systems are vulnerable to the exact same strain of virus or exploit.

[1805855f] 2022-03-30-06-12: virus: a piece of code that adds itself to other programs, including operating systems.

[144198b7] 2022-03-30-06-11: worm: a program that can run by itself and can propogate a fully working version of itself to other machines.

[4e865412] 2022-03-30-06-08: exploit code: a malicious program contained inside input to some program.

[857149a9] 2022-03-30-05-56: "we start by taking a deep look into pointers, one of the most important concepts in the C programming language, but one for which many programmers have a shallow understanding". Go suck a lemon, CSAPP. I'm so done with these little snide remarks.

[55847918] 2022-03-30-05-49: alignment restrictions: restrictions placed on allowable addresses for primitive data types in a computer system. typically must be a multiple of some value (2, 4, or 8).

[53bf8444] 2022-03-26-16-46: it's really funny to see the phrase "safety provided by the C type system" after reading about (rust) in the (rustbook).

[2c45d13d] 2022-03-26-16-40: apparently fixed-sized multi-dimensional arrays have some "clever optimizations". I'll take their word for it. Skimming. This aspect of looking at what the compiler does feels too black-boxy to me.

[c1792475] 2022-03-26-16-26: it's very cool to see recursion at this level. the stack smashing risk is quite clear here.

[ee0e4c3a] 2022-03-26-16-14: this example is very subtle. I gave it a once-over, but I think I'd like to glance at it again at some point. the code readability is poor on the RM, so I might type it up./

[f0fe7eea] 2022-03-26-16-07: what I'm curious about is, is this "x86-64 stack discipline" transferrable knowledge to other ISAs?

[c1aa89ad] 2022-03-26-16-07: the call_proc example apparently showcases "many aspects of the stack discipline", and they say "it is worth studying carefully".

[a2b56a73] 2022-03-25-13-12: finally done with 3.6. wow that felt like forever.

[c38e3679] 2022-03-25-13-07: jump table: data structure used to efficiently implement switches. It is an array of addresses to jump to, using the index as a lookup.

[34ad4a46] 2022-03-25-13-00: there's a "jump to middle" strategy. now there's a "guarded-do" strategy.

[c101f2d8] 2022-03-25-12-55: jump to middle isn't cleanly defined, but it seems like a term I should become familiar with.

[dd40a2e2] 2022-03-24-13-48: branch prediction logic: a mechanism in modern processors use to try and guess whether or not each jump instruction will be followed.

[5c8ea530] 2022-03-24-13-44: conditional move instructions are better suited for modern processors than conditional transfer of control (program follows different execution paths based on condition).

[86aba329] 2022-03-24-13-40: or, to segue into the current section from the previous section, you say something like "in the previous section..., but now in this section...". again, CSAPP fails to do this, and it makes the writing unclear.

[8b9b9fbb] 2022-03-24-13-38: the first sentence of every section should describe what the section is going to be about. this book does not do that, instead choosing to ramble about something else before getting to the point.

[c6022ffe] 2022-03-24-13-23: If I were to to do this better, I'd do what Knuth did in TAOCP and include comments on each line of assembly, or have some kind of arrows.

[62c25b95] 2022-03-24-13-22: okay, there is more than one '5' in this sample code, which made it hard to understand what they were saying. The '5' here refers to the memory address of the next line, which the jump address adds onto to get where it needs to go. The confusing bit was that a sentence ago, it was talking about the other jump instruction, which had a 5 somewhere.

[834cc0e2] 2022-03-24-10-52: so, there's this aside about what repz/retq does. it's this really bonkers convention in the AMD guidelines to writing compilers. It's a thing done to prevent the ret instruction from being the destination of a conditional jump instruction.

[1e43664e] 2022-03-23-14-03: PC relative: jumps that are encoded as the difference between the address of the target instruction and the address of the instruction immediately following the jump.

[0d256f7a] 2022-03-23-13-39: trying to grok what is meant by this "although all arithmetic and logical operations set the condition codes, the descriptions of the different SET instructions apply to the case where a comparison instruction has been executed, setting the condition codes according to the computation t = a - b."

[a7129a42] 2022-03-23-13-37: "wtb" on page 525 is still not making sense to me. It's like having "x" on both sides of the equation. A helpful math textbook would re-arrange things so that the variable is on one side.

[0a5e7a07] 2022-03-23-13-21: setl and setb do not in fact mean "set long word" or "set byte", as one would expect. yeeeesh.

[15b43c72] 2022-03-22-12-49: this oracle doc page on operands in x86 feels more helpful than this book https://docs.oracle.com/cd/E19120-01/open.solaris/817-5477/ennby/index.html.

[8f3d6ee6] 2022-03-22-12-46: in 3.4, it's unclear to me if the notation system used lines up with actual assembly or if it's just a convention in the textbook. Also, there seems to be 9 friggin ways to address memory?

[3d02e1f9] 2022-03-21-10-37: My reading approach isn't working. I think I need to be a bit more granular. I've been grouping things by chapter, when it should be subsections.

[37352992] 2022-03-15-14-58: try to better grok what t = a - wtb means (pdf page 525)

[7cdaa79e] 2022-03-15-14-53: for reference, I'm staring at "t = a - wtb" waiting for that to click. There was another definition a while back about arithmetic shifts with various word sizes that also took a moment for me to grok. Did I mention the math typesetting here sucks?

[730caa82] 2022-03-15-14-50: this book really seems to like define things in mathematical terms. It's precise, but not intuitive or clear for me.

[949b2997] 2022-03-15-14-48: as an example of synonym, setnle (set not less or equal) and setg (set greater).

[03e81f44] 2022-03-15-14-47: apparently there are "synonyms" in assembly, where two instructions refer to the same machine code. totally not confusing.

[a3df192b] 2022-03-15-14-45: the SET instructions sets a register or memory location to either 0 or 1.

[e1a528d8] 2022-03-15-14-43: common ways to access condition codes: setting a byte to 0 or 1, conditionally jumping, conditionally transferring data.

[514670c5] 2022-03-15-14-40: beginning to see the patterns in all these assembly instructions. the names seem to follow a predictable naming system. the hard part for me is remembering all the assembly notation on the right-hand side. interestingly enough, this is what has been tripping me up with learning MIXAL in TAOCP.

[86e1e386] 2022-03-15-14-37: the TEST instructions behave like AND. but similar to CMP, they only update the condition codes.

[9fe34fac] 2022-03-15-14-36: the CMP instructions set the the condition codes according to the differences between of their two operands. they behave like SUB, except they only update the condition codes, not the destinations.

[2f4d9e53] 2022-03-15-14-29: straight-line code: code where instructions follow one another in sequence.

[849d5915] 2022-03-15-14-29: onto 3.6: Control

[7479f4d8] 2022-03-15-14-28: oct word: what Intel refers to as a 16-byte quantity.

[f6f6958b] 2022-03-15-14-28: Skimming through the special arithmetic operations. The gist seems to be that these handle very large numbers like quads and octs (16 byte words).

[eb6314ce] 2022-03-14-07-23: moving through the overview of instructins too quickly. call this a first pass.

[d181ff55] 2022-03-14-07-15: look at leaq more closely and work on problems

[5358f167] 2022-03-14-07-14: leaq looks like black magic to me.

[d83a702d] 2022-03-14-07-13: load effective address: instruction that is a variant of the movq instruction, referred to as leaq. Commonly used to perform simple arithmetic.

[3d527d32] 2022-03-13-18-55: these abbreviations are getting a little silly. looking at you, movz.

[c81d894c] 2022-03-13-18-44: this assembly chapter is starting to go into territory I don't know much about. A part of me thinks it would be wise to do the problem sets, but another part of me wants to keep reading.

[38d1fd5a] 2022-03-13-18-43: effective address: in assembly, this is the computed address used to access a particular part of memory.

[59cd535d] 2022-03-13-18-42: Operands in x86-64 have three types: immediate (for constant values), register (contents of a register), and memory (access a memory location).

[89631152] 2022-03-13-18-38: Figure 3.2 (Integer Registers) looks like a very helpful chart (pdf pg 473). The page before it has a good terse summary of all the register names and abbreviations.

[6b9abe6e] 2022-03-13-18-35: in GCC assembly code, the data movement instruction has 4 variants: movb (move byte), movw (move word), movl (move double word), and movg (move quad word).

[303ed42f] 2022-03-09-16-30: parity flag: in x86-64, a 1-bit flag that is set after an arithmetic or logical operation. PF is 1 when the lower 8 bits have an even number of 1s, 0 otherwise. This information is not directly accessible in C and requires machien-dependent assembly code.

[4b50a996] 2022-03-09-16-25: Intel uses the word "word" to describe a 16-bit data type. This is for historical reasons. Origins as a 16-bit architecture.

[7de2489a] 2022-03-09-16-23: it seems NOPs can be used for zero-padding for memory alignement. go figure.

[8d9ff2bc] 2022-03-09-16-14: the "-S" flag will generate the assembly file only and go no futher. Probnably should be used with "-Og" for learning purposes.

[7ec9d86c] 2022-03-09-16-12: register file: contains 16 named locations storing 64-bit values, able to hold addresses or integer data.

[a803ab44] 2022-03-09-16-09: ISA: instruction set architecture.

[b9ec9347] 2022-03-09-16-08: the "-Og" flag in GCC is used to produce machine code with the overall structure of the C code.

[9af0424a] 2022-03-09-16-07: linking handles filling in the address of global values in the object files

[63477beb] 2022-03-09-16-05: reading about instruction sets makes me want to take a peak at RISC-V. I wonder what it's like to learn compared to x86. It doesn't have the historical baggage, so maybe more elegant?

[31abc336] 2022-03-09-16-01: hyperthreading: a method to run two programs simultaneously on a single processor. First introduced in the x86 line with the Pentium 4E processor in 2004.

[0c292a1e] 2022-03-09-15-18: AMD: Advanced Micro Devices.

[f19cf5a8] 2022-03-09-15-18: These intros are very lecture-like.

[03376f12] 2022-03-09-15-17: "Those who say 'I understand the general principles, I don't want to bother learning the details' are deluding themselves". Jesus. Little sore much?

[494ceb1c] 2022-03-09-15-11: starting chapter 3 now: machine level representations of programs

[be268d80] 2022-02-03-14-51: this chapter has concepts that are pretty familiar to me after my years of writing C. I'm skimming this more quickly until I find things that look unfamiliar. Maybe IEEE floating point gets introduced here? I could always stand to learn more about that.

[614aa6a5] 2022-02-03-14-50: the math equations in chapter 2 seem to be typeset incorrectly half of the time. And when they are typeset correctly, it looks terrible and confusing.

[f91aa55d] 2022-01-24-17-25: I like this, as it is relevant to my bitmap glyph live coding project: In isolation, a single bit is not very useful. When we group bits together and apply some interpretation that gives meaning to the different possible bit patterns, however, we can represent the elements of any finite set.

[cda51c27] 2022-01-24-17-23: okay. onto chapter 2 now.

[1c52374f] 2022-01-24-17-18: Finished chapter 1. Reviewing before moving on to chapter 2: representing and manipulating information.

[e24ecde1] 2022-01-24-17-18: getting my act together with the CSAPP page (which, I regrettably chose a very long name for. oh well).

[22c8adad] 2022-01-13-13-18: Amdahl's Law: states that when increasing the speed of one part of the system, the effect on the overall system performance depends on both how significant this part was and how much it sped up.

[126d5e28] 2022-01-13-13-14: file: a sequence of bytes. nothing more. nothing less.

[9c8481a4] 2022-01-13-13-12: virtual address space: a uniform view of memory that each process has.

[aca1e54c] 2022-01-13-13-12: virtual memory: an abstraction that provides each process with the illusion that it has exclusive use of the main memory.

[245122be] 2022-01-13-13-08: context: the state information that the process needs in order to run.

[eee78ebb] 2022-01-13-13-08: context switching: the mechanism that the OS uses to interleave instructions from multiple processes.

[a9b02aa8] 2022-01-13-13-06: concurrently: instructions of one process are interleaved with the instructions of another process.

[f0aa548b] 2022-01-13-13-06: process: the operating system's abstraction for a running program.

[f72270b8] 2022-01-12-17-45: operating system: the layer of software interposed between the application program and the hardware.

[2a73cc20] 2022-01-12-17-44: memory hierarchy: the way a computer system organizes memory by performance. moving from top to bottom, devices become slower, larger, and less costly per byte.

[832ac8d8] 2022-01-12-17-42: locality: the tendency for programs to access data and code in localized regions.

[81d52e86] 2022-01-12-17-40: L2 cache: cache that is larger than L1 cache, with hundreds of thousands to millions of bytes and is connected to the processor by a special bus. It can be 5x slower than the L1 cache.

[c9f8fa63] 2022-01-12-17-39: L1 cache: Cache that tens of thousands of bytes long, and can be accesed nearly as fast at the register file.

[a8f04ba6] 2022-01-12-17-38: cache memory is used to deal with the processor-memory gap

[0a1352f0] 2022-01-12-17-37: it is easier and cheaper to make processors run faster than it is to make memory run faster.

[28deebd5] 2022-01-12-17-37: processor-memory gap: the growing gap between processor and memory performance. as semiconductor technology progresses over the years, this gap continues to increase.

[1d34671d] 2022-01-12-17-34: cache memory: smaller and faster storage devices that serve as temporary staging areas for information that the processor is likely to need in the near future.

[fd5d3c13] 2022-01-11-08-58: DMA: direct memory access.

[e92ac973] 2022-01-11-08-57: ALU: arithmetic logic unit.

[4f88cb3a] 2022-01-11-08-56: program counter: a register that points at some machine language instruction in main memory.

[a85ba752] 2022-01-11-08-55: register: a word-sized storage device.

[35d7a30a] 2022-01-11-08-55: CPU: central processing unit. the engine that interprets (or executes) instructions stored in main memory.

[f99d4ba5] 2022-01-11-08-54: DRAM: dynamic random access memory.

[4ae19852] 2022-01-11-08-53: main memory: a temporary storage device that holds both a program and the data it manipulates while the processor is executing the program.

[60bc46aa] 2022-01-11-08-53: motherboard: systems main printed circuit board.

[36843765] 2022-01-11-08-52: controller/adapter: the means by which an IO device is connected to the IO bus. The distinction between a controller or an adapter is mainly one of packaging.

[e01b3a40] 2022-01-11-08-51: buses: a collection of electrical conduits that carry bytes of information back and forth between components.

[fdfda208] 2022-01-11-08-50: word: a fixed size chunk of bytes. Most machines today have word sizes of 4 bytes or 8 bytes.

[fd49f8bb] 2022-01-10-17-48: read chapters 1-6 of CSAPP

[ed4c1eaa] 2022-01-09-13-44: just skimmed TOC. there's a lot of ground covered here. (TYCS) recommends the first part at leaast (1-6). But the other parts seem useful. We shall see what the content is like though.

[f979bfa8] 2022-01-09-13-41: the label @CSAPP will be used for anything related to (computer_systems_programmers_perspective).