i cache is all you need

BOLT, FDO, PGO, Propeller, tell you specifically it’s frontend stalls that matter
We really want to be careful of frontend stalls, since i-cache stalls prevent the CPU from doing useful information
- Frontend is responible for fetching and decoding instructions
- Responsible for executing instructions
Code layout matters for the icache
Two types of optimizations really matter
- Inlining (frees up the icache)
- Indirect call promotion
CPU needs to fetch instructions from memory, and if the required instruction is not in the i-cache, the frontend must wait for it to be retrieved (hundreds of cycles)
Instruction TLB
- I-tlb miss occurs when you need to force a page walk? But how does this work
frontend work can’t be parallelized, if you’re waiting on d-cache info, you can do something else, but the frontend stalls means icache is stuck and can’t feed uOps
- Backend can do out of order execution, by looking ahead and finding stuff that doesn’t require the missing data
  - How?
But how does L1 d-cache vs L1 i-cache work?
- Switft is icache heavy because of indrection?
- Harvard architecture
  - I-tlb vs D-tlb
  - How would you tell this in perf?
- L2 TLB is unified
BOLT’ing and propeller and lightning bolt
Zero cost inling is not actually zero cost (see memcmp vs memcpy)

Outline

i-cache as bottleneck
why frontend stalls are uniquebad
FDO was great but annoying workflow
LBR with perf
1. You can slice every part of the LBR buffalo, use the in and out BB for CFR, cycles for how long you’re spending there
Decompilation fuzziness
1. Inling destroys source CFG
2. We need to reconstruct a bunch of stuff
3. Bolt internally uses MC
  1. AutoFDO goes address -> DWARF -> source -> recomp
  2. BOLT goes address -> MC disasembly -> binary CFG -> rewrite
4. Question about over optimization? Can we merge optimizations?
Apply profile where it was collected with BOLT
Linker?
Profile data as universal currency?

Draft

We’ve been doing a series of papers on PGO, specifically modern PGO as it relates to binary layout optimization. I’ve been calling it the “Xinliang series”, mostly because Xinliang David Li has been the throughline for all of these papers. The papers are¹:

 - Li, David Xinliang, Raksit Ashok, and Robert Hundt. “Lightweight Feedback-Directed Cross-Module Optimization.” Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization, April 24, 2010, 53–61. https://doi.org/10.1145/1772954.1772964.
- Panchenko, Maksim, Rafael Auler, Bill Nell, and Guilherme Ottoni. “BOLT: A Practical Binary Optimizer for Data Centers and Beyond.” arXiv:1807.06735. Preprint, arXiv, October 12, 2018. https://doi.org/10.48550/arXiv.1807.06735.
- Chen, Dehao, David Xinliang Li, and Tipp Moseley. “AutoFDO: Automatic Feedback-Directed Optimization for Warehouse-Scale Applications.” Proceedings of the 2016 International Symposium on Code Generation and Optimization, February 29, 2016, 12–23. https://doi.org/10.1145/2854038.2854044.
- Ayers, Grant, Nayana Prasad Nagendra, David I. August, et al. “AsmDB: Understanding and Mitigating Front-End Stalls in Warehouse-Scale Computers.” Proceedings of the 46th International Symposium on Computer Architecture, June 22, 2019, 462–73. https://doi.org/10.1145/3307650.3322234.
- Jamilan, Saba, Tanvir Ahmed Khan, Grant Ayers, Baris Kasikci, and Heiner Litz. “APT-GET: Profile-Guided Timely Software Prefetching.” Proceedings of the Seventeenth European Conference on Computer Systems, March 28, 2022, 747–64. https://doi.org/10.1145/3492321.3519583.
- Shen, Han, Krzysztof Pszeniczny, Rahman Lavaee, Snehasish Kumar, Sriraman Tallam, and Xinliang David Li. “Propeller: A Profile Guided, Relinking Optimizer for Warehouse-Scale Applications.” Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, January 27, 2023, 617–31. https://doi.org/10.1145/3575693.3575727.

The most applicable paper in this block to the working programmer will be the BOLT paper, since you can literally run llvm-bolt on your binary and see if you get a performance increase. The propeller paper is also quite good, but requires you to exist in the hermetic build world, which might be too big of a lift.

This block of papers was pretty critical to helping me build up an understanding of how important CPU stalls matter, and how poorly optimized our binaries (and data!) are for CPUs. This block also really helped me build up a good understanding of the differences between i-cache and d-cache in practice.

Critically, we (developers) tend to think about algorithmic complexity or d-cache locality, and we even sometimes insert software prefetching for data in order to mitigate this issue. However, if we think about the CPU, the data fetching happens on the backend of the CPU, which has out of order execution². You can wait for one block of data and fetch another one most of the time, assuming you’re not doing a ton of pointer chasing and have sufficient independent work in the instruction window. However, instructions must be decoded on the frontend, which means that if you stall on fetching instructions, your CPU is definitely not doing anything.

In other words, the frontend is significant more exposed to i-cache misses. And if the frontend gets stalled, the backend doesn’t even get any instructions to process at all. We have fun stuff like __builtin_prefetch and PREFETCHT0 to give the hardware a head start on a d-cache miss, but none of this exists for i-caches.

The datacenter tax paper³ critically tells us:

This makes sense, given how uniquely bad frontend stalls are.

The fix seems relatively simple but getting to it seems quite difficult. One way that we can improve i-cache locality is that we simply have all the hot instructions in memory sit next to the other hot instructions in memory, so that the cache line coverage of hot instructions is higher. But since the cache loads a chunk of memory, we need to make sure that chunk of memory is arranged properly. And for instructions, that means that we need to arrange our binary layout.

At compile time, however, we only have some basic heuristics to guide us on which instructions in the binary are going to be hot. Which then means that we need to go look into getting information at runtime about which instructions are hot.

The classical way of solving this has been with FDO, where we instrument the binary with special functions to collect data on which sections of the binary are hot. However, this presents the same issue that ASAN has, where we now have to produce a separate binary, feed it through the same deployment pipeline, and do everything in there. Also the act of producing this special binary means that the same profile data may not actually be reusable for the “real” binary later on. Reusability of profiles and quality of profiles remains an issue in this space.

So then AutoFDO came around, and began utilizing the Last Branch Record (LBR) that started being shipped with new CPUs⁴, we had a new way of checking which instructions were hot. The LBR captures information about which basic blocks are getting executed, specifically:

the number of cycles spent in the basic block
which basic block lead into the current block
which basic block is going out of the current block

This lets us sample the actual production binary with negibile overhead, and doesn’t require us to do a whole separate build step. perf record and we’re good to go!

The AsmDB and the BOLT paper both talk about using this information to rebuild a control flow graph, since if you know which chuck of instructions are going in and out of the CPU, you can start counting which paths are particularly hot.

The APT-GET paper attempts to use every part of the buffalo by using the cycle counter to show that in theory, we could use it to inform an instruction prefetcher, similar to __builtin_prefetch.

My SWE tea notes on this block are at https://wiki.malloc.dog/posts/swe_tea/#pgo-block-1-2026 ↩︎
Matt Godbolt’s Skylake Deep Dive is a great intro to CPU frontends and backends. ↩︎
Kanev, Svilen, Juan Pablo Darago, Kim Hazelwood, et al. “Profiling a Warehouse-Scale Computer.” Proceedings of the 42nd Annual International Symposium on Computer Architecture, June 13, 2015, 158–69. https://doi.org/10.1145/2749469.2750392. ↩︎
Or maybe it was the other way around, desire for better FDO influenced CPU manufacturers to allow the LBR to be polled ↩︎