The Pentium Processor’s Innovative (and Complicated) Method of Multiplying by Three, Fast

[Ken Shirriff] has been sharing a really low-level look at Intel’s Pentium (1993) processor. The Pentium’s architecture was highly innovative in many ways, and one of [Ken]’s most recent discoveries is that it contains a complex circuit — containing around 9,000 transistors — whose sole purpose is to multiply specifically by three. Why does such an apparently simple operation require such a complex circuit? And why this particular operation, and not something else?
Let’s back up a little to put this all into context. One of the feathers in the Pentium’s cap was its Floating Point Unit (FPU) which was capable of much faster floating point operations than any of its predecessors. [Ken] dove into reverse-engineering the FPU earlier this year and a close-up look at the Pentium’s silicon die shows that the FPU occupies a significant chunk of it. Of the FPU, nearly half is dedicated to performing multiplications and a comparatively small but quite significant section of that is specifically for multiplying a number by three. [Ken] calls it the x3 circuit.

Why does the multiplier section of the FPU in the Pentium processor have such specialized (and complex) functionality for such an apparently simple operation? It comes down to how the Pentium multiplies numbers.
Multiplying two 64-bit numbers is done in base-8 (octal), which ultimately requires fewer operations than doing so in base-2 (binary). Instead of handling each bit separately (as in binary multiplication), three bits of the multiplier get handled at a time, requiring fewer shifts and additions overall. But the downside is that multiplying by three must be handled as a special case.
[Ken] gives an excellent explanation of exactly how all that works (which is also an explanation of the radix-8 Booth’s algorithm) but it boils down to this: there are numerous shortcuts for multiplying numbers (multiplying by two is the same as shifting left by 1 bit, for example) but multiplying by three is the only one that doesn’t have a tidy shortcut. In addition, because the result of multiplying by three is involved in numerous other shortcuts (x5 is really x8 minus x3 for example) it must also be done very quickly to avoid dragging down those other operations. Straightforward binary multiplication is too slow. Hence the reason for giving it so much dedicated attention.
[Ken] goes into considerable detail on how exactly this is done, and it involves carry lookaheads as a key element to saving time. He also points out that this specific piece of functionality used more transistors than an entire Z80 microprocessor. And if that is not a wild enough idea for you, then how about the fact that the Z80 has a new OS available?